Skip to content

Waypath (Idle Words)#

05.15.2003

Waypath

Today I owe a shoutout to the Waypath Project. Steve Nieker was kind enough to share his list of about a hundred thousand websites, and suddenly my crawler went from adding 200 blogs per hour to adding 11,000. At this rate, we might hit 200,000 weblogs indexed later in the night.

The sites being added now have the proportions of a martini, in which the gin is represented by Pitas.com, and Movable Type serves as the vermoutn. A dry martini. With a Manila olive in it.

The Waypath Project is worth a visit because they are trying to do two very challenging and cool things. The first is, provide a per-post weblog search, rather than the kind of per-page search you can get on Blogdex or Google. The second is to search things based on similarities in content, rather than just doing keyword matches. This is the kind of stuff I've taken to calling 'semantic indexing', just to completely muddy the semantic waters. Their core technique is proprietary, unfortunately, so there's no code you can go poke your big Slavic nose into. But it's still nice to see a content technique actually implemented on live data from the Web.

Thanks in large part to Google, search engine designers have learned the importance of analyzing hyperlinks to improve their search results. On the content side, however, the approaches have remained pretty rudimentary. Various cool methods for calculating content-based similarity are being explored in the academic world, but those algorithms don't often leave the ivory tower, where they are mainly valued for their ability to impress a tenure committee.

Waypath is still quite experimental, so you have to approach it with a good deal of patient understanding. But it's a fascinating site to play with, as you look at your results and try to suss out why the engine made certain connections. And as with all good things, there's a blog attached.

Idle Words

brevity is for the weak

Your Host

Maciej Cegłowski
maciej @ ceglowski.com

Threat

Please ask permission before reprinting full-text posts or I will crush you.