Re: searchable index of the web

mkgray@athena.mit.edu
Wed, 30 Jun 93 16:58:15 EDT


Ok, how "big" is the Web. Here is what W4 has found out.
Actually, first I'd better explain a little bit about what the wanderer does.
It does a simple depth first search, with an added feature I call 'getting
bored'. That is, if it finds a number of documents that have the same
URL, up to the last field (eg http://foo/bar/blah, http://foo/bar/baz,
http://foo/bar/more) it will eventually get 'bored' and skip it. This makes
it go a little quicker. Of course, it potentially is losing some documents
here, but probably not.

W4 took many hours (maybe 20) to run, but I don't remember exactly, because it
saves state so I could kill it and restart it whenever I wanted. Well, in
total, the W4 found more than 17,000 http documents (didn't follow any other
kinds of links) and more than 125 unique hosts. In the current version,
it *only* retrieved the URL of the document.
In the next version, I hope to have it do the following other things.

o Get the <title>Title</title> of the document
o Get the length of the document
o Do a 'keyword' analysis of the document
o Count the number of links in a document
o Improve on the boredom system

By a 'keyword' analysis, I mean looking at the document for words that
appear frequently, but aren't normally common words. Additionally, titles
and things appearing in headers would be good candidates for keyword searches.
I'll try and get the current code at least clean enough that I'm willing to
let everyone in the world to see it, but if you *really* want to see it now,
send me mail. Any other suggestions would be welcome.

Once this index is produced, it will be searchable via http, and I suppose
by WAIS though I really detest the way WAIS restricts searches. In any case,
there is a possibility that this will be done by the end of the summer.

Matthew Gray
mkgray@athena.mit.edu