Re: What if we offered a local spider?

Martijn Koster (m.koster@nexor.co.uk)
Mon, 17 Oct 1994 09:17:38 +0100


Brian Behlendorf wrote:

> First off, I think there's a compromise between
> remote-indexer-hitting-HTTP-server-hard-indexing-files and
> local-indexer-running-through-file-system - and that would be
> local-indexer-hitting-HTTP-server-hard. If you don't want to
> skew your access logs then run a parallel server on another port -
> if you're worried about lag run the program in the middle of the night
> (or nights if your site is really big).

Absolutely.

> Difficult but I don't see an easy way around it - dynamic pages return
> different results depending on their input,

I'm not even worried about input-dependent dynamic pages; they can
probably be ignored. However, dynamic pages without input
(e.g. ISINDEX welcome pages) are quite often important, and quite
often don't change at all, or only minimally so. But on a file-system
bot you miss those. In extreme, if you have a server with 30 way-cool
ISINDEX services, you'd be indexed as empty :-)

> And I say this knowing that very soon our site will become impossible for
> current robots to index.

> But, for example if the file-server-side robot knew it was
> sitting on top of an NCSA server it could know to look in the access.conf
> and for any .htaccess files anywhere up the file system. I could go into
> my problems associated with the fact that many server-side maintenance
> programs are designed for particular servers, but that's another thread
> somewhere else.

I agree completely; that's a waste of effort, and a source of problems.

> Right, it sounds like this is WAIS++, inverted. Not only should the
> index be generated locally, the engines should run locally as well. The
> Internet just doesn't like centralized anything. The only benefit of
> having multiple sites' indices on one machine would be search queries
> that return hits from more than one site - WebCrawler, in other words.
> What's needed is some way to distribute this search, so that search
> engines on all the different machines could return results with the user
> just asking once.

Yeah, but that's just a, ehr, non-trivial problem :-) I believe a
system of index-gatherers can work quite well...

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html