Re: No Nasty Robots! (long)

Paul Everitt (paul@cminds.com)
Thu, 13 Oct 1994 08:42:31 -0500 (CDT)


For a solution to this problem, look at Harvest:
http://rd.cs.colorado.edu/harvest/

Why does this help?

*Distribute Indexing

Collect all of you index information locally using a "gatherer", then
answer requests for compiling into queryable indices. Thus, you are only
putting the _index_ over the wire, not the full-text. Moreover, you are
only transmitting the *changes* to that index, and these changes go in
gzip format.

*Configurable Indices

Run a local "broker", which goes to various net "gatherers" and builds a
local index of the topics you want. Or, go to other brokers on the net
that build datasets that interest you.

*Customized Indices

By selecting the types of info you want (FAQs, HTML files, etc.) and using
filters for these types, you get __structured indices__ for pertinent
info! Much nicer than full-text indices. Moreover, write your own
summarizer for new types (IAFA templates, cc:Mail directory listings,
etc.). Even use the ability to "explode" tar files

*Other interesting subsystems

The Harvest project has other interesting technologies. For instance,
using its replication features, you could move the entire object into
Harvest (i.e. the full text, not just the index) and copy it around,
with changes being put back in. There is also an Object Cache with
strong performance characteristics. Finally, there is the forthcoming
Harvest Object System, using object-oriented extensions to define types
and methods.

Sorry for the verbosity. The real point is that there is a mechanism to
dramatically lower the load from indexing, while dramatically raising the
functionality.

Disclaimer: I am not part of the Harvest Development Team, merely a
ludicrously-happy beta tester.

Paul Everitt V 703.785.7384 Email Paul.Everitt@cminds.com
Connecting Minds, Inc. F 703.785.7385 WWW http://www.cminds.com/