Re: Resource discovery, replication (WWW Announcements archives?)

Daniel W. Connolly (connolly@hal.com)
Wed, 04 May 1994 05:31:38 -0500


In message <9405041001.AA10860@hal.com>, Martijn Koster writes:
>
>Daniel W. Connolly writes:
>>
>> I dunno... Maybe I just need to play with/read about those systems
>> more. For some reason, they strike me as too centralized: there's the
>> per-site data, and there's the list of all sites. What's in-between?
>> How is the global list maintained? Suppose I want an index of
>> resources related to biochemistry: can I build one? (with my strategy,
>> I can filter the articles however I want and build custom indexes)
>
>In ALIWEB the global list is maintained by a form for subscription,
>and email for deletion (currently by hand). But, by having a
>"standard" location for the per-site info any robot can go around
>looking for that.

How? How does the robot know where to go and look? And does each
robot have to search the entire space? With USENET news distribution,
you only need to talk to one neighbor. And he talks to his neighbors,
and so on, and so on... The whole idea here is to quit doing this
N^2 thing where all clients talk to all servers (or all robots scan
all servers...) and start doing some Nlog(N) style things.

>If you want an index of resources related to a particular topic you
>can either use ALIWEB simple or form-based search to do it for you,
>copy the data to your machine and do it locally, or even get the
>list of hosts and go and get the information yourself in a standard
>format.

Each of these is an all-or-nothing proposition: in the first
case, I have to locate an ALIWEB server with all the data in the
world on it (scalability test says: BZZZZT). Or I can copy
all the data to my machine (BZZZZT). Or I can get "the list of
hosts" (BZZZT) and do it myself.

With my broadcast strategy, I just set up a process that gathers new
articles and expires old ones. Server sites wouldn't necessarily renew
their announcements at the same interval, but let's say the maximum
interval for 95% of the sites is one month. Then after two months,
my database reaches steady-state. From then on, It maintains itself.

And its scalable: everybody has access to everything without anybody
having to do everything.

Dan