Re: The future of meta-indices/libraries?

John Franks (john@math.nwu.edu)
Tue, 15 Mar 1994 20:56:16 --100


According to Peter Deutsch:
>
> Actually, we do plan to add this capability to the archie
> server system in the very near future. For WWW there's the
> obvious problem of what to index, since there is no real
> useful meta-info in the URL itself (how many copies of
> "default.html" are there, anyways? :-) so at this point
> we'd be happy to be told what to collect and serve.
>
..
>
> FYI, we now have a WAIS index search engine internal to
> the system as well, so we can index and serve template
> oriented info as well. We're planning to use this to
> gather and serve IAFA templates (among other things) and
> we can use this for WWW, if the info available requires
> it. If the WWW community can agree on a template
> structure for documents this may be the best way to go.
>
..

> I hope so. The only question is what's the best stuff to
> gather and index for a first pass. For that we need to
> hear from the community, keeping in mind the tradeoff
> between disk space and info desired. Can you all define a
> simple template (or perhaps use one of the IAFA ones)? Is
> the HEAD info enough? Once we know that the rest should be
> fairly easy.
>

I think the WWW community should have addressed this long ago. This
is the main area in which we are well behind the gopher community.

In my opinion, one of the most important design criteria should be to
eliminate the need for indexers (of whom there will likely be many) to
walk the entire server tree. This can be annoying and it the worst
cases disruptive.

A second important criterion would be giving the maintainer control
over what is indexed.

I would argue for a very simple document to be provided to indexers
(or created on the fly for them). It should contain the following:

1. A creation date and optional expiration date
2. A list of the titles of all documents on the server paired with
the corresponding URL.
3. An optional short list of keywords relevant to this server.

Perhaps there are other things which would be useful, but it is
primarily the titles one wants to index and it is a good idea to keep
it simple. I think this should be provided with a standard URL for a
document of type text/plain. There is no need to get mired down in
changes/additions to any protocols.

As a server writer I would implement this by having my server create
this document on the fly when it is first requested and then cache
it for later use until it expires. Subsequent requests would get
the cached version until its expiration after which a new version
would be created and cached. The maintainer would set the expiration
period and could mark any part (or all) of his tree as not to be
indexed. The cached file would be extremely useful for features local
to the server also. For example, a search of all titles on the server
or WAIS searches which return a menu of *titles* of hits (this is done
now by WWWWais, for example, but it must search each document corresponding
to a hit to extract its title). Of course, other implementations might
work as well.

John Franks Dept of Math. Northwestern University
john@math.nwu.edu