> If we require people to do such things by hand it will suffer in
I agree that we shouldn't require this to be done by hand.
I did say that in the message you quoted.
> ... But one person's irrelevant is another person's useful
> data. We've had people mine the current archie collection
> just so they could study such things as the proportion of
> file types and other information about the data. We
> certainly didn't forsee such applications when we started
> so I'd rather not hard-wire in too many assumption about
> what people might want or need at this point. We
> definitely want to stay flexible.
Of course. But we do need to make the information mangeable,
otherwise we might as well give up and let full web robots
to it for us.
> > > > I'm not sure it is going to be sensible
> > > > to index all titles on a server and search those, even though it sounds
> > > > attractive. You do need to retain the context of the titles.
> In theory I agree, although in practice we may find that
> titles alone (machine generated and thus accurate) are
> more useful than full templates which are hand-generated
> and thus inaccurate. Filenames alone have proved of use
> in archie even though in theory descriptive information
> would be more useful.
In Archie you can use directories and tar files to give a context to
information. It is up to the information provider to decide what to
tar up into one file, and where to put it. If all files in FTP sites
were untarred in a single directory Archie would be unusable.
In the Web this is a bit more difficult. _Titles_ of web documents
aren't hierarchically structured like file paths. I suppose the
hierarchical natue of most URL's may help there, but it's dodgy. There
is no concept of grouping like in a tar file, you can't specify "Index
this document but nothing linked from it". This means that additional
info is needed more badly than in Archie.
> > > The bottom line choice is between an index of 50 servers with
> > > carefully hand-crafted templates and an index of 5000 servers with
> > > machine generated templates which are less well constructed but up to
> > > date. I would certainly opt for the later.
> Can we aim for both?
> I see wanting both a cheap, relatively useful set of
> machine-generated and accurate titles plus more descriptive info
> where available.
As I said in the previous message, I am completely happy to leave the
decision of what and how to index up to the person providing the
information, as long as it's done in a standard and parseable way.
> why we added the WAIS indexing capability and worked on
> the IAFA template stuff. All the components are now in
> place to do more detailed descriptive info once we figure
> out where it is.
ALIWEB index files should (really) be on http://.../site.idx
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR Ltd@cn=Martijn Koster