Last-modified date & indexing

Nick Arnett (narnett@verity.com)
Thu, 10 Nov 1994 09:50:05 -0800


I think I should know the answer, but where is an HTTP server supposed to
get the HTTP last-modified date? Is it from the file system (which
produces different results on various OSes), from the HTML header, or...?

As we're building our indexing tools, we're trying to figure out how to
trigger index updates efficiently, which is why I'm asking.

Eventually, I'd like our indexer to be able to ask the server "What's
changed since date XX/XX/XX" and get back either of two things:

* A list of the changed documents that are eligible for indexing
("eligible" as defined in "/robots.txt")

* An indication of whether it's going to be more efficient to do an
incremental update by GETting individual documents or just GET a locally
pre-built index from the server.

I'm implying something here that I should make clear. We'll have an
indexer that will build a local index and save it into a compressed archive
file in the root of the HTTP server under a standard name. Our remote
indexing tool will be able to retrieve that file and hand it off to our
search server. Obviously, as others have concluded, this will be far
better use of network resources than the usual spider behavior. (Our
indexes are usually less than half the size of the text they index, they
compress very well, and GETting one archived file is just one transaction
v. many.)

Updating the index creates an efficiency ambiguity. If few documents have
changes, then it may be more efficient to GET the changed documents and
increment (v. getting the whole archived index).

Via last-modified date summary information, we can make an intelligent
decision about when to get individual documents v. when to get the whole
archived index. Further, we can avoid recursing the documents to see
what's changed.

Thoughts? Answer to the first question about last-modified date?

Nick