Re: Searchable Web info (was Finding CGI spec...)

Nick Arnett (narnett@verity.com)
Mon, 9 Jan 1995 18:39:10 +0100


At 5:41 PM 1/3/95, web@sowebo.charm.net wrote:
>Nick Arnett wrote:
>>[...]
>> > I did a search on "cgi" and got back a doc with a name I didn't
>> > recognise. Now although I have several hundreds of HTML files,
>> > like my children, I know most of them by name :*) I think you got
>> > the href from a file that has a Base tag pointing to another server.
>>
>> Our spider doesn't follow links to servers other than the one where it
>> starts (we trigger each index for each server individually). Documents
>> from other servers would have come from distinct indexing sessions.
>>
>> Having said that, I'm not sure exactly what you're describing here. Can
>> you describe it a bit more?
>>
> OK: as I'm not sure what you're not sure of, pls excuse if I
> explain the obvious :*). Relative URLs are normally understood
> to be relative to the directory the file is in. But the Base tag
> can make the URL be relative to any other directory - and on any
> other server. In the particular instance I had noticed, the file
> was in fact adapted from the TOC of Ian Graham's HTML tutorial;
> I didn't want to move all the sub files over so I just made the
> Base tag point to the original TOC - not on my server. So if the
> spider finds a reference in this file to "server-cgi-bin.html"
> it should realise I don't actually *have* that file - it's where
> Base says it is, i.e. some other server, in this case. If it doesn't
> want to go on sidetrips to other sites I guess it's just going to
> have to ignore relative URLs in files having Bases pointing to other
> servers.

Got it! Now I understand what happened. The spider doesn't know about
Base, so it tries to retrieve the file from the original server, which
doesn't have it.

I'll pass this onto the developers, so that they modify the spider code
appropriately.

Thanks!

Nick