Re: Identifying scripts by file extension?

Daniel W. Connolly (connolly@hal.com)
Tue, 15 Feb 1994 12:12:03 --100


In article <2jfcv4$c54@hal.com> nholtz@civeng.carleton.ca (Neal Holtz) writes:

I think this was beat to death a month or two ago, but ...

Currently, the only accepted way to designate a script gateway in a
URL is to use some magic prefix such as "/cgi-bin/myscript/" to the
the path. Unfortunately, this co-exists very poorly with the rules
for forming full URL's from relative ones (see below). What is the
general opinion on using the file extension to identify gateways or
scripts?

..

If the type information is encoded in the front portions of path
names, and if the rules for generating full URL's from relative ones
call for replacement of the rear portions of the path name, then it
is impossible to use a relative URL to link to a node of a different
type (because the replacement rules wouldn't allow replacement of the
type information).

Sorry... I haven't been around for a while... I find this interesting.
I barked a lot _long_ ago about the fact that the URL spec said both
(1) a url is of the form scheme:string
where string is opaque, and
(2) a url is of the form scheme://host/dir/dir/dir/file
or just /dir/dir/dir/file
or ../dir/file
or ../dir/file#id
or just #idd
blah blah blah

The grammar in the URL spec is highly ambiguous. For example, how
does one parse the following?

news:lkjlsdf#lksjdf@hal.com

The question arises: which part of a URL is opaque, and which parts
does the client get to peek into? If the clients are going to peek
into URL's (for example to resolve <A HREF="../foo/bar.html"> into a
global URL), then the servers can't use arbitrary strings as paths in
URL's. We've already seen news id's and WAIS doc-id's clash with
relative URL's.

I will again assert that what we should use the SGML parser to do
whatever parsing is going to be done on the client side, and make the
results opaque to the client, thereby allowing the server to use _any_
string it wants to encode info. We should also _allow_ a link to
contain content-type information. (How else do I link to a postscript
file on an ftp archive? By file extension? Come on!)

I'm pretty sure we can become HyTime compliant while were at it. Consider:

in stead of:
See <A HREF="#z123">the para below</a> for more.
use:
See <A linkend=z123>the para below</a> for more.

in stead of:
See <A HREF="foo.html">the foo section</a> for more.
use:
<httploc id="home" host="host.domain" path="/dir1/dir2/file">
<relloc id=rel1 locsrc="home" path="foo.html">
See <A linkend=rel1>the foo section</a> for more.

in stead of:
See <A HREF="ftp://host/dir/file.tex">fred's thesis</a> for more.
use:
<ftploc id=ftp1 host="info.cern.ch" dir="/host/dir" file="file.tex"
content-type="text/x-latex">
See <A linkend="http1">fred's thesis</a> for more.

The fact that relative HREFs are so widely used justifies support for
the feature. But I think you should be able to stick a tree of HTML
documents, gif files, postscript documents, etc. on an FTP server and
have it work just as well as putting them on an HTTP server. Also, you
should be able to copy those HTML documents to a local disk and use
them there.

The means we need an interoperable way to combine a relative link with
a global link to form a new global link. At first, it seems this
should be done on the server side so that the link strings can stay
opaque and the client can stay dumb.

But that doesn't work:
* if you want to use FTP, or
* if you want to be able to move the documents around
without changing them, or
* if you want to serve the same files up via HTTP, gopher,
and FTP at the same time without filtering them.

Then perhaps we should just once and for all agree that a URL includes
a path that is a list of names, where the syntax of names is the
intersection of the POSIX portable filename syntax and the SGML token
syntax. (Yuk, but...) ULR's can alternatively contain a "selector
string" that is opaque and does not combine with relative locations.

So a client can reliably resolve:

<ftploc id="ftp1" host="think.com" path="pub WAIS src readme.html">
<relloc locsrc="ftp1" path=".. doc xyz.gif">

into:
<ftploc id="ftp1" host="think.com" path="pub WAIS doc xyz.gif">

but it the following is an error:

<waisloc id=wais1 host="think.com" doc-id="12l3kjl2k3jlk3jlj">
<relloc locsrc="wais1" path=".. foo.html">

I suppose all this could be done with punctuation in stead of using
SGML syntax...

traditional style (servers _must_ be careful with syntax):

scheme://host:9999/dir/dir/dir/file.ext#anchor
where scheme =~ /^[a-zA-Z][a-zA-Z0-9]*$/
host =~ m-^[^:/]+$-
dir,file =~ m-^[^/]+$-
ext =~ m-^[^/\.#]+$-
anchor =~ m-^[^/\.#]+$-

"opaque selector"style
scheme://host:9999|selector
where scheme =~ /^[a-zA-Z][a-zA-Z0-9]*$/
host =~ m-^[^:/]+$-
and selector is _completely_ opaque to the client. The context must
be able to allow a URL to be ANY string.

Once you look at how messy the punctuation strategies get in general,
SGML syntax is as good as anything. And since we're already
implementing an SGML parser, why implement a _separate_ URL parser? In
other contexts, we can use other syntaxes to represent URL's, e.g.:

(httploc :host "www.hal.com" :path "/a/b/c.ext")
or
Content-Type: message/external-body; access-type="http";
site="www.hal.com"
path="/a/b/c.ext"

But it's VERY important that we standardize on which parts of a URL
are opaque, and which are not. The current strategy is breaking down.

Dan