Re: URL decisions in Seattle, & changes

Daniel W. Connolly (connolly@hal.com)
Thu, 31 Mar 1994 01:03:03 --100


In message <9403301700.AA00480@ptpc00.cern.ch>, Tim Berners-Lee writes:

[Lots of stuff that, in my view, contradicts current practice. Details
below...]

My perspective on this is (1) a commercial implementor trying to build
a product can claim to be "conforming to the URL spec" and (2) a
formal systems purist trying to be sure we've got a well-defined
specification.

The URL specification is getting hopelessly watered down. Everybody
wants to just take their favorite addressing scheme, slap a few
letters and a colon on the front, and call it a URL.

For example, the current working requirements document

<http://www.acl.lanl.gov/URI/archive/uri-archive.messages/1063.html>

is almost completely devoid of content.

If I specify that a URL is "any sequence of letters, followed by a :
and the characters 'FRED'. Access is defined by choosing a number
between 90 and 2000. If the result is greater than the year, access is
successful.", then I have satisfied all the URL requirements:

5.1 Locators are transient.

The probability with which a given Internet resource locator
leads to successful access decreases over time.

Check.

5.2 Locators have global scope.

The name space of resource locators includes the entire world.
The outcome of a client access attempt using an Internet
locator depends in no way, modulo resource availability, on
the geographical or Internet location of the client.

Check.

5.3 Locators are parsable.

Internet locators can be broken down into complete constituent
parts sufficient for interpreters (software or human) to attempt
access if desired.

Check. All you have to do is see "xyz:FRED" and that tells you enough
to attempt access.

5.4 Locators can be readily distinguished from naming and
descriptive identifiers that may occupy the same name space.

Check. If it says "xyz:FRED," it's a locator. Otherwise, it's not.

5.5 Machines can readily identify locators as such.

Check. The regexp /[a-zA-Z]+:FRED/ will do it.

5.6 Locators are "transport-friendly".

Internet locators can be transmitted from user to user (e.g,
via e-mail) across Internet standard communications protocols
without loss or corruption of information.

Check.

5.7 Locators are human transcribable.

Check.

5.8 An Internet locator consists of a service and an opaque
parameter package.

Check. The parameter package is "FRED". The service is named by the
letters.

5.9 The set of services is extensible.

Check. Just substitute different letters. The draft says nothing about
the features available from these services.

Yes, I'm being somewhat pedantic, and I can see that the approach is
to bite off a little of the problem (Internet Information
Architecture) at a time. But the result of the current "we don't know
how it's gonna work yet, so we won't make any constraints" attitude is
a lot of hot air, if you ask me.

In contrast to the above requirements, why don't we look at the
already deployed applications, and then extrapolate to the future.
Let's look at specific scenarios and be sure we've satisfied their
requirements. For example:

* The Campus-Wide Information Service scenario. From this
we realized that allowing aggregation of information spread
over various administrative domains -- not to mention physical
machines -- is a requirement.

* Corporate promotion/support scenario. These folks want
snazzy images and stuff. They want their online collateral
material to enhance their image.

* The Software distribution and support scenario, i.e.
"Here's the 1.2 release of the quartz package. For more
info, see <http://ftp.host.com/quartz/info.html>". We need support
for mirroring archives.

* Online Documentation. Support for man pages is a requirement.
The various hacks involved in putting Info trees online shows
some needed features.

* The FAQ distribution scenario: From this we learn about
expiration dates as a mechanism for versioning. There's a
lot more to learn/do here.

* Serving newsgroup archives through WAIS. Here we learn
that while fulltext search and relavence feedback are
great tools, we'd like to do SQLish things (e.g. select article
where author="fred" and date>"Jan 1 1994") too.

* Online technical reports and Journals. Postscript support
is a requirement. But plain-text abstracts are too. And
searching a database of abstracts with hypertext pointers to
postscript is nearly optimal.

* Collaborating on WWW specifications. This is pretty painful.
It shows a need for something like the NCSA annotation server.

Before we decide on an addressing standard, we need to establish a
little more context. To me, it's not clear that we need a universal
naming _syntax_. There are lots of uses for a global namespace, but
the syntax isn't important.

I think the basic starting point in all this is a model of
communication: take information, express it as a sequence of bits,
send the bits to someboy else, and extract the information again.

But you can't do that without conventions about what the bits
represent. MIME provides a namespace of content types to fill this
need (except for compression and encryption... Hmmm). Then, once you
know how to interpret the bits, you can (1) present the bits in some
physical form, and/or (2) get at the navigational components like
addresses, names, citations, and structure.

The idea that everything has to be expressible as ASCII text is so
backwards! Surely the widespread deployment of applications like word
processors and spreadsheets shows that the computer should be employed
as a tool to construct information products. And I'm not talking about
emacs html-mode. I'm talking about direct manipulation interfaces.

Let's get off URL syntax and onto features like drag-n-drop and
paste-link between internet information resources.

Anyway... back to the matter at hand...

>The gopher string wants to use the "?" and "/" characters (which
>in WWW have convention syntax role) within a string without any
>special role. This was upheld, so slashes and ? may be put into
>a URL without being encoded. The characters are still in the
>reserved set, but in Gopher, / and %2F have the same effect.

So we're ruling out the possibility of having collections of HTML
files with relative links served through gopher. Oh well... the only
material difference between gopher and HTTP 0.9 is the TCP port number
anyway.

This completely destroys the concept of a comprehensive URI syntax.
The grammar must include all the special parsing features of each
scheme, and it must be explicitly ammended with each new scheme.

In future schemes, will '/' and '%2F' mean the same thing or different
things? I gather that the answer is "it depends." This rules out the
idea of having one algorithm for reducing a URI to canonical form. So
the question of whether

x-my-scheme:abc/def
and
x-my-scheme:abc%2fdef

can occupy the same slot in a cache isn't specified. Bummer.

It also means that one can't build "dumb clients" that count on proxy
gateways to deal with all the various protocols. Suppose I'm viewing

x-new-scheme://host.com/database/cover.html

and it contains the <ISINDEX> tag. How do I construct a query for
"abc def"? I can't unless I know the rules for x-new-scheme.

> I'd also like Dan Connolly's input on the
>grammar -- in which places is it ambiguous in its current form, Dan?

This is an improvement... but there are still problems. (user and
password can contain @, so login is ambiguous) I'm going to have to
convert it to lex/yacc to find all the problems. Why not use that
format permanently in stead of this "BNF-like" informalism?

Dan