Re: partial URLs ? (was <p> ... </p>)

Arjun Ray (aray@pipeline.com)
Wed, 20 Dec 1995 22:41:38 -0500 (EST)


On Wed, 20 Dec 1995, John Franks wrote:

> According to Arjun Ray:
> >
> > On Wed, 20 Dec 1995, Daniel W. Connolly wrote:
> >
> > > http://www.foo.com/a/b/../gifs/btnhome3.gifs
> > >
> > > (which is _not_ a well-formed HTTP url) and send:
> > >
> > > GET /a/b/../gifs/btnhome3.gif HTTP/1.0
> > >
> > > This is illegal because it is a potential secruity risk.

> > I think this is illegal simply because it's not a well-formed URL. The
> > question, then, is what the server should do about it.
>
> As I recall the draft RFC for URL's specifies that certain characters
> (like space) are forbidden, certain (like '?') have special meaning
> and otherwise the "path" part of a URL is an opaque string (which, in
> particular, may have nothing to do with a path). Neither '/' nor '.'
> are forbidden or have special meaning. They do have special meaning
> *for some implementations* and no special meaning for others.
> Likewise the colon may have special meaning for some implementations
> and not for others.
>
> The fact that certain strings may represent securtity risks for
> some implementations does not automatically make them illegal.
> I don't believe that "/../" is forbidden in HTTP URL's. If
> I am wrong I would be interested in a reference.

I confess that I've been relying on memory more than I should have. I was
going on impressions that I had gathered in early '94, when what seemed
like The Final Word(tm) was the latest draft of TBL's URI spec -- now RFC
1630. Here are some excerpts from the RFC version that sorta ring bells:

---8<---
[Page 5]

PATH

The rest of the URI follows the colon in a format depending on the
scheme. The path is interpreted in a manner dependent on the
protocol being used. However, when it contains slashes, these
must imply a hierarchical structure.

[Page 6]

HIERARCHICAL FORMS

The slash ("/", ASCII 2F hex) character is reserved for the
delimiting of substrings whose relationship is hierarchical. This
enables partial forms of the URI. Substrings consisting of single
or double dots ("." or "..") are similarly reserved.

[Page 8-9]

Partial (relative) form

Within a object whose URI is well defined, the URI of another object
may be given in abbreviated form, where parts of the two URIs are the
same. This allows objects within a group to refer to each other
without requiring the space for a complete reference [...] It must be
emphasized that when a reference is passed in anything other than a
well controlled context, the full form must always be used.

In the World-Wide Web applications, the context URI is that of the
document or object containing a reference.
[...]
The partial form relies on a property of the URI syntax that certain
characters ("/") and certain path elements ("..", ".") have a
significance reserved for representing a hierarchical space, and must
be recognized as such by both clients and servers.
[...]
The rules for the use of a partial name relative to the URI of the
context are:
[ on grafting the partial onto the base and then ]
Within the result, all occurrences of "xxx/../" or "/." are
recursively removed, where xxx, ".." and "." are complete path
elements.
---8<---

I understood this to mean that for HTTP urls, ".." and "." were
"reserved" for hierarchy-related specifications of *partial* urls, which
when resolved in context would be removed to form a complete url, which
in turn was what I've always understood by "absolute path" in the HTTP spec.

As others have pointed out, the BNF doesn't forbid ".." in a url, but I
read that as a matter of lexical legitimacy. Before it can be a
"semantically" valid url, however, we have this "recursively removed" bit
quoted above.

To add more confusion, the language suggests that both clients and servers
should grok this. Except, how is the server supposed to know the "context
URI" that allowed the partial forms? Back then, I concluded (unwarrantedly
as it now appears) that this normalization was an implied client-side
requirement deducible from the other parts of various specs considered in
their interaction.

And more: the later documents -- RFC 1738, RFC 1808, revisions of the
HTTP spec -- are much more circumspect about the (rather unabashed)
UN*X-isms in RFC 1630. The notion of "Reserved for hierarchical semantics"
has indeed vanished. Sigh.

> It would, of course, be quite reasonable for the HTTP spec to have
> a UNIX-centric warning to implementors that they should make this
> string illegal for their implementation (or risk the consequences).

Agreed.

Regards,

Arjun