URL's and SGML document

Gavin Nicol (gtn@ebt.com)
Sun, 29 May 1994 16:20:21 +0500

Here is the document I mentioned earlier. I would appreciate any feedback
anyone might have on this, buyt you should all realise that I regard it
as something of a kluge, even if a workable one.

There must be something better than URL's...

<TITLE>URL's and structured documents</TITLE>

<SECTION> <TITLE>Introduction</TITLE> <PARA> Recently, the World Wide Web project has gained a great deal of momentum. The World Wide Web proposes to tie together all the information sources available on the Internet, and has gained a great amount of success by providing a single hypertext interface to such services as Usenet, FTP, and mail. </PARA> <PARA> One of the crucial elements of the World Wide Web is the URL, or Uniform Resource Locator. Currently, URL's appear much like a Unix filename, with extensions for deciding the type of service to be used, the port number for the server, and other such parameters. While URL's are in wide use, they do suffer from a number of problems, including object uniqueness and equality problems (CORBA is currently facing similar issues). An IETF group is working to overcome these problems, but one problem that remains is that all of an object is retreived(except in searches), and that there is no way to take advantage of the inherent structure in a correct SGML document. </PARA> </SECTION>

<SECTION> <TITLE>The URL syntax</TITLE> <PARA> The generic URL has the following structure: <VERBATIM> scheme:path[?search|#fragment_ID] </VERBATIM> where scheme names the service to use (ftp, wais, http etc.), path specifies the location of the document, and the optional search parameter specifies a list of keywords to search for. In theory, this provides a single, simple naming scheme, but in practise, almost all of the different servers use a slightly different syntax. </PARA> </SECTION>

<SECTION> <TITLE>The URL Path Extensions</TITLE> <PARA> This document specifies extensions to the generic URL which can be used in conjunction with SGML document servers to provide a much finer level of control over what is to be retrieved. The extensions have, as far as possible, been designed to be compatible with the concepts in both the URL and HTTP RFC's. </PARA>

<PARA> The key concepts are that an SGML document can be represented as a tree of nodes, in much the same way that files and directories correspond to nodes in the tree of the file system. As such, we can map elements into an extended Unix path to create something like the following: <VERBATIM> http://ebt.com/collection/book/chap=5/para=2 </VERBATIM> where the extension syntax consists of an element GI and a specifier saying which one of possibly multiple elements to chose. The equals sign here is arbitrary. Any character that cannot be used in an element GI in the Reference Concrete Syntax may be used (which agrees with the TEI). In addition to element GI's the following keywords should be supported:

<TEXTLIST> <TI>toc</TI><TT>Table of contents. If a specifier follows it is the name of the TOC to use.</TT> <TI>max-bytes</TI><TT>Specify the maximum number of bytes to transfer. If the number of bytes exceeds this, generate a TOC as a guide to a more specific search. If the requested element is a graphic, scaling might be used, or a small icon attached to a hyperlink with a higher max-bytes value could be sent. </TT> <TI>username</TI><TT>Specify the name of the user.</TT> <TI>passwd</TI><TT>Specify the password to be used. The password is not encoded.</TT> </TEXTLIST> </PARA>

<PARA> The grammar for the path extensions can be specified as:

<VERBATIM> extended_path_member ::= member_name optional_specifier member_name ::= SGML_GI | '!' keyword keyword ::= 'toc' | 'max-bytes' | 'username' | 'passwd' optional_specifier ::= empty | '=' specifier_list specifier_list ::= specifier | specifier ',' specifier_list specifier ::= string | number number ::= [0-9]* | [0-9]* '.' [0-9]* string :: '"' character_constant '"' </VERBATIM> where the character constant rules follow the rules for ANSI C. </PARA> </SECTION>

The group of recognised keywords is currently very small, as time goes by, this will be expanded to include ideas from the TEI, and other groups. In addition, it is expected that the specifier data format will be expanded over time to allow for selection based upon attribute values. In addition, the choice of using the bang character as a keyword marker is arbitrary, and may need to be reconsidered. A sharp sign might be more suitable, but might confuse Mosaic.

<SECTION> <TITLE>Other extensions</TITLE>

<PARA> The following are a few extensions, or notes on how the extensions mentioned above will work with the normal HTTP URL's.

<ENUM> <ITEM>In HTTP, anything following the # character is interpreted as an identifier for a fragment of a document. A set of keywords will be used here to specify document fragments. Currently, the supported keywords will be the same as those in the extended URL path space grammar, with the addition of the following three:

<TEXTLIST> <TI>ebt-search</TI><TT>For using the EBT query language in searches</TT> <TI>element</TI><TT>An arbitrary element number</TT> <TI>node</TI><TT>The internal node identifier</TT> </TEXTLIST>


<ITEM> The normal HTTP search mechanism (the '?' character followed by a '+' seperated list of keywords) will be supported. </ITEM>

<ITEM> In addition to the normal HTTP search mechanisms, extended searches using the EBT query language will be supported via text following the fragment ID delimiter. Such searches would appear like the following:

<VERBATIM> http://collection/book/section#!ebt-search="foo within bar" </VERBATIM>


<ITEM> There will be some pathalogical cases where the above extensions can be used to create an illegal URL. In such cases, the server should move back up the path until a legal URL is created. For example:

<VERBATIM> <a href="http://ebt.com/collection/book/!toc/!toc/!toc/list=2"> </VERBATIM>

is basically meaningless. This should be reduced to:

<VERBATIM> <a href="http://ebt.com/collection/book/!toc"> </VERBATIM>

which is legal, but perhaps not what the user wanted. If a legal URL cannot be found, an error should be returned. </ITEM>

<ITEM> By supporting multiple named TOC's, is is possible to easily create alternate views of the document. For example, it should be possible to have lists of figures etc. </ITEM> </ENUM> </PARA> </SECTION> </BOOK>