How many parsers does it take...?

Daniel W. Connolly (connolly@hal.com)
Fri, 25 Feb 1994 12:41:36 --100


.. to build a WWW client?

* one for SGML (well, enought SGML to parse HTML...)
* one for MIME (well, headers anyway... there doesn't seem to be any
support for MIME multipart body stuff.)
* one for each kind of URL that you support, i.e
one for file:, one for gopher:, one for wais: ...)
* one for each authorization method you support (basic, pubkey, ...)

Now, my question is: why?

I've done enough arguing against SGML syntax to find out that there is
a sincere commitment to SGML as an interchange format in the WWW
community. Nuff said.

The motivation for MIME comes from the successful and pervasive
deployment of the internet mail and news applications. Much of the
data sent around the net uses RFC822 format, and MIME is the accepted
way to put multipart/multimedia stuff in RFC822 format.

Now what motivates separate parsers for each kind of URL and access
method? Why is libWWW riddled with blurbs of code that copies data
from a structure to a string (be careful to escape all the right
chars!) and passes the string to another routine that parses the data
back out of the string (unescaping...) into another structure? I don't
see sufficient motivation for this strategy.

Contrast it with the WAIS strategy of adopting the Common Lisp
print/read strategy: there's one printer, one reader, one supported
internal structure, and other structures can be supported without all
sorts of parsing and escaping.

I've long argued against the current URL syntax in favor of using the
SGML parser, but SGML is unnecessarily verbose for the task. I think
the Common Lisp syntax is just right.

Consider the introduction of a new URL scheme "alternative" used to
point a client to several copies of a resource an allow the client to
choose the "closest." e.g:

alternative:escaped-url1,escaped-url2,escaped-url3,...

We're faced with inventing a new syntax and a whole new set of parsers
(one for libWWW, one for perl, one for elisp...) for this scheme. On
the other hand, suppose we used Common Lisp syntax:

(:alternative url1 url2 ulr3)

for example:

(:alternative (:local-file "/austin2/users/connolly/home.html"
"austin2.hal.com")
(:http "austin2.hal.com:8001"
"/~connolly/home.html"))

We could also do things like support cannonical forms and alternate
forms so that ftp://info.cern.ch/pub/www/src/foo.html could be
written in any of the following ways:

(:ftp (:site "info.cern.ch") (:dir "/pub/www/src") (:name "foo.html"))
(:ftp "foo.html" "/pub/www/src" "info.cern.ch")
(:ftp "/pub/www/src/foo.html" "info.cern.ch")
(:ftp (:path "pub" "www" "src" "foo.html") "info.cern.ch")

The nice thing about supporting Common Lisp style printing/parsing is
that the structure represents a superset of MIME and SGML
printing/parsing!

One could clearly implement the MIME parser as a special kind of Lisp
parser which, on seeing:

From: "Daniel W. Connolly" <connolly@hal.com>
To: www-talk@info.cern.ch
Subject: example
Content-Type: multipart/mixed; boundary="cut-here"

--cut-here
Content-Type: text/html

<html> ... </html>

--cut-here
Content-Type: image/gif
Content-Transfer-Encoding: base64

234k23j4oij234lkj234lkj

--cut-here

would return the same thing the Common Lisp parser would return on
seeing:

(:part
(:head
(From (:mbox "connolly@hal.com" "Daniel W. Connolly"))
(To (:mbox "www-talk@info.cern.ch"))
(Subject "example")
(Content-Type (multipart mixed (:boundary "cut-here")))
)
(:body
(:part
(:head
(Content-Type (text html))
)
"<html> ... </html>"
)
(:part
(:head
(Content-Type (image gif))
(Content-Transfer-Encoding base64)
)
(:any 10034 "10034 decoded bytes from 234k23j4oij234lkj234lkj")
)
)
)

And the SGML parser would act on:

<HTML><HEAD><BASE HREF="http://host/dir/file.html">
<TITLE>Example</TITLE>
</HEAD>
<BODY>Example of &lt;SGML&gt; stuff</BODY>
</HTML>

just as the Lisp parser would act on:

(HTML ()
(HEAD ()
(BASE ((HREF (:http (:host "host")
(:path "dir" "file.html")))
))
(TITLE () "Example") )
(BODY ()
"Example of " "<" "SGML" ">" " stuff")
)

The libWWW code is already migrating from the style of:

char *tmp = escape_url_in_attr(url);
char *tmp2 = escape_text_in_cdata(text);
sprintf(buffer, "<A HREF=\"%s\>%s</A>", tmp, tmp2);
free(tmp); free(tmp2);
SGML_parse(HText, buffer);

to the style of:

StartTag(HText, "A",
"HREF", url_string,
NULL);
Data(HText, text);
EndTag(HText, "A");

The current primitives:

obj.startTag(tag_name, attrs...);
obj.data(string);
obj.entity(name);
obj.endTag(tag_name);

aren't bad, but they're awkward for things like MIME and WAIS WSRC
files. I suggest the new base class:

class LispStructured{
public:
void atom(const Atom* atom);
void string(const char *null_terminated_string);
void bytes(size_t length, unsigned char *length_bytes);
void start(const Atom* tag);
void end();
}