A thought on implementation...

Daniel W. Connolly (connolly@hal.com)
Thu, 17 Feb 1994 00:58:58 --100


It occurs to me that it is unjustifiably difficult to do things to
HTML documents like:

* list all the URL's in a node
* list all the H1,H2,H3s in a node
* find the title of a node

correctly because of bleed between the regular, context free, and
context sensitive idioms of SGML. For example:

<XMP>this: <A HREF="abc"> looks like a link, but it's
not because <XMP> is an RCDATA element, and STAGO is not
recognized in RCDATA</XMP>

<!-- this: <A HREF="abc"> looks like a link too! -->

And this: a < b > c has no markup at all, even though it
uses the "magic" < and > chars.

<A HREF='<A HREF="wierd, but possible">'>I bet this would
break most contemporary implementations!</a>

Suppose we decide to standardize on two things:

(1) a DTD in the strictest sense of SGML compliance
(or better yet, a set of architectural forms...)
that defines HTML in a somewhat abstract sense in terms
of elements, and character data (and entities?)

(2) a context-free interchange language which is a subset
of the SGML syntax.

You could use the DTD if you have real SGML tools and you want to
use minimization, comments, and < chars as data.

But for interchange within the WWW application, we'd agree that, for
example, the < character is _always_ markup, and we'd use &#60; for
the data character '<'.

Here are (at least some of) the rules we'd adopt over and above SGML:

* No <!-- --> comments
* No <![foo] .. ]]> marked sections
* Always use numeric character references for '<', '>', and '&'
(no harm in using &lt;, &gt;, &amp; forms, I suppose)
* Use numeric character references for ", \n, \t inside attribute
value literals
* Always quote attribute value literals with double-quotes, not singe
quotes.
* Don't split attribute values across lines (Hmmm...)

Then the "search for HREF's" problem could be coded in ~20 lines of perl:

while(<>){ # read a line
while(/<[^>]*$/){ # line looks like ...<TAG
# with no >... read another line
$_ .= <>;
}
while(s/^<(\w+)([^>])*)>/){ # find a start tag
local($gi) = $1;
local($attrs) = $2;
$gi =~ tr/a-z/A-Z/; # convert to upper-case
if($gi eq 'A'){
# for each attr...
while($attrs =~ s/^(\w+)\s*=\s*"([^"])"\s*//){
local($name, $val) = ($1, $2);
print "HREF: val\n" if $name eq 'HREF';
}
}
}
}

I can almost guarantee that those 20 lines of perl are already in use
as a heuristic solution to that problem now (I looked at the elisp
code for emacs w3 client, and beleive me: it's all over the place).

It's clear to me that folks are going to write HTML parsers based on
intuition and experience with context-free languages. Learning exactly
how SGML parsing works is sufficiently difficult that folks won't do it.

In stead of declaring that perl code to be busted, why don't we agree
the SGML folks didn't know much about autonoma theory and tighten up
the definition of HTML a little?

Dan