| It occurs to me that it is unjustifiably difficult to do things to
| HTML documents like:
| * list all the URL's in a node
| * list all the H1,H2,H3s in a node
| * find the title of a node
I have no trouble doing this. Unless you mean by node, "more than
one HTML instance." As the sequel shows, you mean "without using
| correctly because of bleed between the regular, context free, and
| context sensitive idioms of SGML. For example:
| <XMP>this: <A HREF="abc"> looks like a link, but it's
| not because <XMP> is an RCDATA element, and STAGO is not
| recognized in RCDATA</XMP>
Off the point, I'll bet Mosaic sees it as a link. But presumably
the author put this in an XMP because it's part of the example, not
a real link. The doc should make clear that you can't put a
link in an XMP. Of course the </a> tag, which you've omitted
to show, *is* detected.
| <!-- this: <A HREF="abc"> looks like a link too! -->
How so? It's in a comment, and so will be ignored by a parser.
| And this: a < b > c has no markup at all, even though it
| uses the "magic" < and > chars.
But not in the magic combinations <[A-Za-z] etc.
| <A HREF='<A HREF="wierd, but possible">'>I bet this would
| break most contemporary implementations!</a>
But it's valid SGML according to my local version of the HTML DTD
| Suppose we decide to standardize on two things:
| (1) a DTD in the strictest sense of SGML compliance
| (or better yet, a set of architectural forms...)
| that defines HTML in a somewhat abstract sense in terms
| of elements, and character data (and entities?)
That is, a proper DTD that parses.
| (2) a context-free interchange language which is a subset
| of the SGML syntax.
Your argument so far does not indicate a need for this. You have
simply remarked that some conventions of SGML are not what you'd
like them to be. They're not what I'd like them to be, either,
but SGML is where we're at so far as document markup, today.
| You could use the DTD if you have real SGML tools and you want to
| use minimization, comments, and < chars as data.
| But for interchange within the WWW application, we'd agree that, for
| example, the < character is _always_ markup, and we'd use < for
| the data character '<'.
Dan, HTML is defined as an SGML DTD. If that's to continue to be so,
you can't apply these restrictions---unless you want to write a
crippled SGML parser that complains about free-floating < > etc.
Furthermore, there is no reason at all to
use < for <, and it is a weakness of the present DTD that
it doesn't use the standard ISO pub and num entity sets. Read
the newbies on comp.infosystems.www; ask Peter Flynn, who
nobly spends a lot of time answering them.
| Here are (at least some of) the rules we'd adopt over and above SGML:
| * No <!-- --> comments
Over my dead body. This is SGML. Run it through a parser and you'll
never have trouble with comments. Lots of people want them, and
it's a problem now that Mosaic, incorrectly, renders tagged text within
| * No <![foo] .. ]]> marked sections
I don't care about this, but someone else may. Why forbid them?
| * Always use numeric character references for '<', '>', and '&'
| (no harm in using <, >, & forms, I suppose)
So let's do it right and use the < forms.
| * Use numeric character references for ", \n, \t inside attribute
| value literals
| * Always quote attribute value literals with double-quotes, not singe
| * Don't split attribute values across lines (Hmmm...)
| Then the "search for HREF's" problem could be coded in ~20 lines of perl:
| I can almost guarantee that those 20 lines of perl are already in use
| as a heuristic solution to that problem now (I looked at the elisp
| code for emacs w3 client, and beleive me: it's all over the place).
| It's clear to me that folks are going to write HTML parsers based on
| intuition and experience with context-free languages. Learning exactly
| how SGML parsing works is sufficiently difficult that folks won't do it.
Too bad for them. They aren't following the spec, then. Please don't
tell us we can't follow the spec. I fully understand that the Webbers
who first decided to define HTML as SGML bit off far more than they
knew about, but for those of us who want to get our documents online,
the HTML DTD is where the rubber hits the road. Browser writers
have to learn to live with SGML, warts and all.
| In stead of declaring that perl code to be busted, why don't we agree
| the SGML folks didn't know much about autonoma theory and tighten up
| the definition of HTML a little?
You mean define a new "Dan's SGML." I don't think this is a reasonable
solution. But, Dan, you have the energy to write TGML (my pet name for
a hypothetical successor to SGML) that would do these things right, and
some others, too. When someone gets around to writing it, whether it
is part of a standards process or no, TGML will replace SGML in short
order, especially if it has a readable manual and a free parser.
BTW, would explain "autonoma theory" and how it relates
to such mundane things as the syntax for comments?
-- Terry Allen (firstname.lastname@example.org) Editor, Digital Media Group O'Reilly & Associates, Inc. Sebastopol, Calif., 95472