Re: HTML DTD issues

Tim Berners-Lee (
Thu, 19 Nov 92 14:32:11 +0100


Your work on the SGML side (as all your work) is much apreciated!

> Date: Thu, 19 Nov 92 04:37:23 CST
> From: Dan Connolly <>


> The thrust to register HTML with the authorities has
> spurred me to look over the DTD again. I've found some
> problems.

> 1. Currently the NAME attribute of an anchor is declared
> as CDATA, i.e. just about anything. There's an SGML thingy
> called an ID. SGML parsers enforce uniqueness among the
> IDs of a document. Seems like that's what we want for ID
> names.

> But an SGML ID has to start with a letter. So all the
> HTML files that use numbers as anchor names will break.

The enforcement of uniqueness is useful, and it is what we want.
It is unfortunate that the very same constraint lead to the use of numbers!
This is a hangup of the NeXT editor (which i still use, as
until somone makes a more convenient editor!) but we oughtn't to
worry about it. A future editor could generate Z[0-9]* names.
We could even specify that Z[0-9]* are related to a NEXTID attribute
somewhere for the generation of time-unique IDs.

The only neat thing about CDATA is that it would allow a gateway
to put in something which as come from the data. For example,
a glossary generator might generate anchors for each term
whose name equals the term, and then generate index entries
pointing to that.

What do you think?

> 2. I introduced two tag names when I drafted the DTD:
> HTML contains the whole document. I defined it
> so you can omit both the start and the end tags, so it's
> inferred by SGML parsers. I don't think I can avoid some
> top-level tag.
> DOCUMENT contains most of the "body" -- all the
> headings and paragraphs. I did this to avoid something
> called mixed content, which causes complications. I
> could rename this element as BODY, and introduce a
> omitable HEADING tag to surround the TITLE, NEXTID, and
> ISINDEX tags.

I like the latter idea. Header and Body fit in well with mail
nomenclature, wherase "document" is normally the whole thing

> 3. I stuck anchors in as an inclusion, meaning they could
> be used just about anywhere. I thought stuff like <a
> name=foo><h1>Foo</h1></a> was legal, but neither linemode
> nor the midas browser groks.

The line mdoe doesn't? It should. Only titles I wanted to insist were
plain ascii text....
Turns out to be a bug in HTML.c -- fixed for next release.

> I'm editing the DTD to restrict the usage of anchors to
> only contain text strings.

I don't like that.... I think that especially as we introduce
highlighting, anchors will want to be general areas of text, so
long as they are nested properly. (An "SGML attitude" restriction
which Frank Kappe objected to I recall).

> 4. The OL tag is disappearing. It's no longer documented
> in the web, and it's not supported by MidasWWW. Should
> I delete it from the DTD?

You say its useful? If you havce implemented it, and noone else
objects, then we could put it back in. In principle, with hypertext, you don't
have to number tyhings, you can refer to them with a link. However, you
can imagine the abstract difference between an ordered list and

a sack of objects being important. [For example, a list of
instructions is ordered]. I'll put it into the HTML2 list of features.
I suggest everyone implement OL as UL in programs which, like the line mode
browser, can't differentiate.

> 5. What about <HP1> thru <HP5>... should we include them?
> I'd prefer <em>, <tt>, <cite>, ala TeX. Or we could go
> with the O'Reilly/Hal DocBook tags: <Emphasis>,
> <OopsChar>, <wordasword>,<CiteBook>,<Subscript>,
> <Superscript>.

I agree that numbering them is on the verge of useless. The trouble is,
you never have enough. Why CiteBook but not CiteProgram? etc etc.
The docbook names are on the long side, aren't they? Not very important
I suppose.

> 6. Any more thoughts on the BaseAddress tag?

Yes. It should be in. I think. I've mentioned in

> 7. The HTML tags documentation says Listing sections can
> contain any ISO Latin 1 characters. The SGML standard
> mentions ISO 646, i.e. ascii, as the default, but the
> sgmls parser, the linemode browser, and MidasWWW all seem
> to grok Latin1 just fine.

I suggest we limit it to ASCII unless something outside the
document says otherwise, while strongly recommending that
8-bit character sets should be handled by the apps. I have
seen some funnies when two clients both handle 8-bit characters,
but not the same ones.

Does the SGML standard say how to specify the character set for
the text?

> Dan