Comments on HTMLPLUS.DTD

Betty Harvey (harvey@oasys.dt.navy.mil)
23 Aug 94 15:09 EDT


Below are some comments that were created by Don Gignac from Advanced
Information Systems Branch, David Taylor Model Basin, NSWC, regarding
the HTMLPLUS.DTD. I hope you find this comments useful. If you have
any questions, please feel free to contact either myself or Don
(gignac@oasys.dt.navy.mil).

Regards,

Betty Harvey

/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
Betty Harvey <harvey@oasys.dt.navy.mil> | David Taylor Model Basin
Advanced Information Systems Branch | Carderock Division
Code 183 | Naval Surface Warfare
Bethesda, Md. 20084-5000 | Center
| DTMB,CD,NSWC
URL: http://navysgml.dt.navy.mil/betty.html |
/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\\/\/

23 AUGUST 1994

SUBJ: The "htmlplus" DTD

REFS: (1) ISO 8879
(2) James Clark's documentation for the SGMLS parser in the "sgmls.doc"
file
(3) Goldfarb, Charles F., "The SGML Handbook"

1. GENERAL

1.1 The following comments were prepared by the Advanced Information
Systems Branch.

1.2 The page numbers in parentheses following citations of ISO 8879
pertain to the annotated standard in Goldfarb's "The SGML Handbook".

1.3 Since it is assumed that NAMECASE GENERAL is YES on the SGML Declaration
for the "htmlplus" DTD, case does not matter for element names, i.e.,
"htmlplus" and "HTMLPLUS" are the same element name.

2. SPECIFIC

2.1 The statement:

This DTD is parsed without errors by "sgmls -s -p htmlplus.dtd".

occurs in the SGML comment following the DOCTYPE declaration. When the
"htmlplus" DTD was parsed with the "sgmls -s -p htmlplus.dtd" command
(using version 1.1 of the SGMLS parser), the error return

sgmls: SGML error at htmlplus.dtd, line 483 in declaration parameter 4:
Content model is ambiguous

was provided. It is an error for an element of the DTD to have an ambiguous
content model. See comment 2.3 below.

It is highly recommended that the "htmlplus" DTD be parsed with
the SGMLS command "sgmls htmlplus.dtd -egprsu -f err-log". The "-e"
and "-g" options provide useful information with regard to the subjects
of error returns/warnings. Strictly speaking, specifying the "-p" option
(parse the DTD only) implies the "-s" option (suppress output except
for error returns/warnings). The "-r" option options warns of defaulted
entity references. The "-u" option warns of elements used in the DTD
but not declared. See Ref(2).

When the "htmlplus" DTD was parsed with the SGMLS command
"sgmls htmlplus.dtd -egprsu -f err-log" (using version 1.1 of the SGMLS
parser) the following error returns/warnings:

sgmls: SGML error at htmlplus.dtd, line 483 in declaration
parameter 4: Content model is ambiguous
sgmls: Warning at htmlplus.dtd, line 574 at record end:
Element "RCDATA" used in DTD but not defined
sgmls: Warning at htmlplus.dtd, line 574 at record end:
Element "EM" used in DTD but not defined
sgmls: Warning at htmlplus.dtd, line 574 at record end:
Element "EMPTY" used in DTD but not defined
sgmls: Warning at htmlplus.dtd, line 574 at record end:
Element "QUOTE" used in DTD but not defined
sgmls: Warning at htmlplus.dtd, line 574 at record end:
Element "PCDATA" used in DTD but not defined

were provided in the "err-log" file. These error returns/warnings are
discussed at length in the following comments.

2.2 The following classification of "SGML text types":

Various classes of SGML text types:

#CDATA text which doesn't include markup or entity references

#RCDATA text with entity references but no markup

#PCDATA text occurring in a context in which markup and
entity references may occur.

occurs in an SGML comment in the "htmlplus" DTD. While the three descriptions
are correct, two of the three "classes" are incorrect.

"CDATA" (character data) and "RCDATA" (replaceable character data)
are declared content, and neither the reserved name indicator "#" nor
parentheses are used in conjunction with them. Accordingly, "#CDATA"
and "#RCDATA" above are not syntactically correct. See clause 11.2.3
"Declared Content" (page 409). On the other hand "#PCDATA" is a content
token for parsed character data in content models. See clause 11.2.4
"Content Model" (pages 409 to 413). Both declared content and parsed
character data are used to define elements (though not in the same content
model). See [116] in clause 11.2 "Element Declaration" (page 405).

The following should be noted in passing. If the reserved name
indicator "#" is omitted from the "#PCDATA" content token, "PCDATA"
will be considered an element name. See the first NOTE in clause 11.2.4
"Content Model" (page 411). Also if declared content of CDATA, RCDATA,
or EMPTY is enclosed in parentheses in an element definition, it will
be considered to be an element name. See lines 32 to 34 (page 412).

2.3 The "text" ENTITY declaration

<!ENTITY % text "#PCDATA | A | ICON | EMPH | EMBED | SP | BR">

and the "fig" ELEMENT declaration

<!ELEMENT FIG - - (EMBED?, FIGA*, (%text;)*)>

will result in a "fig" content model

(EMBED?, FIGA*, (#PCDATA | A | ICON | EMPH | EMBED | SP | BR)*)

which is ambiguous with regard to "embed" content. This means that a
parser can not determine without looking ahead in the "fig" content
(and perhaps not even then) whether "embed" content corresponds to the
"embed?" token or to the "embed" option of "(%text;)*". Ambiguous content
models are not allowed. See clause 11.2.4.3 "Ambiguous Content Model"
(page 414 to 415).

It is not clear how the "fig" element should be redefined to eliminate
this problem.

2.4 The "em" element used to define the "title" element

<!ELEMENT TITLE - - (#PCDATA | EM)+>

and the "h1" through "h6" elements

<!ENTITY % heading "H1|H2|H3|H4|H5|H6">

<!ELEMENT (%heading;) - - (#PCDATA | EM)+>

has not been defined in an ELEMENT declaration. This is not a syntax
error in and of itself (and in fact there may be a good reason for doing
so at times). See lines 32 to 34 (page 412) and the following NOTE
(page 413) in clause 11.2.4 "Content Model". However, in this case it
would appear that a suitable definition of the "em" element is required.

2.5 The "quote" element used to define the "dd" element

<!ELEMENT DD - O (P|QUOTE|UL|OL|%text;)+ -- definition text -- >

has not been defined in an ELEMENT declaration. As before, it would
appear that a suitable definition of the "quote" element is required.

2.6 The definition of the "a" element

<!ELEMENT A - - (PCDATA | ICON | EMPH | EMBED)*>

contains an undefined element "pcdata". In all likelihood, the reserved
name indicator "#" was omitted from the content token "#PCDATA" for
parsed character data.

2.7 The definition of the "embed" element

<!ELEMENT EMBED - - (RCDATA)>

contains an undefined element "rcdata". In all likelihood, declared
content of RCDATA is intended here. The parentheses around "RCDATA"
must be removed.

2.8 The definitions of the "tb", "figa", "isindex", and "nextid" elements

<!ELEMENT TB - O (EMPTY) -- vertical break of 1/2 line spacing -->

<!ELEMENT FIGA - O (EMPTY)>

<!ELEMENT ISINDEX - O (EMPTY)>

<!ELEMENT NEXTID - O (EMPTY)>

contain an undefined element "empty". In all likelihood, declared content
of EMPTY is intended here. The parentheses around "EMPTY" must be removed.

2.10 There is a problem with mixed content arising from the use of the
"text" parameter entity.

<!ENTITY % text "#PCDATA | A | ICON | EMPH | EMBED | SP | BR">

Carriage returns and white space in the content of elements whose content
models consist of certain combinations of subelements may be regarded
as "data" instead of being ignored with the result that the element's
content will not parse against the element's content model. See the
discussion of mixed content in the second NOTE in clause 11.2.4 "Content
Model" (pages 411 to 412) and clause 7.6.1 "Record Boundaries" (pages 321
to 324). For a simple example of the problems arising from mixed content,
consider the following content

<item>
<emph b="b">DON'T DO THIS!</emph>
</item>

of the "item" element

<!ELEMENT ITEM - O (%text;)>

whose resolved content model is

(#PCDATA | A | ICON | EMPH | EMBED | SP | BR)

The carriage return following the "item" start-tag is not ignored; it
is considered to be "data", i.e., parsed character data. The parser
associates this data with the "#PCDATA" content token. The following
"emph" content is rejected since the choice of parsed character data,
"a" content, an "icon" tag, "emph" content, "embed" content, an"sp"
tag, or a "br" tag may not be repeated. If the above choice could be
repeated, i.e., if the above content model has an occurrence indicator
of "*" or "+", there would be no problem.

The "input" element

<!ELEMENT INPUT - - (%text;)>

has the same mixed content problem. The following elements

<!ELEMENT HTMLPLUS O O ((HEAD, BODY) | ((%setup;), (%main;)*))>

<!ELEMENT TH - O ((%text;)+, TD*) -- a row of headers -->

<!ELEMENT TR - O ((%text;)+, TD*) -- a row of data -->

<!ELEMENT SELECT - - ((%text;)*, ITEM*)>

<!ELEMENT FIG - - (EMBED?, FIGA*, (%text;)*)>

have similar but somewhat different mixed content problems.

Unlike the "item" and "input" elements, it is not clear how these
latter elements should be redefined to eliminate these problems.

-------