assorted HTML and SGML questions

Joe Wells (jbw@cs.bu.edu)
Tue, 10 Oct 1995 22:40:50 -0400


Hi, HTML and SGML gurus,

I've got some questions that can probably be answered by an expert without
even thinking but which I haven't been able to find the answers to in my
WWW browsing. Some of these questions are about HTML, some are about
SGML, and some are about HTML as an SGML document type.

Q: (("text/html" Internet Media Type)) Does text/html forbid including the
SGML declaration (<!SGML ...>)? Does it require that a PUBLIC external
identifier (i.e. PUBLIC "-//IETF//DTD HTML Level 2//EN") be included in
the DTD, if the DTD is included? Does it forbid including a DTD
subset? I am not asking how many WWW browsers can handle this; I am
asking instead whether the standard specifies this. The version of the
HTML 2.0 standard (which includes the definition of the text/html media
type) that I read seemed vague on these questions, but perhaps I am
missing something.

Q: ((SGML Marked Sections)) The syntax for marked sections is not clear to
me. I would like to know precisely how to determine when the end of a
marked section has been reached. I've seen two grammars for this, one
from TEI (which is clearly wrong and disagrees with what "sgmls" does)
and one based on the standard which merely says the content of the
marked section is "SGML characters" (which is not helpful). What is
the precise syntax for marked sections? (Pointers to *net* resources
are greatly preferred to paper resources. Pointers to source code
should be to well-commented and clear source code; I've already tried
to figure this out by reading the source of sgmls.)

Q: ((HTML 3.0 with HTML.Recommended vs. Legacy Documents)) In HTML 3.0
with HTML.Recommended enabled in the DTD, it is illegal to put text
directly inside an LI element, like this:

<LI>Here are some words.</LI>

This is legal:

<LI><P>Here are some words.</P></LI>

Is the first fragment supposed to be interpreted (rendered) like the
second one by an HTML browser? I've noticed that many browsers
(e.g. Netscape 1.1N) treat them very differently. Netscape in
particular renders the second, legal version in a truly horrible
fashion. There are other HTML elements with problems like the one I
describe here for the LI element.

Q: ((HTML 3.0 TEXTAREA vs. Inclusion Exceptions)) The HTML 3.0 proposed
draft says that the content of the TEXTAREA element should be used as
follows:

"The text up to the end tag is used to initialize the field's value.
The initialization text can contain SGML entities, e.g. for accented
characters, but is otherwise treated as literal text."

This presumes that the TEXTAREA element's content can only be data
characters. However, using the proposed DTD the following HTML is valid:

<FORM ACTION="http://dev.null.dom">
<P>
<MATH>
<TEXTAREA NAME="foo" ROWS=1 COLS=1>
<SPOT ID="bar">
<BOX>
yyy<SUP>
zzz
</SUP>
</BOX>
</TEXTAREA>
</MATH>
</P>
</FORM>

Thus, the TEXTAREA element can contain subelements. How should a
browser handle this? In particular, what precisely should the browser
send to the server if the user submits the form without changing
anything in the text area?

Q: ((SGML Unclosed Start and End Tags)) Under what circumstances are
unclosed start and end tags allowed?

Q: ((HTML 3.0 Dummy Elements)) In HTML 3.0, what is the purpose of having
the BODYTEXT and FIGTEXT elements at all? They allow both start and
end tags to be omitted and are not intended to be ever be used as
markup. Neither of them seems to be documented in the proposed draft
of the standard.

Q: ((My SGML Confusion)) What is "SDATA"?

Q: ((SGML vs. Carriage Returns)) The documentation for the program "sgmls"
says that it does this:

1. each carriage return character is turned into a
non-SGML character;

2. each newline character is turned into a record end
character, and at the same time a record start
character is inserted at the beginning of each
line;

Is this part of the standard? Is this an appropriate thing to do for
unix compatibility because the convention on unix is that lines are not
started by anything and are ended by newlines?

Q: ((SGML Grammar Confusion)) The grammar of SGML that I have seen says
one alternative for an "attribute value" is "character data". This
seems very open-ended and unspecified. What does this mean?

Thanks for any help you can give me.

-- 
Joe Wells <jbw@cs.bu.edu>