Re: HTML+

Mike Piff (M.Piff@sheffield.ac.uk)
Mon, 5 Sep 1994 16:37:22


Leslie Lamport gives a very cogent account of the problems faced by HTML
trying to expand to occupy the high ground held by (La)TeX at the moment.

Perhaps I could just add a few words more.

%%>
%%>I will discuss these options below. But first, I need to explain the
%%>difference between markup and formatting. Some people seem to think
%%>that, if they use SGML syntax, they are doing markup. In fact, from
%%>my knowledge of SGML, it seems impossible to do markup for scientific
%%>documents using SGML. Consider the simple mathematical formula written
%%>in TeX as $x_i^2$ (x sub i superscript 2). The document
%%>
%%> http://info.cern.ch/hypertext/WWW/MarkUp/HTMLPlus/htmlplus_1.html
%%>
%%>which is a proposal for an extension of HTML called HTML+, recommends
%%>the following SGML representation for this formula
%%>
%%> <math>
%%> x <sub> i </sub> <sup> 2 </sup>
%%> </math>
%%>
%%>If this is supposed to be a markup language, then the <sub> and <sup>
%%>tags should delimit logical entities. Most any scientist or
%%>mathematician can tell you what those logical entities are: <sub> is
%%>the operation of array indexing, and <sup> is the operation of
%%>exponentiation.

Some amplification.
The subscript is not just used for array indexing, that is
just its commonest use if one excludes integrals. Similarly, and more
confusingly, the superscript is sometimes used for the same purposes
as the subscript, namely in tensor notation. $x_{i,j}^{k,l}$ does not
mean raise x[i,j] to the power (k,l), whatever that might mean!

And how could one possibly describe $x_{\pi}^{\prime}$ in terms of array
indexing and exponentiation, where $u'$ denotes the derivative of $u$?
Take the entry in $x$ indexed by $\pi$ and raise it to power $\prime$??

Again, $$\sum_{i=0}^n x_i$$ describes a logical summation in TeX, but
in the same symbolism as used for sub and superscripts, just to complicate
matters. The sub/superscripts describe limits of summation now, and are
placed above and below the summation sign.

Thus we see that both HTML and TeX are describing formatting instructions
in most cases, not logical markup, and in some cases TeX is describing
neither!

Perhaps a better example to look at is

<h1>A heading</h1>

This is most definitely typesetting rather than logical markup.
The instruction changes to large bold, and flushes left, with space above
and below. There is no concept of the logical extent of the section
delimited, and indeed such a section need not exist. Contrast

<section>
<title>A heading</title>
Text of section
<section>
<title>A subheading</title>
Text of subsection. Yes, *sub*section, but that is only apparent
from the logical nesting of this section in the enclosing one.
</section>
</section>

Now *that* I would call logical markup, but nothing like that appears
in HTML or LaTeX, although a similar structure almost exists in lists
in both languages. A logical parser reading this would have no trouble
representing it on screen or paper, once told what formatting to use
for either. Again, if this fragment were nested inside another section,
formatting would change accordingly. The same instructions would produce
different output depending on the surrounding logical context.

%%>2. HTML as a Formatting Language
%%>
%%>A more ambitious plan is to design a new language in such a way that
%%>most of TeX's typesetting engine can be used to display the output,
%%>but in which the input has more of a markup flavor. A major goal of
%%>this plan would be to integrate the viewer and the document editor, so
%%>the user would have something more "WYSIWYG" when creating a document.
%%>This would fit in with what I call LaTeX4, a long-term successor to
%%>conventional TeX/LaTeX.
%%>

TeX of course is entirely page-based. The text is chopped into optimized
paragraphs consisting of conceptual boxes each containing a line. This is
fed to the output routine that optimizes a page at a time, inserting floats,
page numbers, marginal notes, etc. The page is written to a DVI file, and
the next page is then optimized. Most of this mechanism could be discarded
or replaced for screen applications. Optimizing paragraphs is perhaps
necessary, but the page consists of a potentially infinite window.
Floats could be in-lined, or done as hyperlinks, as could footnotes,
marginal notes and cross-references.

Perhaps I could summarize the import of what Leslie is saying, as I see it,
and in a form with which few would disagree.

a) Logical markup---markup according to the logical structure of a document---
is preferable to formatting instructions.

b) Logical markup is infinitely extensible, and so requires an inbuilt
programming language for users to define their basic constructs.

c) Formatting is probably finite in extent, but huge, far huger than HTML,
and extremely complicated. Not the sort of thing to attempt over the weekend.

d) b) needs to be translated into c) in order to view/print a document. The
translation is different depending on which you want to do.

e) HTML and (La)TeX are currently based mostly round formatting, but
TeX does have the extensibility to do markup, and has all the formatting
instructions you might need built in.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Dr M J Piff, School of Mathematics and Statistics, University of %%
%% Sheffield, UK. +44 114 282 4431 e-mail: M.Piff@sheffield.ac.uk %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%