SGML parsing, moving between formats

Kevin Altis (altis@ibeam.jf.intel.com)
Tue, 15 Feb 1994 21:11:01 --100


At 1:19 PM 2/15/94 +0000, Daniel W. Connolly wrote:
>One of the things I released (or was just about to release when I
>changed jobs...) was an SGML compliant HTML parser in a few hundred
>lines of vanilla ANSI C.

Great! But did this code go into a black hole with the job change or is it
available somewhere, need some testing, or what? :)

>>One issue in formalising HTML+ was in providing an adequate structure
>>while dealing with legacy documents. As you can see in my current DTD,
>>documents have a richer structure than with the old HTML DTD.
>>
>
>Yes... and it seems to me (at first glance... I'll have to look more
>closely...) that we've lost the ability to translate HTML to Microsoft
>Word or FrameMaker without any loss of information.
>
>Let's get formal why don't we: I do not mean that we should be able to
>take any RTF file and convert it to HTMLPLUS, or MIF for that matter.
>But I think it's crucial that there exist invertible mappings
>
> h : HTML -> RTF
>and
> g : HTML -> MIF
>and
> h : HTML -> TeXinfo
>
>so that I can take a given HTML document, convert it to RTF, and
>convert it back and get exactly what I started with (the same ESIS,
>that is... perhaps SGML comments and a few meaningless RE's would get
>lost).

Folks, this is extremely important. It may be a b*tch to move documents
from RTF, MIF, etc. to HTML right now, but it should be a breeze to go the
other direction, at least. If we make it difficult or impossible to go from
HTML to RTF, MIF, etc., then we've blown it. If we make it possible to go
from HTML -> RTF -> HTML without loss of information, formatting
information, etc. then we win!

Brief reality check sermon follows:
Keep in mind that the world will neither go whole hog into SGML (HTML), nor
continue to deal execlusively with application specific document formats
such as Microsoft Word or Quark Xpress. We're going to have to deal with
both camps. There isn't a single big company (or small company) or
organization that isn't sick and tired (mad as hell to be precise) of NOT
being able to move formatted information between multiple applications
without loss. Frankly, I'm surprised that the blood of vendors hasn't been
spilt at trade shows (it almost was at Seybold last October). However, it
isn't our job to come up with the ULTIMATE SGML DTD holy grail document
format, etc., though what Dave is doing is extremely useful for us as a
measuring stick and focal point. It IS our job to make it easy for users to
publish information on the Web, HyperText or otherwise. Concentrate on
making authoring/publishing easy and keep in mind that document formats
have to rendered at some point. If you make rendering too hard, the only
people that will be able to spend the bucks to write clients capable of
doing the rendering will be big companies like HP, Intel, and Microsoft.
Plus, capabilities between clients will vary so much, that the presentation
independence will be completely lost, because half of what client A can
render, can't be rendered by client B in any meaningful way. If you make it
too difficult to author/publish hypertext, then most of the information on
the Web will just end up being documents to be displayed by applications
external to the WWW client browser and the browser ends up being an
elaborate Gopher substitute.
Sermon off

ka