Re: Virtual Pages

Daniel W. Connolly (connolly@hal.com)
Fri, 25 Feb 1994 12:32:20 --100


In message <9402190101.AA17555@hpfcma.fc.hp.com>, Dave Hollander writes:
>
>The attached posting reminded me of another document design and maintenance
>issue. The current model for creating and accessing HTML documents
>forces one to build rather small documents that are heavly linked.
>This can be difficult to maintain particularlly in a production
>environment because the unit of display (html files) have no relationship
>to the unit of submission/processing/maintenance.

I've been thinking about this lately too... how the parts make the
whole and such... You mentioned that an "HTML document" is a "unit of
display". Actually, it was designed as a "unit of transfer" (shoot...
can't find the source of that quote). We have precedent for several
other entity/object/thingys. Here's my personal terminology for these
things:

* Entity -- unit of data storage/transfer. A sequence of bytes
with an associated formal interpretation.
Examples: an SGML document is often broken into entities
for maintenance reasons. (an entity may be used several
times in a document, so the author assigns it a name and
references the name each time rather than storing copies).

WWW nodes are currently broken into the HTML source plus
separate entities for graphics, sound, etc.

* Node (Page???) -- unit of display. With the advent of the <IMG SRC=...>
element, there is no longer a 1-1 relationship between
transfer and display objects. I expect this situation to
get more complex...

The page isn't necessarily something the information producer and
consumer need agree on. The producer may edit a document in the form
of 26 different pages, but the consumer may want to see an outline
form with the H1 and H2's from all 26 of the authors pages.

* Element -- unit of "information". "Element" is the root of a class
hierarchy containing Documents, Messages, etc. An element
has a type, and depending on the type, may have some
attributes, and some content.

* Document -- unit of composition. A kind of Element.

* Message -- unit of communication. A kind of document.
A message has an explicit author,
audience, and date of "publication". RFC822 messages
additionally have a globally unique identifier.

The current WWW architecture (with the exception of the IMG
element...) seems roughly equivalent to the gopher model, where the
disk file is the ultimate definiton of the entity and the node. We
have these constraints that (1) HTML nodes are completely independent
-- they must contain all their context, authorship info, etc. (2)
folks should be able to maintain HTML files with a text editor, (3)
the server should be able to ship that file over the wire verbatim
without processing it, and (4) the client should be able to format and
display it in real time. This doesn't seem scalable to me. Constraints
(2) through (4) seem somewhat reasonable. It's number (1) that I'd
like to do something about.

I'd like to see an architecture where an author can compose a document
consisting of a set of nodes with common features -- perhaps a style
sheet, common navigation features ("back", "forward", "up", "top",
"index", etc.) -- without having to store the features in each disk
file. I've heard of some folks using the C preprocessor as a solution
to this problem! Ackk! Thptptptp!

Take the GNN web for and example -- I bet it's a nightmare to
maintain!. It seems that the GNN editors should be able to compose one
SGML document containing lots of little HTML entity files. Then an
SGML parser could validate the ID's and IDREFS of all the
intra-document links, as well as the structure of the document.

Hmmm... I have to noodle on this one for a little while.

Dan