MIME, SGML, UDIs, HTML and W3

Tim Berners-Lee (timbl@zippy.lcs.mit.edu)
Thu, 11 Jun 92 12:22:56 -0400

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Wije Wathugala: "SGML Converters"
Previous message: Paul Burchard: "Re: WAIS APIs"
Next in thread: Dan Connolly: "Re: MIME, SGML, UDIs, HTML and W3"

I have printed off the recent discussion on the new
HTTP, HTML and MIMe and UDIs and done what I can
to disentangle it all in my mind. I will reply
in one message, becase many of the points are linked.
I know this should be hypertext, with references but
(a) I am away from home and (b) we don't yet have a
universal mail/news archive server running to link to.

HTTP and HTML

First of all, Jean-Francois <jfg@dxcern.cern.ch>
points out very properly that the enhaced HTTP
protocol and the enhanced HTML spec are quite
separate things, and should be specified separatedly.
I agree wholeheartdly about all this, and
I aplogize for muddling the levels up till now.

(As a small aside, I would point out that wheras a
HTERR file is not very useful, a HTFWD file IS.
It is like a hypertex soft link. But I am happy to
leave that as a separate type of file. It should
certainly get a different extension so that it gets a
different icon)

HTTP: SGML vs ASN/1

Let's look at the HTTP protocol first. Carl <barker@cernnext.cern.ch>
is mapping out the requirements for this, and assuming that SGML
would be a reasonable representation for it in practice.
And so it is. When the requirements are clear,
it would certainly be interesting to look at mapping them
onto a z39.50 - style ASN/1 implementation. This would
be useful for two reasons. First, the comparison would
point out to us things in z39.50 which we might not have thought of
which would b useful for HTTP. Second, the comparison might give
a nice short or at least well-defined things which the WAIS
guys might like to take into account in the next version
of their protocol. (I demod W3 to Brewster who hadn't
seen it before live, and was very keen that WAIS and W3
should merge, changing the WAIS protocol if necessary.

There is no reason why we shouldn't try both protocols.
If they map well onto each other, its just a question
of having two separate prasers at the low level, building
the same internal structures.

When we're talking about an SGML representation,
and describe a file to come later down the link,
I don't think we have to use the NOTATION= attribute with a notation
type, because we won't in fact be talking about
the notation of an SGML element.
The format in this case is not something which the SGML
parse is aware of.

I must admit I was disappointed to learn that SGML
didn't allow for any way of including 8 bit data. Thanks Eric
<enag@ifi.uio.np> for your explanations.

MIME and SGML

Dan <connolly@pixel.convex.com> rightly points out
the relevance of the coming MIME standards. There
are several things which we must separate here, though:

1. The MIME classification of data formats
2. The MIME format for multi-part messages
3. The MIME format for rich text.
4. The MIME formal for external document addresses (MIME UDIs)

1. MIME classification of data formats

We must do the same disentangling job which JF did
on HTML to MIME.

First of all, the MIME job of classifying data formats
is a useful job which is ideally done by just one
bunch of people. Ther has been some suggestion that
the MIME classifications are not well enough defined,
but they seem to be the best effort yet and one can only
assume they will eveolve in the right direction. So I'd
back the use of these for W3.

2. The MIME format for multi-part messages

This is necessary for sending a multi-part
document over a mail link. We have to ask ourselves
whether it is reasonable to use over a binary link.
Personally, my initial impression is that the MIME
stuff, using as it does terminators such as
--xxx-- separated by blank lines, looks more horrible
to work with in this respect than SGML! Still we have
the problem of restrictions on the content:
Must not contain delimiters, limited 7 bit character set,
line orientation, in fact all the things which email
carries as a restriction. This is really taking on board
a legacy of all the mail which has evolved over the years.
Do we need that for our new ultra-fast hypertext access
protocol?

[Compare the MIME format with the rather cleaner NeXT
Mail format which is as far as I understand simply
a uuencoded compressed tar file of all the bits, where
uuencoding is designed as an optimal way of getting over
mail transport restrictions, compress does what it says
and tar is a multipart wrapper designed for that only. Not
standard outside unix, perhaps, but cleaner in that the
mail formatting is done at the last minute and doesn't
affect the other operations]

If course, with HTTP2, multipart/alternative shouldn't
be needed.

Multipart for hypetext?

Now, Dan not only suggests the use of this for
multipart messages, but also suggests that a hypetext
document shoudl necessarily contain many parts,
one on SGML and one for each link as a MIME external document.
This means that an SGML hypertext document can never stand
on its own! An SGML parser will always need to have
a MIME parser sitting just outside. I don't like
this: I feel we have to separate these two things.

Suppose that an SGML document does want to
be sent in a MIME message and does want to
refer to other parts of that MIME message. In that case,
it seems reasonable to have a format for that.
However, when an SGML document is seen by itself, and
refers to a news message for example, then there is
no resaon for it not to be able to contain a
complete reference within itself.

When SGML documents include other files, then
the SYSTEM value is typically a file name.
It is a reeference to something outside. The
precedent is set that SGML documents are allowed
to refer to things outside.

I think part of you objection, Dan is based on
a dislike of the UDI syntax -- which I'll come to later.

3. The MIME format for rich text.

Here, I am not so impressed. Basically, the MIME
people are at the same level that we were before we started
this cleanup, that they have SGML-LIKE stuff which isn't SGML.
As its not difficult to make it SGML, they should do that.
Comparing MIME's rich text and HTML, I see that
we lack the characetr formatting attributes BOLD and ITALIC
but on the other hand I feel that our treatment of
logical heading levels and other structures is much more powerful
and has turned out to provide more flexible formatting
on different platforms than explicit semi-references
to font sizes. This is born out by all the systems which
use named styles in preference to explicit formatting,
LaTeX or other macros instead of TeX, etc etc.

So technically, HTML has some things to give MIME's rich
text. Are the MIME people still open to additions?
If not, I would suggest we add BOLD and ITALIC (or
two emphasis styles for characters), and keep HTML
separete from MIME's rich text, proposing it as a
MIME text standard.
(HP0 and HP1 were in the HTML spec but as unimplemented)

4. The MIME format for external document addresses (MIME UDIs)

As Ed <emv@msen.com> says, this is a bit of a non-issue,
as MIME addersses and currnet style UDIs map onto
each other. However, we have to agree on a "concrete
syntax" (or two... :-) in the end.

It's like the difference between an x400 style mail address
generated from an internet address, and that internet address.
Which do you prefer

timbl@zippy.lcs.mit.edu

where the sections of the domain name are defined
to have no semantics at all, or

S=timbl; HO=zippy; OU=lcs; O=MIT; SECTOR=edu

(this is not real x400 - don't use it!) or

user=timbl
host=zippy
group=lcs
organization=mit
sector=education

You say, Dan, that you "don't think [UDIs] work".
Do you mean people don't use them in all correspondance?
Well, what DO they use? They use ange-ftp addresses
for FTP (like info.cern.ch:/pub/www/doc/*.ps),
which are even more terse than UDIs! They use news
message-ids which are UDIs.

Let me say that I personally don't much care about the
arbitrary punctuation. There are a few things, though,
which are important:

- The thing should be printable 7-bit ASCII.

Unlike arbitrary document formats,
UDIs must be sendable in the mail

- White space should not be significant. I would
accept the presence of some arbitrary white space
as a delimiter, but one cannot distinguish between
different forms and quantities of white space.
This is because things get wrapped and unwrapped.

Dan, you object to UDIs because they don't
contain white space. But that is purely so that
to CAN wrap them onto several lines and still
recuperate them. You can put white space
in but it shouldn't mean anything. (This is not possible
in W3 as is but it is in the UDI document)

I don't see why you say they
can't be put as an SGML attribute. They are just
text strings. They will be quoted of course
(Yes, I know the old NeXT browser doesn't quote them)
Is that not allowed? What are the problem characters?
If there SGML problem characters in the UDI spec, they
probably are ruled out of SGML for a reason.

(I recently saw in a galley proof of an article in which
our mail adress had been hypernated! UDIs must be
squeezable into 2 inch columns.)

There is a sematic difference between a tagged
list and a punctuation-divided set, and that is that
the former has defined semantics but the latter doesn't and
can therefore be extended more easily. I suggest that tagging
could be used for the four bits of an address
that must be separable by all sides, which are
limited in number (4). Within those bits, the string should
be transparent as the protocol does not require
every party to understand the innards.

The bits are
MIME Used by

name space: ACCESS Used by client

server details: HOST, PORT used by client, protocol-dependent

local doc id: PATH used by server only

anchor id: (none) used by presntation application only

It seems useful to maintain the ability to work out which
bits are seen by whom.

I only used punctation to separate these parts in the W3 UDI
because people like internet addresses and mail addresses
and filenames and telephone numbers and message-ids and
room numbers and zip codes which don't have tags and
do make do with punctuation. If the groundswell of
opionion on this list is that tags are better, then
let's use tags!

Whatever we sue, it should be as quotable in an SGML
attribute as in a MIME external reference as in a
scribbled note or a link-pasteboard or whatever.
(The U is for Universal, NOT Unique!)

PHILOSOPHY

In the W3 world, the model is of a dynamic world of
documents which generally have some "home" or
(or several), which can be found using sufficient
intelligence and the help of ones friends given the UDI.

A mail message has no home, and so in principle the parts
of it have no home. When a hypertext multipart message
(really consisting of multiple hypertext documents)
has links between its parts they refer to each other
within a completely isolated conetext.

There are now two possibilites when the message is in fact
archived and made readable. One is we say that the parts
are then addressed as parts ofthe message, wherever it
may be. The other is to say that the parts of the message
are very likely things which had some original home.
In that case, the message is just giving the reciever
a copy to save him the (perhaps insurmountable) trouble
of retrieving it. In this case the parts should be
identified with thier original UDIs so that the
receiver is not confsed with multiple documents which
are in fact the same thing.

I think that's all the comments I have on what I've read so far..

Tim
________________________________________________________________
Tim Berners-Lee
World-Wide Web initiative
CERN, 1211 Geneva 23, Switzerland timbl@info.cern.ch
Visiting MIT: NE43-513, (617)234 6016 timbl@zippy.lcs.mit.edu

Next message: Wije Wathugala: "SGML Converters"
Previous message: Paul Burchard: "Re: WAIS APIs"
Next in thread: Dan Connolly: "Re: MIME, SGML, UDIs, HTML and W3"