MIME as a hypertext architecture

Dan Connolly (connolly@pixel.convex.com)
Sat, 06 Jun 92 00:53:20 CDT


NOTE: This message uses existing and proposed MIME structuring
conventions. Some parts of it may look strange on pre-MIME viewers.

---

The WWW project needs an architecture for interchange of structured multimedia hypertext documents. The original architecture, HTML, introduced some structuring conventions and a way of specifying hypertext links.

The HTML format is under stress from several issues: * We need an SGML DTD so that we can parse HTML using something besides the public implementation of WWW, and so that we can verify documents converted from other authoring systems such as GNU info, Andew's EZ, or FrameMaker.

* We need to be able to distribute documents and document elements in other formats, including raw 8 bit data streams. The SGML NOTATION feature falls short of providing and adequate mechanism.

* The UDI syntax doesn't match the SGML attribute syntax. There are problems with quoting out-of-band characters, and the length of complex UDI's may exceed SGML limits and/or line-length limits of transport mechanisms. Also, the terse syntax of UDI's conflicts with the goal that they be human-readable.

This is a proposed architecture for global hypertext, addressing the issues raised by the WWW project, but using the MIME architecture.

We define a new subtype of the MIME multipart content type called x-HTDOC. The syntax is the same as multipart/mixed, but the semantics are that of a WWW client: the first part is displayed, and the rest represent links to other documents or other elements of this document.

Then we define a new subtype of the MIME text content type called x-HTML. This is an SGML markup language using the default SGML declaration (i.e. the reference concrete syntax, default processing limits, etc.) and the HTML DTD (included below).

---

<!-- This DTD was produced by DeveGram on Tue Jun 2 18:58:16 1992 --> <!-- and hand-edited by connolly@convex.com -->

<!-- Parameter Entities -->

<!-- Terminal symbols -->

<!ENTITY % words "#PCDATA" >

<!-- Non-ELEMENT symbols -->

<!ENTITY % inline "%words | A" > <!ENTITY % text "%inline | P" > <!ENTITY % heading "H1|H2|H3|H4|H5|H6" >

<!ENTITY lt "<"> <!ENTITY gt ">"> <!ENTITY amp "&">

<!ENTITY lt. "<"> <!ENTITY gt. ">"> <!ENTITY amp. "&">

<!-- Document structure -->

<!ELEMENT html O O (TITLE, NEXTID?, ISINDEX?, section+, ADDRESS?)>

<!ELEMENT TITLE - - (%inline)+> <!ELEMENT ADDRESS - - (%text)+>

<!ELEMENT NEXTID - O EMPTY > <!ATTLIST NEXTID N NUMBER #IMPLIED>

<!ELEMENT ISINDEX - O EMPTY >

<!ELEMENT section O O ((%heading)?, ( %text | section | MENU | UL | OL | DIR | DL)+)>

<!ELEMENT (H1|H2|H3|H4|H5|H6) - - (%inline) >

<!ELEMENT P - O EMPTY -- paragraph SEPARATOR -->

<!ELEMENT A - - (%inline)+> <!ATTLIST A NAME CDATA #IMPLIED PART ENTITY #IMPLIED >

<!ELEMENT MENU - - (LI+)>

<!ELEMENT UL - - (LI+)>

<!ELEMENT OL - - (LI+)>

<!ELEMENT DIR - - (LI+)>

<!ELEMENT LI - O (%text)+>

<!ELEMENT DL - - ((DT, DD)+)>

<!ELEMENT DT - O (%inline)+>

<!ELEMENT DD - O (%text)+>

---

An HTML document would use external entities to reference other parts of the multipart message. The system identifier matches the Content-Id field of the intended part. The content-type of the indicated part could be image, audio, or video for multimedia inclusions; text for quotes etc., or message/external-body for references to other documents.

MIME defines access-types for local-file and anon-ftp. We could define x-HTTP, x-NEWS, x-WAIS, and the other UDI access types.

Within HTML documents, SGML IDREFs and IDs are used to reference and define elements of a document. (I think HYTIME defines a way to reference elements without explicit IDs.)

The next part of this message is a default.html from the WWW distribution adapted to use the conventions here.

It should interoperate with existing MIME systems, though they will not be able to do anyting intelligent with HTML.

---
Content-Type: multipart/x-HTDOC; boundary=cut-here

--cut-here Content-Type: text/x-HTML

<!DOCTYPE HTML SYSTEM [ <!ENTITY part1 SDATA "QuickGuide.html"> <!ENTITY part2 SDATA "http://info.cern.ch/hypertext/WWW/TheProject.html"> <!ENTITY part3 SDATA "http://crnvmc.cern.ch./WHO"> <!ENTITY part4 SDATA "http://crnvmc.cern.ch./FIND/yellow?"> <!ENTITY part5 SDATA "http://crnvmc.cern.ch./FIND/jaune?"> <!ENTITY part6 SDATA "http://crnvmc.cern.ch./FIND"> <!ENTITY part7 SDATA "http://crnvmc.cern.ch/NEWS/?"> <!ENTITY part8 SDATA "http://crnvmc.cern.ch./NEWS/cern"> <!ENTITY part9 SDATA "http://crnvmc.cern.ch./NEWS/vmnews"> <!ENTITY part10 SDATA "http://crnvmc.cern.ch/NEWS/student"> <!ENTITY part11 SDATA "http://info.cern.ch/hypertext/DataSources/NewsFromVM/Overview.html"> <!ENTITY part12 SDATA "http://info.cern.ch/hypertext/DataSources/News/Overview.html"> <!ENTITY part13 SDATA "http://info.cern.ch/hypertext/DataSources/bySubject/Overview.html"> <!ENTITY part14 SDATA "http://info.cern.ch./hypertext/DataSources/Overview.html"> <!ENTITY part15 SDATA "http://slacvm.slac.stanford.edu./FIND/spires"> <!ENTITY part16 SDATA "http://crnvmc.cern.ch/FIND/DESY?"> <!ENTITY part17 SDATA "http://info.cern.ch:8001/archive.orst.edu:9000/archie-orst.edu"> <!ENTITY part18 SDATA "http://iicm.tu-graz.ac.at./jargon"> <!ENTITY part19 SDATA "http://info.cern.ch./hypertext/Products/WAIS/Sources/Overview.html"> <!ENTITY part20 SDATA "http://info.cern.ch/rpc/doc/User/UserGuide.html"> <!ENTITY part21 SDATA "http://otax.tky.hut.fi/tky/default.html"> <!ENTITY part22 SDATA "gopher://gopher.micro.umn.edu:70/11/Other%20Gopher%20and%20Information%20Servers"> <!ENTITY part23 SDATA "http://info.cern.ch./hypertext/WWW/LineMode/Defaults/default.html"> ]> <TITLE>CERN Information</TITLE> <NEXTID N=10> <SECTION><H1>CERN Information - Select by number</H1> <DL> <DT><A PART="part1">Help</A> <DD>On this program, or the <A PART="part2">World-Wide Web project</A>. <DT><A PART="part3" NAME=2>Phone book</A> <DD>People, phone numbers, accounts and email addresses. See also the analytical <A PART="part4" NAME=yellow>Yellow Pages</A>, or the same index in French : <A PART="part5" NAME=jaune>Pages Jaunes</A>. <DT><A PART="part6" NAME=1>"XFIND" index</A> <DD>Index of computer centre documentation, newsletters, news, help files, etc... <DT><A PART="part7" NAME=groups>News</A> <DD>A complete list of all public CERN news groups, such as <A PART="part8" NAME=3>news from the CERN User's Office</A>,<A PART="part9" NAME=4> CERN computer center news</A>,<A PART="part10"> student news</A>. See also <A PART="part11" NAME=5>private groups</A> and <A PART="part12" NAME=inews>Internet news</A>. </dl> </section> <section> <SECTION><H2>From other sites</h2> See online data by <A PART="part13" NAME=subject>subject</A>, pointers to <A PART="part14">other forms of online data</a>, and the following specific databases: <DL> <DT><A PART="part15" NAME=spires>SLAC SPIRES</A> <DD>The High Energy Physics preprint index at Stanford Linear Accelerator, California. (This is the same information avialable via the QSPIRES facility on BITNET. Include the word "FIND" as the first keyword, eg: K FIND AUTHOR FRED.). <DT><A PART="part16" NAME=desy>DESY documents</a> <DD>Documents and help files from the DESY lab in Hamburg. <DT><A PART="part17" NAME=archie> Archie</a> <DD>An index of almost everything available by "anonymous FTP". <DT><A PART="part18" NAME=7>Hacker Jargon</a> <DD>An index to a cross-referenced set of hacker terms. A demonstration of the WWW gateway to the Graz Technical University Hyper-G database. <DT><A PART="part19" NAME=9>W.A.I.S.</a> <DD>All kinds of information available from "Wide Area Information Servers". <DT><A PART="part20" NAME=6>CERN RPC</A> <DD>The user guide for the RPC system developed in CERN CN division (not Sun/RPC). This is an example of documentation (partially) converted into hypertext. <DT><A PART="part21" NAME=hut>Helsinki</a> <DD>Helsinki Technical University information service (Mostly Finnish). <DT><A PART="part22" NAME=gopher>Gophers</a> <DD>Campus-wide information systems using "Gopher" software. (Requires www version 1.1 or higher) </DL> (This page may be an out of date copy. See the <A PART="part23" NAME=latest>latest version</a>.)

--cut-here Content-id: QuickGuide.html Content-type: message/external-body ;access-type=x-relative ;name="QuickGuide.html"

Content-Type: message

--cut-here Content-id: http://info.cern.ch/hypertext/WWW/TheProject.html Content-type: message/external-body ;access-type=x-HTTP ;site=info.cern.ch ;name=/hypertext/WWW/TheProject.html

Content-Type: message

--cut-here Content-id: http://crnvmc.cern.ch./WHO Content-type: message/external-body ;access-type=x-HTTP ;site=crnvmc.cern.ch. ;name=/WHO

Content-Type: message

--cut-here Content-id: http://crnvmc.cern.ch./FIND/yellow? Content-type: message/external-body ;access-type=x-HTTP ;site=crnvmc.cern.ch. ;name=/FIND/yellow?

Content-Type: message

--cut-here Content-id: http://crnvmc.cern.ch./FIND/jaune? Content-type: message/external-body ;access-type=x-HTTP ;site=crnvmc.cern.ch. ;name=/FIND/jaune?

Content-Type: message

--cut-here Content-id: http://crnvmc.cern.ch./FIND Content-type: message/external-body ;access-type=x-HTTP ;site=crnvmc.cern.ch. ;name=/FIND

Content-Type: message

--cut-here Content-id: http://crnvmc.cern.ch/NEWS/? Content-type: message/external-body ;access-type=x-HTTP ;site=crnvmc.cern.ch ;name=/NEWS/?

Content-Type: message

--cut-here Content-id: http://crnvmc.cern.ch./NEWS/cern Content-type: message/external-body ;access-type=x-HTTP ;site=crnvmc.cern.ch. ;name=/NEWS/cern

Content-Type: message

--cut-here Content-id: http://crnvmc.cern.ch./NEWS/vmnews Content-type: message/external-body ;access-type=x-HTTP ;site=crnvmc.cern.ch. ;name=/NEWS/vmnews

Content-Type: message

--cut-here Content-id: http://crnvmc.cern.ch/NEWS/student Content-type: message/external-body ;access-type=x-HTTP ;site=crnvmc.cern.ch ;name=/NEWS/student

Content-Type: message

--cut-here Content-id: http://info.cern.ch/hypertext/DataSources/NewsFromVM/Overview.html Content-type: message/external-body ;access-type=x-HTTP ;site=info.cern.ch ;name=/hypertext/DataSources/NewsFromVM/Overview.html

Content-Type: message

--cut-here Content-id: http://info.cern.ch/hypertext/DataSources/News/Overview.html Content-type: message/external-body ;access-type=x-HTTP ;site=info.cern.ch ;name=/hypertext/DataSources/News/Overview.html

Content-Type: message

--cut-here Content-id: http://info.cern.ch/hypertext/DataSources/bySubject/Overview.html Content-type: message/external-body ;access-type=x-HTTP ;site=info.cern.ch ;name=/hypertext/DataSources/bySubject/Overview.html

Content-Type: message

--cut-here Content-id: http://info.cern.ch./hypertext/DataSources/Overview.html Content-type: message/external-body ;access-type=x-HTTP ;site=info.cern.ch. ;name=/hypertext/DataSources/Overview.html

Content-Type: message

--cut-here Content-id: http://slacvm.slac.stanford.edu./FIND/spires Content-type: message/external-body ;access-type=x-HTTP ;site=slacvm.slac.stanford.edu. ;name=/FIND/spires

Content-Type: message

--cut-here Content-id: http://crnvmc.cern.ch/FIND/DESY? Content-type: message/external-body ;access-type=x-HTTP ;site=crnvmc.cern.ch ;name=/FIND/DESY?

Content-Type: message

--cut-here Content-id: http://info.cern.ch:8001/archive.orst.edu:9000/archie-orst.edu Content-type: message/external-body ;access-type=x-HTTP ;site=info.cern.ch ;port=8001 ;name=/archive.orst.edu:9000/archie-orst.edu

Content-Type: message

--cut-here Content-id: http://iicm.tu-graz.ac.at./jargon Content-type: message/external-body ;access-type=x-HTTP ;site=iicm.tu-graz.ac.at. ;name=/jargon

Content-Type: message

--cut-here Content-id: http://info.cern.ch./hypertext/Products/WAIS/Sources/Overview.html Content-type: message/external-body ;access-type=x-HTTP ;site=info.cern.ch. ;name=/hypertext/Products/WAIS/Sources/Overview.html

Content-Type: message

--cut-here Content-id: http://info.cern.ch/rpc/doc/User/UserGuide.html Content-type: message/external-body ;access-type=x-HTTP ;site=info.cern.ch ;name=/rpc/doc/User/UserGuide.html

Content-Type: message

--cut-here Content-id: http://otax.tky.hut.fi/tky/default.html Content-type: message/external-body ;access-type=x-HTTP ;site=otax.tky.hut.fi ;name=/tky/default.html

Content-Type: message

--cut-here Content-id: gopher://gopher.micro.umn.edu:70/11/Other%20Gopher%20and%20Information%20Servers Content-type: message/external-body ;access-type=x-gopher ;site=gopher.micro.umn.edu ;port=70 ;type=11 ;selector="Other Gopher and Information Servers"

Content-Type: message

--cut-here Content-id: http://info.cern.ch./hypertext/WWW/LineMode/Defaults/default.html Content-type: message/external-body ;access-type=x-HTTP ;site=info.cern.ch. ;name=/hypertext/WWW/LineMode/Defaults/default.html

Content-Type: message --cut-here--

---

Here's the perl script I used to convert default.html into the above message. It's full of gross hacks, but it worked this evening.

---

#!/usr/local/bin/perl

print "Content-Type: multipart/x-HTDOC; boundary=cut-here\n\n"; print "--cut-here\n"; print "Content-Type: text/x-HTML\n\n"; print "<!DOCTYPE HTML SYSTEM \n[\n";

$o = 0; $/ = ">";

while(<>){ s/(<A[^>]*>)/&fix_anchor($1)/ige; s/<NEXTID\s*(\d*)\s*>/<NEXTID N=$1>/g; if(/<H(\d)/){ local($n) = $1; if($n>$o) { $rep = "<SECTION>"; } else { $rep = "</SECTION><SECTION>"; } s/(<H\d)/$rep$1/g; $o = $n; } $doc .= $_; }

@entities = @anchors; while(@entities){ local($id) = shift(@entities); local($_) = shift(@entities); local($name) = shift(@entities); local($type) = shift(@entities);

print "<!ENTITY part$id SDATA \"$_\">\n"; }

print "]>\n", $doc;

while(@anchors){ local($id) = shift(@anchors); local($_) = shift(@anchors); local($name) = shift(@anchors); local($type) = shift(@anchors); local($access_type);

print "\n\n--cut-here\n"; print "Content-id: $_\n"; print "Content-type: message/external-body\n";

$access_type = $1 if s/^(\w+)://; if(s/#([^#]+)$//){ print "\t;x-element-id=\"$1\"\n"; }

if($access_type =~ /file/i){ print "\t;access-type=LOCAL-FILE\n"; print "\t;name=$_\n"; }elsif($access_type =~ /http/i){ print "\t;access-type=x-HTTP\n"; if(s-//([^:/]+)--){ print "\t;site=$1\n"; print "\t;port=$1\n" if s/^:(\d+)//; } &unescape; print "\t;name=$_\n"; }elsif($access_type =~ /news/i){ print "\t;access-type=x-news\n"; &unescape; if(/@/){ print "\t;message-id=$_\n"; }else{ print "\t;group=$_\n"; } }elsif($access_type =~ /telnet/i){ print "\t;access-type=x-telnet\n"; &unescape; print "\t;user=$1\n" if s/^(.*)@//; print "\t;port=$1\n" if s/:(.*)$//; print "\t;site=$_\n"; }elsif($access_type =~ /gopher/i){ print "\t;access-type=x-gopher\n"; if(s-^//([^:/]+)--){ print "\t;site=$1\n"; print "\t;port=$1\n" if s/:(\d+)//; } print "\t;type=$1\n" if s-^/(\d+)/--; &unescape; print "\t;selector=\"$_\"\n"; }elsif($access_type =~ /wais/i){ print "\t;access-type=x-wais\n"; if(s-//([^:/]+)--){ print "\t;site=$1\n"; print "\t;port=$1\n" if s/:(\d+)//; } if(m-^/-){ print "\t;type=$1\n" if s-^/(\w+)--; print "\t;size=$1\n" if s-^/(\d+)--; &unescape; print "\t;path=\"$_\"\n"; }else{ &unescape; print "\t;words=\"$1\"\n" if /\?(.*)/; } }elsif($access_type eq ""){ print "\t;access-type=x-relative\n"; &unescape; print "\t;name=\"$_\"\n"; }else{ warn "unknown access type: $access_type in $_"; }

print "\nContent-Type: message\n"; }

print "--cut-here--\n";

sub unescape{ s/%(\w\w)/sprintf("%c",hex($1))/ge; }

sub fix_anchor{ local($_) = @_; local($name, $href, $type); $href = $1 if /HREF\s*=\s*(\S+)/i; return $_ unless $href; $href =~ s/>$//;

$name = $1 if /NAME\s*=\s*(\S+)/i; $type = $1 if /TYPE\s*=\s*(\S+)/i;

$content_id{$href} = $content_id++ unless $content_id{$href}; push(@anchors, $content_id, $href, $name, $type); local($ret) = "<A PART=\"part$content_id\""; $ret .= " NAME=$name" if $name; $ret .= ">"; return $ret; }

-----