Re: Different character sets in one HTML document

Daniel W. Connolly (connolly@hal.com)
Fri, 24 Jun 1994 13:26:43 -0500


Multi-Language HTML Documents
===============================

First, I'd like to thank Mr. van Zee for teaching me a whole bunch
of stuff about character sets that I've been curious about for some
time.

HTML and SGML
---------------

Second, I'll preface my response to his proposals with the
underlying assumption that it is a requirement of HTML that

A HTML document shall be a conforming SGML document.
(as per ISO 8879, definition 4.51 and section 15.1)

I just spent some time wading through the (excellent) comp.text.sgml
archive. For mucky details about SGML and character sets, see, for
example, Erik Naggum's Q&A on Character Sets_:

Newsgroups: comp.text.sgml
From: Erik Naggum <erik@naggum.no>
Message-ID: <23160A@erik.naggum.no>
Date: 22 May 1992 07:06:51 UT
Subject: Character Sets Q&A, Part 2

I've also been reading the TEI Guidelines_, including their stuff on
writing systems and interchange problems with character sets.

I've come across so much stuff that I don't know that I finally
picked up _Understanding_Japanese_Information_Processing_ by Ken
Lunde (O'Reilly & Associates, ISBN 1-56592-043-0).

About the Proposals
---------------------

Now, about Mr. van Zee's proposals:

In message <9406231309.ZM1164@hpcvusm.cv.hp.com>, "Pieter van Zee" writes:
>My objective is to support multi-lingual content, i.e. to move
>away from the assumption that the entire content of an HTML file
>is in a single charset.

Well, as far as SGML is concerned, the entire content of an "HTML
file" _must_ be a in a single charset. That charset might support
multiple languages through ISO2022 style escape mechanisms to
different graphic code sets, but it's still one character set, in
SGML terminology.

>Assuming we agree that HTML documents need to support
>multi-lingual content, let's discuss how this might occur. I ran
>the following by our i18n guru to verify my comments.
>
>The phrase "specifying the ISO 2022 mechanism at the MIME level"
>isn't exactly clear to me. I'll take it to mean that whenever a
>HTML document is encapsulated as a MIME object for transport, the
>document must use ISO 2022 encoding for its content.
>
>Let's generalize and call this:
>
> Strategy (a): a HTML document has only ISO 2022-encoded content.

Well, let me attmept to clarify, and suggest this wrinkle on this
strategy (a):

HTML and MIME for the Western European Writing System
-------------------------------------------------------

Currently, there is an implicit SGML declaration_ shared by all HTML
documents that specifies ISO8859-1 as the document character set. So
currently, the HTML spec "specifies ISO 8859-1 at the MIME level";
that is, the conventional HTTP header:

Content-Type: text/html

might be considered short for:

Content-Type: text/html; charset="iso8859-1"
Content-Transfer-Encoding: binary

In addition, there are entities for each of the ISO8859 characters
that are not part of ISO646, so most HTML documents _can_ be written
in 7bit characters. So when most folks send HTML via mail, if they
write:

Content-Type: text/html

they are using the US-ASCII character set (the ISO646 subset of
ISO8859-1), ala:

Content-Type: text/html; charset="US-ASCII"
Content-Transfer-Encoding: 7bit

<!DOCTYPE HTML "-//W30//DTD WWW HTML 2.0//EN">
<title>german names in 7bit html</title>
<h1>Kurt G&ouml;del</h1>

(note that while individual 8bit characters in HTML can be converted
to 7bit representations, there are some HTML idioms that can't be
represented within the 72 character limit of the 7bit encoding, such
as very long words, very wide PRE lines, and long URLs)

HTML and MIME for Other Writing Systems
-----------------------------------------

It would make sense to me, then, to interpret this HTTP headers:

Content-Type: text/html; charset="ISO-2022-JP"
Content-Transfer-Encoding: binary

as meaning "the SGML declaration for this document specifies
ISO-2022-JP, rather than ISO-8859-1 as its document character set."

I don't yet know whether this strategy is sufficient to represent,
for example, a single HTML document containing English, Japanese,
Cyrillic, and Hebrew text. I'm guessing something like:

Content-Type: text/html; charset="ISO-10646"
Content-Transfer-Encoding: binary

might be sufficient, but a parser for such a document would be
vastly different from an ISO8859-1 based HTML parser, since even the
markup characters would be 2-byte characters. And I don't even know
if ISO-10646 (aka UNICODE) can be expressed in an SGML declaration.

SGML can't represent per-element Character Sets
-------------------------------------------------

>And my proposal is:
>
> Strategy (b): every HTML element has optional LANG and CHARSET
> attibutes which specify the locale of the element's data.
>
> In other words...A HTML document uses 7-bit ASCII for
> markup but may use any charset for content, and charset is
> specified in two ways: (i) an optional default charset for the
> document, and (ii) an optional charset attribute on every
> element that overrides the document default.
>
>What are the relative merits and pitfalls?

The major pitfall that I see is that changing character sets on a
per-element basis is not expressible in a conforming SGML document.

It's possible to hack things where the parser reports ISO8859-1
characters all the time, and we use NOTATIONs to represent other
writing systems. I'm not sure how that would work just yet, but I
believe the TEI Guidelines include a technique for doing this.

Development tools for Multilanguage Text
------------------------------------------

>Basically, with strategy (a), every program must know how to
>parse a ISO-2022 byte stream and map that to something meaningful
>on their platform.
...
>Also, although the ISO-2022 mechanism supports baseline charset
>specifications, it does not support higher-level specifications
>that combine two or more baseline charsets. These aggregate
>charsets, such as Japanese SJIS and EUC, are the charsets that
>users are exposed to and which have OS infrastructure support.

This is a compelling argument that I'm looking into. I'm interested
to know if there's an "over-the-wire" representation of
multi-language text that's widely supported by development tools.

For example, I checked the ANSI C standard, and while they specify
interfaces to translate between multibyte character and wide
character encodings, they don't specify either of the actual
encodings! So as far as ANSI C goes, there is no portable wide
character or multibyte character encoding.

I've been rooting around the Modula-3 documentation and trying to
find out how other distributed computing platforms do multi-language
text. I used to develop DCE software, and I can't remember their
approach.

It certainly would be a shame if the SGML standard conflicted with
the predominant technique for multilanguage text representation
supported by distributed applications tools.

...

Ok... after reading more of the O'Reilly book, it appears that there
are three predominant encodings of multilanguage text supported by
development tools:

* EUC (Extended Unix Code): specified in ISO2022. Supported by OSF,
Unix International, and USL.

* Shift-JIS: supported by Microsoft and Apple

* Unicode: specified by the Unicode Consortium, concides with parts of ISO
10646. supported by AT&T Plan 9 and PenPoint

My vote for the over-the-wire representation of HTML EUC, with
support for ISO-2022-JP for 7bit transmission. Now: how do we spell
EUC in an SGML declaration?

.. _Sets ftp://ftp.ifi.uio.no/pub/SGML/comp.text.sgml/19920522/070651.Naggum
.. _Guidelines http://etext.virginia.edu/TEI.html
.. _declaraction http://www.hal.com/%7Econnolly/html-spec/html.decl