Re: National characters...again

Richard L. Goerwitz (goer@mithra-orinst.uchicago.edu)
Thu, 2 Feb 1995 20:19:10 +0100


>One would have expected that HTML, from its very early days, would
>have provided the construct
> <CHARSET="XXX"> ...any_8_bit_characters... </CHARSET>
>where XXX could be Latin-2, Latin-3, Latin-4,...

SGML has no mechanism for doing this, so the word I keep hearing is
that we should strangle HTML with the same restrictions.

The fallacy I often hear uttered is that if we can stuff Unicode into
the MIME header as the charset, then we can avoid the problem of having
to define a CHARSET tag (since Unicode encompasses most national char-
acters). But this way of thinking is WRONG. Unicode doesn't provide
a mechanism for varying sort order and other things that vary accord-
ing to locale and language. To do this, THE UNICODE STANDARD ITSELF
SAYS THAT ADDITIONAL TAGS ARE NECESSARY for this sort of thing.

So although offering Unicode or UTF-8 as a default charset is a good
idea, it does not do away with the need for LANG and CHARSET tags.

Just to do away with one other fallacy: You can't have just LANG or
CHARSET tags. You need both. You can have two different charsets for
a single document (e.g., Shift-JIS and ISO 8859-1), and you can have
two different languages within the same charset (e.g. English and Ger-
man for ISO 8859-1; Urdu, Persian, and Arabic for Unicode - they all
use the same Unicode pages).

It may not make sense for all clients to allow all possible combina-
tions, but this is something they can negotiate with servers. It is
not a reason to cripple HTML.

If I'm misunderstanding the Unicode standards, HTML, or SGML, someone
please let me know. I'm doing my best to keep up :-).

Richard Goerwitz
goer@midway.uchicago.edu