Putting the "World" back in WWW...

John Ludeman (johnl@microsoft.com)
Thu, 29 Sep 94 15:02:30 TZ


Microsoft has discussed the Web international issue with some web
related vendors and would like to make a proposal, but first, let me
give a few (unofficial) definitions:

Unicode - A 16 bit character encoding scheme that has defined code
points for every character in virtually every language in the world.
All common languages can be expressed in this scheme and include:
18,000 Han characters set by industry standards in China, Japan, Korea
and Taiwan; other supported languages include Greek, Hebrew, Latin,
Pali, Sanskrit and literary Chinese. In addition, several hundred
common math symbols, geometric shapes, and basic dingbats are defined.
At the beginning of a Unicode document is a two byte signature called
the Byte Ordering Mark (BOM) of 0xFEFF so clients can resolve big
endian/little endian differences.

UTF-8 - A special byte encoding that makes full 8 bit transfer over 7
bit gateways safe. Since HTTP is defined to be 8 bit clean, this is
not needed.

The HTTP protocol (since it is a protocol) will always be in Latin-1
using 8 bit characters (and hopefully this will move to a binary scheme).

A good solution for the multi-lingual problem would be to use Unicode
as the character encoding. Issues of left to right and right to left
text are simply resolved by the Unicode character code (no language or
locale information needed). A document can contain any language at any
place in the document without special delimiters. The only limitation
is whether the browser has the font for the language and supports
non-left to right languages (which a real international browser should).

To go along with this, I would suggest adding a new MIME type -
"text/uni-html" that indicates the document is using Unicode. This
allows locations that don't need the extra information to not accept
unicode versions of the documents (saves on bandwidth). This also
means that existing servers work and don't care they are serving up
Unicode docs. The HTML definition doesn't change except a remark needs
to be added that if the document is a Unicode HTML document, then the
BOM will be the first two bytes of the document.

I have been writing Unicode networking applications for many years and
I'm sold on it's simplicity and elegance. There is simply no better
way to internationalize a product. If other vendors are interested in
pursuing this with Microsoft, please let me know.

John