Re: Putting the "World" back in WWW...

HALLAM-BAKER Phillip (fxrojas@nlsarch.austin.ibm.com)
Sun, 02 Oct 94 11:09:12 -0600


From: John Ludeman <johnl@microsoft.com>
Date: Thu, 29 Sep 94 15:02:30 +0700

UTF-8 - A special byte encoding that makes full 8 bit transfer over 7
bit gateways safe. Since HTTP is defined to be 8 bit clean, this is
not needed.

UTF-8 is based on FSS-UTF (File System Safe UTF ) developed at XOpen that
is an 8-bit encoding of UCS-2. My understanding this name has been registered
with the ECMA.

The 7-bit encoding for UCS, I thought, was called UCS-8.

The key is that FSS-UTF (UTF-8) is an 8-bit encoding that could be used
by HTTP.

The HTTP protocol (since it is a protocol) will always be in Latin-1
using 8 bit characters (and hopefully this will move to a binary scheme).

Actually, from a "world wide" definition, HTTP will always be "portable
characters set" of Latin-1 (aka ASCII).

A good solution for the multi-lingual problem would be to use Unicode
as the character encoding.

I personally agree with this.

Issues of left to right and right to left
text are simply resolved by the Unicode character code (no language or
locale information needed).

Issues of directionality is not critical to this given that we've been
able to do Hebrew/Arabic using a single internatioanlized Motif.

A document can contain any language at any
place in the document without special delimiters.

I agree.

The only limitation
is whether the browser has the font for the language and supports
non-left to right languages (which a real international browser should).

This is a sticky issue. I.e. we can not expect every client to be localized
the same as the server...

To go along with this, I would suggest adding a new MIME type -
"text/uni-html" that indicates the document is using Unicode. This
allows locations that don't need the extra information to not accept
unicode versions of the documents (saves on bandwidth). This also
means that existing servers work and don't care they are serving up
Unicode docs. The HTML definition doesn't change except a remark needs
to be added that if the document is a Unicode HTML document, then the
BOM will be the first two bytes of the document.

Again, I personally like this. But with the addition that a non Uni-browser
should handle un-html and convert it to it's local national code set
and report any conversion errors in a meaningful way.

I have been writing Unicode networking applications for many years and
I'm sold on it's simplicity and elegance. There is simply no better
way to internationalize a product.

Hummm, I know of a bunch of I18N folks that would disagree.
But I'm easy and say it is just as simple as using standard I18N API's.

Frank