Re: Putting the "World" back in WWW...

Richard L. Goerwitz (fxrojas@nlsarch.austin.ibm.com)
Mon, 03 Oct 94 17:59:12 -0600


From: hallam@dxal18.cern.ch (HALLAM-BAKER Phillip)
Date: Mon, 03 Oct 94 20:36:56 +0100

It is simply another content encoding to deal with.

A charset module can easilly be written to convert fairly arbitrary encodings
into UNICODE tokens. This can also do UTS, ASCII, ISO-8893, JIS, and whacky
Russian etc. encodings.

I'm not sure I follow... excuse me if I missed the point ... but it sounds like
you are suggesting we put "ANY ENCODING" in the document and have each viewer
convert into UNICODE...

If so, this will cause MAJOR interoperability problems across the network.
Expecting every client to be convert to from every possible encoding will never
work - consider Latin-1 has : PC 437, PC 850, EBCDIC, ISO8859, UTF, UCS,
other PC national code pages...

Rather, the document should be supplied in a canonical encoding, i.e. UCS,
that each client should just provide 1 conversion at the max.

On the other side I am looking into a scheme of `multifonts' which allows
several X11 fonts to be compounded into a single UNICODE mapping.

If this is so, then storing them with UNICODE makes more sense since such
fonts will exist... and there is no conversion at view time.

Because the
display module is directly engaged we can translate into the target font
character by character. This scheme means that the UNICODE stuff does not cause
increased internal storage requirements.

But this causes a nightmare for system administrators that need to provide
conversions from any other encoding to UNICODE... and puts the burden of
conversion on the clients each time the document is accessed rather then on the
supplier one time.

I fully realize we can't convert over to a single canonical form overnight.
But we should provide the convention that re-enforce simple administration
and enhance interoperability for all systems.

So the content type is

text/html (default to ISO-8893-1)
text/html; charset=UNICODE
text/html; charset=UTF
text/html; charset=ISO..
text/html; charset=JIS
etc.

Here is a proposal that would help to converge on a uniform
canonical encoding....

text/html; charset=charset_name

where

charset_name := UCS2_ID'_'UCS_plane
| DCE_ID'_'DCE_encoding
| private_encoding

UCS2_ID := 'UCS2'

UCS_plane := '0x'<4_hexdigits>

DCE_ID := 'DCE'

DCE_encoding := '0x'<8_hexdigits>

private_encoding := portable character string

For UCS2_ID, the characters will be encoded using 1 byte values corresponding to it's
UCS_plane within UCS-2. The value of any given character can be combined with the
UCS_plane value to obtain an UNICODE value. For example, UCS2_0x0000 is Latin-1!
For CJK, the DCE_ID of UCS-2 is recommended (see below).

For any DCE value, the characters will be encoded using the encoding registry
developed for DCE. Refer to DCE RFA 41.1 and the X/Open Federated Naming
Specification. The latter relies on the DCE encoding registry
to tag names of different encodings. The nice thing is that the DCE
registry includes UCS-2 as one of the encodings...
for example the UCS-2 level 3 DCE registered value is: charset=DCE_0x00010102

Frank

PS. I really don't like the use of hexdigits in the names but it is more
precise...