WWW and non-English (was ISO charsets; Unicode )

Peter Svanberg (psv@nada.kth.se)
Mon, 26 Sep 1994 18:05:36 +0100

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Henrik Frystyk Nielsen: "ANNOUNCEMENT OF CERN LIBRARY OF COMMON CODE 2.17"
Previous message: Gavin Nicol: "Re: Forms support in clients"
Next in thread: Chris Lilley, Computer Graphics Unit: "Re: WWW and non-English (was ISO charsets; Unicode )"

Quoting: "Richard L. Goerwitz" <goer@midway.uchicago.edu>
>
> Has a formal mechanism been considered for specifying various popular
> coding standards, such as ISO 8859-7, ISO 8859-8, etc., and (perhaps
> off in the future) Unicode?

Good question! This is HTML+ discussion text, from
<URL:http://info.cern.ch/hypertext/WWW/MarkUp/HTMLPlus/htmlplus_13.html>:

By default, HTML+ documents are made up of 8-bit characters
from the ISO 8859 Latin-1 character set. The network protocol
used to retrieve documents may translate the character set into
a locally acceptable form, e.g. EBCDIC. The HTTP protocol uses
the MIME standard (RFC 1341) to specify the document type and
character set. ISO SGML entity definitions are used to include
characters which are missing from the character set or which
would otherwise be confused with markup elements...

Appendix II lists a broad range of characters and symbols,
relating their ISO names to the corresponding character codes
in common character sets. They allow authors to include
accented characters in 7-bit ASCII documents. ...

There are a large number of entities defined by the ISO,
covering most languages and symbols for publishing and
mathematics. Requiring all browsers to support these would
be impractical, e.g. how should a dumb terminal show such
symbols. In some cases there will be accepted ways of
mapping them to normal characters, e.g. <aelig> as ae and
<egrave> as e. Perhaps the safest recommendation is that
where authors need to use a specialised character or
symbol, they should use ISO entity names rather than
inventing their own. Browsers should leave unrecognised
entity names untranslated.

That is all I have found on this subject - not much.

> What ideas have been floated along the lines of making the Web more all-
> encompassing, linguistically speaking? Are there any practical solutions
> the folks mentioned above could be working on now? Where should I direct
> people who have questions about internationalization/multilingualism and
> the Web? Can Humanities people help aid the process, even if many of them
> are not technically oriented?

A very important matter here is the choice of language:

(1) in the client
(2) in the documents

For (1) we must urge the client developers to make their
program internationalized - preferably through the standardized
"i18n" methods. Some work is being done for at least Mosaic (in
Germany and in Sweden), but apparently not in cooperation with
the developing team, with all the disadvantages that entails.

Concerning (2), I have looked in the plans for future HTML and
HTTP and found (in
<URL:http://info.cern.ch/hypertext/WWW/Protocols/HTTP/HTRQ_Headers.html>)
that a HTTP request can contain

Accept-Language: <list>

which is a list of "Language values which are preferable in the
response". In
<URL:http://info.cern.ch/hypertext/WWW/Protocols/HTTP/Object_Headers.html>
the parallell specification in the "Object MetaInformation"
contained in the "header fields given with or in relation to
objects in HTTP" is given as

Content-Language: <code>

This seems nice, I have just the following comment:

Make both of these conformant with the suggested
Content-Language header (draft-ietf-mailext-lang-tag-00.txt?),
with the semantic difference that the value for Accept-Language
is a user's priority list for desired language.

---
Peter Svanberg, NADA, KTH		    Email: psv@nada.kth.se
Dept of Num An & CS,
Royal Inst of Tech			    Phone: +46 8 790 71 40
S-100 44  Stockholm, SWEDEN		    Fax:   +46 8 790 09 30

Next message: Henrik Frystyk Nielsen: "ANNOUNCEMENT OF CERN LIBRARY OF COMMON CODE 2.17"
Previous message: Gavin Nicol: "Re: Forms support in clients"
Next in thread: Chris Lilley, Computer Graphics Unit: "Re: WWW and non-English (was ISO charsets; Unicode )"