Re: ISO charsets; Unicode

Chris Lilley, Computer Graphics Unit (lilley@v5.cgu.mcc.ac.uk)
Tue, 27 Sep 1994 10:38:25 GMT


In message <9409261553.AA22679@midway.uchicago.edu> Richard L. Goerwitz said:

> Still, is the general problem of multi-language text worth dis-
> cussing?

Worth discussing and solving.

> For my part, I'd love to make a few of my non-English
> databases available online, but I don't know how to tell query
> forms to expect something other than ISO 8859-1.

You can't, currently. There is the internationalised version of the M-entity by
Toshihiro Takada <http://www.ntt.jp/Mosaic-l10n/README.html> which addresses
some of the problems of automatic switching on a *per document* basis.

I agree that switching on a per-element basis is needed, but as I said SGML does
not appear to support this. Perhaps the SGML standard could do with an
ammendment in the light of this?

> Let me just toss off a suggestion here. Say we suddenly move
> from English to Greek text:

> <language Greek encoding="ISO 8859-8">

I see what you are saying, I have said similar things myself in the past. You
correctly separate the problem into two sub-problems: changing the language, and
changing the encoding.

For example, being able to tag the English, French and Italian sections of a
document (all using ISO 8859-1) is doable with changes to the DTD. This would
have desirable consequences in terms of targetted searching, for example.

Changing the encoding cannot apparently be done like this. You cannot express it
in the DTD. This is a severe limitation. As I have said before, the
Much-of-Western-Europe-and-the-USA Wide Web is flourishing, but the "World" bit
is sorely lacking.

A related issue is defining overlaps between languages and their subsets etc.
There are actually three things to juggle with; the country, the language and
the encoding (there may be more than one encoding possible).

For example, you may want to tag a section as US English. Or British English. Or
US Medical English. But it needs to be expressed somehow that people who read
British English can also understand US English (well, mostly ;-) ). Expressing
tis as a locale, eg Britain, doesn't help because people here speak Scots Gaelic
or Welsh or Manx or Romany or Sheltie, in some cases as a first language, and we
don't want any backdoor cultural imperialism do we ;-(

In case that example was a little too UK-centric for some tastes: expressing
locale as Switzerland doesn't help becuase the primary language may be Swiss
German or Swiss French or Swiss Italian (or English if they are at CERN ;-) )
and it may in some cases be desirable to distinguish Swiss German from German
German.

Or again, in the USA a lot of people have Latin American Spanish as a first
language I am told. In Spain, "Spanish" (Castillian) is widely used but there
are a lot that speak Catalan. And so on.

> The question for me is just how sophisticated we want clients to
> get. The Web is supposed to be worldwide, to be sure, and this
> would seem to imply multilinguality. But how are we supposed to
> be sure that all of the requisite fonts, with all of the requisite
> registries and encodings, are on every machine?

I think you answer that well in your next paragraph - people have the fonts they
use regularly on their machine. People aquire others according to need. Clients
can help by offering to download fonts that are needed. Or gracefully recover,
or just to say "this passage is in Ancient Babylonian, you don't have the
fonts". The user may care, or they may not. Clients start off with a minimal set
and build up capabilities for what they are interested in reading.

> I'm sorry if I seem to be obtruding in a forum without knowing what I
> am doing.

Hardly.

> As I noted above, I'm in the Humanities, and am simply trying to see if I
> can be any help at all....

Good. Keep saying these things. Multilingual capability is essential. The more
people that express opinions about it and keep it in the forefront, the better.

--
Chris