RE: WWW support for Cyrillic (and UNICODE)

Richard L. Goerwitz (goer@midway.uchicago.edu)
Thu, 3 Nov 94 08:37:36 CST


This is a long posting, aimed at answering the following questions
in a non-technical, but practical way - with lots of real-life exam-
ples:

>Would you please explain
>
>1. What the old internationalization/localization model is, and
>2. What the multilingual encoding standard is, and
>3. How does UNICODE fits into all of this.

What has frustrated the university community for some time now is
that we've had to purchase special-purpose software to do our work.
This is because software firms - typically US based - tend to think
of their customer base in terms of local markets. So, if you want
to write a document in Arabic, you first of all have to get the
Arabic version of the operating system and GUI you want to run, then
get Arabic versions of the software you need.

These are so-called "localized" products. Ironically, the term in-
ternationalization refers to the process of storing various locale-
specific information as separate resources, so that a given package
can be recompiled easily for a new local environment.

What's wrong with this method? Nothing. It's just that it's not
a complete solution. Imagine an international firm like Caterpil-
lar, which despite a strike in the US, is making big money - espe-
cially in overseas markets. In Europe the EEC requires that every
document they produce be made available in all of their members'
languages simultaneously. To do this, Cat has a big MT effort on
now. But wouldn't it be nice if they could also pass their docu-
ments freely - in any language - from one branch to the next, with-
out having conversion problems caused by differently localized soft-
ware packages?

Consider another case. What about an academician writing a commentary
in German on the book of Genesis. He must quote his fellow scholars
in French, English, etc. And he must quote his text as well (in
Hebrew). It is also likely that he will need to quote ancient trans-
lations like the Septuagint (Greek), Targum (Aramaic), and Vulgate
(Latin). This sort of thing is common in linguistics and philology.
Thus far, American word processors have, by and large, ignored the
possibility that someone might actually want to use more than one
language at a time. Wouldn't it be nice for philologists and ling-
uists if they could? Would it be nice for *any* humanities scholar
that has to deal with original sources in languages other than his
native one (that's a lot of scholars)?

Consider yet another case. In India there are, I think, six official
scripts. They use English for scholarship and many official functions,
but are also required to document things in the local language as well.
Languages are also freely mixed in many contexts. Note also that Urdu
uses Arabic script. So here we are with six different scripts, incl.
a Latin-based on and one that goes right-left - and we may need to
mix them in documents. Is a system geared only for localization go-
ing to work? No. Wouldn't it be nice to have a system that allowed
us to move freely in and out of each language? Similar problems, BTW,
are encountered in Canada (French-English), Singapore (Chinese-Arabic,
plus various languages of India, English, and local languages), and
in many other places.

To accommodate all of these languages requires a lot of codepage switch-
ing - OR a single 16-bit standard large enough to encompass all of the
world's major scripts. Unicode is supposed to be that standard.

I hope that this explanation is helpful. It's non-technical because I
tend to think of these problems in non-technical ways, due to my train-
ing and background. Any errors you find are my fault alone.

Richard Goerwitz
goer@midway.uchicago.edu