Re: Ligatures (revisited)...

Glenn Adams (glenn@stonehand.com)
Wed, 26 Apr 1995 20:48:19 +0500


There's a movement afoot to make ISO/IEC 10646-1:1993 the
standard document character set for HTML. This can be done
without affecting existing implementations because:

1. The ISO 8859-1 character repertoire is a subset of 10646
and the code assignments in 10646 is the same as 8859-1 for
this repertoire; that is, � through ÿ denote the same
characters in both 8859-1 and 10646.

2. SGML (and thus HTML) doesn't require that the representation
of entities (e.g., the document entity) must use the document
character set; that is, one can use 8859-1 or ASCII or Shift JIS
or any other character set in the actual representation of a
document. The entity manager is responsible for translating
the actual representation of the entity into a form understood
by the parser in terms of the applicable document character set.
[This translation is partially supported by the CHARSET= parameter
on the CONTENT-TYPE header in HTTP: this parameter identifies
the actual encoding of the entity's representation.]

---------

As for ligatures, one needs to be a bit careful about terminology
here. Some ligatures, such as 'ffi' are merely presentation forms
that enhance the aesthetics of rendered text; other 'ligatures', the
so-called 'lexical ligatures' communicate additional lexical
information beyond a mere presentational style. Then again, what
is a lexical ligature to one writing system may be a presentational
ligature to another.

Not all purely presentational ligatures are encoded (or should be encoded)
in 10646 or any other character set. Those that are encoded are merely
aiding in compatibility with older software that can't distinguish
between characters and glyphs.

A general formatting architecture must account for the need of mapping
characters to glyphs in a many-to-many mapping; this is particular true
for the less simple scripts such as Arabic, Devanagari, etc.

Regards,
Glenn Adams