Re: Nicest UTF

Philippe Verdy Sat, 11 Dec 2004 10:36:29 -0800

From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]>

Regarding A, I see three choices:
1. A string is a sequence of code points.
2. A string is a sequence of combining character sequences.
3. A string is a sequence of code points, but it's encouraged
  to process it in groups of combining character sequences.

I'm afraid that anything other than a mixture of 1 and 3 is too
complicated to be widely used. Almost everybody is representing
strings either as code points, or as even lower-level units like
UTF-16 units. And while 2 is nice from the user's point of view,
it's a nightmare from the programmer's point of view:

Consider that the normalized forms are trying to approach the choice number 2, to create more predictable combining character sequences which can still be processed with algorithms just streams of code points. Remember that the total number of possible code points is finite; but not the total number of possible combining sequences, meaning that text handling will necessarily have to make decisions based on a limited set of properties.

Note however that for most Unicode strings, the "composite" character properties are those of the base character in the sequence. Note also that for some languages/scripts, the linguistically correct unit of work is the grapheme cluster; Unicode just defines "default grapheme clusters", which can span several combining sequences (see for example the Hangul script, written with clusters made of multiple combining sequences, where the base character is a Unicode jamo, itself made somtimes of multiple simpler jamos that Unicode do not allow to decompose as canonically equivalent strings, despite this decomposition is inherent of the script itself in its structure, and not bound to the language which Unicode will not standardize).

It's hard to create a general model that will work for all scripts encoded in Unicode. There are too many differences. So Unicode just appears to standardize a higher level of processing with combining sequences and normalization forms that are better approaching the linguistic and semantic of the scripts. Consider this level as an intermediate tool that will help simplify the identification of processing units.

The reality is that a written language is actually more complex than what can be approached in a single definition of processing units. For many other similar reasons, the ideal working model will be with "simple" and enumerable abstract characters with a finite number of code points, and with which actual and non-enumerable characters can be composed.

But the situation is not ideal for some scripts, notably ideographic ones due to their very complex and often "inconsistent" composition rules or layout and that require allocating many code points, one for each combination. Working with ideographic scripts requires much more character properties than with other scripts (see for example the huge and various properties defined in UniHan, which are still not standardized due to the difficulty to represent them and the slow discovery of errors, omissions, or contradictions found in various sources for this data...)

Re: Nicest UTF

Reply via email to