Re: C # character model

Mark Davis Wed, 28 Jun 2000 07:59:29 -0700
Almost all international functions (upper-, lower-, titlecasing, case folding, 
drawing, measuring, collation, transliteration, grapheme-, word-, linebreaks, etc.) 
should take *strings* in the API, NOT single code-points. Single code-point APIs 
almost always malfunction once you get outside of simple languages, because you need 
more context to get the right answer, or because you might need to generate a sequence 
of characters to return the right answer.

Take collation, for example. Any Unicode-compliant collation (UTR #10) must be able to 
handle sequences of more than one code-point, and treat that sequence as a single 
entity. Given that the code has to handle sequences, it makes little difference 
whether the string internally is a sequence of UTF-16 code units, or is a sequence of 
code-points ( = UTF-32 code units). This is because one of the beauties of UTF-16 and 
UTF-8 is that there is no overlap. Unlike SJIS,if I express a code-point as a sequence 
of code units, and search for it within a string of those code units, I will never get 
a mismatch: I will find a match iff there is a matching code-point at that position. 
In SJIS, if I search for an "a", I might find a false match as the second byte of a 
two-byte character -- this never happens in UTF-16 or UTF-8.

If you ever tried to collate by handling single code-points at a time, you would get 
the wrong answer. The same will happen if you draw or measure text, single code-point 
at a time. Because scripts like Arabic are contextual, the width of x plus the width 
of y is not equal to the width of xy. Once you get beyond basic typography, the same 
is true for English as well; because of kerning and ligatures the width of "fi" in 
TrueType may be different than the width of "f" plus the width of "i".

Looking at the question at hand: casing operations must return strings, not single 
code-points. (See http://www.unicode.org/unicode/reports/tr21/charts/). Moreover, the 
titlecasing operation requires strings as input, not single code-points at a time.

Remember also that whenever you store a single code-point in a struct or class instead 
of a string, you are excluding the use of graphemes -- a single code-point may not be 
sufficient to express what is required: you may need to store a sequence, such as "ch" 
for SSlovak.

In other words, almost all APIs and struct/class fields should *not* take either a 
char16 or a char32, they should take a string. And if they take a string, it doesn't 
matter what the internal representation of the string is. The main exception we've 
found are very low-level operations such as getting character properties (e.g. General 
Category or Canonical Class in the UCD). For those it is handy to have interfaces that 
convert quickly to and from UTF-16 and UTF-32, and that allow you to iterate through 
strings returning UTF-32 values (even though the internal format is UTF-16).

For more information, see "Forms of Unicode" on 
http://www2.software.ibm.com/developer/papers.nsf/unicode-papers-bynewest

Mark

Antoine Leca wrote:

> Torsten Mohrin wrote:
> >
> > Antoine Leca <[EMAIL PROTECTED]> wrote:
> >
> > [...]
> > >> > APIs use and return single 16-bit values.
> > >
> > >Ah, that may be a problem (what is the ToUpper return value of �?)
> >
> > I don't know the mentioned API, but it could return 0x00DF or (to
> > indicate it as an error) 0xFFFF. I don't see a problem.
>
> The problem is that the "correct" answer is a two letter string, "SS".
>
> More generally, character manipulation API done on single 16-bit
> values tends to have a number of problems, not very problematic
> when we deal with Latin-based West European languages, but that
> are going gore when considered in a more wide context (example:
> what is the width of character U+064A Arabic yeh? if the context
> is not indicated in some way, the answer is probably wrong...)
>
> Antoine
Re: C # character model

Reply via email to