Almost all international functions (upper-, lower-, titlecasing, case folding,
drawing, measuring, collation, transliteration, grapheme-, word-, linebreaks, etc.)
should take *strings* in the API, NOT single code-points. Single code-point APIs
almost always malfunction once you get outside of simple languages, because you need
more context to get the right answer, or because you might need to generate a sequence
of characters to return the right answer.
Take collation, for example. Any Unicode-compliant collation (UTR #10) must be able to
handle sequences of more than one code-point, and treat that sequence as a single
entity. Given that the code has to handle sequences, it makes little difference
whether the string internally is a sequence of UTF-16 code units, or is a sequence of
code-points ( = UTF-32 code units). This is because one of the beauties of UTF-16 and
UTF-8 is that there is no overlap. Unlike SJIS,if I express a code-point as a sequence
of code units, and search for it within a string of those code units, I will never get
a mismatch: I will find a match iff there is a matching code-point at that position.
In SJIS, if I search for an "a", I might find a false match as the second byte of a
two-byte character -- this never happens in UTF-16 or UTF-8.
If you ever tried to collate by handling single code-points at a time, you would get
the wrong answer. The same will happen if you draw or measure text, single code-point
at a time. Because scripts like Arabic are contextual, the width of x plus the width
of y is not equal to the width of xy. Once you get beyond basic typography, the same
is true for English as well; because of kerning and ligatures the width of "fi" in
TrueType may be different than the width of "f" plus the width of "i".
Looking at the question at hand: casing operations must return strings, not single
code-points. (See http://www.unicode.org/unicode/reports/tr21/charts/). Moreover, the
titlecasing operation requires strings as input, not single code-points at a time.
Remember also that whenever you store a single code-point in a struct or class instead
of a string, you are excluding the use of graphemes -- a single code-point may not be
sufficient to express what is required: you may need to store a sequence, such as "ch"
for SSlovak.
In other words, almost all APIs and struct/class fields should *not* take either a
char16 or a char32, they should take a string. And if they take a string, it doesn't
matter what the internal representation of the string is. The main exception we've
found are very low-level operations such as getting character properties (e.g. General
Category or Canonical Class in the UCD). For those it is handy to have interfaces that
convert quickly to and from UTF-16 and UTF-32, and that allow you to iterate through
strings returning UTF-32 values (even though the internal format is UTF-16).
For more information, see "Forms of Unicode" on
http://www2.software.ibm.com/developer/papers.nsf/unicode-papers-bynewest
Mark
Antoine Leca wrote:
> Torsten Mohrin wrote:
> >
> > Antoine Leca <[EMAIL PROTECTED]> wrote:
> >
> > [...]
> > >> > APIs use and return single 16-bit values.
> > >
> > >Ah, that may be a problem (what is the ToUpper return value of �?)
> >
> > I don't know the mentioned API, but it could return 0x00DF or (to
> > indicate it as an error) 0xFFFF. I don't see a problem.
>
> The problem is that the "correct" answer is a two letter string, "SS".
>
> More generally, character manipulation API done on single 16-bit
> values tends to have a number of problems, not very problematic
> when we deal with Latin-based West European languages, but that
> are going gore when considered in a more wide context (example:
> what is the width of character U+064A Arabic yeh? if the context
> is not indicated in some way, the answer is probably wrong...)
>
> Antoine