CESU-8 marches on

DougEwell2 Sat, 22 Dec 2001 01:32:28 -0800

Without any fanfare, at least on this public mailing list, the proposed 
Unicode Technical Report #26 defining CESU-8 (Compatibility Encoding Scheme 
for UTF-16: 8-Bit) has been upgraded in the past week from "Proposed Draft" 
status to "Draft" status.  That means CESU-8 is moving forward along the road 
to approval by the UTC, however smooth or rocky that road may be.


So it seems like a sensible time to get back on my soapbox about CESU-8, ask 
the pivotal question once again concerning the motivation for this new 
scheme, and point out a lingering error in the TR while I'm at it.

CESU-8, for those who may have forgotten or repressed it, is a variation of 
UTF-8 which encodes supplementary characters in six bytes instead of four 
bytes.  Essentially, it is UTF-8 applied to UTF-16 code units instead of 
Unicode scalar values.  The UTF-16 transformation is applied to each 
supplementary character, breaking it into a high surrogate and a low 
surrogate, and then the UTF-8 transformation is applied to the two 
surrogates, so that each is encoded in three bytes.

CESU-8 was originally called UTF-8S, at least on this list, the "S" 
presumably denoting the variant encoding of Surrogates.  It has been promoted 
by representatives of Oracle, notably Jianping Yang, and PeopleSoft, notably 
Toby Phipps (the author of DUTR #26), as a way to ensure that Unicode data is 
sorted consistently in UTF-16 code-point binary order.

Several people on this list, including me, have been critical of CESU-8, 
claiming that UTF-16 code-point order is not a suitable collation order and 
should not serve as the basis of a new (or hacked) UTF.  UTF's are supposed 
to be character encoding forms (cf. UTR #17, "Character Encoding Model") that 
map Unicode scalar values to sequences of bytes, words, double-words, etc.  
You're not supposed to piggyback a UTF on top of another UTF, the way CESU-8 
sits on top of UTF-16.

The critics of CESU-8 claim its reason for existence is that the database 
vendors have been ignoring the designation of the supplementary code space 
and have handled "Unicode" as surrogate-unaware UCS-2.  Now that 
supplementary characters have become a reality (as of Unicode 3.1), the 
vendors have chosen to promote this new encoding scheme instead of either (a) 
fixing the sort order of existing database engines to sort supplementary 
characters properly, AFTER basic characters, or (b) making a small 
modification to their sort routines to sort normal UTF-8 data in the 
idiosyncratic UCS-2-like order.

There is also a concern that CESU-8 is really just a variation of UTF-8, 
allowing (nay, requiring) sequences that are illegal in UTF-8 but otherwise 
looking just like UTF-8.  This could open security holes that the UTC has 
worked hard to close, and is continuing to close in Unicode 3.2.

Finally, although the promoters claim that this mutant form of UTF-8 is only 
for internal use within closed systems (which would make it completely 
unnecessary for the Unicode Consortium to sanction, describe, or even 
acknowledge it), they have not only written a Technical Report to describe it 
to the public but have announced their intent to register it with the IANA, a 
major step toward open interchange of CESU-8 data.  (It was claimed that the 
IANA registration was intended to pre-empt some other party from registering 
CESU-8 with IANA, but I don't see what difference this would make or how the 
pre-emptive action would help anything.)

The promoters of CESU-8 say that data in this format already exists in the 
real world, and the purpose in describing it in a UTR is to codify an 
existing de-facto standard.  For me, there is one question that seeks to 
explain the real motivation behind CESU-8.  We know that basic (BMP) 
characters are encoded exactly the same in UTF-8 and CESU-8.  We also know 
that, although the supplementary space has been designated for many years, no 
actual supplementary characters (with the exception of private use planes 15 
and 16) were encoded, and thus allowed for interchange, until the publication 
of Unicode 3.1 earlier this year.

Furthermore, we know what characters are currently (Unicode 3.1) encoded in 
the supplementary space:  the ancient Old Italic and Gothic scripts; the 
Deseret script, which has not been actively promoted for 130 years; a large 
set of musical and mathematical symbols; the Plane 14 language tags; and 
several thousand Han characters.  The Han characters are generally thought to 
be less commonly used than those in the BMP; otherwise (so the story goes) 
they would have been encoded in Unicode sooner.  Remember that none of these 
non-BMP characters could be conformantly used (e.g. stored in a database) 
until the publication of Unicode 3.1.

So my question is:  What supplementary characters are currently, TODAY, 
stored in Oracle or PeopleSoft databases that require the creation of a new 
encoding scheme to ensure they can continue to be sorted consistently?

I suspect there are none, and the real rationale behind CESU-8 is not to 
guarantee consistent sorting of existing non-BMP data but to validate the 
continued use of surrogate-unaware, UCS-2 mechanisms for handling "Unicode" 
data.  I have asked this question before, and nobody was able to cite an 
example of real-world supplementary characters that require this 
extraordinary handling.

Oh yes, I almost forgot: the lingering error.  The original PDUTR contained 
the following passage:

"The bit pattern 11110xxx is illegal in any CESU-8 byte, effectively 
prohibiting the occurrence of UTF-8 four-byte surrogates in CESU-8."

Somebody, I think it was Markus Scherer, pointed out that this was wrong; the 
bit pattern 1111xxxx (note fifth character 'x' instead of '0') is actually 
illegal.  This has been changed in the DUTR, but not to the correct bit 
pattern:

"The bit pattern 11111xxx is illegal in any CESU-8 byte...."

-Doug Ewell
 Fullerton, California

CESU-8 marches on

Reply via email to