Without any fanfare, at least on this public mailing list, the proposed Unicode Technical Report #26 defining CESU-8 (Compatibility Encoding Scheme for UTF-16: 8-Bit) has been upgraded in the past week from "Proposed Draft" status to "Draft" status. That means CESU-8 is moving forward along the road to approval by the UTC, however smooth or rocky that road may be.
So it seems like a sensible time to get back on my soapbox about CESU-8, ask the pivotal question once again concerning the motivation for this new scheme, and point out a lingering error in the TR while I'm at it. CESU-8, for those who may have forgotten or repressed it, is a variation of UTF-8 which encodes supplementary characters in six bytes instead of four bytes. Essentially, it is UTF-8 applied to UTF-16 code units instead of Unicode scalar values. The UTF-16 transformation is applied to each supplementary character, breaking it into a high surrogate and a low surrogate, and then the UTF-8 transformation is applied to the two surrogates, so that each is encoded in three bytes. CESU-8 was originally called UTF-8S, at least on this list, the "S" presumably denoting the variant encoding of Surrogates. It has been promoted by representatives of Oracle, notably Jianping Yang, and PeopleSoft, notably Toby Phipps (the author of DUTR #26), as a way to ensure that Unicode data is sorted consistently in UTF-16 code-point binary order. Several people on this list, including me, have been critical of CESU-8, claiming that UTF-16 code-point order is not a suitable collation order and should not serve as the basis of a new (or hacked) UTF. UTF's are supposed to be character encoding forms (cf. UTR #17, "Character Encoding Model") that map Unicode scalar values to sequences of bytes, words, double-words, etc. You're not supposed to piggyback a UTF on top of another UTF, the way CESU-8 sits on top of UTF-16. The critics of CESU-8 claim its reason for existence is that the database vendors have been ignoring the designation of the supplementary code space and have handled "Unicode" as surrogate-unaware UCS-2. Now that supplementary characters have become a reality (as of Unicode 3.1), the vendors have chosen to promote this new encoding scheme instead of either (a) fixing the sort order of existing database engines to sort supplementary characters properly, AFTER basic characters, or (b) making a small modification to their sort routines to sort normal UTF-8 data in the idiosyncratic UCS-2-like order. There is also a concern that CESU-8 is really just a variation of UTF-8, allowing (nay, requiring) sequences that are illegal in UTF-8 but otherwise looking just like UTF-8. This could open security holes that the UTC has worked hard to close, and is continuing to close in Unicode 3.2. Finally, although the promoters claim that this mutant form of UTF-8 is only for internal use within closed systems (which would make it completely unnecessary for the Unicode Consortium to sanction, describe, or even acknowledge it), they have not only written a Technical Report to describe it to the public but have announced their intent to register it with the IANA, a major step toward open interchange of CESU-8 data. (It was claimed that the IANA registration was intended to pre-empt some other party from registering CESU-8 with IANA, but I don't see what difference this would make or how the pre-emptive action would help anything.) The promoters of CESU-8 say that data in this format already exists in the real world, and the purpose in describing it in a UTR is to codify an existing de-facto standard. For me, there is one question that seeks to explain the real motivation behind CESU-8. We know that basic (BMP) characters are encoded exactly the same in UTF-8 and CESU-8. We also know that, although the supplementary space has been designated for many years, no actual supplementary characters (with the exception of private use planes 15 and 16) were encoded, and thus allowed for interchange, until the publication of Unicode 3.1 earlier this year. Furthermore, we know what characters are currently (Unicode 3.1) encoded in the supplementary space: the ancient Old Italic and Gothic scripts; the Deseret script, which has not been actively promoted for 130 years; a large set of musical and mathematical symbols; the Plane 14 language tags; and several thousand Han characters. The Han characters are generally thought to be less commonly used than those in the BMP; otherwise (so the story goes) they would have been encoded in Unicode sooner. Remember that none of these non-BMP characters could be conformantly used (e.g. stored in a database) until the publication of Unicode 3.1. So my question is: What supplementary characters are currently, TODAY, stored in Oracle or PeopleSoft databases that require the creation of a new encoding scheme to ensure they can continue to be sorted consistently? I suspect there are none, and the real rationale behind CESU-8 is not to guarantee consistent sorting of existing non-BMP data but to validate the continued use of surrogate-unaware, UCS-2 mechanisms for handling "Unicode" data. I have asked this question before, and nobody was able to cite an example of real-world supplementary characters that require this extraordinary handling. Oh yes, I almost forgot: the lingering error. The original PDUTR contained the following passage: "The bit pattern 11110xxx is illegal in any CESU-8 byte, effectively prohibiting the occurrence of UTF-8 four-byte surrogates in CESU-8." Somebody, I think it was Markus Scherer, pointed out that this was wrong; the bit pattern 1111xxxx (note fifth character 'x' instead of '0') is actually illegal. This has been changed in the DUTR, but not to the correct bit pattern: "The bit pattern 11111xxx is illegal in any CESU-8 byte...." -Doug Ewell Fullerton, California

