Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
>> What is a shame is that Unicode published a definition of the >> defective CESU-8 at all. > > On that point at least we agree. I wonder why CESU-8 was created, if > there effectively exists applications needing it.
UTC could have simply acknowledged that certain applications and vendors have created their own transformation formats for internal use, based on, but incompatible with, existing Unicode encoding schemes. Oracle has a UTF-8-like one which encodes supplementary code points with six bytes instead of four.
The way UTC formally 'acknowledges' something like that may involve the issuance of a specification for it. That was done for CESU-8, and incidentally also for UTF-EBCDIC.
Sometimes the purpose of creating a label for a format is to be able to clearly identify data as *not* being in conformance to the Unicode specification. I've not seen evidence that UTR#26 has resulted in more or fewer implementations using CESU-8 style data. That is as expected, because the use of that format is driven by specific compatibility requirements, which neither get created nor removed by fiat from the UTC. On the other hand all implementations that do see a need to use that format can now safely warn all others of potential incompatibilities by correctly labelling their data. I see that as a win.
Sun has one like this which also encodes U+0000 as two bytes instead of one. Someone else might decide to use one of the "zany" UTFs invented by Marco Cimarosti or me.
I think there is a distinction that people recognize between zany UTFs invented by some guys with too much time on their hands, compared to documenting specific compatibility warts that (unfortunately) inflict a sizable group of users.
Whatever... but there was no need to publish a Technical Report describing Oracle's custom format, giving it a formal-sounding name like "CESU-8" and registering it as an IANA charset for interchange. Not everyone outside this list is familiar with the fine distinction between a UTR, officially approved by UTC, and a UTN, published but not approved by UTC. I hope UTC does not ever go the "CESU-8" route with a UTN describing Sun's broken format.
A UTN is a different animal, as you are well aware. A UTN that says in effect "Java's string serialization is not conformant to UTF-8" (and explains the reason) is well within the parameters set for UTNs by the Unicode Consortium. It would also pass the sniff test for 'information useful to implementers and users of the standard'.
As Sun is discouraging the use of their format for all but Java-specific and reasonably low level serialization of class data - an option not open to the users of CESU-8 or UTF-EBCDIC who face the issue of interchange at least among the components of certain distributed implementations - there's not the same call for a formal specification and label.
But a UTN would make a nice place that one could use to capture the information that gets dredged up every so often when this issue percolates on this and related mail lists. UTNs after all, are intended to allow for the documentation of such issues, without requiring UTC endorsement.
A./

