Asmus Freytag <asmusf at ix dot netcom dot com> wrote: > The way UTC formally 'acknowledges' something like that may involve > the issuance of a specification for it. That was done for CESU-8, and > incidentally also for UTF-EBCDIC.
Throughout all of this, I had completely missed the fact that the Tech Note for CESU-8 had been upgraded to a Tech Report, two and a half years ago, in fact. Perhaps I was in denial. Anyway, that puts CESU-8 on the same plane with UTF-EBCDIC, and invalidates many of my comments which assumed that CESU-8 was defined in a Tech Note, which non-listers might confuse for the relative sanction of a Tech Report. > Sometimes the purpose of creating a label for a format is to be able > to clearly identify data as *not* being in conformance to the Unicode > specification. I've not seen evidence that UTR#26 has resulted in > more or fewer implementations using CESU-8 style data. That is as > expected, because the use of that format is driven by specific > compatibility requirements, which neither get created nor removed by > fiat from the UTC. On the other hand all implementations that do see a > need to use that format can now safely warn all others of potential > incompatibilities by correctly labelling their data. I see that as a > win. CESU-8 is the documentation of someone's internal, non-standard implementation of UTF-8. Of course, the "someone" is large and important and their implementation affects a lot of users. If nobody else is motivated by the presence of UTR #26 to adopt this non-standard version, good. What worries me is that there might be other people in the world like Philippe who think Sun's "modified UTF-8" is a good and useful thing, because it allows arbitrary data to be stored in C-style strings, and who might propagate its use in a way that, thankfully, you haven't seen with CESU-8. There are perfectly good data structures available for storing arbitrary binary data. Strings of text are not one of them. > A UTN is a different animal, as you are well aware. A UTN that says in > effect "Java's string serialization is not conformant to UTF-8" (and > explains the reason) is well within the parameters set for UTNs by the > Unicode Consortium. It would also pass the sniff test for 'information > useful to implementers and users of the standard'. I am aware of the difference, and so are all (or most) list members. How far that awareness extends beyond this list is left as an exercise for the reader. But again, everything I said about UTNs is moot, because I assumed CESU-8 was documented in a UTN, which did not confer the appearance of Unicode sanction. The fact that it is a UTR is actually more discouraging. At least in the case of UTF-EBCDIC, the creators did not merely take an existing, broken implementation of an existing character encoding scheme and get it documented. They created an algorithm similar to and inspired by UTF-8, but not in any way mistakable for it, and added a 1-to-1 EBCDIC translation layer. It's actually quite elegant. > As Sun is discouraging the use of their format for all but Java- > specific and reasonably low level serialization of class data - an > option not open to the users of CESU-8 or UTF-EBCDIC who face the > issue of interchange at least among the components of certain > distributed implementations - there's not the same call for a formal > specification and label. That's good to know. > But a UTN would make a nice place that one could use to capture the > information that gets dredged up every so often when this issue > percolates on this and related mail lists. UTNs after all, are > intended to allow for the documentation of such issues, without > requiring UTC endorsement. While we're on the subject of UTNs, I think it's a shame that BOCU-1, a genuinely novel and potentially useful compression scheme that was invented from scratch, is only documented in a "no-endorsement" UTN, when a draft UTR-upgrade that adds a white-box algorithm was written almost a year ago but has not been approved. This places BOCU-1 *below* CESU-8 in the food chain, which seems badly wrong. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/

