Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote: >> What is a shame is that Unicode published a definition of the >> defective CESU-8 at all. > > On that point at least we agree. I wonder why CESU-8 was created, if > there effectively exists applications needing it.
UTC could have simply acknowledged that certain applications and vendors have created their own transformation formats for internal use, based on, but incompatible with, existing Unicode encoding schemes. Oracle has a UTF-8-like one which encodes supplementary code points with six bytes instead of four. Sun has one like this which also encodes U+0000 as two bytes instead of one. Someone else might decide to use one of the "zany" UTFs invented by Marco Cimarosti or me. Whatever... but there was no need to publish a Technical Report describing Oracle's custom format, giving it a formal-sounding name like "CESU-8" and registering it as an IANA charset for interchange. Not everyone outside this list is familiar with the fine distinction between a UTR, officially approved by UTC, and a UTN, published but not approved by UTC. I hope UTC does not ever go the "CESU-8" route with a UTN describing Sun's broken format. > On the other side, the Java modified UTF-8 (in fact more near from CESU-8) > has proven to be useful and is widely used... Simply because it is > compatible with standard C libraries for null-terminated strings. An unusual type of "compatible" that makes a special allowance for strings with embedded nulls, impossible by definition in C. If the Java architects had wanted a variable-length array of arbitrary byte data, they should have created such a type in the first place, instead of overloading the string type. Strings are for text. Text does not need nulls. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/

