From: "Doug Ewell" <[EMAIL PROTECTED]>

What is a shame is that Unicode published a definition of the defective
CESU-8 at all.

On that point at least we agree. I wonder why CESU-8 was created, if there effectively exists applications needing it.


On the other side, the Java modified UTF-8 (in fact more near from CESU-8) has proven to be useful and is widely used... Simply because it is compatible with standard C libraries for null-terminated strings. It's historic and lived well with Unicode, given the previous tolerance in legacy UTF-8 decoders. Even today, it is still conforming with Unicode rules, given that Java does not pretend that this is UTF-8 and does not label encoded data as being UTF-8 -- it is used internally in Java JNI interfaces or in the Java class file format which is not plain-text, and both are part of the JVM specifications and not intended for data interchange between distinct hosts or applications).

But the tolerance for non-shortest forms effectively existed, so that C0,80 would be interpreted safely as NUL (U+0000).

Another way to think about the Java modified UTF-8 is that it could be a transport encoding syntax for CESU-8 (from which it differs mostly by escaping null bytes into two bytes C0,80 where the leading byte C0 is not used in CESU-8, and by supporting the presence of isolated/unpaired surrogates or invalid UTF-16 code units in the CESU-8 scheme-encoded string). So why would Sun change something there? Changing something that works with a new API that will create incompatibilities does not look like a good thing.




Reply via email to