From: "Theodore H. Smith" <[EMAIL PROTECTED]>
http://java.sun.com/j2se/1.5.0/docs/api/java/io/ DataInput.html#modified-utf-8

If only people could sue for suggesting bad coding practices ;o)

It was not bad coding practive at the time when Sun designed these APIs, because it was explicitly based on the ISO/IEC 10646 definition of UTF-8, which was at that time the legacy version published in the RFC, where non-shortest encodings were allowed. Sun used it simply as a convenience to allow using standard C libraries that expect a NUL byte to terminate strings, but still allowing String objects to contain NUL (U+0000) characters. Also at that time, Unicode 1.0 was defined only as a 16-bit subset of ISO/IEC 10646, and the definitions for supporting other planes were missing.


What is a shame is that Unicode did not consider this widely used legacy practice when it defined CESU-8 (the way supplementary characters are encoded with the Java-modified-UTF encoding), so that it would also allow encoding NUL (U+0000) as {0xC0,0x80}, something that is so useful to allow interoperatibility with standard C libraries.

Now that CESU-8 is fixed and standardized, the Sun modified UTF encoding should have its own encoding label registered with something less ambiguous than the expression "modified UTF". Has Sun applied for registering its encoding (actually a encoding scheme, because the encoding form is plain UTF-16, even though the Sun scheme allows encoding isolated or unpaired surrogates, or invalid code units 0xFFFE and 0xFFFF) with a IANA/MIME charset identifier? It would then be easier for Sun to reference this encoding with this label, if Sun published a public informative RFC, for the IANA charset registration.

Without this RFC, may be the informative page in the Java SDK documentation may be used as the reference for the IANA registration. But Sun should ensure that this page will remain accessible (that's why extracting this page into a isolated plain text document for an informative RFC would be helpful).

I won't support the idea of Sun suddenly removing one of its APIs (because it would break lots of JNI extensions that need it, even though there are APIs that can be used to pass String data directly in the native UTF-16 format supported by the Java native char datatype). Also redefining the API with new names (without the UTF suffix) seems like overkill, and not needed for Unicode conformance: the API is self-contained and there's no restriction in Unicode about how an API function should be named.


Reply via email to