Re: Opinions on this Java URL?

Philippe Verdy Mon, 15 Nov 2004 04:11:38 -0800

From: "Asmus Freytag" <[EMAIL PROTECTED]>

CESU-8 is the documentation of someone's internal, non-standard
implementation of UTF-8.  Of course, the "someone" is large and
important and their implementation affects a lot of users.  If nobody
else is motivated by the presence of UTR #26 to adopt this non-standard
version, good.
There are some UTF-8/UTF-16 interoperability aspects that are addressed by CESU-8. These concerns are real, and affect multi-component architectures that must interchange data across component boundaries. Therefore a standard specification serves a useful purpose.
What worries me is that there might be other people in the world like
Philippe
Phillippe doesn't worry me ;-)

I'd like to note that the Java modified UTF-8 format is not purely internal to Java and that it is used for interchange of data in a multi-component architecture, which is the JNI interface allowing external native libraries to interchange data with any Java-compliant VM.

So it's not only Sun's implementation, but also part of all other VM implementations that have the support for JNI and native Java interface. Also, this support appears within the class format which is standardized and accessible too through Java's reflection mechanisms allowing Java programs to control how classes are loaded or created at run-time, and interchanged as well with other hosts. The "component boundaries" above apply to Java as well.

Finally, the capability of Java of storing and exchanging valid Unicode strings with embedded nulls (U+0000) is a feature rather than a limitation, notably when this interchange requires using fixed-sized structures containing variable-length strings, where these nulls serve as padding bytes (for example in fixed-width plain-text table formats, where the introduction of binary length prefixes would make the text file unreadable).

The NULL character is mostly used in plain-text formats as a ignorable padding, with less ambiguity than spaces commonly used in so many SQL engines or in XML formats. Some text editors are broken so that they will not load correctly a text file with embedded nulls: these editors truncate the read data instead of handling nulls as if it was ignorable whitespace, because they handle the text as C strings where null bytes mean end of strings.

There are also many places in data structures used for interchange where plain-text strings are encoded in data fields, without any extra length specified specified because the field is extremely small. Nulls are used as required padding and must not be truncated, because these structures would be desynchronized. Nulls are also used as filler bytes within some communication protocols based on plain-text data.

Like it or not, but nulls are part of almost all character sets, from the oldest ones to the most recent ones (with one notable exception in GSM text for SMS, where the null byte is a printable character, as GSM don't need/want data fillers). The support of ignorable padding characters will remain needed for long (or ever) in plain-texts, even if a plain-text *file* does not need it (there are other uses of plain-text than just complete files). Those many expecting that a file containing any null byte is not text but binary are restricting to the use of "text/plain" in MIME message formats.

A GSM message would embed null bytes without being considered as binary, and would contain no data filler; it could not be interchanged with a MIME "text/plain" datatype even with a "charset" qualifier, but it would still be plain-text in the definition accepted at the Unicode or ISO/IEC10646 level (they don't care much about which encoding schemes or transport syntaxes are used to interchange plain-text, but about the interpretation of the *decoded* code points; lower encoding levels in the Unicode standard are mandatory only if applications choose to implement these levels and label their data with the corresponding charset identifiers that have been reserved, and included in the Unicode standard).

So it's a fact that Unicode's UTF-8 format is fully compatible with Unicode (i.e. it can encode any Unicode texts, including those containing NULL characters), but not with C and other applications that can't depend on the effective text length being specified out of band, but with an explicit and mandatory end-of-text marker. This is the place where transport syntaxes are used in MIME, to escape reserved bytes which have special functions in the embedding transport: hexadecimal, Base-64, quoted-printable, uuencode, COBS, ... or escaping control bytes by the 0xC0 leading byte (unused in UTF-8) followed by the control byte with a 0x80 offset. The string definition in C implies that nulls must be escaped if they are needed, or that string length be encoded separately out of band (but in that case this is no more a standard null-terminated C string).

C does not mandate any escaping mechanism, and Java's "modified UTF-8" is perfectly valid in this context as a transport syntax for CESU-8. In fact I don't like the term "modified UTF-8" used by Sun in its revized documentation; it is causing confusion, and in fact it would be more exact if Sun said it is in fact a "modified CESU-8" (so that it will match with how Java handles now supplementary characters), and if Sun documented that this format includes the bijective support of strings with non-character code units '\uFFFE' and '\uFFFF', and more critically of malformed strings with unpaired or isolated surrogates (which are normally not acceptable even in standard CESU-8).

A better term without reference to UTF-8 or even CESU-8 would be useful (even if information is given that will refer to other standard UTF-8 and CESU-8 encoding schemes). As this encoding is needed for the *serialization* (data interchange over a byte-oriented stream) of Java String objects (which can contain malformed Unicode text with unpaired surrogates, and any valid or invalid or reserved or unassigned 16-bit code units), why not refering this encoding as "Java-String-8" (or "JS-8" for short)?

Re: Opinions on this Java URL?

Reply via email to