Corrigendum #1 (UTF-8 shortest form) wording: MIME, and software interfaces specifications

Philippe Verdy Fri, 07 Nov 2003 07:10:15 -0800

I see this sentence in the last paragraph:

<blockquote>

The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for the use of five- and six-byte sequences to encode characters that are outside the range of the Unicode character set; those five- and six-byte sequences are illegal for the use of UTF-8 *AS A TRANSFORMATION OF _UNICODE_ CHARACTERS*. (...)

</blockquote>

The global interpretation of this paragraph is thus defining Unicode as a subset of ISO/IEC 10646-1:2000, for the 17 first planes where Unicode and ISO/IEC 10646-1:2000 will be fully interoperable. So it does NOT say that the use of five- and six-byte sequences are illegal for the use of UTF-8 *AS A TRANSFORMATION OF _ISO/IEC 10646-1:2000_ CHARACTERS*.

Due to that, an application needs to specify whever it will support and comply with the full ISO/IEC 10646-1:2000 character set or to the Unicode subset. As both standards specify "UTF-8" as the name of the transformation, and the transformation is in fact defined in ISO/IEC 10646-1:2000, it seems that there's no restriction on UTF-8 sequences lengths, just restrictions about their use to encode characters in the Unicode subset.

This leaves open the opportunity to encode *non-Unicode* characters of *ISO/IEC 10646-1:2000*, i.e. characters outside its 17 first planes and that must not be interpreted as valid Unicode characters, but can still be interpreted as valid ISO/IEC 10646-1:2000 characters.

Then later, we have this final sentence:

<blockquote>

(...) ISO/IEC 10646 does not allow mapping of unpaired surrogates, nor U+FFFE and U+FFFF (but it does allow other noncharacters).

</blockquote>

Here also this is a difference: non-characters are explicitly said to be *non-Unicode* characters (i.e. must not be interpreted as valid Unicode characters, not even the replacement character), but can still be interpreted as valid ISO-10646-1:2000 if ISO 10646-1:2000 allows it (and it seems to allow it in UTF-8 transformed strings). Here also an application will need to specify which character set it supports. If the application chooses to support and conform to ISO-10646-1:2000, there's no guarantee that it will conform to Unicode.

As there's a requirement to not interpret non-Unicode characters as Unicode characters, an application that conforms to Unicode cannot then remap valid ISO/IEC 10646-1:2000 characters with REPLACEMENT CHARACTER to make the encoded text be interoperable with Unicode. It chooses to do so, it uses an algorithm which is invalid in the scope of Unicode (so it's not a Unicode folding), but is valid and conforming in the ISO/IEC 10646-1:2000 universe, where it will be considered a fully compliant ISO/IEC 10646-1:2000 folding transformation.

When I say "folding" in the last sentence, it really has the same meaning as in Unicode, as it does not preserve the semantic of the string and looses information: such folding operations must then be clearly specified to be done out of scope of the Unicode standard, and is not by itself a identity UTF transformation. Such application would then have a ISO/IEC 10646-1 input interface, but not a compliant Unicode input interface, even though its folded output may conform to Unicode.

Shouldn't then texts coded with strict Unicode conformance be labelled differently than ISO-IEC 10646-1 even if they share the same transformation, simply because they don't formally share the same character set?

I mean here cases like the:

charset="UTF-8"

pseudo-attribute in XML declarations, or the:

; charset=UTF-8

option in MIME "text/*" content-types (in RFC822 messages, or in HTTP headers), or the:

in HTML documents... Here the "charset" is not specifying really a character set, but only the transformation format.

This is probably not a problem, as long as the MIME content-type standard clearly states that the "UTF-8" label must only be used to mean the Unicode character set and not the ISO/IEC 10646-1:2000 character set or its followers (I think that such thing is specified for the interpretation of the charset pseudo-attribute of XML declarations).

However, if such explicit wording is missing in the MIME definition of the charset option, how can we specify on an interface the effective charset used by a datafile ? Note that I don't say this is a problem in the Unicode standard itself or in the ISO/IEC 10646-1:2000 standard, but a problem specific to the MIME standard where there's possibly an ambiguity about the implied character set... What do you think?

Shouldn't Unicode ask to MIME to publish a revized RFC for this case? If they don't want and in fact were refering to the ISO/IEC 10646-1 standard, then we have no choice: the MIME charset="UTF-8" option indicates ONLY conformanace to ISO/IEC 10646-1, but NOT conformance to Unicode, and we need to register another option to indicate the strict Unicode conformance.

Why not then registering this MIME option "subset=Unicode/4.0", and use it for example like this:

- in RFC822 or HTTP headers:

Content-Type: text/plain; charset=UTF-8; subset=Unicode/4.0

- in HTML:

- in XML declarations???

<?xml version="1.2" charset="UTF-8" subset="Unicode/4.0" ?>

(this last case is illegal for XML 1.1, as there's no such subset pseudo-attribute; that's why I use a higher version)

or:

<?xml version="1.0" charset="UTF-8; subset=Unicode/4.0" ?>

(but there will be also interoperability problems with XML parsers, unable to recognize the specified charset name syntax)

Oooootshhhh ! How do you feel with this interoperability issue?

Corrigendum #1 (UTF-8 shortest form) wording: MIME, and software interfaces specifications

Reply via email to