|
I see this sentence in the
last paragraph:
The global interpretation of this paragraph is thus
defining Unicode as a subset of ISO/IEC 10646-1:2000, for the 17 first planes
where Unicode and ISO/IEC 10646-1:2000 will be fully interoperable. So it does
NOT say that the use of five- and six-byte sequences are illegal for the use of
UTF-8 *AS A TRANSFORMATION OF _ISO/IEC 10646-1:2000_
CHARACTERS*.
Due to that, an application needs to specify whever it
will support and comply with the full ISO/IEC 10646-1:2000 character
set or to the Unicode subset. As both standards specify "UTF-8" as the name of
the transformation, and the transformation is in fact defined in ISO/IEC
10646-1:2000, it seems that there's no restriction on UTF-8 sequences lengths,
just restrictions about their use to encode characters in the Unicode
subset.
This leaves open the opportunity to encode
*non-Unicode* characters of *ISO/IEC
10646-1:2000*, i.e. characters outside its 17 first planes and
that must not be interpreted as valid Unicode characters, but can still be
interpreted as valid ISO/IEC 10646-1:2000 characters.
Then later, we have this final sentence:
Here also this is a difference: non-characters are explicitly
said to be *non-Unicode* characters (i.e. must not be
interpreted as valid Unicode characters, not even the replacement character),
but can still be interpreted as valid ISO-10646-1:2000 if ISO 10646-1:2000
allows it (and it seems to allow it in UTF-8 transformed strings). Here
also an application will need to specify which character set it supports. If the
application chooses to support and conform to ISO-10646-1:2000, there's no
guarantee that it will conform to Unicode.
As there's a requirement to not interpret non-Unicode
characters as Unicode characters, an application that conforms to Unicode cannot
then remap valid ISO/IEC 10646-1:2000 characters with REPLACEMENT CHARACTER to
make the encoded text be interoperable with Unicode. It chooses to do so, it
uses an algorithm which is invalid in the scope of Unicode (so it's not a
Unicode folding), but is valid and conforming in the ISO/IEC 10646-1:2000
universe, where it will be considered a fully compliant ISO/IEC 10646-1:2000
folding transformation.
When I say "folding" in the last sentence, it really has the
same meaning as in Unicode, as it does not preserve the semantic of
the string and looses information: such folding operations must then be clearly
specified to be done out of scope of the Unicode standard, and is not by itself
a identity UTF transformation. Such application would then have a ISO/IEC
10646-1 input interface, but not a compliant Unicode input interface, even
though its folded output may conform to Unicode.
Shouldn't then texts coded with strict Unicode conformance be
labelled differently than ISO-IEC 10646-1 even if they share the same
transformation, simply because they don't formally share the same character
set?
I mean here cases like the:
charset="UTF-8"
pseudo-attribute in XML declarations, or the:
; charset=UTF-8
option in MIME "text/*" content-types (in RFC822 messages, or
in HTTP headers), or the:
<meta http-equiv="Content-Type"
content="text/html; charset=UTF-8" />
in HTML documents... Here the "charset" is not specifying
really a character set, but only the transformation format.
This is probably not a problem, as long as the MIME
content-type standard clearly states that the "UTF-8" label must only be used to
mean the Unicode character set and not the ISO/IEC 10646-1:2000 character set or
its followers (I think that such thing is specified for the interpretation of
the charset pseudo-attribute of XML declarations).
However, if such explicit wording is missing in the MIME
definition of the charset option, how can we specify on an interface the
effective charset used by a datafile ? Note that I don't say this is a problem
in the Unicode standard itself or in the ISO/IEC 10646-1:2000 standard, but a
problem specific to the MIME standard where there's possibly an ambiguity about
the implied character set... What do you think?
Shouldn't Unicode ask to MIME to publish a revized RFC for
this case? If they don't want and in fact were refering to the ISO/IEC 10646-1
standard, then we have no choice: the MIME charset="UTF-8" option indicates ONLY
conformanace to ISO/IEC 10646-1, but NOT conformance to Unicode, and we need to
register another option to indicate the strict Unicode conformance.
Why not then registering this MIME option
"subset=Unicode/4.0", and use it for example like this:
- in RFC822 or HTTP headers:
Content-Type: text/plain; charset=UTF-8;
subset=Unicode/4.0
- in HTML:
<meta http-equiv="Content-Type"
content="text/html; charset=UTF-8; subset=Unicode" />
- in XML declarations???
<?xml version="1.2" charset="UTF-8"
subset="Unicode/4.0" ?>
(this last case is illegal for XML 1.1, as
there's no such subset pseudo-attribute; that's why I use a higher
version)
or:
<?xml version="1.0" charset="UTF-8;
subset=Unicode/4.0" ?>
(but there will be also interoperability
problems with XML parsers, unable to recognize the specified charset name
syntax)
Oooootshhhh ! How do you feel with this interoperability
issue?
|
- Re: Corrigendum #1 (UTF-8 shortest form) wording: MIME, and... Philippe Verdy
- Re: Corrigendum #1 (UTF-8 shortest form) wording: MIME... Doug Ewell
- Re: Corrigendum #1 (UTF-8 shortest form) wording: ... Philippe Verdy
- RE: Corrigendum #1 (UTF-8 shortest form) wordi... Kent Karlsson
- Re: Corrigendum #1 (UTF-8 shortest form) w... Philippe Verdy
- Re: Corrigendum #1 (UTF-8 shortest fo... Doug Ewell

