Re: Roundtripping in Unicode

Kenneth Whistler Tue, 14 Dec 2004 13:45:24 -0800

Marcin Kowalczyk noted:

> Unicode has the following property. Consider sequences of valid
> Unicode characters: from the range U+0000..U+10FFFF, excluding
> non-characters (i.e. U+nFFFE and U+nFFFF for n from 0 to 0x10 and
> U+FDD0..U+FDEF) and surrogates. Any such sequence can be encoded
> in any UTF-n, and nothing else is expected from UTF-n.


Actually not quite correct. See Section 3.9 of the standard.

The character encoding forms (UTF-8, UTF-16, UTF-32) are defined
on the range of scalar values for Unicode: 0..D7FF, E000..10FFFF.

Each of the UTF's can represent all of those scalar values, and
can be converted accurately to either of the other UTF's for
each of those values. That *includes* all the code points used
for noncharacters.

U+FFFF is a noncharacter. It is not assigned to an encoded
abstract character. However, it has a well-formed representation
in each of the UTF-8, UTF-16, and UTF32 encoding forms,
namely:

UTF-8:  <EF BF BF>
UTF-16: <FFFF>
UTF-32: <0000FFFF>

> With the exception of the set of non-characters being irregular and
> IMHO too large (why to exclude U+FDD0..U+FDEF?!), and a weird top
> limit caused by UTF-16, this gives a precise and unambiguous set of
> values for which encoders and decoders are supposed to work.

Well, since conformant encoders and decoders must work for all
the noncharacter code points as well, and since U+10FFFF, however
odd numerologically, is itself precise and unambiguous, I don't
think you even need these qualifications. 

> Well,
> except non-obvious treatment of a BOM (at which level it should be
> stripped? does this include UTF-8?).

The handling of BOM is relevant to the character encoding *schemes*,
where the issues are serialization into byte streams and interpretation
of those byte streams. Whether you include U+FEFF in text or not
depends on your interpretation of the encoding scheme for a Unicode
byte stream.

At the level of the character encoding forms (the UTF's), the
handling of BOM is just as for any other scalar value, and is
completely unambiguous:

UTF-8:  <EF BB BF>
UTF-16: <FEFF>
UTF-32: <0000FEFF>

> 
> A variant of UTF-8 which includes all byte sequences yields a much
> less regular set of abstract string values. Especially if we consider
> that 11101111 10111111 10111110 binary is not valid UTF-8, as much as
> 0xFFFE is not valid UTF-16 (it's a reversed BOM; it must be invalid in
> order for a BOM to fulfill its role).

This is incorrect. <EF BF BE> *is* valid UTF-8, just as <FFFE> is
valid UTF-16. In both cases these are valid representations of
a noncharacter, which should not be used in public interchange,
but that is a separate issue from the fact that the code unit
sequences themselves are "well-formed" by definition of the
Unicode encoding forms.

> 
> Question: should a new programming language which uses Unicode for
> string representation allow non-characters in strings? 

Yes.

> Argument for
> allowing them: otherwise they are completely useless at all, except
> U+FFFE for BOM detection. Argument for disallowing them: they make
> UTF-n inappropriate for serialization of arbitrary strings, and thus
> non-standard extensions of UTF-n must be used for serialization.

Incorrect. See above. No extensions of any of the encoding forms
are needed to handle noncharacters correctly.

--Ken

Re: Roundtripping in Unicode

Reply via email to