Marcin Kowalczyk noted: > Unicode has the following property. Consider sequences of valid > Unicode characters: from the range U+0000..U+10FFFF, excluding > non-characters (i.e. U+nFFFE and U+nFFFF for n from 0 to 0x10 and > U+FDD0..U+FDEF) and surrogates. Any such sequence can be encoded > in any UTF-n, and nothing else is expected from UTF-n.
Actually not quite correct. See Section 3.9 of the standard. The character encoding forms (UTF-8, UTF-16, UTF-32) are defined on the range of scalar values for Unicode: 0..D7FF, E000..10FFFF. Each of the UTF's can represent all of those scalar values, and can be converted accurately to either of the other UTF's for each of those values. That *includes* all the code points used for noncharacters. U+FFFF is a noncharacter. It is not assigned to an encoded abstract character. However, it has a well-formed representation in each of the UTF-8, UTF-16, and UTF32 encoding forms, namely: UTF-8: <EF BF BF> UTF-16: <FFFF> UTF-32: <0000FFFF> > With the exception of the set of non-characters being irregular and > IMHO too large (why to exclude U+FDD0..U+FDEF?!), and a weird top > limit caused by UTF-16, this gives a precise and unambiguous set of > values for which encoders and decoders are supposed to work. Well, since conformant encoders and decoders must work for all the noncharacter code points as well, and since U+10FFFF, however odd numerologically, is itself precise and unambiguous, I don't think you even need these qualifications. > Well, > except non-obvious treatment of a BOM (at which level it should be > stripped? does this include UTF-8?). The handling of BOM is relevant to the character encoding *schemes*, where the issues are serialization into byte streams and interpretation of those byte streams. Whether you include U+FEFF in text or not depends on your interpretation of the encoding scheme for a Unicode byte stream. At the level of the character encoding forms (the UTF's), the handling of BOM is just as for any other scalar value, and is completely unambiguous: UTF-8: <EF BB BF> UTF-16: <FEFF> UTF-32: <0000FEFF> > > A variant of UTF-8 which includes all byte sequences yields a much > less regular set of abstract string values. Especially if we consider > that 11101111 10111111 10111110 binary is not valid UTF-8, as much as > 0xFFFE is not valid UTF-16 (it's a reversed BOM; it must be invalid in > order for a BOM to fulfill its role). This is incorrect. <EF BF BE> *is* valid UTF-8, just as <FFFE> is valid UTF-16. In both cases these are valid representations of a noncharacter, which should not be used in public interchange, but that is a separate issue from the fact that the code unit sequences themselves are "well-formed" by definition of the Unicode encoding forms. > > Question: should a new programming language which uses Unicode for > string representation allow non-characters in strings? Yes. > Argument for > allowing them: otherwise they are completely useless at all, except > U+FFFE for BOM detection. Argument for disallowing them: they make > UTF-n inappropriate for serialization of arbitrary strings, and thus > non-standard extensions of UTF-n must be used for serialization. Incorrect. See above. No extensions of any of the encoding forms are needed to handle noncharacters correctly. --Ken

