On Tue, 24 Jun 2014 09:16:00 -0400 CE Whitehead <[email protected]> wrote:
> ME: if two sequences are canonically equivalent except that one has > noncharacters in it, are these still canonically equivalent? Canonical equivalences are defined for all sequences of scalar values; it is just that it changes from version to version for most unassigned characters. Non-characters only decompose to themselves and do not occur in the canonical (or indeed compatibility) decomposition of anything else, so a sequence containing a non-character cannot be canonically equivalent to a seqeunce not containing a non-character. > Regarding the sentinels; I am an outsider but assume that with > Corrigendum 9 U+FFFE will continue to be mentioned as having > generally (not always?) standard use throughout; in Chapter 16.7 it > is currently mentioned; I assume it will still be -- according to > info. in the FAQ and elsewhere: > http://www.unicode.org/faq/private_use.html "U+FFFE. The 16-bit > unsigned hexadecimal value U+FFFE is not a Unicode character value, > and should be taken as a signal that Unicode characters should be > byte-swapped before interpretation. U+FFFE should only be intepreted > as an incorrectly byte-swapped version of U+FEFF" There is a lot of untruth in that FAQ entry, alas. I think U+FFFE and possibly U+FFFF should be treated differently to the other 64 non-characters. At present there is no certainty as to whether an interchanged file in the UTF-16 encoding scheme that appears to contain a BOM contains a BOM or starts with U+FFFE. The only promise is that such a file contains an even number of data bytes. Any such sequence is valid! Will the UTF-16 encoding scheme be withdrawn? Richard. _______________________________________________ Unicode mailing list [email protected] http://unicode.org/mailman/listinfo/unicode

