On Sat, 31 May 2014 13:21:23 +0200 Philippe Verdy <[email protected]> wrote:
> However CESU-8 can be detected by the initial encoding of another byte > order mark U+1FFFE (which is a non-character that MUST be stripped > once detected from the parsed stream of code points) However, > documents starting by this non-cahracters are supposed to be > non-interoperable by definition even though the presence of that > special byte order mark would be very safe to secure CESU-8 and > discriminate it from UTF-8. Where is this tagging defined? It is in general not true that non-characters must be stripped on input. That would be highly inappropriate in a conversion program that transformed between UTFs. Also, the collations defined in CLDR Version 23 file collation/zh.xml would be severely damaged if the non-characters were stripped out. In version 24 and later the file uses a different syntax and doesn't contain non-characters. Richard. _______________________________________________ Unicode mailing list [email protected] http://unicode.org/mailman/listinfo/unicode

