Re: Unicode Regular Expressions, Surrogate Points and UTF-8

Richard Wordingham Sat, 31 May 2014 05:15:08 -0700

On Sat, 31 May 2014 13:21:23 +0200
Philippe Verdy <[email protected]> wrote:


> However CESU-8 can be detected by the initial encoding of another byte
> order mark U+1FFFE (which is a non-character that MUST be stripped
> once detected from the parsed stream of code points) However,
> documents starting by this non-cahracters are supposed to be
> non-interoperable by definition even though the presence of that
> special byte order mark would be very safe to secure CESU-8 and
> discriminate it from UTF-8.

Where is this tagging defined?

It is in general not true that non-characters must be stripped on
input.  That would be highly inappropriate in a conversion program that
transformed between UTFs.  Also, the collations defined in CLDR Version
23 file collation/zh.xml would be severely damaged if the
non-characters were stripped out.  In version 24 and later the file
uses a different syntax and doesn't contain non-characters. 

Richard.
_______________________________________________
Unicode mailing list
[email protected]
http://unicode.org/mailman/listinfo/unicode

Re: Unicode Regular Expressions, Surrogate Points and UTF-8

Reply via email to