I think Richard dd not speak aout that, but about the behavior of a matchier that would start parsing a text using the wrong guessed encoding. e gave the exampe of a valid CESU-8 text containing with U+10000: when reading it incorrectly as UTF-8, the parser gets the 4 invalid sequences: CESU-8 cannot be easily detected at start of the stream with the encoding of byte order mark U+FEFF.
However CESU-8 can be detected by the initial encoding of another byte order mark U+1FFFE (which is a non-character that MUST be stripped once detected from the parsed stream of code points) However, documents starting by this non-cahracters are supposed to be non-interoperable by definition even though the presence of that special byte order mark would be very safe to secure CESU-8 and discriminate it from UTF-8. 2014-05-31 1:15 GMT+02:00 Markus Scherer <[email protected]>: > If you use Unicode 16-bit strings, it's easy to "pass through" unpaired > surrogates and treat them like code points; it's often not productive or > necessary to check for them all the time, that is, to be strict about > UTF-16. > > On the other hand, I don't think anyone expects you to support invalid > UTF-8, and especially not to support any and all Unicode 8-bit strings (see > Unicode 3.9 Unicode Encoding Forms for what I mean here). > > If you find UTS #18 unclear or misleading, I suggest you submit feedback > pointing out specific text issues. > > markus > > _______________________________________________ > Unicode mailing list > [email protected] > http://unicode.org/mailman/listinfo/unicode > >
_______________________________________________ Unicode mailing list [email protected] http://unicode.org/mailman/listinfo/unicode

