Re: Unicode Regular Expressions, Surrogate Points and UTF-8

Philippe Verdy Sat, 31 May 2014 04:24:23 -0700

I think Richard dd not speak aout that, but about the behavior of a
matchier that would start parsing a text using the wrong guessed encoding.
e gave the exampe of a valid CESU-8 text containing with U+10000: when
reading it incorrectly as UTF-8, the parser gets the 4 invalid sequences:
CESU-8 cannot be easily detected at start of the stream with the encoding
of byte order mark U+FEFF.


However CESU-8 can be detected by the initial encoding of another byte
order mark U+1FFFE (which is a non-character that MUST be stripped once
detected from the parsed stream of code points) However, documents starting
by this non-cahracters are supposed to be non-interoperable by definition
even though the presence of that special byte order mark would be very safe
to secure CESU-8 and discriminate it from UTF-8.



2014-05-31 1:15 GMT+02:00 Markus Scherer <[email protected]>:

> If you use Unicode 16-bit strings, it's easy to "pass through" unpaired
> surrogates and treat them like code points; it's often not productive or
> necessary to check for them all the time, that is, to be strict about
> UTF-16.
>
> On the other hand, I don't think anyone expects you to support invalid
> UTF-8, and especially not to support any and all Unicode 8-bit strings (see
> Unicode 3.9 Unicode Encoding Forms for what I mean here).
>
> If you find UTS #18 unclear or misleading, I suggest you submit feedback
> pointing out specific text issues.
>
> markus
>
> _______________________________________________
> Unicode mailing list
> [email protected]
> http://unicode.org/mailman/listinfo/unicode
>
>

_______________________________________________
Unicode mailing list
[email protected]
http://unicode.org/mailman/listinfo/unicode

Re: Unicode Regular Expressions, Surrogate Points and UTF-8

Reply via email to