On Sun, 1 Jun 2014 08:58:26 -0700 Markus Scherer <[email protected]> wrote:
> You misunderstand. In Java, \uD808\uDF45 is the only way to escape a > supplementary code point, but as long as you have a surrogate pair, > it is treated as a code point in APIs that support them. Wasn't obvious that in the following paragraph \uD808\uDF45 was a pattern? "Bear in mind that a pattern \uD808 shall not match anything in a well-formed Unicode string. \uD808\uDF45 specifies a sequence of two codepoints. This sequence can occur in an ill-formed UTF-32 Unicode string and before Unicode 5.2 could readily be taken to occur in an ill-formed UTF-8 Unicode string. RL1.7 declares that for a regular expression engine, the codepoint sequence <U+D808, U+DF45> cannot occur in a UTF-16 Unicode string; instead, the code unit sequence <D808 DF45> is the codepoint sequence <U+12345 CUNEIFORM SIGN URU TIMES KI>." (It might have been clearer to you if I'd said '8-bit' and '16-bit' instead of UTF-8 and UTF-16. It does make me wonder what you'd call a 16-bit encoding of arbitrary *codepoint* sequences.) Richard. _______________________________________________ Unicode mailing list [email protected] http://unicode.org/mailman/listinfo/unicode

