Re: Unicode Regular Expressions, Surrogate Points and UTF-8

Richard Wordingham Sun, 01 Jun 2014 10:09:05 -0700

On Sun, 1 Jun 2014 08:58:26 -0700
Markus Scherer <[email protected]> wrote:


> You misunderstand. In Java, \uD808\uDF45 is the only way to escape a
> supplementary code point, but as long as you have a surrogate pair,
> it is treated as a code point in APIs that support them.

Wasn't obvious that in the following paragraph \uD808\uDF45 was a
pattern?

"Bear in mind that a pattern \uD808 shall not match anything in a
well-formed Unicode string. \uD808\uDF45 specifies a sequence of two
codepoints. This sequence can occur in an ill-formed UTF-32 Unicode
string and before Unicode 5.2 could readily be taken to occur in an
ill-formed UTF-8 Unicode string.  RL1.7 declares that for a regular
expression engine, the codepoint sequence <U+D808, U+DF45> cannot
occur in a UTF-16 Unicode string; instead, the code unit sequence <D808
DF45> is the codepoint sequence <U+12345 CUNEIFORM SIGN URU TIMES
KI>."

(It might have been clearer to you if I'd said '8-bit' and '16-bit'
instead of UTF-8 and UTF-16.  It does make me wonder what you'd call a
16-bit encoding of arbitrary *codepoint* sequences.)

Richard.
_______________________________________________
Unicode mailing list
[email protected]
http://unicode.org/mailman/listinfo/unicode

Re: Unicode Regular Expressions, Surrogate Points and UTF-8

Reply via email to