Jianping said:

> The issue comes from unpaired surrogates as <ED A0 80> and <ED B0 80> 

These are not *unpaired* surrogates -- they are *paired* surrogates.
Else your equating them to <F0 90 80 80> or U-00010000 would make no sense.

> can be
> in UTF-8 

They cannot be in well-formed UTF-8. They can only be in ill-formed
UTF-8 of the irregular subtype.

> and your search for <F0 90 80 80> (which is Unicode scalar value
> U-00010000)  cannot find it. But however, when the UTF-8 string converted into
> UTF-16, <ED A0 80> and <ED B0 80> will become
> <D800 DC00>, and you can find the same character by searching <D800 DC00> in
> UTF-16.
> 
> Unless this unpaired surrogate will be totally eliminated from UTF forms, this
> issue could be hit.

*PAIRED* surrogates.

--Ken

Reply via email to