> From: Jianping Yang [mailto:[EMAIL PROTECTED]]

> The issue comes from unpaired surrogates as <ED A0 80> and 
> <ED B0 80> can be
> in UTF-8 and your search for <F0 90 80 80> (which is Unicode 
> scalar value
> U-00010000)  cannot find it.

        This is good, because <ED A0 80> is U-0000d800 and <ED B0 80> is
U-0000dc00, so they should not match as U-00010000.

>  But, however, when the UTF-8 
> string converted into
> UTF-16, <ED A0 80> and <ED B0 80> will become
> <D800 DC00>, and you can find the same character by searching 
> <D800 DC00> in
> UTF-16.

        So this solves the problem of not matching the worng data?  I'm even
more baffled than when we started!

> Unless this unpaired surrogate will be totally eliminated 
> from UTF forms, this
> issue could be hit.

        I still don't know what the issue is.


/|/|ike

Reply via email to