Re: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)

DougEwell2 Mon, 28 May 2001 05:02:07 -0700
In a message dated 2001-05-26 16:00:47 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  The issue is this: Unicode's three encoding forms don't sort in the same
>  way when sorting is done using that most basic and
>  valid-in-almost-no-locales-but-easy-and-quick approach of simply comparing
>  binary values of code units. The three give these results:
>  
>  UTF-8:  (U+0000 - U+D7FF), (U+E000-U+FFFF), (surrogate)
>  UTF-16: (U+0000 - U+D7FF), (surrogate),     (U+E000-U+FFFF)
>  UTF-32: (U+0000 - U+D7FF), (U+E000-U+FFFF), (surrogate)

First, everyone take a breath and say it out loud:  "UTF-16 is a hack."  
There, doesn't that feel better?  Whether it is necessary, beneficial, or 
unavoidable is beside the point.  Using pairs of 16-bit "surrogates" together 
with an additive offset to refer to a 32-bit value may be a clever solution 
to the problem, but it is still a hack, especially when those surrogate 
values fall in the middle of the range of normal 16-bit values as they do.

UTF-8 and UTF-32 should absolutely not be similarly hacked to maintain some 
sort of bizarre "compatibility" with the binary sorting order of UTF-16.  
Anyone who is using the binary sorting order of UTF-16, and thus concludes 
that (pardon the use of 10646 terms here) Planes 1 through 16 should be 
sorted after U+D7FF but before U+E000 is really missing the point of proper 
collation.  I would state the case even more strongly than Peter, to say that 
such a collation order is valid in NO locale at all.

If developers expect to sort Unicode text in any meaningful way, they should 
be using the Unicode Collation Algorithm (UAX #10).  Using strict code point 
order as a basis for sorting is generally not appropriate, and applying the 
UTF-16 transformation as a further basis for sorting only compounds the error.

UTC should not, and almost certainly will not, endorse such a proposal on the 
part of the database vendors.

-Doug Ewell
 Fullerton, California
Re: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)

Reply via email to