Re: FW: UTF-8S ??? UTF-16F !!!

DougEwell2 Wed, 13 Jun 2001 09:55:56 -0700
In a message dated 2001-06-13 5:29:33 Pacific Daylight Time, 
[EMAIL PROTECTED] (through [EMAIL PROTECTED]) writes:

>  I think, Oracle et al. should consider to use instead of UTF-16 what I
>  propose to call UTF-16F (F for "fixed") in their B-trees, to maintain
>  UCS binary sorting order:
>
>  Conversion between UTF-16 and UTF-16F works as follows:
>
>    unsigned short utf16_to_utf16f(unsigned short u)
>    {  
>      assert(u <= 0xffff);
>      /* shift surrogates into the top 0x800 code positions of 16-bit space 
*/
>      if (u >= 0xe000)
>        return u - 0x800;
>      if (u >= 0xd800)
>        return u + 0x2000;
>      return u;
>    }

This is what I alluded to in my earlier message about the user-defined 
function supplied to qsort().  Any sorting mechanism for UTF-16 can easily 
incorporate this efficient transformation to achieve binary order.

By coding the transformation inline, and reordering things trivially so that 
the test for (u < 0xe000) -- by far the most common case -- appears first, 
the transformation will degenerate in most cases to:

    if (u < 0xe000)
        ;

and nobody can say that that is not efficient enough, on any hardware built 
since 1985.

If you remove the assert(u <= 0xffff) statement, then the same logic can be 
used for data in either UTF-8 or UTF-16, provided that no unpaired surrogates 
appear in your data (a reasonable constraint).

Oracle and PeopleSoft can use this transformation in their COBOL, in their 
memory cache, on the beaches and in the fields and streets, etc. instead of 
UTF-8s, and it will be much less work for *them* than maintaining two 
separate-but-confusable encoding schemes and fielding all the tech support 
calls from irate customers who have discovered that "UTF8" does not mean 
UTF-8.

-Doug Ewell
 Fullerton, California
Re: FW: UTF-8S ??? UTF-16F !!!

Reply via email to