Re: CESU-8 vs UTF-8

DougEwell2 Sun, 16 Sep 2001 17:35:09 -0700
In a message dated 2001-09-16 13:13:38 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  I think that some applications will find it easier to migrate to
>  UTF-32 rather than convert to UTF-16.

I know I have.  Handle everything internally as UTF-32, then read and write 
UTF-8 or UTF-16 as appropriate.

>  CESU-8 breaks that model becasue it is a form of Unicode with the sole
>  purpose of supporting a non-Unicode code point order sort order.  Yes I
>  could devise a way to sort UTF-32 and UTF-8 in UTF-16 binary sort order but
>  that is only a matter of some messy code.  The real issue is that I must 
now
>  handle Unicode that has as part of it essential property that it must
>  survive transforms with two distinctly different sort orders.

I was glad when Unicode began moving away from the doctrine of "treat all 
characters as 16-bit code units" and toward "treat them as abstract code 
points in the range 0..0x10FFFF."  Make no mistake, UTF-16 can be a useful 
16-bit transformation format; but it should not be considered the essence of 
Unicode, especially not to the point where additional machinery needs to be 
built on top of the Unicode standard solely to support UTF-16.

>  With this standard approved my applications can be compelled to use CESU-8
>  in place of UTF-8 if I was to talk to Peoplesoft or other packages that 
will
>  insist on this sort order of Unicode data.  If I use UTF-8 as well, then I
>  will need two completely different sets of support routines.

Actually, what you will need is *one* routine that works with both UTF-8 and 
CESU-8, but breaks the definition of both in doing so, by permitting either 
method of handling supplementary characters, and auto-detecting the data as 
UTF-8 or CESU-8 based on the method encountered.

>  My problem is that the correct approach is for people like Peoplesoft to 
fix
>  their code before accepting non BMP characters.

Still unanswered, in this proposal to sanctify hitherto non-standard 
representations of non-BMP characters in commercial databases, is the 
question of how much non-BMP data even exists in commercial databases in the 
first place.  I know I personally have some (and will soon have more, now 
that SC UniPad supports Deseret), but what about users of Oracle and 
Peoplesoft databases?  Other than the private-use planes, it was not even 
allowable to use non-BMP characters until the release of Unicode 3.1 earlier 
this year.  Where is the great need for a compatibility encoding?

-Doug Ewell
 Fullerton, California
Re: CESU-8 vs UTF-8

Reply via email to