Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

DougEwell2 Mon, 28 May 2001 22:59:01 -0700
In a message dated 2001-05-28 13:56:50 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  The problem with databases is that you have to have a locale independent
>  sorting sequence.  If you store a record with a key built with one locale,
>  you might not be able to retrieve it using another locale sort sequence.

OK, now I think I understand this particular, specific need for a straight 
binary sorting order.  As long as this stays internal and doesn't filter down 
to users, where they will see Z before A-acute and all the Latin-1 characters 
before A-macron, there is no problem... so far.

>  The problem is that [UTF-8] is wasteful of space.  For CLOBs where the 
field are
>  very large allocating 4 bytes per character, wastes space so they used
>  UCS-2.  Converting from UCS-2 to UTF-16 creates a sorting problem.  UTF-16
>  keys and UTF-8 keys have different sorting sequences.

Converting from UCS-2 to UTF-16 should not create a sorting problem, because 
the only difference between the two is that UCS-2 is ignorant of surrogates 
while UTF-16 is aware of them.  This is where the straight binary sorting 
order, as valid as it may be for locale independence, needs to be modified; 
it needs to take surrogates into account.

>  UTF-8s would have put the entire surrogate support into the hands of the
>  application.

Which I don't necessarily think is such a hot idea.  The mechanics of 
different encoding forms, surrogates, combining characters, and other Unicode 
details should be handled as early in the chain as possible, so applications 
can just deal with "characters."

>  Converting UCS-2 to UTF-16 support is a lot of work because most to
>  operation are actually using UTF-32.  This will match UTF-8 sorting.

As Michka observed, this may be "a lot of work" but it has to be done.  It 
could have been anticipated many years ago, and it is the right way to solve 
the problem.  Asking the standardizers to introduce a new hack to compensate 
for industry's overreliance on the mechanical details of a previous hack is 
the wrong way to solve the problem.

-Doug Ewell
 Fullerton, California
Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

Reply via email to