RE: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)

Carl W. Brown Mon, 28 May 2001 14:44:26 -0700
Doug,

The problem with databases is that you have to have a locale independent
sorting sequence.  If you store a record with a key built with one locale,
you might not be able to retrieve it using another locale sort sequence.

The problem with Oracle is that they use both UCS-2 and UTF-8.

UTF-8 is simple to implement because you can use all the multi-byte code
page code with another algorithm for character length calculations.  Thus is
does not take much to implement UTF-8 support.

The problem is that it is wasteful of space.  For CLOBs where the field are
very large allocating 4 bytes per character, wastes space so they used
UCS-2.  Converting from UCS-2 to UTF-16 creates a sorting problem.  UTF-16
keys and UTF-8 keys have different sorting sequences.

UTF-8s would have put the entire surrogate support into the hands of the
application.

Converting UCS-2 to UTF-16 support is a lot of work because most to
operation are actually using UTF-32.  This will match UTF-8 sorting.

The UTF-8s "short cut" in the long run this is a bad idea.  It makes most
proper locale based operations a real problem.  It also can create storage
problems because the UTF-8s characters can be 50% larger that the UTF-8
characters.  Oracle does not appreciate the problem that clients have in
sizing fields with their current UTF-8 implementation.  It would be worse
with UTF-8s.  But no matter since the is only an implementation problem not
a database problem.

Carl

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of [EMAIL PROTECTED]
Sent: Monday, May 28, 2001 3:30 AM
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: Re: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and
email)


In a message dated 2001-05-26 16:00:47 Pacific Daylight Time,
[EMAIL PROTECTED] writes:

>  The issue is this: Unicode's three encoding forms don't sort in the same
>  way when sorting is done using that most basic and
>  valid-in-almost-no-locales-but-easy-and-quick approach of simply
comparing
>  binary values of code units. The three give these results:
>
>  UTF-8:  (U+0000 - U+D7FF), (U+E000-U+FFFF), (surrogate)
>  UTF-16: (U+0000 - U+D7FF), (surrogate),     (U+E000-U+FFFF)
>  UTF-32: (U+0000 - U+D7FF), (U+E000-U+FFFF), (surrogate)

First, everyone take a breath and say it out loud:  "UTF-16 is a hack."
There, doesn't that feel better?  Whether it is necessary, beneficial, or
unavoidable is beside the point.  Using pairs of 16-bit "surrogates"
together
with an additive offset to refer to a 32-bit value may be a clever solution
to the problem, but it is still a hack, especially when those surrogate
values fall in the middle of the range of normal 16-bit values as they do.

UTF-8 and UTF-32 should absolutely not be similarly hacked to maintain some
sort of bizarre "compatibility" with the binary sorting order of UTF-16.
Anyone who is using the binary sorting order of UTF-16, and thus concludes
that (pardon the use of 10646 terms here) Planes 1 through 16 should be
sorted after U+D7FF but before U+E000 is really missing the point of proper
collation.  I would state the case even more strongly than Peter, to say
that
such a collation order is valid in NO locale at all.

If developers expect to sort Unicode text in any meaningful way, they should
be using the Unicode Collation Algorithm (UAX #10).  Using strict code point
order as a basis for sorting is generally not appropriate, and applying the
UTF-16 transformation as a further basis for sorting only compounds the
error.

UTC should not, and almost certainly will not, endorse such a proposal on
the
part of the database vendors.

-Doug Ewell
 Fullerton, California
RE: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)

Reply via email to