Re: FSS-UTF, UTF-2, UTF-8, and UTF-16

Jianping Yang Mon, 18 Jun 2001 13:39:08 -0700

Mark Davis wrote:

> You are correct about the published definitions. As I recall, though, we
> were referring to UTF-FSS as UTF-8 in the UTC meetings before it was changed
> to account for UTF-16.
>
> In any event, I don't know whether Oracle was involved in those discussions
> or not, or whether they introduced their tag "UTF8" before or after the
> definition was changed.
>

As matter of fact, Oracle supported UTF-8 far earlier than surrogate or 4-byte
encoding was introduced. As database vendor, Oracle took fully advantages of
Unicode and also a victim of Unicode in sense of compatibility. As no burden of
fonts and IME issue for a database to store Unicode at its server. Oracle
supported very early version of Unicode in its Oracle 7 release as database
character set AL24UTFFSS which means 3-byte encoding for UTF-FSS. When Unicode
came to version 2.1, we found our AL24UTFFSS had trouble for 2.1 as Hangul's
reallocation, and we could not simply update AL24UTFFSS to 2.1 definition as it
would  mess existing users' data in their database. So we came up with a new
character set as UTF8 which is still 3-byte encoding to support Unicode 2.1. The
choice of 3-byte encoding is also bound to AL24UTFFSS implementation as it would
not break when users migrate AL24UTFFSS into UTF8.

In 9i release, we cannot make an easy expansion for UTF8 up to 4-byte for the
backward compatibility. Although we specifically document that UTF8 does not
support supplementary character in 8i, but users can still input surrogate
through UCS-2 into UTF8 database as a pair of 3-byte ( this is true to other
database vendors ), which will make hard for us to simply change UTF8 definition
up to 4-byte. If we did this simple update, a pair of surrogates from 8i UTF8
database would be stored into 9i UTF8 without character set conversion,
resulting in irregular forms  in AL32UTF8, which could make migration even
harder as there would be two different versions of UTF8 in a distributed system.
So what we did in Oracle 9i is to introduced a new character set as AL32UTF8 for
the standard UTF-8 up to 4-byte encoding, and user can easily migrate UTF8 to
AL32UTF8 either in a database version migration or in a distributed environment.

People may argue that as there is no supplementary character defined before
Unicode 3.1, it should be ok to simply update UTF8 to support 4-byte encoding
without compatibility issue, but the case is not because we cannot force every
Oracle customers to migrate their database into 9i, which means there is still a
certain time period that Oracle 8i and 9i would be co-exist. You have to
consider their compatibility and that's the price we have to pay to support
Unicode.

Regards,
Jianping.

begin:vcard 
n:Yang;Jianping
tel;fax:650-506-7225
tel;work:650-506-4865
x-mozilla-html:FALSE
org:Server Gobalization Technology;Server Technology
version:2.1
email;internet:[EMAIL PROTECTED]
title:Senior Development Manager
adr;quoted-printable:;;500 Oracle Packway=0D=0AM/S 659407;Redwood Shores;CA;94065;
fn:Jianping Yang
end:vcard

Re: FSS-UTF, UTF-2, UTF-8, and UTF-16

Reply via email to