There is one statement that appears to want to be framed:

Jianping Yang wrote:
> [...] When Unicode
> came to version 2.1, we found our AL24UTFFSS had trouble for 2.1 as Hangul's
> reallocation, and we could not simply update AL24UTFFSS to 2.1 definition as it
> would  mess existing users' data in their database. So we came up with a new
> character set as UTF8 which is still 3-byte encoding to support Unicode 2.1. [...]

This means that Oracle mis-implemented the UTF-8 standard as it was specified at that 
time, starting at least with Unicode 2.0.

UTF-8 as a part of the ISO/Unicode standards encoded UCS-4 units=Unicode scalar values 
in up to 6 bytes. Unicode scalar values reached up to 0x10ffff, which required 4-byte 
UTF-8. This was with Unicode 2.0 in 1996.

According to the above, Oracle implemented its new "UTF8" with the intention of 
implementing Unicode 2.1, and did not, in fact, follow the then-specifications.

With the earliest implementations of Oracle "UTF8" not following their own intentions, 
it might be possible to supply a bug fix for older Oracle versions, if only to detect 
surrogates as Tex suggested.

markus

Reply via email to