Needs Analysis for UTF-8s

Carl W. Brown Mon, 18 Jun 2001 20:43:43 -0700
Toby,

I believe that Peoplesoft does not have a unique problem.  A "just say no to
UTF-8s" attitude does not solve your problem or the problem of other
companies in your situation.

The problem that I see with the UTF-8s proposal is that it needs different
support than UTF-8.

What do UTF-8 support services add?  UTF-8 support is the same as multi-byte
support.  Most of same code can be used with one change.  It calculates
character lengths differently from other multi-byte implementations.  All
other changes such as case insensitive compares for example, are based on
being able to detect the proper character boundaries.

Functions such as strcpy will still work with both UTF-8 and UTF-8s but you
don't need a multibyte implementation of strcpy to work with multi-byte
characters either because you are working at a string level without regard
to the string content.

I think that most people will initially think that they can use the UTF-8
services with UTF-8s.  They think that it can be treated as a UTF-16 sorted
UTF-8.  That is not true.  Once that start using surrogates they are no
longer dealing with character units but fractional character encoding units.
They are back to dealing with the same problems that they had using SBCS
functions on MBCS data.

This may be enough for you.  It is difficult to say if the type of functions
that you intend to perform on the data is the same as handing DBCS data with
SBCS functions.  If so, then UTF-8s might be for you.  If not, I think that
it might be helpful to work on a real solution before you and Oracle get in
over your heads.

I think that it would be far more productive of this forum to work on
answers rather than just criticize.

I am still puzzled.  It don't understand what is so special about UTF-8s.
Why can't you use UTF-16?

If you break a standard, it would seem to me that an encoding scheme that
you can not determine the length of a character from the first byte is a
serious flaw.  You might be better to violate the Unicode standard and
relocate the characters after the surrogates to a plane 17.  This would not
be Unicode but it would work with the UTF-8 support functions and would be
easy to convert to and from Unicode.

It also might be easier to compare UTF-16 in code point order.  They are
lots of alternatives but from what I know now it looks to me like UTF-8s is
the worst choice.  Worse still because it is a deceptive choice that looks
like an easy way out but will probably cause more grief latter.  It looks
like it has already put Oracle in a difficult position.

If it turns out that UTF-8s it the best solution then we should spell out
the guidelines for the user community on how to implement and use UTF-8s.

Carl
Needs Analysis for UTF-8s

Reply via email to