Re: UTF-16 inside UTF-8

Doug Ewell Wed, 05 Nov 2003 17:32:01 -0800

Frank Yung-Fong Tang <YTang0648 at aol dot com> wrote:

>> I don't know about the relative market needs.  I think supplementary
>> character support is important because these characters are part of
>> Unicode just as much as BMP characters are,
>
> And don't forget, many people think some of the BMP characters are not
> important for their software. And that is probably the exact reason
> why MES-1, MES-2 and MES-3 got created. For Front-End software, it is
> quite difficult, or I should say impossible to support even the whole
> BMP. I crrently see NO front-end software 100% support the whole
> Unicode BMP correctly from input to rendering. Name me one and I can
> tell you what they didn't do.


Topic-change alert!  I'm not talking about glyph support in fonts, or
bidi support, or collation, or contextual shaping, or any other aspect
of Unicode support.  I'm talking about completely denying the existence
of non-BMP characters.

There are tons of applications -- Notepad is a basic example -- that
allow the entry of any arbitrary BMP character.  They don't allow some
BMP characters and disallow others.  That's all I'm talking about.  Now,
if such an application allows BMP characters but disallows supplementary
characters, as MySQL (e.g.) does, I think that is an unnecessary
restriction.

One of these days I'm going to implement a "Unicode" front end that
supports Basic Latin and U+A068 YI SYLLABLE BBOP, but *no other
characters*, just to show how silly such a restriction would be.
(Remember, it's conformant as long as I don't lie about it.  That
doesn't mean it's not silly.)

> For back end software which do pure data process without keyboard
> input or text rendering, it is eaiser to implement the whole Unicode
> BMP range or even with the surrogate.

(1)  "Surrogates" are only about UTF-16, not any other aspect of
Unicode.
(2)  Supporting surrogates in UTF-16 is not tremendously difficult.

>> and implementing UTF-8 support for the entire Unicode code space is
>> about 0.1% harder than artificially crippling it by restricting it to
>> the BMP.
>
> Disagree about what you said "about 0.1 % harder".
>
> For many developers, adding 4 bytes UTF-8 to surrogate support simply
> mean open a can of worm.

See point (1) above.

> After that, they need to worry about how to
> support surrogate, which is quite complex in the api design/change.

See points (1) and (2) above.

> The work to make the converter convert UTF-8 to a surrogate pair and
> back is probably as you said "0.1 harder". But work AFTER they open
> such door is much harder to manage. As the famouse saying "Unicode is
> not the answer for Internationalization, Unicode is the question for
> the Internationalization". Thanks for all the job opportunity Unicode
> standard created (and keep creating)  of us :)

See point (1) above.  Other than UTF-16 surrogates -- and remember, this
is not 1993; the world of Unicode no longer revolves around the 16-bit
encoding form --  what aspect of supplementary character support is so
much more complicated than BMP support?

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

Re: UTF-16 inside UTF-8

Reply via email to