Re: UTF-16 inside UTF-8

Doug Ewell Wed, 03 Dec 2003 00:16:12 -0800

Frank Yung-Fong Tang <ytang0648 at aol dot com> wrote:

> Agree with you. Just want to make a point that the implementation is
> not "< 1%" of the work.


Oh, for heaven's sake:

If you are starting with a NON-UNICODE application -- one that has NO
prior knowledge of UTF-anything or UCS-anything -- and you are adding
"Unicode support" to it, the amount of work to support the entire
17-plane Unicode range compared to just the BMP is relatively small.  If
I ever said "less than one percent," I apologize.  Such a figure can
only be determined on a case-by-case basis.

>> I'll be happy to supply UTF-8 code that handles 4-byte sequences.
>> That is not the same thing as converting an entire system from
>> 16-bit to 32-bit integers, or adding proper UTF-16 surrogate support
>> to a UCS-2-only system.  Of course that is more work.
>
> You view is based on the assumption the internal code is UCS4 instead
> of UTF-16.

Didn't you read what I wrote?

> Nothing wrong if people choose to use UTF-16 instead of UCS4 in the
> API, even as 2003. Do you agree?

Sure, no problem.  Both UTF-16 and UCS-4 (= UTF-32) support the full
Unicode range.  Only UCS-2 does not.

> If people do use UTF-16 in the API, it is nature for people who do
> care about BMP but not care about Plan 1-16 to only work on BMP,
> right? I am not saying they do the right thing. I am saying they do
> the "nature" thing. Remember, the text describe about 'Surrogate' in
> the Unocde 4.0 standard is probably only 5-10 pages total in that 1462
> pages standard. For developer who won't going to implement the rest
> 1000 pages right, it is nature for them to think "why do I need to
> make this 10 pages right?"

I don't care if they choose not to provide fonts or rendering support
for the supplementary planes.  But it seems silly to deliberately
exclude them from the underlying architecture.

"Using UTF-16" implies that one supports the surrogate mechanism.
UTF-16 without surrogate support is UCS-2.

Of course the Unicode Standard doesn't spend a lot of time describing
the surrogate mechanism.  It only applies to the UTF-16 character
encoding form.  The description of characters encoded in the
supplementary planes, however, is much more extensive.

>> I can't fight this battle with people who would rather stay with
>> ASCII and 7/8 bits per character.  They are not living in a Unicode
>> world.
>
> But how about the UTF-16 vs UCS4 battle?

Well, UTF-16 certainly does occupy less space than UTF-32 (henceforth I
will use this term instead of "UCS-4") in memory, on disk, wherever.  By
all means, when *storing* large amounts of data, use an appropriately
compact form.  That might mean UTF-16, UTF-8, or a compression format
such as SCSU or BOCU-1, or it might mean compressing the data using gzip
or bzip2.

When *processing* character data in memory, I would assume a fixed-width
encoding like UTF-32 would be more convenient than a variable-width
encoding like UTF-16.  But if the extra complexity (such as it is) of
UTF-16 is not a problem, by all means go ahead and use it.

>> I would truly be surprised if full 17-plane Unicode support in a
>> single app could be demonstrated to be a matter of "multiple millions
>> of dollars."
>
> It is not the full 17-plane Unicode support which will contribut to
> it. It is the
> (Number of ASCII only records X sizeof (records in UCS4)) - ( Number
> of ASCII only records X sizeof(record in ASCII))
>
> contribute to that.
>
> compare to
>
> (Number of ASCII only records X sizeof (records in UTF-8)) - ( Number
> of ASCII only records X sizeof(record in ASCII))
> or
>
> (Number of ASCII only records X sizeof (records in UTF-16)) - ( Number
> of ASCII only records X sizeof(record in ASCII))
>
> The other comparision is
> (Number of BMP only records X sizeof (records in UCS4)) - ( Number of
> BMP only records X sizeof(record in UTF-8))
>
> (Number of BMP only records X sizeof (records in UCS4)) - ( Number of
> BMP only records X sizeof(record in UTF-16))
>
> of course, the sizeof() is really the "average size of record with
> those data"

I have never suggested that people with ASCII-only data should suddenly
quadruple their storage needs by storing it all in UTF-32.  That's what
UTF-8 and SCSU are for.  In fact, their data is already in UTF-8, isn't
it?

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: UTF-16 inside UTF-8

Reply via email to