Re: UTF-16 inside UTF-8

Frank Yung-Fong Tang Tue, 02 Dec 2003 17:47:57 -0800


Doug Ewell wrote:


 > Frank Yung-Fong Tang <ytang0648 at aol dot com> wrote:
 >
 > Then, Frank, the Tcl implementation is *not valid UTF-8* and needs to be
 > fixed.  Plain and simple.  If a system like Tcl only supports the BMP,
 > that is its choice, but it *must not* accept non-shortest UTF-8 forms or
 > output CESU-8 disguised as UTF-8.

Agree with you. Just want to make a point that the implementation is not 
"< 1%" of the work.

 >
 > > If you still think adding 4 bytes UTF-8 support is < 1% of the task,
 > > then please join the Tcl project and help me fix that. I appreciate
 > > your efforts there and I beleive a lot of people will thank for your
 > > contribution.
 >
 > I'll be happy to supply UTF-8 code that handles 4-byte sequences.  That
 > is not the same thing as converting an entire system from 16-bit to
 > 32-bit integers, or adding proper UTF-16 surrogate support to a
 > UCS-2-only system.  Of course that is more work.

You view is based on the assumption the internal code is UCS4 instead of 
UTF-16.

 >
 > Remember, AGAIN, that this thread was originally about taking an
 > application like MySQL that did not support Unicode at all, and adding
 > Unicode support to it, **BUT ONLY FOR THE 16-BIT BMP.**  That is what I
 > can't imagine -- making BMP-only assumptions *today*, in 2003, knowing
 > that you'll have to go back and fix them some day.  That is certainly
 > more work than adding support for the full Unicode range at once.  I
 > think you thought I said the opposite, that such retrofitting is easy,
 > and are now trying hard to disprove it.

Nothing wrong if people choose to use UTF-16 instead of UCS4 in the API, 
even as 2003. Do you agree?

If people do use UTF-16 in the API, it is nature for people who do care 
about BMP but not care about Plan 1-16 to only work on BMP, right? I am 
not saying they do the right thing. I am saying they do the "nature" 
thing. Remember, the text describe about 'Surrogate' in the Unocde 4.0 
standard is probably only 5-10 pages total in that 1462 pages standard. 
For developer who won't going to implement the rest 1000 pages right, it 
is nature for them to think "why do I need to make this 10 pages right?"



 >
 > > double your memory cost and size from UTF-8. x4 of the size for your
 > > ASCII data. To change implementation of a ASCII compatable / support
 > > application to UTF-16 is already hard since people only care about
 > > ASCII will upset the data size x 2 for all "their" data. It is already
 > > a hard battle most of the time for someone like me. If we tell them to
 > > change to UCS-4 that mean they need not only x2 the memory but x4 of
 > > the memory.
 >
 > I can't fight this battle with people who would rather stay with ASCII
 > and 7/8 bits per character.  They are not living in a Unicode world.

But how about the UTF-16 vs UCS4 battle?

 >
 > 1024 Ã 768 screen resolution takes 150% more display memory than 640 Ã
 > 480, too.
 >
 > > For web services or application which spend multi millions on those
 > > memory and database, it mean adding millions of dollars to their cost.
 > > They may have to adding some millions of cost to support international
 > > customer by using UTF-16. They probably are willing to add multi
 > > millions dollars of cost to change it to use UCS4. In fact, there are
 > > people proposed to stored UTF-8 in a hackky way into the database
 > > instead of using UTF-16 or UCS4 to save cost. They have to add
 > > restriction of using the api and build upper level api to do
 > > conversion and hacky operation. That mean it will introduce some fixed
 > > (not depend on the size of data) developement cost to the project but
 > > it will save millions of dollars of memory cost which depend on the
 > > size of the data. I don't like that approach but usually my word and
 > > what is "right" is less important than multiple million of dollars for
 > > a commercial company.
 >
 > I would truly be surprised if full 17-plane Unicode support in a single
 > app could be demonstrated to be a matter of "multiple millions of
 > dollars."

It is not the full 17-plane Unicode support which will contribut to it.
It is the
(Number of ASCII only records X sizeof (records in UCS4)) - ( Number of 
ASCII only records X sizeof(record in ASCII))

contribute to that.

compare to

(Number of ASCII only records X sizeof (records in UTF-8)) - ( Number of 
ASCII only records X sizeof(record in ASCII))
or

(Number of ASCII only records X sizeof (records in UTF-16)) - ( Number 
of ASCII only records X sizeof(record in ASCII))


The other comparision is
(Number of BMP only records X sizeof (records in UCS4)) - ( Number of 
BMP only records X sizeof(record in UTF-8))

(Number of BMP only records X sizeof (records in UCS4)) - ( Number of 
BMP only records X sizeof(record in UTF-16))

of course, the sizeof() is really the "average size of record with those 
data"

 >
 > -Doug Ewell
 > Fullerton, California
 > http://users.adelphia.net/~dewell/
 >

-- 
--
Frank Yung-Fong Tang
ÅÃÅtÃm ÃrÃhÃtÃÃt, IÃtÃrnÃtiÃnÃl DÃvÃlÃpmeÃt, AOL IntÃrÃÃtÃvÃ 
SÃrviÃes
AIM:yungfongta   mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan

Re: UTF-16 inside UTF-8

Reply via email to