Doug Ewell wrote:
> Frank Yung-Fong Tang <ytang0648 at aol dot com> wrote: > > Then, Frank, the Tcl implementation is *not valid UTF-8* and needs to be > fixed. Plain and simple. If a system like Tcl only supports the BMP, > that is its choice, but it *must not* accept non-shortest UTF-8 forms or > output CESU-8 disguised as UTF-8. Agree with you. Just want to make a point that the implementation is not "< 1%" of the work. > > > If you still think adding 4 bytes UTF-8 support is < 1% of the task, > > then please join the Tcl project and help me fix that. I appreciate > > your efforts there and I beleive a lot of people will thank for your > > contribution. > > I'll be happy to supply UTF-8 code that handles 4-byte sequences. That > is not the same thing as converting an entire system from 16-bit to > 32-bit integers, or adding proper UTF-16 surrogate support to a > UCS-2-only system. Of course that is more work. You view is based on the assumption the internal code is UCS4 instead of UTF-16. > > Remember, AGAIN, that this thread was originally about taking an > application like MySQL that did not support Unicode at all, and adding > Unicode support to it, **BUT ONLY FOR THE 16-BIT BMP.** That is what I > can't imagine -- making BMP-only assumptions *today*, in 2003, knowing > that you'll have to go back and fix them some day. That is certainly > more work than adding support for the full Unicode range at once. I > think you thought I said the opposite, that such retrofitting is easy, > and are now trying hard to disprove it. Nothing wrong if people choose to use UTF-16 instead of UCS4 in the API, even as 2003. Do you agree? If people do use UTF-16 in the API, it is nature for people who do care about BMP but not care about Plan 1-16 to only work on BMP, right? I am not saying they do the right thing. I am saying they do the "nature" thing. Remember, the text describe about 'Surrogate' in the Unocde 4.0 standard is probably only 5-10 pages total in that 1462 pages standard. For developer who won't going to implement the rest 1000 pages right, it is nature for them to think "why do I need to make this 10 pages right?" > > > double your memory cost and size from UTF-8. x4 of the size for your > > ASCII data. To change implementation of a ASCII compatable / support > > application to UTF-16 is already hard since people only care about > > ASCII will upset the data size x 2 for all "their" data. It is already > > a hard battle most of the time for someone like me. If we tell them to > > change to UCS-4 that mean they need not only x2 the memory but x4 of > > the memory. > > I can't fight this battle with people who would rather stay with ASCII > and 7/8 bits per character. They are not living in a Unicode world. But how about the UTF-16 vs UCS4 battle? > > 1024 à 768 screen resolution takes 150% more display memory than 640 à > 480, too. > > > For web services or application which spend multi millions on those > > memory and database, it mean adding millions of dollars to their cost. > > They may have to adding some millions of cost to support international > > customer by using UTF-16. They probably are willing to add multi > > millions dollars of cost to change it to use UCS4. In fact, there are > > people proposed to stored UTF-8 in a hackky way into the database > > instead of using UTF-16 or UCS4 to save cost. They have to add > > restriction of using the api and build upper level api to do > > conversion and hacky operation. That mean it will introduce some fixed > > (not depend on the size of data) developement cost to the project but > > it will save millions of dollars of memory cost which depend on the > > size of the data. I don't like that approach but usually my word and > > what is "right" is less important than multiple million of dollars for > > a commercial company. > > I would truly be surprised if full 17-plane Unicode support in a single > app could be demonstrated to be a matter of "multiple millions of > dollars." It is not the full 17-plane Unicode support which will contribut to it. It is the (Number of ASCII only records X sizeof (records in UCS4)) - ( Number of ASCII only records X sizeof(record in ASCII)) contribute to that. compare to (Number of ASCII only records X sizeof (records in UTF-8)) - ( Number of ASCII only records X sizeof(record in ASCII)) or (Number of ASCII only records X sizeof (records in UTF-16)) - ( Number of ASCII only records X sizeof(record in ASCII)) The other comparision is (Number of BMP only records X sizeof (records in UCS4)) - ( Number of BMP only records X sizeof(record in UTF-8)) (Number of BMP only records X sizeof (records in UCS4)) - ( Number of BMP only records X sizeof(record in UTF-16)) of course, the sizeof() is really the "average size of record with those data" > > -Doug Ewell > Fullerton, California > http://users.adelphia.net/~dewell/ > -- -- Frank Yung-Fong Tang ÅÃÅtÃm ÃrÃhÃtÃÃt, IÃtÃrnÃtiÃnÃl DÃvÃlÃpmeÃt, AOL IntÃrÃÃtÃvà SÃrviÃes AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan

