Frank Yung-Fong Tang <ytang0648 at aol dot com> wrote: > Agree with you. Just want to make a point that the implementation is > not "< 1%" of the work.
Oh, for heaven's sake: If you are starting with a NON-UNICODE application -- one that has NO prior knowledge of UTF-anything or UCS-anything -- and you are adding "Unicode support" to it, the amount of work to support the entire 17-plane Unicode range compared to just the BMP is relatively small. If I ever said "less than one percent," I apologize. Such a figure can only be determined on a case-by-case basis. >> I'll be happy to supply UTF-8 code that handles 4-byte sequences. >> That is not the same thing as converting an entire system from >> 16-bit to 32-bit integers, or adding proper UTF-16 surrogate support >> to a UCS-2-only system. Of course that is more work. > > You view is based on the assumption the internal code is UCS4 instead > of UTF-16. Didn't you read what I wrote? > Nothing wrong if people choose to use UTF-16 instead of UCS4 in the > API, even as 2003. Do you agree? Sure, no problem. Both UTF-16 and UCS-4 (= UTF-32) support the full Unicode range. Only UCS-2 does not. > If people do use UTF-16 in the API, it is nature for people who do > care about BMP but not care about Plan 1-16 to only work on BMP, > right? I am not saying they do the right thing. I am saying they do > the "nature" thing. Remember, the text describe about 'Surrogate' in > the Unocde 4.0 standard is probably only 5-10 pages total in that 1462 > pages standard. For developer who won't going to implement the rest > 1000 pages right, it is nature for them to think "why do I need to > make this 10 pages right?" I don't care if they choose not to provide fonts or rendering support for the supplementary planes. But it seems silly to deliberately exclude them from the underlying architecture. "Using UTF-16" implies that one supports the surrogate mechanism. UTF-16 without surrogate support is UCS-2. Of course the Unicode Standard doesn't spend a lot of time describing the surrogate mechanism. It only applies to the UTF-16 character encoding form. The description of characters encoded in the supplementary planes, however, is much more extensive. >> I can't fight this battle with people who would rather stay with >> ASCII and 7/8 bits per character. They are not living in a Unicode >> world. > > But how about the UTF-16 vs UCS4 battle? Well, UTF-16 certainly does occupy less space than UTF-32 (henceforth I will use this term instead of "UCS-4") in memory, on disk, wherever. By all means, when *storing* large amounts of data, use an appropriately compact form. That might mean UTF-16, UTF-8, or a compression format such as SCSU or BOCU-1, or it might mean compressing the data using gzip or bzip2. When *processing* character data in memory, I would assume a fixed-width encoding like UTF-32 would be more convenient than a variable-width encoding like UTF-16. But if the extra complexity (such as it is) of UTF-16 is not a problem, by all means go ahead and use it. >> I would truly be surprised if full 17-plane Unicode support in a >> single app could be demonstrated to be a matter of "multiple millions >> of dollars." > > It is not the full 17-plane Unicode support which will contribut to > it. It is the > (Number of ASCII only records X sizeof (records in UCS4)) - ( Number > of ASCII only records X sizeof(record in ASCII)) > > contribute to that. > > compare to > > (Number of ASCII only records X sizeof (records in UTF-8)) - ( Number > of ASCII only records X sizeof(record in ASCII)) > or > > (Number of ASCII only records X sizeof (records in UTF-16)) - ( Number > of ASCII only records X sizeof(record in ASCII)) > > The other comparision is > (Number of BMP only records X sizeof (records in UCS4)) - ( Number of > BMP only records X sizeof(record in UTF-8)) > > (Number of BMP only records X sizeof (records in UCS4)) - ( Number of > BMP only records X sizeof(record in UTF-16)) > > of course, the sizeof() is really the "average size of record with > those data" I have never suggested that people with ASCII-only data should suddenly quadruple their storage needs by storing it all in UTF-32. That's what UTF-8 and SCSU are for. In fact, their data is already in UTF-8, isn't it? -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/

