Well, please ignore my previous email on the coding space. I somehow read that Mr. Ayers was talking about that the UTF-32 has coding space of 4G which he absolutely didn't mean that way. I just blame "super fast reading" class that I took. ;-) With regards, Ienup ] Date: Thu, 08 Mar 2001 20:02:39 -0600 ] From: "Ayers, Mike" <[EMAIL PROTECTED]> ] Subject: RE: UTF8 vs. Unicode (UTF16) in code ] To: 'Ienup Sung' <[EMAIL PROTECTED]>, Unicode List <[EMAIL PROTECTED]> ] MIME-version: 1.0 ] ] ] If you really want to finish the job, there's always UTF-32, which ] should do rather nicely until we meet the space aliens aith the ] 4,293,853,186 character alphabet! ] ] ] /|/|ike ] ] P.S. No, they're not Klingons! ] ] > From: Ienup Sung [mailto:[EMAIL PROTECTED]] ] > ] > I think we shouldn't advocate that since there will be only 43K ] > CJK characters at the SIP, about 1.6K characters at SMP, and, 97 tag ] > characters at SPP, we can ignore such the characters and the ] > additional planes ] > of the UTF-16/32 of Unicode 3.1. Furthemore, when you're ] > doing the first i18n ] > on the existing programs, you can do the whole thing at once ] > with minor ] > additional cost if you choose to have support for UTF-16 ] > while you're at it ] > rather than do it only for BMP/UCS-2 now and later do one ] > more time of change ] > even though that would be decided by each team/company who are doing ] > the i18n in my opinion. ] > ] > And, as we all know, we can no longer claim that the UTF-16 is a fixed ] > width anymore since it is variable width now as like UTF-8; ] > we will just ] > have to deal with it in my opinion. ] > ] > With regards, ] > ] > Ienup ] > ] > ] > ] Date: Fri, 09 Mar 2001 10:48:52 -0800 (PST) ] > ] From: [EMAIL PROTECTED] ] > ] Subject: Re: UTF8 vs. Unicode (UTF16) in code ] > ] X-Sender: [EMAIL PROTECTED] ] > ] To: Ienup Sung <[EMAIL PROTECTED]> ] > ] Cc: Unicode List <[EMAIL PROTECTED]> ] > ] MIME-version: 1.0 ] > ] ] > ] Well.... ] > ] ] > ] Actually, there is a significant difference between being "UTF-8 ] > ] ignorant" and "UTF-16 ignorant". A "UTF-16 ignorant" ] > program thinks that ] > ] surrogate pairs are just two characters with undefined ] > properties. Since ] > ] currently there are no characters "up there" this isn't a really big ] > ] deal. Shortly, when Unicode 3.1 is official, there will be 40K or so ] > ] characters in the supplemental planes... but they'll be ] > relatively rare. ] > ] ] > ] In most cases where one has a "character pointer", one is ] > not performing ] > ] casing, line breaking, or other text interpretation that requires ] > ] significant awareness of the meaning of the text. Of ] > course, it depends on ] > ] the instance and the application how true that is ;-). But ] > in many cases ] > ] you *can* ignore the fact that a high- or low-surrogate character is ] > ] really part of something else. ] > ] ] > ] With UTF-8, however, is is impossible to ignore the ] > multi-byte sequences ] > ] and they can never really be treated as separate ] > characters. So I guess ] > ] all I'm saying is that, depending on what you need to do ] > and what level of ] > ] awareness your application needs to achieve, a pure "UCS-2 ] > port" might be ] > ] a better choice than UTF-8, since the specific details ] > overlooked are ] > ] of a different quality. ] > ] ] > ] Best Regards,. ] > ] ] > ] Addison ] > ] ] > ] =============================================================== ] > ] Addison P. Phillips Globalization Architect ] > ] webMethods, Inc http://www.webmethods.com ] > ] Sunnyvale, CA, USA mailto:[EMAIL PROTECTED] ] > ] ] > ] +1 408.210.3569 (mobile) +1 408.962.5487 (ofc) ] > ] =============================================================== ] > ] "Internationalization is not a feature. It is an architecture."

