Re: Perception that Unicode is 16-bit (was: Re: Surrogate space i

Joel Rees Thu, 22 Feb 2001 19:52:26 -0800

Ken, Thanks for the consideration. I threw my ego away years ago. > Joel, > > > > Note that I am just sending a response to you, not to the list. > > > > I wouldn't mind this being on the list. I was making bad assumptions about > > Sun's and others's reasons for wanting to do perverse things with surrogate > > pairs, and this clears it up. I guess you want to reduce traffic on the > > list? > > No, not necessarily. But I prefer not to say blunt, uncomplementary > things about other members of the Consortium on an open, public list. > I just said this privately to you, so that you would realize that there > are implementation issues here that were different from what you > seemed to be driving at. > > > Now, I'm going to have to do the math and see what happens, but if I get the > > results I it sounds like I will get then the Java char type really was a bad > > choice, and similar engineering decisions need to be avoided in the future, > > even to the extent of heavy evangelizing. Internal probably does need to be > > 32 bit. > > The choice of UTF-16 is done for a whole series of reasons. > > Java choice a 16-bit character because it was practical. There > are some implementation issues with it, because they didn't fully > allow for what UTF-16 would imply for the API's. Many people who > started out with 16-bit Unicode a decade ago have the same issues today > in adapting to Unicode 3.1. > > But it isn't that hard to fix things, while retaining 16-bit code > units. I've been doing that just recently for the Unicode library > that Sybase uses. Microsoft, no doubt, has similar issues, because > they standardized on a 16-bit unichar long ago. > > And while UTF-32 has certain processing advantages in some places, > UTF-16 works just fine for most things. I know, because I've > implemented it for all kinds of functionality. All my tables for > properties, normalization, collation, and such are implemented in > UTF-16 -- they're more space efficient, among other things. And > all my string handling is UTF-16. It is only at certain unique > points, such as in recursive functions for doing decomposition, > where the extra overhead for dealing with UTF-16 makes UTF-32 > attractive enough that I convert locally to UTF-32 to do > that processing, and then convert back. > > This stuff is not rocket science, though it may seem to be sometimes. > > --Ken > If you can look past my extreme opinions prefering common standards to universal, I would appreciate hearing more about how you've managed your way around the warps in the transformations. I think the folks at Sun and Oracle might be interested, too. Have you tried sharing some of the key elements with them, as a sort of bribe to get them away from trying to convert surrogate pairs directly into UTF-8? Joel

Re: Perception that Unicode is 16-bit (was: Re: Surrogate space i

Reply via email to