This thread amuses me. I feel like I know quite a bit about the various Unicode encoding forms and schemes, and my personal opinion is that UTF-16 combines the worst of UTF-8 (necessity to support multi-code unit characters, regardless of how "rare") with the worst of UTF-32 (high overhead for many scripts). Yet there is a Technical Note, UTN #12, that encourages users to use UTF-16 for internal processing, for exactly the opposite reasons.
So I think the word "nice" is actually quite appropriate for this thread. It implies a personal aesthetic judgment, which is what is really being discussed here. I use UTF-8 for most interchange (such as this message; OE doesn't allow me to send UTF-16) and UTF-32 for most internal processing that I write myself. Let people say UTF-32 is wasteful if they want; I don't tend to store huge amounts of text in memory at once, so the overhead is much less important than one code unit per character. I do wish the following statements would stop coming up every time this subject is debated: (1) UTF-32 doesn't really guarantee one code unit per character, since you still have to worry about combining sequences. (2) Write functions that deal with strings, not characters, and the difference becomes moot. Both statements (which are really variations on the same theme) miss the point somewhat. Combining sequences and other interactions between encoded characters don't change the fact that sometimes you have to deal with strings, and sometimes you have to deal with individual characters. That's just the fact. Both types of processing are important. I also think that as more and more Han characters are encoded in the supplementary space, corresponding to the ever-growing repretoires of Eastern standards, the story that UTF-16 is virtually a fixed-width encoding because "supplementary code points are very rare in most text" will gradually go away. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/