This thread amuses me.

I feel like I know quite a bit about the various Unicode encoding forms
and schemes, and my personal opinion is that UTF-16 combines the worst
of UTF-8 (necessity to support multi-code unit characters, regardless of
how "rare") with the worst of UTF-32 (high overhead for many scripts).
Yet there is a Technical Note, UTN #12, that encourages users to use
UTF-16 for internal processing, for exactly the opposite reasons.

So I think the word "nice" is actually quite appropriate for this
thread.  It implies a personal aesthetic judgment, which is what is
really being discussed here.

I use UTF-8 for most interchange (such as this message; OE doesn't allow
me to send UTF-16) and UTF-32 for most internal processing that I write
myself.  Let people say UTF-32 is wasteful if they want; I don't tend to
store huge amounts of text in memory at once, so the overhead is much
less important than one code unit per character.

I do wish the following statements would stop coming up every time this
subject is debated:

(1)  UTF-32 doesn't really guarantee one code unit per character, since
you still have to worry about combining sequences.
(2)  Write functions that deal with strings, not characters, and the
difference becomes moot.

Both statements (which are really variations on the same theme) miss the
point somewhat.  Combining sequences and other interactions between
encoded characters don't change the fact that sometimes you have to deal
with strings, and sometimes you have to deal with individual characters.
That's just the fact.  Both types of processing are important.

I also think that as more and more Han characters are encoded in the
supplementary space, corresponding to the ever-growing repretoires of
Eastern standards, the story that UTF-16 is virtually a fixed-width
encoding because "supplementary code points are very rare in most text"
will gradually go away.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/



Reply via email to