Errors in FAQ on compression

Doug Ewell Sun, 10 Mar 2002 22:50:59 -0800

The FAQ on compression says:

<quote>
Q: Why not use UTF-8 as compressed format?
A: UTF-8 represents only the ASCII characters in less space than needed
in UTF-16, for <i>all</i> other characters it expands.
</quote>


The end of this sentence means "... it expands compared to UTF-16," and
of course that is not true.  Code points from U+0080 through U+07FF are
represented in UTF-8 as two bytes, the same as UTF-16.  For an FAQ, this
is an unfortunate error.

Perhaps something along the lines of:

A: UTF-8 represents only the ASCII characters in less space than needed
in UTF-16; for all other characters it requires the same or more space.

would be more accurate.

Later on...

<quote>
A: SCSU bridges the gap between an 8-bit based LZW and a 16-bit encoded
Unicode text, by removing the extra redundancy that is part of the
encoding (sequences of every other byte being null) and not a redundancy
in the content. The output of SCSU should be sent to LZW for block
compression where that's desired.
</quote>

The part about "sequences of every other byte being null" bothers me.
For one thing, this case is specific to Latin-1 usage.  In Cyrillic
text, you have sequences of every other byte being 0x04; in kana, it's
0x30; and so forth.  Then there's that word "null," which has a special
meaning of "nothing" or "unassigned" in many programming languages.  The
fact that Latin-1 text encoded as UTF-16 results in every other byte
being 0x00 has nothing to do with any of the symbolic meanings of
"null."

How about:

(sequences of every other byte being the same)


-Doug Ewell
 Fullerton, California

Errors in FAQ on compression

Reply via email to