The FAQ on compression says: <quote> Q: Why not use UTF-8 as compressed format? A: UTF-8 represents only the ASCII characters in less space than needed in UTF-16, for <i>all</i> other characters it expands. </quote>
The end of this sentence means "... it expands compared to UTF-16," and of course that is not true. Code points from U+0080 through U+07FF are represented in UTF-8 as two bytes, the same as UTF-16. For an FAQ, this is an unfortunate error. Perhaps something along the lines of: A: UTF-8 represents only the ASCII characters in less space than needed in UTF-16; for all other characters it requires the same or more space. would be more accurate. Later on... <quote> A: SCSU bridges the gap between an 8-bit based LZW and a 16-bit encoded Unicode text, by removing the extra redundancy that is part of the encoding (sequences of every other byte being null) and not a redundancy in the content. The output of SCSU should be sent to LZW for block compression where that's desired. </quote> The part about "sequences of every other byte being null" bothers me. For one thing, this case is specific to Latin-1 usage. In Cyrillic text, you have sequences of every other byte being 0x04; in kana, it's 0x30; and so forth. Then there's that word "null," which has a special meaning of "nothing" or "unassigned" in many programming languages. The fact that Latin-1 text encoded as UTF-16 results in every other byte being 0x00 has nothing to do with any of the symbolic meanings of "null." How about: (sequences of every other byte being the same) -Doug Ewell Fullerton, California

