Someone, I forgot who, questioned whether converting Unicode text to NFC would actually improve its compressibility, and asked if any actual data was available.
Certainly there is no guarantee that normalization would *always* result in a smaller file. A compressor that took advantage of normalization would have to determine whether there would be any benefit. One extremely simple example would be text that consisted mostly of Latin-1, but contained U+212B ANGSTROM SIGN and no other characters from that block. By converting this character to its canonical equivalent U+00C5: * UTF-8 would use 2 bytes instead of 3. * SCSU would use 1 byte instead of 2. * BOCU-1 would use 1 or 2 bytes instead of always using 2. A longer and more realistic case can be seen in the sample Korean file at: http://www.cs.fit.edu/~ryan/compress/corpora/korean/arirang/arirang.txt This file is in EUC-KR, but can easily be converted to Unicode using recode, SC UniPad, or another converter. It consists of 3,317,215 Unicode characters, over 96% Hangul syllables and Basic Latin spaces, full stops, and CRLFs. When broken down into jamos (i.e. converting from NFC to NFD), the character count increases to 6,468,728. The entropy of the syllables file is 6.729, yielding a "Huffman bit count" of 22.3 million bits. That's the theoretical minimum number of bits that could be used to encode this file, character by character, assuming a Huffman or arithmetic coding scheme designed to handle 16- or 32-bit Unicode characters. (Many general-purpose compression algorithms can do better.) The entropy of the jamos file is 4.925, yielding a Huffman bit count of 31.8 million bits, almost 43% larger. When encoded in UTF-8, SCSU, or BOCU-1, the syllables file is smaller than the jamos file by 55%, 17%, and 32% respectively. General-purpose algorithms tend to reduce the difference, but PKZip (using deflate) compresses the syllables file to an output 9% smaller than that of the jamos file. Using bzip2, the compressed syllables file is 2% smaller. So we can at least say that Korean, which can be normalized from NFD to NFC algorithmically and without the use of long tables of equivalents or exclusions, can consistently be compressed to a smaller size after such normalization than before. Whether a "silent" normalization to NFC can be a legitimate part of Unicode compression remains in question. I notice the list is still split as to whether this process "changes" the text (because checksums will differ) or not (because C10 says processes must consider the text to be equivalent). -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/

