On Sat, 29 Nov 2003, Doug Ewell wrote: > A longer and more realistic case can be seen in the sample Korean file > at: > > http://www.cs.fit.edu/~ryan/compress/corpora/korean/arirang/arirang.txt
I finally downloaded the file and took a look at it. I was surprised to find that the text is the entire content of the volume 1 of a famous Korean novel (Arirang) by a _living_ Korean writer CHO Chongrae (published in the early 1990's). This seems to be problematic because it's clearly copyrighted and I don't see any mention of having obtained the permission from the author/the publisher. Using the text for writing the paper may be all right, but putting it up at the web for everyone to download is not (afaik). > This file is in EUC-KR, but can easily be converted to Unicode using I read the novel (almost 10 years ago) and found a lot of Hangul syllables NOT covered by KS X 1001 (one of two CCS' comprising EUC-KR along with US-ASCII/ISO 646:KR). [1] The novel has a large amount of faithful transcription of Cholla (South-Western) dialect of Korean and it's all but impossible to do that within the character repertoire of KS X 1001. So, I was curious as to what they did in ariang.txt (because iconv(3) didn't detect any invalid byte sequence when I used it to convert to UTF-8 from EUC-KR). It turned out that they replaced all Hangul syllables outside KS X 1001 by either ASCII space or the first Hangul compatibility Jamo of syllables in arirang.txt they put up at www.cs.fit.edu in EUC-KR. They should have used UTF-8 from the beginning. It wouldn't have changed their result very signficantly, but still would have given them slightly different numbers. > can do better.) The entropy of the jamos file is 4.925, yielding a > Huffman bit count of 31.8 million bits, almost 43% larger. > When encoded in UTF-8, SCSU, or BOCU-1, the syllables file is smaller > than the jamos file by 55%, 17%, and 32% respectively. You wrote earlier the following. In terms of the number of Unicode characters, going to NFD increases the size almost by 100%. > 3,317,215 Unicode characters, over 96% Hangul syllables and Basic > Latin spaces, full stops, and CRLFs. When broken down into jamos > (i.e. converting from NFC to NFD), the character count increases to > 6,468,728. So, I was a bit confused by your 55% for a moment or two until I realized that the reference is the other way around (because you're talking about the compression via normalization, which is different from my main reason I'm interested in the issue). So, NFD text (in UTF-8) is about twice as long as NFC text (in UTF-8). That's not so bad as a simple back of envelope calculation suggests. NFD text in SCSU and BOCU-1 are _only_ 20% and 47% longer than NFC text in SCSU and BOCU-1. This is even better. > General-purpose algorithms tend to reduce the difference, but PKZip > (using deflate) compresses the syllables file to an output 9% smaller > than that of the jamos file. Using bzip2, the compressed syllables file > is 2% smaller. bzip2 is wonderful ! With bzip2 narrowing the 'gulf' to ~ 2% and pkzip to ~ 11%, 'proponents' of using Hangul letters over Hangul syllables has another good argument as to why Hangul letters be favored in representing Korena text. Thanks for the good news :-) Jungshik [1] Needless to say, when I read the novel, I didn't have the KS X 1001 table by my side. However, it's easy for me to spot Hangul syllables not covered by KS X 1001. Besides, when I read the sequel to 'Arirang', Han-gang (Han-river) by the same author that appeared daily in Hangyoreh shinmun web site (http://www.hani.co.kr) a few years ago, Hangul syllables outside the KS X 1001 character repertoire were represented by sequences of Hangul Compatibility Jamos (U+3130) because the newspaper web site used (still does) EUC-KR. In every daily installement, there were at least several syllables represented that way.

