On Thu, 20 Nov 2003, Doug Ewell wrote: > Jungshik Shin <jshin at mailaps dot org> wrote: > > >> In my experience, SCSU usually does perform somewhat better than > >> BOCU-1, but for some scripts (e.g. Korean) the opposite often seems > >> to be true. > > > > Just out of curiosity, which NF did you use for your uncompressed > > source Korean text, NFC or NFD when you got the above result? > > I guess I'll know in a week or so when your paper is out, but...
Thanks a lot for providing the details. > although I did reproduce their results. The file they used, called > "arirang.txt," contains over 3.3 million Unicode characters and was > apparently once part of their "Florida Tech Corpus of Multi-Lingual > Text" but subsequently deleted for reasons not known to me. I can > supply it if you're interested. It'd be great if you could. > The file is all in syllables, not jamos, which I guess means it's in > NFC. Yes, it's in NFC, then. > The statistics on this file are as follows: > > UTF-16 6,634,430 bytes > UTF-8 7,637,601 bytes > SCSU 6,414,319 bytes > BOCU-1 5,897,258 bytes > Legacy encoding (*) 5,477,432 bytes > (*) KS C 5601, KS X 1001, or EUC-KR) Sorry to pick on this (when I have to thank you). Even with coded character set vs character encoding scheme distinction aside (that is, we just think in terms of character repertoire), KS C 5601/KS X 1001 _alone_ cannot represent any Korean text unless you're willing to live with double width space, Latin letters, numbers and punctuations (since you wrote the file has apparently full stops and spaces in ASCII, it does include characters outside KS X 1001) On the other hand, EUC-KR (KS X 1001 + ISO 646:KR/US-ASCII) can. Actually, I suspect the legacy encoding used was Windows codepage 949(or JOHAB/Windows-1361?) because I can't imagine there is not a single syllable (that is outside the charater repertoire of KS X 1001) out of over 2 million syllables > I used my own SCSU encoder to achieve these results, but it really > wouldn't matter which was chosen -- Korean syllables can be encoded in > SCSU *only* by using Unicode mode. It's not possible to set a window to > the Korean syllable range. Now that you told me you used NFC, isn't this condition similar to Chinese text? How does BOCU and SCSU work for Chinese text? Japanese text might do slightly better with Kana, but isn't likely to be much better. > Only the large number of spaces and full > stops in this file prevented SCSU from degenerating entirely to 2 bytes > per character. That's why I asked. What I'm curious about is how SCSU and BOCU of NFD (and what I and Kent [2] think should have been NFD with the possible code point rearragement of Jamo block to facilate a smaller window size for SCSU) would compare with uncompressed UTF-16 of NFC (SCSU/BOCU isn't much better than UTF-16). The back of an envelope calculation gives me 2.5 ~ 3 bytes per syllable (without the code point rearrangement to put them within a 64 character-long window [1]) so it's still worse than UTF-16. However, that's not as bad as ~5 bytes (or more) per syllable without SCSU/BOCU-1. I have to confess that I just have a very cursory understanding of SCSU/BOCU-1. As for why I'm interested in NFD, one of the reason is that it is easier to implement incremental search (among many other things) in terms of NFD. The only __editor__ I've ever seen support incremental search for Korean per letter (as opposed to per syllable) is Hemacs (Hangul Emacs). Not having incremental search per letter in Korean is like not being able to match 'apples/apple/apply', 'interesting/interested', and 'learned' with 'appl', 'interest' and 'learn'. Jungshik [1] By encoding only basic consonants and vowels, we could have put all Hangul letters in a 64 character block. With all modern consonants and vowels encoded (whether it's complex or not) and only basic consonants/vowels encoded for archaic letters, it'd be slightly larger than to fit into a 64 character block. If we just encode consonants once (instead of twice - as leading and trailing, which is similar to Tibetan case), we can make it even more compact. [2] I found that KIM Kyeong-seok (Korean representative to ISO/IEC JTC1/SC22/WG20 and JTC1/SC2/WG2) had also proposed to change the NFD similarly. I'm not sure why it's submitted to JTC1 instead of UTC (NOT that we can remedy the problem now)