Jungshik Shin <jshin at mailaps dot org> wrote: >> In my experience, SCSU usually does perform somewhat better than >> BOCU-1, but for some scripts (e.g. Korean) the opposite often seems >> to be true. > > Just out of curiosity, which NF did you use for your uncompressed > source Korean text, NFC or NFD when you got the above result? > I guess I'll know in a week or so when your paper is out, but...
It was actually Steven Atkin's and Ryan Stansifer's test, not mine, although I did reproduce their results. The file they used, called "arirang.txt," contains over 3.3 million Unicode characters and was apparently once part of their "Florida Tech Corpus of Multi-Lingual Text" but subsequently deleted for reasons not known to me. I can supply it if you're interested. The file is all in syllables, not jamos, which I guess means it's in NFC. The statistics on this file are as follows: UTF-16 6,634,430 bytes UTF-8 7,637,601 bytes SCSU 6,414,319 bytes BOCU-1 5,897,258 bytes Legacy encoding (*) 5,477,432 bytes (*) KS C 5601, KS X 1001, or EUC-KR) I used my own SCSU encoder to achieve these results, but it really wouldn't matter which was chosen -- Korean syllables can be encoded in SCSU *only* by using Unicode mode. It's not possible to set a window to the Korean syllable range. Only the large number of spaces and full stops in this file prevented SCSU from degenerating entirely to 2 bytes per character. The creators of BOCU-1 (Davis and Scherer) also reported better performance on Korean text for BOCU-1 than for SCSU (this was actually the only script for which this could be said). They used the Korean "What is Unicode?" page, which is also written in syllables rather than jamos. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/