Re: Ternary search trees for Unicode dictionaries

Doug Ewell Fri, 21 Nov 2003 01:36:50 -0800

Jungshik Shin <jshin at mailaps dot org> wrote:

>> In my experience, SCSU usually does perform somewhat better than
>> BOCU-1, but for some scripts (e.g. Korean) the opposite often seems
>> to be true.
>
> Just out of curiosity, which NF did you use for your uncompressed
> source Korean text, NFC or NFD when you got the above result?
> I guess I'll know in a week or so when your paper is out, but...


It was actually Steven Atkin's and Ryan Stansifer's test, not mine,
although I did reproduce their results.  The file they used, called
"arirang.txt," contains over 3.3 million Unicode characters and was
apparently once part of their "Florida Tech Corpus of Multi-Lingual
Text" but subsequently deleted for reasons not known to me.  I can
supply it if you're interested.

The file is all in syllables, not jamos, which I guess means it's in
NFC.

The statistics on this file are as follows:

UTF-16    6,634,430 bytes
UTF-8    7,637,601 bytes
SCSU    6,414,319 bytes
BOCU-1    5,897,258 bytes
Legacy encoding (*)    5,477,432 bytes
    (*) KS C 5601, KS X 1001, or EUC-KR)

I used my own SCSU encoder to achieve these results, but it really
wouldn't matter which was chosen -- Korean syllables can be encoded in
SCSU *only* by using Unicode mode.  It's not possible to set a window to
the Korean syllable range.  Only the large number of spaces and full
stops in this file prevented SCSU from degenerating entirely to 2 bytes
per character.

The creators of BOCU-1 (Davis and Scherer) also reported better
performance on Korean text for BOCU-1 than for SCSU (this was actually
the only script for which this could be said).  They used the Korean
"What is Unicode?" page, which is also written in syllables rather than
jamos.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Ternary search trees for Unicode dictionaries

Reply via email to