Re: Compression through normalization

Doug Ewell Wed, 03 Dec 2003 10:25:13 -0800

Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

> I still think that we could try to use only LV syllables but not LVT
> syllables to reduce the set of Hangul character used if this helps
> the final compressor.


Aha, LV syllables.  Now we are talking about something that exists and
can be used in the manner you describe.  It won't help SCSU or BOCU-1
compression, but it might improve the performance of a Huffman or
arithmetic implementation that can handle more than 256 characters, as
you stated below.

> It's true that the LV syllables are discontinuous in the large Hangul
> johab syllable block. But it could reduce the number of needed codes
> in compression lookup dictionnaries and would limit the number of
> table resets by exhausting less often the lookup table, and it would
> also allow finding compressable similarities in the text stream at
> much shorter distances than within a text using a lot of LVT
> syllables. So the impact of the spreaded LV syllables in the johab
> set would still be low.

True.  Don't try it with SCSU, though, because you'd be constantly
jumping between single-byte and Unicode mode (or using four bytes for
every LVT syllable).  And don't try it with BOCU-1, because every switch
between the jamos block and the syllable block will cost three bytes.

> I will retry to compress Korean by using NFC form modified by
> excluding LVT johab syllables but only keeping LV johab syllables and
> separate L or V or T jamos...

UAX #15 includes sample Java code showing, among other things, how to
compose an LV syllable plus a T jamo into a syllable.  It would be
relatively easy to reverse the logic, though of course the UAX does not
show that because it is neither NF(K)C nor NF(K)D.

Speaking of which, I just noticed that the function in SC UniPad to
compose syllables from jamos does not handle this case (LV + T = LVT).
I'll have to report that to the UniPad team.

> I just have another question for Korean: many jamos are in fact
> composed from other jamos: this is clearly visible both in their name
> and in their composed glyph. What would be the linguistic impact of
> decomposing them (not canonically!)? Do Korean really learn these
> jamos without breaking them into their components? I think here about
> SSANG (double) consonnants, or the initial Y or final E of some
> vowels...

This would be a good question for Jungshik or another native Korean.  I
have read that Korean children learn the syllables as whole units,
rather than as an arrangement of jamos as I would see them, leading some
to think of Hangul as a featural syllabary instead of an alphabet.

> Of couse I won't be able to use such decomposition in Unicode, but
> would it be possible to use it in some private encoding created with a
> m:n charset mapping from/to Unicode?

You can do absolutely anything you like in a private encoding.  Bernard
Miller did:

http://www.bytext.org/

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Compression through normalization

Reply via email to