Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote: > I still think that we could try to use only LV syllables but not LVT > syllables to reduce the set of Hangul character used if this helps > the final compressor.
Aha, LV syllables. Now we are talking about something that exists and can be used in the manner you describe. It won't help SCSU or BOCU-1 compression, but it might improve the performance of a Huffman or arithmetic implementation that can handle more than 256 characters, as you stated below. > It's true that the LV syllables are discontinuous in the large Hangul > johab syllable block. But it could reduce the number of needed codes > in compression lookup dictionnaries and would limit the number of > table resets by exhausting less often the lookup table, and it would > also allow finding compressable similarities in the text stream at > much shorter distances than within a text using a lot of LVT > syllables. So the impact of the spreaded LV syllables in the johab > set would still be low. True. Don't try it with SCSU, though, because you'd be constantly jumping between single-byte and Unicode mode (or using four bytes for every LVT syllable). And don't try it with BOCU-1, because every switch between the jamos block and the syllable block will cost three bytes. > I will retry to compress Korean by using NFC form modified by > excluding LVT johab syllables but only keeping LV johab syllables and > separate L or V or T jamos... UAX #15 includes sample Java code showing, among other things, how to compose an LV syllable plus a T jamo into a syllable. It would be relatively easy to reverse the logic, though of course the UAX does not show that because it is neither NF(K)C nor NF(K)D. Speaking of which, I just noticed that the function in SC UniPad to compose syllables from jamos does not handle this case (LV + T = LVT). I'll have to report that to the UniPad team. > I just have another question for Korean: many jamos are in fact > composed from other jamos: this is clearly visible both in their name > and in their composed glyph. What would be the linguistic impact of > decomposing them (not canonically!)? Do Korean really learn these > jamos without breaking them into their components? I think here about > SSANG (double) consonnants, or the initial Y or final E of some > vowels... This would be a good question for Jungshik or another native Korean. I have read that Korean children learn the syllables as whole units, rather than as an arrangement of jamos as I would see them, leading some to think of Hangul as a featural syllabary instead of an alphabet. > Of couse I won't be able to use such decomposition in Unicode, but > would it be possible to use it in some private encoding created with a > m:n charset mapping from/to Unicode? You can do absolutely anything you like in a private encoding. Bernard Miller did: http://www.bytext.org/ -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/

