Jungshik Shin writes: > On Wed, 3 Dec 2003, Philippe Verdy wrote: > > > I just have another question for Korean: many jamos are in fact composed > > from other jamos: this is clearly visible both in their name > and in their > > composed glyph. What would be the linguistic impact of > decomposing them (not > > canonically!)? Do Korean really learn these jamos without > breaking them into > > their components? > > The Korean alphabet invented in 1443 and announced in 1446 included 17 > consonants and 11 vowels. Modern Korean uses 14 consoants and 10 vowels > (3 consonants and 1 vowel have become obsolete.
Very interesting, as it exhibits my feeling that the Hangul script could have been encoded completely as an alphabet in only 2 columns including special symbols and punctuations. This conforms to a encoding model that I had seen a dozen of years ago about the encoding of Chinese and Korean with very reduced code sets, using a separate complex but implementable set of composition rules that would have allowed an easy integration within existing 8 bit or even 7 bit technologies (for example in Teletex). I think that this work was performed for a candidate ETSI standard (but my memory can fail here) to be used in TV set decoders. When I had read these research papers, they demonstrated that the Han and Hangul scripts were much less complex at the abstract level than the way they appear in their written composed form. And that, depending on the composition capability of the renderer (or of the screen resolution), a linear decomposed representation was still possible and still readable, possibly by using visible composition symbolic glyphs (for Han, I can remember some alternate presentation forms based on linearized radicals, that could have fitted with low resolution devices, giving results quite similar to the approach used to approximate composition of Latin with spacing rather than non spacing diacritics such as "a`" instead of "Ã", something which is much better and more user-friendly than showing a null glyph for unsupported composed characters). Even today, this analysis of the Hangul script at the very abstract level helps creating convenient input methods (your count of basic letters shows that it becomes very easy to map these letters on keyboards, without needing complex to learn input methods, as the input editor can process the input basic letters into standard decomposed jamos or into precomposed johab syllables. The other interest is that it effectively allows efficient search and indexing algorithms in Hangul texts, by allowing matches below the level of jamos currently composed in Unicode. > Korean 'ABC-song' enumerates them only (i.e. it doesn't include > cluster/complex letters.) That's a good proof that children can learn the Hangul script by recognizing this very small set of letters, separately from the 2D layout used to make them fit in a single syllable glyph, something that is typically entered by pressing the spacebar between syllables to render the composed grapheme cluster. > > I think here about SSANG (double) consonnants, or the initial Y > > or final E of some vowels... > > Of couse I won't be able to use such decomposition in Unicode, > > but would it be possible to use it in some private encoding > > created with a m:n charset mapping from/to Unicode? > > That kind of composition/decomposition is necessary for linguistic > analysis of Korean. Search engines (e.g. google), rendering engines > and incremental searches also need that. Unicode has promoted the use of decompositions for Latin, Greek and Cyrillic, but it's a shame that it was not done for Hangul, and that multiple design errors have been made, which now are immutable due to the stability policy. Now,if a IDNA system is to include Hangul domain names, I do think that these names should need to be reserved in bundles matching more strings than just the Unicode canonical equivalents or even compatibility equivalents. Additional decompositions will be needed. The same thing will also be necessary in orthographic correctors used in word processors. You point that Search engines do need this too... This adds to the discussion about the best encoding to use to parse Hangul texts: it's probable that extended decompositions will allow matching equivalent text more precisely, and their recomposition into optimized Unicode jamos or johab syllables can be automated within editors. Other candidate compositions could be looked up also within Hangul compatibility syllables so that the Korean text will compress much better than it is now with just NFC compositions. I already have several applications of "custom" decompositions needed to parse text that are not solved today with NFC/NFD or even NFKD/NFKC and this may be a place where Unicode should provide support by defining a new set of extended decompositions (not to be used for normalized forms as they are now stabilized for the best or the worst) for correct text parsing in various languages using these scripts. It won't be up to ISO/IEC 10646 to define these decompositions as it is not their work of defining properties, but only to include and unify existing repertoires. If needed, for linguistic processing, we'll see that some characters should be decomposed into characters that are still not encoded in the ISO10646 character repertoire, but that's something that could be integrated in future revisions (so that Unicode can refine later the extended decompositions). __________________________________________________________________ << ella for Spam Control >> has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com
<<attachment: winmail.dat>>

