RE: Compression through normalization

Philippe Verdy Wed, 03 Dec 2003 16:21:41 -0800

Jungshik Shin writes:
> On Wed, 3 Dec 2003, Philippe Verdy wrote:
> 
> > I just have another question for Korean: many jamos are in fact composed
> > from other jamos: this is clearly visible both in their name 
> and in their
> > composed glyph. What would be the linguistic impact of 
> decomposing them (not
> > canonically!)? Do Korean really learn these jamos without 
> breaking them into
> > their components?
> 
>   The Korean alphabet invented in 1443 and announced in 1446 included 17
> consonants and 11 vowels. Modern Korean uses 14 consoants and 10 vowels
> (3 consonants and 1 vowel have become obsolete.


Very interesting, as it exhibits my feeling that the Hangul script could
have been encoded completely as an alphabet in only 2 columns including
special symbols and punctuations.

This conforms to a encoding model that I had seen a dozen of years ago
about the encoding of Chinese and Korean with very reduced code sets,
using a separate complex but implementable set of composition rules
that would have allowed an easy integration within existing 8 bit or
even 7 bit technologies (for example in Teletex). I think that this
work was performed for a candidate ETSI standard (but my memory can
fail here) to be used in TV set decoders.

When I had read these research papers, they demonstrated that the
Han and Hangul scripts were much less complex at the abstract level
than the way they appear in their written composed form. And that,
depending on the composition capability of the renderer (or of the
screen resolution), a linear decomposed representation was still
possible and still readable, possibly by using visible composition
symbolic glyphs (for Han, I can remember some alternate presentation
forms based on linearized radicals, that could have fitted with
low resolution devices, giving results quite similar to the approach
used to approximate composition of Latin with spacing rather than
non spacing diacritics such as "a`" instead of "Ã", something which
is much better and more user-friendly than showing a null glyph for
unsupported composed characters).

Even today, this analysis of the Hangul script at the very abstract
level helps creating convenient input methods (your count of basic
letters shows that it becomes very easy to map these letters on
keyboards, without needing complex to learn input methods, as the
input editor can process the input basic letters into standard decomposed
jamos or into precomposed johab syllables.

The other interest is that it effectively allows efficient search
and indexing algorithms in Hangul texts, by allowing matches below
the level of jamos currently composed in Unicode.

> Korean 'ABC-song' enumerates them only (i.e. it doesn't include
> cluster/complex letters.)

That's a good proof that children can learn the Hangul script by
recognizing this very small set of letters, separately from the
2D layout used to make them fit in a single syllable glyph,
something that is typically entered by pressing the spacebar
between syllables to render the composed grapheme cluster.

> > I think here about SSANG (double) consonnants, or the initial Y 
> > or final E of some vowels...
> > Of couse I won't be able to use such decomposition in Unicode, 
> > but would it be possible to use it in some private encoding 
> > created with a m:n charset mapping from/to Unicode?
> 
> That kind of composition/decomposition is necessary for linguistic
> analysis of Korean.  Search engines (e.g. google), rendering engines
> and incremental searches also need that.

Unicode has promoted the use of decompositions for Latin, Greek and
Cyrillic, but it's a shame that it was not done for Hangul, and that
multiple design errors have been made, which now are immutable due
to the stability policy.

Now,if a IDNA system is to include Hangul domain names, I do think
that these names should need to be reserved in bundles matching
more strings than just the Unicode canonical equivalents or even
compatibility equivalents. Additional decompositions will be needed.

The same thing will also be necessary in orthographic correctors
used in word processors. You point that Search engines do need this
too... This adds to the discussion about the best encoding to use
to parse Hangul texts: it's probable that extended decompositions
will allow matching equivalent text more precisely, and their
recomposition into optimized Unicode jamos or johab syllables
can be automated within editors. Other candidate compositions
could be looked up also within Hangul compatibility syllables so
that the Korean text will compress much better than it is now with
just NFC compositions.

I already have several applications of "custom" decompositions needed
to parse text that are not solved today with NFC/NFD or even NFKD/NFKC
and this may be a place where Unicode should provide support by
defining a new set of extended decompositions (not to be used for
normalized forms as they are now stabilized for the best or the
worst) for correct text parsing in various languages using these
scripts. It won't be up to ISO/IEC 10646 to define these decompositions
as it is not their work of defining properties, but only to include
and unify existing repertoires.

If needed, for linguistic processing, we'll see that some characters
should be decomposed into characters that are still not encoded in
the ISO10646 character repertoire, but that's something that could
be integrated in future revisions (so that Unicode can refine later
the extended decompositions).


__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com

<<attachment: winmail.dat>>

RE: Compression through normalization

Reply via email to