RE: Compression through normalization

Jungshik Shin Wed, 03 Dec 2003 11:31:15 -0800

On Wed, 3 Dec 2003, Philippe Verdy wrote:

> I just have another question for Korean: many jamos are in fact composed
> from other jamos: this is clearly visible both in their name and in their
> composed glyph. What would be the linguistic impact of decomposing them (not
> canonically!)? Do Korean really learn these jamos without breaking them into
> their components? I think here about SSANG (double) consonnants, or the


  The Korean alphabet invented in 1443 and announced in 1446 included 17
consonants and 11 vowels. Modern Korean uses 14 consoants and 10 vowels
(3 consonants and 1 vowel have become obsolete. Korean 'ABC-song'
enumerates them only (i.e. it doesn't include cluster/complex letters.)
The vowel 'U+119E ARAE A  á' were used  until the early 20th century
when it was 'officially' made out of use in the draft standard of Korean
orthography by the Korean Linguistic Society in 1933 [1], which became
the basis of both South and North Korean orthographic standards after
the division of the country.  See p. 6(of the PDF file, or p. 2 in the
actual document) of the scanned copy of the draft standard for the list
of Korean letters along with names(The upper left part of p.6 in PDF
when rotated counterclockwise by 90 degrees.)  All others are composed
out of them. A few additional consonants were used briefly to transcribe
Chinese phonems in phonetic textbooks in the 15th century, but have not
been used otherwise.

  I and Kent, on several occasions, wrote that complex Korean letters
(Korean letter clusters) should have been made __canonically_ equivalent
to basic Korean letter sequences. They were compatibly equivalent to each
other in Unicode 2.0, but even that compatible equivalence was removed
instead of being upgraded to the canonical equivalence.  That's another
mistake in Korean encoding in Unicode. In the first place, complex
Korean letters should not have been encoded just like precomposed
syllables should not have been. With the NFC/NFD frozen forever,
it is now impossible to rectifiy this.

> initial Y or final E of some vowels...
> Of couse I won't be able to use such decomposition in Unicode, but would it
> be possible to use it in some private encoding created with a m:n charset
> mapping from/to Unicode?

  That kind of composition/decomposition is necessary for linguistic
analysis of Korean.  Search engines (e.g. google), rendering engines
and incremental searches also need that.  See

  http://i18nl10n.com/korean/jamo.html
  (you need Unbatang font - GPL'd opentype font for Korean-
   available at http://i18nl10n.com/fonts/UnBatang.ttf and mozilla
   either on Linux/Unix or on Windows. Uniscribe on XP
   can take advantage of Korean opentype fonts, but only to a limited extent.
   In particular, it doesn't support the kind of equivalence I'm talking
   about here so that for Mozilla even on Windows 2k/XP, I had to
   build a custom composition routine)
  http://i18nl10n.com/korean/jamocomp.html
  http://bugzilla.mozilla.org/show_bug.cgi?id=176315
  http://bugzilla.mozilla.org/show_bug.cgi?id=177877
  http://bugzilla.mozilla.org/show_bug.cgi?id=176290

  Jungshik

[1] http://i18nl10n.com/korean/orth1933.pdf

RE: Compression through normalization

Reply via email to