RE: Compression through normalization

Philippe Verdy Thu, 04 Dec 2003 21:23:58 -0800

> If some process using text is sensitive to the *interpretation* of
> the text, i.e. it is concerned about the content and meaning of
> the letters involved, then normalization, to forms NFC or NFD,
> which only involve canonical equivalences, will *not* make a difference.
> Or to be more subtle about it, it might make a difference, but it
> is nonconformant to claim that a process which claims it does not
> make a difference is nonconformant.
> 
> If you can parse that last sentence, then you are well on the
> way to understanding the Tao of Unicode.


Still in the same subject, how do the hold KSX standards for Hanul compare
each other? If they are upward compatible, ans specify that the conversion
from an old text not using compound letters to the new standard does not
mandate their composition into compund jamos, as they are considered
equivalent there, then there's an issue if a text is converted from the old
standard set of jamos to Unicode or first converted to the new set and then
to Unicode.

In that case Unicode will not treat them as canonically equivalent, despite
they would have been considered equivalent in the Korean standards. So we
will find various data containing precomposed jamos for the johad set, and
other syllables not using them.

The visual script itself is not altered, but the encoding is different and
uses alternate decompositions which are not allowed as canonically
equivalent in Unicode. So let's suppose you have some data coded only with
single-letter jamos: there will be no way to unify them with modern data
with Unicode.

So, unification of these strings will require rearranging the jamos. This is
an issue for converters, and still an issue within Unicode as single-letter
jamaos are not deprecated and in fact are necessary for Modern Hangul (they
are not "compatibility character" and participate to the composition of
johab syllables for the determination of canonical equivalence).

If your compressor or transcoder is not allowed to perform any rearrangement
of jamos for modern Hangul, it should be relaxed for legacy data where jamos
should have been preferably precomposed before being converted to Unicode.
Such data will continue to persist for a long time, because it seems so easy
for a Korean writer to insert or delete a missing single-letter jamos when
performing corrections (or because of an initial missing keystroke in the
input method used to compose the text initially).

Now even if the text seems corrected, there will remain sequences that
should have been stored with compound jamos. I can imagine a compressor or
convertor that will however preserve the equivalence of only Hangul text
using at best all the compound jamaos where ever they exist, and keep it in
NFC form. For other texts, the jamos will be recomposed as they should have
been, but canonical equivalence will not be preserve. But that will make all
its possible to compress jamos corectly.



__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com

<<attachment: winmail.dat>>

RE: Compression through normalization

Reply via email to