> If some process using text is sensitive to the *interpretation* of > the text, i.e. it is concerned about the content and meaning of > the letters involved, then normalization, to forms NFC or NFD, > which only involve canonical equivalences, will *not* make a difference. > Or to be more subtle about it, it might make a difference, but it > is nonconformant to claim that a process which claims it does not > make a difference is nonconformant. > > If you can parse that last sentence, then you are well on the > way to understanding the Tao of Unicode.
Still in the same subject, how do the hold KSX standards for Hanul compare each other? If they are upward compatible, ans specify that the conversion from an old text not using compound letters to the new standard does not mandate their composition into compund jamos, as they are considered equivalent there, then there's an issue if a text is converted from the old standard set of jamos to Unicode or first converted to the new set and then to Unicode. In that case Unicode will not treat them as canonically equivalent, despite they would have been considered equivalent in the Korean standards. So we will find various data containing precomposed jamos for the johad set, and other syllables not using them. The visual script itself is not altered, but the encoding is different and uses alternate decompositions which are not allowed as canonically equivalent in Unicode. So let's suppose you have some data coded only with single-letter jamos: there will be no way to unify them with modern data with Unicode. So, unification of these strings will require rearranging the jamos. This is an issue for converters, and still an issue within Unicode as single-letter jamaos are not deprecated and in fact are necessary for Modern Hangul (they are not "compatibility character" and participate to the composition of johab syllables for the determination of canonical equivalence). If your compressor or transcoder is not allowed to perform any rearrangement of jamos for modern Hangul, it should be relaxed for legacy data where jamos should have been preferably precomposed before being converted to Unicode. Such data will continue to persist for a long time, because it seems so easy for a Korean writer to insert or delete a missing single-letter jamos when performing corrections (or because of an initial missing keystroke in the input method used to compose the text initially). Now even if the text seems corrected, there will remain sequences that should have been stored with compound jamos. I can imagine a compressor or convertor that will however preserve the equivalence of only Hangul text using at best all the compound jamaos where ever they exist, and keep it in NFC form. For other texts, the jamos will be recomposed as they should have been, but canonical equivalence will not be preserve. But that will make all its possible to compress jamos corectly. __________________________________________________________________ << ella for Spam Control >> has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com
<<attachment: winmail.dat>>

