[EMAIL PROTECTED] wrote: > On 06/02/2002 05:40:05 AM Samphan Raruenrom wrote: >>>My opinion is that they should have been simplified, but that setting the >>>bulk of them to 0 was a mistake and creates some significant problems >>>(which go a step beyond the questions you raise here). >>Can you elaborate on this? > Given the characters > : 0E35;THAI CHARACTER SARA II;Mn;0 > : 0E39;THAI CHARACTER SARA UU;Mn;103 > consider the sequences > < 0e35, 0e39 > vs. < 0e39, 0e35 > > I'm guessing your first reaction will be to say that these cannot co-occur.
No, not at all :) I already learn from you to be more open-minded to this Unicode kind of things. > That is true for the Thai language, but may not be true for other languages > written with Thai script. I've read a book on the history of Thai characters and found that many vowels change position through history. So this issue is more understandable to me now. > Now, the problem with the sequences above is that they are visually > indistinct, meaning that they could not possibly be used by users for a > semantically-relevant distinction. From the user's perspective, they are > identical. Moreover, it would fit a user's expectations to have string > comparisons to equate them (e.g. a search for < 0e35, 0e39 > should find a > match if the data contains < 0e39, 0e35 >). They are both > canonically-ordered sequences, however, since U+0E35 has a combining class > of 0. The result is that string comparisons that rely on normalisation into > any one of the existing Unicode normalisation forms (NFD, NFC, NFKD, NFKC) > will fail to consider these as equal. Let's talk about somethings that really happend in Thai. 1) 0E01;THAI CHARACTER KO KAI;Lo;0 0E38;THAI CHARACTER SARA U;Mn;103 0E4D;THAI CHARACTER NIKHAHIT;Mn;0 The sequences (which happend in Pali transcription) (a) KO KAI + SARA U + NIKHAHIT (b) KO KAI + NIKHAHIT + SARA U They're look the same but not equal because combining class of NIKHAHIT happend to be 0 so both are normalized. 2) 0E32;THAI CHARACTER SARA AA;Lo;0 0E48;THAI CHARACTER MAI EK;Mn;107 0E33;THAI CHARACTER SARA AM;Lo;0;L;<compat> "NIKHAHIT" "SARA AA" There're two ways to represent the word KO KAI + MAI EK + SARA AM (a) KO KAI + MAI EK + SARA AM (b) KO KAI + NIKHAHIT + MAI EK + SARA AA (b) must be in this sequence to get the intended look for the word (not that this is the valid sequence for Thai/WTT). That is the mai-ek is on top of the nikhahit. The problem is with the NFKD/NFKC of (a), which is (c) KO KAI + MAI EK + NIKHAIT + SARA AA Which will be rendered with nikhahit on top of mai-ek. Which is not the same as (a), and is not the intened look. So this means that the string change its shape after normalization. Is this a violation of any principle? The problem comes also from the fact that combining class of NIKHAHIT is 0 and that make reording of (c) impossible. -- Samphan Raruenrom Information Research and Development Division, National Electronics and Computer Technology Center, Thailand. http://www.nectec.or.th/home/index.html

