On 05/21/2002 10:07:32 AM Samphan Raruenrom wrote: >I have something to consult with you about the properties of Thai >characters in Unicode...
>The (below-attached) tone marks "MAI EK, THO, TRI, CHATTAWA" have combining >class 107 That's "above-attached", of course (simply a typo). >My first question is :- >Why the above-attached vowel signs/marks all have combining class 0? I'm not positive on the history, but here's my take: As you mention, there is a sequencing constraint in WTT. In an earlier version of the Unicode standard (prior to 2.1) all of the Thai characters of category Mn had fixed-position classes. I'm guessing that that was influenced by a notion of there needing to be a specific order, as in WTT. It didn't really accomplish anything to have all the different fixed position classes, though. If anything, it created some complications, which I won't elaborate on. At any rate, between 2.0 and 3.0, a lot of fixed-position classes, both for Thai and for other scripts, were simplified. In so doing, many were set to 0. My opinion is that they should have been simplified, but that setting the bulk of them to 0 was a mistake and creates some significant problems (which go a step beyond the questions you raise here). I think they should have been simplified in line with the final suggestion you make: have those that interact typographicallay have the same class. (I'd say the same of many other combining marks in a number of other scripts.) >This inhibits them from participating in normalizations, right? Well, it's not clear what you mean by that. Having them set to combining class 0 means that they do not re-order when performing canonical ordering, and so they are already in canonical order, hence in normal form (except that in NFKD and NFKC there is the compatibility decomposition of sara am). >Examples :- >The sequences (both of which should look the same on non-WTT shaping engine) :- >(1) KO KAI + SARA UU + MAI EK -> ��� -> combining class = 0, 103, 107 >(2) KO KAI + MAI EK + SARA UU -> ��� -> combining class = 0, 107, 103 > >While Unicode doesn't have the notion of invalid sequence, Thai has one, >defined by a >national standard (WTT) to be (approximately) : >CONSONANT + (above or below) VOWEL SIGN + TONE MARK or THANTHAKHAT > >The same concept occurs in, for example, Devanagari... It's important to understand two things: i) Just because a rule applies to the encoding of Devanagari in Unicode, that does not mean the rule therefore necessarily applies to any other script in Unicode. ii) Just because a rule applies to the encoding of Thai in a legacy encoding standard, that does not mean the rule therefore necessarily apply to encoding of Thai script in Unicode. In spite of any sequencing constraints on Devanagari in Unicode or on Thai in WTT, the two Unicode character sequences that you cited above are both valid representations of the same thing. More precisely, they are by definition canonically equivalent, and they have the same normalised represenatations. Either can occur in data, and they should be rendered identically, and in general processes should treat them as indistinguishable. (That's slightly strong, since there are special situations, e.g. in normalising, when a process should distinguish them. The relevant conformance requirement is that no conformant process can assume they are distinct.) >So (correct me if I'm wrong) the notion of invalid sequence in Unicode is >script-specific. Yes, but be careful of misinterpreting combining classes as saying anything about what is or isn't a valid sequence -- they say absolutely nothing in that regard. >And it is (is it?) intended that the normalized sequences should (as much as >possible?) >be correct for the particular scripts; otherwise, the normalized text will be >rendered >differently from the un-normalized text (do they have to?). You've got too many alternative readings in your sentence to know how to answer. Let me respond in reference to what I commented on above: the two example sequences you gave are canonically equivalent, and should be rendered the same. The first is in canonical order (hence in normal form for any of NFC, NFD, NFKC, NFKD), while the second is not, but that is not really relevant with regard to their rendering: both should be presented the same way. It is *not* true that normalised text will necesssarily be rendered different from non-normalised text. >This works for the above sequences, both (1) and (2) normalized to (1). >But for the following sequences :- >(3) KO KAI + SARA II + MAI EK -> ��� -> combining class = 0, 0, 107 >(4) KO KAI + MAI EK + SARA II -> ��� -> combining class = 0, 107, 0 > >They should both be normalized to (3) but not, because class 0 does not >participate in reordering (they are both normalized). I agree that no reordering occurs in canonical ordering because sara ii has a class of 0, but I disagree that they *should* have the same normalised representation. It seems to me you are making that assumption because you are applying the lens of WTT, which is biased specifically in relation to one particular language: Standard Thai. The script can be, and is, used for writing other languages, and in principle another language may have different requirements for combining mark combinations. I personally think that mai eek and sara ii should have the *same* combining class. But that's immaterial at this point since the fact is that they do not, nor is UTC willing to change them so that they have the same combining class. >It's possible to correct this by >assigning >above-attaced vowel signs (i.e. SARA II) with combining class more than 0. I'm assuming you mean to assign sara ii with a combining class > 0 and <> 107. I think that would be the wrong thing to do. But, that's also immaterial since at this point, the stability requirements prohibit the combining class of sara ii from being changed at all. >Or, according to the Unicode (and Thai) convention that order below marks >before above >marks, the combining class of above vowels should be more than 103 (below >vowels) and >less than 107 (tone marks, which always above-attached). Neither a good idea, I think, nor possible. >Or if it's intended that the above vowel and tone mark should be stacked >according >to the Unicode default inside-out rule, both should have the same combining >class 107 >to let them interact typograhically. That is exactly what I think *should* have been done. If I had my way, we'd change it to that. But UTC will not make such a change at this point due to a commitment not to alter normalised representations from version 3.0. We are stuck with the vowels that position above having combining classes of 0, for better or worse. - Peter --------------------------------------------------------------------------- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: <[EMAIL PROTECTED]>

