Re: Combining class for Thai characters

Peter_Constable Thu, 23 May 2002 00:50:45 -0700

On 05/21/2002 10:07:32 AM Samphan Raruenrom wrote:

>I have something to consult with you about the properties of Thai
>characters in Unicode...


>The (below-attached) tone marks "MAI EK, THO, TRI, CHATTAWA" have 
combining 
>class 107

That's "above-attached", of course (simply a typo).


>My first question is :-
>Why the above-attached vowel signs/marks all have combining class 0?

I'm not positive on the history, but here's my take: As you mention, there 
is a sequencing constraint in WTT. In an earlier version of the Unicode 
standard (prior to 2.1) all of the Thai characters of category Mn had 
fixed-position classes. I'm guessing that that was influenced by a notion 
of there needing to be a specific order, as in WTT. It didn't really 
accomplish anything to have all the different fixed position classes, 
though. If anything, it created some complications, which I won't 
elaborate on. At any rate, between 2.0 and 3.0, a lot of fixed-position 
classes, both for Thai and for other scripts, were simplified. In so 
doing, many were set to 0.

My opinion is that they should have been simplified, but that setting the 
bulk of them to 0 was a mistake and creates some significant problems 
(which go a step beyond the questions you raise here). I think they should 
have been simplified in line with the final suggestion you make: have 
those that interact typographicallay have the same class. (I'd say the 
same of many other combining marks in a number of other scripts.)


>This inhibits them from participating in normalizations, right?

Well, it's not clear what you mean by that. Having them set to combining 
class 0 means that they do not re-order when performing canonical 
ordering, and so they are already in canonical order, hence in normal form 
(except that in NFKD and NFKC there is the compatibility decomposition of 
sara am).


>Examples :-
>The sequences (both of which should look the same on non-WTT shaping 
engine) :-
>(1) KO KAI + SARA UU + MAI EK          -> ��� -> combining class = 0, 103, 107
>(2) KO KAI + MAI EK + SARA UU          -> ��� -> combining class = 0, 107, 103
>
>While Unicode doesn't have the notion of invalid sequence, Thai has one, 
>defined by a
>national standard (WTT) to be (approximately) :
>CONSONANT + (above or below) VOWEL SIGN + TONE MARK or THANTHAKHAT
>
>The same concept occurs in, for example, Devanagari...

It's important to understand two things:

i) Just because a rule applies to the encoding of Devanagari in Unicode, 
that does not mean the rule therefore necessarily applies to any other 
script in Unicode.

ii) Just because a rule applies to the encoding of Thai in a legacy 
encoding standard, that does not mean the rule therefore necessarily apply 
to encoding of Thai script in Unicode.

In spite of any sequencing constraints on Devanagari in Unicode or on Thai 
in WTT, the two Unicode character sequences that you cited above are both 
valid representations of the same thing. More precisely, they are by 
definition canonically equivalent, and they have the same normalised 
represenatations. Either can occur in data, and they should be rendered 
identically, and in general processes should treat them as 
indistinguishable. (That's slightly strong, since there are special 
situations, e.g. in normalising, when a process should distinguish them. 
The relevant conformance requirement is that no conformant process can 
assume they are distinct.)



>So (correct me if I'm wrong) the notion of invalid sequence in Unicode is 

>script-specific.

Yes, but be careful of misinterpreting combining classes as saying 
anything about what is or isn't a valid sequence -- they say absolutely 
nothing in that regard.


>And it is (is it?) intended that the normalized sequences should (as much 
as 
>possible?)
>be correct for the particular scripts; otherwise, the normalized text 
will be 
>rendered
>differently from the un-normalized text (do they have to?).

You've got too many alternative readings in your sentence to know how to 
answer. Let me respond in reference to what I commented on above: the two 
example sequences you gave are canonically equivalent, and should be 
rendered the same. The first is in canonical order (hence in normal form 
for any of NFC, NFD, NFKC, NFKD), while the second is not, but that is not 
really relevant with regard to their rendering: both should be presented 
the same way. It is *not* true that normalised text will necesssarily be 
rendered different from non-normalised text.



>This works for the above sequences, both (1) and (2) normalized to (1).
>But for the following sequences :-
>(3) KO KAI + SARA II + MAI EK          -> ��� -> combining class = 0, 0, 
107
>(4) KO KAI + MAI EK + SARA II          -> ��� -> combining class = 0, 107, 
0
>
>They should both be normalized to (3) but not, because class 0 does not 
>participate in reordering (they are both normalized). 

I agree that no reordering occurs in canonical ordering because sara ii 
has a class of 0, but I disagree that they *should* have the same 
normalised representation. It seems to me you are making that assumption 
because you are applying the lens of WTT, which is biased specifically in 
relation to one particular language: Standard Thai. The script can be, and 
is, used for writing other languages, and in principle another language 
may have different requirements for combining mark combinations. I 
personally think that mai eek and sara ii should have the *same* combining 
class. But that's immaterial at this point since the fact is that they do 
not, nor is UTC willing to change them so that they have the same 
combining class. 


>It's possible to correct this by 
>assigning
>above-attaced vowel signs (i.e. SARA II) with combining class more than 
0.

I'm assuming you mean to assign sara ii with a combining class > 0 and <> 
107. I think that would be the wrong thing to do. But, that's also 
immaterial since at this point, the stability requirements prohibit the 
combining class of sara ii from being changed at all.


>Or, according to the Unicode (and Thai) convention that order below marks 

>before above
>marks, the combining class of above vowels should be more than 103 (below 

>vowels) and
>less than 107 (tone marks, which always above-attached).

Neither a good idea, I think, nor possible.


>Or if it's intended that the above vowel and tone mark should be stacked 
>according
>to the Unicode default inside-out rule, both should have the same 
combining 
>class 107
>to let them interact typograhically.

That is exactly what I think *should* have been done. If I had my way, 
we'd change it to that. But UTC will not make such a change at this point 
due to a commitment not to alter normalised representations from version 
3.0. We are stuck with the vowels that position above having combining 
classes of 0, for better or worse.


- Peter


---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>

Re: Combining class for Thai characters

Reply via email to