On Mon, 06 Jan 2003 01:46:44 -0800 (PST), "Robert R. Chilton" wrote:
> Moreover, for the authors of n2558 to argue that a non-combining model > of Tibetan is necessary for compatibility with "traditional education, > publication and electronic desktop publishing systems" to is to entirely > discount the use of other complex scripts --such as the Indic scripts > which employ a combining model-- in such "systems". Clearly, the > direction of such a rationale runs entirely opposite to the basic > principles of Unicode/ISO-10646. > Exactly. And as the underlying encoding should be opaque to the end user, it should make no difference to someone entering Tibetan text into an electronic desktop publishing system whether the system is encoding the syllable "rgya" as one character or three. > Such cases of triple (or quadruple) vowels E or O are best normalized to > double vowel plus single (or double) vowel to aid in collation and other > character data processing functions. Thus, Glyph 107 is best encoded as > (or normalized to) <U+0F41, U+0FB1, U+0F7B, U+0F7A>. > My rationale for not normalising to double vowel plus single (or double) vowel is that a double vowel sign used to indicate a shorthand abbreviation is fundamentally different from a double vowel used to represent a long vowel. For instance, when the phrase "ki ki swo swo" is abbreviated to "Ka + double I" and "Swa + double O" the double I and double O vowels represent the contraction of two I syllables and O syllables respectively, and not a long I and long O vowel respectively. As there is no character for a double I vowel sign, then the double I vowel must needs be encoded as two consecutive I vowels. Although there is a double O vowel sign (U+0F7D), I think that encoding it in the same manner as the double I, as two consecutive O vowels, would be more consistent than encoding it with the graphically identical but semantically different double O vowel. By encoding it as two consecutive O vowels it is making an explicit statement that this is a shorthand abbreviation and not simply a long O. As to shorthand abbreviations with three or four identical vowel signs, what is the advantage of normalising to "vowel + double vowel" or "double vowel + double vowel" other than saving a few bytes ? I don't see how this would aid collation or other character data processing functions. Given that KHYA + triple E could legitimately be encoded as <U+0F41, U+0FB1, U+0F7B, U+0F7A>, <U+0F41, U+0FB1, U+0F7A, U+0F7B> or <U+0F41, U+0FB1, U+0F7A, U+0F7A, U+0F7A>, a good Tibetan font would have to map all three sequences to the same glyph. And from a collation point of view, why is any one of these sequences more helpful than another ? All three sequences would be collated after <U+0F41, U+0FB1, U+0F7A>. Admittedly only <U+0F41, U+0FB1, U+0F7B, U+0F7A> might be collated after <U+0F41, U+0FB1, U+0F7B>, but then as KHYEEE probably represents an abbreviation for KHYE KHYE KHYE, should it not be collated after KHYE rather than KHYEE ? In short, I believe that it is useful to encode shorthand abbreviations as a sequence of individual vowels so as to distinguish them from graphically identical long vowel syllables, and to make explicit their function as shorthand abbreviations. Nevertheless, I'm not terribly fussed about this, and am happy to follow the consensus of opinion. > Assuming that there have been no changes in the combining classes of > these characters since Unicode 3.0, the 2 characters <U+0F88> and > <U+0F89> are spacing, non-combining characters. Therefore, the only > possible encoding that will place the "base consonant" under these signs > (i.e., will result in these signs being "superfixed" to the letters KA, > KHA, PA, PHA, etal.) is for these characters to appear in the data > stream just prior to the "base consonant", such base consonant being > encoded in subjoined position. [It is not really correct to say that > "The Unicode Standard does not explicitly specify the coding sequence > for letters that are combined with any of the transliteration characters > U+0F88 through U+0F8B" since the combining class of the characters is > determinative.] > Thus, to encode Glyphs 029 and 100 use <U+0F88, U+0F90> and <U+0F88, > U+0F91>, respectively. Likewise, to encode Glyphs 435 and 486 use > <U+0F89, U+0FA4> and <U+0F89, U+0FA5>, respectively. Thanks for the explanation. I'm afraid my understanding of combining characters is rather hazy. I was mistakenly assuming that U+0F88 and U+0F89 were combing characters, and therefore encoding them after the base consonant in the same way that combining u-umlaut is encoded as <U+0075, U+0308>. I actually came up with the sequence <U+0F88, U+0F90> on my first attempt to encode Glyph 29, but I decided it must be wrong as I thought that a stack ought to have a base consonant to be valid. If what you are suggesting is that the characters U+0F88 through U+0F8B can behave as base consonants, then I guess I was right the first time. (Looking back at the Unicode Standard, I notice it states that a stack contains "at most one base consonant" and "any number of subjoined consonants", so a stack with no base consonant would be valid). > Note that these > latter two glyphs are *NOT* a case of superfixed TIBETAN MARK PALUTA but > rather a case of superfixed TIBETAN SIGN MCHU CAN. The PALUTA has a > different function (of transliterating the Sanskrit apostrophe in > Tibetan script) and is not found in superfixed position. [Note also > that a naive reader might mistake the TIBETAN SIGN MCHU CAN for a > superfixed NYA, just as one might confuse the NYA and the PALUTA.] > Thanks for the correction. I'm afraid I've never seen a Paluta in action, and naively assumed that this what the superjoined sign was. Nor, I'm afraid, am I familiar with the signs at U+0F88 through U+0F8B. > Though I confess that I am not familiar with these orthographies, the > glyphs cited are cases of TIBETAN MARK TSA -PHRU [U+0F39] being affixed > to letters ZHA, ZA, and -A, respectively. They would be encoded as > <U+0F5E, U+0F39>, <U+0F5F, U+0F39> and <U+0F60, U+0F39>. > I did wonder whether the mark was a TSA -PHRU, but in the document it looks dot-like rather than flag-like - perhaps at higher resolution it would be clearer. However, I still wonder what the TSA -PHRU signifies when added to these letters. > I hope this is useful. Very useful indeed. I'll update my web pages to reflect your comments as soon as possible. Andrew

