Re: PRC asking for 956 precomposed Tibetan characters

Andrew C. West Mon, 06 Jan 2003 04:37:49 -0800

On Mon, 06 Jan 2003 01:46:44 -0800 (PST), "Robert R. Chilton" wrote:


> Moreover, for the authors of n2558 to argue that a non-combining model
> of Tibetan is necessary for compatibility with "traditional education,
> publication and electronic desktop publishing systems" to is to entirely
> discount the use of other complex scripts --such as the Indic scripts
> which employ a combining model-- in such "systems".  Clearly, the
> direction of such a rationale runs entirely opposite to the basic
> principles of Unicode/ISO-10646.
> 

Exactly. And as the underlying encoding should be opaque to the end user, it
should make no difference to someone entering Tibetan text into an electronic
desktop publishing system whether the system is encoding the syllable "rgya" as
one character or three.

> Such cases of triple (or quadruple) vowels E or O are best normalized to
> double vowel plus single (or double) vowel to aid in collation and other
> character data processing functions.  Thus, Glyph 107 is best encoded as
> (or normalized to) <U+0F41, U+0FB1, U+0F7B, U+0F7A>.
> 

My rationale for not normalising to double vowel plus single (or double) vowel
is that a double vowel sign used to indicate a shorthand abbreviation is
fundamentally different from a double vowel used to represent a long vowel. For
instance, when the phrase "ki ki swo swo" is abbreviated to "Ka + double I" and
"Swa + double O" the double I and double O vowels represent the contraction of
two I syllables and O syllables respectively, and not a long I and long O vowel
respectively. As there is no character for a double I vowel sign, then the
double I vowel must needs be encoded as two consecutive I vowels. Although there
is a double O vowel sign (U+0F7D), I think that encoding it in the same manner
as the double I, as two consecutive O vowels, would be more consistent than
encoding it with the graphically identical but semantically different double O
vowel. By encoding it as two consecutive O vowels it is making an explicit
statement that this is a shorthand abbreviation and not simply a long O.
As to shorthand abbreviations with three or four identical vowel signs, what is
the advantage of normalising to "vowel + double vowel" or "double vowel + double
vowel" other than saving a few bytes ? I don't see how this would aid collation
or other character data processing functions. Given that KHYA + triple E could
legitimately be encoded as <U+0F41, U+0FB1, U+0F7B, U+0F7A>, <U+0F41, U+0FB1,
U+0F7A, U+0F7B> or <U+0F41, U+0FB1, U+0F7A, U+0F7A, U+0F7A>, a good Tibetan font
would have to map all three sequences to the same glyph. And from a collation
point of view, why is any one of these sequences more helpful than another ? All
three sequences would be collated after <U+0F41, U+0FB1, U+0F7A>. Admittedly
only <U+0F41, U+0FB1, U+0F7B, U+0F7A> might be collated after <U+0F41, U+0FB1,
U+0F7B>, but then as KHYEEE probably represents an abbreviation for KHYE KHYE
KHYE, should it not be collated after KHYE rather than KHYEE ?
In short, I believe that it is useful to encode shorthand abbreviations as a
sequence of individual vowels so as to distinguish them from graphically
identical long vowel syllables, and to make explicit their function as shorthand
abbreviations.
Nevertheless, I'm not terribly fussed about this, and am happy to follow the
consensus of opinion.

> Assuming that there have been no changes in the combining classes of
> these characters since Unicode 3.0, the 2 characters <U+0F88> and
> <U+0F89> are spacing, non-combining characters.  Therefore, the only
> possible encoding that will place the "base consonant" under these signs
> (i.e., will result in these signs being "superfixed" to the letters KA,
> KHA, PA, PHA, etal.) is for these characters to appear in the data
> stream just prior to the "base consonant", such base consonant being
> encoded in subjoined position.  [It is not really correct to say that
> "The Unicode Standard does not explicitly specify the coding sequence
> for letters that are combined with any of the transliteration characters
> U+0F88 through U+0F8B" since the combining class of the characters is
> determinative.]
> Thus, to encode Glyphs 029 and 100 use <U+0F88, U+0F90> and <U+0F88,
> U+0F91>, respectively.  Likewise, to encode Glyphs 435 and 486 use
> <U+0F89, U+0FA4> and <U+0F89, U+0FA5>, respectively. 

Thanks for the explanation. I'm afraid my understanding of combining characters
is rather hazy. I was mistakenly assuming that U+0F88 and U+0F89 were combing
characters, and therefore encoding them after the base consonant in the same way
that combining u-umlaut is encoded as <U+0075, U+0308>.
I actually came up with the sequence <U+0F88, U+0F90> on my first attempt to
encode Glyph 29, but I decided it must be wrong as I thought that a stack ought
to have a base consonant to be valid. If what you are suggesting is that the
characters U+0F88 through U+0F8B can behave as base consonants, then I guess I
was right the first time. (Looking back at the Unicode Standard, I notice it
states that a stack contains "at most one base consonant" and "any number of
subjoined consonants", so a stack with no base consonant would be valid).

> Note that these
> latter two glyphs are *NOT* a case of superfixed TIBETAN MARK PALUTA but
> rather a case of superfixed TIBETAN SIGN MCHU CAN.  The PALUTA has a
> different function (of transliterating the Sanskrit apostrophe in
> Tibetan script) and is not found in superfixed position.  [Note also
> that a naive reader might mistake the TIBETAN SIGN MCHU CAN for a
> superfixed NYA, just as one might confuse the NYA and the PALUTA.]
> 

Thanks for the correction. I'm afraid I've never seen a Paluta in action, and
naively assumed that this what the superjoined sign was. Nor, I'm afraid, am I
familiar with the signs at U+0F88 through U+0F8B.

> Though I confess that I am not familiar with these orthographies, the
> glyphs cited are cases of TIBETAN MARK TSA -PHRU [U+0F39] being affixed
> to letters ZHA, ZA, and -A, respectively.  They would be encoded as
> <U+0F5E, U+0F39>, <U+0F5F, U+0F39> and <U+0F60, U+0F39>.
> 

I did wonder whether the mark was a TSA -PHRU, but in the document it looks
dot-like rather than flag-like - perhaps at higher resolution it would be
clearer. However, I still wonder what the TSA -PHRU signifies when added to
these letters.

> I hope this is useful.

Very useful indeed. I'll update my web pages to reflect your comments as soon as
possible.

Andrew

Re: PRC asking for 956 precomposed Tibetan characters

Reply via email to