Hi, thanks for your reply.

> There is only one character with a canonical combining class of 9 that
> is included as other_alphabetic, namely U+0E3A THAI CHARACTER PHINTHU.
> That last had any of the other properties of viramas back in Unicode
> 1.0; the characters that triggered such behaviours were permanently
> removed in Unicode 1.1.

I didn't understand the second sentence here, could you clarify? What
do you mean by "any of the other properties" here? And "triggered such
behaviours" seems to imply having them in other_alphabetic had
negative consequences, could you give an example of what that might

> There are some notable absences from the combining marks included.
> Significant absences include ZWJ, ZWNJ and CGJ.
> However, a non-erroneous *conformant* Unicode process cannot
> always determine whether a string, given only that it is a string, is
> composed only of alphabetic characters.  The answer would be 'yes' for
> <U+00E7 LATIN SMALL LETTER C WITH CEDILLA> but 'no' for the canonically
> (U+0327 is not included as alphabetic either.)
> There is at least one combination of Latin letter and combining mark
> that occurs in the normal orthography of a natural language and does not
> have a precomposed equivalent.

Ah, that's somewhat unfortunate that such a quick and easy alphabetic
check is not possible in the general case, but I can understand how it
might be weird to give the Alphabetic property to a ZWJ or ZWNJ.

But in the case of Tamil, I'm curious why most other combining Tamil
marks go in class 0, whereas pulli goes in 9. Even u0B82 Anusvara, a
character barely used in Tamil text, has combining class 0 and is
included in Other_Alphabetic, but the visually similar and  similarly
positioned pulli is not. In this particular case, is it a historical
accident that these got assigned this way, or is there a rationale
behind these? Would it at all be possible to get this changed in the
upcoming Unicode standard?

(By the way, I'm happy to get a link to read through for any of my
questions here. I just find it quite hard to search for and find past
discussions and decision rationales regarding these, not knowing how
and where to search for them.)

> I fear that the correct test for what you want is to split text into
> words and check that each word begins with an alphabetic character.

Do you mean "each grapheme cluster begins with an alphabetic
character" here? It seems to me (in my very limited Unicode knowledge)
that such a test, going through grapheme clusters and checking the
first codepoint in each, would also ensure the text is full
alphabetic. And it has the advantage that more languages have a
(relatively) easy way for splitting text into grapheme clusters, than
for checking minor Unicode properties like WordBreak, so this one
might be easier to implement. Does this test anywhere in the ballpark
of being right?


Reply via email to