On Mon, 28 May 2018 20:03:11 +0530 SundaraRaman R via Unicode <unicode@unicode.org> wrote:
> Hi, thanks for your reply. > > > There is only one character with a canonical combining class of 9 > > that is included as other_alphabetic, namely U+0E3A THAI CHARACTER > > PHINTHU. That last had any of the other properties of viramas back > > in Unicode 1.0; the characters that triggered such behaviours were > > permanently removed in Unicode 1.1. > > I didn't understand the second sentence here, could you clarify? Sorry, I messed that system up. It should have read, "The last time that that had any of the other properties of viramas back in Unicode 1.0;" > What > do you mean by "any of the other properties" here? The effects of virama that spring to mind are: (a) Causing one or both letters on either side to change or combine to indicate combination; (b) Appearing as a mark only if it does not affect one of the letters on either side; (c) Causing a left matra to appear on the left of the sequence of consonants joined by a sequence of non-visible viramas. > And "triggered such > behaviours" seems to imply having them in other_alphabetic had > negative consequences, could you give an example of what that might > be? Nowadays, the Thai syllable ไตร, normatively pronounced /trai/, is only encoded <U+0E44 THAI CHARACTER SARA AI MAIMALAI, U+0E15 THAI CHARACTER TO TAO, U+0E23 THAI CHARACTER RO RUA>, and the character U+0E3A is always visible when used; for most routine purposes it is little different to U+0E38 THAI CHARACTER SARA U. However, in Unicode 1.0, while <U+0E44, U+0E15, U+0E23> was rendered as at present, the same visible string could also be encoded as <U+0E15, U+0E3A, U+0E23, U+0E74 THAI PHONETIC ORDER VOWEL SIGN SARA MAI MALAI> - no glyph would be rendered for U+0E3A. If one wanted the official Sanskritised Pali version, one could type ไตฺร <U+0E44, U+0E15, U+0E3A, U+0E23> as at present. One could also encode it as <U+0E15, U+0E3A, U+200C, U+0E23, U+0E74>. Weirdly, I couldn't have used the phonetically ordered vowel to type a monk's name ending in มฺโม <U+0E21 THAI CHARACTER MO MA, U+0E3A, U+0E42 THAI CHARACTER SARA O, U+0E21>, as <U+0E21, U+0E3A, U+200C, U+0E21, U+0E72 THAI PHONETIC ORDER VOWEL SIGN O> would have been rendered as โมฺม. As the non-phonetic virama-like behaviours of U+0E3A are only mentioned under the heading 'Alternate Ordering', I can only presume that they were triggered by the phonetic order vowel signs, U+0E70 to U+0E74. It is possible that U+0E3A acquired the alphabetic property because it ceased to behave like a virama. Alternatively, it may have acquired the alphabetic property because of its use in the compound vowels of minority languages. > But in the case of Tamil, I'm curious why most other combining Tamil > marks go in class 0, whereas pulli goes in 9. Even u0B82 Anusvara, a > character barely used in Tamil text, has combining class 0 and is > included in Other_Alphabetic, but the visually similar and similarly > positioned pulli is not. In this particular case, is it a historical > accident that these got assigned this way, or is there a rationale > behind these? Would it at all be possible to get this changed in the > upcoming Unicode standard? Tamil has usually been treated as just another Indian Indic script. U+0E3A is the only virama-like character with the property of being 'alphabetic'. I can't see a change making it into Unicode 11.0. It requires too much careful thought. Besides, anything that considered <pulli> as alphabetic should also considerer <pulli, ZWNJ> as alphabetic - they should be mostly interchangeable in Tamil. > > I fear that the correct test for what you want is to split text into > > words and check that each word begins with an alphabetic > > character. > > Do you mean "each grapheme cluster begins with an alphabetic > character" here? It seems to me (in my very limited Unicode knowledge) > that such a test, going through grapheme clusters and checking the > first codepoint in each, would also ensure the text is full > alphabetic. Not directly. Is the string "mark2mark" alphabetic? It constitutes a single word. My suggested simplification would say 'no', as it contains '2'; perhaps my simplification is wrong. > And it has the advantage that more languages have a > (relatively) easy way for splitting text into grapheme clusters, than > for checking minor Unicode properties like WordBreak, so this one > might be easier to implement. Does this test anywhere in the ballpark > of being right? Yes, it's close to being right. Note that simple approximations for SE Asian word-breaking (e.g. treating SE Asian characters as alphabetic) should work well for your application. Richard.