Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
On Tue, 29 May 2018 14:03:25 -0700 Doug Ewell via Unicode wrote: > In any case, Ken has answered the real underlying question: a process > that checks whether each character in a sequence is "alphabetic" is > inappropriate for determining whether the sequence constitutes a word. Back in the second post of the thread, I made the point that a conformant Unicode process cannot always give a yes/no answer to the question of whether all characters in a string are alphabetic. What we seem to have established is that Unicode properties are not set up to facilitate the identification of words. Given that spell-checkers work, we have taken a wrong turn. Perhaps we should reconsider "b⃝e⃝", which consists of two letters each inside its own enclosing circle. The spell-checker I'm using considers it a misspelt word, rather than two symbols side by side. Richard.
Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
Richard Wordingham wrote: >>> The effects of virama that spring to mind are: >>> >>> (a) Causing one or both letters on either side to change or combine >>> to indicate combination; >>> >>> (b) Appearing as a mark only if it does not affect one of the >>> letters on either side; >>> >>> (c) Causing a left matra to appear on the left of the sequence of >>> consonants joined by a sequence of non-visible viramas. >> >> Most of these don't apply to Tamil, of course. > > They all apply to க்ஷே TAMIL > SYLLABLE KSSEE. There are four other named syllables where they all > apply. And several others where they do not. TUS explains that visible puḷḷi is the general rule in Tamil, and conjunct ligatures are the exception. I should have written "These mostly don't apply to Tamil, of course." In any case, Ken has answered the real underlying question: a process that checks whether each character in a sequence is "alphabetic" is inappropriate for determining whether the sequence constitutes a word. -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
On Tue, 29 May 2018 07:27:21 -0700 Ken Whistler via Unicode wrote: > On 5/29/2018 12:49 AM, Richard Wordingham via Unicode wrote: > > How would one know that they are misapplied? And what if the > > author of the text has broken your rules? Are such texts never to > > be transcribed to pukka Unicode? > > Applying Tamil -ii (0BC0, Script=Tamil) to the Latin letter a (0061, > Script=Latin) doesn't automatically make the Tamil vowel "inherit" > the Latin script property value, nor should it. It's the sort of process that gave us U+0310 COMBINING CANDRABINDU. However, I see adding SE Asian dependent vowels to Latin letter x (U+0078, Script=Latin) as rather tending to make 'x' Script=Common. Others have disagreed quite vehemently. I see the view that the base character is U+00D7 MULTIPLICATION SIGN (InSC=Consonant_Placeholder) has prevailed. Serifed U+00D7 is quite common in manually typewritten material; I remember it from school. I'm not sure what script the sequence belongs to in OpenType layout. I ought to find out for the benefit of Tai Tham fonts. > That said, if someone decides they want that sequence, and their text > as "broken my rules", so be it. I'm just not going to assume anything > particular about that text. Note that in terms of trying to determine > whether such a string is (naively) alphabetic, such a sequence > doesn't interfere with the determination. On the other hand, a > process concerned about text runs, script assignment, validity for > domains, or other such issues *will* be sensitive to such a boundary > -- and should not be overruled by some generic determination that > combining marks inherit all the properties of their base. When it comes to script runs for rendering, such a rule feels oppressive; it is widely unenforced. For example, I have found that if my font treats U+0E4A THAI CHARACTER MAI TRI as a Tai Tham character, it will generally render satisfactorily on a Tai Tham character. Presumably I can now use a few examples of the same Northern Thai syllable on the same page in a published language-teaching book as evidence for adding its clone to the Tai Tham script. There should also be some examples of U+0ECA LAO TONE MAI TI on Lao Tai Tham syllables, but I haven't found any yet. See the chart at the end of "Exemple d’écriture ignorée par Unicode : l’écriture tham du Laos" http://www.laosoftware.com/download/articleTALN.pdf for an implicit claim of existence. > > Even without knowing exactly what is wanted, it looks to me as > > though it isn't. If he wants to allow as a > > substring, which he should, then that fails because there is no > > overlap between p{extender} and p{gc=Cf} or between p{diacritic} > > and p{gc=Cf}. > > Yes, so if you are working with strings for Indic scripts (or for > that matter, Arabic), you add Join_Control to the mix: > > Alphabetic ∪ Diacritic ∪ Extender ∪ Join_Control > > gets you a decent approximation of what is (naively) expected to fall > within an "alphabetic" string for most scripts. but won't work for collatable Welsh 'Llan͏gollen'! (There's a CGJ between the 'n' and the 'g'.) One also needs Join_Control for fraktur German and, to my mind, English 'Caesar'. > For those following along, Alphabetic is roughly meant to cover the > ABC, かきくけこ,... plus ideographic elements of most scripts. > Diacritic picks up most of the applied combining marks, including > nuktas, viramas, and tone marks. Extender picks up spacing elements > that indicate length, reduplication, iteration, etc. And joiners are, > well, joiners. 'Diacritic' mostly includes marks with secondary collation weight; those with primary weights, such as Indic dependent vowels, are mopped up in Alphabetic. (Removing diacritics is very much not the same as removing combining marks.) > If one wants finer categorization specifically for Indic scripts, > then I would suggest turning to the Indic_Syllabic_Category property > instead of a union of PropList.txt properties and/or some twiddling > with General_Category values. You'd still need to add gc=L to catch things like U+0971 DEVANAGARI SIGN HIGH SPACING DOT (which starts syllables) and U+A8F4 DEVANAGARI SIGN DOUBLE CANDRABINDU VIRAMA. And you'd still miss U+0303 COMBINING TILDE and U+0331 COMBINING MACRON BELOW from Thai script Pattani Malay - I need to make another attempt to get them appropriate Indic syllabic category values. Richard.
Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
On Mon, 28 May 2018 16:13:43 -0600 Doug Ewell via Unicode wrote: > Richard Wordingham wrote: > > > The effects of virama that spring to mind are: > > > > (a) Causing one or both letters on either side to change or combine > > to indicate combination; > > > > (b) Appearing as a mark only if it does not affect one of the > > letters on either side; > > > > (c) Causing a left matra to appear on the left of the sequence of > > consonants joined by a sequence of non-visible viramas. > > Most of these don't apply to Tamil, of course. They all apply to க்ஷே TAMIL SYLLABLE KSSEE. There are four other named syllables where they all apply. Richard
Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
On 5/29/2018 12:49 AM, Richard Wordingham via Unicode wrote: How would one know that they are misapplied? And what if the author of the text has broken your rules? Are such texts never to be transcribed to pukka Unicode? Applying Tamil -ii (0BC0, Script=Tamil) to the Latin letter a (0061, Script=Latin) doesn't automatically make the Tamil vowel "inherit" the Latin script property value, nor should it. That said, if someone decides they want that sequence, and their text as "broken my rules", so be it. I'm just not going to assume anything particular about that text. Note that in terms of trying to determine whether such a string is (naively) alphabetic, such a sequence doesn't interfere with the determination. On the other hand, a process concerned about text runs, script assignment, validity for domains, or other such issues *will* be sensitive to such a boundary -- and should not be overruled by some generic determination that combining marks inherit all the properties of their base. Even without knowing exactly what is wanted, it looks to me as though it isn't. If he wants to allow as a substring, which he should, then that fails because there is no overlap between p{extender} and p{gc=Cf} or between p{diacritic} and p{gc=Cf}. Yes, so if you are working with strings for Indic scripts (or for that matter, Arabic), you add Join_Control to the mix: Alphabetic ∪ Diacritic ∪ Extender ∪ Join_Control gets you a decent approximation of what is (naively) expected to fall within an "alphabetic" string for most scripts. For those following along, Alphabetic is roughly meant to cover the ABC, かきくけこ,... plus ideographic elements of most scripts. Diacritic picks up most of the applied combining marks, including nuktas, viramas, and tone marks. Extender picks up spacing elements that indicate length, reduplication, iteration, etc. And joiners are, well, joiners. If one wants finer categorization specifically for Indic scripts, then I would suggest turning to the Indic_Syllabic_Category property instead of a union of PropList.txt properties and/or some twiddling with General_Category values. --Ken
Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
On Mon, 28 May 2018 22:02:15 -0700 Ken Whistler via Unicode wrote: > On 5/28/2018 9:44 PM, Asmus Freytag via Unicode wrote: > > One of the general principles is that combining marks inherit the > > property of their base character. > > > > Normally, "inherited" should be the only property value for > > combining marks. > > > > There have been some deviations from this over the years, for > > various reasons, and there are some properties (such as general > > category) where it is necessary to recognize the character as > > combining, but the general principle still holds. > > > > Therefore, if you are trying to see whether a string is alphabetic, > > combining marks should be "transparent" to such an algorithm. > > Generally, good advice. But there are clear exceptions. For example, > the enclosing combining marks for symbols are intended (basically) to > make symbols of a sort. And many combining marks have explicit script > assigments, so they cannot simply willy-nilly inherit the script of a > base letter if they are misapplied, for example. How would one know that they are misapplied? And what if the author of the text has broken your rules? Are such texts never to be transcribed to pukka Unicode? > This is why I recommend simply adding the Diacritic property into the > mix for testing a string. That is a closer approximation to the kind > of naive "Is this string alphabetic?" question that SunaraRaman was > asking about -- it picks up the correct subset of combining marks to > union with the set of actual isAlphabetic characters, to produce more > expected results. (Including, of course, the correct classification > of all the viramas, stackers, and killers, as well as picking up all > the nuktas.). > > Folks, please examine the set of character for Diacritic and for > Extender in: > > http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt > > to see what I'm talking about. The stuff you are looking for is > already there. Even without knowing exactly what is wanted, it looks to me as though it isn't. If he wants to allow as a substring, which he should, then that fails because there is no overlap between p{extender} and p{gc=Cf} or between p{diacritic} and p{gc=Cf}. U+034F COMBINING GRAPHEME JOINER is also missing, apparently deliberately in the case of 'diacritic'. If one uses the definition of words in the word break algorithm, one will end up accepting combinations of letter plus enclosing circle or keycap. (A fix to the word break algorithm for that would be unpleasant.) One hopes that the requirement doesn't include accepting all single words. Every properly spelt word containing U+0E46 THAI CHARACTER MAIYAMOK will be rejected, as it will contain a space before the U+0E46. (I assume there are such words; certainly there are dictionary entries with no corresponding entries without U+0E46, such as "ตึ้ก ๆ".) At a lesser level, even English has a very few words with spaces in them, and there is no solution but to list them. Richard.
Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
On 5/28/2018 9:44 PM, Asmus Freytag via Unicode wrote: One of the general principles is that combining marks inherit the property of their base character. Normally, "inherited" should be the only property value for combining marks. There have been some deviations from this over the years, for various reasons, and there are some properties (such as general category) where it is necessary to recognize the character as combining, but the general principle still holds. Therefore, if you are trying to see whether a string is alphabetic, combining marks should be "transparent" to such an algorithm. Generally, good advice. But there are clear exceptions. For example, the enclosing combining marks for symbols are intended (basically) to make symbols of a sort. And many combining marks have explicit script assigments, so they cannot simply willy-nilly inherit the script of a base letter if they are misapplied, for example. This is why I recommend simply adding the Diacritic property into the mix for testing a string. That is a closer approximation to the kind of naive "Is this string alphabetic?" question that SunaraRaman was asking about -- it picks up the correct subset of combining marks to union with the set of actual isAlphabetic characters, to produce more expected results. (Including, of course, the correct classification of all the viramas, stackers, and killers, as well as picking up all the nuktas.). Folks, please examine the set of character for Diacritic and for Extender in: http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt to see what I'm talking about. The stuff you are looking for is already there. --Ken P.S. And please don't start an argument about the fact that a "virama" isn't really a "diacritic". We know that, too. ;-)
Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
On 5/28/2018 9:23 PM, Martin J. Dürst via Unicode wrote: Hello Sundar, On 2018/05/28 04:27, SundaraRaman R via Unicode wrote: Hi, In languages like Ruby or Java (https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)), functions to check if a character is alphabetic do that by looking for the 'Alphabetic' property (defined true if it's in one of the L categories, or Nl, or has 'Other_Alphabetic' property). When parsing Tamil text, this works out well for independent vowels and consonants (which are in Lo), and for most dependent signs (which are in Mc or Mn but have the 'Other_Alphabetic' property), but the very common pulli (VIRAMA) is neither in Lo nor has 'Other_Alphabetic', and so leads to concluding any string containing it to be non-alphabetic. This doesn't make sense to me since the Virama “◌்” as much of an alphabetic character as any of the "Dependent Vowel" characters which have been given the 'Other_Alphabetic' property. Is there a rationale behind this difference, or is it an oversight to be corrected? I suggest submitting an error report via https://www.unicode.org/reporting.html. I haven't studied the issue in detail (sorry, just no time this week), but it sounds reasonable to give the VIRAMA the 'Other_Alphabetic' property. Please don't. This is not an error in the Unicode property assignments, which have been stable in scope for Alphabetic for some time now. The problem is in assuming that the Java or Ruby isAphabetic() API, which simply report the Unicode property value Alphabetic for a character, suffices for identifying a string as somehow "wordlike". It doesn't. The approximation you are looking for is to add Diacritic to Alphabetic. That will automatically pull in all the nuktas and viramas/killers for Brahmi-derived scripts. It also will pull in the harakat for Arabic and similar abjads, which are also not Alphabetic in the property values. And it will pull in tone marks for various writing systems. For good measure, also add Extender, which will pick up length marks and iteration marks. Please do not assume that the Alphabetic property just automatically equates to "what I would write in a word". Or that it should be adjusted to somehow make that happen. It would be highly advisable to study *all* the UCD properties in more depth, before starting to report bugs in one or another simply because using a single property doesn't produce the string classification one assumes should be correct in a particular case. Of course, to get a better approximation of what actually constitutes a "word" in a particular writing system, instead of using raw property API's, one should be using a WordBreak iterator, preferably one tailored for the language in question. --Ken I'd recommend to mention examples other than Tamil in your report (assuming they exist). BTW, what's the method you are using in Ruby? If there's a problem in Ruby (which I don't think; it's just using Unicode data), then please make a bug report at https://bugs.ruby-lang.org/projects/ruby-trunk, I should be able to follow up on that. Regards, Martin.
Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
One of the general principles is that combining marks inherit the property of their base character. Normally, "inherited" should be the only property value for combining marks. There have been some deviations from this over the years, for various reasons, and there are some properties (such as general category) where it is necessary to recognize the character as combining, but the general principle still holds. Therefore, if you are trying to see whether a string is alphabetic, combining marks should be "transparent" to such an algorithm. A./ On 5/28/2018 9:23 PM, Martin J. Dürst via Unicode wrote: Hello Sundar, On 2018/05/28 04:27, SundaraRaman R via Unicode wrote: Hi, In languages like Ruby or Java (https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)), functions to check if a character is alphabetic do that by looking for the 'Alphabetic' property (defined true if it's in one of the L categories, or Nl, or has 'Other_Alphabetic' property). When parsing Tamil text, this works out well for independent vowels and consonants (which are in Lo), and for most dependent signs (which are in Mc or Mn but have the 'Other_Alphabetic' property), but the very common pulli (VIRAMA) is neither in Lo nor has 'Other_Alphabetic', and so leads to concluding any string containing it to be non-alphabetic. This doesn't make sense to me since the Virama “◌்” as much of an alphabetic character as any of the "Dependent Vowel" characters which have been given the 'Other_Alphabetic' property. Is there a rationale behind this difference, or is it an oversight to be corrected? I suggest submitting an error report via https://www.unicode.org/reporting.html. I haven't studied the issue in detail (sorry, just no time this week), but it sounds reasonable to give the VIRAMA the 'Other_Alphabetic' property. I'd recommend to mention examples other than Tamil in your report (assuming they exist). BTW, what's the method you are using in Ruby? If there's a problem in Ruby (which I don't think; it's just using Unicode data), then please make a bug report at https://bugs.ruby-lang.org/projects/ruby-trunk, I should be able to follow up on that. Regards, Martin.
Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
Hello Sundar, On 2018/05/28 04:27, SundaraRaman R via Unicode wrote: Hi, In languages like Ruby or Java (https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)), functions to check if a character is alphabetic do that by looking for the 'Alphabetic' property (defined true if it's in one of the L categories, or Nl, or has 'Other_Alphabetic' property). When parsing Tamil text, this works out well for independent vowels and consonants (which are in Lo), and for most dependent signs (which are in Mc or Mn but have the 'Other_Alphabetic' property), but the very common pulli (VIRAMA) is neither in Lo nor has 'Other_Alphabetic', and so leads to concluding any string containing it to be non-alphabetic. This doesn't make sense to me since the Virama “◌்” as much of an alphabetic character as any of the "Dependent Vowel" characters which have been given the 'Other_Alphabetic' property. Is there a rationale behind this difference, or is it an oversight to be corrected? I suggest submitting an error report via https://www.unicode.org/reporting.html. I haven't studied the issue in detail (sorry, just no time this week), but it sounds reasonable to give the VIRAMA the 'Other_Alphabetic' property. I'd recommend to mention examples other than Tamil in your report (assuming they exist). BTW, what's the method you are using in Ruby? If there's a problem in Ruby (which I don't think; it's just using Unicode data), then please make a bug report at https://bugs.ruby-lang.org/projects/ruby-trunk, I should be able to follow up on that. Regards, Martin.
Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
SundaraRaman R wrote: but the very common pulli (VIRAMA) is neither in Lo nor has 'Other_Alphabetic', and so leads to concluding any string containing it to be non-alphabetic. Is this definition part of Unicode? I thought the use of General Category to answer questions like "this sequence is a word" or "this string is alphabetic" was much more complex than that. (I'm not even sure what the latter means, for any script with any sort of combining mark.) Richard Wordingham wrote: The effects of virama that spring to mind are: (a) Causing one or both letters on either side to change or combine to indicate combination; (b) Appearing as a mark only if it does not affect one of the letters on either side; (c) Causing a left matra to appear on the left of the sequence of consonants joined by a sequence of non-visible viramas. Most of these don't apply to Tamil, of course. -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
On Mon, 28 May 2018 20:03:11 +0530 SundaraRaman R via Unicodewrote: > Hi, thanks for your reply. > > > There is only one character with a canonical combining class of 9 > > that is included as other_alphabetic, namely U+0E3A THAI CHARACTER > > PHINTHU. That last had any of the other properties of viramas back > > in Unicode 1.0; the characters that triggered such behaviours were > > permanently removed in Unicode 1.1. > > I didn't understand the second sentence here, could you clarify? Sorry, I messed that system up. It should have read, "The last time that that had any of the other properties of viramas back in Unicode 1.0;" > What > do you mean by "any of the other properties" here? The effects of virama that spring to mind are: (a) Causing one or both letters on either side to change or combine to indicate combination; (b) Appearing as a mark only if it does not affect one of the letters on either side; (c) Causing a left matra to appear on the left of the sequence of consonants joined by a sequence of non-visible viramas. > And "triggered such > behaviours" seems to imply having them in other_alphabetic had > negative consequences, could you give an example of what that might > be? Nowadays, the Thai syllable ไตร, normatively pronounced /trai/, is only encoded , and the character U+0E3A is always visible when used; for most routine purposes it is little different to U+0E38 THAI CHARACTER SARA U. However, in Unicode 1.0, while was rendered as at present, the same visible string could also be encoded as - no glyph would be rendered for U+0E3A. If one wanted the official Sanskritised Pali version, one could type ไตฺร as at present. One could also encode it as . Weirdly, I couldn't have used the phonetically ordered vowel to type a monk's name ending in มฺโม , as would have been rendered as โมฺม. As the non-phonetic virama-like behaviours of U+0E3A are only mentioned under the heading 'Alternate Ordering', I can only presume that they were triggered by the phonetic order vowel signs, U+0E70 to U+0E74. It is possible that U+0E3A acquired the alphabetic property because it ceased to behave like a virama. Alternatively, it may have acquired the alphabetic property because of its use in the compound vowels of minority languages. > But in the case of Tamil, I'm curious why most other combining Tamil > marks go in class 0, whereas pulli goes in 9. Even u0B82 Anusvara, a > character barely used in Tamil text, has combining class 0 and is > included in Other_Alphabetic, but the visually similar and similarly > positioned pulli is not. In this particular case, is it a historical > accident that these got assigned this way, or is there a rationale > behind these? Would it at all be possible to get this changed in the > upcoming Unicode standard? Tamil has usually been treated as just another Indian Indic script. U+0E3A is the only virama-like character with the property of being 'alphabetic'. I can't see a change making it into Unicode 11.0. It requires too much careful thought. Besides, anything that considered as alphabetic should also considerer as alphabetic - they should be mostly interchangeable in Tamil. > > I fear that the correct test for what you want is to split text into > > words and check that each word begins with an alphabetic > > character. > > Do you mean "each grapheme cluster begins with an alphabetic > character" here? It seems to me (in my very limited Unicode knowledge) > that such a test, going through grapheme clusters and checking the > first codepoint in each, would also ensure the text is full > alphabetic. Not directly. Is the string "mark2mark" alphabetic? It constitutes a single word. My suggested simplification would say 'no', as it contains '2'; perhaps my simplification is wrong. > And it has the advantage that more languages have a > (relatively) easy way for splitting text into grapheme clusters, than > for checking minor Unicode properties like WordBreak, so this one > might be easier to implement. Does this test anywhere in the ballpark > of being right? Yes, it's close to being right. Note that simple approximations for SE Asian word-breaking (e.g. treating SE Asian characters as alphabetic) should work well for your application. Richard.
Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
Hi, thanks for your reply. > There is only one character with a canonical combining class of 9 that > is included as other_alphabetic, namely U+0E3A THAI CHARACTER PHINTHU. > That last had any of the other properties of viramas back in Unicode > 1.0; the characters that triggered such behaviours were permanently > removed in Unicode 1.1. I didn't understand the second sentence here, could you clarify? What do you mean by "any of the other properties" here? And "triggered such behaviours" seems to imply having them in other_alphabetic had negative consequences, could you give an example of what that might be? > There are some notable absences from the combining marks included. > Significant absences include ZWJ, ZWNJ and CGJ. > > However, a non-erroneous *conformant* Unicode process cannot > always determine whether a string, given only that it is a string, is > composed only of alphabetic characters. The answer would be 'yes' for > but 'no' for the canonically > equivalent ! > (U+0327 is not included as alphabetic either.) > > There is at least one combination of Latin letter and combining mark > that occurs in the normal orthography of a natural language and does not > have a precomposed equivalent. Ah, that's somewhat unfortunate that such a quick and easy alphabetic check is not possible in the general case, but I can understand how it might be weird to give the Alphabetic property to a ZWJ or ZWNJ. But in the case of Tamil, I'm curious why most other combining Tamil marks go in class 0, whereas pulli goes in 9. Even u0B82 Anusvara, a character barely used in Tamil text, has combining class 0 and is included in Other_Alphabetic, but the visually similar and similarly positioned pulli is not. In this particular case, is it a historical accident that these got assigned this way, or is there a rationale behind these? Would it at all be possible to get this changed in the upcoming Unicode standard? (By the way, I'm happy to get a link to read through for any of my questions here. I just find it quite hard to search for and find past discussions and decision rationales regarding these, not knowing how and where to search for them.) > I fear that the correct test for what you want is to split text into > words and check that each word begins with an alphabetic character. Do you mean "each grapheme cluster begins with an alphabetic character" here? It seems to me (in my very limited Unicode knowledge) that such a test, going through grapheme clusters and checking the first codepoint in each, would also ensure the text is full alphabetic. And it has the advantage that more languages have a (relatively) easy way for splitting text into grapheme clusters, than for checking minor Unicode properties like WordBreak, so this one might be easier to implement. Does this test anywhere in the ballpark of being right? Regards, Sundar
Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
On Mon, 28 May 2018 00:57:03 +0530 SundaraRaman R via Unicodewrote: > Hi, > > In languages like Ruby or Java > (https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)), > functions to check if a character is alphabetic do that by looking for > the 'Alphabetic' property (defined true if it's in one of the L > categories, or Nl, or has 'Other_Alphabetic' property). When parsing > Tamil text, this works out well for independent vowels and consonants > (which are in Lo), and for most dependent signs (which are in Mc or Mn > but have the 'Other_Alphabetic' property), but the very common pulli > (VIRAMA) is neither in Lo nor has 'Other_Alphabetic', and so leads to > concluding any string containing it to be non-alphabetic. > > This doesn't make sense to me since the Virama “◌்” as much of an > alphabetic character as any of the "Dependent Vowel" characters which > have been given the 'Other_Alphabetic' property. Is there a rationale > behind this difference, or is it an oversight to be corrected? There is only one character with a canonical combining class of 9 that is included as other_alphabetic, namely U+0E3A THAI CHARACTER PHINTHU. That last had any of the other properties of viramas back in Unicode 1.0; the characters that triggered such behaviours were permanently removed in Unicode 1.1. There are some notable absences from the combining marks included. Significant absences include ZWJ, ZWNJ and CGJ. However, a non-erroneous *conformant* Unicode process cannot always determine whether a string, given only that it is a string, is composed only of alphabetic characters. The answer would be 'yes' for but 'no' for the canonically equivalent ! (U+0327 is not included as alphabetic either.) There is at least one combination of Latin letter and combining mark that occurs in the normal orthography of a natural language and does not have a precomposed equivalent. I fear that the correct test for what you want is to split text into words and check that each word begins with an alphabetic character. That test can be made by a conformant process. I think, but have not checked, that the test an be simplified to: (a) Check that the first character is alphabetic. (b) Ignore every character with a WordBreak property of Extend or ZWJ (c) Check that all other characters are alphabetic. Richard.