Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-29 Thread Richard Wordingham via Unicode
On Tue, 29 May 2018 14:03:25 -0700
Doug Ewell via Unicode  wrote:

> In any case, Ken has answered the real underlying question: a process
> that checks whether each character in a sequence is "alphabetic" is
> inappropriate for determining whether the sequence constitutes a word.

Back in the second post of the thread, I made the point that a
conformant Unicode process cannot always give a yes/no answer to the
question of whether all characters in a string are alphabetic.

What we seem to have established is that Unicode properties are not set
up to facilitate the identification of words.  Given that
spell-checkers work, we have taken a wrong turn.  Perhaps we should
reconsider "b⃝e⃝", which consists of two letters each inside its own
enclosing circle.  The spell-checker I'm using considers it a
misspelt word, rather than two symbols side by side.

Richard.



Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-29 Thread Doug Ewell via Unicode
Richard Wordingham wrote:

>>> The effects of virama that spring to mind are:
>>>
>>> (a) Causing one or both letters on either side to change or combine
>>> to indicate combination;
>>>
>>> (b) Appearing as a mark only if it does not affect one of the
>>> letters on either side;
>>>
>>> (c) Causing a left matra to appear on the left of the sequence of
>>> consonants joined by a sequence of non-visible viramas.
>>
>> Most of these don't apply to Tamil, of course.
>
> They all apply to க்ஷே  TAMIL
> SYLLABLE KSSEE. There are four other named syllables where they all
> apply.

And several others where they do not. TUS explains that visible
puḷḷi is the general rule in Tamil, and conjunct ligatures are the
exception.

I should have written "These mostly don't apply to Tamil, of course."

In any case, Ken has answered the real underlying question: a process
that checks whether each character in a sequence is "alphabetic" is
inappropriate for determining whether the sequence constitutes a word.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org




Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-29 Thread Richard Wordingham via Unicode
On Tue, 29 May 2018 07:27:21 -0700
Ken Whistler via Unicode  wrote:

> On 5/29/2018 12:49 AM, Richard Wordingham via Unicode wrote:
> > How would one know that they are misapplied?  And what if the
> > author of the text has broken your rules? Are such texts never to
> > be transcribed to pukka Unicode?  
> 
> Applying Tamil -ii (0BC0, Script=Tamil) to the Latin letter a (0061, 
> Script=Latin) doesn't automatically make the Tamil vowel "inherit"
> the Latin script property value, nor should it.

It's the sort of process that gave us U+0310 COMBINING CANDRABINDU.
However, I see adding SE Asian dependent vowels to Latin letter x
(U+0078, Script=Latin) as rather tending to make 'x' Script=Common.
Others have disagreed quite vehemently.  I see the view that the base
character is U+00D7 MULTIPLICATION SIGN (InSC=Consonant_Placeholder) has
prevailed.  Serifed U+00D7 is quite common in manually typewritten
material; I remember it from school.  I'm not sure what script the
sequence  belongs to in OpenType
layout. I ought to find out for the benefit of Tai Tham fonts.

> That said, if someone decides they want that sequence, and their text
> as "broken my rules", so be it. I'm just not going to assume anything 
> particular about that text. Note that in terms of trying to determine 
> whether such a string is (naively) alphabetic, such a sequence
> doesn't interfere with the determination. On the other hand, a
> process concerned about text runs, script assignment, validity for
> domains, or other such issues *will* be sensitive to such a boundary
> -- and should not be overruled by some generic determination that
> combining marks inherit all the properties of their base.

When it comes to script runs for rendering, such a rule feels
oppressive; it is widely unenforced.  For example, I have found that
if my font treats U+0E4A THAI CHARACTER MAI TRI as a Tai Tham
character, it will generally render satisfactorily on a Tai Tham
character.  Presumably I can now use a few examples of the same
Northern Thai syllable on the same page in a published language-teaching
book as evidence for adding its clone to the Tai Tham script.  There
should also be some examples of U+0ECA LAO TONE MAI TI on Lao Tai Tham
syllables, but I haven't found any yet.  See the chart at the end of
"Exemple d’écriture ignorée par Unicode : l’écriture tham du Laos"
http://www.laosoftware.com/download/articleTALN.pdf for an implicit
claim of existence.

> > Even without knowing exactly what is wanted, it looks to me as
> > though it isn't.  If he wants to allow  as a
> > substring, which he should, then that fails because there is no
> > overlap between p{extender} and p{gc=Cf} or between p{diacritic}
> > and p{gc=Cf}.  
> 
> Yes, so if you are working with strings for Indic scripts (or for
> that matter, Arabic), you add Join_Control to the mix:
> 
> Alphabetic  ∪ Diacritic ∪ Extender ∪ Join_Control
> 
> gets you a decent approximation of what is (naively) expected to fall 
> within an "alphabetic" string for most scripts.

but won't work for collatable Welsh 'Llan͏gollen'!  (There's a CGJ
between the 'n' and the 'g'.)


One also needs Join_Control for fraktur German and, to my mind,
English 'Ca‍esar'.

> For those following along, Alphabetic is roughly meant to cover the
> ABC, かきくけこ,... plus ideographic elements of most scripts.
> Diacritic picks up most of the applied combining marks, including
> nuktas, viramas, and tone marks. Extender picks up spacing elements
> that indicate length, reduplication, iteration, etc. And joiners are,
> well, joiners.

'Diacritic' mostly includes marks with secondary collation weight;
those with primary weights, such as Indic dependent vowels, are mopped
up in Alphabetic.  (Removing diacritics is very much not the same
as removing combining marks.)

> If one wants finer categorization specifically for Indic scripts,
> then I would suggest turning to the Indic_Syllabic_Category property
> instead of a union of PropList.txt properties and/or some twiddling
> with General_Category values.

You'd still need to add gc=L to catch things like U+0971 DEVANAGARI SIGN
HIGH SPACING DOT (which starts syllables) and U+A8F4 DEVANAGARI SIGN
DOUBLE CANDRABINDU VIRAMA.  And you'd still miss U+0303 COMBINING TILDE
and U+0331 COMBINING MACRON BELOW from Thai script Pattani Malay - I
need to make another attempt to get them appropriate Indic syllabic
category values.

Richard.



Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-29 Thread Richard Wordingham via Unicode
On Mon, 28 May 2018 16:13:43 -0600
Doug Ewell via Unicode  wrote:

> Richard Wordingham wrote:
> 
> > The effects of virama that spring to mind are:
> >
> > (a) Causing one or both letters on either side to change or combine
> > to indicate combination;
> >
> > (b) Appearing as a mark only if it does not affect one of the
> > letters on either side;
> >
> > (c) Causing a left matra to appear on the left of the sequence of
> > consonants joined by a sequence of non-visible viramas.  
> 
> Most of these don't apply to Tamil, of course.

They all apply to க்ஷே   TAMIL
SYLLABLE KSSEE.  There are four other named syllables where they all
apply.

Richard



Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-29 Thread Ken Whistler via Unicode




On 5/29/2018 12:49 AM, Richard Wordingham via Unicode wrote:

How would one know that they are misapplied?  And what if the author of
the text has broken your rules? Are such texts never to be transcribed
to pukka Unicode?


Applying Tamil -ii (0BC0, Script=Tamil) to the Latin letter a (0061, 
Script=Latin) doesn't automatically make the Tamil vowel "inherit" the 
Latin script property value, nor should it.


That said, if someone decides they want that sequence, and their text as 
"broken my rules", so be it. I'm just not going to assume anything 
particular about that text. Note that in terms of trying to determine 
whether such a string is (naively) alphabetic, such a sequence doesn't 
interfere with the determination. On the other hand, a process concerned 
about text runs, script assignment, validity for domains, or other such 
issues *will* be sensitive to such a boundary -- and should not be 
overruled by some generic determination that combining marks inherit all 
the properties of their base.





Even without knowing exactly what is wanted, it looks to me as though
it isn't.  If he wants to allow  as a substring, which
he should, then that fails because there is no overlap between
p{extender} and p{gc=Cf} or between p{diacritic} and p{gc=Cf}.


Yes, so if you are working with strings for Indic scripts (or for that 
matter, Arabic), you add Join_Control to the mix:


Alphabetic  ∪ Diacritic ∪ Extender ∪ Join_Control

gets you a decent approximation of what is (naively) expected to fall 
within an "alphabetic" string for most scripts.


For those following along, Alphabetic is roughly meant to cover the ABC, 
かきくけこ,... plus ideographic elements of most scripts. Diacritic picks up 
most of the applied combining marks, including nuktas, viramas, and tone 
marks. Extender picks up spacing elements that indicate length, 
reduplication, iteration, etc. And joiners are, well, joiners.


If one wants finer categorization specifically for Indic scripts, then I 
would suggest turning to the Indic_Syllabic_Category property instead of 
a union of PropList.txt properties and/or some twiddling with 
General_Category values.


--Ken





Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-29 Thread Richard Wordingham via Unicode
On Mon, 28 May 2018 22:02:15 -0700
Ken Whistler via Unicode  wrote:

> On 5/28/2018 9:44 PM, Asmus Freytag via Unicode wrote:
> > One of the general principles is that combining marks inherit the 
> > property of their base character.
> >
> > Normally, "inherited" should be the only property value for
> > combining marks.
> >
> > There have been some deviations from this over the years, for
> > various reasons, and there are some properties (such as general
> > category) where it is necessary to recognize the character as
> > combining, but the general principle still holds.
> >
> > Therefore, if you are trying to see whether a string is alphabetic, 
> > combining marks should be "transparent" to such an algorithm.  
> 
> Generally, good advice. But there are clear exceptions. For example,
> the enclosing combining marks for symbols are intended (basically) to
> make symbols of a sort. And many combining marks have explicit script 
> assigments, so they cannot simply willy-nilly inherit the script of a 
> base letter if they are misapplied, for example.

How would one know that they are misapplied?  And what if the author of
the text has broken your rules? Are such texts never to be transcribed
to pukka Unicode?

> This is why I recommend simply adding the Diacritic property into the 
> mix for testing a string. That is a closer approximation to the kind
> of naive "Is this string alphabetic?" question that SunaraRaman was
> asking about -- it picks up the correct subset of combining marks to
> union with the set of actual isAlphabetic characters, to produce more
> expected results. (Including, of course, the correct classification
> of all the viramas, stackers, and killers, as well as picking up all
> the nuktas.).
> 
> Folks, please examine the set of character for Diacritic and for 
> Extender in:
> 
> http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt
> 
> to see what I'm talking about. The stuff you are looking for is
> already there.

Even without knowing exactly what is wanted, it looks to me as though
it isn't.  If he wants to allow  as a substring, which
he should, then that fails because there is no overlap between
p{extender} and p{gc=Cf} or between p{diacritic} and p{gc=Cf}.  U+034F
COMBINING GRAPHEME JOINER is also missing, apparently deliberately in
the case of 'diacritic'. If one uses the definition of words in the
word break algorithm, one will end up accepting combinations of letter
plus enclosing circle or keycap.  (A fix to the word break algorithm
for that would be unpleasant.)

One hopes that the requirement doesn't include accepting all single
words.  Every properly spelt word containing U+0E46 THAI CHARACTER
MAIYAMOK will be rejected, as it will contain a space before the
U+0E46.  (I assume there are such words; certainly there are
dictionary entries with no corresponding entries without U+0E46,
such as "ตึ้ก ๆ".) At a lesser level, even English has a very few
words with spaces in them, and there is no solution but to list them.

Richard.



Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-28 Thread Ken Whistler via Unicode




On 5/28/2018 9:44 PM, Asmus Freytag via Unicode wrote:
One of the general principles is that combining marks inherit the 
property of their base character.


Normally, "inherited" should be the only property value for combining 
marks.


There have been some deviations from this over the years, for various 
reasons, and there are some properties (such as general category) 
where it is necessary to recognize the character as combining, but the 
general principle still holds.


Therefore, if you are trying to see whether a string is alphabetic, 
combining marks should be "transparent" to such an algorithm.


Generally, good advice. But there are clear exceptions. For example, the 
enclosing combining marks for symbols are intended (basically) to make 
symbols of a sort. And many combining marks have explicit script 
assigments, so they cannot simply willy-nilly inherit the script of a 
base letter if they are misapplied, for example.


This is why I recommend simply adding the Diacritic property into the 
mix for testing a string. That is a closer approximation to the kind of 
naive "Is this string alphabetic?" question that SunaraRaman was asking 
about -- it picks up the correct subset of combining marks to union with 
the set of actual isAlphabetic characters, to produce more expected 
results. (Including, of course, the correct classification of all the 
viramas, stackers, and killers, as well as picking up all the nuktas.).


Folks, please examine the set of character for Diacritic and for 
Extender in:


http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

to see what I'm talking about. The stuff you are looking for is already 
there.


--Ken

P.S. And please don't start an argument about the fact that a "virama" 
isn't really a "diacritic". We know that, too. ;-)





Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-28 Thread Ken Whistler via Unicode



On 5/28/2018 9:23 PM, Martin J. Dürst via Unicode wrote:

Hello Sundar,

On 2018/05/28 04:27, SundaraRaman R via Unicode wrote:

Hi,

In languages like Ruby or Java
(https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)), 


functions to check if a character is alphabetic do that by looking for
the 'Alphabetic'  property (defined true if it's in one of the L
categories, or Nl, or has 'Other_Alphabetic' property). When parsing
Tamil text, this works out well for independent vowels and consonants
(which are in Lo), and for most dependent signs (which are in Mc or Mn
but have the 'Other_Alphabetic' property), but the very common pulli 
(VIRAMA)

is neither in Lo nor has 'Other_Alphabetic', and so leads to
concluding any string containing it to be non-alphabetic.

This doesn't make sense to me since the Virama  “◌்” as much of an
alphabetic character as any of the "Dependent Vowel" characters which
have been given the 'Other_Alphabetic' property. Is there a rationale
behind this difference, or is it an oversight to be corrected?


I suggest submitting an error report via 
https://www.unicode.org/reporting.html. I haven't studied the issue in 
detail (sorry, just no time this week), but it sounds reasonable to 
give the VIRAMA the 'Other_Alphabetic' property.


Please don't. This is not an error in the Unicode property assignments, 
which have been stable in scope for Alphabetic for some time now.


The problem is in assuming that the Java or Ruby isAphabetic() API, 
which simply report the Unicode property value Alphabetic for a 
character, suffices for identifying a string as somehow "wordlike". It 
doesn't.


The approximation you are looking for is to add Diacritic to Alphabetic. 
That will automatically pull in all the nuktas and viramas/killers for 
Brahmi-derived scripts. It also will pull in the harakat for Arabic and 
similar abjads, which are also not Alphabetic in the property values. 
And it will pull in tone marks for various writing systems.


For good measure, also add Extender, which will pick up length marks and 
iteration marks.


Please do not assume that the Alphabetic property just automatically 
equates to "what I would write in a word". Or that it should be adjusted 
to somehow make that happen. It would be highly advisable to study *all* 
the UCD properties in more depth, before starting to report bugs in one 
or another simply because using a single property doesn't produce the 
string classification one assumes should be correct in a particular case.


Of course, to get a better approximation of what actually constitutes a 
"word" in a particular writing system, instead of using raw property 
API's, one should be using a WordBreak iterator, preferably one tailored 
for the language in question.


--Ken




I'd recommend to mention examples other than Tamil in your report 
(assuming they exist).


BTW, what's the method you are using in Ruby? If there's a problem in 
Ruby (which I don't think; it's just using Unicode data), then please 
make a bug report at https://bugs.ruby-lang.org/projects/ruby-trunk, I 
should be able to follow up on that.


Regards,   Martin.





Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-28 Thread Asmus Freytag via Unicode
One of the general principles is that combining marks inherit the 
property of their base character.


Normally, "inherited" should be the only property value for combining marks.

There have been some deviations from this over the years, for various 
reasons, and there are some properties (such as general category) where 
it is necessary to recognize the character as combining, but the general 
principle still holds.


Therefore, if you are trying to see whether a string is alphabetic, 
combining marks should be "transparent" to such an algorithm.


A./


On 5/28/2018 9:23 PM, Martin J. Dürst via Unicode wrote:

Hello Sundar,

On 2018/05/28 04:27, SundaraRaman R via Unicode wrote:

Hi,

In languages like Ruby or Java
(https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)), 


functions to check if a character is alphabetic do that by looking for
the 'Alphabetic'  property (defined true if it's in one of the L
categories, or Nl, or has 'Other_Alphabetic' property). When parsing
Tamil text, this works out well for independent vowels and consonants
(which are in Lo), and for most dependent signs (which are in Mc or Mn
but have the 'Other_Alphabetic' property), but the very common pulli 
(VIRAMA)

is neither in Lo nor has 'Other_Alphabetic', and so leads to
concluding any string containing it to be non-alphabetic.

This doesn't make sense to me since the Virama  “◌்” as much of an
alphabetic character as any of the "Dependent Vowel" characters which
have been given the 'Other_Alphabetic' property. Is there a rationale
behind this difference, or is it an oversight to be corrected?


I suggest submitting an error report via 
https://www.unicode.org/reporting.html. I haven't studied the issue in 
detail (sorry, just no time this week), but it sounds reasonable to 
give the VIRAMA the 'Other_Alphabetic' property.


I'd recommend to mention examples other than Tamil in your report 
(assuming they exist).


BTW, what's the method you are using in Ruby? If there's a problem in 
Ruby (which I don't think; it's just using Unicode data), then please 
make a bug report at https://bugs.ruby-lang.org/projects/ruby-trunk, I 
should be able to follow up on that.


Regards,   Martin.





Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-28 Thread Martin J. Dürst via Unicode

Hello Sundar,

On 2018/05/28 04:27, SundaraRaman R via Unicode wrote:

Hi,

In languages like Ruby or Java
(https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)),
functions to check if a character is alphabetic do that by looking for
the 'Alphabetic'  property (defined true if it's in one of the L
categories, or Nl, or has 'Other_Alphabetic' property). When parsing
Tamil text, this works out well for independent vowels and consonants
(which are in Lo), and for most dependent signs (which are in Mc or Mn
but have the 'Other_Alphabetic' property), but the very common pulli (VIRAMA)
is neither in Lo nor has 'Other_Alphabetic', and so leads to
concluding any string containing it to be non-alphabetic.

This doesn't make sense to me since the Virama  “◌்” as much of an
alphabetic character as any of the "Dependent Vowel" characters which
have been given the 'Other_Alphabetic' property. Is there a rationale
behind this difference, or is it an oversight to be corrected?


I suggest submitting an error report via 
https://www.unicode.org/reporting.html. I haven't studied the issue in 
detail (sorry, just no time this week), but it sounds reasonable to give 
the VIRAMA the 'Other_Alphabetic' property.


I'd recommend to mention examples other than Tamil in your report 
(assuming they exist).


BTW, what's the method you are using in Ruby? If there's a problem in 
Ruby (which I don't think; it's just using Unicode data), then please 
make a bug report at https://bugs.ruby-lang.org/projects/ruby-trunk, I 
should be able to follow up on that.


Regards,   Martin.


Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-28 Thread Doug Ewell via Unicode

SundaraRaman R wrote:


but the very common pulli (VIRAMA)
is neither in Lo nor has 'Other_Alphabetic', and so leads to
concluding any string containing it to be non-alphabetic.


Is this definition part of Unicode? I thought the use of General 
Category to answer questions like "this sequence is a word" or "this 
string is alphabetic" was much more complex than that. (I'm not even 
sure what the latter means, for any script with any sort of combining 
mark.)


Richard Wordingham wrote:


The effects of virama that spring to mind are:

(a) Causing one or both letters on either side to change or combine to
indicate combination;

(b) Appearing as a mark only if it does not affect one of the letters
on either side;

(c) Causing a left matra to appear on the left of the sequence of
consonants joined by a sequence of non-visible viramas.


Most of these don't apply to Tamil, of course.

--
Doug Ewell | Thornton, CO, US | ewellic.org



Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-28 Thread Richard Wordingham via Unicode
On Mon, 28 May 2018 20:03:11 +0530
SundaraRaman R via Unicode  wrote:

> Hi, thanks for your reply.
> 
> > There is only one character with a canonical combining class of 9
> > that is included as other_alphabetic, namely U+0E3A THAI CHARACTER
> > PHINTHU. That last had any of the other properties of viramas back
> > in Unicode 1.0; the characters that triggered such behaviours were
> > permanently removed in Unicode 1.1.  
> 
> I didn't understand the second sentence here, could you clarify?

Sorry, I messed that system up.  It should have read, "The last time
that that had any of the other properties of viramas back
in Unicode 1.0;"

> What
> do you mean by "any of the other properties" here?

The effects of virama that spring to mind are:

(a) Causing one or both letters on either side to change or combine to
indicate combination;

(b) Appearing as a mark only if it does not affect one of the letters
on either side;

(c) Causing a left matra to appear on the left of the sequence of
consonants joined by a sequence of non-visible viramas.

> And "triggered such
> behaviours" seems to imply having them in other_alphabetic had
> negative consequences, could you give an example of what that might
> be?

Nowadays, the Thai syllable ไตร, normatively pronounced /trai/, is
only encoded , and the character
U+0E3A is always visible when used; for most routine purposes it is
little different to U+0E38 THAI CHARACTER SARA U.  However, in Unicode
1.0, while  was rendered as at present, the same
visible string could also be encoded as   - no glyph would be
rendered for U+0E3A. If one wanted the official Sanskritised Pali
version, one could type ไตฺร  as at
present.  One could also encode it as .

Weirdly, I couldn't have used the phonetically ordered vowel to type a
monk's name ending in มฺโม , as  would have been rendered as
โมฺม.

As the non-phonetic virama-like behaviours of U+0E3A are only mentioned
under the heading 'Alternate Ordering', I can only presume that they
were triggered by the phonetic order vowel signs, U+0E70 to U+0E74.

It is possible that U+0E3A acquired the alphabetic property because it
ceased to behave like a virama.  Alternatively, it may have acquired
the alphabetic property because of its use in the compound vowels of
minority languages.

> But in the case of Tamil, I'm curious why most other combining Tamil
> marks go in class 0, whereas pulli goes in 9. Even u0B82 Anusvara, a
> character barely used in Tamil text, has combining class 0 and is
> included in Other_Alphabetic, but the visually similar and  similarly
> positioned pulli is not. In this particular case, is it a historical
> accident that these got assigned this way, or is there a rationale
> behind these? Would it at all be possible to get this changed in the
> upcoming Unicode standard?

Tamil has usually been treated as just another Indian Indic script.
U+0E3A is the only virama-like character with the property of being
'alphabetic'.

I can't see a change making it into Unicode 11.0.  It requires too much
careful thought.  Besides, anything that considered  as
alphabetic should also considerer  as alphabetic - they
should be mostly interchangeable in Tamil.

> > I fear that the correct test for what you want is to split text into
> > words and check that each word begins with an alphabetic
> > character.  
> 
> Do you mean "each grapheme cluster begins with an alphabetic
> character" here? It seems to me (in my very limited Unicode knowledge)
> that such a test, going through grapheme clusters and checking the
> first codepoint in each, would also ensure the text is full
> alphabetic.

Not directly.  Is the string "mark2mark" alphabetic?  It constitutes a
single word.  My suggested simplification would say 'no', as it
contains '2'; perhaps my simplification is wrong.

> And it has the advantage that more languages have a
> (relatively) easy way for splitting text into grapheme clusters, than
> for checking minor Unicode properties like WordBreak, so this one
> might be easier to implement. Does this test anywhere in the ballpark
> of being right?

Yes, it's close to being right.  Note that simple approximations for SE
Asian word-breaking (e.g. treating SE Asian characters as
alphabetic) should work well for your application.

Richard.



Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-28 Thread SundaraRaman R via Unicode
Hi, thanks for your reply.

> There is only one character with a canonical combining class of 9 that
> is included as other_alphabetic, namely U+0E3A THAI CHARACTER PHINTHU.
> That last had any of the other properties of viramas back in Unicode
> 1.0; the characters that triggered such behaviours were permanently
> removed in Unicode 1.1.

I didn't understand the second sentence here, could you clarify? What
do you mean by "any of the other properties" here? And "triggered such
behaviours" seems to imply having them in other_alphabetic had
negative consequences, could you give an example of what that might
be?

> There are some notable absences from the combining marks included.
> Significant absences include ZWJ, ZWNJ and CGJ.
>
> However, a non-erroneous *conformant* Unicode process cannot
> always determine whether a string, given only that it is a string, is
> composed only of alphabetic characters.  The answer would be 'yes' for
>  but 'no' for the canonically
> equivalent !
> (U+0327 is not included as alphabetic either.)
>
> There is at least one combination of Latin letter and combining mark
> that occurs in the normal orthography of a natural language and does not
> have a precomposed equivalent.

Ah, that's somewhat unfortunate that such a quick and easy alphabetic
check is not possible in the general case, but I can understand how it
might be weird to give the Alphabetic property to a ZWJ or ZWNJ.

But in the case of Tamil, I'm curious why most other combining Tamil
marks go in class 0, whereas pulli goes in 9. Even u0B82 Anusvara, a
character barely used in Tamil text, has combining class 0 and is
included in Other_Alphabetic, but the visually similar and  similarly
positioned pulli is not. In this particular case, is it a historical
accident that these got assigned this way, or is there a rationale
behind these? Would it at all be possible to get this changed in the
upcoming Unicode standard?

(By the way, I'm happy to get a link to read through for any of my
questions here. I just find it quite hard to search for and find past
discussions and decision rationales regarding these, not knowing how
and where to search for them.)


> I fear that the correct test for what you want is to split text into
> words and check that each word begins with an alphabetic character.

Do you mean "each grapheme cluster begins with an alphabetic
character" here? It seems to me (in my very limited Unicode knowledge)
that such a test, going through grapheme clusters and checking the
first codepoint in each, would also ensure the text is full
alphabetic. And it has the advantage that more languages have a
(relatively) easy way for splitting text into grapheme clusters, than
for checking minor Unicode properties like WordBreak, so this one
might be easier to implement. Does this test anywhere in the ballpark
of being right?

Regards,
Sundar


Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-28 Thread Richard Wordingham via Unicode
On Mon, 28 May 2018 00:57:03 +0530
SundaraRaman R via Unicode  wrote:

> Hi,
> 
> In languages like Ruby or Java
> (https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)),
> functions to check if a character is alphabetic do that by looking for
> the 'Alphabetic'  property (defined true if it's in one of the L
> categories, or Nl, or has 'Other_Alphabetic' property). When parsing
> Tamil text, this works out well for independent vowels and consonants
> (which are in Lo), and for most dependent signs (which are in Mc or Mn
> but have the 'Other_Alphabetic' property), but the very common pulli
> (VIRAMA) is neither in Lo nor has 'Other_Alphabetic', and so leads to
> concluding any string containing it to be non-alphabetic.
> 
> This doesn't make sense to me since the Virama  “◌்” as much of an
> alphabetic character as any of the "Dependent Vowel" characters which
> have been given the 'Other_Alphabetic' property. Is there a rationale
> behind this difference, or is it an oversight to be corrected?

There is only one character with a canonical combining class of 9 that
is included as other_alphabetic, namely U+0E3A THAI CHARACTER PHINTHU.
That last had any of the other properties of viramas back in Unicode
1.0; the characters that triggered such behaviours were permanently
removed in Unicode 1.1.

There are some notable absences from the combining marks included.
Significant absences include ZWJ, ZWNJ and CGJ.

However, a non-erroneous *conformant* Unicode process cannot
always determine whether a string, given only that it is a string, is
composed only of alphabetic characters.  The answer would be 'yes' for
 but 'no' for the canonically
equivalent !
(U+0327 is not included as alphabetic either.)

There is at least one combination of Latin letter and combining mark
that occurs in the normal orthography of a natural language and does not
have a precomposed equivalent.

I fear that the correct test for what you want is to split text into
words and check that each word begins with an alphabetic character.
That test can be made by a conformant process.  I think, but have not
checked, that the test an be simplified to:

(a) Check that the first character is alphabetic.

(b) Ignore every character with a WordBreak property of Extend or ZWJ

(c) Check that all other characters are alphabetic.

Richard.