On 5/29/2018 12:49 AM, Richard Wordingham via Unicode wrote:
How would one know that they are misapplied?  And what if the author of
the text has broken your rules? Are such texts never to be transcribed
to pukka Unicode?

Applying Tamil -ii (0BC0, Script=Tamil) to the Latin letter a (0061, Script=Latin) doesn't automatically make the Tamil vowel "inherit" the Latin script property value, nor should it.

That said, if someone decides they want that sequence, and their text as "broken my rules", so be it. I'm just not going to assume anything particular about that text. Note that in terms of trying to determine whether such a string is (naively) alphabetic, such a sequence doesn't interfere with the determination. On the other hand, a process concerned about text runs, script assignment, validity for domains, or other such issues *will* be sensitive to such a boundary -- and should not be overruled by some generic determination that combining marks inherit all the properties of their base.

Even without knowing exactly what is wanted, it looks to me as though
it isn't.  If he wants to allow <pulli, ZWNJ> as a substring, which
he should, then that fails because there is no overlap between
p{extender} and p{gc=Cf} or between p{diacritic} and p{gc=Cf}.

Yes, so if you are working with strings for Indic scripts (or for that matter, Arabic), you add Join_Control to the mix:

Alphabetic  ∪ Diacritic ∪ Extender ∪ Join_Control

gets you a decent approximation of what is (naively) expected to fall within an "alphabetic" string for most scripts.

For those following along, Alphabetic is roughly meant to cover the ABC, かきくけこ,... plus ideographic elements of most scripts. Diacritic picks up most of the applied combining marks, including nuktas, viramas, and tone marks. Extender picks up spacing elements that indicate length, reduplication, iteration, etc. And joiners are, well, joiners.

If one wants finer categorization specifically for Indic scripts, then I would suggest turning to the Indic_Syllabic_Category property instead of a union of PropList.txt properties and/or some twiddling with General_Category values.


Reply via email to