On Tue, 29 May 2018 07:27:21 -0700 Ken Whistler via Unicode <unicode@unicode.org> wrote:
> On 5/29/2018 12:49 AM, Richard Wordingham via Unicode wrote: > > How would one know that they are misapplied? And what if the > > author of the text has broken your rules? Are such texts never to > > be transcribed to pukka Unicode? > > Applying Tamil -ii (0BC0, Script=Tamil) to the Latin letter a (0061, > Script=Latin) doesn't automatically make the Tamil vowel "inherit" > the Latin script property value, nor should it. It's the sort of process that gave us U+0310 COMBINING CANDRABINDU. However, I see adding SE Asian dependent vowels to Latin letter x (U+0078, Script=Latin) as rather tending to make 'x' Script=Common. Others have disagreed quite vehemently. I see the view that the base character is U+00D7 MULTIPLICATION SIGN (InSC=Consonant_Placeholder) has prevailed. Serifed U+00D7 is quite common in manually typewritten material; I remember it from school. I'm not sure what script the sequence <U+00D7, U+0EB5 LAO VOWEL SIGN II> belongs to in OpenType layout. I ought to find out for the benefit of Tai Tham fonts. > That said, if someone decides they want that sequence, and their text > as "broken my rules", so be it. I'm just not going to assume anything > particular about that text. Note that in terms of trying to determine > whether such a string is (naively) alphabetic, such a sequence > doesn't interfere with the determination. On the other hand, a > process concerned about text runs, script assignment, validity for > domains, or other such issues *will* be sensitive to such a boundary > -- and should not be overruled by some generic determination that > combining marks inherit all the properties of their base. When it comes to script runs for rendering, such a rule feels oppressive; it is widely unenforced. For example, I have found that if my font treats U+0E4A THAI CHARACTER MAI TRI as a Tai Tham character, it will generally render satisfactorily on a Tai Tham character. Presumably I can now use a few examples of the same Northern Thai syllable on the same page in a published language-teaching book as evidence for adding its clone to the Tai Tham script. There should also be some examples of U+0ECA LAO TONE MAI TI on Lao Tai Tham syllables, but I haven't found any yet. See the chart at the end of "Exemple d’écriture ignorée par Unicode : l’écriture tham du Laos" http://www.laosoftware.com/download/articleTALN.pdf for an implicit claim of existence. > > Even without knowing exactly what is wanted, it looks to me as > > though it isn't. If he wants to allow <pulli, ZWNJ> as a > > substring, which he should, then that fails because there is no > > overlap between p{extender} and p{gc=Cf} or between p{diacritic} > > and p{gc=Cf}. > > Yes, so if you are working with strings for Indic scripts (or for > that matter, Arabic), you add Join_Control to the mix: > > Alphabetic ∪ Diacritic ∪ Extender ∪ Join_Control > > gets you a decent approximation of what is (naively) expected to fall > within an "alphabetic" string for most scripts. but won't work for collatable Welsh 'Llan͏gollen'! (There's a CGJ between the 'n' and the 'g'.) One also needs Join_Control for fraktur German and, to my mind, English 'Caesar'. > For those following along, Alphabetic is roughly meant to cover the > ABC, かきくけこ,... plus ideographic elements of most scripts. > Diacritic picks up most of the applied combining marks, including > nuktas, viramas, and tone marks. Extender picks up spacing elements > that indicate length, reduplication, iteration, etc. And joiners are, > well, joiners. 'Diacritic' mostly includes marks with secondary collation weight; those with primary weights, such as Indic dependent vowels, are mopped up in Alphabetic. (Removing diacritics is very much not the same as removing combining marks.) > If one wants finer categorization specifically for Indic scripts, > then I would suggest turning to the Indic_Syllabic_Category property > instead of a union of PropList.txt properties and/or some twiddling > with General_Category values. You'd still need to add gc=L to catch things like U+0971 DEVANAGARI SIGN HIGH SPACING DOT (which starts syllables) and U+A8F4 DEVANAGARI SIGN DOUBLE CANDRABINDU VIRAMA. And you'd still miss U+0303 COMBINING TILDE and U+0331 COMBINING MACRON BELOW from Thai script Pattani Malay - I need to make another attempt to get them appropriate Indic syllabic category values. Richard.