On Mon, 11 Dec 2017 21:45:23 +0000 Cibu Johny (സിബു) <c...@google.com> wrote: > I am assuming the purpose of the grapheme cluster definition is to be > used line spacing, vertical writing or cursor movement. Without > defining the purpose, it is hard for me to say if a ruleset is valid > or not. Assuming that purpose driven definition, we probably need > language specific definitions - a pan-indic algorithm may not work. > For instance, the proposed ruleset, may not hold good for Tamil. For > example, see the title in the following image: துக்ளக் broken as > [ta-u, ka-virama, lla, ka-virama]. However, as per the proposed > algorithm it would be: [ta-u, ka-virama-lla, ka-virama] > > http://www.chennaispider.com/attachments/Resources/3486-7144-Thuglak-Tamil-Magazine-Chennai.jpg
I think Tamil is actually rather straightforward. For native intuition, I would cite the Tamil letter-counting account at https://venkatarangan.com/blog/content/binary/Counting%20Letters%20in%20an%20Unicode%20String.pdf. What the author counts is not spacing glyphs, but vowel letters and consonant characters, with two significant modifications. Firstly, K.SSA counts as just one consonant, and SH.R.II is also counted as containing a single consonant. In other words, the Tamil virama character works as a pure killer except in those two environments. This is also the story the TUNE protagonists tell us. It will be an inelegant rule for UAX#29, but, unfortunately, reality is messy. > Malayalam could be a similar story. In case of Malayalam, it can be > font specific because of the existence of traditional and reformed > writing styles. A conjunct might be a ligature in traditional; and it > might get displayed with explicit virama in the reformed style. For > example see the poster with word ഉസ്താദ് broken as [u, sa-virama, > ta-aa, da-virama] - as it is written in the reformed style. As per > the proposed algorithm, it would be [u, sa-virama-ta-aa, da-virama]. > These breaks would be used by the traditional style of writing. Working round that seems to be tricky. The best I can think of is to have two different locales, traditional and reformed, and hope that the right font is selected. It doesn't seem at all straightforward to work out what the font is doing even from a character to glyph map without knowing what the glyphs are. I'm not sure how one should have the difference designated - language variants, or two scripts? > > [image: image.png] > https://upload.wikimedia.org/wikipedia/en/6/64/Ustad_Hotel_%282012%29_-_Poster.jpg > BTW, there is an example with explicit virama in the proposal under > the Sanskrit section: The alleged grapheme cluster is the last cluster of the second word in the Sanskrit section of L2/17-200 Recommendations to UTC #152 on Text segmentation in Indian languages (https://www.unicode.org/L2/L2017/17200-text-seg-rec.pdf). The rendering seems odd if there is no ZWNJ in the word. I read the word as प्प्रप॑द्ये॒ pprpadya with two pitch accents. However, I can't explain the visible virama under the DA - even a Hindi font should have a conjunct for D.YA. Richard.