On Fri, 21 Apr 2017 16:27:43 -0700 Manish Goregaokar via Unicode <unicode@unicode.org> wrote:
> > Do Hindi speakers really think of orthographic syllables as > > characters? > > When rendered as a cluster, yes? I've asked around, and folks seem to > insist on coupling it to the rendering. That argues that it's a unit, which I don't think is in dispute. Words are also units, and nowadays we don't normally insist that one retype a word just to change one bit of it. > Given most fonts render > *normal* (common, etc) clusters, I think making them EGCs and looking > at nonrendered clusters the same way we do family emoji is fine > (family emojis of length 5 are a single EGC, but that's not what's > actually perceived by the user, but it's a use case that's very rare > in the wild, so it doesn't matter). That depends on the language. In the Tai Tham script, even without consonant clusters one can get 5 graphic characters in a syllable, e.g. ᨧᩮᩢ᩶ᩣ _cao_ <HIGH CA, SIGN E, MAI SAT, TONE-2, SIGN AA> 'lord; you (polite)', and when one adds consonant clusters one easily gets monosyllables like ᨠᩖ᩠᩶ᩅ᩠ᨿ _kluai_ <HIGH KA, MEDIAL LA, SAKOT, WA, TONE-2, SAKOT, LOW YA> 'banana' with 5 graphic characters and additionally 2 coengs. (One can distinguish Pali from the Tai languages simply by the density of the ink!) At present these are split into two and three grapheme clusters respectively, and LibreOffice cursor movement responds accordingly. (SIGN AA starts a grapheme cluster in several scripts of further India.) However, if one teaches the Emacs editor what a Tai Tham syllable is, so that it can use the M17n rendering library, the cursor then advances syllable by syllable, which is unpleasant for imperfect typists. Fortunately, it's possible to add functions to Emacs to allow it to advance character-by-character; I forget if one has to also add a few code changes. (The downside is that text either side of the cursor is rendered independently, which can be a nuisance when editing very long lines.) > The way I see it, the current > system is wrong, and so would the proposed system of not breaking at > viramas (or not breaking at viramas followed by a consonant if we want > to be more precise), but the proposed system would be wrong much less > often. > I am only talking about Devanagari, though scripts like > Bangla/Gujrati/Gurmukhi may have similar needs. Breaking on ZWNJ seems > sensible. Indeed, viramas (InSC=Virama) will have to be handled case-by-case. One should continue to break after pulli (U+0BCD TAMIL SIGN VIRAMA) except for the cases of the ligatures/conjuncts. I don't know if there are obscure cases, or whether it's only _shri_ and <KA, SSA> for which one should not break just because of the virama. Continuation after coengs (InSC=Invisible_Stacker) should be automatic. Malayalam will need customisation. Definitions by codepoints are only a fallback, for when a font cannot be used to guide the process. Formally, normalisation is a problem, as these characters can be separated from letters by other marks. This is a problem in practice for normalised text in Tai Tham. Pure killers (InSC=Pure_Killer) should probably be given no special treatment, as at present, by default, though I wonder if we should define orthographic syllables for Pali in Thai script. The two orthographies will need different rules, and renderers won't help. Defining orthographic syllables for languages in the Latin script is probably excessive. Richard.