On Thu, 24 Apr 2014 19:38:54 +0000 "Whistler, Ken" <[email protected]> wrote:
> Yes. Grapheme_Extend characters per se do not "apply" to anything. > They are a mixture of different General_Category types -- mostly > combining marks, but not all. The concept of applying to a base only > refers to combining marks proper. > The proper use of the Grapheme_Extend property is in the context of > the text segmentation algorithms defined in UAX #29, <snip> A watertight definition of a grapheme cluster is probably impossible. The precise definition of the legacy grapheme cluster is crafted so that the process of splitting a string of characters into legacy grapheme clusters is invariant under canonical equivalence. The various Indic AA vowels that are other_grapheme_extend are there because they are also the second parts of canonical decompositions of multipart Indic vowels, most typically OO. However, diametrically opposite approaches were taken in the 'Myanmar' and Khmer scripts. In the Myanmar script, the two-part vowel symbol must be encoded as two separate characters, as in the various Tai scripts. In the Khmer script, the two parts are encoded as a single vowel. Most of the scripts of India allow both approaches; Devanagari is the most notable exception, and the multipart vowels there are primarily used for an archaic style. Thus U+09BE BENGALI VOWEL SIGN AA is intended to 'apply to' U+09C7 BENGALI VOWEL SIGN E, and it is only in the interests of simplicity and consistency that <U+0995 BENGALI LETTER KA, U+09BE BENGALI VOWEL SIGN AA> is a grapheme cluster but <U+0995, U+09C0 BENGALI VOWEL SIGN AA> is not. Richard Ishida points out in one of his web pages that the practical definition of a grapheme cluster may actually depend on the font. > > http://www.unicode.org/reports/tr29/#Default_Grapheme_Cluster_Table > > See that document for the proper use. They are relevant to the > determination of grapheme cluster boundaries. > > And by the way, it is a very bad idea to be writing a program to just > unilaterally strip away grapheme extenders from input strings. Thank you, Ken and Doug, for making that point. Richard. _______________________________________________ Unicode mailing list [email protected] http://unicode.org/mailman/listinfo/unicode

