> On 23 Apr 2014, at 22:16, Mathias Bynens <[email protected]> wrote: > > > Let’s say I’m writing a program that strips combining characters and > grapheme extenders from an input string. > > > > For combining marks, I’m looking for any non-combining marks (e.g. `a`) > followed by one or more combining marks (e.g. `̃`), and then I remove > everything but the non-combining mark (e.g. leaving only `a`). Is this a > correct approach? > > > > What should the approach be for grapheme extenders? Should the > program only look for `Grapheme_Base` characters followed by > `Grapheme_Extend` characters (which includes the code points in > `Other_Grapheme_Extend`)? > > The email subject should have been “Do `Grapheme_Extend` characters only > apply to `Grapheme_Base`?” — sorry for the confusion. > > Does anyone know the answer?
Yes. Grapheme_Extend characters per se do not "apply" to anything. They are a mixture of different General_Category types -- mostly combining marks, but not all. The concept of applying to a base only refers to combining marks proper. The proper use of the Grapheme_Extend property is in the context of the text segmentation algorithms defined in UAX #29, and in particular: http://www.unicode.org/reports/tr29/#Default_Grapheme_Cluster_Table See that document for the proper use. They are relevant to the determination of grapheme cluster boundaries. And by the way, it is a very bad idea to be writing a program to just unilaterally strip away grapheme extenders from input strings. In particular, many dependent vowels in Indic scripts are defined as grapheme extenders. If you strip them away, the input string will just end up as random trash. That is very, very different from something which is trying to strip diacritics and accent marks off of Latin letters. --Ken _______________________________________________ Unicode mailing list [email protected] http://unicode.org/mailman/listinfo/unicode

