On 24 Apr 2014, at 21:38, Whistler, Ken <[email protected]> wrote:
> Grapheme_Extend characters per se do not "apply" to anything. > They are a mixture of different General_Category types -- mostly combining > marks, but not all. The concept of applying to a base only refers to > combining marks proper. > > The proper use of the Grapheme_Extend property is in the context of the > text segmentation algorithms defined in UAX #29, and in particular: > > http://www.unicode.org/reports/tr29/#Default_Grapheme_Cluster_Table > > See that document for the proper use. They are relevant to the determination > of grapheme cluster boundaries. > > And by the way, it is a very bad idea to be writing a program to just > unilaterally strip away grapheme extenders from input strings. In particular, > many dependent vowels in Indic scripts are defined as grapheme extenders. If > you strip them away, the input string will just end up as random trash. That > is very, very different from something which is trying to strip diacritics > and accent marks off of Latin letters. I agree. Don’t worry — I am not actually writing such a program, it was just an example to simplify my question. The real program attempts to reverse a string while accounting for combining marks and grapheme extenders. Before reversing the code points one by one, some things need to happen: * For combining marks, I use a regular expression that looks for non-combining marks followed by any number of combining marks, and then I swap the combining marks with the preceding character. * Now I’m trying to figure out what to do about grapheme extenders (if anything). I was thinking: look for any non-grapheme extender symbol (or should it be only `Grapheme_Base` characters? Your reply suggested it shouldn’t) followed by a single grapheme extender (or should it be several, like with combining marks?), and then swap them. Would that be a correct approach? I realize reversing a string has nothing to do with text segmentation – but ignoring grapheme extenders leads to unexpected results (since after reversing the code points, the grapheme extender might extend the wrong character): https://github.com/mathiasbynens/esrever/issues/5 _______________________________________________ Unicode mailing list [email protected] http://unicode.org/mailman/listinfo/unicode

