Re: Do `Grapheme_Extend` characters only apply to `Grapheme_Base`?

Mathias Bynens Thu, 24 Apr 2014 14:12:13 -0700

On 24 Apr 2014, at 21:38, Whistler, Ken <[email protected]> wrote:


> Grapheme_Extend characters per se do not "apply" to anything.
> They are a mixture of different General_Category types -- mostly combining
> marks, but not all. The concept of applying to a base only refers to
> combining marks proper.
> 
> The proper use of the Grapheme_Extend property is in the context of the
> text segmentation algorithms defined in UAX #29, and in particular:
> 
> http://www.unicode.org/reports/tr29/#Default_Grapheme_Cluster_Table
> 
> See that document for the proper use. They are relevant to the determination 
> of grapheme cluster boundaries.
> 
> And by the way, it is a very bad idea to be writing a program to just 
> unilaterally strip away grapheme extenders from input strings. In particular, 
> many dependent vowels in Indic scripts are defined as grapheme extenders. If 
> you strip them away, the input string will just end up as random trash. That 
> is very, very different from something which is trying to strip diacritics 
> and accent marks off of Latin letters.

I agree. Don’t worry — I am not actually writing such a program, it was just an 
example to simplify my question.

The real program attempts to reverse a string while accounting for combining 
marks and grapheme extenders. Before reversing the code points one by one, some 
things need to happen:

* For combining marks, I use a regular expression that looks for non-combining 
marks followed by any number of combining marks, and then I swap the combining 
marks with the preceding character.
* Now I’m trying to figure out what to do about grapheme extenders (if 
anything). I was thinking: look for any non-grapheme extender symbol (or should 
it be only `Grapheme_Base` characters? Your reply suggested it shouldn’t) 
followed by a single grapheme extender (or should it be several, like with 
combining marks?), and then swap them. Would that be a correct approach?

I realize reversing a string has nothing to do with text segmentation – but 
ignoring grapheme extenders leads to unexpected results (since after reversing 
the code points, the grapheme extender might extend the wrong character): 
https://github.com/mathiasbynens/esrever/issues/5
_______________________________________________
Unicode mailing list
[email protected]
http://unicode.org/mailman/listinfo/unicode

Re: Do `Grapheme_Extend` characters only apply to `Grapheme_Base`?

Reply via email to