2015-11-26 12:38 GMT+01:00 Asmus Freytag (t) <[email protected]>:
> On 11/26/2015 3:08 AM, Philippe Verdy wrote: > > The related definition for extended grapheme clusters says: > > ( CRLF > | *Prepend* *( RI-sequence | Hangul-Syllable | !Control ) > ( Grapheme_Extend | *SpacingMark* )* > | . ) > > However I do not understand why it may include only one Hangul-Syllable > when applying prepended concatenation marks. And if the definition excludes > whitespaces, nothing prevents it to extend to arbitrary sequences of > letters/digits/symbols/punctuations (this could span very long sequences of > sinograms, or other letters from scripts that do not use whitespaces as > word separators. Even in the Latin script it would extend to the > punctuation signs that may follow any word, or to an entire mathematical > formula such as "1+2*3" but not "sin x"... > > > White space is clearly NOT part a grapheme cluster, so I don't see what > your issue is? > No, whitespace is a grapheme cluster by its own, matching (.) The issue is the overlong extended grapheme cluster after any Prepend occurs because ( Grapheme_Extend | *SpacingMark* )* But ( RI-sequence | Hangul-Syllable | !Control ) is bounded (if we ignore the rare RI-sequences which are still are stil short) and will not match the sequences of digits or letters intended by the prepended concatenation marks, but only one. > BTW, if after careful analysis you think there is a mistake, you should > probably raise a bug on this. > For now the proposal only speaks about listing the prepended characters enumeration with a new defined property , it still does not address what are the sequences of graphemes over which they apply. As these quequences are specific to each prepended character, I don't see how the new property will help if we need to specialize each one of these characters: we still need custom algorithm (possibly tailored by locale) for breaking clusters using them. With the definition given above, the extended grapheme clusters will break after each letter/digit/punctuation and <ARABIC NUMBER SIGN, ARABIC DIGIT ONE, ARABIC DIGIT TWO> will still break into <ARABIC NUMBER SIGN, ARABIC DIGIT ONE> separated from <ARABIC DIGIT TWO> The proposed new property does not change this : how can we really extend the sequence of digits so that the number sign will span all of them? Use CGJ or explicit sequence delimiters ?

