A question about the default grapheme cluster boundaries with U+0020 as the grapheme base

Konstantin Ritt Fri, 01 Jun 2012 21:33:29 -0700

It seems like there is an inconsistency between what the default
grapheme clusters specification says and what the test results are
expected to be:


The UAX#29 says:
> Another key feature (of default Unicode grapheme clusters) is that <b>default 
> Unicode grapheme clusters are atomic units with respect to the process of 
> determining the Unicode default line, word, and sentence boundaries</b>.
Also this mentioned in UAX#14:
> Example 6. Some implementations may wish to tailor the line breaking 
> algorithm to resolve grapheme clusters according to Unicode Standard Annex 
> #29, “Unicode Text Segmentation” [UAX29], as a first stage. <b>Generally, the 
> line breaking algorithm does not create line break opportunities within 
> default grapheme clusters</b>; therefore such a tailoring would be expected 
> to produce results that are close to those defined by the default algorithm. 
> However, if such a tailoring is chosen, characters that are members of line 
> break class CM but not part of the definition of default grapheme clusters 
> must still be handled by rules LB9 and LB10, or by some additional tailoring.

However, <U+0020 (SP), U+0308 (CM)> in the line breaking algorithm is
handled by the rules LB10+LB18 and produces a break opportunity while
GB9 prohibits break between <U+0020 (Other), U+0308 (Entend)>.
Section 9.2 "Legacy Support for Space Character as Base for Combining
Marks" in UAX#29 clarifies why there is a line break occurs, but the
fact that the statements above are false statements and introduce some
ambiguility.
In case the space character is not a grapheme base anymore the
grapheme cluster breaking rules need to be updated.

Kind regards,
Konstantin

A question about the default grapheme cluster boundaries with U+0020 as the grapheme base

Reply via email to