TR29 Word Break awkwardness

Andy Heninger Mon, 13 Sep 2004 16:27:02 -0700

In looking at how the proposed changes to the TR 29 word boundary rules would be implemented in the ICU library, I came across an odd situation in the rules.

My question actually has nothing to do with the new proposed changes to TR29, but is something that has been in the word boundary rules all along.


Looking at the pertinent parts of TR 29, we have

   ALetter = Alphabetic = TRUE
               and some other stuff

    Rule 3:  Treat a grapheme cluster as if it were a single
             character, the first char of the cluster.

    Rule 5, 6, 7,9   Don't break between most letters
                      and a variety of other constructs.
                         ALetter x [whatever]

The issue is that there are characters that have both the Alphabetic property = TRUE and the Grapheme Extend property = TRUE. These will normally be part a grapheme cluster, and thus will not directly participate in determining word boundaries. But if you do get an alphabetic combining char by itself, either at the start of the text, or following a character that does not accept combining characters, the Alphabetic property comes into play and the combining char can join with following letters in forming a "word".

If the orphaned alphabetic combining character were preceded by some sort of space, which is what you would normally do if you wanted to display the it without a base character, the space would determine the word breaking and the combining char would not join with subsequent letters to form a word. I suspect that a combining character with no base would display in more or less the same way as one with a space as a base.

What I think would make sense is to modify the definition of ALetter in TR29 to look like this:

   ALetter  =  Alphabetic = TRUE
                 [plus all the other stuff already in TR29]
                  AND NOT GRAPHEME EXTEND = true

And how did I come to be looking at this particular case? The ICU library word breaker uses a regular-expression based set of rules and a DFA style matching engine that is very clever in always finding the longest match, which is to say, the longest possible word.

consider a sequence of four characters like this:
[ALetter] [MidLetter]  [ALetter with GraphemeExtend] [Numeric]

In the correct interpretation of the TR29 word rules, the middle two characters would combine to form a grapheme cluster, the word break rules would then be applied to the reduced sequence [ALetter] [MidLetter] [Numeric] which matches none of the word rules.

The DFA Regex implementation, however, finds a longer word by considering the [ALetter + Grapheme Extend] character as just an ALetter

  [ALetter]  [MidLetter] [ALetter] [Numeric]

which all sticks together through application of rules 6 and 9 from the TR.

While thinking about what to do about this, it struck me that it would probably be more consistent all the way around to remove the Grapheme Extend characters from the ALetter set. The only effect of this change would be on the breaking behavior of combining characters with no base character.

Any thoughts?

--
  -- Andy Heninger
     [EMAIL PROTECTED]

TR29 Word Break awkwardness

Reply via email to