Re: Potential contradiction between the WordBreak test data and UAX #29

Tom Hacohen Wed, 23 Nov 2016 01:18:23 -0800

You said:
> So ignore it and test whever the last symbols glues with ZWJ (it should,
> so there's no break in the reference implementation).

Which makes me think you misread the example I quoted. There is a breakin the reference implementation, though I argue (like you just did) thatthere shouldn't be. So I think you agree with me and also think it's broken.

Otherwise, I'm not sure I fully understand what you are saying, but ifwhat you are saying is correct, then following the same logic, otherrules would fail, specifically:

÷ 0061 × 2060 × 0030 ÷ # ÷ [0.2] LATIN SMALL LETTER A (ALetter) ×[4.0] WORD JOINER (Format_FE) × [9.0] DIGIT ZERO (Numeric) ÷ [0.3]


After the FE here there's no BREAK because:
ALetter Format Numeric -> ALetter Numeric
Which then following rule 9.0 is a no-break.

This is exactly the rule (4) as described in my previous email, justwith a different follow-up rule (9 instead of 3c). I don't see how ruleprecedence would matter here, as there is no case for which two rules apply.


--
Tom.

On 23/11/16 02:49, Philippe Verdy wrote:

IMHO, the ZWJ should glue with the last symbol following your examples.
But the combining diaeresis following the ZWJ extends it (even if in my
opinion it is "defective" and would likely display on a dotted ciurcle
in renderers, but not defective for the string definition of combining
sequences).
So ignore it and test whever the last symbols glues with ZWJ (it should,
so there's no break in the reference implementation).

WB4: X (Extend | Format | ZWJ)*→X

Extend: [ExtendGrapheme_Extend=Yes]  This includes:
  General_Category = Nonspacing_Mark (this includes the combining diaeresis)
  General_Category = Enclosing_Mark
  U+200C ZERO WIDTH NON-JOINER
  plus a few General_Category = Spacing_Mark needed for canonical
equivalence.

So yes we have: ZWJ "COMBINING DIERESIS" (EBG|Glue_After_Zwj) → ZWJ
(EBG|Glue_After_Zwj) from rule WB4 eliminate the combining mark from the
input queue

But rule WB3c comes before and prohibits it:

WB3c: ZWJ × (Glue_After_Zwj | EBG)

This means that you have first:

ZWJ "COMBINING DIERESIS" GAZ →  ZWJ × "COMBINING DIERESIS" EBG

and this does not match the rule WB4 which is not matching for:

X × (Extend | Format | ZWJ)*→X

(it cannot remove the extenders if there's a no-break before them, it is
valid only when the break oppotunity is still unspecified. As soon as a
rule as produced a "break here" or "nobreak here" at a given position,
you must advance after this position (the rules are based on a small
finite state machine). So after :

ZWJ "COMBINING DIERESIS" GAZ →  ZWJ × "COMBINING DIERESIS" EBG

it just remains in your input queue:

"COMBINING DIERESIS" EBG  (because "ZWJ ×" is already processed, and so
ZWJ is elminated)

Now comes WB4: X (Extend | Format | ZWJ)* → X

There's no more any "X" to match before the combining diaeresis: your
input queue starts by the combining diareasis matching "X", the
following character (EBG) does not match within "(Extend | Format |
ZWJ)*" (which matches an empty string and does not contain the combining
diaresis already matched in "X"), rule WB4 has then no replacement
effect and preserves the initial "X" (i.e. the combining diaeresis)

.

        
        




2016-11-22 13:07 GMT+01:00 Tom Hacohen <[email protected]
<mailto:[email protected]>>:

    Dear,

    I recently updated libunibreak[1] according to unicode 9.0.0. I
    thought I implemented it correctly, however it fails against two of
    the tests in the reference test data:

    ÷ 200D × 0308 ÷ 2764 ÷ #  ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) × [4.0]
    COMBINING DIAERESIS (Extend_FE) ÷ [999.0] HEAVY BLACK HEART
    (Glue_After_Zwj) ÷ [0.3]

    and

    ÷ 200D × 0308 ÷ 1F466 ÷ #  ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) ×
    [4.0] COMBINING DIAERESIS (Extend_FE) ÷ [999.0] BOY (EBG) ÷ [0.3]


    More specifically, it fails in both after the "combining diaeresis".
    My implementation marks it as a break, whereas the test data as not.
    The reference implementation, as expected, agrees with the test data.


    However, looking at the test case and the UAX[2], this does not look
    correct. More specifically, because of rule 4:
    ZWJ Extended GAZ -> ZWJ GAZ
    And then according to rule 3c, there should be no break opportunity
    between them. The reference implementation, however, uses rule 999
    here, which I believe is incorrect.


    Am I missing anything, or is this an issue with the reference test
    data and reference implementation?

    Thanks,
    Tom.

    [1]: https://github.com/adah1972/libunibreak
    <https://github.com/adah1972/libunibreak>
    [2]: http://www.unicode.org/reports/tr29/#WB1
    <http://www.unicode.org/reports/tr29/#WB1>

Re: Potential contradiction between the WordBreak test data and UAX #29

Reply via email to