daniel.buenzli wrote: I think the behaviour of → rules should be clarified
I wholeheartedly agree. If I understand correctly if the match [or a "treat-as" rule] spans over > the [candidate] boundary position candidate that simply turns it into a > non-boundary. Otherwise you apply the rule on the left of the boundary > position candiate. I have considered the extent of a left-side treat-as match to not continue beyond the candidate boundary position. This comes into play following a ZWJ, where it may be absorbed into a "treat as" on the left (WB4), while some other rule triggers on the right side (WB3C). At any rate, this is what I do in ICU. It gets very confusing, and is tricky to implement. Reconsidering how ZWJ rules work could also be a help, if we could figure out how to keep them out of the "treat as" rules, but use explicit no-break rules on both sides instead. -- Andy On Wed, Mar 4, 2020 at 4:01 PM Mark Davis ☕️ via Unicode < unicode@unicode.org> wrote: > One thing we have considered for a while is whether to do a rewrite of the > rules to simplify the processing (and avoid the "treat as" rules), but it > would take a fair amount of design work that we haven't had time to do. If > you (or others) are interested in getting involved, please let us know. > > Mark > > > On Wed, Mar 4, 2020 at 11:30 AM Daniel Bünzli via Unicode < > unicode@unicode.org> wrote: > >> On 4 March 2020 at 18:48:09, Daniel Bünzli (daniel.buen...@erratique.ch) >> wrote: >> >> > On 4 March 2020 at 18:01:25, Daniel Bünzli (daniel.buen...@erratique.ch) >> wrote: >> > >> > > Re-reading the text I suspect I should not restart the rules from the >> first one when a >> > WB4 >> > > rewrite occurs but only apply the subsequent rules. Is that correct ? >> > >> > However even if that's correct I don't understand how this test case >> works: >> > >> > ÷ 1F6D1 × 200D × 1F6D1 ÷ # ÷ [0.2] OCTAGONAL SIGN (ExtPict) × [4.0] >> ZERO WIDTH JOINER (ZWJ_FE) >> > × [3.3] OCTAGONAL SIGN (ExtPict) ÷ [0.3] >> > >> > Here the first two chars get rewritten with WB4 to ExtPic then if only >> subsequent rules >> > are applied we end up in WB999 and a break between 200D and 1F6D1. >> >> That's nonsense and not the operational model of the algorithm which IIRC >> was once clearly stated on this list by Mark Davis (sorry I failed to dig >> out the message) which is to take each boundary position candidate and >> apply the rule in sequences taking the first one that matches and then >> start over with the next one. >> >> In that case applying the rules bewteen 1F6D1 and 200D leads to WB4 but >> then that implicitely adds a non boundary condition -- this is not really >> evident from the formalism but see the comment above WB4, for that boundary >> position that settles the non boundary condition. Then we start again >> applying the rules between 200D and the last 1F6D1 and WB3c matches before >> WB4 quicks. >> >> I think the behaviour of → rules should be clarified: it's not clear on >> which data you apply it w.r.t. the boundary position candiate. If I >> understand correctly if the match spans over the boundary position >> candidate that simply turns it into a non-boundary. Otherwise you apply the >> rule on the left of the boundary position candiate. >> >> Regarding the question of my original message it seems at a certain point >> I knew better: >> >> https://www.unicode.org/mail-arch/unicode-ml/y2016-m11/0151.html >> >> Sorry for the noise. >> >> Daniel >> >> P.S. I still think the UAX29 and UAX14 could benefit from clarifiying the >> operational model of the rules a bit (I also have the impression that the >> formalism to express all that may not be the right one, but then I don't >> have something better to propose at the time). Also it would be nicer for >> implementers if they didn't have to factorize rules themselves (e.g. like >> in the new LB30 rules of UAX14) so that correctness of implemented rules is >> easier to assert. >> >> >> >>