Re: [pcre-dev] (*THEN) works differently in Perl
On 2019-07-09 13:53, ph10 wrote: On Mon, 8 Jul 2019, ND via Pcre-dev wrote: And if we disregards Perl's bugs then it seems (*COMMIT) in Perl works in a > following manner: >> 1. Backtracking can't move to the left of COMMIT (this is PCRE behaviour too) > 2. If COMMIT occurs then no advance match to any other position of subject can > happen. No matter there are any other backtracking control verbs occurs after > COMMIT or COMMIT occurs in atomic group/negative lookaround etc (this is not > implemented by PCRE) There is also a difference in the way Perl handles repeated groups. Consider In Perl, the group repeat matches "abcd", but when it then does not match "c", it unwinds complete repetitions of the group. In PCRE2, there is a backtrack onto *COMMIT, so it fails. Looks like Perl handles *COMMIT somehow differently to normal backtracks, because it does do ordinary backtracks into repeated groups: No. I think Perl don't handle (*COMMIT) somehow differently. Perl can match pattern A*B by number of methods. Common method is named CURLYX-WHILEM. But there are some optimizations that are involved in some situations. Thus, optimized method named CURLYM used when A is a group of constant length without captures. CURLYM have a buggy realization that is not take into account a (*COMMIT) influence. Perl match a patterns /\A(?:.(*COMMIT))*c/ /\A(?:(*COMMIT).)*c/ with use of CURLYM. So it do it wrong in both cases that we can see at Perl debug output. But in second case result is accidentally coincided to expected. A pattern /\A(?:.{1,2}(*COMMIT))*c/ is matched with CURLYX-WHILEM which realization have not such bug. I think Perl developers should fix a realization of CURLYM or process groups that have (*COMMIT) with CURLYX-WHILEM. What can do PCRE? PCRE can do nothing or change to process (*COMMIT) as Perl mean it: 1. If COMMIT occurs then backtracking can't move to the pattern part that is left of it. 2. If COMMIT occurs then start position can't be advanced. This two principles works no matter there are any other backtracking control verbs occurs after COMMIT or COMMIT occurs in atomic group or negative lookaround etc. PCRE didn't now realize them strong. For example consider a pattern: PCRE2 version 10.33 2019-04-16 /.?(?!(*COMMIT)x)a/ abc 0: a Perl way is "There can be no backtracking left of COMMIT". So engine can't backtrack to ".?" and Perl result will be "no match". -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] (*THEN) works differently in Perl
On Mon, 8 Jul 2019, ND via Pcre-dev wrote: > And if we disregards Perl's bugs then it seems (*COMMIT) in Perl works in a > following manner: > > 1. Backtracking can't move to the left of COMMIT (this is PCRE behaviour too) > 2. If COMMIT occurs then no advance match to any other position of subject can > happen. No matter there are any other backtracking control verbs occurs after > COMMIT or COMMIT occurs in atomic group/negative lookaround etc (this is not > implemented by PCRE) There is also a difference in the way Perl handles repeated groups. Consider Perl 5.03 Regular Expressions /\A(?:.(*COMMIT))*c/ abcd 0: abc PCRE2 version 10.34-RC1 2019-04-22 /\A(?:.(*COMMIT))*c/ abcd No match In Perl, the group repeat matches "abcd", but when it then does not match "c", it unwinds complete repetitions of the group. In PCRE2, there is a backtrack onto *COMMIT, so it fails. Looks like Perl handles *COMMIT somehow differently to normal backtracks, because it does do ordinary backtracks into repeated groups: Perl 5.03 Regular Expressions /\A(.{1,2})*X/ AABBCX 0: AABBCX 1: C Adding {1,2} to the first example gives this: Perl 5.03 Regular Expressions /\A(?:.{1,2}(*COMMIT))*c/ abcd No match Having another backtrack point inside the group changes things, but then I found this: Perl 5.03 Regular Expressions /\A(?:(*COMMIT).)*c/ abcd No match I give up! Philip -- Philip Hazel -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] (*THEN) works differently in Perl
And if we disregards Perl's bugs then it seems (*COMMIT) in Perl works in a following manner: 1. Backtracking can't move to the left of COMMIT (this is PCRE behaviour too) 2. If COMMIT occurs then no advance match to any other position of subject can happen. No matter there are any other backtracking control verbs occurs after COMMIT or COMMIT occurs in atomic group/negative lookaround etc (this is not implemented by PCRE) -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] (*THEN) works differently in Perl
On 2019-07-03 17:33, ph10 wrote: On Tue, 2 Jul 2019, ND via Pcre-dev wrote: It seems a Perl is so buggy or have really different conception of (*COMMIT) > then PCRE. I am waiting for further information from the Perl developers, but I suspect that I won't want to change PCRE2, except perhaps to add more detail to the documentation. In pcre2compat.3 there are already some comments about differences in the way the (*VERB)s are processed. Note also that they interact badly with optimizations (both in PCRE2 andPerl). Internal voice says to me that there will be no answer from Perl developers :) No answer (at all or next waited answer in thread) is a frequent thing in a perl bug tracker. It's obvious that COMMIT realization have bugs in Perl (https://rt.perl.org/Public/Search/Simple.html?q=*commit). Users worries about this but Perl developers no. PCRE have much more consistent and documented behaviour of backtracking control verbs. So it seems be best to not change PCRE behaviour around COMMIT and THEN in unnecessary attempts to achieve full compatibility with Perl. You be free to close this thread. -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] (*THEN) works differently in Perl
On Tue, 2 Jul 2019, ND via Pcre-dev wrote: > It seems a Perl is so buggy or have really different conception of (*COMMIT) > then PCRE. I am waiting for further information from the Perl developers, but I suspect that I won't want to change PCRE2, except perhaps to add more detail to the documentation. In pcre2compat.3 there are already some comments about differences in the way the (*VERB)s are processed. Note also that they interact badly with optimizations (both in PCRE2 and Perl). Philip -- Philip Hazel -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] (*THEN) works differently in Perl
> A Perl developer has admitted there is some ambiguity, but suggests that > (*COMMIT) just means "never advance the starting point". That pattern > can find a match without advancing the starting point. The documentation says two rules: 1) It's a zero-width pattern similar to (*SKIP) , except that when backtracked into on failure it causes the match to fail outright. 2) No further attempts to find a valid match by advancing the start pointer will occur again. If the first rule is not honored at all, the description is clearly wrong. However, you can put (*COMMIT) at the beginning of the pattern which ignores the 1st rule, so changing the verb would be a hidden feature removal. Anyway it concerns me that the verbs are not clearly defined / implemented in perl. They are quite powerful tools, but undefined behavior breaks them. The documentation should list exceptions, e.g. they work differently inside assertions. If that is unintended then fix assertions. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] (*THEN) works differently in Perl
On 2019-07-02 14:34, ph10 wrote: A Perl developer has admitted there is some ambiguity, but suggests that (*COMMIT) just means "never advance the starting point". That patterncan find a match without advancing the starting point. I have pointedout that, in that case, /.*(*COMMIT)c/ should also match, but itdoesn't. This is still under discussion by the Perl people. It may takesome time for a conclusion to emerge. Your example /.*(*COMMIT)c/ is very reasonable and contradicts with that words of perl authors. And here is another example. Perl reports no match as if it backtracks to (*COMMIT) into possessive group: /(?>.b(*COMMIT))*c/ abxabc 0: abc It seems a Perl is so buggy or have really different conception of (*COMMIT) then PCRE. -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] (*THEN) works differently in Perl
And it totally contradicts the Perl documentation, in particular, this sentence: Note that if this operator is used and NOT inside of an alternation then it acts exactly like the "(*PRUNE)" operator. Sorry I'm ND but write from another mailbox. I guess from Perl point of view (*THEN) IS inside a branch of the alternation (please look at line indentations in Perl debug output - it is inside BRANCH). So it should not act as (*PRUNE) in that example. -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] (*THEN) works differently in Perl
On Tue, 2 Jul 2019, I wrote: > > PCRE2 version 10.33 2019-04-16 > > /\A(?:.(*COMMIT))*c/ > > abcd > > No match > > > > But Perl reports that this is successful match "abc". > > I think this is also a Perl bug and I will report it. A Perl developer has admitted there is some ambiguity, but suggests that (*COMMIT) just means "never advance the starting point". That pattern can find a match without advancing the starting point. I have pointed out that, in that case, /.*(*COMMIT)c/ should also match, but it doesn't. This is still under discussion by the Perl people. It may take some time for a conclusion to emerge. Regards, Philip -- Philip Hazel -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] (*THEN) works differently in Perl
On Tue, 2 Jul 2019, Zoltán Herczeg wrote: > Perhaps the misunderstanding comes from the fact that we are talking > about the pattern and they talk about the matching process. So (*THEN) > simply starts a backtrack, and when an alternation is encountered, it > switches to the next alternative. That is indeed what happens in the pcre2_match() interpreter. > But this happens normally as well, so what is the exact purpose of > this verb then? Not quite. (*THEN) suppresses going back to a previous backtrack inside the branch. In the Perl example ( COND (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) if COND matches, but FOO fails to match, it does not go back to backtrack points inside COND (which it would do without (*THEN)), but instead abandons the entire branch and jumps to try to match COND2. It's a sort of branch-level (*COMMIT). At a simple level I suppose it's also equivalent to ((?>COND) FOO | ... but perhaps there are more complicated examples that can't be written that way. Philip -- Philip Hazel -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] (*THEN) works differently in Perl
> Note that if this operator is used and NOT inside of an alternation > then it acts exactly like the "(*PRUNE)" operator. > But it doesn't. Perhaps the misunderstanding comes from the fact that we are talking about the pattern and they talk about the matching process. So (*THEN) simply starts a backtrack, and when an alternation is encountered, it switches to the next alternative. But this happens normally as well, so what is the exact purpose of this verb then? This is also very confusing (especially if you read the documentation): /(a(a|b)c(*THEN)d|e)/ It says: Its name comes from the observation that this operation combined with the alternation operator ("|" ) can be used to create what is essentially a pattern-based if/then/else block: ( COND (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) But if your cond has an alternation, it will do something else. I think they simply introduced some random verb which is easy to implement for them, but totally confusing for a user. Imagine if (*THEN) backtracks into an atomic block, or a recursion. Btw this type of (*THEN) is impossible to implement in JIT, because static analysis of its effect is not always possible. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] (*THEN) works differently in Perl
On Tue, 2 Jul 2019, Zoltán Herczeg wrote: > If you are right about the internal working of (*THEN), then this verb > has a very unclear and inconsistent behavior, which is very hard to > track for a user. And it totally contradicts the Perl documentation, in particular, this sentence: Note that if this operator is used and NOT inside of an alternation then it acts exactly like the "(*PRUNE)" operator. But it doesn't. Let's see what the Perl maintainers' reaction to my bug report is. Philip -- Philip Hazel -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] (*THEN) works differently in Perl
On Mon, 1 Jul 2019, ND via Pcre-dev wrote: > As you participate in Perl regex development can you take a look at another > Perl bug please: I do not participate in Perl regex development. I just report bugs when I find them, using the perlbug command. You could do this yourself. (And you seem to know more about Perl internals than I do.) > PCRE2 version 10.33 2019-04-16 > /\A(?:.(*COMMIT))*c/ > abcd > No match > > But Perl reports that this is successful match "abc". I think this is also a Perl bug and I will report it. Philip -- Philip Hazel -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] (*THEN) works differently in Perl
If you are right about the internal working of (*THEN), then this verb has a very unclear and inconsistent behavior, which is very hard to track for a user. I think it should made obsolete and removed eventually. Regards, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] (*THEN) works differently in Perl
On 2019-07-01 10:28, ph10 wrote: On Sun, 30 Jun 2019, ND via Pcre-dev wrote: PCRE2 version 10.33 2019-04-16 > /\A(?:.|..)(*THEN)c/ > abc > No match >>> Perl is match "abc". > I suppose "next innermost alternative" is interpreted differently by PCRE and > Perl. >> If so, may be PCRE should go Perl way in this matter? I think this is a bug in Perl and I will report it as such. After reading this post https://rt.perl.org/Public/Bug/Display.html?id=92898#txn-1227153 I don't sure that there is a Perl bug. I suppose that there are two branches started from "(?:.|..)". Each of this branches ends with a common TAIL to end of pattern. Here are this two branches: 1) .(*THEN)c 2) ..(*THEN)c Lets look at the Perl debug output: Matching REx "\A(?:.|..)(*THEN)c" against "abcd" Intuit: trying to determine minimum start position... doing 'check' fbm scan, [1..3] gave 2 Found floating substr "c" at offset 2 (rx_origin now 0)... (multiline anchor test skipped) Intuit: Successfully guessed: match at offset 0 0 <>| 0| 1:SBOL /\A/(2) 0 <>| 0| 2:BRANCH(4) 0 <>| 1| 3:REG_ANY(8) 1 | 1| 8:CUTGROUP(10) 1 | 2| 10:EXACT (12) | 2| failed... | 1| failed... 0 <>| 0| 4:BRANCH(7) 0 <>| 1| 5:REG_ANY(6) 1 | 1| 6:REG_ANY(8) 2 | 1| 8:CUTGROUP(10) 2 | 2| 10:EXACT (12) 3 | 2| 12:END(0) Match successful! So backtracking to (*THEN) in BRANCH(4) caused immediately fail of this branch and jump to BRANCH(7). -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] (*THEN) works differently in Perl
On 2019-07-01 10:28, ph10 wrote: I think this is a bug in Perl and I will report it as such. It's great. As you participate in Perl regex development can you take a look at another Perl bug please: PCRE2 version 10.33 2019-04-16 /\A(?:.(*COMMIT))*c/ abcd No match But Perl reports that this is successful match "abc". Thanks -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] (*THEN) works differently in Perl
On Sun, 30 Jun 2019, ND via Pcre-dev wrote: > PCRE2 version 10.33 2019-04-16 > /\A(?:.|..)(*THEN)c/ > abc > No match > > > Perl is match "abc". > I suppose "next innermost alternative" is interpreted differently by PCRE and > Perl. > > If so, may be PCRE should go Perl way in this matter? I think this is a bug in Perl and I will report it as such. The Perl document says, concerning (*THEN): "when backtracked into on failure, it causes the regex engine to try the next alternation in the innermost enclosing group (capturing or otherwise) that has alternations." There is no group enclosing (*THEN) in your pattern. Perls' doc also says this: Note that if this operator is used and NOT inside of an alternation then it acts exactly like the "(*PRUNE)" operator. .. but it doesn't: Perl 5.03 Regular Expressions /\A(?:.|..)(*THEN)c/ abc 0: abc /\A(?:.|..)(*PRUNE)c/ abc No match Philip -- Philip Hazel -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev