Re: [pcre-dev] (*THEN) works differently in Perl

2019-07-10 Thread ND via Pcre-dev

On 2019-07-09 13:53, ph10 wrote:

On Mon, 8 Jul 2019, ND via Pcre-dev wrote:
And if we disregards Perl's bugs then it seems (*COMMIT) in Perl works  

in a
> following manner:
>> 1. Backtracking can't move to the left of COMMIT (this is PCRE  
behaviour too)
> 2. If COMMIT occurs then no advance match to any other position of  
subject can
> happen. No matter there are any other backtracking control verbs  
occurs after
> COMMIT or COMMIT occurs in atomic group/negative lookaround etc (this  
is not

> implemented by PCRE)
There is also a difference in the way Perl handles repeated groups. 
Consider

In Perl, the group repeat matches "abcd", but when it then does not
match "c", it unwinds complete repetitions of the group. In PCRE2,
there is a backtrack onto *COMMIT, so it fails. Looks like Perl handles 
*COMMIT somehow differently to normal backtracks, because it does do 
ordinary backtracks into repeated groups:




No. I think Perl don't handle (*COMMIT) somehow differently. Perl can  
match pattern A*B by number of methods. Common method is named  
CURLYX-WHILEM. But there are some optimizations that are involved in some  
situations. Thus, optimized method named CURLYM used when A is a group of  
constant length without captures. CURLYM have a buggy realization that is  
not take into account a (*COMMIT) influence.


Perl match a patterns
/\A(?:.(*COMMIT))*c/
/\A(?:(*COMMIT).)*c/
with use of CURLYM. So it do it wrong in both cases that we can see at  
Perl debug output. But in second case result is accidentally coincided to  
expected.


A pattern
/\A(?:.{1,2}(*COMMIT))*c/
is matched with CURLYX-WHILEM which realization have not such bug.


I think Perl developers should fix a realization of CURLYM or process  
groups that have (*COMMIT) with CURLYX-WHILEM.



What can do PCRE?
PCRE can do nothing or change to process (*COMMIT) as Perl mean it:
1. If COMMIT occurs then backtracking can't move to the pattern part that  
is left of it.

2. If COMMIT occurs then start position can't be advanced.
This two principles works no matter there are any other backtracking  
control verbs

occurs after COMMIT or COMMIT occurs in atomic group or negative lookaround
etc.

PCRE didn't now realize them strong.
For example consider a pattern:

PCRE2 version 10.33 2019-04-16
/.?(?!(*COMMIT)x)a/
abc
 0: a

Perl way is "There can be no backtracking left of COMMIT". So engine can't  
backtrack to ".?" and Perl result will be "no match".


--
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] (*THEN) works differently in Perl

2019-07-09 Thread ph10
On Mon, 8 Jul 2019, ND via Pcre-dev wrote:

> And if we disregards Perl's bugs then it seems (*COMMIT) in Perl works in a
> following manner:
> 
> 1. Backtracking can't move to the left of COMMIT (this is PCRE behaviour too)
> 2. If COMMIT occurs then no advance match to any other position of subject can
> happen. No matter there are any other backtracking control verbs occurs after
> COMMIT or COMMIT occurs in atomic group/negative lookaround etc (this is not
> implemented by PCRE)

There is also a difference in the way Perl handles repeated groups. 
Consider

Perl 5.03 Regular Expressions
/\A(?:.(*COMMIT))*c/
abcd
 0: abc

PCRE2 version 10.34-RC1 2019-04-22
/\A(?:.(*COMMIT))*c/
abcd
No match

In Perl, the group repeat matches "abcd", but when it then does not
match "c", it unwinds complete repetitions of the group. In PCRE2,
there is a backtrack onto *COMMIT, so it fails. Looks like Perl handles 
*COMMIT somehow differently to normal backtracks, because it does do 
ordinary backtracks into repeated groups:

Perl 5.03 Regular Expressions
/\A(.{1,2})*X/
AABBCX
 0: AABBCX
 1: C

Adding {1,2} to the first example gives this:

Perl 5.03 Regular Expressions
/\A(?:.{1,2}(*COMMIT))*c/
abcd
No match

Having another backtrack point inside the group changes things, but then
I found this:

Perl 5.03 Regular Expressions
/\A(?:(*COMMIT).)*c/
abcd
No match

I give up!

Philip

-- 
Philip Hazel

-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] (*THEN) works differently in Perl

2019-07-07 Thread ND via Pcre-dev
And if we disregards Perl's bugs then it seems (*COMMIT) in Perl works in  
a following manner:


1. Backtracking can't move to the left of COMMIT (this is PCRE behaviour  
too)
2. If COMMIT occurs then no advance match to any other position of subject  
can happen. No matter there are any other backtracking control verbs  
occurs after COMMIT or COMMIT occurs in atomic group/negative lookaround  
etc (this is not implemented by PCRE)


--
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] (*THEN) works differently in Perl

2019-07-07 Thread ND via Pcre-dev

On 2019-07-03 17:33, ph10 wrote:

On Tue, 2 Jul 2019, ND via Pcre-dev wrote:
It seems a Perl is so buggy or have really different conception of  

(*COMMIT)
> then PCRE.
I am waiting for further information from the Perl developers, but I 
suspect that I won't want to change PCRE2, except perhaps to add more 
detail to the documentation. In pcre2compat.3 there are already some 
comments about differences in the way the (*VERB)s are processed. Note 
also that they interact badly with optimizations (both in PCRE2 andPerl).




Internal voice says to me that there will be no answer from Perl  
developers :)
No answer (at all or next waited answer in thread) is a frequent thing in  
a perl bug tracker.
It's obvious that COMMIT realization have bugs in Perl  
(https://rt.perl.org/Public/Search/Simple.html?q=*commit). Users worries  
about this but Perl developers no.


PCRE have much more consistent and documented behaviour of backtracking  
control verbs.
So it seems be best to not change PCRE behaviour around COMMIT and THEN in  
unnecessary attempts to achieve full compatibility with Perl.


You be free to close this thread.

--
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] (*THEN) works differently in Perl

2019-07-03 Thread ph10
On Tue, 2 Jul 2019, ND via Pcre-dev wrote:

> It seems a Perl is so buggy or have really different conception of (*COMMIT)
> then PCRE.

I am waiting for further information from the Perl developers, but I 
suspect that I won't want to change PCRE2, except perhaps to add more 
detail to the documentation. In pcre2compat.3 there are already some 
comments about differences in the way the (*VERB)s are processed. Note 
also that they interact badly with optimizations (both in PCRE2 and 
Perl).

Philip

-- 
Philip Hazel

-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] (*THEN) works differently in Perl

2019-07-03 Thread Zoltán Herczeg
> A Perl developer has admitted there is some ambiguity, but suggests that
> (*COMMIT) just means "never advance the starting point". That pattern
> can find a match without advancing the starting point.

The documentation says two rules:

1) It's a zero-width pattern similar to (*SKIP) , except that when backtracked 
into on failure it causes the match to fail outright.
2) No further attempts to find a valid match by advancing the start pointer 
will occur again.

If the first rule is not honored at all, the description is clearly wrong. 
However, you can put (*COMMIT) at the beginning of the pattern which ignores 
the 1st rule, so changing the verb would be a hidden feature removal.

Anyway it concerns me that the verbs are not clearly defined / implemented in 
perl. They are quite powerful tools, but undefined behavior breaks them. The 
documentation should list exceptions, e.g. they work differently inside 
assertions. If that is unintended then fix assertions.

Regards,
Zoltan
 
-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] (*THEN) works differently in Perl

2019-07-02 Thread ND via Pcre-dev

On 2019-07-02 14:34, ph10 wrote:

A Perl developer has admitted there is some ambiguity, but suggests that 
(*COMMIT) just means "never advance the starting point". That patterncan  
find a match without advancing the starting point. I have pointedout  
that, in that case, /.*(*COMMIT)c/ should also match, but itdoesn't.  
This is still under discussion by the Perl people. It may takesome time  
for a conclusion to emerge.





Your example
/.*(*COMMIT)c/
is very reasonable and contradicts with that words of perl authors.



And here is another example. Perl reports no match as if it backtracks to  
(*COMMIT) into possessive group:


/(?>.b(*COMMIT))*c/
abxabc
 0: abc


It seems a Perl is so buggy or have really different conception of  
(*COMMIT) then PCRE.


--
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] (*THEN) works differently in Perl

2019-07-02 Thread Максим via Pcre-dev
And it totally contradicts the Perl documentation, in particular, this 
sentence: 

Note that if this operator is used and NOT inside of an alternation 
then it acts exactly like the "(*PRUNE)" operator. 




Sorry I'm ND but write from another mailbox.


I guess from Perl  point of view (*THEN) IS inside a branch of the alternation 
(please look at line indentations in Perl debug output - it is inside BRANCH).
So it should not act as (*PRUNE) in that example.
-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] (*THEN) works differently in Perl

2019-07-02 Thread ph10
On Tue, 2 Jul 2019, I wrote:

> > PCRE2 version 10.33 2019-04-16
> > /\A(?:.(*COMMIT))*c/
> > abcd
> > No match
> > 
> > But Perl reports that this is successful match "abc".
> 
> I think this is also a Perl bug and I will report it.

A Perl developer has admitted there is some ambiguity, but suggests that 
(*COMMIT) just means "never advance the starting point". That pattern 
can find a match without advancing the starting point. I have pointed 
out that, in that case, /.*(*COMMIT)c/ should also match, but it 
doesn't. This is still under discussion by the Perl people. It may take 
some time for a conclusion to emerge.

Regards,
Philip

-- 
Philip Hazel

-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] (*THEN) works differently in Perl

2019-07-02 Thread ph10
On Tue, 2 Jul 2019, Zoltán Herczeg wrote:

> Perhaps the misunderstanding comes from the fact that we are talking
> about the pattern and they talk about the matching process. So (*THEN)
> simply starts a backtrack, and when an alternation is encountered, it
> switches to the next alternative. 

That is indeed what happens in the pcre2_match() interpreter.

> But this happens normally as well, so what is the exact purpose of
> this verb then?

Not quite. (*THEN) suppresses going back to a previous backtrack inside 
the branch. In the Perl example

  ( COND (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ )
  
if COND matches, but FOO fails to match, it does not go back to
backtrack points inside COND (which it would do without (*THEN)), but
instead abandons the entire branch and jumps to try to match COND2. It's
a sort of branch-level (*COMMIT).

At a simple level I suppose it's also equivalent to

  ((?>COND) FOO | ...
  
but perhaps there are more complicated examples that can't be written 
that way. 
 
Philip

-- 
Philip Hazel
-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] (*THEN) works differently in Perl

2019-07-02 Thread Zoltán Herczeg
> Note that if this operator is used and NOT inside of an alternation
> then it acts exactly like the "(*PRUNE)" operator.
> But it doesn't.

Perhaps the misunderstanding comes from the fact that we are talking about the 
pattern and they talk about the matching process. So (*THEN) simply starts a 
backtrack, and when an alternation is encountered, it switches to the next 
alternative. But this happens normally as well, so what is the exact purpose of 
this verb then?

This is also very confusing (especially if you read the documentation):
/(a(a|b)c(*THEN)d|e)/

It says:
Its name comes from the observation that this operation combined with the 
alternation operator ("|" ) can be used to create what is essentially a 
pattern-based if/then/else block:

( COND (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ )

But if your cond has an alternation, it will do something else.

I think they simply introduced some random verb which is easy to implement for 
them, but totally confusing for a user. Imagine if (*THEN) backtracks into an 
atomic block, or a recursion. Btw this type of (*THEN) is impossible to 
implement in JIT, because static analysis of its effect is not always possible.

Regards,
Zoltan
 
-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] (*THEN) works differently in Perl

2019-07-02 Thread ph10
On Tue, 2 Jul 2019, Zoltán Herczeg wrote:

> If you are right about the internal working of (*THEN), then this verb
> has a very unclear and inconsistent behavior, which is very hard to
> track for a user. 

And it totally contradicts the Perl documentation, in particular, this 
sentence:

  Note that if this operator is used and NOT inside of an alternation
  then it acts exactly like the "(*PRUNE)" operator.

But it doesn't.

Let's see what the Perl maintainers' reaction to my bug report is.

Philip

-- 
Philip Hazel
-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] (*THEN) works differently in Perl

2019-07-02 Thread ph10
On Mon, 1 Jul 2019, ND via Pcre-dev wrote:

> As you participate in Perl regex development can you take a look at another
> Perl bug please:

I do not participate in Perl regex development. I just report bugs when 
I find them, using the perlbug command. You could do this yourself. (And 
you seem to know more about Perl internals than I do.)

> PCRE2 version 10.33 2019-04-16
> /\A(?:.(*COMMIT))*c/
> abcd
> No match
> 
> But Perl reports that this is successful match "abc".

I think this is also a Perl bug and I will report it.

Philip

-- 
Philip Hazel

-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] (*THEN) works differently in Perl

2019-07-02 Thread Zoltán Herczeg
If you are right about the internal working of (*THEN), then this verb has a 
very unclear and inconsistent behavior, which is very hard to track for a user. 
I think it should made obsolete and removed eventually.

Regards,
Zoltan
 
-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] (*THEN) works differently in Perl

2019-07-01 Thread ND via Pcre-dev

On 2019-07-01 10:28, ph10 wrote:

On Sun, 30 Jun 2019, ND via Pcre-dev wrote:

PCRE2 version 10.33 2019-04-16

> /\A(?:.|..)(*THEN)c/
> abc
> No match
>>> Perl is match "abc".
> I suppose "next innermost alternative" is interpreted differently by  
PCRE and

> Perl.
>> If so, may be PCRE should go Perl way in this matter?
I think this is a bug in Perl and I will report it as such.



After reading this post  
https://rt.perl.org/Public/Bug/Display.html?id=92898#txn-1227153

I don't sure that there is a Perl bug.
I suppose that there are two branches started from "(?:.|..)". Each of  
this branches ends with a common TAIL to end of pattern. Here are this two  
branches:

1) .(*THEN)c
2) ..(*THEN)c

Lets look at the Perl debug output:


Matching REx "\A(?:.|..)(*THEN)c" against "abcd"
Intuit: trying to determine minimum start position...
  doing 'check' fbm scan, [1..3] gave 2
  Found floating substr "c" at offset 2 (rx_origin now 0)...
  (multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 0
   0 <>|   0| 1:SBOL /\A/(2)
   0 <>|   0| 2:BRANCH(4)
   0 <>|   1|  3:REG_ANY(8)
   1 |   1|  8:CUTGROUP(10)
   1 |   2|   10:EXACT (12)
 |   2|   failed...
 |   1|  failed...
   0 <>|   0| 4:BRANCH(7)
   0 <>|   1|  5:REG_ANY(6)
   1 |   1|  6:REG_ANY(8)
   2 |   1|  8:CUTGROUP(10)
   2 |   2|   10:EXACT (12)
   3 |   2|   12:END(0)
Match successful!


So backtracking to (*THEN) in BRANCH(4) caused immediately fail of this  
branch and jump to BRANCH(7).


--
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] (*THEN) works differently in Perl

2019-07-01 Thread ND via Pcre-dev

On 2019-07-01 10:28, ph10 wrote:


I think this is a bug in Perl and I will report it as such.


It's great.


As you participate in Perl regex development can you take a look at  
another Perl bug please:



PCRE2 version 10.33 2019-04-16
/\A(?:.(*COMMIT))*c/
abcd
No match


But Perl reports that this is successful match "abc".


Thanks

--
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] (*THEN) works differently in Perl

2019-07-01 Thread ph10
On Sun, 30 Jun 2019, ND via Pcre-dev wrote:

> PCRE2 version 10.33 2019-04-16
> /\A(?:.|..)(*THEN)c/
> abc
> No match
> 
> 
> Perl is match "abc".
> I suppose "next innermost alternative" is interpreted differently by PCRE and
> Perl.
> 
> If so, may be PCRE should go Perl way in this matter?

I think this is a bug in Perl and I will report it as such. The Perl 
document says, concerning (*THEN): "when backtracked into on failure, it
causes the regex engine to try the next alternation in the innermost
enclosing group (capturing or otherwise) that has alternations."

There is no group enclosing (*THEN) in your pattern. Perls' doc also 
says this:

  Note that if this operator is used and NOT inside of an alternation
  then it acts exactly like the "(*PRUNE)" operator.
  
.. but it doesn't:

Perl 5.03 Regular Expressions

/\A(?:.|..)(*THEN)c/
abc
 0: abc

/\A(?:.|..)(*PRUNE)c/
abc
No match

Philip

-- 
Philip Hazel

-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev