Re: [pcre-dev] Quantifying backtracking verbs

2019-06-20 Thread ph10
On Wed, 19 Jun 2019, ND via Pcre-dev wrote:

> >At present, lookarounds do not take part in minimum length calculations,
> 
> I see lookarounds takes part: first and last code units are searched in
> lookarounds too.

I wasn't quite precise. Lookarounds are not scanned when computing
a minimum length, but they may provide a first/last code unit. If the
minimum length is less than 2, the existence of the first/last code
units will be used to set a minimum length.

Philip

-- 
Philip Hazel

-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] Quantifying backtracking verbs

2019-06-20 Thread ND via Pcre-dev

On 2019-06-20 15:53, ph10 wrote:
I have updated the doc to use your example, but it can be done easily 
with other PCRE2 facilities:

 (?|(ab)c|(a))
 does the same thing. If "a" is complex, and you do not want to write it 
out twice, you could DEFINE it and use a subroutine call.




I don't say that it can't be done at all. I say that it can't be done with  
same effectiveness (performance).
In your example "a" trying to match twice if "abc" is not match. With  
(*ACCEPT)?? - only once.

You can add performance:


(?|((a)b)c|(\2))

But there is also less performance then with (a(*ACCEPT)??b)c. And much  
more unreadable. And have excess capture.


--
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] Quantifying backtracking verbs

2019-06-20 Thread ph10
On Sun, 16 Jun 2019, ND via Pcre-dev wrote:

> A following example was included in docs (pcre2pattern.html) :
> 
> A(*ACCEPT)??BC
> 
> But this example does not show what we can do with (*ACCESS)?? that can't
> doing well with another PCRE facilities.
> I suggest to show in docs another example that was posted earlier in this
> thread:
> 
> (a(*ACCEPT)??b)c
> 
> In this example we want to match "abc" or "a" AND capture all but "c".

I have updated the doc to use your example, but it can be done easily 
with other PCRE2 facilities:

  (?|(ab)c|(a))
  
does the same thing. If "a" is complex, and you do not want to write it 
out twice, you could DEFINE it and use a subroutine call.

Philip

-- 
Philip Hazel

-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] Quantifying backtracking verbs

2019-06-19 Thread ND via Pcre-dev

On 2019-06-19 17:15, ph10 wrote:


At present, lookarounds do not take part in minimum length calculations,


I see lookarounds takes part: first and last code units are searched in  
lookarounds too.

So this is another reason in opposition to my poroposal.

So I suggest to close this thread.

--
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] Quantifying backtracking verbs

2019-06-19 Thread ph10
On Wed, 19 Jun 2019, ND via Pcre-dev wrote:

> (*ACCEPT) can't leave lookaround borders. So ACCEPT's that are inside
> lookarounds can't influence minimum length claculation, if lookaround entrails
> are not participate in this calculation (is this true?).
> 
> Thus more preferable may be to turn off minimum length scan not for all
> patterns that have ACCEPT. But only for patterns with at least one ACCEPT that
> is not inside a lookaround. Of course if it's easy to distinguish such
> ACCEPT's.

At present, lookarounds do not take part in minimum length calculations, 
though I have been wondering if there is an easy way to make use of 
positive lookaheads. I am not sure that there is an easy way, and I am
also not sure whether it is worth doing in any case. How often would it 
matter?

Unfortunately, it isn't simple to distinguish ACCEPT's in assertions 
when compiling. The current code sets a flag for all ACCEPTs (which is 
very simple - one line of code). Again, I am not sure whether doing a 
lot of work for this case is worth it - and there is also the case of 
ASSERT in a non-assertion that is called as a subroutine from inside an
assertion. Actually, I suppose that wouldn't matter - it would cause the 
scan to be disabled, which does no harm. However, my point is that this 
can get rather complicated, which makes it error-prone.

Philip

-- 
Philip Hazel

-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] Quantifying backtracking verbs

2019-06-18 Thread ND via Pcre-dev



(*ACCEPT) can't leave lookaround borders. So ACCEPT's that are inside  
lookarounds can't influence minimum length claculation, if lookaround  
entrails are not participate in this calculation (is this true?).


Thus more preferable may be to turn off minimum length scan not for all  
patterns that have ACCEPT. But only for patterns with at least one ACCEPT  
that is not inside a lookaround. Of course if it's easy to distinguish  
such ACCEPT's.


--
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] Quantifying backtracking verbs

2019-06-18 Thread ph10
On Mon, 17 Jun 2019, ND via Pcre-dev wrote:

> >If a pattern contains (*ACCEPT) the code for finding the minimum length
> >gives up because everything gets too complicated, and so the minimum is
> >0.
> 
> Yes, it should. But it not 0, it is 2 in example from this thread:
> 
> /A(?:(*ACCEPT))?B/info
> Capture group count = 0
> First code unit = 'A'
> Subject length lower bound = 2
> A
> No match
> 
> So this pattern have (*ACCEPT) but subject_length_lower_bound is not 0.

Indeed! You have found a bug. The minimum length scan was not noticing
(*ACCEPT) because it skipped over the group with minimum zero repeat. I
have now changed the code so that it remembers (*ACCEPT) is present, and
so skips doing a length scan. However, the above example now gives 1
rather than zero, because it notices that there is a first code unit (a
recent change, suggested by you).

Thanks for the bug report.

Philip

-- 
Philip Hazel

-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] Quantifying backtracking verbs

2019-06-17 Thread ND via Pcre-dev



It seems you don't understand or I don't. Sorry for my bad English.
I don't ask to calculate real subject_length_lower_bound in patterns with  
ACCEPT.

I ask to set subject_length_lower_bound to 0 in all such patterns.


On 2019-06-17 15:07, ph10 wrote:

If a pattern contains (*ACCEPT) the code for finding the minimum length
gives up because everything gets too complicated, and so the minimum is
0.


Yes, it should. But it not 0, it is 2 in example from this thread:


/A(?:(*ACCEPT))?B/info
Capture group count = 0
First code unit = 'A'
Subject length lower bound = 2
A
No match


So this pattern have (*ACCEPT) but subject_length_lower_bound is not 0.

--
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] Quantifying backtracking verbs

2019-06-17 Thread ph10
On Mon, 17 Jun 2019, ND via Pcre-dev wrote:

> I don't so this fix in PCRE2 source code repository.
> Is there plans to fix it as it doing with another patterns with ACCEPT?
> 
> For example studying a pattern
> 
> /a(*ACCEPT)/
> 
> produce
> Subject length lower bound = 0

If a pattern contains (*ACCEPT) the code for finding the minimum length
gives up because everything gets too complicated, and so the minimum is
0. A simple case like the above is all very well, but I chose not to try
to handle cases like /(abcd(*ACCEPT)|xyz)123/ and worse.

Thank you for all your documentation comments. I will work through them 
in due course.

Philip

-- 
Philip Hazel

-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] Quantifying backtracking verbs

2019-06-17 Thread ND via Pcre-dev

On 2019-06-04 16:59, ND wrote:
1. Start optimizer brakes a result to "no match" from "match". Is there  
documented (I remember only example with (*COMMIT) where optimizer can  
make "match" from "no match")? May be there is a way to correct this  
PCRE optimization to not break a result.


I don't so this fix in PCRE2 source code repository.
Is there plans to fix it as it doing with another patterns with ACCEPT?

For example studying a pattern

/a(*ACCEPT)/

produce
Subject length lower bound = 0

--
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] Quantifying backtracking verbs

2019-06-15 Thread ND via Pcre-dev

On 2019-06-10 16:47, ph10 wrote:

I have done this, and committed the result. However, it seems to me that 
/a(*ACCEPT)??bc/ is the same as a(?:bc|) though if a, b, and c are 
complex it may be easier to read.




A following example was included in docs (pcre2pattern.html) :

A(*ACCEPT)??BC


But this example does not show what we can do with (*ACCESS)?? that can't  
doing well with another PCRE facilities.
I suggest to show in docs another example that was posted earlier in this  
thread:


(a(*ACCEPT)??b)c


In this example we want to match "abc" or "a" AND capture all but "c".

--
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] Quantifying backtracking verbs

2019-06-10 Thread ph10
On Wed, 5 Jun 2019, ND via Pcre-dev wrote:

> It seems a bug of Perl start optimizer. It say:
> "Did not find floating substr "bc"...
> Match rejected by optimizer"
> 
> Please look at PCRE start optimizer. 

PCRE2 does not record "c" as necessary after (*ACCEPT). So it does not 
have this bug. (It only records a single code unit.)

> >The simplest fix for PCRE2 is to change (*ACCEPT) into
> >(?:(*ACCEPT)) at compile time. This avoids any
> >implementation requirement at match time and in the JIT.

I have done this, and committed the result. However, it seems to me that 
/a(*ACCEPT)??bc/ is the same as a(?:bc|) though if a, b, and c are 
complex it may be easier to read.

Philip

-- 
Philip Hazel

-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] Quantifying backtracking verbs

2019-06-05 Thread ND via Pcre-dev

On 2019-06-05 16:53, ph10 wrote:


Perl gets it wrong:
/(a(?:(*ACCEPT))??bc)/
axy
No match
/a(*ACCEPT)??bc/
axy
No match



It seems a bug of Perl start optimizer. It say:
"Did not find floating substr "bc"...
Match rejected by optimizer"

Please look at PCRE start optimizer. It seems correction needed to act  
with such pattern accurately (as with "(a(*ACCEPT)bc)")




The simplest fix for PCRE2 is to change (*ACCEPT) into
(?:(*ACCEPT)) at compile time. This avoids any
implementation requirement at match time and in the JIT.



It's fine for me.

--
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] Quantifying backtracking verbs

2019-06-05 Thread ph10
On Wed, 5 Jun 2019, I wrote:

> On Wed, 5 Jun 2019, ND via Pcre-dev wrote:
> 
> > It's not true for (*ACCEPT).
> > Construct "(*ACCEPT)??" is like a negated (*COMMIT) pattern:
> > (*COMMIT) immediately fail a whole match when backtrack to it occurs
> > (*ACCEPT)?? immediately match when backtrack to it occurs
> 
> Ah, that is an interesting feature that I had not though of. I agree it 
> could be useful and will consider how to implement it. Thank you for the 
> insight.

Perl gets it wrong:

Perl 5.028001 Regular Expressions

/(a(?:(*ACCEPT))??bc)/
abc
 0: abc
axy
No match

/a(*ACCEPT)??bc/
abc
 0: abc
axy
No match

PCRE does get it right when parentheses are used:

PCRE2 version 10.34-RC1 2019-04-22
/(a(?:(*ACCEPT))??bc)/
abc
 0: abc
axy
 0: a

The simplest fix for PCRE2 is to change (*ACCEPT) into
(?:(*ACCEPT)) at compile time. This avoids any
implementation requirement at match time and in the JIT.

Philip

-- 
Philip Hazel

-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] Quantifying backtracking verbs

2019-06-05 Thread ph10
On Wed, 5 Jun 2019, ND via Pcre-dev wrote:

> It's not true for (*ACCEPT).
> Construct "(*ACCEPT)??" is like a negated (*COMMIT) pattern:
> (*COMMIT) immediately fail a whole match when backtrack to it occurs
> (*ACCEPT)?? immediately match when backtrack to it occurs

Ah, that is an interesting feature that I had not though of. I agree it 
could be useful and will consider how to implement it. Thank you for the 
insight.

Philip

-- 
Philip Hazel

-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] Quantifying backtracking verbs

2019-06-05 Thread ND via Pcre-dev




Repetition is allowed for groups such as (?:...) but not for individual
backtracking verbs


It seems Perl does not rise error with "(*ACCEPT)??". And generates  
expected code.

Is there weighty reason to be not compatible with Perl in this situation?



(for which it is meaningless).


It's not true for (*ACCEPT).
Construct "(*ACCEPT)??" is like a negated (*COMMIT) pattern:
(*COMMIT) immediately fail a whole match when backtrack to it occurs
(*ACCEPT)?? immediately match when backtrack to it occurs

Consider a pattern:

(a(?:(*ACCEPT))??b)c

Lets we want to match "abc" and capture "ab". If "abc" don't match we want  
to match "a" and capture it.
What another PCRE pattern allows doing this with same effectiveness (a,b,c  
can be a large and complex subpatterns)?


So, this construct is not meaningless. Moreover, I can't find another  
PCRE-pattern that can do the same without sacrifice of effectiveness.


--
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] Quantifying backtracking verbs

2019-06-05 Thread ph10
On Tue, 4 Jun 2019, ND via Pcre-dev wrote:

> PCRE2 version 10.33 2019-04-16
> /A(?:(*ACCEPT))?B/
> A
> No match
> 
> /A(?:(*ACCEPT))?B/no_start_optimize
> A
> 0: A
> 
> /A(*ACCEPT)?B/
> Failed: error 109 at offset 10: quantifier does not follow a repeatable item
> A
> 
> 
> I have a two questions with it:
> 1. Start optimizer brakes a result to "no match" from "match". Is there
> documented (I remember only example with (*COMMIT) where optimizer can make
> "match" from "no match")? May be there is a way to correct this PCRE
> optimization to not break a result.

It is documented that the start-up optimizations and the backtracking
control verbs such as (*ACCEPT) do not play well together. In
particular, the result of a match may change. Here is an extract from
the pcre2pattern documentation:

  Note that (*COMMIT) at the start of a pattern is not the same as an
  anchor, unless PCRE2's start-of-match optimizations are turned off, as
  shown in this output from pcre2test:
  
  re> /(*COMMIT)abc/
data> xyzabc
 0: abc
data>
re> /(*COMMIT)abc/no_start_optimize
data> xyzabc
No match
  
  For the first pattern, PCRE2 knows that any match must start with "a",
  so the optimization skips along the subject to "a" before applying the
  pattern to the first set of data. The match attempt then succeeds. The
  second pattern disables the optimization that skips along to the first
  character. The pattern is now applied starting at "x", and so the
  (*COMMIT) causes the match to fail without trying any other starting
  points.

There is also a whole section called "Optimizations that affect 
backtracking verbs" which contains this:

  PCRE2 contains some optimizations that are used to speed up matching
  by running some checks at the start of each match attempt. For
  example, it may know the minimum length of matching subject, or that a
  particular character must be present. When one of these optimizations
  bypasses the running of a match, any included backtracking verbs will
  not, of course, be processed.

> 2. If "(?:(*ACCEPT))?" syntax is allowed, then why a more ergonomic
> "(*ACCEPT)?" is disabled?

Repetition is allowed for groups such as (?:...) but not for individual 
backtracking verbs (for which it is meaningless). PCRE2 is not clever 
enough to realize that (?:(*ACCEPT))? has nothing other than (*ACCEPT) 
inside (?:). If it did, I would expect an error. Repeating any group
that has an unconditional (*ACCEPT) does not seem meaningful but there
are cases such as (a(*ACCEPT))? that could perhaps be useful. Such a 
group can only match zero or one times. It means "If 'a' end the match; 
if not 'a' carry on with the pattern."

Philip

-- 
Philip Hazel

-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


[pcre-dev] Quantifying backtracking verbs

2019-06-04 Thread ND via Pcre-dev



Good day!

Here is pcre2test listing:


PCRE2 version 10.33 2019-04-16
/A(?:(*ACCEPT))?B/
A
No match

/A(?:(*ACCEPT))?B/no_start_optimize
A
 0: A

/A(*ACCEPT)?B/
Failed: error 109 at offset 10: quantifier does not follow a repeatable  
item

A


I have a two questions with it:
1. Start optimizer brakes a result to "no match" from "match". Is there  
documented (I remember only example with (*COMMIT) where optimizer can  
make "match" from "no match")? May be there is a way to correct this PCRE  
optimization to not break a result.
2. If "(?:(*ACCEPT))?" syntax is allowed, then why a more ergonomic  
"(*ACCEPT)?" is disabled?



Thanks in advance

--
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev