Re: [pcre-dev] Quantifying backtracking verbs
On Wed, 19 Jun 2019, ND via Pcre-dev wrote: > >At present, lookarounds do not take part in minimum length calculations, > > I see lookarounds takes part: first and last code units are searched in > lookarounds too. I wasn't quite precise. Lookarounds are not scanned when computing a minimum length, but they may provide a first/last code unit. If the minimum length is less than 2, the existence of the first/last code units will be used to set a minimum length. Philip -- Philip Hazel -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Quantifying backtracking verbs
On 2019-06-20 15:53, ph10 wrote: I have updated the doc to use your example, but it can be done easily with other PCRE2 facilities: (?|(ab)c|(a)) does the same thing. If "a" is complex, and you do not want to write it out twice, you could DEFINE it and use a subroutine call. I don't say that it can't be done at all. I say that it can't be done with same effectiveness (performance). In your example "a" trying to match twice if "abc" is not match. With (*ACCEPT)?? - only once. You can add performance: (?|((a)b)c|(\2)) But there is also less performance then with (a(*ACCEPT)??b)c. And much more unreadable. And have excess capture. -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Quantifying backtracking verbs
On Sun, 16 Jun 2019, ND via Pcre-dev wrote: > A following example was included in docs (pcre2pattern.html) : > > A(*ACCEPT)??BC > > But this example does not show what we can do with (*ACCESS)?? that can't > doing well with another PCRE facilities. > I suggest to show in docs another example that was posted earlier in this > thread: > > (a(*ACCEPT)??b)c > > In this example we want to match "abc" or "a" AND capture all but "c". I have updated the doc to use your example, but it can be done easily with other PCRE2 facilities: (?|(ab)c|(a)) does the same thing. If "a" is complex, and you do not want to write it out twice, you could DEFINE it and use a subroutine call. Philip -- Philip Hazel -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Quantifying backtracking verbs
On 2019-06-19 17:15, ph10 wrote: At present, lookarounds do not take part in minimum length calculations, I see lookarounds takes part: first and last code units are searched in lookarounds too. So this is another reason in opposition to my poroposal. So I suggest to close this thread. -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Quantifying backtracking verbs
On Wed, 19 Jun 2019, ND via Pcre-dev wrote: > (*ACCEPT) can't leave lookaround borders. So ACCEPT's that are inside > lookarounds can't influence minimum length claculation, if lookaround entrails > are not participate in this calculation (is this true?). > > Thus more preferable may be to turn off minimum length scan not for all > patterns that have ACCEPT. But only for patterns with at least one ACCEPT that > is not inside a lookaround. Of course if it's easy to distinguish such > ACCEPT's. At present, lookarounds do not take part in minimum length calculations, though I have been wondering if there is an easy way to make use of positive lookaheads. I am not sure that there is an easy way, and I am also not sure whether it is worth doing in any case. How often would it matter? Unfortunately, it isn't simple to distinguish ACCEPT's in assertions when compiling. The current code sets a flag for all ACCEPTs (which is very simple - one line of code). Again, I am not sure whether doing a lot of work for this case is worth it - and there is also the case of ASSERT in a non-assertion that is called as a subroutine from inside an assertion. Actually, I suppose that wouldn't matter - it would cause the scan to be disabled, which does no harm. However, my point is that this can get rather complicated, which makes it error-prone. Philip -- Philip Hazel -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Quantifying backtracking verbs
(*ACCEPT) can't leave lookaround borders. So ACCEPT's that are inside lookarounds can't influence minimum length claculation, if lookaround entrails are not participate in this calculation (is this true?). Thus more preferable may be to turn off minimum length scan not for all patterns that have ACCEPT. But only for patterns with at least one ACCEPT that is not inside a lookaround. Of course if it's easy to distinguish such ACCEPT's. -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Quantifying backtracking verbs
On Mon, 17 Jun 2019, ND via Pcre-dev wrote: > >If a pattern contains (*ACCEPT) the code for finding the minimum length > >gives up because everything gets too complicated, and so the minimum is > >0. > > Yes, it should. But it not 0, it is 2 in example from this thread: > > /A(?:(*ACCEPT))?B/info > Capture group count = 0 > First code unit = 'A' > Subject length lower bound = 2 > A > No match > > So this pattern have (*ACCEPT) but subject_length_lower_bound is not 0. Indeed! You have found a bug. The minimum length scan was not noticing (*ACCEPT) because it skipped over the group with minimum zero repeat. I have now changed the code so that it remembers (*ACCEPT) is present, and so skips doing a length scan. However, the above example now gives 1 rather than zero, because it notices that there is a first code unit (a recent change, suggested by you). Thanks for the bug report. Philip -- Philip Hazel -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Quantifying backtracking verbs
It seems you don't understand or I don't. Sorry for my bad English. I don't ask to calculate real subject_length_lower_bound in patterns with ACCEPT. I ask to set subject_length_lower_bound to 0 in all such patterns. On 2019-06-17 15:07, ph10 wrote: If a pattern contains (*ACCEPT) the code for finding the minimum length gives up because everything gets too complicated, and so the minimum is 0. Yes, it should. But it not 0, it is 2 in example from this thread: /A(?:(*ACCEPT))?B/info Capture group count = 0 First code unit = 'A' Subject length lower bound = 2 A No match So this pattern have (*ACCEPT) but subject_length_lower_bound is not 0. -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Quantifying backtracking verbs
On Mon, 17 Jun 2019, ND via Pcre-dev wrote: > I don't so this fix in PCRE2 source code repository. > Is there plans to fix it as it doing with another patterns with ACCEPT? > > For example studying a pattern > > /a(*ACCEPT)/ > > produce > Subject length lower bound = 0 If a pattern contains (*ACCEPT) the code for finding the minimum length gives up because everything gets too complicated, and so the minimum is 0. A simple case like the above is all very well, but I chose not to try to handle cases like /(abcd(*ACCEPT)|xyz)123/ and worse. Thank you for all your documentation comments. I will work through them in due course. Philip -- Philip Hazel -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Quantifying backtracking verbs
On 2019-06-04 16:59, ND wrote: 1. Start optimizer brakes a result to "no match" from "match". Is there documented (I remember only example with (*COMMIT) where optimizer can make "match" from "no match")? May be there is a way to correct this PCRE optimization to not break a result. I don't so this fix in PCRE2 source code repository. Is there plans to fix it as it doing with another patterns with ACCEPT? For example studying a pattern /a(*ACCEPT)/ produce Subject length lower bound = 0 -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Quantifying backtracking verbs
On 2019-06-10 16:47, ph10 wrote: I have done this, and committed the result. However, it seems to me that /a(*ACCEPT)??bc/ is the same as a(?:bc|) though if a, b, and c are complex it may be easier to read. A following example was included in docs (pcre2pattern.html) : A(*ACCEPT)??BC But this example does not show what we can do with (*ACCESS)?? that can't doing well with another PCRE facilities. I suggest to show in docs another example that was posted earlier in this thread: (a(*ACCEPT)??b)c In this example we want to match "abc" or "a" AND capture all but "c". -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Quantifying backtracking verbs
On Wed, 5 Jun 2019, ND via Pcre-dev wrote: > It seems a bug of Perl start optimizer. It say: > "Did not find floating substr "bc"... > Match rejected by optimizer" > > Please look at PCRE start optimizer. PCRE2 does not record "c" as necessary after (*ACCEPT). So it does not have this bug. (It only records a single code unit.) > >The simplest fix for PCRE2 is to change (*ACCEPT) into > >(?:(*ACCEPT)) at compile time. This avoids any > >implementation requirement at match time and in the JIT. I have done this, and committed the result. However, it seems to me that /a(*ACCEPT)??bc/ is the same as a(?:bc|) though if a, b, and c are complex it may be easier to read. Philip -- Philip Hazel -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Quantifying backtracking verbs
On 2019-06-05 16:53, ph10 wrote: Perl gets it wrong: /(a(?:(*ACCEPT))??bc)/ axy No match /a(*ACCEPT)??bc/ axy No match It seems a bug of Perl start optimizer. It say: "Did not find floating substr "bc"... Match rejected by optimizer" Please look at PCRE start optimizer. It seems correction needed to act with such pattern accurately (as with "(a(*ACCEPT)bc)") The simplest fix for PCRE2 is to change (*ACCEPT) into (?:(*ACCEPT)) at compile time. This avoids any implementation requirement at match time and in the JIT. It's fine for me. -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Quantifying backtracking verbs
On Wed, 5 Jun 2019, I wrote: > On Wed, 5 Jun 2019, ND via Pcre-dev wrote: > > > It's not true for (*ACCEPT). > > Construct "(*ACCEPT)??" is like a negated (*COMMIT) pattern: > > (*COMMIT) immediately fail a whole match when backtrack to it occurs > > (*ACCEPT)?? immediately match when backtrack to it occurs > > Ah, that is an interesting feature that I had not though of. I agree it > could be useful and will consider how to implement it. Thank you for the > insight. Perl gets it wrong: Perl 5.028001 Regular Expressions /(a(?:(*ACCEPT))??bc)/ abc 0: abc axy No match /a(*ACCEPT)??bc/ abc 0: abc axy No match PCRE does get it right when parentheses are used: PCRE2 version 10.34-RC1 2019-04-22 /(a(?:(*ACCEPT))??bc)/ abc 0: abc axy 0: a The simplest fix for PCRE2 is to change (*ACCEPT) into (?:(*ACCEPT)) at compile time. This avoids any implementation requirement at match time and in the JIT. Philip -- Philip Hazel -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Quantifying backtracking verbs
On Wed, 5 Jun 2019, ND via Pcre-dev wrote: > It's not true for (*ACCEPT). > Construct "(*ACCEPT)??" is like a negated (*COMMIT) pattern: > (*COMMIT) immediately fail a whole match when backtrack to it occurs > (*ACCEPT)?? immediately match when backtrack to it occurs Ah, that is an interesting feature that I had not though of. I agree it could be useful and will consider how to implement it. Thank you for the insight. Philip -- Philip Hazel -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Quantifying backtracking verbs
Repetition is allowed for groups such as (?:...) but not for individual backtracking verbs It seems Perl does not rise error with "(*ACCEPT)??". And generates expected code. Is there weighty reason to be not compatible with Perl in this situation? (for which it is meaningless). It's not true for (*ACCEPT). Construct "(*ACCEPT)??" is like a negated (*COMMIT) pattern: (*COMMIT) immediately fail a whole match when backtrack to it occurs (*ACCEPT)?? immediately match when backtrack to it occurs Consider a pattern: (a(?:(*ACCEPT))??b)c Lets we want to match "abc" and capture "ab". If "abc" don't match we want to match "a" and capture it. What another PCRE pattern allows doing this with same effectiveness (a,b,c can be a large and complex subpatterns)? So, this construct is not meaningless. Moreover, I can't find another PCRE-pattern that can do the same without sacrifice of effectiveness. -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Quantifying backtracking verbs
On Tue, 4 Jun 2019, ND via Pcre-dev wrote: > PCRE2 version 10.33 2019-04-16 > /A(?:(*ACCEPT))?B/ > A > No match > > /A(?:(*ACCEPT))?B/no_start_optimize > A > 0: A > > /A(*ACCEPT)?B/ > Failed: error 109 at offset 10: quantifier does not follow a repeatable item > A > > > I have a two questions with it: > 1. Start optimizer brakes a result to "no match" from "match". Is there > documented (I remember only example with (*COMMIT) where optimizer can make > "match" from "no match")? May be there is a way to correct this PCRE > optimization to not break a result. It is documented that the start-up optimizations and the backtracking control verbs such as (*ACCEPT) do not play well together. In particular, the result of a match may change. Here is an extract from the pcre2pattern documentation: Note that (*COMMIT) at the start of a pattern is not the same as an anchor, unless PCRE2's start-of-match optimizations are turned off, as shown in this output from pcre2test: re> /(*COMMIT)abc/ data> xyzabc 0: abc data> re> /(*COMMIT)abc/no_start_optimize data> xyzabc No match For the first pattern, PCRE2 knows that any match must start with "a", so the optimization skips along the subject to "a" before applying the pattern to the first set of data. The match attempt then succeeds. The second pattern disables the optimization that skips along to the first character. The pattern is now applied starting at "x", and so the (*COMMIT) causes the match to fail without trying any other starting points. There is also a whole section called "Optimizations that affect backtracking verbs" which contains this: PCRE2 contains some optimizations that are used to speed up matching by running some checks at the start of each match attempt. For example, it may know the minimum length of matching subject, or that a particular character must be present. When one of these optimizations bypasses the running of a match, any included backtracking verbs will not, of course, be processed. > 2. If "(?:(*ACCEPT))?" syntax is allowed, then why a more ergonomic > "(*ACCEPT)?" is disabled? Repetition is allowed for groups such as (?:...) but not for individual backtracking verbs (for which it is meaningless). PCRE2 is not clever enough to realize that (?:(*ACCEPT))? has nothing other than (*ACCEPT) inside (?:). If it did, I would expect an error. Repeating any group that has an unconditional (*ACCEPT) does not seem meaningful but there are cases such as (a(*ACCEPT))? that could perhaps be useful. Such a group can only match zero or one times. It means "If 'a' end the match; if not 'a' carry on with the pattern." Philip -- Philip Hazel -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
[pcre-dev] Quantifying backtracking verbs
Good day! Here is pcre2test listing: PCRE2 version 10.33 2019-04-16 /A(?:(*ACCEPT))?B/ A No match /A(?:(*ACCEPT))?B/no_start_optimize A 0: A /A(*ACCEPT)?B/ Failed: error 109 at offset 10: quantifier does not follow a repeatable item A I have a two questions with it: 1. Start optimizer brakes a result to "no match" from "match". Is there documented (I remember only example with (*COMMIT) where optimizer can make "match" from "no match")? May be there is a way to correct this PCRE optimization to not break a result. 2. If "(?:(*ACCEPT))?" syntax is allowed, then why a more ergonomic "(*ACCEPT)?" is disabled? Thanks in advance -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev