[issue23692] Undocumented feature prevents re module from finding certain matches

2019-11-12 Thread Jackson Riley


Jackson Riley  added the comment:

Hi Matthew, Serhiy,

I tried to identify the right places in re to fix things but have found it a 
bit difficult.
I wrote up my attempt (at 
https://enhackathon.github.io/2019/11/04/JacksonRiley.html) which includes some 
examples that behave differently from how I would have expected this bug to 
manifest itself. 
Any further guidance would be greatly appreciated!

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23692] Undocumented feature prevents re module from finding certain matches

2019-11-04 Thread Matthew Barnett


Matthew Barnett  added the comment:

It's been many years since I looked at the code, and there have been changes 
since then, so some of the details might not be correct.

As to have it should behave:

re.match('(?:()|(?(1)()|z)){1,2}(?(2)a|z)', 'a')

Iteration 1.
Match the repeated part. Group 1 matches.
Iteration 2.
Match the repeated part. Group 1 matches.
Has group 2 matched? No.
Try to match 'z'. Fail and backtrack.
Retry the repeated part.
Iteration 2.
Has group 1 matched? Yes.
Group 2 matches.
Has group 2 matched? Yes.
Try to match 'a'. Success. Group 1 matched and group 2 matched.


re.match('(?:()|(?(1)()|z)){1,2}(?(1)a|z)', 'a')

Iteration 1.
Match the repeated part. Group 1 matches.
Iteration 2.
Match the repeated part. Group 1 matches.
Has group 1 matched? Yes.
Try to match 'a'. Success. Group 1 matched and group 2 didn't match.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23692] Undocumented feature prevents re module from finding certain matches

2019-11-04 Thread Jackson Riley


Jackson Riley  added the comment:

I've got a bit confused and am doubting myself - is the below output expected?
>>> m = re.match('(?:()|(?(1)()|z)){1,2}(?(2)a|z)', 'a')
>>> m.groups()
('', '')
>>> m = re.match('(?:()|(?(1)()|z)){1,2}(?(1)a|z)', 'a')
>>> m.groups()
('', None)

The first pattern doesn't behave as I would (probably naively expect) given 
Matthew's explanation of this bug - wouldn't the bug cause the match to fail 
with {1,2} as well?
Also, it seems odd that changing the condition in the if clause should change 
what gets captured.

Anyone have any thoughts?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23692] Undocumented feature prevents re module from finding certain matches

2019-11-04 Thread Jackson Riley


Jackson Riley  added the comment:

Ah thank you very much Serhiy, that's super helpful!

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23692] Undocumented feature prevents re module from finding certain matches

2019-11-04 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

Matthew referred to the code of the regex module (of which he is the author).

https://pypi.org/project/regex/

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23692] Undocumented feature prevents re module from finding certain matches

2019-11-04 Thread Jackson Riley


Jackson Riley  added the comment:

Hi Matthew, thank you for your suggestions of where to start.
Could you possibly give a pointer to the place in the code where the 'capture 
changed' counter is incremented? I had a bit of a hunt and couldn't find it but 
may have been looking in the wrong places!

--
nosy: +jacksonriley

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23692] Undocumented feature prevents re module from finding certain matches

2019-10-29 Thread Lewis Gaul


Lewis Gaul  added the comment:

Thanks for the explanation Matthew, I'll take a further look at some point in 
the coming weeks.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23692] Undocumented feature prevents re module from finding certain matches

2019-10-27 Thread Ma Lin


Change by Ma Lin :


--
nosy: +Ma Lin

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23692] Undocumented feature prevents re module from finding certain matches

2019-10-27 Thread Matthew Barnett


Matthew Barnett  added the comment:

Suppose you had a pattern:

.*

It would advance one character on each iteration of the * until the . failed to 
match. The text is finite, so it would stop matching eventually.

Now suppose you had a pattern:

(?:)*

On each iteration of the * it wouldn't advance, so it would keep matching 
forever.

A way to avoid that is to stop the * if it hasn't advanced.

The example pattern shows that there's still a problem. It advances if a group 
has matched, but that group doens't match until the first iteration, after the 
test, and does not, itself, advance. The * stops because it hasn't advanced, 
but, in this instance, that doesn't mean it never will.

The solution is for the * to check not only whether it has advanced, but also 
whether a group has changed. (Strictly speaking, the latter check is needed 
only if the repeated part tests whether a group also in the repeated part has 
changed, but it's probably not worth "optimising" for that possibility.)

In the regex module, it increments a "capture changed" counter whenever any 
group is changed (a group's first match or a change to a group's span). That 
makes it easier for the * to check. The code needs to save that counter for 
backtracking and restore it when backtracking.

I've mentioned only the *, but the same remarks apply to + and {...}, except 
that the {...} should keep repeating until it has reached its prescribed 
minimum.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23692] Undocumented feature prevents re module from finding certain matches

2019-10-27 Thread Lewis Gaul


Lewis Gaul  added the comment:

Hi there, if anyone's able to provide any guidance on this issue I'd be happy 
to take a look into it.

Is this a behaviour that is feasible to fix, or should this just be documented 
in some way as suggested by Evgeny?

--
nosy: +Lewis Gaul

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23692] Undocumented feature prevents re module from finding certain matches

2016-10-16 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
nosy: +serhiy.storchaka
stage:  -> needs patch
versions: +Python 2.7, Python 3.5, Python 3.6, Python 3.7 -Python 3.4

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23692] Undocumented feature prevents re module from finding certain matches

2015-03-17 Thread Evgeny Kapun

New submission from Evgeny Kapun:

This pattern matches:

re.match('(?:()|(?(1)()|z)){2}(?(2)a|z)', 'a')

But this doesn't:

re.match('(?:()|(?(1)()|z)){0,2}(?(2)a|z)', 'a')

The difference is that {2} is replaced by {0,2}. This shouldn't prevent the 
pattern from matching anywhere where it matched before.

The reason for this misbehavior is a feature which is designed to protect re 
engine from infinite loops, but in fact it sometimes prevents patterns from 
matching where they should. I think that this feature should be at least 
properly documented, by properly I mean that it should be possible to 
reconstruct the exact behavior from documentation, as the implementation is not 
particularly easy to understand.

--
components: Regular Expressions
messages: 238330
nosy: abacabadabacaba, ezio.melotti, mrabarnett
priority: normal
severity: normal
status: open
title: Undocumented feature prevents re module from finding certain matches
type: behavior
versions: Python 3.4

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23692
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com