[issue18647] re.error: nothing to repeat

2013-08-25 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
resolution:  -> fixed
stage: commit review -> committed/rejected
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-19 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
stage:  -> commit review

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-19 Thread Roundup Robot

Roundup Robot added the comment:

New changeset de049e9abdf7 by Serhiy Storchaka in branch '3.3':
Issue #18647: Correctly bound calculated min/max width of a subexpression.
http://hg.python.org/cpython/rev/de049e9abdf7

New changeset e47f2dc564bc by Serhiy Storchaka in branch 'default':
Issue #18647: Correctly bound calculated min/max width of a subexpression.
http://hg.python.org/cpython/rev/e47f2dc564bc

New changeset d10c287c200c by Serhiy Storchaka in branch '2.7':
Issue #18647: Correctly bound calculated min/max width of a subexpression.
http://hg.python.org/cpython/rev/d10c287c200c

New changeset 19ed2fbb8e6b by Serhiy Storchaka in branch '3.3':
Issue #18647: A regular expression in the doctest module rewritten so that
http://hg.python.org/cpython/rev/19ed2fbb8e6b

New changeset 8b24818c7327 by Serhiy Storchaka in branch 'default':
Issue #18647: A regular expression in the doctest module rewritten so that
http://hg.python.org/cpython/rev/8b24818c7327

New changeset c2dc99ec46bc by Serhiy Storchaka in branch '2.7':
Issue #18647: A regular expression in the doctest module rewritten so that
http://hg.python.org/cpython/rev/c2dc99ec46bc

New changeset 7ab07f15d78c by Serhiy Storchaka in branch '3.3':
Issue #2537: Remove breaked check which prevented valid regular expressions.
http://hg.python.org/cpython/rev/7ab07f15d78c

New changeset f4271cc2dfb5 by Serhiy Storchaka in branch 'default':
Issue #2537: Remove breaked check which prevented valid regular expressions.
http://hg.python.org/cpython/rev/f4271cc2dfb5

New changeset 7b867a46a8b4 by Serhiy Storchaka in branch '2.7':
Issue #2537: Remove breaked check which prevented valid regular expressions.
http://hg.python.org/cpython/rev/7b867a46a8b4

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-14 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

So what now? Just remove unneeded check?

Related issues: issue1633953, issue2537.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-11 Thread Matthew Barnett

Matthew Barnett added the comment:

I think you're probably right.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-11 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

And not to me. This check forbids some possible legal regexps and doesn't 
prevent from shooting in the leg.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-10 Thread Tim Peters

Tim Peters added the comment:

So does anyone believe this check serves a useful purpose _now_?  Doesn't seem 
so to me.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-10 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

> 2. Matthew said that Python's engine is not robust against _unbounded_ 
> repeated matching of an empty sub-match, and so "That's the reason for the 
> up-front check".  I was asking for an example of _that_ behavior.  I still 
> haven't seen one.

Perhaps Matthew did not understand you or you did not understand Matthew. 
For non-greedy repeats this was fixed in issue9669 (thank to Matthew). For 
greedy repeats this was fixed some time before.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-10 Thread Tim Peters

Tim Peters added the comment:

Serhiy, yes, I know the regexp you gave takes exponential time.  But:

1. This appears to have nothing to do with repeated 0-length matches.  I gave 
you an example of a very similar regexp that also takes exponential time, but 
never makes any 0-length sub-match.

2. Matthew said that Python's engine is not robust against _unbounded_ repeated 
matching of an empty sub-match, and so "That's the reason for the up-front 
check".  I was asking for an example of _that_ behavior.  I still haven't seen 
one.

My goal here is to understand why we're doing this check at all.  If Python's 
engine cannot in fact be provoked into an infinite loop, the check has at best 
very little value, as there are many ways to provoke exponential-time behavior, 
and the possibility of a repeated 0-length sub-match doesn't appear to have 
much (if anything) to do with it.

(By the way, exponential-time regexps can't always be rewritten to take linear 
time, although it takes gimmicks like back references to create examples that 
are inherently expensive.)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-10 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

> Serhiy, yup, that regexp is slow, but it does finish - so the engine is doing 
> something to avoid _unbounded_ repetitive matching of an empty string.

Yes, it finish, but it has exponential computation complexity. Increase the 
length of the string to 21, 22, 30, 100...

There were multiple bug reports about "hanged" regexps which actually had 
quadratic or exponential computation complexity (see for example issue1662581, 
issue16430, issue15077, issue15515). In all such cases the regexp can be 
rewritten to have linear computation complexity. However peoples constantly do 
such mistakes.

Armin, I totally agree with you.

Note that before b78c321ee9a5 the regexp '(?:.{0,6}.{0,5535})?' was 
forbidden while '(?:.{0,6}.{0,5534})?' and '(?:.{0,6}.{0,5536})?' were 
allowed. Existing check allowed false positives.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-10 Thread Armin Rigo

Armin Rigo added the comment:

Just a side note for 2.7: could I recommend people to be really extra, extra 
careful when changing what kind of regexps are accepted and what kind of 
regexps are outright rejected?  I believe the risk of making long-existing and 
working 2.7 programs suddenly crash on the next 2.7 micro version should *by 
far* outweight the theoretical advantage of crashing early on sufficiently 
bogus regexps.

(Fwiw, I believe the same would apply on 3.x too, where rejecting 
previously-accepted regexps should only be done in a minor version upgrade.)

--
nosy: +arigo

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-06 Thread Tim Peters

Tim Peters added the comment:

Serhiy, yup, that regexp is slow, but it does finish - so the engine is doing 
something to avoid _unbounded_ repetitive matching of an empty string.

Change it to

(?:.?.+)*y

and the group can no longer match an empty string, but it's still slow 
(although about 3x faster, it's still exponential in the length of the string 
it fails to match).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-06 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Originally the catch condition was (lo == 0). It was changed in changeset 
41c42b1bd582.

> Offhand, do you have an example that displays bad behavior in 2.7?  I'm 
> curious because I didn't find one after half an hour of trying.

re.match('(?:.?.?)*y', 'x'*20)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-04 Thread Tim Peters

Tim Peters added the comment:

> Python's current regex engine isn't so coded. That's
> the reason for the up-front check.

It's peculiar then that nobody noticed before now that the check was so badly 
broken ;-)

Offhand, do you have an example that displays bad behavior in 2.7?  I'm curious 
because I didn't find one after half an hour of trying.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-04 Thread Matthew Barnett

Matthew Barnett added the comment:

Python's current regex engine isn't so coded. That's the reason for the 
up-front check.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-04 Thread Tim Peters

Tim Peters added the comment:

Matthew, yes, I agree that a regexp engine can be coded to be robust against 
unboundedly repeated matching of an empty string.  I don't know whether 
Python's current engine is so coded.  It's easy to trick the 2.7 engine into 
accepting regexps that do try to match an empty string endlessly, but across 
all I've tried none show "infinite loop" (or even slow) behavior.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-04 Thread Matthew Barnett

Matthew Barnett added the comment:

Suppose you have a repeated pattern, such as "(?:...)*" or "(?:...){0,100}".

If, after matching the subpattern, the text position hasn't changed, and none 
of the capture groups have changed, then there has been no progress, and the 
subpattern will be matched again with no more progress, unless the maximum 
count has been reached, at which point it'll continue with the remainder of the 
pattern. If there's no minimum count, then the subpattern will be matched 
repeatedly forever.

Therefore, what a repeat should do is not to attempt any more iterations if the 
text position hasn't changed and none of the capture groups have changed.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-04 Thread Tim Peters

Tim Peters added the comment:

Matching an empty string an unbounded number of times isn't a case of 
exponential runtime, it's a case of infinite runtime, unless the regexp 
internals are smart enough to cut the search off (I don't know enough about 
re's internals to know whether it's smart enough to do so).

So it's worth catching at compile time if it can be caught.  Alas, regexps are 
so complicated I doubt it's possible to do so without false positives or false 
negatives short of major effort.

For now, at least the doctest part of the patch should be harmless ;-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-04 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Agree. Here is a partial patch which fixes getwidth() and doctest regexp. But I 
don't know how to fix _simple(). Perhaps we should permanently remove the 
"nothing to repeat" check. It guards only against '(.*)*', but there are other 
methods to make regexp matching exponential (i.e. '(.?.?)*').

--
keywords: +patch
Added file: http://bugs.python.org/file31159/issue18647.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-04 Thread Tim Peters

Tim Peters added the comment:

I'm afraid it's just too tricky for the code to deduce that a negative 
lookahead assertion can imply that a later match can't be empty.  But I don't 
know how smart the re compilation code already is ;-)

It occurs to me now that the doctest regexp could worm around this very easily, 
via replacing:

.*$\n?

with:

.+$\n?

The success of the negative lookahead assertion here doesn't _just_ imply that


.*$\n?

will match a non-empty string, it also implies that

.+$

will succeed (and so also that .+$\n? will succeed).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-04 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

> Serhiy, I don't see the regexp '(?:.*$\n?)*' anywhere in doctest.py.  Are you 
> talking about the _EXAMPLE_RE regexp?  That's the closest I see.

Yes, it is. In my previous message I answered Eli.

> If that's the case, the "nothing to repeat" error is incorrect:  _EXAMPLE_RE 
> also contains a negative lookahead assertion '(?![ ]*$)' to ensure that the 
> later '.*$\n?' part never tries to match an empty string.

Thank you for explanation. Unlucky the getwidth() method is not smart enough to 
detect that minimal length of matched string is not 0. We should made either 
the getwidth() method or the _simple() function (or both) more smart.

> A compromise may be to replace
> .*$\n?
> with
> .+$\n? | .*$\n

I'm sure similar patterns are used in third-party code. We shouldn't break 
them, therefore we should fix _simple()/getwidth().

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-04 Thread Tim Peters

Tim Peters added the comment:

Serhiy, I'm asking you to be very explicit about which regexp in doctest.py 
you're talking about.  If you're talking about the _EXAMPLE_RE regexp, I 
already explained what's going on with that.  If you're talking about some 
other regexp, I have no idea which one you're talking about.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-04 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

The doctest engine uses a regexp which contains subpattern which now considered 
as illegal be the regexp engine (due to unlucky coincidence MAXREPEAT == 
sys.maxsize on 32-bit platforms). We should rewrite the _simple() function in 
the re module to be more smart. But if this assumption is correct and this 
subpattern is really dangerous we should also rewrite a regular expression in 
the doctest module. In any case we should fix SubPattern.getwidth() so that the 
behavior on 32-bit and 64-bit platforms should be the same.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-04 Thread Tim Peters

Tim Peters added the comment:

Serhiy, I don't see the regexp '(?:.*$\n?)*' anywhere in doctest.py.  Are you 
talking about the _EXAMPLE_RE regexp?  That's the closest I see.

If that's the case, the "nothing to repeat" error is incorrect:  _EXAMPLE_RE 
also contains a negative lookahead assertion '(?![ ]*$)' to ensure that the 
later '.*$\n?' part never tries to match an empty string.

That said, it takes some intelligence to realize that the negative lookahead 
assertion prevents repeating an empty match in this regexp, so it may not be 
easy to fix this false positive.

A compromise may be to replace

.*$\n?

with

.+$\n? | .*$\n

Both branches then "obviously" consume at least one character.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-04 Thread Eli Bendersky

Eli Bendersky added the comment:

Wonderfully terse, as usual. Can you be so kind to elaborate just a tiny bit 
more? Is the amount of doctests this affects so large that it's better to 
change the implementation? What are the plans for this "temporary" stage - is 
there an intention to fix the code "soon" and revert the disabling of this 
error check?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-04 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

All doctests affected.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-04 Thread Eli Bendersky

Eli Bendersky added the comment:

Would it not be better to temporarily-fix the test rather than the code?

--
nosy: +eli.bendersky

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-04 Thread Roundup Robot

Roundup Robot added the comment:

New changeset e2ba4592ce3a by Serhiy Storchaka in branch '2.7':
Issue #18647: Temporary disable the "nothing to repeat" check to make buildbots 
happy.
http://hg.python.org/cpython/rev/e2ba4592ce3a

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-03 Thread Tim Peters

Changes by Tim Peters :


--
nosy: +tim_one

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-03 Thread Arfrever Frehtes Taifersar Arahesis

Changes by Arfrever Frehtes Taifersar Arahesis :


--
nosy: +Arfrever

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-03 Thread Roundup Robot

Roundup Robot added the comment:

New changeset c243896e12be by Serhiy Storchaka in branch '3.3':
Issue #18647: Temporary disable the "nothing to repeat" check to make buildbots 
happy.
http://hg.python.org/cpython/rev/c243896e12be

New changeset 4faf9b73c3df by Serhiy Storchaka in branch 'default':
Issue #18647: Temporary disable the "nothing to repeat" check to make buildbots 
happy.
http://hg.python.org/cpython/rev/4faf9b73c3df

--
nosy: +python-dev

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18647] re.error: nothing to repeat

2013-08-03 Thread Serhiy Storchaka

New submission from Serhiy Storchaka:

Now all doctests failed on 32-bit platforms due to the unlucky coincidence of 
my patch with at least two bugs which were hided before.

SubPattern.getwidth() is wrong, it truncates resulted values to sys.maxsize 
(should be MAXREPEAT). As side effect of my patch (on 32-bit MAXREPEAT == 
sys.maxsize) it now returns correct value in some cases on 32-bit platforms. On 
other hand, the _simple() function in sre_compile.py checks if getwidth() 
returns (0, MAXREPEAT) and raise an error in such case. Perhaps it should 
guards against such patterns as '(x*)*' (but it doesn't guards against 
'(x*y?)*' or '(x*y*)*' and can raise false positive). Now getwidth() returns 
(0, MAXREPEAT) for '(x*y?)*' on 32-bit platforms (this is a correct result) and 
triggers the check. The doctest module uses regular expression pattern 
'(?:.*$\n?)*' which now causes an error.

Definitely SubPattern.getwidth() is wrong and should be fixed. At least one of 
two, the _simple() function or doctest pattern should be fixed too. The 
simplest fix is disable the 'nothing to repeat' check.

--
assignee: serhiy.storchaka
components: Library (Lib), Regular Expressions
messages: 194297
nosy: ezio.melotti, larry, mrabarnett, pitrou, serhiy.storchaka
priority: normal
severity: normal
status: open
title: re.error: nothing to repeat
type: behavior
versions: Python 2.7, Python 3.3, Python 3.4

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com