[issue16563] re.match loops forever on simple regexp

2012-11-27 Thread Mark Dickinson

Mark Dickinson added the comment:

Closing as a duplicate of issue 1662581.

--
nosy: +mark.dickinson
resolution:  - duplicate
status: open - closed
superseder:  - the re module can perform poorly: O(2**n) versus O(n**2)

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16563
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16563] re.match loops forever on simple regexp

2012-11-27 Thread Mark Dickinson

Mark Dickinson added the comment:

@lpd:  you may want to look at the 'regex' module on PyPI [1] to see if it 
solves your problems.  The idea is that some form of this should eventually go 
into the standard library, but we've been lacking volunteers for the huge code 
review task involved.

[1] http://pypi.python.org/pypi/regex.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16563
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16563] re.match loops forever on simple regexp

2012-11-26 Thread L. Peter Deutsch

New submission from L. Peter Deutsch:

I've read a number of reports of exponential-time regexp matching, but this 
regexp uses no unusual features, requires no backtracking, and only loops 
forever on certain input strings.

I listed the Python version # as 2.6; I actually observed the behavior in 2.5.1 
and 2.5.2, but I'm almost certain it's still there, because I saw the same 
behavior in a very recent build of Google's V8 interpreter, which I believe 
uses the same regexp engine.

Here's the test case:

import re
re_utf8 = 
r'^([\x00-\x7f]+|[\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf][\x80-\xbf]|[\xf0-\xf7][\x80-\xbf][\x80-\xbf][\x80-\xbf])*$'
s = 
\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x03\x00\x14\x00\x00\x00\x01\x00\x00,`\x00\x00\x004\x00\x01\x8d
print re.match(re_utf8, s)

If you pass s[0:34] or s[34:35] as the argument of re.match, it returns the 
correct answer, but the code above loops apparently forever.

--
components: Regular Expressions
messages: 176458
nosy: ezio.melotti, lpd, mrabarnett
priority: normal
severity: normal
status: open
title: re.match loops forever on simple regexp
type: crash
versions: Python 2.6

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16563
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16563] re.match loops forever on simple regexp

2012-11-26 Thread Tim Peters

Tim Peters added the comment:

There's actually enormous backtracking here.  Try this much shorter regexp and 
you'll see much the same behavior:

re_utf8 = r'^([\x00-\x7f]+)*$'

That's the original re_utf8 with all but the first alternative removed.

Looks like passing s[0:34] works because it eliminates the trailing \x8d that 
prevents the regexp from matching the whole string.  Because the regexp cannot 
match the whole string, it takes a very long time to try all the futile 
combinations implied by the nested quantifiers.  As the much simpler re_utf8 
above shows, it's not the alternatives in the regexp that matter here, it's the 
nested quantifiers.

--
nosy: +tim_one

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16563
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16563] re.match loops forever on simple regexp

2012-11-26 Thread Ezio Melotti

Ezio Melotti added the comment:

I think the problem is the first '+', but I'm not sure if this is similar to 
other problems (I remember something similar to (x+)* causing problems).

(On a side note: the regex matches non-valid utf-8, see table 3-7 on 
http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16563
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16563] re.match loops forever on simple regexp

2012-11-26 Thread Tim Peters

Tim Peters added the comment:

Yes, if you remove the first +, the example quickly prints None (i.e., 
reports that the regexp cannot match the string).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16563
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16563] re.match loops forever on simple regexp

2012-11-26 Thread L. Peter Deutsch

L. Peter Deutsch added the comment:

It never occurred to me that the regexp package would be so poorly designed 
that a pattern that so clearly never requires backtracking could require 
exponential time. I'll change the pattern (taking out the + has no effect on 
what strings it matches) and leave it up to others to decide whether the 
performance issue is worth addressing. And thanks for the pointer to the table 
in the Unicode standard.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16563
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com