[issue31759] re wont recover nor fail on runaway regular expression

2017-10-13 Thread Matthew Barnett

Matthew Barnett  added the comment:

@Tim: the regex module includes some extra checks to reduce the chance of 
excessive backtracking. In the case of the OP's example, they seem to be 
working. However, it's difficult to know when adding such checks will help, and 
your example is one case where they are being done but aren't helping, with the 
result that it's slower.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31759] re wont recover nor fail on runaway regular expression

2017-10-11 Thread Tim Peters

Tim Peters  added the comment:

Sure!  The OP was obviously asking about the engine that ships with Python, so 
that's what I talked about.

Raphaël, Matthew develops an excellent replacement ("regex") for Python's re 
module, which you can install via, e.g., "pip install regex" (or, on Windows, 
"python -m pip install regex").  More info here:

https://pypi.python.org/pypi/regex/

Matthew, will that become Python's standard offering some day?  I'd be in favor 
of that!  It has many advantages, although it doesn't always avoid 
exponential-time backtracking in failing cases.  For example, `re` and `regex` 
both take exponential time to fail to match the regexp:

"((xy)+)+$"

against strings of the form:

"xy" * i + "y"

Increase `i` by 1, and both take about twice as long to fail to match (meaning 
either .match or .search), and `re` is actually quicker on my box (3.6.3 on 
64-bit Win10).

In any case, I'm closing this, since there's no concrete idea on the table for 
a change to `re` that would actually help (e.g., people ignore warnings, and 
there's really no way to _guess_ whether a regexp is "taking too long" to begin 
with - if it's taking minutes, people immediately discover the hangup already 
when they interrupt the program and see that it's trying to match a regexp).

--
resolution:  -> wont fix
stage:  -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31759] re wont recover nor fail on runaway regular expression

2017-10-11 Thread Matthew Barnett

Matthew Barnett  added the comment:

You shouldn't assume that just because it takes a long time on one 
implementation that it'll take a long time on all of the others, because it's 
sometimes possible to include additional checks to reduce the problem. (I doubt 
you could eliminate the problem entirely, however.)

My regex module, for example, includes some additional checks, and it seems to 
be OK with these tests.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31759] re wont recover nor fail on runaway regular expression

2017-10-11 Thread Raphaël Riel

Raphaël Riel  added the comment:

Thanks Tim! Pretty nice answer that I can learn from! Thanks for your time.

I definitely knew my Regex was broken, yet I was surprised the 
interpreter/library didn't gave up/error after some(several million) steps.

Some other language seems to just assume there will be no match (Source: some 
play on https://regex101.com/), and I don't think this is a valid approach.

Should there be a WARNing logged on a defined soft-limit?

I know this is low-level "re" library, and your point is pretty valid about the 
fact the lib should be doing what it's told to do. 
I was mostly looking for opinion about WARNs on soft limits and maybe errors on 
a hard-limit to debug/avoid this kind of false-hang situation.

Feel free to close/wont-fix/not-a-bug this issue! And thanks again for your 
kind answer to my initial issue!

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31759] re wont recover nor fail on runaway regular expression

2017-10-11 Thread Tim Peters

Tim Peters  added the comment:

Well, the problem in the regexp is this part:  "\d+,? ?".  You're not 
_requiring_ that strings of digits be separated by a comma or blank, you're 
only _allowing_ them to be so separated.  A solid string of digits is matched 
by this, and so the enclosing + requires the engine, when backtracking, to 
consider every possible way of breaking the solid string of digits into one or 
more solid strings of digits too.  If there are n digits in the digit string, 
there are 2**(n-1) ways to do so.

Overall the regexp can never match because "NULL" never gets matched.  So all 
possibilities will be tried.  We can compute how many attempts will be made 
like so:

prod = 1
for c in re.findall("\d+|NULL", statement):
if c == "NULL":
break
prod *= 2**(len(c) - 1)
print(format(prod, ","))


Attempting test_01
256
Attempting test_02
4,096
Attempting test_03
65,536
Attempting test_04
1,048,576
Attempting test_05
2,097,152
Attempting test_06
4,194,304
Attempting test_07
8,388,608
Attempting test_08
16,777,216
Attempting test_09
33,554,432
Attempting test_10
67,108,864
Attempting test_11
16,777,216
Attempting test_12
536,870,912
Attempting test_13
17,179,869,184
Attempting test_14
549,755,813,888

Note, e.g., this "predicts" test_08 will take about the same time as test_11 
(both require 16,777,216 attempts).  Which is what you actually saw happen.

And that's why you gave up on test_14:  it will require over half a trillion 
failing attempts before it ends, which will take roughly 30 times longer than 
it failed to match test_13.

If we were building a regexp _tester_, we'd spin off a new process to run the 
regexp, and kill the process if it took "too long".  But that's not we're doing 
- library functions do what you tell them to do ;-)

In this case, it's easily repaired.  For example, replace

"\d+,? ?"

 by 

"\d+(?=[ ,)]),? ?"

This uses a lookahead assertion to _require_ that a digit string is followed by 
a blank, comma, or right parenthesis.  Which prevents exponential-time 
backtracking in failing cases.

--
nosy: +tim.peters

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31759] re wont recover nor fail on runaway regular expression

2017-10-11 Thread Raphaël Riel

Raphaël Riel  added the comment:

Results for my local computer:

```
Attempting test_01
Done in 0.12ms
Attempting test_02
Done in 1.52ms
Attempting test_03
Done in 26.24ms
Attempting test_04
Done in 432.32ms
Attempting test_05
Done in 886.3ms
Attempting test_06
Done in 1757.07ms
Attempting test_07
Done in 3388.92ms
Attempting test_08
Done in 6669.02ms
Attempting test_09
Done in 13088.37ms
Attempting test_10
Done in 26600.23ms
Attempting test_11
Done in 6722.9ms
Attempting test_12
Done in 211192.82ms
Attempting test_13
Done in 6850465.09ms
Attempting test_14
^CTraceback (most recent call last):
  File "/Users/rriel/Desktop/re_backtracking.py", line 26, in 
bad_expression.sub("IN_GROUP", statement)
KeyboardInterrupt
```

Cancelled test_14...

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31759] re wont recover nor fail on runaway regular expression

2017-10-11 Thread Raphaël Riel

New submission from Raphaël Riel :

re won't raise nor return when working with Runaway Regular Expression.
It will compute "almost" indefinitely. Although I'm pretty sure it *may* 
complete sometime, it's definetly looks like it's stuck.

```
> python -
Python 3.6.2 (default, Aug 23 2017, 14:57:08)
[GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.42)]
```

Reproduce with attached file.

Should there be a (configurable?) limit on the number of steps involved in the 
process.
Or some warnings and/or hard limit that raises exception?

https://pythex.org/ will fail with a HTTP502 BadGateway (server taking too long 
to respond)
https://regex101.com/ python's tester seems to set a limit for this case. I 
can't say how they managed this.

--
components: Regular Expressions
files: re_backtracking.py
messages: 304146
nosy: Raphaël Riel, ezio.melotti, mrabarnett
priority: normal
severity: normal
status: open
title: re wont recover nor fail on runaway regular expression
type: behavior
versions: Python 3.6
Added file: https://bugs.python.org/file47212/re_backtracking.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com