[issue46065] re.findall takes forever and never ends

2021-12-19 Thread Gareth Rees


Gareth Rees  added the comment:

This kind of question is frequently asked (#3128, #29977, #28690, #30973, 
#1737127, etc.), and so maybe it deserves an answer somewhere in the Python 
documentation.

--
resolution:  -> wont fix
stage:  -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46065] re.findall takes forever and never ends

2021-12-19 Thread Gareth Rees


Gareth Rees  added the comment:

The way to avoid this behaviour is to disallow the attempts at matching that 
you know are going to fail. As Serhiy described above, if the search fails 
starting at the first character of the string, it will move forward and try 
again starting at the second character. But you know that this new attempt must 
fail, so you can force the regular expression engine to discard the attempt 
immediately.

Here's an illustration in a simpler setting, where we are looking for all 
strings of 'a' followed by 'b':

>>> import re
>>> from timeit import timeit
>>> text = 'a' * 10
>>> timeit(lambda:re.findall(r'a+b', text), number=1)
6.64353118114

We know that any successful match must be preceded by a character other than 
'a' (or the beginning of the string), so we can reject many unsuccessful 
matches like this:

>>> timeit(lambda:re.findall(r'(?:^|[^a])(a+b)', text), number=1)
0.00374348114981

In your case, a successful match must be preceded by [^a-zA-Z0-9_.+-] (or the 
beginning of the string).

--
nosy: +g...@garethrees.org

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46065] re.findall takes forever and never ends

2021-12-13 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

Limit the number of repetitions. For example use "{1,100}" (or what is the 
expected maximal length of email) instead of "+".

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46065] re.findall takes forever and never ends

2021-12-13 Thread Ramzi Trabelsi


Ramzi Trabelsi  added the comment:

thanks for the answer. Is there any workaround for this ?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46065] re.findall takes forever and never ends

2021-12-13 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

The simplest example is:

re.search('a@', 'a'*10)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46065] re.findall takes forever and never ends

2021-12-13 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

It ends, but it tooks several minutes to complete.

It is a limitation of the regular expression implementation in Python. Your 
input contains a sequence of 588431 characters which match the pattern 
[a-zA-Z0-9_.+-] not following by '@'. The engine finds the first character in 
this sequence, then scans 588431 characters matching this pattern, but does not 
find '@' after them. So it backtracks, steps back by one character and tries to 
match '@', fails, and continue stepping back until returns to the initial 
character. 588431 steps forward and 588431 steps back are needed to find that 
no matches starting at this position. So it steps forward and try the 
proce3dure from a new position. No it does 588430 steps in both direction. 
Totally it needs to do 588431+588430+588429+...+1 ~ 588431**2/2 ~ 173e9 steps. 
It takes a long time.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46065] re.findall takes forever and never ends

2021-12-13 Thread Ned Deily


Change by Ned Deily :


--
components: +Regular Expressions
nosy: +ezio.melotti, mrabarnett, serhiy.storchaka -ned.deily, ronaldoussoren
type: crash -> behavior

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46065] re.findall takes forever and never ends

2021-12-13 Thread Ramzi Trabelsi


Change by Ramzi Trabelsi :


--
components:  -macOS

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46065] re.findall takes forever and never ends

2021-12-13 Thread Ramzi Trabelsi


New submission from Ramzi Trabelsi :

parsing emails from this text took forever and never ends. Here the code 
 and the file res.html is attached.
The Behavior is same on Windows 10, 11 and Ubuntu 18.04

CODE:

import re
pattern_email  = r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]{2,3}"
with open("res.html","r",encoding="utf-8") as FF:
TEXT = FF.read()
matched_email  = re.findall(pattern_email,TEXT)

--
components: macOS
files: res.zip
messages: 408453
nosy: ned.deily, ramzitra, ronaldoussoren
priority: normal
severity: normal
status: open
title: re.findall takes forever and never ends
type: crash
versions: Python 3.9
Added file: https://bugs.python.org/file50488/res.zip

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com