[issue42885] Regex performance problem with ^ aka AT_BEGINNING

2021-01-15 Thread Terry J. Reedy


Change by Terry J. Reedy :


--
nosy: +serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42885] Regex performance problem with ^ aka AT_BEGINNING

2021-01-11 Thread Steven D'Aprano


Steven D'Aprano  added the comment:

I'm getting similar results in Python 3.9.

[steve ~]$ python3.9 -m timeit -s "a = 'A'*1" -s "import re" 
"re.search('^x', a)"
5000 loops, best of 5: 67.3 usec per loop
[steve ~]$ python3.9 -m timeit -s "a = 'A'*10" -s "import re" 
"re.search('^x', a)"
500 loops, best of 5: 639 usec per loop
[steve ~]$ python3.9 -m timeit -s "a = 'A'*100" -s "import re" 
"re.search('^x', a)"
50 loops, best of 5: 6.27 msec per loop
[steve ~]$ python3.9 -m timeit -s "a = 'A'*1000" -s "import re" 
"re.search('^x', a)"
5 loops, best of 5: 62.8 msec per loop
[steve ~]$ python3.9 -m timeit -s "a = 'A'*1" -s "import re" 
"re.search('^x', a)"
1 loop, best of 5: 654 msec per loop


It looks like the time is roughly linear in the length of the string.

I get the same result as far back as Python 2.7 (I haven't tried older 
versions).

[steve ~]$ python2.7 -m timeit -s "a = 'A'*1" -s "import re" 
"re.search('^x', a)"
1 loops, best of 3: 75.7 usec per loop
[steve ~]$ python2.7 -m timeit -s "a = 'A'*1000" -s "import re" 
"re.search('^x', a)"
10 loops, best of 3: 73.4 msec per loop


I would have expected essentially constant time, as in re.match:

[steve ~]$ python3.9 -m timeit -s "a = 'A'*1" -s "import re" "re.match('x', 
a)"
50 loops, best of 5: 560 nsec per loop
[steve ~]$ python3.9 -m timeit -s "a = 'A'*1" -s "import re" 
"re.match('x', a)"
50 loops, best of 5: 561 nsec per loop

--
nosy: +steven.daprano
versions: +Python 3.9

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42885] Regex performance problem with ^ aka AT_BEGINNING

2021-01-10 Thread Arnim Rupp


New submission from Arnim Rupp :

The re lib needs 7 seconds to check if a billion As start with an x. So e.g. 
this statement takes this long:

re.search(r'^x', 'A' * 10)

It takes longer, the longer the string is. The string handling is not the 
problem, checking if it starts which an A takes just 0.00014 seconds. See 
output and code below:

3.10.0a4+ (heads/master:d16f617, Jan  9 2021, 13:24:45) 
[GCC 7.5.0]
testing string len:  10
re_test_false:  0.0008246829966083169
testing string len:  10
re_test_false:  7.317708015005337
testing string len:  10
re_test_true:   0.00014710200048284605


import re, timeit, functools, sys

def re_test_true(string):
print("testing string len: ", len(string))
re.search(r'^A', string)

def re_test_false(string):
print("testing string len: ", len(string))
re.search(r'^x', string)

print(sys.version)

huge_string = 'A' * 10
print('re_test_false: ', timeit.timeit(functools.partial(re_test_false, 
huge_string), number=1))

huge_string = 'A' * 10
print('re_test_false: ', timeit.timeit(functools.partial(re_test_false, 
huge_string), number=1))

print('re_test_true:  ', timeit.timeit(functools.partial(re_test_true, 
huge_string), number=1))

--
components: Library (Lib)
files: regex_timeit.py
messages: 384782
nosy: another_try
priority: normal
severity: normal
status: open
title: Regex performance problem with ^ aka AT_BEGINNING
type: performance
versions: Python 3.10
Added file: https://bugs.python.org/file49733/regex_timeit.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com