[issue37723] important performance regression on regular expression parsing

2019-08-04 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

Thank you for your contribution yannvgn.

--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37723] important performance regression on regular expression parsing

2019-07-31 Thread miss-islington


miss-islington  added the comment:


New changeset 77fcccb5321137456549b7f55b819f2c8a4c78a4 by Miss Islington (bot) 
in branch '3.8':
bpo-37723: Fix performance regression on regular expression parsing. (GH-15030)
https://github.com/python/cpython/commit/77fcccb5321137456549b7f55b819f2c8a4c78a4


--
nosy: +miss-islington, miss-islington

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37723] important performance regression on regular expression parsing

2019-07-31 Thread miss-islington


miss-islington  added the comment:


New changeset 33b700ba8cbb128519442eeed8c8747ff73f4524 by Miss Islington (bot) 
in branch '3.7':
bpo-37723: Fix performance regression on regular expression parsing. (GH-15030)
https://github.com/python/cpython/commit/33b700ba8cbb128519442eeed8c8747ff73f4524


--
nosy: +miss-islington

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37723] important performance regression on regular expression parsing

2019-07-31 Thread miss-islington


Change by miss-islington :


--
pull_requests: +14809
pull_request: https://github.com/python/cpython/pull/15060

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37723] important performance regression on regular expression parsing

2019-07-31 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:


New changeset 9f1f3df238e58315e724e50cb0d574d75b94 by Serhiy Storchaka 
(yannvgn) in branch 'master':
bpo-37723: Fix performance regression on regular expression parsing. (GH-15030)
https://github.com/python/cpython/commit/9f1f3df238e58315e724e50cb0d574d75b94


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37723] important performance regression on regular expression parsing

2019-07-31 Thread miss-islington


Change by miss-islington :


--
pull_requests: +14808
pull_request: https://github.com/python/cpython/pull/15059

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37723] important performance regression on regular expression parsing

2019-07-31 Thread yannvgn


yannvgn  added the comment:

Hey Matthew,

we decided to go for this, which is simpler and straightforward:

def _uniq(items):
return list(dict.fromkeys(items))

(see https://github.com/python/cpython/pull/15030)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37723] important performance regression on regular expression parsing

2019-07-31 Thread Matthew Barnett


Matthew Barnett  added the comment:

I've just had a look at _uniq, and the code surprises me.

The obvious way to detect duplicates is with a set, but that requires the items 
to be hashable. Are they?

Well, the first line of the function uses 'set', so they are.

Why, then, isn't it using a set to detect the duplicates?

How about this:

def _uniq(items):
newitems = []
seen = set()
for item in items:
if item not in seen:
newitems.append(item)
seen.add(item)
return newitems

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37723] important performance regression on regular expression parsing

2019-07-31 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

Oh, this is convincing.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37723] important performance regression on regular expression parsing

2019-07-31 Thread yannvgn


yannvgn  added the comment:

> Indeed, it was not expected that the character set contains hundreds of 
> thousands items. What is its size in your real code?

> Could you please show benchmarking results for different implementations and 
> different sizes?

I can't precisely answer that, but sacremoses (a tokenization package) for 
example is strongly impacted. See 
https://github.com/alvations/sacremoses/issues/61#issuecomment-516401853

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37723] important performance regression on regular expression parsing

2019-07-31 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

Indeed, it was not expected that the character set contains hundreds of 
thousands items. What is its size in your real code?

Could you please show benchmarking results for different implementations and 
different sizes?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37723] important performance regression on regular expression parsing

2019-07-31 Thread Serhiy Storchaka


Change by Serhiy Storchaka :


--
nosy: +serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37723] important performance regression on regular expression parsing

2019-07-30 Thread yannvgn


Change by yannvgn :


--
keywords: +patch
pull_requests: +14788
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/15030

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37723] important performance regression on regular expression parsing

2019-07-30 Thread yannvgn


New submission from yannvgn :

On complex cases, parsing regular expressions takes much, much longer on Python 
>= 3.7

Example (ipython):

In [1]: import re
In [2]: char_list = ''.join([chr(i) for i in range(0x)])
In [3]: long_char_list = char_list * 10
In [4]: pattern = f'[{re.escape(long_char_list)}]'
In [5]: %time compiled = re.compile(pattern)

The test was run on Amazon Linux AMI 2017.03.

On Python 3.6.1, the regexp compiled in 2.6 seconds:
CPU times: user 2.59 s, sys: 30 ms, total: 2.62 s
Wall time: 2.64 s

On Python 3.7.3, the regexp compiled in 15 minutes (~350x increase in this 
case):
CPU times: user 15min 6s, sys: 240 ms, total: 15min 7s
Wall time: 15min 9s

Doing some profiling with cProfile shows that the issue is caused by 
sre_parse._uniq function, which does not exist in Python <= 3.6

The complexity of this function is on average O(N^2) but can be easily reduced 
to O(N).

The issue might not be noticeable with simple regexps, but programs like text 
tokenizers - which use complex regexps - might really be impacted by this 
regression.

--
components: Regular Expressions
messages: 348771
nosy: ezio.melotti, mrabarnett, yannvgn
priority: normal
severity: normal
status: open
title: important performance regression on regular expression parsing
type: performance
versions: Python 3.7, Python 3.8, Python 3.9

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com