Roundup Robot added the comment:
New changeset 6f52a3d0f548 by Serhiy Storchaka in branch 'default':
Issue #17381: Fixed handling of case-insensitive ranges in regular expressions.
https://hg.python.org/cpython/rev/6f52a3d0f548
New changeset 7981cb1556cf by Serhiy Storchaka in branch '3.4':
Roundup Robot added the comment:
New changeset ebd48b4f650d by Serhiy Storchaka in branch '2.7':
Backported the optimization of compiling charsets in regular expressions
https://hg.python.org/cpython/rev/ebd48b4f650d
New changeset 6cd4b9827755 by Serhiy Storchaka in branch '2.7':
Issue #17381:
Serhiy Storchaka added the comment:
Thank you Antoine for your review.
--
resolution: - fixed
stage: patch review - resolved
status: open - closed
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue17381
Serhiy Storchaka added the comment:
Does the patch look good now for you Antoine? If there are no objections I'm
going to commit it soon.
In order to apply 3.4 patch to 2.7 we need either significant modify the patch,
or first backport issue19329 changes to 2.7 (it would be easier).
Serhiy Storchaka added the comment:
Updated patch for 3.5 addresses Antoine's comments.
Note that 3.4 and 3.5 use different solutions of this issue.
--
dependencies: +Get rid of SRE character tables
Added file: http://bugs.python.org/file36842/re_ignore_case_range-3.5_3.patch
Serhiy Storchaka added the comment:
Actually 3.5 patch can be simpler.
--
Added file: http://bugs.python.org/file36839/re_ignore_case_range-3.5_2.patch
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue17381
Serhiy Storchaka added the comment:
Here is other patch for 3.4. It is more than 10 times faster than initial patch
in worst case.
--
Added file: http://bugs.python.org/file36712/re_ignore_case_range-3.4_2.patch
___
Python tracker
Serhiy Storchaka added the comment:
This patch has a disadvantage - it slows down case-insensitive compiling of
some very wide ranges, e.g. compile(r[\x00-\U0010]+, re.I) (this is worst
case). In most cases this is not important, because such wide ranges are rare
enough and compiled
Serhiy Storchaka added the comment:
No, issue12728 is more complicate case.
Here is a patch which fixes this issue and issue3511.
--
assignee: - serhiy.storchaka
keywords: +patch
stage: - patch review
versions: +Python 3.4, Python 3.5 -Python 3.3
Added file:
Ezio Melotti added the comment:
Is this the same issue described in #12728?
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue17381
___
___
Ezio Melotti added the comment:
Matthew, should this be closed then?
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue17381
___
___
Chris Adams added the comment:
Ezio: given the non-obvious failure, what do you think of at least documenting
this and issuing a warning any time both re.UNICODE and re.IGNORECASE are set?
--
___
Python tracker rep...@bugs.python.org
Matthew Barnett added the comment:
In issue #3511 the range was slightly unusual, so closing it seemed a
reasonable approach, but the range in this issue is less clearly a problem. My
preference would be to fix it, if possible.
--
___
Python
Serhiy Storchaka added the comment:
I'm working on the patch.
--
nosy: +serhiy.storchaka
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue17381
___
Chris Adams added the comment:
Ah, that explains it - I'd been hoping based on the re.DEBUG output that the
explicit unicode ranges were preserved.
I found #3511 before opening this one but don't believe the decision should be
the same since this isn't a mixed numeric/alphabetic range.
New submission from Chris Adams:
I noticed an interesting failure while using re.match / re.sub to look for
non-Cyrillic characters in allegedly Russian text:
re.sub(r'[\s\u0400-\u0527]+', ' ', 'Архангельская губерния',
flags=re.IGNORECASE)
'Архангельская губерния'
Matthew Barnett added the comment:
The way the re handles ranges is to convert the two endpoints to lowercase and
then check whether the lowercase form of the character in the text is in that
range.
For example, [A-Z] is converted to the range [\x41-\x5A], and the lowercase
form of 'Q'
17 matches
Mail list logo