[issue25054] Capturing start of line '^'

2018-03-14 Thread Serhiy Storchaka

Change by Serhiy Storchaka :


--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25054] Capturing start of line '^'

2017-12-04 Thread Serhiy Storchaka

Serhiy Storchaka  added the comment:


New changeset 70d56fb52582d9d3f7c00860d6e90570c6259371 by Serhiy Storchaka in 
branch 'master':
bpo-25054, bpo-1647489: Added support of splitting on zerowidth patterns. 
(#4471)
https://github.com/python/cpython/commit/70d56fb52582d9d3f7c00860d6e90570c6259371


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25054] Capturing start of line '^'

2017-12-02 Thread Matthew Barnett

Matthew Barnett  added the comment:

findall() and finditer() consist of multiple uses of search(), basically, as do 
sub() and split(), so we want the same rule to apply to them all.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25054] Capturing start of line '^'

2017-12-02 Thread Serhiy Storchaka

Serhiy Storchaka  added the comment:

Avoiding ZWM after a NWM in re.sub() is explicitly documented (and the 
documentation is correct in this case). This follows the behavior in the 
ancient RE implementation. Once it was broken in sre, but then fixed (see 
21009b9c6fc40b25fcb30ee60d6108f235733e40, issue462270). Changing this behavior 
doesn't break anything in the stdlib except the specially purposed test. I 
think it is better to keep this behavior, but maybe discuss its changing (for 
making matching the behavior of other RE engines) in the separate issue.

I don't know how the behavior of findall() and finditer() is related to this.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25054] Capturing start of line '^'

2017-12-02 Thread Matthew Barnett

Matthew Barnett  added the comment:

The pattern:

\b|:+

will match a word boundary (zero-width) before colons, so if there's a word 
followed by colons, finditer will find the boundary and then the colons. You 
_can_ get a zero-width match (ZWM) joined to the start of a nonzero-width match 
(NWM). That's not really surprising.

If you wanted to avoid a ZWM joined to either end of a NWM, you'd need to keep 
looking for another match at a position even after you'd already found a match 
if what you'd found was zero-width. That would also affect re.search and 
re.match.

For regex on Python 3.7, I'm going with avoiding a ZWM joined to the end of a 
NWM, unless re's going a different way, in which case I have more work to do to 
remain compatible! The change I did for Python 3.7+ was trivial.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25054] Capturing start of line '^'

2017-12-02 Thread Serhiy Storchaka

Serhiy Storchaka  added the comment:

The clause "Empty matches are included in the result unless they touch the 
beginning of another match" was added in 
2f3e5483a3324b44fa5dbbb98859dc0ac42b6070 (issue732120) and I suppose it never 
was correct. So we can ignore it in the context of this issue.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25054] Capturing start of line '^'

2017-12-02 Thread Serhiy Storchaka

Serhiy Storchaka  added the comment:

Good point. Neither old nor new (which matches regex) behaviors conform the 
documentation: "Empty matches are included in the result unless they touch the 
beginning of another match." It is easy to exclude empty matches that touch the 
*ending* of another match. This would be consistent with the new behavior of 
split() and sub().

But this would break a one existing test for issue817234. Though that issue 
shouldn't rely on this detail. The test should just test that iterating doesn't 
hang.

And this would break a regular expression in pprint.

PR 4678 implements this version. I don't know what version is better.

>>> list(re.finditer(r"\b|:+", "a::bc"))
[, , , ]
>>> re.sub(r"(\b|:+)", r"[\1]", "a::bc")
'[]a[][::]bc[]'

With PR 4471 the result of re.sub() is the same, but the result of 
re.finditer() is as in msg307424.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25054] Capturing start of line '^'

2017-12-02 Thread Serhiy Storchaka

Change by Serhiy Storchaka :


--
pull_requests: +4586

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25054] Capturing start of line '^'

2017-12-02 Thread Martin Panter

Martin Panter  added the comment:

The new “finditer” behaviour seems to contradict the documentation about 
excluding empty matches if they touch the start of another match.

>>> list(re.finditer(r"\b|:+", "a::bc"))
[, , , , ]

An empty match at (1, 1) is included, despite it touching the beginning of the 
match at (1, 3). My best guess is that when an empty match is found, searching 
continues at the same position for the first non-empty match.

--
nosy: +martin.panter

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25054] Capturing start of line '^'

2017-12-01 Thread Serhiy Storchaka

Serhiy Storchaka  added the comment:

Could anybody please make review at least of the documentation part? I want to 
merge this before 3.7.0a3 be released.

Initially I was going to backport the part that relates findall(), finditer() 
and sub(). It changes the behavior only in corner cases and I didn't expect it 
can break a real code. But since it broke a pattern in the doctest module, I 
afraid it can break a third-party code.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25054] Capturing start of line '^'

2017-11-19 Thread Serhiy Storchaka

Serhiy Storchaka  added the comment:

PR 4471 fixes this issue, issue1647489, and a couple of similar issues.

The most visible change is the change in re.split(). This is compatibility 
breaking change, and it affects third-party code. But ValueError or 
FutureWarning were raised for patterns that will change the behavior in this PR 
for two Python releases, since Python 3.5. Developers had enough time for 
fixing them. In most cases this is so trivial as changing `*` to `+` in `\s*`.

Changes in sub(), findall(), and finditer() are less visible. No one existing 
test needs modification for them. Was:

>>> re.split(r"\b|:+", "a::bc")
/usr/lib/python3.6/re.py:212: FutureWarning: split() requires a non-empty 
pattern match.
  return _compile(pattern, flags).split(string, maxsplit)
['a:', 'bc']
>>> re.sub(r"\b|:+", "-", "a::bc")
'-a-:-bc-'
>>> re.findall(r"\b|:+", "a::bc")
['', '', ':', '', '']
>>> list(re.finditer(r"\b|:+", "a::bc"))
[<_sre.SRE_Match object; span=(0, 0), match=''>, <_sre.SRE_Match object; 
span=(1, 1), match=''>, <_sre.SRE_Match object; span=(2, 3), match=':'>, 
<_sre.SRE_Match object; span=(3, 3), match=''>, <_sre.SRE_Match object; 
span=(5, 5), match=''>]

Fixed:

>>> re.split(r"\b|:+", "a::bc")
['', 'a', '', 'bc', '']
>>> re.sub(r"\b|:+", "-", "a::bc")
'-a--bc-'
>>> re.findall(r"\b|:+", "a::bc")
['', '', '::', '', '']
>>> list(re.finditer(r"\b|:+", "a::bc"))
[, , , , ]

The behavior of re.split(), re.findall() and re.finditer() now is the same as 
in the regex module with the V1 flag. But the behavior of re.sub() left closer 
to the previous behavior, otherwise this would break existing tests. It is 
consistent with re.split() rather of re.findall() and re.finditer(). In regex 
with the V1 flag sub() is consistent with findall() and finditer(), but not 
with split().

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25054] Capturing start of line '^'

2017-11-19 Thread Serhiy Storchaka

Change by Serhiy Storchaka :


--
keywords: +patch
pull_requests: +4403
stage:  -> patch review

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25054] Capturing start of line '^'

2017-11-16 Thread Serhiy Storchaka

Change by Serhiy Storchaka :


--
assignee:  -> serhiy.storchaka
nosy: +serhiy.storchaka
versions: +Python 2.7, Python 3.7 -Python 3.5

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25054] Capturing start of line '^'

2016-01-01 Thread Ezio Melotti

Ezio Melotti added the comment:

AFAIU the problem is at Modules/_sre.c:852: after matching, if the ptr is still 
at the start position, the start position gets incremented to avoid an endless 
loop.
Ideally the problem could be avoided by marking and skipping the part(s) of the 
pattern that have already been tested and produced a zero-length match, however 
I don't see any easy way to do it.
Unless someone can come up with a reasonable solution, I would suggest to close 
this as wontfix, and possibly add a note to the docs about this corner case.

--
versions: +Python 3.5, Python 3.6 -Python 3.4

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25054] Capturing start of line '^'

2015-09-10 Thread Matthew Barnett

Matthew Barnett added the comment:

Just to confirm, it _is_ a bug.

It tries to avoid getting stuck, but the way it does that causes it to skip a 
character, sometimes missing a match it should have found.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25054] Capturing start of line '^'

2015-09-10 Thread Alcolo Alcolo

Alcolo Alcolo added the comment:

Naively, I thinked that ^ is be considered as a 0-length token (like $, \b, 
\B), then after capturing it, we can read the next token : 'a' (for the input 
string "a").

I use a simple work around: prepending my string with ' ' (because ' ' is 
neutral with my regex results).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25054] Capturing start of line '^'

2015-09-10 Thread Matthew Barnett

Matthew Barnett added the comment:

After matching '^', it advances so that it won't find the same match again (and 
again and again...).

Unfortunately, that means that it sometimes misses some matches.

It's a known issue.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25054] Capturing start of line '^'

2015-09-10 Thread R. David Murray

R. David Murray added the comment:

^ finds an empty match at the beginning of the string, $ finds an empty match 
at the end.  I don't see the bug (but I'm not a regex expert).

--
nosy: +r.david.murray

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25054] Capturing start of line '^'

2015-09-10 Thread Alcolo Alcolo

New submission from Alcolo Alcolo:

Why
re.findall('^|a', 'a') != ['', 'a'] ?

We have:
re.findall('^|a', ' a') == ['', 'a']
and
re.findall('$|a', ' a') == ['a', '']

Capturing '^' take the 1st character. It's look like a bug ...

--
components: Regular Expressions
messages: 250364
nosy: Alcolo Alcolo, ezio.melotti, mrabarnett
priority: normal
severity: normal
status: open
title: Capturing start of line '^'
type: behavior
versions: Python 3.4

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com