[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-30 Thread Georg Brandl

Georg Brandl added the comment:

I think you will, Matthew being MRAB on the mailing lists :)

--
nosy: +georg.brandl

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-29 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Fixed. Thank you for a patch, Matthew. I hope to see more your patches.

--
resolution:  -> fixed
stage: commit review -> committed/rejected
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-29 Thread Roundup Robot

Roundup Robot added the comment:

New changeset 44a4f9289faa by Serhiy Storchaka in branch '3.3':
Issue #16688: Fix backreferences did make case-insensitive regex fail on 
non-ASCII strings.
http://hg.python.org/cpython/rev/44a4f9289faa

New changeset c59ee1ff6f27 by Serhiy Storchaka in branch 'default':
Issue #16688: Fix backreferences did make case-insensitive regex fail on 
non-ASCII strings.
http://hg.python.org/cpython/rev/c59ee1ff6f27

--
nosy: +python-dev

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-29 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
assignee:  -> serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-16 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

LGTM.

Matthew, can you please submit a contributor form?

http://python.org/psf/contrib/contrib-form/
http://python.org/psf/contrib/

--
stage: patch review -> commit review

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-16 Thread Matthew Barnett

Changes by Matthew Barnett :


Removed file: http://bugs.python.org/file28330/issue16688#3.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-16 Thread Matthew Barnett

Matthew Barnett added the comment:

Oops! :-( Now corrected.

--
Added file: http://bugs.python.org/file28332/issue16688#3.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-16 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

The second test pass on unpatched Python.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-16 Thread Matthew Barnett

Matthew Barnett added the comment:

Here are some tests for the issue.

--
Added file: http://bugs.python.org/file28330/issue16688#3.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-16 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

The patches LGTM. How about adding a test?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-15 Thread Matthew Barnett

Matthew Barnett added the comment:

I haven't found any other issues, so here's the second patch.

--
Added file: http://bugs.python.org/file28325/issue16688#2.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-15 Thread Matthew Barnett

Matthew Barnett added the comment:

I found another bug while looking through the source.

On line 495 in function SRE_COUNT:

if (maxcount < end - ptr && maxcount != 65535)
end = ptr + maxcount*state->charsize;

where 'end' and 'ptr' are of type 'char*'. That means that 'end - ptr' is the 
length in _bytes_, not characters.

If the byte after the end of the string is 0 then you get this:

>>> # Good:
>>> re.search(r"\x00{1,3}", "a\x00\x00").span()
(1, 3)
>>> # Bad:
>>> re.search(r"\x00{1,3}", "\u0100\x00\x00").span()
(1, 4)

I'll keep looking before submitting a patch.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-15 Thread Matthew Barnett

Matthew Barnett added the comment:

I found another bug while looking through the source.

On line 495 in function SRE_COUNT:

if (maxcount < end - ptr && maxcount != 65535)
end = ptr + maxcount*state->charsize;

where 'end' and 'ptr' are of type 'char*'. That means that 'end - ptr' is the 
length in _bytes_, not characters.

If the byte after the end of the string is 0 then you get this:

>>> # Good:
>>> re.search(r"\x00{1,3}", "a\x00\x00").span()
(1, 3)
>>> # Bad:
>>> re.search(r"\x00{1,3}", "\u0100\x00\x00").span()
(1, 4)

I'll keep looking before submitting a patch.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-15 Thread STINNER Victor

STINNER Victor added the comment:

Can someone check if there is no other similar regression (introduced
by the PEP 393)?

2012/12/15 Serhiy Storchaka :
>
> Changes by Serhiy Storchaka :
>
>
> --
> stage: needs patch -> patch review
>
> ___
> Python tracker 
> 
> ___

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-15 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
stage: needs patch -> patch review

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-15 Thread Matthew Barnett

Matthew Barnett added the comment:

OK, here's a patch.

--
keywords: +patch
Added file: http://bugs.python.org/file28321/issue16688.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-15 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Good analysis, Matthew. Are you want to submit a patch?

--
keywords: +easy
stage:  -> needs patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-14 Thread Matthew Barnett

Matthew Barnett added the comment:

In function SRE_MATCH, the code for SRE_OP_GROUPREF (line 1290) contains this:

while (p < e) {
if (ctx->ptr >= end ||
SRE_CHARGET(state, ctx->ptr, 0) != SRE_CHARGET(state, p, 0))
RETURN_FAILURE;
p += state->charsize;
ctx->ptr += state->charsize;
}

However, the code for SRE_OP_GROUPREF_IGNORE (line 1316) contains this:

while (p < e) {
if (ctx->ptr >= end ||
state->lower(SRE_CHARGET(state, ctx->ptr, 0)) != state->lower(*p))
RETURN_FAILURE;
p++;
ctx->ptr += state->charsize;
}

(In both cases 'p' is of type 'char*'.)

The problem appears to be that the latter is still using '*p' and 'p++' and is 
thus always working with chars (it gets and advances 1 byte at a time instead 
of 1, 2 or 4 bytes for Unicode).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-14 Thread Arfrever Frehtes Taifersar Arahesis

Changes by Arfrever Frehtes Taifersar Arahesis :


--
nosy: +Arfrever

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-14 Thread Ezio Melotti

Ezio Melotti added the comment:

It works on 2.7 too, and fails on 3.3/3.x.
Maybe it's related to PEP 393?

--
versions: +Python 3.4

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-14 Thread STINNER Victor

Changes by STINNER Victor :


--
nosy: +serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-14 Thread STINNER Victor

Changes by STINNER Victor :


--
nosy: +haypo

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-14 Thread pyos

New submission from pyos:

The title says it all: if a regular expression that makes use of backreferences 
is compiled with `re.I` flag, it will always fail when matched against a string 
that contains characters outside of U+-U+00FF range. I've been unable to 
further narrow the bug down.

A simple example:

>>> import re
>>> r = re.compile(r'(a)\1', re.I)  # should match "aa", "aA", "Aa", or "AA"
>>> r.findall('aa')  # works as expected
['a']
>>> r.findall('aa bcd')  # still works
['a']
>>> r.findall('aa Ā')  # ord('Ā') == 0x0100
[]

The same code works as expected in Python 3.2:

>>> r.findall('aa Ā')
['a']

--
components: Regular Expressions
messages: 177518
nosy: ezio.melotti, mrabarnett, pitrou, pyos
priority: normal
severity: normal
status: open
title: Backreferences make case-insensitive regex fail on non-ASCII strings.
type: behavior
versions: Python 3.3

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com