[issue22407] re.LOCALE is nonsensical for Unicode

2014-12-01 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


--
resolution:  - fixed
stage: patch review - resolved
status: open - closed
type: behavior - enhancement

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22407
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22407] re.LOCALE is nonsensical for Unicode

2014-12-01 Thread Martin Panter

Martin Panter added the comment:

Looks like revision 561d1d0de518 was to fix this issue, but the NEWS entry has 
the wrong reference number

--
nosy: +vadmium

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22407
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22407] re.LOCALE is nonsensical for Unicode

2014-12-01 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Indeed. Thank you Martin.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22407
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22407] re.LOCALE is nonsensical for Unicode

2014-12-01 Thread Roundup Robot

Roundup Robot added the comment:

New changeset abc7fe393016 by Serhiy Storchaka in branch 'default':
Fixed issue number in Misc/NEWS for issue #22407.
https://hg.python.org/cpython/rev/abc7fe393016

--
nosy: +python-dev

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22407
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22407] re.LOCALE is nonsensical for Unicode

2014-11-11 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


--
assignee:  - serhiy.storchaka

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22407
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22407] re.LOCALE is nonsensical for Unicode

2014-11-11 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


--
dependencies: +Convert re tests to unittest

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22407
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22407] re.LOCALE is nonsensical for Unicode

2014-11-11 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

If there are no objections I'll commit the re_deprecate_unicode_locale.patch 
patch. But it would be good if someone will review doc changes.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22407
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22407] re.LOCALE is nonsensical for Unicode

2014-10-09 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Here is simple patch which just deprecate using of the re.LOCALE flag with str 
patterns. It also deprecates using of the re.LOCALE flag with the re.ASCII flag 
(with bytes patterns) and adds some re.LOCALE related tests.

--
versions:  -Python 2.7, Python 3.4
Added file: http://bugs.python.org/file36853/re_deprecate_unicode_locale.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22407
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22407] re.LOCALE is nonsensical for Unicode

2014-09-21 Thread Arfrever Frehtes Taifersar Arahesis

Changes by Arfrever Frehtes Taifersar Arahesis arfrever@gmail.com:


--
nosy: +Arfrever

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22407
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22407] re.LOCALE is nonsensical for Unicode

2014-09-16 Thread STINNER Victor

Changes by STINNER Victor victor.stin...@gmail.com:


--
nosy: +haypo

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22407
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22407] re.LOCALE is nonsensical for Unicode

2014-09-16 Thread STINNER Victor

Changes by STINNER Victor victor.stin...@gmail.com:


--
components: +Unicode

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22407
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22407] re.LOCALE is nonsensical for Unicode

2014-09-16 Thread Antoine Pitrou

Antoine Pitrou added the comment:

I don't think we should fix this in 2.x: some people may rely on the old 
behaviour, and it will be difficult for them to debug.
In 3.x, I simply propose we deprecate re.LOCALE for unicode strings and make it 
a no-op.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22407
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22407] re.LOCALE is nonsensical for Unicode

2014-09-16 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Yes, one of solution is to deprecate re.LOCALE for unicode strings and then 
make it incompatible with unicode strings. But I think it would be good to 
implement locale-aware matching.

Example.

 for a in 'Ii\u0130\u0131':
... for b in 'Ii\u0130\u0131':
... if a != b and re.match(a, b, re.I): print(a, '~', b)
... 
I ~ i
I ~ İ
i ~ I
i ~ İ
İ ~ I
İ ~ i

This is incorrect result in Turkish. Capital dotless I matches capital İ 
with dot above, and small dotless ı doesn't match anything.

Regex produces more relevant output, which includes matches for Turkish and 
English:

I ~ i
I ~ ı
i ~ I
i ~ İ
İ ~ i
ı ~ I

With locale tr_TR.utf8 (with the patch):

 for a in 'Ii\u0130\u0131':
... for b in 'Ii\u0130\u0131':
... if a != b and re.match(a, b, re.I|re.L): print(a, '~', b)
... 
I ~ ı
i ~ İ
İ ~ i
ı ~ I

This is correct result in Turkish.

Therefore there is a use case for this feature.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22407
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22407] re.LOCALE is nonsensical for Unicode

2014-09-16 Thread Antoine Pitrou

Antoine Pitrou added the comment:

Ha, I always forget about the Turkish locale case...

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22407
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22407] re.LOCALE is nonsensical for Unicode

2014-09-14 Thread Serhiy Storchaka

New submission from Serhiy Storchaka:

Current implementation of re.LOCALE support for Unicode strings is nonsensical. 
It correctly works only on Latin1 locales (because Unicode string interpreted 
as Latin1 decoded bytes string. all characters outside UCS1 range considered as 
non-words), on other locales it got strange and useless results.

 import re, locale
 locale.setlocale(locale.LC_CTYPE, 'ru_RU.cp1251')
'ru_RU.cp1251'
 re.match(br'\w', 'µ'.encode('cp1251'), re.L)
_sre.SRE_Match object; span=(0, 1), match=b'\xb5'
 re.match(r'\w', 'µ', re.L)
_sre.SRE_Match object; span=(0, 1), match='µ'
 re.match(br'\w', 'ё'.encode('cp1251'), re.L)
_sre.SRE_Match object; span=(0, 1), match=b'\xb8'
 re.match(r'\w', 'ё', re.L)

Proposed patch fixes re.LOCALE support for Unicode strings. It uses the 
wide-character equivalents of C characters functions (towlower(), iswalpha(), 
etc).

The problem is that these functions are not exists in C89, they are introduced 
only in C99. Gcc understand them, we should check other compilers. However 
these functions are already used on FreeBSD and MacOS.

--
components: Extension Modules, Library (Lib), Regular Expressions
files: re_unicode_locale.patch
keywords: patch
messages: 226871
nosy: ezio.melotti, mrabarnett, pitrou, serhiy.storchaka
priority: normal
severity: normal
stage: patch review
status: open
title: re.LOCALE is nonsensical for Unicode
type: behavior
versions: Python 2.7, Python 3.4, Python 3.5
Added file: http://bugs.python.org/file36615/re_unicode_locale.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22407
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com