New submission from Vlastimil Brom vlastimil.b...@gmail.com:
Hi,
I just noticed a behaviour of the re.LOCALE flag I can't understand; I first
reported this to the new regex implementation, which, however, only mimics the
standard lib re in this case:
http://code.google.com/p/mrab-regex-hg/issues/detail?id=6
I also couldn't find anything relevant in the tracker, other than some older,
already fixed issues; I'm sorry, if I missed something.
I thought, the search pattern (?L)\w would match any of the respective
string.letters according to the current locale (and possibly additionally
[0-9_]).
However, the locale doesn't seem to be reflected in an expected way.
unicode_BMP = + .join(unichr(i)for i in range(1, 0x1))
import locale
locale.setlocale(locale.LC_ALL, )
'Czech_Czech Republic.1250'
import re
print(.join(re.findall(r(?L)\w, unicode_BMP)))
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz£¥ª¯³µ¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ
locale.setlocale(locale.LC_ALL, Greek)
'Greek_Greece.1253'
print(.join(re.findall(r(?L)\w, unicode_BMP)))
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz¢²³µ¸¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþ
unicode_BMP = + .join(unichr(i)for i in range(1, 0x1))
locale.setlocale(locale.LC_ALL, )
'Czech_Czech Republic.1250'
print unicode(string.letters, windows-1250)
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzŠŚŤŽŹšśťžźŁĄŞŻłµąşĽľżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőöřůúűüýţ
locale.setlocale(locale.LC_ALL, Greek)
'Greek_Greece.1253'
print unicode(string.letters, windows-1253)
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzƒΆµΈΉΊΌΎΏΐΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩΪΫάέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώ
It seems that the nearest letter set to the result of the re/regex LOCALE flags
migt be ascii or US locale:
locale.setlocale(locale.LC_ALL, US)
'English_United States.1252'
print unicode(string.letters, windows-1252)
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzƒŠŒŽšœžŸªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
however, there are some differences too, namely between z and À
re (?L)\w :
Czech
z£¥ª¯³µ¹º¼¾¿À
Greek
z¢²³µ¸¹º¼¾¿À
string.letters -- US locale
zƒŠŒŽšœžŸªµºÀ
(as displayed in tkinter Idle shell)
(in either case, there are some items, one wouldn't consider usual word
characters, cf. ¿)
I am not sure whether there are no other issues (like some encoding/displaying
peculiarities in Tkinter), but the re matching using the LOCALE flag don't
reflect the locale.setlocale(...) in a transparent way.
Is it supposed to work this way and is there another possibility to get the
expected locale aware matching, as one might expect according to:
http://docs.python.org/library/re.html#re.LOCALE
Make \w, \W, \b, \B, \s and \S dependent on the current locale.
using Python 2.7.1, 32 bit; win 7 Home Premium 64-bit, Czech.
in Python 3.1.3 as well as 3.2 the result is the same (with the appropriately
modified code): ...
import locale
locale.setlocale(locale.LC_ALL, )
'Czech_Czech Republic.1250'
import re
print(.join(re.findall(r(?L)\w, unicode_BMP)))
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz£¥ª¯³µ¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ
However, in Python 3, there is no comparison with string.letters available
anymore.
Regards,
Vlastimil Brom
--
components: Regular Expressions, Unicode
messages: 132826
nosy: vbr
priority: normal
severity: normal
status: open
title: re.LOCALE doesn't reflect locale.setlocale(...)
type: behavior
versions: Python 2.7, Python 3.1, Python 3.2
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11744
___
___
Python-bugs-list mailing list
Unsubscribe:
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com