[issue11744] re.LOCALE doesn't reflect locale.setlocale(...)

2011-04-03 Thread R. David Murray

R. David Murray rdmur...@bitdance.com added the comment:

I don't know what re is doing with respect to locale, but I do know that the 
implementation of string.letters is at least somewhat broken in 2.x.  It has no 
useful meaning in unicode, which is why it doesn't exist in 3.x.

A standard that talks about regex and locale is here:

http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html

I don't know enough about locale or regex to comment further, but from the 
perspective of what I know about current developer resources and focus I would 
say that if anything is going to be changed, it would be by mrabarnett in the 
new engine.  Unless mrab (or you?) does it, the old engine is unlikely to be 
touched at this point.

--
nosy: +r.david.murray

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11744
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11744] re.LOCALE doesn't reflect locale.setlocale(...)

2011-04-03 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Thanks for the comment for string.letters and further reference.
Given, that Mr. Barnett mentioned in his tracker to regex ( 
http://code.google.com/p/mrab-regex-hg/issues/detail?id=6 ), that he only 
supports the LOCALE flag because of the compatibility with re and given my zero 
knowledge of C, I suppose, we will live with the status quo.
I guess, if there were a well defined source of letters for the given 
locales, the implementation wouldn't necessarily have to be be that complex (in 
the context of the regex code), but as there is probably no agreement in this 
respect (if string.letters is questionable), it becomes pointless.
After all, one can define a needed regex pattern manually, and mrab's regex 
library makes it much easier due to the support for unicode properties and 
others.

vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11744
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11744] re.LOCALE doesn't reflect locale.setlocale(...)

2011-04-03 Thread R. David Murray

R. David Murray rdmur...@bitdance.com added the comment:

Yeah, as far as I could tell from a brief scan of google hits, locale support 
in regex in general is a legacy thing, and the correct thing to do is to use 
unicode properties.  So I'll close this as won't fix.  If someone comes along 
with motivation to fix it it can always be reopened.

--
resolution:  - wont fix
stage:  - committed/rejected
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11744
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11744] re.LOCALE doesn't reflect locale.setlocale(...)

2011-04-02 Thread Vlastimil Brom

New submission from Vlastimil Brom vlastimil.b...@gmail.com:

Hi,
I just noticed a behaviour of the re.LOCALE flag I can't understand; I first 
reported this to the new regex implementation, which, however, only mimics the 
standard lib re in this case:
http://code.google.com/p/mrab-regex-hg/issues/detail?id=6
I also couldn't find anything relevant in the tracker, other than some older, 
already fixed issues; I'm sorry, if I missed something.
I thought, the search pattern (?L)\w would match any of the respective 
string.letters according to the current locale (and possibly additionally 
[0-9_]).

However, the locale doesn't seem to be reflected in an expected way.

 unicode_BMP =   + .join(unichr(i)for i in range(1, 0x1))
 import locale
 locale.setlocale(locale.LC_ALL, )
'Czech_Czech Republic.1250'
 import re
 print(.join(re.findall(r(?L)\w, unicode_BMP)))
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyzŠŒŽšœžŸ£¥ª¯³µ¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ
 locale.setlocale(locale.LC_ALL, Greek)
'Greek_Greece.1253'
 print(.join(re.findall(r(?L)\w, unicode_BMP)))
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyzƒ¢²³µ¸¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþ
 

 unicode_BMP =   + .join(unichr(i)for i in range(1, 0x1))

 locale.setlocale(locale.LC_ALL, )
'Czech_Czech Republic.1250'
 print unicode(string.letters, windows-1250)
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzŠŚŤŽŹšśťžźŁĄŞŻłµąşĽľżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőöřůúűüýţ
 locale.setlocale(locale.LC_ALL, Greek)
'Greek_Greece.1253'
 print unicode(string.letters, windows-1253)
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzƒΆµΈΉΊΌΎΏΐΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩΪΫάέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώ
 

It seems that the nearest letter set to the result of the re/regex LOCALE flags 
migt be ascii or US locale:

 locale.setlocale(locale.LC_ALL, US)
'English_United States.1252'
 print unicode(string.letters, windows-1252)
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzƒŠŒŽšœžŸªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
 

however, there are some differences too, namely between zƒ and À
re (?L)\w : 
Czech
zŠŒŽšœžŸ£¥ª¯³µ¹º¼¾¿À
Greek
zƒ¢²³µ¸¹º¼¾¿À
string.letters -- US locale
zƒŠŒŽšœžŸªµºÀ
(as displayed in tkinter Idle shell)
(in either case, there are some items, one wouldn't consider usual word 
characters, cf. ¿)

I am not sure whether there are no other issues (like some encoding/displaying 
peculiarities in Tkinter), but the re matching using the LOCALE flag don't 
reflect the locale.setlocale(...) in a transparent way.

Is it supposed to work this way and is there another possibility to get the 
expected locale aware matching, as one might expect according to:
http://docs.python.org/library/re.html#re.LOCALE

Make \w, \W, \b, \B, \s and \S dependent on the current locale.



using Python 2.7.1, 32 bit;  win 7 Home Premium 64-bit, Czech.

in Python 3.1.3 as well as 3.2 the result is the same (with the appropriately 
modified code): ...
 import locale
 locale.setlocale(locale.LC_ALL, )
'Czech_Czech Republic.1250'
 import re
 print(.join(re.findall(r(?L)\w, unicode_BMP)))
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyzŠŒŽšœžŸ£¥ª¯³µ¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ
 

However, in Python 3, there is no comparison with string.letters available 
anymore.

Regards,
Vlastimil Brom

--
components: Regular Expressions, Unicode
messages: 132826
nosy: vbr
priority: normal
severity: normal
status: open
title: re.LOCALE doesn't reflect locale.setlocale(...)
type: behavior
versions: Python 2.7, Python 3.1, Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11744
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11744] re.LOCALE doesn't reflect locale.setlocale(...)

2011-04-02 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
nosy: +ezio.melotti, mrabarnett

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11744
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com