[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2020-01-31 Thread Henry S. Thompson


Henry S. Thompson  added the comment:

[One year and 2 days later... :-[

Is this fixed in 3.9?  If not, the Versions list above should be updated.

The failure of lower() to preserve 'alpha-ness' is a serious bug, it causes 
significant failures in e.g. Turkish NLP, and it's _not_ just a failure of the 
documentation!

Please can we move this to category Unicode and get at least this aspect of the 
problem fixed?  Should I raise a separate issue on isalpha() etc.?

--

___
Python tracker 
<https://bugs.python.org/issue12731>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2019-01-29 Thread Henry S. Thompson

Henry S. Thompson  added the comment:

This issue is also implicated in a failure of isalpha and friends.
Easy way to see this is to compare
>>> isalpha('İ')
True
>>> isalpha('İ'.lower())
False

This results from the use of a combining character to encode lower-case Turkish 
dotted i:
>>> len('İ'.lower())
2
>>> unicodedata.category('İ'.lower()[1])
'Mn'

--
nosy: +HThompson

___
Python tracker 
<https://bugs.python.org/issue12731>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com