[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a
Henry S. Thompson added the comment: [One year and 2 days later... :-[ Is this fixed in 3.9? If not, the Versions list above should be updated. The failure of lower() to preserve 'alpha-ness' is a serious bug, it causes significant failures in e.g. Turkish NLP, and it's _not_ just a failure of the documentation! Please can we move this to category Unicode and get at least this aspect of the problem fixed? Should I raise a separate issue on isalpha() etc.? -- ___ Python tracker <https://bugs.python.org/issue12731> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a
Henry S. Thompson added the comment: This issue is also implicated in a failure of isalpha and friends. Easy way to see this is to compare >>> isalpha('İ') True >>> isalpha('İ'.lower()) False This results from the use of a combining character to encode lower-case Turkish dotted i: >>> len('İ'.lower()) 2 >>> unicodedata.category('İ'.lower()[1]) 'Mn' -- nosy: +HThompson ___ Python tracker <https://bugs.python.org/issue12731> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com