[issue31193] re.IGNORECASE strips combining character from lower case of LATIN CAPITAL LETTER I WITH DOT ABOVE

2017-08-14 Thread David MacIver
David MacIver added the comment: Sure, but 'i' is a single code point. The bug is that the regex matches 'i', not that it doesn't match the actual two codepoint lower case of the string. -- ___ Python tracker

[issue31193] re.IGNORECASE strips combining character from lower case of LATIN CAPITAL LETTER I WITH DOT ABOVE

2017-08-14 Thread Matthew Barnett
Matthew Barnett added the comment: The re module works with codepoints, it doesn't understand canonical equivalence. For example, it doesn't recognise that "\N{LATIN CAPITAL LETTER E}\N{COMBINING ACUTE ACCENT}" is equivalent to "\N{LATIN CAPITAL LETTER E WITH ACUTE}". This is true for Python

[issue31193] re.IGNORECASE strips combining character from lower case of LATIN CAPITAL LETTER I WITH DOT ABOVE

2017-08-13 Thread Tom Viner
Changes by Tom Viner : -- nosy: +tomviner ___ Python tracker ___ ___ Python-bugs-list

[issue31193] re.IGNORECASE strips combining character from lower case of LATIN CAPITAL LETTER I WITH DOT ABOVE

2017-08-13 Thread David MacIver
New submission from David MacIver: chr(304).lower() is a two character string - a lower case i followed by a combining chr(775) ('COMBINING DOT ABOVE'). The re module seems not to understand the combining character and a regex compiled with IGNORECASE will erroneously match a single lower