[issue31193] re.IGNORECASE strips combining character from lower case of LATIN CAPITAL LETTER I WITH DOT ABOVE

2017-08-14 Thread David MacIver

David MacIver added the comment:

Sure, but 'i' is a single code point. The bug is that the regex matches 'i', 
not that it doesn't match the actual two codepoint lower case of the string.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31193] re.IGNORECASE strips combining character from lower case of LATIN CAPITAL LETTER I WITH DOT ABOVE

2017-08-14 Thread Matthew Barnett

Matthew Barnett added the comment:

The re module works with codepoints, it doesn't understand canonical 
equivalence.

For example, it doesn't recognise that "\N{LATIN CAPITAL LETTER E}\N{COMBINING 
ACUTE ACCENT}" is equivalent to "\N{LATIN CAPITAL LETTER E WITH ACUTE}".

This is true for Python in general, except for identifiers, which are 
normalised:

>>> "\N{LATIN CAPITAL LETTER E}\N{COMBINING ACUTE ACCENT}"
'É'
>>> É = 0
>>> "\N{LATIN CAPITAL LETTER E WITH ACUTE}"
'É'
>>> É
0

This also means that, say '.' will match only 1 _codepoint_.

--
nosy: +mrabarnett

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31193] re.IGNORECASE strips combining character from lower case of LATIN CAPITAL LETTER I WITH DOT ABOVE

2017-08-13 Thread Tom Viner

Changes by Tom Viner :


--
nosy: +tomviner

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31193] re.IGNORECASE strips combining character from lower case of LATIN CAPITAL LETTER I WITH DOT ABOVE

2017-08-13 Thread David MacIver

New submission from David MacIver:

chr(304).lower() is a two character string - a lower case i followed by a 
combining chr(775) ('COMBINING DOT ABOVE').

The re module seems not to understand the combining character and a regex 
compiled with IGNORECASE will erroneously match a single lower case i without 
the required combining character. The attached file demonstrates this. I've 
tested this on Python 3.6.1 with my locale as ('en_GB', 'UTF-8') (I don't know 
whether that matters for reproducing this, but I know it can affect how 
lower/upper work so am including it for the sake of completeness).

The problem does not reproduce on Python 2.7.13 because on that case 
chr(304).lower() is 'i' without the combining character, so it fails earlier.

This is presumably related to #12728, but as that is closed as fixed while this 
still reproduces I don't believe it's a duplicate.

--
components: Library (Lib)
files: casing.py
messages: 300219
nosy: David MacIver
priority: normal
severity: normal
status: open
title: re.IGNORECASE strips combining character from lower case of LATIN 
CAPITAL LETTER I WITH DOT ABOVE
versions: Python 3.6
Added file: http://bugs.python.org/file47080/casing.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com