[issue33108] Unicode char 304 in lowercase has len = 2

2018-03-21 Thread Serhiy Storchaka

Serhiy Storchaka  added the comment:

Thank you Ma Lin.

Closed as a duplicate of issue17252.

--
nosy: +serhiy.storchaka
resolution:  -> duplicate
stage:  -> resolved
status: open -> closed
superseder:  -> Latin Capital Letter I with Dot Above

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33108] Unicode char 304 in lowercase has len = 2

2018-03-21 Thread Ma Lin

Ma Lin  added the comment:

There was a discussion about "Latin Capital Letter I with Dot Above"
https://bugs.python.org/issue17252

--
nosy: +Ma Lin

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com




[issue33108] Unicode char 304 in lowercase has len = 2

2018-03-21 Thread INADA Naoki

INADA Naoki  added the comment:

Maybe, we should update UnicodeData?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33108] Unicode char 304 in lowercase has len = 2

2018-03-20 Thread Steven D'Aprano

Steven D'Aprano  added the comment:

It has never been the case that upper() or lower() are guaranteed to preserve 
string length in Unicode. For example, some characters decompose into a base 
plus combining characters. Ligatures are another example. See here for more 
details:

https://unicode.org/faq/casemap_charprop.html


However, this example surprises me. In Python 2, I get the result I expected:

py> c = unichr(304)
py> unicodedata.name(c)
'LATIN CAPITAL LETTER I WITH DOT ABOVE'
py> unicodedata.name(c.lower())
'LATIN SMALL LETTER I'


If I am reading the UnicodeData.txt file correctly, I think that the right 
behaviour is for LATIN CAPITAL LETTER I WITH DOT ABOVE to lowercase to LATIN 
SMALL LETTER I, as it did in Python 2.

ftp://ftp.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt

--
nosy: +steven.daprano

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33108] Unicode char 304 in lowercase has len = 2

2018-03-20 Thread Kiril Dimitrov

Kiril Dimitrov  added the comment:

This is roughly my use case:
zip( "ßx", [0.5, 0.3]) is [('ß', 0.5), ('x', 0.3)]
zip("ßx".upper(), [0.5, 0.3])  will be [('S', 0.5), ('S', 0.3)] in later
case you never get to see the value for 'x'.

At least my expectation was that lower and upper should preserve text
length. At least this seemed to be the case in python2.7

2018-03-20 15:28 GMT+02:00 INADA Naoki :

>
> INADA Naoki  added the comment:
>
> Another example:
>
> >>> s = "ß"
> >>> len(s)
> 1
> >>> len(s.upper())
> 2
> >>> s.upper()
> 'SS'
> >>> ord(s)
> 223
>
>
> > This breaks unicode text matching.
>
> What do you talking about? re module?
>
> --
> nosy: +inada.naoki
>
> ___
> Python tracker 
> 
> ___
>

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33108] Unicode char 304 in lowercase has len = 2

2018-03-20 Thread INADA Naoki

INADA Naoki  added the comment:

Another example:

>>> s = "ß"
>>> len(s)
1
>>> len(s.upper())
2
>>> s.upper()
'SS'
>>> ord(s)
223


> This breaks unicode text matching.

What do you talking about? re module?

--
nosy: +inada.naoki

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33108] Unicode char 304 in lowercase has len = 2

2018-03-20 Thread Kiril Dimitrov

Change by Kiril Dimitrov :


--
title: Unicode char 304 in lowercase has len 2 -> Unicode char 304 in lowercase 
has len = 2

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33108] Unicode char 304 in lowercase has len 2

2018-03-20 Thread Kiril Dimitrov

New submission from Kiril Dimitrov :

>>> chr(304)
'İ'
>>> chr(304).lower()
'i̇'
>>> len(chr(304).lower())
2

This breaks unicode text matching. There is no other unicode character with the 
same behaviour (in 3.6.2 and 3.6.4).

--
components: Unicode
messages: 314142
nosy: Kiril Dimitrov, ezio.melotti, vstinner
priority: normal
severity: normal
status: open
title: Unicode char 304 in lowercase has len 2
type: behavior
versions: Python 3.6

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com