[issue31979] Simplify converting non-ASCII strings to int, float and complex

2018-07-13 Thread INADA Naoki


Change by INADA Naoki :


--
pull_requests: +7810

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31979] Simplify converting non-ASCII strings to int, float and complex

2017-11-23 Thread STINNER Victor

STINNER Victor  added the comment:


New changeset 6a54c676e63517653d3d4a1e164bdd0fd45132d8 by Victor Stinner in 
branch 'master':
bpo-31979: Remove unused align_maxchar() function (#4527)
https://github.com/python/cpython/commit/6a54c676e63517653d3d4a1e164bdd0fd45132d8


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31979] Simplify converting non-ASCII strings to int, float and complex

2017-11-23 Thread STINNER Victor

Change by STINNER Victor :


--
pull_requests: +4463

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31979] Simplify converting non-ASCII strings to int, float and complex

2017-11-13 Thread Serhiy Storchaka

Serhiy Storchaka  added the comment:

Thank you for your review Victor.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31979] Simplify converting non-ASCII strings to int, float and complex

2017-11-13 Thread Serhiy Storchaka

Change by Serhiy Storchaka :


--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31979] Simplify converting non-ASCII strings to int, float and complex

2017-11-13 Thread Serhiy Storchaka

Serhiy Storchaka  added the comment:


New changeset 9b6c60cbce4ac45e8ccd7934babff465e9769509 by Serhiy Storchaka in 
branch 'master':
bpo-31979: Simplify transforming decimals to ASCII (#4336)
https://github.com/python/cpython/commit/9b6c60cbce4ac45e8ccd7934babff465e9769509


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31979] Simplify converting non-ASCII strings to int, float and complex

2017-11-13 Thread Serhiy Storchaka

Serhiy Storchaka  added the comment:

As a side effect it slightly optimizes parsing non-ASCII numbers.

$ ./python -m perf timeit --compare-to=./python0  'int("۱۲۳۴۵۶۷۸۹")' 
--duplicate 100
python0: . 277 ns +- 3 ns
python: . 225 ns +- 3 ns

Mean +- std dev: [python0] 277 ns +- 3 ns -> [python] 225 ns +- 3 ns: 1.23x 
faster (-19%)

$ ./python -m perf timeit --compare-to=./python0  'float("۱۲۳۴۵.۶۷۸۹")' 
--duplicate 100
python0: . 256 ns +- 1 ns
python: . 199 ns +- 2 ns

Mean +- std dev: [python0] 256 ns +- 1 ns -> [python] 199 ns +- 2 ns: 1.29x 
faster (-22%)

$ ./python -m perf timeit --compare-to=./python0  'complex("۱۲۳۴۵.۶۷۸۹j")' 
--duplicate 100
python0: . 298 ns +- 4 ns
python: . 235 ns +- 3 ns

Mean +- std dev: [python0] 298 ns +- 4 ns -> [python] 235 ns +- 3 ns: 1.27x 
faster (-21%)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31979] Simplify converting non-ASCII strings to int, float and complex

2017-11-13 Thread STINNER Victor

STINNER Victor  added the comment:

I don't think that the weird behaviour justify to backport this non-trivial 
change. I propose to only apply the change in Python 3.7.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31979] Simplify converting non-ASCII strings to int, float and complex

2017-11-08 Thread Serhiy Storchaka

Change by Serhiy Storchaka :


--
keywords: +patch
pull_requests: +4290
stage:  -> patch review

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31979] Simplify converting non-ASCII strings to int, float and complex

2017-11-08 Thread Serhiy Storchaka

New submission from Serhiy Storchaka :

This issue is inspired by the tweet 
https://twitter.com/dabeaz/status/925787482515533830

>>> a = 'n'
>>> b = 'ñ'
>>> sys.getsizeof(a)
   50
>>> sys.getsizeof(b)
   74
>>> float(b)
   Traceback (most recent call last):
  File "", line 1, in 
   ValueError: could not convert string to float: 'ñ'
>>> sys.getsizeof(b)
   77

See also a discussion on Python-list 
(https://mail.python.org/pipermail/python-list/2017-November/728149.html) and 
Stack Overflow 
(https://stackoverflow.com/questions/47062184/why-does-the-size-of-this-python-string-change-on-a-failed-int-conversion).


When convert a non-ASCII string which don't contain non-ASCII decimal digits 
and spaces, this will fail, but will change the size of the input string due to 
creating an internal UTF-8 representation.

This can look surprising for beginners, but there is nothing wrong here. There 
are many cases in which an internal UTF-8 representation is created implicitly.

But looking on the code I have found that it is too complicated. Parsers to 
int, float and complex call _PyUnicode_TransformDecimalAndSpaceToASCII() which 
transforms non-ASCII decimal digits and spaces to ASCII. This functions uses 
the general function fixup() which takes a transformation function, creates a 
new string object, apply the transformation. It checks if the transformation 
produces a string with larger maximal character code than the original string 
and creates a new string object and repeat the transformation in that case. 
Finally, if the resulting string is equal to the original string, destroy the 
resulting string and returns the original string. In the past fixup() was used 
for implementing methods like upper(). But now 
_PyUnicode_TransformDecimalAndSpaceToASCII() is only the user of fixup(), and 
it doesn't need all complicated logic of fixup(). For example, this 
transformation never produces wider strings.

The proposed PR simplifies the code by getting rid of fixup() and 
fix_decimal_and_space_to_ascii(). The semantic of 
_PyUnicode_TransformDecimalAndSpaceToASCII() has been changed. It now always 
produces ASCII string (this also simplifies caller places). If non-ASCII 
characters which is not a decimal digit and is not a space is encountered, the 
rest of the string is replaced with '?' which will cause an error in parsers.

The only visible behavior change (except not changing the size of the original 
string) is changing the exception raised by float() and complex() when the 
string contains lone surrogates from UnicodeEncodeError to ValueError (the same 
as for other malformed strings). int() already contained a special case for 
this and raised a ValueError.

Unpatched:

>>> int('\ud800')
Traceback (most recent call last):
  File "", line 1, in 
ValueError: invalid literal for int() with base 10: '\ud800'
>>> float('\ud800')
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 
0: surrogates not allowed
>>> complex('\ud800')
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 
0: surrogates not allowed

Patched:

>>> int('\ud800')
Traceback (most recent call last):
  File "", line 1, in 
ValueError: invalid literal for int() with base 10: '\ud800'
>>> float('\ud800')
Traceback (most recent call last):
  File "", line 1, in 
ValueError: could not convert string to float: '\ud800'
>>> complex('\ud800')
Traceback (most recent call last):
  File "", line 1, in 
ValueError: complex() arg is a malformed string

This PR saves around 80 lines of code.

--
assignee: serhiy.storchaka
components: Unicode
messages: 305824
nosy: ezio.melotti, haypo, serhiy.storchaka
priority: normal
severity: normal
status: open
title: Simplify converting non-ASCII strings to int, float and complex
type: enhancement
versions: Python 3.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com