[issue25275] Documentation v/s behaviour mismatch wrt integer literals containing non-ASCII characters

2015-09-30 Thread Shreevatsa R

Shreevatsa R added the comment:

About the mismatch: of course it's probably not a good idea to change the 
parser (so that simply typing १२३४ in Python 3 code is like typing 1234), but 
how about changing the behaviour of int()? Not sure whether anyone should be 
relying on int(u'१२३४') being 1234, given that it is not documented as such.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25275>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25275] Documentation v/s behaviour mismatch wrt integer literals containing non-ASCII characters

2015-09-30 Thread Shreevatsa R

Shreevatsa R added the comment:

Minor difference, but the relevant function for int() is not quite isdigit(), 
e.g.:

>>> import unicodedata
>>> s = u'\u2460'
>>> unicodedata.name(s)
'CIRCLED DIGIT ONE'
>>> print s
①
>>> s.isdigit()
True
>>> s.isdecimal()
False
>>> int(s)
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'decimal' codec can't encode character u'\u2460' in 
position 0: invalid decimal Unicode string

It seems to be isdecimal(), plus if there are other digits in the string then 
many leading and trailing space-like characters are also allowed (e.g. 5760 
OGHAM SPACE MARK or 8195 EM SPACE or 12288 IDEOGRAPHIC SPACE:

>>> 987 == int(u'\u3000\n 987\u1680\t')
True

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25275>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25275] Documentation v/s behaviour mismatch wrt integer literals containing non-ASCII characters

2015-09-29 Thread Shreevatsa R

New submission from Shreevatsa R:

Summary: This is about int(u'१२३४') == 1234.

At https://docs.python.org/2/library/functions.html and also 
https://docs.python.org/3/library/functions.html the documentation for 

 class int(x=0)
 class int(x, base=10)

says (respectively):

> If x is not a number or if base is given, then x must be a string or Unicode 
> object representing an integer literal in radix base.

> If x is not a number or if base is given, then x must be a string, bytes, or 
> bytearray instance representing an integer literal in radix base.

If you follow the definition of "integer literal" into the reference 
(https://docs.python.org/2/reference/lexical_analysis.html#integers and 
https://docs.python.org/3/reference/lexical_analysis.html#integers 
respectively), the definitions ultimately involve

 nonzerodigit   ::=  "1"..."9"
 octdigit   ::=  "0"..."7"
 bindigit   ::=  "0" | "1"
 digit  ::=  "0"..."9"

So it looks like whether the behaviour of int() conforms to its documentation 
hinges on what "representing" means. Apparently it is some definition under 
which u'१२३४' represents the integer literal 1234, but it would be great to 
either clarify the documentation of int() or change its behaviour.

--
assignee: docs@python
components: Documentation, Interpreter Core, Unicode
messages: 251915
nosy: docs@python, ezio.melotti, haypo, shreevatsa
priority: normal
severity: normal
status: open
title: Documentation v/s behaviour mismatch wrt integer literals containing 
non-ASCII characters

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25275>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com