[issue10557] Malformed error message from float()

2013-02-10 Thread Mark Dickinson

Mark Dickinson added the comment:

Closing.

--
status: pending - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2013-02-05 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


--
status: open - pending

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2013-02-05 Thread Mark Dickinson

Mark Dickinson added the comment:

Sure, this can be closed as far as I'm concerned.

--
status: pending - open

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2013-02-05 Thread Mark Dickinson

Changes by Mark Dickinson dicki...@gmail.com:


--
status: open - pending

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2013-01-04 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Can this issue be closed?

--
nosy: +serhiy.storchaka

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2011-11-21 Thread STINNER Victor

Changes by STINNER Victor victor.stin...@haypocalc.com:


--
versions: +Python 3.3 -Python 3.1

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-12-20 Thread Alexander Belopolsky

Changes by Alexander Belopolsky belopol...@users.sourceforge.net:


--
versions: +Python 3.1

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-12-04 Thread Mark Dickinson

Mark Dickinson dicki...@gmail.com added the comment:

Looks okay, I guess.

I don't much like the extra boilerplate that's introduced (and repeated) in 
longobject.c, floatobject.c and complexobject.c, though.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-12-04 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Sat, Dec 4, 2010 at 6:03 AM, Mark Dickinson rep...@bugs.python.org wrote:
..
 I don't much like the extra boilerplate that's introduced (and repeated)
 in longobject.c, floatobject.c and complexobject.c, though.


Yes, that's exactly what I meant when I called that code
repetitious.  Maybe we can tighten this up during the beta period.
What do you think about adding number parsers that operate directly on
Py_UNICODE* strings?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-12-04 Thread Mark Dickinson

Mark Dickinson dicki...@gmail.com added the comment:

 What do you think about adding number parsers that operate directly on
 Py_UNICODE* strings?

I think that might make some sense.  It's not without difficulties, though.  
One issue is that we'd still need the char* - double operations, partly 
because PyOS_string_to_double is part of the public API, and partly to continue 
to support creation of a float from a bytes instance.

The other issue is that for floats, it's difficult to separate the parser from 
the base conversion;  to be useful, we'd probably end up making the whole of 
dtoa.c Py_UNICODE aware.  (One of the return values from the dtoa.c parser is a 
pointer to the significant digits in the original input string;  so the 
base-conversion calculation itself needs access to portions of the original 
string.)

Ideally, for float(string), we'd have a zero-copy setup that operated directly 
on the unicode input (read-only);  but I think that achieving that right now is 
going to be messy, and involve dtoa.c knowing far more about Unicode that I'd 
be comfortable with.

N.B. If we didn't have to deal with alternative digits, it *really* would be 
much simpler.

Perhaps a compromise option is available, that does a preliminary pass on the 
Unicode string and only makes a copy if non-European digits are discovered.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-12-04 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Sat, Dec 4, 2010 at 3:11 PM, Mark Dickinson rep...@bugs.python.org wrote:

 Mark Dickinson dicki...@gmail.com added the comment:
.. One issue is that we'd still need the char* - double operations, partly 
because
 PyOS_string_to_double is part of the public API, and partly to continue to 
 support
 creation of a float from a bytes instance.


I thought about it.  I see two solutions:

1. Retain PyOS_string_to_double unchanged and add PyOS_unicode_to_double.
2. Replace PyOS_string_to_double with UTF-8 decode result passed to
PyOS_unicode_to_double.

 The other issue is that for floats, it's difficult to separate the parser 
 from the base
 conversion;  to be useful, we'd probably end up making the whole of dtoa.c
 Py_UNICODE aware.

That's what I had in mind.  Naively it looks like we just need to
replace char type with Py_UNICODE in several places.  Assuming exotic
digit conversion is still handled separately.

  (One of the return values from the dtoa.c parser is a pointer to the 
 significant digits
 in the original input string;  so the base-conversion calculation itself 
 needs access
 to portions of the original string.)


Maybe we should start with int().  It is simpler, but probably reveal
some of the same difficulties as float()

 Ideally, for float(string), we'd have a zero-copy setup that operated 
 directly on the
 unicode input (read-only);  but I think that achieving that right now is 
 going to be
 messy, and involve dtoa.c knowing far more about Unicode that I'd be 
 comfortable
 with.


This is clearly a 3.3-ish project.  Hopefully in time people will
realize that decimal digits are just [0-9] and numeric experts will
not be required to know about Unicode beyond 127th code point. :-)

 N.B. If we didn't have to deal with alternative digits, it *really* would be 
 much simpler.


We still don't.  I've already separated this out and we can keep it
this way as long as people are willing to pay the price for
alternative digits' support.

One thing we may improve, is to fail earlier on non-digits in
PyUnicode_TransformDecimalToASCII()  to speedup not uncommon code like
this:

for line in f:
   try:
   n = int(lint)
   except ValueError:
   pass
   ...

 Perhaps a compromise option is available, that does a preliminary pass on the
 Unicode string and only makes a copy if non-European digits are discovered.

Hmm.  That would require changing the signature of
PyUnicode_TransformDecimalToASCII() to take PyObject* instead of the
buffer.  I knew we shouldn't have rushed to make it public.  We can
still do it in longobject.c and friends' boilerplate.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-12-03 Thread STINNER Victor

Changes by STINNER Victor victor.stin...@haypocalc.com:


--
nosy:  -haypo

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-12-03 Thread Mark Dickinson

Mark Dickinson dicki...@gmail.com added the comment:

 Are you sure ? I'm not sure how the underlying PyOS_string_to_double()
 (IIRC) works.

I believe it accepts ASCII whitespace (i.e., chars ' ', '\t', '\f', '\n', '\r', 
'\v'), and nothing else.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-12-03 Thread Eric Smith

Eric Smith e...@trueblade.com added the comment:

According to comments in the code and verified by inspection, 
PyOS_string_to_double does not accept any whitespace.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-12-03 Thread Mark Dickinson

Mark Dickinson dicki...@gmail.com added the comment:

 According to comments in the code and verified by inspection,
 PyOS_string_to_double does not accept any whitespace.

Bah.  You're right, of course.  :-)

Any whitespace (post PyUnicode_EncodeDecimal) is handled in PyFloat_FromString, 
using Py_ISSPACE.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-12-03 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Fri, Dec 3, 2010 at 4:45 AM, Marc-Andre Lemburg
rep...@bugs.python.org wrote:
..
 On Thu, Dec 2, 2010 at 4:34 PM, Marc-Andre Lemburg
 rep...@bugs.python.org wrote:
 ..
  * Please change the API _PyUnicode_NormalizeDecimal() to
   PyUnicode_ConvertToASCIIDecimal() - that's closer to what
   it does.


 Are you sure it is a good idea to give it a public name?  I have no
 problem with calling it _PyUnicode_ConvertToASCIIDecimal().
 (Transform may be a better term, though.)

 Yes, I think it's useful to have as public API.


I am afraid this means it has to go in today.I'll take the white
space processing out of the public method and try to get the patch
ready for commit.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-12-03 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Thu, Dec 2, 2010 at 9:53 PM, Alexander Belopolsky
rep...@bugs.python.org wrote:
..
 .. The honest reason for the exclusion is that I gave up chasing a bug that 
 only shows
 in full regrtest runs.

I have realized where the problem was. PyUnicode_FromUnicode()
helpfully interns single-character  Unicode objects in the Latin-1
range.  So when TransformDecimal replaces whitespace with ' ', it may
garble cached strings.  I think this optimization is a left-over from
the time when unicode did not have interning, but it is never a good
idea to change immutable objects in-place after creation, so I'll fix
this problem in my code.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-12-03 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

Hopefully this is the last iteration before commit.  As discussed, I took 
whitespace processing out of PyUnicode_TransformDecimalToASCII() and made it 
public.  Whitespace conversion in int()/float()/complex() is repetitious and 
can be optimized by, for example only converting leading and trailing 
whitespace.  I erred on the side of correctness here and real optimization will 
come from making conversion algorithms operate directly on Py_UNICODE 
characters.  This looks like a relatively easy thing to do, but is definitely 
outside of the scope of this issue.

--
Added file: http://bugs.python.org/file19931/issue10557d.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-12-03 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

Committed in revision 87007.  As a bug fix, this needs to be backported to 3.1, 
but PyUnicode_TransformDecimalToASCII() should probably be renamed to 
_PyUnicode_TransformDecimalToASCII() to avoid introduction of a new feature.

--
resolution:  - fixed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-12-02 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

I am submitting a patch (issue10557b.diff) for commit review.  As Marc 
suggested, decimal conversion is now performed on Py_UNICODE characters. For 
this purpose, I introduced _PyUnicode_NormalizeDecimal() function that takes 
Py_UNICODE and returns a PyUnicode object with whitespace stripped and 
non-ASCII digits converted to ASCII equivalents.  The PyUnicode_EncodeDecimal() 
function is no longer used and I added a comment recommending that 
_PyUnicode_NormalizeDecimal() be used instead. I would like to eventually 
remove PyUnicode_EncodeDecimal(), but I amd not sure about the proper 
deprecation procedures for undocumented C APIs.

As a result, int(), float(), etc will no longer raise UnicodeDecodeError unless 
given a string with lone surrogates.  (This error comes from UTF-8 codec that 
is applied after digit normalization.)

A few error cases such as embedded '\0' and non-digit characters with ord(c)  
255 will now raise ValueError instead of UnicodeDecodeError.  Since 
UnicodeDecodeError is a subclass of ValueError, it is unlikely that existing 
code would attempt to differentiate between the two.  It is possible to achieve 
complete compatibility, but it is hard to justify reporting different error 
types on non-digit characters below and above code point 255.

The patch contains tests for error messages that I tried to make robust by only 
requiring that s.strip() be found somewhere in the error message from int(s).  
Note that since in this patch whitespace is stripped before the string is 
passed to the parser, the parser errors do not contain the whitespace.  This 
may actually be desirable because it helps the user to see the source of the 
error without being distracted by irrelevant white space.

--
assignee:  - belopolsky
stage: unit test needed - commit review

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-12-02 Thread Alexander Belopolsky

Changes by Alexander Belopolsky belopol...@users.sourceforge.net:


Added file: http://bugs.python.org/file19907/issue10557b.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-12-02 Thread Mark Dickinson

Mark Dickinson dicki...@gmail.com added the comment:

Is the stripping of whitespace necessary for this fix?

Currently, the complex constructor accepts whitespace both inside and outside 
the (optional) parentheses:

 complex(' ( 2+3j ) ')
(2+3j)

The classes of whitespace accepted in each position are the same.  IIUC, with 
your patch, that consistency would be lost---is that right?

If the whitespace stripping isn't necessary then I'd prefer to leave that 
change for another issue.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-12-02 Thread Mark Dickinson

Mark Dickinson dicki...@gmail.com added the comment:

Just to clarify:  I'm not opposed to allowing arbitrary Unicode whitespace in 
the float, int, complex constructors (indeed, it's probably a good thing).  But 
I'd like to see the change made consistently;  for the complex constructor this 
looks a bit involved, so it would probably be cleaner to have a separate patch 
to make the whitespace changes.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-12-02 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Thu, Dec 2, 2010 at 12:54 PM, Mark Dickinson rep...@bugs.python.org wrote:
..
 The classes of whitespace accepted in each position are the same.  IIUC, with 
 your patch,
 that consistency would be lost---is that right?

Good point. I thought The PyUnicode_EncodeDecimal() was stripping the
space, but it was converting it to ASCII ' ' instead.  That's easy to
fix.   Can you suggest a test case?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-12-02 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Thu, Dec 2, 2010 at 1:28 PM, Alexander Belopolsky
belopol...@users.sourceforge.net wrote:
..
  Can you suggest a test case?

I mean for complex().

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-12-02 Thread Mark Dickinson

Mark Dickinson dicki...@gmail.com added the comment:

Ah yes, you're right: this shouldn't be a hard fix. I withdraw my suggestion 
for a separate patch.  :-)

Checking that:

  complex('\xa0(\xa02+3j\xa0)\xa0') == complex(2.0, 3.0)

would probably be enough.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-12-02 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Alexander Belopolsky wrote:
 
 Alexander Belopolsky belopol...@users.sourceforge.net added the comment:
 
 I am submitting a patch (issue10557b.diff) for commit review.  As Marc 
 suggested, decimal conversion is now performed on Py_UNICODE characters. For 
 this purpose, I introduced _PyUnicode_NormalizeDecimal() function that takes 
 Py_UNICODE and returns a PyUnicode object with whitespace stripped and 
 non-ASCII digits converted to ASCII equivalents.  The 
 PyUnicode_EncodeDecimal() function is no longer used and I added a comment 
 recommending that _PyUnicode_NormalizeDecimal() be used instead. I would like 
 to eventually remove PyUnicode_EncodeDecimal(), but I amd not sure about the 
 proper deprecation procedures for undocumented C APIs.
 
 As a result, int(), float(), etc will no longer raise UnicodeDecodeError 
 unless given a string with lone surrogates.  (This error comes from UTF-8 
 codec that is applied after digit normalization.)
 
 A few error cases such as embedded '\0' and non-digit characters with ord(c) 
  255 will now raise ValueError instead of UnicodeDecodeError.  Since 
 UnicodeDecodeError is a subclass of ValueError, it is unlikely that existing 
 code would attempt to differentiate between the two.  It is possible to 
 achieve complete compatibility, but it is hard to justify reporting different 
 error types on non-digit characters below and above code point 255.
 
 The patch contains tests for error messages that I tried to make robust by 
 only requiring that s.strip() be found somewhere in the error message from 
 int(s).  Note that since in this patch whitespace is stripped before the 
 string is passed to the parser, the parser errors do not contain the 
 whitespace.  This may actually be desirable because it helps the user to see 
 the source of the error without being distracted by irrelevant white space.

Thanks for the patch. I've had a quick look...

Some comments:

 * Please change the API _PyUnicode_NormalizeDecimal() to
   PyUnicode_ConvertToASCIIDecimal() - that's closer to what
   it does.

 * Don't have the API remove any whitespace. It should just
   work on decimal digit code points (chainging the length
   of the Unicode string is a bad idea).

 * Please remove the note This function is no longer used.
   Use _PyUnicode_NormalizeDecimal instead. from the
   PyUnicode_EncodeDecimal() API description in the
   header file. The API won't go away (it does have its
   use and is being used in 3rd party extensions) and
   you cannot guide people to use a Python private API.

 * Please double check the ref counts. I think you have a leak
   in PyLong_FromUnicode() (for norm) and possible in other
   functions as well.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-12-02 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Thu, Dec 2, 2010 at 4:34 PM, Marc-Andre Lemburg
rep...@bugs.python.org wrote:
..
  * Please change the API _PyUnicode_NormalizeDecimal() to
   PyUnicode_ConvertToASCIIDecimal() - that's closer to what
   it does.


Are you sure it is a good idea to give it a public name?  I have no
problem with calling it _PyUnicode_ConvertToASCIIDecimal().
(Transform may be a better term, though.)

  * Don't have the API remove any whitespace. It should just
   work on decimal digit code points (chainging the length
   of the Unicode string is a bad idea).


Yes, that was a bad idea, but the old EncodeDecimal was replacing all
Unicode space with ASCII ' '.  It will be hard to replicate old
behavior without doing the same in  ConvertToASCIIDecimal().

  * Please remove the note This function is no longer used.
   Use _PyUnicode_NormalizeDecimal instead. from the
   PyUnicode_EncodeDecimal() API description in the
   header file. The API won't go away (it does have its
   use and is being used in 3rd party extensions) and
   you cannot guide people to use a Python private API.


OK.  I had the same reservations about recommending private API.

  * Please double check the ref counts. I think you have a leak
   in PyLong_FromUnicode() (for norm) and possible in other
   functions as well.

Will do.  I should also add some more tests for error conditions.  I
test for leaks, but if the error branch is not covered, it is not
covered.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-12-02 Thread Stefan Krah

Stefan Krah stefan-use...@bytereef.org added the comment:

Alexander Belopolsky rep...@bugs.python.org wrote:
 On Thu, Dec 2, 2010 at 4:34 PM, Marc-Andre Lemburg
 rep...@bugs.python.org wrote:
 ..
   * Please change the API _PyUnicode_NormalizeDecimal() to
    PyUnicode_ConvertToASCIIDecimal() - that's closer to what
    it does.
 
 
 Are you sure it is a good idea to give it a public name?  I have no
 problem with calling it _PyUnicode_ConvertToASCIIDecimal().
 (Transform may be a better term, though.)

I like the public name. Extension authors can use it and be sure that
their programs accept exactly the same numeric strings as the rest of
Python.

Are you worried that the semantics might change? If they do, I would
actually welcome to have an official transformation function that
automatically follows the current preferences of python-dev (or the
Unicode Consortium).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-12-02 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Thu, Dec 2, 2010 at 6:32 PM, Stefan Krah rep...@bugs.python.org wrote:
..
 I like the public name. Extension authors can use it and be sure that
 their programs accept exactly the same numeric strings as the rest of
 Python.

 Are you worried that the semantics might change?

Yes, I am already working on a change that instead of stripping
whitespace will replace it with ASCI space.  I don't think that a
public TransformDecimalToASCII() function should care about space, but
in order to support current semantics without making an extra copy, I
need to do all replacements together.  (And of course I would not want
to publicly expose a function that modifies the string in-place.)

 If they do, I would
 actually welcome to have an official transformation function that
 automatically follows the current preferences of python-dev (or the
 Unicode Consortium).


I don't think either group has clearly articulated their preferences
with respect to such algorithm.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-12-02 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

I am submitting a new patch that excludes int() changes.  The honest reason for 
the exclusion is that I gave up chasing a bug that only shows in full regrtest 
runs.  (Marc, I don't think it is related to what you thought was a missing 
norm decref: that place had result = NULL, not return NULL, so the contral was 
falling through to the decref after the if statement.)  Nevertheless, I think 
it makes sense to proceed with smaller steps.  As Marc suggested, 
PyUnicode_EncodeDecimal() API will stay indefinitely, so there is no urge to 
remove its use.  There are no actual bugs associated with int(), so technically 
it does not belong to this issue.

--
Added file: http://bugs.python.org/file19918/issue10557c.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-11-29 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Alexander Belopolsky wrote:
 
 Alexander Belopolsky belopol...@users.sourceforge.net added the comment:
 
 After a bit of svn archeology, it does appear that Arabic-Indic digits' 
 support was deliberate at least in the sense that the feature was tested for 
 when the code was first committed. See r15000.

As I mentioned on python-dev 
(http://mail.python.org/pipermail/python-dev/2010-November/106077.html)
this support was added intentionally.

 The test migrated from file to file over the last 10 years, but it is still 
 present in test_float.py:
 
 self.assertEqual(float(b  \u0663.\u0661\u0664  
 .decode('raw-unicode-escape')), 3.14)
 
 (It should probably be now rewritten using a string literal.)
 
 I am now attaching the patch (issue10557.diff) that fixes the bug without 
 sacrificing non-ASCII digit support.
 If this approach is well-received, I would like to replace all calls to 
 PyUnicode_EncodeDecimal() with the calls to the new 
 _PyUnicode_EncodeDecimalUTF8() and deprecate Latin-1-oriented 
 PyUnicode_EncodeDecimal().

It would be better to copy and iterate over the Unicode string first,
replacing any decimal code points with ASCII ones and then call the
UTF-8 encoder.

The code as it stands is very inefficient, since it will most likely
run the memcpy() part for every code point after the first non-ASCII
decimal one.

 For the future, I note that starting with Unicode 6.0.0, the Unicode 
 Consortium promises that
 
 
 Characters with the property value Numeric_Type=de (Decimal) only occur in 
 contiguous ranges of 10 characters, with ascending numeric values from 0 to 9 
 (Numeric_Value=0..9).
 
 
 This makes it very easy to check a numeric string does not contain a mix of 
 digits from different scripts.

I'm not sure why you'd want to check for such ranges.

 I still believe that proper API should require explicit choice of language or 
 locale before allowing digits other than 0-9 just as int() would not accept 
 hexadecimal digits without explicit choice of base = 16.  But this would be 
 a subject of a feature request.

Since when do we require a locale or language to be specified when
using Unicode ?

The codecs, Unicode methods and other Unicode support features
happily work with all kinds of languages, mixed or not, without any
such specification.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-11-29 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Mon, Nov 29, 2010 at 4:41 AM, Marc-Andre Lemburg
rep...@bugs.python.org wrote:
..
 It would be better to copy and iterate over the Unicode string first,
 replacing any decimal code points with ASCII ones and then call the
 UTF-8 encoder.


Good idea.

 The code as it stands is very inefficient, since it will most likely
 run the memcpy() part for every code point after the first non-ASCII
 decimal one.


I doubt there are measurable gains from this optimization, but doing
conversion in Unicode characters results in cleaner API.  The new
patch, issue10557a.diff, implements
_PyUnicode_NormalizeDecimal(Py_UNICODE *s, Py_ssize_t length) which is
defined as follows:

/* Strip leading and trailing space and convert code points that have
decimal
   digit property to the corresponding ASCII digit code point.

   Returns a new Unicode string on success, NULL on failure.
*/

Note that I used deprecated _PyUnicode_AsStringAndSize() in
floatobject.c not only because it is convenient, but also because I
believe that in the future numerical value parsers should be converted
to operate on unicode characters.  When this happens, the use of
_PyUnicode_AsStringAndSize() can be removed.

--
Added file: http://bugs.python.org/file19872/issue10557a.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___Index: Include/unicodeobject.h
===
--- Include/unicodeobject.h (revision 86843)
+++ Include/unicodeobject.h (working copy)
@@ -1173,6 +1173,17 @@
 const char *errors  /* error handling */
 );
 
+/* Strip leading and trailing space and convert code points that have decimal
+   digit property to the corresponding ASCII digit code point. 
+
+   Returns a new Unicode string on success, NULL on failure.
+*/
+
+PyAPI_FUNC(PyObject*) _PyUnicode_NormalizeDecimal(
+Py_UNICODE *s,  /* Unicode buffer */
+Py_ssize_t length   /* Number of Py_UNICODE chars to encode */
+);
+
 /* --- File system encoding -- */
 
 /* ParseTuple converter: encode str objects to bytes using
Index: Objects/unicodeobject.c
===
--- Objects/unicodeobject.c (revision 86843)
+++ Objects/unicodeobject.c (working copy)
@@ -6207,6 +6207,40 @@
 return NULL;
 }
 
+PyObject *
+_PyUnicode_NormalizeDecimal(Py_UNICODE *s,
+Py_ssize_t length)
+{
+PyObject *result;
+Py_UNICODE *p; /* write pointer into result */
+const Py_UNICODE *end = s + length;
+Py_ssize_t i;
+/* Strip whitespace */
+while (s  end) {
+if (Py_UNICODE_ISSPACE(*s))
+s++;
+else if (Py_UNICODE_ISSPACE(end[-1]))
+end--;
+else
+break;
+}
+length = end - s;
+/* Copy to a new string */
+result = PyUnicode_FromUnicode(s, length);
+if (result == NULL)
+return result;
+p = PyUnicode_AS_UNICODE(result);
+/* Iterate over code points */
+for (i = 0; i  length; i++) {
+Py_UNICODE ch = p[i];
+if (!Py_ISDIGIT(ch)) {
+int decimal = Py_UNICODE_TODECIMAL(ch);
+if (decimal = 0)
+p[i] = '0' + decimal;
+}
+}
+return result;
+}
 /* --- Decimal Encoder  */
 
 int PyUnicode_EncodeDecimal(Py_UNICODE *s,
Index: Objects/floatobject.c
===
--- Objects/floatobject.c   (revision 86843)
+++ Objects/floatobject.c   (working copy)
@@ -175,52 +175,53 @@
 {
 const char *s, *last, *end;
 double x;
-char buffer[256]; /* for errors */
-char *s_buffer = NULL;
+PyObject *s_buffer = NULL;
 Py_ssize_t len;
 PyObject *result = NULL;
 
 if (PyUnicode_Check(v)) {
-s_buffer = (char *)PyMem_MALLOC(PyUnicode_GET_SIZE(v)+1);
+s_buffer = _PyUnicode_NormalizeDecimal(PyUnicode_AS_UNICODE(v),
+   PyUnicode_GET_SIZE(v));
 if (s_buffer == NULL)
-return PyErr_NoMemory();
-if (PyUnicode_EncodeDecimal(PyUnicode_AS_UNICODE(v),
-PyUnicode_GET_SIZE(v),
-s_buffer,
-NULL))
-goto error;
-s = s_buffer;
-len = strlen(s);
+return NULL;
+s = _PyUnicode_AsStringAndSize(s_buffer, len);
+if (s == NULL)
+return NULL;
+last = s + len;
 }
 else if (PyObject_AsCharBuffer(v, s, len)) {
 PyErr_SetString(PyExc_TypeError,
 float() argument must be a string or a number);
 return NULL;
 }
-

[issue10557] Malformed error message from float()

2010-11-28 Thread Eric Smith

Changes by Eric Smith e...@trueblade.com:


--
nosy: +eric.smith

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-11-28 Thread Mark Dickinson

Mark Dickinson dicki...@gmail.com added the comment:

 I am not sure, whether support for non-ascii digits in float()
 constructor is worth maintaining.

I'd be very happy to drop such support.  If you allow alternative digit sets, 
trying to work out exactly what should be supported and what shouldn't gets 
very messy;  it's even worse for int(), where bases  10 have to be taken into 
account.

 (Anyone knows whether Arabic numbers are written right to left
 or left to right?  What is the proper decimal point character?)

Well, judging by the chocolate packaging I saw recently, they're written left 
to right (so presumably if you're reading right-to-left, you see the units 
first, then the tens, etc., which always struck me as the 'right' way to write 
things in the first place :-).  No idea about the proper decimal point 
character, though.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-11-28 Thread Mark Dickinson

Mark Dickinson dicki...@gmail.com added the comment:

About Alexander's solution:  might it make more sense to have 
PyUnicode_EncodeDecimal raise for inputs like this?  I see it as 
PyUnicode_EncodeDecimal's job to turn the unicode input into usable ASCII (or 
raise an exception);  it looks like that's not happening here.

Adding MAL to the nosy in case he wants to comment on this.

--
nosy: +lemburg

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-11-28 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Sun, Nov 28, 2010 at 11:37 AM, Mark Dickinson rep...@bugs.python.org wrote:

 Mark Dickinson dicki...@gmail.com added the comment:

 I am not sure, whether support for non-ascii digits in float()
 constructor is worth maintaining.

 I'd be very happy to drop such support.  If you allow alternative digit sets, 
 trying to work out exactly what should
 be supported and what shouldn't gets very messy;  it's even worse for int(), 
 where bases  10 have to be taken
 into account.


I think there is a proposal (if not already in Unicode 6.0) to add
hexadecimal value property to UCD, but I don't think it is right for
int() and float() to depend on UCD in the first place.  If people need
to process exotic decimals, we can expose 'decimal' encoding, however,
translating from code-point to decimal value is trivial - just
subtract the code-point value for zero.

 (Anyone knows whether Arabic numbers are written right to left
 or left to right?  What is the proper decimal point character?)

 Well, judging by the chocolate packaging I saw recently, they're written left 
 to right
 (so presumably if you're reading right-to-left, you see the units first, then 
 the tens, etc.,
 which always struck me as the 'right' way to write things in the first place 
 :-).

Did your chocolate packaging use European digits or Arabic-Indic ones?
 Note that they have different bidi properties:

'EN'
 unicodedata.bidirectional('\N{ARABIC-INDIC DIGIT ONE}')
'AN'

(I am not sure what 'AN' (Arabic Numeral?) bidi property actually mean.)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-11-28 Thread Mark Dickinson

Mark Dickinson dicki...@gmail.com added the comment:

 Did your chocolate packaging use European digits or Arabic-Indic ones?
 Note that they have different bidi properties:

Good question;  I think it was Arabic-Indic digits, but to be honest I don't 
remember.  (It wasn't *all* that recently.)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-11-28 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Sun, Nov 28, 2010 at 12:00 PM, Mark Dickinson rep...@bugs.python.org wrote:

 Mark Dickinson dicki...@gmail.com added the comment:

 About Alexander's solution:  might it make more sense to have 
 PyUnicode_EncodeDecimal raise
 for inputs like this?

No, I think PyOS_string_to_double() can generate better error messages
than  PyUnicode_EncodeDecimal.  It is important to pass losslessly
encoded string to PyOS_string_to_double() for proper error reporting.
 Otherwise, we will have to catch the error in PyFloat_FromString()
just to add the string value to the message and may loose other
information such as the precise location of the offending character.
(AFAICT, we don't make use of it now, but this would be a meaningful
improvement.)

 I see it as PyUnicode_EncodeDecimal's job to turn the unicode input into 
usable ASCII
 (or raise an exception);  it looks like that's not happening here.

UTF-8 is quite usable by PyOS_string_to_double() .  UTF-8 encoder is
extremely fast and will only get faster over time.  In my opinion,
PyUnicode_EncodeDecimal() is either unnecessary or should be exposed
as a codec.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-11-28 Thread Stefan Krah

Stefan Krah stefan-use...@bytereef.org added the comment:

 PyUnicode_EncodeDecimal() is either unnecessary or should be exposed
 as a codec.

I'm depending on PyUnicode_EncodeDecimal in cdecimal. In fact, it saved
me quite a bit of trouble. I wouldn't be surprised if other extension
writers use it as well.

--
nosy: +skrah

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-11-28 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Mark Dickinson wrote:
 
 Mark Dickinson dicki...@gmail.com added the comment:
 
 About Alexander's solution:  might it make more sense to have 
 PyUnicode_EncodeDecimal raise for inputs like this?  I see it as 
 PyUnicode_EncodeDecimal's job to turn the unicode input into usable ASCII (or 
 raise an exception);  it looks like that's not happening here.
 
 Adding MAL to the nosy in case he wants to comment on this.

The purpose of the PyUnicode_EncodeDecimal() API is to convert
Unicode decimal digits to ASCII digits for interpretation by
the other usual functions to convert the ASCII representation to
numbers.

The proposed patch will not work, since it removes the support for
non-ASCII number code points, e.g. Asian number code points.

You might want to replace the error message by something more
related to floats in the float constructor. Note that UnicodeErrors
are subclasses of ValueErrors, so the errors are not unexpected
for number constructors.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-11-28 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

 float('½')
 Traceback (most recent call last):
   File stdin, line 1, in module
 ValueError: could not convert string to float: �
 
 float('42½')
 Traceback (most recent call last):
   File stdin, line 1, in module
 ValueError

Note that fractional Unicode code points are not supported
by the encoding function.  Neither are code points which do
not evaluate to 0-9, e.g. ones that represent numbers
larger than 9.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-11-28 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

Issue #10567 demonstrated the problem of relying on the Unicode
database in Python builtins.  Apparently, Unicode does not guarantee
stability of the character categories.   On the other hand, we are
already tied to UCD for the language definition.  Maybe Python should
document the version of Unicode it is using in any given version and
possibly upgrade to Unicode 6.0 should be postponed until the language
moratorium is lifted.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-11-28 Thread Stefan Krah

Stefan Krah stefan-use...@bytereef.org added the comment:

 (Anyone knows whether Arabic numbers are written right to left or left 
 to right?  What is the proper decimal point character?)

Thousands separator: U+066C
Decimal point: U+066B

١٢٣٬١٢٣٫١٢ should be: 123,123.12


Wikipedia says that digits are arranged in the usual way, lowest
value digit to the right.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-11-28 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Sun, Nov 28, 2010 at 3:09 PM, Stefan Krah rep...@bugs.python.org wrote:
..
 Decimal point: U+066B

Well, not so fast:

Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeEncodeError: 'decimal' codec can't encode character '\u066b' in
position 1: invalid decimal Unicode string

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-11-28 Thread Stefan Krah

Stefan Krah stefan-use...@bytereef.org added the comment:

 UnicodeEncodeError: 'decimal' codec can't encode character '\u066b'

Hmm, looks like a bug? I think U+066B is correct.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-11-28 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Sun, Nov 28, 2010 at 3:30 PM, Stefan Krah rep...@bugs.python.org wrote:
..
 UnicodeEncodeError: 'decimal' codec can't encode character '\u066b'

 Hmm, looks like a bug? I think U+066B is correct.

Really?  What about

Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeEncodeError: 'decimal' codec can't encode character '\uff0e' in
position 4: invalid decimal Unicode string

.. and where do we draw the line?  Note that I am not against
Decimal() accepting any c with c.isdigit() returning True, but
builtins should be less promiscuous IMO.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-11-28 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

Sending this by e-mail was not a good idea ...

On Sun, Nov 28, 2010 at 3:30 PM, Stefan Krah rep...@bugs.python.org wrote:
..
 UnicodeEncodeError: 'decimal' codec can't encode character '\u066b'

 Hmm, looks like a bug? I think U+066B is correct.

Really?  What about

 float('1234.56')
Traceback (most recent call last):
 File stdin, line 1, in module
UnicodeEncodeError: 'decimal' codec can't encode character '\uff0e' in
position 4: invalid decimal Unicode string

.. and where do we draw the line?  Note that I am not against
Decimal() accepting any c with c.isdigit() returning True, but
builtins should be less promiscuous IMO.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-11-28 Thread Alexander Belopolsky

Changes by Alexander Belopolsky belopol...@users.sourceforge.net:


--
Removed message: http://bugs.python.org/msg122725

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-11-28 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

After a bit of svn archeology, it does appear that Arabic-Indic digits' support 
was deliberate at least in the sense that the feature was tested for when the 
code was first committed. See r15000.

The test migrated from file to file over the last 10 years, but it is still 
present in test_float.py:

self.assertEqual(float(b  \u0663.\u0661\u0664  
.decode('raw-unicode-escape')), 3.14)

(It should probably be now rewritten using a string literal.)

I am now attaching the patch (issue10557.diff) that fixes the bug without 
sacrificing non-ASCII digit support.

If this approach is well-received, I would like to replace all calls to 
PyUnicode_EncodeDecimal() with the calls to the new 
_PyUnicode_EncodeDecimalUTF8() and deprecate Latin-1-oriented 
PyUnicode_EncodeDecimal().

For the future, I note that starting with Unicode 6.0.0, the Unicode Consortium 
promises that


Characters with the property value Numeric_Type=de (Decimal) only occur in 
contiguous ranges of 10 characters, with ascending numeric values from 0 to 9 
(Numeric_Value=0..9).


This makes it very easy to check a numeric string does not contain a mix of 
digits from different scripts.

I still believe that proper API should require explicit choice of language or 
locale before allowing digits other than 0-9 just as int() would not accept 
hexadecimal digits without explicit choice of base = 16.  But this would be a 
subject of a feature request.

--
Added file: http://bugs.python.org/file19865/issue10557.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-11-27 Thread Alexander Belopolsky

New submission from Alexander Belopolsky belopol...@users.sourceforge.net:

 float('½')
Traceback (most recent call last):
  File stdin, line 1, in module
ValueError: could not convert string to float: �

 float('42½')
Traceback (most recent call last):
  File stdin, line 1, in module
ValueError

With the attached patch, float-error.diff


 float('½')
Traceback (most recent call last):
  File stdin, line 1, in module
ValueError: invalid literal for float(): ½
 float('42½')
Traceback (most recent call last):
  File stdin, line 1, in module
ValueError: invalid literal for float(): 42½

Note that the proposed patch also has an effect of disallowing non-ascii digits 
in float() constructor.

Before the patch:

 float('١٢٣٤.٥٦')
1234.56

After the patch:

 float('١٢٣٤.٥٦')
Traceback (most recent call last):
  File stdin, line 1, in module
ValueError: could not convert string to float: ١٢٣٤.٥٦

I am not sure, whether support for non-ascii digits in float() constructor is 
worth maintaining.  (Anyone knows whether Arabic numbers are written right to 
left or left to right?  What is the proper decimal point character?)

Also, I don't think users expect UnicodeEncodeError from float() or int().

Before the patch:

 float('\u')
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeEncodeError: 'decimal' codec can't encode character '\u' in position 
0: invalid decimal Unicode string


After the patch:

 float('\u')
Traceback (most recent call last):
  File stdin, line 1, in module
ValueError: could not convert string to float: �

--
components: Interpreter Core
files: float-error.diff
keywords: patch
messages: 122612
nosy: belopolsky, ezio.melotti, haypo, mark.dickinson
priority: normal
severity: normal
stage: unit test needed
status: open
title: Malformed error message from float()
type: behavior
versions: Python 3.2
Added file: http://bugs.python.org/file19848/float-error.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-11-27 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

I think float() should support non-ascii digits but I agree that it would be 
better to avoid UnicodeErrors and convert them to ValueErrors so that

 float('١٢٣٤.٥٦')
1234.56

and

 float('½')
Traceback (most recent call last):
  File stdin, line 1, in module
ValueError: invalid literal for float(): ½

I.e. float should do the C equivalent of:
try:
  s = arg.encode('decimal')
except UnicodeEncodeError:
  raise ValueError('Invalid liter for float(): {}'.format(arg))

Note that int() and Decimal() supports non-ascii chars too.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10557] Malformed error message from float()

2010-11-27 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

FWIW the UnicodeError comes from PyUnicode_EncodeDecimal (unicodeobject.c:6212) 
and the ValueError: could not convert string to float with the buggy � comes 
from PyOS_string_to_double (pystrtod.c:316).  Maybe PyOS_string_to_double 
should be fixed to display the string correctly, but I don't know what will be 
affected from that (and if it will make things worse or better).  The 
UnicodeError can be fixed in PyFloat_FromString (floatobject.c:174).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com