[issue23614] Opaque error message on UTF-8 decoding to surrogates

2015-04-27 Thread Chris Angelico

Chris Angelico added the comment:

Got around to tracking down where this is actually being done. It's in 
Objects/stringlib/codecs.h and it looks to be a hot area for optimization. I 
don't want to fiddle with it without knowing a lot about the performance 
implications (UTF-8 encode/decode being a pretty common operation), and since 
my knowledge of CPU operation costs is about fifteen or twenty years out of 
date, it's probably best I not do this. It would be nice if the message could 
be improved per Ezio's suggestion, but that would mean returning more 
information to the caller.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23614
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23614] Opaque error message on UTF-8 decoding to surrogates

2015-03-13 Thread Ezio Melotti

Ezio Melotti added the comment:

The Table 3-7 of http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf (page 93 
of the book, or 40 of the pdf) shows that if the start byte is ED the 
continuation byte must be in range 80..9F.  This means that, in order to decode 
a sequence starting with ED, you need two more valid continuation bytes.  Since 
the following byte (B4) is not in allowed range 80..9F and is thus an invalid 
continuation byte, the decoder doesn't know how to decode the byte in position 
0 (i.e. ED).

It is also true that this particular sequence, if allowed, would result in a 
surrogate.  However, by looking at the first two bytes only, you don't have 
enough information to be sure about that (e.g. ED B4 00 begins doesn't decode 
to a surrogate, so Pike's error message is imprecise).

If handling this special case doesn't require too much extra code, it would be 
ok with me to have something like:
 b\xed\xb4\x80.decode(utf-8)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid 
continuation byte (possible start of a surrogate)

--
type:  - enhancement

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23614
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23614] Opaque error message on UTF-8 decoding to surrogates

2015-03-13 Thread Chris Angelico

Chris Angelico added the comment:

Nice document. Is that actually how Python's decoder checks things? Does the 
decoder have different definitions of valid continuation byte based on the 
lead byte? If that's the case... well, ten out of ten for complying with the 
spec, to be sure, but unfortunately it leads to some opaque error messages!

I haven't looked into the code even a little bit, but would it be possible to 
have a specific error message attached to certain invalid continuation bytes?

* E0 followed by 80..9F: non-shortest form
* ED followed by A0..BF: surrogate
* F4 followed by 90..BF: outside defined range

If that's too hard, it'd at least be helpful to point out that the invalid 
continuation byte is not the same as the byte 0x?? in position ? - the 
rejection here is actually of the B4 that follows it. How does this look?

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid 
continuation byte 0xb4 for this start byte

(BTW, I think Pike's decoder just always emits two bytes, no matter what the 
actual errant stream (after all, there's no way to know how many bytes ought 
to have been one character, when there's an error in it). So it's incomplete, 
yes, but when you're dealing with wrong data, completeness isn't all that 
possible anyway.)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23614
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23614] Opaque error message on UTF-8 decoding to surrogates

2015-03-13 Thread Ezio Melotti

Ezio Melotti added the comment:

 Nice document. Is that actually how Python's decoder checks things?

Yes, Python follows the Unicode standard.

 * E0 followed by 80..9F: non-shortest form
 * ED followed by A0..BF: surrogate
 * F4 followed by 90..BF: outside defined range

If you get a decode error while using UTF-8, it means that you are trying to 
decode something that is not (valid) UTF-8.  I can see 3 situations where this 
might happen:
1) the input is using a different encoding;
2) the input is corrupted;
3) the input is using an encoding similar to UTF-8 (e.g. CESU-8);

In the first two cases additional information about continuation bytes are 
meaningless and misleading (there's no such thing as short form or surrogates 
in e.g. ASCII).  In the third case (which is actually a special case of 1), 
mentioning surrogates and perhaps non-shortest form might be useful if the 
developer is intimately familiar with UTF-8 and Unicode since he might suspect 
that the input is actually CESU-8 or the text has been encoded by an outdated 
encoder that follows the RFC 2044 specs from 1996.

 How does this look?

 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0:
 invalid continuation byte 0xb4 for this start byte

Something similar would be ok with me, assuming is easy to implement in the 
code.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23614
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23614] Opaque error message on UTF-8 decoding to surrogates

2015-03-08 Thread Chris Angelico

New submission from Chris Angelico:

 b\xed\xb4\x80.decode(utf-8)
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position
0: invalid continuation byte

The actual problem here is that this byte sequence would decode to U+DD00, 
which, being a surrogate, is invalid for the encoding. It's correct to raise 
UnicodeDecodeError, but the text of the message is a bit obscure. I'm not sure 
whether the invalid continuation byte is talking about the 0xed in position 
0 or about one of the others; 0xED is not a continuation byte, it's a start 
byte - and a perfectly valid one:

 b\xed\x9f\xbf.decode(utf-8)
'\ud7ff'

Pike is more explicit about what the problem is:

 utf8_to_string(\xed\xb4\x80);
UTF-8 sequence beginning with 0xed 0xb4 at index 0 would decode to a
UTF-16 surrogate character.

Is this something worth fixing?

Tested on 3.4.2 and a recent build of 3.5, probably applies to most 3.x 
versions. (2.7 actually permits this, which is a bigger bug, but one with 
backward-compatibility issues.)

--
components: Interpreter Core, Unicode
messages: 237572
nosy: Rosuav, ezio.melotti, haypo
priority: normal
severity: normal
status: open
title: Opaque error message on UTF-8 decoding to surrogates
versions: Python 3.4, Python 3.5

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23614
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23614] Opaque error message on UTF-8 decoding to surrogates

2015-03-08 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

UTF-8 codec can't decode byte 0xed because 0xed is not valid UTF-8 sequence and 
following byte is not expected valid continuation byte.

UTF-8 codec can produce errors of three types:

* invalid start byte. When the byte is not start byte of UTF-8 sequence 
(%x00-7F, %xC2-F4).
* invalid continuation byte.  When the byte that follow unfinished UTF-8 
sequence is not valid continuation byte (the validity depends on previous byte).
* unexpected end of data. When the there are no bytes after unfinished UTF-8 
sequence.

--
nosy: +serhiy.storchaka

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23614
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com