[issue27971] utf-16 decoding can't handle lone surrogates

2016-12-06 Thread Christoph Reiter
Changes by Christoph Reiter : -- resolution: -> wont fix status: open -> closed ___ Python tracker ___ ___ Python-bugs-list mailing l

[issue27971] utf-16 decoding can't handle lone surrogates

2016-09-10 Thread Christoph Reiter
Christoph Reiter added the comment: Closing as wontfix if there are concerns regarding compatibility seems fine to me. Thanks for looking into this. I've also found a workaround for my usecase in the meantime: https://github.com/lazka/senf/commit/b7dadb05a29db5f0d74f659971b0a86d5e579028

[issue27971] utf-16 decoding can't handle lone surrogates

2016-09-09 Thread Eryk Sun
Eryk Sun added the comment: I wasn't trying to put words in your mouth, Victor. I was replying to Terry (msg275406). -- ___ Python tracker ___ __

[issue27971] utf-16 decoding can't handle lone surrogates

2016-09-09 Thread STINNER Victor
STINNER Victor added the comment: > Considering the UTF-16 codec isn't self-consistent, it's a stretch to say > it's not a bug. I didn't say that it's not a bug. I said that it's not possible to modify a codec at this point in Python 2.7 without taking a risk of breaking applications relying on

[issue27971] utf-16 decoding can't handle lone surrogates

2016-09-09 Thread Eryk Sun
Eryk Sun added the comment: Considering the UTF-16 codec isn't self-consistent, it's a stretch to say it's not a bug. It's misbehavior, and it either will be or won't be fixed. From Victor's response it's looking like the latter. -- ___ Python track

[issue27971] utf-16 decoding can't handle lone surrogates

2016-09-09 Thread Terry J. Reedy
Terry J. Reedy added the comment: Unless the 2.7 docs specify that the utf codecs should violate the standard with respect to lone surrogates, I think this should definitely be closed (as 'not a bug'). -- nosy: +terry.reedy ___ Python tracker

[issue27971] utf-16 decoding can't handle lone surrogates

2016-09-06 Thread STINNER Victor
STINNER Victor added the comment: I dislike the idea of changing the behaviour in a minor release :-/ -- ___ Python tracker ___ ___ Py

[issue27971] utf-16 decoding can't handle lone surrogates

2016-09-06 Thread Eryk Sun
Eryk Sun added the comment: Victor, it seems the only option here (other than closing this as won't fix) is to modify the UTF-16 decoder in 2.7 to allow lone surrogates, which would be consistent with the UTF-8 and UTF-32 decoders. While it's too late to enforce strict compliance in 2.7, it sh

[issue27971] utf-16 decoding can't handle lone surrogates

2016-09-06 Thread STINNER Victor
STINNER Victor added the comment: UTF codecs must not encode surrogate characters: http://unicodebook.readthedocs.io/issues.html#non-strict-utf-8-decoder-overlong-byte-sequences-and-surrogates Python 3 is right, sadly it's too late to fix Python 2. -- __

[issue27971] utf-16 decoding can't handle lone surrogates

2016-09-06 Thread Christoph Reiter
Christoph Reiter added the comment: On Tue, Sep 6, 2016 at 4:10 PM, Eryk Sun wrote: > Lone surrogate codes aren't valid Unicode. In Python 3 they get used > internally for tricks like the "surrogateescape" error handler. In Python > 3.4+. the 'surrogatepass' error handler allows encoding and d

[issue27971] utf-16 decoding can't handle lone surrogates

2016-09-06 Thread Christoph Reiter
Christoph Reiter added the comment: On Tue, Sep 6, 2016 at 3:43 PM, Xiang Zhang wrote: > > Xiang Zhang added the comment: > > With the latest build, even encode will fail: With Python 3 you have to use the "surrogatepass" error handler. I assumed this was the default in Python 2 since it worked

[issue27971] utf-16 decoding can't handle lone surrogates

2016-09-06 Thread Eryk Sun
Eryk Sun added the comment: Probably Python 2's UTF-16 decoder should be as broken as the encoder, which will match the broken behavior of the UTF-8 and UTF-32 codecs: >>> u'\ud83d\uda12'.encode('utf-8').decode('utf-8') u'\ud83d\uda12' >>> u'\ud83d\uda12'.encode('utf-32-le').decode(

[issue27971] utf-16 decoding can't handle lone surrogates

2016-09-06 Thread Xiang Zhang
Xiang Zhang added the comment: With the latest build, even encode will fail: Python 3.6.0a4+ (default:dad4c42869f6, Sep 6 2016, 21:41:38) [GCC 5.2.1 20151010] on linux Type "help", "copyright", "credits" or "license" for more information. >>> u"\ud83d".encode("utf-16-le") Traceback (most recen

[issue27971] utf-16 decoding can't handle lone surrogates

2016-09-06 Thread Christoph Reiter
Christoph Reiter added the comment: Same problem on 3.3.6. But works on 3.4.5. So I guess this was fixed but not backported. -- ___ Python tracker ___ __

[issue27971] utf-16 decoding can't handle lone surrogates

2016-09-06 Thread Christoph Reiter
New submission from Christoph Reiter: Using Python 2.7.12 >>> u"\ud83d".encode("utf-16-le") '=\xd8' >>> u"\ud83d".encode("utf-16-le").decode("utf-16-le") Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.7/encodings/utf_16_le.py", line 16, in decode return co