[issue9133] Invalid UTF8 Byte sequence not raising exception/being substituted
Changes by Julian Mehnle jul...@mehnle.net: -- nosy: +jmehnle ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue9133 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9133] Invalid UTF8 Byte sequence not raising exception/being substituted
New submission from Mike Lewis mikelikes...@gmail.com: When I do codecs.encode(codecs.decode('\xed\xbc\xad', 'utf8'), 'utf8') its not throwing an exception. '\xed\xbc\xad' is an invalid UTF8 byte sequence. It maps to the value U+DF2D which is a surrogate pair it seems. http://tools.ietf.org/html/rfc3629#section-4 explains: However, pairs of UCS-2 values between D800 and DFFF (surrogate pairs in Unicode parlance), being actually UCS-4 characters transformed through UTF-16, need special treatment: the UTF-16 transformation must be undone, yielding a UCS-4 character that is then transformed as above. which would suggest that it is invalid. However, I think wikipedia's explanation is a bit clearer: UTF-8 may only legally be used to encode valid Unicode scalar values. According to the Unicode standard the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and values above U+10 are not legal Unicode values, and the UTF-8 encoding of them is an invalid byte sequence and should be treated as described above. Thanks, Mike -- components: Unicode messages: 109010 nosy: Mike.Lewis priority: normal severity: normal status: open title: Invalid UTF8 Byte sequence not raising exception/being substituted versions: Python 2.6 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue9133 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9133] Invalid UTF8 Byte sequence not raising exception/being substituted
Mike Lewis mikelikes...@gmail.com added the comment: Sorry, meant to add this part to the quote from the rfc: This leads to different results for character numbers above 0x; the CESU-8 encoding of those characters is NOT valid UTF-8 -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue9133 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9133] Invalid UTF8 Byte sequence not raising exception/being substituted
Ezio Melotti ezio.melo...@gmail.com added the comment: This is already fixed in Python 3. However I think that for backward compatibility reasons it can't be fixed in Python 2, where it is possible to encode and decode every codepoint to/from UTF-8. See also http://bugs.python.org/issue8271#msg102209 I think this can be closed as wontfix. -- nosy: +ezio.melotti, haypo, lemburg status: open - pending type: - behavior versions: +Python 2.7 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue9133 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9133] Invalid UTF8 Byte sequence not raising exception/being substituted
Changes by Marc-Andre Lemburg m...@egenix.com: -- resolution: - wont fix status: pending - closed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue9133 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9133] Invalid UTF8 Byte sequence not raising exception/being substituted
Marc-Andre Lemburg m...@egenix.com added the comment: Ezio Melotti wrote: I think this can be closed as wontfix. Agreed. I've already closed the ticket. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue9133 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9133] Invalid UTF8 Byte sequence not raising exception/being substituted
Changes by Ezio Melotti ezio.melo...@gmail.com: -- stage: - committed/rejected ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue9133 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com