[issue8941] utf-32be codec failing on UCS-2 python build for 32-bit value
Antoine Pitrou pit...@free.fr added the comment: Fixed in r81907 (trunk), r81908 (py3k), r81909 (2.6), r81910 (3.1). -- resolution: - fixed status: open - closed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8941 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8941] utf-32be codec failing on UCS-2 python build for 32-bit value
Antoine Pitrou pit...@free.fr added the comment: Also witnessed on 2.x (UCS-2 build): unicode(b'\x00\x01\x00\x00', 'utf-32be') u'\ud800\u0773' unicode(b'\x00\x00\x01\x00', 'utf-32le') u'\U0001' -- nosy: +haypo, lemburg, pitrou priority: normal - high title: utf-32be codec failing on 16-bit python build for 32-bit value - utf-32be codec failing on UCS-2 python build for 32-bit value versions: +Python 2.6, Python 2.7, Python 3.2 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8941 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8941] utf-32be codec failing on UCS-2 python build for 32-bit value
Antoine Pitrou pit...@free.fr added the comment: The following code at the beginning of PyUnicode_DecodeUTF32Stateful is buggy when codec endianness doesn't match the native endianness (not to mention it could also crash if the underlying CPU arch doesn't support unaligned access to 4-byte integers): #ifndef Py_UNICODE_WIDE for (i = pairs = 0; i size/4; i++) if (((Py_UCS4 *)s)[i] = 0x1) pairs++; #endif As a result, the preallocated unicode object isn't long enough and Python writes into memory it shouldn't write into. It can produce hard crashes, such as: l = unicode(b'\x00\x01\x00\x00' * 1024, 'utf-32be') Debug memory block at address p=0xf2b310: 2050 bytes originally requested The 8 pad bytes at p-8 are FORBIDDENBYTE, as expected. The 8 pad bytes at tail=0xf2bb12 are not all FORBIDDENBYTE (0xfb): at tail+0: 0x00 *** OUCH at tail+1: 0xdc *** OUCH at tail+2: 0x00 *** OUCH at tail+3: 0xd8 *** OUCH at tail+4: 0x00 *** OUCH at tail+5: 0xdc *** OUCH at tail+6: 0x00 *** OUCH at tail+7: 0xd8 *** OUCH The block was made by call #61925422603698392 to debug malloc/realloc. Data at p: 00 d8 00 dc 00 d8 00 dc ... 00 dc 00 d8 00 dc 00 d8 Fatal Python error: bad trailing pad byte Abandon -- priority: high - critical type: behavior - crash ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8941 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8941] utf-32be codec failing on UCS-2 python build for 32-bit value
Changes by Antoine Pitrou pit...@free.fr: -- nosy: +doerwalter ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8941 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8941] utf-32be codec failing on UCS-2 python build for 32-bit value
Antoine Pitrou pit...@free.fr added the comment: Here is a simple patch. A test should be added, though. -- keywords: +patch Added file: http://bugs.python.org/file17596/utf32.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8941 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8941] utf-32be codec failing on UCS-2 python build for 32-bit value
Marc-Andre Lemburg m...@egenix.com added the comment: Antoine Pitrou wrote: Antoine Pitrou pit...@free.fr added the comment: The following code at the beginning of PyUnicode_DecodeUTF32Stateful is buggy when codec endianness doesn't match the native endianness (not to mention it could also crash if the underlying CPU arch doesn't support unaligned access to 4-byte integers): #ifndef Py_UNICODE_WIDE for (i = pairs = 0; i size/4; i++) if (((Py_UCS4 *)s)[i] = 0x1) pairs++; #endif Good catch ! I wonder whether it wouldn't be better to preallocate a Unicode object with size of e.g. size/4 + 16 and then resize the object as necessary in case a surrogate pair needs to be created (won't happen that often in practice). The extra scan for pairs can take long depending on how much data you have to decode and likely doesn't go down well with CPU caches. -- title: utf-32be codec failing on UCS-2 python build for 32-bit value - utf-32be codec failing on UCS-2 python build for 32-bit value ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8941 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8941] utf-32be codec failing on UCS-2 python build for 32-bit value
Changes by Ezio Melotti ezio.melo...@gmail.com: -- nosy: +ezio.melotti stage: - unit test needed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8941 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8941] utf-32be codec failing on UCS-2 python build for 32-bit value
Antoine Pitrou pit...@free.fr added the comment: Here is a new patch with tests. I wonder whether it wouldn't be better to preallocate a Unicode object with size of e.g. size/4 + 16 and then resize the object as necessary in case a surrogate pair needs to be created (won't happen that often in practice). The extra scan for pairs can take long depending on how much data you have to decode and likely doesn't go down well with CPU caches. Perhaps, but I think this should measured and be the target of a separate issue. We're in rc phase and we should probably minimize potential disruption. -- Added file: http://bugs.python.org/file17598/utf32-2.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8941 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8941] utf-32be codec failing on UCS-2 python build for 32-bit value
Marc-Andre Lemburg m...@egenix.com added the comment: Antoine Pitrou wrote: Antoine Pitrou pit...@free.fr added the comment: Here is a new patch with tests. I wonder whether it wouldn't be better to preallocate a Unicode object with size of e.g. size/4 + 16 and then resize the object as necessary in case a surrogate pair needs to be created (won't happen that often in practice). The extra scan for pairs can take long depending on how much data you have to decode and likely doesn't go down well with CPU caches. Perhaps, but I think this should measured and be the target of a separate issue. We're in rc phase and we should probably minimize potential disruption. Fair enough. Here's a little optimization: -if (qq[iorder[3]] != 0 || qq[iorder[2]] != 0) +if (qq[iorder[2]] != 0 || qq[iorder[3]] != 0) For non-BMP code points, it's more likely that byte 2 will be non-zero. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8941 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com