[issue8941] utf-32be codec failing on UCS-2 python build for 32-bit value

2010-06-11 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

Fixed in r81907 (trunk), r81908 (py3k), r81909 (2.6), r81910 (3.1).

--
resolution:  - fixed
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8941
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8941] utf-32be codec failing on UCS-2 python build for 32-bit value

2010-06-09 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

Also witnessed on 2.x (UCS-2 build):

 unicode(b'\x00\x01\x00\x00', 'utf-32be')
u'\ud800\u0773'
 unicode(b'\x00\x00\x01\x00', 'utf-32le')
u'\U0001'

--
nosy: +haypo, lemburg, pitrou
priority: normal - high
title: utf-32be codec failing on 16-bit python build for 32-bit value - 
utf-32be codec failing on UCS-2 python build for 32-bit value
versions: +Python 2.6, Python 2.7, Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8941
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8941] utf-32be codec failing on UCS-2 python build for 32-bit value

2010-06-09 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

The following code at the beginning of PyUnicode_DecodeUTF32Stateful is buggy 
when codec endianness doesn't match the native endianness (not to mention it 
could also crash if the underlying CPU arch doesn't support unaligned access to 
4-byte integers):

#ifndef Py_UNICODE_WIDE
for (i = pairs = 0; i  size/4; i++)
if (((Py_UCS4 *)s)[i] = 0x1)
pairs++;
#endif

As a result, the preallocated unicode object isn't long enough and Python 
writes into memory it shouldn't write into. It can produce hard crashes, such 
as:

 l = unicode(b'\x00\x01\x00\x00' * 1024, 'utf-32be')
Debug memory block at address p=0xf2b310:
2050 bytes originally requested
The 8 pad bytes at p-8 are FORBIDDENBYTE, as expected.
The 8 pad bytes at tail=0xf2bb12 are not all FORBIDDENBYTE (0xfb):
at tail+0: 0x00 *** OUCH
at tail+1: 0xdc *** OUCH
at tail+2: 0x00 *** OUCH
at tail+3: 0xd8 *** OUCH
at tail+4: 0x00 *** OUCH
at tail+5: 0xdc *** OUCH
at tail+6: 0x00 *** OUCH
at tail+7: 0xd8 *** OUCH
The block was made by call #61925422603698392 to debug malloc/realloc.
Data at p: 00 d8 00 dc 00 d8 00 dc ... 00 dc 00 d8 00 dc 00 d8
Fatal Python error: bad trailing pad byte
Abandon

--
priority: high - critical
type: behavior - crash

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8941
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8941] utf-32be codec failing on UCS-2 python build for 32-bit value

2010-06-09 Thread Antoine Pitrou

Changes by Antoine Pitrou pit...@free.fr:


--
nosy: +doerwalter

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8941
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8941] utf-32be codec failing on UCS-2 python build for 32-bit value

2010-06-09 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

Here is a simple patch. A test should be added, though.

--
keywords: +patch
Added file: http://bugs.python.org/file17596/utf32.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8941
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8941] utf-32be codec failing on UCS-2 python build for 32-bit value

2010-06-09 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Antoine Pitrou wrote:
 
 Antoine Pitrou pit...@free.fr added the comment:
 
 The following code at the beginning of PyUnicode_DecodeUTF32Stateful is buggy 
 when codec endianness doesn't match the native endianness (not to mention it 
 could also crash if the underlying CPU arch doesn't support unaligned access 
 to 4-byte integers):
 
 #ifndef Py_UNICODE_WIDE
 for (i = pairs = 0; i  size/4; i++)
 if (((Py_UCS4 *)s)[i] = 0x1)
 pairs++;
 #endif

Good catch !

I wonder whether it wouldn't be better to preallocate
a Unicode object with size of e.g. size/4 + 16 and
then resize the object as necessary in case a surrogate
pair needs to be created (won't happen that often in
practice).

The extra scan for pairs can take long depending on
how much data you have to decode and likely doesn't
go down well with CPU caches.

--
title: utf-32be codec failing on UCS-2 python build for 32-bit value - 
utf-32be codec failing on UCS-2 python build for 32-bit value

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8941
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8941] utf-32be codec failing on UCS-2 python build for 32-bit value

2010-06-09 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
nosy: +ezio.melotti
stage:  - unit test needed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8941
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8941] utf-32be codec failing on UCS-2 python build for 32-bit value

2010-06-09 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

Here is a new patch with tests.

 I wonder whether it wouldn't be better to preallocate
 a Unicode object with size of e.g. size/4 + 16 and
 then resize the object as necessary in case a surrogate
 pair needs to be created (won't happen that often in
 practice).
 
 The extra scan for pairs can take long depending on
 how much data you have to decode and likely doesn't
 go down well with CPU caches.

Perhaps, but I think this should measured and be the target of a separate 
issue. We're in rc phase and we should probably minimize potential disruption.

--
Added file: http://bugs.python.org/file17598/utf32-2.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8941
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8941] utf-32be codec failing on UCS-2 python build for 32-bit value

2010-06-09 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Antoine Pitrou wrote:
 
 Antoine Pitrou pit...@free.fr added the comment:
 
 Here is a new patch with tests.
 
 I wonder whether it wouldn't be better to preallocate
 a Unicode object with size of e.g. size/4 + 16 and
 then resize the object as necessary in case a surrogate
 pair needs to be created (won't happen that often in
 practice).

 The extra scan for pairs can take long depending on
 how much data you have to decode and likely doesn't
 go down well with CPU caches.
 
 Perhaps, but I think this should measured and be the target of a separate 
 issue. We're in rc phase and we should probably minimize potential disruption.

Fair enough.

Here's a little optimization:

-if (qq[iorder[3]] != 0 || qq[iorder[2]] != 0)
+if (qq[iorder[2]] != 0 || qq[iorder[3]] != 0)

For non-BMP code points, it's more likely that byte 2
will be non-zero.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8941
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com