[issue8821] Range check on unicode repr

2013-09-02 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


--
resolution:  - works for me
stage:  - committed/rejected
status: pending - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8821
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8821] Range check on unicode repr

2013-01-26 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


--
status: open - pending

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8821
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8821] Range check on unicode repr

2013-01-03 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

You can accept the patch. You can reject the patch. It doesn't matter.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8821
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8821] Range check on unicode repr

2013-01-02 Thread Ezio Melotti

Ezio Melotti added the comment:

Should this be closed then?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8821
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8821] Range check on unicode repr

2012-12-02 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

This patch is not needed for 3.3+. And for 2.7 and 3.2 it actually doesn't fix 
any bug in the current code.

--
nosy: +serhiy.storchaka
versions:  -Python 3.1

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8821
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8821] Range check on unicode repr

2010-12-29 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

[MAL]
  * Unicode objects are NUL-terminated, but only very external APIs
rely on this (e.g. code using the Windows Unicode API). Please
don't make the code in unicodeobject.c itself rely on this
subtle detail.

I would like to note that several APIs have been introduced that require 
NUL-terminated unicode strings: Py_UNICODE_strlen(), Py_UNICODE_strcpy(), etc.  
I see them used quite extensively in unicodeobject.c and elsewhere in Python 
codebase. Perhaps this train has left the station already.

--
nosy: +belopolsky, haypo

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8821
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8821] Range check on unicode repr

2010-12-29 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

 Unicode objects are NUL-terminated, but only very external APIs
 rely on this (e.g. code using the Windows Unicode API).

All Py_UNICODE_str*() functions rely on the NUL character. They are useful when 
patching a function from bytes (char*) to unicode (PyUnicodeObject): the API is 
very close. It is possible to avoid them with new functions using the strings 
length.

All functions using PyUNICODE* as wchar_t* to the Windows wide character API 
(*W functions) also rely on the NUL character. Python core uses a lot of these 
functions. Don't write a NUL character require to create a temporary new string 
ending with a NUL character. It is not efficient, especially on long strings.

And there is the problem of all third party modules (written in C) relying on 
the NUL character.

I think that we have good reasons to not remove the NUL character. So I think 
that we can continue to accept that unicode[length] character can be read. Eg. 
implement text.startswith(ab) as p=PyUnicode_AS_UNICODE(text); if (p[0] == 
'a'  p[1] == 'b') without checking the length of text.

Using the NUL character or the length as a terminator condition doesn't really 
matter. I just see one advantage for the NUL character: it is faster in some 
cases.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8821
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8821] Range check on unicode repr

2010-12-29 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

 I think that we have good reasons to not remove the NUL character.

Please note: Nobody is suggesting that we remove the NUL character. I was 
merely suggesting that we don't rely on it where it is unnecessary.

Returning to my original patch: If the code was using the NUL character as a 
terminator, then it wouldn't be a bug.

What the repr code does is it uses the length, and does not explicitly search 
for a NUL character. However, there is a *bug* where it reads one too many 
characters in certain cases. As I said in the first place, it just happens to 
*not* be catastrophic due to the presence of the NUL character. But that does 
not mean this isn't a bug -- at the very least, the code is very confusing to 
read because it does not do what it is trying to do.

Anyway the important issue is what Marc-Andre raised about buffers. Since we 
are in agreement that there is a potential problem here, and I have a patch 
which seems correct and doesn't break any test cases (note my above post 
responding to test case breakages), can it be applied?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8821
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8821] Range check on unicode repr

2010-08-10 Thread Terry J. Reedy

Changes by Terry J. Reedy tjre...@udel.edu:


--
versions:  -Python 2.6, Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8821
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8821] Range check on unicode repr

2010-08-02 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Antoine Pitrou wrote:
 
 Antoine Pitrou pit...@free.fr added the comment:
 
 Well, the patch was technically useless since, as mentioned, unicode strings 
 are terminated by a NUL character by design.

There are two things to keep in mind:

 * Unicode objects are NUL-terminated, but only very external APIs
   rely on this (e.g. code using the Windows Unicode API). Please
   don't make the code in unicodeobject.c itself rely on this
   subtle detail.

 * The codecs work on Py_UNICODE* buffers which are *never* guaranteed
   to be NUL-terminated, so the problem in question is real.

 Anyway, I now get the following error on the 2.7 branch. Perhaps it's related:
 
 ==
 FAIL: test_ucs4 (test.test_unicode.UnicodeTest)
 --
 Traceback (most recent call last):
   File /home/antoine/cpython/27/Lib/test/test_unicode.py, line 941, in 
 test_ucs4
 self.assertEqual(x, y)
 AssertionError: '\\udbc0\\udc00' != '\\U0010'
 
 --
 nosy: +pitrou
 status: closed - open
 
 ___
 Python tracker rep...@bugs.python.org
 http://bugs.python.org/issue8821
 ___
 ___
 Python-bugs-list mailing list
 Unsubscribe: 
 http://mail.python.org/mailman/options/python-bugs-list/mal%40egenix.com

--
nosy: +lemburg

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8821
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8821] Range check on unicode repr

2010-08-02 Thread Marc-Andre Lemburg

Changes by Marc-Andre Lemburg m...@egenix.com:


--
resolution: fixed - 
status: closed - open

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8821
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8821] Range check on unicode repr

2010-08-02 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

  * Unicode objects are NUL-terminated, but only very external APIs
rely on this (e.g. code using the Windows Unicode API). Please
don't make the code in unicodeobject.c itself rely on this
subtle detail.

That's wishful thinking, don't you think? *I* am not making code in
unicodeobject.c rely on this. It has been so for years, long before I
was here. You should check who made that design decision in the first
place instead of putting the blame on me.

Besides, the fact that external APIs rely on it make it much more
unchangeable than if it were an implementation detail.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8821
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8821] Range check on unicode repr

2010-08-02 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
nosy: +ezio.melotti

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8821
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8821] Range check on unicode repr

2010-08-02 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Antoine Pitrou wrote:
 
 Antoine Pitrou pit...@free.fr added the comment:
 
  * Unicode objects are NUL-terminated, but only very external APIs
rely on this (e.g. code using the Windows Unicode API). Please
don't make the code in unicodeobject.c itself rely on this
subtle detail.
 
 That's wishful thinking, don't you think? *I* am not making code in
 unicodeobject.c rely on this. It has been so for years, long before I
 was here. You should check who made that design decision in the first
 place instead of putting the blame on me.

I'm not blaming you for this. However, I don't want more code
to rely on this behavior.

The NUL-termination has never been documented and my decision
to use NUL-termination on the PyUnicodeObject buffers was merely
a safety measure.

 Besides, the fact that external APIs rely on it make it much more
 unchangeable than if it were an implementation detail.

It's an undocumented implementation detail. We can certainly
deprecate it's use using the standard approach we have for this.

But all that is off-topic for this ticket, since codecs
operate on Py_UNICODE* buffers together with a size parameter
and relying on those buffers being NUL-terminated is bound to
cause problems.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8821
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8821] Range check on unicode repr

2010-08-02 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

 But all that is off-topic for this ticket, since codecs
 operate on Py_UNICODE* buffers together with a size parameter
 and relying on those buffers being NUL-terminated is bound to
 cause problems.

That's right. Then perhaps a fixed patch can be uploaded, if someone
cares enough.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8821
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8821] Range check on unicode repr

2010-08-02 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

OK, I finally had time to review this issue again.

Firstly, granted the original fix broke a test case, shouldn't we figure out 
why it broke and fix it, rather than just revert the change and continue 
relying on this tenuous assumption? Surely it's best to have as little code 
relying on it as possible.

Secondly, please have a look at my patch again. It wasn't committed properly -- 
no offense to Georg, it's an honest mistake! My patch was:

--- Objects/unicodeobject.c (revision 81539)
+++ Objects/unicodeobject.c (working copy)
@@ -3065,7 +3065,7 @@
 }
 #else
 /* Map UTF-16 surrogate pairs to '\U00xx' */
-else if (ch = 0xD800  ch  0xDC00) {
+else if (ch = 0xD800  ch  0xDC00  size  0) {
 Py_UNICODE ch2;
 Py_UCS4 ucs;

The commit made in r83418 by Georg Brandl (and similarly r83395 in py3k):
http://svn.python.org/view/python/branches/release27-maint/Objects/unicodeobject.c?r1=82980r2=83418

--- Objects/unicodeobject.c (revision 83417)
+++ Objects/unicodeobject.c (revision 83418)
@@ -3067,7 +3067,7 @@
 
 ch2 = *s++;
 size--;
-if (ch2 = 0xDC00  ch2 = 0xDFFF) {
+if (ch2 = 0xDC00  ch2 = 0xDFFF  size) {
 ucs = (((ch  0x03FF)  10) | (ch2  0x03FF)) + 0x0001;
 *p++ = '\\';
 *p++ = 'U';
@@ -3316,7 +3316,7 @@
 
 ch2 = *s++;
 size--;
-if (ch2 = 0xDC00  ch2 = 0xDFFF) {
+if (ch2 = 0xDC00  ch2 = 0xDFFF  size) {
 ucs = (((ch  0x03FF)  10) | (ch2  0x03FF)) + 
0x0001;
 *p++ = '\\';
 *p++ = 'U';

I put the size check on the first character of the surrogate pair; in the 
committed version the size check was on the second character (after the size 
variable is decremented), causing it to break out of that branch too early in 
some cases.

Moving the size check to the outer if block fixes the test breakage.

PS. Good find on the second copy of that code in the 
PyUnicode_EncodeRawUnicodeEscape function. I've attached a new patch which 
fixes both functions instead of just the unicodeescape_string function.

Passes all test cases on UCS2 build of the 2.7 branch.

--
Added file: http://bugs.python.org/file18322/unicode-range-check2.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8821
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8821] Range check on unicode repr

2010-08-01 Thread Georg Brandl

Georg Brandl ge...@python.org added the comment:

Applied in r83395. Thanks!

--
nosy: +georg.brandl
resolution:  - fixed
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8821
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8821] Range check on unicode repr

2010-08-01 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

Well, the patch was technically useless since, as mentioned, unicode strings 
are terminated by a NUL character by design.

Anyway, I now get the following error on the 2.7 branch. Perhaps it's related:

==
FAIL: test_ucs4 (test.test_unicode.UnicodeTest)
--
Traceback (most recent call last):
  File /home/antoine/cpython/27/Lib/test/test_unicode.py, line 941, in 
test_ucs4
self.assertEqual(x, y)
AssertionError: '\\udbc0\\udc00' != '\\U0010'

--
nosy: +pitrou
status: closed - open

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8821
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8821] Range check on unicode repr

2010-08-01 Thread Georg Brandl

Georg Brandl ge...@python.org added the comment:

You're right. Reverted in r83444 and merging back, and I'll also remove the 
XXX is this needed from 2.7.

--
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8821
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8821] Range check on unicode repr

2010-05-25 Thread Matt Giuca

New submission from Matt Giuca matt.gi...@gmail.com:

In unicodeobject.c's unicodeescape_string, in UCS2 builds, if the last 
character of the string is the start of a UTF-16 surrogate pair (between 
'\ud800' and '\udfff'), there is a slight overrun problem. For example:

 repr(u'abcd\ud800')

Upon reading ch = 0xd800, the test (ch = 0xD800  ch  0xDC00) succeeds, and 
it then reads ch2 = *s++. Note that preceding this line, s points at one 
character past the end of the string, so the value read will be garbage. I 
imagine that unless it falls on a segment boundary, the worst that could happen 
is the character '\ud800' is interpreted as some other wide character. 
Nevertheless, this is bad.

Note that *technically* this is never bad, because _PyUnicode_New allocates an 
extra character and sets it to '\u', and thus the above example will always 
set ch2 to 0, and it will behave correctly. But this is a tenuous thing to rely 
on, especially given the comment above _PyUnicode_New:

/* We allocate one more byte to make sure the string is
   Ux terminated -- XXX is this needed ?
*/

I thought about removing that XXX, but I'd rather fix the problem. Therefore, I 
have attached a patch which does a range check before reading ch2:

--- Objects/unicodeobject.c (revision 81539)
+++ Objects/unicodeobject.c (working copy)
@@ -3065,7 +3065,7 @@
 }
 #else
 /* Map UTF-16 surrogate pairs to '\U00xx' */
-else if (ch = 0xD800  ch  0xDC00) {
+else if (ch = 0xD800  ch  0xDC00  size  0) {
 Py_UNICODE ch2;
 Py_UCS4 ucs;

Also affects Python 3.

--
components: Unicode
files: unicode-range-check.patch
keywords: patch
messages: 106506
nosy: mgiuca
priority: normal
severity: normal
status: open
title: Range check on unicode repr
type: behavior
versions: Python 2.6, Python 2.7, Python 3.1, Python 3.2, Python 3.3
Added file: http://bugs.python.org/file17465/unicode-range-check.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8821
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com