[issue8821] Range check on unicode repr
Changes by Serhiy Storchaka storch...@gmail.com: -- resolution: - works for me stage: - committed/rejected status: pending - closed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8821 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8821] Range check on unicode repr
Changes by Serhiy Storchaka storch...@gmail.com: -- status: open - pending ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8821 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8821] Range check on unicode repr
Serhiy Storchaka added the comment: You can accept the patch. You can reject the patch. It doesn't matter. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8821 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8821] Range check on unicode repr
Ezio Melotti added the comment: Should this be closed then? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8821 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8821] Range check on unicode repr
Serhiy Storchaka added the comment: This patch is not needed for 3.3+. And for 2.7 and 3.2 it actually doesn't fix any bug in the current code. -- nosy: +serhiy.storchaka versions: -Python 3.1 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8821 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8821] Range check on unicode repr
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: [MAL] * Unicode objects are NUL-terminated, but only very external APIs rely on this (e.g. code using the Windows Unicode API). Please don't make the code in unicodeobject.c itself rely on this subtle detail. I would like to note that several APIs have been introduced that require NUL-terminated unicode strings: Py_UNICODE_strlen(), Py_UNICODE_strcpy(), etc. I see them used quite extensively in unicodeobject.c and elsewhere in Python codebase. Perhaps this train has left the station already. -- nosy: +belopolsky, haypo ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8821 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8821] Range check on unicode repr
STINNER Victor victor.stin...@haypocalc.com added the comment: Unicode objects are NUL-terminated, but only very external APIs rely on this (e.g. code using the Windows Unicode API). All Py_UNICODE_str*() functions rely on the NUL character. They are useful when patching a function from bytes (char*) to unicode (PyUnicodeObject): the API is very close. It is possible to avoid them with new functions using the strings length. All functions using PyUNICODE* as wchar_t* to the Windows wide character API (*W functions) also rely on the NUL character. Python core uses a lot of these functions. Don't write a NUL character require to create a temporary new string ending with a NUL character. It is not efficient, especially on long strings. And there is the problem of all third party modules (written in C) relying on the NUL character. I think that we have good reasons to not remove the NUL character. So I think that we can continue to accept that unicode[length] character can be read. Eg. implement text.startswith(ab) as p=PyUnicode_AS_UNICODE(text); if (p[0] == 'a' p[1] == 'b') without checking the length of text. Using the NUL character or the length as a terminator condition doesn't really matter. I just see one advantage for the NUL character: it is faster in some cases. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8821 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8821] Range check on unicode repr
Matt Giuca matt.gi...@gmail.com added the comment: I think that we have good reasons to not remove the NUL character. Please note: Nobody is suggesting that we remove the NUL character. I was merely suggesting that we don't rely on it where it is unnecessary. Returning to my original patch: If the code was using the NUL character as a terminator, then it wouldn't be a bug. What the repr code does is it uses the length, and does not explicitly search for a NUL character. However, there is a *bug* where it reads one too many characters in certain cases. As I said in the first place, it just happens to *not* be catastrophic due to the presence of the NUL character. But that does not mean this isn't a bug -- at the very least, the code is very confusing to read because it does not do what it is trying to do. Anyway the important issue is what Marc-Andre raised about buffers. Since we are in agreement that there is a potential problem here, and I have a patch which seems correct and doesn't break any test cases (note my above post responding to test case breakages), can it be applied? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8821 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8821] Range check on unicode repr
Changes by Terry J. Reedy tjre...@udel.edu: -- versions: -Python 2.6, Python 3.3 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8821 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8821] Range check on unicode repr
Marc-Andre Lemburg m...@egenix.com added the comment: Antoine Pitrou wrote: Antoine Pitrou pit...@free.fr added the comment: Well, the patch was technically useless since, as mentioned, unicode strings are terminated by a NUL character by design. There are two things to keep in mind: * Unicode objects are NUL-terminated, but only very external APIs rely on this (e.g. code using the Windows Unicode API). Please don't make the code in unicodeobject.c itself rely on this subtle detail. * The codecs work on Py_UNICODE* buffers which are *never* guaranteed to be NUL-terminated, so the problem in question is real. Anyway, I now get the following error on the 2.7 branch. Perhaps it's related: == FAIL: test_ucs4 (test.test_unicode.UnicodeTest) -- Traceback (most recent call last): File /home/antoine/cpython/27/Lib/test/test_unicode.py, line 941, in test_ucs4 self.assertEqual(x, y) AssertionError: '\\udbc0\\udc00' != '\\U0010' -- nosy: +pitrou status: closed - open ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8821 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/mal%40egenix.com -- nosy: +lemburg ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8821 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8821] Range check on unicode repr
Changes by Marc-Andre Lemburg m...@egenix.com: -- resolution: fixed - status: closed - open ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8821 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8821] Range check on unicode repr
Antoine Pitrou pit...@free.fr added the comment: * Unicode objects are NUL-terminated, but only very external APIs rely on this (e.g. code using the Windows Unicode API). Please don't make the code in unicodeobject.c itself rely on this subtle detail. That's wishful thinking, don't you think? *I* am not making code in unicodeobject.c rely on this. It has been so for years, long before I was here. You should check who made that design decision in the first place instead of putting the blame on me. Besides, the fact that external APIs rely on it make it much more unchangeable than if it were an implementation detail. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8821 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8821] Range check on unicode repr
Changes by Ezio Melotti ezio.melo...@gmail.com: -- nosy: +ezio.melotti ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8821 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8821] Range check on unicode repr
Marc-Andre Lemburg m...@egenix.com added the comment: Antoine Pitrou wrote: Antoine Pitrou pit...@free.fr added the comment: * Unicode objects are NUL-terminated, but only very external APIs rely on this (e.g. code using the Windows Unicode API). Please don't make the code in unicodeobject.c itself rely on this subtle detail. That's wishful thinking, don't you think? *I* am not making code in unicodeobject.c rely on this. It has been so for years, long before I was here. You should check who made that design decision in the first place instead of putting the blame on me. I'm not blaming you for this. However, I don't want more code to rely on this behavior. The NUL-termination has never been documented and my decision to use NUL-termination on the PyUnicodeObject buffers was merely a safety measure. Besides, the fact that external APIs rely on it make it much more unchangeable than if it were an implementation detail. It's an undocumented implementation detail. We can certainly deprecate it's use using the standard approach we have for this. But all that is off-topic for this ticket, since codecs operate on Py_UNICODE* buffers together with a size parameter and relying on those buffers being NUL-terminated is bound to cause problems. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8821 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8821] Range check on unicode repr
Antoine Pitrou pit...@free.fr added the comment: But all that is off-topic for this ticket, since codecs operate on Py_UNICODE* buffers together with a size parameter and relying on those buffers being NUL-terminated is bound to cause problems. That's right. Then perhaps a fixed patch can be uploaded, if someone cares enough. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8821 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8821] Range check on unicode repr
Matt Giuca matt.gi...@gmail.com added the comment: OK, I finally had time to review this issue again. Firstly, granted the original fix broke a test case, shouldn't we figure out why it broke and fix it, rather than just revert the change and continue relying on this tenuous assumption? Surely it's best to have as little code relying on it as possible. Secondly, please have a look at my patch again. It wasn't committed properly -- no offense to Georg, it's an honest mistake! My patch was: --- Objects/unicodeobject.c (revision 81539) +++ Objects/unicodeobject.c (working copy) @@ -3065,7 +3065,7 @@ } #else /* Map UTF-16 surrogate pairs to '\U00xx' */ -else if (ch = 0xD800 ch 0xDC00) { +else if (ch = 0xD800 ch 0xDC00 size 0) { Py_UNICODE ch2; Py_UCS4 ucs; The commit made in r83418 by Georg Brandl (and similarly r83395 in py3k): http://svn.python.org/view/python/branches/release27-maint/Objects/unicodeobject.c?r1=82980r2=83418 --- Objects/unicodeobject.c (revision 83417) +++ Objects/unicodeobject.c (revision 83418) @@ -3067,7 +3067,7 @@ ch2 = *s++; size--; -if (ch2 = 0xDC00 ch2 = 0xDFFF) { +if (ch2 = 0xDC00 ch2 = 0xDFFF size) { ucs = (((ch 0x03FF) 10) | (ch2 0x03FF)) + 0x0001; *p++ = '\\'; *p++ = 'U'; @@ -3316,7 +3316,7 @@ ch2 = *s++; size--; -if (ch2 = 0xDC00 ch2 = 0xDFFF) { +if (ch2 = 0xDC00 ch2 = 0xDFFF size) { ucs = (((ch 0x03FF) 10) | (ch2 0x03FF)) + 0x0001; *p++ = '\\'; *p++ = 'U'; I put the size check on the first character of the surrogate pair; in the committed version the size check was on the second character (after the size variable is decremented), causing it to break out of that branch too early in some cases. Moving the size check to the outer if block fixes the test breakage. PS. Good find on the second copy of that code in the PyUnicode_EncodeRawUnicodeEscape function. I've attached a new patch which fixes both functions instead of just the unicodeescape_string function. Passes all test cases on UCS2 build of the 2.7 branch. -- Added file: http://bugs.python.org/file18322/unicode-range-check2.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8821 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8821] Range check on unicode repr
Georg Brandl ge...@python.org added the comment: Applied in r83395. Thanks! -- nosy: +georg.brandl resolution: - fixed status: open - closed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8821 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8821] Range check on unicode repr
Antoine Pitrou pit...@free.fr added the comment: Well, the patch was technically useless since, as mentioned, unicode strings are terminated by a NUL character by design. Anyway, I now get the following error on the 2.7 branch. Perhaps it's related: == FAIL: test_ucs4 (test.test_unicode.UnicodeTest) -- Traceback (most recent call last): File /home/antoine/cpython/27/Lib/test/test_unicode.py, line 941, in test_ucs4 self.assertEqual(x, y) AssertionError: '\\udbc0\\udc00' != '\\U0010' -- nosy: +pitrou status: closed - open ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8821 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8821] Range check on unicode repr
Georg Brandl ge...@python.org added the comment: You're right. Reverted in r83444 and merging back, and I'll also remove the XXX is this needed from 2.7. -- status: open - closed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8821 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8821] Range check on unicode repr
New submission from Matt Giuca matt.gi...@gmail.com: In unicodeobject.c's unicodeescape_string, in UCS2 builds, if the last character of the string is the start of a UTF-16 surrogate pair (between '\ud800' and '\udfff'), there is a slight overrun problem. For example: repr(u'abcd\ud800') Upon reading ch = 0xd800, the test (ch = 0xD800 ch 0xDC00) succeeds, and it then reads ch2 = *s++. Note that preceding this line, s points at one character past the end of the string, so the value read will be garbage. I imagine that unless it falls on a segment boundary, the worst that could happen is the character '\ud800' is interpreted as some other wide character. Nevertheless, this is bad. Note that *technically* this is never bad, because _PyUnicode_New allocates an extra character and sets it to '\u', and thus the above example will always set ch2 to 0, and it will behave correctly. But this is a tenuous thing to rely on, especially given the comment above _PyUnicode_New: /* We allocate one more byte to make sure the string is Ux terminated -- XXX is this needed ? */ I thought about removing that XXX, but I'd rather fix the problem. Therefore, I have attached a patch which does a range check before reading ch2: --- Objects/unicodeobject.c (revision 81539) +++ Objects/unicodeobject.c (working copy) @@ -3065,7 +3065,7 @@ } #else /* Map UTF-16 surrogate pairs to '\U00xx' */ -else if (ch = 0xD800 ch 0xDC00) { +else if (ch = 0xD800 ch 0xDC00 size 0) { Py_UNICODE ch2; Py_UCS4 ucs; Also affects Python 3. -- components: Unicode files: unicode-range-check.patch keywords: patch messages: 106506 nosy: mgiuca priority: normal severity: normal status: open title: Range check on unicode repr type: behavior versions: Python 2.6, Python 2.7, Python 3.1, Python 3.2, Python 3.3 Added file: http://bugs.python.org/file17465/unicode-range-check.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8821 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com