[issue10254] unicodedata.normalize('NFC', s) regression
Victor Ruiz vic...@ninibe.com added the comment: Hi, I think I've come across what seems to be another flavor of this issue. The following string will cause a crash in some interpreters. text = u\u062d\u064e\u064a\u0651\u064b\u0627\u060c\u0648\u064e\u064a\u064e\u062d\u0650\u0642\u0651\u064e \u0627\u0644\u0652\u0642\u064e\u0648\u0652\u0644\u064f \u0648\u064e\u0644\u0651\u064e\u064a\u0652\u062a\u064f\u0643\u064f\u0645\u064e\u0627\u060c \u0648\u064e\u0625\u0650\u0646\u0652 \u0623\u064e\u0628\u064e\u064a\u0652\u062a\u064f\u0645\u064e\u0627 \u0623\u064e\u0646\u0652 \u062a\u064f\u0642\u0650\u0631\u0651\u064e\u0627 \u0628\u0650\u0627\u0644\u0625\u0650\u0633\u0652\u0644\u0627\u064e\u0645\u0650 \u0641\u064e\u0625\u0650\u0646\u0651\u064e \u0648\u064e\u062e\u064e\u064a\u0652\u0644\u0650\u064a \u062a\u064e\u062d\u064f\u0644\u0651\u064f \u0628\u0650\u0633\u064e\u0627\u062d\u064e\u062a\u0650\u0643\u064f\u0645\u064e\u0627\u060c \u0648\u064e\u062a\u064e\u0638\u0652\u0647\u064e\u0631\u064f \u0646\u064f\u0628\u064f\u0648\u0651\u064e\u062a\u0650\u064a \u0645\u064f\u0644\u0652\u0643\u0650\u0643\u064f\u0645\u064e\u0627.\u0648\u0643\u062a\u0628 \u0623\u0628\u064a\u0651\u064f \u0628\u0646 \u0643\u0639\u0628 \u0627\u0644\u0652\u0631\u064e\u0651\u062d\u0650\u064a\u0652\u0645\u060c \u0645\u0650\u0646 \u0645\u064f\u062d\u064e\u0645\u064e\u0651\u062f \u0631\u064e\u0633\u064f\u0648\u0652\u0644 \u0627\u0644\u0652\u0644\u064e\u0651\u0647 \u0625\u0650\u0644\u064e\u0649 \u0627\u0644\u0652\u0645\u064f\u0646\u0652\u0630\u0650\u0631 \u0628\u0652\u0646 \u0633\u064e\u0627\u0648\u0650\u064a \u0633\u064e\u0644\u064e\u0627\u0645 \u0639\u064e\u0644\u064e\u064a\u0652\u0643 \u0641\u064e\u0625\u0650\u0646\u0650\u0651\u064a \u0623\u064e\u062d\u0652\u0645\u064e\u062f \u0627\u0644\u0652\u0644\u064e\u0651\u0647 \u0625\u0650\u0644\u064e\u064a\u0652\u0643 \u0627\u0644\u064e\u0651\u0630\u0650\u064a\u0644\u064e\u0627 \u0625\u0650\u0644\u064e\u0647 \u063a\u064e\u064a\u0652\u0631\u064f\u0647 \u0648\u064e\u0623\u064e\u0634\u0652\u0647\u064e\u062f \u0623\u064e\u0646 \u0644\u064e\u0627 \u0625\u0650\u0644\u064e\u0647 \u0625\u0650\u0644\u064e\u0651\u0627 \u0627\u0644\u0652\u0644\u064e\u0651\u0647 There is a sample script attached. This issue does not seem to be related to the python version itself but rather to its compilation. Since the exact same version crashes in OSX but not Ubuntu linux for example. ERROR - Python 2.7.1 (r271:86832, Apr 9 2011, 17:12:59) [GCC 4.2.1 (Apple Inc. build 5664)] on darwin OK - Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53) [GCC 4.5.2] on linux2 Default version 2.6.6 on Debian squeeze should crash too for example. This is a trace of the error in 2.7.1 OSX (this interpreter passes the test posted on msg124450): Process: Python [78170] Path: /opt/local/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python Identifier: Python Version: ??? (???) Code Type: X86-64 (Native) Parent Process: bash [77126] Date/Time: 2011-09-22 23:20:48.892 +0200 OS Version: Mac OS X 10.6.8 (10K549) Report Version: 6 Interval Since Last Report: 88509 sec Crashes Since Last Report: 135 Per-App Crashes Since Last Report: 134 Anonymous UUID: F5DD44CE-A8F4-474C-BA10-2B21B4C92C1E Exception Type: EXC_BAD_ACCESS (SIGSEGV) Exception Codes: 0x000d, 0x Crashed Thread: 0 Dispatch queue: com.apple.main-thread Thread 0 Crashed: Dispatch queue: com.apple.main-thread 0 org.python.python 0x000100086b33 _PyUnicode_Resize + 51 1 unicodedata.so 0x000100601bff nfc_nfkc + 335 2 unicodedata.so 0x000100601f2a unicodedata_normalize + 154 3 org.python.python 0x0001000bfccd PyEval_EvalFrameEx + 20797 4 org.python.python 0x0001000c1f16 PyEval_EvalCodeEx + 2118 5 org.python.python 0x0001000c2036 PyEval_EvalCode + 54 6 org.python.python 0x0001000e6a5e PyRun_FileExFlags + 174 7 org.python.python 0x0001000e6d19 PyRun_SimpleFileExFlags + 489 8 org.python.python 0x0001000fd6fc Py_Main + 2940 9 org.python.python 0x00010f14 0x1 + 3860 Thread 0 crashed with X86 Thread State (64-bit): rax: 0x0644062700200627 rbx: 0x000100373d9c rcx: 0x003c rdx: 0x000a rdi: 0x7fff5fbff078 rsi: 0x80169ba9 rbp: 0x7fff5fbfefa0 rsp: 0x7fff5fbfef80 r8: 0x004e r9: 0x000a r10: 0x000100373db8 r11: 0x000100373dac r12: 0x7fff5fbff078 r13: 0x80169ba9 r14: 0x80169ba9 r15: 0x00a1 rip: 0x000100086b33 rfl: 0x00010206 cr2: 0x00010066a2f4 Binary Images: 0x1 -0x10fff
[issue10254] unicodedata.normalize('NFC', s) regression
Changes by Alexander Belopolsky alexander.belopol...@gmail.com: -- status: closed - open ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
Alexander Belopolsky alexander.belopol...@gmail.com added the comment: This new data does not crash Python 2.7.2, so I assume the issue has been fixed. Re-closing. -- status: open - closed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
STINNER Victor victor.stin...@haypocalc.com added the comment: This new data does not crash Python 2.7.2, so I assume the issue has been fixed. Yes, the bug was already fixed in branch 2.7 by the SVN commit r87541: changeset: 67185:54f1d5651555 branch: 2.7 parent: 67159:2d09af4c137c user:Alexander Belopolsky alexander.belopol...@gmail.com date:Tue Dec 28 15:47:56 2010 + files: Lib/test/test_normalization.py Lib/test/test_unicodedata.py Modules/unicodedata.c description: Merged revisions 87442 via svnmerge from svn+ssh://python...@svn.python.org/python/branches/py3k r87442 | alexander.belopolsky | 2010-12-22 21:27:37 -0500 (Wed, 22 Dec 2010) | 1 line Issue #10254: Fixed a crash and a regression introduced by the implementation of PRI 29. This fix is part of Python 2.7.2, but not of 2.7.2. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
STINNER Victor victor.stin...@haypocalc.com added the comment: This fix is part of Python 2.7.2, but not of 2.7.2. ... but not of 2.7.1. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: Committed backports: r87540 (3.1) r87541 (2.7) r87546 (2.6) -- resolution: - fixed stage: commit review - committed/rejected status: open - closed versions: +Python 3.2 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: Committed to py3k in revision 87442. -- versions: -Python 3.2 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: In the new patch, issue10254b.diff, I've added a test that would crash unpatched code: unicodedata.normalize('NFC', 'C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸Ç') Segmentation fault Martin, I still feel uneasy about the fixed size of the skipped buffer. It is not obvious that skipped combining characters always get removed from the buffer before the next starter is processed. I would really like another pair of eyes to look at this code before it goes in especially to 2.6. Victor, IIRC, you did some stress testing on random data. I wonder if you could test this code after tightening the assert to cskipped 4. (The current theory is that this should be enough.) -- Added file: http://bugs.python.org/file20131/issue10254b.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: Attached patch, issue10254a.diff, adds the OP's cases to test_unicodedata and changes the code as I suggested in msg124173 because ISTM that comb = comb1 matches the pr-29 definition: D2'. In any character sequence beginning with a starter S, a character C is blocked from S if and only if there is some character B between S and C, and either B is a starter or it has the same or higher combining class as C. http://www.unicode.org/review/pr-29.html Unfortunately, all tests pass with either comb = comb1 or comb == comb1, so before I commit, I would like to figure out the test case that would properly exercise this code. -- Added file: http://bugs.python.org/file20120/issue10254a.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Mon, Dec 20, 2010 at 2:50 PM, Alexander Belopolsky rep...@bugs.python.org wrote: .. Unfortunately, all tests pass with either comb = comb1 or comb == comb1, so before I commit, I would like to figure out the test case that would properly exercise this code. After some more thought, I've realized that the comb comb1 case is impossible if comb1 != 0 (due to canonical reordering step) and if comb1 == 0, the comb1 to comb comparison is not reached. In other words, it does not matter whether comparison is done as Martin suggested in msg120018 or as it is done in the latest patch. The fact that comb comb1 case is impossible if comb1 != 0 is actually mentioned in PR 29 itself. See Table 1: Differences at http://www.unicode.org/review/pr-29.html. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
Martin v. Löwis mar...@v.loewis.de added the comment: Am 17.12.2010 01:56, schrieb STINNER Victor: STINNER Victor victor.stin...@haypocalc.com added the comment: Ooops, sorry. I just applied the patch suggested by Marc-Andre Lemburg in msg22885 (#1054943). As the patch worked for the examples given in Unicode PRI 29 and the test suite passed, it was enough for me. I don't understand the normalization code, so I don't know how to fix it. So lacking a new patch, I think we should revert the existing change for now. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
Martin v. Löwis mar...@v.loewis.de added the comment: So lacking a new patch, I think we should revert the existing change for now. Oops, I missed that Alexander has proposed a patch. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
Martin v. Löwis mar...@v.loewis.de added the comment: The logic suggested by Martin in msg120018 looks right to me, but the whole code seems to be unnecessarily complex. (And comb1==comb may need to be changed to comb1=comb.) I don't understand why linear search through skipped array is needed. At the very least instead of adding their positions to the skipped list, used combining characters can be replaced by a non-character to be later skipped. The skipped array keeps track of what characters have been integrated into a base character, as they must not appear in the output. Assume you have a sequence B,C,N,C,N,B (B: base character, C: combined, N: not combined). You need to remember not to output C, whereas you still need to output N. I don't think replacing them with a non-character can work: which one would you chose (that cannot also appear in the input)? The worst case (wrt. cskipped) is the maximum number of characters that can get combined into a single base character. It used to be (and I hope still is) 20 (decomposition of U+FDFA). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
Martin v. Löwis mar...@v.loewis.de added the comment: Passing Part3 tests and not crashing on crash.py is probably good enough for a commit, but I don't have a proof that length 20 skipped buffer is always enough. I would agree with that. I still didn't have time to fully review the patch, but assuming it fixes the cases in msg119995, we should proceed with it. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Fri, Dec 17, 2010 at 3:47 AM, Martin v. Löwis rep...@bugs.python.org wrote: .. The worst case (wrt. cskipped) is the maximum number of characters that can get combined into a single base character. It used to be (and I hope still is) 20 (decomposition of U+FDFA). The C forms (NFC and NFKC) do canonical composition and U+FDFA is a compatibility composite. (BTW, makeunicodedata.py checks that maximum decomposed length of a character is 19, but it would be better if it would compute and define a named constant, say MAXDLENGTH, to be used instead of literal 20.) As far as I (and a two-line script) can tell the maximum length of a canonical decomposition of a character is 4. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
Martin v. Löwis mar...@v.loewis.de added the comment: The C forms (NFC and NFKC) do canonical composition and U+FDFA is a compatibility composite. (BTW, makeunicodedata.py checks that maximum decomposed length of a character is 19, but it would be better if it would compute and define a named constant, say MAXDLENGTH, to be used instead of literal 20.) As far as I (and a two-line script) can tell the maximum length of a canonical decomposition of a character is 4. Even better - so allowing for 20 characters should be safe. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Fri, Dec 17, 2010 at 2:08 PM, Martin v. Löwis rep...@bugs.python.org wrote: .. As far as I (and a two-line script) can tell the maximum length of a canonical decomposition of a character is 4. Even better - so allowing for 20 characters should be safe. I don't disagree, but the number of break and continue statements before cskipped++ makes me nervous. This said, I am going to add test cases from the first post to test_unicodedata (I think it is a better place than test_normalise because the latter is skipped by default) and commit. Improving the algorithm is a separate issue. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: Adding an assert as shown in the diff below, makes it easy to reproduce the crash in py3k branch: $ ./python.exe crash.py Assertion failed: (cskipped 20), function nfc_nfkc, file Modules/unicodedata.c, line 714. Abort trap I am attaching jhalcrow's code as crash.py === --- Modules/unicodedata.c (revision 87322) +++ Modules/unicodedata.c (working copy) @@ -711,6 +711,7 @@ /* Replace the original character. */ *i = code; /* Mark the second character unused. */ + assert(cskipped 20); skipped[cskipped++] = i1; i1++; f = find_nfc_index(self, nfc_first, *i); -- nosy: +belopolsky Added file: http://bugs.python.org/file20080/crash.py ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
STINNER Victor victor.stin...@haypocalc.com added the comment: Ooops, sorry. I just applied the patch suggested by Marc-Andre Lemburg in msg22885 (#1054943). As the patch worked for the examples given in Unicode PRI 29 and the test suite passed, it was enough for me. I don't understand the normalization code, so I don't know how to fix it. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: The logic suggested by Martin in msg120018 looks right to me, but the whole code seems to be unnecessarily complex. (And comb1==comb may need to be changed to comb1=comb.) I don't understand why linear search through skipped array is needed. At the very least instead of adding their positions to the skipped list, used combining characters can be replaced by a non-character to be later skipped. A better algorithm should be able to avoid the whole issue of skipping by properly computing the length of the decomposed character. See internalCompose() at http://www.unicode.org/reports/tr15/Normalizer.java. I'll try to come up with a patch. -- assignee: - belopolsky ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: Attached patch, issue10254.diff, is essentially Martin's code from msg120018 and Part3 tests from NormalizationTest.txt. Since this bug exposes a buffer overflow condition, I think it qualifies as a security issue, so I am adding 2.6 to versions. Passing Part3 tests and not crashing on crash.py is probably good enough for a commit, but I don't have a proof that length 20 skipped buffer is always enough. As the next step, I would like to consider an alternative algorithm that would not require a skipped buffer. -- keywords: +patch stage: - commit review versions: +Python 2.6 Added file: http://bugs.python.org/file20089/issue10254.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
Jonathan Halcrow jonathan.halc...@gmail.com added the comment: I think I've come across a related problem. I am experiencing a segfault when NFC-normalizing a certain string [1]. The crash occurs with 2.7.1 in OS X (built from source with homebrew). Here is the backtrace: #0 0x0025a96e in _PyUnicode_Resize () #1 0x00601673 in nfc_nfkc () #2 0x00601bb7 in unicodedata_normalize () #3 0x0029834b in PyEval_EvalFrameEx () #4 0x00299f13 in PyEval_EvalCodeEx () #5 0x0029a0fe in PyEval_EvalCode () #6 0x002bd5f0 in PyRun_FileExFlags () #7 0x002be430 in PyRun_SimpleFileExFlags () #8 0x002d5bd6 in Py_Main () #9 0x1f8f in _start () #10 0x1ebd in start () [1] http://pastebin.com/cfNd2QEz -- nosy: +jhalcrow ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
Antoine Pitrou pit...@free.fr added the comment: I can reproduce the crash under 2.7, but not 2.6 or 3.x here. So it might be a separate issue. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
Antoine Pitrou pit...@free.fr added the comment: After a bit of debugging, the crash is due to the skipped array being overflowed in nfc_nfkc() in unicodedata.c. cskipped goes up to 21 while the array only has 20 entries. This happens in all branches (but only crashes in 2.7 right now for probably unimportant reasons). And the problem was indeed introduced by Victor's patch in issue1054943. Just before, cskipped would only go up to 1. -- priority: normal - high type: behavior - crash ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
Changes by Arfrever Frehtes Taifersar Arahesis arfrever@gmail.com: -- nosy: +Arfrever ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
Changes by Ezio Melotti ezio.melo...@gmail.com: -- nosy: +ezio.melotti ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
New submission from Merlijn van Deen valhall...@gmail.com: Summary: Somewhere between 2.6.5 r79063 and 3.1 r79147 a regression in the unicode NFC normalization has been introduces. This regression leads to bot edit wars on wikipedia [1]. It is reproducable with a simple script [2]. Mediawiki/PHP [3] and C# [4] test scripts both show the old behaviour, which leads me to believe this is a python bug. A search for older bugs shows bug #1054943 [5] which has commits in the suspected region. The regression causes certain NFC-normalized strings to become mangled. Because of the wide range of unicode strings on wikipedia, this causes several problems. Details of those can be found at [1]. Example strings include: (these strings have been NFC-normalized by mediawiki) * u'Li\u030dt-s\u1e73\u0301' * u'\u092e\u093e\u0930\u094d\u0915 \u091c\u093c\u0941\u0915\u0947\u0930\u092c\u0930\u094d\u0917' * u'\u0915\u093f\u0930\u094d\u0917\u093f\u091c\u093c\u0938\u094d\u0924\u093e\u0928' The bug can be shown simply with unicodedata.normalize('NFC', s) == s where s is one of the strings above. This will return True on older python versions, False on newer versions. There is a script available that does this [2]. The bug has been tested on the following machines and python versions. OK indicates the bug is not present, FAIL indicates the bug is present. Host: SunOS willow 5.10 Generic_142910-17 i86pc i386 i86pc Solaris '2.3.3 (#1, Dec 16 2004, 14:38:56) [C]' OK '2.6.5 (r265:79063, Jul 10 2010, 17:50:38) [C]' OK '2.7 (r27:82500, Aug 5 2010, 04:28:45) [C]' FAIL '3.1.2 (r312:79147, Sep 24 2010, 05:34:04) [C]' FAIL Host: Linux nightshade 2.6.26-2-amd64 #1 SMP Thu Sep 16 15:56:38 UTC 2010 x86_64 GNU/Linux '2.4.6 (#2, Jan 24 2010, 12:20:41) \n[GCC 4.3.2]' OK '2.5.2 (r252:60911, Jan 24 2010, 17:44:40) \n[GCC 4.3.2]' OK '2.6.4+ (r264:75706, Feb 16 2010, 05:11:28) \n[GCC 4.4.3]' OK Host: Linux dorthonion 2.6.22.18-co-0.7.4 #1 PREEMPT Wed Apr 15 18:57:39 UTC 2009 i686 GNU/Linux '2.5.4 (r254:67916, Jan 20 2010, 21:44:03) \n[GCC 4.3.3]' OK '2.6.2 (release26-maint, Apr 19 2009, 01:56:41) \n[GCC 4.3.3]' OK '3.0.1+ (r301:69556, Apr 15 2009, 15:59:22) \n[GCC 4.3.3]' OK [1] https://sourceforge.net/tracker/index.php?func=detailaid=3081100group_id=93107atid=603138# ; http://fr.wikipedia.org/w/index.php?title=Mark_Zuckerbergaction=historysubmitdiff=57753004oldid=57751674 [2] http://pastebin.ca/1977285 (py2.x), http://pastebin.ca/1977287 (py3.x) [3] http://pastebin.ca/1977292 (PHP, placed in http://svn.wikimedia.org/svnroot/mediawiki/trunk/phase3/normal/), [4] http://pastebin.ca/1977261 (C#) [5] http://bugs.python.org/issue1054943# -- components: Unicode messages: 119995 nosy: valhallasw priority: normal severity: normal status: open title: unicodedata.normalize('NFC', s) regression type: behavior versions: Python 2.7, Python 3.1 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
Merlijn van Deen valhall...@gmail.com added the comment: Please note: The bug might very well be present in python 3.2 and 3.3. However, I do not have these versions installed, so I cannot confirm this. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
Antoine Pitrou pit...@free.fr added the comment: Confirmed on Python 3.2. -- nosy: +haypo, loewis, pitrou versions: +Python 3.2 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
Martin v. Löwis mar...@v.loewis.de added the comment: The change from issue1054943 is indeed bogus. As written, the code will happily run over starters, even though a blocked start means that subsequent characters can't possibly be combinable. That way, the code manages to combine, in 'Li\u030dt-s\u1e73\u0301', the final U+0301 with the i - even though there are several starters in-between. I think the code should work like this: if comb!=0 and comb1==0: #starter after character with higher class: # not combinable, and all subsequent characters will be blocked # as well break if comb!=0 and comb1==comb: # blocked combining character, continue searching i1++ continue # candidate pair, check whether *i and *i1 are combinable It's unfortunate that the patch had been backported to 2.6.6; we can't fix it there anymore. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
Marc-Andre Lemburg m...@egenix.com added the comment: Martin v. Löwis wrote: It's unfortunate that the patch had been backported to 2.6.6; we can't fix it there anymore. Why not ? It looks a lot like a security fix. -- nosy: +lemburg ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
Martin v. Löwis mar...@v.loewis.de added the comment: It's unfortunate that the patch had been backported to 2.6.6; we can't fix it there anymore. Why not ? It looks a lot like a security fix. Indeed, you could argue that. It's up to the 2.6 release manager, I guess. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10254] unicodedata.normalize('NFC', s) regression
Changes by R. David Murray rdmur...@bitdance.com: -- nosy: +barry ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com