[issue10254] unicodedata.normalize('NFC', s) regression

2011-09-22 Thread Victor Ruiz
Victor Ruiz vic...@ninibe.com added the comment: Hi, I think I've come across what seems to be another flavor of this issue. The following string will cause a crash in some interpreters. text = u\u062d\u064e\u064a\u0651\u064b\u0627\u060c\u0648\u064e\u064a\u064e\u062d\u0650\u0642\u0651\u064e

[issue10254] unicodedata.normalize('NFC', s) regression

2011-09-22 Thread Alexander Belopolsky
Changes by Alexander Belopolsky alexander.belopol...@gmail.com: -- status: closed - open ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___

[issue10254] unicodedata.normalize('NFC', s) regression

2011-09-22 Thread Alexander Belopolsky
Alexander Belopolsky alexander.belopol...@gmail.com added the comment: This new data does not crash Python 2.7.2, so I assume the issue has been fixed. Re-closing. -- status: open - closed ___ Python tracker rep...@bugs.python.org

[issue10254] unicodedata.normalize('NFC', s) regression

2011-09-22 Thread STINNER Victor
STINNER Victor victor.stin...@haypocalc.com added the comment: This new data does not crash Python 2.7.2, so I assume the issue has been fixed. Yes, the bug was already fixed in branch 2.7 by the SVN commit r87541: changeset: 67185:54f1d5651555 branch: 2.7 parent:

[issue10254] unicodedata.normalize('NFC', s) regression

2011-09-22 Thread STINNER Victor
STINNER Victor victor.stin...@haypocalc.com added the comment: This fix is part of Python 2.7.2, but not of 2.7.2. ... but not of 2.7.1. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-28 Thread Alexander Belopolsky
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: Committed backports: r87540 (3.1) r87541 (2.7) r87546 (2.6) -- resolution: - fixed stage: commit review - committed/rejected status: open - closed versions: +Python 3.2 ___

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-22 Thread Alexander Belopolsky
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: Committed to py3k in revision 87442. -- versions: -Python 3.2 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-21 Thread Alexander Belopolsky
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: In the new patch, issue10254b.diff, I've added a test that would crash unpatched code: unicodedata.normalize('NFC', 'C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸Ç') Segmentation fault Martin, I still feel uneasy about the

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-20 Thread Alexander Belopolsky
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: Attached patch, issue10254a.diff, adds the OP's cases to test_unicodedata and changes the code as I suggested in msg124173 because ISTM that comb = comb1 matches the pr-29 definition: D2'. In any character sequence

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-20 Thread Alexander Belopolsky
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Mon, Dec 20, 2010 at 2:50 PM, Alexander Belopolsky rep...@bugs.python.org wrote: .. Unfortunately, all tests pass with either comb = comb1 or comb == comb1, so before I commit, I would like to figure out the test

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-17 Thread Martin v . Löwis
Martin v. Löwis mar...@v.loewis.de added the comment: Am 17.12.2010 01:56, schrieb STINNER Victor: STINNER Victor victor.stin...@haypocalc.com added the comment: Ooops, sorry. I just applied the patch suggested by Marc-Andre Lemburg in msg22885 (#1054943). As the patch worked for the

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-17 Thread Martin v . Löwis
Martin v. Löwis mar...@v.loewis.de added the comment: So lacking a new patch, I think we should revert the existing change for now. Oops, I missed that Alexander has proposed a patch. -- ___ Python tracker rep...@bugs.python.org

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-17 Thread Martin v . Löwis
Martin v. Löwis mar...@v.loewis.de added the comment: The logic suggested by Martin in msg120018 looks right to me, but the whole code seems to be unnecessarily complex. (And comb1==comb may need to be changed to comb1=comb.) I don't understand why linear search through skipped array is

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-17 Thread Martin v . Löwis
Martin v. Löwis mar...@v.loewis.de added the comment: Passing Part3 tests and not crashing on crash.py is probably good enough for a commit, but I don't have a proof that length 20 skipped buffer is always enough. I would agree with that. I still didn't have time to fully review the patch,

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-17 Thread Alexander Belopolsky
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Fri, Dec 17, 2010 at 3:47 AM, Martin v. Löwis rep...@bugs.python.org wrote: .. The worst case (wrt. cskipped) is the maximum number of characters that can get combined into a single base character. It used to be (and I

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-17 Thread Martin v . Löwis
Martin v. Löwis mar...@v.loewis.de added the comment: The C forms (NFC and NFKC) do canonical composition and U+FDFA is a compatibility composite. (BTW, makeunicodedata.py checks that maximum decomposed length of a character is 19, but it would be better if it would compute and define a

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-17 Thread Alexander Belopolsky
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Fri, Dec 17, 2010 at 2:08 PM, Martin v. Löwis rep...@bugs.python.org wrote: ..  As far as I (and a two-line script) can tell the maximum length of a canonical decomposition of a character is 4. Even better - so

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-16 Thread Alexander Belopolsky
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: Adding an assert as shown in the diff below, makes it easy to reproduce the crash in py3k branch: $ ./python.exe crash.py Assertion failed: (cskipped 20), function nfc_nfkc, file Modules/unicodedata.c, line 714. Abort

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-16 Thread STINNER Victor
STINNER Victor victor.stin...@haypocalc.com added the comment: Ooops, sorry. I just applied the patch suggested by Marc-Andre Lemburg in msg22885 (#1054943). As the patch worked for the examples given in Unicode PRI 29 and the test suite passed, it was enough for me. I don't understand the

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-16 Thread Alexander Belopolsky
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: The logic suggested by Martin in msg120018 looks right to me, but the whole code seems to be unnecessarily complex. (And comb1==comb may need to be changed to comb1=comb.) I don't understand why linear search through

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-16 Thread Alexander Belopolsky
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: Attached patch, issue10254.diff, is essentially Martin's code from msg120018 and Part3 tests from NormalizationTest.txt. Since this bug exposes a buffer overflow condition, I think it qualifies as a security issue, so I

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-15 Thread Jonathan Halcrow
Jonathan Halcrow jonathan.halc...@gmail.com added the comment: I think I've come across a related problem. I am experiencing a segfault when NFC-normalizing a certain string [1]. The crash occurs with 2.7.1 in OS X (built from source with homebrew). Here is the backtrace: #0 0x0025a96e in

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-15 Thread Antoine Pitrou
Antoine Pitrou pit...@free.fr added the comment: I can reproduce the crash under 2.7, but not 2.6 or 3.x here. So it might be a separate issue. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254

[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-15 Thread Antoine Pitrou
Antoine Pitrou pit...@free.fr added the comment: After a bit of debugging, the crash is due to the skipped array being overflowed in nfc_nfkc() in unicodedata.c. cskipped goes up to 21 while the array only has 20 entries. This happens in all branches (but only crashes in 2.7 right now for

[issue10254] unicodedata.normalize('NFC', s) regression

2010-10-31 Thread Arfrever Frehtes Taifersar Arahesis
Changes by Arfrever Frehtes Taifersar Arahesis arfrever@gmail.com: -- nosy: +Arfrever ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___

[issue10254] unicodedata.normalize('NFC', s) regression

2010-10-31 Thread Ezio Melotti
Changes by Ezio Melotti ezio.melo...@gmail.com: -- nosy: +ezio.melotti ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___

[issue10254] unicodedata.normalize('NFC', s) regression

2010-10-30 Thread Merlijn van Deen
New submission from Merlijn van Deen valhall...@gmail.com: Summary: Somewhere between 2.6.5 r79063 and 3.1 r79147 a regression in the unicode NFC normalization has been introduces. This regression leads to bot edit wars on wikipedia [1]. It is reproducable with a simple script [2].

[issue10254] unicodedata.normalize('NFC', s) regression

2010-10-30 Thread Merlijn van Deen
Merlijn van Deen valhall...@gmail.com added the comment: Please note: The bug might very well be present in python 3.2 and 3.3. However, I do not have these versions installed, so I cannot confirm this. -- ___ Python tracker rep...@bugs.python.org

[issue10254] unicodedata.normalize('NFC', s) regression

2010-10-30 Thread Antoine Pitrou
Antoine Pitrou pit...@free.fr added the comment: Confirmed on Python 3.2. -- nosy: +haypo, loewis, pitrou versions: +Python 3.2 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___

[issue10254] unicodedata.normalize('NFC', s) regression

2010-10-30 Thread Martin v . Löwis
Martin v. Löwis mar...@v.loewis.de added the comment: The change from issue1054943 is indeed bogus. As written, the code will happily run over starters, even though a blocked start means that subsequent characters can't possibly be combinable. That way, the code manages to combine, in

[issue10254] unicodedata.normalize('NFC', s) regression

2010-10-30 Thread Marc-Andre Lemburg
Marc-Andre Lemburg m...@egenix.com added the comment: Martin v. Löwis wrote: It's unfortunate that the patch had been backported to 2.6.6; we can't fix it there anymore. Why not ? It looks a lot like a security fix. -- nosy: +lemburg ___ Python

[issue10254] unicodedata.normalize('NFC', s) regression

2010-10-30 Thread Martin v . Löwis
Martin v. Löwis mar...@v.loewis.de added the comment: It's unfortunate that the patch had been backported to 2.6.6; we can't fix it there anymore. Why not ? It looks a lot like a security fix. Indeed, you could argue that. It's up to the 2.6 release manager, I guess. --

[issue10254] unicodedata.normalize('NFC', s) regression

2010-10-30 Thread R. David Murray
Changes by R. David Murray rdmur...@bitdance.com: -- nosy: +barry ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10254 ___ ___ Python-bugs-list