[issue10254] unicodedata.normalize('NFC', s) regression

2011-09-22 Thread Victor Ruiz

Victor Ruiz vic...@ninibe.com added the comment:

Hi,

I think I've come across what seems to be another flavor of this issue. The  
following string will cause a crash in some interpreters.

text = 
u\u062d\u064e\u064a\u0651\u064b\u0627\u060c\u0648\u064e\u064a\u064e\u062d\u0650\u0642\u0651\u064e
 \u0627\u0644\u0652\u0642\u064e\u0648\u0652\u0644\u064f
\u0648\u064e\u0644\u0651\u064e\u064a\u0652\u062a\u064f\u0643\u064f\u0645\u064e\u0627\u060c
 \u0648\u064e\u0625\u0650\u0646\u0652 
\u0623\u064e\u0628\u064e\u064a\u0652\u062a\u064f\u0645\u064e\u0627 
\u0623\u064e\u0646\u0652 \u062a\u064f\u0642\u0650\u0631\u0651\u064e\u0627 
\u0628\u0650\u0627\u0644\u0625\u0650\u0633\u0652\u0644\u0627\u064e\u0645\u0650 
\u0641\u064e\u0625\u0650\u0646\u0651\u064e
\u0648\u064e\u062e\u064e\u064a\u0652\u0644\u0650\u064a 
\u062a\u064e\u062d\u064f\u0644\u0651\u064f 
\u0628\u0650\u0633\u064e\u0627\u062d\u064e\u062a\u0650\u0643\u064f\u0645\u064e\u0627\u060c
 
\u0648\u064e\u062a\u064e\u0638\u0652\u0647\u064e\u0631\u064f 
\u0646\u064f\u0628\u064f\u0648\u0651\u064e\u062a\u0650\u064a 
\u0645\u064f\u0644\u0652\u0643\u0650\u0643\u064f\u0645\u064e\u0627.\u0648\u0643\u062a\u0628
 \u0623\u0628\u064a\u0651\u064f \u0628\u0646 \u0643\u0639\u0628 
\u0627\u0644\u0652\u0631\u064e\u0651\u062d\u0650\u064a\u0652\u0645\u060c 
\u0645\u0650\u0646 \u0645\u064f\u062d\u064e\u0645\u064e\u0651\u062f 
\u0631\u064e\u0633\u064f\u0648\u0652\u0644 
\u0627\u0644\u0652\u0644\u064e\u0651\u0647 \u0625\u0650\u0644\u064e\u0649 
\u0627\u0644\u0652\u0645\u064f\u0646\u0652\u0630\u0650\u0631 \u0628\u0652\u0646 
\u0633\u064e\u0627\u0648\u0650\u064a
\u0633\u064e\u0644\u064e\u0627\u0645 \u0639\u064e\u0644\u064e\u064a\u0652\u0643 
\u0641\u064e\u0625\u0650\u0646\u0650\u0651\u064a 
\u0623\u064e\u062d\u0652\u0645\u064e\u062f 
\u0627\u0644\u0652\u0644\u064e\u0651\u0647
\u0625\u0650\u0644\u064e\u064a\u0652\u0643 
\u0627\u0644\u064e\u0651\u0630\u0650\u064a\u0644\u064e\u0627 
\u0625\u0650\u0644\u064e\u0647 \u063a\u064e\u064a\u0652\u0631\u064f\u0647 
\u0648\u064e\u0623\u064e\u0634\u0652\u0647\u064e\u062f \u0623\u064e\u0646 
\u0644\u064e\u0627 \u0625\u0650\u0644\u064e\u0647 
\u0625\u0650\u0644\u064e\u0651\u0627 \u0627\u0644\u0652\u0644\u064e\u0651\u0647


There is a sample script attached. This issue does not seem to be related to 
the python version itself but rather to its compilation. Since the exact same 
version crashes in OSX but not Ubuntu linux for example.

ERROR - Python 2.7.1 (r271:86832, Apr 9 2011, 17:12:59) [GCC 4.2.1 (Apple Inc. 
build 5664)] on darwin
OK - Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53) [GCC 4.5.2] on linux2

Default version 2.6.6 on Debian squeeze should crash too for example.

This is a trace of the error in 2.7.1 OSX (this interpreter passes the test 
posted on msg124450):

Process: Python [78170]
Path:
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python
Identifier:  Python
Version: ??? (???)
Code Type:   X86-64 (Native)
Parent Process:  bash [77126]

Date/Time:   2011-09-22 23:20:48.892 +0200
OS Version:  Mac OS X 10.6.8 (10K549)
Report Version:  6

Interval Since Last Report:  88509 sec
Crashes Since Last Report:   135
Per-App Crashes Since Last Report:   134
Anonymous UUID:  F5DD44CE-A8F4-474C-BA10-2B21B4C92C1E

Exception Type:  EXC_BAD_ACCESS (SIGSEGV)
Exception Codes: 0x000d, 0x
Crashed Thread:  0  Dispatch queue: com.apple.main-thread

Thread 0 Crashed:  Dispatch queue: com.apple.main-thread
0   org.python.python   0x000100086b33 _PyUnicode_Resize + 
51
1   unicodedata.so  0x000100601bff nfc_nfkc + 335
2   unicodedata.so  0x000100601f2a 
unicodedata_normalize + 154
3   org.python.python   0x0001000bfccd PyEval_EvalFrameEx + 
20797
4   org.python.python   0x0001000c1f16 PyEval_EvalCodeEx + 
2118
5   org.python.python   0x0001000c2036 PyEval_EvalCode + 54
6   org.python.python   0x0001000e6a5e PyRun_FileExFlags + 
174
7   org.python.python   0x0001000e6d19 
PyRun_SimpleFileExFlags + 489
8   org.python.python   0x0001000fd6fc Py_Main + 2940
9   org.python.python   0x00010f14 0x1 + 3860

Thread 0 crashed with X86 Thread State (64-bit):
  rax: 0x0644062700200627  rbx: 0x000100373d9c  rcx: 0x003c  
rdx: 0x000a
  rdi: 0x7fff5fbff078  rsi: 0x80169ba9  rbp: 0x7fff5fbfefa0  
rsp: 0x7fff5fbfef80
   r8: 0x004e   r9: 0x000a  r10: 0x000100373db8  
r11: 0x000100373dac
  r12: 0x7fff5fbff078  r13: 0x80169ba9  r14: 0x80169ba9  
r15: 0x00a1
  rip: 0x000100086b33  rfl: 0x00010206  cr2: 0x00010066a2f4

Binary Images:
   0x1 -0x10fff 

[issue10254] unicodedata.normalize('NFC', s) regression

2011-09-22 Thread Alexander Belopolsky

Changes by Alexander Belopolsky alexander.belopol...@gmail.com:


--
status: closed - open

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2011-09-22 Thread Alexander Belopolsky

Alexander Belopolsky alexander.belopol...@gmail.com added the comment:

This new data does not crash Python 2.7.2, so I assume the issue has been 
fixed.  Re-closing.

--
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2011-09-22 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

This new data does not crash Python 2.7.2, so I assume the issue has been 
fixed.

Yes, the bug was already fixed in branch 2.7 by the SVN commit r87541:

changeset:   67185:54f1d5651555
branch:  2.7
parent:  67159:2d09af4c137c
user:Alexander Belopolsky alexander.belopol...@gmail.com
date:Tue Dec 28 15:47:56 2010 +
files:   Lib/test/test_normalization.py Lib/test/test_unicodedata.py 
Modules/unicodedata.c
description:
Merged revisions 87442 via svnmerge from
svn+ssh://python...@svn.python.org/python/branches/py3k


  r87442 | alexander.belopolsky | 2010-12-22 21:27:37 -0500 (Wed, 22 Dec 2010) 
| 1 line

  Issue #10254: Fixed a crash and a regression introduced by the implementation 
of PRI 29.


This fix is part of Python 2.7.2, but not of 2.7.2.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2011-09-22 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

This fix is part of Python 2.7.2, but not of 2.7.2.

... but not of 2.7.1.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-28 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

Committed backports:

r87540 (3.1)
r87541 (2.7)
r87546 (2.6)

--
resolution:  - fixed
stage: commit review - committed/rejected
status: open - closed
versions: +Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-22 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

Committed to py3k in revision 87442.

--
versions:  -Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-21 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

In the new patch, issue10254b.diff, I've added a test that would crash 
unpatched code:

 unicodedata.normalize('NFC', 'C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸C̸Ç')
Segmentation fault

Martin, I still feel uneasy about the fixed size of the skipped buffer.  It is 
not obvious that skipped combining characters always get removed from the 
buffer before the next starter is processed.

I would really like another pair of eyes to look at this code before it goes in 
especially to 2.6.

Victor,

IIRC, you did some stress testing on random data.  I wonder if you could test 
this code after tightening the assert to cskipped  4.  (The current theory is 
that this should be enough.)

--
Added file: http://bugs.python.org/file20131/issue10254b.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-20 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

Attached patch, issue10254a.diff, adds the OP's cases to test_unicodedata and 
changes the code as I suggested in msg124173 because ISTM that comb = comb1 
matches the pr-29 definition:


D2'. In any character sequence beginning with a starter S, a character C is 
blocked from S if and only if there is some character B between S and C, and 
either B is a starter or it has the same or higher combining class as C.
 http://www.unicode.org/review/pr-29.html

Unfortunately, all tests pass with either comb = comb1 or comb == comb1, so 
before I commit, I would like to figure out the test case that would properly 
exercise this code.

--
Added file: http://bugs.python.org/file20120/issue10254a.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-20 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Mon, Dec 20, 2010 at 2:50 PM, Alexander Belopolsky
rep...@bugs.python.org wrote:
..
 Unfortunately, all tests pass with either comb = comb1 or comb == comb1, so 
 before
 I commit, I would like to figure out the test case that would properly 
 exercise this code.


After some more thought, I've realized that the comb  comb1 case is
impossible if comb1 != 0 (due to canonical reordering step) and if
comb1 == 0, the comb1 to comb comparison is not reached.  In other
words, it does not matter whether comparison is done as Martin
suggested in msg120018 or as it is done in the latest patch.  The fact
that comb  comb1 case is impossible if comb1 != 0 is actually
mentioned in PR 29 itself.  See Table 1: Differences at
http://www.unicode.org/review/pr-29.html.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-17 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

Am 17.12.2010 01:56, schrieb STINNER Victor:
 
 STINNER Victor victor.stin...@haypocalc.com added the comment:
 
 Ooops, sorry. I just applied the patch suggested by Marc-Andre
 Lemburg in msg22885 (#1054943). As the patch worked for the examples
 given in Unicode PRI 29 and the test suite passed, it was enough for
 me. I don't understand the normalization code, so I don't know how to
 fix it.

So lacking a new patch, I think we should revert the existing change
for now.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-17 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

 So lacking a new patch, I think we should revert the existing change
 for now.

Oops, I missed that Alexander has proposed a patch.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-17 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

 The logic suggested by Martin in msg120018 looks right to me, but the
 whole code seems to be unnecessarily complex.  (And comb1==comb may
 need to be changed to comb1=comb.) I don't understand why linear
 search through skipped array is needed.  At the very least instead
 of adding their positions to the skipped list, used combining
 characters can be replaced by a non-character to be later skipped.

The skipped array keeps track of what characters have been integrated
into a base character, as they must not appear in the output.
Assume you have a sequence B,C,N,C,N,B (B: base character, C: combined,
N: not combined). You need to remember not to output C, whereas you
still need to output N. I don't think replacing them with a
non-character can work: which one would you chose (that cannot also
appear in the input)?

The worst case (wrt. cskipped) is the maximum number of characters that
can get combined into a single base character. It used to be (and I
hope still is) 20 (decomposition of U+FDFA).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-17 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

 Passing Part3 tests and not crashing on crash.py is probably good
 enough for a commit, but I don't have a proof that length 20 skipped
 buffer is always enough.

I would agree with that. I still didn't have time to fully review the
patch, but assuming it fixes the cases in msg119995, we should proceed
with it.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-17 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Fri, Dec 17, 2010 at 3:47 AM, Martin v. Löwis rep...@bugs.python.org wrote:
..
 The worst case (wrt. cskipped) is the maximum number of characters that
 can get combined into a single base character. It used to be (and I
 hope still is) 20 (decomposition of U+FDFA).


The C forms (NFC and NFKC) do canonical composition and U+FDFA is a
compatibility composite. (BTW, makeunicodedata.py checks that maximum
decomposed length of a character is  19, but it would be better if it
would compute and define a named constant, say MAXDLENGTH, to be used
instead of literal 20.)  As far as I (and a two-line script) can tell
the maximum length of a canonical decomposition of a character is 4.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-17 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

 The C forms (NFC and NFKC) do canonical composition and U+FDFA is a
 compatibility composite. (BTW, makeunicodedata.py checks that maximum
 decomposed length of a character is  19, but it would be better if it
 would compute and define a named constant, say MAXDLENGTH, to be used
 instead of literal 20.)  As far as I (and a two-line script) can tell
 the maximum length of a canonical decomposition of a character is 4.

Even better - so allowing for 20 characters should be safe.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-17 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Fri, Dec 17, 2010 at 2:08 PM, Martin v. Löwis rep...@bugs.python.org wrote:
..
 As far as I (and a two-line script) can tell
 the maximum length of a canonical decomposition of a character is 4.

 Even better - so allowing for 20 characters should be safe.

I don't disagree, but the number of break and continue statements
before cskipped++ makes me nervous.  This said, I am going to  add
test cases from the first post to test_unicodedata (I think it is a
better place than test_normalise because the latter is skipped by
default) and commit.

Improving the algorithm is a separate issue.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-16 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

Adding an assert as shown in the diff below, makes it easy to reproduce the 
crash in py3k branch:

$ ./python.exe  crash.py
Assertion failed: (cskipped  20), function nfc_nfkc, file 
Modules/unicodedata.c, line 714.
Abort trap

I am attaching jhalcrow's code as crash.py 

===
--- Modules/unicodedata.c   (revision 87322)
+++ Modules/unicodedata.c   (working copy)
@@ -711,6 +711,7 @@
   /* Replace the original character. */
   *i = code;
   /* Mark the second character unused. */
+  assert(cskipped  20);
   skipped[cskipped++] = i1;
   i1++;
   f = find_nfc_index(self, nfc_first, *i);

--
nosy: +belopolsky
Added file: http://bugs.python.org/file20080/crash.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-16 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

Ooops, sorry. I just applied the patch suggested by Marc-Andre Lemburg in 
msg22885 (#1054943). As the patch worked for the examples given in Unicode PRI 
29 and the test suite passed, it was enough for me. I don't understand the 
normalization code, so I don't know how to fix it.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-16 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

The logic suggested by Martin in msg120018 looks right to me, but the whole 
code seems to be unnecessarily complex.  (And comb1==comb may need to be 
changed to comb1=comb.) I don't understand why linear search through skipped 
array is needed.  At the very least instead of adding their positions to the 
skipped list, used combining characters can be replaced by a non-character to 
be later skipped.  A better algorithm should be able to avoid the whole issue 
of skipping by properly computing the length of the decomposed character.  
See internalCompose() at http://www.unicode.org/reports/tr15/Normalizer.java.

I'll try to come up with a patch.

--
assignee:  - belopolsky

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-16 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

Attached patch, issue10254.diff, is essentially Martin's code from msg120018 
and Part3 tests from NormalizationTest.txt.

Since this bug exposes a buffer overflow condition, I think it qualifies as a 
security issue, so I am adding 2.6 to versions.

Passing Part3 tests and not crashing on crash.py is probably good enough for a 
commit, but I don't have a proof that length 20 skipped buffer is always 
enough.  As the next step, I would like to consider an alternative algorithm 
that would not require a skipped buffer.

--
keywords: +patch
stage:  - commit review
versions: +Python 2.6
Added file: http://bugs.python.org/file20089/issue10254.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-15 Thread Jonathan Halcrow

Jonathan Halcrow jonathan.halc...@gmail.com added the comment:

I think I've come across a related problem.  I am experiencing a segfault when 
NFC-normalizing a certain string [1].
The crash occurs with 2.7.1 in OS X (built from source with homebrew).   

Here is the backtrace:
#0  0x0025a96e in _PyUnicode_Resize ()
#1  0x00601673 in nfc_nfkc ()
#2  0x00601bb7 in unicodedata_normalize ()
#3  0x0029834b in PyEval_EvalFrameEx ()
#4  0x00299f13 in PyEval_EvalCodeEx ()
#5  0x0029a0fe in PyEval_EvalCode ()
#6  0x002bd5f0 in PyRun_FileExFlags ()
#7  0x002be430 in PyRun_SimpleFileExFlags ()
#8  0x002d5bd6 in Py_Main ()
#9  0x1f8f in _start ()
#10 0x1ebd in start ()


[1] http://pastebin.com/cfNd2QEz

--
nosy: +jhalcrow

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-15 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

I can reproduce the crash under 2.7, but not 2.6 or 3.x here. So it might be a 
separate issue.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-12-15 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

After a bit of debugging, the crash is due to the skipped array being 
overflowed in nfc_nfkc() in unicodedata.c. cskipped goes up to 21 while the 
array only has 20 entries. This happens in all branches (but only crashes in 
2.7 right now for probably unimportant reasons).

And the problem was indeed introduced by Victor's patch in issue1054943. Just 
before, cskipped would only go up to 1.

--
priority: normal - high
type: behavior - crash

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-10-31 Thread Arfrever Frehtes Taifersar Arahesis

Changes by Arfrever Frehtes Taifersar Arahesis arfrever@gmail.com:


--
nosy: +Arfrever

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-10-31 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
nosy: +ezio.melotti

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-10-30 Thread Merlijn van Deen

New submission from Merlijn van Deen valhall...@gmail.com:

Summary: Somewhere between 2.6.5 r79063 and 3.1 r79147 a regression in the 
unicode NFC normalization has been introduces. This regression leads to bot 
edit wars on wikipedia [1]. It is reproducable with a simple script [2]. 
Mediawiki/PHP [3] and C# [4] test scripts both show the old behaviour, which 
leads me to believe this is a python bug.
A search for older bugs shows bug #1054943 [5] which has commits in the 
suspected region.

The regression causes certain NFC-normalized strings to become mangled. Because 
of the wide range of unicode strings on wikipedia, this causes several 
problems. Details of those can be found at [1].

Example strings include: (these strings have been NFC-normalized by mediawiki)
 * u'Li\u030dt-s\u1e73\u0301'
 * u'\u092e\u093e\u0930\u094d\u0915 
\u091c\u093c\u0941\u0915\u0947\u0930\u092c\u0930\u094d\u0917'
 * 
u'\u0915\u093f\u0930\u094d\u0917\u093f\u091c\u093c\u0938\u094d\u0924\u093e\u0928'

The bug can be shown simply with
unicodedata.normalize('NFC', s) == s
where s is one of the strings above. This will return True on older python 
versions, False on newer versions. There is a script available that does this 
[2].

The bug has been tested on the following machines and python versions. OK 
indicates the bug is not present, FAIL indicates the bug is present.

Host: SunOS willow 5.10 Generic_142910-17 i86pc i386 i86pc Solaris
'2.3.3 (#1, Dec 16 2004, 14:38:56) [C]' OK
'2.6.5 (r265:79063, Jul 10 2010, 17:50:38) [C]' OK
'2.7 (r27:82500, Aug  5 2010, 04:28:45) [C]' FAIL
'3.1.2 (r312:79147, Sep 24 2010, 05:34:04) [C]' FAIL

Host: Linux nightshade 2.6.26-2-amd64 #1 SMP Thu Sep 16 15:56:38 UTC 2010 
x86_64 GNU/Linux
'2.4.6 (#2, Jan 24 2010, 12:20:41) \n[GCC 4.3.2]' OK
'2.5.2 (r252:60911, Jan 24 2010, 17:44:40) \n[GCC 4.3.2]' OK
'2.6.4+ (r264:75706, Feb 16 2010, 05:11:28) \n[GCC 4.4.3]' OK

Host: Linux dorthonion 2.6.22.18-co-0.7.4 #1 PREEMPT Wed Apr 15 18:57:39 UTC 
2009 i686 GNU/Linux
'2.5.4 (r254:67916, Jan 20 2010, 21:44:03) \n[GCC 4.3.3]' OK
'2.6.2 (release26-maint, Apr 19 2009, 01:56:41) \n[GCC 4.3.3]' OK
'3.0.1+ (r301:69556, Apr 15 2009, 15:59:22) \n[GCC 4.3.3]' OK

[1] 
https://sourceforge.net/tracker/index.php?func=detailaid=3081100group_id=93107atid=603138#
 ; 
http://fr.wikipedia.org/w/index.php?title=Mark_Zuckerbergaction=historysubmitdiff=57753004oldid=57751674
[2] http://pastebin.ca/1977285 (py2.x), http://pastebin.ca/1977287 (py3.x)
[3] http://pastebin.ca/1977292 (PHP, placed in 
http://svn.wikimedia.org/svnroot/mediawiki/trunk/phase3/normal/), 
[4] http://pastebin.ca/1977261 (C#)
[5] http://bugs.python.org/issue1054943#

--
components: Unicode
messages: 119995
nosy: valhallasw
priority: normal
severity: normal
status: open
title: unicodedata.normalize('NFC', s) regression
type: behavior
versions: Python 2.7, Python 3.1

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-10-30 Thread Merlijn van Deen

Merlijn van Deen valhall...@gmail.com added the comment:

Please note: The bug might very well be present in python 3.2 and 3.3. However, 
I do not have these versions installed, so I cannot confirm this.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-10-30 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

Confirmed on Python 3.2.

--
nosy: +haypo, loewis, pitrou
versions: +Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-10-30 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

The change from issue1054943 is indeed bogus. As written, the code will happily 
run over starters, even though a blocked start means that subsequent characters 
can't possibly be combinable. That way, the code manages to combine, in 
'Li\u030dt-s\u1e73\u0301', the final U+0301 with the i - even though there are 
several starters in-between.

I think the code should work like this:

if comb!=0 and comb1==0:
  #starter after character with higher class:
  # not combinable, and all subsequent characters will be blocked
  # as well
  break
if comb!=0 and comb1==comb:
  # blocked combining character, continue searching
  i1++
  continue
# candidate pair, check whether *i and *i1 are combinable

It's unfortunate that the patch had been backported to 2.6.6; we can't fix it 
there anymore.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-10-30 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Martin v. Löwis wrote:
 It's unfortunate that the patch had been backported to 2.6.6; we can't fix it 
 there anymore.

Why not ? It looks a lot like a security fix.

--
nosy: +lemburg

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-10-30 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

 It's unfortunate that the patch had been backported to 2.6.6; we can't fix 
 it there anymore.
 
 Why not ? It looks a lot like a security fix.

Indeed, you could argue that. It's up to the 2.6 release manager, I guess.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10254] unicodedata.normalize('NFC', s) regression

2010-10-30 Thread R. David Murray

Changes by R. David Murray rdmur...@bitdance.com:


--
nosy: +barry

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10254
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com