[issue14850] The inconsistency of codecs.charmap_decode

2013-01-15 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
resolution:  -> fixed
stage: patch review -> committed/rejected
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14850] The inconsistency of codecs.charmap_decode

2013-01-15 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Fixed. Thank you for your answers, Martin.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14850] The inconsistency of codecs.charmap_decode

2013-01-15 Thread Roundup Robot

Roundup Robot added the comment:

New changeset 33a8ef498b1e by Serhiy Storchaka in branch '2.7':
Issue #14850: Now a chamap decoder treates U+FFFE as "undefined mapping"
http://hg.python.org/cpython/rev/33a8ef498b1e

New changeset 13cd78a2a17b by Serhiy Storchaka in branch '3.2':
Issue #14850: Now a chamap decoder treates U+FFFE as "undefined mapping"
http://hg.python.org/cpython/rev/13cd78a2a17b

New changeset 6ac4f1609847 by Serhiy Storchaka in branch '3.3':
Issue #14850: Now a chamap decoder treates U+FFFE as "undefined mapping"
http://hg.python.org/cpython/rev/6ac4f1609847

New changeset 03e22cc9407a by Serhiy Storchaka in branch 'default':
Issue #14850: Now a chamap decoder treates U+FFFE as "undefined mapping"
http://hg.python.org/cpython/rev/03e22cc9407a

--
nosy: +python-dev

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14850] The inconsistency of codecs.charmap_decode

2012-12-27 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
assignee:  -> serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14850] The inconsistency of codecs.charmap_decode

2012-12-27 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

I no one objects I will commit this next year.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14850] The inconsistency of codecs.charmap_decode

2012-10-24 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
stage:  -> patch review

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14850] The inconsistency of codecs.charmap_decode

2012-10-19 Thread Antoine Pitrou

Changes by Antoine Pitrou :


--
nosy: +haypo

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14850] The inconsistency of codecs.charmap_decode

2012-10-19 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Does anyone have objections against the idea or the implementation of the 
patch?  Please review.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14850] The inconsistency of codecs.charmap_decode

2012-10-02 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
components: +Unicode
keywords: +needs review
versions: +Python 3.4

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14850] The inconsistency of codecs.charmap_decode

2012-10-02 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Patch updated to resolve conflict with issue15379. Added tests. Added patches 
for 3.2 and 2.7.

--
Added file: http://bugs.python.org/file27387/decode_charmap_fffe-3.3.patch
Added file: http://bugs.python.org/file27388/decode_charmap_fffe-3.2.patch
Added file: http://bugs.python.org/file27389/decode_charmap_fffe-2.7.patch

___
Python tracker 

___diff -r 5ddc7b3f2795 Lib/test/test_codecs.py
--- a/Lib/test/test_codecs.py   Tue Oct 02 12:54:07 2012 +0200
+++ b/Lib/test/test_codecs.py   Tue Oct 02 19:07:20 2012 +0300
@@ -1701,6 +1701,10 @@
 codecs.charmap_decode, b"\x00\x01\x02", "strict", "ab"
 )
 
+self.assertRaises(UnicodeDecodeError,
+codecs.charmap_decode, b"\x00\x01\x02", "strict", "ab\ufffe"
+)
+
 self.assertEqual(
 codecs.charmap_decode(b"\x00\x01\x02", "replace", "ab"),
 ("ab\ufffd", 3)
@@ -1757,6 +1761,12 @@
{0: 'a', 1: 'b'}
 )
 
+# Issue #14850
+self.assertRaises(UnicodeDecodeError,
+codecs.charmap_decode, b"\x00\x01\x02", "strict",
+   {0: 'a', 1: 'b', 3: '\ufffe'}
+)
+
 self.assertEqual(
 codecs.charmap_decode(b"\x00\x01\x02", "replace",
   {0: 'a', 1: 'b'}),
@@ -1769,6 +1779,13 @@
 ("ab\ufffd", 3)
 )
 
+# Issue #14850
+self.assertEqual(
+codecs.charmap_decode(b"\x00\x01\x02", "replace",
+  {0: 'a', 1: 'b', 2: '\ufffe'}),
+("ab\ufffd", 3)
+)
+
 self.assertEqual(
 codecs.charmap_decode(b"\x00\x01\x02", "ignore",
   {0: 'a', 1: 'b'}),
@@ -1781,6 +1798,13 @@
 ("ab", 3)
 )
 
+# Issue #14850
+self.assertEqual(
+codecs.charmap_decode(b"\x00\x01\x02", "ignore",
+  {0: 'a', 1: 'b', 2: '\ufffe'}),
+("ab", 3)
+)
+
 allbytes = bytes(range(256))
 self.assertEqual(
 codecs.charmap_decode(allbytes, "ignore", {}),
@@ -1821,6 +1845,11 @@
{0: a, 1: b},
 )
 
+self.assertRaises(UnicodeDecodeError,
+codecs.charmap_decode, b"\x00\x01\x02", "strict",
+   {0: a, 1: b, 2: 0xFFFE},
+)
+
 self.assertEqual(
 codecs.charmap_decode(b"\x00\x01\x02", "replace",
   {0: a, 1: b}),
@@ -1828,11 +1857,23 @@
 )
 
 self.assertEqual(
+codecs.charmap_decode(b"\x00\x01\x02", "replace",
+  {0: a, 1: b, 2: 0xFFFE}),
+("ab\ufffd", 3)
+)
+
+self.assertEqual(
 codecs.charmap_decode(b"\x00\x01\x02", "ignore",
   {0: a, 1: b}),
 ("ab", 3)
 )
 
+self.assertEqual(
+codecs.charmap_decode(b"\x00\x01\x02", "ignore",
+  {0: a, 1: b, 2: 0xFFFE}),
+("ab", 3)
+)
+
 
 class WithStmtTest(unittest.TestCase):
 def test_encodedfile(self):
diff -r 5ddc7b3f2795 Objects/unicodeobject.c
--- a/Objects/unicodeobject.c   Tue Oct 02 12:54:07 2012 +0200
+++ b/Objects/unicodeobject.c   Tue Oct 02 19:07:20 2012 +0300
@@ -7516,15 +7516,18 @@
 if (PyErr_ExceptionMatches(PyExc_LookupError)) {
 /* No mapping found means: mapping is undefined. */
 PyErr_Clear();
-x = Py_None;
-Py_INCREF(x);
+goto Undefined;
 } else
 goto onError;
 }
 
 /* Apply mapping */
+if (x == Py_None)
+goto Undefined;
 if (PyLong_Check(x)) {
 long value = PyLong_AS_LONG(x);
+if (value == 0xFFFE)
+goto Undefined;
 if (value < 0 || value > MAX_UNICODE) {
 PyErr_Format(PyExc_TypeError,
  "character mapping must be in range(0x%lx)",
@@ -7535,21 +7538,6 @@
 if (unicode_putchar(&v, &outpos, value) < 0)
 goto onError;
 }
-else if (x == Py_None) {
-/* undefined mapping */
-startinpos = s-starts;
-endinpos = startinpos+1;
-if (unicode_decode_call_errorhandler(
-errors, &errorHandler,
-"charmap", "character maps to ",
-&starts, &e, &startinpos, &endinpos, &exc, &s,
-&v, &outpos)) {
-Py_DECREF(x);
-  

[issue14850] The inconsistency of codecs.charmap_decode

2012-10-02 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


Removed file: http://bugs.python.org/file25934/decode_charmap_fffe.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14850] The inconsistency of codecs.charmap_decode

2012-06-16 Thread Antoine Pitrou

Changes by Antoine Pitrou :


--
nosy: +ezio.melotti

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14850] The inconsistency of codecs.charmap_decode

2012-06-10 Thread Serhiy Storchaka

Serhiy Storchaka  added the comment:

> So the answer to your last question is "yes". I hope that the answer to
> your other questions follows from that

Thank you, this is the answer to all my questions. I've prepared a patch
to treat U+FFFE in general mapping as “undefined mapping”.

>  (strictly speaking, it's only
> U+FFFE, not 0xFFFE, that is documented as indicating an undefined
> mapping; a patch should probably fix that).

As both integer 0x and string '\u' denote U+, I do not think
it necessary fixes.

> (I also wonder where the support for LookupError comes from - that 
> appears to be undocumented)

I believe, this is what is meant by the words "undefined mapping".

--
keywords: +patch
Added file: http://bugs.python.org/file25934/decode_charmap_fffe.patch

___
Python tracker 

___diff -r 743cf3319862 Objects/unicodeobject.c
--- a/Objects/unicodeobject.c   Sat Jun 09 22:51:39 2012 -0700
+++ b/Objects/unicodeobject.c   Sun Jun 10 21:49:34 2012 +0300
@@ -7512,15 +7512,18 @@
 if (PyErr_ExceptionMatches(PyExc_LookupError)) {
 /* No mapping found means: mapping is undefined. */
 PyErr_Clear();
-x = Py_None;
-Py_INCREF(x);
+goto Undefined;
 } else
 goto onError;
 }
 
 /* Apply mapping */
+if (x == Py_None)
+goto Undefined;
 if (PyLong_Check(x)) {
 long value = PyLong_AS_LONG(x);
+if (value == 0xFFFE)
+goto Undefined;
 if (value < 0 || value > 65535) {
 PyErr_SetString(PyExc_TypeError,
 "character mapping must be in 
range(65536)");
@@ -7530,21 +7533,6 @@
 if (unicode_putchar(&v, &outpos, value) < 0)
 goto onError;
 }
-else if (x == Py_None) {
-/* undefined mapping */
-startinpos = s-starts;
-endinpos = startinpos+1;
-if (unicode_decode_call_errorhandler(
-errors, &errorHandler,
-"charmap", "character maps to ",
-&starts, &e, &startinpos, &endinpos, &exc, &s,
-&v, &outpos)) {
-Py_DECREF(x);
-goto onError;
-}
-Py_DECREF(x);
-continue;
-}
 else if (PyUnicode_Check(x)) {
 Py_ssize_t targetsize;
 
@@ -7554,8 +7542,10 @@
 
 if (targetsize == 1) {
 /* 1-1 mapping */
-if (unicode_putchar(&v, &outpos,
-PyUnicode_READ_CHAR(x, 0)) < 0)
+Py_UCS4 value = PyUnicode_READ_CHAR(x, 0);
+if (value == 0xFFFE)
+goto Undefined;
+if (unicode_putchar(&v, &outpos, value) < 0)
 goto onError;
 }
 else if (targetsize > 1) {
@@ -7590,6 +7580,19 @@
 }
 Py_DECREF(x);
 ++s;
+continue;
+Undefined:
+/* undefined mapping */
+Py_XDECREF(x);
+startinpos = s-starts;
+endinpos = startinpos+1;
+if (unicode_decode_call_errorhandler(
+errors, &errorHandler,
+"charmap", "character maps to ",
+&starts, &e, &startinpos, &endinpos, &exc, &s,
+&v, &outpos)) {
+goto onError;
+}
 }
 }
 if (unicode_resize(&v, outpos) < 0)
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14850] The inconsistency of codecs.charmap_decode

2012-06-10 Thread Martin v . Löwis

Martin v. Löwis  added the comment:

> integers or 1-character strings? What about general mapping? Should
 > any of them have 0xFFFE or '\uFFFE' represent an undefined mapping?

The documentation says that the parameter "can be a dictionary mapping 
byte or a unicode string, which is treated as a lookup table". So 
anything that supports GetItem with a small integer index can be passed.
It then says '...  U+FFFE “characters” are treated as “undefined mapping”'.

So the answer to your last question is "yes". I hope that the answer to
your other questions follows from that (strictly speaking, it's only
U+FFFE, not 0xFFFE, that is documented as indicating an undefined
mapping; a patch should probably fix that).

(I also wonder where the support for LookupError comes from - that 
appears to be undocumented)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14850] The inconsistency of codecs.charmap_decode

2012-06-10 Thread Serhiy Storchaka

Serhiy Storchaka  added the comment:

> What is the question? U+FFFE also represents an undefined mapping in 
> string subclasses.

What about classes that not subclassed string but ducktyped string by
implementing all string method? What about list/tuple/array.array of
integers or 1-character strings? What about general mapping? Should any
of them have 0xFFFE or '\uFFFE' represent an undefined mapping?

> This is a single issue, a single bug. If the bug is fixed, it is fixed. 
> No need to go further (unless there is another bug somewhere).

My question, where is the limit of this bug.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14850] The inconsistency of codecs.charmap_decode

2012-06-10 Thread Martin v . Löwis

Martin v. Löwis  added the comment:

>> U+FFFE is documented as representing an undefined mapping,
>
> Yes, using U+FFFE for representing an undefined mapping in strings is
> normal, the question was about string subclasses.

What is the question? U+FFFE also represents an undefined mapping in 
string subclasses.

 > And if we will correct it for string subclasses, how far we go any
 > further?

This is a single issue, a single bug. If the bug is fixed, it is fixed. 
No need to go further (unless there is another bug somewhere).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14850] The inconsistency of codecs.charmap_decode

2012-06-10 Thread Serhiy Storchaka

Serhiy Storchaka  added the comment:

> U+FFFE is documented as representing an undefined mapping,

Yes, using U+FFFE for representing an undefined mapping in strings is
normal, the question was about string subclasses. And if we will correct
it for string subclasses, how far we go any further? How about general
mapping?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14850] The inconsistency of codecs.charmap_decode

2012-06-10 Thread Serhiy Storchaka

Serhiy Storchaka  added the comment:

> What is the use case for passing a string subclass to charmap_decode?  Or in 
> other words, how did you stumble upon the bug?

I stumbled upon it, rewriting the charmap decoder (issue14874). Now
charmap decoder processes the two cases -- a more effective case of
string table and a general slower case of general mapping. I proposed a
more optimized case of 256-character UCS2 string (covers all standard
charmap encodings). If processing general strings and maps was
consistent, these cases can be merged. A string subclass is just an
example that illustrates the inconsistency.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14850] The inconsistency of codecs.charmap_decode

2012-06-10 Thread Martin v . Löwis

Martin v. Löwis  added the comment:

U+FFFE is documented as representing an undefined mapping, see

http://docs.python.org/dev/c-api/unicode.html?highlight=charmap#PyUnicode_DecodeCharmap

So the base string case is correct; the derived string implementation also 
needs to invoke the error handler.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14850] The inconsistency of codecs.charmap_decode

2012-06-09 Thread Éric Araujo

Éric Araujo  added the comment:

What is the use case for passing a string subclass to charmap_decode?  Or in 
other words, how did you stumble upon the bug?

--
nosy: +eric.araujo

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14850] The inconsistency of codecs.charmap_decode

2012-05-18 Thread Terry J. Reedy

Changes by Terry J. Reedy :


--
nosy: +doerwalter, lemburg

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14850] The inconsistency of codecs.charmap_decode

2012-05-18 Thread Antoine Pitrou

Changes by Antoine Pitrou :


--
nosy: +loewis

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14850] The inconsistency of codecs.charmap_decode

2012-05-18 Thread Serhiy Storchaka

New submission from Serhiy Storchaka :

codecs.charmap_decode behaves differently with native and user string as decode 
table.

>>> import codecs
>>> print(ascii(codecs.charmap_decode(b'\x00', 'replace', '\uFFFE')))
('\ufffd', 1)
>>> class S(str): pass
... 
>>> print(ascii(codecs.charmap_decode(b'\x00', 'replace', S('\uFFFE'
('\ufffe', 1)

It's because charmap decoder (function PyUnicode_DecodeCharmap in 
Objects/unicodeobject.c) uses different algorithms for exact strings and for 
other.

We need to fix it? If yes, what should return `codecs.charmap_decode(b'\x00', 
'replace', {0:'\uFFFE'})`? What should return `codecs.charmap_decode(b'\x00', 
'replace', {0:0xFFFE})`?

--
components: Interpreter Core
messages: 161054
nosy: storchaka
priority: normal
severity: normal
status: open
title: The inconsistency of codecs.charmap_decode
type: behavior
versions: Python 2.7, Python 3.2, Python 3.3

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com