[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

2016-09-08 Thread Roundup Robot

Roundup Robot added the comment:

New changeset 2150eadb54c7 by Serhiy Storchaka in branch 'default':
Remove old typo.
https://hg.python.org/cpython/rev/2150eadb54c7

--
nosy: +python-dev

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

2010-04-07 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3672
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

2009-06-15 Thread hippietrail

Changes by hippietrail hippytr...@gmail.com:


--
nosy: +hippietrail

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3672
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

2009-05-02 Thread Martin v. Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

Reviewers: report_bugs.python.org, Benjamin,

Message:
Issues fixed in r72188.

http://codereview.appspot.com/52081/diff/1/5
File Doc/library/codecs.rst (right):

http://codereview.appspot.com/52081/diff/1/5#newcode326
Line 326: In addition, the following error handlers are specific to only
selected
On 2009/05/01 21:25:44, Benjamin wrote:
 In addition, the following error handlers are specific to a single
codec.
 sounds better

Done.

http://codereview.appspot.com/52081/diff/1/5#newcode335
Line 335:
On 2009/05/01 21:25:44, Benjamin wrote:
 There should probably be a versionchanged directive indicating that
surrogates
 was added in 3.1.

Done.

http://codereview.appspot.com/52081/diff/1/6
File Lib/test/test_codecs.py (right):

http://codereview.appspot.com/52081/diff/1/6#newcode544
Line 544: def test_surrogates(self):
On 2009/05/01 21:25:44, Benjamin wrote:
 I think this should be split into 2 tests. test_lone_surrogates and
 test_surrogate_handler.

Done.

http://codereview.appspot.com/52081/diff/1/4
File Objects/unicodeobject.c (right):

http://codereview.appspot.com/52081/diff/1/4#newcode157
Line 157: static PyObject *unicode_encode_call_errorhandler(const char
*errors,
On 2009/05/01 21:25:44, Benjamin wrote:
 These prototypes are longer than 80 chars some places. I don't think
the
 arguments need to line up with the starting parenthesis.

Done.

http://codereview.appspot.com/52081/diff/1/4#newcode2393
Line 2393: s, size, exc, i-1, i, newpos);
On 2009/05/01 21:25:44, Benjamin wrote:
 exc is never Py_DECREFed.

Done.

http://codereview.appspot.com/52081/diff/1/4#newcode4110
Line 4110: if (!PyUnicode_Check(repunicode)) {
On 2009/05/01 21:25:44, Benjamin wrote:
 Is there a test of this case somewhere?

No. This is temporary - for PEP 383, I will have to support error
handlers returning bytes here, also.

http://codereview.appspot.com/52081/diff/1/2
File Python/codecs.c (right):

http://codereview.appspot.com/52081/diff/1/2#newcode758
Line 758: if (PyObject_IsInstance(exc, PyExc_UnicodeEncodeError)) {
On 2009/05/01 21:25:44, Benjamin wrote:
 I believe PyErr_GivenExceptionMatches is more appropriate here, but
given the
 rest of the file uses PyObject_IsInstance, it likely doesn't matter.

No. The interface is that an exception instance must be passed;
GivenExceptionMatches would also allow for tuples and types.

http://codereview.appspot.com/52081/diff/1/2#newcode771
Line 771: return NULL;
On 2009/05/01 21:25:44, Benjamin wrote:
 This is leaks object.

Done.

Please review this at http://codereview.appspot.com/52081

Affected files:
   M Doc/library/codecs.rst
   M Lib/test/test_bytes.py
   M Lib/test/test_codecs.py
   M Lib/test/test_unicode.py
   M Lib/test/test_unicodedata.py
   M Objects/unicodeobject.c
   M Python/codecs.c
   M Python/marshal.c

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3672
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

2009-05-02 Thread Martin v. Löwis

Changes by Martin v. Löwis mar...@v.loewis.de:


Removed file: http://bugs.python.org/file13830/surrogates.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3672
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

2009-05-02 Thread Martin v. Löwis

Changes by Martin v. Löwis mar...@v.loewis.de:


Added file: http://bugs.python.org/file13836/surrogates.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3672
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

2009-05-02 Thread Benjamin Peterson

Benjamin Peterson benja...@python.org added the comment:

I think the new patch looks fine.

--
assignee: benjamin.peterson - loewis

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3672
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

2009-05-02 Thread Benjamin Peterson

Benjamin Peterson benja...@python.org added the comment:

Something I overlooked is that PyCodec_SurrogateErrors isn't exposed in
any headers.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3672
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

2009-05-02 Thread Martin v. Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

Committed as r72208, blocked as r72209.

As for PyCodec_SurrogateErrors: I'd rather make it static than expose it.

--
resolution:  - accepted
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3672
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

2009-05-02 Thread Benjamin Peterson

Benjamin Peterson benja...@python.org added the comment:

2009/5/2  \Martin v. Löwis\
rep...@bugs.python.org@psf.upfronthosting.co.za:

 Martin v. Löwis mar...@v.loewis.de added the comment:

 Committed as r72208, blocked as r72209.

 As for PyCodec_SurrogateErrors: I'd rather make it static than expose it.

Why? All the other error handlers are exposed.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3672
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

2009-05-02 Thread Martin v. Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

 As for PyCodec_SurrogateErrors: I'd rather make it static than expose it.
 
 Why? All the other error handlers are exposed.

Sure - but what for? IMO, they all shouldn't be exposed.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3672
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

2009-05-02 Thread Benjamin Peterson

Benjamin Peterson benja...@python.org added the comment:

2009/5/2  \Martin v. Löwis\
rep...@bugs.python.org@psf.upfronthosting.co.za:

 Martin v. Löwis mar...@v.loewis.de added the comment:

 As for PyCodec_SurrogateErrors: I'd rather make it static than expose it.

 Why? All the other error handlers are exposed.

 Sure - but what for? IMO, they all shouldn't be exposed.

The only reason I can think of is consistency, but I don't care that much.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3672
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

2009-05-01 Thread Martin v. Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

Here is a patch that implements this proposed approach. It introduces a
surrogates error handler, useful only for the utf-8 codec.

If this is accepted, the implementation of PEP 383 can be simplified
significantly, essentially removing the need for a separate utf-8b codec
(as that could be done in the error handler, as for the other codecs).

--
assignee:  - benjamin.peterson
keywords: +patch
nosy: +benjamin.peterson
priority: high - release blocker
Added file: http://bugs.python.org/file13827/surrogates.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3672
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

2009-05-01 Thread Martin v. Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

rietveld: http://codereview.appspot.com/52081

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3672
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

2009-05-01 Thread Martin v. Löwis

Changes by Martin v. Löwis mar...@v.loewis.de:


Removed file: http://bugs.python.org/file13827/surrogates.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3672
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

2009-05-01 Thread Martin v. Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

Fixed indexing error.

--
Added file: http://bugs.python.org/file13830/surrogates.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3672
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

2009-05-01 Thread Benjamin Peterson

Benjamin Peterson benja...@python.org added the comment:

http://codereview.appspot.com/52081/diff/1/5
File Doc/library/codecs.rst (right):

http://codereview.appspot.com/52081/diff/1/5#newcode326
Line 326: In addition, the following error handlers are specific to only
selected
In addition, the following error handlers are specific to a single
codec. sounds better

http://codereview.appspot.com/52081/diff/1/5#newcode335
Line 335:
There should probably be a versionchanged directive indicating that
surrogates was added in 3.1.

http://codereview.appspot.com/52081/diff/1/6
File Lib/test/test_codecs.py (right):

http://codereview.appspot.com/52081/diff/1/6#newcode544
Line 544: def test_surrogates(self):
I think this should be split into 2 tests. test_lone_surrogates and
test_surrogate_handler.

http://codereview.appspot.com/52081/diff/1/4
File Objects/unicodeobject.c (right):

http://codereview.appspot.com/52081/diff/1/4#newcode157
Line 157: static PyObject *unicode_encode_call_errorhandler(const char
*errors,
These prototypes are longer than 80 chars some places. I don't think the
arguments need to line up with the starting parenthesis.

http://codereview.appspot.com/52081/diff/1/4#newcode2393
Line 2393: s, size, exc, i-1, i, newpos);
exc is never Py_DECREFed.

http://codereview.appspot.com/52081/diff/1/4#newcode4110
Line 4110: if (!PyUnicode_Check(repunicode)) {
Is there a test of this case somewhere?

http://codereview.appspot.com/52081/diff/1/2
File Python/codecs.c (right):

http://codereview.appspot.com/52081/diff/1/2#newcode758
Line 758: if (PyObject_IsInstance(exc, PyExc_UnicodeEncodeError)) {
I believe PyErr_GivenExceptionMatches is more appropriate here, but
given the rest of the file uses PyObject_IsInstance, it likely doesn't
matter.

http://codereview.appspot.com/52081/diff/1/2#newcode771
Line 771: return NULL;
This is leaks object.

http://codereview.appspot.com/52081

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3672
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

2009-04-30 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

On 2009-04-29 22:39, Martin v. Löwis @psf.upfronthosting.co.za wrote:
 Martin v. Löwis mar...@v.loewis.de added the comment:
 
 I think we could preserve the marshal format with yet another error
 handler - one that emits half surrogates into their intuitive form.

That's a good idea. We could have an error handler which then let's
the codec accept lone surrogates for utf-8 or just add a new codec
which does this and use that for marshal.

Still, this can only go into 3.1.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3672
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

2009-04-29 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

While it's probably ok to fix the codecs, there's an issue which makes
this difficult at least for the utf-8 codec:

The marshal module uses utf-8 to write Unicode objects and these can and
need to be able to store the full range of supported UCS2/UCS4 code
points, including lone surrogates.

If the utf-8 codec were changed to raise an error for these, marshal
would no longer be able to write/read Unicode objects.

It is likely that other existing Python code (outside the std lib) also
relies on this ability.

Changing this would only be possible in 3.1.

The marshal module would then also have to be changed to use a different
encoding which does support encoding lone surrogates.

See issue 3297 for a discussion of UTF-8/16 vs. UCS2/4, the
implications, motivations, etc.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3672
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

2009-04-29 Thread Martin v. Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

I think we could preserve the marshal format with yet another error
handler - one that emits half surrogates into their intuitive form.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3672
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

2009-04-28 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

We could fix it for 3.1, and perhaps leave 2.7 unchanged if some people
rely on this (for whatever reason).

--
nosy: +pitrou
priority:  - high
stage:  - test needed
versions: +Python 3.1

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3672
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

2009-04-28 Thread Antoine Pitrou

Changes by Antoine Pitrou pit...@free.fr:


--
nosy: +lemburg, loewis

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3672
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

2009-04-25 Thread Jakub Wilk

Changes by Jakub Wilk uba...@users.sf.net:


--
nosy: +jwilk

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3672
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

2008-09-02 Thread Ezio Melotti

Changes by Ezio Melotti [EMAIL PROTECTED]:


--
nosy: +ezio.melotti

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3672
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

2008-08-24 Thread Adam Olsen

New submission from Adam Olsen [EMAIL PROTECTED]:

The Unicode FAQ makes it quite clear that any surrogates in UTF-8 or
UTF-32 should be treated as errors.  Lone surrogates in UTF-16 should
probably be treated as errors too (but only during encoding/decoding;
unicode objects on UTF-16 builds should allow them to be created through
slicing).

http://unicode.org/faq/utf_bom.html#30
http://unicode.org/faq/utf_bom.html#42
http://unicode.org/faq/utf_bom.html#40

Lone surrogate in UTF-8 (effectively CESU-8):
 '\xED\xA0\x81'.decode('utf-8')
u'\ud801'

Surrogate pair in UTF-8:
 '\xED\xA0\x81\xED\xB0\x80'.decode('utf-8')
u'\ud801\udc00'

On a UTF-32 build, encoding a surrogate pair with UTF-16, then decoding
again will produce the proper non-surrogate scalar value.  This has
security implications, although rare as characters outside the BMP are rare:
 u'\ud801\udc00'.encode('utf-16').decode('utf-16')
u'\U00010400'

Also on a UTF-32 build, decoding of a lone surrogate in UTF-16 fails
(correctly), but encoding one does not:
 u'\ud801'.encode('utf-16')
'\xff\xfe\x01\xd8'


I have gotten a report of a user decoding bad data using
x.decode('utf-8', 'replace'), then getting an error from Gtk+ when the
ill-formed surrogates reached it.

Fixing this would cause issue 3297 to blow up loudly, rather than silently.

--
messages: 71889
nosy: Rhamphoryncus
severity: normal
status: open
title: Ill-formed surrogates not treated as errors during encoding/decoding

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3672
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

2008-08-24 Thread Adam Olsen

Changes by Adam Olsen [EMAIL PROTECTED]:


--
components: +Unicode
type:  - behavior

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3672
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com