[issue36819] Crash during encoding using UTF-16/32 and custom error handler

2021-09-29 Thread Walter Dörwald

Walter Dörwald  added the comment:

The original specification (PEP 293) required that an error handler called for 
encoding *must* return a replacement string (not bytes). This returned string 
must then be encoded again. Only if this fails an exception must be raised.

Returning bytes from the encoding error handler is an extension specified by 
PEP 383:

> The error handler interface is extended to allow the encode error handler to 
> return byte strings immediately, in addition to returning Unicode strings 
> which then get encoded again (also see the discussion below).

So for 3. in Serhiy's problem list

> 3. Incorrect exception can be raised if the error handler returns invalid 
> string/bytes: a non-ASCII string or a bytes object consisting of not a whole 
> number of units.

I get:

 ~/ ❯ python
Python 3.9.7 (default, Sep  3 2021, 12:37:55)
[Clang 12.0.5 (clang-1205.0.22.9)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> def bad(exc):
...  return ('\udbc0', exc.start)
...
>>> import codecs
>>> codecs.register_error('bad', bad)
>>> '\udbc0'.encode('utf-16', 'bad')
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError

I would have expected an exception message that basically looks like the one 
I'd get, if I had used the strict error handler.

But otherwise returning a replacement that is unencodable is allowed and should 
raise an exception (which happens here, but with a missing exception message). 
(Returning something unencodable might make sense when the error handler is 
able to create replacement characters for some unencodable input, but not for 
other, but of course the error handler can always raise an exception directly).

Returning invalid bytes is not an issue, they simply get written to the output. 
That's exactly the use case of PEP 383: The bytes couldn't be decoded in the 
specified encoding, so they are "invalid", but the surrogateescape error 
handler encodes them back to the same "invalid" bytes. So the error handler is 
allowed to output bytes that can't be decoded again with the same encoding.

Returning a restart position outside the valid range of the length of the 
original string should raise an IndexError according to PEP 293:

> If the callback does not raise an exception (either the one passed in, or a 
> different one), it must return a tuple: `(replacement, newpos)`
> `replacement` is a unicode object that the encoder will encode and emit 
> instead of the unencodable `object[start:end]` part, `newpos` specifies
> a new position within object, where (after encoding the replacement) the 
> encoder will continue encoding.
> Negative values for `newpos` are treated as being relative to end of object. 
> If `newpos` is out of bounds the encoder will raise an `IndexError`.

Of course we could retroactively reinterpret "out of bounds" as outside of 
`range(exc.start + 1, len(object))`, instead of outside `range(0, 
len(object))`. An error handler that never advances is broken anyway. But we 
can't detect "never".

However it would probably be OK to reject pathological error handlers (i.e. 
those that don't advance (i.e. return at least `exc.start + 1` as the restart 
position)). But I'm not sure how that's different from an error handler that 
skips ahead much farther (i.e. returns something like 
`(exc.start+len(object))//2` or `max(exc.start+1, len(object)-10)`): The 
returned restart position leads to a certain expectation of how many bytes the 
encoder might have to output until everything is encoded and must adjust 
accordingly.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36819] Crash during encoding using UTF-16/32 and custom error handler

2021-09-29 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

On 29.09.2021 10:41, Serhiy Storchaka wrote:
> 
> Restricting the returned position to be strictly larger than start would 
> solve the problem with infinite loop and OOM. But this is a different issue.

Yes, this would make sense, since having the codec process
the same error location over and over again will not resolve the
error, so it's clearly a bug in the error handler.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36819] Crash during encoding using UTF-16/32 and custom error handler

2021-09-29 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

Restricting the returned position to be strictly larger than start would solve 
the problem with infinite loop and OOM. But this is a different issue.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36819] Crash during encoding using UTF-16/32 and custom error handler

2021-09-29 Thread Marc-Andre Lemburg


Change by Marc-Andre Lemburg :


--
nosy: +doerwalter

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36819] Crash during encoding using UTF-16/32 and custom error handler

2021-09-29 Thread Marc-Andre Lemburg


Change by Marc-Andre Lemburg :


--
components: +Unicode
nosy: +ezio.melotti

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36819] Crash during encoding using UTF-16/32 and custom error handler

2021-09-29 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

Looking at the specs in PEP 293 (https://www.python.org/dev/peps/pep-0293/), it 
is certainly possible for the error handler to return a newpos outside the 
range start - end, meaning in most cases: a value >= end.

There's a good reason for this: the codec may not be able to correctly 
determine the end of the sequence and so the end value presented by the codec 
is not necessarily a valid start to continue encoding/decoding. The error 
handler can e.g. choose to skip more input characters by trying to find the 
next valid sequence.

In the example script, the handler returns start, so the value is within the 
range. A limit would not solve the problem.

It seems that the reallocation logic of the codecs is the main problem here.

--
nosy: +lemburg

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36819] Crash during encoding using UTF-16/32 and custom error handler

2021-09-29 Thread STINNER Victor


STINNER Victor  added the comment:

> We could just forbid error handlers returning position not in the range 
> (start , end]

Yeah, that sounds like a reasonable solution. I don't see the point of 
returning a position outside this range. What would be the use case?

For me, the only corner case is the "ignore" error handler which returns an 
empty string, but it returns a position in this range, no?

> it can break some code, so it is better to do this only in a new release.

Implementing custom error handlers is a rare use case, so it should only affect 
a minority of users. Moreover, IMO returning a position outside the valid range 
is a bug. It's common that security fixes change the behavior, like rejecting 
values which were previously acceptd, to prevent a Python crash.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36819] Crash during encoding using UTF-16/32 and custom error handler

2021-09-28 Thread Serhiy Storchaka


Change by Serhiy Storchaka :


--
pull_requests: +26973
pull_request: https://github.com/python/cpython/pull/28593

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36819] Crash during encoding using UTF-16/32 and custom error handler

2021-09-27 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

I am working on it, since it is more complex issue, and PR 13134 does not solve 
it.

1. This bug affects also other codecs implemented in C: ASCII, Latin1, UTF-8, 
etc.
2. It still crashes in UTF-16/32 encoders if the error handler returns a 
position less than the current position.
3. Incorrect exception can be raised if the error handler returns invalid 
string/bytes: a non-ASCII string or a bytes object consisting of not a whole 
number of units.
4. The code for standard error handlers and for decoders needs a revision too. 
I have some suspects.

We could just forbid error handlers returning position not in the range (start 
, end], but it can break some code, so it is better to do this only in a new 
release.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36819] Crash during encoding using UTF-16/32 and custom error handler

2021-09-19 Thread Irit Katriel


Irit Katriel  added the comment:

Reproduced on 3.11.

--
nosy: +iritkatriel
versions: +Python 3.10, Python 3.11, Python 3.9 -Python 3.7, Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36819] Crash during encoding using UTF-16/32 and custom error handler

2019-05-06 Thread anthony shaw


anthony shaw  added the comment:

Easily reproduced on master, thanks

(lldb) run encode_crash.py
Process 14743 launched: '/Users/anthonyshaw/repo/cpython/python.exe' (x86_64)
Objects/unicodeobject.c:448: _PyUnicode_CheckConsistency: Assertion 
"((PyObject*)(op))->ob_type))->tp_flags & ((1UL << 28))) != 0)" failed
Enable tracemalloc to get the memory block allocation traceback

object  : Process 14743 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
(code=EXC_I386_GPFLT)
frame #0: 0x0001000b5c15 python.exe`PyObject_Repr(v=0x000101376f90) 
at object.c:535:11
   532 infinitely. */
   533  if (Py_EnterRecursiveCall(" while getting the repr of an object"))
   534  return NULL;
-> 535  res = (*v->ob_type->tp_repr)(v);
   536  Py_LeaveRecursiveCall();
   537  if (res == NULL)
   538  return NULL;

--
nosy: +anthonypjshaw

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36819] Crash during encoding using UTF-16/32 and custom error handler

2019-05-06 Thread Andrei Talaba


Change by Andrei Talaba :


Added file: https://bugs.python.org/file48305/encode_crash.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36819] Crash during encoding using UTF-16/32 and custom error handler

2019-05-06 Thread Andrei Talaba


Change by Andrei Talaba :


Removed file: https://bugs.python.org/file48304/encode_crash.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36819] Crash during encoding using UTF-16/32 and custom error handler

2019-05-06 Thread Serhiy Storchaka


Change by Serhiy Storchaka :


--
assignee:  -> serhiy.storchaka
versions:  -Python 3.6

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36819] Crash during encoding using UTF-16/32 and custom error handler

2019-05-06 Thread Andrei Talaba


Change by Andrei Talaba :


--
keywords: +patch
pull_requests: +13047
stage:  -> patch review

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36819] Crash during encoding using UTF-16/32 and custom error handler

2019-05-06 Thread Andrei Talaba


New submission from Andrei Talaba :

The CPython interpreter write out-of-bounds of allocated memory in certain edge 
cases in the utf-16 and utf-32 encoders. 

The attached script registers two error handlers that either write one ascii 
character, or two bytes, and tells the encoder to start again from the start of 
the encoding error. The script then tries to encode an invalid codepoint in 
either utf-16 or utf-32. Each of the calls to encode independently cause 
segfaults

Since the encoder starts over again and keeps trying to append the result of 
the error handler, the lack of proper re-allocations leads to a buffer 
overflow, and corrupts the stack.

--
components: Interpreter Core
files: encode_crash.py
messages: 341599
nosy: atalaba, serhiy.storchaka, vstinner
priority: normal
severity: normal
status: open
title: Crash during encoding using UTF-16/32 and custom error handler
type: crash
versions: Python 3.6, Python 3.7, Python 3.8
Added file: https://bugs.python.org/file48304/encode_crash.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com