[issue24214] Exception with utf-8, surrogatepass and incremental decoding

2016-07-27 Thread STINNER Victor

STINNER Victor added the comment:

Attached patch fixes the UTF-8 decoder to support correctly incremental decoder 
using surrogatepass error handler.

The bug occurs when b'\xed\xa4\x80' is decoded in two parts: the first two 
bytes (b'\xed\xa4'), and then the last byte (b'\x80').

It works as expected if we decode the first byte (b'\xed') and then the two 
last bytes (b'\xa4\x80').

My patch tries to keep best performances for the UTF-8/strict decoder.

@Serhiy: Would you mind to review my patch since you helped to design the fast 
UTF-8 decoder?

--
keywords: +patch
Added file: http://bugs.python.org/file43911/surrogatepass.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24214] Exception with utf-8, surrogatepass and incremental decoding

2016-07-26 Thread RalfM

RalfM added the comment:

I just tested Python 3.6.0a3, and that (mis)behaves exactly like 3.4.3.

--
versions: +Python 3.6

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24214] Exception with utf-8, surrogatepass and incremental decoding

2015-05-16 Thread RalfM

New submission from RalfM:

I have an utf-8 encoded file containing single surrogates. Reading this file, 
specifying surrgatepass, works fine when I read the whole file with .read(), 
but raises an UnicodeDecodeError when I read the file line by line:

- start of demo -
Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:44:40) [MSC v.1600 64 bit (AM
D64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> with open("Demo.txt", encoding="utf-8", errors="surrogatepass") as f:
...   s = f.read()
...
>>> "\ud900" in s
True
>>> with open("Demo.txt", encoding="utf-8", errors="surrogatepass") as f:
...   for line in f:
... pass
...
Traceback (most recent call last):
  File "", line 2, in 
  File "C:\Python\34x64\lib\codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 8190: inval
id continuation byte
>>>
- end of demo -

I attached the file used for the demo such that you can reproduce the problem.

If I change all 0xED bytes in the file to 0xEC (i.e. effectively change all 
surrogates to non-surrogates), the problem disappears.

The original file I noticed the problem with was 73 MB.  The demo file was 
derived from the original by removing data around the critical section, keeping 
the alignment to 16-k-blocks, and then replacing all printable ASCII characters 
by x.

If I change the file length by adding or removing 16 bytes to / from the 
beginning of the demo file, the problem disappears, so alignment seems to be an 
issue.

All this seems to indicate that the utf-8 decoder has problems when used for 
incremental decoding and a single surrogate appears around the block boundary.

--
components: Unicode
files: Demo.txt
messages: 243376
nosy: RalfM, ezio.melotti, haypo
priority: normal
severity: normal
status: open
title: Exception with utf-8, surrogatepass and incremental decoding
type: behavior
versions: Python 3.4
Added file: http://bugs.python.org/file39400/Demo.txt

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com