[issue37871] 40 * 473 grid of "é" has a single wrong character on Windows

2019-08-21 Thread STINNER Victor


Change by STINNER Victor :


--
nosy: +vstinner

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37871] 40 * 473 grid of "é" has a single wrong character on Windows

2019-08-16 Thread Steve Dower


Steve Dower  added the comment:

I'd rather keep encoding incrementally, and reduce the length of each attempt 
until the last UTF-8 character does not have its top bit set (i.e. is the final 
character in a multi-byte sequence).

Otherwise the people who like to print >2GB worth of data to the console will 
complain about the memory error :)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37871] 40 * 473 grid of "é" has a single wrong character on Windows

2019-08-15 Thread Eryk Sun

Eryk Sun  added the comment:

To be compatible with Windows 7, _io__WindowsConsoleIO_write_impl in 
Modules/_io/winconsoleio.c is forced to write to the console in chunks that do 
not exceed 32 KiB. It does so by repeatedly dividing the length to decode by 2 
until the decoded buffer size is small enough. 

wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
while (wlen > 32766 / sizeof(wchar_t)) {
len /= 2;
wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
}

With `('é' * 40 + '\n') * 473`, encoded as UTF-8, we have 473 82-byte lines 
(note that "\n" has been translated to "\r\n"). This is 38,786 bytes, which is 
too much for a single write, so it splits it in two.

>>> 38786 // 2
19393
>>> 19393 // 82
236
>>> 19393 % 82
41

This means line 237 ends up with 20 'é' characters (UTF-8 b'\xc3\xa9') and one 
partial character sequjence, b'\xc3'. When this buffer is passed to 
MultiByteToWideChar to decode from UTF-8 to UTF-16, the partial sequence gets 
decoded as the replacement character U+FFFD. For the next write, the remaining 
b'\xa9' byte also gets decoded as U+FFFD.

To avoid this, _io__WindowsConsoleIO_write_impl could decode the whole buffer 
in one pass, and slice that up into writes that are less than 32 KiB. Or it 
could ensure that its UTF-8 slices are always at character boundaries.

--
components: +IO
nosy: +eryksun

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37871] 40 * 473 grid of "é" has a single wrong character on Windows

2019-08-15 Thread ANdy

New submission from ANdy :

# To reproduce:
# Put this text in a file `a.py` and run `py a.py`.
# Or just run: py -c "print(('é' * 40 + '\n') * 473)"
# Scroll up for a while. One of the lines will be:
# ��ééé
# (You can spot this because it's slightly longer than the other lines.)
# The error is consistently on line 237, column 21 (1-indexed).

# The error reproduces on Windows but not Linux. Tested in both powershell and 
CMD.
# (Failed to reproduce on either a real Linux machine or on Ubuntu with WSL.)
# On Windows, the error reproduces every time consistently.

# There is no error if N = 472 or 474.
N = 473
# There is no error if W = 39 or 41.
# (I tested with console windows of varying sizes, all well over 40 characters.)
W = 40
# There is no error if ch = "e" with no accent.
# There is still an error for other unicode characters like "Ö" or "ü".
ch = "é"
# There is no error without newlines.
s = (ch * W + "\n") * N
# Assert the string itself is correct.
assert all(c in (ch, "\n") for c in s)
print(s)

# There is no error if we use N separate print statements
# instead of printing a single string with N newlines.

# Similar scripts written in Groovy, JS and Ruby have no error.
# Groovy: System.out.println(("é" * 40 + "\n") * 473)
# JS: console.log(("é".repeat(40) + "\n").repeat(473))
# Ruby: puts(("é" * 40 + "\n") * 473)

--
components: Windows
messages: 349837
nosy: anhans, paul.moore, steve.dower, tim.golden, zach.ware
priority: normal
severity: normal
status: open
title: 40 * 473 grid of "é" has a single wrong character on Windows
type: behavior
versions: Python 3.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com