[issue36311] Flaw in Windows code page decoder for large input

2019-09-09 Thread Steve Dower


Steve Dower  added the comment:

Declaring this out-of-scope for 2.7, unless someone wants to insist (and 
provide a PR).

--
resolution:  -> fixed
stage: backport needed -> resolved
status: open -> closed
versions:  -Python 2.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36311] Flaw in Windows code page decoder for large input

2019-08-21 Thread miss-islington


miss-islington  added the comment:


New changeset 735a960ac98cf414caf910565220ab2761fa542a by Miss Islington (bot) 
in branch '3.7':
bpo-36311: Fixes decoding multibyte characters around chunk boundaries and 
improves decoding performance (GH-15083)
https://github.com/python/cpython/commit/735a960ac98cf414caf910565220ab2761fa542a


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36311] Flaw in Windows code page decoder for large input

2019-08-21 Thread miss-islington


miss-islington  added the comment:


New changeset f93c15aedc2ea2cb8b56fc9dbb0d412918992e86 by Miss Islington (bot) 
in branch '3.8':
bpo-36311: Fixes decoding multibyte characters around chunk boundaries and 
improves decoding performance (GH-15083)
https://github.com/python/cpython/commit/f93c15aedc2ea2cb8b56fc9dbb0d412918992e86


--
nosy: +miss-islington

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36311] Flaw in Windows code page decoder for large input

2019-08-21 Thread Steve Dower


Steve Dower  added the comment:

I'll get the 3.7 and 3.8 backports merged - looks like they're trivial. 

Going to need some help with the 2.7 backport, but I'm happy to approve a PR.

--
stage: patch review -> backport needed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36311] Flaw in Windows code page decoder for large input

2019-08-21 Thread miss-islington


Change by miss-islington :


--
pull_requests: +15086
pull_request: https://github.com/python/cpython/pull/15375

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36311] Flaw in Windows code page decoder for large input

2019-08-21 Thread miss-islington


Change by miss-islington :


--
pull_requests: +15085
pull_request: https://github.com/python/cpython/pull/15374

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36311] Flaw in Windows code page decoder for large input

2019-08-21 Thread Steve Dower


Steve Dower  added the comment:


New changeset 7ebdda0dbee7df6f0c945a7e1e623e47676e112d by Steve Dower in branch 
'master':
bpo-36311: Fixes decoding multibyte characters around chunk boundaries and 
improves decoding performance (GH-15083)
https://github.com/python/cpython/commit/7ebdda0dbee7df6f0c945a7e1e623e47676e112d


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36311] Flaw in Windows code page decoder for large input

2019-08-02 Thread Steve Dower


Change by Steve Dower :


--
keywords: +patch
pull_requests: +14828
stage: test needed -> patch review
pull_request: https://github.com/python/cpython/pull/15083

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36311] Flaw in Windows code page decoder for large input

2019-08-02 Thread Steve Dower


Change by Steve Dower :


--
assignee:  -> steve.dower
versions: +Python 3.9

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36311] Flaw in Windows code page decoder for large input

2019-08-02 Thread Steve Dower


Steve Dower  added the comment:

If we reduce our chunk size below INT_MAX, then we avoid the issue entirely. 
Our logic for hitting the middle of a multibyte character is fine (perhaps 
fixed since this issue was opened?), there's just a weird edge case at 2 GiB in 
the API call.

As a bonus, smaller chunks seems to have a performance benefit too. It seems 
like INT_MAX/4 is the sweet spot - it took about a quarter of the time for my 
2GiB test case as INT_MAX (and we're measuring in tens of seconds here, so I'm 
pretty comfortable with the direction of the result). INT_MAX/2 and INT_MAX/8 
were both slower than INT_MAX/4.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36311] Flaw in Windows code page decoder for large input

2019-03-22 Thread Terry J. Reedy


Terry J. Reedy  added the comment:

I have 24G if all working and would be willing to try to run a test case.

--
nosy: +terry.reedy
stage:  -> test needed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36311] Flaw in Windows code page decoder for large input

2019-03-16 Thread Serhiy Storchaka


New submission from Serhiy Storchaka :

There is a flaw in PyUnicode_DecodeCodePageStateful() (exposed as 
_codecs.code_page_decode() at Python level). Since MultiByteToWideChar() takes 
the size of the input as C int, it can not be used for decoding more than 2 
GiB. Large input is split on chunks of size 2 GiB which are decoded separately. 
The problem is if it split in the middle of a multibyte character. In this case 
decoding chunks will always fail or replace incomplete parts of the multibyte 
character at both ends with what the error handler returns.

It is hard to reproduce this bug, because you need to decode more than 2 GiB, 
and you will need at least 14 GiB of RAM for this (maybe more).

--
components: Interpreter Core, Windows
messages: 338061
nosy: doerwalter, lemburg, paul.moore, serhiy.storchaka, steve.dower, 
tim.golden, zach.ware
priority: normal
severity: normal
status: open
title: Flaw in Windows code page decoder for large input
type: behavior
versions: Python 2.7, Python 3.7, Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com