[issue27716] http.client truncates UTF-8 encoded headers

2016-09-17 Thread Martin Panter
Martin Panter added the comment: Thanks to the fix for Issue 22233, now the response is parsed more sensibly, and the body can be read. The 0x85 byte now gets decoded with Latin-1: >>> print(ascii(resp.getheader("Link")[:100])) '

[issue27716] http.client truncates UTF-8 encoded headers

2016-08-09 Thread Martin Panter
Martin Panter added the comment: For the test case given, the main problem is actually that a header field is being incorrectly split on a Latin-1 “next line” control code U+0085. The problem is already described under Issue 22233. It looks like I wrote a patch for that a while ago, so it woul

[issue27716] http.client truncates UTF-8 encoded headers

2016-08-09 Thread R. David Murray
R. David Murray added the comment: Well, email will happily parse bytes and treat the non-ascii data as opaque (though it does record errors in an internal data structure), but the python3 http api expects the parsed headers to be strings when you access them, so you'd just hit the decoding pr

[issue27716] http.client truncates UTF-8 encoded headers

2016-08-09 Thread Cory Benfield
Cory Benfield added the comment: Honestly, David, everything's a mess on this front. The authoritative document here is RFC 7230 Section 3.2.4 (https://tools.ietf.org/html/rfc7230#section-3.2.4). The last paragraph of that section reads: Historically, HTTP has allowed field content with te

[issue27716] http.client truncates UTF-8 encoded headers

2016-08-09 Thread R. David Murray
R. David Murray added the comment: utf-8 headers are contrary to the http spec, aren't they? Or has that changed? (It's been a long time since I've looked at any http RFCs.) This could be fixed by using SMTPUTF8 mode when parsing the headers, which in theory ought to be backward compatible.

[issue27716] http.client truncates UTF-8 encoded headers

2016-08-09 Thread Cory Benfield
Cory Benfield added the comment: Simple repro case: import http.client conn = http.client.HTTPConnection('pl.bab.la') conn.request("GET", '/slownik/angielski-polski/') resp = conn.getresponse() resp.read() # Hangs here -- ___ Pyt

[issue27716] http.client truncates UTF-8 encoded headers

2016-08-09 Thread Cory Benfield
New submission from Cory Benfield: Originally reported as Requests issue #3485: https://github.com/kennethreitz/requests/issues/3485 On Python 3, http.client uses the email module to parse its HTTP headers. The email module, for better or worse, requires that it parse headers as *text*: that