[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread STINNER Victor
STINNER Victor added the comment: The surrogateescape error handler works with any codec. The surrogatepass only works with utf-8 if I remember correctly. The surrogateescape error handler works with any codec, especially ascii. As a side effect of this change an input from stdin will be

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: The surrogateescape error handler works with any codec. Ah, sorry. You are correct. Correct, but it's not something new: os.listdir(), sys.argv, os.environ and other functions using os.fsdecode(). Applications should already have to support surrogates.

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread STINNER Victor
STINNER Victor added the comment: I'm only saying that this will increase a number of cases when an exception will raised in unexpected place. The print() instruction is much more common than input(). IMO changing the error handle should fix more issues than adding regressions. Python

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Shouldn't be safer use surrogateescape for output and strict for input. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18713 ___

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread STINNER Victor
STINNER Victor added the comment: Serhiy Storchaka added the comment: Shouldn't be safer use surrogateescape for output and strict for input. Nick wrote Think sysadmins running scripts on Linux, writing to the console or a pipe. See my message msg195769: Python3 cannot be simply used as a

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Antoine Pitrou
Antoine Pitrou added the comment: See my message msg195769: Python3 cannot be simply used as a pipe because it wants to be kind by decoding binary data to Unicode, whereas no everybody cares of Unicode :-) If somebody doesn't care about unicode, they can use sys.stdin.buffer. Problem solved

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Antoine Pitrou
Antoine Pitrou added the comment: Serhiy Storchaka also noticed (in the review of my patch) than errors is strict when PYTHONIOENCODING=utf-8 is used. We should also use surrogateescape if only the encoding is changed. I don't understand what you say. Could you rephrase? --

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread STINNER Victor
STINNER Victor added the comment: Serhiy Storchaka also noticed (in the review of my patch) than errors is strict when PYTHONIOENCODING=utf-8 is used. We should also use surrogateescape if only the encoding is changed. I don't understand what you say. Could you rephrase? With my patch,

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Antoine Pitrou
Antoine Pitrou added the comment: Is it a bug in your patch, or is it deliberate? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18713 ___ ___

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread STINNER Victor
STINNER Victor added the comment: Is it a bug in your patch, or is it deliberate? It was not deliberate, and I think that it would be more consistent to use the same error handler (surrogateescape) when only the encoding is changed by the PYTHONIOENCODING environment variable. So

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: The surrogateescape error handler is dangerous with utf-16/32. It can produce globally invalid output. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18713

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread STINNER Victor
STINNER Victor added the comment: The surrogateescape error handler is dangerous with utf-16/32. It can produce globally invalid output. I don't understand, can you give an example? surrogateescape generate invalid encoded string with any encoding. Example with UTF-8:

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: ('\udcff' + 'qwerty').encode('utf-16le', 'surrogateescape') b'\xff\xdcq\x00w\x00e\x00r\x00t\x00y\x00' ('\udcff' + 'qwerty').encode('utf-16le', 'surrogateescape').decode('utf-16le', 'surrogateescape') '\udcff\udcdcqwerty' ('\udcff' +

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Nick Coghlan
Nick Coghlan added the comment: Note that the specific case I'm really interested is printing on systems that are properly configured to use UTF-8, but are getting bad metadata from an OS API. I'm OK with the idea of *only* changing it for UTF-8 rather than for arbitrary encodings, as well as

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread R. David Murray
R. David Murray added the comment: If you pipe the ls (eg: ls temp) the bytes are preserved. Since setting the escape handler via PYTHONIOENCODING sets it for both stdin in and stdout, it sounds like that solves the sysadmin use case. The sysadmin can just put that environment variable

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread STINNER Victor
STINNER Victor added the comment: ('\udcff' + 'qwerty').encode('utf-16le', 'surrogateescape') b'\xff\xdcq\x00w\x00e\x00r\x00t\x00y\x00' Oh, this is a bug in the UTF-16 encoder: it should not encode surrogate characters = see issue #12892 I read that it's possible to set a standard stream

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Nick Coghlan
Nick Coghlan added the comment: On 23 Aug 2013 01:40, R. David Murray rep...@bugs.python.org wrote: . (I double checked, and this does indeed work...doing the equivalent of ls temp via python preserves the bytes with that PYTHONIOENCODING setting. I don't quite understand, however, why I get

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Benjamin Peterson
Benjamin Peterson added the comment: I think it would be great to have a Unicode/bytes howto with information like this included. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18713 ___

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-21 Thread R. David Murray
R. David Murray added the comment: I think the essential use case is using a python program in a unix pipeline. I'm very sympathetic to that use case, despite my unease. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18713

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-21 Thread STINNER Victor
STINNER Victor added the comment: Currently, Python 3 fails miserabily when it gets a non-ASCII character from stdin or when it tries to write a byte encoded as a Unicode surrogate to stdout. It works fine when OS data can be decoded from and encoded to the locale encoding. Example on Linux

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-21 Thread Arfrever Frehtes Taifersar Arahesis
Changes by Arfrever Frehtes Taifersar Arahesis arfrever@gmail.com: -- nosy: +Arfrever ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18713 ___

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-21 Thread STINNER Victor
STINNER Victor added the comment: Attached patch changes the error handle of stdin, stdout and stderr to surrogateescape by default. It can still be changed explicitly using the PYTHONIOENCODING environment variable. -- keywords: +patch Added file:

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-21 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: The surrogateescape error handler works only with UTF-8. As a side effect of this change an input from stdin will be incompatible in general with extensions which implicitly encode a string to bytes with UTF-8 (e.g. tkinter, XML parsers, sqlite3, datetime,

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-20 Thread Toshio Kuratomi
Toshio Kuratomi added the comment: Nick and I had talked about this at a recent conference and came to it from different directions. On the one hand, Nick made the point that any encoding of surrogateescape'd text to bytes via a different encoding is corrupting the data as a whole. On the

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-20 Thread Nick Coghlan
Nick Coghlan added the comment: Which reminds me: I'm curious what ls currently does for malformed filenames. The aim of this change would be to get 'python -c import os; print(os.listdir())' to do the best it can to work without losing data in such a situation. --

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-20 Thread STINNER Victor
STINNER Victor added the comment: On Linux, the locale encoding is usually UTF-8. If a filename cannot be decoded from UTF-8, invalid bytes are escaped to the surrogate range using the PEP 383. If I create a UTF-8 text file and I try to write the filename into this text file, the Python UTF-8

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-20 Thread STINNER Victor
STINNER Victor added the comment: 2013/8/21 Nick Coghlan rep...@bugs.python.org: Which reminds me: I'm curious what ls currently does for malformed filenames. The aim of this change would be to get 'python -c import os; print(os.listdir())' to do the best it can to work without losing data in

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-20 Thread Nick Coghlan
Nick Coghlan added the comment: Think sysadmins running scripts on Linux, writing to the console or a pipe. I agree the generalisation is a bad idea, so only consider the original proposal that was specifically limited to the standard streams. Specifically, if a system is properly configured to

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-20 Thread Antoine Pitrou
Antoine Pitrou added the comment: After some thought, Nick came up with this solution. The idea is that surrogateescape was originally accepted to allow roundtripping data from the OS and back when the OS considers it to be a string but python does not consider it to be text. When that's

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-12 Thread Nick Coghlan
New submission from Nick Coghlan: One problem with Unicode in 3.x is that surrogateescape isn't normally enabled on stdin and stdout. This means the following code will fail with UnicodeEncodeError in the presence of invalid filesystem metadata: print(os.listdir()) We don't really want

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-12 Thread R. David Murray
R. David Murray added the comment: My gut reaction to this is that it feels dangerous. That doesn't mean my gut is right, I'm just reporting my reaction :) -- nosy: +r.david.murray ___ Python tracker rep...@bugs.python.org

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-12 Thread Nick Coghlan
Nick Coghlan added the comment: Everything about surrogateescape is dangerous - we're trying to work around the presence of bad data by at least allowing it to be tunnelled through Python code without corrupting it further :) -- ___ Python tracker