STINNER Victor added the comment:
The surrogateescape error handler works with any codec.
The surrogatepass only works with utf-8 if I remember correctly. The
surrogateescape error handler works with any codec, especially ascii.
As a side effect of this change an input from stdin will be
Serhiy Storchaka added the comment:
The surrogateescape error handler works with any codec.
Ah, sorry. You are correct.
Correct, but it's not something new: os.listdir(), sys.argv, os.environ and
other functions using os.fsdecode(). Applications should already have to
support surrogates.
STINNER Victor added the comment:
I'm only saying that this will increase a number of cases
when an exception will raised in unexpected place.
The print() instruction is much more common than input(). IMO changing
the error handle should fix more issues than adding regressions.
Python
Serhiy Storchaka added the comment:
Shouldn't be safer use surrogateescape for output and strict for input.
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18713
___
STINNER Victor added the comment:
Serhiy Storchaka added the comment:
Shouldn't be safer use surrogateescape for output and strict for input.
Nick wrote Think sysadmins running scripts on Linux, writing to the
console or a pipe.
See my message msg195769: Python3 cannot be simply used as a
Antoine Pitrou added the comment:
See my message msg195769: Python3 cannot be simply used as a pipe
because it wants to be kind by decoding binary data to Unicode,
whereas no everybody cares of Unicode :-)
If somebody doesn't care about unicode, they can use sys.stdin.buffer.
Problem solved
Antoine Pitrou added the comment:
Serhiy Storchaka also noticed (in the review of my patch) than errors
is strict when PYTHONIOENCODING=utf-8 is used. We should also use
surrogateescape if only the encoding is changed.
I don't understand what you say. Could you rephrase?
--
STINNER Victor added the comment:
Serhiy Storchaka also noticed (in the review of my patch) than errors
is strict when PYTHONIOENCODING=utf-8 is used. We should also use
surrogateescape if only the encoding is changed.
I don't understand what you say. Could you rephrase?
With my patch,
Antoine Pitrou added the comment:
Is it a bug in your patch, or is it deliberate?
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18713
___
___
STINNER Victor added the comment:
Is it a bug in your patch, or is it deliberate?
It was not deliberate, and I think that it would be more consistent to
use the same error handler (surrogateescape) when only the encoding is
changed by the PYTHONIOENCODING environment variable. So
Serhiy Storchaka added the comment:
The surrogateescape error handler is dangerous with utf-16/32. It can produce
globally invalid output.
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18713
STINNER Victor added the comment:
The surrogateescape error handler is dangerous with utf-16/32. It can produce
globally invalid output.
I don't understand, can you give an example? surrogateescape generate invalid
encoded string with any encoding. Example with UTF-8:
Serhiy Storchaka added the comment:
('\udcff' + 'qwerty').encode('utf-16le', 'surrogateescape')
b'\xff\xdcq\x00w\x00e\x00r\x00t\x00y\x00'
('\udcff' + 'qwerty').encode('utf-16le',
'surrogateescape').decode('utf-16le', 'surrogateescape')
'\udcff\udcdcqwerty'
('\udcff' +
Nick Coghlan added the comment:
Note that the specific case I'm really interested is printing on systems that
are properly configured to use UTF-8, but are getting bad metadata from an OS
API. I'm OK with the idea of *only* changing it for UTF-8 rather than for
arbitrary encodings, as well as
R. David Murray added the comment:
If you pipe the ls (eg: ls temp) the bytes are preserved. Since setting the
escape handler via PYTHONIOENCODING sets it for both stdin in and stdout, it
sounds like that solves the sysadmin use case. The sysadmin can just put that
environment variable
STINNER Victor added the comment:
('\udcff' + 'qwerty').encode('utf-16le', 'surrogateescape')
b'\xff\xdcq\x00w\x00e\x00r\x00t\x00y\x00'
Oh, this is a bug in the UTF-16 encoder: it should not encode surrogate
characters = see issue #12892
I read that it's possible to set a standard stream
Nick Coghlan added the comment:
On 23 Aug 2013 01:40, R. David Murray rep...@bugs.python.org wrote:
. (I double checked, and this does indeed work...doing the equivalent of
ls temp via python preserves the bytes with that PYTHONIOENCODING setting.
I don't quite understand, however, why I get
Benjamin Peterson added the comment:
I think it would be great to have a Unicode/bytes howto with information like
this included.
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18713
___
R. David Murray added the comment:
I think the essential use case is using a python program in a unix pipeline.
I'm very sympathetic to that use case, despite my unease.
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18713
STINNER Victor added the comment:
Currently, Python 3 fails miserabily when it gets a non-ASCII
character from stdin or when it tries to write a byte encoded as a
Unicode surrogate to stdout.
It works fine when OS data can be decoded from and encoded to the
locale encoding. Example on Linux
Changes by Arfrever Frehtes Taifersar Arahesis arfrever@gmail.com:
--
nosy: +Arfrever
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18713
___
STINNER Victor added the comment:
Attached patch changes the error handle of stdin, stdout and stderr to
surrogateescape by default. It can still be changed explicitly using the
PYTHONIOENCODING environment variable.
--
keywords: +patch
Added file:
Serhiy Storchaka added the comment:
The surrogateescape error handler works only with UTF-8.
As a side effect of this change an input from stdin will be incompatible in
general with extensions which implicitly encode a string to bytes with UTF-8
(e.g. tkinter, XML parsers, sqlite3, datetime,
Toshio Kuratomi added the comment:
Nick and I had talked about this at a recent conference and came to it from
different directions. On the one hand, Nick made the point that any encoding
of surrogateescape'd text to bytes via a different encoding is corrupting the
data as a whole. On the
Nick Coghlan added the comment:
Which reminds me: I'm curious what ls currently does for malformed
filenames. The aim of this change would be to get 'python -c import os;
print(os.listdir())' to do the best it can to work without losing data in
such a situation.
--
STINNER Victor added the comment:
On Linux, the locale encoding is usually UTF-8. If a filename cannot
be decoded from UTF-8, invalid bytes are escaped to the surrogate
range using the PEP 383. If I create a UTF-8 text file and I try to
write the filename into this text file, the Python UTF-8
STINNER Victor added the comment:
2013/8/21 Nick Coghlan rep...@bugs.python.org:
Which reminds me: I'm curious what ls currently does for malformed
filenames. The aim of this change would be to get 'python -c import os;
print(os.listdir())' to do the best it can to work without losing data in
Nick Coghlan added the comment:
Think sysadmins running scripts on Linux, writing to the console or a pipe.
I agree the generalisation is a bad idea, so only consider the original
proposal that was specifically limited to the standard streams.
Specifically, if a system is properly configured to
Antoine Pitrou added the comment:
After some thought, Nick came up with this solution. The idea is that
surrogateescape was originally accepted to allow roundtripping data
from the OS and back when the OS considers it to be a string but
python does not consider it to be text. When that's
New submission from Nick Coghlan:
One problem with Unicode in 3.x is that surrogateescape isn't normally enabled
on stdin and stdout. This means the following code will fail with
UnicodeEncodeError in the presence of invalid filesystem metadata:
print(os.listdir())
We don't really want
R. David Murray added the comment:
My gut reaction to this is that it feels dangerous. That doesn't mean my gut
is right, I'm just reporting my reaction :)
--
nosy: +r.david.murray
___
Python tracker rep...@bugs.python.org
Nick Coghlan added the comment:
Everything about surrogateescape is dangerous - we're trying to work
around the presence of bad data by at least allowing it to be
tunnelled through Python code without corrupting it further :)
--
___
Python tracker
32 matches
Mail list logo