[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Benjamin Peterson

Benjamin Peterson added the comment:

I think it would be great to have a "Unicode/bytes" howto with information like 
this included.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Nick Coghlan

Nick Coghlan added the comment:

On 23 Aug 2013 01:40, "R. David Murray"  wrote:
.  (I double checked, and this does indeed work...doing the equivalent of
ls >temp via python preserves the bytes with that PYTHONIOENCODING setting.
 I don't quite understand, however, why I get the � chars if I don't
redirect the output.).

I assume the terminal window is doing the substitution for the improperly
encoded bytes.

Regarding the issue, perhaps we should convert this to a docs bug? Attempt
to make the "PYTHONIOENCODING=utf-8:surrogateescape" easier to discover?
Heck, it may be worth creating a stable URL that we can include in
surrogate related error messages...

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread STINNER Victor

STINNER Victor added the comment:

>>> ('\udcff' + 'qwerty').encode('utf-16le', 'surrogateescape')
b'\xff\xdcq\x00w\x00e\x00r\x00t\x00y\x00'

Oh, this is a bug in the UTF-16 encoder: it should not encode surrogate 
characters => see issue #12892

I read that it's possible to set a standard stream like stdout in UTF-16 mode 
on Windows. I don't know if it's commonly used, nor it would impact Python. I 
never see a platform using UTF-16 or UTF-32 for standard streams.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread R. David Murray

R. David Murray added the comment:

If you pipe the ls (eg: ls >temp) the bytes are preserved.  Since setting the 
escape handler via PYTHONIOENCODING sets it for both stdin in and stdout, it 
sounds like that solves the sysadmin use case.  The sysadmin can just put that 
environment variable setting in their default profile, and python will once 
again work like the other unix shell tools.  (I double checked, and this does 
indeed work...doing the equivalent of ls >temp via python preserves the bytes 
with that PYTHONIOENCODING setting.  I don't quite understand, however, why I 
get the � chars if I don't redirect the output.). 

I'd be inclined to consider the above as reason enough to close this issue.  As 
usual with Python, explicit is better than implicit.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Nick Coghlan

Nick Coghlan added the comment:

Note that the specific case I'm really interested is printing on systems that 
are properly configured to use UTF-8, but are getting bad metadata from an OS 
API. I'm OK with the idea of *only* changing it for UTF-8 rather than for 
arbitrary encodings, as well as restricting it to sys.stdout when the codec 
used matches the default filesystem encoding.

To double check the current behaviour, I created a directory to tinker with 
this. Filenames were created with the following:

>>> open("ℙƴ☂ℌøἤ".encode("utf-8"), "w")
>>> open("basic_ascii".encode("utf-8"), "w")
>>> b"\xd0\xd1\xd2\xd3".decode("latin-1")
'ÐÑÒÓ'
>>> open(b"\xd0\xd1\xd2\xd3", "w")

That last generates an invalid UTF-8 filename. "ls" actually degrades less 
gracefully than I thought, and just prints question marks for the bad file:

$ ls -l
total 0
-rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:04 
-rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:01 basic_ascii
-rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:01 ℙƴ☂ℌøἤ

Python 2 & 3 both work OK if you just print the directory listing directly, 
since repr() happily displays the surrogate escaped string:

$ python -c "import os; print(os.listdir('.'))"
['basic_ascii', '\xd0\xd1\xd2\xd3', 
'\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4']
$ python3 -c "import os; print(os.listdir('.'))"
['basic_ascii', '\udcd0\udcd1\udcd2\udcd3', 'ℙƴ☂ℌøἤ']

Where it falls down is when you try to print the strings directly in Python 3:

$ python3 -c "import os; [print(fname) for fname in os.listdir('.')]"
basic_ascii
Traceback (most recent call last):
  File "", line 1, in 
  File "", line 1, in 
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd0' in position 
0: surrogates not allowed

While setting the IO encoding produces behaviour closer to that of the native 
tools:
$ PYTHONIOENCODING=utf-8:surrogateescape python3 -c "import os; [print(fname) 
for fname in os.listdir('.')]"
basic_ascii

ℙƴ☂ℌøἤ

On the other hand, setting PYTHONIOENCODING as shown provides an environmental 
workaround, and http://bugs.python.org/issue15216 will provide an improved 
programmatic workaround (which tools like http://code.google.com/p/pyp/ could 
use to configure surrogateescape by default).

So perhaps pursuing #15216 further would be a better approach than selectively 
changing the default behaviour? And better documentation for ways to handle the 
surrogate escape error when it arises?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

>>> ('\udcff' + 'qwerty').encode('utf-16le', 'surrogateescape')
b'\xff\xdcq\x00w\x00e\x00r\x00t\x00y\x00'
>>> ('\udcff' + 'qwerty').encode('utf-16le', 
>>> 'surrogateescape').decode('utf-16le', 'surrogateescape')
'\udcff\udcdcqwerty'
>>> ('\udcff' + 'qwerty').encode('utf-16le', 
>>> 'surrogateescape').decode('utf-16le', 'surrogateescape').encode('utf-16le', 
>>> 'surrogateescape')
b'\xff\xdc\xdc\xdcq\x00w\x00e\x00r\x00t\x00y\x00'
>>> ('\udcff' + 'qwerty').encode('utf-16le', 
>>> 'surrogateescape').decode('utf-16le', 'surrogateescape').encode('utf-16le', 
>>> 'surrogateescape').decode('utf-16le', 'surrogateescape')
'\udcff\udcdc\udcdc\udcdcqwerty'

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread STINNER Victor

STINNER Victor added the comment:

> The surrogateescape error handler is dangerous with utf-16/32. It can produce 
> globally invalid output.

I don't understand, can you give an example? surrogateescape generate invalid 
encoded string with any encoding. Example with UTF-8:

>>> b"a\xffb".decode("utf-8", "surrogateescape")
'a\udcffb'

>>> 'a\udcffb'.encode("utf-8", "surrogateescape")
b'a\xffb'

>>> b'a\xffb'.decode("utf-8")
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 1: invalid 
start byte

So str.encode("utf-8", "surrogateescape") produces an invalid UTF-8 sequence.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

The surrogateescape error handler is dangerous with utf-16/32. It can produce 
globally invalid output.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread STINNER Victor

STINNER Victor added the comment:

> Is it a bug in your patch, or is it deliberate?

It was not deliberate, and I think that it would be more consistent to
use the same error handler (surrogateescape) when only the encoding is
changed by the PYTHONIOENCODING environment variable. So
surrogateescape should be used even with PYTHONIOENCODING=utf-8.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Antoine Pitrou

Antoine Pitrou added the comment:

Is it a bug in your patch, or is it deliberate?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread STINNER Victor

STINNER Victor added the comment:

>> Serhiy Storchaka also noticed (in the review of my patch) than errors
>> is "strict" when PYTHONIOENCODING=utf-8 is used. We should also use
>> surrogateescape if only the encoding is changed.
> I don't understand what you say. Could you rephrase?

With my patch, sys.stdin.errors is "surrogateescape" by default, but
it is "strict" when the PYTHONIOENCODING environment variable is set
to "utf-8".

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Antoine Pitrou

Antoine Pitrou added the comment:

> Serhiy Storchaka also noticed (in the review of my patch) than errors
> is "strict" when PYTHONIOENCODING=utf-8 is used. We should also use
> surrogateescape if only the encoding is changed.

I don't understand what you say. Could you rephrase?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Antoine Pitrou

Antoine Pitrou added the comment:

> See my message msg195769: Python3 cannot be simply used as a pipe
> because it wants to be kind by decoding binary data to Unicode,
> whereas no everybody cares of Unicode :-)

If somebody doesn't care about unicode, they can use sys.stdin.buffer.
Problem solved :-)

Note: enabling surrogateescape on stdin enables precisely the
"exception being raised far from the source of the problem" people
are afraid of.  surrogateescape on stdin allows invalid unicode strings
to slip into your application, only for a later encoding to utf-8
to fail (since lone surrogates are not allowed).  For example if you
are sending that user data over an utf-8 network protocol (perhaps
JSON-encoded or XML-encoded)...

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread STINNER Victor

STINNER Victor added the comment:

Serhiy Storchaka added the comment:
> Shouldn't be safer use surrogateescape for output and strict for input.

Nick wrote "Think sysadmins running scripts on Linux, writing to the
console or a pipe."

See my message msg195769: Python3 cannot be simply used as a pipe
because it wants to be kind by decoding binary data to Unicode,
whereas no everybody cares of Unicode :-)

Hum, I realized that the subprocess should also be patched to be
consistent: subprocess already uses surrogateescape for the command
line arguments and environment variables, why not using the same error
handler for stdin, stdout and stderr?

Serhiy Storchaka also noticed (in the review of my patch) than errors
is "strict" when PYTHONIOENCODING=utf-8 is used. We should also use
surrogateescape if only the encoding is changed.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Shouldn't be safer use surrogateescape for output and strict for input.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread STINNER Victor

STINNER Victor added the comment:

> I'm only saying that this will increase a number of cases
> when an exception will raised in unexpected place.

The print() instruction is much more common than input(). IMO changing
the error handle should fix more issues than adding regressions.

Python functions decoding OS data from the filesystem encoding with
surrogateescape:

- sys.thread_info.version
- sys.argv
- os.environ, os.getenv()
- os.fsdecode()
- _ssl._SSLSocket.compression
- os.ttyname(), os.ctermid(), os.getcwd(), os.listdir(), os.uname(),
os.getlogin(), os.readlink(), os.confstr(), os.listxattr(), nis.cat()
- grp.getgrpgid(), grp.getgrpnam(), grp.getgrpall()
- spwd.spwd_getspnam(), spwd.spwd_getspall()
- pwd.getpwuid(), pwd.getpwnam(), pwd.getpwall()
- socket.socket.accept(), socket.socket.getsockname(),
socket.socket.getpeername(), socket.socket.recvfrom(),
socket.gethostname(), socket.if_nameindex(), socket.if_indextoname()

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

> "The surrogateescape error handler works with any codec."

Ah, sorry. You are correct.

> Correct, but it's not something new: os.listdir(), sys.argv, os.environ and
other functions using os.fsdecode(). Applications should already have to
support surrogates.

I'm only saying that this will increase a number of cases when an exception 
will raised in unexpected place.

Perhaps it will be safe left the "strict" default error handler and make the 
errors attribute of text streams modifiable.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-21 Thread STINNER Victor

STINNER Victor added the comment:

"The surrogateescape error handler works with any codec."

The surrogatepass only works with utf-8 if I remember correctly. The
surrogateescape error handler works with any codec, especially ascii.

"As a side effect of this change an input from stdin will be incompatible
in general with extensions which implicitly encode a string to bytes with
UTF-8 (e.g. tkinter, XML parsers, sqlite3, datetime, locale, curses, etc.)"

Correct, but it's not something new: os.listdir(), sys.argv, os.environ and
other functions using os.fsdecode(). Applications should already have to
support surrogates.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-21 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

The surrogateescape error handler works only with UTF-8.

As a side effect of this change an input from stdin will be incompatible in 
general with extensions which implicitly encode a string to bytes with UTF-8 
(e.g. tkinter, XML parsers, sqlite3, datetime, locale, curses, etc.)

--
nosy: +serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-21 Thread STINNER Victor

STINNER Victor added the comment:

Attached patch changes the error handle of stdin, stdout and stderr to 
surrogateescape by default. It can still be changed explicitly using the 
PYTHONIOENCODING environment variable.

--
keywords: +patch
Added file: http://bugs.python.org/file31414/surrogateescape.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-21 Thread Arfrever Frehtes Taifersar Arahesis

Changes by Arfrever Frehtes Taifersar Arahesis :


--
nosy: +Arfrever

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-21 Thread STINNER Victor

STINNER Victor added the comment:

Currently, Python 3 fails miserabily when it gets a non-ASCII
character from stdin or when it tries to write a byte encoded as a
Unicode surrogate to stdout.

It works fine when OS data can be decoded from and encoded to the
locale encoding. Example on Linux with UTF-8 data and UTF-8 locale
encoding:

$ mkdir test
$ cd test
$ touch héhé.txt
$ ls
héhé.txt
$ python3 -c 'import os; print(", ".join(os.listdir()))'
héhé.txt
$ echo "héhé"|python3 -c 'import sys; sys.stdout.write(sys.stdin.read())'|cat
héhé

It fails miserabily when OS data cannot be decoded from or encoded to
the locale encoding. Example on Linux with UTF-8 data and ASCII locale
encoding:

$ mkdir test
$ cd test
$ touch héhé.txt
$ export LANG=  # switch to ASCII locale encoding
$ ls
h??h??.txt
$ python3 -c 'import os; print(", ".join(os.listdir()))'
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode characters in position
1-2: ordinal not in range(128)

$ echo "héhé"|LANG= python3 -c 'import sys;
sys.stdout.write(sys.stdin.read())'|cat
Traceback (most recent call last):
  File "", line 1, in 
  File "/home/vstinner/prog/python/default/Lib/encodings/ascii.py",
line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
1: ordinal not in range(128)

The ls output is not the expected "héhé" string, but it is an issue
with the console output, not the ls program. ls does just write raw
bytes to stdout:

$ ls|hexdump -C
  68 c3 a9 68 c3 a9 2e 74  78 74 0a |h..h...txt.|
000b

("héhé" encoded to UTF-8 gives b'h\xc3\xa9h\xc3\xa9')

I agree that we can do something to improve the situation on standard
streams, but only on standard streams. It is already possible to
workaround the issue by forcing the surrogateescape error handler on
stdout:

$ LANG= PYTHONIOENCODING=utf-8:surrogateescape python3 -c 'import os;
print(", ".join(os.listdir()))'
héhé.txt

Something similar can be done in Python. For example,
test.support.regrtest reopens sys.stdout to set the error handle to
"backslashreplace". Extract of the replace_stdout() function:

sys.stdout = open(stdout.fileno(), 'w',
encoding=sys.stdout.encoding,
errors="backslashreplace",
closefd=False,
newline='\n')

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-21 Thread R. David Murray

R. David Murray added the comment:

I think the essential use case is using a python program in a unix pipeline.  
I'm very sympathetic to that use case, despite my unease.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-20 Thread Antoine Pitrou

Antoine Pitrou added the comment:

> After some thought, Nick came up with this solution.  The idea is that
> surrogateescape was originally accepted to allow roundtripping data
> from the OS and back when the OS considers it to be a "string" but
> python does not consider it to be "text".  When that's the case, we
> know what the encoding was used to attempt to construct the text in
> python.  If that same encoding is used to re-encode the data on the
> way back to the OS, then we're successfully roundtripping the data we
> were given in the first place.  So this is just applying the original
> goal to another API.

I think that outlook is a bit naïve. The text source is not always the
same as the text destination, i.e. your surrogateescape-decoded data may
come from a database or some JSON API, so there's no reason to think
that the end of the stdout pipe will share the same convention.

I'm myself quite partial to the "round-tripping" use case, but I'm not
sure we can solve it as bluntly. If it's merely for printing out data,
maybe we can an os.fsescape() function to allow for representation of
broken filenames.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-20 Thread Nick Coghlan

Nick Coghlan added the comment:

Think sysadmins running scripts on Linux, writing to the console or a pipe.
I agree the generalisation is a bad idea, so only consider the original
proposal that was specifically limited to the standard streams.

Specifically, if a system is properly configured to use UTF-8 for all
interfaces, I shouldn't have to live in fear of Python steps in a command
pipeline falling over because it happens to encounter a filename encoded
with latin-1 (etc).

If the bytes oriented os tools like ls don't fall over on it, then neither
should Python. This is about treating the standard streams as OS
interfaces, as long as they're using the filesystem encoding.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-20 Thread STINNER Victor

STINNER Victor added the comment:

2013/8/21 Nick Coghlan :
> Which reminds me: I'm curious what "ls" currently does for malformed
> filenames. The aim of this change would be to get 'python -c "import os;
> print(os.listdir())"' to do the best it can to work without losing data in
> such a situation.

The "ls" command works on bytes, not on characters. You can
reimplement "ls" with:

* Unicode: os.listdir(str), os.fsencode() and sys.stdout.buffer
* bytes: os.listdir(bytes) and sys.stdout.buffer

os.fsencode() does exactly the opposite of os.fsdecode(). There is a
unit test to check that :-)

I ensured that all OS functions can be used directly with bytes
filenames in Python 3. That's why I added os.environb for example.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-20 Thread STINNER Victor

STINNER Victor added the comment:

On Linux, the locale encoding is usually UTF-8. If a filename cannot
be decoded from UTF-8, invalid bytes are escaped to the surrogate
range using the PEP 383. If I create a UTF-8 text file and I try to
write the filename into this text file, the Python UTF-8 encoder
raises an error.

IMO Python must raise an error here because I want to generate a valid
UTF-8 text file, not a text file only readable by Python if the locale
encoding is UTF-8.

So using surrogateescape error handler if the encoding is
sys.getfilesystemencoding() is *not* a good idea.

What is your use case where you need to display a filename? Is it
displayed to the terminal, into a file or in a graphical window? Why
not escaping surrogate just to format the filename, as Gnome does? See
for example:
https://developer.gnome.org/glib/2.34/glib-Character-Set-Conversion.html#g-filename-display-name

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-20 Thread Nick Coghlan

Nick Coghlan added the comment:

Which reminds me: I'm curious what "ls" currently does for malformed
filenames. The aim of this change would be to get 'python -c "import os;
print(os.listdir())"' to do the best it can to work without losing data in
such a situation.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-20 Thread Toshio Kuratomi

Toshio Kuratomi added the comment:

Nick and I had talked about this at a recent conference and came to it from 
different directions.  On the one hand, Nick made the point that any encoding 
of surrogateescape'd text to bytes via a different encoding is corrupting the 
data as a whole.  On the other hand, I made the point that raising an exception 
when doing something as basic as printing something that's text type was 
reintroducing the issues that python2 had wrt unicode, bytes, and encodings -- 
particularly with the exception being raised far from the source of the problem 
(when the data is introduced into the program).

After some thought, Nick came up with this solution.  The idea is that 
surrogateescape was originally accepted to allow roundtripping data from the OS 
and back when the OS considers it to be a "string" but python does not consider 
it to be "text".  When that's the case, we know what the encoding was used to 
attempt to construct the text in python.  If that same encoding is used to 
re-encode the data on the way back to the OS, then we're successfully 
roundtripping the data we were given in the first place.  So this is just 
applying the original goal to another API.

--
nosy: +a.badger

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-12 Thread Nick Coghlan

Nick Coghlan added the comment:

Everything about surrogateescape is dangerous - we're trying to work
around the presence of bad data by at least allowing it to be
tunnelled through Python code without corrupting it further :)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-12 Thread R. David Murray

R. David Murray added the comment:

My gut reaction to this is that it feels dangerous.  That doesn't mean my gut 
is right, I'm just reporting my reaction :)

--
nosy: +r.david.murray

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-12 Thread Nick Coghlan

New submission from Nick Coghlan:

One problem with Unicode in 3.x is that surrogateescape isn't normally enabled 
on stdin and stdout. This means the following code will fail with 
UnicodeEncodeError in the presence of invalid filesystem metadata:

print(os.listdir())

We don't really want to enable surrogateescape on sys.stdin or sys.stdout 
unilaterally, as it increases the chance of data corruption errors when the 
filesystem encoding and the IO encodings don't match.

Last night, Toshio and I thought of a possible solution: enable surrogateescape 
by default for sys.stdin and sys.stdout on non-Windows systems if (and only if) 
they're using the same codec as that returned by sys.getfilesystemencoding() 
(allowing for codec aliases rather than doing a simple string comparison)

This means that for full UTF-8 systems (which includes most modern Linux 
installations), roundtripping will be enabled by default between the standard 
streams and OS facing APIs, while systems where the encodings don't match will 
still fail noisily.

A more general alternative is also possible: default to errors='surrogatescape' 
for *any* text stream that uses the filesystem encoding. It's primarily the 
standard streams we're interested in fixing, though.

--
messages: 194968
nosy: abadger1999, benjamin.peterson, ezio.melotti, haypo, lemburg, ncoghlan, 
pitrou
priority: normal
severity: normal
stage: needs patch
status: open
title: Enable surrogateescape on stdin and stdout when appropriate
type: enhancement
versions: Python 3.4

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com