[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-06-13 Thread STINNER Victor

STINNER Victor added the comment:

It seems like this change:

 def test_forced_io_encoding(self):
 # Checks forced configuration of embedded interpreter IO streams
-out, err = self.run_embedded_interpreter("forced_io_encoding")
-if support.verbose:
+env = {"PYTHONIOENCODING": "utf-8:surrogateescape"}
+out, err = self.run_embedded_interpreter("forced_io_encoding", env=env)
(...)

Caused a failure on the "shared" buildbot (./configure --enable-shared):

http://buildbot.python.org/all/builders/x86%20Ubuntu%20Shared%203.x/builds/877/steps/test/logs/stdio

==
FAIL: test_forced_io_encoding (test.test_capi.EmbeddingTests)
--
Traceback (most recent call last):
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_capi.py", 
line 484, in test_forced_io_encoding
out, err = self.run_embedded_interpreter("forced_io_encoding", env=env)
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_capi.py", 
line 392, in run_embedded_interpreter
(p.returncode, err))
AssertionError: 127 != 0 : bad returncode 127, stderr is 
'/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Programs/_testembed: error 
while loading shared libraries: libpython3.7dm.so.1.0: cannot open shared 
object file: No such file or directory\n'

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-06-13 Thread STINNER Victor

STINNER Victor added the comment:

Ronald Oussoren:
> The macOS failures are at least partially caused by test assumptions that 
> aren't true on macOS (...)

Nick is working on a fix for macOS:
https://github.com/python/cpython/pull/2130

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-06-13 Thread STINNER Victor

STINNER Victor added the comment:

> FreeBSD 10.x: if locale coercion succeeds, we then fail on get_codeset() 
> (perhaps because that doesn't recognise LC_CTYPE=UTF-8?)

I created bpo-30647 to track this one.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-06-12 Thread Nick Coghlan

Changes by Nick Coghlan :


--
pull_requests: +2184

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-06-11 Thread Ronald Oussoren

Ronald Oussoren added the comment:

The macOS failures are at least partially caused by test assumptions that 
aren't true on macOS: in particular the filesystem encoding defaults to UTF-8 
on macOS (because HFS+ and the recent APFS filesystem store unicode data and 
not pure byte strings).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-06-11 Thread Nick Coghlan

Nick Coghlan added the comment:

Initial look at the failures on the stable buildbots:

FreeBSD 10.x: if locale coercion succeeds, we then fail on get_codeset() 
(perhaps because that doesn't recognise LC_CTYPE=UTF-8?)
FreeBSD CURRENT: if locale coercion fails (due to no suitable locale), lots of 
error handling tests fail due to the unexpected warning message on stderr

Mac OS X Tiger: looks like the test expectations aren't right on Mac OS X (at 
least for Tiger). I've added the Mac OS X folks to the nosy list.

Ubuntu shared library build: loading the shared library fails in _testembed for 
the `test_forced_io_encoding` test case, which suggest a problem with the way 
that particular test is running the binary

Windows 8.1 refleak hunting: failure doesn't appear to be due to this change 
(multiprocessing test failures)
s390x RHEL 7: failure doesn't appear to be due to this change (multiprocessing 
test failures)

--
nosy: +ned.deily, ronaldoussoren

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-06-11 Thread Nick Coghlan

Nick Coghlan added the comment:

Ah, it would have been too easy for all the other *nix variants to be close 
enough to Fedora & Ubuntu for everything to work first time :)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-06-11 Thread STINNER Victor

STINNER Victor added the comment:

Tests fail on many buildbots.

--
resolution: fixed -> 
status: closed -> open

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-06-10 Thread Nick Coghlan

Nick Coghlan added the comment:

And merged!

Thanks to all involved in the process of getting this change through to 
implementation :)

--
resolution:  -> fixed
stage:  -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-06-10 Thread Nick Coghlan

Nick Coghlan added the comment:


New changeset 6ea4186de32d65b1f1dc1533b6312b798d300466 by Nick Coghlan in 
branch 'master':
bpo-28180: Implementation for PEP 538 (#659)
https://github.com/python/cpython/commit/6ea4186de32d65b1f1dc1533b6312b798d300466


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-06-04 Thread Nick Coghlan

Nick Coghlan added the comment:

The PEP 538 PR is mostly complete now, but I created 
https://bugs.python.org/issue30565 to track making a follow-up decision on 
whether or not we really want to emit a warning on *successful* implicit locale 
coercion.

The pre-release What's New entry for PEP 538 will include a link to that issue 
to allow folks to provide feedback on their preferences.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-03-13 Thread Nick Coghlan

Nick Coghlan added the comment:

OK, the PEP 538 reference implementation has reached the point where I was 
willing to create a PR for it: https://github.com/python/cpython/pull/659

That PR/branch also includes the necessary changes to always force the C.UTF-8 
locale on Android rather than defaulting to the C locale.

I believe the only thing missing at this point is the configure.ac dance to 
ensure that PY_WARN_ON_C_LOCALE and PY_COERCE_C_LOCALE never get set on Mac OS 
X.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-03-13 Thread Nick Coghlan

Changes by Nick Coghlan :


--
pull_requests: +540

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-03-04 Thread Nick Coghlan

Nick Coghlan added the comment:

An updated reference implementation has been pushed to the 
pep538-coerce-c-locale branch in my GitHub fork: 
https://github.com/python/cpython/compare/master...ncoghlan:pep538-coerce-c-locale

(That doesn't include Xavier's Android fixes yet, though)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-22 Thread Xavier de Gaye

Xavier de Gaye added the comment:

> On Android, setlocale(CATEGORY, "") does not look for the locale environment 
> variables (LANG, ...) but sets the 'C' locale instead

FWIW the source code of setlocale() on bionic (Android libc) is at 
https://android.googlesource.com/platform/bionic/+/master/libc/bionic/locale.cpp#144

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-18 Thread Xavier de Gaye

Xavier de Gaye added the comment:

pep538_coerce_legacy_c_locale_v3.diff fixes issue 28997 on Android (api 21 and 
24). This issue is raised because there is an inconsistency between Python on 
Android that considers the locale encoding to be always UTF-8 and GNU Readline 
that does not accept eight-bit characters when LANG is not set (on Android).

On Android, setlocale(CATEGORY, "") does not look for the locale environment 
variables (LANG, ...) but sets the 'C' locale instead, so the patch does not 
fully behave as expected and the 'Py_Initialize detected' warning is emitted. 
Here is the output of an interactive session on Android:

root@generic_x86:/data/data/org.bitbucket.pyona # python
Python detected LC_CTYPE=C, forcing LC_ALL & LANG to C.UTF-8 (set 
PYTHONALLOWCLOCALE to disable this locale coercion behaviour).
Py_Initialize detected LC_CTYPE=C, which limits Unicode compatibility. Some 
libraries and operating system interfaces may not work correctly. Set 
`PYTHONALLOWCLOCALE=1 LC_CTYPE=C` to configure a similar environment when 
running Python directly.
Python 3.7.0a0 (default:0503024831ad+, Jan 18 2017, 11:34:53)
[GCC 4.2.1 Compatible Android Clang 3.8.256229 ] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale, os
>>> os.environ['LANG']
'C.UTF-8'
>>> locale.getdefaultlocale()
('en_US', 'UTF-8')
>>> locale.setlocale(locale.LC_CTYPE)
'C'
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
'C.UTF-8'
>>> locale.setlocale(locale.LC_CTYPE)
'C.UTF-8'

The attached android_setlocale.patch fixes the following problems when applied 
after pep538_coerce_legacy_c_locale_v3.diff:
* No 'Py_Initialize detected' warning is emitted.
* locale.setlocale(locale.LC_CTYPE) returns now 'C.UTF-8'.

--
nosy: +xdegaye
Added file: http://bugs.python.org/file46329/android_setlocale.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-07 Thread Nick Coghlan

Nick Coghlan added the comment:

Uploaded one last version of the patch implementing the previous PEP 538 
design. This refactors the test cases so they systematically cover 4 cases that 
we expect to be reported as "the C locale":

- LC_ALL, LC_CTYPE, and LANG all empty
- one of them set to "C", others empty
- one of them set to "POSIX", others empty
- one of them set to an unknown locale, others empty

The next version of the patch will update it to match the latest draft of the 
PEP (PYTHONCOERCECLOCALE, different message wording, etc)

--
Added file: 
http://bugs.python.org/file46205/pep538_coerce_legacy_c_locale_v3.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-07 Thread Sworddragon

Sworddragon added the comment:

> $ cat badfilename.py 
> badfn = "こんにちは".encode('euc-jp').decode('utf-8', 'surrogateescape')
> print("bad filename:", badfn)
>
> $ PYTHONIOENCODING=utf-8:backslashreplace python3 badfilename.py 
> bad filename: \udca4\udcb3\udca4\udcf3\udca4ˤ\udcc1\udca4\udccf
>
> $ PYTHONIOENCODING=utf-8:surrogateescape python3 badfilename.py 
> bad filename: �ˤ���

The first example is still readable (but effectively for an user not so much) 
while the second example appears to be not readable anymore at all. But the 
second example is actually technically still readable and there is no data 
loss, isn't it? As in this case it would probably not speak against 
surrogateescape for sys.stderr in UTF-8 non-strict mode. Otherwise 
backslashescape might be indeed the better choice.


I have thought about this a bit more and in case we go PEP 538 with keeping 
strict errors more or less the old way there might be another solution that 
could improve the overall issue: print() could get an option to allow changing 
the error handler on demand (with 'strict' still being the default).

Most things that I do output with print() are deterministic or optional and not 
important application data. Being able to print this information without caring 
for de-/encoding errors would mitigate this issue. In case application data is 
being printed where data loss is not desired exceptions can still be thrown.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-07 Thread Nick Coghlan

Nick Coghlan added the comment:

While the attached PEP 538 patches include their own tests, the uploaded 
pep538-check-click.sh script is the one I've been using to check that the 
changes have the desired effect of letting click "just work", even when the 
nominal locale is cleared, explicitly set to C, or explicitly set to POSIX.

--
Added file: http://bugs.python.org/file46190/pep538-check-click.sh

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-07 Thread Nick Coghlan

Nick Coghlan added the comment:

I just pushed an update to PEP 538 based on PEP 540 and the feedback in the 
linux-sig discussion: 
https://github.com/python/peps/commit/221099d8765125bbd798e869846b005bcca84b47

I'll be starting a thread for that on python-ideas shortly, but in the context 
of the discussion here:

* There are good reasons to go back to strict error handling by default on the 
standard streams when we're using UTF-8 as the default encoding rather than 
ASCII: 
https://www.python.org/dev/peps/pep-0538/#using-strict-error-handling-by-default
* The right overall answer might actually be to create a hybrid merger of the 
two PEPs, rather than seeing them as strictly competitors: 
https://www.python.org/dev/peps/pep-0538/#relationship-with-other-peps

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-06 Thread INADA Naoki

INADA Naoki added the comment:

>> stderr is used to log errors. Getting a new error when trying to log
>> an error is kind of annoying.
>
> Hm, what bad surprise/error could appear that would not appear with 
> backslashescape?

$ cat badfilename.py 
badfn = "こんにちは".encode('euc-jp').decode('utf-8', 'surrogateescape')
print("bad filename:", badfn)

$ PYTHONIOENCODING=utf-8:backslashreplace python3 badfilename.py 
bad filename: \udca4\udcb3\udca4\udcf3\udca4ˤ\udcc1\udca4\udccf

$ PYTHONIOENCODING=utf-8:surrogateescape python3 badfilename.py 
bad filename: �ˤ���

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-06 Thread Sworddragon

Sworddragon added the comment:

> What do you mean by "make the C locale"?

I was pointing to the Platform Support Changes of PEP 538.


> I'm not sure of the name of each mode yet.
>
> After having written the "Use Cases" section and especially the
> Mojibake column of results, I consider the option of renaming the
> "UTF-8 mode" to "YOLO mode".

Assumingly YOLO is meant to be negative: Things are whirling in my mind. 
Eventually you want to save your joker :>


> Using surrogateescape means that you pass through undecodable bytes
> from inputs to stderr which can cause various kinds of bad surprises.
>
> stderr is used to log errors. Getting a new error when trying to log
> an error is kind of annoying.

Hm, what bad surprise/error could appear that would not appear with 
backslashescape?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-06 Thread STINNER Victor

STINNER Victor added the comment:

Sworddragon added the comment:
> (for me and maybe others that is explicitly preferred but maybe this depends 
> on each individual)

That's why the PEP 540 has options to enable to disable its UTF-8 mode(s).

> If I'm not wrong PEP 538 improves this for the output too but input handling 
> will still suffer from the overall issue while PEP 540 does also solve this 
> case.

The PEP 538 works fine if all inputs and outputs are encoded to UTF-8.
I understand that it's a deliberate choice to fail on
decoding/encoding error (to not use surrogateescape), but I can be
wrong.

> Also PEP 540 would not make the C locale and thus eventually some systems 
> potentially unsupported (but it might be an acceptable trade-off if we should 
> really go PEP 538).

What do you mean by "make the C locale"?

> Specific for PEP 540:
>
>> The POSIX locale enables the UTF-8 mode
>
> Non-strict I assume?

Yes, non strict.

I'm not sure of the name of each mode yet.

After having written the "Use Cases" section and especially the
Mojibake column of results, I consider the option of renaming the
"UTF-8 mode" to "YOLO mode".

>> UTF-8 /backslashreplace
>
> Was/is the reason to use backslashreplace for sys.stderr to guarantee that 
> the developer/user sees the error messages?

Yes.

> Might it make sense to also use surrogateescape instead of backslashescape 
> for sys.stderr in UTF-8 non-strict mode to be consistent here?

Using surrogateescape means that you pass through undecodable bytes
from inputs to stderr which can cause various kinds of bad surprises.

stderr is used to log errors. Getting a new error when trying to log
an error is kind of annoying.

Victor

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-06 Thread Sworddragon

Sworddragon added the comment:

On looking into PEP 538 and PEP 540 I think PEP 540 is the way to go. It 
provides an option for a stronger encapsulation for the de-/encoding logic 
between the interpreter and the developer. Instead of caring about error 
handling the developer has now to care about mojibake handling (for me and 
maybe others that is explicitly preferred but maybe this depends on each 
individual). If I'm not wrong PEP 538 improves this for the output too but 
input handling will still suffer from the overall issue while PEP 540 does also 
solve this case. Also PEP 540 would not make the C locale and thus eventually 
some systems potentially unsupported (but it might be an acceptable trade-off 
if we should really go PEP 538).


Specific for PEP 540:

> The POSIX locale enables the UTF-8 mode

Non-strict I assume?


> UTF-8 /backslashreplace

Was/is the reason to use backslashreplace for sys.stderr to guarantee that the 
developer/user sees the error messages? Might it make sense to also use 
surrogateescape instead of backslashescape for sys.stderr in UTF-8 non-strict 
mode to be consistent here?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-06 Thread Jan Niklas Hasse

Jan Niklas Hasse added the comment:

> Can you please tell me if these variables are set and if yes, give me their 
> value?

None of these variables are set (with `docker run -it fedora:25 /bin/bash`).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-05 Thread Nick Coghlan

Nick Coghlan added the comment:

And by PEP 528, I actually mean PEP 538 :)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-05 Thread Nick Coghlan

Nick Coghlan added the comment:

Docker containers don't have a locale set by default - the approach proposed in 
PEP 528 actually comes from the way I configure Docker images (which in turn 
comes from Armin Ronacher's recommendations in click for Python 3 locale 
handling).

In the Dockerfile for Fedora based containers I add:

ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8

while in CentOS 7 based containers I add:

ENV LC_ALL=en_US.UTF-8
ENV LANG=en_US.UTF-8

And with those settings, Python 3 based containers just work (my laptop is 
running en_AU.UTF-8 locally)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-05 Thread STINNER Victor

STINNER Victor added the comment:

> Working with Docker I often end up with an environment where the locale isn't 
> correctly set.

The locale encoding is controlled by 3 environment variables: LC_ALL, LC_CTYPE 
and LANG.
https://www.python.org/dev/peps/pep-0540/#the-posix-locale-and-its-encoding

Can you please tell me if these variables are set and if yes, give me their 
value?

I would like to know if it would be possible to change the behaviour of Python 
when the (LC_CTYPE) locale is POSIX (aka the famous "C" locale).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-05 Thread STINNER Victor

STINNER Victor added the comment:

> That way each PEP can argue as strongly as it can for the respective authors 
> preferred approach to tackling the default C locale problem, even if they 
> point to a common background section in one of the PEPs (similar to the way 
> PEPs 522 and 524 shared a common problem definition, even though they 
> proposed different ways of handling it).

Ok, same players play again: as PEP 522/524 with Nick and me, I just wrote the 
PEP 540 "Add a new UTF-8 mode" and Nick wrote the PEP 538 :-D

I started a thread to discuss the PEP on python-ideas:
https://mail.python.org/pipermail/python-ideas/2017-January/044089.html

IMHO the PEP 538 should discuss the usage of the surrogateescape error handler: 
see my second mail in the thread for the details.

I proposed a change in my 3rd mail which would move my PEP closer to Nick's PEP 
538: enable "automatically" the UTF-8 mode when the locale is POSIX.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-05 Thread Barry A. Warsaw

Barry A. Warsaw added the comment:

On Jan 05, 2017, at 11:11 AM, STINNER Victor wrote:

>I'm sure that many Linux, UNIX and BSD systems don't have the "C.UTF-8"
>locale. For example, HP-UX has "C.utf8" which is not exactly "C.UTF-8".
>
>I'm not sure that it's ok in 2017 to always force the UTF-8 encoding if the
>user locale uses a different encoding.

It's not just any different encoding, it's specifically C (implicitly,
C.ASCII).

>I proposed an opt-in option to force UTF-8: -X utf8 command line option and
>PYTHONUTF8=1 env var. Opt-in will obviously reduce the risk of backward
>compatibility issues. With an opt-in option, users are better prepared for
>mojibake issues.

If this is true, then I would like a configuration option to default this on.
As mentioned, Debian and Ubuntu already have C.UTF-8 and most environments
(although not all, see my sbuild/schroot comment earlier) will at least be
C.UTF-8.  Perhaps it doesn't matter then, but what I really want is that for
those few odd outliers (e.g. schroot), Python would act the same inside and
out those environments.  I really don't want people to have to add that envar
or switch (or even export LC_ALL) to get proper build behavior.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-05 Thread Nick Coghlan

Nick Coghlan added the comment:

The trade-offs here are incredibly complex (and are mainly a matter of deciding 
whose code and configurations we want to break in 3.7+), so I think competing 
PEPs are going to be better than attempting to create a combined PEP that tries 
to cover all the options.

That way each PEP can argue as strongly as it can for the respective authors 
preferred approach to tackling the default C locale problem, even if they point 
to a common background section in one of the PEPs (similar to the way PEPs 522 
and 524 shared a common problem definition, even though they proposed different 
ways of handling it).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-05 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

While going for the full locale setting may be a good option,
perhaps just focusing on the FS encoding for now is a better
way forward (and also more in line with the ticket title).

So essentially go for the PEP 529 approach on Unix as well
(except that we use 'ascii' as fallback in legacy mode):

https://www.python.org/dev/peps/pep-0529/

The PEP also includes a section on affected modules, which we
could double check (even though the term "FS encoding" implies
that only file system relevant APIs are touched by such a change,
the encoding is used in several other places as well):

https://www.python.org/dev/peps/pep-0529/#id14

For Windows, a couple of modules such as pwd and nis are not
used, so those may need some extra attention.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-05 Thread STINNER Victor

STINNER Victor added the comment:

Sorry, I still didn't have enough time to read carefully the PEP 538. But since 
the discussion already started on this issue, I will add my comments:

* I'm sure that many Linux, UNIX and BSD systems don't have the "C.UTF-8" 
locale. For example, HP-UX has "C.utf8" which is not exactly "C.UTF-8".

* Setting the locale has an impact on all libraries running in the Python 
process. At this point, I'm not sure that it is what we want.

* I'm not sure that it's ok in 2017 to always force the UTF-8 encoding if the 
user locale uses a different encoding. I had the same concern with the PEP 528 
(Change Windows console encoding to UTF-8) and PEP 529 (Change Windows 
filesystem encoding to UTF-8) on Windows, but these PEPs were approved and 
merged into Python 3.6. My fear is obviously mojibake with the other 
applications using the other encoding, the locale encoding. Other applications 
are not impacted by setlocale() in the Python process.

* I proposed an opt-in option to force UTF-8: -X utf8 command line option and 
PYTHONUTF8=1 env var. Opt-in will obviously reduce the risk of backward 
compatibility issues. With an opt-in option, users are better prepared for 
mojibake issues.

* I dislike "Backporting to earlier Python 3 releases". In my experience, 
changes on how Python handles text (encodings, codecs, etc.) always have subtle 
issues, and users dislike getting backward incompatible changes in minor 
releases. *Maybe* if the option is an opt-in, the risk is lower and acceptable?

* I dislike that Fedora has such downstream change. I would prefer to decide 
upstream how to convert UTF-8 slowly as a first-class citizen in Python. 
Otherwise, Fedora would behave differently than other Linux distributions and 
it can be painful to write applications having the same behaviour on all Linux 
distributions. But I also understand that Fedora has sometimes to move faster 
than the slow CPython project :-) Fedora can also seen as a toy to experiment 
changes quickly which helps to provide a wide feedback upstream to take better 
decision.

* Using strict or surrogateescape error handler is a very important choice 
which has a wide impact. If we use utf8 by default (PEP 538), people will 
problably complain less if Python magically pass undecoded bytes thanks to the 
surrogateescape. If the option is an opt-in, strict may make sense. But 
surrogateescape is maybe still more "convenient". I don't know at this point.

Nick: it seems like you have a well defined plan. But I dislike on multiple 
points. I don't know if it's better to try to convince you to change your PEP, 
or write a different PEP.

I planned to write such "UTF-8" PEP since 2015, but I never started because the 
scope is so large that I fear all tiny but annoying corner cases...

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-05 Thread Nick Coghlan

Nick Coghlan added the comment:

The PEP already explains how other runtimes achieve UTF-8 and UTF-18-LE 
everywhere: by ignoring the C/C++ locale entirely. While this breaks 
integration with other C/C++ components, the developers of those languages and 
runtimes simply don't care, as they never supported integrating with those 
components in the first place.

CPython doesn't have that luxury, since it is used extensively in locale aware 
desktop applications.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-05 Thread Nick Coghlan

Nick Coghlan added the comment:

No, requesting a locale that doesn't exist doesn't error out, because we don't 
check the return code - it just keeps working the same way it does now (i.e. 
falling back to the legacy C locale).

However, it would be entirely reasonable to put together a competing PEP 
proposing to eliminate the reliance on the problematic libc APIs, and instead 
use locale independent replacements. I'm simply not offering to implement or 
champion such a PEP myself, as I think ignoring the locale settings rather than 
coercing them to something more sensible will break integration with C/C++ GUI 
toolkits like Tcl/Tk, Gtk, and Qt, and it's reasonable for us to expect OS 
providers to offer at least one of C.UTF-8 or en_US.UTF-8 (see 
https://github.com/python/peps/issues/171 for more on that).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-05 Thread INADA Naoki

INADA Naoki added the comment:

Why I want to add configure option to ignore locale is:


1. C.UTF-8 is not supported by RHEL7 
(https://bugzilla.redhat.com/show_bug.cgi?id=1361965)

RHEL7 will be used for a long time.
And many people uses new Python instead of distro's Python, via pyenv or 
pythonz.
I feel deprecating C locale from Python 3.7 is bit aggressive.


2. Many admins like C locale.

locale setting will cause unintended side effects. So many admins dislike 
xx_XX.UTF-8 locale.
For example (from 
https://fumiyas.github.io/2016/12/25/dislike.sh-advent-calendar.html ):

$ mkdir tmp
$ cd tmp
$ touch a b c x y z A B C X Y Z
$ LC_ALL=C /bin/bash --noprofile --norc -c 'echo [A-Z]'
A B C X Y Z
$ LC_ALL=en_US.UTF-8 /bin/bash --noprofile --norc -c 'echo [A-Z]'
A b B c C x X y Y z Z


3. Many other languages can use UTF-8 even when C locale

node.js, Ruby, Rust, Go can use UTF-8 on Linux
People don't want to learn how to configure locale properly only for Python.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-05 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 05.01.2017 10:26, Nick Coghlan wrote:
> 
> Anything purely on the Python side of things doesn't work in a traditional C 
> environment - CPython relies on the C lib to do conversions during startup, 
> so we need the C locale to be set correctly. We can do things differently on 
> Mac OS X and iOS because Apple ensure that *C* behaves differently on Mac OS 
> X and iOS (and apparently Google do something similar for Android, so I'll 
> update the PEP to mention that as well).

I believe IANADA-san (hope that's the right way to address him)
raised a good point though: what if a system doesn't come with
the C.UTF-8 local setup ?

The C lib would then error out when trying to use setlocale()
on such an environment.

Now, Python's main() function doesn't look at any such errors
(and neither do the other places which use it such as frozenmain.c
and readline.c), so it wouldn't even notice.

The setlocal() man-page doesn't mention how such a failure would
affect the current locale settings. My guess is that the locale
remains set to what it was before, which in case of a fresh C
application start is the "C" locale.

So in the implementation of the PEP, there should be a test
to see whether "C.UTF-8" does result in a successful
call to setlocale(). If it doesn't, there would have to be
some work-around to still make Python's FS encoding happy
while leaving the C lib locale set at "C".

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-05 Thread INADA Naoki

INADA Naoki added the comment:

> Anything purely on the Python side of things doesn't work in a traditional C 
> environment - CPython relies on the C lib to do conversions during startup, 
> so we need the C locale to be set correctly. 

What I propose is non't use mbstowcs, like __ANDROID__

wchar_t*
Py_DecodeLocale(const char* arg, size_t *size)
{
#if defined(__APPLE__) || defined(__ANDROID__)
wchar_t *wstr;
wstr = _Py_DecodeUTF8_surrogateescape(arg, strlen(arg));


On Linux, command line arguments and filepath is just a byte sequence.
So using UTF-8:surrogateescape from during startup should works fine.

Am I wrong?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-05 Thread Nick Coghlan

Nick Coghlan added the comment:

Anything purely on the Python side of things doesn't work in a traditional C 
environment - CPython relies on the C lib to do conversions during startup, so 
we need the C locale to be set correctly. We can do things differently on Mac 
OS X and iOS because Apple ensure that *C* behaves differently on Mac OS X and 
iOS (and apparently Google do something similar for Android, so I'll update the 
PEP to mention that as well).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-04 Thread INADA Naoki

INADA Naoki added the comment:

On Linux, I think most people wants UTF-8:surrogateescape by default, without 
fighting against locale and environment variables.

There are already `#if defined(__APPLE__) || defined(__ANDROID__)` path for it.
How about adding configure option to use same logic? (say 
`--with-encoding=(locale|utf-8)`, preferred encoding is changed in same way).

It may help many people building Python themselves without having root 
privilege for generating C.UTF-8 locale.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-04 Thread STINNER Victor

STINNER Victor added the comment:

> The default encoding in the C/POSIX locale is ASCII (which is the entire 
> source of the problem).

The reality is more complex than that :-) It depends on the OS.

Some OS uses Latin1 for the POSIX locale. Some OS announces to use
Latin1 for the POSIX locale, but use ASCII in practice :-) On these
lying OS, Python decodes bytes 0x80..0xff using mbstowcs() to check if
we get ASCII or Latin1: see the check_force_ascii() function.

/* Workaround FreeBSD and OpenIndiana locale encoding issue with the C locale.
   On these operating systems, nl_langinfo(CODESET) announces an alias of the
   ASCII encoding, whereas mbstowcs() and wcstombs() functions use the
   ISO-8859-1 encoding. The problem is that os.fsencode() and os.fsdecode() use
   locale.getpreferredencoding() codec. For example, if command line arguments
   are decoded by mbstowcs() and encoded back by os.fsencode(), we get a
   UnicodeEncodeError instead of retrieving the original byte string.

   The workaround is enabled if setlocale(LC_CTYPE, NULL) returns "C",
   nl_langinfo(CODESET) announces "ascii" (or an alias to ASCII), and at least
   one byte in range 0x80-0xff can be decoded from the locale encoding. The
   workaround is also enabled on error, for example if getting the locale
   failed.

(...) */

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-04 Thread Nick Coghlan

Nick Coghlan added the comment:

The default encoding in the C/POSIX locale is ASCII (which is the entire source 
of the problem).

The initial verison of the PEP I uploaded didn't explain that background, but I 
added a section about it in the update earlier this week: 
https://www.python.org/dev/peps/pep-0538/#background

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-04 Thread INADA Naoki

INADA Naoki added the comment:

I'm sorry.
I must search old discussion about why we can't simply use utf-8
for fsencoding when C locale, instead of asking here.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-04 Thread INADA Naoki

INADA Naoki added the comment:

> That isn't the case on other *nix systems - there, we need CPython to be 
> consistent with the configured C/C++ locale, *and* we need it to be using 
> something other than ASCII as the default encoding.

Isn't using UTF-8 as filesystem encoding and stdin/stdout encoding consistent 
with C or POSIX locale?

Don't "modern" programming environments (Rust, Go, node.js) use UTF-8 even if 
locale is C or POSIX?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-04 Thread Nick Coghlan

Nick Coghlan added the comment:

On Mac OS X, the XCode libc already ignores the locale settings and just uses 
UTF-8 as the default text encoding, so the hardcoding in CPython aligns with 
that behaviour.

That isn't the case on other *nix systems - there, we need CPython to be 
consistent with the configured C/C++ locale, *and* we need it to be using 
something other than ASCII as the default encoding.

Answer: coerce the default locale from C to C.UTF-8 (if available), or to 
en_US.UTF-8 (for older distros that don't provide C.UTF-8). (The latter aspect 
isn't in the PEP yet, it's an improvement that came up in the linux-sig 
discussions: https://github.com/python/peps/issues/171 )

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-03 Thread INADA Naoki

INADA Naoki added the comment:

I read PEP 538 but I can't understand why just using UTF-8 when locale is C 
like macOS is bad idea.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-03 Thread Barry A. Warsaw

Changes by Barry A. Warsaw :


--
nosy: +barry

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2017-01-02 Thread Nick Coghlan

Nick Coghlan added the comment:

Updated patch adds some tests showing that this change should also help with 
cases where SSH environment forwarding results in an unknown locale being 
requested in the server environment.

--
Added file: 
http://bugs.python.org/file46121/pep538_coerce_legacy_c_locale_v2.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-12-28 Thread Nick Coghlan

Nick Coghlan added the comment:

If nothing is configured (i.e. none of LC_ALL, LC_CTYPE or LANG are set in the 
environment), then C reports the locale as "C". It's probably worthwhile for me 
to add a Background section to the PEP that explains the behaviour of 
``setlocale`` at the C level, as that's the source of the majority of the 
problems, as well as the key mechanism used to implement the locale coercion.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-12-28 Thread Jan Niklas Hasse

Jan Niklas Hasse added the comment:

Only important case for me: What when LANG is unset?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-12-27 Thread Nick Coghlan

Nick Coghlan added the comment:

I've now written this up as a PEP: 
https://github.com/python/peps/blob/master/pep-0538.txt

The latest attached patch implements the specific design proposed in the PEP. 
Relative to the last Fedora specific patch, this tweaks the warning message 
wording slightly, and only emits the library level warning when 
PYTHONALLOWCLOCALE is set:

==
$ LANG=C ./python -c "import sys; print(sys.getfilesystemencoding())"
Python detected LC_CTYPE=C, forcing LC_ALL & LANG to C.UTF-8 (set 
PYTHONALLOWCLOCALE to disable this locale coercion behaviour).
utf-8


==
$ PYTHONALLOWCLOCALE=1 LANG=C ./python -c "import sys; 
print(sys.getfilesystemencoding())"
Py_Initialize detected LC_CTYPE=C, which limits Unicode compatibility. Some 
libraries and operating system interfaces may not work correctly. Set 
`PYTHONALLOWCLOCALE=1 LC_CTYPE=C` to configure a similar environment when 
running Python directly.
ascii

--
Added file: http://bugs.python.org/file46059/pep538_coerce_legacy_c_locale.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-12-21 Thread Sworddragon

Changes by Sworddragon :


--
nosy: +Sworddragon

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-12-21 Thread Akira Li

Changes by Akira Li <4kir4...@gmail.com>:


--
nosy: +akira

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-12-21 Thread STINNER Victor

STINNER Victor added the comment:

Previous related work:

changeset:   89836:bc06f67234d0
user:Victor Stinner 
date:Tue Mar 18 01:18:21 2014 +0100
files:   Doc/whatsnew/3.5.rst Lib/test/test_sys.py Misc/NEWS Python/pythonru
description:
Issue #19977: When the ``LC_TYPE`` locale is the POSIX locale (``C`` locale),
:py:data:`sys.stdin` and :py:data:`sys.stdout` are now using the
``surrogateescape`` error handler, instead of the ``strict`` error handler.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-12-17 Thread Nick Coghlan

Nick Coghlan added the comment:

For folks not following the Fedora BZ issue directly, I've also attached the 
latest draft downstream patch here, which gives the following behaviour:

==

$ ./python -c "import sys; print(sys.getfilesystemencoding())"
utf-8

$ LANG=C.UTF-8 ./python -c "import sys; print(sys.getfilesystemencoding())"
utf-8

$ LANG=C ./python -c "import sys; print(sys.getfilesystemencoding())"
Python detected LC_CTYPE=C, forcing LC_ALL & LANG to C.UTF-8 (set 
PYTHONALLOWCLOCALE to disable this behaviour).
utf-8

$ PYTHONALLOWCLOCALE=1 LANG=C ./python -c "import sys; 
print(sys.getfilesystemencoding())"
Python detected LC_CTYPE=C, but PYTHONALLOWCLOCALE is set. Some libraries, 
applications, and operating system interfaces may not work correctly.
Py_Initialize detected LC_CTYPE=C, which limits Unicode compatibility. Some 
libraries and operating system interfaces may not work correctly. Use 
`PYTHONALLOWCLOCALE=1 LC_CTYPE=C python3` to configure a similar environment 
when running Python directly.
ascii
==

(The double warning in the last example is likely to go away by skipping the 
CLI level warning in that case)

The Python tests checking for the expected behaviour are signficantly longer 
than the C level changes needed to implement it :)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-12-17 Thread Nick Coghlan

Changes by Nick Coghlan :


Added file: 
http://bugs.python.org/file45951/fedora-cpython-PYTHONALLOWCLOCALE.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-12-17 Thread Jan Niklas Hasse

Jan Niklas Hasse added the comment:

> Usually, when a new option is added to Python, we add a command line option 
> (-X utf8) but also an environment variable: I propose PYTHONUTF8=1.
>
> Use your favorite method to define the env var "system wide" in your docker 
> containers.

This doesn't help me, as I already set LANG to C.utf-8.

I'm rather thing about new people trying out Python in Docker who don't know 
about this.

Furthermore I think that UTF-8 is the future and the use of ASCII should be 
discouraged.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-12-17 Thread Nick Coghlan

Nick Coghlan added the comment:

On 17 December 2016 at 20:15, Marc-Andre Lemburg 
wrote:

> Another use case to consider is embedding the Python
> interpreter in another application. In such situations,
> the C locale will usually already be set by the main
> application and it may conflict with the LANG or other
> locale env var settings, since the user may have chosen
> to use a different locale in the context of the application.
>

Aye, that's the origin of the split proposal to only emit a warning in the
shared library (since CPython might only be a piece of a larger
application), but implement actual locale coercion (by overriding LANG and
LC_ALL in the process environment) in the command line app's main()
function (as in that case we know CPython *is* the application).

The hard part of writing the PEP isn't really going to be explaining the
proposal itself (I expect it to be around a 20 line patch to the C code) -
it's going to be explaining why all the other possibilities we've
considered over the years don't work, and why we (as in the Fedora Python
SIG) think this one actually stands a chance of working properly :)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-12-17 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 17.12.2016 08:56, Nick Coghlan wrote:
> 
> Making an explicit note of this so I remember to mention it in the draft PEP: 
> one of the biggest problems that arises in any attempt at a Python-only 
> solution to overriding the locale is that we can end up disagreeing with 
> C/C++ extensions, and this is *especially* a problem when sharing a process 
> with GUI frameworks like Tcl/Tk, Qt, and GTK (since they tend to read the 
> process-wide settings, rather than querying anything that CPython configures 
> during normal operation).

Another use case to consider is embedding the Python
interpreter in another application. In such situations,
the C locale will usually already be set by the main
application and it may conflict with the LANG or other
locale env var settings, since the user may have chosen
to use a different locale in the context of the application.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-12-16 Thread Nick Coghlan

Nick Coghlan added the comment:

Making an explicit note of this so I remember to mention it in the draft PEP: 
one of the biggest problems that arises in any attempt at a Python-only 
solution to overriding the locale is that we can end up disagreeing with C/C++ 
extensions, and this is *especially* a problem when sharing a process with GUI 
frameworks like Tcl/Tk, Qt, and GTK (since they tend to read the process-wide 
settings, rather than querying anything that CPython configures during normal 
operation).

So the approach I'm proposing is to implement a C->C.UTF-8 locale override in 
the *actual python CLI executable*, and then in the dynamically linked library 
we only emit a warning if we detect the C locale, we don't actually do anything 
to change it.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-12-16 Thread Nick Coghlan

Nick Coghlan added the comment:

We've been discussing this further downstream in the Fedora Python SIG, and we 
have a draft approach that we're pretty sure will work for us (based in turn on 
the approach Armin Ronacher came up with for click), and we think it should 
work for other distros as well (as long as they already ship the C.UTF-8 
locale, and if they don't, they should fix that limitation anyway).

So I'm assigning this to myself as I think the next step will be to write a PEP 
that both proposes the specific idea as the default behaviour in 3.7, and also 
encourages distros to opt-in to trialling it as a downstream patch for 3.6.

--
assignee:  -> ncoghlan

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-12-16 Thread Chi Hsuan Yen

Changes by Chi Hsuan Yen :


--
nosy: +Chi Hsuan Yen

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-12-16 Thread STINNER Victor

STINNER Victor added the comment:

> I believe Victor put quite a bit of time into trying to get more selective 
> approaches to work reliably and eventually gave up.

Yeah, it just doesn't work to use more than one encoding per process. You 
should use the same encoding for the whole lifetime of a process.

If you decode early data from an encoding A and later encode it back to 
encoding B, you get mojibake. The problem is simple.

Using more than one encoding per process means starting to make assumtpions on 
how data is used. For example, consider that environment variables use the 
encoding A, but filenames should use the encoding B. Or, but what if an 
environment variable contains a filename? Similar issues for command line 
arguments, subprocess pipes, standard streams (sys.std*), etc.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-12-16 Thread STINNER Victor

STINNER Victor added the comment:

Victor>> I proposed to add "-X utf8" command line option for UNIX to force utf8 
encoding. Would it work for you?

Jan Niklas Hasse> Unfortunately no, as this would mean I'll have to change all 
my python invocations in my scripts and it wouldn't work for executable files 
with "#!/usr/bin/env python3" would it?

Usually, when a new option is added to Python, we add a command line option (-X 
utf8) but also an environment variable: I propose PYTHONUTF8=1.

Use your favorite method to define the env var "system wide" in your docker 
containers.

Note: Technically, I'm not sure that it's possible to support -E option with 
PYTHONUTF8, since -E comes from the command line, and we first need to decode 
command line arguments with an encoding to parse these options 
Chicken-and-egg issue ;-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-12-14 Thread Nick Coghlan

Nick Coghlan added the comment:

Downstream Fedora issue proposing the above idea for F26: 
https://bugzilla.redhat.com/show_bug.cgi?id=1404918

I've also attached the patch from that issue here.

--
keywords: +patch
Added file: http://bugs.python.org/file45907/fedora-cpython-force-c-utf-8.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-12-12 Thread Nick Coghlan

Nick Coghlan added the comment:

The challenge that arises in being selective about this is that 
"sys.getfilesystemencoding()" is actually a misnomer, and some of the things we 
use it for (like decoding command line arguments and environment variables) 
necessarily happen *really* early in the interpreter bootstrapping process. The 
bugs that arise from being internally inconsistent are then even harder to 
debug than those that arise from believing the OS when it says the right 
encoding to use is ASCII - the latter at least don't tend to be subtle, and are 
amenable to being resolved via "LC_ALL=C.UTF-8" and "LANG=C.UTF-8".

I believe Victor put quite a bit of time into trying to get more selective 
approaches to work reliably and eventually gave up.

For Fedora 26, I'm going to explore the feasibility of patching our system 3.6 
installation such that the python3 command itself (rather than the shared 
library) checks for "LC_CTYPE=C" as almost the first thing it does, and 
forcibly sets LANG and LC_ALL to C.UTF-8 if it gets an answer it doesn't like. 
If we're able to do that successfully in the more constrained environment of a 
specific recent Fedora release, then I think it will bode well for doing 
something similar by default in CPython 3.7

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-12-12 Thread INADA Naoki

INADA Naoki added the comment:

Sorry for confusing.
I didn't meant defaulting LANG=C.UTF-8.

I meant use UTF-8 as default fsencoding, stdioencoding regardless locale,
and locale.getpreferredencoding() returns 'utf-8' when LC_CTYPE is ascii.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-12-12 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

If we just restrict this to the file system encoding (and not the whole LANG 
setting), how about:

 * default the file system encoding to 'utf-8' and use the surrogate escape 
handler as default error handler
 * add a PYTHONFSENCODING env var to set the file system encoding to something 
else (*)

(*) I believe we discussed this at some point already, but don't remember the 
outcome.

Regarding the questions of defaulting to LANG=C.UTF-8: I think this needs some 
more thought, since it would also affect many C locale aware functions. To make 
this work, Python would have to call setlocale() early on in the startup phase 
to adjust the C lib accordingly.

--
nosy: +lemburg

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-12-12 Thread Jan Niklas Hasse

Jan Niklas Hasse added the comment:

https://sourceware.org/glibc/wiki/Proposals/C.UTF-8#Defaults mentions that 
C.UTF-8 should be glibc's default.

This bug report also mentions Python: 
https://sourceware.org/bugzilla/show_bug.cgi?id=17318
It hasn't been fixed yet, though :/

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-12-12 Thread Nick Coghlan

Nick Coghlan added the comment:

>From CPython's point of view, glibc behaves the same way (i.e. reporting 
>`ascii` as the preferred encoding for operating system interfaces) regardless 
>of whether the cause is the locale not being set at all, or due to it being 
>explicitly set to the legacy POSIX locale via `LANG=C`.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-12-12 Thread Jan Niklas Hasse

Jan Niklas Hasse added the comment:

Actually in a new Docker container, the LANG variable isn't set at all. 
Defaulting to UTF-8 in that case should be easier to reason about, shouldn't it?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-12-11 Thread Nick Coghlan

Nick Coghlan added the comment:

Note also that if we say we're going to do this for 3.7, *and* go ahead and 
implement it, then distros may be more inclined to incorporate the same 
behavioural changes into distro-provided releases of 3.6, providing real world 
testing of the concept before we make it the default behaviour.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-12-11 Thread Nick Coghlan

Nick Coghlan added the comment:

I think we're genuinely getting to the point now where the majority of "LANG=C" 
cases are misconfigurations rather than intended behaviour. We're also to the 
point where:

- on Mac OS X, binary system interfaces have been handled as UTF-8 by default 
since 3.0
- on Windows, as of 3.6, the OS native binary system interfaces are now 
bypassed entirely in favour of transcoding from UTF-8 to UTF-16-LE 

So I think for Python 3.7 it makes sense to do the following on other *nix 
systems:

- very early in CPython startup (even before argument processing), if the 
detected locale is "C", force it to "C.UTF-8" if possible, and print a warning 
either way
- add a PYTHONKEEPASCIILOCALE environment variable to turn that behaviour off

I do think we actually want to *change* the C level locale in the process 
though, as otherwise we can expect to see weird interactions where CPython and 
extension modules disagree about the default text encoding.

--
nosy: +ncoghlan

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-09-23 Thread INADA Naoki

INADA Naoki added the comment:

I want locale free Python which behaves like on C.UTF-8 locale.
(stdio encoding, preferred encoding, weekday in _strptime._strptime,
and more maybe)

But Python 3.6 is feature freeze already >_<;;

--
nosy: +inada.naoki

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-09-23 Thread Jan Niklas Hasse

Jan Niklas Hasse added the comment:

Why not?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-09-16 Thread STINNER Victor

STINNER Victor added the comment:

> is this someday already?)

Not yet :-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-09-16 Thread R. David Murray

R. David Murray added the comment:

I thought we "fixed" this by using surrogate escape when the locale was ASCII?  
We certainly have discussed changing the default and posix and so far have 
decided not to (someday that will change...is this someday already?)

--
nosy: +r.david.murray
stage: resolved -> 
versions: +Python 3.7 -Python 3.5

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-09-16 Thread Jan Niklas Hasse

Jan Niklas Hasse added the comment:

Unfortunately no, as this would mean I'll have to change all my python 
invocations in my scripts and it wouldn't work for executable files with

#!/usr/bin/env python3

would it?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-09-16 Thread STINNER Victor

STINNER Victor added the comment:

> This is a duplicate of issue27781.

issue27781 is specific to Windows. I'm not sure that it's the base in this 
issue. So I reopen the issue.

@Jan Niklas Hasse: What is your OS?

I proposed to add "-X utf8" command line option for UNIX to force utf8 
encoding. Would it work for you?

--
resolution: duplicate -> 
status: closed -> open
superseder: Change sys.getfilesystemencoding() on Windows to UTF-8 -> 

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-09-16 Thread Emanuel Barry

Emanuel Barry added the comment:

This is a duplicate of issue27781.

--
nosy: +ebarry
resolution:  -> duplicate
stage:  -> resolved
status: open -> closed
superseder:  -> Change sys.getfilesystemencoding() on Windows to UTF-8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28180] sys.getfilesystemencoding() should default to utf-8

2016-09-16 Thread Jan Niklas Hasse

New submission from Jan Niklas Hasse:

Working with Docker I often end up with an environment where the locale isn't 
correctly set. In these cases it would be great if sys.getfilesystemencoding() 
could default to 'utf-8' instead of 'ascii', as it's the encoding of the future 
and ascii is a subset of it anyway.

Related: http://bugs.python.org/issue19846

--
components: Unicode
messages: 276693
nosy: Jan Niklas Hasse, ezio.melotti, haypo
priority: normal
severity: normal
status: open
title: sys.getfilesystemencoding() should default to utf-8
type: behavior
versions: Python 3.5

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com