subject:"\[Python\-Dev\] PEP 540\: Add a new UTF\-8 mode \(v3\)"

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-13 Thread Nick Coghlan

On 11 Dec. 2017 6:50 am, "INADA Naoki"  wrote:

Except one typo I commented on Github,
I accept PEP 540.

Well done, Victor and Nick for PEP 540 and 538.
Python 3.7 will be most UTF-8 friendly Python 3 than ever.


And thank you for all of your work on reviewing them! The appropriate
trade-offs between ease of use in common scenarios and an increased chance
of emitting mojibake are hard to figure out, but I like where we've ended
up :)

Cheers,
Nick.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-11 Thread Guido van Rossum

Congrats Victor! Thanks mr. Inada for reviewing this PEP (and 538). Thanks
everyone else who participated in the lively discussion!

On Sun, Dec 10, 2017 at 4:00 PM, INADA Naoki  wrote:

> >
> > Could you explain why not? utf-8 seems like the common thread for using
> > surrogateescape so I'm not sure what would make en_US.UTF-8 different
> than
> > C.UTF-8.
> >
>
> Because there are many lang_COUNTRY.UTF-8 locales:
> ja_JP.UTF-8, zh_TW.UTF-8, fr_FR.UTF-8, etc...
>
> If only en_US.UTF-8 should use surrogateescape, it may make confusing
> situation
> like: "This script works in English Linux desktop, but doesn't work in
> Japanese Linux
> desktop!"
>
> I accepted PEP 540.  So even if failed to coerce locale, it is better
> than Python 3.6.
>
> Regards,
>
> INADA Naoki  
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/
> guido%40python.org
>



-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-10 Thread INADA Naoki

>
> Could you explain why not? utf-8 seems like the common thread for using
> surrogateescape so I'm not sure what would make en_US.UTF-8 different than
> C.UTF-8.
>

Because there are many lang_COUNTRY.UTF-8 locales:
ja_JP.UTF-8, zh_TW.UTF-8, fr_FR.UTF-8, etc...

If only en_US.UTF-8 should use surrogateescape, it may make confusing situation
like: "This script works in English Linux desktop, but doesn't work in
Japanese Linux
desktop!"

I accepted PEP 540.  So even if failed to coerce locale, it is better
than Python 3.6.

Regards,

INADA Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-10 Thread Toshio Kuratomi

On Dec 9, 2017 8:53 PM, "INADA Naoki"  wrote:

> Earlier versions of PEP 538 thus included "en_US.UTF-8" on the
> candidate target locale list, but that turned out to cause assorted
> problems due to the "C -> en_US" part of the coercion.

Hm, but PEP 538 says:

> this PEP instead proposes to extend the "surrogateescape" default for
stdin and stderr error handling to also apply to the three potential
coercion target locales.

https://www.python.org/dev/peps/pep-0538/#defaulting-to-
surrogateescape-error-handling-on-the-standard-io-streams

I don't think en_US.UTF-8 should use surrogateescape error handler.


Could you explain why not? utf-8 seems like the common thread for using
surrogateescape so I'm not sure what would make en_US.UTF-8 different than
C.UTF-8.

-Toshio
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-10 Thread Victor Stinner

2017-12-10 18:46 GMT+01:00 INADA Naoki :
> Except one typo I commented on Github,

Fixed: 
https://github.com/python/peps/commit/08224bf6bdf16b539fb6f8136061877e5924476d

> I accept PEP 540.

Wow, thank you :-) Again, thank you for your very useful feedback
which helped to make the PEP 540 much better than its initial version.

> Well done, Victor and Nick for PEP 540 and 538.
> Python 3.7 will be most UTF-8 friendly Python 3 than ever.

Yep. Once the PEP 540 will be implemented, we will need need to test
them as much as possible before 3.7 final!

https://bugs.python.org/issue29240
https://github.com/python/cpython/pull/855

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-10 Thread INADA Naoki

Except one typo I commented on Github,
I accept PEP 540.

Well done, Victor and Nick for PEP 540 and 538.
Python 3.7 will be most UTF-8 friendly Python 3 than ever.

INADA Naoki  


On Mon, Dec 11, 2017 at 2:21 AM, Victor Stinner
 wrote:
> Ok, I fixed the effects of the locale coercion (PEP 538). Does it now
> look good to you, Naoki?
>
> https://www.python.org/dev/peps/pep-0540/#relationship-with-the-locale-coercion-pep-538
>
> The commit:
>
> https://github.com/python/peps/commit/71cda51fbb622ece63f7a9d3c8fa6cd33ce06b58
>
> diff --git a/pep-0540.txt b/pep-0540.txt
> index 0a9cbc1e..c163916d 100644
> --- a/pep-0540.txt
> +++ b/pep-0540.txt
> @@ -144,9 +144,15 @@ The POSIX locale enables the locale coercion (PEP
> 538) and the UTF-8
>  mode (PEP 540). When the locale coercion is enabled, enabling the UTF-8
>  mode has no (additional) effect.
>
> -Locale coercion only impacts non-Python code like C libraries, whereas
> -the Python UTF-8 Mode only impacts Python code: the two PEPs are
> -complementary.
> +The UTF-8 has the same effect than locale coercion:
> +``sys.getfilesystemencoding()`` returns ``'UTF-8'``,
> +``locale.getpreferredencoding()`` returns ``UTF-8``, ``sys.stdin`` and
> +``sys.stdout`` error handler set to ``surrogateescape``. These changes
> +only affect Python code. But the locale coercion has addiditonal
> +effects: the ``LC_CTYPE`` environment variable and the ``LC_CTYPE``
> +locale are set to a UTF-8 locale like ``C.UTF-8``. The side effect is
> +that non-Python code is also impacted by the locale coercion. The two
> +PEPs are complementary.
>
>  On platforms where locale coercion is not supported like Centos 7, the
>  POSIX locale only enables the UTF-8 Mode. In this case, Python code uses
>
> Victor
>
>
> 2017-12-10 5:47 GMT+01:00 INADA Naoki :
>> Now I'm OK to accept the PEP, except one nitpick.
>>
>>>
>>> Locale coercion only impacts non-Python code like C libraries, whereas
>>> the Python UTF-8 Mode only impacts Python code: the two PEPs are
>>> complementary.
>>>
>>
>> This sentence seems bit misleading.
>> If UTF-8 mode is disabled explicitly, locale coercion affects Python code 
>> too.
>> locale.getpreferredencoding() is UTF-8, open()' s default encoding is UTF-8,
>> and stdio is UTF-8/surrogateescape.
>>
>> So shouldn't this sentence is: "Locale coercion impacts both of Python code
>> and non-Python code like C libraries, whereas ..."?
>>
>> INADA Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-10 Thread Victor Stinner

Ok, I fixed the effects of the locale coercion (PEP 538). Does it now
look good to you, Naoki?

https://www.python.org/dev/peps/pep-0540/#relationship-with-the-locale-coercion-pep-538

The commit:

https://github.com/python/peps/commit/71cda51fbb622ece63f7a9d3c8fa6cd33ce06b58

diff --git a/pep-0540.txt b/pep-0540.txt
index 0a9cbc1e..c163916d 100644
--- a/pep-0540.txt
+++ b/pep-0540.txt
@@ -144,9 +144,15 @@ The POSIX locale enables the locale coercion (PEP
538) and the UTF-8
 mode (PEP 540). When the locale coercion is enabled, enabling the UTF-8
 mode has no (additional) effect.

-Locale coercion only impacts non-Python code like C libraries, whereas
-the Python UTF-8 Mode only impacts Python code: the two PEPs are
-complementary.
+The UTF-8 has the same effect than locale coercion:
+``sys.getfilesystemencoding()`` returns ``'UTF-8'``,
+``locale.getpreferredencoding()`` returns ``UTF-8``, ``sys.stdin`` and
+``sys.stdout`` error handler set to ``surrogateescape``. These changes
+only affect Python code. But the locale coercion has addiditonal
+effects: the ``LC_CTYPE`` environment variable and the ``LC_CTYPE``
+locale are set to a UTF-8 locale like ``C.UTF-8``. The side effect is
+that non-Python code is also impacted by the locale coercion. The two
+PEPs are complementary.

 On platforms where locale coercion is not supported like Centos 7, the
 POSIX locale only enables the UTF-8 Mode. In this case, Python code uses

Victor


2017-12-10 5:47 GMT+01:00 INADA Naoki :
> Now I'm OK to accept the PEP, except one nitpick.
>
>>
>> Locale coercion only impacts non-Python code like C libraries, whereas
>> the Python UTF-8 Mode only impacts Python code: the two PEPs are
>> complementary.
>>
>
> This sentence seems bit misleading.
> If UTF-8 mode is disabled explicitly, locale coercion affects Python code too.
> locale.getpreferredencoding() is UTF-8, open()' s default encoding is UTF-8,
> and stdio is UTF-8/surrogateescape.
>
> So shouldn't this sentence is: "Locale coercion impacts both of Python code
> and non-Python code like C libraries, whereas ..."?
>
> INADA Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-10 Thread Victor Stinner

Hi,

Le 10 déc. 2017 05:48, "INADA Naoki"  a écrit :

Now I'm OK to accept the PEP, except one nitpick.


I got a private email about the same issue. I don't think that it's
nitpicking since many people were confused about the relationship between
the PEP 538 and PEP 540. So it seems like I was confused as well :-) I was
also confused because my PEP evolved quickly. With the additionnal
local.getpreferredenconding() change in my PEP, the two PEP became even
more similar.

> Locale coercion only impacts non-Python code like C libraries, whereas
> the Python UTF-8 Mode only impacts Python code: the two PEPs are
> complementary.
>

This sentence seems bit misleading.
If UTF-8 mode is disabled explicitly, locale coercion affects Python code
too.
locale.getpreferredencoding() is UTF-8, open()' s default encoding is UTF-8,
and stdio is UTF-8/surrogateescape.

So shouldn't this sentence is: "Locale coercion impacts both of Python code
and non-Python code like C libraries, whereas ..."?


Right. I will rephrase it.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-09 Thread INADA Naoki

> Earlier versions of PEP 538 thus included "en_US.UTF-8" on the
> candidate target locale list, but that turned out to cause assorted
> problems due to the "C -> en_US" part of the coercion.

Hm, but PEP 538 says:

> this PEP instead proposes to extend the "surrogateescape" default for stdin 
> and stderr error handling to also apply to the three potential coercion 
> target locales.

https://www.python.org/dev/peps/pep-0538/#defaulting-to-surrogateescape-error-handling-on-the-standard-io-streams

I don't think en_US.UTF-8 should use surrogateescape error handler.

Regards,

INADA Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-09 Thread INADA Naoki

Now I'm OK to accept the PEP, except one nitpick.

>
> Locale coercion only impacts non-Python code like C libraries, whereas
> the Python UTF-8 Mode only impacts Python code: the two PEPs are
> complementary.
>

This sentence seems bit misleading.
If UTF-8 mode is disabled explicitly, locale coercion affects Python code too.
locale.getpreferredencoding() is UTF-8, open()' s default encoding is UTF-8,
and stdio is UTF-8/surrogateescape.

So shouldn't this sentence is: "Locale coercion impacts both of Python code
and non-Python code like C libraries, whereas ..."?

INADA Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-08 Thread Nick Coghlan

On 9 December 2017 at 01:22, Victor Stinner  wrote:
> I updated my PEP: in the 4th version, locale.getpreferredencoding()
> now returns 'UTF-8' in the UTF-8 Mode.

+1, that's a good change, since it brings the "locale coercion failed"
case even closer to the "locale coercion succeeded" behaviour.

To continue with the CentOS 7 example: that actually does use a UTF-8
based locale by default, it's just en_US.UTF.8 rather than C.UTF-8.

Earlier versions of PEP 538 thus included "en_US.UTF-8" on the
candidate target locale list, but that turned out to cause assorted
problems due to the "C -> en_US" part of the coercion.

Cheers,
Nick.

P.S. Thinking back on the history of the changes though, it may be
worth revisiting the idea of "en_US.UTF-8" as a potential coercion
locale: it was dropped as a potential coercion target back when the
PEP still set both LANG & LC_ALL, whereas it now changes only
LC_CTYPE. That means setting it won't mess with LC_COLLATE, or any of
the other locale categories. That said, I'm not sure if there are
behavioural differences between "LC_CTYPE=C.UTF-8" and
"LC_CTYPE=en_US.UTF-8", so I'm inclined to leave that alone for now.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-08 Thread Victor Stinner

2017-12-08 17:29 GMT+01:00 Ethan Furman :
> For those of us trying to follow along, is this change to open() one that
> Inada-san was worried about?  Has something else changed?

I agree that my PEP is evolving quickly, that's why I added a "Version
History" at the end:
https://www.python.org/dev/peps/pep-0540/#version-history

"""
Version History
===

* Version 4: ``locale.getpreferredencoding()`` now returns ``'UTF-8'``
  in the UTF-8 Mode.
* Version 3: The UTF-8 Mode does not change the ``open()`` default error
  handler (``strict``) anymore, and the Strict UTF-8 Mode has been
  removed.
* Version 2: Rewrite the PEP from scratch to make it much shorter and
  easier to understand.
* Version 1: First version posted to python-dev.
"""

Naoki disliked the usage of the surrogateescape error handler for
open(). I "fixed" this in the PEP version 3: open() error handler is
not modified by the PEP.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-08 Thread Ethan Furman


There were some concerns about open() earlier:

On Wed, 6 Dec 2017 at 06:10 INADA Naoki wrote:
> I think PEP 538 and PEP 540 should behave almost identical except
> changing locale or not.  So I need very strong reason if PEP 540
> changes default error handler of open().

Brett replied:
> I don't have enough locale experience to weigh in as an expert,
> but I already was leaning towards INADA-san's logic of not wanting
> to change open() and this makes me really not want to change it.

On 12/08/2017 07:22 AM, Victor Stinner wrote:

"""
Effects of the UTF-8 Mode:

[...]

Side effects:

* ``open()`` uses the UTF-8 encoding by default.


For those of us trying to follow along, is this change to open() one that Inada-san was worried about?  Has something 
else changed?


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-08 Thread Victor Stinner

2017-12-08 16:22 GMT+01:00 Victor Stinner :
> I updated my PEP: in the 4th version, locale.getpreferredencoding()
> now returns 'UTF-8' in the UTF-8 Mode.

Sorry, I forgot to mention that I already updated the implementation
to the latest version of the PEP:
https://github.com/python/cpython/pull/855

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-08 Thread Victor Stinner

I updated my PEP: in the 4th version, locale.getpreferredencoding()
now returns 'UTF-8' in the UTF-8 Mode.

https://www.python.org/dev/peps/pep-0540/

I also clarified the direct effects of the UTF-8 Mode, but also listed
the most user visible changes as "Side effects".

"""
Effects of the UTF-8 Mode:

* ``sys.getfilesystemencoding()`` returns ``'UTF-8'``.
* ``locale.getpreferredencoding()`` returns ``UTF-8``, its
  *do_setlocale* argument and the locale encoding are ignored.
* ``sys.stdin`` and ``sys.stdout`` error handler is set to
  ``surrogateescape``

Side effects:

* ``open()`` uses the UTF-8 encoding by default.
* ``os.fsdecode()`` and ``os.fsencode()`` use the UTF-8 encoding.
* Command line arguments, environment variables and filenames use the
  UTF-8 encoding.
"""

Thank you Naokia INADA for your quick feedback, it was very helpful
and I really like how the PEP evolves!

IMHO the PEP 540 version 4 is just perfect and ready for
pronouncement! (... until someone finds another flaw, obviously!)

Victor


2017-12-08 13:58 GMT+01:00 Victor Stinner :
> 2017-12-08 6:11 GMT+01:00 INADA Naoki :
>> Or should we change loale.getpreferredencoding() to return UTF-8
>> instead of ASCII always, regardless of PEP 538 and 540?
>
> On the POSIX locale, if the locale coercion works (PEP 538),
> locale.getpreferredencoding() returns UTF-8. We are good.
>
> The question is for platforms like Centos 7 where the locale coercion
> (PEP 538) doesn't work and so Python uses UTF-8 (PEP 540), whereas the
> locale probably uses ASCII (or maybe Latin1).
>
> My current implementation of the PEP 540 is cheating for open(): if
> sys.flags.utf8_mode is non-zero, use the UTF-8 encoding rather than
> calling locale.getpreferredencoding().
>
> I checked the stdlib, and I found many places where
> locale.getpreferredencoding() is used to get the user preferred
> encoding:
>
> * builtin open(): default encoding
> * cgi.FieldStorage: encode the query string
> * encoding._alias_mbcs(): check if the requested encoding is the ANSI code 
> page
> * gettext.GNUTranslations: lgettext() and lngettext() methods
> * xml.etree.ElementTree: ElementTree.write(encoding='unicode')
>
> In the UTF-8 mode, I would expect that cgi, gettext and xml.etree all
> use the UTF-8 encoding by default. So locale.getpreferredencoding()
> should return UTF-8 if the UTF-8 mode is enabled.
>
> The private _alias_mbcs() method can be modified to call directly
> _locale._getdefaultlocale()[1] to get the ANSI code page.
>
> Question: do we need to add an option to getpreferredencoding() to
> return the locale encoding even if the UTF-8 mode is enabled. If yes,
> what should be the API? locale.getpreferredencoding(utf8_mode=False)?
>
> Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-08 Thread Victor Stinner

2017-12-08 15:01 GMT+01:00 INADA Naoki :
>> In short, locale coercion and UTF-8 mode will be both enabled by the
>> POSIX locale.
>
> Hm, it is bit surprising because I thought UTF-8 mode is fallback
> of locale coercion when coercion is failed or disabled.

I rewrote the "differences between the PEP 538 and the PEP 540" as a
new section "Relationship with the locale coercion (PEP 538)".

https://www.python.org/dev/peps/pep-0540/#relationship-with-the-locale-coercion-pep-538

"""
Relationship with the locale coercion (PEP 538)
===

The POSIX locale enables the locale coercion (PEP 538) and the UTF-8
mode (PEP 540). When the locale coercion is enabled, enabling the UTF-8
mode has no (additional) effect.

Locale coercion only impacts non-Python code like C libraries, whereas
the Python UTF-8 Mode only impacts Python code: the two PEPs are
complementary.

On platforms where locale coercion is not supported like Centos 7, the
POSIX locale only enables the UTF-8 Mode. In this case, Python code uses
the UTF-8 encoding and ignores the locale encoding, whereas non-Python
code uses the locale encoding which is usually ASCII for the POSIX
locale.

While the UTF-8 Mode is supported on all platforms and can be enabled
with any locale, the locale coercion is not supported by all platforms
and is restricted to the POSIX locale.

The UTF-8 Mode has only an impact on Python child processes when the
``PYTHONUTF8`` environment variable is set to ``1``, whereas the locale
coercion sets the ``LC_CTYPE`` environment variables which impacts all
child processes.

The benefit of the locale coercion approach is that it helps ensure that
encoding handling in binary extension modules and child processes is
consistent with Python's encoding handling. The upside of the UTF-8 Mode
approach is that it allows an embedding application to change the
interpreter's behaviour without having to change the process global
locale settings.
"""

I hope that it's now better explained.

In short, the two PEPs are really complementary.

> As PEP 538 [1], all coercion target locales uses surrogateescape
> for stdin and stdout.
> So, do you mean "UTF-8 mode enabled as flag level, but it has no
> real effects"?

Right and it was a deliberate choice of Nick Coghlan when he designed
the PEP 538, to make sure that the two PEPs are complementary and
"compatible".

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-08 Thread INADA Naoki

On Fri, Dec 8, 2017 at 7:22 PM, Victor Stinner  wrote:
>>
>> Both of PEP 538 (locale coercion) and PEP 540 (UTF-8 mode) shares
>> same logic to detect POSIX locale.
>>
>> When POSIX locale is detected, locale coercion is tried first. And if
>> locale coercion
>> succeeds,  UTF-8 mode is not used because locale is not POSIX anymore.
>
> No, I would like to enable the UTF-8 mode as well in this case.
>
> In short, locale coercion and UTF-8 mode will be both enabled by the
> POSIX locale.
>

Hm, it is bit surprising because I thought UTF-8 mode is fallback
of locale coercion when coercion is failed or disabled.

As PEP 538 [1], all coercion target locales uses surrogateescape
for stdin and stdout.
So, do you mean "UTF-8 mode enabled as flag level, but it has no
real effects"?

[1]: 
https://www.python.org/dev/peps/pep-0538/#changes-to-the-default-error-handling-on-the-standard-streams

Since coercion target locales and UTF-8 mode do same thing,
I think this is not a big issue.
But I want it is clarified in the PEP.

Regards,
---
INADA Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-08 Thread Victor Stinner

2017-12-08 6:11 GMT+01:00 INADA Naoki :
> Or should we change loale.getpreferredencoding() to return UTF-8
> instead of ASCII always, regardless of PEP 538 and 540?

On the POSIX locale, if the locale coercion works (PEP 538),
locale.getpreferredencoding() returns UTF-8. We are good.

The question is for platforms like Centos 7 where the locale coercion
(PEP 538) doesn't work and so Python uses UTF-8 (PEP 540), whereas the
locale probably uses ASCII (or maybe Latin1).

My current implementation of the PEP 540 is cheating for open(): if
sys.flags.utf8_mode is non-zero, use the UTF-8 encoding rather than
calling locale.getpreferredencoding().

I checked the stdlib, and I found many places where
locale.getpreferredencoding() is used to get the user preferred
encoding:

* builtin open(): default encoding
* cgi.FieldStorage: encode the query string
* encoding._alias_mbcs(): check if the requested encoding is the ANSI code page
* gettext.GNUTranslations: lgettext() and lngettext() methods
* xml.etree.ElementTree: ElementTree.write(encoding='unicode')

In the UTF-8 mode, I would expect that cgi, gettext and xml.etree all
use the UTF-8 encoding by default. So locale.getpreferredencoding()
should return UTF-8 if the UTF-8 mode is enabled.

The private _alias_mbcs() method can be modified to call directly
_locale._getdefaultlocale()[1] to get the ANSI code page.

Question: do we need to add an option to getpreferredencoding() to
return the locale encoding even if the UTF-8 mode is enabled. If yes,
what should be the API? locale.getpreferredencoding(utf8_mode=False)?

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-08 Thread Victor Stinner

Hi,

Oh, locale.getpreferredencoding(), that's a good question :-)

2017-12-08 6:02 GMT+01:00 INADA Naoki :
> But I want to clarify more about difference/relationship between PEP
> 538 and 540.
>
> If I understand correctly:
>
> Both of PEP 538 (locale coercion) and PEP 540 (UTF-8 mode) shares
> same logic to detect POSIX locale.
>
> When POSIX locale is detected, locale coercion is tried first. And if
> locale coercion
> succeeds,  UTF-8 mode is not used because locale is not POSIX anymore.

No, I would like to enable the UTF-8 mode as well in this case.

In short, locale coercion and UTF-8 mode will be both enabled by the
POSIX locale.


> If locale coercion is disabled or failed, UTF-8 mode is used automatically,
> unless it is disabled explicitly.

PEP 540 is always enabled if the POSIX locale is detected. Only
PYTHONUTF8=0 or -X utf8=0 disable it in this case.

Disabling locale coercion doesn't disable the PEP 540.


> UTF-8 mode is similar to C.UTF-8 or other locale coercion target locales.
> But UTF-8 mode is different from C.UTF-8 locale in these ways because
> actual locale is not changed:
>
> * Libraries using locale (e.g. readline) works as in POSIX locale.  So UTF-8
>   cannot be used in such libraries.

My assumption is that very few C library rely on the locale encoding.
The wchar_t* type is rarely used. You may only get issues if Python
pass UTF-8 encoded string to a C library which tries to decode it from
the locale encoding which is not UTF-8. For example, with the POSIX
locale, if the locale encoding is ASCII, you can get a decoding error
if a C library tries to decode a UTF-8 encoded string coming from
Python.

But the encoding problem is not restricted to the current process. For
the "producer | consumer" model, if the producer is a Python 3.7
application using UTF-8 mode and so encoding text to UTF-8 to stdout,
an application may be unable to decode the UTF-8 data. Here we enter
the grey area of encodings. Which applications rely use the locale
encoding? Which applications always use UTF-8? Do some applications
try UTF-8 first, or falls back on the locale encoding? (OpenSSL does
that on filenames for example, as the glib if I recall correctly.)

Until we know exactly how UTF-8 is used in the "wild", I chose to make
the UTF-8 an opt-in option for locales other than POSIX. I expect a
few bugs reports later which will help us to adjust our encodings.

> * locale.getpreferredencoding() returns 'ASCII' instead of 'UTF-8'.  So
>   libraries depending on locale.getpreferredencoding() may raise
>   UnicodeErrors.

Right.


> Or locale.getpreferredencoding() returns UTF-8 in UTF-8 mode too?

Here is where the PEP 538 plays very nicely with the PEP 540. On
platforms where the locale coercion is supported (Fedora, macOS,
FreeBSD, maybe other Linux distributons), on the POSIX locale,
locale.getpreferredencoding() will return UTF-8 and functions like
mbstowcs() will use the UTF-8 encoding internally.

Currently, in the implementation of my PEP 540, I chose to modify
open() to use UTF-8 if the UTF-8 mode is used, rather using
locale.getpreferredencoding().

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-07 Thread INADA Naoki

> Or locale.getpreferredencoding() returns UTF-8 in UTF-8 mode too?

Or should we change loale.getpreferredencoding() to return UTF-8
instead of ASCII always, regardless of PEP 538 and 540?

INADA Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-07 Thread INADA Naoki

Looks nice.

But I want to clarify more about difference/relationship between PEP
538 and 540.

If I understand correctly:

Both of PEP 538 (locale coercion) and PEP 540 (UTF-8 mode) shares
same logic to detect POSIX locale.

When POSIX locale is detected, locale coercion is tried first. And if
locale coercion
succeeds,  UTF-8 mode is not used because locale is not POSIX anymore.

If locale coercion is disabled or failed, UTF-8 mode is used automatically,
unless it is disabled explicitly.

UTF-8 mode is similar to C.UTF-8 or other locale coercion target locales.
But UTF-8 mode is different from C.UTF-8 locale in these ways because
actual locale is not changed:

* Libraries using locale (e.g. readline) works as in POSIX locale.  So UTF-8
  cannot be used in such libraries.
* locale.getpreferredencoding() returns 'ASCII' instead of 'UTF-8'.  So
  libraries depending on locale.getpreferredencoding() may raise
  UnicodeErrors.

Am I correct?
Or locale.getpreferredencoding() returns UTF-8 in UTF-8 mode too?

INADA Naoki  


On Fri, Dec 8, 2017 at 9:50 AM, Victor Stinner  wrote:
> Hi,
>
> I made the following two changes to the PEP 540:
>
> * open() error handler remains "strict"
> * remove the "Strict UTF8 mode" which doesn't make much sense anymore
>
> I wrote the Strict UTF-8 mode when open() used surrogateescape error
> handler in the UTF-8 mode. I don't think that a Strict UTF-8 mode is
> required just to change the error handler of stdin and stdout. Well,
> read the "Passthough undecodable bytes: surrogateescape" section of
> the PEP rationale :-)
>
>
> https://www.python.org/dev/peps/pep-0540/
>
> Victor
>
>
> PEP: 540
> Title: Add a new UTF-8 mode
> Version: $Revision$
> Last-Modified: $Date$
> Author: Victor Stinner 
> BDFL-Delegate: INADA Naoki
> Status: Draft
> Type: Standards Track
> Content-Type: text/x-rst
> Created: 5-January-2016
> Python-Version: 3.7
>
>
> Abstract
> 
>
> Add a new UTF-8 mode to ignore the locale, use the UTF-8 encoding, and
> change ``stdin`` and ``stdout`` error handlers to ``surrogateescape``.
> This mode is enabled by default in the POSIX locale, but otherwise
> disabled by default.
>
> The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
> variable are added to control the UTF-8 mode.
>
>
> Rationale
> =
>
> Locale encoding and UTF-8
> -
>
> Python 3.6 uses the locale encoding for filenames, environment
> variables, standard streams, etc. The locale encoding is inherited from
> the locale; the encoding and the locale are tightly coupled.
>
> Many users inherit the ASCII encoding from the POSIX locale, aka the "C"
> locale, but are unable change the locale for different reasons. This
> encoding is very limited in term of Unicode support: any non-ASCII
> character is likely to cause troubles.
>
> It is not easy to get the expected locale. Locales don't get the exact
> same name on all Linux distributions, FreeBSD, macOS, etc. Some
> locales, like the recent ``C.UTF-8`` locale, are only supported by a few
> platforms. For example, a SSH connection can use a different encoding
> than the filesystem or terminal encoding of the local host.
>
> On the other side, Python 3.6 is already using UTF-8 by default on
> macOS, Android and Windows (PEP 529) for most functions, except of
> ``open()``. UTF-8 is also the default encoding of Python scripts, XML
> and JSON file formats. The Go programming language uses UTF-8 for
> strings.
>
> When all data are stored as UTF-8 but the locale is often misconfigured,
> an obvious solution is to ignore the locale and use UTF-8.
>
> PEP 538 attempts to mitigate this problem by coercing the C locale
> to a UTF-8 based locale when one is available, but that isn't a
> universal solution. For example, CentOS 7's container images default
> to the POSIX locale, and don't include the C.UTF-8 locale, so PEP 538's
> locale coercion is ineffective.
>
>
> Passthough undecodable bytes: surrogateescape
> -
>
> When decoding bytes from UTF-8 using the ``strict`` error handler, which
> is the default, Python 3 raises a ``UnicodeDecodeError`` on the first
> undecodable byte.
>
> Unix command line tools like ``cat`` or ``grep`` and most Python 2
> applications simply do not have this class of bugs: they don't decode
> data, but process data as a raw bytes sequence.
>
> Python 3 already has a solution to behave like Unix tools and Python 2:
> the ``surrogateescape`` error handler (:pep:`383`). It allows to process
> data "as bytes" but uses Unicode in practice (undecodable bytes are
> stored as surrogate characters).
>
> The UTF-8 mode uses the ``surrogateescape`` error handler for ``stdin``
> and ``stdout`` since these streams as commonly associated to Unix
> command line tools.
>
> However, users have a different expectation on files. Files are expected
> to be properly encoded. Python is expected to fail early when ``open()``
> is called with the wrong options, l

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-07 Thread Chris Barker - NOAA Federal

I made the following two changes to the PEP 540:

* open() error handler remains "strict"
* remove the "Strict UTF8 mode" which doesn't make much sense anymore


+1 — ignore my previous note.

-CHB


I wrote the Strict UTF-8 mode when open() used surrogateescape error
handler in the UTF-8 mode. I don't think that a Strict UTF-8 mode is
required just to change the error handler of stdin and stdout. Well,
read the "Passthough undecodable bytes: surrogateescape" section of
the PEP rationale :-)


https://www.python.org/dev/peps/pep-0540/

Victor


PEP: 540
Title: Add a new UTF-8 mode
Version: $Revision$
Last-Modified: $Date$
Author: Victor Stinner 
BDFL-Delegate: INADA Naoki
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 5-January-2016
Python-Version: 3.7


Abstract


Add a new UTF-8 mode to ignore the locale, use the UTF-8 encoding, and
change ``stdin`` and ``stdout`` error handlers to ``surrogateescape``.
This mode is enabled by default in the POSIX locale, but otherwise
disabled by default.

The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
variable are added to control the UTF-8 mode.


Rationale
=

Locale encoding and UTF-8
-

Python 3.6 uses the locale encoding for filenames, environment
variables, standard streams, etc. The locale encoding is inherited from
the locale; the encoding and the locale are tightly coupled.

Many users inherit the ASCII encoding from the POSIX locale, aka the "C"
locale, but are unable change the locale for different reasons. This
encoding is very limited in term of Unicode support: any non-ASCII
character is likely to cause troubles.

It is not easy to get the expected locale. Locales don't get the exact
same name on all Linux distributions, FreeBSD, macOS, etc. Some
locales, like the recent ``C.UTF-8`` locale, are only supported by a few
platforms. For example, a SSH connection can use a different encoding
than the filesystem or terminal encoding of the local host.

On the other side, Python 3.6 is already using UTF-8 by default on
macOS, Android and Windows (PEP 529) for most functions, except of
``open()``. UTF-8 is also the default encoding of Python scripts, XML
and JSON file formats. The Go programming language uses UTF-8 for
strings.

When all data are stored as UTF-8 but the locale is often misconfigured,
an obvious solution is to ignore the locale and use UTF-8.

PEP 538 attempts to mitigate this problem by coercing the C locale
to a UTF-8 based locale when one is available, but that isn't a
universal solution. For example, CentOS 7's container images default
to the POSIX locale, and don't include the C.UTF-8 locale, so PEP 538's
locale coercion is ineffective.


Passthough undecodable bytes: surrogateescape
-

When decoding bytes from UTF-8 using the ``strict`` error handler, which
is the default, Python 3 raises a ``UnicodeDecodeError`` on the first
undecodable byte.

Unix command line tools like ``cat`` or ``grep`` and most Python 2
applications simply do not have this class of bugs: they don't decode
data, but process data as a raw bytes sequence.

Python 3 already has a solution to behave like Unix tools and Python 2:
the ``surrogateescape`` error handler (:pep:`383`). It allows to process
data "as bytes" but uses Unicode in practice (undecodable bytes are
stored as surrogate characters).

The UTF-8 mode uses the ``surrogateescape`` error handler for ``stdin``
and ``stdout`` since these streams as commonly associated to Unix
command line tools.

However, users have a different expectation on files. Files are expected
to be properly encoded. Python is expected to fail early when ``open()``
is called with the wrong options, like opening a JPEG picture in text
mode. The ``open()`` default error handler remains ``strict`` for these
reasons.


No change by default for best backward compatibility


While UTF-8 is perfect in most cases, sometimes the locale encoding is
actually the best encoding.

This PEP changes the behaviour for the POSIX locale since this locale
usually gives the ASCII encoding, whereas UTF-8 is a much better choice.
It does not change the behaviour for other locales to prevent any risk
or regression.

As users are responsible to enable explicitly the new UTF-8 mode, they
are responsible for any potential mojibake issues caused by this mode.


Proposal


Add a new UTF-8 mode to ignore the locale, use the UTF-8 encoding, and
change ``stdin`` and ``stdout`` error handlers to ``surrogateescape``.
This mode is enabled by default in the POSIX locale, but otherwise
disabled by default.

The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
variable are added. The UTF-8 mode is enabled by ``-X utf8`` or
``PYTHONUTF8=1``.

The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode
can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``

[Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

2017-12-07 Thread Victor Stinner

Hi,

I made the following two changes to the PEP 540:

* open() error handler remains "strict"
* remove the "Strict UTF8 mode" which doesn't make much sense anymore

I wrote the Strict UTF-8 mode when open() used surrogateescape error
handler in the UTF-8 mode. I don't think that a Strict UTF-8 mode is
required just to change the error handler of stdin and stdout. Well,
read the "Passthough undecodable bytes: surrogateescape" section of
the PEP rationale :-)


https://www.python.org/dev/peps/pep-0540/

Victor


PEP: 540
Title: Add a new UTF-8 mode
Version: $Revision$
Last-Modified: $Date$
Author: Victor Stinner 
BDFL-Delegate: INADA Naoki
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 5-January-2016
Python-Version: 3.7


Abstract


Add a new UTF-8 mode to ignore the locale, use the UTF-8 encoding, and
change ``stdin`` and ``stdout`` error handlers to ``surrogateescape``.
This mode is enabled by default in the POSIX locale, but otherwise
disabled by default.

The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
variable are added to control the UTF-8 mode.


Rationale
=

Locale encoding and UTF-8
-

Python 3.6 uses the locale encoding for filenames, environment
variables, standard streams, etc. The locale encoding is inherited from
the locale; the encoding and the locale are tightly coupled.

Many users inherit the ASCII encoding from the POSIX locale, aka the "C"
locale, but are unable change the locale for different reasons. This
encoding is very limited in term of Unicode support: any non-ASCII
character is likely to cause troubles.

It is not easy to get the expected locale. Locales don't get the exact
same name on all Linux distributions, FreeBSD, macOS, etc. Some
locales, like the recent ``C.UTF-8`` locale, are only supported by a few
platforms. For example, a SSH connection can use a different encoding
than the filesystem or terminal encoding of the local host.

On the other side, Python 3.6 is already using UTF-8 by default on
macOS, Android and Windows (PEP 529) for most functions, except of
``open()``. UTF-8 is also the default encoding of Python scripts, XML
and JSON file formats. The Go programming language uses UTF-8 for
strings.

When all data are stored as UTF-8 but the locale is often misconfigured,
an obvious solution is to ignore the locale and use UTF-8.

PEP 538 attempts to mitigate this problem by coercing the C locale
to a UTF-8 based locale when one is available, but that isn't a
universal solution. For example, CentOS 7's container images default
to the POSIX locale, and don't include the C.UTF-8 locale, so PEP 538's
locale coercion is ineffective.


Passthough undecodable bytes: surrogateescape
-

When decoding bytes from UTF-8 using the ``strict`` error handler, which
is the default, Python 3 raises a ``UnicodeDecodeError`` on the first
undecodable byte.

Unix command line tools like ``cat`` or ``grep`` and most Python 2
applications simply do not have this class of bugs: they don't decode
data, but process data as a raw bytes sequence.

Python 3 already has a solution to behave like Unix tools and Python 2:
the ``surrogateescape`` error handler (:pep:`383`). It allows to process
data "as bytes" but uses Unicode in practice (undecodable bytes are
stored as surrogate characters).

The UTF-8 mode uses the ``surrogateescape`` error handler for ``stdin``
and ``stdout`` since these streams as commonly associated to Unix
command line tools.

However, users have a different expectation on files. Files are expected
to be properly encoded. Python is expected to fail early when ``open()``
is called with the wrong options, like opening a JPEG picture in text
mode. The ``open()`` default error handler remains ``strict`` for these
reasons.


No change by default for best backward compatibility


While UTF-8 is perfect in most cases, sometimes the locale encoding is
actually the best encoding.

This PEP changes the behaviour for the POSIX locale since this locale
usually gives the ASCII encoding, whereas UTF-8 is a much better choice.
It does not change the behaviour for other locales to prevent any risk
or regression.

As users are responsible to enable explicitly the new UTF-8 mode, they
are responsible for any potential mojibake issues caused by this mode.


Proposal


Add a new UTF-8 mode to ignore the locale, use the UTF-8 encoding, and
change ``stdin`` and ``stdout`` error handlers to ``surrogateescape``.
This mode is enabled by default in the POSIX locale, but otherwise
disabled by default.

The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
variable are added. The UTF-8 mode is enabled by ``-X utf8`` or
``PYTHONUTF8=1``.

The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode
can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``.

For standard streams, the ``PYT

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

[Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

23 matches

Site Navigation

Mail list logo

Footer information