Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-07 Thread Greg Ewing

Victor Stinner wrote:

Users don't use stdin and
stdout as regular files, they are more used as pipes to pass data
between programs with the Unix pipe in a shell like "producer |
consumer". Sometimes stdout is redirected to a file, but I consider
that it is expected to behave as a pipe and the regular TTY stdout.


It seems weird to me to make a distinction between stdin/stdout
connected to a file and accessing the file some other way.

It would be surprising, for example, if the following two
commands behaved differently with respect to encoding:

   cat foo | sort

   cat < foo | sort


But Naoki explained that open() is commonly misused to open binary
files and Python should somehow fail badly to notify the developer of
their mistake.


Maybe if you *explicitly* open the file in text mode it
should default to surrogateescape, but use strict if text
mode is being used by default?

I.e.

   open("foo", "rt") --> surrogateescape
   open("foo")   --> strict

That way you can easily open a file in a way that's
compatible with the way stdin/stdout behave, but you
will get bitten if you mistakenly open a binary file
as text.

--
Greg

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-07 Thread Chris Barker - NOAA Federal
I’m a bit confused:

File names and the like are one thing, and the CONTENTS of files is quite
another.

I get that there is theoretically a “default” encoding for the contents of
text files, but that is SO likely to be wrong as to be ignorable.

open() already defaults to utf-8. Which is a fine default if you are going
to have one, but it seems a bad idea to have it default to surrogateescape
EVER, regardless of the locale or anything else.

If the file is binary, or a different encoding, or simply broken, it’s much
better to get an encoding error as soon as possible.

Why does this have anything to do with the PEP?

Perhaps the issue of reading a filename from the system, writing it to a
file, then reading it back in again.

I actually do that a lot — but mostly so I can pass that file to another
system, so I really don’t want broken encoding in it anyway.

-CHB


Sent from my iPhone

On Dec 7, 2017, at 5:53 PM, Glenn Linderman  wrote:

On 12/7/2017 5:45 PM, Jonathan Goble wrote:

On Thu, Dec 7, 2017 at 8:38 PM Glenn Linderman 
wrote:

> If it were to be changed, one could add a text-mode option in 3.7, say "t"
> in the mode string, and a PendingDeprecationWarning for open calls without
> the specification of either t or b in the mode string.
>

"t" is already supported in open()'s mode argument [1] as a way to
explicitly request text mode, though it's essentially ignored right now
since text is the default anyway. So since the option is already present,
the only thing needed at this stage for your plan would be to begin
deprecating not using it.

*goes back to lurking*

[1] https://docs.python.org/3/library/functions.html#open


Thanks for briefly de-lurking.

So then for PEP 540... use surrogateescape immediately for t mode.

Then, when the user encounters an encoding error, there would be three
solutions: switch to t mode, explicitly switch to surrogateescape, or fix
the locale.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/chris.barker%40noaa.gov
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-07 Thread Glenn Linderman

On 12/7/2017 5:45 PM, Jonathan Goble wrote:
On Thu, Dec 7, 2017 at 8:38 PM Glenn Linderman > wrote:


If it were to be changed, one could add a text-mode option in 3.7,
say "t" in the mode string, and a PendingDeprecationWarning for
open calls without the specification of either t or b in the mode
string.


"t" is already supported in open()'s mode argument [1] as a way to 
explicitly request text mode, though it's essentially ignored right 
now since text is the default anyway. So since the option is already 
present, the only thing needed at this stage for your plan would be to 
begin deprecating not using it.


*goes back to lurking*

[1] https://docs.python.org/3/library/functions.html#open


Thanks for briefly de-lurking.

So then for PEP 540... use surrogateescape immediately for t mode.

Then, when the user encounters an encoding error, there would be three 
solutions: switch to t mode, explicitly switch to surrogateescape, or 
fix the locale.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-07 Thread Jonathan Goble
On Thu, Dec 7, 2017 at 8:38 PM Glenn Linderman 
wrote:

> If it were to be changed, one could add a text-mode option in 3.7, say "t"
> in the mode string, and a PendingDeprecationWarning for open calls without
> the specification of either t or b in the mode string.
>

"t" is already supported in open()'s mode argument [1] as a way to
explicitly request text mode, though it's essentially ignored right now
since text is the default anyway. So since the option is already present,
the only thing needed at this stage for your plan would be to begin
deprecating not using it.

*goes back to lurking*

[1] https://docs.python.org/3/library/functions.html#open
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-07 Thread Glenn Linderman

On 12/7/2017 4:48 PM, Victor Stinner wrote:


Ok, now comes the real question, open().

For open(), I used the example of a code snippet *writing* the content
of a directory (os.listdir) into a text file. Another example is to
read filenames from a text files but pass-through undecodable bytes
thanks to surrogateescape.

But Naoki explained that open() is commonly misused to open binary
files and Python should somehow fail badly to notify the developer of
their mistake.


So the real problem here is that open has a default mode of text. 
Instead of forcing the user to specify either "text" or "binary" when 
opening, text is used as a default, binary as an option to be specified.


I understand that default has a long history in Unix-land, dating at 
last as far back as 1977 when I first learned how to use the Unix open() 
function.


And now it would be an incompatible change to change it.

The real question is whether or not it is a good idea to change it... at 
this point in time, with Unicode and UTF-8 so prevalent, text and binary 
modes are far different than back in 1977, when they mostly just 
documented that this was a binary file that was being opened, and that 
one could more likely expect to see read() than fgets() in the following 
code.


If it were to be changed, one could add a text-mode option in 3.7, say 
"t" in the mode string, and a PendingDeprecationWarning for open calls 
without the specification of either t or b in the mode string.


In 3.8, the warning would be changed to DeprecationWarning.

In 3.9, all open calls would need to have either t or b, or would fail.

Meanwhile, back on the PEP 540 ranch, text mode open calls could 
immediately use surrogateescape, binary mode open calls would not, and 
unspecified open calls would not.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-07 Thread Victor Stinner
2017-12-08 0:26 GMT+01:00 Guido van Rossum :
> You will quickly get decoding errors, and that is INADA's point. (Unless you
> use encoding='Latin-1'.) His worry is that the surrogateescape error handler
> makes it so that you won't get decoding errors, and then the failure mode is
> much harder to debug.

Hum, my question was more to know if Python fails because of an
operation failing with strings whereas bytes were expected, or if
Python fails with a decoding error... But now I'm not sure aynmore
that this level of detail really matters.


Let me think out loud. To explain unicode issues, I like to use
filenames, since it's something that users view commonly, handle
directly and can modify (and so enter many non-ASCII characters like
diacritics and emojis ;-)).

Filenames can be found on the command line, in environment variables
(PYTHONSTARTUP), stdin (read a list of files from stdin), stdout
(write the list of files into stdout), but also in text files (the
Mercurial "makefile problem).

I consider that the command line and environment variables should
"just work" and so use surrogateescape. It would be too annoying to
not even be able to *start* Python because of an Unicode error. For
example, it wouldn't be easy to identify which environment variable
causes the issue. Hopefully, the UTF-8 doesn't change anything here:
surrogateescape is already used since Python 3.3 for the command line
and environment variables.

For stdin/stdout, I think that the main motivation here is to write
Unix command line tools using Python 3: pass-through undecodable bytes
without bugging the user with Unicode. Users don't use stdin and
stdout as regular files, they are more used as pipes to pass data
between programs with the Unix pipe in a shell like "producer |
consumer". Sometimes stdout is redirected to a file, but I consider
that it is expected to behave as a pipe and the regular TTY stdout.
IMHO we are still in the safe surrogateescape area (for the specific
case of the UTF-8 mode).


Ok, now comes the real question, open().

For open(), I used the example of a code snippet *writing* the content
of a directory (os.listdir) into a text file. Another example is to
read filenames from a text files but pass-through undecodable bytes
thanks to surrogateescape.

But Naoki explained that open() is commonly misused to open binary
files and Python should somehow fail badly to notify the developer of
their mistake.

If I should make a choice between the two categories of usage of
open(), "read undecodable bytes in UTF-8 from a text file" versus
"misuse open() on binary file", I expect that the later is more common
that that open() shouldn't use surrogateescape by default.

While stdin and stdout are usually associated to Unix pipes and Unix
tools working on bytes, files are more commonly associated to
important data that must not be lost nor corrupted. Python is expected
to "help" the developer to use the proper options to read content from
a file and to write content into a file. So I understand that open()
should use the "strict" error handler in the UTF-8 mode, rather than
"surrogateescape".

I can survive to this "tiny" change to my PEP. I just posted a 3rd
version of my PEP where open() error handler remains strict (is no
more changed by the PEP).

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-07 Thread Guido van Rossum
On Thu, Dec 7, 2017 at 3:02 PM, Victor Stinner 
wrote:

> 2017-12-06 5:07 GMT+01:00 INADA Naoki :
> > And opening binary file without "b" option is very common mistake of new
> > developers.  If default error handler is surrogateescape, they lose a
> chance
> > to notice their bug.
>
> To come back to your original point, I didn't know that it was a
> common mistake to open binary files in text mode.
>

It probably is because in Python 2 it makes no difference on UNIX, and on
Windows the only difference is that binary mode preserves \r.


> Honestly, I didn't try recently. How does Python behave when you do that?
>
> Is it possible to write a full binary parser using the text mode? You
> should quickly get issues pointing you to your mistake, no?
>

You will quickly get decoding errors, and that is INADA's point. (Unless
you use encoding='Latin-1'.) His worry is that the surrogateescape error
handler makes it so that you won't get decoding errors, and then the
failure mode is much harder to debug.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-07 Thread Victor Stinner
2017-12-06 5:07 GMT+01:00 INADA Naoki :
> And opening binary file without "b" option is very common mistake of new
> developers.  If default error handler is surrogateescape, they lose a chance
> to notice their bug.

To come back to your original point, I didn't know that it was a
common mistake to open binary files in text mode.

Honestly, I didn't try recently. How does Python behave when you do that?

Is it possible to write a full binary parser using the text mode? You
should quickly get issues pointing you to your mistake, no?

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-07 Thread Victor Stinner
While I'm not strongly convinced that open() error handler must be
changed for surrogateescape, first I would like to make sure that it's
really a very bad idea because changing it :-)


2017-12-07 7:49 GMT+01:00 INADA Naoki :
> I just came up with crazy idea; changing default error handler of open()
> to "surrogateescape" only when open mode is "w" or "a".

The idea is tempting but I'm not sure that it's a good idea. Moreover,
what about "r+" and "w+" modes?

I dislike getting a different behaviour for inputs and outputs. The
motivation for surrogateescape is to "pass through" undecodable bytes:
you need to handle them on the input side and on the output side.

That's why I decided to not only change sys.stdin error handler to
surrogateescape for the POSIX locale, but also sys.stdout:
https://bugs.python.org/issue19977


> When reading, "surrogateescape" error handler is dangerous because
> it can produce arbitrary broken unicode string by mistake.

I'm fine with that. I wouldn't say that it's the purpose of the PEP,
but sadly it's an expected, known and documented side effect.

You get the same behaviour with Unix command line tools and most
Python 2 applications (processing data as bytes). Nothing new under
the sun.

The PEP 540 allows users to write applications behaving like Unix
tools/Python 2 with the power of the Python 3 language and stdlib.

Again, use the Strict UTF8 mode if you prioritize *correctness* over
*usability*.

Honestly, I'm not even sure that the Strict UTF-8 mode is *usable* in
practice, since we are all surrounded by old documents encoded to
various "legacy" encodings (where legay means: "not UTF-8", like
Latin1 or ShiftJIS). The first non-ASCII character which is not
encoded to UTF-8 is going to "crash" the application (big traceback
with an unicode error).


Maybe the problem is the feature name: "UTF-8 mode". Users may think
to "strict" when they read "UTF-8", since UTF-8 is known to be a
strict encoding. For example, UTF-8 is much stricter than latin1 which
is unable to tell if a document was encoded latin1 or whatever else.
UTF-8 is able to tell if a document was actually encoded to UTF-8 or
not, thanks to the design of the encoding itself.



> And it doesn't allow following code:
>
> with open("image.jpg", "r") as f:  # Binary data, not UTF-8
> return f.read()

Using a JPEG image, the example is obviously wrong.

But using surrogateescape on open() is written to read *text files*
which are mostly correctly encoded to UTF-8, except a few bytes.

I'm not sure how to explain the issue. The Mercurial wiki page has a
good example of this issue that they call the "Makefile problem":
https://www.mercurial-scm.org/wiki/EncodingStrategy#The_.22makefile_problem.22

While it's not exactly the discussed issue, it gives you an issue of
the kind of issues that you have when you use open(filename,
encoding="utf-8", errors="strict") versus open(filename,
encoding="utf-8", errors="surrogateescape")


> I'm not sure about this is good idea.  And I don't know when is good for
> changing write error handler; only when PEP 538 or PEP 540 is used?
> Or always when os.fsencoding() is UTF-8?
>
> Any thoughts?

The PEP 538 doesn't affect the error handler. The PEP 540 only changes
the error handler for the POSIX locale, it's a deliberate choice. The
PEP 538 is only enabled for the POSIX locale, and the PEP 540 will
also be enabled by default by this locale.

I dislike the idea of chaning the error handler if the filesystem
encoding is UTF-8. The UTF-8 mode must be enabled explicitly on
purpose. The reduce any risk of regression, and prepare users who
enable it for any potential issue.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread INADA Naoki
> I care only about builtin open()'s behavior.
> PEP 538 doesn't change default error handler of open().
>
> I think PEP 538 and PEP 540 should behave almost identical except
> changing locale
> or not.  So I need very strong reason if PEP 540 changes default error
> handler of open().
>

I just came up with crazy idea; changing default error handler of open()
to "surrogateescape" only when open mode is "w" or "a".

When reading, "surrogateescape" error handler is dangerous because
it can produce arbitrary broken unicode string by mistake.

On the other hand, "surrogateescape" error handler for writing
is not so dangerous if encoding is UTF-8.
When writing normal unicode string, it doesn't create broken data.
When writing string containing surrogateescaped data, data is
(partially) broken before writing.

This idea allows following code:

with open("files.txt", "w") as f:
for fn in os.listdir():  # may returns surrogateescaped string
f.write(fn+'\n')

And it doesn't allow following code:

with open("image.jpg", "r") as f:  # Binary data, not UTF-8
return f.read()


I'm not sure about this is good idea.  And I don't know when is good for
changing write error handler; only when PEP 538 or PEP 540 is used?
Or always when os.fsencoding() is UTF-8?

Any thoughts?

INADA Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread Nick Coghlan
On 7 December 2017 at 08:20, Victor Stinner  wrote:
> 2017-12-06 23:07 GMT+01:00 Antoine Pitrou :
>> One question: how do you plan to test for the POSIX locale?
>
> I'm not sure. I will probably rely on Nick for that ;-) Nick already
> implemented this exact check for his PEP 538 which is already
> implemented in Python 3.7.
>
> I already implemented the PEP 540:
>
>https://bugs.python.org/issue29240
>https://github.com/python/cpython/pull/855
>
> Right now, my implementation uses:
>
>char *ctype = _PyMem_RawStrdup(setlocale(LC_CTYPE, ""));
>...
>if (strcmp(ctype, "C") == 0) ...

We have a private helper for this as a result of the PEP 538
implementation: _Py_LegacyLocaleDetected()

Details are in the source code at
https://github.com/python/cpython/blob/master/Python/pylifecycle.c#L345

As per my comment there, and Jakub Wilk's post to this thread, we're
missing a case to also check for the string "POSIX" (which will fix
several of the current locale coercion discrepancies between Linux and
*BSD systems).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread Nick Coghlan
On 7 December 2017 at 01:59, Jakub Wilk  wrote:
> * Nick Coghlan , 2017-12-06, 16:15:
>> The one that's relevant to default locale detection is just the string
>> that "setlocale(LC_CTYPE, NULL)" returns.
>
> POSIX doesn't require any particular return value for setlocale() calls.
> It's only guaranteed that the returned string can be used in subsequent
> setlocale() calls to restore the original locale.
>
> So in the POSIX locale, a compliant setlocale() implementation could return
> "C", or "POSIX", or even something entirely different.

Thanks. I'd been wondering if we should also handle the "POSIX" case
in the legacy locale detection logic, and you've convinced me that we
should. Issue filed for that here: https://bugs.python.org/issue32238

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread Antoine Pitrou
On Thu, 7 Dec 2017 00:22:52 +0100
Victor Stinner  wrote:
> 2017-12-06 23:36 GMT+01:00 Antoine Pitrou :
> > Other than that, +1 on the PEP.  
> 
> Naoki doesn't seem to be confortable with the usage of the
> surrogateescape error handler by default for open(). Are you ok with
> that? If yes, would you mind to explain why? :-)

Sorry, I had missed that objection.  I agree with Inada Naoki: it's
better to keep it strict.

Regards

Antoine.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread Victor Stinner
2017-12-06 23:36 GMT+01:00 Antoine Pitrou :
> Other than that, +1 on the PEP.

Naoki doesn't seem to be confortable with the usage of the
surrogateescape error handler by default for open(). Are you ok with
that? If yes, would you mind to explain why? :-)

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread Antoine Pitrou
On Wed, 6 Dec 2017 23:20:41 +0100
Victor Stinner  wrote:
> 2017-12-06 23:07 GMT+01:00 Antoine Pitrou :
> > One question: how do you plan to test for the POSIX locale?  
> 
> I'm not sure. I will probably rely on Nick for that ;-) Nick already
> implemented this exact check for his PEP 538 which is already
> implemented in Python 3.7.

Other than that, +1 on the PEP.

Regards

Antoine.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread Victor Stinner
2017-12-06 23:07 GMT+01:00 Antoine Pitrou :
> One question: how do you plan to test for the POSIX locale?

I'm not sure. I will probably rely on Nick for that ;-) Nick already
implemented this exact check for his PEP 538 which is already
implemented in Python 3.7.

I already implemented the PEP 540:

   https://bugs.python.org/issue29240
   https://github.com/python/cpython/pull/855

Right now, my implementation uses:

   char *ctype = _PyMem_RawStrdup(setlocale(LC_CTYPE, ""));
   ...
   if (strcmp(ctype, "C") == 0) ...

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread Antoine Pitrou
On Wed, 6 Dec 2017 01:49:41 +0100
Victor Stinner  wrote:
> Hi,
> 
> I knew that I had to rewrite my PEP 540, but I was too lazy. Since
> Guido explicitly requested a shorter PEP, here you have!
> 
> https://www.python.org/dev/peps/pep-0540/
> 
> Trust me, it's the same PEP, but focused on the most important
> information and with a shorter rationale ;-)

Congrats on the rewriting!  The shortening is appreciated :-)

One question: how do you plan to test for the POSIX locale?  Apparently
you need to check at least for the "C" and "POSIX" strings, but perhaps
other aliases as well?

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread Greg Ewing

Victor Stinner wrote:

Maybe the "UTF-8 Mode" should be renamed to "UTF-8 with
surrogateescape, or backslashreplace for stderr, or surrogatepass for
fsencode/fsencode on Windows, or strict for Strict UTF-8 Mode"... But
the PEP title would be too long, no? :-)


Relaxed UTF-8 Mode?

UTF8-Yeah-I'm-Fine-With-That mode?

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread Brett Cannon
On Wed, 6 Dec 2017 at 06:10 INADA Naoki  wrote:

> >> And I have one worrying point.
> >> With UTF-8 mode, open()'s default encoding/error handler is
> >> UTF-8/surrogateescape.
> >
> > The Strict UTF-8 Mode is for you if you prioritize correctness over
> usability.
>
> Yes, but as I said, I cares about not experienced developer
> who doesn't know what UTF-8 mode is.
>
> >
> > In the very first version of my PEP/idea, I wanted to use
> > UTF-8/strict. But then I started to play with the implementation and I
> > got many "practical" issues. Using UTF-8/strict, you quickly get
> > encoding errors. For example, you become unable to read undecodable
> > bytes from stdin. stdin.read() only gives you an error, without
> > letting you decide how to handle these "invalid" data. Same issue with
> > stdout.
> >
>
> I don't care about stdio, because PEP 538 uses surrogateescape for
> stdio/error
>
> https://www.python.org/dev/peps/pep-0538/#changes-to-the-default-error-handling-on-the-standard-streams
>
> I care only about builtin open()'s behavior.
> PEP 538 doesn't change default error handler of open().
>
> I think PEP 538 and PEP 540 should behave almost identical except
> changing locale
> or not.  So I need very strong reason if PEP 540 changes default error
> handler of open().
>

I don't have enough locale experience to weigh in as an expert, but I
already was leaning towards INADA-san's logic of not wanting to change
open() and this makes me really not want to change it.

-Brett


>
>
> > In the old long version of the PEP, I tried to explain UTF-8/strict
> > issues with very concrete examples, the removed "Use Cases" section:
> >
> https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L490
> >
> > Tell me if I should rephrase the rationale of the PEP 540 to better
> > justify the usage of surrogateescape.
>
> OK, "List a directory into a text file" example demonstrates why
> surrogateescape
> is used for open().  If os.listdir() returns surrogateescpaed data,
> file.write() will be
> fail.
> All other examples are about stdio.
>
> But we should achieve good balance between correctness and usability of
> default behavior.
>
> >
> > Maybe the "UTF-8 Mode" should be renamed to "UTF-8 with
> > surrogateescape, or backslashreplace for stderr, or surrogatepass for
> > fsencode/fsencode on Windows, or strict for Strict UTF-8 Mode"... But
> > the PEP title would be too long, no? :-)
> >
>
> I feel short name is enough.
>
> >
> >> And opening binary file without "b" option is very common mistake of new
> >> developers.  If default error handler is surrogateescape, they lose a
> chance
> >> to notice their bug.
> >
> > When open() in used in text mode to read "binary data", usually the
> > developer would only notify when getting the POSIX locale (ASCII
> > encoding). But the PEP 538 already changed that by using the C.UTF-8
> > locale (and so the UTF-8 encoding, instead of the ASCII encoding).
> >
>
> With PEP 538 (C.UTF-8 locale), open() uses UTF-8/strict, not
> UTF-8/surrogateescape.
>
> For example, this code raise UnicodeDecodeError with PEP 538 if the
> file is JPEG file.
>
> with open(fn) as f:
> f.read()
>
>
> > I'm not sure that locales are the best way to detect such class of
> > bytes. I suggest to use -b or -bb option to detect such bugs without
> > having to care of the locale.
> >
>
> But many new developers doesn't use/know -b or -bb option.
>
> >
> >> On the other hand, it helps some use cases when user want
> byte-transparent
> >> behavior, without modifying code to use "surrogateescape" explicitly.
> >>
> >> Which is more important scenario?  Anyone has opinion about it?
> >> Are there any rationals and use cases I missing?
> >
> > Usually users expect that Python 3 "just works" and don't bother them
> > with the locale (thay nobody understands).
> >
> > The old version of the PEP contains a long list of issues:
> >
> https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L924-L986
> >
> > I already replaced the strict error handler with surrogateescape for
> > sys.stdin and sys.stdout on the POSIX locale in Python 3.5:
> > https://bugs.python.org/issue19977
> >
> > For the rationale, read for example these comments:
> >
> [snip]
>
> OK, I'll read them and think again about open()'s default behavior.
> But I still hope open()'s behavior is consistent with PEP 538 and PEP 540.
>
> Regards,
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/brett%40python.org
>
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread Jakub Wilk

* Nick Coghlan , 2017-12-06, 16:15:
Something I've just noticed that needs to be clarified: on Linux, "C" 
locale and "POSIX" locale are aliases, but this isn't true in general 
(e.g. it's not the case on *BSD systems, including Mac OS X).
For those of us with little to no BSD/MacOS experience, can you give a 
quick run-down of the differences between "C" and "POSIX"?


POSIX says that "C" and "POSIX" are equivalent[0].

The one that's relevant to default locale detection is just the string 
that "setlocale(LC_CTYPE, NULL)" returns.


POSIX doesn't require any particular return value for setlocale() calls. 
It's only guaranteed that the returned string can be used in subsequent 
setlocale() calls to restore the original locale.


So in the POSIX locale, a compliant setlocale() implementation could 
return "C", or "POSIX", or even something entirely different.



Beyond that, I don't know what the actual functional differences are.


I don't believe there are any.


[0] http://pubs.opengroup.org/onlinepubs/9699919799/functions/setlocale.html

--
Jakub Wilk
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread INADA Naoki
>> And I have one worrying point.
>> With UTF-8 mode, open()'s default encoding/error handler is
>> UTF-8/surrogateescape.
>
> The Strict UTF-8 Mode is for you if you prioritize correctness over usability.

Yes, but as I said, I cares about not experienced developer
who doesn't know what UTF-8 mode is.

>
> In the very first version of my PEP/idea, I wanted to use
> UTF-8/strict. But then I started to play with the implementation and I
> got many "practical" issues. Using UTF-8/strict, you quickly get
> encoding errors. For example, you become unable to read undecodable
> bytes from stdin. stdin.read() only gives you an error, without
> letting you decide how to handle these "invalid" data. Same issue with
> stdout.
>

I don't care about stdio, because PEP 538 uses surrogateescape for stdio/error
https://www.python.org/dev/peps/pep-0538/#changes-to-the-default-error-handling-on-the-standard-streams

I care only about builtin open()'s behavior.
PEP 538 doesn't change default error handler of open().

I think PEP 538 and PEP 540 should behave almost identical except
changing locale
or not.  So I need very strong reason if PEP 540 changes default error
handler of open().


> In the old long version of the PEP, I tried to explain UTF-8/strict
> issues with very concrete examples, the removed "Use Cases" section:
> https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L490
>
> Tell me if I should rephrase the rationale of the PEP 540 to better
> justify the usage of surrogateescape.

OK, "List a directory into a text file" example demonstrates why surrogateescape
is used for open().  If os.listdir() returns surrogateescpaed data,
file.write() will be
fail.
All other examples are about stdio.

But we should achieve good balance between correctness and usability of
default behavior.

>
> Maybe the "UTF-8 Mode" should be renamed to "UTF-8 with
> surrogateescape, or backslashreplace for stderr, or surrogatepass for
> fsencode/fsencode on Windows, or strict for Strict UTF-8 Mode"... But
> the PEP title would be too long, no? :-)
>

I feel short name is enough.

>
>> And opening binary file without "b" option is very common mistake of new
>> developers.  If default error handler is surrogateescape, they lose a chance
>> to notice their bug.
>
> When open() in used in text mode to read "binary data", usually the
> developer would only notify when getting the POSIX locale (ASCII
> encoding). But the PEP 538 already changed that by using the C.UTF-8
> locale (and so the UTF-8 encoding, instead of the ASCII encoding).
>

With PEP 538 (C.UTF-8 locale), open() uses UTF-8/strict, not
UTF-8/surrogateescape.

For example, this code raise UnicodeDecodeError with PEP 538 if the
file is JPEG file.

with open(fn) as f:
f.read()


> I'm not sure that locales are the best way to detect such class of
> bytes. I suggest to use -b or -bb option to detect such bugs without
> having to care of the locale.
>

But many new developers doesn't use/know -b or -bb option.

>
>> On the other hand, it helps some use cases when user want byte-transparent
>> behavior, without modifying code to use "surrogateescape" explicitly.
>>
>> Which is more important scenario?  Anyone has opinion about it?
>> Are there any rationals and use cases I missing?
>
> Usually users expect that Python 3 "just works" and don't bother them
> with the locale (thay nobody understands).
>
> The old version of the PEP contains a long list of issues:
> https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L924-L986
>
> I already replaced the strict error handler with surrogateescape for
> sys.stdin and sys.stdout on the POSIX locale in Python 3.5:
> https://bugs.python.org/issue19977
>
> For the rationale, read for example these comments:
>
[snip]

OK, I'll read them and think again about open()'s default behavior.
But I still hope open()'s behavior is consistent with PEP 538 and PEP 540.

Regards,
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread Nick Coghlan
On 6 December 2017 at 20:38, Victor Stinner  wrote:
> Nick:
>> So if PEP 540 is going to implicitly trigger switching encodings, it
>> needs to specify whether it's going to look for the C locale or the
>> POSIX locale (I'd suggest C locale, since that's the actual default
>> that causes problems).
>
> I'm thinking at the test already used by check_force_ascii() (function
> checking if the LC_CTYPE uses the ASCII encoding or something else):
>
> loc = setlocale(LC_CTYPE, NULL);
> if (loc == NULL)
> goto error;
> if (strcmp(loc, "C") != 0) {
> /* the LC_CTYPE locale is different than C */
> return 0;
> }

Yeah, the locale coercion code changes the locale multiple times to
make sure we have a coercion target that will actually work (and then
checks nl_langinfo as well, since that sometimes breaks on BSD
systems, even if the original setlocale() call claimed to work). Once
we've found a locale that appears to work though, then we configure
the LC_CTYPE environment variable, and reload the locale from the
environment.

It's all annoyingly convoluted and arcane, but it works well enough
for 
https://github.com/python/cpython/blob/master/Lib/test/test_c_locale_coercion.py
to pass across the full BuildBot fleet :)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread Victor Stinner
Nick:
> So if PEP 540 is going to implicitly trigger switching encodings, it
> needs to specify whether it's going to look for the C locale or the
> POSIX locale (I'd suggest C locale, since that's the actual default
> that causes problems).

I'm thinking at the test already used by check_force_ascii() (function
checking if the LC_CTYPE uses the ASCII encoding or something else):

loc = setlocale(LC_CTYPE, NULL);
if (loc == NULL)
goto error;
if (strcmp(loc, "C") != 0) {
/* the LC_CTYPE locale is different than C */
return 0;
}

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-06 Thread Victor Stinner
Hi Naoki,

2017-12-06 5:07 GMT+01:00 INADA Naoki :
> Oh, revised version is really short!
>
> And I have one worrying point.
> With UTF-8 mode, open()'s default encoding/error handler is
> UTF-8/surrogateescape.

The Strict UTF-8 Mode is for you if you prioritize correctness over usability.

In the very first version of my PEP/idea, I wanted to use
UTF-8/strict. But then I started to play with the implementation and I
got many "practical" issues. Using UTF-8/strict, you quickly get
encoding errors. For example, you become unable to read undecodable
bytes from stdin. stdin.read() only gives you an error, without
letting you decide how to handle these "invalid" data. Same issue with
stdout.

Compare encodings of the UTF-8 mode and the Strict UTF-8 Mode:
https://www.python.org/dev/peps/pep-0540/#encoding-and-error-handler

I tried to summarize all these kinds of issues in the second short
subsection of the rationale:
https://www.python.org/dev/peps/pep-0540/#passthough-undecodable-bytes-surrogateescape

In the old long version of the PEP, I tried to explain UTF-8/strict
issues with very concrete examples, the removed "Use Cases" section:
https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L490

Tell me if I should rephrase the rationale of the PEP 540 to better
justify the usage of surrogateescape.

Maybe the "UTF-8 Mode" should be renamed to "UTF-8 with
surrogateescape, or backslashreplace for stderr, or surrogatepass for
fsencode/fsencode on Windows, or strict for Strict UTF-8 Mode"... But
the PEP title would be too long, no? :-)


> And opening binary file without "b" option is very common mistake of new
> developers.  If default error handler is surrogateescape, they lose a chance
> to notice their bug.

When open() in used in text mode to read "binary data", usually the
developer would only notify when getting the POSIX locale (ASCII
encoding). But the PEP 538 already changed that by using the C.UTF-8
locale (and so the UTF-8 encoding, instead of the ASCII encoding).

I'm not sure that locales are the best way to detect such class of
bytes. I suggest to use -b or -bb option to detect such bugs without
having to care of the locale.


> On the other hand, it helps some use cases when user want byte-transparent
> behavior, without modifying code to use "surrogateescape" explicitly.
>
> Which is more important scenario?  Anyone has opinion about it?
> Are there any rationals and use cases I missing?

Usually users expect that Python 3 "just works" and don't bother them
with the locale (thay nobody understands).

The old version of the PEP contains a long list of issues:
https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L924-L986

I already replaced the strict error handler with surrogateescape for
sys.stdin and sys.stdout on the POSIX locale in Python 3.5:
https://bugs.python.org/issue19977

For the rationale, read for example these comments:

* https://bugs.python.org/issue19846#msg205727 "As I would state it,
the problem is that python's boundary with the OS is not yet uniform.
(...) Note that currently, input() and sys.stdin.read() won't read
undecodable data so this is somewhat symmetrical but it seems to me
that saying "everything that interfaces with the OS except the
standard streams will use surrogateescape on undecodable bytes" is
drawing a line in an unintuitive location."

* https://bugs.python.org/issue19977#msg206141 "My impression was that
python3 was supposed to help get rid of UnicodeError tracebacks, not
mojibake.  If mojibake was the problem then we should never have gone
down the surrogateescape path for input."

* https://bugs.python.org/issue19846#msg205646 "For example I'm using
[LANG=C] for testcases to set the language uncomplicated to english."

In bug reports, to get the user expectations, just ignore all core
developers comments :-)

Users set the locale to C to get messages in english and still expects
"Unicode" to work properly.

Only Python 3 is so strict about encodings. Most other programming
languages, like Python 2, "just works", since they process data as
bytes.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-05 Thread Nick Coghlan
On 6 December 2017 at 16:18, Glenn Linderman  wrote:
> "b" mostly matters on Windows, correct? And Windows doesn't use C or POSIX
> locale, correct? And if these are correct, then is this an issue? And if so,
> why?

In Python 3, "b" matters everywhere, since it controls whether the
stream gets wrapped in TextIOWrapper or not.

It's only in Python 2 that the distinction is Windows-specific (where
it controls how "\r\n" sequences get handled).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-05 Thread Glenn Linderman

On 12/5/2017 8:07 PM, INADA Naoki wrote:

Oh, revised version is really short!

And I have one worrying point.
With UTF-8 mode, open()'s default encoding/error handler is
UTF-8/surrogateescape.

Containers are really growing.  PyCharm supports Docker and many new Python
developers use Docker instead of installing Python directly on their system,
especially on Windows.

And opening binary file without "b" option is very common mistake of new
developers.  If default error handler is surrogateescape, they lose a chance
to notice their bug.


"b" mostly matters on Windows, correct? And Windows doesn't use C or 
POSIX locale, correct? And if these are correct, then is this an issue? 
And if so, why?



On the other hand, it helps some use cases when user want byte-transparent
behavior, without modifying code to use "surrogateescape" explicitly.

Which is more important scenario?  Anyone has opinion about it?
Are there any rationals and use cases I missing?

Regards,

INADA Naoki  


On Wed, Dec 6, 2017 at 12:17 PM, INADA Naoki  wrote:

I'm sorry about my laziness.
I've very busy these months, but I'm back to OSS world from today.

While I should review carefully again, I think I'm close to accept PEP 540.

* PEP 540 really helps containers and old Linux machines PEP 538 doesn't work.
   And containers is really important for these days.  Many new
Pythonistas who is
   not Linux experts start using containers.

* In recent years, UTF-8 fixed many mojibakes.  Now UnicodeError is
more usability
   problem for many Python users.  So I agree opt-out UTF-8 mode is
better than opt-in
   on POSIX locale.

I don't have enough time to read all mails in ML archive.
So if someone have opposite opinion, please remind me by this weekend.

Regards,

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/v%2Bpython%40g.nevcal.com



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-05 Thread Nick Coghlan
On 6 December 2017 at 15:59, Chris Angelico  wrote:
> On Wed, Dec 6, 2017 at 4:46 PM, Nick Coghlan  wrote:
>> Something I've just noticed that needs to be clarified: on Linux, "C"
>> locale and "POSIX" locale are aliases, but this isn't true in general
>> (e.g. it's not the case on *BSD systems, including Mac OS X).
>
> For those of us with little to no BSD/MacOS experience, can you give a
> quick run-down of the differences between "C" and "POSIX"?

The one that's relevant to default locale detection is just the string
that "setlocale(LC_CTYPE, NULL)" returns.

On Linux (or, more accurately, with glibc), after setting
"LC_CTYPE=POSIX", that call still returns "C" (since the "POSIX"
locale is defined as an alias for the "C" locale).

By contrast, on *BSD, it will return "POSIX" (since "POSIX" is
actually a distinct locale there).

Beyond that, I don't know what the actual functional differences are.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-05 Thread Chris Angelico
On Wed, Dec 6, 2017 at 4:46 PM, Nick Coghlan  wrote:
> Something I've just noticed that needs to be clarified: on Linux, "C"
> locale and "POSIX" locale are aliases, but this isn't true in general
> (e.g. it's not the case on *BSD systems, including Mac OS X).

For those of us with little to no BSD/MacOS experience, can you give a
quick run-down of the differences between "C" and "POSIX"?

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-05 Thread Nick Coghlan
Something I've just noticed that needs to be clarified: on Linux, "C"
locale and "POSIX" locale are aliases, but this isn't true in general
(e.g. it's not the case on *BSD systems, including Mac OS X).

To handle that in PEP 538, I made it clear that everything is keyed
specifically off the "C" locale, since that's what you actually get by
default.

So if PEP 540 is going to implicitly trigger switching encodings, it
needs to specify whether it's going to look for the C locale or the
POSIX locale (I'd suggest C locale, since that's the actual default
that causes problems).

The precedence relationship with locale coercion also needs to be
spelled out: successful locale coercion should skip implicitly
enabling UTF-8 mode (for opt-in UTF-8 mode, we'd still try to coerce
the locale setting as appropriate, so extensions modules are more
likely to behave themselves).

On 6 December 2017 at 14:07, INADA Naoki  wrote:
> Oh, revised version is really short!
>
> And I have one worrying point.
> With UTF-8 mode, open()'s default encoding/error handler is
> UTF-8/surrogateescape.
>
> Containers are really growing.  PyCharm supports Docker and many new Python
> developers use Docker instead of installing Python directly on their system,
> especially on Windows.
>
> And opening binary file without "b" option is very common mistake of new
> developers.  If default error handler is surrogateescape, they lose a chance
> to notice their bug.
>
> On the other hand, it helps some use cases when user want byte-transparent
> behavior, without modifying code to use "surrogateescape" explicitly.
>
> Which is more important scenario?  Anyone has opinion about it?
> Are there any rationals and use cases I missing?

For platforms that offer a C.UTF-8 locale, I'd like "LC_CTYPE=C.UTF-8
python" and "PYTHONCOERCECLOCALE=0 LC_CTYPE=C PYTHONUTF8=1" to be
equivalent (aside from the known limitation that extension modules may
not do the right thing in the latter case).

For the locale coercion case, the default error handler for `open`
remains as "strict", which means I'd be in favour of keeping it as
"strict" by default in UTF-8 mode as well. That would flip the toggle
in the PEP: "strict UTF-8" would be the default selection for
"PYTHONUTF8=1, and you'd choose the more relaxed option via
"PYTHONUTF8=permissive".

That way, the combination of PEPs 538 and 540 would give us the
following situation in the C locale:

1. Our preferred approach is to coerce LC_CTYPE in the C locale to a
UTF-8 based equivalent
2. Only if that fails (e.g. as it will on CentOS 7) do we resort to
implicitly enabling CPython's internal UTF-8 mode (which should behave
like C.UTF-8, *except* for the fact extension modules won't respect
it)

That way, the ideal outcome is that a UTF-8 based locale exists, and
we use it automatically when needed. UTF-8 mode than lets us cope with
older platforms where neither C.UTF-8 nor an equivalent exists.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-05 Thread INADA Naoki
Oh, revised version is really short!

And I have one worrying point.
With UTF-8 mode, open()'s default encoding/error handler is
UTF-8/surrogateescape.

Containers are really growing.  PyCharm supports Docker and many new Python
developers use Docker instead of installing Python directly on their system,
especially on Windows.

And opening binary file without "b" option is very common mistake of new
developers.  If default error handler is surrogateescape, they lose a chance
to notice their bug.

On the other hand, it helps some use cases when user want byte-transparent
behavior, without modifying code to use "surrogateescape" explicitly.

Which is more important scenario?  Anyone has opinion about it?
Are there any rationals and use cases I missing?

Regards,

INADA Naoki  


On Wed, Dec 6, 2017 at 12:17 PM, INADA Naoki  wrote:
> I'm sorry about my laziness.
> I've very busy these months, but I'm back to OSS world from today.
>
> While I should review carefully again, I think I'm close to accept PEP 540.
>
> * PEP 540 really helps containers and old Linux machines PEP 538 doesn't work.
>   And containers is really important for these days.  Many new
> Pythonistas who is
>   not Linux experts start using containers.
>
> * In recent years, UTF-8 fixed many mojibakes.  Now UnicodeError is
> more usability
>   problem for many Python users.  So I agree opt-out UTF-8 mode is
> better than opt-in
>   on POSIX locale.
>
> I don't have enough time to read all mails in ML archive.
> So if someone have opposite opinion, please remind me by this weekend.
>
> Regards,
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-05 Thread INADA Naoki
I'm sorry about my laziness.
I've very busy these months, but I'm back to OSS world from today.

While I should review carefully again, I think I'm close to accept PEP 540.

* PEP 540 really helps containers and old Linux machines PEP 538 doesn't work.
  And containers is really important for these days.  Many new
Pythonistas who is
  not Linux experts start using containers.

* In recent years, UTF-8 fixed many mojibakes.  Now UnicodeError is
more usability
  problem for many Python users.  So I agree opt-out UTF-8 mode is
better than opt-in
  on POSIX locale.

I don't have enough time to read all mails in ML archive.
So if someone have opposite opinion, please remind me by this weekend.

Regards,
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-05 Thread Nick Coghlan
On 6 December 2017 at 11:01, Victor Stinner  wrote:
>> Annex: Differences between the PEP 538 and the PEP 540
>> ==
>>
>> The PEP 538 uses the "C.UTF-8" locale which is quite new and only
>> supported by a few Linux distributions; this locale is not currently
>> supported by FreeBSD or macOS for example. This PEP 540 supports all
>> operating systems.
>>
>> The PEP 538 only changes the behaviour for the POSIX locale. While the
>> new UTF-8 mode of this PEP is only enabled by the POSIX locale, it can
>> be enabled manually for any other locale.
>>
>> The PEP 538 is implemented with ``setlocale(LC_CTYPE, "C.UTF-8")``: any
>> non-Python code running in the process is impacted by this change.  This
>> PEP is implemented in Python internals and ignores the locale:
>> non-Python running in the same process is not aware of the "Python UTF-8
>> mode".

I submitted a PR to reword this part: https://github.com/python/peps/pull/493

> The main advantage of the PEP 538 ùover* the PEP 540 is that, for the
> POSIX locale, non-Python code running in the same process gets the
> UTF-8 encoding.
>
> To be honest, I'm not sure that there is a lot of code in the wild
> which uses "text" types like the C type wchar_t* and rely on the
> locale encoding. Almost all C library handle data as bytes using the
> char* type, like filenames and environment variables.

At the very least, GNU readline breaks if you don't change the locale
setting: 
https://www.python.org/dev/peps/pep-0538/#considering-locale-coercion-independently-of-utf-8-mode

Given that we found an example of this directly in the standard
library, I assume that there are plenty more in third party extension
modules (especially once we take C++ extensions into account, not just
C ones).

> First I understood that the PEP 538 changed the locale encoding using
> an environment variable. But no, it's implemented with
> setlocale(LC_CTYPE, "C.UTF-8") which only impacts the current process
> and is not inherited by child processes. So I'm not sure anymore that
> PEP 538 and PEP 540 are really complementary.

It sets the LC_CTYPE environment variable as well:
https://www.python.org/dev/peps/pep-0538/#explicitly-setting-lc-ctype-for-utf-8-locale-coercion

The relevant code is in _coerce_default_locale_settings (currently at
https://github.com/python/cpython/blob/master/Python/pylifecycle.c#L448)

> I'm not sure how PyGTK interacts with the PEP 538 for example. Does it
> use UTF-8 with the POSIX locale?

Desktop environments aim not to get into this situation in the first
place by ensuring they're using a more appropriate locale :)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-05 Thread Victor Stinner
> Annex: Differences between the PEP 538 and the PEP 540
> ==
>
> The PEP 538 uses the "C.UTF-8" locale which is quite new and only
> supported by a few Linux distributions; this locale is not currently
> supported by FreeBSD or macOS for example. This PEP 540 supports all
> operating systems.
>
> The PEP 538 only changes the behaviour for the POSIX locale. While the
> new UTF-8 mode of this PEP is only enabled by the POSIX locale, it can
> be enabled manually for any other locale.
>
> The PEP 538 is implemented with ``setlocale(LC_CTYPE, "C.UTF-8")``: any
> non-Python code running in the process is impacted by this change.  This
> PEP is implemented in Python internals and ignores the locale:
> non-Python running in the same process is not aware of the "Python UTF-8
> mode".

The main advantage of the PEP 538 ùover* the PEP 540 is that, for the
POSIX locale, non-Python code running in the same process gets the
UTF-8 encoding.

To be honest, I'm not sure that there is a lot of code in the wild
which uses "text" types like the C type wchar_t* and rely on the
locale encoding. Almost all C library handle data as bytes using the
char* type, like filenames and environment variables.

First I understood that the PEP 538 changed the locale encoding using
an environment variable. But no, it's implemented with
setlocale(LC_CTYPE, "C.UTF-8") which only impacts the current process
and is not inherited by child processes. So I'm not sure anymore that
PEP 538 and PEP 540 are really complementary.

I'm not sure how PyGTK interacts with the PEP 538 for example. Does it
use UTF-8 with the POSIX locale?

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

2017-12-05 Thread Victor Stinner
Hi,

I knew that I had to rewrite my PEP 540, but I was too lazy. Since
Guido explicitly requested a shorter PEP, here you have!

https://www.python.org/dev/peps/pep-0540/

Trust me, it's the same PEP, but focused on the most important
information and with a shorter rationale ;-)

Full text below.

Victor


PEP: 540
Title: Add a new UTF-8 mode
Version: $Revision$
Last-Modified: $Date$
Author: Victor Stinner 
BDFL-Delegate: INADA Naoki
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 5-January-2016
Python-Version: 3.7


Abstract


Add a new UTF-8 mode to ignore the locale and use the UTF-8 encoding
with the ``surrogateescape`` error handler. This mode is enabled by
default in the POSIX locale, but otherwise disabled by default.

Add also a "strict" UTF-8 mode which uses the ``strict`` error handler,
instead of ``surrogateescape``, with the UTF-8 encoding.

The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
variable are added to control the UTF-8 mode.


Rationale
=

Locale encoding and UTF-8
-

Python 3.6 uses the locale encoding for filenames, environment
variables, standard streams, etc. The locale encoding is inherited from
the locale; the encoding and the locale are tightly coupled.

Many users inherit the ASCII encoding from the POSIX locale, aka the "C"
locale, but are unable change the locale for different reasons. This
encoding is very limited in term of Unicode support: any non-ASCII
character is likely to cause troubles. For example, the Alpine Linux
distribution became popular thanks to Docker containers, but it uses the
POSIX locale by default.

It is not easy to get the expected locale. Locales don't get the exact
same name on all Linux distributions, FreeBSD, macOS, etc. Some
locales, like the recent ``C.UTF-8`` locale, are only supported by a few
platforms. For example, a SSH connection can use a different encoding
than the filesystem or terminal encoding of the local host.

On the other side, Python 3.6 is already using UTF-8 by default on
macOS, Android and Windows (PEP 529) for most functions, except of
``open()``. UTF-8 is also the default encoding of Python scripts, XML
and JSON file formats. The Go programming language uses UTF-8 for
strings.

When all data are stored as UTF-8 but the locale is often misconfigured,
an obvious solution is to ignore the locale and use UTF-8.

Passthough undecodable bytes: surrogateescape
-

Using UTF-8 is nice, until you read the first file encoded to a
different encoding. When using the ``strict`` error handler, which is
the default, Python 3 raises a ``UnicodeDecodeError`` on the first
undecodable byte.

Unix command line tools like ``cat`` or ``grep`` and most Python 2
applications simply do not have this class of bugs: they don't decode
data, but process data as a raw bytes sequence.

Python 3 already has a solution to behave like Unix tools and Python 2:
the ``surrogateescape`` error handler (:pep:`383`). It allows to process
data "as bytes" but uses Unicode in practice (undecodable bytes are
stored as surrogate characters).

For an application written as a Unix "pipe" tool like ``grep``, taking
input on stdin and writing output to stdout, ``surrogateescape`` allows
to "passthrough" undecodable bytes.

The UTF-8 encoding used with the ``surrogateescape`` error handler is a
compromise between correctness and usability.

Strict UTF-8 for correctness


When correctness matters more than usability, the ``strict`` error
handler is preferred over ``surrogateescape`` to raise an encoding error
at the first undecodable byte or unencodable character.

No change by default for best backward compatibility


While UTF-8 is perfect in most cases, sometimes the locale encoding is
actually the best encoding.

This PEP changes the behaviour for the POSIX locale since this locale
usually gives the ASCII encoding, whereas UTF-8 is a much better choice.
It does not change the behaviour for other locales to prevent any risk
or regression.

As users are responsible to enable explicitly the new UTF-8 mode, they
are responsible for any potential mojibake issues caused by this mode.


Proposal


Add a new UTF-8 mode to ignore the locale and use the UTF-8 encoding
with the ``surrogateescape`` error handler. This mode is enabled by
default in the POSIX locale, but otherwise disabled by default.

Add also a "strict" UTF-8 mode which uses the ``strict`` error handler,
instead of ``surrogateescape``, with the UTF-8 encoding.

The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
variable are added to control the UTF-8 mode:

* The UTF-8 mode is enabled by ``-X utf8`` or ``PYTHONUTF8=1``
* The Strict UTF-8 mode is configured by ``-X utf8=strict`` or
  ``PYTHONUTF8=strict``

The POSIX locale enables the UTF-8 mode. In this case,