Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
Victor Stinner wrote: Users don't use stdin and stdout as regular files, they are more used as pipes to pass data between programs with the Unix pipe in a shell like "producer | consumer". Sometimes stdout is redirected to a file, but I consider that it is expected to behave as a pipe and the regular TTY stdout. It seems weird to me to make a distinction between stdin/stdout connected to a file and accessing the file some other way. It would be surprising, for example, if the following two commands behaved differently with respect to encoding: cat foo | sort cat < foo | sort But Naoki explained that open() is commonly misused to open binary files and Python should somehow fail badly to notify the developer of their mistake. Maybe if you *explicitly* open the file in text mode it should default to surrogateescape, but use strict if text mode is being used by default? I.e. open("foo", "rt") --> surrogateescape open("foo") --> strict That way you can easily open a file in a way that's compatible with the way stdin/stdout behave, but you will get bitten if you mistakenly open a binary file as text. -- Greg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
I’m a bit confused: File names and the like are one thing, and the CONTENTS of files is quite another. I get that there is theoretically a “default” encoding for the contents of text files, but that is SO likely to be wrong as to be ignorable. open() already defaults to utf-8. Which is a fine default if you are going to have one, but it seems a bad idea to have it default to surrogateescape EVER, regardless of the locale or anything else. If the file is binary, or a different encoding, or simply broken, it’s much better to get an encoding error as soon as possible. Why does this have anything to do with the PEP? Perhaps the issue of reading a filename from the system, writing it to a file, then reading it back in again. I actually do that a lot — but mostly so I can pass that file to another system, so I really don’t want broken encoding in it anyway. -CHB Sent from my iPhone On Dec 7, 2017, at 5:53 PM, Glenn Lindermanwrote: On 12/7/2017 5:45 PM, Jonathan Goble wrote: On Thu, Dec 7, 2017 at 8:38 PM Glenn Linderman wrote: > If it were to be changed, one could add a text-mode option in 3.7, say "t" > in the mode string, and a PendingDeprecationWarning for open calls without > the specification of either t or b in the mode string. > "t" is already supported in open()'s mode argument [1] as a way to explicitly request text mode, though it's essentially ignored right now since text is the default anyway. So since the option is already present, the only thing needed at this stage for your plan would be to begin deprecating not using it. *goes back to lurking* [1] https://docs.python.org/3/library/functions.html#open Thanks for briefly de-lurking. So then for PEP 540... use surrogateescape immediately for t mode. Then, when the user encounters an encoding error, there would be three solutions: switch to t mode, explicitly switch to surrogateescape, or fix the locale. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/chris.barker%40noaa.gov ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
On 12/7/2017 5:45 PM, Jonathan Goble wrote: On Thu, Dec 7, 2017 at 8:38 PM Glenn Linderman> wrote: If it were to be changed, one could add a text-mode option in 3.7, say "t" in the mode string, and a PendingDeprecationWarning for open calls without the specification of either t or b in the mode string. "t" is already supported in open()'s mode argument [1] as a way to explicitly request text mode, though it's essentially ignored right now since text is the default anyway. So since the option is already present, the only thing needed at this stage for your plan would be to begin deprecating not using it. *goes back to lurking* [1] https://docs.python.org/3/library/functions.html#open Thanks for briefly de-lurking. So then for PEP 540... use surrogateescape immediately for t mode. Then, when the user encounters an encoding error, there would be three solutions: switch to t mode, explicitly switch to surrogateescape, or fix the locale. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
On Thu, Dec 7, 2017 at 8:38 PM Glenn Lindermanwrote: > If it were to be changed, one could add a text-mode option in 3.7, say "t" > in the mode string, and a PendingDeprecationWarning for open calls without > the specification of either t or b in the mode string. > "t" is already supported in open()'s mode argument [1] as a way to explicitly request text mode, though it's essentially ignored right now since text is the default anyway. So since the option is already present, the only thing needed at this stage for your plan would be to begin deprecating not using it. *goes back to lurking* [1] https://docs.python.org/3/library/functions.html#open ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
On 12/7/2017 4:48 PM, Victor Stinner wrote: Ok, now comes the real question, open(). For open(), I used the example of a code snippet *writing* the content of a directory (os.listdir) into a text file. Another example is to read filenames from a text files but pass-through undecodable bytes thanks to surrogateescape. But Naoki explained that open() is commonly misused to open binary files and Python should somehow fail badly to notify the developer of their mistake. So the real problem here is that open has a default mode of text. Instead of forcing the user to specify either "text" or "binary" when opening, text is used as a default, binary as an option to be specified. I understand that default has a long history in Unix-land, dating at last as far back as 1977 when I first learned how to use the Unix open() function. And now it would be an incompatible change to change it. The real question is whether or not it is a good idea to change it... at this point in time, with Unicode and UTF-8 so prevalent, text and binary modes are far different than back in 1977, when they mostly just documented that this was a binary file that was being opened, and that one could more likely expect to see read() than fgets() in the following code. If it were to be changed, one could add a text-mode option in 3.7, say "t" in the mode string, and a PendingDeprecationWarning for open calls without the specification of either t or b in the mode string. In 3.8, the warning would be changed to DeprecationWarning. In 3.9, all open calls would need to have either t or b, or would fail. Meanwhile, back on the PEP 540 ranch, text mode open calls could immediately use surrogateescape, binary mode open calls would not, and unspecified open calls would not. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
2017-12-08 0:26 GMT+01:00 Guido van Rossum: > You will quickly get decoding errors, and that is INADA's point. (Unless you > use encoding='Latin-1'.) His worry is that the surrogateescape error handler > makes it so that you won't get decoding errors, and then the failure mode is > much harder to debug. Hum, my question was more to know if Python fails because of an operation failing with strings whereas bytes were expected, or if Python fails with a decoding error... But now I'm not sure aynmore that this level of detail really matters. Let me think out loud. To explain unicode issues, I like to use filenames, since it's something that users view commonly, handle directly and can modify (and so enter many non-ASCII characters like diacritics and emojis ;-)). Filenames can be found on the command line, in environment variables (PYTHONSTARTUP), stdin (read a list of files from stdin), stdout (write the list of files into stdout), but also in text files (the Mercurial "makefile problem). I consider that the command line and environment variables should "just work" and so use surrogateescape. It would be too annoying to not even be able to *start* Python because of an Unicode error. For example, it wouldn't be easy to identify which environment variable causes the issue. Hopefully, the UTF-8 doesn't change anything here: surrogateescape is already used since Python 3.3 for the command line and environment variables. For stdin/stdout, I think that the main motivation here is to write Unix command line tools using Python 3: pass-through undecodable bytes without bugging the user with Unicode. Users don't use stdin and stdout as regular files, they are more used as pipes to pass data between programs with the Unix pipe in a shell like "producer | consumer". Sometimes stdout is redirected to a file, but I consider that it is expected to behave as a pipe and the regular TTY stdout. IMHO we are still in the safe surrogateescape area (for the specific case of the UTF-8 mode). Ok, now comes the real question, open(). For open(), I used the example of a code snippet *writing* the content of a directory (os.listdir) into a text file. Another example is to read filenames from a text files but pass-through undecodable bytes thanks to surrogateescape. But Naoki explained that open() is commonly misused to open binary files and Python should somehow fail badly to notify the developer of their mistake. If I should make a choice between the two categories of usage of open(), "read undecodable bytes in UTF-8 from a text file" versus "misuse open() on binary file", I expect that the later is more common that that open() shouldn't use surrogateescape by default. While stdin and stdout are usually associated to Unix pipes and Unix tools working on bytes, files are more commonly associated to important data that must not be lost nor corrupted. Python is expected to "help" the developer to use the proper options to read content from a file and to write content into a file. So I understand that open() should use the "strict" error handler in the UTF-8 mode, rather than "surrogateescape". I can survive to this "tiny" change to my PEP. I just posted a 3rd version of my PEP where open() error handler remains strict (is no more changed by the PEP). Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
On Thu, Dec 7, 2017 at 3:02 PM, Victor Stinnerwrote: > 2017-12-06 5:07 GMT+01:00 INADA Naoki : > > And opening binary file without "b" option is very common mistake of new > > developers. If default error handler is surrogateescape, they lose a > chance > > to notice their bug. > > To come back to your original point, I didn't know that it was a > common mistake to open binary files in text mode. > It probably is because in Python 2 it makes no difference on UNIX, and on Windows the only difference is that binary mode preserves \r. > Honestly, I didn't try recently. How does Python behave when you do that? > > Is it possible to write a full binary parser using the text mode? You > should quickly get issues pointing you to your mistake, no? > You will quickly get decoding errors, and that is INADA's point. (Unless you use encoding='Latin-1'.) His worry is that the surrogateescape error handler makes it so that you won't get decoding errors, and then the failure mode is much harder to debug. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
2017-12-06 5:07 GMT+01:00 INADA Naoki: > And opening binary file without "b" option is very common mistake of new > developers. If default error handler is surrogateescape, they lose a chance > to notice their bug. To come back to your original point, I didn't know that it was a common mistake to open binary files in text mode. Honestly, I didn't try recently. How does Python behave when you do that? Is it possible to write a full binary parser using the text mode? You should quickly get issues pointing you to your mistake, no? Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
While I'm not strongly convinced that open() error handler must be changed for surrogateescape, first I would like to make sure that it's really a very bad idea because changing it :-) 2017-12-07 7:49 GMT+01:00 INADA Naoki: > I just came up with crazy idea; changing default error handler of open() > to "surrogateescape" only when open mode is "w" or "a". The idea is tempting but I'm not sure that it's a good idea. Moreover, what about "r+" and "w+" modes? I dislike getting a different behaviour for inputs and outputs. The motivation for surrogateescape is to "pass through" undecodable bytes: you need to handle them on the input side and on the output side. That's why I decided to not only change sys.stdin error handler to surrogateescape for the POSIX locale, but also sys.stdout: https://bugs.python.org/issue19977 > When reading, "surrogateescape" error handler is dangerous because > it can produce arbitrary broken unicode string by mistake. I'm fine with that. I wouldn't say that it's the purpose of the PEP, but sadly it's an expected, known and documented side effect. You get the same behaviour with Unix command line tools and most Python 2 applications (processing data as bytes). Nothing new under the sun. The PEP 540 allows users to write applications behaving like Unix tools/Python 2 with the power of the Python 3 language and stdlib. Again, use the Strict UTF8 mode if you prioritize *correctness* over *usability*. Honestly, I'm not even sure that the Strict UTF-8 mode is *usable* in practice, since we are all surrounded by old documents encoded to various "legacy" encodings (where legay means: "not UTF-8", like Latin1 or ShiftJIS). The first non-ASCII character which is not encoded to UTF-8 is going to "crash" the application (big traceback with an unicode error). Maybe the problem is the feature name: "UTF-8 mode". Users may think to "strict" when they read "UTF-8", since UTF-8 is known to be a strict encoding. For example, UTF-8 is much stricter than latin1 which is unable to tell if a document was encoded latin1 or whatever else. UTF-8 is able to tell if a document was actually encoded to UTF-8 or not, thanks to the design of the encoding itself. > And it doesn't allow following code: > > with open("image.jpg", "r") as f: # Binary data, not UTF-8 > return f.read() Using a JPEG image, the example is obviously wrong. But using surrogateescape on open() is written to read *text files* which are mostly correctly encoded to UTF-8, except a few bytes. I'm not sure how to explain the issue. The Mercurial wiki page has a good example of this issue that they call the "Makefile problem": https://www.mercurial-scm.org/wiki/EncodingStrategy#The_.22makefile_problem.22 While it's not exactly the discussed issue, it gives you an issue of the kind of issues that you have when you use open(filename, encoding="utf-8", errors="strict") versus open(filename, encoding="utf-8", errors="surrogateescape") > I'm not sure about this is good idea. And I don't know when is good for > changing write error handler; only when PEP 538 or PEP 540 is used? > Or always when os.fsencoding() is UTF-8? > > Any thoughts? The PEP 538 doesn't affect the error handler. The PEP 540 only changes the error handler for the POSIX locale, it's a deliberate choice. The PEP 538 is only enabled for the POSIX locale, and the PEP 540 will also be enabled by default by this locale. I dislike the idea of chaning the error handler if the filesystem encoding is UTF-8. The UTF-8 mode must be enabled explicitly on purpose. The reduce any risk of regression, and prepare users who enable it for any potential issue. Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
> I care only about builtin open()'s behavior. > PEP 538 doesn't change default error handler of open(). > > I think PEP 538 and PEP 540 should behave almost identical except > changing locale > or not. So I need very strong reason if PEP 540 changes default error > handler of open(). > I just came up with crazy idea; changing default error handler of open() to "surrogateescape" only when open mode is "w" or "a". When reading, "surrogateescape" error handler is dangerous because it can produce arbitrary broken unicode string by mistake. On the other hand, "surrogateescape" error handler for writing is not so dangerous if encoding is UTF-8. When writing normal unicode string, it doesn't create broken data. When writing string containing surrogateescaped data, data is (partially) broken before writing. This idea allows following code: with open("files.txt", "w") as f: for fn in os.listdir(): # may returns surrogateescaped string f.write(fn+'\n') And it doesn't allow following code: with open("image.jpg", "r") as f: # Binary data, not UTF-8 return f.read() I'm not sure about this is good idea. And I don't know when is good for changing write error handler; only when PEP 538 or PEP 540 is used? Or always when os.fsencoding() is UTF-8? Any thoughts? INADA Naoki___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
On 7 December 2017 at 08:20, Victor Stinnerwrote: > 2017-12-06 23:07 GMT+01:00 Antoine Pitrou : >> One question: how do you plan to test for the POSIX locale? > > I'm not sure. I will probably rely on Nick for that ;-) Nick already > implemented this exact check for his PEP 538 which is already > implemented in Python 3.7. > > I already implemented the PEP 540: > >https://bugs.python.org/issue29240 >https://github.com/python/cpython/pull/855 > > Right now, my implementation uses: > >char *ctype = _PyMem_RawStrdup(setlocale(LC_CTYPE, "")); >... >if (strcmp(ctype, "C") == 0) ... We have a private helper for this as a result of the PEP 538 implementation: _Py_LegacyLocaleDetected() Details are in the source code at https://github.com/python/cpython/blob/master/Python/pylifecycle.c#L345 As per my comment there, and Jakub Wilk's post to this thread, we're missing a case to also check for the string "POSIX" (which will fix several of the current locale coercion discrepancies between Linux and *BSD systems). Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
On 7 December 2017 at 01:59, Jakub Wilkwrote: > * Nick Coghlan , 2017-12-06, 16:15: >> The one that's relevant to default locale detection is just the string >> that "setlocale(LC_CTYPE, NULL)" returns. > > POSIX doesn't require any particular return value for setlocale() calls. > It's only guaranteed that the returned string can be used in subsequent > setlocale() calls to restore the original locale. > > So in the POSIX locale, a compliant setlocale() implementation could return > "C", or "POSIX", or even something entirely different. Thanks. I'd been wondering if we should also handle the "POSIX" case in the legacy locale detection logic, and you've convinced me that we should. Issue filed for that here: https://bugs.python.org/issue32238 Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
On Thu, 7 Dec 2017 00:22:52 +0100 Victor Stinnerwrote: > 2017-12-06 23:36 GMT+01:00 Antoine Pitrou : > > Other than that, +1 on the PEP. > > Naoki doesn't seem to be confortable with the usage of the > surrogateescape error handler by default for open(). Are you ok with > that? If yes, would you mind to explain why? :-) Sorry, I had missed that objection. I agree with Inada Naoki: it's better to keep it strict. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
2017-12-06 23:36 GMT+01:00 Antoine Pitrou: > Other than that, +1 on the PEP. Naoki doesn't seem to be confortable with the usage of the surrogateescape error handler by default for open(). Are you ok with that? If yes, would you mind to explain why? :-) Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
On Wed, 6 Dec 2017 23:20:41 +0100 Victor Stinnerwrote: > 2017-12-06 23:07 GMT+01:00 Antoine Pitrou : > > One question: how do you plan to test for the POSIX locale? > > I'm not sure. I will probably rely on Nick for that ;-) Nick already > implemented this exact check for his PEP 538 which is already > implemented in Python 3.7. Other than that, +1 on the PEP. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
2017-12-06 23:07 GMT+01:00 Antoine Pitrou: > One question: how do you plan to test for the POSIX locale? I'm not sure. I will probably rely on Nick for that ;-) Nick already implemented this exact check for his PEP 538 which is already implemented in Python 3.7. I already implemented the PEP 540: https://bugs.python.org/issue29240 https://github.com/python/cpython/pull/855 Right now, my implementation uses: char *ctype = _PyMem_RawStrdup(setlocale(LC_CTYPE, "")); ... if (strcmp(ctype, "C") == 0) ... Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
On Wed, 6 Dec 2017 01:49:41 +0100 Victor Stinnerwrote: > Hi, > > I knew that I had to rewrite my PEP 540, but I was too lazy. Since > Guido explicitly requested a shorter PEP, here you have! > > https://www.python.org/dev/peps/pep-0540/ > > Trust me, it's the same PEP, but focused on the most important > information and with a shorter rationale ;-) Congrats on the rewriting! The shortening is appreciated :-) One question: how do you plan to test for the POSIX locale? Apparently you need to check at least for the "C" and "POSIX" strings, but perhaps other aliases as well? Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
Victor Stinner wrote: Maybe the "UTF-8 Mode" should be renamed to "UTF-8 with surrogateescape, or backslashreplace for stderr, or surrogatepass for fsencode/fsencode on Windows, or strict for Strict UTF-8 Mode"... But the PEP title would be too long, no? :-) Relaxed UTF-8 Mode? UTF8-Yeah-I'm-Fine-With-That mode? -- Greg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
On Wed, 6 Dec 2017 at 06:10 INADA Naokiwrote: > >> And I have one worrying point. > >> With UTF-8 mode, open()'s default encoding/error handler is > >> UTF-8/surrogateescape. > > > > The Strict UTF-8 Mode is for you if you prioritize correctness over > usability. > > Yes, but as I said, I cares about not experienced developer > who doesn't know what UTF-8 mode is. > > > > > In the very first version of my PEP/idea, I wanted to use > > UTF-8/strict. But then I started to play with the implementation and I > > got many "practical" issues. Using UTF-8/strict, you quickly get > > encoding errors. For example, you become unable to read undecodable > > bytes from stdin. stdin.read() only gives you an error, without > > letting you decide how to handle these "invalid" data. Same issue with > > stdout. > > > > I don't care about stdio, because PEP 538 uses surrogateescape for > stdio/error > > https://www.python.org/dev/peps/pep-0538/#changes-to-the-default-error-handling-on-the-standard-streams > > I care only about builtin open()'s behavior. > PEP 538 doesn't change default error handler of open(). > > I think PEP 538 and PEP 540 should behave almost identical except > changing locale > or not. So I need very strong reason if PEP 540 changes default error > handler of open(). > I don't have enough locale experience to weigh in as an expert, but I already was leaning towards INADA-san's logic of not wanting to change open() and this makes me really not want to change it. -Brett > > > > In the old long version of the PEP, I tried to explain UTF-8/strict > > issues with very concrete examples, the removed "Use Cases" section: > > > https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L490 > > > > Tell me if I should rephrase the rationale of the PEP 540 to better > > justify the usage of surrogateescape. > > OK, "List a directory into a text file" example demonstrates why > surrogateescape > is used for open(). If os.listdir() returns surrogateescpaed data, > file.write() will be > fail. > All other examples are about stdio. > > But we should achieve good balance between correctness and usability of > default behavior. > > > > > Maybe the "UTF-8 Mode" should be renamed to "UTF-8 with > > surrogateescape, or backslashreplace for stderr, or surrogatepass for > > fsencode/fsencode on Windows, or strict for Strict UTF-8 Mode"... But > > the PEP title would be too long, no? :-) > > > > I feel short name is enough. > > > > >> And opening binary file without "b" option is very common mistake of new > >> developers. If default error handler is surrogateescape, they lose a > chance > >> to notice their bug. > > > > When open() in used in text mode to read "binary data", usually the > > developer would only notify when getting the POSIX locale (ASCII > > encoding). But the PEP 538 already changed that by using the C.UTF-8 > > locale (and so the UTF-8 encoding, instead of the ASCII encoding). > > > > With PEP 538 (C.UTF-8 locale), open() uses UTF-8/strict, not > UTF-8/surrogateescape. > > For example, this code raise UnicodeDecodeError with PEP 538 if the > file is JPEG file. > > with open(fn) as f: > f.read() > > > > I'm not sure that locales are the best way to detect such class of > > bytes. I suggest to use -b or -bb option to detect such bugs without > > having to care of the locale. > > > > But many new developers doesn't use/know -b or -bb option. > > > > >> On the other hand, it helps some use cases when user want > byte-transparent > >> behavior, without modifying code to use "surrogateescape" explicitly. > >> > >> Which is more important scenario? Anyone has opinion about it? > >> Are there any rationals and use cases I missing? > > > > Usually users expect that Python 3 "just works" and don't bother them > > with the locale (thay nobody understands). > > > > The old version of the PEP contains a long list of issues: > > > https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L924-L986 > > > > I already replaced the strict error handler with surrogateescape for > > sys.stdin and sys.stdout on the POSIX locale in Python 3.5: > > https://bugs.python.org/issue19977 > > > > For the rationale, read for example these comments: > > > [snip] > > OK, I'll read them and think again about open()'s default behavior. > But I still hope open()'s behavior is consistent with PEP 538 and PEP 540. > > Regards, > ___ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/brett%40python.org > ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
* Nick Coghlan, 2017-12-06, 16:15: Something I've just noticed that needs to be clarified: on Linux, "C" locale and "POSIX" locale are aliases, but this isn't true in general (e.g. it's not the case on *BSD systems, including Mac OS X). For those of us with little to no BSD/MacOS experience, can you give a quick run-down of the differences between "C" and "POSIX"? POSIX says that "C" and "POSIX" are equivalent[0]. The one that's relevant to default locale detection is just the string that "setlocale(LC_CTYPE, NULL)" returns. POSIX doesn't require any particular return value for setlocale() calls. It's only guaranteed that the returned string can be used in subsequent setlocale() calls to restore the original locale. So in the POSIX locale, a compliant setlocale() implementation could return "C", or "POSIX", or even something entirely different. Beyond that, I don't know what the actual functional differences are. I don't believe there are any. [0] http://pubs.opengroup.org/onlinepubs/9699919799/functions/setlocale.html -- Jakub Wilk ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
>> And I have one worrying point. >> With UTF-8 mode, open()'s default encoding/error handler is >> UTF-8/surrogateescape. > > The Strict UTF-8 Mode is for you if you prioritize correctness over usability. Yes, but as I said, I cares about not experienced developer who doesn't know what UTF-8 mode is. > > In the very first version of my PEP/idea, I wanted to use > UTF-8/strict. But then I started to play with the implementation and I > got many "practical" issues. Using UTF-8/strict, you quickly get > encoding errors. For example, you become unable to read undecodable > bytes from stdin. stdin.read() only gives you an error, without > letting you decide how to handle these "invalid" data. Same issue with > stdout. > I don't care about stdio, because PEP 538 uses surrogateescape for stdio/error https://www.python.org/dev/peps/pep-0538/#changes-to-the-default-error-handling-on-the-standard-streams I care only about builtin open()'s behavior. PEP 538 doesn't change default error handler of open(). I think PEP 538 and PEP 540 should behave almost identical except changing locale or not. So I need very strong reason if PEP 540 changes default error handler of open(). > In the old long version of the PEP, I tried to explain UTF-8/strict > issues with very concrete examples, the removed "Use Cases" section: > https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L490 > > Tell me if I should rephrase the rationale of the PEP 540 to better > justify the usage of surrogateescape. OK, "List a directory into a text file" example demonstrates why surrogateescape is used for open(). If os.listdir() returns surrogateescpaed data, file.write() will be fail. All other examples are about stdio. But we should achieve good balance between correctness and usability of default behavior. > > Maybe the "UTF-8 Mode" should be renamed to "UTF-8 with > surrogateescape, or backslashreplace for stderr, or surrogatepass for > fsencode/fsencode on Windows, or strict for Strict UTF-8 Mode"... But > the PEP title would be too long, no? :-) > I feel short name is enough. > >> And opening binary file without "b" option is very common mistake of new >> developers. If default error handler is surrogateescape, they lose a chance >> to notice their bug. > > When open() in used in text mode to read "binary data", usually the > developer would only notify when getting the POSIX locale (ASCII > encoding). But the PEP 538 already changed that by using the C.UTF-8 > locale (and so the UTF-8 encoding, instead of the ASCII encoding). > With PEP 538 (C.UTF-8 locale), open() uses UTF-8/strict, not UTF-8/surrogateescape. For example, this code raise UnicodeDecodeError with PEP 538 if the file is JPEG file. with open(fn) as f: f.read() > I'm not sure that locales are the best way to detect such class of > bytes. I suggest to use -b or -bb option to detect such bugs without > having to care of the locale. > But many new developers doesn't use/know -b or -bb option. > >> On the other hand, it helps some use cases when user want byte-transparent >> behavior, without modifying code to use "surrogateescape" explicitly. >> >> Which is more important scenario? Anyone has opinion about it? >> Are there any rationals and use cases I missing? > > Usually users expect that Python 3 "just works" and don't bother them > with the locale (thay nobody understands). > > The old version of the PEP contains a long list of issues: > https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L924-L986 > > I already replaced the strict error handler with surrogateescape for > sys.stdin and sys.stdout on the POSIX locale in Python 3.5: > https://bugs.python.org/issue19977 > > For the rationale, read for example these comments: > [snip] OK, I'll read them and think again about open()'s default behavior. But I still hope open()'s behavior is consistent with PEP 538 and PEP 540. Regards, ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
On 6 December 2017 at 20:38, Victor Stinnerwrote: > Nick: >> So if PEP 540 is going to implicitly trigger switching encodings, it >> needs to specify whether it's going to look for the C locale or the >> POSIX locale (I'd suggest C locale, since that's the actual default >> that causes problems). > > I'm thinking at the test already used by check_force_ascii() (function > checking if the LC_CTYPE uses the ASCII encoding or something else): > > loc = setlocale(LC_CTYPE, NULL); > if (loc == NULL) > goto error; > if (strcmp(loc, "C") != 0) { > /* the LC_CTYPE locale is different than C */ > return 0; > } Yeah, the locale coercion code changes the locale multiple times to make sure we have a coercion target that will actually work (and then checks nl_langinfo as well, since that sometimes breaks on BSD systems, even if the original setlocale() call claimed to work). Once we've found a locale that appears to work though, then we configure the LC_CTYPE environment variable, and reload the locale from the environment. It's all annoyingly convoluted and arcane, but it works well enough for https://github.com/python/cpython/blob/master/Lib/test/test_c_locale_coercion.py to pass across the full BuildBot fleet :) Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
Nick: > So if PEP 540 is going to implicitly trigger switching encodings, it > needs to specify whether it's going to look for the C locale or the > POSIX locale (I'd suggest C locale, since that's the actual default > that causes problems). I'm thinking at the test already used by check_force_ascii() (function checking if the LC_CTYPE uses the ASCII encoding or something else): loc = setlocale(LC_CTYPE, NULL); if (loc == NULL) goto error; if (strcmp(loc, "C") != 0) { /* the LC_CTYPE locale is different than C */ return 0; } Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
Hi Naoki, 2017-12-06 5:07 GMT+01:00 INADA Naoki: > Oh, revised version is really short! > > And I have one worrying point. > With UTF-8 mode, open()'s default encoding/error handler is > UTF-8/surrogateescape. The Strict UTF-8 Mode is for you if you prioritize correctness over usability. In the very first version of my PEP/idea, I wanted to use UTF-8/strict. But then I started to play with the implementation and I got many "practical" issues. Using UTF-8/strict, you quickly get encoding errors. For example, you become unable to read undecodable bytes from stdin. stdin.read() only gives you an error, without letting you decide how to handle these "invalid" data. Same issue with stdout. Compare encodings of the UTF-8 mode and the Strict UTF-8 Mode: https://www.python.org/dev/peps/pep-0540/#encoding-and-error-handler I tried to summarize all these kinds of issues in the second short subsection of the rationale: https://www.python.org/dev/peps/pep-0540/#passthough-undecodable-bytes-surrogateescape In the old long version of the PEP, I tried to explain UTF-8/strict issues with very concrete examples, the removed "Use Cases" section: https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L490 Tell me if I should rephrase the rationale of the PEP 540 to better justify the usage of surrogateescape. Maybe the "UTF-8 Mode" should be renamed to "UTF-8 with surrogateescape, or backslashreplace for stderr, or surrogatepass for fsencode/fsencode on Windows, or strict for Strict UTF-8 Mode"... But the PEP title would be too long, no? :-) > And opening binary file without "b" option is very common mistake of new > developers. If default error handler is surrogateescape, they lose a chance > to notice their bug. When open() in used in text mode to read "binary data", usually the developer would only notify when getting the POSIX locale (ASCII encoding). But the PEP 538 already changed that by using the C.UTF-8 locale (and so the UTF-8 encoding, instead of the ASCII encoding). I'm not sure that locales are the best way to detect such class of bytes. I suggest to use -b or -bb option to detect such bugs without having to care of the locale. > On the other hand, it helps some use cases when user want byte-transparent > behavior, without modifying code to use "surrogateescape" explicitly. > > Which is more important scenario? Anyone has opinion about it? > Are there any rationals and use cases I missing? Usually users expect that Python 3 "just works" and don't bother them with the locale (thay nobody understands). The old version of the PEP contains a long list of issues: https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L924-L986 I already replaced the strict error handler with surrogateescape for sys.stdin and sys.stdout on the POSIX locale in Python 3.5: https://bugs.python.org/issue19977 For the rationale, read for example these comments: * https://bugs.python.org/issue19846#msg205727 "As I would state it, the problem is that python's boundary with the OS is not yet uniform. (...) Note that currently, input() and sys.stdin.read() won't read undecodable data so this is somewhat symmetrical but it seems to me that saying "everything that interfaces with the OS except the standard streams will use surrogateescape on undecodable bytes" is drawing a line in an unintuitive location." * https://bugs.python.org/issue19977#msg206141 "My impression was that python3 was supposed to help get rid of UnicodeError tracebacks, not mojibake. If mojibake was the problem then we should never have gone down the surrogateescape path for input." * https://bugs.python.org/issue19846#msg205646 "For example I'm using [LANG=C] for testcases to set the language uncomplicated to english." In bug reports, to get the user expectations, just ignore all core developers comments :-) Users set the locale to C to get messages in english and still expects "Unicode" to work properly. Only Python 3 is so strict about encodings. Most other programming languages, like Python 2, "just works", since they process data as bytes. Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
On 6 December 2017 at 16:18, Glenn Lindermanwrote: > "b" mostly matters on Windows, correct? And Windows doesn't use C or POSIX > locale, correct? And if these are correct, then is this an issue? And if so, > why? In Python 3, "b" matters everywhere, since it controls whether the stream gets wrapped in TextIOWrapper or not. It's only in Python 2 that the distinction is Windows-specific (where it controls how "\r\n" sequences get handled). Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
On 12/5/2017 8:07 PM, INADA Naoki wrote: Oh, revised version is really short! And I have one worrying point. With UTF-8 mode, open()'s default encoding/error handler is UTF-8/surrogateescape. Containers are really growing. PyCharm supports Docker and many new Python developers use Docker instead of installing Python directly on their system, especially on Windows. And opening binary file without "b" option is very common mistake of new developers. If default error handler is surrogateescape, they lose a chance to notice their bug. "b" mostly matters on Windows, correct? And Windows doesn't use C or POSIX locale, correct? And if these are correct, then is this an issue? And if so, why? On the other hand, it helps some use cases when user want byte-transparent behavior, without modifying code to use "surrogateescape" explicitly. Which is more important scenario? Anyone has opinion about it? Are there any rationals and use cases I missing? Regards, INADA NaokiOn Wed, Dec 6, 2017 at 12:17 PM, INADA Naoki wrote: I'm sorry about my laziness. I've very busy these months, but I'm back to OSS world from today. While I should review carefully again, I think I'm close to accept PEP 540. * PEP 540 really helps containers and old Linux machines PEP 538 doesn't work. And containers is really important for these days. Many new Pythonistas who is not Linux experts start using containers. * In recent years, UTF-8 fixed many mojibakes. Now UnicodeError is more usability problem for many Python users. So I agree opt-out UTF-8 mode is better than opt-in on POSIX locale. I don't have enough time to read all mails in ML archive. So if someone have opposite opinion, please remind me by this weekend. Regards, ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/v%2Bpython%40g.nevcal.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
On 6 December 2017 at 15:59, Chris Angelicowrote: > On Wed, Dec 6, 2017 at 4:46 PM, Nick Coghlan wrote: >> Something I've just noticed that needs to be clarified: on Linux, "C" >> locale and "POSIX" locale are aliases, but this isn't true in general >> (e.g. it's not the case on *BSD systems, including Mac OS X). > > For those of us with little to no BSD/MacOS experience, can you give a > quick run-down of the differences between "C" and "POSIX"? The one that's relevant to default locale detection is just the string that "setlocale(LC_CTYPE, NULL)" returns. On Linux (or, more accurately, with glibc), after setting "LC_CTYPE=POSIX", that call still returns "C" (since the "POSIX" locale is defined as an alias for the "C" locale). By contrast, on *BSD, it will return "POSIX" (since "POSIX" is actually a distinct locale there). Beyond that, I don't know what the actual functional differences are. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
On Wed, Dec 6, 2017 at 4:46 PM, Nick Coghlanwrote: > Something I've just noticed that needs to be clarified: on Linux, "C" > locale and "POSIX" locale are aliases, but this isn't true in general > (e.g. it's not the case on *BSD systems, including Mac OS X). For those of us with little to no BSD/MacOS experience, can you give a quick run-down of the differences between "C" and "POSIX"? ChrisA ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
Something I've just noticed that needs to be clarified: on Linux, "C" locale and "POSIX" locale are aliases, but this isn't true in general (e.g. it's not the case on *BSD systems, including Mac OS X). To handle that in PEP 538, I made it clear that everything is keyed specifically off the "C" locale, since that's what you actually get by default. So if PEP 540 is going to implicitly trigger switching encodings, it needs to specify whether it's going to look for the C locale or the POSIX locale (I'd suggest C locale, since that's the actual default that causes problems). The precedence relationship with locale coercion also needs to be spelled out: successful locale coercion should skip implicitly enabling UTF-8 mode (for opt-in UTF-8 mode, we'd still try to coerce the locale setting as appropriate, so extensions modules are more likely to behave themselves). On 6 December 2017 at 14:07, INADA Naokiwrote: > Oh, revised version is really short! > > And I have one worrying point. > With UTF-8 mode, open()'s default encoding/error handler is > UTF-8/surrogateescape. > > Containers are really growing. PyCharm supports Docker and many new Python > developers use Docker instead of installing Python directly on their system, > especially on Windows. > > And opening binary file without "b" option is very common mistake of new > developers. If default error handler is surrogateescape, they lose a chance > to notice their bug. > > On the other hand, it helps some use cases when user want byte-transparent > behavior, without modifying code to use "surrogateescape" explicitly. > > Which is more important scenario? Anyone has opinion about it? > Are there any rationals and use cases I missing? For platforms that offer a C.UTF-8 locale, I'd like "LC_CTYPE=C.UTF-8 python" and "PYTHONCOERCECLOCALE=0 LC_CTYPE=C PYTHONUTF8=1" to be equivalent (aside from the known limitation that extension modules may not do the right thing in the latter case). For the locale coercion case, the default error handler for `open` remains as "strict", which means I'd be in favour of keeping it as "strict" by default in UTF-8 mode as well. That would flip the toggle in the PEP: "strict UTF-8" would be the default selection for "PYTHONUTF8=1, and you'd choose the more relaxed option via "PYTHONUTF8=permissive". That way, the combination of PEPs 538 and 540 would give us the following situation in the C locale: 1. Our preferred approach is to coerce LC_CTYPE in the C locale to a UTF-8 based equivalent 2. Only if that fails (e.g. as it will on CentOS 7) do we resort to implicitly enabling CPython's internal UTF-8 mode (which should behave like C.UTF-8, *except* for the fact extension modules won't respect it) That way, the ideal outcome is that a UTF-8 based locale exists, and we use it automatically when needed. UTF-8 mode than lets us cope with older platforms where neither C.UTF-8 nor an equivalent exists. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
Oh, revised version is really short! And I have one worrying point. With UTF-8 mode, open()'s default encoding/error handler is UTF-8/surrogateescape. Containers are really growing. PyCharm supports Docker and many new Python developers use Docker instead of installing Python directly on their system, especially on Windows. And opening binary file without "b" option is very common mistake of new developers. If default error handler is surrogateescape, they lose a chance to notice their bug. On the other hand, it helps some use cases when user want byte-transparent behavior, without modifying code to use "surrogateescape" explicitly. Which is more important scenario? Anyone has opinion about it? Are there any rationals and use cases I missing? Regards, INADA NaokiOn Wed, Dec 6, 2017 at 12:17 PM, INADA Naoki wrote: > I'm sorry about my laziness. > I've very busy these months, but I'm back to OSS world from today. > > While I should review carefully again, I think I'm close to accept PEP 540. > > * PEP 540 really helps containers and old Linux machines PEP 538 doesn't work. > And containers is really important for these days. Many new > Pythonistas who is > not Linux experts start using containers. > > * In recent years, UTF-8 fixed many mojibakes. Now UnicodeError is > more usability > problem for many Python users. So I agree opt-out UTF-8 mode is > better than opt-in > on POSIX locale. > > I don't have enough time to read all mails in ML archive. > So if someone have opposite opinion, please remind me by this weekend. > > Regards, ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
I'm sorry about my laziness. I've very busy these months, but I'm back to OSS world from today. While I should review carefully again, I think I'm close to accept PEP 540. * PEP 540 really helps containers and old Linux machines PEP 538 doesn't work. And containers is really important for these days. Many new Pythonistas who is not Linux experts start using containers. * In recent years, UTF-8 fixed many mojibakes. Now UnicodeError is more usability problem for many Python users. So I agree opt-out UTF-8 mode is better than opt-in on POSIX locale. I don't have enough time to read all mails in ML archive. So if someone have opposite opinion, please remind me by this weekend. Regards, ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
On 6 December 2017 at 11:01, Victor Stinnerwrote: >> Annex: Differences between the PEP 538 and the PEP 540 >> == >> >> The PEP 538 uses the "C.UTF-8" locale which is quite new and only >> supported by a few Linux distributions; this locale is not currently >> supported by FreeBSD or macOS for example. This PEP 540 supports all >> operating systems. >> >> The PEP 538 only changes the behaviour for the POSIX locale. While the >> new UTF-8 mode of this PEP is only enabled by the POSIX locale, it can >> be enabled manually for any other locale. >> >> The PEP 538 is implemented with ``setlocale(LC_CTYPE, "C.UTF-8")``: any >> non-Python code running in the process is impacted by this change. This >> PEP is implemented in Python internals and ignores the locale: >> non-Python running in the same process is not aware of the "Python UTF-8 >> mode". I submitted a PR to reword this part: https://github.com/python/peps/pull/493 > The main advantage of the PEP 538 ùover* the PEP 540 is that, for the > POSIX locale, non-Python code running in the same process gets the > UTF-8 encoding. > > To be honest, I'm not sure that there is a lot of code in the wild > which uses "text" types like the C type wchar_t* and rely on the > locale encoding. Almost all C library handle data as bytes using the > char* type, like filenames and environment variables. At the very least, GNU readline breaks if you don't change the locale setting: https://www.python.org/dev/peps/pep-0538/#considering-locale-coercion-independently-of-utf-8-mode Given that we found an example of this directly in the standard library, I assume that there are plenty more in third party extension modules (especially once we take C++ extensions into account, not just C ones). > First I understood that the PEP 538 changed the locale encoding using > an environment variable. But no, it's implemented with > setlocale(LC_CTYPE, "C.UTF-8") which only impacts the current process > and is not inherited by child processes. So I'm not sure anymore that > PEP 538 and PEP 540 are really complementary. It sets the LC_CTYPE environment variable as well: https://www.python.org/dev/peps/pep-0538/#explicitly-setting-lc-ctype-for-utf-8-locale-coercion The relevant code is in _coerce_default_locale_settings (currently at https://github.com/python/cpython/blob/master/Python/pylifecycle.c#L448) > I'm not sure how PyGTK interacts with the PEP 538 for example. Does it > use UTF-8 with the POSIX locale? Desktop environments aim not to get into this situation in the first place by ensuring they're using a more appropriate locale :) Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
> Annex: Differences between the PEP 538 and the PEP 540 > == > > The PEP 538 uses the "C.UTF-8" locale which is quite new and only > supported by a few Linux distributions; this locale is not currently > supported by FreeBSD or macOS for example. This PEP 540 supports all > operating systems. > > The PEP 538 only changes the behaviour for the POSIX locale. While the > new UTF-8 mode of this PEP is only enabled by the POSIX locale, it can > be enabled manually for any other locale. > > The PEP 538 is implemented with ``setlocale(LC_CTYPE, "C.UTF-8")``: any > non-Python code running in the process is impacted by this change. This > PEP is implemented in Python internals and ignores the locale: > non-Python running in the same process is not aware of the "Python UTF-8 > mode". The main advantage of the PEP 538 ùover* the PEP 540 is that, for the POSIX locale, non-Python code running in the same process gets the UTF-8 encoding. To be honest, I'm not sure that there is a lot of code in the wild which uses "text" types like the C type wchar_t* and rely on the locale encoding. Almost all C library handle data as bytes using the char* type, like filenames and environment variables. First I understood that the PEP 538 changed the locale encoding using an environment variable. But no, it's implemented with setlocale(LC_CTYPE, "C.UTF-8") which only impacts the current process and is not inherited by child processes. So I'm not sure anymore that PEP 538 and PEP 540 are really complementary. I'm not sure how PyGTK interacts with the PEP 538 for example. Does it use UTF-8 with the POSIX locale? Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
Hi, I knew that I had to rewrite my PEP 540, but I was too lazy. Since Guido explicitly requested a shorter PEP, here you have! https://www.python.org/dev/peps/pep-0540/ Trust me, it's the same PEP, but focused on the most important information and with a shorter rationale ;-) Full text below. Victor PEP: 540 Title: Add a new UTF-8 mode Version: $Revision$ Last-Modified: $Date$ Author: Victor StinnerBDFL-Delegate: INADA Naoki Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 5-January-2016 Python-Version: 3.7 Abstract Add a new UTF-8 mode to ignore the locale and use the UTF-8 encoding with the ``surrogateescape`` error handler. This mode is enabled by default in the POSIX locale, but otherwise disabled by default. Add also a "strict" UTF-8 mode which uses the ``strict`` error handler, instead of ``surrogateescape``, with the UTF-8 encoding. The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment variable are added to control the UTF-8 mode. Rationale = Locale encoding and UTF-8 - Python 3.6 uses the locale encoding for filenames, environment variables, standard streams, etc. The locale encoding is inherited from the locale; the encoding and the locale are tightly coupled. Many users inherit the ASCII encoding from the POSIX locale, aka the "C" locale, but are unable change the locale for different reasons. This encoding is very limited in term of Unicode support: any non-ASCII character is likely to cause troubles. For example, the Alpine Linux distribution became popular thanks to Docker containers, but it uses the POSIX locale by default. It is not easy to get the expected locale. Locales don't get the exact same name on all Linux distributions, FreeBSD, macOS, etc. Some locales, like the recent ``C.UTF-8`` locale, are only supported by a few platforms. For example, a SSH connection can use a different encoding than the filesystem or terminal encoding of the local host. On the other side, Python 3.6 is already using UTF-8 by default on macOS, Android and Windows (PEP 529) for most functions, except of ``open()``. UTF-8 is also the default encoding of Python scripts, XML and JSON file formats. The Go programming language uses UTF-8 for strings. When all data are stored as UTF-8 but the locale is often misconfigured, an obvious solution is to ignore the locale and use UTF-8. Passthough undecodable bytes: surrogateescape - Using UTF-8 is nice, until you read the first file encoded to a different encoding. When using the ``strict`` error handler, which is the default, Python 3 raises a ``UnicodeDecodeError`` on the first undecodable byte. Unix command line tools like ``cat`` or ``grep`` and most Python 2 applications simply do not have this class of bugs: they don't decode data, but process data as a raw bytes sequence. Python 3 already has a solution to behave like Unix tools and Python 2: the ``surrogateescape`` error handler (:pep:`383`). It allows to process data "as bytes" but uses Unicode in practice (undecodable bytes are stored as surrogate characters). For an application written as a Unix "pipe" tool like ``grep``, taking input on stdin and writing output to stdout, ``surrogateescape`` allows to "passthrough" undecodable bytes. The UTF-8 encoding used with the ``surrogateescape`` error handler is a compromise between correctness and usability. Strict UTF-8 for correctness When correctness matters more than usability, the ``strict`` error handler is preferred over ``surrogateescape`` to raise an encoding error at the first undecodable byte or unencodable character. No change by default for best backward compatibility While UTF-8 is perfect in most cases, sometimes the locale encoding is actually the best encoding. This PEP changes the behaviour for the POSIX locale since this locale usually gives the ASCII encoding, whereas UTF-8 is a much better choice. It does not change the behaviour for other locales to prevent any risk or regression. As users are responsible to enable explicitly the new UTF-8 mode, they are responsible for any potential mojibake issues caused by this mode. Proposal Add a new UTF-8 mode to ignore the locale and use the UTF-8 encoding with the ``surrogateescape`` error handler. This mode is enabled by default in the POSIX locale, but otherwise disabled by default. Add also a "strict" UTF-8 mode which uses the ``strict`` error handler, instead of ``surrogateescape``, with the UTF-8 encoding. The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment variable are added to control the UTF-8 mode: * The UTF-8 mode is enabled by ``-X utf8`` or ``PYTHONUTF8=1`` * The Strict UTF-8 mode is configured by ``-X utf8=strict`` or ``PYTHONUTF8=strict`` The POSIX locale enables the UTF-8 mode. In this case,