Re: [Python-Dev] PEP 529: Change Windows filesystem encoding to UTF-8
Please no. Let's not add unrelated new functionality in with this already large change with not entirely understood consequences. On Thu, Sep 8, 2016 at 1:05 PM, Chris Barkerwrote: > On Thu, Sep 8, 2016 at 10:35 AM, Random832 wrote: >> >> >> It means that the so-called "bash" on windows 10 is actually a full >> Ubuntu system (running on, AIUI, a simulation of Linux kernel system >> calls), which will presumably also have its own python installation and >> use a UTF-8 locale, rather than one that runs "natively" on win32. > > > yes -- it looks like one could run a "linux" build of python under the whole > subsystem, which would presumably "look" jsu tlike LInux to Python. > > >> >> If it's possible for a win32 version of python to call it as a >> subprocess, > > > But this is what I was referring too -- it may be way to early to know what > the capabilities or implications are, but I'm hoping that "regular" windows > programs can interact with the subsystem. So if we're making changes now, it > would be nice to consider it if we can. > >> >> Incidentally, according to >> >> https://github.com/Microsoft/BashOnWindows/issues/2, pipes didn't work >> at all between WSL processes and Win32 processes until two weeks ago, so >> it's clear that these features are still evolving. > > > so it may indeed be way to early -- but if they DO work now -- pretty cool! > > Thanks, > >-CHB > > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR(206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > chris.bar...@noaa.gov > > ___ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/guido%40python.org > -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 529: Change Windows filesystem encoding to UTF-8
On Thu, Sep 8, 2016 at 1:14 PM, Guido van Rossumwrote: > Please no. Let's not add unrelated new functionality in with this > already large change with not entirely understood consequences. > Fair enough -- this is clearly a really raw API so far. -CHB > > On Thu, Sep 8, 2016 at 1:05 PM, Chris Barker > wrote: > > On Thu, Sep 8, 2016 at 10:35 AM, Random832 > wrote: > >> > >> > >> It means that the so-called "bash" on windows 10 is actually a full > >> Ubuntu system (running on, AIUI, a simulation of Linux kernel system > >> calls), which will presumably also have its own python installation and > >> use a UTF-8 locale, rather than one that runs "natively" on win32. > > > > > > yes -- it looks like one could run a "linux" build of python under the > whole > > subsystem, which would presumably "look" jsu tlike LInux to Python. > > > > > >> > >> If it's possible for a win32 version of python to call it as a > >> subprocess, > > > > > > But this is what I was referring too -- it may be way to early to know > what > > the capabilities or implications are, but I'm hoping that "regular" > windows > > programs can interact with the subsystem. So if we're making changes > now, it > > would be nice to consider it if we can. > > > >> > >> Incidentally, according to > >> > >> https://github.com/Microsoft/BashOnWindows/issues/2, pipes didn't work > >> at all between WSL processes and Win32 processes until two weeks ago, so > >> it's clear that these features are still evolving. > > > > > > so it may indeed be way to early -- but if they DO work now -- pretty > cool! > > > > Thanks, > > > >-CHB > > > > > > -- > > > > Christopher Barker, Ph.D. > > Oceanographer > > > > Emergency Response Division > > NOAA/NOS/OR(206) 526-6959 voice > > 7600 Sand Point Way NE (206) 526-6329 fax > > Seattle, WA 98115 (206) 526-6317 main reception > > > > chris.bar...@noaa.gov > > > > ___ > > Python-Dev mailing list > > Python-Dev@python.org > > https://mail.python.org/mailman/listinfo/python-dev > > Unsubscribe: > > https://mail.python.org/mailman/options/python-dev/guido%40python.org > > > > > > -- > --Guido van Rossum (python.org/~guido) > -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 529: Change Windows filesystem encoding to UTF-8
On Thu, Sep 8, 2016 at 10:35 AM, Random832wrote: > > It means that the so-called "bash" on windows 10 is actually a full > Ubuntu system (running on, AIUI, a simulation of Linux kernel system > calls), which will presumably also have its own python installation and > use a UTF-8 locale, rather than one that runs "natively" on win32. > yes -- it looks like one could run a "linux" build of python under the whole subsystem, which would presumably "look" jsu tlike LInux to Python. > If it's possible for a win32 version of python to call it as a > subprocess, But this is what I was referring too -- it may be way to early to know what the capabilities or implications are, but I'm hoping that "regular" windows programs can interact with the subsystem. So if we're making changes now, it would be nice to consider it if we can. > Incidentally, according to > https://github.com/Microsoft/BashOnWindows/issues/2, pipes didn't work > at all between WSL processes and Win32 processes until two weeks ago, so > it's clear that these features are still evolving. so it may indeed be way to early -- but if they DO work now -- pretty cool! Thanks, -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 529: Change Windows filesystem encoding to UTF-8
On Thu, Sep 8, 2016, at 13:10, Guido van Rossum wrote: > On Thu, Sep 8, 2016 at 9:57 AM, Brett Cannonwrote: > > Bash on Windows is just Linux, so it isn't affected by any of this. > > I don't know what that sentence means. It means that the so-called "bash" on windows 10 is actually a full Ubuntu system (running on, AIUI, a simulation of Linux kernel system calls), which will presumably also have its own python installation and use a UTF-8 locale, rather than one that runs "natively" on win32. If it's possible for a win32 version of python to call it as a subprocess, this may be an argument in favor of using UTF-8 - subject to finding out whether WSL does use UTF-8, whether it supports non-ASCII arguments from a Win32 CreateProcess at all, whether there's any way to pass non-UTF-8 arguments to it, etc. Incidentally, according to https://github.com/Microsoft/BashOnWindows/issues/2, pipes didn't work at all between WSL processes and Win32 processes until two weeks ago, so it's clear that these features are still evolving. > But anyways, if someone wants > to try making subprocess work with bytes arguments on Windows work, > that's just a bugfix, and you're not constrained by how it works on > previous Python versions (since it doesn't work there at all). It > might be wise to choose an interpretation that's consistent with other > uses of command line arguments by Python on Windows though (rather > than choosing to favor making just bash work the same as it works on > Linux). ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 529: Change Windows filesystem encoding to UTF-8
On Thu, Sep 8, 2016 at 9:57 AM, Brett Cannonwrote: > > > On Thu, 8 Sep 2016 at 09:06 Chris Barker wrote: >> >> On Wed, Sep 7, 2016 at 10:37 AM, Guido van Rossum >> wrote: >>> >>> And apart from Python, few shell commands that work on >>> Unix make much sense on Windows, >> >> >> Does the (optional) addition of bash to Windows 10 have any impact on >> this? >> >> It'll be something that Windows developers can't count on their users >> having for a good while, if ever, but if you can control the deployment >> environment, then you might. And it would be VERY tempting for >> "posix-focused" developers that want to run their code on Windows. >> >> So it would be nice if the "new" approach worked well with bash on >> Windows. > > > Bash on Windows is just Linux, so it isn't affected by any of this. I don't know what that sentence means. But anyways, if someone wants to try making subprocess work with bytes arguments on Windows work, that's just a bugfix, and you're not constrained by how it works on previous Python versions (since it doesn't work there at all). It might be wise to choose an interpretation that's consistent with other uses of command line arguments by Python on Windows though (rather than choosing to favor making just bash work the same as it works on Linux). -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 529: Change Windows filesystem encoding to UTF-8
On Thu, 8 Sep 2016 at 09:06 Chris Barkerwrote: > On Wed, Sep 7, 2016 at 10:37 AM, Guido van Rossum > wrote: > >> And apart from Python, few shell commands that work on >> Unix make much sense on Windows, > > > Does the (optional) addition of bash to Windows 10 have any impact on this? > > It'll be something that Windows developers can't count on their users > having for a good while, if ever, but if you can control the deployment > environment, then you might. And it would be VERY tempting for > "posix-focused" developers that want to run their code on Windows. > > So it would be nice if the "new" approach worked well with bash on Windows. > Bash on Windows is just Linux, so it isn't affected by any of this. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 529: Change Windows filesystem encoding to UTF-8
On Wed, Sep 7, 2016 at 10:37 AM, Guido van Rossumwrote: > And apart from Python, few shell commands that work on > Unix make much sense on Windows, Does the (optional) addition of bash to Windows 10 have any impact on this? It'll be something that Windows developers can't count on their users having for a good while, if ever, but if you can control the deployment environment, then you might. And it would be VERY tempting for "posix-focused" developers that want to run their code on Windows. So it would be nice if the "new" approach worked well with bash on Windows. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 529: Change Windows filesystem encoding to UTF-8
On 8 September 2016 at 03:37, Guido van Rossumwrote: > On Sun, Sep 4, 2016 at 11:58 PM, Nick Coghlan wrote: >> While calling system native apps that way will still have many >> portability challenges, there are also plenty of cases where folks use >> sys.executable to launch new Python processes in a separate instance >> of the currently running interpreter, and it would be good if these >> changes brought cross-platform consistency to the handling of binary >> arguments here as well. > > I checked with Steve and this is not supported anyway -- bytes > arguments (regardless of the value of shell) fail early with a > TypeError. That may be a bug but there's no backwards compatibility to > preserve here. (And apart from Python, few shell commands that work on > Unix make much sense on Windows, so Im also not particularly worried > about that particular example being non-portable -- it doesn't > represent a realistic concern.) Cool, I suspected "That already doesn't work, so you just have to use strings for cross-platform compatibility in those cases" would be the answer, and I think that's a sensible way to go. Even on *nix passing bytes arguments to subprocess is unusual, since anyone with Python 2 based habits will omit the "b" prefix from literals, and anything coming from the command line, environment, or other user input is supplied as text by default. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 529: Change Windows filesystem encoding to UTF-8
On 07Sep2016 1037, Guido van Rossum wrote: I'm hijacking this thread to provisionally accept PEP 529. (I'll also do this for PEP 528, in its own thread.) I've talked things over with Steve and Victor and we're going to do an experiment (as now written up in the PEP: https://www.python.org/dev/peps/pep-0529/#beta-experiment) to tease out any issues with this change during the beta. If serious problems crop up we may have to roll back the changes and reject the PEP -- we won't get another chance at getting this right. (That would also mean that using the binary filesystem APIs will remain deprecated and will eventually be disallowed; as long as the PEP remains accepted they are undeprecated.) Congrats Steve! Thanks for the massive amount of work on the implementation and the thinking that went into the design. Thanks everyone else for their feedback. --Guido Thanks! I've updated the status. Now the process of bartering for code reviews begins :) Patches are at: PEP 528: http://bugs.python.org/issue1602 PEP 529: http://bugs.python.org/issue27781 Cheers, Steve ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 529: Change Windows filesystem encoding to UTF-8
I'm hijacking this thread to provisionally accept PEP 529. (I'll also do this for PEP 528, in its own thread.) I've talked things over with Steve and Victor and we're going to do an experiment (as now written up in the PEP: https://www.python.org/dev/peps/pep-0529/#beta-experiment) to tease out any issues with this change during the beta. If serious problems crop up we may have to roll back the changes and reject the PEP -- we won't get another chance at getting this right. (That would also mean that using the binary filesystem APIs will remain deprecated and will eventually be disallowed; as long as the PEP remains accepted they are undeprecated.) Congrats Steve! Thanks for the massive amount of work on the implementation and the thinking that went into the design. Thanks everyone else for their feedback. --Guido PS. I have one small inline response to Nick below. On Sun, Sep 4, 2016 at 11:58 PM, Nick Coghlanwrote: > On 5 September 2016 at 15:59, Steve Dower wrote: >> +continue to default to ``locale.getpreferredencoding()`` (for text files) or >> +plain bytes (for binary files). This only affects the encoding used when >> users >> +pass a bytes object to Python where it is then passed to the operating >> system as >> +a path name. > > For the three non-filesystem cases: > > I checked the situation for os.environb, and that's already > unavailable on Windows (since os.supports_bytes_environ is False > there), while sys.argv is apparently already handled correctly (i.e. > always using the *W APIs). > > That means my only open question would be the handling of subprocess > module calls (both with and without shell=True), since that currently > works with binary arguments on *nix: > subprocess.call([b"python", b"-c", "print('ℙƴ☂ℌøἤ')".encode("utf-8")]) > ℙƴ☂ℌøἤ > 0 subprocess.call(b"python -c '%s'" % 'print("ℙƴ☂ℌøἤ")'.encode("utf-8"), shell=True) > ℙƴ☂ℌøἤ > 0 > > While calling system native apps that way will still have many > portability challenges, there are also plenty of cases where folks use > sys.executable to launch new Python processes in a separate instance > of the currently running interpreter, and it would be good if these > changes brought cross-platform consistency to the handling of binary > arguments here as well. I checked with Steve and this is not supported anyway -- bytes arguments (regardless of the value of shell) fail early with a TypeError. That may be a bug but there's no backwards compatibility to preserve here. (And apart from Python, few shell commands that work on Unix make much sense on Windows, so Im also not particularly worried about that particular example being non-portable -- it doesn't represent a realistic concern.) -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 529: Change Windows filesystem encoding to UTF-8
On 5 September 2016 at 15:59, Steve Dowerwrote: > +continue to default to ``locale.getpreferredencoding()`` (for text files) or > +plain bytes (for binary files). This only affects the encoding used when > users > +pass a bytes object to Python where it is then passed to the operating > system as > +a path name. For the three non-filesystem cases: I checked the situation for os.environb, and that's already unavailable on Windows (since os.supports_bytes_environ is False there), while sys.argv is apparently already handled correctly (i.e. always using the *W APIs). That means my only open question would be the handling of subprocess module calls (both with and without shell=True), since that currently works with binary arguments on *nix: >>> subprocess.call([b"python", b"-c", "print('ℙƴ☂ℌøἤ')".encode("utf-8")]) ℙƴ☂ℌøἤ 0 >>> subprocess.call(b"python -c '%s'" % 'print("ℙƴ☂ℌøἤ")'.encode("utf-8"), >>> shell=True) ℙƴ☂ℌøἤ 0 While calling system native apps that way will still have many portability challenges, there are also plenty of cases where folks use sys.executable to launch new Python processes in a separate instance of the currently running interpreter, and it would be good if these changes brought cross-platform consistency to the handling of binary arguments here as well. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 529: Change Windows filesystem encoding to UTF-8
I posted an update to PEP 529 at https://github.com/python/peps/blob/master/pep-0529.txt and a diff below. The update includes more detail on the affected code within CPython - including a number of references to broken code that would be resolved with the change - and more details about the necessary changes. As with PEP 528, I don't think it's possible to predict the impact better than I already have, and the beta period will be essential to determine whether this change is completely unworkable. I am fully prepared to back out the change if necessary prior to RC. Cheers, Steve --- @@ -16,7 +16,8 @@ operating system, often via C Runtime functions. However, these have been long discouraged in favor of the UTF-16 APIs. Within the operating system, all text is represented as UTF-16, and the ANSI APIs perform encoding and decoding using -the active code page. +the active code page. See `Naming Files, Paths, and Namespaces`_ for +more details. This PEP proposes changing the default filesystem encoding on Windows to utf-8, and changing all filesystem functions to use the Unicode APIs for filesystem @@ -27,10 +28,10 @@ characters outside of the user's active code page. Notably, this does not impact the encoding of the contents of files. These will -continue to default to locale.getpreferredencoding (for text files) or plain -bytes (for binary files). This only affects the encoding used when users pass a -bytes object to Python where it is then passed to the operating system as a path -name. +continue to default to ``locale.getpreferredencoding()`` (for text files) or +plain bytes (for binary files). This only affects the encoding used when users +pass a bytes object to Python where it is then passed to the operating system as +a path name. Background == @@ -44,9 +45,10 @@ When paths are passed between the filesystem and the application, they are either passed through as a bytes blob or converted to/from str using -``os.fsencode()`` or ``sys.getfilesystemencoding()``. The result of encoding a -string with ``sys.getfilesystemencoding()`` is a blob of bytes in the native -format for the default file system. +``os.fsencode()`` and ``os.fsdecode()`` or explicit encoding using +``sys.getfilesystemencoding()``. The result of encoding a string with +``sys.getfilesystemencoding()`` is a blob of bytes in the native format for the +default file system. On Windows, the native format for the filesystem is utf-16-le. The recommended platform APIs for accessing the filesystem all accept and return text encoded in @@ -83,11 +85,11 @@ canonical representation. Even if the encoding is "incorrect" by some standard, the file system will still map the bytes back to the file. Making use of this avoids the cost of decoding and reencoding, such that (theoretically, and only -on POSIX), code such as this may be faster because of the use of `b'.'` compared -to using `'.'`:: +on POSIX), code such as this may be faster because of the use of ``b'.'`` +compared to using ``'.'``:: >>> for f in os.listdir(b'.'): -... os.stat(f) +... os.stat(f) ... As a result, POSIX-focused library authors prefer to use bytes to represent @@ -105,32 +107,31 @@ Currently the default filesystem encoding is 'mbcs', which is a meta-encoder that uses the active code page. However, when bytes are passed to the filesystem they go through the \*A APIs and the operating system handles encoding. In this -case, paths are always encoded using the equivalent of 'mbcs:replace' - we have -no ability to change this (though there is a user/machine configuration option -to change the encoding from CP_ACP to CP_OEM, so it won't necessarily always -match mbcs...) +case, paths are always encoded using the equivalent of 'mbcs:replace' with no +opportunity for Python to override or change this. This proposal would remove all use of the \*A APIs and only ever call the \*W -APIs. When Windows returns paths to Python as str, they will be decoded from +APIs. When Windows returns paths to Python as ``str``, they will be decoded from utf-16-le and returned as text (in whatever the minimal representation is). When -Windows returns paths to Python as bytes, they will be decoded from utf-16-le to -utf-8 using surrogatepass (Windows does not validate surrogate pairs, so it is -possible to have invalid surrogates in filenames). Equally, when paths are -provided as bytes, they are decoded from utf-8 into utf-16-le and passed to the -\*W APIs. +Python code requests paths as ``bytes``, the paths will be transcoded from +utf-16-le into utf-8 using surrogatepass (Windows does not validate surrogate +pairs, so it is possible to have invalid surrogates in filenames). Equally, when +paths are provided as ``bytes``, they are trasncoded from utf-8 into utf-16-le +and passed to the \*W APIs. -The use of utf-8 will not be configurable, with the possible exception of a -"legacy mode" environment variable or X-flag.
Re: [Python-Dev] PEP 529: Change Windows filesystem encoding to UTF-8
Nick Coghlan (ncoghlan at gmail.com) on Sat Sep 3 12:27:44 EDT 2016 wrote: > After also reading the Windows console encoding PEP, I realised > there's a couple of missing discussions here regarding the impacts on > sys.argv, os.environ, and os.environb. > > The reason that's relevant is that "sys.getfilesystemencoding" is a > bit of a misnomer, as it's also used to determine the assumed encoding > of command line arguments and environment variables. > > Regarding sys.argv, AFAIK Unicode arguments work well on Python 3. Even non-BMP characters are transferred correctly. Adam Bartoš ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 529: Change Windows filesystem encoding to UTF-8
On 4 September 2016 at 00:49, Nick Coghlanwrote: > On 2 September 2016 at 08:31, Steve Dower wrote: >> This proposal would remove all use of the *A APIs and only ever call the *W >> APIs. When Windows returns paths to Python as str, they will be decoded from >> utf-16-le and returned as text (in whatever the minimal representation is). >> When >> Windows returns paths to Python as bytes, they will be decoded from >> utf-16-le to >> utf-8 using surrogatepass (Windows does not validate surrogate pairs, so it >> is >> possible to have invalid surrogates in filenames). Equally, when paths are >> provided as bytes, they are decoded from utf-8 into utf-16-le and passed to >> the >> *W APIs. > > The overall proposal looks good to me, there's just a terminology > glitch here: utf-8 <-> utf-16-le should either be described as > transcoding, or else as decoding and then re-encoding. As they're both > text codecs, there's no "decoding" operation that switches between > them. After also reading the Windows console encoding PEP, I realised there's a couple of missing discussions here regarding the impacts on sys.argv, os.environ, and os.environb. The reason that's relevant is that "sys.getfilesystemencoding" is a bit of a misnomer, as it's also used to determine the assumed encoding of command line arguments and environment variables. With the PEP currently stating that all use of the "*A" Windows APIs will be removed, I'm guessing these will just start working as expected, but it should be convered explicitly. In addition, if the subprocess module is going to be excluded from these changes, that should be called out explicitly (Keeping in mind that on *nix, the only subprocess pipe configurations that are straightforward to set up in Python 3 are raw binary mode and universal newlines mode, with the latter implicitly treating the pipes as UTF-8 text) Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 529: Change Windows filesystem encoding to UTF-8
On 2 September 2016 at 08:31, Steve Dowerwrote: > This proposal would remove all use of the *A APIs and only ever call the *W > APIs. When Windows returns paths to Python as str, they will be decoded from > utf-16-le and returned as text (in whatever the minimal representation is). > When > Windows returns paths to Python as bytes, they will be decoded from > utf-16-le to > utf-8 using surrogatepass (Windows does not validate surrogate pairs, so it > is > possible to have invalid surrogates in filenames). Equally, when paths are > provided as bytes, they are decoded from utf-8 into utf-16-le and passed to > the > *W APIs. The overall proposal looks good to me, there's just a terminology glitch here: utf-8 <-> utf-16-le should either be described as transcoding, or else as decoding and then re-encoding. As they're both text codecs, there's no "decoding" operation that switches between them. As far as the timing of this particular change goes, I think you make a good case that all of the cases that will see a behaviour change with this PEP have already been receiving deprecation warnings since 3.3, which would make it acceptable to change the default behaviour in 3.6. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 529: Change Windows filesystem encoding to UTF-8
On 1 September 2016 at 23:31, Steve Dowerwrote: [...] > As a result, POSIX-focused library authors prefer to use bytes to represent > paths. A minor point, but in my experience, a lot of POSIX-focused authors are happy to move to a better text/bytes separation, so I'd soften this to "some POSIX-focused library authors...". Other than that minor point, this looks great - +1 from me. Paul ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] PEP 529: Change Windows filesystem encoding to UTF-8
I'm about to be offline for a few days, so I wanted to get my current draft PEPs out for people can read and review. I don't believe there is a lot of change as a result of either PEP, but the impact of what change there is needs to be weighed against the benefits. We've already had some thorough discussion on this one and failed to reach agreement on whether we can make this change in 3.6 or if it needs a deprecation cycle that is more visible than the one we started in 3.3. In the latter case, we need to determine how visible that should be (i.e. warnings visible by default, visible for non-Windows platforms, value-dependent warnings/errors, etc.). IMHO, the argument about having the change be on-by-default or off-by-default is irrelevant until we decide on the deprecation issue, at which point it is obvious what the default should be. See https://bugs.python.org/issue27781 for the current proposed patch. I do need to update it in order to merge against default it seems (work for my upcoming flight). Cheers, Steve --- https://github.com/python/peps/blob/master/pep-0529.txt --- PEP: 529 Title: Change Windows filesystem encoding to UTF-8 Version: $Revision$ Last-Modified: $Date$ Author: Steve DowerStatus: Draft Type: Standards Track Content-Type: text/x-rst Created: 27-Aug-2016 Post-History: 01-Sep-2016 Abstract Historically, Python uses the ANSI APIs for interacting with the Windows operating system, often via C Runtime functions. However, these have been long discouraged in favor of the UTF-16 APIs. Within the operating system, all text is represented as UTF-16, and the ANSI APIs perform encoding and decoding using the active code page. This PEP proposes changing the default filesystem encoding on Windows to utf-8, and changing all filesystem functions to use the Unicode APIs for filesystem paths. This will not affect code that uses strings to represent paths, however those that use bytes for paths will now be able to correctly round-trip all valid paths in Windows filesystems. Currently, the conversions between Unicode (in the OS) and bytes (in Python) were lossy and would fail to round-trip characters outside of the user's active code page. Notably, this does not impact the encoding of the contents of files. These will continue to default to locale.getpreferredencoding (for text files) or plain bytes (for binary files). This only affects the encoding used when users pass a bytes object to Python where it is then passed to the operating system as a path name. Background == File system paths are almost universally represented as text with an encoding determined by the file system. In Python, we expose these paths via a number of interfaces, such as the ``os`` and ``io`` modules. Paths may be passed either direction across these interfaces, that is, from the filesystem to the application (for example, ``os.listdir()``), or from the application to the filesystem (for example, ``os.unlink()``). When paths are passed between the filesystem and the application, they are either passed through as a bytes blob or converted to/from str using ``os.fsencode()`` or ``sys.getfilesystemencoding()``. The result of encoding a string with ``sys.getfilesystemencoding()`` is a blob of bytes in the native format for the default file system. On Windows, the native format for the filesystem is utf-16-le. The recommended platform APIs for accessing the filesystem all accept and return text encoded in this format. However, prior to Windows NT (and possibly further back), the native format was a configurable machine option and a separate set of APIs existed to accept this format. The option (the "active code page") and these APIs (the "*A functions") still exist in recent versions of Windows for backwards compatibility, though new functionality often only has a utf-16-le API (the "*W functions"). In Python, str is recommended because it can correctly round-trip all characters used in paths (on POSIX with surrogateescape handling; on Windows because str maps to the native representation). On Windows bytes cannot round-trip all characters used in paths, as Python internally uses the *A functions and hence the encoding is "whatever the active code page is". Since the active code page cannot represent all Unicode characters, the conversion of a path into bytes can lose information without warning or any available indication. As a demonstration of this:: >>> open('test\uAB00.txt', 'wb').close() >>> import glob >>> glob.glob('test*') ['test\uab00.txt'] >>> glob.glob(b'test*') [b'test?.txt'] The Unicode character in the second call to glob has been replaced by a '?', which means passing the path back into the filesystem will result in a ``FileNotFoundError``. The same results may be observed with ``os.listdir()`` or any function that matches the return type to the parameter type. While one