[issue43395] os.path states that bytes can't represent all MBCS paths under Windows

2021-03-17 Thread STINNER Victor


Change by STINNER Victor :


--
nosy:  -vstinner

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43395] os.path states that bytes can't represent all MBCS paths under Windows

2021-03-17 Thread Eryk Sun


Change by Eryk Sun :


--
versions:  -Python 3.6, Python 3.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43395] os.path states that bytes can't represent all MBCS paths under Windows

2021-03-05 Thread Eryk Sun

Eryk Sun  added the comment:

>  instead of the stated 'surrogatepass'

In Python 3.6 and above, you can check this as follows:

>>> sys.getfilesystemencoding()
'utf-8'
>>> sys.getfilesystemencodeerrors()
'surrogatepass'

In Python 3.5 and previous:

>>> sys.getfilesystemencoding()
'mbcs'

In 3.5, the error handler used by fsencode() and fsdecode() was hard coded as 
'strict' for the 'mbcs' encoding, and otherwise 'surrogateescape'.

> https://docs.python.org/3/library/os.html#os.fsencode
> https://docs.python.org/3/library/os.html#os.fsdecode

The above documentation needs to be updated to reference 
sys.getfilesystemencodeerrors(), as do the doc strings:

>>> print(textwrap.dedent(os.fsencode.__doc__))

Encode filename to the filesystem encoding with 'surrogateescape' error
handler, return bytes unchanged. On Windows, use 'strict' error handler if
the file system encoding is 'mbcs' (which is the default encoding).

>>> print(textwrap.dedent(os.fsdecode.__doc__))

Decode filename from the filesystem encoding with 'surrogateescape' error
handler, return str unchanged. On Windows, use 'strict' error handler if
the file system encoding is 'mbcs' (which is the default encoding).

> https://docs.python.org/3/library/os.html#file-names-command-line-arguments-and-environment-variables

This should be rewritten to link to sys.getfilesystemencodeerrors(). I'm fine 
with only discussing the use of "surrogateescape", which is a significant 
concern in POSIX systems, for which it is very easy and common for filenames to 
be created with an arbitrary encoding. 

I don't know if the use of "surrogatepass" in Windows warrants discussion. It 
is uncommon to need the error handler because the filesystem is Unicode. A user 
is unlikely to create a filename with an unpaired surrogate code. 

That said, before Windows 10, the legacy console allowed copying half of a 
surrogate pair to the clipboard, and a program could have a bug that nulls the 
second surrogate code in the pair (e.g. when limiting the length of a 
filename). Anyway, it's technically possible, so we support it. For example, 
"😈" (U+0001F608) is encoded in UTF-16 as the pair (U+D83D, U+DE08). A filename 
could end up with only the first of the two codes:

>>> open('devil\ud83d', 'w').close()
>>> print(ascii(os.listdir('.')[0]))
'devil\ud83d'

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43395] os.path states that bytes can't represent all MBCS paths under Windows

2021-03-04 Thread Eric L.


Eric L.  added the comment:

Very confusing but very interesting. I'm trying to follow as I'm the main 
maintainer of the rdiff-backup software, which goes cross-platforms, so these 
small differences might become important.

Now, looking into the docs, following your explanations, I noticed that 
https://docs.python.org/3/library/os.html#os.fsencode and 
https://docs.python.org/3/library/os.html#os.fsdecode state that the 'strict' 
error handler is used under Windows instead of the stated 'surrogatepass'. 
Again an issue with the documentation?

Also, the 2nd paragraph of 
https://docs.python.org/3.8/library/os.html#file-names-command-line-arguments-and-environment-variables
 speaks only of surrogateescape and doesn't make the difference between POSIX 
and Windows.

Very interesting but very confusing...

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43395] os.path states that bytes can't represent all MBCS paths under Windows

2021-03-04 Thread Eryk Sun


Eryk Sun  added the comment:

> Vice versa, using bytes objects cannot represent all file names 
> on Windows (in the standard mbcs encoding), hence Windows 
> applications should use string objects to access all files.

This is outdated advice that should be removed, or at least reworded to 
emphasize that the 'mbcs' encoding is only used in legacy mode, with a link to 
the documentation of sys._enablelegacywindowsfsencoding [1].

Starting in Python 3.6, the default filesystem encoding in Windows is UTF-8. 
Internally, what happens is that a UTF-8 byte string gets translated to UTF-16 
(2 or 4 bytes per character), the native Unicode encoding of the Windows API. 

A caveat is that Windows filesystems use 16-bit characters that are not 
restricted to valid Unicode. In particular, ordinals U+D800-U+DFFF are not 
reserved for use in surrogate pairs. This is "Wobbly" Unicode, and the 
filesystem encoding thus needs to be "Wobbly Transformation Format, 8-bit" 
(WTF-8). This is implemented in Python by setting the encode errors handler to 
"surrogatepass", in contrast to using "surrogateescape" in POSIX. For example, 
os.fsencode('\ud800') succeeds in Windows but fails in POSIX, while 
os.fsdecode(b'\x80') fails in Windows but succeeds in POSIX. The latter case is 
not a practical problem since filesystem functions will never return an invalid 
WTF-8 byte string.

---
[1] 
https://docs.python.org/3/library/sys.html#sys._enablelegacywindowsfsencoding

--
components: +Unicode, Windows
nosy: +eryksun, ezio.melotti, paul.moore, steve.dower, tim.golden, vstinner, 
zach.ware

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43395] os.path states that bytes can't represent all MBCS paths under Windows

2021-03-03 Thread Eric L.


New submission from Eric L. :

The os.path documentation at https://docs.python.org/3/library/os.path.html 
states that:

> Vice versa, using bytes objects cannot represent all file names on Windows 
> (in the standard mbcs encoding), hence Windows applications should use string 
> objects to access all files.

This doesn't sound right and is at least misleading because anything can be 
represented as bytes, as everything (in a computer) is bytes at the end of the 
day, unless mbcs is really using something like half-bytes, which I couldn't 
find any sign of (skimming through the documentation, Microsoft seems to 
interpret it as DBCS, one or two bytes).

I could imagine that the meaning is that some bytes combinations can't be used 
as path under Windows, but I just don't know, and that wouldn't be a valid 
reason to not use bytes under Windows (IMHO).

--
assignee: docs@python
components: Documentation
messages: 388077
nosy: docs@python, ericzolf
priority: normal
severity: normal
status: open
title: os.path states that bytes can't represent all MBCS paths under Windows
type: enhancement
versions: Python 3.10, Python 3.6, Python 3.7, Python 3.8, Python 3.9

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com