[issue43936] os.path.realpath() normalizes paths before resolving links on Windows

2021-04-25 Thread Eryk Sun


Eryk Sun  added the comment:

> os.path.realpath() normalizes paths before resolving links 
> on Windows

Normalizing the input path is required in order to be consistent with the 
Windows file API. OTOH, the target path of a relative symlink gets resolved in 
a POSIX-ly correct manner in the kernel, and ntpath._readlink_deep() doesn't 
ensure this. 

I've attached a prototype that I wrote for a POSIX-like implementation that 
recursively resolves both the drive and the path. It uses the final path only 
as a shortcut to normalize volume GUID names as drives and the proper casing of 
UNC server and share names. However, it's considerably more work than the 
final-path approach, and more work always has the potential for more bugs. I'm 
providing it for the sake of discussion, or just for people to point to it as 
an example of what not to do... ;-)

Patching up the current implementation would probably involve extending 
_getfinalpathname() to support follow_symlinks=False. Aspects of the POSIX 
implementation would have to be adopted, but I think it can be kept relatively 
simple when integrated with _getfinalpathname(path, follow_symlinks=False). The 
latter also makes it easy to identify a UNC path, which is necessary because 
mountpoints should never be resolved in a UNC path, which is something the 
current implementation gets wrong.

What this wouldn't support is resolving an inaccessible drive as much as 
possible. Mapped drives are object symlinks that expand to UNC paths that can 
include an arbitrary filepath on a share. Substitute drives by definition 
target an arbitrary filepath, and can even target other substitute and mapped 
drives. A final-path only approach would leave the inaccessible drive in the 
result, along with any symlinks that are internal to the drive.

A final-path approach also can't support targets with rooted paths or ".." 
components that traverse a mountpoint. The final path will be on the 
mountpoint's device, which will change how such relative symlinks resolve. That 
said, rooted symlink targets are almost never seen in Windows, and targets that 
traverse a mountpoint by way of a ".." component should be rare, in principle. 

One problem is the frequent use of bind mountpoints in place of symlinks in 
Windows. In CMD, bind mountpoints can be created by anyone via `mklink /j`. 
Here's a fabricated example with a mountpoint (i.e. junction) that's used where 
normally a symlink should be used.

C:\
work\
foo\
bar [junction -> C:\work\bar]
remote [symlink -> \\baz\spam]
bar\
remote [symlink -> ..\remote]
remote [symlink -> \\qux\eggs]

C:\work\foo\bar\remote normally resolves as follows:

C:\work\foo\bar\remote
-> C:\work\foo\bar + ..\remote
-> C:\work\foo\remote
-> \\baz\spam

Assume that \\baz\spam is down, so C:\work\foo\bar\remote can't be strictly 
resolved. If the non-strict algorithm relies on getting the final path of 
C:\work\foo\bar\remote before resolving the target of "remote", then the result 
for this case will be incorrect.

C:\work\foo\bar\remote
-> C:\work\bar\remote
-> C:\work\bar + ..\remote
-> C:\work\remote
-> \\qux\eggs

--
components: +Windows
nosy: +eryksun, paul.moore, steve.dower, tim.golden, zach.ware
versions:  -Python 3.6, Python 3.7
Added file: https://bugs.python.org/file49984/realpath_posixly.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43936] os.path.realpath() normalizes paths before resolving links on Windows

2021-04-24 Thread Barney Gale


New submission from Barney Gale :

Capturing a write-up by eryksun on GitHub into a new bug.

Link: https://github.com/python/cpython/pull/25264#pullrequestreview-631787754

> `nt._getfinalpathname()` opens a handle to a file/directory with 
> `CreateFileW()` and calls `GetFinalPathNameByHandleW()`. The latter makes a 
> few system calls to get the final opened path in the filesystem (e.g. 
> "\Windows\explorer.exe") and the canonical DOS name of the volume device on 
> which the filesystem is mounted (e.g. "\Device\HarddiskVolume2" -> "\\?\C:") 
> in order to return a canonical DOS path (e.g. "\\?\C:\Windows\explorer.exe").
> 
> Opening a handle with `CreateFileW()` entails first getting a fully-qualified 
> and normalized NT path, which, among other things, entails resolving ".." 
> components naively in the path string. This does not take reparse points such 
> as symlinks and mountpoints into account. The only time Windows parses ".." 
> components in an opened path the way POSIX does is in the kernel when they're 
> in the target path of a relative symlink.
> 
> `nt.readlink()` opens a handle to the file with the flag 
> `FILE_FLAG_OPEN_REPARSE_POINT`. If the final path component is a reparse 
> point, it opens it instead of traversing it. Then the reparse point is read 
> with the filesystem control request, `FSCTL_GET_REPARSE_POINT`. System 
> symlinks and mountpoints (`IO_REPARSE_TAG_SYMLINK` and 
> `IO_REPARSE_TAG_MOUNT_POINT`) are the only supported name-surrogate 
> reparse-point types, though `os.stat()` and `os.lstat()` handle all 
> name-surrogate types as 'links'. Moreover, only symlinks get the `S_IFLNK` 
> mode flag in a stat result, because they're the only ones we can create with 
> `os.symlink()` to satisfy the usage `if os.path.islink(src): 
> os.symlink(os.readlink(src), dst)`.
>
> > What would it take to do a POSIX-style "normalize as we resolve",
> > and would we want to? I guess we'd need to call nt._getfinalpathname()
> > on each path component in turn (C:, C:\Users, C:\Users\Barney etc),
> > which from my pretty basic Windows knowledge might be rather slow if
> > that involves file handles.
> 
> You asked, so I decided to write up an outline of what implementing a 
> POSIX-style `realpath()` might look like in Windows. At its core, it's 
> similar to POSIX: lstat(), and, for a symlink, readlink() and recur. The 
> equivalent calls in Windows are the following:
> 
> * `CreateFileW()` (open a handle)
> 
> * `GetFileInformationByHandleEx()`: `FileAttributeTagInfo`
> 
> * `DeviceIoControl()`: `FSCTL_GET_REPARSE_POINT`
> 
> 
> A symlink has the reparse tag `IO_REPARSE_TAG_SYMLINK`.
> 
> Filesystem mountpoints (aka junctions, which are like Unix bind mountpoints) 
> must be retained in the resolved path in order to correctly resolve relative 
> symlinks such as "\spam" (relative to the resolved device) and "..\..\spam". 
> Anyway, this is consistent with the UNC case, since mountpoints on a remote 
> server can never be resolved (i.e. a final UNC path never resolves 
> mountpoints).
> 
> Here are some of the notable differences compared to POSIX:
> 
> * If the source path is not a "\\?\" verbatim path, `GetFullPathNameW()` 
> must be called initially.  However, ".." components in the target path of a 
> relative symlink must be resolved the POSIX way, else symlinks in the target 
> path may be removed incorrectly before their target is resolved (e.g. 
> "foo\symlink\..\bar" incorrectly resolved as "foo\bar"). The opened path is 
> initially normalized as follows:
>   
>   * replace forward slashes with backslashes
>   * collapse repeated backslashes (except the UNC root must have exactly 
> two backslashes)
>   * resolve a relative path (e.g. "spam"), drive-relative path (e.g. 
> "Z:spam"), or rooted path (e.g. "\spam") as a fully-qualified path (e.g. 
> "Z:\eggs\spam")
>   * resolve "." and ".." components in the opened path (naive to symlinks)
>   * strip trailing spaces and dots from the final component (e.g. 
> "C:\spam. . ." -> "C:\spam")
>   * resolve reserved device names in the final component of a non-UNC 
> path (e.g. "C:\nul" -> "\\.\nul")
> 
> * Substitute drives (e.g. created by "subst.exe", or `DefineDosDeviceW`) 
> and mapped drives (e.g. created by "net.exe", or `WNetAddConnection2W`) must 
> be resolved, respectively via `QueryDosDeviceW()` and 
> `WNetGetUniversalNameW()`. Like all DOS 'devices', these drives are 
> implemented as object symlinks (i.e. symlinks in the object namespace, not to 
> be confused with filesystem symlinks). The target path of these drives, 
> however, is not a Device object, but rather a filesystem path on a device 
> that can include any number of path components, some of which may be 
> filesystem symlinks that need to be resolved. Normally when a path is opened, 
> the system object manager reparses all DOS 'devices' to the path of an actual 
> Device object, or a path on a Device object,