Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
Very much agreed that this isn't necessary for just readdir/FindNext errors. We've never had this level of detail before -- if listdir() fails half way through (very unlikely) it just bombs with OSError and you get no entries at all. If you really really want this (again very unlikely), you can always use call next() directly and catch OSError around that call. Agreed - I think the PEP should point this out explicitly, and show that the approach it takes offers a lot of flexibility in error handling from just let it fail, to a single try/catch around the whole loop, to try/catch just around the operations that might call lstat(), to try/catch around the individual iteration steps. Good point. It'd be good to mention this explicitly in the PEP and have another example or two of the different levels of errors handling. os.walk remains the higher level API that most code should be using, and that has to retain the current listdir based behaviour (any error = ignore all entries in that directory) for backwards compatibility reasons. Yes, definitely. -Ben ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 11 Jul 2014 12:46, Ben Hoyt benh...@gmail.com wrote: [replying to python-dev this time] The onerror approach can also deal with readdir failing, which the PEP currently glosses over. Do we want this, though? I can see an error handler for individual entries, but if one of the *dir commands fails that would seem to be fairly catastrophic. Very much agreed that this isn't necessary for just readdir/FindNext errors. We've never had this level of detail before -- if listdir() fails half way through (very unlikely) it just bombs with OSError and you get no entries at all. If you really really want this (again very unlikely), you can always use call next() directly and catch OSError around that call. Agreed - I think the PEP should point this out explicitly, and show that the approach it takes offers a lot of flexibility in error handling from just let it fail, to a single try/catch around the whole loop, to try/catch just around the operations that might call lstat(), to try/catch around the individual iteration steps. os.walk remains the higher level API that most code should be using, and that has to retain the current listdir based behaviour (any error = ignore all entries in that directory) for backwards compatibility reasons. Cheers, Nick. -Ben ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
[replying to python-dev this time] The onerror approach can also deal with readdir failing, which the PEP currently glosses over. Do we want this, though? I can see an error handler for individual entries, but if one of the *dir commands fails that would seem to be fairly catastrophic. Very much agreed that this isn't necessary for just readdir/FindNext errors. We've never had this level of detail before -- if listdir() fails half way through (very unlikely) it just bombs with OSError and you get no entries at all. If you really really want this (again very unlikely), you can always use call next() directly and catch OSError around that call. -Ben ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 10 July 2014 01:23, Victor Stinner victor.stin...@gmail.com wrote: As a Windows user with only a superficial understanding of how symlinks should behave, (...) FYI Windows also supports symbolic links since Windows Vista. The feature is unknown because it is restricted to the administrator account. Try the mklink command in a terminal (cmd.exe) ;-) http://en.wikipedia.org/wiki/NTFS_symbolic_link ... To be honest, I never created a symlink on Windows. But since it is supported, you need to know it to write correctly your Windows code. I know how symlinks *do* behave, and I know how Windows supports them. What I meant was that, because Windows typically makes little use of symlinks, I have little or no intuition of what feels natural to people using an OS where symlinks are common. As someone (Tim?) pointed out later in the thread, FindFirstFile/FindNextFile doesn't follow symlinks by default (and nor do the dirent entries on Unix). So whether or not it's natural, the free functionality provided by the OS is that of lstat, not that of stat. Presumably because it's possible to build symlink-following code on top of non-following code, but not the other way around. Paul ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 10 July 2014 17:04, Paul Moore p.f.mo...@gmail.com wrote: On 10 July 2014 01:23, Victor Stinner victor.stin...@gmail.com wrote: As a Windows user with only a superficial understanding of how symlinks should behave, (...) FYI Windows also supports symbolic links since Windows Vista. The feature is unknown because it is restricted to the administrator account. Try the mklink command in a terminal (cmd.exe) ;-) http://en.wikipedia.org/wiki/NTFS_symbolic_link ... To be honest, I never created a symlink on Windows. But since it is supported, you need to know it to write correctly your Windows code. I know how symlinks *do* behave, and I know how Windows supports them. What I meant was that, because Windows typically makes little use of symlinks, I have little or no intuition of what feels natural to people using an OS where symlinks are common. As someone (Tim?) pointed out later in the thread, FindFirstFile/FindNextFile doesn't follow symlinks by default (and nor do the dirent entries on Unix). It wasn't me (I didn't even see it - lost in the noise). So whether or not it's natural, the free functionality provided by the OS is that of lstat, not that of stat. Presumably because it's possible to build symlink-following code on top of non-following code, but not the other way around. For most uses the most natural thing is to follow symlinks (e.g. opening a symlink in a text editor should open the target). However, I think not following symlinks by default is better approach for exactly the reason Paul has noted above. Tim Delaney ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
2014-07-10 9:04 GMT+02:00 Paul Moore p.f.mo...@gmail.com: As someone (Tim?) pointed out later in the thread, FindFirstFile/FindNextFile doesn't follow symlinks by default (and nor do the dirent entries on Unix). So whether or not it's natural, the free functionality provided by the OS is that of lstat, not that of stat. Presumably because it's possible to build symlink-following code on top of non-following code, but not the other way around. DirEntry methods will remain free (no syscall) for directories and regular files. One extra syscall will be needed only for symlinks, which are more rare than other file types (for example, you wrote Windows typically makes little use of symlinks). See my pseudo-code: https://mail.python.org/pipermail/python-dev/2014-July/135439.html On Windows, _lstat and _stat attributes will be filled directly in the constructor on Windows for regular files and directories. Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 10 Jul 2014 03:39, Victor Stinner victor.stin...@gmail.com wrote: 2014-07-10 9:04 GMT+02:00 Paul Moore p.f.mo...@gmail.com: As someone (Tim?) pointed out later in the thread, FindFirstFile/FindNextFile doesn't follow symlinks by default (and nor do the dirent entries on Unix). So whether or not it's natural, the free functionality provided by the OS is that of lstat, not that of stat. Presumably because it's possible to build symlink-following code on top of non-following code, but not the other way around. DirEntry methods will remain free (no syscall) for directories and regular files. One extra syscall will be needed only for symlinks, which are more rare than other file types (for example, you wrote Windows typically makes little use of symlinks). The info we want for scandir is that of the *link itself*. That makes it easy to implement things like the followlinks flag of os.walk. The *far end* of the link isn't relevant at this level. The docs just need to be clear that DirEntry objects always match lstat(), never stat(). Cheers, Nick. See my pseudo-code: https://mail.python.org/pipermail/python-dev/2014-July/135439.html On Windows, _lstat and _stat attributes will be filled directly in the constructor on Windows for regular files and directories. Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
DirEntry methods will remain free (no syscall) for directories and regular files. One extra syscall will be needed only for symlinks, which are more rare than other file types (for example, you wrote Windows typically makes little use of symlinks). The info we want for scandir is that of the *link itself*. That makes it easy to implement things like the followlinks flag of os.walk. The *far end* of the link isn't relevant at this level. The docs just need to be clear that DirEntry objects always match lstat(), never stat(). Yeah, I agree with this. It makes the function (and documentation and implementation) quite a lot simpler to understand. scandir() is a lowish-level function which deals with the directory entries themselves, and mirrors both Windows FindNextFile and POSIX readdir() in that. If the user wants follow-links behaviour, they can easily call os.stat() themselves. If this is clearly documented that seems much simpler to me (and it also seems implicit to me in the fact that you're calling is_dir() on the *entry*). Otherwise we might as well go down the route of -- the objects returned are just like pathlib.Path(), but with stat() and lstat() cached on first use. -Ben ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 07/10/2014 06:58 AM, Nick Coghlan wrote: The info we want for scandir is that of the *link itself*. That makes it easy to implement things like the followlinks flag of os.walk. The *far end* of the link isn't relevant at this level. This also mirrors listdir, correct? scandir is simply* returning something smarter than a string. The docs just need to be clear that DirEntry objects always match lstat(), never stat(). Agreed. -- ~Ethan~ * As well as being a less resource-intensive generator. :) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 07/09/2014 09:02 PM, Nick Coghlan wrote: On 9 Jul 2014 17:14, Ethan Furman wrote: I like the 'onerror' API better primarily because it gives a single point to deal with the errors. [...] The onerror approach can also deal with readdir failing, which the PEP currently glosses over. Do we want this, though? I can see an error handler for individual entries, but if one of the *dir commands fails that would seem to be fairly catastrophic. I'm somewhat inclined towards the current approach in the PEP, but I'd like to see an explanation of two aspects: 1. How a scandir variant with an 'onerror' option could be implemented given the version in the PEP Here's a stab at it: def scandir_error(path, info=None, onerror=None): for entry in scandir(path): if info == 'type': try: entry.is_dir() except OSError as exc: if onerror is None: raise if not onerror(exc, entry): continue elif info == 'lstat': try: entry.lstat() except OSError as exc: if onerror is None: raise if not onerror(exc, entry): continue yield entry Here it is again with an attempt to deal with opendir/readdir/closedir exceptions: def scandir_error(path, info=None, onerror=None): entries = scandir(path) try: entry = next(entries) except StopIteration: # pass it through raise except Exception as exc: if onerror is None: raise if not onerror(exc, 'what else here?'): # what do we do on False? # what do we do on True? else: for entry in scandir(path): if info == 'type': try: entry.is_dir() except OSError as exc: if onerror is None: raise if not onerror(exc, entry): continue elif info == 'lstat': try: entry.lstat() except OSError as exc: if onerror is None: raise if not onerror(exc, entry): continue yield entry 2. How the existing scandir module handles the 'onerror' parameter to its directory walking function Here's the first third of it from the repo: def walk(top, topdown=True, onerror=None, followlinks=False): Like os.walk(), but faster, as it uses scandir() internally. # Determine which are files and which are directories dirs = [] nondirs = [] try: for entry in scandir(top): if entry.is_dir(): dirs.append(entry) else: nondirs.append(entry) except OSError as error: if onerror is not None: onerror(error) return ... -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
In this case because the names are exactly the same as the os versions which /do/ make a system call. Fair enough. So if I'm finally understanding the root problem here: - listdir returns a list of strings, one for each filename and one for each directory, and keeps no other O/S supplied info. - os.walk, which uses listdir, then needs to go back to the O/S and refetch the thrown-away information - so it's slow. ... and the new problem: - not all O/Ses provide the same (or any) extra info about the directory entries Have I got that right? Yes, that's exactly right. If so, I still like the attribute idea better (surprise!), we just need to revisit the 'ensure_lstat' (or whatever it's called) parameter: instead of a true/false value, it could have a scale: - 0 = whatever the O/S gives us - 1 = at least the is_dir/is_file (whatever the other normal one is), and if the O/S doesn't give it to us for free than call lstat - 2 = we want it all -- call lstat if necessary on this platform After all, the programmer should know up front how much of the extra info will be needed for the work that is trying to be done. Yeah, I think this is a good idea to make option #2 a bit nicer. I don't like the magic constants, and using constants like os.SCANDIR_LSTAT is annoying, so how about using strings? I also suggest calling the parameter info (because it determines what info is returned), so you'd do scandir(path, info='type') if you need just the is_X type information. I also think it's nice to have a way for power users to just return what the OS gives us. However, I think making this the default is a bad idea, as it's just asking for cross-platform bugs (and it's easy to prevent). Paul Moore basically agrees with this in his reply yesterday, though I disagree with him it would be unfriendly to fail hard unless you asked for the info -- quite the opposite, Linux users would think it very unfriendly when your code broke because you didn't ask for the info. :-) So how about tweaking option #2 a tiny bit more to this: def scandir(path='.', info=None, onerror=None): ... * if info is None (the default), only the .name and .full_name attributes are present * if info is 'type', scandir ensures the is_dir/is_file/is_symlink attributes are present and either True or False * if info is 'lstat', scandir additionally ensures a .lstat is present and is a full stat_result object * if info is 'os', scandir returns the attributes the OS provides (everything on Windows, only is_X -- most of the time -- on POSIX) * if onerror is not None and errors occur during any internal lstat() call, onerror(exc) is called with the OSError exception object Further point -- because the is_dir/is_file/is_symlink attributes are booleans, it would be very bad for them to be present but None if you didn't ask for (or the OS didn't return) the type information. Because then if entry.is_dir: would be None and your code would think it wasn't a directory, when actually you don't know. For this reason, all attributes should fail with AttributeError if not fetched. Thank you for writing scandir, and this PEP. Excellent work. Thanks! Oh, and +1 for option 2, slightly modified. :) With the above tweaks, I'm getting closer to being 50/50. It's probably 60% #1 and 40% #2 for me now. :-) Okay folks -- please respond: option #1 as per the current PEP 471, or option #2 with Ethan's multi-level thing tweaks as per the above? -Ben ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
2014-07-08 22:09 GMT+02:00 Ben Hoyt benh...@gmail.com: I think you're misunderstanding is_dir() and is_file(), as these don't actually call os.stat(). All DirEntry methods either call nothing or os.lstat() to get the stat info on the entry itself (not the destination of the symlink). Oh. Extract of your PEP: is_dir(): like os.path.isdir(), but much cheaper. genericpath.isdir() and genericpath.isfile() use os.stat(), whereas posixpath.islink() uses os.lstat(). Is it a mistake in the PEP? Ah, you're dead right -- this is basically a bug in the PEP, as DirEntry.is_dir() is not like os.path.isdir() in that it is based on the entry itself (like lstat), not following the link. I'll improve the wording here and update the PEP. Ok, so it means that your example grouping files per type, files and directories, is also wrong. Or at least, it behaves differently than os.walk(). You should put symbolic links to directories in the dirs list too. if entry.is_dir(): # is_dir() checks os.lstat() dirs.append(entry) elif entry.is_symlink() and os.path.isdir(entry): # isdir() checks os.stat() dirs.append(entry) else: non_dirs.append(entry) Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
Ok, so it means that your example grouping files per type, files and directories, is also wrong. Or at least, it behaves differently than os.walk(). You should put symbolic links to directories in the dirs list too. if entry.is_dir(): # is_dir() checks os.lstat() dirs.append(entry) elif entry.is_symlink() and os.path.isdir(entry): # isdir() checks os.stat() dirs.append(entry) else: non_dirs.append(entry) Yes, good call. I believe I'm doing this wrong in the scandir.py os.walk() implementation too -- hence this open issue: https://github.com/benhoyt/scandir/issues/4 -Ben ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 9 July 2014 13:48, Ben Hoyt benh...@gmail.com wrote: Okay folks -- please respond: option #1 as per the current PEP 471, or option #2 with Ethan's multi-level thing tweaks as per the above? I'm probably about 50/50 at the moment. What will swing it for me is likely error handling, so let's try both approaches with some error handling: Rules are that we calculate the total size of all files in a tree (as returned from lstat), with files that fail to stat being logged and their size assumed to be 0. Option 1: def get_tree_size(path): total = 0 for entry in os.scandir(path): try: isdir = entry.is_dir() except OSError: logger.warn(Cannot stat {}.format(entry.full_name)) continue if entry.is_dir(): total += get_tree_size(entry.full_name) else: try: total += entry.lstat().st_size except OSError: logger.warn(Cannot stat {}.format(entry.full_name)) return total Option 2: def log_err(exc): logger.warn(Cannot stat {}.format(exc.filename)) def get_tree_size(path): total = 0 for entry in os.scandir(path, info='lstat', onerror=log_err): if entry.is_dir: total += get_tree_size(entry.full_name) else: total += entry.lstat.st_size return total On this basis, #2 wins. However, I'm slightly uncomfortable using the filename attribute of the exception in the logging, as there is nothing in the docs saying that this will give a full pathname. I'd hate to see Unable to stat __init__.py!!! So maybe the onerror function should also receive the DirEntry object - which will only have the name and full_name attributes, but that's all that is needed. OK, looks like option #2 is now my preferred option. My gut instinct still rebels over an API that deliberately throws information away in the default case, even though there is now an option to ask it to keep that information, but I see the logic and can learn to live with it. Paul ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
Option 2: def log_err(exc): logger.warn(Cannot stat {}.format(exc.filename)) def get_tree_size(path): total = 0 for entry in os.scandir(path, info='lstat', onerror=log_err): if entry.is_dir: total += get_tree_size(entry.full_name) else: total += entry.lstat.st_size return total On this basis, #2 wins. That's a pretty nice comparison, and you're right, onerror handling is nicer here. However, I'm slightly uncomfortable using the filename attribute of the exception in the logging, as there is nothing in the docs saying that this will give a full pathname. I'd hate to see Unable to stat __init__.py!!! Huh, you're right. I think this should be documented in os.walk() too. I think it should be the full filename (is it currently?). So maybe the onerror function should also receive the DirEntry object - which will only have the name and full_name attributes, but that's all that is needed. That's an interesting idea -- though enough of a deviation from os.walk()'s onerror that I'm uncomfortable with it -- I'd rather just document that the onerror exception .filename is the full path name. One issue with option #2 that I just realized -- does scandir yield the entry at all if there's a stat error? It can't really, because the caller will except the .lstat attribute to be set (assuming he asked for type='lstat') but it won't be. Is effectively removing these entries just because the stat failed a problem? I kind of think it is. If so, is there a way to solve it with option #2? OK, looks like option #2 is now my preferred option. My gut instinct still rebels over an API that deliberately throws information away in the default case, even though there is now an option to ask it to keep that information, but I see the logic and can learn to live with it. In terms of throwing away info in the default case -- it's simply a case of getting what you ask for. :-) Worst case, you'll write your code and test it, it'll fail hard on any system, you'll fix it immediately, and then it'll work on any system. -Ben ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 9 July 2014 14:22, Ben Hoyt benh...@gmail.com wrote: So maybe the onerror function should also receive the DirEntry object - which will only have the name and full_name attributes, but that's all that is needed. That's an interesting idea -- though enough of a deviation from os.walk()'s onerror that I'm uncomfortable with it -- I'd rather just document that the onerror exception .filename is the full path name. But the onerror exception will come from the lstat call, so it'll be a raw OSError (unless scandir modifies it, which may be what you're thinking of). And if so, aren't we at the mercy of what the OS module gives us? That's why I said we can't guarantee it. I looked at the documentation of OSError (in Built In Exceptions), and all it says is the filename (unqualified). I'd expect that to be whatever got passed to the underlying OS API - which may well be an absolute pathname if we're lucky, but who knows? (I'd actually prefer it if OSError guaranteed a full pathname, but that's a bigger issue...) Paul ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 07/09/2014 05:48 AM, Ben Hoyt wrote: So how about tweaking option #2 a tiny bit more to this: def scandir(path='.', info=None, onerror=None): ... * if info is None (the default), only the .name and .full_name attributes are present * if info is 'type', scandir ensures the is_dir/is_file/is_symlink attributes are present and either True or False * if info is 'lstat', scandir additionally ensures a .lstat is present and is a full stat_result object * if info is 'os', scandir returns the attributes the OS provides (everything on Windows, only is_X -- most of the time -- on POSIX) I would rather have the default for info be 'os': cross-platform is good, but there is no reason to force it on some poor script that is meant to run on a local machine and will never leave it. * if onerror is not None and errors occur during any internal lstat() call, onerror(exc) is called with the OSError exception object As Paul mentioned, 'onerror(exc, DirEntry)' would be better. Further point -- because the is_dir/is_file/is_symlink attributes are booleans, it would be very bad for them to be present but None if you didn't ask for (or the OS didn't return) the type information. Because then if entry.is_dir: would be None and your code would think it wasn't a directory, when actually you don't know. For this reason, all attributes should fail with AttributeError if not fetched. Fair point, and agreed. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 07/09/2014 06:22 AM, Ben Hoyt wrote: One issue with option #2 that I just realized -- does scandir yield the entry at all if there's a stat error? It can't really, because the caller will expect the .lstat attribute to be set (assuming he asked for type='lstat') but it won't be. Is effectively removing these entries just because the stat failed a problem? I kind of think it is. If so, is there a way to solve it with option #2? Leave it up to the onerror handler. If it returns None, skip yielding the entry, otherwise yield whatever it returned -- which also means the error handler should be able to set fields on the DirEntry: def log_err(exc, entry): logger.warn(Cannot stat {}.format(exc.filename)) entry.lstat.st_size = 0 return True def get_tree_size(path): total = 0 for entry in os.scandir(path, info='lstat', onerror=log_err): if entry.is_dir: total += get_tree_size(entry.full_name) else: total += entry.lstat.st_size return total This particular example doesn't benefit much from the addition, but this way we don't have to guess what the programmer wants or needs to do in the case of failure. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 8 Jul 2014, at 15:52, Ben Hoyt wrote: Hi folks, After some very good python-dev feedback on my first version of PEP 471, I've updated the PEP to clarify a few things and added various Rejected ideas subsections. Here's a link to the new version (I've also copied the full text below): http://legacy.python.org/dev/peps/pep-0471/ -- new PEP as HTML http://hg.python.org/peps/rev/0da4736c27e8 -- changes [...] Rejected ideas == [...] Return values being pathlib.Path objects With Antoine Pitrou's new standard library ``pathlib`` module, it at first seems like a great idea for ``scandir()`` to return instances of ``pathlib.Path``. However, ``pathlib.Path``'s ``is_X()`` and ``lstat()`` functions are explicitly not cached, whereas ``scandir`` has to cache them by design, because it's (often) returning values from the original directory iteration system call. And if the ``pathlib.Path`` instances returned by ``scandir`` cached lstat values, but the ordinary ``pathlib.Path`` objects explicitly don't, that would be more than a little confusing. Guido van Rossum explicitly rejected ``pathlib.Path`` caching lstat in the context of scandir `here https://mail.python.org/pipermail/python-dev/2013-November/130583.html`_, making ``pathlib.Path`` objects a bad choice for scandir return values. Can we at least make sure that attributes of DirEntry that have the same meaning as attributes of pathlib.Path have the same name? [...] Servus, Walter ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 07/09/2014 06:41 AM, Ethan Furman wrote: Leave it up to the onerror handler. If it returns None, skip yielding the entry, otherwise yield whatever it returned -- which also means the error handler should be able to set fields on the DirEntry: def log_err(exc, entry): logger.warn(Cannot stat {}.format(exc.filename)) entry.lstat.st_size = 0 return True Blah. Okay, either return the DirEntry (possibly modified), or have the log_err return entry instead of True. (Now where is that caffeine??) -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
2014-07-09 15:12 GMT+02:00 Ben Hoyt benh...@gmail.com: Ok, so it means that your example grouping files per type, files and directories, is also wrong. Or at least, it behaves differently than os.walk(). You should put symbolic links to directories in the dirs list too. if entry.is_dir(): # is_dir() checks os.lstat() dirs.append(entry) elif entry.is_symlink() and os.path.isdir(entry): # isdir() checks os.stat() dirs.append(entry) else: non_dirs.append(entry) Yes, good call. I believe I'm doing this wrong in the scandir.py os.walk() implementation too -- hence this open issue: https://github.com/benhoyt/scandir/issues/4 The PEP says that DirEntry should mimic pathlib.Path, so I think that DirEntry.is_dir() should work as os.path.isir(): if the entry is a symbolic link, you should follow the symlink to get the status of the linked file with os.stat(). entry.is_dir() or (entry.is_symlink() and os.path.isdir(entry)) looks wrong: why would you have to check is_dir() and isdir()? Duplicating this check is error prone and not convinient. Pseudo-code: --- class DirEntry: def __init__(self, lstat=None, d_type=None): self._stat = None self._lstat = lstat self._d_type = d_type def stat(self): if self._stat is None: self._stat = os.stat(self.full_name) return self._stat def lstat(self): if self._lstat is None: self._lstat = os.lstat(self.full_name) return self._lstat def is_dir(self): if self._d_type is not None: if self._d_type == DT_DIR: return True if self._d_type != DT_LNK: return False else: lstat = self.lstat() if stat.S_ISDIR(lstat.st_mode): return True if not stat.S_ISLNK(lstat.st_mode): return False stat = self.stat() return stat.S_ISDIR(stat.st_mode) --- DirEntry would be created with lstat (Windows) or d_type (Linux) prefilled. is_dir() would only need to call os.stat() once for symbolic links. With this code, it becomes even more obvious why is_dir() is a method and not a property ;-) Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 9 July 2014 16:05, Victor Stinner victor.stin...@gmail.com wrote: The PEP says that DirEntry should mimic pathlib.Path, so I think that DirEntry.is_dir() should work as os.path.isir(): if the entry is a symbolic link, you should follow the symlink to get the status of the linked file with os.stat(). Would this not break the tree size script being discussed in the other thread, as it would follow links and include linked directories in the size of the tree? As a Windows user with only a superficial understanding of how symlinks should behave, I don't have any intuition as to what the right answer should be. But I would say that the tree size code we've been debating over there (which recurses if is_dir is true and adds in st_size otherwise) should do whatever people would expect of a function with that name, as it's a perfect example of something a Windows user might write and expect it to work cross-platform. If it doesn't much of the worrying over making sure scandir's API is cross-platform by default is probably being wasted :-) (Obviously the walk_tree function could be modified if needed, but that's missing the point I'm trying to make :-)) Paul ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
The PEP says that DirEntry should mimic pathlib.Path, so I think that DirEntry.is_dir() should work as os.path.isir(): if the entry is a symbolic link, you should follow the symlink to get the status of the linked file with os.stat(). Would this not break the tree size script being discussed in the other thread, as it would follow links and include linked directories in the size of the tree? Yeah, I agree. Victor -- I don't think the DirEntry is_X() methods (or attributes) should mimic the link-following os.path.isdir() at all. You want the type of the entry, not the type of the source. Otherwise, as Paul says, you are essentially forced to follow links, and os.walk(followlinks=False), which is the default, can't do the right thing. -Ben ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
One issue with option #2 that I just realized -- does scandir yield the entry at all if there's a stat error? It can't really, because the caller will expect the .lstat attribute to be set (assuming he asked for type='lstat') but it won't be. Is effectively removing these entries just because the stat failed a problem? I kind of think it is. If so, is there a way to solve it with option #2? Leave it up to the onerror handler. If it returns None, skip yielding the entry, otherwise yield whatever it returned -- which also means the error handler should be able to set fields on the DirEntry: def log_err(exc, entry): logger.warn(Cannot stat {}.format(exc.filename)) entry.lstat.st_size = 0 return True This is an interesting idea, but it's just getting more and more complex, and I'm guessing that being able to change the attributes of DirEntry will make the C implementation more complex. Also, I'm not sure it's very workable. For log_err above, you'd actually have to do something like this, right? def log_err(exc, entry): logger.warn(Cannot stat {}.format(exc.filename)) entry.lstat = os.stat_result((0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) return entry Unless there's another simple way around this issue, I'm back to loving the simplicity of option #1, which avoids this whole question. -Ben ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 9 July 2014 14:22, Ben Hoyt benh...@gmail.com wrote: One issue with option #2 that I just realized -- does scandir yield the entry at all if there's a stat error? It can't really, because the caller will except the .lstat attribute to be set (assuming he asked for type='lstat') but it won't be. Is effectively removing these entries just because the stat failed a problem? I kind of think it is. If so, is there a way to solve it with option #2? So the issue is that you need to do a stat but it failed. You have whatever the OS gave you, but can't get anything more. This is only an issue on POSIX, where the original OS call doesn't give you everything, so it's fine, those POSIX people can just learn to cope with their broken OS, right? :-) More seriously, why not just return a DirEntry that says it's a file with a stat entry that's all zeroes? That seems pretty harmless. And the onerror function will be called, so if it is inappropriate the application can do something. Maybe it's worth letting onerror return a boolean that says whether to skip the entry, but that's as far as I'd bother going. It's a close call, but I think option #2 still wins (just) for me. Paul ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 07/09/2014 08:35 AM, Ben Hoyt wrote: One issue with option #2 that I just realized -- does scandir yield the entry at all if there's a stat error? It can't really, because the caller will expect the .lstat attribute to be set (assuming he asked for type='lstat') but it won't be. Is effectively removing these entries just because the stat failed a problem? I kind of think it is. If so, is there a way to solve it with option #2? Leave it up to the onerror handler. If it returns None, skip yielding the entry, otherwise yield whatever it returned -- which also means the error handler should be able to set fields on the DirEntry: def log_err(exc, entry): logger.warn(Cannot stat {}.format(exc.filename)) entry.lstat.st_size = 0 return True This is an interesting idea, but it's just getting more and more complex, and I'm guessing that being able to change the attributes of DirEntry will make the C implementation more complex. Also, I'm not sure it's very workable. For log_err above, you'd actually have to do something like this, right? def log_err(exc, entry): logger.warn(Cannot stat {}.format(exc.filename)) entry.lstat = os.stat_result((0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) return entry I would imagine we would provide a helper function: def stat_result(st_size=0, st_atime=0, st_mtime=0, ...): return os.stat_result((st_size, st_atime, st_mtime, ...)) and then in onerror entry.lstat = stat_result() Unless there's another simple way around this issue, I'm back to loving the simplicity of option #1, which avoids this whole question. Too simple is just as bad as too complex, and properly handling errors is rarely a simple task. Either we provide a clean way to deal with errors in the API, or we force every user everywhere to come up with their own system. Also, just because we provide it doesn't force people to use it, but if we don't provide it then people cannot use it. To summarize the choice I think we are looking at: 1) We provide a very basic tool that many will have to write wrappers around to get the desired behavior (choice 1) 2) We provide a more advanced tool that, in many cases, can be used as-is, and is also fairly easy to extend to handle odd situations (choice 2) More specifically, if we go with choice 1 (no built-in error handling, no mutable DirEntry), how would I implement choice 2? Would I have to write my own CustomDirEntry object? -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 9 July 2014 17:35, Ethan Furman et...@stoneleaf.us wrote: More specifically, if we go with choice 1 (no built-in error handling, no mutable DirEntry), how would I implement choice 2? Would I have to write my own CustomDirEntry object? Having built-in error handling is, I think, a key point. That's where #1 really falls down. But a mutable DirEntry and/or letting onerror manipulate the result is a lot more than just having a hook for being notified of errors. That seems to me to be a step too far, in the current context. Specifically, the tree size example doesn't need it. Do you have a compelling use case that needs a mutable DirEntry? It feels like YAGNI to me. Paul ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 07/09/2014 11:04 AM, Paul Moore wrote: On 9 July 2014 17:35, Ethan Furman et...@stoneleaf.us wrote: More specifically, if we go with choice 1 (no built-in error handling, no mutable DirEntry), how would I implement choice 2? Would I have to write my own CustomDirEntry object? Having built-in error handling is, I think, a key point. That's where #1 really falls down. But a mutable DirEntry and/or letting onerror manipulate the result is a lot more than just having a hook for being notified of errors. That seems to me to be a step too far, in the current context. Specifically, the tree size example doesn't need it. Do you have a compelling use case that needs a mutable DirEntry? It feels like YAGNI to me. Not at this point. As I indicated in my reply to your response, as long as we have the onerror machinery now we can tweak it later if real-world use shows it would be beneficial. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
This is just getting way too complex ... further thoughts below. This is an interesting idea, but it's just getting more and more complex, and I'm guessing that being able to change the attributes of DirEntry will make the C implementation more complex. Also, I'm not sure it's very workable. For log_err above, you'd actually have to do something like this, right? def log_err(exc, entry): logger.warn(Cannot stat {}.format(exc.filename)) entry.lstat = os.stat_result((0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) return entry I would imagine we would provide a helper function: def stat_result(st_size=0, st_atime=0, st_mtime=0, ...): return os.stat_result((st_size, st_atime, st_mtime, ...)) and then in onerror entry.lstat = stat_result() Unless there's another simple way around this issue, I'm back to loving the simplicity of option #1, which avoids this whole question. Too simple is just as bad as too complex, and properly handling errors is rarely a simple task. Either we provide a clean way to deal with errors in the API, or we force every user everywhere to come up with their own system. Also, just because we provide it doesn't force people to use it, but if we don't provide it then people cannot use it. So here's the ways in which option #2 is now more complicated than option #1: 1) it has an additional info argument, the values of which have to be documented ('os', 'type', 'lstat', and what each one means) 2) it has an additional onerror argument, the signature of which and fairly complicated return value is non-obvious and has to be documented 3) it requires user modification of the DirEntry object, which needs documentation, and is potentially hard to implement 4) because the DirEntry object now allows modification, you need a stat_result() helper function to help you build your own stat values I'm afraid points 3 and 4 here add way too much complexity. Remind me why all this is better than the PEP 471 approach again? It handles all of these problems, is very direct, and uses built-in Python constructs (method calls and try/except error handling). And it's also simple to document -- much simpler than the above 4 things, which could be a couple of pages in the docs. Here's the doc required for the PEP 471 approach: Note about caching and error handling: The is_X() and lstat() functions may perform an lstat() on first call if the OS didn't already fetch this data when reading the directory. So if you need fine-grained error handling, catch OSError exceptions around these method calls. After the first call, the is_X() and lstat() functions cache the value on the DirEntry. -Ben ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 07/09/2014 12:03 PM, Ben Hoyt wrote: So here's the ways in which option #2 is now more complicated than option #1: 1) it has an additional info argument, the values of which have to be documented ('os', 'type', 'lstat', and what each one means) 2) it has an additional onerror argument, the signature of which and fairly complicated return value is non-obvious and has to be documented 3) it requires user modification of the DirEntry object, which needs documentation, and is potentially hard to implement 4) because the DirEntry object now allows modification, you need a stat_result() helper function to help you build your own stat values I'm afraid points 3 and 4 here add way too much complexity. I'm okay with dropping 3 and 4, and making the return from onerror being simply True to yield the entry, and False/None to skip it. That should make implementation much easier, and documentation not too strenuous either. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
1) it has an additional info argument, the values of which have to be documented ('os', 'type', 'lstat', and what each one means) 2) it has an additional onerror argument, the signature of which and fairly complicated return value is non-obvious and has to be documented 3) it requires user modification of the DirEntry object, which needs documentation, and is potentially hard to implement 4) because the DirEntry object now allows modification, you need a stat_result() helper function to help you build your own stat values I'm afraid points 3 and 4 here add way too much complexity. I'm okay with dropping 3 and 4, and making the return from onerror being simply True to yield the entry, and False/None to skip it. That should make implementation much easier, and documentation not too strenuous either. That's definitely better in terms of complexity. Other python-devers, please chime in with your thoughts or votes. -Ben ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
2014-07-09 21:59 GMT+02:00 Ben Hoyt benh...@gmail.com: Other python-devers, please chime in with your thoughts or votes. Sorry, I didn't follow the whole discussion. IMO DirEntry must use methods and you should not expose nor document which infos are already provided by the OS or not. DirEntry should be a best-effort black-box object providing an API similar to pathlib.Path. is_dir() may be fast? fine, but don't say it in the documentation because Python must remain portable and you should not write code specific to one specific platform. is_dir(), is_file(), is_symlink() and lstat() can fail as any other Python function, no need to specialize them with custom error handler. If you care, just use a very standard try/except. I'm also against ensure_lstat=True or ideas like that fetching all datas transparently in the generator. The behaviour would be too different depending on the OS, and usually you don't need all informations. And it raises errors in the generator, which is something unusual, and difficult to handle (I don't like the onerror callback). Example where you may sometimes need is_dir(), but not always --- for entry in os.scandir(path): if ignore_entry(entry.name): # this entry is not interesting, lstat_result is useless here continue if entry.is_dir(): # fetch required data if needed continue ... --- I don't understand why you are all focused on handling os.stat() and os.lstat() errors. See for example the os.walk() function which is an old function (python 2.6!): it doesn't catch erros on isdir(), even if it has an onerror parameter... It only handles errors on listdir(). IMO errors on os.stat() and os.lstat() are very rare under very specific conditions. The most common case is that you can get the status if you can list files. Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 9 July 2014 21:24, Victor Stinner victor.stin...@gmail.com wrote: Example where you may sometimes need is_dir(), but not always --- for entry in os.scandir(path): if ignore_entry(entry.name): # this entry is not interesting, lstat_result is useless here continue if entry.is_dir(): # fetch required data if needed continue ... --- That is an extremely good point, and articulates why I've always been a bit uncomfortable with the whole ensure_stat idea. I don't understand why you are all focused on handling os.stat() and os.lstat() errors. See for example the os.walk() function which is an old function (python 2.6!): it doesn't catch erros on isdir(), even if it has an onerror parameter... It only handles errors on listdir(). IMO errors on os.stat() and os.lstat() are very rare under very specific conditions. The most common case is that you can get the status if you can list files. Personally, I'm only focused on it as a response to others feeling it's important. I'm on Windows, where there are no extra stat calls, so all *I* care about is having an API that deals with the use cases others are concerned about without making it too hard for me to use it on Windows where I don't have to worry about all this. If POSIX users come to a consensus that error handling doesn't need special treatment, I'm more than happy to go back to the PEP version. (Much as previously happened with the race condition debate). Paul ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 07/09/2014 01:57 PM, Paul Moore wrote: On 9 July 2014 21:24, Victor Stinner wrote: Example where you may sometimes need is_dir(), but not always --- for entry in os.scandir(path): if ignore_entry(entry.name): # this entry is not interesting, lstat_result is useless here continue if entry.is_dir(): # fetch required data if needed continue ... That is an extremely good point, and articulates why I've always been a bit uncomfortable with the whole ensure_stat idea. On a system which did not supply is_dir automatically I would write that as: for entry in os.scandir(path): # info defaults to 'os', which is basically None in this case if ignore_entry(entry.name): continue if os.path.isdir(entry.full_name): # do something interesting Not hard to read or understand, no time wasted in unnecessary lstat calls. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 07/09/2014 01:24 PM, Victor Stinner wrote: Sorry, I didn't follow the whole discussion. IMO DirEntry must use methods and you should not expose nor document which infos are already provided by the OS or not. DirEntry should be a best-effort black-box object providing an API similar to pathlib.Path. is_dir() may be fast? fine, but don't say it in the documentation because Python must remain portable and you should not write code specific to one specific platform. Okay, so using that logic we should head over to the os module and remove: ctermid, getenv, getegid, geteuid, getgid, getgrouplist, getgroups, getpgid, getpgrp, getpriority, PRIO_PROCESS, PRIO_PGRP, PRIO_USER, getresuid, getresgid, getuid, initgroups, putenv, setegid, seteuid, setgid, setgroups, setpriority, setregid, setrusgid, setresuid, setreuid, getsid, setsid, setuid, unsetenv, fchmod, fchown, fdatasync, fpathconf, fstatvfs, ftruncate, lockf, F_LOCK, F_TLOCK, F_ULOCK, F_TEST, O_DSYNC, O_RSYNC, O_SYNC, O_NDELAY, O_NONBLOCK, O_NOCTTY, O_SHLOCK, O_EXLOCK, O_CLOEXEC, O_BINARY, O_NOINHERIT, O_SHORT_LIVED, O_TEMPORARY, O_RANDOM, O_SEQUENTIAL, O_TEXT, ... Okay, I'm tired of typing, but that list is not even half-way through the os page, and those are all methods or attributes that are not available on either Windows or Unix or some flavors of Unix. Oh, and all those upper-case attributes? Yup, documented. And when we don't document it ourselves we often refer readers to their system documentation because Python does not, in fact, return exactly the same results on all platforms -- particularly when calling into the OS. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On a system which did not supply is_dir automatically I would write that as: for entry in os.scandir(path): # info defaults to 'os', which is basically None in this case if ignore_entry(entry.name): continue if os.path.isdir(entry.full_name): # do something interesting Not hard to read or understand, no time wasted in unnecessary lstat calls. No, but how do you know whether you're on a system which did not supply is_dir automatically? The above is not cross-platform, or at least, not efficient cross-platform, which defeats the whole point of scandir -- the above is no better than listdir(). -Ben ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
I really don't understand why you *want* a worse, much less cross-platform API? Okay, so using that logic we should head over to the os module and remove: ctermid, getenv, getegid... Okay, I'm tired of typing, but that list is not even half-way through the os page, and those are all methods or attributes that are not available on either Windows or Unix or some flavors of Unix. True, is this really the precedent we want to *aim for*. listdir() is cross-platform, and it's relatively easy to make scandir() cross-platform, so why not? Oh, and all those upper-case attributes? Yup, documented. And when we don't document it ourselves we often refer readers to their system documentation because Python does not, in fact, return exactly the same results on all platforms -- particularly when calling into the OS. But again, why a worse, less cross-platform API when a simple, cross-platform one is a method call away? -Ben ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
2014-07-09 22:44 GMT+02:00 Ethan Furman et...@stoneleaf.us: On 07/09/2014 01:24 PM, Victor Stinner wrote: Sorry, I didn't follow the whole discussion. IMO DirEntry must use methods and you should not expose nor document which infos are already provided by the OS or not. DirEntry should be a best-effort black-box object providing an API similar to pathlib.Path. is_dir() may be fast? fine, but don't say it in the documentation because Python must remain portable and you should not write code specific to one specific platform. Okay, so using that logic we should head over to the os module and remove: (...) My comment was specific to the PEP 471, design of the DirEntry class. Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 07/09/2014 02:42 PM, Ben Hoyt wrote: Okay, so using that [no platform specific] logic we should head over to the os module and remove: ctermid, getenv, getegid... Okay, I'm tired of typing, but that list is not even half-way through the os page, and those are all methods or attributes that are not available on either Windows or Unix or some flavors of Unix. True, is this really the precedent we want to *aim for*. listdir() is cross-platform, and listdir has serious performance issues, which is why you developed scandir. Oh, and all those [snipped] upper-case attributes? Yup, documented. And when we don't document it ourselves we often refer readers to their system documentation because Python does not, in fact, return exactly the same results on all platforms -- particularly when calling into the OS. But again, why a worse, less cross-platform API when a simple, cross-platform one is a method call away? For the same reason we don't use code that makes threaded behavior better, but kills the single thread application. If the programmer would rather have consistency on all platforms rather than performance on the one being used, `info='lstat'` is the option to use. I like the 'onerror' API better primarily because it gives a single point to deal with the errors. This has at least a couple advantages: - less duplication of code: in the tree_size example, the error handling is duplicated twice - readablity: with the error handling in a separate routine, one does not have to jump around the try/except blocks looking for what happens if there are no errors -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 07/09/2014 02:38 PM, Victor Stinner wrote: 2014-07-09 22:44 GMT+02:00 Ethan Furman: On 07/09/2014 01:24 PM, Victor Stinner wrote: [...] Python must remain portable and you should not write code specific to one specific platform. Okay, so using that logic we should head over to the os module and remove: (...) My comment was specific to the PEP 471, design of the DirEntry class. And my comment was to the point of there being methods/attributes/return values that /do/ vary by platform, and /are/ documented as such. Even stat itself is not the same on Windows as posix. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 07/09/2014 02:33 PM, Ben Hoyt wrote: On a system which did not supply is_dir automatically I would write that as: for entry in os.scandir(path): if ignore_entry(entry.name): continue if os.path.isdir(entry.full_name): # do something interesting Not hard to read or understand, no time wasted in unnecessary lstat calls. No, but how do you know whether you're on a system which did not supply is_dir automatically? The above is not cross-platform, or at least, not efficient cross-platform, which defeats the whole point of scandir -- the above is no better than listdir(). Hit a directory with 100,000 entries and you'll change your mind. ;) Okay, so the issue is you /want/ to write an efficient, cross-platform routine... hrmmm. thinking Okay, marry the two ideas together: scandir(path, info=None, onerror=None) Return a generator that returns one directory entry at a time in a DirEntry object info: None -- DirEntries will have whatever attributes the O/S provides 'type' -- DirEntries will already have at least the file/dir distinction 'stat' -- DirEntries will also already have stat information DirEntry.is_dir() Return True if this is a directory-type entry; may call os.lstat if the cache is empty. DirEntry.is_file() Return True if this is a file-type entry; may call os.lstat if the cache is empty. DirEntry.is_symlink() Return True if this is a symbolic link; may call os.lstat if the cache is empty. DirEntry.stat Return the stat info for this link; may call os.lstat if the cache is empty. This way both paradigms are supported. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 2014-07-09 23:50, Ethan Furman wrote: On 07/09/2014 02:33 PM, Ben Hoyt wrote: On a system which did not supply is_dir automatically I would write that as: for entry in os.scandir(path): if ignore_entry(entry.name): continue if os.path.isdir(entry.full_name): # do something interesting Not hard to read or understand, no time wasted in unnecessary lstat calls. No, but how do you know whether you're on a system which did not supply is_dir automatically? The above is not cross-platform, or at least, not efficient cross-platform, which defeats the whole point of scandir -- the above is no better than listdir(). Hit a directory with 100,000 entries and you'll change your mind. ;) Okay, so the issue is you /want/ to write an efficient, cross-platform routine... hrmmm. thinking Okay, marry the two ideas together: scandir(path, info=None, onerror=None) Return a generator that returns one directory entry at a time in a DirEntry object Should that be that yields one directory entry at a time? info: None -- DirEntries will have whatever attributes the O/S provides 'type' -- DirEntries will already have at least the file/dir distinction 'stat' -- DirEntries will also already have stat information DirEntry.is_dir() Return True if this is a directory-type entry; may call os.lstat if the cache is empty. DirEntry.is_file() Return True if this is a file-type entry; may call os.lstat if the cache is empty. DirEntry.is_symlink() Return True if this is a symbolic link; may call os.lstat if the cache is empty. DirEntry.stat Return the stat info for this link; may call os.lstat if the cache is empty. Why is is_dir, et al, functions, but stat not a function? This way both paradigms are supported. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 07/09/2014 04:22 PM, MRAB wrote: On 2014-07-09 23:50, Ethan Furman wrote: Okay, marry the two ideas together: scandir(path, info=None, onerror=None) Return a generator that returns one directory entry at a time in a DirEntry object Should that be that yields one directory entry at a time? Yes, thanks. info: None -- DirEntries will have whatever attributes the O/S provides 'type' -- DirEntries will already have at least the file/dir distinction 'stat' -- DirEntries will also already have stat information DirEntry.is_dir() Return True if this is a directory-type entry; may call os.lstat if the cache is empty. DirEntry.is_file() Return True if this is a file-type entry; may call os.lstat if the cache is empty. DirEntry.is_symlink() Return True if this is a symbolic link; may call os.lstat if the cache is empty. DirEntry.stat Return the stat info for this link; may call os.lstat if the cache is empty. Why is is_dir, et al, functions, but stat not a function? Good point. Make stat a function as well. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
2014-07-09 17:29 GMT+02:00 Ben Hoyt benh...@gmail.com: Would this not break the tree size script being discussed in the other thread, as it would follow links and include linked directories in the size of the tree? The get_tree_size() function in the PEP would use: if not entry.is_symlink() and entry.is_dir():. Note: First I wrote if entry.is_dir() and not entry.is_symlink():, but this syntax is slower on Linux because is_dir() has to call lstat(). Adding an optional keyword to DirEntry.is_dir() would allow to write if entry.is_dir(follow_symlink=False), but it looks like a micro optimization and as I said, I prefer to stick to pathlib.Path API (which was already heavily discussed in its PEP). Anyway, this case is rare (I explain that below), we should not worry too much about it. Yeah, I agree. Victor -- I don't think the DirEntry is_X() methods (or attributes) should mimic the link-following os.path.isdir() at all. You want the type of the entry, not the type of the source. On UNIX, a symlink to a directory is expected to behave like a directory. For example, in a file browser, you should enter in the linked directory when you click on a symlink to a directory. There are only a few cases where you want to handle symlinks differently: archive (ex: tar), compute the size of a directory (ex: du does not follow symlinks by default, du -L follows them), remove a directory. You should do a short poll in the Python stdlib and on the Internet to check what is the most common check. Examples of the Python stdlib: - zipfile: listdir + os.path.isdir - pkgutil: listdir + os.path.isdir - unittest.loader: listdir + os.path.isdir and os.path.isfile - http.server: listdir + os.path.isdir, it also uses os.path.islink: Append / for directories or @ for symbolic links - idlelib.GrepDialog: listdir + os.path.isdir - compileall: listdir + os.path.isdir and os.path.isdir(fullname) and not os.path.islink(fullname) = don't follow symlinks to directories - shutil (copytree): listdir + os.path.isdir + os.path.islink - shutil (rmtree): listdir + os.lstat() + stat.S_ISDIR(mode) = don't follow symlinks to directories - mailbox: listdir + os.path.isdir - tabnanny: listdir + os.path.isdir - os.walk: listdir + os.path.isdir + os.path.islink = don't follow symlinks to directories by default, but the behaviour is configurable ... but symlinks to directories are added to the dirs list (not all symlinks, only symlinks to directories) - setup.py: listdir + os.path.isfile In this list of 12 examples, only compileall, shutil.rmtree and os.walk check if entries are symlinks. compileall starts by checking if not os.path.isdir(fullname): which follows symlinks. os.walk() starts by checking if os.path.isdir(name): which follows symlinks. I consider that only one case on 12 (8.3%) doesn't follow symlinks. If entry.is_dir() doesn't follow symlinks, the other 91.7% will need to be modified to use if entry.is_dir() or (entry.is_link() and os.path.is_dir(entry.full_name)): to keep the same behaviour :-( Otherwise, as Paul says, you are essentially forced to follow links, and os.walk(followlinks=False), which is the default, can't do the right thing. os.walk() and get_tree_size() are good users of scandir(), but they are recursive functions. It means that you may handle symlinks differently, os.walk() gives the choice to follow or not symlinks for example. Recursive functions are rare. The most common case is to list files of a single directory and then filter files depending on various filters (is a file? is a directory? match the file name? ...). In such use case, you don't care of symlinks (you want to follow them). Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
2014-07-09 17:26 GMT+02:00 Paul Moore p.f.mo...@gmail.com: On 9 July 2014 16:05, Victor Stinner victor.stin...@gmail.com wrote: The PEP says that DirEntry should mimic pathlib.Path, so I think that DirEntry.is_dir() should work as os.path.isir(): if the entry is a symbolic link, you should follow the symlink to get the status of the linked file with os.stat(). (...) As a Windows user with only a superficial understanding of how symlinks should behave, (...) FYI Windows also supports symbolic links since Windows Vista. The feature is unknown because it is restricted to the administrator account. Try the mklink command in a terminal (cmd.exe) ;-) http://en.wikipedia.org/wiki/NTFS_symbolic_link ... To be honest, I never created a symlink on Windows. But since it is supported, you need to know it to write correctly your Windows code. (It's unrelated to LNK files.) Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
Ben Hoyt benh...@gmail.com writes: So here's the ways in which option #2 is now more complicated than option #1: 1) it has an additional info argument, the values of which have to be documented ('os', 'type', 'lstat', and what each one means) 2) it has an additional onerror argument, the signature of which and fairly complicated return value is non-obvious and has to be documented 3) it requires user modification of the DirEntry object, which needs documentation, and is potentially hard to implement 4) because the DirEntry object now allows modification, you need a stat_result() helper function to help you build your own stat values I'm afraid points 3 and 4 here add way too much complexity. Points 3 and 4 are not required to go with option #2, option #2 merely allows to implement points 3 and 4 at some point in the future if it turns out to be desirable. Best, -Nikolaus -- GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.« ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 07/09/2014 05:15 PM, Victor Stinner wrote: 2014-07-09 17:29 GMT+02:00 Ben Hoyt benh...@gmail.com: Would this not break the tree size script being discussed in the other thread, as it would follow links and include linked directories in the size of the tree? The get_tree_size() function in the PEP would use: if not entry.is_symlink() and entry.is_dir():. Note: First I wrote if entry.is_dir() and not entry.is_symlink():, but this syntax is slower on Linux because is_dir() has to call lstat(). Wouldn't it only have to call lstat if the entry was, in fact, a link? There are only a few cases where you want to handle symlinks differently: archive (ex: tar), compute the size of a directory (ex: du does not follow symlinks by default, du -L follows them), remove a directory. I agree with Victor here. If the entry is a link I would want to know if it was a link to a directory or a link to a file. If I care about not following sym links I can check is_symlink() (or whatever it's called). -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
Oh, since I'm proposing to add a new stat() method to DirEntry, we can optimize it. stat() can reuse lstat() result if the file is not a symlink. It simplifies is_dir(). New pseudo-code: --- class DirEntry: def __init__(self, path, name, lstat=None, d_type=None): self.name = name self.full_name = os.path.join(path, name) # lstat is known on Windows self._lstat = lstat if lstat is not None and not stat.S_ISLNK(lstat.st_mode): # On Windows, stat() only calls os.stat() for symlinks self._stat = lstat else: self._stat = None # d_type is known on UNIX if d_type != DT_UNKNOWN: self._d_type = d_type else: # DT_UNKNOWN is not a very useful information :-p self._d_type = None def stat(self): if self._stat is None: self._stat = os.stat(self.full_name) return self._stat def lstat(self): if self._lstat is None: self._lstat = os.lstat(self.full_name) if self._stat is None and not stat.S_ISLNK(self._lstat.st_mode): self._stat = lstat return self._lstat def is_dir(self): if self._d_type is not None: if self._d_type == DT_DIR: return True if self._d_type != DT_LNK: return False else: lstat = self.lstat() if stat.S_ISDIR(lstat.st_mode): return True stat = self.stat() # if lstat() was already called, stat() will only call os.stat() for symlink return stat.S_ISDIR(stat.st_mode) --- The extra caching rules are complex, that's why I suggest to not document them. In short: is_dir() only needs an extra syscall for symlinks, for other file types it does not need any syscall. Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 10 July 2014 10:23, Victor Stinner victor.stin...@gmail.com wrote: 2014-07-09 17:26 GMT+02:00 Paul Moore p.f.mo...@gmail.com: On 9 July 2014 16:05, Victor Stinner victor.stin...@gmail.com wrote: The PEP says that DirEntry should mimic pathlib.Path, so I think that DirEntry.is_dir() should work as os.path.isir(): if the entry is a symbolic link, you should follow the symlink to get the status of the linked file with os.stat(). (...) As a Windows user with only a superficial understanding of how symlinks should behave, (...) FYI Windows also supports symbolic links since Windows Vista. The feature is unknown because it is restricted to the administrator account. Try the mklink command in a terminal (cmd.exe) ;-) http://en.wikipedia.org/wiki/NTFS_symbolic_link ... To be honest, I never created a symlink on Windows. But since it is supported, you need to know it to write correctly your Windows code. Personally, I create them all the time on Windows - mainly via the Link Shell Extension http://schinagl.priv.at/nt/hardlinkshellext/linkshellextension.html. It's the easiest way to ensure that my directory structures are as I want them whilst not worrying about where the files really are e.g. code on SSD, GB+-sized data files on rusty metal, symlinks makes it look like it's the same directory structure. Same thing can be done with junctions if you're only dealing with directories, but symlinks work with files as well. I work cross-platform, and have a mild preference for option #2 with similar semantics on all platforms. Tim Delaney ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
Ben Hoyt benh...@gmail.com writes: ... ``scandir()`` yields a ``DirEntry`` object for each file and directory in ``path``. Just like ``listdir``, the ``'.'`` and ``'..'`` pseudo-directories are skipped, and the entries are yielded in system-dependent order. Each ``DirEntry`` object has the following attributes and methods: * ``name``: the entry's filename, relative to the ``path`` argument (corresponds to the return values of ``os.listdir``) * ``full_name``: the entry's full path name -- the equivalent of ``os.path.join(path, entry.name)`` I suggest renaming .full_name - .path .full_name might be misleading e.g., it implies that .full_name == abspath(.full_name) that might be false. The .path name has no such associations. The semantics of the the .path attribute is defined by these assertions:: for entry in os.scandir(topdir): #NOTE: assume os.path.normpath(topdir) is not called to create .path assert entry.path == os.path.join(topdir, entry.name) assert entry.name == os.path.basename(entry.path) assert entry.name == os.path.relpath(entry.path, start=topdir) assert os.path.dirname(entry.path) == topdir assert (entry.path != os.path.abspath(entry.path) or os.path.isabs(topdir)) # it is absolute only if topdir is assert (entry.path != os.path.realpath(entry.path) or topdir == os.path.realpath(topdir)) # symlinks are not resolved assert (entry.path != os.path.normcase(entry.path) or topdir == os.path.normcase(topdir)) # no case-folding, # unlike PureWindowsPath ... * ``is_dir()``: like ``os.path.isdir()``, but much cheaper -- it never requires a system call on Windows, and usually doesn't on POSIX systems I suggest documenting the implicit follow_symlinks parameter for .is_X methods. Note: lstat == partial(stat, follow_symlinks=False). In particular, .is_dir() should probably use follow_symlinks=True by default as suggested by Victor Stinner *if .is_dir() does it on Windows* MSDN says: GetFileAttributes() does not follow symlinks. os.path.isdir docs imply follow_symlinks=True: both islink() and isdir() can be true for the same path. ... Like the other functions in the ``os`` module, ``scandir()`` accepts either a bytes or str object for the ``path`` parameter, and returns the ``DirEntry.name`` and ``DirEntry.full_name`` attributes with the same type as ``path``. However, it is *strongly recommended* to use the str type, as this ensures cross-platform support for Unicode filenames. Document when {e.name for e in os.scandir(path)} != set(os.listdir(path)) + e.g., path can be an open file descriptor in os.listdir(path) since Python 3.3 but the PEP doesn't mention it explicitly. It has been discussed already e.g., https://mail.python.org/pipermail/python-dev/2014-July/135296.html PEP 471 should explicitly reject the support for specifying a file descriptor so that a code that uses os.scandir may assume that entry.path (.full_name) attribute is always present (no exceptions due to a failure to read /proc/self/fd/NNN or an error while calling fcntl(F_GETPATH) or GetFileInformationByHandleEx() -- see http://stackoverflow.com/q/1188757 ). Reject explicitly in PEP 471 the support for dir_fd parameter + aka the support for paths relative to directory descriptors. Note: it is a *different* (but related) issue. ... Notes on exception handling --- ``DirEntry.is_X()`` and ``DirEntry.lstat()`` are explicitly methods rather than attributes or properties, to make it clear that they may not be cheap operations, and they may do a system call. As a result, these methods may raise ``OSError``. For example, ``DirEntry.lstat()`` will always make a system call on POSIX-based systems, and the ``DirEntry.is_X()`` methods will make a ``stat()`` system call on such systems if ``readdir()`` returns a ``d_type`` with a value of ``DT_UNKNOWN``, which can occur under certain conditions or on certain file systems. For this reason, when a user requires fine-grained error handling, it's good to catch ``OSError`` around these method calls and then handle as appropriate. I suggest documenting that next(os.scandir()) may raise OSError e.g., on POSIX it may happen due to an OS error in opendir/readdir/closedir Also, document whether os.scandir() itself may raise OSError (whether opendir or other OS functions may be called before the first yield). ... os.scandir() should allow the explicit cleanup ++ :: with closing(os.scandir()) as entries: for _ in entries: break entries.close() is called that frees the resources if necessary, to *avoid relying on garbage-collection for managing file
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 9 Jul 2014 17:14, Ethan Furman et...@stoneleaf.us wrote: On 07/09/2014 02:42 PM, Ben Hoyt wrote: Okay, so using that [no platform specific] logic we should head over to the os module and remove: ctermid, getenv, getegid... Okay, I'm tired of typing, but that list is not even half-way through the os page, and those are all methods or attributes that are not available on either Windows or Unix or some flavors of Unix. True, is this really the precedent we want to *aim for*. listdir() is cross-platform, and listdir has serious performance issues, which is why you developed scandir. Oh, and all those [snipped] upper-case attributes? Yup, documented. And when we don't document it ourselves we often refer readers to their system documentation because Python does not, in fact, return exactly the same results on all platforms -- particularly when calling into the OS. But again, why a worse, less cross-platform API when a simple, cross-platform one is a method call away? For the same reason we don't use code that makes threaded behavior better, but kills the single thread application. If the programmer would rather have consistency on all platforms rather than performance on the one being used, `info='lstat'` is the option to use. I like the 'onerror' API better primarily because it gives a single point to deal with the errors. This has at least a couple advantages: - less duplication of code: in the tree_size example, the error handling is duplicated twice - readablity: with the error handling in a separate routine, one does not have to jump around the try/except blocks looking for what happens if there are no errors The onerror approach can also deal with readdir failing, which the PEP currently glosses over. I'm somewhat inclined towards the current approach in the PEP, but I'd like to see an explanation of two aspects: 1. How a scandir variant with an 'onerror' option could be implemented given the version in the PEP 2. How the existing scandir module handles the 'onerror' parameter to its directory walking function Regards, Nick. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
Hi, 2014-07-08 15:52 GMT+02:00 Ben Hoyt benh...@gmail.com: After some very good python-dev feedback on my first version of PEP 471, I've updated the PEP to clarify a few things and added various Rejected ideas subsections. Here's a link to the new version (I've also copied the full text below): Thanks, the new PEP looks better. * Removed the open issues section, as the three open issues have either been included (full_name) or rejected (windows_wildcard) I remember a pending question on python-dev: - Martin von Loewis asked if the scandir generator would have send() and close() methods as any Python generator. I didn't see a reply on the mailing (nor in the PEP). One known error in the PEP is that the Notes sections should be top-level sections, not be subheadings of Examples. If someone would like to give me (benhoyt) commit access to the peps repo, I can fix this and any other issues that come up. Or just send me your new PEP ;-) Notes on caching The ``DirEntry`` objects are relatively dumb -- the ``name`` and ``full_name`` attributes are obviously always cached, and the ``is_X`` and ``lstat`` methods cache their values (immediately on Windows via ``FindNextFile``, and on first use on POSIX systems via a ``stat`` call) and never refetch from the system. It is not clear to me which methods share the cache. On UNIX, is_dir() and is_file() call os.stat(); whereas lstat() and is_symlink() call os.lstat(). If os.stat() says that the file is not a symlink, I guess that you can use os.stat() result for lstat() and is_symlink() methods? In the worst case, if the path is a symlink, would it be possible that os.stat() and os.lstat() become inconsistent if the symlink is modified between the two calls? If yes, I don't think that it's an issue, it's just good to know it. For symlinks, readdir() returns the status of the linked file or of the symlink? Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 7/8/2014 9:52 AM, Ben Hoyt wrote: DirEntry fields being static attribute-only objects - In `this July 2014 python-dev message https://mail.python.org/pipermail/python-dev/2014-July/135303.html`_, Paul Moore suggested a solution that was a thin wrapper round the OS feature, where the ``DirEntry`` object had only static attributes: ``name``, ``full_name``, and ``is_X``, with the ``st_X`` attributes only present on Windows. The idea was to use this simpler, lower-level function as a building block for higher-level functions. At first there was general agreement that simplifying in this way was a good thing. However, there were two problems with this approach. First, the assumption is the ``is_dir`` and similar attributes are always present on POSIX, which isn't the case (if ``d_type`` is not present or is ``DT_UNKNOWN``). Second, it's a much harder-to-use API in practice, as even the ``is_dir`` attributes aren't always present on POSIX, and would need to be tested with ``hasattr()`` and then ``os.stat()`` called if they weren't present. Only exposing what the OS provides for free will make the API too difficult to use in the common case. But is there a nice way to expand the API that will allow the user who is trying to avoid extra expense know what information is already available? Even if the initial version doesn't have a way to check what information is there for free, ensuring there is a clean way to add this in the future would be really nice. Janzert ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
I remember a pending question on python-dev: - Martin von Loewis asked if the scandir generator would have send() and close() methods as any Python generator. I didn't see a reply on the mailing (nor in the PEP). Good call. Looks like you're referring to this message: https://mail.python.org/pipermail/python-dev/2014-July/135324.html I'm not actually familiar with the purpose of .close() and .send()/.throw() on generators. Do you typically call these functions manually, or are they called automatically by the generator protocol? It is not clear to me which methods share the cache. On UNIX, is_dir() and is_file() call os.stat(); whereas lstat() and is_symlink() call os.lstat(). If os.stat() says that the file is not a symlink, I guess that you can use os.stat() result for lstat() and is_symlink() methods? In the worst case, if the path is a symlink, would it be possible that os.stat() and os.lstat() become inconsistent if the symlink is modified between the two calls? If yes, I don't think that it's an issue, it's just good to know it. For symlinks, readdir() returns the status of the linked file or of the symlink? I think you're misunderstanding is_dir() and is_file(), as these don't actually call os.stat(). All DirEntry methods either call nothing or os.lstat() to get the stat info on the entry itself (not the destination of the symlink). In light of this, I don't think what you're describing above is an issue. -Ben ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
Only exposing what the OS provides for free will make the API too difficult to use in the common case. But is there a nice way to expand the API that will allow the user who is trying to avoid extra expense know what information is already available? Even if the initial version doesn't have a way to check what information is there for free, ensuring there is a clean way to add this in the future would be really nice. We could easily add .had_type and .had_lstat properties (not sure on the names), that would be true if the is_X information and lstat information was fetched, respectively. Basically both would always be True on Windows, but on POSIX only had_type would be True d_type is present and != DT_UNKNOWN. I don't feel this is actually necessary, but it's not hard to add. Thoughts? -Ben ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
I think you're misunderstanding is_dir() and is_file(), as these don't actually call os.stat(). All DirEntry methods either call nothing or os.lstat() to get the stat info on the entry itself (not the destination of the symlink). Oh. Extract of your PEP: is_dir(): like os.path.isdir(), but much cheaper. genericpath.isdir() and genericpath.isfile() use os.stat(), whereas posixpath.islink() uses os.lstat(). Is it a mistake in the PEP? Ah, you're dead right -- this is basically a bug in the PEP, as DirEntry.is_dir() is not like os.path.isdir() in that it is based on the entry itself (like lstat), not following the link. I'll improve the wording here and update the PEP. -Ben ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 07/08/2014 12:34 PM, Ben Hoyt wrote: Better to just have the attributes be None if they were not fetched. None is better than hasattr anyway, at least in the respect of not having to catch exceptions to function properly. The thing is, is_dir() and lstat() are not attributes (for a good reason). Please read the relevant Rejected ideas sections and let us know what you think. :-) I did better than that -- I read the whole thing! ;) -1 on the PEP's implementation. Just like an attribute does not imply a system call, having a method named 'is_dir' /does/ imply a system call, and not having one can be just as misleading. If we have this: size = 0 for entry in scandir('/some/path'): size += entry.st_size - on Windows, this should Just Work (if I have the names correct ;) - on Posix, etc., this should fail noisily with either an AttributeError ('entry' has no 'st_size') or a TypeError (cannot add None) and the solution is equally simple: for entry in scandir('/some/path', stat=True): - if not Windows, perform a stat call at the same time Now, of course, we might get errors. I am not a big fan of wrapping everything in try/except, particularly when we already have a model to follow -- os.walk: for entry in scandir('/some/path', stat=True, onerror=record_and_skip): If we don't care if an error crashes the script, leave off onerror. If we don't need st_size and friends, leave off stat=True. If we get better performance on Windows instead of Linux, that's okay. scandir is going into os because it may not behave the same on every platform. Heck, even some non-os modules (multiprocessing comes to mind) do not behave the same on every platform. I think caching the attributes for DirEntry is fine, but let's do it as a snapshot of that moment in time, not name now, and attributes in 30 minutes when we finally get to you because we had a lot of processing/files ahead of you (you being a DirEntry ;) . -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
I did better than that -- I read the whole thing! ;) Thanks. :-) -1 on the PEP's implementation. Just like an attribute does not imply a system call, having a method named 'is_dir' /does/ imply a system call, and not having one can be just as misleading. Why does a method imply a system call? os.path.join() and str.lower() don't make system calls. Isn't it just a matter of clear documentation? Anyway -- less philosophical discussion below. If we have this: size = 0 for entry in scandir('/some/path'): size += entry.st_size - on Windows, this should Just Work (if I have the names correct ;) - on Posix, etc., this should fail noisily with either an AttributeError ('entry' has no 'st_size') or a TypeError (cannot add None) and the solution is equally simple: for entry in scandir('/some/path', stat=True): - if not Windows, perform a stat call at the same time I'm not totally opposed to this, which is basically a combination of Nick Coghlan's and Paul Moore's recent proposals mentioned in the PEP. However, as discussed on python-dev, there are some edge cases it doesn't handle very well, and it's messier to handle errors (requires onerror as you mention below). I presume you're suggesting that is_dir/is_file/is_symlink should be regular attributes, and accessing them should never do a system call. But what if the system doesn't support d_type (eg: Solaris) or the d_type value is DT_UNKNOWN (can happen on Linux, OS X, BSD)? The options are: 1) scandir() would always call lstat() in the case of missing/unknown d_type. If so, scandir() is actually more expensive than listdir(), and as a result it's no longer safe to implement listdir in terms of scandir: def listdir(path='.'): return [e.name for e in scandir(path)] 2) Or would it be better to have another flag like scandir(path, type=True) to ensure the is_X type info is fetched? This is explicit, but also getting kind of unwieldly. 3) A third option is for the is_X attributes to be absent in this case (hasattr tests required, and the user would do the lstat manually). But as I noted on python-dev recently, you basically always want is_X, so this leads to unwieldly and code that's twice as long as it needs to be. See here: https://mail.python.org/pipermail/python-dev/2014-July/135312.html 4) I gather in your proposal above, scandir will call lstat() if stat=True? Except where does it put the values? Surely it should return an existing stat_result object, rather than stuffing everything onto the DirEntry, or throwing away some values on Linux? In this case, I'd prefer Nick Coghlan's approach of ensure_lstat and a .stat_result attribute. However, this still has the what if d_type is missing or DT_UNKNOWN issue. It seems to me that making is_X() methods handles this exact scenario -- methods are so you don't have to do the dirty work. So yes, the real world is messy due to missing is_X values, but I think it's worth getting this right, and is_X() methods can do this while keeping the API simple and cross-platform. Now, of course, we might get errors. I am not a big fan of wrapping everything in try/except, particularly when we already have a model to follow -- os.walk: I don't mind the onerror too much if we went with this kind of approach. It's not quite as nice as a standard try/except around the method call, but it's definitely workable and has a precedent with os.walk(). It seems a bit like we're going around in circles here, and I think we have all the information and options available to us, so I'm going to SUMMARIZE. We have a choice before us, a fork in the road. :-) We can choose one of these options for the scandir API: 1) The current PEP 471 approach. This solves the issue with d_type being missing or DT_UNKNOWN, it doesn't require onerror, and it's a really tidy API that doesn't explode with AttributeErrors if you write code on Windows (without thinking too hard) and then move to Linux. I think all of these points are important -- the cross-platform one not the least, because we want to make it easy, even *trivial*, for people to write cross-platform code. For reference, here's what get_tree_size() looks like with this approach, not including error handling with try/except: def get_tree_size(path): total = 0 for entry in os.scandir(path): if entry.is_dir(): total += get_tree_size(entry.full_name) else: total += entry.lstat().st_size return total 2) Nick Coghlan's model of only fetching the lstat value if ensure_lstat=True, and including an onerror callback for error handling when scandir calls lstat internally. However, as described, we'd also need an ensure_type=True option, so that scandir() isn't way slower than listdir() if you actually don't want the is_X values and d_type is missing/unknown. For reference, here's what get_tree_size() looks like with this approach, not including error handling with onerror: def
Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal
On 07/08/2014 06:08 PM, Ben Hoyt wrote: Just like an attribute does not imply a system call, having a method named 'is_dir' /does/ imply a system call, and not having one can be just as misleading. Why does a method imply a system call? os.path.join() and str.lower() don't make system calls. Isn't it just a matter of clear documentation? Anyway -- less philosophical discussion below. In this case because the names are exactly the same as the os versions which /do/ make a system call. I presume you're suggesting that is_dir/is_file/is_symlink should be regular attributes, and accessing them should never do a system call. But what if the system doesn't support d_type (eg: Solaris) or the d_type value is DT_UNKNOWN (can happen on Linux, OS X, BSD)? The options are: So if I'm finally understanding the root problem here: - listdir returns a list of strings, one for each filename and one for each directory, and keeps no other O/S supplied info. - os.walk, which uses listdir, then needs to go back to the O/S and refetch the thrown-away information - so it's slow. The solution: - have scandir /not/ throw away the O/S supplied info and the new problem: - not all O/Ses provide the same (or any) extra info about the directory entries Have I got that right? If so, I still like the attribute idea better (surprise!), we just need to revisit the 'ensure_lstat' (or whatever it's called) parameter: instead of a true/false value, it could have a scale: - 0 = whatever the O/S gives us - 1 = at least the is_dir/is_file (whatever the other normal one is), and if the O/S doesn't give it to us for free than call lstat - 2 = we want it all -- call lstat if necessary on this platform After all, the programmer should know up front how much of the extra info will be needed for the work that is trying to be done. We have a choice before us, a fork in the road. :-) We can choose one of these options for the scandir API: 1) The current PEP 471 approach. This solves the issue with d_type being missing or DT_UNKNOWN, it doesn't require onerror, and it's a really tidy API that doesn't explode with AttributeErrors if you write code on Windows (without thinking too hard) and then move to Linux. I think all of these points are important -- the cross-platform one not the least, because we want to make it easy, even *trivial*, for people to write cross-platform code. Yes, but we don't want a function that sucks equally on all platforms. ;) 2) Nick Coghlan's model of only fetching the lstat value if ensure_lstat=True, and including an onerror callback for error handling when scandir calls lstat internally. However, as described, we'd also need an ensure_type=True option, so that scandir() isn't way slower than listdir() if you actually don't want the is_X values and d_type is missing/unknown. With the multi-level version of 'ensure_lstat' we do not need an extra 'ensure_type'. For reference, here's what get_tree_size() looks like with this approach, not including error handling with onerror: def get_tree_size(path): total = 0 for entry in os.scandir(path, ensure_lstat=1): if entry.is_dir: total += get_tree_size(entry.full_name) else: total += entry.lstat_result.st_size return total And if we added the onerror here it would be a line fragment, as opposed to the extra four lines (at least) for the try/except in the first example (which I cut). Finally: Thank you for writing scandir, and this PEP. Excellent work. Oh, and +1 for option 2, slightly modified. :) -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com