[issue38764] Deterministic globbing.

2019-11-12 Thread Guido van Rossum


Guido van Rossum  added the comment:

Let's not do this, for all the reasons brought up.

--
nosy: +gvanrossum
resolution:  -> wont fix
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38764] Deterministic globbing.

2019-11-11 Thread Matthew Barnett


Matthew Barnett  added the comment:

I could also add: would sorting be case-sensitive or case-insensitive? Windows 
is case-insensitive, Linux is case-sensitive.

--
nosy: +mrabarnett

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38764] Deterministic globbing.

2019-11-11 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

ISTM that if glob had been sorted, it would have only thinly masked the data 
pipeline issues that were encountered.  The code would still have been easily 
broken by someone renaming, adding, or removing a file.  In their case, it was 
the semantic context of the files where the application order was important, 
the actual filenames were incidental to the bug.

Also having user depend on sorting may lead to other bugs as well.  1) Sorts in 
one version of Python but not another, 2) Sorts in some functions but not 
others, and 3) The sort is text based so that file1, file2, ..., file11, file12 
don't sort as expected.

FWIW, I didn't put a strong -1.  However, the "sorting of discovered children" 
is a warning sign that we're doing too much.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38764] Deterministic globbing.

2019-11-11 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

> This seems like an inappropriately elitist attitude.

Please drop the personal attacks and abusive tone.  The tracker is for 
technical discussions.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38764] Deterministic globbing.

2019-11-11 Thread Brandt Bucher


Brandt Bucher  added the comment:

I'm not sure listdir and walk should be sorted, for the reasons Raymond 
mentioned. And I was actually surprised to learn today that pathlib.Path.glob 
doesn't use the glob module internally. Was there some reasoning behind this 
decision? Separator handling, perhaps?

Either way, these seem to be out of the scope of this issue... not every change 
needs to be a repo-wide omnibus! I think incremental improvements like the 
proposed one are often fine.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38764] Deterministic globbing.

2019-11-11 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

If sort the output of glob(), we should sort also the output of Path.glob(), 
listdir(), and maybe walk().

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38764] Deterministic globbing.

2019-11-10 Thread Karthikeyan Singaravelan


Change by Karthikeyan Singaravelan :


--
nosy: +serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38764] Deterministic globbing.

2019-11-10 Thread Brandt Bucher


Brandt Bucher  added the comment:

I disagree somewhat with the assessment that glob provides "thin" access to OS 
services. It is composed of a few low-level utilities, but it exposes them 
through what I consider to be a fairly high-level, abstract, friendly interface 
that (aside from ordering, and some symlink stuff) is surprisingly 
cross-platform.

I originally felt that this was "their own fault", since the behavior is 
well-documented and the data pipeline was so susceptible to error. But I 
started thinking more about the fact that the bad globbing made it into 
peer-reviewed cancer research over a hundred times. Python is a friendly 
language and glob is a friendly library, so it would be natural for someone 
without formal engineering experience to reach for both. I'm becoming more and 
more surprised that glob exposes this type of quirk, for trivial gain.

I respect your opinion though. Another option would be to leave the patch in, 
but not document the new traversal order. That way we rid ourselves of a class 
of user errors without exposing implementation details like the type of search 
used.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38764] Deterministic globbing.

2019-11-10 Thread Nathaniel Smith

Nathaniel Smith  added the comment:

> I saw the article as well, but think auto-sorting would have just provided a 
> thin mask over their serious data pipeline bugs.

This seems like an inappropriately elitist attitude. I'm sure their code has 
bugs; so does mine and yours. But in fact they did test their code thoroughly, 
demonstrated that it produced correct results on their system, and it would 
have produced correct results for their downstream users too if Python's 'glob' 
had behaved consistently across platforms. Python isn't just for folks who can 
"git gud" to meet your arbitrary standards.

In addition, auto-sorting has a number of ergonomic benefits, beyond this one 
case: it makes output more deterministic, makes it easy to estimate progress 
given just a list of filenames being processed ("hmm, this script has been 
running for ten minutes and it's only on the Cs, I guess this will take a few 
hours"), and so on.

> Also, this isn't the only pathway to seeing file lists:  os.listdir, fnmatch, 
> etc.

os.listdir is a genuinely low-level OS-level tool, as indicated by the name, 
versus 'glob' which is designed as a convenient interface for high-level 
scripting.

fnmatch doesn't deal with either files or lists, so I'm not sure why you're 
bringing it up here.

> The reasons for the previous rejections still apply.

The reasons I see in the previous issues are:

- Sorting adds overhead
- It's hard to implement, esp. if iglob uses scandir internally
- iglob can't sort, so glob should be consistent
- it's easy to sort after the fact

But these are all false.

- The overhead added by sorting is negligible. In some informal benchmarks I 
measured it as ~4% extra CPU time in the worst case, if you're (a) running on 
Linux with it's ultra-optimized VFS caching, (b) all the directories are 
already cached in RAM, (c) you're not actually doing anything with the files 
except calling glob() and then discarding the result. Of course, it's always 
possible my benchmarks missed an important case; if you have benchmarks that 
show otherwise please post them so we can discuss.

- The implementation is trivial, as shown from the PR – literally just 
replacing two calls to 'list' with calls to 'sorted'.

- iglob being incremental doesn't actually pose any obstacles to sorting. Even 
if it's using scandir(), it still has to load each individual directory list 
into memory in one batch, to avoid the risk of leaking file descriptors.

- It's actually not trivial to sort filenames in the natural order afterwards, 
because you have to split each filename into components. It's also 
asymptotically slower than doing the sort as-you-go, O(total-files log 
total-files), versus O(total-files log maximum-individual-directory).

Given that there's at least one extremely clear case where not-sorting caused 
substantial harm, that there are lots of other minor cases where it's a nice 
win, and that the costs are negligible, sorting seems like the obviously 
correct default.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38764] Deterministic globbing.

2019-11-10 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

I saw the article as well, but think auto-sorting would have just provided a 
thin mask over their serious data pipeline bugs.

Also, this isn't the only pathway to seeing file lists:  os.listdir, fnmatch, 
etc.

I say that we leave it alone and not unnecessarily introduce a breadth-first, 
recursive sort.  The reasons for the previous rejections still apply.  For the 
most part our tools that access os services only provide a thin pass-through 
and tend to neither promise nor do anything extra (even SQL only gives sorted 
data when the user explicitly requests ORDER By -- the could have provided a 
default sort but choose not to).

--
nosy: +rhettinger

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38764] Deterministic globbing.

2019-11-10 Thread Brandt Bucher


Change by Brandt Bucher :


--
keywords: +patch
pull_requests: +16611
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/17105

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38764] Deterministic globbing.

2019-11-10 Thread Brandt Bucher


New submission from Brandt Bucher :

This has been discussed before, but we now have examples in the news of glob's 
non-deterministic behavior causing some real headaches for hundreds of people 
in the scientific community. After some cursory discussion 
(https://discuss.python.org/t/a-code-glitch-may-have-caused-errors-in-more-than-100-published-studies/2583)
 it was suggested that this issue would at least be worth revisiting, given the 
damage... bad programming practices aside.

The attached patch guarantees glob/iglob traversal order across all platforms, 
and modifies all of the existing tests to check for this.

One reason this was rejected twice before (in both issue 21748, four years ago, 
and issue 30461, two years ago) was the argument that the user could always 
just call "sorted" on the returned list if they wanted ordered results. Aside 
from losing the laziness of iglob, this solution also loses the traversal 
order, since recursive globs use a breadth-first search. In the attached patch, 
iglob is still just as lazy as ever, and the traversal is unchanged for anyone 
using a platform that already returns sorted results.

--
components: Library (Lib)
messages: 356344
nosy: brandtbucher, njs
priority: normal
severity: normal
status: open
title: Deterministic globbing.
type: enhancement
versions: Python 3.9

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com