[issue37935] Improve performance of pathlib.scandir()

2019-09-12 Thread Gregory P. Smith


Gregory P. Smith  added the comment:


New changeset 98a4a713d001cf2dfb04a9e318e4aea899bc8fbd by Gregory P. Smith 
(Miss Islington (bot)) in branch '3.8':
bpo-37935: Added tests for os.walk(), glob.iglob() and Path.glob() (GH-15956) 
(GH-16043)
https://github.com/python/cpython/commit/98a4a713d001cf2dfb04a9e318e4aea899bc8fbd


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37935] Improve performance of pathlib.scandir()

2019-09-12 Thread miss-islington


Change by miss-islington :


--
pull_requests: +15665
pull_request: https://github.com/python/cpython/pull/16043

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37935] Improve performance of pathlib.scandir()

2019-09-12 Thread Gregory P. Smith


Gregory P. Smith  added the comment:


New changeset f9dc2ad89032201427ed5f08061c703794627ad9 by Gregory P. Smith 
(Serhiy Storchaka) in branch 'master':
bpo-37935: Added tests for os.walk(), glob.iglob() and Path.glob() (GH-15956)
https://github.com/python/cpython/commit/f9dc2ad89032201427ed5f08061c703794627ad9


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37935] Improve performance of pathlib.scandir()

2019-09-11 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

Yes, I am the author of the code that uses os.scandir() in os.walk(), 
os.fwalk(), glob.iglob() and Path.glob() (see issue23605, issue25996, 
issue25596, issue26032). And it was always in mind to not keep many file 
descriptors open when traverse directories recursively. It was also a reason of 
rejecting my earlier patch for speeding up os.walk() by using os.fwalk() (see 
issue15200).

It is safe to iterated over os.scandir() without turning it into a list if we 
do not do this recursively.

Unfortunately there were no special tests for this. PR 15956 adds them. You can 
easily break tests for pathlib if remove any of list(). It is harder to break 
tests for glob, because fnmatch.filter() consumes the iterator and implicitly 
closes the scandir iterator. You need to replace it with a generator and 
fnmatch.fnmatch() called in a loop. Breaking the tests for os.walk() is 
difficult. The code of os.walk() is so complex because it needs to return lists 
of files and subdirectories, and therefore it consumes the scandir iterator and 
closes it. But with some tricks it is possible to break tests for os.walk(). It 
is unlikely somebody will do this unintentionally.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37935] Improve performance of pathlib.scandir()

2019-09-11 Thread Serhiy Storchaka


Change by Serhiy Storchaka :


--
keywords: +patch
pull_requests: +15591
pull_request: https://github.com/python/cpython/pull/15956

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37935] Improve performance of pathlib.scandir()

2019-09-10 Thread Gregory P. Smith


Gregory P. Smith  added the comment:

For some context here, the pathlib scandir code was written by Serhiy in 
https://bugs.python.org/issue26032 and related commit 
https://github.com/python/cpython/commit/680cb152c5d220a74321fa905d4fc91bdec40fbb.

> Any optimization can be accepted only when we have any prove that the change 
> actually has measurable effect, and that it is large enough.

While often good advice, there is no such blanket policy.  Saying that comes 
across as dismissive.  A better way to phrase such a statement is as a question:

In what practical scenarios does this change provide a demonstrable benefit?

The improvement in this case could be avoiding an explosion of memory use in 
some circumstances.  Depending on the particular filesystem.  (the original 
reason based on real world production experience that I pushed to have our 
os.scandir API created in the first place)

> Avoiding calling list() on the output of scandir() may be not harmless.

Keeping it may not be harmless either. :)

One key point of os.scandir() is to be iterated over instead of turned into a 
list to avoid using too much memory.  Code that requires a list could've just 
called listdir() (as the previous code did in both of these places) - if that 
is intentional, we should include an explicit comment stating why a preloaded 
list was intentional.

Either these list calls go away as is natural with scandir, or comments should 
be added explaining why they are important (related: unittests exercising the 
situations they are important for would be ideal)

As you're the original author of the pathlib scandir code, do you remember what 
the intent of these list(scandir) calls was?

One potential difference without them: If someone is selecting via pathlib and 
modifying directory file contents while iterating over the results, the lists 
may be important.  (what happens in this case presumably depends on the 
underlying os fs implementation)

It sounds like we don't have test cases covering such scenarios.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37935] Improve performance of pathlib.scandir()

2019-09-10 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

Any optimization can be accepted only when we have any prove that the change 
actually has measurable effect, and that it is large enough.

Avoiding calling list() on the output of scandir() may be not harmless. For 
example it will left open a file descriptor. This can cause issues if walk a 
deep tree, especially if do this in few threads simultaneously. list() 
guaranties that it is open in very limited time (although using a context 
manager is desirable).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37935] Improve performance of pathlib.scandir()

2019-09-10 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

The title is misleading. There is no pathlib.scandir().

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37935] Improve performance of pathlib.scandir()

2019-09-10 Thread Gregory P. Smith


Gregory P. Smith  added the comment:

Shai, please open the 'Your Details' link in the bugs.python.org sidebar and 
make sure you have your github username filled in.  it needs to say ShaiAvr for 
our CLA automation to understand.

Avoiding calling list() on the output of scandir() is always desirable. :)

--
assignee:  -> gregory.p.smith
nosy: +gregory.p.smith
versions: +Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37935] Improve performance of pathlib.scandir()

2019-08-27 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

I think using the timeit module is enough. For more precise benchmarking you 
may need to use the pyperf module, but I think this is not a case.

For example, something like:

./python -m timeit -s "from pathlib import Path; p = Patch('...')" "for x 
in p.rglob('...'): pass"

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37935] Improve performance of pathlib.scandir()

2019-08-26 Thread Shai


Shai  added the comment:

I'm new to contributing here. I've never done benchmarking before.

I'd appreciate it if you could provide a guide to benchmarking.
You could look at the changes I made in the pull request (PR 15331). They're 
easy to follow and I think that removing a useless call to list() should 
enhance the performance, but I'd like to have benchmarking to back this up, so 
if someone more experienced could do this or at least provide a link to a 
guide, I'd really appreciate it.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37935] Improve performance of pathlib.scandir()

2019-08-26 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

Could you please provide any microbenchmarks that show the performance 
improvement?

--
nosy: +serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37935] Improve performance of pathlib.scandir()

2019-08-26 Thread Shai


Shai  added the comment:

>From the docs (https://docs.python.org/3/library/os.html#os.scandir.close):

"This is called automatically when the iterator is exhausted or garbage 
collected, or when an error happens during iterating. However it is advisable 
to call it explicitly or use the with statement.".

The iterator is indeed closed properly, but the docs state that it's still 
advisable to close it explicitly, which is why I wrapped it in a with statement.

However, the more important change is that the iterator is no longer converted 
into a list, which should reduce the iterations from 2N to N, when N is the 
number of entries in the directory (one N when converting to list and another 
one when iterating it). This should enhance the performance of the functions 
that use scandir.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37935] Improve performance of pathlib.scandir()

2019-08-26 Thread hongweipeng


hongweipeng  added the comment:

Scandir() will be close when it iteration is over.You can see 
ScandirIterator_iternext:

```
static PyObject *
ScandirIterator_iternext(ScandirIterator *iterator)
{
while (1) {
...
}
/* Error or no more files */
ScandirIterator_closedir(iterator);
return NULL;
}
```

So, `entries = list(scandir(parent_path))` resources in this code can be 
properly closed.

--
nosy: +hongweipeng

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37935] Improve performance of pathlib.scandir()

2019-08-26 Thread Brett Cannon


Change by Brett Cannon :


--
nosy: +brett.cannon

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37935] Improve performance of pathlib.scandir()

2019-08-24 Thread SilentGhost


Change by SilentGhost :


--
nosy: +pitrou
stage:  -> patch review

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37935] Improve performance of pathlib.scandir()

2019-08-23 Thread Shai


New submission from Shai :

I recently have taken a look in the source code of the pathlib module, and I 
saw something weird there:

when the module used the scandir function, it first converted its iterator into 
a list and then used it in a for loop. The list wasn't used anywhere else, so I 
think the conversion to list is just a waste of performance.

In addition, I noticed that the scandir iterator is never closed (it's not used 
in a with statement and its close method isn't called). I know that the 
iterator is closed automatically when it's garbaged collected, but according to 
the docs, it's advisable to close it explicitly.

I've created a pull request that fixes these issues:
PR 15331

In the PR, I changed the code so the scandir iterator is used directly instead 
of being converted into a list and I wrapped its usage in a with statement to 
close resources properly.

--
components: Library (Lib)
messages: 350354
nosy: Shai
priority: normal
pull_requests: 15142
severity: normal
status: open
title: Improve performance of pathlib.scandir()
type: performance
versions: Python 3.9

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com