[issue32453] shutil.rmtree can have O(n^2) performance on large dirs

2021-10-30 Thread Niklas Hambüchen

Niklas Hambüchen  added the comment:

A small update / summary so far:

>From here this developed into coreutils discussion:

#29921 O(n^2) performance of rm -r
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=29921

and finally a `linux-fsdevel` discussion:

O(n^2) deletion performance

https://lore.kernel.org/linux-fsdevel/5ca3808d-4eea-afec-75a6-2cc41f44b...@nh2.me/t/#u

Dave Chinner (xfs dev) suggests that on XFS there is no quadratic behaviour 
once  the problem is bound by seek-time of the spinning disk.

Somebody should try to confirm that it becomes linear in even larger tests, 
e.g. way larger than 21 minutes deletion time.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32453] shutil.rmtree can have O(n^2) performance on large dirs

2018-06-12 Thread Giampaolo Rodola'


Change by Giampaolo Rodola' :


--
nosy: +giampaolo.rodola

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32453] shutil.rmtree can have O(n^2) performance on large dirs

2018-01-02 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

Yes, so `rm -rf` is quadratic on my SSD too...

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32453] shutil.rmtree can have O(n^2) performance on large dirs

2018-01-02 Thread Serhiy Storchaka

Serhiy Storchaka  added the comment:

Thanks for your investigations Niklas. I ran my benchmark on a spinning disk, 
but significant nonlinearity is observed only for the size of directory around 
100 and more. And yes, sorting by inode number helps.

$ for j in 1 2 3 4 6 8 12 16; do i=$((j*10)); mkdir x; (cd x && seq 
$i|xargs touch); env time -f "%e $i" python3.7-unpatched -c 'import shutil; 
shutil.rmtree("x")'; done
1.01 10
3.80 20
3.64 30
4.89 40
8.72 60
11.86 80
56.80 120
209.82 160

$ for j in 1 2 3 4 6 8 12 16; do i=$((j*10)); mkdir x; (cd x && seq 
$i|xargs touch); env time -f "%e $i" python3.7-patched -c 'import shutil; 
shutil.rmtree("x")'; done
0.97 10
2.42 20
3.84 30
4.48 40
7.07 60
10.01 80
15.53 120
23.24 160

$ for j in 1 2 3 4 6 8 12 16; do i=$((j*10)); mkdir x; (cd x && seq 
$i|xargs touch); env time -f "%e $i" env rm -rf x; done
0.68 10
1.34 20
2.10 30
3.95 40
5.95 60
10.28 80
47.66 120
89.32 160

On an SSD:

$ for j in 1 2 3 4 6 8 12 16; do i=$((j*10)); mkdir x; (cd x && seq 
$i|xargs touch); env time -f "%e $i" python3.7-unpatched -c 'import shutil; 
shutil.rmtree("x")'; done
1.00 10
1.93 20
2.90 30
4.98 40
7.05 60
9.87 80
21.45 120
36.19 160

$ for j in 1 2 3 4 6 8 12 16; do i=$((j*10)); mkdir x; (cd x && seq 
$i|xargs touch); env time -f "%e $i" python3.7-patched -c 'import shutil; 
shutil.rmtree("x")'; done
0.96 10
1.99 20
3.09 30
4.85 40
7.55 60
9.44 80
16.03 120
21.28 160

$ for j in 1 2 3 4 6 8 12 16; do i=$((j*10)); mkdir x; (cd x && seq 
$i|xargs touch); env time -f "%e $i" env rm -rf x; done
0.67 10
1.38 20
2.41 30
2.82 40
5.24 60
7.02 80
18.60 120
30.58 160

Interesting that patched Python is faster than 'rm' (GNU coreutils 8.26) for 
large numbers.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32453] shutil.rmtree can have O(n^2) performance on large dirs

2018-01-01 Thread Niklas Hambüchen

Niklas Hambüchen  added the comment:

A better location to view the whole coreutils thread is:

https://debbugs.gnu.org/cgi/bugreport.cgi?bug=29921

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32453] shutil.rmtree can have O(n^2) performance on large dirs

2017-12-31 Thread Niklas Hambüchen

Niklas Hambüchen  added the comment:

OK, my coreutils email is at 
http://lists.gnu.org/archive/html/bug-coreutils/2017-12/msg00054.html

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32453] shutil.rmtree can have O(n^2) performance on large dirs

2017-12-31 Thread Niklas Hambüchen

Niklas Hambüchen  added the comment:

> Did you try to sync and flush caches before running `rm -r`?

Yes, it doesn't make a difference for me, I still see the same O(n²) behaviour 
in `rm -r`.

I've sent an email "O(n^2) performance of rm -r" to bug-coreut...@gnu.org just 
now, unfortunately I can't link it yet because the mailman archive doesn't show 
it yet. It should appear soon on 
http://lists.gnu.org/archive/html/bug-coreutils/2017-12/threads.html.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32453] shutil.rmtree can have O(n^2) performance on large dirs

2017-12-31 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

Did you try to sync and flush caches before running `rm -r`?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32453] shutil.rmtree can have O(n^2) performance on large dirs

2017-12-31 Thread Niklas Hambüchen

Niklas Hambüchen  added the comment:

It turns out I was wrong when saying that there's some cache we're hitting.

In fact, `rm -r` is still precisely O(n^2), even with the coreutils patch I 
linked.

Quick overview table of the benchmark:

 nfiles real   user sys

 100.51s  0.07s   0.43s
 202.46s  0.15s   0.89s
 40   10.78s  0.26s   2.21s
 80   44.72s  0.58s   6.03s
160  180.37s  1.06s  10.70s

Each 2x increase of number of files results in 4x increased deletion time.


Full benchmark output:

# time (mkdir dirtest && cd dirtest && seq 1 10 | xargs touch)

real  0m0.722s
user  0m0.032s
sys 0m0.680s

# time rm -rf dirtest/

real  0m0.519s
user  0m0.074s
sys 0m0.437s

# time (mkdir dirtest && cd dirtest && seq 1 20 | xargs touch)

real  0m1.576s
user  0m0.044s
sys 0m1.275s

# time rm -r dirtest/

real  0m2.469s
user  0m0.150s
sys 0m0.890s

# time (mkdir dirtest && cd dirtest && seq 1 40 | xargs touch)

real  0m4.249s
user  0m0.098s
sys 0m2.804s

# time rm -rf dirtest/

real  0m10.782s
user  0m0.265s
sys 0m2.213s

# time (mkdir dirtest && cd dirtest && seq 1 80 | xargs touch)

real  0m10.533s
user  0m0.204s
sys 0m5.758s

# time rm -rf dirtest/

real  0m44.725s
user  0m0.589s
sys 0m6.037s

# time (mkdir dirtest && cd dirtest && seq 1 160 | xargs touch)

real  0m34.480s
user  0m0.382s
sys 0m12.057s

# time rm -r dirtest/

real  3m0.371s
user  0m1.069s
sys 0m10.704s

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32453] shutil.rmtree can have O(n^2) performance on large dirs

2017-12-31 Thread Niklas Hambüchen

Niklas Hambüchen  added the comment:

Serhiy, did you run your benchmark on an SSD or a spinning disk?

The coreutils bug mentions that the problem is seek times.

My tests on a spinning disk with 400k files suggest that indeed rmtree() is 
~30x slower than `rm -r`:

# time (mkdir dirtest && cd dirtest && seq 1 10 | xargs touch)

real  0m0.722s
user  0m0.032s
sys 0m0.680s

# time rm -rf dirtest/

real  0m0.519s
user  0m0.074s
sys 0m0.437s

# time (mkdir dirtest && cd dirtest && seq 1 10 | xargs touch)

real  0m0.693s
user  0m0.039s
sys 0m0.659s

# time python -c 'import shutil; shutil.rmtree("dirtest")'

real  0m0.756s
user  0m0.225s
sys 0m0.499s

# time (mkdir dirtest && cd dirtest && seq 1 10 | xargs touch)

real  0m0.685s
user  0m0.032s
sys 0m0.658s

# time python3 -c 'import shutil; shutil.rmtree("dirtest")'

real  0m0.965s
user  0m0.424s
sys 0m0.528s

# time (mkdir dirtest && cd dirtest && seq 1 40 | xargs touch)

real  0m4.249s
user  0m0.098s
sys 0m2.804s

# time rm -rf dirtest/

real  0m10.782s
user  0m0.265s
sys 0m2.213s

# time (mkdir dirtest && cd dirtest && seq 1 40 | xargs touch)

real  0m5.236s
user  0m0.107s
sys 0m2.832s

# time python -c 'import shutil; shutil.rmtree("dirtest")'

real  3m8.006s
user  0m1.323s
sys 0m3.929s

# time (mkdir dirtest && cd dirtest && seq 1 40 | xargs touch)

real  0m4.671s
user  0m0.097s
sys 0m2.832s

# time python3 -c 'import shutil; shutil.rmtree("dirtest")'

real  2m49.476s
user  0m2.196s
sys 0m3.695s

The tests were done with coreutils rm 8.28, Python 2.7.14, Python 3.6.3,  on 
ext4 (rw,relatime,data=ordered), on a dmraid RAID1 across 2 WDC_WD4000FYYZ 
disks (WD 4 TB Enterprise).

Also note how deleting 100k files takes ~0.5 seconds with `rm -r` and the 
Pythons, but deleting 4x more files takes 20x longer with `rm -r` and ~300x 
longer with the Pythons.

There is clearly some boundary below which we are hitting some nice cached 
behaviour.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32453] shutil.rmtree can have O(n^2) performance on large dirs

2017-12-30 Thread Serhiy Storchaka

Change by Serhiy Storchaka :


Added file: https://bugs.python.org/file47355/bench_rmtree.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32453] shutil.rmtree can have O(n^2) performance on large dirs

2017-12-30 Thread Serhiy Storchaka

Serhiy Storchaka  added the comment:

I have tested shutil.rmtree() with a large number of files using modified 
benchmark from issue28564. For 40 files it takes less than 5 seconds. From 
the comment to the coreutils benchmark 
(http://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=24412edeaf556a):

# Using rm -rf to remove a 400k-entry directory takes:
# - 9 seconds with the patch, on a 2-yr-old system
# - 350 seconds without the patch, on a high-end system (disk 20-30% faster) 
threshold_seconds=60

Running the coreutils benchmark gives the result 3 seconds on my computer.

It seems to me that this issue have been fixed in the kernel.

--
nosy: +serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32453] shutil.rmtree can have O(n^2) performance on large dirs

2017-12-30 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

I don't think filesystem-specific optimizations belong in the Python stdlib.  
Python is compatible with multiple operating systems, including Windows, macOS, 
Android, many POSIX variants.  It would be much better if this were fixed in 
the kernel...

--
nosy: +pitrou

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32453] shutil.rmtree can have O(n^2) performance on large dirs

2017-12-29 Thread Niklas Hambüchen

New submission from Niklas Hambüchen :

See http://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=24412edeaf556a 
for the explanation and equivalent fix in coreutils.

The gist ist that deleting entries in inode order can improve deletion 
performance dramatically.

To obtain inode numbers and sort by them, one needs to `getdents()` all the 
entries ahead of time, but rmtree() already gets all dirents ahead of the 
deletion. https://bugs.python.org/issue28564 recently improved shutil.rmtree() 
performance by using scandir(), but nevertheless the returned generator is 
list()ed, so we already have all necessary informtaion in memory and would just 
have to perform an O(n) integer sort.

I propose we check if the improvements made to `rm -r` in coreutils should be 
ported to shutil.rmtree().

--
components: Library (Lib)
messages: 309217
nosy: nh2
priority: normal
severity: normal
status: open
title: shutil.rmtree can have O(n^2) performance on large dirs
type: enhancement
versions: Python 3.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com