I have played with python and parallelism. The result is commit
2340364e1063b5c38f5d47575c19e53fd4efc516, and I would like to share
something about it with all of you.
To make story short - download of rpm files in satellite-sync is now
faster by factor 1.5.
The download of rpm was in quite isolated for-loop, each loop
independent. So I thought it is nice candidate for parallelism.
I created new class ThreadDownload, which took one file to be downloaded
from queue.
I start several threads - each consist of instance instance
ThreadDownload. When the queue is empty, thread will finish.
Original for-loop just populate queue and then pick up downloaded
packages from out_queue and write info to screen and/or to log.
I find that one part of rhnlib is not reentrant, so I put lock around
that part of code. This is place for future improvement.
I used 4 concurent threads. The HTTP spec says that a single user client
should not use more than two persistent ones per server
(which on the other hand is somewhat modest these days, mainstream
browsers limit it to 6 or so I think).
I used threading instead of multiprocessing, because multiprocessing is
not available in RHEL5 (python 2.4).
For the same reason I used simple Queue, even I know it is suboptimal.
If large file happen to be last, then you have finish the downloading
only with one thread. So better option is to use PriorityQueue and
download largest files as first, so all threads will be utilized all the
time. But again PrirorityQueue was introduced in python 2.6 and rhel5
has 2.4 :(
Well in fact best order is SmallestFile, 1-LargestFile, 2-LargestFile,
... This way we can get best time estimation very quickly and the we get
best utilization of threads. If somebody want to implement this and
workaround that missing PriorityQueue on RHEL5 - be my guest.
I suppose the very similar code can go to repo-sync, but I do not use,
so my motivation is very low for that part of code...
I tested the code quite intensively, but if I broke something ... you
know my irc nick and email...
Here are some benchmarks I done on syncing channel
redhat-rhn-proxy-5.4-server-i386-5. Given times are always start and end
of phase, where satellite-sync downloads rpm files.
threading on
run 1:
05:46:29
05:48:48
=2:19
run 2:
06:12:11
06:14:34
=2:23
avg. 2:21
threading off
run 1:
05:52:16
05:55:50
=3:34
run 2:
08:34:33
08:38:23
= 3:50
avg 3:42
Comments are welcome. If there will be positive feedback, I think we can
apply the same pattern to errata and kickstarts.
Mirek
_______________________________________________
Spacewalk-devel mailing list
Spacewalk-devel@redhat.com
https://www.redhat.com/mailman/listinfo/spacewalk-devel