I have played with python and parallelism. The result is commit 2340364e1063b5c38f5d47575c19e53fd4efc516, and I would like to share something about it with all of you.

To make story short - download of rpm files in satellite-sync is now faster by factor 1.5.

The download of rpm was in quite isolated for-loop, each loop independent. So I thought it is nice candidate for parallelism.

I created new class ThreadDownload, which took one file to be downloaded from queue. I start several threads - each consist of instance instance ThreadDownload. When the queue is empty, thread will finish. Original for-loop just populate queue and then pick up downloaded packages from out_queue and write info to screen and/or to log.

I find that one part of rhnlib is not reentrant, so I put lock around that part of code. This is place for future improvement.

I used 4 concurent threads. The HTTP spec says that a single user client should not use more than two persistent ones per server (which on the other hand is somewhat modest these days, mainstream browsers limit it to 6 or so I think).

I used threading instead of multiprocessing, because multiprocessing is not available in RHEL5 (python 2.4).

For the same reason I used simple Queue, even I know it is suboptimal. If large file happen to be last, then you have finish the downloading only with one thread. So better option is to use PriorityQueue and download largest files as first, so all threads will be utilized all the time. But again PrirorityQueue was introduced in python 2.6 and rhel5 has 2.4 :( Well in fact best order is SmallestFile, 1-LargestFile, 2-LargestFile, ... This way we can get best time estimation very quickly and the we get best utilization of threads. If somebody want to implement this and workaround that missing PriorityQueue on RHEL5 - be my guest. I suppose the very similar code can go to repo-sync, but I do not use, so my motivation is very low for that part of code...

I tested the code quite intensively, but if I broke something ... you know my irc nick and email...

Here are some benchmarks I done on syncing channel redhat-rhn-proxy-5.4-server-i386-5. Given times are always start and end of phase, where satellite-sync downloads rpm files.

threading on
run 1:
05:46:29
05:48:48
=2:19

run 2:
06:12:11
06:14:34
=2:23

avg. 2:21

threading off
run 1:
05:52:16
05:55:50
=3:34

run 2:
08:34:33
08:38:23
= 3:50

avg 3:42

Comments are welcome. If there will be positive feedback, I think we can apply the same pattern to errata and kickstarts.

Mirek

_______________________________________________
Spacewalk-devel mailing list
Spacewalk-devel@redhat.com
https://www.redhat.com/mailman/listinfo/spacewalk-devel

Reply via email to