Hi guys, This week-end, I monitored (and helped a bit) the package importer handling the backlog consecutive to wheezy release.
This is a bit long but many things occur when a new debian release is imported, thanks for your patience ;) For the impatients among you, some high level imprecise (TM) numbers: - the 15,000 branches were imported in less than 3 days, - the longest import was gnumeric with more than 2 hours (wall time), - the average time to import a package is 5 mins (wall time). This produced a big peek in the graph below: https://lpstats.canonical.com/graphs/BranchesPerDayNew/ Roughly wheezy required to create ~15.000 new branches and then there was a couple of pending wrong attempts to clear. The package importer had been turned off while lp was prepared for the rush and it was restarted Friday 11 Feb circa 11:00AM UTC (all dates below are in UTC too). http://webnumbr.com/outstanding-package-imports tells the complete story from the package import point of view. At 3 packages/min the first estimation was that it would finish circa Monday 14 Feb 12:00 UTC at best and more probably Wed 16 Feb. I thought it was a good time to make some load experiments since we won't have another debian release to watch until... some unknown date ;) So I went to jubany and progressively raise the number of threads (by doing 'echo 12 >max_threads' in the 'new' directory. 2011-02-12 00:17:38.185000 - Concurrency is now: 8 Seeing that things went well, I continued adding 4 threads, watching https://code.launchpad.net/+recently-registered-branches to make sure the branch scanner wasn't blowing up and finally reached 32 threads. 2011-02-12 10:53:53.084000 - Concurrency is now: 12 2011-02-12 11:04:57.763000 - Concurrency is now: 16 2011-02-12 11:09:54.381000 - Concurrency is now: 20 2011-02-12 11:23:32.988000 - Concurrency is now: 24 2011-02-12 12:25:38.213000 - Concurrency is now: 28 2011-02-12 12:58:42.076000 - Concurrency is now: 32 Branch scans are queued (thanks to wgrant for clarifying this one), so we can't DDOS the lp branch scanner, which is good. Nevertheless, I saw the number of branches needed to be scanned raise as I raised the number of threads. So far, so good, the progression wasn't linear and the branch scanner "lag" quite erratic anyway, no real correlation there. Just a couple of data points: at 3p/m the number of branches with a 'This branch has not been scanned yet' was ~10 and the message disappear (the so-called lag) in less than 2 minutes. At 6p/m, I've seen the lag goes up to 10 mins and the number of branches waiting to be scanned as high as ~40. But the max values were reached only rarely, on average the branch scanner was around ~2min behind with ~15/20 branches to be processed. No direct correlation either between the number of threads and the number of packages by minute. With 8 threads with reach ~3 packages/min, with 32 ~6 packages by minute. jubany wasn't swapping, I made sure there was always at least 2GB free out of the 32G. This went down to ~1GB at one point before stabilizing at 8GB free (i.e. more than the initial ~6GB with 8 threads, weird, but I didn't investigate either, the buffers (reported by top) stayed at ~10GB the whole time and no swap was ever used). There should be some bottleneck here but I didn't search for it, I mostly wanted to collect a set of data points to establish some comparison basis for the future. So no changes, except for the number of threads and given the observed behavior I didn't overload any part of the tool chain so the numbers should be valid. A package import starts at ~45M and generally ends after < 1min CPU time reaching ~ < 100M. Some imports goes up to ~150M, I saw a couple of ones reach 200M and at least one 500M (gnumeric). 32 * 500M = 16G, I thought I was safe leaving the p-i run with 32 threads. There was some transient librarian failures: Invalid http response for https://launchpadlibrarian.net/24134878/oneisenough_0.40-2.dsc: Unable to handle http code 503 The first one occurred in isolation and wgrant looked around on the lp side of things without founding real evidence (apart from spm mentioning a +10 minutes to restart the librarian :-/). There have been ~100 failures related to the librarian returning 503 http errors which are transient and I observed a couple of them occurring before the librarian came back to normal in less than 10 minutes (i.e. too small a window to attempt a more precise diagnostic, but we have logs so we will be able to investigate and cross our logs with the librarian ones). A couple of ones below for those with access to the librarian logs: 2011-02-13 00:13:08,126 - __main__ - WARNING - Importing octave-communications failed: bzrlib.errors.InvalidHttpResponse: Invalid http response for https://launchpadlibrarian.net/40039776/octave-communications_1.0.10-2.dsc: Unable to handle http code 503: expected 200 or 404 for full response.: 2011-02-13 00:13:28,751 - __main__ - WARNING - Importing jester failed: bzrlib.errors.InvalidHttpResponse: Invalid http response for https://launchpadlibrarian.net/18993704/jester_1.0-8.dsc: Unable to handle http code 503: expected 200 or 404 for full response.: 2011-02-13 00:13:28,785 - __main__ - WARNING - Importing monodevelop-database failed: bzrlib.errors.InvalidHttpResponse: Invalid http response for https://launchpadlibrarian.net/55583192/monodevelop-database_2.4%2Bdfsg-2.dsc: Unable to handle http code 503: expected 200 or 404 for full response.: 2011-02-13 00:13:28,789 - __main__ - WARNING - Importing bzr-rewrite failed: bzrlib.errors.InvalidHttpResponse: Invalid http response for https://launchpadlibrarian.net/58283982/bzr-rewrite_0.6.1%2Bbzr225-1.dsc: Unable to handle http code 503: expected 200 or 404 for full response.: 2011-02-13 00:16:55,411 - __main__ - WARNING - Importing potool failed: bzrlib.errors.InvalidHttpResponse: Invalid http response for https://launchpadlibrarian.net/56057698/potool_0.11-2.dsc: Unable to handle http code 503: expected 200 or 404 for full response.: I could provide more if needed, the log files mentioned ~100 of them. Since I was becoming impatient and things were going well, I went ahead and push the importer some more for a short period: 2011-02-13 00:33:27.476000 - Concurrency is now: 32 2011-02-13 18:24:47.175000 - Concurrency is now: 40 2011-02-13 18:33:55.523000 - Concurrency is now: 64 2011-02-13 18:34:45.963000 - Concurrency is now: 62 2011-02-13 18:35:26.124000 - Concurrency is now: 60 2011-02-13 18:35:56.346000 - Concurrency is now: 57 2011-02-13 18:36:36.756000 - Concurrency is now: 40 2011-02-13 18:37:47.845000 - Concurrency is now: 39 2011-02-13 18:38:18.547000 - Concurrency is now: 32 No trouble. Later on, we hit an obscure failure: 2011-02-13 19:56:59,671 - __main__ - CRITICAL - Driver hit database is locked This is probably due to a race condition when starting the imports, the actual code doesn't wait for one import to start before starting the next one (https://bugs.launchpad.net/udd/+bug/719196). It's even more amazing to have encountered it *after* I pushed to 64 threads and went back to 32... To be safe, I lowered the concurrency to 8 again to let whole wheezy import finish (thanks to spm for promptly restarting the importer). 2011-02-13 23:06:49.215000 - Concurrency is now: 8 Overall, I'm impressed by the robustness of the *whole* system[1] (i.e. launchpad + the package importer). An important point is that we want to be able to temporarily raise the number of threads in the future to address such huge imports. We need more monitoring and more data collecting[2] to refine what resources we really need between this big imports. In parallel, my package-import.local vm has now reach '0' outstanding jobs too. Of course it runs in read-only mode with package_import.py being called with '--no-push; and currently have 16403 failures due to some access misconfiguration ;) But it runs and that's a nice first step. Congratulations for staying with me so long ! Vincent [1] And while I'm working on making the p-i more robust, I'd like to be clear, for the record, that James left us an already robust system and I thank him for that. So, in the future, if you hear me say or see me write: "the p-i sucks", I'll be talking about the remaining defects and about the bugs we will add ;-) [2]: The obvious candidates being memory peak and CPU/elapsed times (we can (and I will) extract the later from the existing logs, see bzr+ssh://bazaar.launchpad.net/~vila/udd/analyze_log/. -- ubuntu-distributed-devel mailing list [email protected] Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-distributed-devel
