On Thu, Nov 24, 2011 at 6:26 AM, Gary Poster <[email protected]> wrote: > Hey Robert. Francis mentioned that you had updated the parallel testing LEP > so I took a moment to look at it today. > > I cc'd the yellow squad to keep us all in the loop. Hi everybody! The LEP > is https://dev.launchpad.net/LEP/ParallelTesting if you want to take a look. > > Could you clarify these points, ideally on the LEP? > > - You write that we must "[o]rganise and upgrade our CI test running to take > advantage of this new environment." You also clarify that "[c]hanging the > landing technology is out of scope." To make sure I understand, then, you > want us to keep buildbot and everything else as-is as much as possible, but > guide LOSAs to getting us machines/VMs that can quickly and robustly run > these tests. Is this right? If so, no additional LEP clarification needed, > I think, but otherwise, please give us more information there.
Yes. I am keep to overhaul our CI story as well but that seems to be entirely orthogonal to me. Francis and I would be very open to an argument that this is misguided :). So 'Yes, thats right'. > - You write in comments that "The prototype LXC + testr based parallelisation > seems to have the best effort-reward tradeoff today." [Yellow folks, I found > https://dev.launchpad.net/ParallelTests to describe the prototype.] Have you > done enough research here that you are able to recommend or even prescribe > this approach? That would probably save time, if so; and though it violates > my understanding of LEP goals to have an implementation prescribed, I think > that ought to be relaxed for documents written by the TA. I think prescribing it would make sense. Uhm, basically we have a tonne of same-machine shared state still lurking in our test suite that would make a shared machine parallel environment (e.g. splitting by threads or processes) a high risk endeavour - we could be months finding and fixing such things when they bite us (and as race conditions that would be a source of continual pain rather than a clear 'wow thats broken lets fix it' scenario. For instance once such things is the oops directory for tests /var/tmp/lperr.test which even now isn't quite gone or unique-per-test-run. One very solid way to mitigate against the risk of such race conditions is to have separate machines, but separate machines present a coordination and setup overhead: syncing code around, creating template dbs - none of these things are free. We've demonstrated parallel testing using subunit on multiple machines years ago and it was very effective - and more recently Aaron has done a canonistack helper that does parallel machines. LXC offers a way to make very efficient use of one machine with the benefits of having separate machines and with out [most] of the overhead of having separate machines or VM's. > - If we use LXC, do you expect this effort to dig into the fragility that you > note in your prototype notes, and try to improve it? If not, do you have > requirements or thoughts on how to help developers work with the > issues--perhaps scripts that developers are encouraged to use for the > workflow, that handle problems like the ones you identify ("you may need to > manually shutdown postgresql before stopping lxc, to get it to shutdown > cleanly")? Serge Hallyn and the Ubuntu server team are driving LXC to be a local cloud deployment environment - its a key feature goal for Precise. I expect that we can benefit from this work (even without running Precise on the server/VM that hosts the LXC instance - though we can do that if needed). There will be things we need to automate etc. I expect the squad to run into some curly problems they need to escalate to the Ubuntu server team for assistance on, but that most of them have been identified and workarounds sketched or implemented by the LXC experiments wgrant and I were doing. > - If we use LXC, you describe a number of steps to set up a working > environment. Do you envision a rocketfuel-XXX style script to help produce > this environment? If so, do you have any requirements for it? If not, do > you have something else in mind, and can we extract requirements from that? There are two components here - the installation of dependencies/setup of libvirt/LXC host configuration and the creation of the LXC template. Whether the first component is manual or automated doesn't really bother me - I'm happy with automation, but OTOH its a one-time cost to get the LOSAs to setup a working environment on e.g. the buildbod slave. However I think its important that we be able to rebuild an LXC template rapidly (or even in the test run itself) as we *may* find that that is the most reliable way to ensure everything is just so. LXC environments can persist - certainly the template environment for the test can persist, and there are some latency benefits in having that, but a totally clean environment is also very beneficial. The LXC guest itself is only a couple hundred MB, so pretty fast to bootstrap once LXC has cached all the bits. > If you don't intend to recommend/prescribe LXC + testr, these next two > question are pertinent. > > - You write that the solution "[m]ust parallelise more effectively than > bin/test -j (which does per-layer splits)." Is that really a "must"? If we > met your success metric ("down to less than 50% of the current time, > preferrable 15%-20%"), would it really matter which method got there? If it > does matter, can you identify what the underlying "must" is for rejecting the > -j approach, so that, for instance, other solutions can be cleanly rejected? Our test distribution per layer is not very even - I highly doubt that we'd be able to meet a reduction to 15% of the current time splitting per layer. The other issue of shared global state that will bite us, will also be a significant issue with -j, unless a remoting facility is brought in (and at that point it seems to be reinventing subunit.... :P). > - Francis had said earlier when talking with me about the project that > running the tests on multiple machines might be a acceptable way to achieve > the goal. You specifically disallow that, even with the LEP title ("Single > machine parallel testing of single branches"), even though doing this with > multiple machines would match the letter of the law (the biggest stretch I > see is that "[p]ermit[ting] developers to reliably run parallelised as well" > would mean that developers would need to run ec2 to meet that requirement). > As with the previous question, is there a deeper "must" hidden in here > somewhere? Perhaps it is cost related? Not really - if using multiple machines/full blown vms is the right way forward, we can do that. I believe that in the single developer case that the performance will be significantly worse than that with LXC due to the lack of shared page cache and increase disk IO. Also you then have the overheads of code sync across to the VM's/machines... > That's all I've got so far. :-) Thanks, these are great questions, please keep them coming! -Rob -- Mailing list: https://launchpad.net/~yellow Post to : [email protected] Unsubscribe : https://launchpad.net/~yellow More help : https://help.launchpad.net/ListHelp

