On 06/27/2012 12:10 PM, Graham Binns wrote: > Okay, so keeping the thread for reference: > > Today, it all Just Worked. The 32-core instance yesterday just hung > around and then quit, but m1.small that I accidentally used today > worked fine. Re-running things on a 32-core instance worked fine too.
Yay. mostly. > > Except for one small feature: we're still seeing an unknown worker. > > For these tests I limited the number of workers to 8 using > --concurrency=8 in master.cfg. There are, however, 9 workers listed - > 8 normal and one unknown. The unknown worker log appears to contain > the output from bin/test. Not bin/test --subunit, mind; things like > this: > > lp.codehosting.codeimport.tests.test_worker.ForeignBranchPluginLayer:tearDown > lp.codehosting.codeimport.tests.test_worker.TestBzrSvnImport.test_forbidden > lp.codehosting.codeimport.tests.test_worker.TestImportDataStore.test_fetch_with_dest_transport > lp.codehosting.codeimport.tests.test_worker.TestGitImport.test_partial > lp.codehosting.codeimport.tests.test_worker.RedirectTests.test_redirect_to_forbidden_url > lp.codehosting.codeimport.tests.test_worker.ForeignBranchPluginLayer:tearDown > Running in a subprocess. > lp.services.messaging.tests.test_rabbit.TestRabbitUnreliableSession.test_connect_with_incomplete_configuration > lp.services.messaging.tests.test_rabbit.TestRabbitUnreliableSession.test_connect > lp.services.messaging.tests.test_rabbit.TestRabbitUnreliableSession.test_getConsumer > lp.services.messaging.tests.test_rabbit.TestRabbitMessageBase.test_channel_session_closed > lp.services.messaging.tests.test_rabbit.TestRabbitSession.test_disconnect > lp.testing.layers.BaseLayer:tearDown These all look like expected test names from that worker log. In particular, the worker log is *generated* from the subunit, so it is at least partially working; and the ":tearDown" and ":setUp" suffixes are part of the subunit/zope.testing layer dance. It smells very much like a subunit/testtools bug in the code that aggregates subunit streams. It is that code, not any code in zope.testing or in the Launchpad tree, that generates the worker tags. A way that this might happen is if the tags get messed up. We see global tags getting messed up by testr (we think, though Robert thinks it is in zope.testing) regularly; we don't see worker or per-test tags getting messed up/ > I've uploaded the entire contents of the lp_devel directory from this > slave, gzipped, to U1 for your joy and edification: > http://ubuntuone.com/4UL3L98uBZCcDHnYqYQy7c. If anyone can shed some > light on what's going on, that'd be a great help to us here. I'm > hoping it's not another "hey, let's muck with stdout" problem... I don't see evidence of stdout issues yet, though we may get there yet. We are currently eating stdout, stderr, and __stderr__; that still leaves fun for __stdout__! I suspect that what is happening is that we are getting a test failure that is messing up the subunit stream in such a way that the testr/testtools/subunit tower falls over. We can find out what that test failure is, and hopefully fix both it and the fragility. Of course, fixing the fragility may, yes, involve __stdout__ or file descriptors. Whee! My first choice for investigating this would be to, *while tests are running*, get the contents of /var/lib/buildbot/slaves/slave/lucid-devel/build/temp/ . That directory is cleaned out after the tests are finished. Then, once you have a non-setUp, non-tearDown and non-Running-in-a-subprocess test name from the unknown worker log, go and find the file from that directory that contains that test name. Run those tests with xvfb-run ./bin/test -vvv --subunit --load-list FILENAME on a precise LXC container somewhere (could be the one on the slave, or could be elsewhere) and see if you can identify something in the stream that looks "wrong," right before the unknown worker test. > > I'll try again on a --concurrency=20 32-core box tomorrow. > Cool -- Mailing list: https://launchpad.net/~yellow Post to : [email protected] Unsubscribe : https://launchpad.net/~yellow More help : https://help.launchpad.net/ListHelp

