On Wed, Oct 20, 2010 at 11:48 PM, Thomas Koch <tho...@koch.ro> wrote:
> Hi, > > last night I let my hudson server do 42 (sic) builds of ZooKeeper trunk. > One > of this builds failed: > > junit.framework.AssertionFailedError: Leader hasn't joined: 5 > at org.apache.zookeeper.test.FLETest.testLE(FLETest.java:312) > > I did this many builds of trunk, because in my quest to redo the client > netty > integration step by step I made one step which resulted in 2 failed builds > out > of 8. The two failures were both: > > junit.framework.AssertionFailedError: Threads didn't join > at > > org.apache.zookeeper.test.FLERestartTest.testLERestart(FLERestartTest.java:198) > > Hi Thomas, there's an open jira for this: https://issues.apache.org/jira/browse/ZOOKEEPER-653 great if you'd like to address it. I can't find any relationship between the above test and my changes. The > test > does not use the ZooKeeper client code at all. So I begin to believe that > there are some Heisenbugs, Bohrbugs or Mandelbugs[1] in ZooKeeper that just > happen to show up from time to time without any relationship to the current > changes. > > I'll try to investigate the cause further, maybe there is some relationship > I've not yet found. But if my assumption should apply, then these kind of > bugs > would be a strong argument in favor of refactoring. These bugs are best > found > by cleaning the code, most important implementing strict separation of > concerns. > I believe the bug is in the test, rather than in the code. Forming a quorum is non-deterministic, the test assumes that it's allowing enough time for everyone to join, this may not be the case. The opposite may be true as well however, it might be the case that something is really failing, however my understanding from Flavio is that it's the former. The unfortunate thing is that since we don't really know which it is, we sort of ignore these failures. Really we should fix this issue "for reals". Whatever that means... Flavio perhaps you could give Thomas some insight, if you have ideas he is motivated to help resolve. Also notice that we are currently @Ignore ing a handful of tests. These are also "broken" tests, tests which we really need to fix and bring back online. The "session moved" in particular needs to be fixed (again, non-deterministic test, probably could benefit from some refactoring, however I think it's more a "design for test" issue). Take a look at the clover output for some insight on areas that need more testing and refactoring (coverage/complexity): https://hudson.apache.org/hudson/view/ZooKeeper/job/ZooKeeper-trunk/clover/ > Wouldn't you like to setup Hudson to build ZooKeeper trunk every half an > hour? > I wouldn't mind, but we'd probably get yelled at by the apache hudson admins. :-) Hudson is a shared resource and we typically need to "play nice". Also there's been problems with hadoop on hudson for the past few months, Nigel is working on that, might be a good thing to bring up again once that's addressed (patch queue primarily). Patrick