[ https://issues.apache.org/jira/browse/MAPREDUCE-6507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133187#comment-15133187 ]
Eric Badger commented on MAPREDUCE-6507: ---------------------------------------- Tests are failing because of a race condition between the RM startup and the NM startup. In each of their serviceStart() methods, they are spawning new threads to call start(), which introduces the race. The NM is set up with a waitCount of up to 60 seconds, so that it can wait for the cluster to complete startup (even though the start method for the RM has already returned). Removing the threads fixes the race in the test that prompted this Jira (TestRMNMInfo), but causes other tests to fail. Any tests that start up the MiniYARNCluster cluster without an active RM will fail because the node managers block the main thread from transitioning one of the RMs from standby to active. This is why the threads worked, since it allowed the NMs to wait, while the main thread zoomed by and transitioned a standby RM to active. I propose changing the MiniYARNCluster start method such that it does not complete until the cluster is completely started and to always make one RM active in HA setups. This will require changes to the affected tests (TestRMFailover, TestMiniYARNClusterForHA, etc.), but makes the code more understandable and removes races. The tests are only passing right now because of excessive timeouts to mask the race that they're fighting. [~kasha] [~jlowe] Please advise. > MiniYARNCluster.start() returns before cluster is completely started > -------------------------------------------------------------------- > > Key: MAPREDUCE-6507 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6507 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: test > Reporter: Rohith Sharma K S > Assignee: Eric Badger > Attachments: MAPREDUCE-6507.001.patch > > > TestRMNMInfo fails intermittently. Below is trace for the failure > {noformat} > testRMNMInfo(org.apache.hadoop.mapreduce.v2.TestRMNMInfo) Time elapsed: 0.28 > sec <<< FAILURE! > java.lang.AssertionError: Unexpected number of live nodes: expected:<4> but > was:<3> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at > org.apache.hadoop.mapreduce.v2.TestRMNMInfo.testRMNMInfo(TestRMNMInfo.java:111) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)