[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133187#comment-15133187
 ] 

Eric Badger commented on MAPREDUCE-6507:
----------------------------------------

Tests are failing because of a race condition between the RM startup and the NM 
startup. In each of their serviceStart() methods, they are spawning new threads 
to call start(), which introduces the race. The NM is set up with a waitCount 
of up to 60 seconds, so that it can wait for the cluster to complete startup 
(even though the start method for the RM has already returned). Removing the 
threads fixes the race in the test that prompted this Jira (TestRMNMInfo), but 
causes other tests to fail. Any tests that start up the MiniYARNCluster cluster 
without an active RM will fail because the node managers block the main thread 
from transitioning one of the RMs from standby to active. This is why the 
threads worked, since it allowed the NMs to wait, while the main thread zoomed 
by and transitioned a standby RM to active. 

I propose changing the MiniYARNCluster start method such that it does not 
complete until the cluster is completely started and to always make one RM 
active in HA setups. This will require changes to the affected tests 
(TestRMFailover, TestMiniYARNClusterForHA, etc.), but makes the code more 
understandable and removes races. The tests are only passing right now because 
of excessive timeouts to mask the race that they're fighting. 

[~kasha] [~jlowe] Please advise. 

> MiniYARNCluster.start() returns before cluster is completely started
> --------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6507
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6507
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: test
>            Reporter: Rohith Sharma K S
>            Assignee: Eric Badger
>         Attachments: MAPREDUCE-6507.001.patch
>
>
> TestRMNMInfo fails intermittently. Below is trace for the failure
> {noformat}
> testRMNMInfo(org.apache.hadoop.mapreduce.v2.TestRMNMInfo)  Time elapsed: 0.28 
> sec  <<< FAILURE!
> java.lang.AssertionError: Unexpected number of live nodes: expected:<4> but 
> was:<3>
>       at org.junit.Assert.fail(Assert.java:88)
>       at org.junit.Assert.failNotEquals(Assert.java:743)
>       at org.junit.Assert.assertEquals(Assert.java:118)
>       at org.junit.Assert.assertEquals(Assert.java:555)
>       at 
> org.apache.hadoop.mapreduce.v2.TestRMNMInfo.testRMNMInfo(TestRMNMInfo.java:111)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to