[ 
https://issues.apache.org/jira/browse/YARN-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe reassigned YARN-4686:
--------------------------------

    Assignee: Eric Badger

I'd really like to see the minicluster not startup by default with a race 
condition where it hasn't actually finished starting.  With multiple tests 
currently failing sporadically due to this, I'd like the start() method to not 
return until the cluster is started.  For non-HA setups this seems very 
straightforward.

However for the HA minicluster it appears the intent is to have the RMs all 
come up in standby.  The problem is that the NM start method _will not return_ 
until it has successfully registered with an RM.  Since all RMs are in standby 
the NM start never completes, the minicluster start never completes, and we 
never get to the part of the test where it activates an RM.  Therefore HA 
minicluster tests will always timeout.

I like Eric's proposal to have the minicluster activate the first RM during the 
start method of an HA cluster so we can bring it up and return from the cluster 
start method with no pending start processing (and therefore race conditions in 
the test using the cluster).  However that could break some of the assumptions 
of those using the HA minicluster in their existing tests.  For Hadoop tests we 
can simply fixup the tests accordingly, if necessary (since most seem to 
activate the first one anyway), but I don't know if there are other tests that 
use an HA minicluster and will break if the first RM is already active by 
default.

[~kasha] do you have an opinion on this?

> MiniYARNCluster.start() returns before cluster is completely started
> --------------------------------------------------------------------
>
>                 Key: YARN-4686
>                 URL: https://issues.apache.org/jira/browse/YARN-4686
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: test
>            Reporter: Rohith Sharma K S
>            Assignee: Eric Badger
>         Attachments: MAPREDUCE-6507.001.patch
>
>
> TestRMNMInfo fails intermittently. Below is trace for the failure
> {noformat}
> testRMNMInfo(org.apache.hadoop.mapreduce.v2.TestRMNMInfo)  Time elapsed: 0.28 
> sec  <<< FAILURE!
> java.lang.AssertionError: Unexpected number of live nodes: expected:<4> but 
> was:<3>
>       at org.junit.Assert.fail(Assert.java:88)
>       at org.junit.Assert.failNotEquals(Assert.java:743)
>       at org.junit.Assert.assertEquals(Assert.java:118)
>       at org.junit.Assert.assertEquals(Assert.java:555)
>       at 
> org.apache.hadoop.mapreduce.v2.TestRMNMInfo.testRMNMInfo(TestRMNMInfo.java:111)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to