[
https://issues.apache.org/jira/browse/YARN-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Lowe reassigned YARN-4686:
--------------------------------
Assignee: Eric Badger
I'd really like to see the minicluster not startup by default with a race
condition where it hasn't actually finished starting. With multiple tests
currently failing sporadically due to this, I'd like the start() method to not
return until the cluster is started. For non-HA setups this seems very
straightforward.
However for the HA minicluster it appears the intent is to have the RMs all
come up in standby. The problem is that the NM start method _will not return_
until it has successfully registered with an RM. Since all RMs are in standby
the NM start never completes, the minicluster start never completes, and we
never get to the part of the test where it activates an RM. Therefore HA
minicluster tests will always timeout.
I like Eric's proposal to have the minicluster activate the first RM during the
start method of an HA cluster so we can bring it up and return from the cluster
start method with no pending start processing (and therefore race conditions in
the test using the cluster). However that could break some of the assumptions
of those using the HA minicluster in their existing tests. For Hadoop tests we
can simply fixup the tests accordingly, if necessary (since most seem to
activate the first one anyway), but I don't know if there are other tests that
use an HA minicluster and will break if the first RM is already active by
default.
[~kasha] do you have an opinion on this?
> MiniYARNCluster.start() returns before cluster is completely started
> --------------------------------------------------------------------
>
> Key: YARN-4686
> URL: https://issues.apache.org/jira/browse/YARN-4686
> Project: Hadoop YARN
> Issue Type: Bug
> Components: test
> Reporter: Rohith Sharma K S
> Assignee: Eric Badger
> Attachments: MAPREDUCE-6507.001.patch
>
>
> TestRMNMInfo fails intermittently. Below is trace for the failure
> {noformat}
> testRMNMInfo(org.apache.hadoop.mapreduce.v2.TestRMNMInfo) Time elapsed: 0.28
> sec <<< FAILURE!
> java.lang.AssertionError: Unexpected number of live nodes: expected:<4> but
> was:<3>
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.failNotEquals(Assert.java:743)
> at org.junit.Assert.assertEquals(Assert.java:118)
> at org.junit.Assert.assertEquals(Assert.java:555)
> at
> org.apache.hadoop.mapreduce.v2.TestRMNMInfo.testRMNMInfo(TestRMNMInfo.java:111)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)