[ 
https://issues.apache.org/jira/browse/YARN-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388352#comment-14388352
 ] 

zhihai xu commented on YARN-2666:
---------------------------------

Hi [~ywskycn], Could you assign this JIRA to me?
I think I know what cause this Intermittent failure.
The problem is because ContinuousSchedulingThread is calling 
continuousSchedulingAttempt periodically. 
And continuousSchedulingAttempt doesn't hold the FairScheduler lock.
continuousSchedulingAttempt can run at any time,
{code}
    for (NodeId nodeId : nodeIdList) {
      FSSchedulerNode node = getFSSchedulerNode(nodeId);
      try {
        if (node != null && Resources.fitsIn(minimumAllocation,
            node.getAvailableResource())) {
          attemptScheduling(node);
        }
      } catch (Throwable ex) {
        LOG.error("Error while attempting scheduling for node " + node +
            ": " + ex.toString(), ex);
      }
    }
{code}
when the testContinuousScheduling run scheduler.allocate to make a container 
allocation request.
It is possible application.updateResourceRequests in scheduler.allocate is 
running right after attemptScheduling first node and before attemptScheduling 
second node. then the second node with less resource will allocate container 
for this allocation request.
Then the issue will happen: both containers are allocated on the same node.
The default ContinuousSchedulingSleepMs is 5ms which is very short, If we 
increase ContinuousSchedulingSleepMs, the test failure will be much less. We 
can make the test deterministic by manually calling continuousSchedulingAttempt 
after second allocation request and stopping the ContinuousSchedulingThread 
before second allocation request.
I uploaded a patch which will stop ContinuousSchedulingThread before second 
allocation request and manually call continuousSchedulingAttempt after second 
allocation request.

> TestFairScheduler.testContinuousScheduling fails Intermittently
> ---------------------------------------------------------------
>
>                 Key: YARN-2666
>                 URL: https://issues.apache.org/jira/browse/YARN-2666
>             Project: Hadoop YARN
>          Issue Type: Test
>          Components: scheduler
>            Reporter: Tsuyoshi Ozawa
>            Assignee: Wei Yan
>         Attachments: YARN-2666.000.patch
>
>
> The test fails on trunk.
> {code}
> Tests run: 79, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.698 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler
> testContinuousScheduling(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler)
>   Time elapsed: 0.582 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<2> but was:<1>
>       at org.junit.Assert.fail(Assert.java:88)
>       at org.junit.Assert.failNotEquals(Assert.java:743)
>       at org.junit.Assert.assertEquals(Assert.java:118)
>       at org.junit.Assert.assertEquals(Assert.java:555)
>       at org.junit.Assert.assertEquals(Assert.java:542)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler.testContinuousScheduling(TestFairScheduler.java:3372)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to