[ 
https://issues.apache.org/jira/browse/YARN-9430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-9430:
------------------------------------

    Assignee: Szilard Nemeth

> Recovering containers does not check available resources on node
> ----------------------------------------------------------------
>
>                 Key: YARN-9430
>                 URL: https://issues.apache.org/jira/browse/YARN-9430
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Szilard Nemeth
>            Assignee: Szilard Nemeth
>            Priority: Critical
>
> I have a testcase that checks if some GPU devices gone offline and recovery 
> happens, only the containers that fit into the node's resources will be 
> recovered. Unfortunately, this is not the case: RM does not check available 
> resources on node during recovery.
> *Detailed explanation:*
> *Testcase:* 
>  1. There are 2 nodes running NodeManagers
>  2. nvidia-smi is replaced with a fake bash script that reports 2 GPU devices 
> per node, initially. This means 4 GPU devices in the cluster altogether.
>  3. RM / NM recovery is enabled
>  4. The test starts off a sleep job, requesting 4 containers, 1 GPU device 
> for each (AM does not request GPUs)
>  5. Before restart, the fake bash script is adjusted to report 1 GPU device 
> per node (2 in the cluster) after restart.
>  6. Restart is initiated.
>  
> *Expected behavior:* 
>  After restart, only the AM and 2 normal containers should have been started, 
> as there are only 2 GPU devices in the cluster.
>  
> *Actual behaviour:* 
>  AM + 4 containers are allocated, this is all containers started originally 
> with step 4.
> App id was: 1553977186701_0001
> *Logs*:
>  
> {code:java}
> 2019-03-30 13:22:30,299 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> Processing event for appattempt_1553977186701_0001_000001 of type RECOVER
> 2019-03-30 13:22:30,366 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Added Application Attempt appattempt_1553977186701_0001_000001 to scheduler 
> from user: systest
>  2019-03-30 13:22:30,366 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> appattempt_1553977186701_0001_000001 is recovering. Skipping notifying 
> ATTEMPT_ADDED
>  2019-03-30 13:22:30,367 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1553977186701_0001_000001 State change from NEW to LAUNCHED on 
> event = RECOVER
> 2019-03-30 13:22:33,257 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:
>  Recovering container [container_e84_1553977186701_0001_01_000001, 
> CreateTime: 1553977260732, Version: 0, State: RUNNING, Capability: 
> <memory:1024, vCores:1>, Diagnostics: , ExitStatus: -1000, 
> NodeLabelExpression: Priority: 0]
> 2019-03-30 13:22:33,275 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:
>  Recovering container [container_e84_1553977186701_0001_01_000004, 
> CreateTime: 1553977272802, Version: 0, State: RUNNING, Capability: 
> <memory:1024, vCores:1, yarn.io/gpu: 1>, Diagnostics: , ExitStatus: -1000, 
> NodeLabelExpression: Priority: 0]
> 2019-03-30 13:22:33,275 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
> Assigned container container_e84_1553977186701_0001_01_000004 of capacity 
> <memory:1024, vCores:1, yarn.io/gpu: 1> on host 
> snemeth-gpu-2.vpc.cloudera.com:8041, which has 2 containers, <memory:2048, 
> vCores:2, yarn.io/gpu: 1> used and <memory:37252, vCores:6> available after 
> allocation
> 2019-03-30 13:22:33,276 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:
>  Recovering container [container_e84_1553977186701_0001_01_000005, 
> CreateTime: 1553977272803, Version: 0, State: RUNNING, Capability: 
> <memory:1024, vCores:1, yarn.io/gpu: 1>, Diagnostics: , ExitStatus: -1000, 
> NodeLabelExpression: Priority: 0]
>  2019-03-30 13:22:33,276 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> Processing container_e84_1553977186701_0001_01_000005 of type RECOVER
>  2019-03-30 13:22:33,276 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e84_1553977186701_0001_01_000005 Container Transitioned from NEW to 
> RUNNING
>  2019-03-30 13:22:33,276 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
> Assigned container container_e84_1553977186701_0001_01_000005 of capacity 
> <memory:1024, vCores:1, yarn.io/gpu: 1> on host 
> snemeth-gpu-2.vpc.cloudera.com:8041, which has 3 containers, <memory:3072, 
> vCores:3, yarn.io/gpu: 2> used and <memory:36228, vCores:5, yarn.io/gpu: -1> 
> available after allocation
> 2019-03-30 13:22:33,279 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:
>  Recovering container [container_e84_1553977186701_0001_01_000003, 
> CreateTime: 1553977272166, Version: 0, State: RUNNING, Capability: 
> <memory:1024, vCores:1, yarn.io/gpu: 1>, Diagnostics: , ExitStatus: -1000, 
> NodeLabelExpression: Priority: 0]
>  2019-03-30 13:22:33,280 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> Processing container_e84_1553977186701_0001_01_000003 of type RECOVER
>  2019-03-30 13:22:33,280 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e84_1553977186701_0001_01_000003 Container Transitioned from NEW to 
> RUNNING
>  2019-03-30 13:22:33,280 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing 
> event for application_1553977186701_0001 of type APP_RUNNING_ON_NODE
>  2019-03-30 13:22:33,280 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
> Assigned container container_e84_1553977186701_0001_01_000003 of capacity 
> <memory:1024, vCores:1, yarn.io/gpu: 1> on host 
> snemeth-gpu-3.vpc.cloudera.com:8041, which has 2 containers, <memory:2048, 
> vCores:2, yarn.io/gpu: 2> used and <memory:37252, vCores:6, yarn.io/gpu: -1> 
> available after allocation
>  2019-03-30 13:22:33,280 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  SchedulerAttempt appattempt_1553977186701_0001_000001 is recovering 
> container container_e84_1553977186701_0001_01_000003
> {code}
>  
> There are multiple logs like this:
> {code:java}
> Assigned container container_e84_1553977186701_0001_01_000005 of capacity 
> <memory:1024, vCores:1, yarn.io/gpu: 1> on host 
> snemeth-gpu-2.vpc.cloudera.com:8041, which has 3 containers, <memory:3072, 
> vCores:3, yarn.io/gpu: 2> used and <memory:36228, vCores:5, yarn.io/gpu: -1> 
> available after allocation{code}
> *Note the -1 value for the yarn.io/gpu resource!*
> The issue lies in this method: 
> [https://github.com/apache/hadoop/blob/e40e2d6ad5cbe782c3a067229270738b501ed27e/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java#L179]
> The problem is that method deductUnallocatedResource does not check if the 
> resource of the container is subtracted from unallocated resource, the 
> unallocated resource remains above zero.
>  Here is the ResourceManager call hierarchy for the method (from top to 
> bottom):
> {code:java}
> 1. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler#handle
> 2. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler#addNode
> 3. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler#recoverContainersOnNode
> 4. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode#recoverContainer
> 5. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode#allocateContainer
> 6. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode#allocateContainer(org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainer,
>  boolean)
> deduct is called here!{code}
> *Testcase that reproduces the issue:* 
>  *Add this testcase to TestFSSchedulerNode:*
>  
> {code:java}
> @Test
>  public void testRecovery() {
>  RMNode node = createNode();
>  FSSchedulerNode schedulerNode = new FSSchedulerNode(node, false);
> RMContainer container1 = createContainer(Resource.newInstance(4096, 4),
>  null);
>  RMContainer container2 = createContainer(Resource.newInstance(4096, 4),
>  null);
>  
>  schedulerNode.allocateContainer(container1);
>  schedulerNode.containerStarted(container1.getContainerId());
>  schedulerNode.allocateContainer(container2);
>  schedulerNode.containerStarted(container2.getContainerId());
>  assertEquals("All resources of node should have been allocated",
>  nodeResource, schedulerNode.getAllocatedResource());
>  RMContainer container3 = createContainer(Resource.newInstance(1000, 1),
>  null);
>  when(container3.getState()).thenReturn(RMContainerState.NEW);
>  assertEquals("All resources of node should have been allocated",
>  nodeResource, schedulerNode.getAllocatedResource());
>  
>  schedulerNode.recoverContainer(container3);
> assertEquals("No resource should have been unallocated",
>  Resources.none(), schedulerNode.getUnallocatedResource());
>  assertEquals("All resources of node should have been allocated",
>  nodeResource, schedulerNode.getAllocatedResource());
>  }
> {code}
>  
>  
> *Result of testcase:*
> {code:java}
> java.lang.AssertionError: No resource should have been unallocated 
> Expected :<memory:0, vCores:0>
> Actual :<memory:-1000, vCores:-1>{code}
> *IT'S IMMEDIATELY CLEAR THAT NOT ONLY GPU (OR OTHER RESOURCE TYPES), BUT ANY 
> RESOURCES ARE AFFECTED BY THIS ISSUE!*
>  
> *Possible fix:* 
>  1. A condition needs to be introduced to check if there is enough resources 
> on the node, we should proceed with the container's recovery only if this is 
> true.
>  2. An error log should be added. For a quick look, this is seemingly enough 
> so no exception is required, but this needs a more thorough investigation and 
> manual test on cluster!
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to