[
https://issues.apache.org/jira/browse/YARN-9430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Szilard Nemeth reassigned YARN-9430:
------------------------------------
Assignee: Szilard Nemeth
> Recovering containers does not check available resources on node
> ----------------------------------------------------------------
>
> Key: YARN-9430
> URL: https://issues.apache.org/jira/browse/YARN-9430
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Szilard Nemeth
> Assignee: Szilard Nemeth
> Priority: Critical
>
> I have a testcase that checks if some GPU devices gone offline and recovery
> happens, only the containers that fit into the node's resources will be
> recovered. Unfortunately, this is not the case: RM does not check available
> resources on node during recovery.
> *Detailed explanation:*
> *Testcase:*
> 1. There are 2 nodes running NodeManagers
> 2. nvidia-smi is replaced with a fake bash script that reports 2 GPU devices
> per node, initially. This means 4 GPU devices in the cluster altogether.
> 3. RM / NM recovery is enabled
> 4. The test starts off a sleep job, requesting 4 containers, 1 GPU device
> for each (AM does not request GPUs)
> 5. Before restart, the fake bash script is adjusted to report 1 GPU device
> per node (2 in the cluster) after restart.
> 6. Restart is initiated.
>
> *Expected behavior:*
> After restart, only the AM and 2 normal containers should have been started,
> as there are only 2 GPU devices in the cluster.
>
> *Actual behaviour:*
> AM + 4 containers are allocated, this is all containers started originally
> with step 4.
> App id was: 1553977186701_0001
> *Logs*:
>
> {code:java}
> 2019-03-30 13:22:30,299 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
> Processing event for appattempt_1553977186701_0001_000001 of type RECOVER
> 2019-03-30 13:22:30,366 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
> Added Application Attempt appattempt_1553977186701_0001_000001 to scheduler
> from user: systest
> 2019-03-30 13:22:30,366 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
> appattempt_1553977186701_0001_000001 is recovering. Skipping notifying
> ATTEMPT_ADDED
> 2019-03-30 13:22:30,367 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
> appattempt_1553977186701_0001_000001 State change from NEW to LAUNCHED on
> event = RECOVER
> 2019-03-30 13:22:33,257 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:
> Recovering container [container_e84_1553977186701_0001_01_000001,
> CreateTime: 1553977260732, Version: 0, State: RUNNING, Capability:
> <memory:1024, vCores:1>, Diagnostics: , ExitStatus: -1000,
> NodeLabelExpression: Priority: 0]
> 2019-03-30 13:22:33,275 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:
> Recovering container [container_e84_1553977186701_0001_01_000004,
> CreateTime: 1553977272802, Version: 0, State: RUNNING, Capability:
> <memory:1024, vCores:1, yarn.io/gpu: 1>, Diagnostics: , ExitStatus: -1000,
> NodeLabelExpression: Priority: 0]
> 2019-03-30 13:22:33,275 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode:
> Assigned container container_e84_1553977186701_0001_01_000004 of capacity
> <memory:1024, vCores:1, yarn.io/gpu: 1> on host
> snemeth-gpu-2.vpc.cloudera.com:8041, which has 2 containers, <memory:2048,
> vCores:2, yarn.io/gpu: 1> used and <memory:37252, vCores:6> available after
> allocation
> 2019-03-30 13:22:33,276 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:
> Recovering container [container_e84_1553977186701_0001_01_000005,
> CreateTime: 1553977272803, Version: 0, State: RUNNING, Capability:
> <memory:1024, vCores:1, yarn.io/gpu: 1>, Diagnostics: , ExitStatus: -1000,
> NodeLabelExpression: Priority: 0]
> 2019-03-30 13:22:33,276 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
> Processing container_e84_1553977186701_0001_01_000005 of type RECOVER
> 2019-03-30 13:22:33,276 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
> container_e84_1553977186701_0001_01_000005 Container Transitioned from NEW to
> RUNNING
> 2019-03-30 13:22:33,276 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode:
> Assigned container container_e84_1553977186701_0001_01_000005 of capacity
> <memory:1024, vCores:1, yarn.io/gpu: 1> on host
> snemeth-gpu-2.vpc.cloudera.com:8041, which has 3 containers, <memory:3072,
> vCores:3, yarn.io/gpu: 2> used and <memory:36228, vCores:5, yarn.io/gpu: -1>
> available after allocation
> 2019-03-30 13:22:33,279 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:
> Recovering container [container_e84_1553977186701_0001_01_000003,
> CreateTime: 1553977272166, Version: 0, State: RUNNING, Capability:
> <memory:1024, vCores:1, yarn.io/gpu: 1>, Diagnostics: , ExitStatus: -1000,
> NodeLabelExpression: Priority: 0]
> 2019-03-30 13:22:33,280 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
> Processing container_e84_1553977186701_0001_01_000003 of type RECOVER
> 2019-03-30 13:22:33,280 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
> container_e84_1553977186701_0001_01_000003 Container Transitioned from NEW to
> RUNNING
> 2019-03-30 13:22:33,280 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing
> event for application_1553977186701_0001 of type APP_RUNNING_ON_NODE
> 2019-03-30 13:22:33,280 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode:
> Assigned container container_e84_1553977186701_0001_01_000003 of capacity
> <memory:1024, vCores:1, yarn.io/gpu: 1> on host
> snemeth-gpu-3.vpc.cloudera.com:8041, which has 2 containers, <memory:2048,
> vCores:2, yarn.io/gpu: 2> used and <memory:37252, vCores:6, yarn.io/gpu: -1>
> available after allocation
> 2019-03-30 13:22:33,280 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
> SchedulerAttempt appattempt_1553977186701_0001_000001 is recovering
> container container_e84_1553977186701_0001_01_000003
> {code}
>
> There are multiple logs like this:
> {code:java}
> Assigned container container_e84_1553977186701_0001_01_000005 of capacity
> <memory:1024, vCores:1, yarn.io/gpu: 1> on host
> snemeth-gpu-2.vpc.cloudera.com:8041, which has 3 containers, <memory:3072,
> vCores:3, yarn.io/gpu: 2> used and <memory:36228, vCores:5, yarn.io/gpu: -1>
> available after allocation{code}
> *Note the -1 value for the yarn.io/gpu resource!*
> The issue lies in this method:
> [https://github.com/apache/hadoop/blob/e40e2d6ad5cbe782c3a067229270738b501ed27e/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java#L179]
> The problem is that method deductUnallocatedResource does not check if the
> resource of the container is subtracted from unallocated resource, the
> unallocated resource remains above zero.
> Here is the ResourceManager call hierarchy for the method (from top to
> bottom):
> {code:java}
> 1.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler#handle
> 2.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler#addNode
> 3.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler#recoverContainersOnNode
> 4.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode#recoverContainer
> 5.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode#allocateContainer
> 6.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode#allocateContainer(org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainer,
> boolean)
> deduct is called here!{code}
> *Testcase that reproduces the issue:*
> *Add this testcase to TestFSSchedulerNode:*
>
> {code:java}
> @Test
> public void testRecovery() {
> RMNode node = createNode();
> FSSchedulerNode schedulerNode = new FSSchedulerNode(node, false);
> RMContainer container1 = createContainer(Resource.newInstance(4096, 4),
> null);
> RMContainer container2 = createContainer(Resource.newInstance(4096, 4),
> null);
>
> schedulerNode.allocateContainer(container1);
> schedulerNode.containerStarted(container1.getContainerId());
> schedulerNode.allocateContainer(container2);
> schedulerNode.containerStarted(container2.getContainerId());
> assertEquals("All resources of node should have been allocated",
> nodeResource, schedulerNode.getAllocatedResource());
> RMContainer container3 = createContainer(Resource.newInstance(1000, 1),
> null);
> when(container3.getState()).thenReturn(RMContainerState.NEW);
> assertEquals("All resources of node should have been allocated",
> nodeResource, schedulerNode.getAllocatedResource());
>
> schedulerNode.recoverContainer(container3);
> assertEquals("No resource should have been unallocated",
> Resources.none(), schedulerNode.getUnallocatedResource());
> assertEquals("All resources of node should have been allocated",
> nodeResource, schedulerNode.getAllocatedResource());
> }
> {code}
>
>
> *Result of testcase:*
> {code:java}
> java.lang.AssertionError: No resource should have been unallocated
> Expected :<memory:0, vCores:0>
> Actual :<memory:-1000, vCores:-1>{code}
> *IT'S IMMEDIATELY CLEAR THAT NOT ONLY GPU (OR OTHER RESOURCE TYPES), BUT ANY
> RESOURCES ARE AFFECTED BY THIS ISSUE!*
>
> *Possible fix:*
> 1. A condition needs to be introduced to check if there is enough resources
> on the node, we should proceed with the container's recovery only if this is
> true.
> 2. An error log should be added. For a quick look, this is seemingly enough
> so no exception is required, but this needs a more thorough investigation and
> manual test on cluster!
>
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]