Szilard Nemeth created YARN-9430:
------------------------------------

             Summary: Recovering containers does not check available resources 
on node
                 Key: YARN-9430
                 URL: https://issues.apache.org/jira/browse/YARN-9430
             Project: Hadoop YARN
          Issue Type: Bug
            Reporter: Szilard Nemeth
            Assignee: Szilard Nemeth


I have a testcase that checks if some GPU devices gone offline and recovery 
happens, only the containers that fit into the node's resources will be 
recovered. Unfortunately, this is not the case: RM does not check available 
resources on node during recovery.

*Detailed explanation:*

*Testcase:* 
1. There are 2 nodes running NodeManagers
2. nvidia-smi is replaced with a fake bash script that reports 2 GPU devices 
per node, initially. This means 4 GPU devices in the cluster altogether.
3. RM / NM recovery is enabled
4. The test starts off a sleep job, requesting 4 containers, 1 GPU device for 
each (AM does not request GPUs)
5. Before restart, the fake bash script is adjusted to report 1 GPU device per 
node (2 in the cluster) after restart.
6. Restart is initiated.

 

*Expected behavior:* 
After restart, only the AM and 2 normal containers should have been started, as 
there are only 2 GPU devices in the cluster.

 

*Actual behaviour:* 
AM + 4 containers are allocated, this is all containers started originally with 
step 4.

App id was: 1553977186701_0001

*Logs*:

2019-03-30 13:22:30,299 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
Processing event for appattempt_1553977186701_0001_000001 of type RECOVER

2019-03-30 13:22:30,366 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Added Application Attempt appattempt_1553977186701_0001_000001 to scheduler 
from user: systest
2019-03-30 13:22:30,366 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
appattempt_1553977186701_0001_000001 is recovering. Skipping notifying 
ATTEMPT_ADDED
2019-03-30 13:22:30,367 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1553977186701_0001_000001 State change from NEW to LAUNCHED on event 
= RECOVER

2019-03-30 13:22:33,257 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: 
Recovering container [container_e84_1553977186701_0001_01_000001, CreateTime: 
1553977260732, Version: 0, State: RUNNING, Capability: <memory:1024, vCores:1>, 
Diagnostics: , ExitStatus: -1000, NodeLabelExpression: Priority: 0]

2019-03-30 13:22:33,275 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: 
Recovering container [container_e84_1553977186701_0001_01_000004, CreateTime: 
1553977272802, Version: 0, State: RUNNING, Capability: <memory:1024, vCores:1, 
yarn.io/gpu: 1>, Diagnostics: , ExitStatus: -1000, NodeLabelExpression: 
Priority: 0]


2019-03-30 13:22:33,275 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
Assigned container container_e84_1553977186701_0001_01_000004 of capacity 
<memory:1024, vCores:1, yarn.io/gpu: 1> on host 
snemeth-gpu-2.vpc.cloudera.com:8041, which has 2 containers, <memory:2048, 
vCores:2, yarn.io/gpu: 1> used and <memory:37252, vCores:6> available after 
allocation

2019-03-30 13:22:33,276 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: 
Recovering container [container_e84_1553977186701_0001_01_000005, CreateTime: 
1553977272803, Version: 0, State: RUNNING, Capability: <memory:1024, vCores:1, 
yarn.io/gpu: 1>, Diagnostics: , ExitStatus: -1000, NodeLabelExpression: 
Priority: 0]
2019-03-30 13:22:33,276 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
Processing container_e84_1553977186701_0001_01_000005 of type RECOVER
2019-03-30 13:22:33,276 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_e84_1553977186701_0001_01_000005 Container Transitioned from NEW to 
RUNNING
2019-03-30 13:22:33,276 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
Assigned container container_e84_1553977186701_0001_01_000005 of capacity 
<memory:1024, vCores:1, yarn.io/gpu: 1> on host 
snemeth-gpu-2.vpc.cloudera.com:8041, which has 3 containers, <memory:3072, 
vCores:3, yarn.io/gpu: 2> used and <memory:36228, vCores:5, yarn.io/gpu: -1> 
available after allocation


2019-03-30 13:22:33,279 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: 
Recovering container [container_e84_1553977186701_0001_01_000003, CreateTime: 
1553977272166, Version: 0, State: RUNNING, Capability: <memory:1024, vCores:1, 
yarn.io/gpu: 1>, Diagnostics: , ExitStatus: -1000, NodeLabelExpression: 
Priority: 0]
2019-03-30 13:22:33,280 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
Processing container_e84_1553977186701_0001_01_000003 of type RECOVER
2019-03-30 13:22:33,280 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_e84_1553977186701_0001_01_000003 Container Transitioned from NEW to 
RUNNING
2019-03-30 13:22:33,280 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event 
for application_1553977186701_0001 of type APP_RUNNING_ON_NODE
2019-03-30 13:22:33,280 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
Assigned container container_e84_1553977186701_0001_01_000003 of capacity 
<memory:1024, vCores:1, yarn.io/gpu: 1> on host 
snemeth-gpu-3.vpc.cloudera.com:8041, which has 2 containers, <memory:2048, 
vCores:2, yarn.io/gpu: 2> used and <memory:37252, vCores:6, yarn.io/gpu: -1> 
available after allocation
2019-03-30 13:22:33,280 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
 SchedulerAttempt appattempt_1553977186701_0001_000001 is recovering container 
container_e84_1553977186701_0001_01_000003


There are multiple logs like this: 
{code:java}
Assigned container container_e84_1553977186701_0001_01_000005 of capacity 
<memory:1024, vCores:1, yarn.io/gpu: 1> on host 
snemeth-gpu-2.vpc.cloudera.com:8041, which has 3 containers, <memory:3072, 
vCores:3, yarn.io/gpu: 2> used and <memory:36228, vCores:5, yarn.io/gpu: -1> 
available after allocation{code}
*Note the -1 value for the yarn.io/gpu resource!*

The issue lies in this method: 
https://github.com/apache/hadoop/blob/e40e2d6ad5cbe782c3a067229270738b501ed27e/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java#L179

The problem is that method deductUnallocatedResource does not check if the 
resource of the container is subtracted from unallocated resource, the 
unallocated resource remains above zero.
Here is the ResourceManager call hierarchy for the method (from top to bottom):
{code:java}
1. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler#handle
2. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler#addNode
3. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler#recoverContainersOnNode
4. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode#recoverContainer
5. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode#allocateContainer
6. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode#allocateContainer(org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainer,
 boolean)
deduct is called here!{code}

*Testcase that reproduces the issue:* 
*Add this testcase to TestFSSchedulerNode:*

 
{code:java}
@Test
 public void testRecovery() {
 RMNode node = createNode();
 FSSchedulerNode schedulerNode = new FSSchedulerNode(node, false);
RMContainer container1 = createContainer(Resource.newInstance(4096, 4),
 null);
 RMContainer container2 = createContainer(Resource.newInstance(4096, 4),
 null);
 
 schedulerNode.allocateContainer(container1);
 schedulerNode.containerStarted(container1.getContainerId());
 schedulerNode.allocateContainer(container2);
 schedulerNode.containerStarted(container2.getContainerId());
 assertEquals("All resources of node should have been allocated",
 nodeResource, schedulerNode.getAllocatedResource());

 RMContainer container3 = createContainer(Resource.newInstance(1000, 1),
 null);
 when(container3.getState()).thenReturn(RMContainerState.NEW);
 assertEquals("All resources of node should have been allocated",
 nodeResource, schedulerNode.getAllocatedResource());
 
 schedulerNode.recoverContainer(container3);
assertEquals("No resource should have been unallocated",
 Resources.none(), schedulerNode.getUnallocatedResource());
 assertEquals("All resources of node should have been allocated",
 nodeResource, schedulerNode.getAllocatedResource());
 }
{code}
 

 

*Result of testcase:*
{code:java}
java.lang.AssertionError: No resource should have been unallocated 
Expected :<memory:0, vCores:0>
Actual :<memory:-1000, vCores:-1>{code}

*IT'S IMMEDIATELY CLEAR THAT NOT ONLY GPU (OR OTHER RESOURCE TYPES), BUT ANY 
RESOURCES ARE AFFECTED BY THIS ISSUE!*

 

*Possible fix:* 
1. A condition needs to be introduced to check if there is enough resources on 
the node, we should proceed with the container's recovery only if this is true.
2. An error log should be added. For a quick look, this is seemingly enough so 
no exception is required, but this needs a more thorough investigation and 
manual test on cluster!

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

Reply via email to