Prabhu Joseph created YARN-10873:
------------------------------------
Summary: Graceful Decommission ignores launched containers and
gets deactivated before timeout
Key: YARN-10873
URL: https://issues.apache.org/jira/browse/YARN-10873
Project: Hadoop YARN
Issue Type: Bug
Components: RM
Affects Versions: 3.3.1
Reporter: Prabhu Joseph
Assignee: Srinivas S T
Graceful Decommission of a Node gets deactivated before timeout even though
there are launched containers.
On Status update from Node which is in Decommissioning, RM transitions the node
to DECOMMISSIONED before timeout if there are no running applications. These
running applications are added from the Container Statuses from NodeManager. We
have observed Containers are launched at NodeManager and at the same time
ResourceManager forcefully decommissions the node.
This affects the Livy Interactive jobs which supports only one application
attempt.
Will suggest to check FicaSchedulerNode to identify if there are any launched
containers and determine whether to forcefully decommission or not.
{code}
public static class StatusUpdateWhenHealthyTransition implements
MultipleArcTransition<RMNodeImpl, RMNodeEvent, NodeState> {
@Override
public NodeState transition(RMNodeImpl rmNode, RMNodeEvent event) {
.....
if (isNodeDecommissioning) {
List<ApplicationId> keepAliveApps = statusEvent.getKeepAliveAppIds();
if (rmNode.runningApplications.isEmpty() &&
(keepAliveApps == null || keepAliveApps.isEmpty())) {
RMNodeImpl.deactivateNode(rmNode, NodeState.DECOMMISSIONED);
return NodeState.DECOMMISSIONED;
}
}
{code}
*ResourceManager Logs:*
{code}
2021-06-16 08:45:04,140 INFO
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching
masterappattempt_1623830067124_0382_000001
2021-06-16 08:45:04,141 INFO
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting up
container Container: [ContainerId: container_1623830067124_0382_01_000001,
AllocationRequestId: 0, Version: 0, NodeId: node1:34753, NodeHttpAddress:
927a9ef942b24b1eaa0e99c39d4e73f90224b902983:8042, Resource: <memory:29696,
vCores:4>, Priority: 0, Token: Token { kind: ContainerToken, service:
10.1.2.3:34753 }, ExecutionType: GUARANTEED, ] for AM
appattempt_1623830067124_0382_000001
2021-06-16 08:45:04,141 INFO
org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager:
Create AMRMToken for ApplicationAttempt: appattempt_1623830067124_0382_000001
2021-06-16 08:45:04,141 INFO
org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager:
Creating password for appattempt_1623830067124_0382_000001
2021-06-16 08:45:04,154 INFO
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done
launching container Container: [ContainerId:
container_1623830067124_0382_01_000001, AllocationRequestId: 0, Version: 0,
NodeId: node1:34753, NodeHttpAddress:
927a9ef942b24b1eaa0e99c39d4e73f90224b902983:8042, Resource: <memory:29696,
vCores:4>, Priority: 0, Token: Token { kind: ContainerToken, service:
10.1.2.3:34753 }, ExecutionType: GUARANTEED, ] for AM
appattempt_1623830067124_0382_000001
2021-06-16 08:45:04,776 INFO
org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully
decommission node node1:34753 with state RUNNING
2021-06-16 08:45:04,776 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node
node1:34753 in DECOMMISSIONING.
2021-06-16 08:45:04,776 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node1:34753
Node Transitioned from RUNNING to DECOMMISSIONING
2021-06-16 08:45:05,131 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating
Node node1:34753 as it is now DECOMMISSIONED
2021-06-16 08:45:05,131 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node1:34753
Node Transitioned from DECOMMISSIONING to DECOMMISSIONED
2021-06-16 08:45:05,131 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1623830067124_0382_01_000001 Container Transitioned from ACQUIRED to
KILLED
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]