[
https://issues.apache.org/jira/browse/YARN-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Prabhu Joseph resolved YARN-10873.
----------------------------------
Resolution: Fixed
> Graceful Decommission ignores launched containers and gets deactivated before
> timeout
> -------------------------------------------------------------------------------------
>
> Key: YARN-10873
> URL: https://issues.apache.org/jira/browse/YARN-10873
> Project: Hadoop YARN
> Issue Type: Bug
> Components: RM
> Affects Versions: 3.3.1
> Reporter: Prabhu Joseph
> Assignee: Srinivas S T
> Priority: Major
> Fix For: 3.4.0
>
> Time Spent: 2h 20m
> Remaining Estimate: 0h
>
> Graceful Decommission of a Node gets deactivated before timeout even though
> there are launched containers.
> On Status update from Node which is in Decommissioning, RM transitions the
> node to DECOMMISSIONED before timeout if there are no running applications.
> These running applications are added from the Container Statuses from
> NodeManager. We have observed Containers are launched at NodeManager and at
> the same time ResourceManager forcefully decommissions the node.
> This affects the Livy Interactive jobs which supports only one application
> attempt.
> Will suggest to check FicaSchedulerNode to identify if there are any launched
> containers and determine whether to forcefully decommission or not.
> {code}
> public static class StatusUpdateWhenHealthyTransition implements
> MultipleArcTransition<RMNodeImpl, RMNodeEvent, NodeState> {
> @Override
> public NodeState transition(RMNodeImpl rmNode, RMNodeEvent event) {
> .....
> if (isNodeDecommissioning) {
> List<ApplicationId> keepAliveApps = statusEvent.getKeepAliveAppIds();
> if (rmNode.runningApplications.isEmpty() &&
> (keepAliveApps == null || keepAliveApps.isEmpty())) {
> RMNodeImpl.deactivateNode(rmNode, NodeState.DECOMMISSIONED);
> return NodeState.DECOMMISSIONED;
> }
> }
> {code}
> *ResourceManager Logs:*
> {code}
> 2021-06-16 08:45:04,140 INFO
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher:
> Launching masterappattempt_1623830067124_0382_000001
> 2021-06-16 08:45:04,141 INFO
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting
> up container Container: [ContainerId: container_1623830067124_0382_01_000001,
> AllocationRequestId: 0, Version: 0, NodeId: node1:34753, NodeHttpAddress:
> 927a9ef942b24b1eaa0e99c39d4e73f90224b902983:8042, Resource: <memory:29696,
> vCores:4>, Priority: 0, Token: Token { kind: ContainerToken, service:
> 10.1.2.3:34753 }, ExecutionType: GUARANTEED, ] for AM
> appattempt_1623830067124_0382_000001
> 2021-06-16 08:45:04,141 INFO
> org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager:
> Create AMRMToken for ApplicationAttempt: appattempt_1623830067124_0382_000001
> 2021-06-16 08:45:04,141 INFO
> org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager:
> Creating password for appattempt_1623830067124_0382_000001
> 2021-06-16 08:45:04,154 INFO
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done
> launching container Container: [ContainerId:
> container_1623830067124_0382_01_000001, AllocationRequestId: 0, Version: 0,
> NodeId: node1:34753, NodeHttpAddress:
> 927a9ef942b24b1eaa0e99c39d4e73f90224b902983:8042, Resource: <memory:29696,
> vCores:4>, Priority: 0, Token: Token { kind: ContainerToken, service:
> 10.1.2.3:34753 }, ExecutionType: GUARANTEED, ] for AM
> appattempt_1623830067124_0382_000001
> 2021-06-16 08:45:04,776 INFO
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully
> decommission node node1:34753 with state RUNNING
> 2021-06-16 08:45:04,776 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node
> node1:34753 in DECOMMISSIONING.
> 2021-06-16 08:45:04,776 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node1:34753
> Node Transitioned from RUNNING to DECOMMISSIONING
> 2021-06-16 08:45:05,131 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating
> Node node1:34753 as it is now DECOMMISSIONED
> 2021-06-16 08:45:05,131 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node1:34753
> Node Transitioned from DECOMMISSIONING to DECOMMISSIONED
> 2021-06-16 08:45:05,131 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
> container_1623830067124_0382_01_000001 Container Transitioned from ACQUIRED
> to KILLED
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]