[ 
https://issues.apache.org/jira/browse/YARN-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph resolved YARN-10873.
----------------------------------
    Resolution: Fixed

> Graceful Decommission ignores launched containers and gets deactivated before 
> timeout
> -------------------------------------------------------------------------------------
>
>                 Key: YARN-10873
>                 URL: https://issues.apache.org/jira/browse/YARN-10873
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: RM
>    Affects Versions: 3.3.1
>            Reporter: Prabhu Joseph
>            Assignee: Srinivas S T
>            Priority: Major
>             Fix For: 3.4.0
>
>          Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Graceful Decommission of a Node gets deactivated before timeout even though 
> there are launched containers. 
> On Status update from Node which is in Decommissioning, RM transitions the 
> node to DECOMMISSIONED before timeout if there are no running applications. 
> These running applications are added from the Container Statuses from 
> NodeManager. We have observed Containers are launched at NodeManager and at 
> the same time ResourceManager forcefully decommissions the node.
> This affects the Livy Interactive jobs which supports only one application 
> attempt.
> Will suggest to check FicaSchedulerNode to identify if there are any launched 
> containers and determine whether to forcefully decommission or not.
> {code}
>   public static class StatusUpdateWhenHealthyTransition implements
>       MultipleArcTransition<RMNodeImpl, RMNodeEvent, NodeState> {
>     @Override
>     public NodeState transition(RMNodeImpl rmNode, RMNodeEvent event) {
>       .....
>       if (isNodeDecommissioning) {
>         List<ApplicationId> keepAliveApps = statusEvent.getKeepAliveAppIds();
>         if (rmNode.runningApplications.isEmpty() &&
>             (keepAliveApps == null || keepAliveApps.isEmpty())) {
>           RMNodeImpl.deactivateNode(rmNode, NodeState.DECOMMISSIONED);
>           return NodeState.DECOMMISSIONED;
>         }
>       }
> {code}
> *ResourceManager Logs:*
> {code}
> 2021-06-16 08:45:04,140 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: 
> Launching masterappattempt_1623830067124_0382_000001
> 2021-06-16 08:45:04,141 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting 
> up container Container: [ContainerId: container_1623830067124_0382_01_000001, 
> AllocationRequestId: 0, Version: 0, NodeId: node1:34753, NodeHttpAddress: 
> 927a9ef942b24b1eaa0e99c39d4e73f90224b902983:8042, Resource: <memory:29696, 
> vCores:4>, Priority: 0, Token: Token { kind: ContainerToken, service: 
> 10.1.2.3:34753 }, ExecutionType: GUARANTEED, ] for AM 
> appattempt_1623830067124_0382_000001
> 2021-06-16 08:45:04,141 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager:
>  Create AMRMToken for ApplicationAttempt: appattempt_1623830067124_0382_000001
> 2021-06-16 08:45:04,141 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager:
>  Creating password for appattempt_1623830067124_0382_000001
> 2021-06-16 08:45:04,154 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done 
> launching container Container: [ContainerId: 
> container_1623830067124_0382_01_000001, AllocationRequestId: 0, Version: 0, 
> NodeId: node1:34753, NodeHttpAddress: 
> 927a9ef942b24b1eaa0e99c39d4e73f90224b902983:8042, Resource: <memory:29696, 
> vCores:4>, Priority: 0, Token: Token { kind: ContainerToken, service: 
> 10.1.2.3:34753 }, ExecutionType: GUARANTEED, ] for AM 
> appattempt_1623830067124_0382_000001
> 2021-06-16 08:45:04,776 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully 
> decommission node node1:34753 with state RUNNING
> 2021-06-16 08:45:04,776 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node 
> node1:34753 in DECOMMISSIONING.
> 2021-06-16 08:45:04,776 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node1:34753 
> Node Transitioned from RUNNING to DECOMMISSIONING
> 2021-06-16 08:45:05,131 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating 
> Node node1:34753 as it is now DECOMMISSIONED
> 2021-06-16 08:45:05,131 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node1:34753 
> Node Transitioned from DECOMMISSIONING to DECOMMISSIONED
> 2021-06-16 08:45:05,131 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1623830067124_0382_01_000001 Container Transitioned from ACQUIRED 
> to KILLED
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to