[
https://issues.apache.org/jira/browse/YARN-11421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Abhishek Dixit updated YARN-11421:
----------------------------------
Description:
During Graceful Decommission, a Node gets deactivated before timeout even
though there are launched containers on that node.
We have observed cases when graceful decommission signal is sent to node and
Containers are launched at NodeManager and at the same time, in such cases
ResourceManager moves the node from Decommissioning to Decommissioned state
because launced containers are not checked in DeactivateNodeTransition.
We will suggest waiting for AM liveliness timeout to complete before marking
node ready to be decommissioned. This behavior will be gated behind flag
decommissioning-nodes-watcher.delayed-removal.allowed
was:
During Graceful Decommission, a Node gets deactivated before timeout even
though there are launched containers on that node.
We have observed cases when graceful decommission signal is sent to node and
Containers are launched at NodeManager and at the same time, in such cases
ResourceManager moves the node from Decommissioning to Decommissioned state
because launced containers are not checked in DeactivateNodeTransition.
We will suggest using a MultiArc transition instead of DeactivateNodeTransition
which checks for AM containers from the scheduler and then decides whether to
keep the node in Decommissioning state or move it to Decommissioned State.
{code:java}
.addTransition(NodeState.DECOMMISSIONING, NodeState.DECOMMISSIONED,
RMNodeEventType.DECOMMISSION, new
DeactivateNodeTransition(NodeState.DECOMMISSIONED)){code}
> Graceful Decommission ignores launched containers and gets deactivated before
> timeout
> -------------------------------------------------------------------------------------
>
> Key: YARN-11421
> URL: https://issues.apache.org/jira/browse/YARN-11421
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 3.2.1, 3.3.1, 3.3.4
> Reporter: Abhishek Dixit
> Priority: Major
>
> During Graceful Decommission, a Node gets deactivated before timeout even
> though there are launched containers on that node.
> We have observed cases when graceful decommission signal is sent to node and
> Containers are launched at NodeManager and at the same time, in such cases
> ResourceManager moves the node from Decommissioning to Decommissioned state
> because launced containers are not checked in DeactivateNodeTransition.
> We will suggest waiting for AM liveliness timeout to complete before marking
> node ready to be decommissioned. This behavior will be gated behind flag
> decommissioning-nodes-watcher.delayed-removal.allowed
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]