[
https://issues.apache.org/jira/browse/YARN-11512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732254#comment-17732254
]
Akshesh Doshi commented on YARN-11512:
--------------------------------------
Also described (by my colleague) in
https://stackoverflow.com/q/76443651/3061686.
> Graceful decommission doesn't work when NM restart recovery is enabled
> ----------------------------------------------------------------------
>
> Key: YARN-11512
> URL: https://issues.apache.org/jira/browse/YARN-11512
> Project: Hadoop YARN
> Issue Type: Bug
> Components: graceful, nodemanager
> Affects Versions: 3.3.1
> Reporter: Akshesh Doshi
> Priority: Major
>
> We have added these configs on yarn-site.xml file of our Hadoop-Yarn cluster.
> {code:xml}
> <property>
> <name>yarn.nodemanager.recovery.enabled</name>
> <value>true</value>
> </property>
> <property>
> <name>yarn.nodemanager.recovery.supervised</name>
> <value>true</value>
> </property>
> {code}
> The NM restart recovery feature has been working well, applications not
> failing even if we restart nodemanager processes. But, when we try to
> decommission a node by adding the node name to yarn_exclude_hosts file and
> refreshing nodes on resourcemanager, the applications that had containers
> running on that node are stuck for a long time and then fail.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]