[ 
https://issues.apache.org/jira/browse/YARN-11512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732254#comment-17732254
 ] 

Akshesh Doshi commented on YARN-11512:
--------------------------------------

Also described (by my colleague) in 
https://stackoverflow.com/q/76443651/3061686.

> Graceful decommission doesn't work when NM restart recovery is enabled
> ----------------------------------------------------------------------
>
>                 Key: YARN-11512
>                 URL: https://issues.apache.org/jira/browse/YARN-11512
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: graceful, nodemanager
>    Affects Versions: 3.3.1
>            Reporter: Akshesh Doshi
>            Priority: Major
>
> We have added these configs on yarn-site.xml file of our Hadoop-Yarn cluster.
> {code:xml}
> <property>
>     <name>yarn.nodemanager.recovery.enabled</name>
>     <value>true</value>
> </property>
> <property>
>     <name>yarn.nodemanager.recovery.supervised</name>
>     <value>true</value>
> </property>
> {code}
> The NM restart recovery feature has been working well, applications not 
> failing even if we restart nodemanager processes. But, when we try to 
> decommission a node by adding the node name to yarn_exclude_hosts file and 
> refreshing nodes on resourcemanager, the applications that had containers 
> running on that node are stuck for a long time and then fail.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to