[ https://issues.apache.org/jira/browse/YARN-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15210694#comment-15210694 ]
Daniel Zhi commented on YARN-4676: ---------------------------------- Thanks. I will update patch afterward. Here are quick responses: 1. I will look at unit tests. (Once all the tests reported as failed by Hadoop QA actually PASS on my local machine without or without my patch). 3. At least in AWS EMR cluster, all Hadoop daemons are configured to restart automatically if stopped. So NodeManager, upon told to shutdown, will exit, but then immediately restarted and try to register itself to RM. Should the node be RECOMMISSIONed later, it will be accepted and become a normal node. While the node remains as DECOMMISSIONED, such shutdown-restart loop will keep going until the node is either terminated or be recommissioned. The 5 second wait is to avoid such loop become too tight (1~2 second). 4. I will remove one 5. The particular "} else if (exclude) {" is to avoid the "No action ..." log message for a RUNNING node that was not excluded. The "} else {" block corresponding to "if (graceful) {" covers non-graceful case. if (graceful) { ...... } else { ...... } 6. will do 7. yarn.resourcemanager.decommissioning.timeout is the key for timeout, the default value is declared as YarnConfiguration.DEFAULT_DECOMMISSIONING_TIMEOUT. 8. will figure out what to add/update. > Automatic and Asynchronous Decommissioning Nodes Status Tracking > ---------------------------------------------------------------- > > Key: YARN-4676 > URL: https://issues.apache.org/jira/browse/YARN-4676 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Affects Versions: 2.8.0 > Reporter: Daniel Zhi > Assignee: Daniel Zhi > Labels: features > Attachments: GracefulDecommissionYarnNode.pdf, YARN-4676.004.patch, > YARN-4676.005.patch, YARN-4676.006.patch, YARN-4676.007.patch, > YARN-4676.008.patch > > > DecommissioningNodeWatcher inside ResourceTrackingService tracks > DECOMMISSIONING nodes status automatically and asynchronously after > client/admin made the graceful decommission request. It tracks > DECOMMISSIONING nodes status to decide when, after all running containers on > the node have completed, will be transitioned into DECOMMISSIONED state. > NodesListManager detect and handle include and exclude list changes to kick > out decommission or recommission as necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)