[ https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167915#comment-14167915 ]
Karthik Kambatla commented on YARN-2641: ---------------------------------------- I poked around on a cluster with 2 NMs. Submitted a sleep job with 4 mappers each sleeping for 10 minutes, the mappers got assigned 2 on each node. After the 4 mappers made some progress (11%), I decommissioned a node. When I decommissioned the node with AM, the AM died and the job restarted from scratch. When I decommissioned the node without the AM, the tasks immediately got re-scheduled onto the active node (job progress came down to 6% before going up again). > improve node decommission latency in RM. > ---------------------------------------- > > Key: YARN-2641 > URL: https://issues.apache.org/jira/browse/YARN-2641 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager > Affects Versions: 2.5.0 > Reporter: zhihai xu > Assignee: zhihai xu > Attachments: YARN-2641.000.patch, YARN-2641.001.patch > > > improve node decommission latency in RM. > Currently the node decommission only happened after RM received nodeHeartbeat > from the Node Manager. The node heartbeat interval is configurable. The > default value is 1 second. > It will be better to do the decommission during RM Refresh(NodesListManager) > instead of nodeHeartbeat(ResourceTrackerService). > This will be a much more serious issue: > After RM is refreshed (refreshNodes), If the NM to be decommissioned is > killed before NM sent heartbeat to RM. The RMNode will never be > decommissioned in RM. The RMNode will only expire in RM after > "yarn.nm.liveness-monitor.expiry-interval-ms"(default value 10 minutes) time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)