[
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167985#comment-14167985
]
zhihai xu commented on YARN-2641:
---------------------------------
Sorry, I didn't describe clearly the second scenario: We need first kill the NM
process then call refreshNodes CLI to put the node in the blacklist. To make
the refreshNodes CLI work correctly, we need create a file for example
"exclude_host.txt" which should have the node name to decommission, then we
should set the "yarn.resourcemanager.nodes.exclude-path" to the file
"exclude_host.txt".
See the code in TestResourceTrackerService.java for the configuration and node
list file:
{code}
conf.set(YarnConfiguration.RM_NODES_EXCLUDE_FILE_PATH, hostFile
.getAbsolutePath());
private void writeToHostsFile(String... hosts) throws IOException {
if (!hostFile.exists()) {
TEMP_DIR.mkdirs();
hostFile.createNewFile();
}
FileOutputStream fStream = null;
try {
fStream = new FileOutputStream(hostFile);
for (int i = 0; i < hosts.length; i++) {
fStream.write(hosts[i].getBytes());
fStream.write("\n".getBytes());
}
} finally {
if (fStream != null) {
IOUtils.closeStream(fStream);
fStream = null;
}
}
}
{code}
> improve node decommission latency in RM.
> ----------------------------------------
>
> Key: YARN-2641
> URL: https://issues.apache.org/jira/browse/YARN-2641
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: resourcemanager
> Affects Versions: 2.5.0
> Reporter: zhihai xu
> Assignee: zhihai xu
> Attachments: YARN-2641.000.patch, YARN-2641.001.patch
>
>
> improve node decommission latency in RM.
> Currently the node decommission only happened after RM received nodeHeartbeat
> from the Node Manager. The node heartbeat interval is configurable. The
> default value is 1 second.
> It will be better to do the decommission during RM Refresh(NodesListManager)
> instead of nodeHeartbeat(ResourceTrackerService).
> This will be a much more serious issue:
> After RM is refreshed (refreshNodes), If the NM to be decommissioned is
> killed before NM sent heartbeat to RM. The RMNode will never be
> decommissioned in RM. The RMNode will only expire in RM after
> "yarn.nm.liveness-monitor.expiry-interval-ms"(default value 10 minutes) time.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)