[
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15131632#comment-15131632
]
Jian He commented on YARN-4635:
-------------------------------
bq. Some DISKS_FAILED could happens due to the failed container write disk to
full. But it could still have other directories available to use by node. It
could still get launched with normal containers but not suitable to risk AM
container.
In current code, the DISKS_FAILED status is set when this condition is true
{code}
if (!dirsHandler.areDisksHealthy()) {
ret = ContainerExitStatus.DISKS_FAILED;
throw new IOException("Most of the disks failed. "
+ dirsHandler.getDisksHealthReport(false));
}
{code}
The same check {{dirsHandler.areDisksHealthy}} is used by DiskHealth monitor.
{code}
boolean isHealthy() {
boolean scriptHealthStatus = (nodeHealthScriptRunner == null) ? true
: nodeHealthScriptRunner.isHealthy();
return scriptHealthStatus && dirsHandler.areDisksHealthy();
}
{code}
Essentially, if this condition is false, the node will be reported as unhealthy
in the first place, which makes RM remove the node. And the global blacklisted
becomes not useful in practice because the node is already removed. Maybe I
missed something, a unit test can prove this.
bq. If so, it means the job submitter have to understand how many nodes the
current cluster have
Sorry, I don't understand why job submitter needs to understand the number of
nodes. what I meant is that, right now a boolean flag(false) is used to
indicate that this feature is disabled. alternatively, a 0 threshold can
achieve the same result (with logic change on RM side). I said this because I
feel the API may look simpler and we don't need a separate nested
AMBlackListingRequest class. Having the threshold set in submissionContext will
be enough. But I don't have strong opinion on this. Current way is ok too.
> Add global blacklist tracking for AM container failure.
> -------------------------------------------------------
>
> Key: YARN-4635
> URL: https://issues.apache.org/jira/browse/YARN-4635
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Junping Du
> Assignee: Junping Du
> Priority: Critical
> Attachments: YARN-4635-v2.patch, YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM
> container failures in global
> affection. That means we need to differentiate the non-succeed
> ContainerExitStatus reasoning from
> NM or more related to App.
> For more details, please refer the document in YARN-4576.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)