[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.

Jian He (JIRA) Wed, 03 Feb 2016 19:07:27 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15131632#comment-15131632
 ]


Jian He commented on YARN-4635:
-------------------------------

bq. Some DISKS_FAILED could happens due to the failed container write disk to 
full. But it could still have other directories available to use by node. It 
could still get launched with normal containers but not suitable to risk AM 
container.
In current code, the DISKS_FAILED status is set when this condition is true
{code}
      if (!dirsHandler.areDisksHealthy()) {
        ret = ContainerExitStatus.DISKS_FAILED;
        throw new IOException("Most of the disks failed. "
            + dirsHandler.getDisksHealthReport(false));
      }
{code}
The same check {{dirsHandler.areDisksHealthy}} is used by DiskHealth monitor. 
{code}
  boolean isHealthy() {
    boolean scriptHealthStatus = (nodeHealthScriptRunner == null) ? true
        : nodeHealthScriptRunner.isHealthy();
    return scriptHealthStatus && dirsHandler.areDisksHealthy();
  }
{code}
Essentially, if this condition is false, the node will be reported as unhealthy 
in the first place, which makes RM remove the node. And the global blacklisted 
becomes not useful in practice because the node is already removed. Maybe I 
missed something, a unit test can prove this.

bq. If so, it means the job submitter have to understand how many nodes the 
current cluster have 
Sorry, I don't understand why job submitter needs to understand the number of 
nodes. what I meant is that, right now a boolean flag(false) is used to 
indicate that this feature is disabled. alternatively,  a 0 threshold can 
achieve the same result (with logic change on RM side).  I said this because I 
feel the API may look simpler and we don't need a separate nested 
AMBlackListingRequest class. Having the threshold set in submissionContext will 
be enough. But I don't have strong opinion on this. Current way is ok too.

> Add global blacklist tracking for AM container failure.
> -------------------------------------------------------
>
>                 Key: YARN-4635
>                 URL: https://issues.apache.org/jira/browse/YARN-4635
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: YARN-4635-v2.patch, YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM 
> container failures in global 
> affection. That means we need to differentiate the non-succeed 
> ContainerExitStatus reasoning from 
> NM or more related to App. 
> For more details, please refer the document in YARN-4576.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.

Reply via email to