[
https://issues.apache.org/jira/browse/YARN-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14875727#comment-14875727
]
Jason Lowe commented on YARN-4181:
----------------------------------
Dup of YARN-2005?
> node blacklist for AM launching
> -------------------------------
>
> Key: YARN-4181
> URL: https://issues.apache.org/jira/browse/YARN-4181
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Reporter: Hong Zhiguo
> Assignee: Hong Zhiguo
> Priority: Minor
>
> In some cases, a node goes problematic and most launching containers fail on
> this node, as well as the launching AM containers.
> Then this node has more available resource than other nodes in the cluster.
> The Application whose AM is failing has zero minShareRatio. With fair
> scheduler, this node is always rated first, and the misfortune Application is
> also likely rated first. The result is: attempts of the this application are
> failing again and again on the same node.
> We should avoid such a deadlock situation.
> Solution 1: NM could detect the failure rate of containers. If the rate is
> high, the NM marks itself to unhealthy for a period. But we should be careful
> not to turn all nodes into unhealthy by a buggy Application. Maybe use
> failure rate of containers for different Applications.
> Solution 2: To have Application level blacklist by AMLauncher, in addition to
> existing blacklist by AM.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)