[ 
https://issues.apache.org/jira/browse/YARN-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14875727#comment-14875727
 ] 

Jason Lowe commented on YARN-4181:
----------------------------------

Dup of YARN-2005?

> node blacklist for AM launching
> -------------------------------
>
>                 Key: YARN-4181
>                 URL: https://issues.apache.org/jira/browse/YARN-4181
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Hong Zhiguo
>            Assignee: Hong Zhiguo
>            Priority: Minor
>
> In some cases, a node goes problematic and most launching containers fail on 
> this node, as well as the launching AM containers.
> Then this node has more available resource than other nodes in the cluster. 
> The Application whose AM is failing has zero minShareRatio. With fair 
> scheduler, this node is always rated first, and the misfortune Application is 
> also likely rated first. The result is:  attempts of the this application are 
> failing again and again on the same node.
> We should avoid such a deadlock situation.
> Solution 1: NM could detect the failure rate of containers. If the rate is 
> high, the NM marks itself to unhealthy for a period. But we should be careful 
> not to turn all nodes into unhealthy by a buggy Application. Maybe use 
> failure rate of containers for different Applications.
> Solution 2: To have Application level blacklist by AMLauncher, in addition to 
> existing blacklist by AM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to