[ https://issues.apache.org/jira/browse/YARN-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14875727#comment-14875727 ]
Jason Lowe commented on YARN-4181: ---------------------------------- Dup of YARN-2005? > node blacklist for AM launching > ------------------------------- > > Key: YARN-4181 > URL: https://issues.apache.org/jira/browse/YARN-4181 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Reporter: Hong Zhiguo > Assignee: Hong Zhiguo > Priority: Minor > > In some cases, a node goes problematic and most launching containers fail on > this node, as well as the launching AM containers. > Then this node has more available resource than other nodes in the cluster. > The Application whose AM is failing has zero minShareRatio. With fair > scheduler, this node is always rated first, and the misfortune Application is > also likely rated first. The result is: attempts of the this application are > failing again and again on the same node. > We should avoid such a deadlock situation. > Solution 1: NM could detect the failure rate of containers. If the rate is > high, the NM marks itself to unhealthy for a period. But we should be careful > not to turn all nodes into unhealthy by a buggy Application. Maybe use > failure rate of containers for different Applications. > Solution 2: To have Application level blacklist by AMLauncher, in addition to > existing blacklist by AM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)