Hong Zhiguo created YARN-4181:
---------------------------------
Summary: node blacklist for AM launching
Key: YARN-4181
URL: https://issues.apache.org/jira/browse/YARN-4181
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo
Priority: Minor
In some cases, a node goes problematic and most launching containers fail on
this node, as well as the launching AM containers.
Then this node has more available resource than other nodes in the cluster. The
Application whose AM is failing has zero minShareRatio. With fair scheduler,
this node is always rated first, and the misfortune Application is also likely
rated first. The result is: attempts of the this application are failing again
and again on the same node.
Solution 1: NM could detect the failure rate of containers. If the rate is
high, the NM marks itself to unhealthy for a period. But we should be careful
not to turn all nodes into unhealthy by a buggy Application. Maybe use failure
rate of containers for different Applications.
Solution 2: To have Application level blacklist by AMLauncher, in addition to
existing blacklist by AM.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)