Junping Du created YARN-4576:
--------------------------------
Summary: Extend blacklist mechanism to protect AM failed multiple
times on failure nodes
Key: YARN-4576
URL: https://issues.apache.org/jira/browse/YARN-4576
Project: Hadoop YARN
Issue Type: Improvement
Components: resourcemanager
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
Current YARN blacklist mechanism is to track the bad nodes by AM: If AM tried
to launch containers on a specific node get failed for several times, AM will
blacklist this node in future resource asking. This mechanism works fine for
normal containers. However, from our observation on behaviors of clusters: if
this problematic node launch AM failed, then RM could pickup this problematic
node to launch next AM attempts again and again that cause application failure
in case other functional nodes are busy. In normal case, the customized healthy
checker script cannot be so sensitive to mark node as unhealthy when one or two
containers get launched failed. However, in RM side, we can blacklist these
nodes for launching AM for a certain time.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)