[
https://issues.apache.org/jira/browse/YARN-4576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Junping Du updated YARN-4576:
-----------------------------
Summary: Pluggable blacklist/whitelist policies in launching AM to protect
AM failed multiple times on problematic nodes (was: Extend blacklist mechanism
to protect AM failed multiple times on failure nodes)
> Pluggable blacklist/whitelist policies in launching AM to protect AM failed
> multiple times on problematic nodes
> ---------------------------------------------------------------------------------------------------------------
>
> Key: YARN-4576
> URL: https://issues.apache.org/jira/browse/YARN-4576
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: resourcemanager
> Reporter: Junping Du
> Assignee: Junping Du
> Priority: Critical
>
> Current YARN blacklist mechanism is to track the bad nodes by AM: If AM tried
> to launch containers on a specific node get failed for several times, AM will
> blacklist this node in future resource asking. This mechanism works fine for
> normal containers. However, from our observation on behaviors of several
> clusters: if this problematic node launch AM failed, then RM could pickup
> this problematic node to launch next AM attempts again and again that cause
> application failure in case other functional nodes are busy. In normal case,
> the customized healthy checker script cannot be so sensitive to mark node as
> unhealthy when one or two containers get launched failed. However, in RM
> side, we can blacklist these nodes for launching AM for a certain time if
> launching AMs get failed before.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)