[
https://issues.apache.org/jira/browse/YARN-4576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15101917#comment-15101917
]
Junping Du commented on YARN-4576:
----------------------------------
NOTE: Per discussion in YARN-4389, we also need to check if a user configured
value for {{blacklistDisableFailureThreshold}} is valid (<=1.0f and not
negative) during RM start when we have a global blacklist.
> Pluggable blacklist/whitelist policies in launching AM
> ------------------------------------------------------
>
> Key: YARN-4576
> URL: https://issues.apache.org/jira/browse/YARN-4576
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: resourcemanager
> Reporter: Junping Du
> Assignee: Junping Du
> Priority: Critical
>
> Before YARN-2005, YARN blacklist mechanism is to track the bad nodes by AM:
> If AM tried to launch containers on a specific node get failed for several
> times, AM will blacklist this node in future resource asking. This mechanism
> works fine for normal containers. However, from our observation on behaviors
> of several clusters: if this problematic node launch AM failed, then RM could
> pickup this problematic node to launch next AM attempts again and again that
> cause application failure in case other functional nodes are busy. In normal
> case, the customized healthy checker script cannot be so sensitive to mark
> node as unhealthy when one or two containers get launched failed.
> After YARN-2005, we can have a BlacklistManager in each RMapp, so those nodes
> who launching AM attempts failed for specific application before will get
> blacklisted. To get rid of potential risks that all nodes being blacklisted
> by BlacklistManager, a disable-failure-threshold is involved to stop adding
> more nodes into blacklist if hit certain ratio already.
> There are already some enhancements for this AM blacklist mechanism:
> YARN-4284 is to address the more wider case for AM container get launched
> failure and YARN-4389 tries to make configuration settings available for
> change by App to meet app specific requirement. However, there are still
> several gaps to address more scenarios:
> 1. We may need a global blacklist instead of each app maintain a separated
> one. The reason is: AM could get more chance to fail if other AM get failed
> before. A quick example is: in a busy cluster, all nodes are busy except two
> problematic nodes: node a and node b, app1 already submit and get failed in
> two AM attempts on a and b. app2 and other apps should wait for other busy
> nodes rather than waste attempts on these two problematic nodes.
> 2. If AM container failure is recognized as global event instead app own
> issue, we should consider the blacklist is not a permanent thing but with a
> specific time window.
> 3. We could have user defined black list polices to address more possible
> cases and scenarios, so it reasonable to make blacklist policy pluggable.
> 4. For some test scenario, we could have whitelist mechanism for AM launching.
> 5. Some minor issues: it sounds like NM reconnect won't refresh blacklist so
> far.
> Will try to address all issues here.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)