[jira] [Commented] (YARN-4576) Pluggable blacklist/whitelist policies in launching AM

Junping Du (JIRA) Fri, 15 Jan 2016 07:29:14 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-4576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15101917#comment-15101917
 ]


Junping Du commented on YARN-4576:
----------------------------------

NOTE: Per discussion in YARN-4389, we also need to check if a user configured 
value for {{blacklistDisableFailureThreshold}} is valid (<=1.0f and not 
negative) during RM start when we have a global blacklist.

> Pluggable blacklist/whitelist policies in launching AM
> ------------------------------------------------------
>
>                 Key: YARN-4576
>                 URL: https://issues.apache.org/jira/browse/YARN-4576
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>
> Before YARN-2005, YARN blacklist mechanism is to track the bad nodes by AM:  
> If AM tried to launch containers on a specific node get failed for several 
> times, AM will blacklist this node in future resource asking. This mechanism 
> works fine for normal containers. However, from our observation on behaviors 
> of several clusters: if this problematic node launch AM failed, then RM could 
> pickup this problematic node to launch next AM attempts again and again that 
> cause application failure in case other functional nodes are busy. In normal 
> case, the customized healthy checker script cannot be so sensitive to mark 
> node as unhealthy when one or two containers get launched failed. 
> After YARN-2005, we can have a BlacklistManager in each RMapp, so those nodes 
> who launching AM attempts failed for specific application before will get 
> blacklisted. To get rid of potential risks that all nodes being blacklisted 
> by BlacklistManager, a disable-failure-threshold is involved to stop adding 
> more nodes into blacklist if hit certain ratio already. 
> There are already some enhancements for this AM blacklist mechanism: 
> YARN-4284 is to address the more wider case for AM container get launched 
> failure and YARN-4389 tries to make configuration settings available for 
> change by App to meet app specific requirement. However, there are still 
> several gaps to address more scenarios:
> 1. We may need a global blacklist instead of each app maintain a separated 
> one. The reason is: AM could get more chance to fail if other AM get failed 
> before. A quick example is: in a busy cluster, all nodes are busy except two 
> problematic nodes: node a and node b, app1 already submit and get failed in 
> two AM attempts on a and b. app2 and other apps should wait for other busy 
> nodes rather than waste attempts on these two problematic nodes.
> 2. If AM container failure is recognized as global event instead app own 
> issue, we should consider the blacklist is not a permanent thing but with a 
> specific time window. 
> 3. We could have user defined black list polices to address more possible 
> cases and scenarios, so it reasonable to make blacklist policy pluggable.
> 4. For some test scenario, we could have whitelist mechanism for AM launching.
> 5. Some minor issues: it sounds like NM reconnect won't refresh blacklist so 
> far.
> Will try to address all issues here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4576) Pluggable blacklist/whitelist policies in launching AM

Reply via email to