[ 
https://issues.apache.org/jira/browse/YARN-4576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-4576:
-----------------------------
    Description: Current YARN blacklist mechanism is to track the bad nodes by 
AM: If AM tried to launch containers on a specific node get failed for several 
times, AM will blacklist this node in future resource asking. This mechanism 
works fine for normal containers. However, from our observation on behaviors of 
several clusters: if this problematic node launch AM failed, then RM could 
pickup this problematic node to launch next AM attempts again and again that 
cause application failure in case other functional nodes are busy. In normal 
case, the customized healthy checker script cannot be so sensitive to mark node 
as unhealthy when one or two containers get launched failed. However, in RM 
side, we can blacklist these nodes for launching AM for a certain time if 
launching AMs get failed before.  (was: Current YARN blacklist mechanism is to 
track the bad nodes by AM: If AM tried to launch containers on a specific node 
get failed for several times, AM will blacklist this node in future resource 
asking. This mechanism works fine for normal containers. However, from our 
observation on behaviors of several clusters: if this problematic node launch 
AM failed, then RM could pickup this problematic node to launch next AM 
attempts again and again that cause application failure in case other 
functional nodes are busy. In normal case, the customized healthy checker 
script cannot be so sensitive to mark node as unhealthy when one or two 
containers get launched failed. However, in RM side, we can blacklist these 
nodes for launching AM for a certain time.)

> Extend blacklist mechanism to protect AM failed multiple times on failure 
> nodes
> -------------------------------------------------------------------------------
>
>                 Key: YARN-4576
>                 URL: https://issues.apache.org/jira/browse/YARN-4576
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>
> Current YARN blacklist mechanism is to track the bad nodes by AM: If AM tried 
> to launch containers on a specific node get failed for several times, AM will 
> blacklist this node in future resource asking. This mechanism works fine for 
> normal containers. However, from our observation on behaviors of several 
> clusters: if this problematic node launch AM failed, then RM could pickup 
> this problematic node to launch next AM attempts again and again that cause 
> application failure in case other functional nodes are busy. In normal case, 
> the customized healthy checker script cannot be so sensitive to mark node as 
> unhealthy when one or two containers get launched failed. However, in RM 
> side, we can blacklist these nodes for launching AM for a certain time if 
> launching AMs get failed before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to