[ 
https://issues.apache.org/jira/browse/YARN-4576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15094177#comment-15094177
 ] 

Sunil G commented on YARN-4576:
-------------------------------

Yes [~djp]. Currently rules are made strict, mostly the thoughts were to ensure 
container failures to be considered for blacklisting for a safety purpose. I 
agree that its stricter, so a dampening factor or dead zone can be introduced 
to ensure that we do not fall into cases which you have mentioned.

Now we have only {{am.blacklisting.disable-failure-threshold}} which is default 
to 80% from blacklisting all nodes in cluster. +1 for having some more tuning 
configs here.
I feel based on container errors, we can take a call how long we need to black 
list the node. (a time based black list also may  be better)

> Extend blacklist mechanism to protect AM failed multiple times on failure 
> nodes
> -------------------------------------------------------------------------------
>
>                 Key: YARN-4576
>                 URL: https://issues.apache.org/jira/browse/YARN-4576
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>
> Current YARN blacklist mechanism is to track the bad nodes by AM: If AM tried 
> to launch containers on a specific node get failed for several times, AM will 
> blacklist this node in future resource asking. This mechanism works fine for 
> normal containers. However, from our observation on behaviors of several 
> clusters: if this problematic node launch AM failed, then RM could pickup 
> this problematic node to launch next AM attempts again and again that cause 
> application failure in case other functional nodes are busy. In normal case, 
> the customized healthy checker script cannot be so sensitive to mark node as 
> unhealthy when one or two containers get launched failed. However, in RM 
> side, we can blacklist these nodes for launching AM for a certain time if 
> launching AMs get failed before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to