[ 
https://issues.apache.org/jira/browse/YARN-4576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15094158#comment-15094158
 ] 

Junping Du commented on YARN-4576:
----------------------------------

Thanks for pointing it out, [~sunilg]. From briefly looking at YARN-4284, I 
think it could be too strict rules for picking up AMs. The side effects could 
be (I haven't go through the implementation yet): 
1. in a small cluster, all nodes could be blacklisted for AM launching. 
2. in a larger cluster, AM get aggregated on small set of nodes (which don't 
have container failure before) that cause network congestion on these nodes and 
affect apps running.
3. Some problematic apps (malicious or not) launch problematic containers that 
cause many innocent NMs get blacklisted.
I need to go through more details on YARN-4284 for more ideas, but I guess we 
should find another balance for some cases/scenarios.

> Extend blacklist mechanism to protect AM failed multiple times on failure 
> nodes
> -------------------------------------------------------------------------------
>
>                 Key: YARN-4576
>                 URL: https://issues.apache.org/jira/browse/YARN-4576
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>
> Current YARN blacklist mechanism is to track the bad nodes by AM: If AM tried 
> to launch containers on a specific node get failed for several times, AM will 
> blacklist this node in future resource asking. This mechanism works fine for 
> normal containers. However, from our observation on behaviors of several 
> clusters: if this problematic node launch AM failed, then RM could pickup 
> this problematic node to launch next AM attempts again and again that cause 
> application failure in case other functional nodes are busy. In normal case, 
> the customized healthy checker script cannot be so sensitive to mark node as 
> unhealthy when one or two containers get launched failed. However, in RM 
> side, we can blacklist these nodes for launching AM for a certain time if 
> launching AMs get failed before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to