[ 
https://issues.apache.org/jira/browse/YARN-4837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201655#comment-15201655
 ] 

Sangjin Lee commented on YARN-4837:
-----------------------------------

I just wanted to add my 2 cents to the discussion, specifically about YARN-4284 
where we broadened the cause for blacklisting a node for an AM purpose.

AMs repeatedly getting assigned to the same node in spite of failures is one of 
the most frequent complaints from our users ("why did our AMs keep landing on 
that bad node, causing our jobs to fail?"). If a node is having a "soft" 
failure that doesn't quite trip itself over to an unhealthy state, that's the 
worst possible case. Since the node is still healthy and appears to have a lot 
of available capacity, the chance that it still gets the next attempt is quite 
high; i.e. we have node-affinity. And since this is AM, the consequence is much 
more severe than when a container landed on that node.

Oftentimes, the cause for this soft failure situation is varied, and trying to 
come up with a precise set of exit codes that meet this criteria isn't 
straightforward. There are even error codes like INVALID which we see quite 
often (see [my previous 
comment|https://issues.apache.org/jira/browse/YARN-4284?focusedCommentId=14966248&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14966248]).
 I know it could blacklist the node for the app for reasons such as the app's 
configuration error (false positives). However, the reason we could afford to 
go broad is this blacklisting is *per-app*. The only downside there is to get 
assigned to another node.

We have a number of large busy clusters, and we're using this with success and 
with little downside.

That said, I do recognize that this could be a problem if 
{{yarn.resourcemanager.am.max-attempts}} is larger than the size of the cluster.

> User facing aspects of 'AM blacklisting' feature need fixing
> ------------------------------------------------------------
>
>                 Key: YARN-4837
>                 URL: https://issues.apache.org/jira/browse/YARN-4837
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>
> Was reviewing the user-facing aspects that we are releasing as part of 2.8.0.
> Looking at the 'AM blacklisting feature', I see several things to be fixed 
> before we release it in 2.8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to