[
https://issues.apache.org/jira/browse/YARN-4837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201655#comment-15201655
]
Sangjin Lee commented on YARN-4837:
-----------------------------------
I just wanted to add my 2 cents to the discussion, specifically about YARN-4284
where we broadened the cause for blacklisting a node for an AM purpose.
AMs repeatedly getting assigned to the same node in spite of failures is one of
the most frequent complaints from our users ("why did our AMs keep landing on
that bad node, causing our jobs to fail?"). If a node is having a "soft"
failure that doesn't quite trip itself over to an unhealthy state, that's the
worst possible case. Since the node is still healthy and appears to have a lot
of available capacity, the chance that it still gets the next attempt is quite
high; i.e. we have node-affinity. And since this is AM, the consequence is much
more severe than when a container landed on that node.
Oftentimes, the cause for this soft failure situation is varied, and trying to
come up with a precise set of exit codes that meet this criteria isn't
straightforward. There are even error codes like INVALID which we see quite
often (see [my previous
comment|https://issues.apache.org/jira/browse/YARN-4284?focusedCommentId=14966248&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14966248]).
I know it could blacklist the node for the app for reasons such as the app's
configuration error (false positives). However, the reason we could afford to
go broad is this blacklisting is *per-app*. The only downside there is to get
assigned to another node.
We have a number of large busy clusters, and we're using this with success and
with little downside.
That said, I do recognize that this could be a problem if
{{yarn.resourcemanager.am.max-attempts}} is larger than the size of the cluster.
> User facing aspects of 'AM blacklisting' feature need fixing
> ------------------------------------------------------------
>
> Key: YARN-4837
> URL: https://issues.apache.org/jira/browse/YARN-4837
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Vinod Kumar Vavilapalli
> Assignee: Vinod Kumar Vavilapalli
>
> Was reviewing the user-facing aspects that we are releasing as part of 2.8.0.
> Looking at the 'AM blacklisting feature', I see several things to be fixed
> before we release it in 2.8.0.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)