[ 
https://issues.apache.org/jira/browse/YARN-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967377#comment-14967377
 ] 

Sangjin Lee commented on YARN-4284:
-----------------------------------

[~steve_l] [~sunilg], if you have one or two nodes and the AM container of an 
app fails, {{yarn.am.blacklisting.disable-failure-threshold}} will ensure that 
it cannot blacklist the entire cluster for that app. Once you're above the 
threshold, the blacklisting is cleared, and all nodes are available. Again, 
this is a *per-app* behavior. Other apps are not affected by this decision 
whatever.

As for the condition for applying blacklisting, I think we can add 
{{PREEMPTED}} to that list (for not blacklisting). I'm not so sure about 
{{KILLED_BY_RESOURCEMANAGER}}. I think it is possible that an AM container can 
be killed by the resource manager due to a node issue. Any failure of 
heartbeating properly will cause the AM container to be killed by the RM, but 
the cause of that failure of heartbeating can be many. Just because it was 
killed by the RM doesn't mean definitively that it was purely an app problem. 
What do you think?

I think we may want to approach this from the point of view of *anti-affinity*. 
Currently there is an inherent *affinity* to nodes when it comes to assigning 
the AM containers. In my view, anti-affinity is a better behavior as a default 
behavior. In the worst case scenario when the AM container failure was caused 
purely by the app, running subsequent attempts on different nodes will make it 
only clear the failures were unrelated to nodes. This helps troubleshooting a 
great deal. Today when all AM containers land on the same node, we sometimes 
spend a fair amount of time convincing our users that it had nothing to do with 
the node.

Thoughts and comments are welcome. Thanks!

> condition for AM blacklisting is too narrow
> -------------------------------------------
>
>                 Key: YARN-4284
>                 URL: https://issues.apache.org/jira/browse/YARN-4284
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Sangjin Lee
>            Assignee: Sangjin Lee
>         Attachments: YARN-4284.001.patch
>
>
> Per YARN-2005, there is now a way to blacklist nodes for AM purposes so the 
> next app attempt can be assigned to a different node.
> However, currently the condition under which the node gets blacklisted is 
> limited to {{DISKS_FAILED}}. There are a whole host of other issues that may 
> cause the failure, for which we want to locate the AM elsewhere; e.g. disks 
> full, JVM crashes, memory issues, etc.
> Since the AM blacklisting is per-app, there is little practical downside in 
> blacklisting the nodes on *any failure* (although it might lead to 
> blacklisting the node more aggressively than necessary). I would propose 
> locating the next app attempt to a different node on any failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to