[ 
https://issues.apache.org/jira/browse/YARN-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967434#comment-14967434
 ] 

Sunil G commented on YARN-4284:
-------------------------------

Yes [~sjlee0]. After the threshold, it clears all nodes from blacklisting. 
Thank you for correcting.

bq.Just because it was killed by the RM doesn't mean definitively that it was 
purely an app problem.
I think yes. It may not be an app specific. 

bq.anti-affinity is a better behavior as a default behavior. In the worst case 
scenario when the AM container failure was caused purely by the app, running 
subsequent attempts on different nodes will make it only clear the failures 
were unrelated to nodes
Yes, I agree to your point. It can help to isolate the problem of container 
failure. So we could skip only {{PREEMPTED}} for now and consider all other 
failure cases for blacklisting. Correct?

> condition for AM blacklisting is too narrow
> -------------------------------------------
>
>                 Key: YARN-4284
>                 URL: https://issues.apache.org/jira/browse/YARN-4284
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Sangjin Lee
>            Assignee: Sangjin Lee
>         Attachments: YARN-4284.001.patch
>
>
> Per YARN-2005, there is now a way to blacklist nodes for AM purposes so the 
> next app attempt can be assigned to a different node.
> However, currently the condition under which the node gets blacklisted is 
> limited to {{DISKS_FAILED}}. There are a whole host of other issues that may 
> cause the failure, for which we want to locate the AM elsewhere; e.g. disks 
> full, JVM crashes, memory issues, etc.
> Since the AM blacklisting is per-app, there is little practical downside in 
> blacklisting the nodes on *any failure* (although it might lead to 
> blacklisting the node more aggressively than necessary). I would propose 
> locating the next app attempt to a different node on any failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to