[ 
https://issues.apache.org/jira/browse/YARN-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966248#comment-14966248
 ] 

Sangjin Lee commented on YARN-4284:
-----------------------------------

Hi [~sunilg], thanks for the comment. Yes, I've been following the discussion 
on YARN-2005 as well as YARN-2293. Although it would be nice to have a reliable 
scoring mechanism as a basis for assigning AM containers, what's implemented in 
YARN-2005 is actually a pretty solid solution to this problem. By the way, this 
is one of the more common issues our users encounter.

The only problem with YARN-2005 is that the blacklisting condition is too 
narrow. In fact, we rarely encounter the DISKS_FAILED error. It's usually more 
like INVALID (-1000) or other errors. We can try to be real precise and 
blacklist nodes only if the container exit status is purely due to the node 
itself and is not caused by the app. But maintaining that precise condition may 
prove to be brittle.

IMO the key is that blacklisting implemented in YARN-2005 is *per-app*. As 
such, we can afford to be more aggressive, instead of trying to come up with 
the 100% accurate blacklisting condition. Since it is per-app, there is no risk 
one bad app can cause a node to be blacklisted for all other apps (correct me 
if I'm wrong). Thoughts? Do you see other risk in taking this approach?

> condition for AM blacklisting is too narrow
> -------------------------------------------
>
>                 Key: YARN-4284
>                 URL: https://issues.apache.org/jira/browse/YARN-4284
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Sangjin Lee
>            Assignee: Sangjin Lee
>         Attachments: YARN-4284.001.patch
>
>
> Per YARN-2005, there is now a way to blacklist nodes for AM purposes so the 
> next app attempt can be assigned to a different node.
> However, currently the condition under which the node gets blacklisted is 
> limited to {{DISKS_FAILED}}. There are a whole host of other issues that may 
> cause the failure, for which we want to locate the AM elsewhere; e.g. disks 
> full, JVM crashes, memory issues, etc.
> Since the AM blacklisting is per-app, there is little practical downside in 
> blacklisting the nodes on *any failure* (although it might lead to 
> blacklisting the node more aggressively than necessary). I would propose 
> locating the next app attempt to a different node on any failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to