[ 
https://issues.apache.org/jira/browse/YARN-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966240#comment-14966240
 ] 

Sunil G commented on YARN-4284:
-------------------------------

Hi [~sjlee0]
As part of YARN-2293, we were looking into a proposal where we wanted to score 
NMs based on its performance (more failures of attempts, launch failures,disk 
crash etc will result in decrementing score). And for all applications AMs, its 
always best schedule to a highest ranked NM (best performed so far).
But this is a very generic proposal, and we thought of achieving this step by 
step, and YARN-2005 was a first step for this as suggested by [~jlowe]. Coming 
to an improvement, your proposal is very much the same as the next step to this 
and it can give a better probability of a successful AM container launch for 
all applications. Currently there are chances that new application's first AM 
will still fail and only second one will be successful because of AM 
blacklisting.
Downfall in achieving this is to collect the general failures (disk crashes/jvm 
launch problem) Vs application specific errors (some AM containers may not run 
on a node due to its memory or some other factors). If we cannot achieve this, 
then there are chances that due to one specific problem with an AM in a node 
may result that node to be blacklisted for all other apps, this may be 
dangerous. 
So as per you thought also, I think we can collect all the other issues in 
container launches and segregate to generic errors, and blacklist for a period 
of time. +1 for this. Thoughts?

> condition for AM blacklisting is too narrow
> -------------------------------------------
>
>                 Key: YARN-4284
>                 URL: https://issues.apache.org/jira/browse/YARN-4284
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Sangjin Lee
>            Assignee: Sangjin Lee
>         Attachments: YARN-4284.001.patch
>
>
> Per YARN-2005, there is now a way to blacklist nodes for AM purposes so the 
> next app attempt can be assigned to a different node.
> However, currently the condition under which the node gets blacklisted is 
> limited to {{DISKS_FAILED}}. There are a whole host of other issues that may 
> cause the failure, for which we want to locate the AM elsewhere; e.g. disks 
> full, JVM crashes, memory issues, etc.
> Since the AM blacklisting is per-app, there is little practical downside in 
> blacklisting the nodes on *any failure* (although it might lead to 
> blacklisting the node more aggressively than necessary). I would propose 
> locating the next app attempt to a different node on any failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to