[ 
https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294976#comment-14294976
 ] 

Sunil G commented on YARN-2005:
-------------------------------

bq. Is that different users, app names, or .

Yes. App name is the first point came in to my thoughts. As you mentioned, 
challenge here is to find the real buggy application which comes in as a 
workflow. There also can be a genuine cases, where a workflow of jobs failed 
because of node problem. 
To overcome this, multiple inputs can be considered. Such as app name, user, 
queue etc. 

*Point 1:*
An app from "user1" with name "job1" failed on node1. If again same app name 
"job1" fails on same node, an immediate history or current running AM in that 
node can be cross checked. This may give a better idea about the behavior in 
that node.
IN simple words, a sample rate of 2 or more (different applications categorized 
from name/user etc) always has to be considered before taking a decision on a 
node.

*Point 2:*
If an app from "user1" with name "job2" fails on node1, it is very much 
appropriate to try its second attempt in a different node.


bq.However we need a "don't even go there" level to avoid rapid rescheduling of 
failed AM attempts on the same node in a busy cluster scenario.
This is one of the real intention from my side also. But a continuos monitoring 
in cluster with its historical data will play a pivotal role here, and one 
decision making point also has to be time. I feel i could jot down few points 
and share as a doc for same, and we can see whether this adds a value to system 
without causing a chance to hack.

> Blacklisting support for scheduling AMs
> ---------------------------------------
>
>                 Key: YARN-2005
>                 URL: https://issues.apache.org/jira/browse/YARN-2005
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 0.23.10, 2.4.0
>            Reporter: Jason Lowe
>
> It would be nice if the RM supported blacklisting a node for an AM launch 
> after the same node fails a configurable number of AM attempts.  This would 
> be similar to the blacklisting support for scheduling task attempts in the 
> MapReduce AM but for scheduling AM attempts on the RM side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to