[ 
https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578597#comment-14578597
 ] 

Sunil G commented on YARN-2005:
-------------------------------

Hi [[email protected]]
In our environments we have seen AM container launch failures in specific nodes 
due to memory issues (-Xmx configs). But in other nodes, it was fine. So if AM 
container fails in node1, then next time for its second attempt we can try in 
another node other than node1. Keeping in mind that we can do that skipping for 
certain duration or some retry counts. This is not a clear solution, but 
somewhat a fail safe option.

bq. known node failure events (counts against node reliability)
A proposal was made earlier in YARN-2293, where a count or score was set 
against the reliability of a node ( here container failures contributes to node 
reliability also).  I could see SLIDER-856 is doing somewhat similar approach. 
Do you see any advantages of doing this in YARN?



> Blacklisting support for scheduling AMs
> ---------------------------------------
>
>                 Key: YARN-2005
>                 URL: https://issues.apache.org/jira/browse/YARN-2005
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 0.23.10, 2.4.0
>            Reporter: Jason Lowe
>            Assignee: Anubhav Dhoot
>
> It would be nice if the RM supported blacklisting a node for an AM launch 
> after the same node fails a configurable number of AM attempts.  This would 
> be similar to the blacklisting support for scheduling task attempts in the 
> MapReduce AM but for scheduling AM attempts on the RM side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to