[
https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578597#comment-14578597
]
Sunil G commented on YARN-2005:
-------------------------------
Hi [[email protected]]
In our environments we have seen AM container launch failures in specific nodes
due to memory issues (-Xmx configs). But in other nodes, it was fine. So if AM
container fails in node1, then next time for its second attempt we can try in
another node other than node1. Keeping in mind that we can do that skipping for
certain duration or some retry counts. This is not a clear solution, but
somewhat a fail safe option.
bq. known node failure events (counts against node reliability)
A proposal was made earlier in YARN-2293, where a count or score was set
against the reliability of a node ( here container failures contributes to node
reliability also). I could see SLIDER-856 is doing somewhat similar approach.
Do you see any advantages of doing this in YARN?
> Blacklisting support for scheduling AMs
> ---------------------------------------
>
> Key: YARN-2005
> URL: https://issues.apache.org/jira/browse/YARN-2005
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: resourcemanager
> Affects Versions: 0.23.10, 2.4.0
> Reporter: Jason Lowe
> Assignee: Anubhav Dhoot
>
> It would be nice if the RM supported blacklisting a node for an AM launch
> after the same node fails a configurable number of AM attempts. This would
> be similar to the blacklisting support for scheduling task attempts in the
> MapReduce AM but for scheduling AM attempts on the RM side.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)