[ 
https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578577#comment-14578577
 ] 

Steve Loughran commented on YARN-2005:
--------------------------------------

Don't do it yet, but plan for a future version to add liveness probes, which is 
what we're adding to slider soon. The AM already registers its IPC and HTTP 
ports; if the AM could also register a health URL, such as the codehale 
/healthy  URL, then something near the RM could decide when the AM had failed. 
For that we need
* URLs to be provided at AM registration, or updated later
* something to do the liveness checks. The RM is overloaded on a big cluster, 
but a little YARN service that could be launched standalone or embedded would 
be enough. I have all the code for liveness probes (basic TCP, http gets & 
status, with a launch track policy: you are given time to start, but once a 
probe is up, it must stay up). Of course, it'd need to run on an RM node for 
the redirect logic to not bounce it through the RM proxy.
* AMs to provide simple health URLs which return an HTTP error code on failure, 
200 if happy.



> Blacklisting support for scheduling AMs
> ---------------------------------------
>
>                 Key: YARN-2005
>                 URL: https://issues.apache.org/jira/browse/YARN-2005
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 0.23.10, 2.4.0
>            Reporter: Jason Lowe
>            Assignee: Anubhav Dhoot
>
> It would be nice if the RM supported blacklisting a node for an AM launch 
> after the same node fails a configurable number of AM attempts.  This would 
> be similar to the blacklisting support for scheduling task attempts in the 
> MapReduce AM but for scheduling AM attempts on the RM side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to