Steve Loughran commented on YARN-2005:

Don't do it yet, but plan for a future version to add liveness probes, which is 
what we're adding to slider soon. The AM already registers its IPC and HTTP 
ports; if the AM could also register a health URL, such as the codehale 
/healthy  URL, then something near the RM could decide when the AM had failed. 
For that we need
* URLs to be provided at AM registration, or updated later
* something to do the liveness checks. The RM is overloaded on a big cluster, 
but a little YARN service that could be launched standalone or embedded would 
be enough. I have all the code for liveness probes (basic TCP, http gets & 
status, with a launch track policy: you are given time to start, but once a 
probe is up, it must stay up). Of course, it'd need to run on an RM node for 
the redirect logic to not bounce it through the RM proxy.
* AMs to provide simple health URLs which return an HTTP error code on failure, 
200 if happy.

> Blacklisting support for scheduling AMs
> ---------------------------------------
>                 Key: YARN-2005
>                 URL: https://issues.apache.org/jira/browse/YARN-2005
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 0.23.10, 2.4.0
>            Reporter: Jason Lowe
>            Assignee: Anubhav Dhoot
> It would be nice if the RM supported blacklisting a node for an AM launch 
> after the same node fails a configurable number of AM attempts.  This would 
> be similar to the blacklisting support for scheduling task attempts in the 
> MapReduce AM but for scheduling AM attempts on the RM side.

This message was sent by Atlassian JIRA

Reply via email to