[ 
https://issues.apache.org/jira/browse/YARN-10955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415388#comment-17415388
 ] 

Tao Yang commented on YARN-10955:
---------------------------------

Any suggestions and comments are welcome!

cc [~cheersyang], [~leftnoteasy], [~sunil.g], hope to hear your thoughts about 
this.

Thanks!

> Add health check mechanism to improve troubleshooting skills for RM
> -------------------------------------------------------------------
>
>                 Key: YARN-10955
>                 URL: https://issues.apache.org/jira/browse/YARN-10955
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>            Reporter: Tao Yang
>            Assignee: Tao Yang
>            Priority: Major
>
> RM is the most complex component in YARN with many basic or core services 
> including RPC servers, event dispatchers, HTTP server, core scheduler, state 
> managers etc., and some of them depends on other basic components like 
> ZooKeeper, HDFS. 
> Currently we may have to find some suspicious traces from many related 
> metrics and tremendous logs while encountering an unclear issue, hope to 
> locate the root cause of the problem. For example, some applications keep 
> staying in NEW_SAVING state, which can be caused by lost of ZooKeeper 
> connections or jam in event dispatcher, the useful traces is sinking in many 
> metrics and logs. That's not easy to figure out what happened even for some 
> experts, let alone common users.
> So I propose to add a common health check mechanism to improve 
> troubleshooting skills for RM, in my general thought, we can
>  * add a HealthReporter interface as follows:
> {code:java}
> public interface HealthReporter {
>   HealthReport getHealthReport();
> }
> {code}
> HealthReport can have some generic fields like isHealthy(boolean), 
> updateTime(long), diagnostics(string) and keyMetrics(Map<String, Object>).
>  * make some key services implement HealthReporter interface and generate 
> health report via evaluating the internal state.
>  * add HealthCheckerService which can manage and monitor all reportable 
> services, support checking and fetching health reports periodically and 
> manually (can be triggered by REST API), publishing metrics and logs as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to