[ https://issues.apache.org/jira/browse/YARN-10955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415388#comment-17415388 ]
Tao Yang commented on YARN-10955: --------------------------------- Any suggestions and comments are welcome! cc [~cheersyang], [~leftnoteasy], [~sunil.g], hope to hear your thoughts about this. Thanks! > Add health check mechanism to improve troubleshooting skills for RM > ------------------------------------------------------------------- > > Key: YARN-10955 > URL: https://issues.apache.org/jira/browse/YARN-10955 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager > Reporter: Tao Yang > Assignee: Tao Yang > Priority: Major > > RM is the most complex component in YARN with many basic or core services > including RPC servers, event dispatchers, HTTP server, core scheduler, state > managers etc., and some of them depends on other basic components like > ZooKeeper, HDFS. > Currently we may have to find some suspicious traces from many related > metrics and tremendous logs while encountering an unclear issue, hope to > locate the root cause of the problem. For example, some applications keep > staying in NEW_SAVING state, which can be caused by lost of ZooKeeper > connections or jam in event dispatcher, the useful traces is sinking in many > metrics and logs. That's not easy to figure out what happened even for some > experts, let alone common users. > So I propose to add a common health check mechanism to improve > troubleshooting skills for RM, in my general thought, we can > * add a HealthReporter interface as follows: > {code:java} > public interface HealthReporter { > HealthReport getHealthReport(); > } > {code} > HealthReport can have some generic fields like isHealthy(boolean), > updateTime(long), diagnostics(string) and keyMetrics(Map<String, Object>). > * make some key services implement HealthReporter interface and generate > health report via evaluating the internal state. > * add HealthCheckerService which can manage and monitor all reportable > services, support checking and fetching health reports periodically and > manually (can be triggered by REST API), publishing metrics and logs as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org