[jira] [Created] (YARN-10955) Add health check mechanism to improve troubleshooting skills for RM

Tao Yang (Jira) Wed, 15 Sep 2021 01:36:54 -0700

Tao Yang created YARN-10955:
-------------------------------

             Summary: Add health check mechanism to improve troubleshooting 
skills for RM
                 Key: YARN-10955
                 URL: https://issues.apache.org/jira/browse/YARN-10955
             Project: Hadoop YARN
          Issue Type: Improvement
          Components: resourcemanager
            Reporter: Tao Yang
            Assignee: Tao Yang



RM is the most complex component in YARN with many basic or core services 
including RPC servers, event dispatchers, HTTP server, core scheduler, state 
managers etc., and some of them depends on other basic components like 
ZooKeeper, HDFS. 

Currently we may have to find some suspicious traces from many related metrics 
and tremendous logs while encountering an unclear issue, hope to locate the 
root cause of the problem. For example, some applications keep staying in 
NEW_SAVING state, which can be caused by lost of ZooKeeper connections or jam 
in event dispatcher, the useful traces is sinking in many metrics and logs. 
That's not easy to figure out what happened even for some experts, let alone 
common users.

So I propose to add a common health check mechanism to improve troubleshooting 
skills for RM, in my general thought, we can
 * add a HealthReporter interface as follows:
{code:java}
public interface HealthReporter {
  HealthReport getHealthReport();
}
{code}
HealthReport can have some generic fields like isHealthy(boolean), 
updateTime(long), diagnostics(string) and keyMetrics(Map<String, Object>).

 * make some key services implement HealthReporter interface and generate 
health report via evaluating the internal state.
 * add HealthCheckerService which can manage and monitor all reportable 
services, support checking and fetching health reports periodically and 
manually (can be triggered by REST API), publishing metrics and logs as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (YARN-10955) Add health check mechanism to improve troubleshooting skills for RM

Reply via email to