Tao Yang created YARN-10955:
-------------------------------
Summary: Add health check mechanism to improve troubleshooting
skills for RM
Key: YARN-10955
URL: https://issues.apache.org/jira/browse/YARN-10955
Project: Hadoop YARN
Issue Type: Improvement
Components: resourcemanager
Reporter: Tao Yang
Assignee: Tao Yang
RM is the most complex component in YARN with many basic or core services
including RPC servers, event dispatchers, HTTP server, core scheduler, state
managers etc., and some of them depends on other basic components like
ZooKeeper, HDFS.
Currently we may have to find some suspicious traces from many related metrics
and tremendous logs while encountering an unclear issue, hope to locate the
root cause of the problem. For example, some applications keep staying in
NEW_SAVING state, which can be caused by lost of ZooKeeper connections or jam
in event dispatcher, the useful traces is sinking in many metrics and logs.
That's not easy to figure out what happened even for some experts, let alone
common users.
So I propose to add a common health check mechanism to improve troubleshooting
skills for RM, in my general thought, we can
* add a HealthReporter interface as follows:
{code:java}
public interface HealthReporter {
HealthReport getHealthReport();
}
{code}
HealthReport can have some generic fields like isHealthy(boolean),
updateTime(long), diagnostics(string) and keyMetrics(Map<String, Object>).
* make some key services implement HealthReporter interface and generate
health report via evaluating the internal state.
* add HealthCheckerService which can manage and monitor all reportable
services, support checking and fetching health reports periodically and
manually (can be triggered by REST API), publishing metrics and logs as well.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]