[jira] [Updated] (YARN-10955) Add health check mechanism to improve troubleshooting skills for RM

2021-10-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-10955:
--
Labels: pull-request-available  (was: )

> Add health check mechanism to improve troubleshooting skills for RM
> ---
>
> Key: YARN-10955
> URL: https://issues.apache.org/jira/browse/YARN-10955
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> RM is the most complex component in YARN with many basic or core services 
> including RPC servers, event dispatchers, HTTP server, core scheduler, state 
> managers etc., and some of them depends on other basic components like 
> ZooKeeper, HDFS. 
> Currently we may have to find some suspicious traces from many related 
> metrics and tremendous logs while encountering an unclear issue, hope to 
> locate the root cause of the problem. For example, some applications keep 
> staying in NEW_SAVING state, which can be caused by lost of ZooKeeper 
> connections or jam in event dispatcher, the useful traces is sinking in many 
> metrics and logs. That's not easy to figure out what happened even for some 
> experts, let alone common users.
> So I propose to add a common health check mechanism to improve 
> troubleshooting skills for RM, in my general thought, we can
>  * add a HealthReporter interface as follows:
> {code:java}
> public interface HealthReporter {
>   HealthReport getHealthReport();
> }
> {code}
> HealthReport can have some generic fields like isHealthy(boolean), 
> workState(enum: NORMAL/IDLE/BUSY), updateTime(long), diagnostics(string) and 
> keyMetrics(Map).
>  * make some key services implement HealthReporter interface and generate 
> health report via evaluating the internal state.
>  * add HealthCheckerService which can manage and monitor all reportable 
> services, support checking and fetching health reports periodically and 
> manually (can be triggered by REST API), publishing metrics and logs as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10955) Add health check mechanism to improve troubleshooting skills for RM

2021-09-16 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-10955:

Description: 
RM is the most complex component in YARN with many basic or core services 
including RPC servers, event dispatchers, HTTP server, core scheduler, state 
managers etc., and some of them depends on other basic components like 
ZooKeeper, HDFS. 

Currently we may have to find some suspicious traces from many related metrics 
and tremendous logs while encountering an unclear issue, hope to locate the 
root cause of the problem. For example, some applications keep staying in 
NEW_SAVING state, which can be caused by lost of ZooKeeper connections or jam 
in event dispatcher, the useful traces is sinking in many metrics and logs. 
That's not easy to figure out what happened even for some experts, let alone 
common users.

So I propose to add a common health check mechanism to improve troubleshooting 
skills for RM, in my general thought, we can
 * add a HealthReporter interface as follows:
{code:java}
public interface HealthReporter {
  HealthReport getHealthReport();
}
{code}
HealthReport can have some generic fields like isHealthy(boolean), 
workState(enum: NORMAL/IDLE/BUSY), updateTime(long), diagnostics(string) and 
keyMetrics(Map).

 * make some key services implement HealthReporter interface and generate 
health report via evaluating the internal state.
 * add HealthCheckerService which can manage and monitor all reportable 
services, support checking and fetching health reports periodically and 
manually (can be triggered by REST API), publishing metrics and logs as well.

  was:
RM is the most complex component in YARN with many basic or core services 
including RPC servers, event dispatchers, HTTP server, core scheduler, state 
managers etc., and some of them depends on other basic components like 
ZooKeeper, HDFS. 

Currently we may have to find some suspicious traces from many related metrics 
and tremendous logs while encountering an unclear issue, hope to locate the 
root cause of the problem. For example, some applications keep staying in 
NEW_SAVING state, which can be caused by lost of ZooKeeper connections or jam 
in event dispatcher, the useful traces is sinking in many metrics and logs. 
That's not easy to figure out what happened even for some experts, let alone 
common users.

So I propose to add a common health check mechanism to improve troubleshooting 
skills for RM, in my general thought, we can
 * add a HealthReporter interface as follows:
{code:java}
public interface HealthReporter {
  HealthReport getHealthReport();
}
{code}
HealthReport can have some generic fields like isHealthy(boolean), 
processState(enum: NORMAL/IDLE/BUSY), updateTime(long), diagnostics(string) and 
keyMetrics(Map).

 * make some key services implement HealthReporter interface and generate 
health report via evaluating the internal state.
 * add HealthCheckerService which can manage and monitor all reportable 
services, support checking and fetching health reports periodically and 
manually (can be triggered by REST API), publishing metrics and logs as well.


> Add health check mechanism to improve troubleshooting skills for RM
> ---
>
> Key: YARN-10955
> URL: https://issues.apache.org/jira/browse/YARN-10955
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>
> RM is the most complex component in YARN with many basic or core services 
> including RPC servers, event dispatchers, HTTP server, core scheduler, state 
> managers etc., and some of them depends on other basic components like 
> ZooKeeper, HDFS. 
> Currently we may have to find some suspicious traces from many related 
> metrics and tremendous logs while encountering an unclear issue, hope to 
> locate the root cause of the problem. For example, some applications keep 
> staying in NEW_SAVING state, which can be caused by lost of ZooKeeper 
> connections or jam in event dispatcher, the useful traces is sinking in many 
> metrics and logs. That's not easy to figure out what happened even for some 
> experts, let alone common users.
> So I propose to add a common health check mechanism to improve 
> troubleshooting skills for RM, in my general thought, we can
>  * add a HealthReporter interface as follows:
> {code:java}
> public interface HealthReporter {
>   HealthReport getHealthReport();
> }
> {code}
> HealthReport can have some generic fields like isHealthy(boolean), 
> workState(enum: NORMAL/IDLE/BUSY), updateTime(long), diagnostics(string) and 
> keyMetrics(Map).
>  * make some key services implement HealthReporter interface and generate 
> health report via evaluating the internal state.
>  * add 

[jira] [Updated] (YARN-10955) Add health check mechanism to improve troubleshooting skills for RM

2021-09-16 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-10955:

Description: 
RM is the most complex component in YARN with many basic or core services 
including RPC servers, event dispatchers, HTTP server, core scheduler, state 
managers etc., and some of them depends on other basic components like 
ZooKeeper, HDFS. 

Currently we may have to find some suspicious traces from many related metrics 
and tremendous logs while encountering an unclear issue, hope to locate the 
root cause of the problem. For example, some applications keep staying in 
NEW_SAVING state, which can be caused by lost of ZooKeeper connections or jam 
in event dispatcher, the useful traces is sinking in many metrics and logs. 
That's not easy to figure out what happened even for some experts, let alone 
common users.

So I propose to add a common health check mechanism to improve troubleshooting 
skills for RM, in my general thought, we can
 * add a HealthReporter interface as follows:
{code:java}
public interface HealthReporter {
  HealthReport getHealthReport();
}
{code}
HealthReport can have some generic fields like isHealthy(boolean), 
processState(enum: NORMAL/IDLE/BUSY), updateTime(long), diagnostics(string) and 
keyMetrics(Map).

 * make some key services implement HealthReporter interface and generate 
health report via evaluating the internal state.
 * add HealthCheckerService which can manage and monitor all reportable 
services, support checking and fetching health reports periodically and 
manually (can be triggered by REST API), publishing metrics and logs as well.

  was:
RM is the most complex component in YARN with many basic or core services 
including RPC servers, event dispatchers, HTTP server, core scheduler, state 
managers etc., and some of them depends on other basic components like 
ZooKeeper, HDFS. 

Currently we may have to find some suspicious traces from many related metrics 
and tremendous logs while encountering an unclear issue, hope to locate the 
root cause of the problem. For example, some applications keep staying in 
NEW_SAVING state, which can be caused by lost of ZooKeeper connections or jam 
in event dispatcher, the useful traces is sinking in many metrics and logs. 
That's not easy to figure out what happened even for some experts, let alone 
common users.

So I propose to add a common health check mechanism to improve troubleshooting 
skills for RM, in my general thought, we can
 * add a HealthReporter interface as follows:
{code:java}
public interface HealthReporter {
  HealthReport getHealthReport();
}
{code}
HealthReport can have some generic fields like isHealthy(boolean), 
updateTime(long), diagnostics(string) and keyMetrics(Map).

 * make some key services implement HealthReporter interface and generate 
health report via evaluating the internal state.
 * add HealthCheckerService which can manage and monitor all reportable 
services, support checking and fetching health reports periodically and 
manually (can be triggered by REST API), publishing metrics and logs as well.


> Add health check mechanism to improve troubleshooting skills for RM
> ---
>
> Key: YARN-10955
> URL: https://issues.apache.org/jira/browse/YARN-10955
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>
> RM is the most complex component in YARN with many basic or core services 
> including RPC servers, event dispatchers, HTTP server, core scheduler, state 
> managers etc., and some of them depends on other basic components like 
> ZooKeeper, HDFS. 
> Currently we may have to find some suspicious traces from many related 
> metrics and tremendous logs while encountering an unclear issue, hope to 
> locate the root cause of the problem. For example, some applications keep 
> staying in NEW_SAVING state, which can be caused by lost of ZooKeeper 
> connections or jam in event dispatcher, the useful traces is sinking in many 
> metrics and logs. That's not easy to figure out what happened even for some 
> experts, let alone common users.
> So I propose to add a common health check mechanism to improve 
> troubleshooting skills for RM, in my general thought, we can
>  * add a HealthReporter interface as follows:
> {code:java}
> public interface HealthReporter {
>   HealthReport getHealthReport();
> }
> {code}
> HealthReport can have some generic fields like isHealthy(boolean), 
> processState(enum: NORMAL/IDLE/BUSY), updateTime(long), diagnostics(string) 
> and keyMetrics(Map).
>  * make some key services implement HealthReporter interface and generate 
> health report via evaluating the internal state.
>  * add HealthCheckerService which can manage