[ 
https://issues.apache.org/jira/browse/YARN-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197870#comment-17197870
 ] 

Szilard Nemeth edited comment on YARN-10421 at 12/18/20, 11:49 PM:
-------------------------------------------------------------------

Hi [~bteke],

Thanks for working on this.

Here are my comments:

1. In DefaultRequestInterceptorREST: The targetpath parameter has the same 
value (RMWSConsts.RM_WEB_SERVICE_PATH + RMWSConsts.SCHEDULE) for both 
getCommonIssueData and getCommonIssueList. Shouldn't you use 
COMMON_ISSUE_COLLECT in method getCommonIssueData?

2. Nit: There some added newlines for some files at the end, please remove 
those.

3. Nit: In FederationInterceptorREST, I would use the string for the exception 
"This feature is not yet implemented" or something like this instead of the 
current "Code is not implemented"

4. For 
org.apache.hadoop.yarn.server.router.webapp.MockRESTRequestInterceptor#getCommonIssueData,
 it doesn't clear why a single OK response is returned. Can you please add a 
comment to explain this behaviour?

5. Just by looking at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices#getCommonIssueData,
 it's hard to guess what values the issueId and issueParams fields could take.
 Can you please add javadoc to the method?
 UPDATE: Oh, I can see this is documented in RMWebServiceProtocol.

6. Can you add API documentation of these 2 new endpoints to the file 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerRest.md?

7. The javadoc for 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServiceProtocol#getCommonIssueData
 seems wrong. These are not form params, they are query params as per the 
implementing code, instead.

8. Nit: Can you use uppercase name for 
org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#scriptLocation?

9. Nit: I would introduce an enum for argument types. You currently have it as 
a string constant here: 
org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#LIST_ISSUES_ARGUMENT

10. About the script location defined by: 
org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#scriptLocation.
 
 What makes the file available at location "/tmp/diagnostics_collector.sh" ?

11. In the warning log in method 
org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#parseIssueType,
 please add the expected number of parameters as well to the message. I would 
also add a return statement after the log message for clarity.

12. Can you please rename the diagnostics_collector.sh, so that it somehow 
reflects it is intended to be used by tests?

13. There's a typo in comment of method 
org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testListCommonIssuesValidCaseWithOptionsToBeSkipped:
 ambigious -> ambiguous

14. In testcase: 
org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testListCommonIssuesValidCaseWithOptionsToBeSkipped:
 You may simplify the testcase dramatically by using a loop to check the issue 
id, name and parameter.

15. In testcase: 
org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testParseIssueTypeValidCases:
 The comment says: 
 // valid case: id, name, one parameter
 But the test actually uses 2 parameters so this is a bit misleading.


was (Author: snemeth):
Hi [~bteke],

Thanks for working on this.

Here are my comments:

1. In FederationInterceptorREST: The targetpath parameter has the same value 
(RMWSConsts.RM_WEB_SERVICE_PATH + RMWSConsts.SCHEDULE) for both 
getCommonIssueData and getCommonIssueList. Shouldn't you use 
COMMON_ISSUE_COLLECT in method getCommonIssueData?

2. Nit: There some added newlines for some files at the end, please remove 
those.

3. Nit: In FederationInterceptorREST, I would use the string for the exception 
"This feature is not yet implemented" or something like this instead of the 
current "Code is not implemented"

4. For 
org.apache.hadoop.yarn.server.router.webapp.MockRESTRequestInterceptor#getCommonIssueData,
 it doesn't clear why a single OK response is returned. Can you please add a 
comment to explain this behaviour?

5. Just by looking at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices#getCommonIssueData,
 it's hard to guess what values the issueId and issueParams fields could take.
Can you please add javadoc to the method?
UPDATE: Oh, I can see this is documented in RMWebServiceProtocol.

6. Can you add API documentation of these 2 new endpoints to the file 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerRest.md?

7. The javadoc for 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServiceProtocol#getCommonIssueData
 seems wrong. These are not form params, they are query params as per the 
implementing code, instead.

8. Nit: Can you use uppercase name for 
org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#scriptLocation?

9. Nit: I would introduce an enum for argument types. You currently have it as 
a string constant here: 
org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#LIST_ISSUES_ARGUMENT

10. About the script location defined by: 
org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#scriptLocation.
 
What makes the file available at location "/tmp/diagnostics_collector.sh" ?

11. In the warning log in method 
org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#parseIssueType,
 please add the expected number of parameters as well to the message. I would 
also add a return statement after the log message for clarity.

12. Can you please rename the diagnostics_collector.sh, so that it somehow 
reflects it is intended to be used by tests?

13. There's a typo in comment of method 
org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testListCommonIssuesValidCaseWithOptionsToBeSkipped:
 ambigious -> ambiguous

14. In testcase: 
org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testListCommonIssuesValidCaseWithOptionsToBeSkipped:
 You may simplify the testcase dramatically by using a loop to check the issue 
id, name and parameter.

15. In testcase: 
org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testParseIssueTypeValidCases:
 The comment says: 
// valid case: id, name, one parameter
But the test actually uses 2 parameters so this is a bit misleading.

> Create YarnDiagnosticsService to serve diagnostic queries 
> ----------------------------------------------------------
>
>                 Key: YARN-10421
>                 URL: https://issues.apache.org/jira/browse/YARN-10421
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Benjamin Teke
>            Assignee: Benjamin Teke
>            Priority: Major
>         Attachments: YARN-10421.001.patch, YARN-10421.002.patch, 
> YARN-10421.003.patch, YARN-10421.004.patch
>
>
> YarnDiagnosticsServlet should run inside ResourceManager Daemon. The servlet 
> forks a separate process, which executes a shell/Python/etc script. Based on 
> the use-cases listed below the script collects information, bundles it and 
> sends it to UI2. The diagnostic options are the following:
>  # Application hanging: 
>  ** Application logs
>  ** Find the hanging container and get multiple Jstacks
>  ** ResourceManager logs during job lifecycle
>  ** NodeManager logs from NodeManager where the hanging containers of the 
> jobs ran
>  ** Job configuration from MapReduce HistoryServer, Spark HistoryServer, Tez 
> History URL
>  # Application failed: 
>  ** Application logs
>  ** ResourceManager logs during job lifecycle.
>  ** NodeManager logs from NodeManager where the hanging containers of the 
> jobs ran
>  ** Job Configuration from MapReduce HistoryServer, Spark HistoryServer, Tez 
> History URL.
>  ** Job related metrics like container, attempts.
>  # Scheduler related issue:
>  ** ResourceManager Scheduler logs with DEBUG enabled for 2 minutes.
>  ** Multiple Jstacks of ResourceManager
>  ** YARN and Scheduler Configuration
>  ** Cluster Scheduler API _/ws/v1/cluster/scheduler_ and Cluster Nodes API 
> _/ws/v1/cluster/nodes response_
>  ** Scheduler Activities _/ws/v1/cluster/scheduler/bulkactivities_ response 
> (YARN-10319)
>  # ResourceManager / NodeManager daemon fails to start:
>  ** ResourceManager and NodeManager out and log file
>  ** YARN and Scheduler Configuration
> Two new endpoints should be added to the RM web service: one for listing the 
> available diagnostic options (_/common-issue/list_), and one for calling a 
> selected option with the user provided parameters (_/common-issue/collect_).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to