[ https://issues.apache.org/jira/browse/YARN-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197870#comment-17197870 ]
Szilard Nemeth edited comment on YARN-10421 at 12/18/20, 11:49 PM: ------------------------------------------------------------------- Hi [~bteke], Thanks for working on this. Here are my comments: 1. In DefaultRequestInterceptorREST: The targetpath parameter has the same value (RMWSConsts.RM_WEB_SERVICE_PATH + RMWSConsts.SCHEDULE) for both getCommonIssueData and getCommonIssueList. Shouldn't you use COMMON_ISSUE_COLLECT in method getCommonIssueData? 2. Nit: There some added newlines for some files at the end, please remove those. 3. Nit: In FederationInterceptorREST, I would use the string for the exception "This feature is not yet implemented" or something like this instead of the current "Code is not implemented" 4. For org.apache.hadoop.yarn.server.router.webapp.MockRESTRequestInterceptor#getCommonIssueData, it doesn't clear why a single OK response is returned. Can you please add a comment to explain this behaviour? 5. Just by looking at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices#getCommonIssueData, it's hard to guess what values the issueId and issueParams fields could take. Can you please add javadoc to the method? UPDATE: Oh, I can see this is documented in RMWebServiceProtocol. 6. Can you add API documentation of these 2 new endpoints to the file hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerRest.md? 7. The javadoc for org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServiceProtocol#getCommonIssueData seems wrong. These are not form params, they are query params as per the implementing code, instead. 8. Nit: Can you use uppercase name for org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#scriptLocation? 9. Nit: I would introduce an enum for argument types. You currently have it as a string constant here: org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#LIST_ISSUES_ARGUMENT 10. About the script location defined by: org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#scriptLocation. What makes the file available at location "/tmp/diagnostics_collector.sh" ? 11. In the warning log in method org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#parseIssueType, please add the expected number of parameters as well to the message. I would also add a return statement after the log message for clarity. 12. Can you please rename the diagnostics_collector.sh, so that it somehow reflects it is intended to be used by tests? 13. There's a typo in comment of method org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testListCommonIssuesValidCaseWithOptionsToBeSkipped: ambigious -> ambiguous 14. In testcase: org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testListCommonIssuesValidCaseWithOptionsToBeSkipped: You may simplify the testcase dramatically by using a loop to check the issue id, name and parameter. 15. In testcase: org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testParseIssueTypeValidCases: The comment says: // valid case: id, name, one parameter But the test actually uses 2 parameters so this is a bit misleading. was (Author: snemeth): Hi [~bteke], Thanks for working on this. Here are my comments: 1. In FederationInterceptorREST: The targetpath parameter has the same value (RMWSConsts.RM_WEB_SERVICE_PATH + RMWSConsts.SCHEDULE) for both getCommonIssueData and getCommonIssueList. Shouldn't you use COMMON_ISSUE_COLLECT in method getCommonIssueData? 2. Nit: There some added newlines for some files at the end, please remove those. 3. Nit: In FederationInterceptorREST, I would use the string for the exception "This feature is not yet implemented" or something like this instead of the current "Code is not implemented" 4. For org.apache.hadoop.yarn.server.router.webapp.MockRESTRequestInterceptor#getCommonIssueData, it doesn't clear why a single OK response is returned. Can you please add a comment to explain this behaviour? 5. Just by looking at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices#getCommonIssueData, it's hard to guess what values the issueId and issueParams fields could take. Can you please add javadoc to the method? UPDATE: Oh, I can see this is documented in RMWebServiceProtocol. 6. Can you add API documentation of these 2 new endpoints to the file hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerRest.md? 7. The javadoc for org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServiceProtocol#getCommonIssueData seems wrong. These are not form params, they are query params as per the implementing code, instead. 8. Nit: Can you use uppercase name for org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#scriptLocation? 9. Nit: I would introduce an enum for argument types. You currently have it as a string constant here: org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#LIST_ISSUES_ARGUMENT 10. About the script location defined by: org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#scriptLocation. What makes the file available at location "/tmp/diagnostics_collector.sh" ? 11. In the warning log in method org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#parseIssueType, please add the expected number of parameters as well to the message. I would also add a return statement after the log message for clarity. 12. Can you please rename the diagnostics_collector.sh, so that it somehow reflects it is intended to be used by tests? 13. There's a typo in comment of method org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testListCommonIssuesValidCaseWithOptionsToBeSkipped: ambigious -> ambiguous 14. In testcase: org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testListCommonIssuesValidCaseWithOptionsToBeSkipped: You may simplify the testcase dramatically by using a loop to check the issue id, name and parameter. 15. In testcase: org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testParseIssueTypeValidCases: The comment says: // valid case: id, name, one parameter But the test actually uses 2 parameters so this is a bit misleading. > Create YarnDiagnosticsService to serve diagnostic queries > ---------------------------------------------------------- > > Key: YARN-10421 > URL: https://issues.apache.org/jira/browse/YARN-10421 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Benjamin Teke > Assignee: Benjamin Teke > Priority: Major > Attachments: YARN-10421.001.patch, YARN-10421.002.patch, > YARN-10421.003.patch, YARN-10421.004.patch > > > YarnDiagnosticsServlet should run inside ResourceManager Daemon. The servlet > forks a separate process, which executes a shell/Python/etc script. Based on > the use-cases listed below the script collects information, bundles it and > sends it to UI2. The diagnostic options are the following: > # Application hanging: > ** Application logs > ** Find the hanging container and get multiple Jstacks > ** ResourceManager logs during job lifecycle > ** NodeManager logs from NodeManager where the hanging containers of the > jobs ran > ** Job configuration from MapReduce HistoryServer, Spark HistoryServer, Tez > History URL > # Application failed: > ** Application logs > ** ResourceManager logs during job lifecycle. > ** NodeManager logs from NodeManager where the hanging containers of the > jobs ran > ** Job Configuration from MapReduce HistoryServer, Spark HistoryServer, Tez > History URL. > ** Job related metrics like container, attempts. > # Scheduler related issue: > ** ResourceManager Scheduler logs with DEBUG enabled for 2 minutes. > ** Multiple Jstacks of ResourceManager > ** YARN and Scheduler Configuration > ** Cluster Scheduler API _/ws/v1/cluster/scheduler_ and Cluster Nodes API > _/ws/v1/cluster/nodes response_ > ** Scheduler Activities _/ws/v1/cluster/scheduler/bulkactivities_ response > (YARN-10319) > # ResourceManager / NodeManager daemon fails to start: > ** ResourceManager and NodeManager out and log file > ** YARN and Scheduler Configuration > Two new endpoints should be added to the RM web service: one for listing the > available diagnostic options (_/common-issue/list_), and one for calling a > selected option with the user provided parameters (_/common-issue/collect_). -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org