[ https://issues.apache.org/jira/browse/YARN-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benjamin Teke updated YARN-10421: --------------------------------- Attachment: YARN-10421.001.patch > Create YarnDiagnosticsServlet to serve diagnostic queries > ---------------------------------------------------------- > > Key: YARN-10421 > URL: https://issues.apache.org/jira/browse/YARN-10421 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Benjamin Teke > Assignee: Benjamin Teke > Priority: Major > Attachments: YARN-10421.001.patch > > > YarnDiagnosticsServlet should run inside ResourceManager Daemon. The servlet > forks a separate process, which executes a shell/Python/etc script. Based on > the use-cases listed below the script collects information, bundles it and > sends it to UI2. The diagnostic cases are the following: > # Application hanging: > ** Application logs > ** Find the hanging container and get multiple Jstacks > ** ResourceManager logs during job lifecycle > ** NodeManager logs from NodeManager where the hanging containers of the > jobs ran > ** Job configuration from MapReduce HistoryServer, Spark HistoryServer, Tez > History URL > # Application failed: > ** Application logs > ** ResourceManager logs during job lifecycle. > ** NodeManager logs from NodeManager where the hanging containers of the > jobs ran > ** Job Configuration from MapReduce HistoryServer, Spark HistoryServer, Tez > History URL. > ** Job related metrics like container, attempts. > # Scheduler related issue: > ** ResourceManager Scheduler logs with DEBUG enabled for 2 minutes. > ** Multiple Jstacks of ResourceManager > ** YARN and Scheduler Configuration > ** Cluster Scheduler API _/ws/v1/cluster/scheduler_ and Cluster Nodes API > _/ws/v1/cluster/nodes response_ > ** Scheduler Activities _/ws/v1/cluster/scheduler/bulkactivities_ response > (YARN-10319) > # ResourceManager / NodeManager daemon fails to start: > ** ResourceManager and NodeManager out and log file > ** YARN and Scheduler Configuration > To ease the load on the RM, the servlet should allow only one HTTP request at > a time. If a new request comes in while serving another an appropriate > response code should be returned, with the message "Diagnostics Collection in > Progress”. The servlet should list the possible diagnostic cases to the UI. > The cases will be implemented in the script. The servlet should be > transparent to the script changes to help with the (on-the-fly) extensibility > of the diagnostic tool. > > The diag bundle can become large in size, so a threshold functionality should > be added. If the bundle's size exceeds the threshold the bundle will be > stored in a local folder on the host of the RM, and the path will be returned. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org