[ 
https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183819#comment-17183819
 ] 

Siddharth Ahuja edited comment on YARN-1806 at 8/26/20, 1:24 AM:
-----------------------------------------------------------------

This JIRA implements a feature for the addition of a "*Threaddump*" button on 
the ResourceManager Web UI's individual application page accessible by visiting 
RM Web UI -> Applications -> Click on <app_id> (So, the breadcrumb would be 
{{Home / Applications / App [app_id] / Threaddump}}) to trigger thread dumps 
for running YARN containers for a currently running application attempt. The 
thread dumps are captured as part of the stdout logs for the selected container 
and displayed as-is by querying the NodeManager node on which this container 
ran on.

As part of this feature, there are 2 panels implemented. The first panel 
displays two drop-downs, the first one displaying the currently running app 
attempt id and a "None" option (similar to "Logs" functionality). Once this is 
selected, it goes on to display another drop-down in the same panel that 
contains a listing of currently running containers for this application attempt 
id.

Once you select a container id from this second drop-down, another Panel is 
opened just below (again this is similar to the "Logs" functionality) that 
shows the selected attempt id and the container as the header with container's 
stdout logs also being displayed containing the thread dump that was triggered 
when the container was selected.

Following sets of API calls are made:

+API calls made when the _Threaddump_ button is clicked:+
{code}
1. http://<rm>:8088/ws/v1/cluster/apps/<app_id> -> Get application info e.g. 
app state from RM,
2. http://<rm>:8088/ws/v1/cluster/apps/<app_id>/appattempts -> Get application 
attempt info from RM, e.g. to get the app attempt state to see if it is RUNNING 
or not ([YARN-10381|https://issues.apache.org/jira/browse/YARN-10381]).
{code}

If the application is not RUNNING, then, there will be an error displayed for 
that based on info from 1. above. 
If the application is RUNNING, then, by checking the application attempts info 
for this app (there can be more than one app attempt), we display the 
application attempt id for the RUNNING attempt only. This is based on the info 
from 2. above.

+API calls made when the app attempt is selected from the drop-down:+
{code}
3. 
http://<rm>:8088/ws/v1/cluster/apps/<app_id>/appattempts/<appattempt_id>/containers
 -> This is to get the list of running containers for the currently running app 
attempt from the RM.
{code}

+API calls made when the container is selected from the drop-down:+
{code}
4. 
http://<rm>:8088/ws/v1/cluster/containers/<container_id>/signal/OUTPUT_THREAD_DUMP?user.name=<logged_in_user>
 -> This is for RM (that eventually calls NM through NM heartbeat) to send a 
SIGQUIT signal to the container process for the selected container 
([YARN-8693|https://issues.apache.org/jira/browse/YARN-8693]). This is 
essentially a kill -3 and it generates a thread dump that are captured in the 
stdout logs of the container.
5. http://<nm>:8042/ws/v1/node/containerlogs/<container_id>/stdout -> This is 
for the NM that is running the selected container to acquire the stdout logs 
from this running container that contains the thread dump by the above call. 
{code}


was (Author: sahuja):
This JIRA implements a feature for the addition of a "*Threaddump*" button on 
the ResourceManager Web UI's individual application page accessible by visiting 
RM Web UI -> Applications -> Click on <app_id> (So, the breadcrumb would be 
{{Home / Applications / App [app_id] / Threaddump}}) to trigger thread dumps 
for running YARN containers for a currently running application attempt. The 
thread dumps are captured as part of the stdout logs for the selected container 
and displayed as-is by querying the NodeManager node on which this container 
ran on.

As part of this feature, there are 2 panels implemented. The first panel 
displays two drop-downs, the first one displaying the currently running app 
attempt id and a "None" option (similar to "Logs" functionality). Once this is 
selected, it goes on to display another drop-down in the same panel that 
contains a listing of currently running containers for this application attempt 
id.

Once you select a container id from this second drop-down, another Panel is 
opened just below (again this is similar to the "Logs" functionality) that 
shows the selected attempt id and the container as the header with container's 
stdout logs also being displayed containing the thread dump that was triggered 
when the container was selected.

Following sets of API calls are made:

+API calls made when the _Threaddump_ button is clicked:+
{code}
1. http://<rm>:8088/ws/v1/cluster/apps/<app_id> -> Get application info e.g. 
app state from RM,
2. http://<rm>:8088/ws/v1/cluster/apps/<app_id>/appattempts -> Get application 
attempt info from RM, e.g. to get the app attempt state to see if it is RUNNING 
or not ([YARN-10381|https://issues.apache.org/jira/browse/YARN-10381]).
{code}

If the application is not RUNNING, then, there will be an error displayed for 
that based on info from 1. above. 
If the application is RUNNING, then, by checking the application attempts info 
for this app (there can be more than one app attempt), we display the 
application attempt id for the RUNNING attempt only. This is based on the info 
from 2. above.

+API calls made when the app attempt is selected from the drop-down:+
{code}
3. 
http://<rm>:8088/ws/v1/cluster/apps/<app_id>/appattempts/<appattempt_id>/containers
 -> This is to get the list of running containers for the currently running app 
attempt from the RM.
{code}

+API calls made when the container is selected from the drop-down:+
{code}
        4. 
http://<rm>:8088/ws/v1/cluster/containers/<container_id>/signal/OUTPUT_THREAD_DUMP?user.name=<logged_in_user>
 -> This is for RM (that eventually calls NM through NM heartbeat) to send a 
SIGQUIT signal to the container process for the selected container 
([YARN-8693|https://issues.apache.org/jira/browse/YARN-8693]). This is 
essentially a kill -3 and it generates a thread dump that are captured in the 
stdout logs of the container.
5. http://<nm>:8042/ws/v1/node/containerlogs/<container_id>/stdout -> This is 
for the NM that is running the selected container to acquire the stdout logs 
from this running container that contains the thread dump by the above call. 
{code}

> webUI update to allow end users to request thread dump
> ------------------------------------------------------
>
>                 Key: YARN-1806
>                 URL: https://issues.apache.org/jira/browse/YARN-1806
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Ming Ma
>            Assignee: Siddharth Ahuja
>            Priority: Major
>         Attachments: YARN-1806.001.patch
>
>
> Both individual container gage and containers page will support this. After 
> end user clicks on the request link, they can follow to get to stdout page 
> for the thread dump content.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to