[ 
https://issues.apache.org/jira/browse/YARN-9437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16811738#comment-16811738
 ] 

qiuliang commented on YARN-9437:
--------------------------------

As I understand it, there are two cases that may cause the completedContainers 
in RMNodeImpl to not be released.
1. When RMAppAttemptImpl receives the CONTAINER_FINISHED(not amContainer) 
event, it will add this container to justFinishedContainers. When processing 
the AM heartbeat, RMAppAttemptImpl first sends the container in 
finishedContainersSentToAM to NM, and RMNodeImpl also removes these containers 
from the completedContainers. Then transfer the containers in 
justFinishedContainers to finishedContainersSentToAM and wait for the next AM 
heartbeat to send these containers to NM. If RMAppAttemptImpl accepts the event 
of AM unregistration, justFinishedContainers is not empty, then the container 
in justFinishedContainers may not have the opportunity to transfer to 
finishedContainersSentToAM, so that these containers are not sent to NM, and 
RMNodeImpl does not release these containers.
2. When RMAppAttemptImpl is in the final state and receives the 
CONTAINER_FINISHED event, just add this container to justFinishedContainers and 
not send it to NM.
For the first case, my idea is that when RMAppAttemptImpl handles the 
amContainer finished event, the container in justFinishedContainers is 
transferred to finishedContainersSentToAM and sent to NM along with 
amContainer. I am not sure if there is any other impact. For the second case, 
when RMAppAttemptImpl is in the final state and receives the CONTAINER_FINISHED 
event, these containers are sent directly to NM, but I am worried that this 
will generate many events.

> RMNodeImpls occupy too much memory and causes RM GC to take a long time
> -----------------------------------------------------------------------
>
>                 Key: YARN-9437
>                 URL: https://issues.apache.org/jira/browse/YARN-9437
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.9.1
>            Reporter: qiuliang
>            Priority: Minor
>         Attachments: 1.png, 2.png, 3.png
>
>
> We use hadoop-2.9.1 in our production environment with 1600+ nodes. 95.63% of 
> RM memory is occupied by RMNodeImpl. Analysis of RM memory found that each 
> RMNodeImpl has approximately 14M. The reason is that there is a 130,000+ 
> completedcontainers in each RMNodeImpl that has not been released.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to