[ https://issues.apache.org/jira/browse/YARN-9437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16811738#comment-16811738 ]
qiuliang commented on YARN-9437: -------------------------------- As I understand it, there are two cases that may cause the completedContainers in RMNodeImpl to not be released. 1. When RMAppAttemptImpl receives the CONTAINER_FINISHED(not amContainer) event, it will add this container to justFinishedContainers. When processing the AM heartbeat, RMAppAttemptImpl first sends the container in finishedContainersSentToAM to NM, and RMNodeImpl also removes these containers from the completedContainers. Then transfer the containers in justFinishedContainers to finishedContainersSentToAM and wait for the next AM heartbeat to send these containers to NM. If RMAppAttemptImpl accepts the event of AM unregistration, justFinishedContainers is not empty, then the container in justFinishedContainers may not have the opportunity to transfer to finishedContainersSentToAM, so that these containers are not sent to NM, and RMNodeImpl does not release these containers. 2. When RMAppAttemptImpl is in the final state and receives the CONTAINER_FINISHED event, just add this container to justFinishedContainers and not send it to NM. For the first case, my idea is that when RMAppAttemptImpl handles the amContainer finished event, the container in justFinishedContainers is transferred to finishedContainersSentToAM and sent to NM along with amContainer. I am not sure if there is any other impact. For the second case, when RMAppAttemptImpl is in the final state and receives the CONTAINER_FINISHED event, these containers are sent directly to NM, but I am worried that this will generate many events. > RMNodeImpls occupy too much memory and causes RM GC to take a long time > ----------------------------------------------------------------------- > > Key: YARN-9437 > URL: https://issues.apache.org/jira/browse/YARN-9437 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.9.1 > Reporter: qiuliang > Priority: Minor > Attachments: 1.png, 2.png, 3.png > > > We use hadoop-2.9.1 in our production environment with 1600+ nodes. 95.63% of > RM memory is occupied by RMNodeImpl. Analysis of RM memory found that each > RMNodeImpl has approximately 14M. The reason is that there is a 130,000+ > completedcontainers in each RMNodeImpl that has not been released. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org