[ https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324955#comment-14324955 ]
Jason Lowe commented on YARN-3194: ---------------------------------- bq. Jason Lowe, I remember we discussed this case in some JIRA under YARN-1336, did you see this problem before? I didn't see this problem originally, but I suspect it was because there were two things that masked it. As mentioned above, this problem doesn't manifest before YARN-2997. In addition, I was testing it with MapReduce applications, and the MR AM will explicitly kill containers for tasks that have completed (as reported by the umbilical connection between the AM and tasks). I agree that we should be processing the container report sent with the NM registration, and it appears that is being dropped in the reconnected event. Comments on the patch: I noticed that the container status processing code is _almost_ a duplicate of the same code in StatusUpdateWhenHealthyTransition. One difference is that we don't remove containers that have completed from the launchedContainers map which seems wrong. I don't see why we would process container status sent during a reconnect differently than a regular status update from the NM. Therefore I think we should refactor the code to reuse this logic, as it should apply here just as it does for StatusUpdateWhenHealthyTransition. > After NM restart,completed containers are not released by RM which are sent > during NM registration > -------------------------------------------------------------------------------------------------- > > Key: YARN-3194 > URL: https://issues.apache.org/jira/browse/YARN-3194 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.6.0 > Environment: NM restart is enabled > Reporter: Rohith > Assignee: Rohith > Attachments: 0001-yarn-3194-v1.patch > > > On NM restart ,NM sends all the outstanding NMContainerStatus to RM. But RM > process only ContainerState.RUNNING. If container is completed when NM was > down then those containers resources wont be release which result in > applications to hang. -- This message was sent by Atlassian JIRA (v6.3.4#6332)