[ https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325610#comment-14325610 ]
Rohith commented on YARN-3194: ------------------------------ Thanks [~jlowe] [~djp] [~jianhe] for detailed review:-) bq. the container status processing code is almost a duplicate of the same code in StatusUpdateWhenHealthyTransition Agree, this has to be refactored. Majority of processing containerStatus code is same. bq. we don't remove containers that have completed from the launchedContainers map which seems wrong I see, yes. completed containers should be removed from launchedContainers. bq. I don't see why we would process container status sent during a reconnect differently than a regular status update from the NM IIUC it is only to deal with NMContainerStatus and containerStatus. But I am not sure why these both created differently. What I see is containerStatus is subset of NMcontainerStatus. I think containerStatus would have been inside NMContainerStatus. bq. Is below condition valid for the newly added code in ReconnectNodeTransition too ? Yes, it is applicable since we are keeping old RMNode object. bq. Add timeout to the test, testAppCleanupWhenNMRstarts -> testProcessingContainerStatusesOnNMRestart ? and add more detailed comments about what the test is doing too ? Agree. bq. Could you add a validation that ApplicationMasterService#allocate indeed receives the completed container in this scenario? Agree, I will add bq. Question: does the 3072 include 1024 for the AM container and 2048 for the allocated container ? AM memory is 1024 and additional requested container memory is 2048. In test, number of request container is 1. So AllocatedMB should be AM+Requested i.e 1024+2048=3072 > After NM restart,completed containers are not released by RM which are sent > during NM registration > -------------------------------------------------------------------------------------------------- > > Key: YARN-3194 > URL: https://issues.apache.org/jira/browse/YARN-3194 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.7.0 > Environment: NM restart is enabled > Reporter: Rohith > Assignee: Rohith > Priority: Blocker > Attachments: 0001-yarn-3194-v1.patch > > > On NM restart ,NM sends all the outstanding NMContainerStatus to RM. But RM > process only ContainerState.RUNNING. If container is completed when NM was > down then those containers resources wont be release which result in > applications to hang. -- This message was sent by Atlassian JIRA (v6.3.4#6332)