[ https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325010#comment-14325010 ]
Jian He commented on YARN-3194: ------------------------------- [~rohithsharma], thanks for your explanation. could you edit the description to be more clear about the problem ? - Is it possible to have a common method for below code in ReconnectNodeTransition and StatusUpdateWhenHealthyTransition ? {code} // Filter the map to only obtain just launched containers and finished // containers. List<ContainerStatus> newlyLaunchedContainers = new ArrayList<ContainerStatus>(); List<ContainerStatus> completedContainers = new ArrayList<ContainerStatus>(); for (NMContainerStatus remoteContainer : reconnectEvent .getNMContainerStatuses()) { ContainerId containerId = remoteContainer.getContainerId(); // Process running containers if (remoteContainer.getContainerState() == ContainerState.RUNNING) { if (!rmNode.launchedContainers.contains(containerId)) { // Just launched container. RM knows about it the first time. rmNode.launchedContainers.add(containerId); ContainerStatus cStatus = createContainerStatus(remoteContainer); newlyLaunchedContainers.add(cStatus); } } else { ContainerStatus cStatus = createContainerStatus(remoteContainer); completedContainers.add(cStatus); } } if (newlyLaunchedContainers.size() != 0 || completedContainers.size() != 0) { rmNode.nodeUpdateQueue.add(new UpdatedContainerInfo( newlyLaunchedContainers, completedContainers)); } {code} - Is below condition valid for the newly added code in ReconnectNodeTransition too ? {code} // Don't bother with containers already scheduled for cleanup, or for // applications already killed. The scheduler doens't need to know any // more about this container if (rmNode.containersToClean.contains(containerId)) { LOG.info("Container " + containerId + " already scheduled for " + "cleanup, no further processing"); continue; } if (rmNode.finishedApplications.contains(containerId .getApplicationAttemptId().getApplicationId())) { LOG.info("Container " + containerId + " belongs to an application that is already killed," + " no further processing"); continue; } {code} - Add timeout to the test, testAppCleanupWhenNMRstarts -> testProcessingContainerStatusesOnNMRestart ? and add more detailed comments about what the test is doing too ? {code} @Test public void testAppCleanupWhenNMRstarts() throws Exception {code} - Question: does the 3072 include 1024 for the AM container and 2048 for the allocated container ? {code} Assert.assertEquals(3072, allocatedMB); {code} - Could you add a validation that ApplicationMasterService#allocate indeed receives the completed container in this scenario? > After NM restart,completed containers are not released by RM which are sent > during NM registration > -------------------------------------------------------------------------------------------------- > > Key: YARN-3194 > URL: https://issues.apache.org/jira/browse/YARN-3194 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.6.0 > Environment: NM restart is enabled > Reporter: Rohith > Assignee: Rohith > Attachments: 0001-yarn-3194-v1.patch > > > On NM restart ,NM sends all the outstanding NMContainerStatus to RM. But RM > process only ContainerState.RUNNING. If container is completed when NM was > down then those containers resources wont be release which result in > applications to hang. -- This message was sent by Atlassian JIRA (v6.3.4#6332)