[ https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14322277#comment-14322277 ]
Rohith commented on YARN-3194: ------------------------------ bq. it's removing the old node and adding the newly connected node. RM is also not restarted. {{RMNodeImpl#ReconnectNodeTransition#.transition}} does not remove old node if any applications are running. In the below code, if noRunningApps is false then Node is not removed. Instead just handling running applications. {code} public void transition(RMNodeImpl rmNode, RMNodeEvent event) { RMNodeReconnectEvent reconnectEvent = (RMNodeReconnectEvent) event; RMNode newNode = reconnectEvent.getReconnectedNode(); rmNode.nodeManagerVersion = newNode.getNodeManagerVersion(); List<ApplicationId> runningApps = reconnectEvent.getRunningApplications(); boolean noRunningApps = (runningApps == null) || (runningApps.size() == 0); // No application running on the node, so send node-removal event with // cleaning up old container info. if (noRunningApps) { // Remove the node from scheduler // Add node to the scheduler } else { rmNode.httpPort = newNode.getHttpPort(); rmNode.httpAddress = newNode.getHttpAddress(); rmNode.totalCapability = newNode.getTotalCapability(); // Reset heartbeat ID since node just restarted. rmNode.getLastNodeHeartBeatResponse().setResponseId(0); } // Handles running app on this node // resource update to schedule code } } {code} > After NM restart,completed containers are not released which are sent during > NM registration > -------------------------------------------------------------------------------------------- > > Key: YARN-3194 > URL: https://issues.apache.org/jira/browse/YARN-3194 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.6.0 > Environment: NM restart is enabled > Reporter: Rohith > Assignee: Rohith > > On NM restart ,NM sends all the outstanding NMContainerStatus to RM. But RM > process only ContainerState.RUNNING. If container is completed when NM was > down then those containers resources wont be release which result in > applications to hang. -- This message was sent by Atlassian JIRA (v6.3.4#6332)