[ 
https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14322277#comment-14322277
 ] 

Rohith commented on YARN-3194:
------------------------------

bq. it's removing the old node and adding the newly connected node. RM is also 
not restarted. 
{{RMNodeImpl#ReconnectNodeTransition#.transition}} does not remove old node if 
any applications are running. In the below code, if noRunningApps is false then 
Node is not removed. Instead just handling running applications.
{code}
public void transition(RMNodeImpl rmNode, RMNodeEvent event) {
      RMNodeReconnectEvent reconnectEvent = (RMNodeReconnectEvent) event;
      RMNode newNode = reconnectEvent.getReconnectedNode();
      rmNode.nodeManagerVersion = newNode.getNodeManagerVersion();
      List<ApplicationId> runningApps = reconnectEvent.getRunningApplications();
      boolean noRunningApps = 
          (runningApps == null) || (runningApps.size() == 0);
      
      // No application running on the node, so send node-removal event with 
      // cleaning up old container info.
      if (noRunningApps) {
        // Remove the node from scheduler
        // Add node to the scheduler
      } else {
        rmNode.httpPort = newNode.getHttpPort();
        rmNode.httpAddress = newNode.getHttpAddress();
        rmNode.totalCapability = newNode.getTotalCapability();
      
        // Reset heartbeat ID since node just restarted.
        rmNode.getLastNodeHeartBeatResponse().setResponseId(0);
      }

      // Handles running app on this node
    // resource update to schedule code
      }
      
    }
{code}

> After NM restart,completed containers are not released which are sent during 
> NM registration
> --------------------------------------------------------------------------------------------
>
>                 Key: YARN-3194
>                 URL: https://issues.apache.org/jira/browse/YARN-3194
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>         Environment: NM restart is enabled
>            Reporter: Rohith
>            Assignee: Rohith
>
> On NM restart ,NM sends all the outstanding NMContainerStatus to RM. But RM 
> process only ContainerState.RUNNING. If container is completed when NM was 
> down then those containers resources wont be release which result in 
> applications to hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to