[ 
https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325010#comment-14325010
 ] 

Jian He commented on YARN-3194:
-------------------------------

[~rohithsharma], thanks for your explanation. could you edit the description to 
be more clear about the problem ?

- Is it possible to have a common method for below code in 
ReconnectNodeTransition and StatusUpdateWhenHealthyTransition ?
{code}
        // Filter the map to only obtain just launched containers and finished
        // containers.
        List<ContainerStatus> newlyLaunchedContainers =
            new ArrayList<ContainerStatus>();
        List<ContainerStatus> completedContainers =
            new ArrayList<ContainerStatus>();
        for (NMContainerStatus remoteContainer : reconnectEvent
            .getNMContainerStatuses()) {
          ContainerId containerId = remoteContainer.getContainerId();

          // Process running containers
          if (remoteContainer.getContainerState() == ContainerState.RUNNING) {
            if (!rmNode.launchedContainers.contains(containerId)) {
              // Just launched container. RM knows about it the first time.
              rmNode.launchedContainers.add(containerId);
              ContainerStatus cStatus = createContainerStatus(remoteContainer);
              newlyLaunchedContainers.add(cStatus);
            }
          } else {

            ContainerStatus cStatus = createContainerStatus(remoteContainer);
            completedContainers.add(cStatus);
          }
        }
        if (newlyLaunchedContainers.size() != 0
            || completedContainers.size() != 0) {
          rmNode.nodeUpdateQueue.add(new UpdatedContainerInfo(
              newlyLaunchedContainers, completedContainers));
        }
{code}
- Is below condition valid for the newly added code in ReconnectNodeTransition 
too ?
{code}
        // Don't bother with containers already scheduled for cleanup, or for
        // applications already killed. The scheduler doens't need to know any
        // more about this container
        if (rmNode.containersToClean.contains(containerId)) {
          LOG.info("Container " + containerId + " already scheduled for " +
                        "cleanup, no further processing");
          continue;
        }
        if (rmNode.finishedApplications.contains(containerId
            .getApplicationAttemptId().getApplicationId())) {
          LOG.info("Container " + containerId
              + " belongs to an application that is already killed,"
              + " no further processing");
          continue;
        }
{code}
- Add timeout to the test, testAppCleanupWhenNMRstarts -> 
testProcessingContainerStatusesOnNMRestart ? and add more detailed comments 
about what the test is doing too ?
{code}
@Test
  public void testAppCleanupWhenNMRstarts() throws Exception
{code}
- Question: does the 3072 include 1024 for the AM container and 2048 for the 
allocated container ?
{code}
 Assert.assertEquals(3072, allocatedMB);
{code}
- Could you add a validation that ApplicationMasterService#allocate indeed 
receives the completed container in this scenario?

> After NM restart,completed containers are not released by RM which are sent 
> during NM registration
> --------------------------------------------------------------------------------------------------
>
>                 Key: YARN-3194
>                 URL: https://issues.apache.org/jira/browse/YARN-3194
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>         Environment: NM restart is enabled
>            Reporter: Rohith
>            Assignee: Rohith
>         Attachments: 0001-yarn-3194-v1.patch
>
>
> On NM restart ,NM sends all the outstanding NMContainerStatus to RM. But RM 
> process only ContainerState.RUNNING. If container is completed when NM was 
> down then those containers resources wont be release which result in 
> applications to hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to