Jian He commented on YARN-1372:

Thanks for your explanation,
bq. Is is possible that for some reason there is no ack from AM and the 
application never gets removed and these entries stay in memory? 
ApplicationImpl on NM should be guaranteed to be cleaned for already completed 
applications. (Otherwise, it's a leak. we should fix this too.)
bq. If we are removing it from the nm store, is there any value in keeping it 
in memory? If NM restarts, its not going to know about this anyway.
That's why I said in my previous comment: {{make sure 
context.getNMStateStore().removeContainer(cid); is called after receiving the 
notification from RM as well.}}

One other thing is:
- In RMAppAttemptImpl#pullJustFinishedContainers, we may just send the whole 
list of containers in one event; Instead of sending individual event for each 
      for (Map.Entry<ContainerStatus, NodeId> finishedContainerStatus: this
          .finishedContainersSentToAM.entrySet()) {
        // Implicitly acks the previous list as being received by the AM
        eventHandler.handle(new RMNodeCleanedupContainerNotifiedEvent(
            finishedContainerStatus.getValue(), finishedContainerStatus

> Ensure all completed containers are reported to the AMs across RM restart
> -------------------------------------------------------------------------
>                 Key: YARN-1372
>                 URL: https://issues.apache.org/jira/browse/YARN-1372
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Bikas Saha
>            Assignee: Anubhav Dhoot
>         Attachments: YARN-1372.001.patch, YARN-1372.001.patch, 
> YARN-1372.002_NMHandlesCompletedApp.patch, 
> YARN-1372.002_RMHandlesCompletedApp.patch, 
> YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.003.patch, 
> YARN-1372.prelim.patch, YARN-1372.prelim2.patch
> Currently the NM informs the RM about completed containers and then removes 
> those containers from the RM notification list. The RM passes on that 
> completed container information to the AM and the AM pulls this data. If the 
> RM dies before the AM pulls this data then the AM may not be able to get this 
> information again. To fix this, NM should maintain a separate list of such 
> completed container notifications sent to the RM. After the AM has pulled the 
> containers from the RM then the RM will inform the NM about it and the NM can 
> remove the completed container from the new list. Upon re-register with the 
> RM (after RM restart) the NM should send the entire list of completed 
> containers to the RM along with any other containers that completed while the 
> RM was dead. This ensures that the RM can inform the AM's about all completed 
> containers. Some container completions may be reported more than once since 
> the AM may have pulled the container but the RM may die before notifying the 
> NM about the pull.

This message was sent by Atlassian JIRA

Reply via email to