[ https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129459#comment-14129459 ]
Jian He commented on YARN-1372: ------------------------------- Thanks for updating the patch. Some comments and naming suggestions: - NodeStatusUpdater import changes only, we can revert - indentation format of the second line. {code} public void removeCompletedContainersFromContext(List<ContainerId> containerIds) throws public RMNodeCleanedupContainerNotifiedEvent(NodeId nodeId, ContainerId contId) { {code} - why adding {{context.getContainers().remove(cid);}} in removeVeryOldStoppedContainersFromContext method? won’t this remove the containers from context immediately when we send the container statuses across, which contradicts the rest of the changes? - In NodeStatusUpdaterImpl, previousCompletedContainers cache is not needed any more, as we make NM remove containers from context only after it gets the notification. We can remove this; Instead, in NodeStatusUpdater#getContainerStatuses, while we are looping all the containers, we can check whether the corresponding application exists, if Not, remove it from context. - make sure {{context.getNMStateStore().removeContainer(cid);}} is called after receiving the notification from RM as well. - {{RMNodeEventType#CLEANEDUP_CONTAINER_NOTIFIED}}: put in a new section where source is RMAppAttempt. how about rename to FINISHED_CONTAINERS_PULLED_BY_AM; similarly RMNodeCleanedupContainerNotifiedEvent -> RMNodeFinishedContainersPulledByAMEvent - In RMAppAttemptImpl#BaseFinalTransition, we can clear finishedContainersSentToAM, in case that AM unexpectedly crashes. - I think Map<NodeId, List<ContainerStatus >> is more space efficient than: {code} private Map<ContainerStatus, NodeId> finishedContainersSentToAM = new HashMap<ContainerStatus, NodeId>(); {code} - : format convention is to have method body in a different line from the method head. {code} public NodeId getNodeId() { return this.nodeId; } {code} - RMNodeImpl#cleanupContainersNotified, may be rename to finishedContainersPulledByAM. similarly CleanedupContainerNotifiedTransition to FinishedContainersPulledByAMTransition. - NodeHeartbeatResponse#addCleanedupContainersNotified, how about addFinishedContainersPulledByAM; similarly for the getter NodeHeartbeatResponse#getCleanedupContainersNotified and the proto file. Also add some code comments to explain why adding this new API. > Ensure all completed containers are reported to the AMs across RM restart > ------------------------------------------------------------------------- > > Key: YARN-1372 > URL: https://issues.apache.org/jira/browse/YARN-1372 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Reporter: Bikas Saha > Assignee: Anubhav Dhoot > Attachments: YARN-1372.001.patch, YARN-1372.001.patch, > YARN-1372.002_NMHandlesCompletedApp.patch, > YARN-1372.002_RMHandlesCompletedApp.patch, > YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.003.patch, > YARN-1372.prelim.patch, YARN-1372.prelim2.patch > > > Currently the NM informs the RM about completed containers and then removes > those containers from the RM notification list. The RM passes on that > completed container information to the AM and the AM pulls this data. If the > RM dies before the AM pulls this data then the AM may not be able to get this > information again. To fix this, NM should maintain a separate list of such > completed container notifications sent to the RM. After the AM has pulled the > containers from the RM then the RM will inform the NM about it and the NM can > remove the completed container from the new list. Upon re-register with the > RM (after RM restart) the NM should send the entire list of completed > containers to the RM along with any other containers that completed while the > RM was dead. This ensures that the RM can inform the AM's about all completed > containers. Some container completions may be reported more than once since > the AM may have pulled the container but the RM may die before notifying the > NM about the pull. -- This message was sent by Atlassian JIRA (v6.3.4#6332)