[ https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140052#comment-14140052 ]
Jian He commented on YARN-1372: ------------------------------- Thanks for updating! looks good overall, some comments: - we may add the AM container into the justFinishedContainers as well in some transitions, so that the AM container instance could be removed on NM. - this method exceed 80 limit. the method doesn’t need to add containerStatus as a parameter, as the containerStatus can be got from containerFinishedEvent. {code} private static void addJustFinishedContainer(RMAppAttemptImpl appAttem….. {code} - MockRM, import changes only, we can revert. - transferStateFromPreviousAttempt is currently invoked by RMApp only if work-preserving restart is enabled. As per discussion, we are going to transfer them irrespectively. - For the following code, {code} if (recentlyStoppedContainers.get(cid) < currentTime || !context.getContainers().containsKey(cid)) { i.remove(); context.getContainers().remove(cid); context.getNMStateStore().removeContainer(cid); } {code} should the logic be like this ? So that the state-store can preserve the containers until it the container is removed from context. {code} if (recentlyStoppedContainers.get(cid) < currentTime) { i.remove(); if (!context.getContainers().containsKey(cid)) { context.getNMStateStore().removeContainer(cid); } } {code} - Could you merge the following two transitions up to where the previously corresponding (DECOMMISSIONED/LOST) state transitions are, maybe add the same transition for REBOOTED state as well ? (Also, there’s a typo in the first transition state, it’s supposed to be DECOMMISSIONED instead of UNHEALTHY) {code} //Transitions from DECOMMISSIONED state .addTransition(NodeState.UNHEALTHY, NodeState.UNHEALTHY, RMNodeEventType.FINISHED_CONTAINERS_PULLED_BY_AM, new FinishedContainersPulledByAMTransition()) //Transitions from LOST state .addTransition(NodeState.LOST, NodeState.LOST, RMNodeEventType.FINISHED_CONTAINERS_PULLED_BY_AM, new FinishedContainersPulledByAMTransition()) {code} > Ensure all completed containers are reported to the AMs across RM restart > ------------------------------------------------------------------------- > > Key: YARN-1372 > URL: https://issues.apache.org/jira/browse/YARN-1372 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Reporter: Bikas Saha > Assignee: Anubhav Dhoot > Attachments: YARN-1372.001.patch, YARN-1372.001.patch, > YARN-1372.002_NMHandlesCompletedApp.patch, > YARN-1372.002_RMHandlesCompletedApp.patch, > YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.003.patch, > YARN-1372.004.patch, YARN-1372.005.patch, YARN-1372.005.patch, > YARN-1372.006.patch, YARN-1372.prelim.patch, YARN-1372.prelim2.patch > > > Currently the NM informs the RM about completed containers and then removes > those containers from the RM notification list. The RM passes on that > completed container information to the AM and the AM pulls this data. If the > RM dies before the AM pulls this data then the AM may not be able to get this > information again. To fix this, NM should maintain a separate list of such > completed container notifications sent to the RM. After the AM has pulled the > containers from the RM then the RM will inform the NM about it and the NM can > remove the completed container from the new list. Upon re-register with the > RM (after RM restart) the NM should send the entire list of completed > containers to the RM along with any other containers that completed while the > RM was dead. This ensures that the RM can inform the AM's about all completed > containers. Some container completions may be reported more than once since > the AM may have pulled the container but the RM may die before notifying the > NM about the pull. -- This message was sent by Atlassian JIRA (v6.3.4#6332)