[
https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14101568#comment-14101568
]
Jian He commented on YARN-1372:
-------------------------------
bq. Not sure if there is an easier way to link the two right now as the
application cleanup lifecycle also converts into a Container Kill just like any
other container Kill.
I meant can we remove all the containers in NMContext once we received the
NodeHeartbeatResponse#getApplicationsToCleanup notification, instead of
depending on expiration. Because applications are already completed at this
point when receiving the applicationsToCleanUp, the containers kept in
NMContext may not be needed any more.
bq. This it to allow a separate set of justFinishedContainers that can be used
for returning to AM and at the same time acknowledging the previous returned
set to NM.
the same justFinishedContainers set can be used to return to AM and ack NMs?
bq. DECOMMISSIONED/LOST state possible to receive the new event?
sorry for being unclear. I meant is it possible for NM at DECOMMISSIONED/LOST
state to receive the newly added CLEANEDUP_CONTAINER_NOTIFIED event ? If so, we
need to handle them too.
Patch is not applying anymore. Can you update the patch please? thx
> Ensure all completed containers are reported to the AMs across RM restart
> -------------------------------------------------------------------------
>
> Key: YARN-1372
> URL: https://issues.apache.org/jira/browse/YARN-1372
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Bikas Saha
> Assignee: Anubhav Dhoot
> Attachments: YARN-1372.001.patch, YARN-1372.001.patch,
> YARN-1372.prelim.patch, YARN-1372.prelim2.patch
>
>
> Currently the NM informs the RM about completed containers and then removes
> those containers from the RM notification list. The RM passes on that
> completed container information to the AM and the AM pulls this data. If the
> RM dies before the AM pulls this data then the AM may not be able to get this
> information again. To fix this, NM should maintain a separate list of such
> completed container notifications sent to the RM. After the AM has pulled the
> containers from the RM then the RM will inform the NM about it and the NM can
> remove the completed container from the new list. Upon re-register with the
> RM (after RM restart) the NM should send the entire list of completed
> containers to the RM along with any other containers that completed while the
> RM was dead. This ensures that the RM can inform the AM's about all completed
> containers. Some container completions may be reported more than once since
> the AM may have pulled the container but the RM may die before notifying the
> NM about the pull.
--
This message was sent by Atlassian JIRA
(v6.2#6252)