Jian He commented on YARN-1372:

bq. I looked into an alternate approach where the RM acks the completed 
containers that belong to an App thats completed.
Can you please elaborate the changes you made for this approach ? By looking at 
the diffs of patches, seems the following change is what you are referring to. 
// Ack all previousJustFinishedContainers and justFinishedContainers to NM
(why call the same method twice ?) This may not guarantee notifying NM to clean 
all containers from context, because a) containers may not yet finish at this 
point; b) containers finished but not yet added to the list(in transit) will 
not be notified.

bq.  Next allocate call from AM has happened after the container was sent. This 
implicitly acks from AM point of view and now can be sent to NM.
Can we ack NMs at the same time when the finishedContainers are pulled by AM? 
In the current patch, when AM calls allocate, justFinishedContainers will be 
transferred to previousJustFinishedContainers.   If AM never calls allocate 
again, those containers in previousJustFinishedContainers will not be notified 
to NM.

> Ensure all completed containers are reported to the AMs across RM restart
> -------------------------------------------------------------------------
>                 Key: YARN-1372
>                 URL: https://issues.apache.org/jira/browse/YARN-1372
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Bikas Saha
>            Assignee: Anubhav Dhoot
>         Attachments: YARN-1372.001.patch, YARN-1372.001.patch, 
> YARN-1372.002_NMHandlesCompletedApp.patch, 
> YARN-1372.002_RMHandlesCompletedApp.patch, 
> YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.prelim.patch, 
> YARN-1372.prelim2.patch
> Currently the NM informs the RM about completed containers and then removes 
> those containers from the RM notification list. The RM passes on that 
> completed container information to the AM and the AM pulls this data. If the 
> RM dies before the AM pulls this data then the AM may not be able to get this 
> information again. To fix this, NM should maintain a separate list of such 
> completed container notifications sent to the RM. After the AM has pulled the 
> containers from the RM then the RM will inform the NM about it and the NM can 
> remove the completed container from the new list. Upon re-register with the 
> RM (after RM restart) the NM should send the entire list of completed 
> containers to the RM along with any other containers that completed while the 
> RM was dead. This ensures that the RM can inform the AM's about all completed 
> containers. Some container completions may be reported more than once since 
> the AM may have pulled the container but the RM may die before notifying the 
> NM about the pull.

This message was sent by Atlassian JIRA

Reply via email to