[ 
https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14111713#comment-14111713
 ] 

Anubhav Dhoot commented on YARN-1372:
-------------------------------------

bq. I meant is it possible for NM at DECOMMISSIONED/LOST state to receive the 
newly added CLEANEDUP_CONTAINER_NOTIFIED event ? If so, we need to handle them 
too.
Fixed that.

bq. the same justFinishedContainers set can be used to return to AM and ack NMs?
There are 3 states to completed containers in this set.
a) Container added to justFinishedContainer but not yet sent to AM.
b) Container sent to AM in a previous allocateResponse but is not yet acked
c) Next allocate call from AM has happened after the container was sent. This 
implicitly acks from AM point of view and now can be sent to NM.
Instead of having some additional state to track a) and b), I used 2 
collections justFinishedContainers and previousJustFinishedContainers 
respectively. Have added tests to show that.

bq. I meant can we remove all the containers in NMContext for the application 
once we received the NodeHeartbeatResponse#getApplicationsToCleanup 
notification, instead of depending on expiration.

I tried doing that but had one issue. ApplicationImpl which has the mapping of 
application to containers, cannot access the event dispatcher for 
ContainerManagerImpl (which is the one removing the containers from context). I 
am going to upload a patch that removes the dispatcher local to 
ContainerManagerImpl (~/patches/YARN-1372.002_NMHandlesCompletedApp.patch).  

I looked into an alternate approach where the RM acks the completed containers 
that belong to an App thats completed. I am uploading that patch as well 
(~/patches/YARN-1372.002_RMHandlesCompletedApp.patch)

> Ensure all completed containers are reported to the AMs across RM restart
> -------------------------------------------------------------------------
>
>                 Key: YARN-1372
>                 URL: https://issues.apache.org/jira/browse/YARN-1372
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Bikas Saha
>            Assignee: Anubhav Dhoot
>         Attachments: YARN-1372.001.patch, YARN-1372.001.patch, 
> YARN-1372.prelim.patch, YARN-1372.prelim2.patch
>
>
> Currently the NM informs the RM about completed containers and then removes 
> those containers from the RM notification list. The RM passes on that 
> completed container information to the AM and the AM pulls this data. If the 
> RM dies before the AM pulls this data then the AM may not be able to get this 
> information again. To fix this, NM should maintain a separate list of such 
> completed container notifications sent to the RM. After the AM has pulled the 
> containers from the RM then the RM will inform the NM about it and the NM can 
> remove the completed container from the new list. Upon re-register with the 
> RM (after RM restart) the NM should send the entire list of completed 
> containers to the RM along with any other containers that completed while the 
> RM was dead. This ensures that the RM can inform the AM's about all completed 
> containers. Some container completions may be reported more than once since 
> the AM may have pulled the container but the RM may die before notifying the 
> NM about the pull.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to