[ 
https://issues.apache.org/jira/browse/YARN-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15341538#comment-15341538
 ] 

Rohith Sharma K S commented on YARN-4862:
-----------------------------------------

Hi [~jianhe], apologies for long delay!!
  In a positive case flow is NM inform RM that container is finished intern RM 
wait for AM to pull finished containers and after AM pulls the finished 
containers RM informs to NM that remove from NMContext.

In preemption flow, 
# RM preempt the containers which inform RMContainerImpl first that 
KillContainer. 
# In KillContainer#transistion, informs the RMnodeImpl to cleanUpTheContainers 
and also inform RMAppAttemptImpl that add to JustFinishedContainers so that let 
AM pulls finished containers on next heartbeat. It is assumedthat 
containersToCleanUp will be sent first to NM and later 
containersToBeRemovedFromNM is sent next heartbeat of NM. 

I see that there is *potential container leak in NodeManager module* in 
preemption flow. There can be situation where {{containersToCleanUp }} and 
{{containersToBeRemovedFromNM }} can go together in the same heartbeat. If same 
containerId details sent to NM together, then container will never-ever removed 
in NMContext.

CC :/ [~jlowe]  Basically I feel it is bug from RM that should inform back to 
RMNode if rmContainer is null whenever finished containers are received from NM 


And for this JIRA, I think current patch approach should be fine if we fix the 
above mentioned issue. Thoughts?

> Handle duplicate completed containers in RMNodeImpl
> ---------------------------------------------------
>
>                 Key: YARN-4862
>                 URL: https://issues.apache.org/jira/browse/YARN-4862
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Rohith Sharma K S
>            Assignee: Rohith Sharma K S
>         Attachments: 0001-YARN-4862.patch, 0002-YARN-4862.patch
>
>
> As per 
> [comment|https://issues.apache.org/jira/browse/YARN-4852?focusedCommentId=15209689&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15209689]
>  from [~sharadag], there should be safe guard for duplicated container status 
> in RMNodeImpl before creating UpdatedContainerInfo. 
> Or else in heavily loaded cluster where event processing is gradually slow, 
> if any duplicated container are sent to RM(may be bug in NM also), there is 
> significant impact that RMNodImpl always create UpdatedContainerInfo for 
> duplicated containers. This result in increase in the heap memory and causes 
> problem like YARN-4852.
> This is an optimization for issue kind YARN-4852



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to