[ 
https://issues.apache.org/jira/browse/YARN-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15343725#comment-15343725
 ] 

Rohith Sharma K S commented on YARN-4862:
-----------------------------------------

Thanks Jason for your suggestion:-)

I see there are two different scenarios where container leak can occur after 
this JIRA patch.
# NM forgets the completed-container status.  -> Similar to YARN-5197 approach 
can be done to handle this leak.
# (RM forgets) YarnScheduler clears RMcontainer details because of preemption. 
As a result scheduler(RMContainer) inform RMNodeImpl to add into 
{{containersToCleanUp}} list. And also RMAppAttempt inform RMnodeImpl to add 
into {{containersToBeRemovedFromNM}} after AM pulls finished containers. If 
NM-RM heartbeat interval is more then AM-RM heartbeat interval, then it is sure 
that both can go together in the same nodeHeartbeat response. If this is the 
case, then YARN-5279 issue occurs and also NM keeps sending these container 
status as completed to NM. At this point, RM start tracking the 
completedContainers but never get purged from the completedContainer set.

> Handle duplicate completed containers in RMNodeImpl
> ---------------------------------------------------
>
>                 Key: YARN-4862
>                 URL: https://issues.apache.org/jira/browse/YARN-4862
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Rohith Sharma K S
>            Assignee: Rohith Sharma K S
>         Attachments: 0001-YARN-4862.patch, 0002-YARN-4862.patch
>
>
> As per 
> [comment|https://issues.apache.org/jira/browse/YARN-4852?focusedCommentId=15209689&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15209689]
>  from [~sharadag], there should be safe guard for duplicated container status 
> in RMNodeImpl before creating UpdatedContainerInfo. 
> Or else in heavily loaded cluster where event processing is gradually slow, 
> if any duplicated container are sent to RM(may be bug in NM also), there is 
> significant impact that RMNodImpl always create UpdatedContainerInfo for 
> duplicated containers. This result in increase in the heap memory and causes 
> problem like YARN-4852.
> This is an optimization for issue kind YARN-4852



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to