[
https://issues.apache.org/jira/browse/YARN-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15341538#comment-15341538
]
Rohith Sharma K S commented on YARN-4862:
-----------------------------------------
Hi [~jianhe], apologies for long delay!!
In a positive case flow is NM inform RM that container is finished intern RM
wait for AM to pull finished containers and after AM pulls the finished
containers RM informs to NM that remove from NMContext.
In preemption flow,
# RM preempt the containers which inform RMContainerImpl first that
KillContainer.
# In KillContainer#transistion, informs the RMnodeImpl to cleanUpTheContainers
and also inform RMAppAttemptImpl that add to JustFinishedContainers so that let
AM pulls finished containers on next heartbeat. It is assumedthat
containersToCleanUp will be sent first to NM and later
containersToBeRemovedFromNM is sent next heartbeat of NM.
I see that there is *potential container leak in NodeManager module* in
preemption flow. There can be situation where {{containersToCleanUp }} and
{{containersToBeRemovedFromNM }} can go together in the same heartbeat. If same
containerId details sent to NM together, then container will never-ever removed
in NMContext.
CC :/ [~jlowe] Basically I feel it is bug from RM that should inform back to
RMNode if rmContainer is null whenever finished containers are received from NM
And for this JIRA, I think current patch approach should be fine if we fix the
above mentioned issue. Thoughts?
> Handle duplicate completed containers in RMNodeImpl
> ---------------------------------------------------
>
> Key: YARN-4862
> URL: https://issues.apache.org/jira/browse/YARN-4862
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Reporter: Rohith Sharma K S
> Assignee: Rohith Sharma K S
> Attachments: 0001-YARN-4862.patch, 0002-YARN-4862.patch
>
>
> As per
> [comment|https://issues.apache.org/jira/browse/YARN-4852?focusedCommentId=15209689&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15209689]
> from [~sharadag], there should be safe guard for duplicated container status
> in RMNodeImpl before creating UpdatedContainerInfo.
> Or else in heavily loaded cluster where event processing is gradually slow,
> if any duplicated container are sent to RM(may be bug in NM also), there is
> significant impact that RMNodImpl always create UpdatedContainerInfo for
> duplicated containers. This result in increase in the heap memory and causes
> problem like YARN-4852.
> This is an optimization for issue kind YARN-4852
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]