[
https://issues.apache.org/jira/browse/YARN-5279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356693#comment-15356693
]
Sunil G commented on YARN-5279:
-------------------------------
Thanks [~rohithsharma] for the patch and approach.
Ideally this can help to find those untracked finished containers and asks NM
to remove from its context. Since we are trying to fix the real issue in
preemption flow in YARN-4148 as mentioned by [~jlowe] here in this
[comment|https://issues.apache.org/jira/browse/YARN-4862?focusedCommentId=15345069&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15345069],
this new tracking way also can ensure such corners cases. However its better
if we can log such activities as INFO or WARN. We have very less chance to hit
this, still its better we know such cases are happening and if possible to
track how it happened.
Few more comments in the patch:
1. {{RMNodeFinishedContainersPulledByAMEvent}} I guess we can change this name
as this event is used by schedulers to report untracked containers.
2. Since scheduler reports such untracked containers in an event back to
RMNode, its possible that such information reaches NM may be after a heratbeat
interval. So scheduler may hit this same scenario again in worst case, and
schedulers can fire {{RMNodeFinishedContainersPulledByAMEvent}} even. If
possible, we can try avoid this.
> Potential Container leak in NM in preemption flow
> -------------------------------------------------
>
> Key: YARN-5279
> URL: https://issues.apache.org/jira/browse/YARN-5279
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager, resourcemanager
> Reporter: Rohith Sharma K S
> Assignee: Rohith Sharma K S
> Attachments: 0001-YARN-5279.patch
>
>
> In discussion YARN-4862
> [comment|https://issues.apache.org/jira/browse/YARN-4862?focusedCommentId=15341538&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15341538],
> it is observed that there could be a container leak in NodeManager whenever
> container is preempted from RM
> Basically if NM receives same containerId details in {{containersToCleanUp}}
> and {{containersToBeRemovedFromNM}} in the same heartbeat then container
> will never-ever removed in NMContext. Rather NM kills the container of
> containersToCleanup and send back status again to RM. But RM blindly reject
> the status since RMContainer is already removed and it is null.
> I think whenever RMContainer is null, RMNode should be informed to send
> {{containersToBeRemovedFromNM}} so that NM will remove from its context.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]