[ 
https://issues.apache.org/jira/browse/YARN-5279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356693#comment-15356693
 ] 

Sunil G commented on YARN-5279:
-------------------------------

Thanks [~rohithsharma] for the patch and approach. 
Ideally this can help to find those untracked finished containers and asks NM 
to remove from its context. Since we are trying to fix the real issue in 
preemption flow in YARN-4148 as mentioned by [~jlowe] here in this 
[comment|https://issues.apache.org/jira/browse/YARN-4862?focusedCommentId=15345069&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15345069],
 this new tracking way also can ensure such corners cases. However its better 
if we can log such activities as INFO or WARN. We have very less chance to hit 
this, still its better we know such cases are happening and if possible to 
track how it happened.
Few more comments in the patch:
1. {{RMNodeFinishedContainersPulledByAMEvent}} I guess we can change this name 
as this event is used by schedulers to report untracked containers.
2. Since scheduler reports such untracked containers in an event back to 
RMNode, its possible that such information reaches NM may be after a heratbeat 
interval. So scheduler may hit this same scenario again in worst case, and 
schedulers can fire {{RMNodeFinishedContainersPulledByAMEvent}} even. If 
possible, we can try avoid this.

> Potential Container leak in NM in preemption flow
> -------------------------------------------------
>
>                 Key: YARN-5279
>                 URL: https://issues.apache.org/jira/browse/YARN-5279
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager, resourcemanager
>            Reporter: Rohith Sharma K S
>            Assignee: Rohith Sharma K S
>         Attachments: 0001-YARN-5279.patch
>
>
> In discussion YARN-4862 
> [comment|https://issues.apache.org/jira/browse/YARN-4862?focusedCommentId=15341538&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15341538],
>  it is observed that there could be a container leak in NodeManager whenever 
> container is preempted from RM
> Basically if NM receives same containerId details in  {{containersToCleanUp}} 
> and {{containersToBeRemovedFromNM}} in the same heartbeat  then container 
> will never-ever removed in NMContext. Rather NM kills the container of 
> containersToCleanup and send back status again to RM. But RM blindly reject 
> the status since RMContainer is already removed and it is null.
> I think whenever RMContainer is null, RMNode should be informed to send 
> {{containersToBeRemovedFromNM}} so that NM will remove from its context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to