[ 
https://issues.apache.org/jira/browse/YARN-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14963875#comment-14963875
 ] 

Jason Lowe commented on YARN-4277:
----------------------------------

If I'm reading the problem description properly, the issue is the RM failover 
causing the RM to forget what containers are there and relying on the NM to 
tell it.  When the RM times out the node it cannot contact the node to tell it 
to kill all the containers, and it forgets that those containers are there 
after the RM failover.  Then later when the NM re-registers it will tell the RM 
that it has these containers that should be dead.  The NM was never told they 
should be dead, and the RM forgot about them before the NM re-registered.

If the application has exited in the interim then the containers will be killed 
as part of app shutdown handling, but as long as the app is still active then 
it looks like the containers will be allowed to exist despite the RM previously 
telling the AM that they were gone.

> containers would be leaked if nm crashed  and rm failover
> ---------------------------------------------------------
>
>                 Key: YARN-4277
>                 URL: https://issues.apache.org/jira/browse/YARN-4277
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: sandflee
>
> nm restart and rm ha is enabled.
> 1,  nm crashed, after timeout, rm send container complete msg to 
> corresponding AM.
> 2, rm failovers
> 3, nm restart and register to RM , recovering containers running on NM, these 
> containers and leaked.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to