[
https://issues.apache.org/jira/browse/YARN-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14963875#comment-14963875
]
Jason Lowe commented on YARN-4277:
----------------------------------
If I'm reading the problem description properly, the issue is the RM failover
causing the RM to forget what containers are there and relying on the NM to
tell it. When the RM times out the node it cannot contact the node to tell it
to kill all the containers, and it forgets that those containers are there
after the RM failover. Then later when the NM re-registers it will tell the RM
that it has these containers that should be dead. The NM was never told they
should be dead, and the RM forgot about them before the NM re-registered.
If the application has exited in the interim then the containers will be killed
as part of app shutdown handling, but as long as the app is still active then
it looks like the containers will be allowed to exist despite the RM previously
telling the AM that they were gone.
> containers would be leaked if nm crashed and rm failover
> ---------------------------------------------------------
>
> Key: YARN-4277
> URL: https://issues.apache.org/jira/browse/YARN-4277
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: sandflee
>
> nm restart and rm ha is enabled.
> 1, nm crashed, after timeout, rm send container complete msg to
> corresponding AM.
> 2, rm failovers
> 3, nm restart and register to RM , recovering containers running on NM, these
> containers and leaked.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)