[
https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15226956#comment-15226956
]
Nathan Roberts commented on YARN-4924:
--------------------------------------
Observed the following race with NM recovery.
1) ContainerManager handles a FINISH_APPS event causing
storeFinishedApplication() to be recorded in state store (e.g. if RM kills
application)
2) Prior to cleaning up the containers associated with this application, the NM
dies
3) When NM restarts it attempts to recover the Application, Containers, and
FinishedApplication events all associated with this application, in that order
4) This leads to a NEW to DONE transition for the containers, which will not
try to cleanup the actual container since this is supposed to be a pre-LAUNCHED
transition
iiuc, this happens because when the application transitions from NEW to INITING
during Application recovery, the containerInitEvents aren't actually dispatched
yet. They are delayed until the AppInitDoneTransition. However, the
AppInitDoneTransition may not occur until after the recovery code has handled
the FinishedApplicationEvent and queued up KILL_CONTAINER events. So, in
effect, the containerKillEvents passed up the containerInitEvents leading to
the NEW to DONE transition.
{noformat}
2016-04-04 18:20:45,513 [main] INFO application.ApplicationImpl: Application
application_1458666253602_2367938 transitioned from NEW to INITING
2016-04-04 18:20:56,437 [AsyncDispatcher event handler] INFO
application.ApplicationImpl: Adding
container_e08_1458666253602_2367938_01_000004 to application
application_1458666253602_2367938
2016-04-04 18:20:57,062 [AsyncDispatcher event handler] INFO
application.ApplicationImpl: Application application_1458666253602_2367938
transitioned from INITING to FINISHING_CONTAINERS_WAIT
2016-04-04 18:20:57,095 [AsyncDispatcher event handler] INFO
container.ContainerImpl: Container
container_e08_1458666253602_2367938_01_000004 transitioned from NEW to DONE
2016-04-04 18:20:57,120 [AsyncDispatcher event handler] INFO
application.ApplicationImpl: Removing
container_e08_1458666253602_2367938_01_000004 from application
application_1458666253602_2367938
2016-04-04 18:20:57,120 [AsyncDispatcher event handler] INFO
application.ApplicationImpl: Application application_1458666253602_2367938
transitioned from FINISHING_CONTAINERS_WAIT to APPLICATION_RESOURCES_CLEANINGUP
{noformat}
> NM recovery race can lead to container not cleaned up
> -----------------------------------------------------
>
> Key: YARN-4924
> URL: https://issues.apache.org/jira/browse/YARN-4924
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 3.0.0, 2.7.2
> Reporter: Nathan Roberts
>
> It's probably a small window but we observed a case where the NM crashed and
> then a container was not properly cleaned up during recovery.
> I will add details in first comment.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)