[ https://issues.apache.org/jira/browse/YARN-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334446#comment-14334446 ]
Xuan Gong commented on YARN-3245: --------------------------------- More details here: Currently, we have two directions: * NM->RM : When the AM is successfully finished/failed, the NM will inform the RM through the regular heartbeat, then RM will change the related RMContainer/RMAppAttempt/RMApp status. * RM->NM: When user kills the app/pre-emption, the RM will change the status first, then inform the NM through the NM heartbeat. NM will kill the AMContainer. No matter in which direction, they will use the common function CapacityScheduler#completeContainer. In this function, based on whether the container is AM and clean-up container is enabled, we could reserve the resource by just trigger the containerFinishedEven to inform the RMContainer/RMAppAttempt/RMApp to change their status, but do not inform the queue to release the resource. If this attempt is not the last attempt, we will release the container resource. If it is, we will use the resource to launch the clean-up container. Based the different direction either NM->RM and RM->NM, we need to make sure the AMContainer really exists. The only way to make sure it is through the NodeStatusUpdate. If we could get the AMContainer from NodeStatusUpdate#completeContainerList, it means the AMContainer exists. Here, we could add a flag/trigger to indicate that right now it is the good time to launch the clean-up container. So, in this ticket, we expect to fix: reserve the AMContainer resource, and release the resource afterwards. How/When to launch the clean-up container will be fixed separately. > Find a way to reserve AMContainer resource to launch clean-up container in > CapacityScheduler > -------------------------------------------------------------------------------------------- > > Key: YARN-3245 > URL: https://issues.apache.org/jira/browse/YARN-3245 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Reporter: Xuan Gong > Assignee: Xuan Gong > > The clean-up container will be launched after the application is > finished/killed/failed. Cleanup container may not get resources if we > negotiate the resource for it separately because cluster may have gotten busy > after the final AM exit. The propose is to reserve AMContainer resource, and > use it to launch clean-up container. In that case, we do not need to > re-negotiate the resource, and clean-up container can be launch in the same > NM as AM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)