Xuan Gong commented on YARN-3245:

More details here:
Currently, we have two directions:
* NM->RM : When the AM is successfully finished/failed, the NM will inform the 
RM through the regular heartbeat, then RM will change the related 
RMContainer/RMAppAttempt/RMApp status.
* RM->NM: When user kills the app/pre-emption, the RM will change the status 
first, then inform the NM through the NM heartbeat. NM will kill the 

No matter in which direction, they will use the common function 
CapacityScheduler#completeContainer. In this function, based on whether the 
container is AM and clean-up container is enabled, we could reserve the 
resource by just trigger the containerFinishedEven to inform the 
RMContainer/RMAppAttempt/RMApp to change their status, but do not inform the 
queue to release the resource.

If this attempt is not the last attempt, we will release the container 
resource. If it is, we will use the resource to launch the clean-up container.

Based the different direction either NM->RM and RM->NM, we need to make sure 
the AMContainer really exists. The only way to make sure it is through the 
NodeStatusUpdate. If we could get the AMContainer from 
NodeStatusUpdate#completeContainerList, it means the AMContainer exists. Here, 
we could add a flag/trigger to indicate that right now it is the good time to 
launch the clean-up container.

So, in this ticket, we expect to fix: reserve the AMContainer resource, and 
release the resource afterwards.
How/When to launch the clean-up container will be fixed separately.

> Find a way to reserve AMContainer resource to launch clean-up container in 
> CapacityScheduler
> --------------------------------------------------------------------------------------------
>                 Key: YARN-3245
>                 URL: https://issues.apache.org/jira/browse/YARN-3245
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Xuan Gong
>            Assignee: Xuan Gong
> The clean-up container will be launched after the application is 
> finished/killed/failed. Cleanup container may not get resources if we 
> negotiate the resource for it separately because cluster may have gotten busy 
> after the final AM exit. The propose is to reserve AMContainer resource, and 
> use it to launch clean-up container. In that case, we do not need to 
> re-negotiate the resource, and clean-up container can be launch in the same 
> NM as AM.

This message was sent by Atlassian JIRA

Reply via email to