[ 
https://issues.apache.org/jira/browse/YARN-4089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiwei Guo updated YARN-4089:
-----------------------------
    Attachment: YARN-4089.001.patch

> Race condition when calling AbstractYarnScheduler.completedContainer.
> ---------------------------------------------------------------------
>
>                 Key: YARN-4089
>                 URL: https://issues.apache.org/jira/browse/YARN-4089
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0, 2.7.0, 2.5.2, 2.7.1
>            Reporter: Shiwei Guo
>              Labels: patch
>         Attachments: YARN-4089.001.patch
>
>
> There is a  race condition of calling 
> AbstractYarnScheduler.completedContainer, which will cause the usedResource 
> counter of application not accurate. At worst situation, the scheduler will 
> not allocate any resource to any application in some queue( when the 
> usedResource became negative) even there is indeed lots of free resource to 
> be allocated.
> It also cause the Scheduler UI and metrics report negative resource usage 
> value.In our cluster, it has the ability to run 13000+ container, but the WEB 
> UI says that:
> - Containers Running: -26546
> - Memory Used: -82.38 TB
> - VCores Used: -26451
> This is how it happens in FairSchedular:
> completedContainer method will call application.containerCompleted, which 
> will subtraction the resources used by this container from the usedResource 
> counter of the application. So, if the completedContainer are called twice 
> with the same container, the counter is subtracted too much values. So is the 
> updateRootQueueMetrics call, so we can see negative allocatedMemory on 
> rootQueue.
> The solution is to check whether the container being supplied is still live 
> inside the completedContainer (as shown in the patch). There is some check 
> before calling completedContainer, but that's not enough.
> For a more deeply discussion, the completedContainer may be called from two 
> place:
> 1. Trigered by RMContainerEventType.FINISHED event:
> {code:title=FairScheduler.nodeUpdate}
> // Process completed containers
>     for (ContainerStatus completedContainer : completedContainers) {
>       ContainerId containerId = completedContainer.getContainerId();
>       LOG.debug("Container FINISHED: " + containerId);
>       completedContainer(getRMContainer(containerId),
>           completedContainer, RMContainerEventType.FINISHED);
>     }
> {code}
> 2. Trigered by RMContainerEventType.RELEASED
> {code:title=AbstractYarnScheduler.releaseContainers}
> completedContainer(rmContainer,
>         SchedulerUtils.createAbnormalContainerStatus(containerId,
>           SchedulerUtils.RELEASED_CONTAINER), RMContainerEventType.RELEASED);
> {code}
> RMContainerEventType.RELEASED is not triggered by MapReduce 
> ApplicationMaster, so we won't see this problem on MR jobs. But TEZ will 
> triggered it when it do not need this this container, while the NodeManger 
> will also report a container complete message to RM ,which in turn trigger 
> the RMContainerEventType.FINISHED event. If RMContainerEventType.FINISHED 
> event comes to RM early than TEZ AM, the problem happens.
> This behavior can be more easily seen if the cluster had setup a 
> TimelineServer for TEZ, which make it more likely TEZ AM will send 
> RMContainerEventType.RELEASED event later than NM send 
> RMContainerEventType.FINISHED.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to