[
https://issues.apache.org/jira/browse/YARN-4089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Karthik Kambatla updated YARN-4089:
-----------------------------------
Labels: (was: patch)
> Race condition when calling AbstractYarnScheduler.completedContainer.
> ---------------------------------------------------------------------
>
> Key: YARN-4089
> URL: https://issues.apache.org/jira/browse/YARN-4089
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.6.0, 2.7.0, 2.5.2, 2.7.1
> Reporter: Shiwei Guo
> Attachments: YARN-4089.001.patch
>
>
> There is a race condition of calling
> AbstractYarnScheduler.completedContainer, which will cause the usedResource
> counter of application not accurate. At worst situation, the scheduler will
> not allocate any resource to any application in some queue( when the
> usedResource became negative) even there is indeed lots of free resource to
> be allocated.
> It also cause the Scheduler UI and metrics report negative resource usage
> value.In our cluster, it has the ability to run 13000+ container, but the WEB
> UI says that:
> - Containers Running: -26546
> - Memory Used: -82.38 TB
> - VCores Used: -26451
> This is how it happens in FairSchedular:
> completedContainer method will call application.containerCompleted, which
> will subtraction the resources used by this container from the usedResource
> counter of the application. So, if the completedContainer are called twice
> with the same container, the counter is subtracted too much values. So is the
> updateRootQueueMetrics call, so we can see negative allocatedMemory on
> rootQueue.
> The solution is to check whether the container being supplied is still live
> inside the completedContainer (as shown in the patch). There is some check
> before calling completedContainer, but that's not enough.
> For a more deeply discussion, the completedContainer may be called from two
> place:
> 1. Trigered by RMContainerEventType.FINISHED event:
> {code:title=FairScheduler.nodeUpdate}
> // Process completed containers
> for (ContainerStatus completedContainer : completedContainers) {
> ContainerId containerId = completedContainer.getContainerId();
> LOG.debug("Container FINISHED: " + containerId);
> completedContainer(getRMContainer(containerId),
> completedContainer, RMContainerEventType.FINISHED);
> }
> {code}
> 2. Trigered by RMContainerEventType.RELEASED
> {code:title=AbstractYarnScheduler.releaseContainers}
> completedContainer(rmContainer,
> SchedulerUtils.createAbnormalContainerStatus(containerId,
> SchedulerUtils.RELEASED_CONTAINER), RMContainerEventType.RELEASED);
> {code}
> RMContainerEventType.RELEASED is not triggered by MapReduce
> ApplicationMaster, so we won't see this problem on MR jobs. But TEZ will
> triggered it when it do not need this this container, while the NodeManger
> will also report a container complete message to RM ,which in turn trigger
> the RMContainerEventType.FINISHED event. If RMContainerEventType.FINISHED
> event comes to RM early than TEZ AM, the problem happens.
> This behavior can be more easily seen if the cluster had setup a
> TimelineServer for TEZ, which make it more likely TEZ AM will send
> RMContainerEventType.RELEASED event later than NM send
> RMContainerEventType.FINISHED.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)