[ 
https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709146#comment-14709146
 ] 

Shiwei Guo commented on YARN-3933:
----------------------------------

We also seeing this problems, and it may make the RM never allocate resource 
for a queue that has used negative resource.

I did some research and found the this is mainly caused by a race condition of 
calling AbstractYarnScheduler.completedContainer. Lets take FairScheduler as an 
example:
{code:title=FairSchedular.java}
protected synchronized void completedContainer(RMContainer rmContainer,
      ContainerStatus containerStatus, RMContainerEventType event) {
    if (rmContainer == null) {
      LOG.info("Null container completed...");
      return;
    }


    Container container = rmContainer.getContainer();

    // Get the application for the finished container
    FSAppAttempt application =
        getCurrentAttemptForContainer(container.getId());
    ApplicationId appId =
        container.getId().getApplicationAttemptId().getApplicationId();
    if (application == null) {
      LOG.info("Container " + container + " of" +
          " unknown application attempt " + appId +
          " completed with event " + event);
      return;
    }
    if(!application.getLiveContainersMap().containsKey(container.getId())){
      LOG.info("Container " + container + " of application attempt " + appId
      + " is not alive, skip do completedContainer operation on event " + 
event);
      return;
    }

    // Get the node on which the container was allocated
    FSSchedulerNode node = getFSSchedulerNode(container.getNodeId());

    if (rmContainer.getState() == RMContainerState.RESERVED) {
      application.unreserve(rmContainer.getReservedPriority(), node);
    } else {
      application.containerCompleted(rmContainer, containerStatus, event);
      node.releaseContainer(container);
      updateRootQueueMetrics();
    }

    LOG.info("Application attempt " + application.getApplicationAttemptId()
        + " released container " + container.getId() + " on node: " + node
        + " with event: " + event);
  }
{code}

completedContainer method will call application.containerCompleted, which will 
subtraction the resources used by this container from the usedResource counter 
of the application. So, if the completedContainer are called twice with the 
same container, the counter is subtracted too much values. So is the 
updateRootQueueMetrics call, so we can see negative allocatedMemory on 
rootQueue.

The solution is to check whether the container being supplied is still live 
*inside* the completedContainer (as shown in the patch). There is some check 
before calling completedContainer, but that's not enough.

For a more deeply discussion, the completedContainer may be called from two 
place:

1. Trigered by RMContainerEventType.FINISHED event:
{code:title=FairScheduler.nodeUpdate}
// Process completed containers
    for (ContainerStatus completedContainer : completedContainers) {
      ContainerId containerId = completedContainer.getContainerId();
      LOG.debug("Container FINISHED: " + containerId);
      completedContainer(getRMContainer(containerId),
          completedContainer, RMContainerEventType.FINISHED);
    }
{code}

2. Trigered by  RMContainerEventType.RELEASED
{code:title=AbstractYarnScheduler.releaseContainers}
completedContainer(rmContainer,
        SchedulerUtils.createAbnormalContainerStatus(containerId,
          SchedulerUtils.RELEASED_CONTAINER), RMContainerEventType.RELEASED);
{code}

RMContainerEventType.RELEASED is not triggered by MapReduce ApplicationMaster, 
so we won't see this problem on MR jobs. But TEZ will triggered it when it do 
not need this this container, while the NodeManger will also report a container 
complete message to RM ,which in turn trigger the RMContainerEventType.FINISHED 
event. If RMContainerEventType.FINISHED event comes to RM early than TEZ AM, 
the problem happens.

> Resources(both core and memory) are being negative
> --------------------------------------------------
>
>                 Key: YARN-3933
>                 URL: https://issues.apache.org/jira/browse/YARN-3933
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.5.2
>            Reporter: Lavkesh Lahngir
>            Assignee: Lavkesh Lahngir
>
> In our cluster we are seeing available memory and cores being negative. 
> Initial inspection:
> Scenario no. 1: 
> In capacity scheduler the method allocateContainersToNode() checks if 
> there are excess reservation of containers for an application, and they are 
> no longer needed then it calls queue.completedContainer() which causes 
> resources being negative. And they were never assigned in the first place. 
> I am still looking through the code. Can somebody suggest how to simulate 
> excess containers assignments ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to