Eric Payne commented on YARN-10283:

bq. Please feel free to pull in the additional changes from YARN-9903 into this 
So, IMHO, I think we should make this JIRA (YARN-10283) dependent on YARN-9903 
and complete YARN-9903 first. IIUC, YARN-9903 is addressing the general case of 
reservation starvation whereas this JIRA is specific to the concerns of 
priority queues. Even with the fixes in YARN-9903, there are still 
priority-queue-specific problems that need to be addressed.

bq. If there are no node labels, the same allocation errors occur if 
reservationsContinueLooking == false AND minimum-allocation-mb == 512.

I verified that when YARN-9903 is applied, reproTestWithNodeLabels succeeds but 
reproWithoutNodeLabels still fails.

> Capacity Scheduler: starvation occurs if a higher priority queue is full and 
> node labels are used
> -------------------------------------------------------------------------------------------------
>                 Key: YARN-10283
>                 URL: https://issues.apache.org/jira/browse/YARN-10283
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler
>            Reporter: Peter Bacsko
>            Assignee: Peter Bacsko
>            Priority: Major
>         Attachments: YARN-10283-POC01.patch, YARN-10283-ReproTest.patch, 
> YARN-10283-ReproTest2.patch
> Recently we've been investigating a scenario where applications submitted to 
> a lower priority queue could not get scheduled because a higher priority 
> queue in the same hierarchy could now satisfy the allocation request. Both 
> queue belonged to the same partition.
> If we disabled node labels, the problem disappeared.
> The problem is that {{RegularContainerAllocator}} always allocated a 
> container for the request, even if it should not have.
> *Example:*
> * Cluster total resources: 3 nodes, 15GB, 24 vcores (5GB / 8 vcore per node)
> * Partition "shared" was created with 2 nodes
> * "root.lowprio" (priority = 20) and "root.highprio" (priorty = 40) were 
> added to the partition
> * Both queues have a limit of <memory:5120, vCores:8>
> * Using DominantResourceCalculator
> Setup:
> Submit distributed shell application to highprio with switches 
> "-num_containers 3 -container_vcores 4". The memory allocation is 512MB per 
> container.
> Chain of events:
> 1. Queue is filled with contaners until it reaches usage <memory:2560, 
> vCores:5>
> 2. A node update event is pushed to CS from a node which is part of the 
> partition
> 2. {{AbstractCSQueue.canAssignToQueue()}} returns true because it's smaller 
> than the current limit resource <memory:5120, vCores:8>
> 3. Then {{LeafQueue.assignContainers()}} runs successfully and gets an 
> allocated container for <memory:512, vcores:4>
> 4. But we can't commit the resource request because we would have 9 vcores in 
> total, violating the limit.
> The problem is that we always try to assign container for the same 
> application in each heartbeat from "highprio". Applications in "lowprio" 
> cannot make progress.
> *Problem:*
> {{RegularContainerAllocator.assignContainer()}} does not handle this case 
> well. We only reject allocation if this condition is satisfied:
> {noformat}
>  if (rmContainer == null && reservationsContinueLooking
>           && node.getLabels().isEmpty()) {
> {noformat}
> But if we have node labels, we enter a different code path and succeed with 
> the allocation if there's room for a container.

This message was sent by Atlassian Jira

To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to