[
https://issues.apache.org/jira/browse/YARN-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412922#comment-17412922
]
Tao Yang commented on YARN-10903:
---------------------------------
+1 for the PR, will merge it after a few days if there are no objections.
> Too many "Failed to accept allocation proposal" because of wrong Headroom
> check for DRF
> ---------------------------------------------------------------------------------------
>
> Key: YARN-10903
> URL: https://issues.apache.org/jira/browse/YARN-10903
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacityscheduler
> Reporter: jackwangcs
> Assignee: jackwangcs
> Priority: Major
> Labels: pull-request-available
> Time Spent: 40m
> Remaining Estimate: 0h
>
> The headroom check in `ParentQueue.canAssign` and
> `RegularContainerAllocator#checkHeadroom` does not consider the DRF cases.
> This will cause a lot of "Failed to accept allocation proposal" when a queue
> is near-fully used.
> In the log:
> Headroom: memory:256, vCores:729
> Request: memory:56320, vCores:5
> clusterResource: memory:673966080, vCores:110494
> If use the DRF, then
> {code:java}
> Resources.greaterThanOrEqual(rc, clusterResource, Resources.add(
> currentResourceLimits.getHeadroom(), resourceCouldBeUnReserved),
> required); {code}
> will be true but in fact we can not allocate resources to the request due to
> the max limit(no enough memory).
> {code:java}
> 2021-07-21 23:49:39,012 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
> showRequests: application=application_1626747977559_95859
> headRoom=<memory:256, vCores:729> currentConsumption=0
> 2021-07-21 23:49:39,012 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.LocalityAppPlacementAllocator:
> Request={AllocationRequestId: -1, Priority: 1, Capability: <memory:56320,
> vCores:5>, # Containers: 19, Location: *, Relax Locality: true, Execution
> Type Request: null, Node Label Expression: prod-best-effort-node}
> .....
> 2021-07-21 23:49:39,013 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Try to commit allocation proposal=New
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.ResourceCommitRequest:
> ALLOCATED=[(Application=appattempt_1626747977559_95859_000001;
> Node=xxxx:8041; Resource=<memory:56320, vCores:5>)]
> 2021-07-21 23:49:39,013 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager:
> userLimit is fetched. userLimit=<memory:7077376, vCores:1277>,
> userSpecificUserLimit=<memory:7077376, vCores:1277>,
> schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
> Headroom calculation for user xxxxx: userLimit=<memory:7077376, vCores:1277>
> queueMaxAvailRes=<memory:0, vCores:0> consumed=<memory:0, vCores:0>
> partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
> Used resource=<memory:7077120, vCores:548> exceeded maxResourceLimit of the
> queue =<memory:7089920, vCores:1278>
> 2021-07-21 23:49:39,013 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Failed to accept allocation proposal
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]