jackwangcs created YARN-10903:
---------------------------------
Summary: Too many "Failed to accept allocation proposal" because
of wrong Headroom check for DRF
Key: YARN-10903
URL: https://issues.apache.org/jira/browse/YARN-10903
Project: Hadoop YARN
Issue Type: Bug
Components: capacityscheduler
Reporter: jackwangcs
The headroom check inĀ `ParentQueue.canAssign` and
`RegularContainerAllocator#checkHeadroom` does not consider the DRF cases.
This will cause a lot of "Failed to accept allocation proposal" when a queue is
near-fully used.
{code:java}
2021-07-21 23:49:39,012 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
showRequests: application=application_1626747977559_95859
headRoom=<memory:256, vCores:729> currentConsumption=0
2021-07-21 23:49:39,012 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.LocalityAppPlacementAllocator:
Request={AllocationRequestId: -1, Priority: 1, Capability: <memory:56320,
vCores:5>, # Containers: 19, Location: *, Relax Locality: true, Execution Type
Request: null, Node Label Expression: prod-best-effort-node}
2021-07-21 23:49:39,012 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
Assigned to queue: root.prod-best-effort.dp-prod-be stats: dp-prod-be:
capacity=0.0043, absoluteCapacity=5.5728003E-4, usedResources=<memory:0,
vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=6,
numContainers=112 --> <memory:56320, vCores:5>, OFF_SWITCH
2021-07-21 23:49:39,012 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
assignedContainer queue=prod-best-effort usedCapacity=0.0
absoluteUsedCapacity=0.0 used=<memory:0, vCores:0> cluster=<memory:673966080,
vCores:110494>
2021-07-21 23:49:39,013 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
ParentQ=prod-best-effort assignedSoFarInThisIteration=<memory:56320, vCores:5>
usedCapacity=0.0 absoluteUsedCapacity=0.0
2021-07-21 23:49:39,013 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
Assigned to queue: root.prod-best-effort stats: prod-best-effort:
numChildQueue= 17, capacity=0.1296, absoluteCapacity=0.1296,
usedResources=<memory:0, vCores:0>usedCapacity=0.0, numApps=157,
numContainers=1814 --> <memory:56320, vCores:5>, OFF_SWITCH
2021-07-21 23:49:39,013 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
assignedContainer queue=root usedCapacity=0.0 absoluteUsedCapacity=0.0
used=<memory:0, vCores:0> cluster=<memory:673966080, vCores:110494>
2021-07-21 23:49:39,013 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
ParentQ=root assignedSoFarInThisIteration=<memory:56320, vCores:5>
usedCapacity=0.0 absoluteUsedCapacity=0.0
2021-07-21 23:49:39,013 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Try to commit allocation proposal=New
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.ResourceCommitRequest:
ALLOCATED=[(Application=appattempt_1626747977559_95859_000001;
Node=lashadoop-21j29.server.hulu.com:8041; Resource=<memory:56320, vCores:5>)]
2021-07-21 23:49:39,013 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager:
userLimit is fetched. userLimit=<memory:7077376, vCores:1277>,
userSpecificUserLimit=<memory:7077376, vCores:1277>,
schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=prod-best-effort-node
2021-07-21 23:49:39,013 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
Headroom calculation for user reco_research.prod: userLimit=<memory:7077376,
vCores:1277> queueMaxAvailRes=<memory:0, vCores:0> consumed=<memory:0,
vCores:0> partition=prod-best-effort-node
2021-07-21 23:49:39,013 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
Used resource=<memory:7077120, vCores:548> exceeded maxResourceLimit of the
queue =<memory:7089920, vCores:1278>
2021-07-21 23:49:39,013 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Failed to accept allocation proposal
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]