[jira] [Updated] (YARN-10903) Too many "Failed to accept allocation proposal" because of wrong Headroom check for DRF

jackwangcs (Jira) Sat, 28 Aug 2021 07:25:08 -0700


     [ 
https://issues.apache.org/jira/browse/YARN-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


jackwangcs updated YARN-10903:
------------------------------
    Description: 
The headroom check in  `ParentQueue.canAssign` and 
`RegularContainerAllocator#checkHeadroom` does not consider the DRF cases.

This will cause a lot of "Failed to accept allocation proposal" when a queue is 
near-fully used. 
In the log:
Headroom: memory:256, vCores:729
Request: memory:56320, vCores:5
clusterResource: memory:673966080, vCores:110494
If use the DRF, then 
{code:java}
Resources.greaterThanOrEqual(rc, clusterResource, Resources.add(
    currentResourceLimits.getHeadroom(), resourceCouldBeUnReserved),
    required); {code}
will be true but in fact we can not allocate resources to the request due to 
the max limit(no enough memory).
{code:java}
2021-07-21 23:49:39,012 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
 showRequests: application=application_1626747977559_95859 
headRoom=<memory:256, vCores:729> currentConsumption=0
2021-07-21 23:49:39,012 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.LocalityAppPlacementAllocator:
  Request={AllocationRequestId: -1, Priority: 1, Capability: <memory:56320, 
vCores:5>, # Containers: 19, Location: *, Relax Locality: true, Execution Type 
Request: null, Node Label Expression: prod-best-effort-node}
.....
2021-07-21 23:49:39,013 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Try to commit allocation proposal=New 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.ResourceCommitRequest:
         ALLOCATED=[(Application=appattempt_1626747977559_95859_000001; 
Node=xxxx:8041; Resource=<memory:56320, vCores:5>)]
2021-07-21 23:49:39,013 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager: 
userLimit is fetched. userLimit=<memory:7077376, vCores:1277>, 
userSpecificUserLimit=<memory:7077376, vCores:1277>, 
schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=prod-best-effort-node
2021-07-21 23:49:39,013 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
Headroom calculation for user xxxxx:  userLimit=<memory:7077376, vCores:1277> 
queueMaxAvailRes=<memory:0, vCores:0> consumed=<memory:0, vCores:0> 
partition=prod-best-effort-node
2021-07-21 23:49:39,013 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
 Used resource=<memory:7077120, vCores:548> exceeded maxResourceLimit of the 
queue =<memory:7089920, vCores:1278>
2021-07-21 23:49:39,013 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Failed to accept allocation proposal
 {code}

  was:
The headroom check in  `ParentQueue.canAssign` and 
`RegularContainerAllocator#checkHeadroom` does not consider the DRF cases.

This will cause a lot of "Failed to accept allocation proposal" when a queue is 
near-fully used.
{code:java}
2021-07-21 23:49:39,012 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
 showRequests: application=application_1626747977559_95859 
headRoom=<memory:256, vCores:729> currentConsumption=0
2021-07-21 23:49:39,012 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.LocalityAppPlacementAllocator:
  Request={AllocationRequestId: -1, Priority: 1, Capability: <memory:56320, 
vCores:5>, # Containers: 19, Location: *, Relax Locality: true, Execution Type 
Request: null, Node Label Expression: prod-best-effort-node}
2021-07-21 23:49:39,012 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Assigned to queue: root.prod-best-effort.dp-prod-be stats: dp-prod-be: 
capacity=0.0043, absoluteCapacity=5.5728003E-4, usedResources=<memory:0, 
vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=6, 
numContainers=112 --> <memory:56320, vCores:5>, OFF_SWITCH
2021-07-21 23:49:39,012 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
assignedContainer queue=prod-best-effort usedCapacity=0.0 
absoluteUsedCapacity=0.0 used=<memory:0, vCores:0> cluster=<memory:673966080, 
vCores:110494>
2021-07-21 23:49:39,013 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
ParentQ=prod-best-effort assignedSoFarInThisIteration=<memory:56320, vCores:5> 
usedCapacity=0.0 absoluteUsedCapacity=0.0
2021-07-21 23:49:39,013 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Assigned to queue: root.prod-best-effort stats: prod-best-effort: 
numChildQueue= 17, capacity=0.1296, absoluteCapacity=0.1296, 
usedResources=<memory:0, vCores:0>usedCapacity=0.0, numApps=157, 
numContainers=1814 --> <memory:56320, vCores:5>, OFF_SWITCH
2021-07-21 23:49:39,013 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
assignedContainer queue=root usedCapacity=0.0 absoluteUsedCapacity=0.0 
used=<memory:0, vCores:0> cluster=<memory:673966080, vCores:110494>
2021-07-21 23:49:39,013 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
ParentQ=root assignedSoFarInThisIteration=<memory:56320, vCores:5> 
usedCapacity=0.0 absoluteUsedCapacity=0.0
2021-07-21 23:49:39,013 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Try to commit allocation proposal=New 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.ResourceCommitRequest:
         ALLOCATED=[(Application=appattempt_1626747977559_95859_000001; 
Node=lashadoop-21j29.server.hulu.com:8041; Resource=<memory:56320, vCores:5>)]
2021-07-21 23:49:39,013 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager: 
userLimit is fetched. userLimit=<memory:7077376, vCores:1277>, 
userSpecificUserLimit=<memory:7077376, vCores:1277>, 
schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=prod-best-effort-node
2021-07-21 23:49:39,013 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
Headroom calculation for user reco_research.prod:  userLimit=<memory:7077376, 
vCores:1277> queueMaxAvailRes=<memory:0, vCores:0> consumed=<memory:0, 
vCores:0> partition=prod-best-effort-node
2021-07-21 23:49:39,013 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
 Used resource=<memory:7077120, vCores:548> exceeded maxResourceLimit of the 
queue =<memory:7089920, vCores:1278>
2021-07-21 23:49:39,013 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Failed to accept allocation proposal
 {code}


> Too many "Failed to accept allocation proposal" because of wrong Headroom 
> check for DRF
> ---------------------------------------------------------------------------------------
>
>                 Key: YARN-10903
>                 URL: https://issues.apache.org/jira/browse/YARN-10903
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>            Reporter: jackwangcs
>            Priority: Major
>
> The headroom check in  `ParentQueue.canAssign` and 
> `RegularContainerAllocator#checkHeadroom` does not consider the DRF cases.
> This will cause a lot of "Failed to accept allocation proposal" when a queue 
> is near-fully used. 
> In the log:
> Headroom: memory:256, vCores:729
> Request: memory:56320, vCores:5
> clusterResource: memory:673966080, vCores:110494
> If use the DRF, then 
> {code:java}
> Resources.greaterThanOrEqual(rc, clusterResource, Resources.add(
>     currentResourceLimits.getHeadroom(), resourceCouldBeUnReserved),
>     required); {code}
> will be true but in fact we can not allocate resources to the request due to 
> the max limit(no enough memory).
> {code:java}
> 2021-07-21 23:49:39,012 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1626747977559_95859 
> headRoom=<memory:256, vCores:729> currentConsumption=0
> 2021-07-21 23:49:39,012 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.LocalityAppPlacementAllocator:
>   Request={AllocationRequestId: -1, Priority: 1, Capability: <memory:56320, 
> vCores:5>, # Containers: 19, Location: *, Relax Locality: true, Execution 
> Type Request: null, Node Label Expression: prod-best-effort-node}
> .....
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Try to commit allocation proposal=New 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.ResourceCommitRequest:
>          ALLOCATED=[(Application=appattempt_1626747977559_95859_000001; 
> Node=xxxx:8041; Resource=<memory:56320, vCores:5>)]
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager:
>  userLimit is fetched. userLimit=<memory:7077376, vCores:1277>, 
> userSpecificUserLimit=<memory:7077376, vCores:1277>, 
> schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> Headroom calculation for user xxxxx:  userLimit=<memory:7077376, vCores:1277> 
> queueMaxAvailRes=<memory:0, vCores:0> consumed=<memory:0, vCores:0> 
> partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
>  Used resource=<memory:7077120, vCores:548> exceeded maxResourceLimit of the 
> queue =<memory:7089920, vCores:1278>
> 2021-07-21 23:49:39,013 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (YARN-10903) Too many "Failed to accept allocation proposal" because of wrong Headroom check for DRF

Reply via email to