[jira] [Commented] (YARN-10903) Too many "Failed to accept allocation proposal" because of wrong Headroom check for DRF

2021-09-10 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413346#comment-17413346
 ] 

Eric Payne commented on YARN-10903:
---

Thanks [~jackwangcs] for your patience and thanks for fixing this bug. The 
changes LGTM.
+1.

> Too many "Failed to accept allocation proposal" because of wrong Headroom 
> check for DRF
> ---
>
> Key: YARN-10903
> URL: https://issues.apache.org/jira/browse/YARN-10903
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: jackwangcs
>Assignee: jackwangcs
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The headroom check in  `ParentQueue.canAssign` and 
> `RegularContainerAllocator#checkHeadroom` does not consider the DRF cases.
> This will cause a lot of "Failed to accept allocation proposal" when a queue 
> is near-fully used. 
> In the log:
> Headroom: memory:256, vCores:729
> Request: memory:56320, vCores:5
> clusterResource: memory:673966080, vCores:110494
> If use the DRF, then 
> {code:java}
> Resources.greaterThanOrEqual(rc, clusterResource, Resources.add(
> currentResourceLimits.getHeadroom(), resourceCouldBeUnReserved),
> required); {code}
> will be true but in fact we can not allocate resources to the request due to 
> the max limit(no enough memory).
> {code:java}
> 2021-07-21 23:49:39,012 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1626747977559_95859 
> headRoom= currentConsumption=0
> 2021-07-21 23:49:39,012 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.LocalityAppPlacementAllocator:
>   Request={AllocationRequestId: -1, Priority: 1, Capability:  vCores:5>, # Containers: 19, Location: *, Relax Locality: true, Execution 
> Type Request: null, Node Label Expression: prod-best-effort-node}
> .
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Try to commit allocation proposal=New 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.ResourceCommitRequest:
>  ALLOCATED=[(Application=appattempt_1626747977559_95859_01; 
> Node=:8041; Resource=)]
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager:
>  userLimit is fetched. userLimit=, 
> userSpecificUserLimit=, 
> schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> Headroom calculation for user x:  userLimit= 
> queueMaxAvailRes= consumed= 
> partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
>  Used resource= exceeded maxResourceLimit of the 
> queue =
> 2021-07-21 23:49:39,013 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10903) Too many "Failed to accept allocation proposal" because of wrong Headroom check for DRF

2021-09-10 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413222#comment-17413222
 ] 

Eric Payne commented on YARN-10903:
---

Thanks [~jackwangcs] and [~Tao Yang] for raising the issue and for reviewing 
and commenting. Headroom calculations are very sensitive and any changes could 
have unforseen side effects. I will take some time today and review.

> Too many "Failed to accept allocation proposal" because of wrong Headroom 
> check for DRF
> ---
>
> Key: YARN-10903
> URL: https://issues.apache.org/jira/browse/YARN-10903
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: jackwangcs
>Assignee: jackwangcs
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The headroom check in  `ParentQueue.canAssign` and 
> `RegularContainerAllocator#checkHeadroom` does not consider the DRF cases.
> This will cause a lot of "Failed to accept allocation proposal" when a queue 
> is near-fully used. 
> In the log:
> Headroom: memory:256, vCores:729
> Request: memory:56320, vCores:5
> clusterResource: memory:673966080, vCores:110494
> If use the DRF, then 
> {code:java}
> Resources.greaterThanOrEqual(rc, clusterResource, Resources.add(
> currentResourceLimits.getHeadroom(), resourceCouldBeUnReserved),
> required); {code}
> will be true but in fact we can not allocate resources to the request due to 
> the max limit(no enough memory).
> {code:java}
> 2021-07-21 23:49:39,012 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1626747977559_95859 
> headRoom= currentConsumption=0
> 2021-07-21 23:49:39,012 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.LocalityAppPlacementAllocator:
>   Request={AllocationRequestId: -1, Priority: 1, Capability:  vCores:5>, # Containers: 19, Location: *, Relax Locality: true, Execution 
> Type Request: null, Node Label Expression: prod-best-effort-node}
> .
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Try to commit allocation proposal=New 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.ResourceCommitRequest:
>  ALLOCATED=[(Application=appattempt_1626747977559_95859_01; 
> Node=:8041; Resource=)]
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager:
>  userLimit is fetched. userLimit=, 
> userSpecificUserLimit=, 
> schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> Headroom calculation for user x:  userLimit= 
> queueMaxAvailRes= consumed= 
> partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
>  Used resource= exceeded maxResourceLimit of the 
> queue =
> 2021-07-21 23:49:39,013 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10903) Too many "Failed to accept allocation proposal" because of wrong Headroom check for DRF

2021-09-09 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412922#comment-17412922
 ] 

Tao Yang commented on YARN-10903:
-

+1 for the PR, will merge it after a few days if there are no objections.

> Too many "Failed to accept allocation proposal" because of wrong Headroom 
> check for DRF
> ---
>
> Key: YARN-10903
> URL: https://issues.apache.org/jira/browse/YARN-10903
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: jackwangcs
>Assignee: jackwangcs
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The headroom check in  `ParentQueue.canAssign` and 
> `RegularContainerAllocator#checkHeadroom` does not consider the DRF cases.
> This will cause a lot of "Failed to accept allocation proposal" when a queue 
> is near-fully used. 
> In the log:
> Headroom: memory:256, vCores:729
> Request: memory:56320, vCores:5
> clusterResource: memory:673966080, vCores:110494
> If use the DRF, then 
> {code:java}
> Resources.greaterThanOrEqual(rc, clusterResource, Resources.add(
> currentResourceLimits.getHeadroom(), resourceCouldBeUnReserved),
> required); {code}
> will be true but in fact we can not allocate resources to the request due to 
> the max limit(no enough memory).
> {code:java}
> 2021-07-21 23:49:39,012 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1626747977559_95859 
> headRoom= currentConsumption=0
> 2021-07-21 23:49:39,012 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.LocalityAppPlacementAllocator:
>   Request={AllocationRequestId: -1, Priority: 1, Capability:  vCores:5>, # Containers: 19, Location: *, Relax Locality: true, Execution 
> Type Request: null, Node Label Expression: prod-best-effort-node}
> .
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Try to commit allocation proposal=New 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.ResourceCommitRequest:
>  ALLOCATED=[(Application=appattempt_1626747977559_95859_01; 
> Node=:8041; Resource=)]
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager:
>  userLimit is fetched. userLimit=, 
> userSpecificUserLimit=, 
> schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> Headroom calculation for user x:  userLimit= 
> queueMaxAvailRes= consumed= 
> partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
>  Used resource= exceeded maxResourceLimit of the 
> queue =
> 2021-07-21 23:49:39,013 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10903) Too many "Failed to accept allocation proposal" because of wrong Headroom check for DRF

2021-09-09 Thread jackwangcs (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412899#comment-17412899
 ] 

jackwangcs commented on YARN-10903:
---

Hi [~Tao Yang], thanks for your review, I have fixed the check-style warnings.

> Too many "Failed to accept allocation proposal" because of wrong Headroom 
> check for DRF
> ---
>
> Key: YARN-10903
> URL: https://issues.apache.org/jira/browse/YARN-10903
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: jackwangcs
>Assignee: jackwangcs
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The headroom check in  `ParentQueue.canAssign` and 
> `RegularContainerAllocator#checkHeadroom` does not consider the DRF cases.
> This will cause a lot of "Failed to accept allocation proposal" when a queue 
> is near-fully used. 
> In the log:
> Headroom: memory:256, vCores:729
> Request: memory:56320, vCores:5
> clusterResource: memory:673966080, vCores:110494
> If use the DRF, then 
> {code:java}
> Resources.greaterThanOrEqual(rc, clusterResource, Resources.add(
> currentResourceLimits.getHeadroom(), resourceCouldBeUnReserved),
> required); {code}
> will be true but in fact we can not allocate resources to the request due to 
> the max limit(no enough memory).
> {code:java}
> 2021-07-21 23:49:39,012 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1626747977559_95859 
> headRoom= currentConsumption=0
> 2021-07-21 23:49:39,012 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.LocalityAppPlacementAllocator:
>   Request={AllocationRequestId: -1, Priority: 1, Capability:  vCores:5>, # Containers: 19, Location: *, Relax Locality: true, Execution 
> Type Request: null, Node Label Expression: prod-best-effort-node}
> .
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Try to commit allocation proposal=New 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.ResourceCommitRequest:
>  ALLOCATED=[(Application=appattempt_1626747977559_95859_01; 
> Node=:8041; Resource=)]
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager:
>  userLimit is fetched. userLimit=, 
> userSpecificUserLimit=, 
> schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> Headroom calculation for user x:  userLimit= 
> queueMaxAvailRes= consumed= 
> partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
>  Used resource= exceeded maxResourceLimit of the 
> queue =
> 2021-07-21 23:49:39,013 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10903) Too many "Failed to accept allocation proposal" because of wrong Headroom check for DRF

2021-09-08 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412281#comment-17412281
 ] 

Tao Yang commented on YARN-10903:
-

Thanks [~jackwangcs] for raising this issue, which may generate invalid 
proposals to slow down the normal scheduling process. Good catch! 

The PR generally LGTM, just some minor check-style warnings need to be fixed, 
please take a look.

> Too many "Failed to accept allocation proposal" because of wrong Headroom 
> check for DRF
> ---
>
> Key: YARN-10903
> URL: https://issues.apache.org/jira/browse/YARN-10903
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: jackwangcs
>Assignee: jackwangcs
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The headroom check in  `ParentQueue.canAssign` and 
> `RegularContainerAllocator#checkHeadroom` does not consider the DRF cases.
> This will cause a lot of "Failed to accept allocation proposal" when a queue 
> is near-fully used. 
> In the log:
> Headroom: memory:256, vCores:729
> Request: memory:56320, vCores:5
> clusterResource: memory:673966080, vCores:110494
> If use the DRF, then 
> {code:java}
> Resources.greaterThanOrEqual(rc, clusterResource, Resources.add(
> currentResourceLimits.getHeadroom(), resourceCouldBeUnReserved),
> required); {code}
> will be true but in fact we can not allocate resources to the request due to 
> the max limit(no enough memory).
> {code:java}
> 2021-07-21 23:49:39,012 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1626747977559_95859 
> headRoom= currentConsumption=0
> 2021-07-21 23:49:39,012 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.LocalityAppPlacementAllocator:
>   Request={AllocationRequestId: -1, Priority: 1, Capability:  vCores:5>, # Containers: 19, Location: *, Relax Locality: true, Execution 
> Type Request: null, Node Label Expression: prod-best-effort-node}
> .
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Try to commit allocation proposal=New 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.ResourceCommitRequest:
>  ALLOCATED=[(Application=appattempt_1626747977559_95859_01; 
> Node=:8041; Resource=)]
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager:
>  userLimit is fetched. userLimit=, 
> userSpecificUserLimit=, 
> schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> Headroom calculation for user x:  userLimit= 
> queueMaxAvailRes= consumed= 
> partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
>  Used resource= exceeded maxResourceLimit of the 
> queue =
> 2021-07-21 23:49:39,013 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10903) Too many "Failed to accept allocation proposal" because of wrong Headroom check for DRF

2021-09-08 Thread jackwangcs (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411977#comment-17411977
 ] 

jackwangcs commented on YARN-10903:
---

Hi [~gandras] [~bteke] [~snemeth], could you help to review this patch when you 
have time? Thanks!

> Too many "Failed to accept allocation proposal" because of wrong Headroom 
> check for DRF
> ---
>
> Key: YARN-10903
> URL: https://issues.apache.org/jira/browse/YARN-10903
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: jackwangcs
>Assignee: jackwangcs
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The headroom check in  `ParentQueue.canAssign` and 
> `RegularContainerAllocator#checkHeadroom` does not consider the DRF cases.
> This will cause a lot of "Failed to accept allocation proposal" when a queue 
> is near-fully used. 
> In the log:
> Headroom: memory:256, vCores:729
> Request: memory:56320, vCores:5
> clusterResource: memory:673966080, vCores:110494
> If use the DRF, then 
> {code:java}
> Resources.greaterThanOrEqual(rc, clusterResource, Resources.add(
> currentResourceLimits.getHeadroom(), resourceCouldBeUnReserved),
> required); {code}
> will be true but in fact we can not allocate resources to the request due to 
> the max limit(no enough memory).
> {code:java}
> 2021-07-21 23:49:39,012 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1626747977559_95859 
> headRoom= currentConsumption=0
> 2021-07-21 23:49:39,012 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.LocalityAppPlacementAllocator:
>   Request={AllocationRequestId: -1, Priority: 1, Capability:  vCores:5>, # Containers: 19, Location: *, Relax Locality: true, Execution 
> Type Request: null, Node Label Expression: prod-best-effort-node}
> .
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Try to commit allocation proposal=New 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.ResourceCommitRequest:
>  ALLOCATED=[(Application=appattempt_1626747977559_95859_01; 
> Node=:8041; Resource=)]
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager:
>  userLimit is fetched. userLimit=, 
> userSpecificUserLimit=, 
> schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> Headroom calculation for user x:  userLimit= 
> queueMaxAvailRes= consumed= 
> partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
>  Used resource= exceeded maxResourceLimit of the 
> queue =
> 2021-07-21 23:49:39,013 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10903) Too many "Failed to accept allocation proposal" because of wrong Headroom check for DRF

2021-08-28 Thread jackwangcs (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17406347#comment-17406347
 ] 

jackwangcs commented on YARN-10903:
---

Hi [~zhuqi], [~chaosju], [~leftnoteasy], could you help to review this patch?

Thanks!

> Too many "Failed to accept allocation proposal" because of wrong Headroom 
> check for DRF
> ---
>
> Key: YARN-10903
> URL: https://issues.apache.org/jira/browse/YARN-10903
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: jackwangcs
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The headroom check in  `ParentQueue.canAssign` and 
> `RegularContainerAllocator#checkHeadroom` does not consider the DRF cases.
> This will cause a lot of "Failed to accept allocation proposal" when a queue 
> is near-fully used. 
> In the log:
> Headroom: memory:256, vCores:729
> Request: memory:56320, vCores:5
> clusterResource: memory:673966080, vCores:110494
> If use the DRF, then 
> {code:java}
> Resources.greaterThanOrEqual(rc, clusterResource, Resources.add(
> currentResourceLimits.getHeadroom(), resourceCouldBeUnReserved),
> required); {code}
> will be true but in fact we can not allocate resources to the request due to 
> the max limit(no enough memory).
> {code:java}
> 2021-07-21 23:49:39,012 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1626747977559_95859 
> headRoom= currentConsumption=0
> 2021-07-21 23:49:39,012 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.LocalityAppPlacementAllocator:
>   Request={AllocationRequestId: -1, Priority: 1, Capability:  vCores:5>, # Containers: 19, Location: *, Relax Locality: true, Execution 
> Type Request: null, Node Label Expression: prod-best-effort-node}
> .
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Try to commit allocation proposal=New 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.ResourceCommitRequest:
>  ALLOCATED=[(Application=appattempt_1626747977559_95859_01; 
> Node=:8041; Resource=)]
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager:
>  userLimit is fetched. userLimit=, 
> userSpecificUserLimit=, 
> schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> Headroom calculation for user x:  userLimit= 
> queueMaxAvailRes= consumed= 
> partition=prod-best-effort-node
> 2021-07-21 23:49:39,013 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
>  Used resource= exceeded maxResourceLimit of the 
> queue =
> 2021-07-21 23:49:39,013 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org