[jira] [Commented] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator
[ https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17407661#comment-17407661 ] Eric Payne commented on YARN-10848: --- bq. IMO this is breaking the existing behavior of DefaultResourceCalculator Agreed. Just to add my 2 cents... IMO, the DefaultResourceCalculator should only consider the memory portion of the resource. This is my understanding of "correct" behavior for DefaultResourceCalculator. > Vcore allocation problem with DefaultResourceCalculator > --- > > Key: YARN-10848 > URL: https://issues.apache.org/jira/browse/YARN-10848 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Assignee: Minni Mittal >Priority: Major > Labels: pull-request-available > Attachments: TestTooManyContainers.java > > Time Spent: 20m > Remaining Estimate: 0h > > If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating > containers even if we run out of vcores. > CS checks the the available resources at two places. The first check is > {{CapacityScheduler.allocateContainerOnSingleNode()}}: > {noformat} > if (calculator.computeAvailableContainers(Resources > .add(node.getUnallocatedResource(), > node.getTotalKillableResources()), > minimumAllocation) <= 0) { > LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient " > + "available or preemptible resource for minimum allocation"); > {noformat} > The second, which is more important, is located in > {{RegularContainerAllocator.assignContainer()}}: > {noformat} > if (!Resources.fitsIn(rc, capability, totalResource)) { > LOG.warn("Node : " + node.getNodeID() > + " does not have sufficient resource for ask : " + pendingAsk > + " node total capability : " + node.getTotalResource()); > // Skip this locality request > ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation( > activitiesManager, node, application, schedulerKey, > ActivityDiagnosticConstant. > NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST > + getResourceDiagnostics(capability, totalResource), > ActivityLevel.NODE); > return ContainerAllocation.LOCALITY_SKIPPED; > } > {noformat} > Here, {{rc}} is the resource calculator instance, the other two values are: > {noformat} > Resource capability = pendingAsk.getPerAllocationResource(); > Resource available = node.getUnallocatedResource(); > {noformat} > There is a repro unit test attatched to this case, which can demonstrate the > problem. The root cause is that we pass the resource calculator to > {{Resource.fitsIn()}}. Instead, we should use an overridden version, just > like in {{FSAppAttempt.assignContainer()}}: > {noformat} >// Can we allocate a container on this node? > if (Resources.fitsIn(capability, available)) { > // Inform the application of the new container for this request > RMContainer allocatedContainer = > allocate(type, node, schedulerKey, pendingAsk, > reservedContainer); > {noformat} > In CS, if we switch to DominantResourceCalculator OR use > {{Resources.fitsIn()}} without the calculator in > {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit > test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator
[ https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403793#comment-17403793 ] Peter Bacsko commented on YARN-10848: - Thanks for the comment [~prabhujoseph], so you're saying that this is by design? If this is intentional, then probably we should close this JIRA. But at first, this behavior was really weird to me. > Vcore allocation problem with DefaultResourceCalculator > --- > > Key: YARN-10848 > URL: https://issues.apache.org/jira/browse/YARN-10848 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Assignee: Minni Mittal >Priority: Major > Labels: pull-request-available > Attachments: TestTooManyContainers.java > > Time Spent: 20m > Remaining Estimate: 0h > > If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating > containers even if we run out of vcores. > CS checks the the available resources at two places. The first check is > {{CapacityScheduler.allocateContainerOnSingleNode()}}: > {noformat} > if (calculator.computeAvailableContainers(Resources > .add(node.getUnallocatedResource(), > node.getTotalKillableResources()), > minimumAllocation) <= 0) { > LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient " > + "available or preemptible resource for minimum allocation"); > {noformat} > The second, which is more important, is located in > {{RegularContainerAllocator.assignContainer()}}: > {noformat} > if (!Resources.fitsIn(rc, capability, totalResource)) { > LOG.warn("Node : " + node.getNodeID() > + " does not have sufficient resource for ask : " + pendingAsk > + " node total capability : " + node.getTotalResource()); > // Skip this locality request > ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation( > activitiesManager, node, application, schedulerKey, > ActivityDiagnosticConstant. > NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST > + getResourceDiagnostics(capability, totalResource), > ActivityLevel.NODE); > return ContainerAllocation.LOCALITY_SKIPPED; > } > {noformat} > Here, {{rc}} is the resource calculator instance, the other two values are: > {noformat} > Resource capability = pendingAsk.getPerAllocationResource(); > Resource available = node.getUnallocatedResource(); > {noformat} > There is a repro unit test attatched to this case, which can demonstrate the > problem. The root cause is that we pass the resource calculator to > {{Resource.fitsIn()}}. Instead, we should use an overridden version, just > like in {{FSAppAttempt.assignContainer()}}: > {noformat} >// Can we allocate a container on this node? > if (Resources.fitsIn(capability, available)) { > // Inform the application of the new container for this request > RMContainer allocatedContainer = > allocate(type, node, schedulerKey, pendingAsk, > reservedContainer); > {noformat} > In CS, if we switch to DominantResourceCalculator OR use > {{Resources.fitsIn()}} without the calculator in > {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit > test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator
[ https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390099#comment-17390099 ] Prabhu Joseph commented on YARN-10848: -- Hi [~pbacsko], IMO this is breaking the existing behavior of DefaultResourceCalculator. DefaultResourceCalculator is useful when the workloads are not CPU intensive like MapReduce, Tez and user need not worry on CPU configurations here. >> IMO whether a container "fits in" or not should depend on both values DominantResourceCalaculator provides this support which users configures if they want to consider both memory and cpu resources in scheduling. > Vcore allocation problem with DefaultResourceCalculator > --- > > Key: YARN-10848 > URL: https://issues.apache.org/jira/browse/YARN-10848 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Assignee: Minni Mittal >Priority: Major > Labels: pull-request-available > Attachments: TestTooManyContainers.java > > Time Spent: 20m > Remaining Estimate: 0h > > If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating > containers even if we run out of vcores. > CS checks the the available resources at two places. The first check is > {{CapacityScheduler.allocateContainerOnSingleNode()}}: > {noformat} > if (calculator.computeAvailableContainers(Resources > .add(node.getUnallocatedResource(), > node.getTotalKillableResources()), > minimumAllocation) <= 0) { > LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient " > + "available or preemptible resource for minimum allocation"); > {noformat} > The second, which is more important, is located in > {{RegularContainerAllocator.assignContainer()}}: > {noformat} > if (!Resources.fitsIn(rc, capability, totalResource)) { > LOG.warn("Node : " + node.getNodeID() > + " does not have sufficient resource for ask : " + pendingAsk > + " node total capability : " + node.getTotalResource()); > // Skip this locality request > ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation( > activitiesManager, node, application, schedulerKey, > ActivityDiagnosticConstant. > NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST > + getResourceDiagnostics(capability, totalResource), > ActivityLevel.NODE); > return ContainerAllocation.LOCALITY_SKIPPED; > } > {noformat} > Here, {{rc}} is the resource calculator instance, the other two values are: > {noformat} > Resource capability = pendingAsk.getPerAllocationResource(); > Resource available = node.getUnallocatedResource(); > {noformat} > There is a repro unit test attatched to this case, which can demonstrate the > problem. The root cause is that we pass the resource calculator to > {{Resource.fitsIn()}}. Instead, we should use an overridden version, just > like in {{FSAppAttempt.assignContainer()}}: > {noformat} >// Can we allocate a container on this node? > if (Resources.fitsIn(capability, available)) { > // Inform the application of the new container for this request > RMContainer allocatedContainer = > allocate(type, node, schedulerKey, pendingAsk, > reservedContainer); > {noformat} > In CS, if we switch to DominantResourceCalculator OR use > {{Resources.fitsIn()}} without the calculator in > {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit > test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator
[ https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17388909#comment-17388909 ] Minni Mittal commented on YARN-10848: - Got it. For checking whether container fitsIn should just depend on available resource and requested resource (the way it is done for FairScheduler) and not on resource calculator. [~pbacsko], I've added the PR. Can you please review the patch ? > Vcore allocation problem with DefaultResourceCalculator > --- > > Key: YARN-10848 > URL: https://issues.apache.org/jira/browse/YARN-10848 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Assignee: Minni Mittal >Priority: Major > Labels: pull-request-available > Attachments: TestTooManyContainers.java > > Time Spent: 10m > Remaining Estimate: 0h > > If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating > containers even if we run out of vcores. > CS checks the the available resources at two places. The first check is > {{CapacityScheduler.allocateContainerOnSingleNode()}}: > {noformat} > if (calculator.computeAvailableContainers(Resources > .add(node.getUnallocatedResource(), > node.getTotalKillableResources()), > minimumAllocation) <= 0) { > LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient " > + "available or preemptible resource for minimum allocation"); > {noformat} > The second, which is more important, is located in > {{RegularContainerAllocator.assignContainer()}}: > {noformat} > if (!Resources.fitsIn(rc, capability, totalResource)) { > LOG.warn("Node : " + node.getNodeID() > + " does not have sufficient resource for ask : " + pendingAsk > + " node total capability : " + node.getTotalResource()); > // Skip this locality request > ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation( > activitiesManager, node, application, schedulerKey, > ActivityDiagnosticConstant. > NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST > + getResourceDiagnostics(capability, totalResource), > ActivityLevel.NODE); > return ContainerAllocation.LOCALITY_SKIPPED; > } > {noformat} > Here, {{rc}} is the resource calculator instance, the other two values are: > {noformat} > Resource capability = pendingAsk.getPerAllocationResource(); > Resource available = node.getUnallocatedResource(); > {noformat} > There is a repro unit test attatched to this case, which can demonstrate the > problem. The root cause is that we pass the resource calculator to > {{Resource.fitsIn()}}. Instead, we should use an overridden version, just > like in {{FSAppAttempt.assignContainer()}}: > {noformat} >// Can we allocate a container on this node? > if (Resources.fitsIn(capability, available)) { > // Inform the application of the new container for this request > RMContainer allocatedContainer = > allocate(type, node, schedulerKey, pendingAsk, > reservedContainer); > {noformat} > In CS, if we switch to DominantResourceCalculator OR use > {{Resources.fitsIn()}} without the calculator in > {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit > test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator
[ https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1732#comment-1732 ] Peter Bacsko commented on YARN-10848: - [~minni31] the problem is that if you have a node with a lots of memory, CS keeps allocating containers even if there are no more vcores available. Imagine a 32 core server with 768GB of RAM. With a container size of 2G, this means that 384 containers can run in parallel, potentially overloading the node. This might be a slightly artifical scenario, but it can happen. IMO whether a container "fits in" or not should depend on both values. It's OK to use only one for fairness calculation, but as I pointed out above, Fair Scheduler does not allow such allocation if "Fair" policy is used in the queue. But if this was done intentionally, I'm wondering what's the thought process behind it. > Vcore allocation problem with DefaultResourceCalculator > --- > > Key: YARN-10848 > URL: https://issues.apache.org/jira/browse/YARN-10848 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Assignee: Minni Mittal >Priority: Major > Attachments: TestTooManyContainers.java > > > If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating > containers even if we run out of vcores. > CS checks the the available resources at two places. The first check is > {{CapacityScheduler.allocateContainerOnSingleNode()}}: > {noformat} > if (calculator.computeAvailableContainers(Resources > .add(node.getUnallocatedResource(), > node.getTotalKillableResources()), > minimumAllocation) <= 0) { > LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient " > + "available or preemptible resource for minimum allocation"); > {noformat} > The second, which is more important, is located in > {{RegularContainerAllocator.assignContainer()}}: > {noformat} > if (!Resources.fitsIn(rc, capability, totalResource)) { > LOG.warn("Node : " + node.getNodeID() > + " does not have sufficient resource for ask : " + pendingAsk > + " node total capability : " + node.getTotalResource()); > // Skip this locality request > ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation( > activitiesManager, node, application, schedulerKey, > ActivityDiagnosticConstant. > NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST > + getResourceDiagnostics(capability, totalResource), > ActivityLevel.NODE); > return ContainerAllocation.LOCALITY_SKIPPED; > } > {noformat} > Here, {{rc}} is the resource calculator instance, the other two values are: > {noformat} > Resource capability = pendingAsk.getPerAllocationResource(); > Resource available = node.getUnallocatedResource(); > {noformat} > There is a repro unit test attatched to this case, which can demonstrate the > problem. The root cause is that we pass the resource calculator to > {{Resource.fitsIn()}}. Instead, we should use an overridden version, just > like in {{FSAppAttempt.assignContainer()}}: > {noformat} >// Can we allocate a container on this node? > if (Resources.fitsIn(capability, available)) { > // Inform the application of the new container for this request > RMContainer allocatedContainer = > allocate(type, node, schedulerKey, pendingAsk, > reservedContainer); > {noformat} > In CS, if we switch to DominantResourceCalculator OR use > {{Resources.fitsIn()}} without the calculator in > {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit > test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator
[ https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17388046#comment-17388046 ] Minni Mittal commented on YARN-10848: - [~pbacsko], As per my understanding, DefaultResourceCalculator considers memory as the limiting resource. {code:java} private static final Set INSUFFICIENT_RESOURCE_NAME = ImmutableSet.of(ResourceInformation.MEMORY_URI); {code} As such, it will keep on allocating containers till we have memory available irrespective of the availability of the vcores. In the test "TestTooManyContainers" ypu added, if we increase numRequestedContainers to 13, then it will allocate 11 containers and then will have {code:java} This node 127.0.0.1:1234 doesn't have sufficient available or preemptible resource for minimum allocation {code} This looks like expected behavior to me. Please help me with understanding the issue. > Vcore allocation problem with DefaultResourceCalculator > --- > > Key: YARN-10848 > URL: https://issues.apache.org/jira/browse/YARN-10848 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Assignee: Minni Mittal >Priority: Major > Attachments: TestTooManyContainers.java > > > If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating > containers even if we run out of vcores. > CS checks the the available resources at two places. The first check is > {{CapacityScheduler.allocateContainerOnSingleNode()}}: > {noformat} > if (calculator.computeAvailableContainers(Resources > .add(node.getUnallocatedResource(), > node.getTotalKillableResources()), > minimumAllocation) <= 0) { > LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient " > + "available or preemptible resource for minimum allocation"); > {noformat} > The second, which is more important, is located in > {{RegularContainerAllocator.assignContainer()}}: > {noformat} > if (!Resources.fitsIn(rc, capability, totalResource)) { > LOG.warn("Node : " + node.getNodeID() > + " does not have sufficient resource for ask : " + pendingAsk > + " node total capability : " + node.getTotalResource()); > // Skip this locality request > ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation( > activitiesManager, node, application, schedulerKey, > ActivityDiagnosticConstant. > NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST > + getResourceDiagnostics(capability, totalResource), > ActivityLevel.NODE); > return ContainerAllocation.LOCALITY_SKIPPED; > } > {noformat} > Here, {{rc}} is the resource calculator instance, the other two values are: > {noformat} > Resource capability = pendingAsk.getPerAllocationResource(); > Resource available = node.getUnallocatedResource(); > {noformat} > There is a repro unit test attatched to this case, which can demonstrate the > problem. The root cause is that we pass the resource calculator to > {{Resource.fitsIn()}}. Instead, we should use an overridden version, just > like in {{FSAppAttempt.assignContainer()}}: > {noformat} >// Can we allocate a container on this node? > if (Resources.fitsIn(capability, available)) { > // Inform the application of the new container for this request > RMContainer allocatedContainer = > allocate(type, node, schedulerKey, pendingAsk, > reservedContainer); > {noformat} > In CS, if we switch to DominantResourceCalculator OR use > {{Resources.fitsIn()}} without the calculator in > {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit > test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator
[ https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377446#comment-17377446 ] Peter Bacsko commented on YARN-10848: - [~minni31] sure, you can take it and I can review the patch if you upload one. > Vcore allocation problem with DefaultResourceCalculator > --- > > Key: YARN-10848 > URL: https://issues.apache.org/jira/browse/YARN-10848 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Assignee: Minni Mittal >Priority: Major > Attachments: TestTooManyContainers.java > > > If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating > containers even if we run out of vcores. > CS checks the the available resources at two places. The first check is > {{CapacityScheduler.allocateContainerOnSingleNode()}}: > {noformat} > if (calculator.computeAvailableContainers(Resources > .add(node.getUnallocatedResource(), > node.getTotalKillableResources()), > minimumAllocation) <= 0) { > LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient " > + "available or preemptible resource for minimum allocation"); > {noformat} > The second, which is more important, is located in > {{RegularContainerAllocator.assignContainer()}}: > {noformat} > if (!Resources.fitsIn(rc, capability, totalResource)) { > LOG.warn("Node : " + node.getNodeID() > + " does not have sufficient resource for ask : " + pendingAsk > + " node total capability : " + node.getTotalResource()); > // Skip this locality request > ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation( > activitiesManager, node, application, schedulerKey, > ActivityDiagnosticConstant. > NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST > + getResourceDiagnostics(capability, totalResource), > ActivityLevel.NODE); > return ContainerAllocation.LOCALITY_SKIPPED; > } > {noformat} > Here, {{rc}} is the resource calculator instance, the other two values are: > {noformat} > Resource capability = pendingAsk.getPerAllocationResource(); > Resource available = node.getUnallocatedResource(); > {noformat} > There is a repro unit test attatched to this case, which can demonstrate the > problem. The root cause is that we pass the resource calculator to > {{Resource.fitsIn()}}. Instead, we should use an overridden version, just > like in {{FSAppAttempt.assignContainer()}}: > {noformat} >// Can we allocate a container on this node? > if (Resources.fitsIn(capability, available)) { > // Inform the application of the new container for this request > RMContainer allocatedContainer = > allocate(type, node, schedulerKey, pendingAsk, > reservedContainer); > {noformat} > In CS, if we switch to DominantResourceCalculator OR use > {{Resources.fitsIn()}} without the calculator in > {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit > test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator
[ https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377412#comment-17377412 ] Minni Mittal commented on YARN-10848: - Hey [~pbacsko], Can I take up this Jira if you are not working on this ? > Vcore allocation problem with DefaultResourceCalculator > --- > > Key: YARN-10848 > URL: https://issues.apache.org/jira/browse/YARN-10848 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Priority: Major > Attachments: TestTooManyContainers.java > > > If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating > containers even if we run out of vcores. > CS checks the the available resources at two places. The first check is > {{CapacityScheduler.allocateContainerOnSingleNode()}}: > {noformat} > if (calculator.computeAvailableContainers(Resources > .add(node.getUnallocatedResource(), > node.getTotalKillableResources()), > minimumAllocation) <= 0) { > LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient " > + "available or preemptible resource for minimum allocation"); > {noformat} > The second, which is more important, is located in > {{RegularContainerAllocator.assignContainer()}}: > {noformat} > if (!Resources.fitsIn(rc, capability, totalResource)) { > LOG.warn("Node : " + node.getNodeID() > + " does not have sufficient resource for ask : " + pendingAsk > + " node total capability : " + node.getTotalResource()); > // Skip this locality request > ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation( > activitiesManager, node, application, schedulerKey, > ActivityDiagnosticConstant. > NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST > + getResourceDiagnostics(capability, totalResource), > ActivityLevel.NODE); > return ContainerAllocation.LOCALITY_SKIPPED; > } > {noformat} > Here, {{rc}} is the resource calculator instance, the other two values are: > {noformat} > Resource capability = pendingAsk.getPerAllocationResource(); > Resource available = node.getUnallocatedResource(); > {noformat} > There is a repro unit test attatched to this case, which can demonstrate the > problem. The root cause is that we pass the resource calculator to > {{Resource.fitsIn()}}. Instead, we should use an overridden version, just > like in {{FSAppAttempt.assignContainer()}}: > {noformat} >// Can we allocate a container on this node? > if (Resources.fitsIn(capability, available)) { > // Inform the application of the new container for this request > RMContainer allocatedContainer = > allocate(type, node, schedulerKey, pendingAsk, > reservedContainer); > {noformat} > In CS, if we switch to DominantResourceCalculator OR use > {{Resources.fitsIn()}} without the calculator in > {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit > test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org