[
https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18040928#comment-18040928
]
ASF GitHub Bot commented on YARN-10848:
---------------------------------------
github-actions[bot] commented on PR #3246:
URL: https://github.com/apache/hadoop/pull/3246#issuecomment-3583628895
We're closing this stale PR because it has been open for 100 days with no
activity. This isn't a judgement on the merit of the PR in any way. It's just a
way of keeping the PR queue manageable.
If you feel like this was a mistake, or you would like to continue working
on it, please feel free to re-open it and ask for a committer to remove the
stale tag and review again.
Thanks all for your contribution.
> Vcore allocation problem with DefaultResourceCalculator
> -------------------------------------------------------
>
> Key: YARN-10848
> URL: https://issues.apache.org/jira/browse/YARN-10848
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacity scheduler, capacityscheduler
> Reporter: Peter Bacsko
> Assignee: Minni Mittal
> Priority: Major
> Labels: pull-request-available
> Attachments: TestTooManyContainers.java
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating
> containers even if we run out of vcores.
> CS checks the the available resources at two places. The first check is
> {{CapacityScheduler.allocateContainerOnSingleNode()}}:
> {noformat}
> if (calculator.computeAvailableContainers(Resources
> .add(node.getUnallocatedResource(),
> node.getTotalKillableResources()),
> minimumAllocation) <= 0) {
> LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient "
> + "available or preemptible resource for minimum allocation");
> {noformat}
> The second, which is more important, is located in
> {{RegularContainerAllocator.assignContainer()}}:
> {noformat}
> if (!Resources.fitsIn(rc, capability, totalResource)) {
> LOG.warn("Node : " + node.getNodeID()
> + " does not have sufficient resource for ask : " + pendingAsk
> + " node total capability : " + node.getTotalResource());
> // Skip this locality request
> ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation(
> activitiesManager, node, application, schedulerKey,
> ActivityDiagnosticConstant.
> NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST
> + getResourceDiagnostics(capability, totalResource),
> ActivityLevel.NODE);
> return ContainerAllocation.LOCALITY_SKIPPED;
> }
> {noformat}
> Here, {{rc}} is the resource calculator instance, the other two values are:
> {noformat}
> Resource capability = pendingAsk.getPerAllocationResource();
> Resource available = node.getUnallocatedResource();
> {noformat}
> There is a repro unit test attatched to this case, which can demonstrate the
> problem. The root cause is that we pass the resource calculator to
> {{Resource.fitsIn()}}. Instead, we should use an overridden version, just
> like in {{FSAppAttempt.assignContainer()}}:
> {noformat}
> // Can we allocate a container on this node?
> if (Resources.fitsIn(capability, available)) {
> // Inform the application of the new container for this request
> RMContainer allocatedContainer =
> allocate(type, node, schedulerKey, pendingAsk,
> reservedContainer);
> {noformat}
> In CS, if we switch to DominantResourceCalculator OR use
> {{Resources.fitsIn()}} without the calculator in
> {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit
> test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]