[
https://issues.apache.org/jira/browse/YARN-9432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834411#comment-16834411
]
Weiwei Yang commented on YARN-9432:
-----------------------------------
Hi [~Tao Yang]
Thanks for the patch, the overall logic makes sense to me. But only one thing,
{code:java}
List<FiCaSchedulerNode> reservedNodes =
candidates.getAllNodes().values().stream()
.filter(node -> node.getReservedContainer() != null)
.collect(Collectors.toList());
for (FiCaSchedulerNode reservedNode : reservedNodes) {
RMContainer reservedContainer = reservedNode.getReservedContainer();
if (reservedContainer != null) {
allocateFromReservedContainer(reservedNode, false, reservedContainer);
}
}
{code}
this can loop nodes twice. Can we use a single loop instead?
{code:java}
for (FiCaSchedulerNode node : candidates.getAllNodes().values()) {
if (node.getReservedContainer() != null) {
//////
}
}
{code}
Thanks
> Reserved containers leak after its request has been cancelled or satisfied
> when multi-nodes enabled
> ---------------------------------------------------------------------------------------------------
>
> Key: YARN-9432
> URL: https://issues.apache.org/jira/browse/YARN-9432
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacityscheduler
> Reporter: Tao Yang
> Assignee: Tao Yang
> Priority: Major
> Attachments: YARN-9432.001.patch, YARN-9432.002.patch,
> YARN-9432.003.patch
>
>
> Reserved containers may change to be excess after its request has been
> cancelled or satisfied, excess reserved containers need to be unreserved
> quickly to release resource for others.
> For multi-nodes disabled scenario, excess reserved containers can be quickly
> released in next node heartbeat, the calling stack is
> CapacityScheduler#nodeUpdate --> CapacityScheduler#allocateContainersToNode
> --> CapacityScheduler#allocateContainerOnSingleNode.
> But for multi-nodes enabled scenario, excess reserved containers have chance
> to be released only in allocation process, key phase of the calling stack is
> LeafQueue#assignContainers --> LeafQueue#allocateFromReservedContainer.
> According to this, excess reserved containers may not be released until its
> queue has pending request and has chance to be allocated, and the worst is
> that excess reserved containers will never be released and keep holding
> resource if there is no additional pending request for this queue.
> To solve this problem, my opinion is to directly kill excess reserved
> containers when request is satisfied (in FiCaSchedulerApp#apply) or the
> allocation number of resource-requests/scheduling-requests is updated to be 0
> (in SchedulerApplicationAttempt#updateResourceRequests /
> SchedulerApplicationAttempt#updateSchedulingRequests).
> Please feel free to give your suggestions. Thanks.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]