[jira] [Commented] (YARN-9432) Reserved containers leak after its request has been cancelled or satisfied when multi-nodes enabled

Weiwei Yang (JIRA) Mon, 06 May 2019 23:28:44 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-9432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834411#comment-16834411
 ]


Weiwei Yang commented on YARN-9432:
-----------------------------------

Hi [~Tao Yang]

Thanks for the patch, the overall logic makes sense to me. But only one thing, 
{code:java}
List<FiCaSchedulerNode> reservedNodes = 
candidates.getAllNodes().values().stream()
  .filter(node -> node.getReservedContainer() != null)
  .collect(Collectors.toList());
for (FiCaSchedulerNode reservedNode : reservedNodes) {
  RMContainer reservedContainer = reservedNode.getReservedContainer();
  if (reservedContainer != null) { 
     allocateFromReservedContainer(reservedNode, false, reservedContainer);
  }
}
{code}
this can loop nodes twice. Can we use a single loop instead?
{code:java}
for (FiCaSchedulerNode node : candidates.getAllNodes().values()) {
 if (node.getReservedContainer() != null) {
  //////
  }
}
{code}
Thanks

> Reserved containers leak after its request has been cancelled or satisfied 
> when multi-nodes enabled
> ---------------------------------------------------------------------------------------------------
>
>                 Key: YARN-9432
>                 URL: https://issues.apache.org/jira/browse/YARN-9432
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>            Reporter: Tao Yang
>            Assignee: Tao Yang
>            Priority: Major
>         Attachments: YARN-9432.001.patch, YARN-9432.002.patch, 
> YARN-9432.003.patch
>
>
> Reserved containers may change to be excess after its request has been 
> cancelled or satisfied, excess reserved containers need to be unreserved 
> quickly to release resource for others.
> For multi-nodes disabled scenario, excess reserved containers can be quickly 
> released in next node heartbeat, the calling stack is 
> CapacityScheduler#nodeUpdate -->  CapacityScheduler#allocateContainersToNode 
> --> CapacityScheduler#allocateContainerOnSingleNode. 
> But for multi-nodes enabled scenario, excess reserved containers have chance 
> to be released only in allocation process, key phase of the calling stack is 
> LeafQueue#assignContainers --> LeafQueue#allocateFromReservedContainer. 
> According to this, excess reserved containers may not be released until its 
> queue has pending request and has chance to be allocated, and the worst is 
> that excess reserved containers will never be released and keep holding 
> resource if there is no additional pending request for this queue.
> To solve this problem, my opinion is to directly kill excess reserved 
> containers when request is satisfied (in FiCaSchedulerApp#apply) or the 
> allocation number of resource-requests/scheduling-requests is updated to be 0 
> (in SchedulerApplicationAttempt#updateResourceRequests / 
> SchedulerApplicationAttempt#updateSchedulingRequests).
> Please feel free to give your suggestions. Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-9432) Reserved containers leak after its request has been cancelled or satisfied when multi-nodes enabled

Reply via email to