[ 
https://issues.apache.org/jira/browse/YARN-11573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-11573:
----------------------------------
    Labels: pull-request-available  (was: )

> Add config option to make container allocation prefer nodes without reserved 
> containers
> ---------------------------------------------------------------------------------------
>
>                 Key: YARN-11573
>                 URL: https://issues.apache.org/jira/browse/YARN-11573
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler
>            Reporter: Szilard Nemeth
>            Assignee: Szilard Nemeth
>            Priority: Minor
>              Labels: pull-request-available
>
> Applications could be stuck when the container allocation logic does not 
> consider more nodes, but only nodes that are having reserved containers.
> This behavior can even block new AMs to be allocated on nodes so they don't 
> reach the running status.
> A jira that mentions the same thing is YARN-9598:
> {quote}Nodes which have been reserved should be skipped when iterating 
> candidates in RegularContainerAllocator#allocate, otherwise scheduler may 
> generate allocation or reservation proposal on these node which will always 
> be rejected in FiCaScheduler#commonCheckContainerAllocation.
> {quote}
> Since this jira implements 2 other points, I decided to create this one and 
> implement the 3rd point separately.
> h2. Notes:
> 1. FiCaSchedulerApp#commonCheckContainerAllocation will log this:
> {code:java}
> Trying to allocate from reserved container in async scheduling mode
> {code}
> in case RegularContainerAllocator creates a reservation proposal for nodes 
> having reserved container.
> 2. A better way is to prevent generating an AM container (or even normal 
> container) allocation proposal on a node if it already has a reservation on 
> it and we still have more nodes to check in the preferred node set. 
> Completely disabling task containers from being allocated to worker nodes 
> could limit the downscaling ability that we have currently.
> h2. 3. CALL HIERARCHY
> 1. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#nodeUpdate
> 2. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.api.records.NodeId,
>  boolean)
> 3. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.CandidateNodeSet<org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode>,
>  boolean)
> 3.1. This is the place where it is decided whether to call 
> allocateContainerOnSingleNode or allocateContainersOnMultiNodes
> 4. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersOnMultiNodes
> 5. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateOrReserveNewContainers
> 6. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue#assignContainers
> 7. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractParentQueue#assignContainersToChildQueues
> 8. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractLeafQueue#assignContainers
> 9. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#assignContainers
> 10. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainers
> 11. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#allocate
> 12. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#tryAllocateOnNode
> 13. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainersOnNode
> 14. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignNodeLocalContainers
> 15. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainer
> Logs these lines as an example:
> {code:java}
> 2023-08-23 17:44:08,129 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator:
>  assignContainers: node=<host> application=application_1692304118418_3151 
> priority=0 pendingAsk=<per-allocation-resource=<memory:5632, 
> vCores:1>,repeat=1> type=OFF_SWITCH
> {code}
> h2. 4. DETAILS OF RegularContainerAllocator#allocate
> [Method 
> definition|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/allocator/RegularContainerAllocator.java#L826-L896]
> 4.1. Defining ordered list of nodes to allocate containers on: 
> [LINK|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/allocator/RegularContainerAllocator.java#L851-L852]
> {code:java}
>     Iterator<FiCaSchedulerNode> iter = schedulingPS.getPreferredNodeIterator(
>         candidates);
> {code}
> 4.2. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.AppPlacementAllocator#getPreferredNodeIterator
> 4.3. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.MultiNodeSortingManager#getMultiNodeSortIterator
>  
> ([LINK|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/placement/MultiNodeSortingManager.java#L114-L180])
> In this method, the MultiNodeLookupPolicy is resolved 
> [here|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/placement/MultiNodeSortingManager.java#L142-L143]
> 4.4. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.MultiNodeSorter#getMultiNodeLookupPolicy
> 4.5. This is where the MultiNodeLookupPolicy implementation of 
> getPreferredNodeIterator is invoked
> h2. 5. GOING UP THE CALL HIERARCHY UNTIL 
> CapacityScheduler#allocateOrReserveNewContainers
> 1. CSAssigment is created 
> [here|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1797-L1801]
>  in method: CapacityScheduler#allocateOrReserveNewContainers
> 2. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#submitResourceCommitRequest
> 3. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#tryCommit
> 4. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#accept
> 5. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#commonCheckContainerAllocation
> --> This returns false and logs this line:
> {code:java}
> 2023-08-23 17:44:08,130 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Trying to allocate from reserved container in async scheduling mode
> {code}
> h2. PROPOSED FIX
> In method: RegularContainerAllocator#allocate
> There's a loop that iterates over candidate nodes: 
> [https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/allocator/RegularContainerAllocator.java#L853-L895]
> We need to skip the nodes that are having a reservation, example code:
> {code:java}
> if (reservedContainer == null) {
> 840           // Do not schedule if there are any reservations to fulfill on 
> the node
> 841           if (node.getReservedContainer() != null) {
> 842             LOG.debug("Skipping scheduling on node {} since it has 
> already been"
> 843                     + " reserved by {}", node.getNodeID(),
> 844                 node.getReservedContainer().getContainerId());
> 845             
> ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation(
> 846                 activitiesManager, node, application, schedulerKey,
> 847                 ActivityDiagnosticConstant.NODE_HAS_BEEN_RESERVED);
> 848             continue;
> 849           }
> {code}
> NOTE: This code block is copied from [^YARN-9598.001.patch#file-5]
> h2. More notes for the implementation
> 1. This new behavior need to be hidden behind a feature flag (CS config).
> In my understanding, the [^YARN-9598.001.patch#file-5] skips all the nodes 
> with reservations, regardless of the container's type whether it's an AM 
> container or a task container.
> 2. Only skip the actual node with existing reservation if there are more 
> nodes to process with the iterator.
> 3. Add testcase to cover this scenario
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to