[ 
https://issues.apache.org/jira/browse/YARN-11573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-11573:
----------------------------------
    Description: 
Applications could be stuck when the container allocation logic does not 
consider more nodes, but only nodes that are having reserved containers.
This behavior can even block new AMs to be allocated on nodes so they don't 
reach the running status.
A jira that mentions the same thing is YARN-9598:
{quote}Nodes which have been reserved should be skipped when iterating 
candidates in RegularContainerAllocator#allocate, otherwise scheduler may 
generate allocation or reservation proposal on these node which will always be 
rejected in FiCaScheduler#commonCheckContainerAllocation.
{quote}
Since this jira implements 2 other points, I decided to create this one and 
implement the 3rd point separately.

Notes:

1. FiCaSchedulerApp#commonCheckContainerAllocation will log this:
{code:java}
Trying to allocate from reserved container in async scheduling mode
{code}
in case RegularContainerAllocator creates a reservation proposal for nodes 
having reserved container.

2. A better way is to prevent generating an AM container (or even normal 
container) allocation proposal on a node if it already has a reservation on it 
and we still have more nodes to check in the preferred node set. Completely 
disabling task containers from being allocated to worker nodes could limit the 
downscaling ability that we have currently.

3. CALL HIERARCHY
1. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#nodeUpdate
2. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.api.records.NodeId,
 boolean)
3. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.CandidateNodeSet<org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode>,
 boolean)
3.1. This is the place where it is decided whether to call 
allocateContainerOnSingleNode or allocateContainersOnMultiNodes
4. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersOnMultiNodes
5. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateOrReserveNewContainers
6. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue#assignContainers
7. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractParentQueue#assignContainersToChildQueues
8. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractLeafQueue#assignContainers
9. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#assignContainers
10. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainers
11. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#allocate
12. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#tryAllocateOnNode
13. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainersOnNode
14. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignNodeLocalContainers
15. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainer

Logs these lines as an example:
{code:java}
2023-08-23 17:44:08,129 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator:
 assignContainers: node=<host> application=application_1692304118418_3151 
priority=0 pendingAsk=<per-allocation-resource=<memory:5632, 
vCores:1>,repeat=1> type=OFF_SWITCH
{code}
4. DETAILS OF RegularContainerAllocator#allocate
[Method 
definition|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/allocator/RegularContainerAllocator.java#L826-L896]

4.1. Defining ordered list of nodes to allocate containers on: 
[LINK|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/allocator/RegularContainerAllocator.java#L851-L852]
{code:java}
    Iterator<FiCaSchedulerNode> iter = schedulingPS.getPreferredNodeIterator(
        candidates);
{code}
4.2. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.AppPlacementAllocator#getPreferredNodeIterator
4.3. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.MultiNodeSortingManager#getMultiNodeSortIterator
 
([LINK|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/placement/MultiNodeSortingManager.java#L114-L180])
In this method, the MultiNodeLookupPolicy is resolved 
[here|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/placement/MultiNodeSortingManager.java#L142-L143]
4.4. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.MultiNodeSorter#getMultiNodeLookupPolicy
4.5. This is where the MultiNodeLookupPolicy implementation of 
getPreferredNodeIterator is invoked

5. GOING UP THE CALL HIERARCHY UNTIL 
CapacityScheduler#allocateOrReserveNewContainers
1. CSAssigment is created 
[here|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1797-L1801]
 in method: CapacityScheduler#allocateOrReserveNewContainers
2. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#submitResourceCommitRequest
3. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#tryCommit
4. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#accept
5. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#commonCheckContainerAllocation
--> This returns false and logs this line:
{code:java}
2023-08-23 17:44:08,130 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
 Trying to allocate from reserved container in async scheduling mode
{code}
h2. PROPOSED FIX

In method: RegularContainerAllocator#allocate

There's a loop that iterates over candidate nodes: 
[https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/allocator/RegularContainerAllocator.java#L853-L895]

We need to skip the nodes that are having a reservation, example code:
{code:java}
if (reservedContainer == null) {
840             // Do not schedule if there are any reservations to fulfill on 
the node
841             if (node.getReservedContainer() != null) {
842               LOG.debug("Skipping scheduling on node {} since it has 
already been"
843                       + " reserved by {}", node.getNodeID(),
844                   node.getReservedContainer().getContainerId());
845               
ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation(
846                   activitiesManager, node, application, schedulerKey,
847                   ActivityDiagnosticConstant.NODE_HAS_BEEN_RESERVED);
848               continue;
849             }
{code}
NOTE: This code block is copied from [^YARN-9598.001.patch#file-5]
h2. More notes for the implementation

1. This new behavior need to be hidden behind a feature flag (CS config).
In my understanding, the [^YARN-9598.001.patch#file-5] skips all the nodes with 
reservations, regardless of the container's type whether it's an AM container 
or a task container.
2. Only skip the actual node with existing reservation if there are more nodes 
to process with the iterator.
3. Add testcase to cover this scenario

 

> Add config option to make container allocation prefer nodes without reserved 
> containers
> ---------------------------------------------------------------------------------------
>
>                 Key: YARN-11573
>                 URL: https://issues.apache.org/jira/browse/YARN-11573
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler
>            Reporter: Szilard Nemeth
>            Assignee: Szilard Nemeth
>            Priority: Minor
>
> Applications could be stuck when the container allocation logic does not 
> consider more nodes, but only nodes that are having reserved containers.
> This behavior can even block new AMs to be allocated on nodes so they don't 
> reach the running status.
> A jira that mentions the same thing is YARN-9598:
> {quote}Nodes which have been reserved should be skipped when iterating 
> candidates in RegularContainerAllocator#allocate, otherwise scheduler may 
> generate allocation or reservation proposal on these node which will always 
> be rejected in FiCaScheduler#commonCheckContainerAllocation.
> {quote}
> Since this jira implements 2 other points, I decided to create this one and 
> implement the 3rd point separately.
> Notes:
> 1. FiCaSchedulerApp#commonCheckContainerAllocation will log this:
> {code:java}
> Trying to allocate from reserved container in async scheduling mode
> {code}
> in case RegularContainerAllocator creates a reservation proposal for nodes 
> having reserved container.
> 2. A better way is to prevent generating an AM container (or even normal 
> container) allocation proposal on a node if it already has a reservation on 
> it and we still have more nodes to check in the preferred node set. 
> Completely disabling task containers from being allocated to worker nodes 
> could limit the downscaling ability that we have currently.
> 3. CALL HIERARCHY
> 1. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#nodeUpdate
> 2. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.api.records.NodeId,
>  boolean)
> 3. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.CandidateNodeSet<org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode>,
>  boolean)
> 3.1. This is the place where it is decided whether to call 
> allocateContainerOnSingleNode or allocateContainersOnMultiNodes
> 4. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersOnMultiNodes
> 5. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateOrReserveNewContainers
> 6. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue#assignContainers
> 7. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractParentQueue#assignContainersToChildQueues
> 8. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractLeafQueue#assignContainers
> 9. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#assignContainers
> 10. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainers
> 11. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#allocate
> 12. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#tryAllocateOnNode
> 13. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainersOnNode
> 14. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignNodeLocalContainers
> 15. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainer
> Logs these lines as an example:
> {code:java}
> 2023-08-23 17:44:08,129 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator:
>  assignContainers: node=<host> application=application_1692304118418_3151 
> priority=0 pendingAsk=<per-allocation-resource=<memory:5632, 
> vCores:1>,repeat=1> type=OFF_SWITCH
> {code}
> 4. DETAILS OF RegularContainerAllocator#allocate
> [Method 
> definition|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/allocator/RegularContainerAllocator.java#L826-L896]
> 4.1. Defining ordered list of nodes to allocate containers on: 
> [LINK|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/allocator/RegularContainerAllocator.java#L851-L852]
> {code:java}
>     Iterator<FiCaSchedulerNode> iter = schedulingPS.getPreferredNodeIterator(
>         candidates);
> {code}
> 4.2. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.AppPlacementAllocator#getPreferredNodeIterator
> 4.3. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.MultiNodeSortingManager#getMultiNodeSortIterator
>  
> ([LINK|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/placement/MultiNodeSortingManager.java#L114-L180])
> In this method, the MultiNodeLookupPolicy is resolved 
> [here|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/placement/MultiNodeSortingManager.java#L142-L143]
> 4.4. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.MultiNodeSorter#getMultiNodeLookupPolicy
> 4.5. This is where the MultiNodeLookupPolicy implementation of 
> getPreferredNodeIterator is invoked
> 5. GOING UP THE CALL HIERARCHY UNTIL 
> CapacityScheduler#allocateOrReserveNewContainers
> 1. CSAssigment is created 
> [here|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1797-L1801]
>  in method: CapacityScheduler#allocateOrReserveNewContainers
> 2. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#submitResourceCommitRequest
> 3. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#tryCommit
> 4. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#accept
> 5. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#commonCheckContainerAllocation
> --> This returns false and logs this line:
> {code:java}
> 2023-08-23 17:44:08,130 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Trying to allocate from reserved container in async scheduling mode
> {code}
> h2. PROPOSED FIX
> In method: RegularContainerAllocator#allocate
> There's a loop that iterates over candidate nodes: 
> [https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/allocator/RegularContainerAllocator.java#L853-L895]
> We need to skip the nodes that are having a reservation, example code:
> {code:java}
> if (reservedContainer == null) {
> 840           // Do not schedule if there are any reservations to fulfill on 
> the node
> 841           if (node.getReservedContainer() != null) {
> 842             LOG.debug("Skipping scheduling on node {} since it has 
> already been"
> 843                     + " reserved by {}", node.getNodeID(),
> 844                 node.getReservedContainer().getContainerId());
> 845             
> ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation(
> 846                 activitiesManager, node, application, schedulerKey,
> 847                 ActivityDiagnosticConstant.NODE_HAS_BEEN_RESERVED);
> 848             continue;
> 849           }
> {code}
> NOTE: This code block is copied from [^YARN-9598.001.patch#file-5]
> h2. More notes for the implementation
> 1. This new behavior need to be hidden behind a feature flag (CS config).
> In my understanding, the [^YARN-9598.001.patch#file-5] skips all the nodes 
> with reservations, regardless of the container's type whether it's an AM 
> container or a task container.
> 2. Only skip the actual node with existing reservation if there are more 
> nodes to process with the iterator.
> 3. Add testcase to cover this scenario
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to