[ https://issues.apache.org/jira/browse/YARN-11573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated YARN-11573: ---------------------------------- Labels: pull-request-available (was: ) > Add config option to make container allocation prefer nodes without reserved > containers > --------------------------------------------------------------------------------------- > > Key: YARN-11573 > URL: https://issues.apache.org/jira/browse/YARN-11573 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler > Reporter: Szilard Nemeth > Assignee: Szilard Nemeth > Priority: Minor > Labels: pull-request-available > > Applications could be stuck when the container allocation logic does not > consider more nodes, but only nodes that are having reserved containers. > This behavior can even block new AMs to be allocated on nodes so they don't > reach the running status. > A jira that mentions the same thing is YARN-9598: > {quote}Nodes which have been reserved should be skipped when iterating > candidates in RegularContainerAllocator#allocate, otherwise scheduler may > generate allocation or reservation proposal on these node which will always > be rejected in FiCaScheduler#commonCheckContainerAllocation. > {quote} > Since this jira implements 2 other points, I decided to create this one and > implement the 3rd point separately. > h2. Notes: > 1. FiCaSchedulerApp#commonCheckContainerAllocation will log this: > {code:java} > Trying to allocate from reserved container in async scheduling mode > {code} > in case RegularContainerAllocator creates a reservation proposal for nodes > having reserved container. > 2. A better way is to prevent generating an AM container (or even normal > container) allocation proposal on a node if it already has a reservation on > it and we still have more nodes to check in the preferred node set. > Completely disabling task containers from being allocated to worker nodes > could limit the downscaling ability that we have currently. > h2. 3. CALL HIERARCHY > 1. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#nodeUpdate > 2. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.api.records.NodeId, > boolean) > 3. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.CandidateNodeSet<org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode>, > boolean) > 3.1. This is the place where it is decided whether to call > allocateContainerOnSingleNode or allocateContainersOnMultiNodes > 4. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersOnMultiNodes > 5. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateOrReserveNewContainers > 6. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue#assignContainers > 7. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractParentQueue#assignContainersToChildQueues > 8. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractLeafQueue#assignContainers > 9. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#assignContainers > 10. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainers > 11. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#allocate > 12. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#tryAllocateOnNode > 13. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainersOnNode > 14. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignNodeLocalContainers > 15. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainer > Logs these lines as an example: > {code:java} > 2023-08-23 17:44:08,129 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator: > assignContainers: node=<host> application=application_1692304118418_3151 > priority=0 pendingAsk=<per-allocation-resource=<memory:5632, > vCores:1>,repeat=1> type=OFF_SWITCH > {code} > h2. 4. DETAILS OF RegularContainerAllocator#allocate > [Method > definition|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/allocator/RegularContainerAllocator.java#L826-L896] > 4.1. Defining ordered list of nodes to allocate containers on: > [LINK|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/allocator/RegularContainerAllocator.java#L851-L852] > {code:java} > Iterator<FiCaSchedulerNode> iter = schedulingPS.getPreferredNodeIterator( > candidates); > {code} > 4.2. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.AppPlacementAllocator#getPreferredNodeIterator > 4.3. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.MultiNodeSortingManager#getMultiNodeSortIterator > > ([LINK|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/placement/MultiNodeSortingManager.java#L114-L180]) > In this method, the MultiNodeLookupPolicy is resolved > [here|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/placement/MultiNodeSortingManager.java#L142-L143] > 4.4. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.MultiNodeSorter#getMultiNodeLookupPolicy > 4.5. This is where the MultiNodeLookupPolicy implementation of > getPreferredNodeIterator is invoked > h2. 5. GOING UP THE CALL HIERARCHY UNTIL > CapacityScheduler#allocateOrReserveNewContainers > 1. CSAssigment is created > [here|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1797-L1801] > in method: CapacityScheduler#allocateOrReserveNewContainers > 2. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#submitResourceCommitRequest > 3. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#tryCommit > 4. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#accept > 5. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#commonCheckContainerAllocation > --> This returns false and logs this line: > {code:java} > 2023-08-23 17:44:08,130 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Trying to allocate from reserved container in async scheduling mode > {code} > h2. PROPOSED FIX > In method: RegularContainerAllocator#allocate > There's a loop that iterates over candidate nodes: > [https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/allocator/RegularContainerAllocator.java#L853-L895] > We need to skip the nodes that are having a reservation, example code: > {code:java} > if (reservedContainer == null) { > 840 // Do not schedule if there are any reservations to fulfill on > the node > 841 if (node.getReservedContainer() != null) { > 842 LOG.debug("Skipping scheduling on node {} since it has > already been" > 843 + " reserved by {}", node.getNodeID(), > 844 node.getReservedContainer().getContainerId()); > 845 > ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation( > 846 activitiesManager, node, application, schedulerKey, > 847 ActivityDiagnosticConstant.NODE_HAS_BEEN_RESERVED); > 848 continue; > 849 } > {code} > NOTE: This code block is copied from [^YARN-9598.001.patch#file-5] > h2. More notes for the implementation > 1. This new behavior need to be hidden behind a feature flag (CS config). > In my understanding, the [^YARN-9598.001.patch#file-5] skips all the nodes > with reservations, regardless of the container's type whether it's an AM > container or a task container. > 2. Only skip the actual node with existing reservation if there are more > nodes to process with the iterator. > 3. Add testcase to cover this scenario > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org