[ https://issues.apache.org/jira/browse/YARN-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Syed Shameerur Rahman updated YARN-11702: ----------------------------------------- Description: *Replication Steps:* Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler) {code:java} spark.executor.memory 1024M spark.driver.memory 2048M spark.executor.cores 1 spark.executor.instances 20 spark.dynamicAllocation.enabled false{code} Based on the setup, there should be 20 spark executors, but from the ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 of them were released in seconds. On analyzing the Spark ApplicationMaster (AM) logs, The following logs were observed. {code:java} 4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) for ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with custom resources: <memory:1408, max memory:2147483647, vCores:1, max vCores:2147483647> 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, launching executors on 8 of them. 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, launching executors on 8 of them. 24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, launching executors on 4 of them. 24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, launching executors on 0 of them. {code} It was clear for the logs that extra allocated 12 containers are being ignored from Spark side. Inorder to debug this further, additional log lines were added to [AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427] class in increment and decrement of container request to expose additional information about the request. {code:java} 2024-06-24 14:10:14,075 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,077 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 20 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,077 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 19 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,111 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 18 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,112 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 17 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,112 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 16 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,113 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 15 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,113 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 14 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,113 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 13 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,327 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 12 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,328 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 11 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,362 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 10 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,363 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 9 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,363 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 8 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,363 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 7 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,363 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 6 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,364 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 5 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_0000012024-06-24 14:10:14,454 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC Server handler 35 on default port 8030): Updates PendingContainers: 4 Decremented by: 4 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,454 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC Server handler 35 on default port 8030): Updates PendingContainers: 0 Incremented by: 12 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,578 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 12 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,579 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 11 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,614 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 10 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,614 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 9 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,614 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 8 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,615 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 7 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,615 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 6 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,615 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 5 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,829 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 4 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,829 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 3 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,864 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 2 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,864 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 1 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_0000012024-06-24 14:10:14,874 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 Incremented by: 4 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:15,080 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 4 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:15,081 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 3 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:15,115 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 2 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:15,115 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 1 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:17,931 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC Server handler 3 on default port 8030): checking for deactivate of application :application_1719234929152_0004 2024-06-24 14:10:20,743 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Application application_1719234929152_0004 requests cleared {code} *RCA* It looks like the issue is happening due to asynchronous nature of AMRMClient response communication. The newly allocated containers from Yarn is send as part of the next request from the client causing additional additional. Please check the following example. As per the above mentioned logs # At time 2024-06-24 14:10:14,075, AM sends 20 container request to Yarn, since this is the first request, yarn had not allocated any containers and send 0 containers as response. # At time 2024-06-24 14:10:14,454, AM sends 12 container request to Yarn which suggests that before this time there was an empty request ask to Yarn which had returned 8 containers as allocated in reponse. For this request was well the Yarn returned 8 container as allocated in response. # At time 2024-06-24 14:10:14,874, AM again send 4 container request to yarn for which it returns 12 containers allocated (based on previous request) as response. # Since all the container request is exhausted, AM sends empty request to yarn which returns 4 containers allocated (based on previous request) # So in total 32 containers were allocated for 20 container request. The problem here is AM (client) is being notified about the allocated container request in the next heartbeat/request causing inconsistency. *Solution* Normalize/update the AM container request to consdier the allocated containers. This can be done in the respective scheduler side i.e newContainerAsk = AMContainerAsk - AllocatedContainers. *PS: I could see similar behavior with Apache Tez and could happen with the latest hadoop-3.4.0 version as well.* was: *Replication Steps:* Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler) {code:java} spark.executor.memory 1024M spark.driver.memory 2048M spark.executor.cores 1 spark.executor.instances 20 spark.dynamicAllocation.enabled false{code} Based on the setup, there should be 20 spark executors, but from the ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 of them were released in seconds. On analyzing the Spark ApplicationMaster (AM) logs, The following logs were observed. {code:java} 4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) for ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with custom resources: <memory:1408, max memory:2147483647, vCores:1, max vCores:2147483647> 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, launching executors on 8 of them. 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, launching executors on 8 of them. 24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, launching executors on 4 of them. 24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, launching executors on 0 of them. {code} It was clear for the logs that extra allocated 12 containers are being ignored from Spark side. Inorder to debug this further, additional log lines were added to [AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427] class in increment and decrement of container request to expose additional information about the request. {code:java} 2024-06-24 14:10:14,075 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,077 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 20 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,077 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 19 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,111 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 18 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,112 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 17 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,112 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 16 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,113 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 15 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,113 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 14 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,113 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 13 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,327 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 12 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,328 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 11 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,362 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 10 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,363 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 9 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,363 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 8 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,363 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 7 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,363 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 6 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,364 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 5 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_0000012024-06-24 14:10:14,454 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC Server handler 35 on default port 8030): Updates PendingContainers: 4 Decremented by: 4 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,454 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC Server handler 35 on default port 8030): Updates PendingContainers: 0 Incremented by: 12 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,578 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 12 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,579 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 11 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,614 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 10 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,614 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 9 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,614 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 8 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,615 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 7 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,615 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 6 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,615 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 5 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,829 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 4 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,829 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 3 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,864 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 2 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,864 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 1 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_0000012024-06-24 14:10:14,874 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 Incremented by: 4 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:15,080 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 4 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:15,081 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 3 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:15,115 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 2 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:15,115 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 1 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:17,931 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC Server handler 3 on default port 8030): checking for deactivate of application :application_1719234929152_0004 2024-06-24 14:10:20,743 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Application application_1719234929152_0004 requests cleared {code} *RCA* It looks like the issue is happening due to asynchronous nature of AMRMClient response communication. The newly allocated containers from Yarn is send as part of the next request from the client causing additional additional. Please check the following example. As per the above mentioned logs # At time 2024-06-24 14:10:14,075, AM sends 20 container request to Yarn, since this is the first request, yarn had not allocated any containers and send 0 containers as response. # At time 2024-06-24 14:10:14,454, AM sends 12 container request to Yarn which suggests that before this time there was an empty request ask to Yarn which had returned 8 containers as allocated in reponse. For this request was well the Yarn returned 8 container as allocated in response. # At time 2024-06-24 14:10:14,874, AM again send 4 container request to yarn for which it returns 12 containers allocated (based on previous request) as response. # Since all the container request is exhausted, AM sends empty request to yarn which returns 4 containers allocated (based on previous request) # So in total 32 containers were allocated for 20 container request. The problem here is AM (client) is being notified about the allocated container request in the next heartbeat/request causing inconsistency. *Solution* Normalize/update the AM container request to consdier the allocated containers. This can be done in the respective scheduler side i.e newContainerAsk = AMContainerAsk - AllocatedContainers. *PS: I could see similar behavior with Apache Tez.* > Fix Yarn over allocating containers > ----------------------------------- > > Key: YARN-11702 > URL: https://issues.apache.org/jira/browse/YARN-11702 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Syed Shameerur Rahman > Assignee: Syed Shameerur Rahman > Priority: Major > > *Replication Steps:* > Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler) > > {code:java} > spark.executor.memory 1024M > spark.driver.memory 2048M > spark.executor.cores 1 > spark.executor.instances 20 > spark.dynamicAllocation.enabled false{code} > > Based on the setup, there should be 20 spark executors, but from the > ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 > of them were released in seconds. On analyzing the Spark ApplicationMaster > (AM) logs, The following logs were observed. > > {code:java} > 4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) > for ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with > custom resources: <memory:1408, max memory:2147483647, vCores:1, max > vCores:2147483647> > 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, > launching executors on 8 of them. > 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, > launching executors on 8 of them. > 24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, > launching executors on 4 of them. > 24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, > launching executors on 0 of them. > {code} > It was clear for the logs that extra allocated 12 containers are being > ignored from Spark side. Inorder to debug this further, additional log lines > were added to > [AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427] > class in increment and decrement of container request to expose additional > information about the request. > > {code:java} > 2024-06-24 14:10:14,075 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 > Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, > containerToUpdate=null} for: appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,077 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 20 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,077 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 19 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,111 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 18 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,112 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 17 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,112 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 16 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,113 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 15 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,113 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 14 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,113 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 13 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,327 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 12 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,328 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 11 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,362 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 10 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,363 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 9 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,363 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 8 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,363 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 7 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,363 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 6 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,364 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 5 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_0000012024-06-24 14:10:14,454 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (IPC Server handler 35 on default port 8030): Updates PendingContainers: 4 > Decremented by: 4 SchedulerRequestKey{priority=0, allocationRequestId=0, > containerToUpdate=null} for: appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,454 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (IPC Server handler 35 on default port 8030): Updates PendingContainers: 0 > Incremented by: 12 SchedulerRequestKey{priority=0, allocationRequestId=0, > containerToUpdate=null} for: appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,578 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 12 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,579 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 11 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,614 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 10 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,614 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 9 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,614 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 8 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,615 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 7 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,615 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 6 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,615 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 5 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,829 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 4 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,829 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 3 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,864 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 2 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:14,864 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 1 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_0000012024-06-24 14:10:14,874 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 > Incremented by: 4 SchedulerRequestKey{priority=0, allocationRequestId=0, > containerToUpdate=null} for: appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:15,080 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 4 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:15,081 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 3 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:15,115 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 2 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:15,115 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 1 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000001 > 2024-06-24 14:10:17,931 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (IPC Server handler 3 on default port 8030): checking for deactivate of > application :application_1719234929152_0004 > 2024-06-24 14:10:20,743 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Application > application_1719234929152_0004 requests cleared {code} > *RCA* > It looks like the issue is happening due to asynchronous nature of AMRMClient > response communication. The newly allocated containers from Yarn is send as > part of the next request from the client causing additional additional. > Please check the following example. As per the above mentioned logs > # At time 2024-06-24 14:10:14,075, AM sends 20 container request to Yarn, > since this is the first request, yarn had not allocated any containers and > send 0 containers as response. > # At time 2024-06-24 14:10:14,454, AM sends 12 container request to Yarn > which suggests that before this time there was an empty request ask to Yarn > which had returned 8 containers as allocated in reponse. For this request was > well the Yarn returned 8 container as allocated in response. > # At time 2024-06-24 14:10:14,874, AM again send 4 container request to yarn > for which it returns 12 containers allocated (based on previous request) as > response. > # Since all the container request is exhausted, AM sends empty request to > yarn which returns 4 containers allocated (based on previous request) > # So in total 32 containers were allocated for 20 container request. The > problem here is AM (client) is being notified about the allocated container > request in the next heartbeat/request causing inconsistency. > > *Solution* > Normalize/update the AM container request to consdier the allocated > containers. This can be done in the respective scheduler side i.e > newContainerAsk = AMContainerAsk - AllocatedContainers. > > *PS: I could see similar behavior with Apache Tez and could happen with the > latest hadoop-3.4.0 version as well.* > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org