[ 
https://issues.apache.org/jira/browse/YARN-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Syed Shameerur Rahman updated YARN-11702:
-----------------------------------------
    Description: 
*Replication Steps:*

Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler)

 
{code:java}
spark.executor.memory            1024M
spark.driver.memory              2048M
spark.executor.cores             1
spark.executor.instances 20
spark.dynamicAllocation.enabled false{code}
 

Based on the setup, there should be 20 spark executors, but from the 
ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 of 
them were released in seconds. On analyzing the Spark ApplicationMaster (AM) 
logs, The following logs were observed.

 
{code:java}
4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) for  
ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with custom 
resources: <memory:1408, max memory:2147483647, vCores:1, max vCores:2147483647>
24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
launching executors on 8 of them.
24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
launching executors on 8 of them.
24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, 
launching executors on 4 of them.
24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, 
launching executors on 0 of them.

{code}
It was clear for the logs that extra allocated 12 containers are being ignored 
from Spark side. Inorder to debug this further, additional log lines were added 
to 
[AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427]
 class in increment and decrement of container request to expose additional 
information about the request.

 
{code:java}
2024-06-24 14:10:14,075 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC 
Server handler 42 on default port 8030): Updates PendingContainers: 0 
Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,077 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
20 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,077 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
19 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,111 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
18 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,112 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
17 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,112 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
16 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,113 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
15 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,113 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
14 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,113 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
13 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,327 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
12 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,328 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
11 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,362 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
10 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,363 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
9 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,363 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
8 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,363 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
7 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,363 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
6 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,364 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
5 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_0000012024-06-24 
14:10:14,454 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC 
Server handler 35 on default port 8030): Updates PendingContainers: 4 
Decremented by: 4 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,454 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC 
Server handler 35 on default port 8030): Updates PendingContainers: 0 
Incremented by: 12 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,578 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
12 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,579 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
11 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,614 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
10 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,614 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
9 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,614 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
8 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,615 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
7 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,615 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
6 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,615 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
5 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,829 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
4 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,829 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
3 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,864 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
2 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,864 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
1 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_0000012024-06-24 
14:10:14,874 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC 
Server handler 42 on default port 8030): Updates PendingContainers: 0 
Incremented by: 4 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:15,080 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
4 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:15,081 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
3 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:15,115 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
2 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:15,115 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
1 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:17,931 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC 
Server handler 3 on default port 8030): checking for deactivate of application 
:application_1719234929152_0004
2024-06-24 14:10:20,743 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Application 
application_1719234929152_0004 requests cleared {code}
*RCA*

It looks like the issue is happening due to asynchronous nature of AMRMClient 
response communication. The newly allocated containers from Yarn is send as 
part of the next request from the client causing additional additional. Please 
check the following example. As per the above mentioned logs
 # At time 2024-06-24 14:10:14,075, AM sends 20 container request to Yarn, 
since this is the first request, yarn had not allocated any containers and send 
0 containers as response.
 # At time 2024-06-24 14:10:14,454, AM sends 12 container request to Yarn which 
suggests that before this time there was an empty request ask to Yarn which had 
returned 8 containers as allocated in reponse. For this request was well the 
Yarn returned 8 container as allocated in response.
 # At time 2024-06-24 14:10:14,874, AM again send 4 container request to yarn 
for which it returns 12 containers allocated (based on previous request) as 
response.
 # Since all the container request is exhausted, AM sends empty request to yarn 
which returns 4 containers allocated (based on previous request)
 # So in total 32 containers were allocated for 20 container request. The 
problem here is AM (client) is being notified about the allocated container 
request in the next heartbeat/request causing inconsistency.

 

*Solutions*

*1. Modify the AM request in the Yarn/RM side*

Normalize/update the AM container request to consider the allocated containers. 
This can be done in the respective scheduler side i.e newContainerAsk = 
AMContainerAsk - AllocatedContainers.

*2. Make AM aware of allocated container*

Before client making allocate request, Check AM should check with RM about 
container allocated. This approach requires additional AM-RM client 
communication which can be expense when there are large number of allocate 
requests.

 

*PS: I could see similar behavior with Apache Tez and could happen with the 
latest hadoop-3.4.0 version as well.*

 

 

  was:
*Replication Steps:*

Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler)

 
{code:java}
spark.executor.memory            1024M
spark.driver.memory              2048M
spark.executor.cores             1
spark.executor.instances 20
spark.dynamicAllocation.enabled false{code}
 

Based on the setup, there should be 20 spark executors, but from the 
ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 of 
them were released in seconds. On analyzing the Spark ApplicationMaster (AM) 
logs, The following logs were observed.

 
{code:java}
4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) for  
ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with custom 
resources: <memory:1408, max memory:2147483647, vCores:1, max vCores:2147483647>
24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
launching executors on 8 of them.
24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
launching executors on 8 of them.
24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, 
launching executors on 4 of them.
24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, 
launching executors on 0 of them.

{code}
It was clear for the logs that extra allocated 12 containers are being ignored 
from Spark side. Inorder to debug this further, additional log lines were added 
to 
[AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427]
 class in increment and decrement of container request to expose additional 
information about the request.

 
{code:java}
2024-06-24 14:10:14,075 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC 
Server handler 42 on default port 8030): Updates PendingContainers: 0 
Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,077 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
20 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,077 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
19 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,111 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
18 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,112 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
17 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,112 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
16 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,113 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
15 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,113 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
14 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,113 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
13 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,327 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
12 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,328 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
11 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,362 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
10 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,363 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
9 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,363 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
8 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,363 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
7 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,363 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
6 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,364 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
5 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_0000012024-06-24 
14:10:14,454 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC 
Server handler 35 on default port 8030): Updates PendingContainers: 4 
Decremented by: 4 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,454 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC 
Server handler 35 on default port 8030): Updates PendingContainers: 0 
Incremented by: 12 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,578 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
12 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,579 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
11 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,614 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
10 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,614 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
9 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,614 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
8 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,615 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
7 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,615 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
6 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,615 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
5 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,829 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
4 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,829 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
3 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,864 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
2 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,864 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
1 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_0000012024-06-24 
14:10:14,874 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC 
Server handler 42 on default port 8030): Updates PendingContainers: 0 
Incremented by: 4 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:15,080 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
4 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:15,081 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
3 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:15,115 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
2 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:15,115 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
1 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:17,931 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC 
Server handler 3 on default port 8030): checking for deactivate of application 
:application_1719234929152_0004
2024-06-24 14:10:20,743 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Application 
application_1719234929152_0004 requests cleared {code}
*RCA*

It looks like the issue is happening due to asynchronous nature of AMRMClient 
response communication. The newly allocated containers from Yarn is send as 
part of the next request from the client causing additional additional. Please 
check the following example. As per the above mentioned logs
 # At time 2024-06-24 14:10:14,075, AM sends 20 container request to Yarn, 
since this is the first request, yarn had not allocated any containers and send 
0 containers as response.
 # At time 2024-06-24 14:10:14,454, AM sends 12 container request to Yarn which 
suggests that before this time there was an empty request ask to Yarn which had 
returned 8 containers as allocated in reponse. For this request was well the 
Yarn returned 8 container as allocated in response.
 # At time 2024-06-24 14:10:14,874, AM again send 4 container request to yarn 
for which it returns 12 containers allocated (based on previous request) as 
response.
 # Since all the container request is exhausted, AM sends empty request to yarn 
which returns 4 containers allocated (based on previous request)
 # So in total 32 containers were allocated for 20 container request. The 
problem here is AM (client) is being notified about the allocated container 
request in the next heartbeat/request causing inconsistency.

 

*Solutions*

*### 1 Modify the AM request in the Yarn/RM side.*

Normalize/update the AM container request to consider the allocated containers. 
This can be done in the respective scheduler side i.e newContainerAsk = 
AMContainerAsk - AllocatedContainers.

*### 2 Make AM aware of allocated container*

Before client making allocate request, Check AM should check with RM about 
container allocated. This approach requires additional AM-RM client 
communication which can be expense when there are large number of allocate 
requests.

 

*PS: I could see similar behavior with Apache Tez and could happen with the 
latest hadoop-3.4.0 version as well.*

 

 


> Fix Yarn over allocating containers
> -----------------------------------
>
>                 Key: YARN-11702
>                 URL: https://issues.apache.org/jira/browse/YARN-11702
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Syed Shameerur Rahman
>            Assignee: Syed Shameerur Rahman
>            Priority: Major
>
> *Replication Steps:*
> Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler)
>  
> {code:java}
> spark.executor.memory            1024M
> spark.driver.memory              2048M
> spark.executor.cores             1
> spark.executor.instances 20
> spark.dynamicAllocation.enabled false{code}
>  
> Based on the setup, there should be 20 spark executors, but from the 
> ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 
> of them were released in seconds. On analyzing the Spark ApplicationMaster 
> (AM) logs, The following logs were observed.
>  
> {code:java}
> 4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) 
> for  ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with 
> custom resources: <memory:1408, max memory:2147483647, vCores:1, max 
> vCores:2147483647>
> 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
> launching executors on 8 of them.
> 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
> launching executors on 8 of them.
> 24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, 
> launching executors on 4 of them.
> 24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, 
> launching executors on 0 of them.
> {code}
> It was clear for the logs that extra allocated 12 containers are being 
> ignored from Spark side. Inorder to debug this further, additional log lines 
> were added to 
> [AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427]
>  class in increment and decrement of container request to expose additional 
> information about the request.
>  
> {code:java}
> 2024-06-24 14:10:14,075 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 
> Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, 
> containerToUpdate=null} for: appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,077 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 20 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,077 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 19 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,111 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 18 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,112 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 17 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,112 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 16 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,113 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 15 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,113 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 14 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,113 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 13 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,327 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 12 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,328 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 11 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,362 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 10 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,363 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 9 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,363 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 8 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,363 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 7 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,363 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 6 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,364 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 5 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_0000012024-06-24 14:10:14,454 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (IPC Server handler 35 on default port 8030): Updates PendingContainers: 4 
> Decremented by: 4 SchedulerRequestKey{priority=0, allocationRequestId=0, 
> containerToUpdate=null} for: appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,454 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (IPC Server handler 35 on default port 8030): Updates PendingContainers: 0 
> Incremented by: 12 SchedulerRequestKey{priority=0, allocationRequestId=0, 
> containerToUpdate=null} for: appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,578 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 12 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,579 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 11 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,614 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 10 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,614 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 9 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,614 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 8 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,615 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 7 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,615 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 6 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,615 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 5 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,829 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 4 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,829 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 3 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,864 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 2 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,864 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 1 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_0000012024-06-24 14:10:14,874 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 
> Incremented by: 4 SchedulerRequestKey{priority=0, allocationRequestId=0, 
> containerToUpdate=null} for: appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:15,080 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 4 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:15,081 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 3 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:15,115 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 2 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:15,115 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 1 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:17,931 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (IPC Server handler 3 on default port 8030): checking for deactivate of 
> application :application_1719234929152_0004
> 2024-06-24 14:10:20,743 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Application 
> application_1719234929152_0004 requests cleared {code}
> *RCA*
> It looks like the issue is happening due to asynchronous nature of AMRMClient 
> response communication. The newly allocated containers from Yarn is send as 
> part of the next request from the client causing additional additional. 
> Please check the following example. As per the above mentioned logs
>  # At time 2024-06-24 14:10:14,075, AM sends 20 container request to Yarn, 
> since this is the first request, yarn had not allocated any containers and 
> send 0 containers as response.
>  # At time 2024-06-24 14:10:14,454, AM sends 12 container request to Yarn 
> which suggests that before this time there was an empty request ask to Yarn 
> which had returned 8 containers as allocated in reponse. For this request was 
> well the Yarn returned 8 container as allocated in response.
>  # At time 2024-06-24 14:10:14,874, AM again send 4 container request to yarn 
> for which it returns 12 containers allocated (based on previous request) as 
> response.
>  # Since all the container request is exhausted, AM sends empty request to 
> yarn which returns 4 containers allocated (based on previous request)
>  # So in total 32 containers were allocated for 20 container request. The 
> problem here is AM (client) is being notified about the allocated container 
> request in the next heartbeat/request causing inconsistency.
>  
> *Solutions*
> *1. Modify the AM request in the Yarn/RM side*
> Normalize/update the AM container request to consider the allocated 
> containers. This can be done in the respective scheduler side i.e 
> newContainerAsk = AMContainerAsk - AllocatedContainers.
> *2. Make AM aware of allocated container*
> Before client making allocate request, Check AM should check with RM about 
> container allocated. This approach requires additional AM-RM client 
> communication which can be expense when there are large number of allocate 
> requests.
>  
> *PS: I could see similar behavior with Apache Tez and could happen with the 
> latest hadoop-3.4.0 version as well.*
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to