Syed Shameerur Rahman created YARN-11702:
--------------------------------------------

             Summary: Fix Yarn over allocating containers
                 Key: YARN-11702
                 URL: https://issues.apache.org/jira/browse/YARN-11702
             Project: Hadoop YARN
          Issue Type: Bug
            Reporter: Syed Shameerur Rahman
            Assignee: Syed Shameerur Rahman


*Replication Steps:*

Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler)

 
{code:java}

spark.executor.memory            1024M
spark.driver.memory              2048M
spark.executor.cores             1
spark.executor.instances 20
spark.dynamicAllocation.enabled false{code}
 

Based on the setup, there should be 20 spark executors, but from the 
ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 of 
them were released in seconds. On analyzing the Spark ApplicationMaster (AM) 
logs, The following logs were observed.

 
{code:java}
4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) for  
ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with custom 
resources: <memory:1408, max memory:2147483647, vCores:1, max vCores:2147483647>
24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
launching executors on 8 of them.
24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
launching executors on 8 of them.
24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, 
launching executors on 4 of them.
24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, 
launching executors on 0 of them.

{code}
It was clear for the logs that extra allocated 12 containers are being ignored 
from Spark side. Inorder to debug this further, additional log lines were added 
to 
[AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427]
 class in increment and decrement of container request to expose additional 
information about the request.

 
{code:java}
2024-06-24 14:10:14,075 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC 
Server handler 42 on default port 8030): Updates PendingContainers: 0 
Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,077 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
20 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,077 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
19 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,111 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
18 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,112 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
17 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,112 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
16 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,113 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
15 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,113 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
14 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,113 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
13 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,327 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
12 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,328 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
11 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,362 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
10 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,363 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
9 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,363 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
8 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,363 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
7 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,363 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
6 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,364 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
5 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_0000012024-06-24 
14:10:14,454 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC 
Server handler 35 on default port 8030): Updates PendingContainers: 4 
Decremented by: 4 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,454 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC 
Server handler 35 on default port 8030): Updates PendingContainers: 0 
Incremented by: 12 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,578 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
12 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,579 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
11 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,614 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
10 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,614 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
9 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,614 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
8 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,615 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
7 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,615 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
6 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,615 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
5 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,829 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
4 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,829 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
3 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,864 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
2 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,864 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
1 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_0000012024-06-24 
14:10:14,874 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC 
Server handler 42 on default port 8030): Updates PendingContainers: 0 
Incremented by: 4 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:15,080 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
4 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:15,081 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
3 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:15,115 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
2 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:15,115 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
1 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:17,931 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC 
Server handler 3 on default port 8030): checking for deactivate of application 
:application_1719234929152_0004
2024-06-24 14:10:20,743 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Application 
application_1719234929152_0004 requests cleared {code}
*RCA*

It looks like the issue is happening due to asynchronous nature of AMRMClient 
response communication. The newly allocated containers from Yarn is send as 
part of the next request from the client causing additional additional. Please 
check the following example. As per the above mentioned logs
 # At time 2024-06-24 14:10:14,075, AM sends 20 container request to Yarn, 
since this is the first request, yarn had not allocated any containers and send 
0 containers as response.
 # At time 2024-06-24 14:10:14,454, AM sends 12 container request to Yarn which 
suggests that before this time there was an empty request ask to Yarn which had 
returned 8 containers as allocated in reponse. For this request was well the 
Yarn returned 8 container as allocated in response.
 # At time 2024-06-24 14:10:14,874, AM again send 4 container request to yarn 
for which it returns 12 containers allocated (based on previous request) as 
response.
 # Since all the container request is exhausted, AM sends empty request to yarn 
which returns 4 containers allocated (based on previous request)
 # So in total 32 containers were allocated for 20 container request. The 
problem here is AM (client) is being notified about the allocated container 
request in the next heartbeat/request causing inconsistency.

 

*Solution*

Normalize/update the AM container request to consdier the allocated containers. 
This can be done in the respective scheduler side i.e newContainerAsk = 
AMContainerAsk - AllocatedContainers.

 

*PS: I could see similar behavior with Apache Tez.*

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

Reply via email to