[ 
https://issues.apache.org/jira/browse/YARN-11697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Syed Shameerur Rahman updated YARN-11697:
-----------------------------------------
    Description: 
For Hadoop version 3.2.1, the ResourceManager (RM) restarts frequently with the 
following exception
{code:java}
2024-03-11 04:41:29,329 FATAL org.apache.hadoop.yarn.event.EventDispatcher 
(SchedulerEventDispatcher:Event Processor): Error in handling event type 
APP_ATTEMPT_REMOVED to the Event Dispatcher
java.lang.IllegalStateException: Given app to remove 
appattempt_1706879498319_86660_000001 Alloc: <memory:0, vCores:0> does not 
exist in queue [root, demand=<memory:10826752, vCores:2101>, 
running=<memory:99328, vCores:17>, share=<memory:6201984, vCores:0>, w=1.0]
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:121)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:757)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1378)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:139)
        at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
        at java.lang.Thread.run(Thread.java:750)
{code}
The exception seems similar to the one mentioned in YARN-5136, but it looks 
like there is still some edge cases not covered by YARN-5136.

1. On deeper look, i could see that as mentioned in the comment here. if a call 
for a moveApplication and removeApplicationAttempt for the same attempt are 
processed in short succession the application attempt will still contain a 
queue reference but is already removed from the list of applications for the 
queue.

2. This can happen when 
[moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1908]
 removes the appAttempt from the queue and 
[removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L707]
 also tries to remove the same appAttempt from the queue.

3. On further checking, i could see that before doing 
[moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1779]
 writeLock on appAttempt is taken where as for 
[removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L665]
 , i don't see any writelock being taken which can result in race condition if 
same appAttempt is being processed.

4. Additionally as mentioned in the comment here when such scenario occurs 
ideally we should not take down RM.

  was:
For Hadoop version 3.2.1, the ResourceManager (RM) restarts frequently with the 
following exception
{code:java}
2024-03-11 04:41:29,329 FATAL org.apache.hadoop.yarn.event.EventDispatcher 
(SchedulerEventDispatcher:Event Processor): Error in handling event type 
APP_ATTEMPT_REMOVED to the Event Dispatcher
java.lang.IllegalStateException: Given app to remove 
appattempt_1706879498319_86660_000001 Alloc: <memory:0, vCores:0> does not 
exist in queue [root.tier2.livy, demand=<memory:10826752, vCores:2101>, 
running=<memory:99328, vCores:17>, share=<memory:6201984, vCores:0>, w=1.0]
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:121)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:757)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1378)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:139)
        at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
        at java.lang.Thread.run(Thread.java:750)
{code}
The exception seems similar to the one mentioned in YARN-5136, but it looks 
like there is still some edge cases not covered by YARN-5136.

1. On deeper look, i could see that as mentioned in the comment here. if a call 
for a moveApplication and removeApplicationAttempt for the same attempt are 
processed in short succession the application attempt will still contain a 
queue reference but is already removed from the list of applications for the 
queue.

2. This can happen when 
[moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1908]
 removes the appAttempt from the queue and 
[removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L707]
 also tries to remove the same appAttempt from the queue.

3. On further checking, i could see that before doing 
[moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1779]
 writeLock on appAttempt is taken where as for 
[removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L665]
 , i don't see any writelock being taken which can result in race condition if 
same appAttempt is being processed.

4. Additionally as mentioned in the comment here when such scenario occurs 
ideally we should not take down RM.


> Fix fair scheduler race condition in removeApplicationAttempt and 
> moveApplication
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-11697
>                 URL: https://issues.apache.org/jira/browse/YARN-11697
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.2.1
>            Reporter: Syed Shameerur Rahman
>            Assignee: Syed Shameerur Rahman
>            Priority: Major
>
> For Hadoop version 3.2.1, the ResourceManager (RM) restarts frequently with 
> the following exception
> {code:java}
> 2024-03-11 04:41:29,329 FATAL org.apache.hadoop.yarn.event.EventDispatcher 
> (SchedulerEventDispatcher:Event Processor): Error in handling event type 
> APP_ATTEMPT_REMOVED to the Event Dispatcher
> java.lang.IllegalStateException: Given app to remove 
> appattempt_1706879498319_86660_000001 Alloc: <memory:0, vCores:0> does not 
> exist in queue [root, demand=<memory:10826752, vCores:2101>, 
> running=<memory:99328, vCores:17>, share=<memory:6201984, vCores:0>, w=1.0]
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:121)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:757)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1378)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:139)
>         at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
>         at java.lang.Thread.run(Thread.java:750)
> {code}
> The exception seems similar to the one mentioned in YARN-5136, but it looks 
> like there is still some edge cases not covered by YARN-5136.
> 1. On deeper look, i could see that as mentioned in the comment here. if a 
> call for a moveApplication and removeApplicationAttempt for the same attempt 
> are processed in short succession the application attempt will still contain 
> a queue reference but is already removed from the list of applications for 
> the queue.
> 2. This can happen when 
> [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1908]
>  removes the appAttempt from the queue and 
> [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L707]
>  also tries to remove the same appAttempt from the queue.
> 3. On further checking, i could see that before doing 
> [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1779]
>  writeLock on appAttempt is taken where as for 
> [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L665]
>  , i don't see any writelock being taken which can result in race condition 
> if same appAttempt is being processed.
> 4. Additionally as mentioned in the comment here when such scenario occurs 
> ideally we should not take down RM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to