[jira] [Commented] (YARN-11697) Fix fair scheduler race condition in removeApplicationAttempt and moveApplication

Syed Shameerur Rahman (Jira) Tue, 21 May 2024 00:58:03 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-11697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848085#comment-17848085
 ]


Syed Shameerur Rahman commented on YARN-11697:
----------------------------------------------

[~wilfreds] 

I had some custom code/backports from higher version and hence the code lines 
might have differed from the OSS hadoop code base. I could see the following 
exception though 
java.lang.IllegalStateException: Given app to remove 
appattempt_1706879498319_86660_000001 Alloc: <memory:0, vCores:0> does not 
exist in queue [root, demand=<memory:10826752, vCores:2101>, 
running=<memory:99328, vCores:17>, share=<memory:6201984, vCores:0>, w=1.0]
 

So this exception comes only when the appAttempt is already removed from the 
queue and we try to remove it again. Throwing IllegalStateException causes the 
RM to shutdown with exception. Can you think of any scenario this can happen ?

> Fix fair scheduler race condition in removeApplicationAttempt and 
> moveApplication
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-11697
>                 URL: https://issues.apache.org/jira/browse/YARN-11697
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.2.1
>            Reporter: Syed Shameerur Rahman
>            Assignee: Syed Shameerur Rahman
>            Priority: Major
>
> For Hadoop version 3.2.1, the ResourceManager (RM) restarts frequently with 
> the following exception
> {code:java}
> 2024-03-11 04:41:29,329 FATAL org.apache.hadoop.yarn.event.EventDispatcher 
> (SchedulerEventDispatcher:Event Processor): Error in handling event type 
> APP_ATTEMPT_REMOVED to the Event Dispatcher
> java.lang.IllegalStateException: Given app to remove 
> appattempt_1706879498319_86660_000001 Alloc: <memory:0, vCores:0> does not 
> exist in queue [root, demand=<memory:10826752, vCores:2101>, 
> running=<memory:99328, vCores:17>, share=<memory:6201984, vCores:0>, w=1.0]
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:121)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:757)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1378)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:139)
>         at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
>         at java.lang.Thread.run(Thread.java:750)
> {code}
> The exception seems similar to the one mentioned in YARN-5136, but it looks 
> like there is still some edge cases not covered by YARN-5136.
> 1. On deeper look, i could see that as mentioned in the comment here. if a 
> call for a moveApplication and removeApplicationAttempt for the same attempt 
> are processed in short succession the application attempt will still contain 
> a queue reference but is already removed from the list of applications for 
> the queue.
> 2. This can happen when 
> [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1908]
>  removes the appAttempt from the queue and 
> [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L707]
>  also tries to remove the same appAttempt from the queue.
> 3. On further checking, i could see that before doing 
> [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1779]
>  writeLock on appAttempt is taken where as for 
> [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L665]
>  , i don't see any writelock being taken which can result in race condition 
> if same appAttempt is being processed.
> 4. Additionally as mentioned in the comment here when such scenario occurs 
> ideally we should not take down RM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11697) Fix fair scheduler race condition in removeApplicationAttempt and moveApplication

Reply via email to