[ 
https://issues.apache.org/jira/browse/YARN-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967099#comment-16967099
 ] 

Steven Rand commented on YARN-8990:
-----------------------------------

Hi all,

Unfortunately, this patch never made its way into the 3.2.1 release, which is 
affected by this race condition. I think what happened is that it was committed 
to trunk and backported to branch-3.2.0, but not to branch-3.2 (or 
branch-3.2.1).

And unless I'm misinterpreting the git history, the 3.2.1 release is also 
missing YARN-8992, despite the fix version of that ticket. 

We should at minimum make sure that the fixes for these race conditions are in 
3.2.2. Since this was a blocker and the impact is pretty serious, there may be 
more things we want to do, e.g., messaging or expediting the 3.2.2 release, but 
I'll leave that up you to decide.

> Fix fair scheduler race condition in app submit and queue cleanup
> -----------------------------------------------------------------
>
>                 Key: YARN-8990
>                 URL: https://issues.apache.org/jira/browse/YARN-8990
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 3.2.0
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Wilfred Spiegelenburg
>            Priority: Blocker
>             Fix For: 3.2.0, 3.3.0
>
>         Attachments: YARN-8990.001.patch, YARN-8990.002.patch
>
>
> With the introduction of the dynamic queue deletion in YARN-8191 a race 
> condition was introduced that can cause a queue to be removed while an 
> application submit is in progress.
> The issue occurs in {{FairScheduler.addApplication()}} when an application is 
> submitted to a dynamic queue which is empty or the queue does not exist yet. 
> If during the processing of the application submit the 
> {{AllocationFileLoaderService}} kicks of for an update the queue clean up 
> will be run first. The application submit first creates the queue and get a 
> reference back to the queue. 
> Other checks are performed and as the last action before getting ready to 
> generate an AppAttempt the queue is updated to show the submitted application 
> ID..
> The time between the queue creation and the queue update to show the submit 
> is long enough for the queue to be removed. The application however is lost 
> and will never get any resources assigned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to