[ 
https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148302#comment-14148302
 ] 

Wei Yan commented on YARN-2608:
-------------------------------

For the first deadlock, as the clock is only changed by testcases, so we can 
directly remove the synchronized, and make the clock as volatile. For the 
second deadlock, we can also remove the synchronized from the reinitialize and 
initScheduler functions; thus, the reinitialize function would require the * 
AllocationFileLoaderService's lock* first, and then *FairScheduler's lock*.

> FairScheduler may hung due to two potential deadlocks
> -----------------------------------------------------
>
>                 Key: YARN-2608
>                 URL: https://issues.apache.org/jira/browse/YARN-2608
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Wei Yan
>            Assignee: Wei Yan
>         Attachments: YARN-2608-1.patch
>
>
> Two potential deadlocks exist inside the FairScheduler.
> 1. AllocationFileLoaderService would reload the queue configuration, which 
> calls FairScheduler.AllocationReloadListener.onReload() function. And require 
> *FairScheduler's lock*; 
> {code}
>   public void onReload(AllocationConfiguration queueInfo) {
>       synchronized (FairScheduler.this) {
>           ....
>       }
>   }
> {code}
> after that, it would require the *QueueManager's queues lock*.
> {code}
>   private FSQueue getQueue(String name, boolean create, FSQueueType 
> queueType) {
>       name = ensureRootPrefix(name);
>       synchronized (queues) {
>           ....
>       }
>   }
> {code}
> Another thread FairScheduler.assignToQueue may also need to create a new 
> queue when a new job submitted. This thread would hold the *QueueManager's 
> queues lock* firstly, and then would like to hold the *FairScheduler's lock* 
> as it needs to call FairScheduler.getClock() function when creating a new 
> FSLeafQueue. Deadlock may happen here.
> 2. The AllocationFileLoaderService holds  *AllocationFileLoaderService's 
> lock* first, and then waits for *FairScheduler's lock*. Another thread (like 
> AdminService.refreshQueues) may call FairScheduler's reinitialize function, 
> which holds *FairScheduler's lock* first, and then waits for 
> *AllocationFileLoaderService's lock*. Deadlock may happen here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to