[
https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Karthik Kambatla updated YARN-2608:
-----------------------------------
Summary: FairScheduler: Potential deadlocks in loading alloc files and
clock access (was: FairScheduler: Potential deadlocks in loading alloc files
and clock)
> FairScheduler: Potential deadlocks in loading alloc files and clock access
> --------------------------------------------------------------------------
>
> Key: YARN-2608
> URL: https://issues.apache.org/jira/browse/YARN-2608
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Wei Yan
> Assignee: Wei Yan
> Attachments: YARN-2608-1.patch, YARN-2608-2.patch, YARN-2608-3.patch
>
>
> Two potential deadlocks exist inside the FairScheduler.
> 1. AllocationFileLoaderService would reload the queue configuration, which
> calls FairScheduler.AllocationReloadListener.onReload() function. And require
> *FairScheduler's lock*;
> {code}
> public void onReload(AllocationConfiguration queueInfo) {
> synchronized (FairScheduler.this) {
> ....
> }
> }
> {code}
> after that, it would require the *QueueManager's queues lock*.
> {code}
> private FSQueue getQueue(String name, boolean create, FSQueueType
> queueType) {
> name = ensureRootPrefix(name);
> synchronized (queues) {
> ....
> }
> }
> {code}
> Another thread FairScheduler.assignToQueue may also need to create a new
> queue when a new job submitted. This thread would hold the *QueueManager's
> queues lock* firstly, and then would like to hold the *FairScheduler's lock*
> as it needs to call FairScheduler.getClock() function when creating a new
> FSLeafQueue. Deadlock may happen here.
> 2. The AllocationFileLoaderService holds *AllocationFileLoaderService's
> lock* first, and then waits for *FairScheduler's lock*. Another thread (like
> AdminService.refreshQueues) may call FairScheduler's reinitialize function,
> which holds *FairScheduler's lock* first, and then waits for
> *AllocationFileLoaderService's lock*. Deadlock may happen here.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)