[
https://issues.apache.org/jira/browse/YARN-325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13547677#comment-13547677
]
Arun C Murthy commented on YARN-325:
------------------------------------
Ok, the fix is to bubble the 'non-requirement' of the reservation all the way
to the CapacityScheduler.nodeUpdate call and then call
LeafQueue.completedContainer outside the context of LeafQueue.assignContainers
i.e. do not call LeafQueue.completedContainer while holding the lock on the
LeafQueue.
LeafQueue.completedContainer, on it's own, has the right synchronization i.e.
doesn't call ParentQueue.completedContainer while holding a lock on the
LeafQueue.
> RM CapacityScheduler can deadlock when getQueueInfo() is called and a
> container is completing
> ---------------------------------------------------------------------------------------------
>
> Key: YARN-325
> URL: https://issues.apache.org/jira/browse/YARN-325
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacityscheduler
> Affects Versions: 2.0.2-alpha, 0.23.5
> Reporter: Jason Lowe
> Assignee: Arun C Murthy
> Priority: Blocker
>
> If a client calls getQueueInfo on a parent queue (e.g.: the root queue) and
> containers are completing then the RM can deadlock. getQueueInfo() locks the
> ParentQueue and then calls the child queues' getQueueInfo() methods in turn.
> However when a container completes, it locks the LeafQueue then calls back
> into the ParentQueue. When the two mix, it's a recipe for deadlock.
> Stacktrace to follow.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira