Tao Yang created YARN-6029:
------------------------------
Summary: CapacityScheduler deadlock when
ParentQueue#getQueueUserAclInfo is called by Thread_A at the moment that
Thread_B calls LeafQueue#assignContainers to release a reserved container
Key: YARN-6029
URL: https://issues.apache.org/jira/browse/YARN-6029
Project: Hadoop YARN
Issue Type: Bug
Components: capacityscheduler
Affects Versions: 2.8.0
Reporter: Tao Yang
Assignee: Tao Yang
When ParentQueue#getQueueUserAclInfo is called (e.g. a client calls
YarnClient#getQueueAclsInfo) just at the moment that LeafQueue#assignContainers
is called and before notifying parent queue to release resource (should release
a reserved container), then ResourceManager can deadlock. I found this problem
on our testing environment for hadoop2.8.
Reproduce the deadlock in chronological order
* 1. Thread A (ResourceManager Event Processor) calls synchronized
LeafQueue#assignContainers (got LeafQueue instance lock of queue root.a)
* 2. Thread B (IPC Server handler) calls synchronized
ParentQueue#getQueueUserAclInfo (got ParentQueue instance lock of queue root),
iterates over children queue acls and is blocked when calling synchronized
LeafQueue#getQueueUserAclInfo (the LeafQueue instance lock of queue root.a is
hold by Thread A)
* 3. Thread A wants to inform the parent queue that a container is being
completed and is blocked when invoking synchronized
ParentQueue#internalReleaseResource method (the ParentQueue instance lock of
queue root is hold by Thread B)
I think the synchronized modifier of LeafQueue#getQueueUserAclInfo can be
removed to solve this problem, since this method appears to not affect fields
of LeafQueue instance.
Attach patch with UT for review.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]