[
https://issues.apache.org/jira/browse/YARN-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15781644#comment-15781644
]
Naganarasimha G R commented on YARN-6029:
-----------------------------------------
Thanks [~djp] & [~wangda], for correcting me, missed to realize earlier that
write lock needs to wait till all read read locks are finished.
But [~wangda] agree your solution solves the problem but current flow is
{{CapacityScheduler.allocateContainersToNode \-> LeafQueue.assignContainers
(hold the lock on leaf) \-> LeafQueue.handleExcessReservedContainer \->
LeafQueue.completedContainer \-> ParentQueue.completedContainer (try to get
the lock here)}}
Agree that we need to fix in this flow but simpler temporary correction in
*ParentQueue* (assuming that 2.9/ trunk avoids the issue) could be
{code}
@Override
public List<QueueUserACLInfo> getQueueUserAclInfo(
UserGroupInformation user) {
List<QueueUserACLInfo> userAcls = new ArrayList<QueueUserACLInfo>();
synchronized (this) {
// Add parent queue acls
userAcls.add(getUserAclInfo(user));
}
// Add children queue acls
for (CSQueue child : childQueues) {
userAcls.addAll(child.getQueueUserAclInfo(user));
}
return userAcls;
}
{code}
Thoughts ?
> CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by
> Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to
> release a reserved container
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: YARN-6029
> URL: https://issues.apache.org/jira/browse/YARN-6029
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacityscheduler
> Affects Versions: 2.8.0
> Reporter: Tao Yang
> Assignee: Tao Yang
> Priority: Blocker
> Attachments: YARN-6029.001.patch, deadlock.jstack
>
>
> When ParentQueue#getQueueUserAclInfo is called (e.g. a client calls
> YarnClient#getQueueAclsInfo) just at the moment that
> LeafQueue#assignContainers is called and before notifying parent queue to
> release resource (should release a reserved container), then ResourceManager
> can deadlock. I found this problem on our testing environment for hadoop2.8.
> Reproduce the deadlock in chronological order
> * 1. Thread A (ResourceManager Event Processor) calls synchronized
> LeafQueue#assignContainers (got LeafQueue instance lock of queue root.a)
> * 2. Thread B (IPC Server handler) calls synchronized
> ParentQueue#getQueueUserAclInfo (got ParentQueue instance lock of queue
> root), iterates over children queue acls and is blocked when calling
> synchronized LeafQueue#getQueueUserAclInfo (the LeafQueue instance lock of
> queue root.a is hold by Thread A)
> * 3. Thread A wants to inform the parent queue that a container is being
> completed and is blocked when invoking synchronized
> ParentQueue#internalReleaseResource method (the ParentQueue instance lock of
> queue root is hold by Thread B)
> I think the synchronized modifier of LeafQueue#getQueueUserAclInfo can be
> removed to solve this problem, since this method appears to not affect fields
> of LeafQueue instance.
> Attach patch with UT for review.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]