[
https://issues.apache.org/jira/browse/YARN-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15784099#comment-15784099
]
Wangda Tan commented on YARN-6029:
----------------------------------
bq. I'm not clear about this. Is it worth to ensure consistency of acls through
reducing the efficiency of scheduler?
It gonna be inefficient, previously getQueueInfo hold scheduler lock and that
causes problems.
bq. We also noticed that it doesn't hold the lock of LeafQueue instance when
updating acls (CapacityScheduler#setQueueAcls) so that current logic doesn't
guarantee the consistency of acls.
Yeah you're correct...
I think we could directly get queue ACL info from CS by invoking
authorizer#checkPermissions, and we can have a separate lock to protect
permission get/set. cc: [~jianhe]
But this is should be a separated patch, since we need to fix getQueueInfo as
well.
I think we can go ahead to fix locks inside LQ#assignContainers, thoughts?
> CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by
> Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to
> release a reserved container
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: YARN-6029
> URL: https://issues.apache.org/jira/browse/YARN-6029
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacityscheduler
> Affects Versions: 2.8.0
> Reporter: Tao Yang
> Assignee: Tao Yang
> Priority: Blocker
> Attachments: YARN-6029.001.patch, deadlock.jstack
>
>
> When ParentQueue#getQueueUserAclInfo is called (e.g. a client calls
> YarnClient#getQueueAclsInfo) just at the moment that
> LeafQueue#assignContainers is called and before notifying parent queue to
> release resource (should release a reserved container), then ResourceManager
> can deadlock. I found this problem on our testing environment for hadoop2.8.
> Reproduce the deadlock in chronological order
> * 1. Thread A (ResourceManager Event Processor) calls synchronized
> LeafQueue#assignContainers (got LeafQueue instance lock of queue root.a)
> * 2. Thread B (IPC Server handler) calls synchronized
> ParentQueue#getQueueUserAclInfo (got ParentQueue instance lock of queue
> root), iterates over children queue acls and is blocked when calling
> synchronized LeafQueue#getQueueUserAclInfo (the LeafQueue instance lock of
> queue root.a is hold by Thread A)
> * 3. Thread A wants to inform the parent queue that a container is being
> completed and is blocked when invoking synchronized
> ParentQueue#internalReleaseResource method (the ParentQueue instance lock of
> queue root is hold by Thread B)
> I think the synchronized modifier of LeafQueue#getQueueUserAclInfo can be
> removed to solve this problem, since this method appears to not affect fields
> of LeafQueue instance.
> Attach patch with UT for review.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]