[ 
https://issues.apache.org/jira/browse/YARN-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15782324#comment-15782324
 ] 

Naganarasimha G R commented on YARN-6029:
-----------------------------------------

Thanks [~wangda] &  [~Tao Yang]
bq.  I think there maybe have a problem when iterating childQueues and at the 
same time ParentQueue#setChildQueues is called
Yes you are right this happens during CS initialize or reinitialize and during 
this time if {{getQueueUserAclInfo}} is called then some anamolies can happen 
as getQueueUserAclInfo is not holding lock on CS. 

bq. But it could cause inconsistency read data, for example, queue acl could be 
updated while it being updated. So I will not in favor of this solution.
Agree but IIUC based on 2.8 code its less dependent on locking of child queue 
as acls are updated during reinitialization all the  queues at one shot, So to 
ensure acls are returned appropriately i presume we should be holding the lock 
on CS.getQueueUserAclInfo which is not happening currently in 2.8. 

bq. I still prefer to fix the issue in scheduling logic, there're some other 
similar logics like GetQueueInfo, etc. 
Hmm so you are suggesting to apply 2.9/trunk's patch  or reorganize the flow in 
CS with synchronized blocks itself ? 


> CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by 
> Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to 
> release a reserved container
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-6029
>                 URL: https://issues.apache.org/jira/browse/YARN-6029
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 2.8.0
>            Reporter: Tao Yang
>            Assignee: Tao Yang
>            Priority: Blocker
>         Attachments: YARN-6029.001.patch, deadlock.jstack
>
>
> When ParentQueue#getQueueUserAclInfo is called (e.g. a client calls 
> YarnClient#getQueueAclsInfo) just at the moment that 
> LeafQueue#assignContainers is called and before notifying parent queue to 
> release resource (should release a reserved container), then ResourceManager 
> can deadlock. I found this problem on our testing environment for hadoop2.8.
> Reproduce the deadlock in chronological order
> * 1. Thread A (ResourceManager Event Processor) calls synchronized 
> LeafQueue#assignContainers (got LeafQueue instance lock of queue root.a)
> * 2. Thread B (IPC Server handler) calls synchronized 
> ParentQueue#getQueueUserAclInfo (got ParentQueue instance lock of queue 
> root), iterates over children queue acls and is blocked when calling 
> synchronized LeafQueue#getQueueUserAclInfo (the LeafQueue instance lock of 
> queue root.a is hold by Thread A)
> * 3. Thread A wants to inform the parent queue that a container is being 
> completed and is blocked when invoking synchronized 
> ParentQueue#internalReleaseResource method (the ParentQueue instance lock of 
> queue root is hold by Thread B)
> I think the synchronized modifier of LeafQueue#getQueueUserAclInfo can be 
> removed to solve this problem, since this method appears to not affect fields 
> of LeafQueue instance.
> Attach patch with UT for review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to