[ 
https://issues.apache.org/jira/browse/YARN-325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13547409#comment-13547409
 ] 

Jason Lowe commented on YARN-325:
---------------------------------

Stacktrace of an occurrence:

{noformat}
"IPC Server handler 28 on xxxx":
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.getQueueInfo(LeafQueue.java:513)
        - waiting to lock <0x00002aaaee2e1600> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getQueueInfo(ParentQueue.java:314)
        - locked <0x00002aaaee2a7548> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.getQueueInfo(CapacityScheduler.java:527)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueInfo(ClientRMService.java:382)
        at 
org.apache.hadoop.yarn.api.impl.pb.service.ClientRMProtocolPBServiceImpl.getQueueInfo(ClientRMProtocolPBServiceImpl.java:181)
        at 
org.apache.hadoop.yarn.proto.ClientRMProtocol$ClientRMProtocolService$2.callBlockingMethod(ClientRMProtocol.java:188)
        at 
org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Server.call(ProtoOverHadoopRpcEngine.java:353)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1530)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1526)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1212)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1524)
"ResourceManager Event Processor":
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.completedContainer(ParentQueue.java:685)
        - waiting to lock <0x00002aaaee2a7548> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1359)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignReservedContainer(LeafQueue.java:860)
        - locked <0x00002aaaee2e1600> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:763)
        - locked <0x00002aaaee2e1600> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:586)
        - locked <0x00002aaaee28b090> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:635)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:80)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:341)
        at java.lang.Thread.run(Thread.java:619)

Found 1 deadlock.
{noformat}

                
> RM CapacityScheduler can deadlock when getQueueInfo() is called and a 
> container is completing
> ---------------------------------------------------------------------------------------------
>
>                 Key: YARN-325
>                 URL: https://issues.apache.org/jira/browse/YARN-325
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 2.0.2-alpha, 0.23.5
>            Reporter: Jason Lowe
>            Priority: Critical
>
> If a client calls getQueueInfo on a parent queue (e.g.: the root queue) and 
> containers are completing then the RM can deadlock.  getQueueInfo() locks the 
> ParentQueue and then calls the child queues' getQueueInfo() methods in turn.  
> However when a container completes, it locks the LeafQueue then calls back 
> into the ParentQueue.  When the two mix, it's a recipe for deadlock.
> Stacktrace to follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to