[
https://issues.apache.org/jira/browse/YARN-325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13547409#comment-13547409
]
Jason Lowe commented on YARN-325:
---------------------------------
Stacktrace of an occurrence:
{noformat}
"IPC Server handler 28 on xxxx":
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.getQueueInfo(LeafQueue.java:513)
- waiting to lock <0x00002aaaee2e1600> (a
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getQueueInfo(ParentQueue.java:314)
- locked <0x00002aaaee2a7548> (a
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.getQueueInfo(CapacityScheduler.java:527)
at
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueInfo(ClientRMService.java:382)
at
org.apache.hadoop.yarn.api.impl.pb.service.ClientRMProtocolPBServiceImpl.getQueueInfo(ClientRMProtocolPBServiceImpl.java:181)
at
org.apache.hadoop.yarn.proto.ClientRMProtocol$ClientRMProtocolService$2.callBlockingMethod(ClientRMProtocol.java:188)
at
org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Server.call(ProtoOverHadoopRpcEngine.java:353)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1530)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1526)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1212)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1524)
"ResourceManager Event Processor":
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.completedContainer(ParentQueue.java:685)
- waiting to lock <0x00002aaaee2a7548> (a
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1359)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignReservedContainer(LeafQueue.java:860)
- locked <0x00002aaaee2e1600> (a
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:763)
- locked <0x00002aaaee2e1600> (a
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:586)
- locked <0x00002aaaee28b090> (a
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:635)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:80)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:341)
at java.lang.Thread.run(Thread.java:619)
Found 1 deadlock.
{noformat}
> RM CapacityScheduler can deadlock when getQueueInfo() is called and a
> container is completing
> ---------------------------------------------------------------------------------------------
>
> Key: YARN-325
> URL: https://issues.apache.org/jira/browse/YARN-325
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacityscheduler
> Affects Versions: 2.0.2-alpha, 0.23.5
> Reporter: Jason Lowe
> Priority: Critical
>
> If a client calls getQueueInfo on a parent queue (e.g.: the root queue) and
> containers are completing then the RM can deadlock. getQueueInfo() locks the
> ParentQueue and then calls the child queues' getQueueInfo() methods in turn.
> However when a container completes, it locks the LeafQueue then calls back
> into the ParentQueue. When the two mix, it's a recipe for deadlock.
> Stacktrace to follow.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira