[ 
https://issues.apache.org/jira/browse/YARN-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334986#comment-14334986
 ] 

Jason Lowe commented on YARN-3251:
----------------------------------

Sample stack trace:
{noformat}
Found one Java-level deadlock:
=============================
"IPC Server handler 71 on 8032":
  waiting to lock monitor 0x00000000037f9120 (object 0x000000023b060ad8, a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue),
  which is held by "ResourceManager Event Processor"
"ResourceManager Event Processor":
  waiting to lock monitor 0x0000000002c4b7d0 (object 0x000000023aecf620, a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue),
  which is held by "IPC Server handler 71 on 8032"

Java stack information for the threads listed above:
===================================================
"IPC Server handler 71 on 8032":
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.getQueueInfo(LeafQueue.java:451)
        - waiting to lock <0x000000023b060ad8> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getQueueInfo(ParentQueue.java:214)
        - locked <0x000000023aecf620> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getQueueInfo(ParentQueue.java:214)
        - locked <0x000000023af36e70> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getQueueInfo(ParentQueue.java:214)
        - locked <0x000000023b0d9478> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.getQueueInfo(CapacityScheduler.java:910)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueInfo(ClientRMService.java:832)
        at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getQueueInfo(ApplicationClientProtocolPBServiceImpl.java:259)
        at 
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:413)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2079)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2075)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2073)
"ResourceManager Event Processor":
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.getParent(AbstractCSQueue.java:185)
        - waiting to lock <0x000000023aecf620> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils.getAbsoluteMaxAvailCapacity(CSQueueUtils.java:177)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils.getAbsoluteMaxAvailCapacity(CSQueueUtils.java:183)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.computeUserLimitAndSetHeadroom(LeafQueue.java:1033)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.checkLimitsToReserve(LeafQueue.java:1341)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1611)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignOffSwitchContainers(LeafQueue.java:1399)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1278)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignReservedContainer(LeafQueue.java:893)
        - locked <0x000000023b060ad8> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:758)
        - locked <0x000000023ceb53e0> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
        - locked <0x000000023b060ad8> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:992)
        - locked <0x000000023ae2fbd0> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1059)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:114)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:680)
        at java.lang.Thread.run(Thread.java:722)

Found 1 deadlock.
{noformat}

> CapacityScheduler deadlock when computing absolute max avail capacity
> ---------------------------------------------------------------------
>
>                 Key: YARN-3251
>                 URL: https://issues.apache.org/jira/browse/YARN-3251
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Jason Lowe
>            Priority: Blocker
>
> The ResourceManager can deadlock in the CapacityScheduler when computing the 
> absolute max available capacity for user limits and headroom.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to