[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148646#comment-14148646
 ] 

Wangda Tan commented on YARN-2594:
----------------------------------

[~kasha],
Thanks for your comments, definitely we should reduce synchronized lock, but 
this problems seems not caused by this, 
Had a discussion with Jian He, We found 4 suspicious threads,

Thread #2/#4 try to acquire readlock but failed, but at the same time, *no 
writelock hold by anyone* (thread#3 is waiting for writelock). This is more 
like a bug of Java to me.
Followings are links of descriptions of that bug, and there's some other people 
claims this not yet fixed.
1) Java bug description: 
http://webcache.googleusercontent.com/search?q=cache:fjM5oxWzmCsJ:bugs.java.com/view_bug.do%3Fbug_id%3D6822370+&cd=1&hl=en&ct=clnk&gl=hk
2) People report the bug still occurs:
http://cs.oswego.edu/pipermail/concurrency-interest/2010-September/007413.html

Thoughts? Following are thread#1-#4

*Thread#1*
{code}
"IPC Server handler 45 on 8032" daemon prio=10 tid=0x00007f032909b000 
nid=0x7bd7 waiting for monitor entry [0x00007f0307aa9000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceUsageReport(SchedulerApplicationAttempt.java:541)
        - waiting to lock <0x00000000e0e7ea70> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getAppResourceUsageReport(AbstractYarnScheduler.java:196)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.getApplicationResourceUsageReport(RMAppAttemptImpl.java:703)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.createAndGetApplicationReport(RMAppImpl.java:569)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:294)
        at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
        at 
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
{code}

*Thread#2*
{code}
"ResourceManager Event Processor" prio=10 tid=0x00007f0328db9800 nid=0x7aeb 
waiting on condition [0x00007f0311a48000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000e0e72bc0> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:964)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
        at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:731)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.getCurrentAppAttempt(RMAppImpl.java:476)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.updateAttemptMetrics(RMContainerImpl.java:509)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.transition(RMContainerImpl.java:495)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.transition(RMContainerImpl.java:484)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
        - locked <0x00000000e0e85318> (a 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:373)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:58)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.containerCompleted(FiCaSchedulerApp.java:89)
        - locked <0x00000000e0e7ea70> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1433)
        - locked <0x00000000e01a57b8> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:1124)
        - locked <0x00000000e011aea0> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:884)
        - locked <0x00000000e011aea0> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:989)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:93)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:612)
        at java.lang.Thread.run(Thread.java:745)
{code}

*Thread#3*
{code}
"AsyncDispatcher event handler" prio=10 tid=0x00007f0328b2e800 nid=0x7c58 
waiting on condition [0x00007f0306d9d000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000e0e72bc0> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
        at 
java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:945)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:698)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:94)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:716)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:700)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
        at java.lang.Thread.run(Thread.java:745)
{code}

*Thread#4*
{code}
"IPC Server handler 13 on 8032" daemon prio=10 tid=0x00007f0329057800 
nid=0x7bb6 waiting on condition [0x00007f0309ac9000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000e0e72bc0> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:964)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
        at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:731)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.getState(RMAppImpl.java:421)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.createApplicationState(RMAppImpl.java:1232)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplications(ClientRMService.java:724)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplications(ClientRMService.java:627)
        at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplications(ApplicationClientProtocolPBServiceImpl.java:204)
        at 
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:329)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
{code}

> Potential deadlock in RM when querying ApplicationResourceUsageReport
> ---------------------------------------------------------------------
>
>                 Key: YARN-2594
>                 URL: https://issues.apache.org/jira/browse/YARN-2594
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Karam Singh
>            Assignee: Wangda Tan
>            Priority: Blocker
>         Attachments: YARN-2594.patch
>
>
> ResoruceManager sometimes become un-responsive:
> There was in exception in ResourceManager log and contains only  following 
> type of messages:
> {code}
> 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
> 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
> 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
> 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
> 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
> 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
> 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to