[jira] [Commented] (YARN-2594) ResourceManger sometimes become un-responsive
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148006#comment-14148006 ] Karthik Kambatla commented on YARN-2594: Taking a look at the issue and the patch.. ResourceManger sometimes become un-responsive - Key: YARN-2594 URL: https://issues.apache.org/jira/browse/YARN-2594 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karam Singh Assignee: Wangda Tan Priority: Blocker Attachments: YARN-2594.patch ResoruceManager sometimes become un-responsive: There was in exception in ResourceManager log and contains only following type of messages: {code} 2014-09-19 19:13:45,241 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000 2014-09-19 19:30:26,312 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000 2014-09-19 19:47:07,351 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000 2014-09-19 20:03:48,460 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000 2014-09-19 20:20:29,542 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000 2014-09-19 20:37:10,635 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000 2014-09-19 20:53:51,722 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2594) ResourceManger sometimes become un-responsive
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148049#comment-14148049 ] Karthik Kambatla commented on YARN-2594: Thanks for working on this, Wangda. As I see, we could adopt the approach in the current patch. If we do so, we should avoid using readLock in other get methods that access {{RMAppImpl#currentAttempt}}. {{RMAppAttemptImpl}} should handle the thread-safety of its fields. Either in addition to or instead of current approach, we really need to cleanup {{SchedulerApplicationAttempt}}. Most of the methods there are synchronized, and many of them just call synchronized methods in {{AppSchedulingInfo}}. Needless to say, this is more involved and we need to be very careful. I am open to adopting the first approach in this JIRA and file follow-up JIRAs to address the second approach suggested. PS: We really need to set up jcarder or something to identify most of these deadlock paths. ResourceManger sometimes become un-responsive - Key: YARN-2594 URL: https://issues.apache.org/jira/browse/YARN-2594 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karam Singh Assignee: Wangda Tan Priority: Blocker Attachments: YARN-2594.patch ResoruceManager sometimes become un-responsive: There was in exception in ResourceManager log and contains only following type of messages: {code} 2014-09-19 19:13:45,241 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000 2014-09-19 19:30:26,312 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000 2014-09-19 19:47:07,351 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000 2014-09-19 20:03:48,460 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000 2014-09-19 20:20:29,542 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000 2014-09-19 20:37:10,635 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000 2014-09-19 20:53:51,722 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2594) ResourceManger sometimes become un-responsive
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146381#comment-14146381 ] Wangda Tan commented on YARN-2594: -- Hi [~devaraj.k], Have you already looked into that? I think I've found the root cause of this problem already, could you assign this ticket to me? This is a deadlock between the two pairs: {code} IPC Server handler 45 on 8032 daemon prio=10 tid=0x7f032909b000 nid=0x7bd7 waiting for monitor entry [0x7f0307aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceUsageReport(SchedulerApplicationAttempt.java:541) - waiting to lock 0xe0e7ea70 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getAppResourceUsageReport(AbstractYarnScheduler.java:196) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.getApplicationResourceUsageReport(RMAppAttemptImpl.java:703) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.createAndGetApplicationReport(RMAppImpl.java:569) at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:294) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) {code} And {code} ResourceManager Event Processor prio=10 tid=0x7f0328db9800 nid=0x7aeb waiting on condition [0x7f0311a48000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0xe0e72bc0 (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:964) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:731) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.getCurrentAppAttempt(RMAppImpl.java:476) at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.updateAttemptMetrics(RMContainerImpl.java:509) at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.transition(RMContainerImpl.java:495) at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.transition(RMContainerImpl.java:484) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) - locked 0xe0e85318 (a org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine) at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:373) at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:58) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.containerCompleted(FiCaSchedulerApp.java:89) - locked 0xe0e7ea70 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp) at
[jira] [Commented] (YARN-2594) ResourceManger sometimes become un-responsive
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146890#comment-14146890 ] zhihai xu commented on YARN-2594: - Only these two threads won't cause deadlock because they only access the RMAppImpl.readLock. There is another thread which access RMAppImpl.writeLock at the following: {code} AsyncDispatcher event handler prio=10 tid=0x7f0328b2e800 nid=0x7c58 waiting on condition [0x7f0306d9d000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0xe0e72bc0 (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:945) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:698) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:94) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:716) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:700) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) {code} I think these three threads cause the deadlock. ResourceManger sometimes become un-responsive - Key: YARN-2594 URL: https://issues.apache.org/jira/browse/YARN-2594 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karam Singh Assignee: Wangda Tan ResoruceManager sometimes become un-responsive: There was in exception in ResourceManager log and contains only following type of messages: {code} 2014-09-19 19:13:45,241 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000 2014-09-19 19:30:26,312 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000 2014-09-19 19:47:07,351 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000 2014-09-19 20:03:48,460 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000 2014-09-19 20:20:29,542 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000 2014-09-19 20:37:10,635 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000 2014-09-19 20:53:51,722 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2594) ResourceManger sometimes become un-responsive
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147083#comment-14147083 ] Wangda Tan commented on YARN-2594: -- Hi [~zxu], You're correct, this problem is, first two readlock thread deadlock because of synchronized access. So they block writelock acquiring so RM dispatcher blocked. Working on a patch now. Wangda ResourceManger sometimes become un-responsive - Key: YARN-2594 URL: https://issues.apache.org/jira/browse/YARN-2594 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karam Singh Assignee: Wangda Tan Priority: Blocker ResoruceManager sometimes become un-responsive: There was in exception in ResourceManager log and contains only following type of messages: {code} 2014-09-19 19:13:45,241 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000 2014-09-19 19:30:26,312 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000 2014-09-19 19:47:07,351 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000 2014-09-19 20:03:48,460 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000 2014-09-19 20:20:29,542 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000 2014-09-19 20:37:10,635 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000 2014-09-19 20:53:51,722 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2594) ResourceManger sometimes become un-responsive
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147333#comment-14147333 ] Hadoop QA commented on YARN-2594: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671126/YARN-2594.patch against trunk revision 428a766. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5116//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5116//console This message is automatically generated. ResourceManger sometimes become un-responsive - Key: YARN-2594 URL: https://issues.apache.org/jira/browse/YARN-2594 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karam Singh Assignee: Wangda Tan Priority: Blocker Attachments: YARN-2594.patch ResoruceManager sometimes become un-responsive: There was in exception in ResourceManager log and contains only following type of messages: {code} 2014-09-19 19:13:45,241 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000 2014-09-19 19:30:26,312 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000 2014-09-19 19:47:07,351 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000 2014-09-19 20:03:48,460 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000 2014-09-19 20:20:29,542 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000 2014-09-19 20:37:10,635 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000 2014-09-19 20:53:51,722 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)