[jira] [Commented] (YARN-2594) ResourceManger sometimes become un-responsive

2014-09-25 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148006#comment-14148006
 ] 

Karthik Kambatla commented on YARN-2594:


Taking a look at the issue and the patch.. 

 ResourceManger sometimes become un-responsive
 -

 Key: YARN-2594
 URL: https://issues.apache.org/jira/browse/YARN-2594
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karam Singh
Assignee: Wangda Tan
Priority: Blocker
 Attachments: YARN-2594.patch


 ResoruceManager sometimes become un-responsive:
 There was in exception in ResourceManager log and contains only  following 
 type of messages:
 {code}
 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2594) ResourceManger sometimes become un-responsive

2014-09-25 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148049#comment-14148049
 ] 

Karthik Kambatla commented on YARN-2594:


Thanks for working on this, Wangda. 

As I see, we could adopt the approach in the current patch. If we do so, we 
should avoid using readLock in other get methods that access 
{{RMAppImpl#currentAttempt}}. {{RMAppAttemptImpl}} should handle the 
thread-safety of its fields.

Either in addition to or instead of current approach, we really need to cleanup 
{{SchedulerApplicationAttempt}}. Most of the methods there are synchronized, 
and many of them just call synchronized methods in {{AppSchedulingInfo}}. 
Needless to say, this is more involved and we need to be very careful. 

I am open to adopting the first approach in this JIRA and file follow-up JIRAs 
to address the second approach suggested. 

PS: We really need to set up jcarder or something to identify most of these 
deadlock paths. 

 ResourceManger sometimes become un-responsive
 -

 Key: YARN-2594
 URL: https://issues.apache.org/jira/browse/YARN-2594
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karam Singh
Assignee: Wangda Tan
Priority: Blocker
 Attachments: YARN-2594.patch


 ResoruceManager sometimes become un-responsive:
 There was in exception in ResourceManager log and contains only  following 
 type of messages:
 {code}
 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2594) ResourceManger sometimes become un-responsive

2014-09-24 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146381#comment-14146381
 ] 

Wangda Tan commented on YARN-2594:
--

Hi [~devaraj.k],
Have you already looked into that? I think I've found the root cause of this 
problem already, could you assign this ticket to me?

This is a deadlock between the two pairs:
{code}
IPC Server handler 45 on 8032 daemon prio=10 tid=0x7f032909b000 
nid=0x7bd7 waiting for monitor entry [0x7f0307aa9000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceUsageReport(SchedulerApplicationAttempt.java:541)
- waiting to lock 0xe0e7ea70 (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getAppResourceUsageReport(AbstractYarnScheduler.java:196)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.getApplicationResourceUsageReport(RMAppAttemptImpl.java:703)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.createAndGetApplicationReport(RMAppImpl.java:569)
at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:294)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
at 
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
{code}

And 

{code}
ResourceManager Event Processor prio=10 tid=0x7f0328db9800 nid=0x7aeb 
waiting on condition [0x7f0311a48000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0xe0e72bc0 (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:964)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:731)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.getCurrentAppAttempt(RMAppImpl.java:476)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.updateAttemptMetrics(RMContainerImpl.java:509)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.transition(RMContainerImpl.java:495)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.transition(RMContainerImpl.java:484)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
- locked 0xe0e85318 (a 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:373)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:58)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.containerCompleted(FiCaSchedulerApp.java:89)
- locked 0xe0e7ea70 (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
at 

[jira] [Commented] (YARN-2594) ResourceManger sometimes become un-responsive

2014-09-24 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146890#comment-14146890
 ] 

zhihai xu commented on YARN-2594:
-

Only these two threads won't  cause deadlock because they only access the 
RMAppImpl.readLock.
There is another thread which access RMAppImpl.writeLock at the following:
{code}
AsyncDispatcher event handler prio=10 tid=0x7f0328b2e800 nid=0x7c58 
waiting on condition [0x7f0306d9d000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0xe0e72bc0 (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
at 
java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:945)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:698)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:94)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:716)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:700)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:745)
{code}

I think these three threads cause the deadlock.

 ResourceManger sometimes become un-responsive
 -

 Key: YARN-2594
 URL: https://issues.apache.org/jira/browse/YARN-2594
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karam Singh
Assignee: Wangda Tan

 ResoruceManager sometimes become un-responsive:
 There was in exception in ResourceManager log and contains only  following 
 type of messages:
 {code}
 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2594) ResourceManger sometimes become un-responsive

2014-09-24 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147083#comment-14147083
 ] 

Wangda Tan commented on YARN-2594:
--

Hi [~zxu],
You're correct, this problem is, first two readlock thread deadlock because of 
synchronized access. So they block writelock acquiring so RM dispatcher blocked.
Working on a patch now.

Wangda

 ResourceManger sometimes become un-responsive
 -

 Key: YARN-2594
 URL: https://issues.apache.org/jira/browse/YARN-2594
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karam Singh
Assignee: Wangda Tan
Priority: Blocker

 ResoruceManager sometimes become un-responsive:
 There was in exception in ResourceManager log and contains only  following 
 type of messages:
 {code}
 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2594) ResourceManger sometimes become un-responsive

2014-09-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147333#comment-14147333
 ] 

Hadoop QA commented on YARN-2594:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671126/YARN-2594.patch
  against trunk revision 428a766.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5116//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5116//console

This message is automatically generated.

 ResourceManger sometimes become un-responsive
 -

 Key: YARN-2594
 URL: https://issues.apache.org/jira/browse/YARN-2594
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karam Singh
Assignee: Wangda Tan
Priority: Blocker
 Attachments: YARN-2594.patch


 ResoruceManager sometimes become un-responsive:
 There was in exception in ResourceManager log and contains only  following 
 type of messages:
 {code}
 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)