[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143200#comment-14143200
 ] 

Rohith commented on YARN-2579:
------------------------------

This scenario could ocure if 2 thread trying to access 
ResourceManager#transitionToStandby().One is from 
AdminService#trainsitiontostandby first and then 
RMFatalEventDispatcher#transitionToStandBy(). This I simulated using debug 
point.
The main problem is in resetting dispatcher, stops the dispatcher. Suppose, if 
AdminService is stopping dispatcher but dispatcher thread is blocked for 
getting acquire lock on ResourceManager, then ResourceManager never get 
transitioned to StandBy. It wait infinitely.

{code}
"AsyncDispatcher event handler" daemon prio=10 tid=0x00000000007ea000 
nid=0x39d1 waiting for monitor entry [0x00007fe0a77f6000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:976)
        - waiting to lock <0x00000000c1f7d438> (a 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:701)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:678)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
        at java.lang.Thread.run(Thread.java:745)
"IPC Server handler 0 on 45021" daemon prio=10 tid=0x00007fe0a9026800 
nid=0x30ab in Object.wait() [0x00007fe0a7cfa000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000000eb3310e8> (a java.lang.Thread)
        at java.lang.Thread.join(Thread.java:1281)
        - locked <0x00000000eb3310e8> (a java.lang.Thread)
        at java.lang.Thread.join(Thread.java:1355)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:150)
        at 
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
        - locked <0x00000000eb32fef8> (a java.lang.Object)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.resetDispatcher(ResourceManager.java:1166)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:987)
        - locked <0x00000000c1f7d438> (a 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
        at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:308)
        - locked <0x00000000c2038d10> (a 
org.apache.hadoop.yarn.server.resourcemanager.AdminService)
        at 
org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToStandby(HAServiceProtocolServerSideTranslatorPB.java:119)
        at 
org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4462)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
{code}


> Both RM's state is Active , but 1 RM is not really active.
> ----------------------------------------------------------
>
>                 Key: YARN-2579
>                 URL: https://issues.apache.org/jira/browse/YARN-2579
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.5.1
>            Reporter: Rohith
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to