Tarun Parimi created YARN-9712:
----------------------------------

             Summary: ResourceManager goes into a deadlock while transitioning 
to standby
                 Key: YARN-9712
                 URL: https://issues.apache.org/jira/browse/YARN-9712
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager, RM
    Affects Versions: 2.9.0
            Reporter: Tarun Parimi


We have observed RM go into a deadlock while transitioning to standby in a 
heavily loaded production cluster which can observe random connection loss to a 
zookeeper session and also has a large amount of RMDelegationToken requests due 
to oozie jobs.

On analyzing the jstack and the logs, this seems to happen when the below 
sequence of events occur.

1. Zookeeper session is lost and so the ActiveStandbyElector service will do 
transitionToStandby . This transitionToStandby is a synchronized method and so 
will acquire a lock on ResourceManager. 
{code:java}
2019-07-25 14:31:24,497 INFO ha.ActiveStandbyElector 
(ActiveStandbyElector.java:processWatchEvent(621)) - Session expired. Entering 
neutral mode and rejoining... 
2019-07-25 14:31:28,084 INFO resourcemanager.ResourceManager 
(ResourceManager.java:transitionToStandby(1134)) - Transitioning to standby 
state 
{code}


2. While transitioning to standby, a java.lang.InterruptedException occurs in 
RMStateStore while removing/storing RMDelegationToken. This is because 
RMSecretManagerService will be stopped while transitioning to standby.
{code:java}
2019-07-25 14:31:28,576 ERROR recovery.RMStateStore 
(RMStateStore.java:transition(373)) - Error While Removing RMDelegationToken 
and SequenceNumber
java.lang.InterruptedException
2019-07-25 14:31:28,576 ERROR recovery.RMStateStore 
(RMStateStore.java:notifyStoreOperationFailedInternal(992)) - State store 
operation failed 
java.lang.InterruptedException 
{code}


3. When state store error occurs, a RMFatalEvent of type STATE_STORE_FENCED 
will be sent. 

{code:java}
2019-07-25 14:31:28,579 ERROR resourcemanager.ResourceManager 
(ResourceManager.java:handle(767)) - Received RMFatalEvent of type 
STATE_STORE_FENCED, caused by java.lang.InterruptedException 
{code}


4. The problem occurs when the RMFatalEventDispatcher calls getConfig() . This 
also needs a lock on ResourceManager since its a synchronized method. This will 
cause the rmDispatcher eventHandlingThread to become blocked.

{code:java}
private class RMFatalEventDispatcher implements EventHandler<RMFatalEvent> {
    @Override
    public void handle(RMFatalEvent event) {
      LOG.error("Received " + event);

      if (HAUtil.isHAEnabled(getConfig())) {
        // If we're in an HA config, the right answer is always to go into
        // standby.
        LOG.warn("Transitioning the resource manager to standby.");
        handleTransitionToStandByInNewThread();
{code}

5. The transitionToStandby will wait forever as the eventHandlingThread of 
rmDispatcher is blocked. This causes a deadlock and RM will not become active 
until restarted.

Below are the relevant threads in the jstack captured.

The transitionToStandby thread that waits forever.
{code:java}
"main-EventThread" #138239 daemon prio=5 os_prio=0 tid=0x00007fea473b2800 
nid=0x2f411 in Object.wait() [0x00007fda5bef5000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1245)
        - locked <0x00007fdb6c5059a0> (a java.lang.Thread)
        at java.lang.Thread.join(Thread.java:1319)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:161)
        at 
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
        - locked <0x00007fdb6c538ca0> (a java.lang.Object)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.resetRMContext(ResourceManager.java:1323)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.reinitialize(ResourceManager.java:1091)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:1139)
        - locked <0x00007fdb33e418f0> (a 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
        at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:355)
        - locked <0x00007fdb33e41828> (a 
org.apache.hadoop.yarn.server.resourcemanager.AdminService)
        at 
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeStandby(EmbeddedElectorService.java:147)
        at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeStandby(ActiveStandbyElector.java:970)
        at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:480)
        - locked <0x00007fdb33e7bb88> (a 
org.apache.hadoop.ha.ActiveStandbyElector)
        at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:617)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)

   Locked ownable synchronizers:
        - None
{code}

The blocked rmDispatcher EventHandler.

{code:java}
"AsyncDispatcher event handler" #135565 daemon prio=5 os_prio=0 
tid=0x00007fdb2107f000 nid=0x2484a waiting for monitor entry 
[0x00007fda597cc000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.hadoop.service.AbstractService.getConfig(AbstractService.java:403)
        - waiting to lock <0x00007fdb33e418f0> (a 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:769)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:764)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
        at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
        - None
{code}

This scenario will happen only when having the changes introduced in YARN-3742 
where RMFatalEventDispatcher handles ERROR scenarios such as STATE_STORE_FENCED 
and tries to transitionToStandby.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to