[jira] [Created] (YARN-11292) resourcemanager no longer reconnects to zk

chenwencan (Jira) Fri, 02 Sep 2022 02:08:11 -0700

chenwencan created YARN-11292:
---------------------------------

             Summary: resourcemanager no longer reconnects to zk
                 Key: YARN-11292
                 URL: https://issues.apache.org/jira/browse/YARN-11292
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
    Affects Versions: 3.3.3
            Reporter: chenwencan



this problem has occurred in our environment ，the process of the problem is as 
follow:
 # network exception between resourcemanager and zookeeper
 # resourcemanger reconnect zookeeper successful
 # zookeeper session expire occurred
 # resourcemanager create new zookeeper client and reconnect it
 # if reconnect zk failed，will trigger RMFatalEvent
 # then start new thread to continue reconnect and rejoin election，while the 
variable  hasAlreadyRun controll just run once，so if still reconnect 
failed，there have no chance to reconnect

{code:java}
    private class StandByTransitionRunnable implements Runnable {
      // The atomic variable to make sure multiple threads with the same 
runnable
      // run only once.
      private final AtomicBoolean hasAlreadyRun = new AtomicBoolean(false);     
 @Override
      public void run() {
        // Run this only once, even if multiple threads end up triggering
        // this simultaneously.
        if (hasAlreadyRun.getAndSet(true)) {
          return;
        }        if (rmContext.isHAEnabled()) {
          try {
            // Transition to standby and reinit active services
            LOG.info("Transitioning RM to Standby mode");
            transitionToStandby(true);
            EmbeddedElector elector = rmContext.getLeaderElectorService();
            if (elector != null) {
              elector.rejoinElection();
            }
          } catch (Exception e) {
            LOG.error(FATAL, "Failed to transition RM to Standby mode.", e);
            ExitUtil.terminate(1, e);
          }
        }
      }
    } {code}
so, i think use a lock here will be better



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (YARN-11292) resourcemanager no longer reconnects to zk

Reply via email to