chenwencan created YARN-11292:
---------------------------------
Summary: resourcemanager no longer reconnects to zk
Key: YARN-11292
URL: https://issues.apache.org/jira/browse/YARN-11292
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Affects Versions: 3.3.3
Reporter: chenwencan
this problem has occurred in our environment ,the process of the problem is as
follow:
# network exception between resourcemanager and zookeeper
# resourcemanger reconnect zookeeper successful
# zookeeper session expire occurred
# resourcemanager create new zookeeper client and reconnect it
# if reconnect zk failed,will trigger RMFatalEvent
# then start new thread to continue reconnect and rejoin election,while the
variable hasAlreadyRun controll just run once,so if still reconnect
failed,there have no chance to reconnect
{code:java}
private class StandByTransitionRunnable implements Runnable {
// The atomic variable to make sure multiple threads with the same
runnable
// run only once.
private final AtomicBoolean hasAlreadyRun = new AtomicBoolean(false);
@Override
public void run() {
// Run this only once, even if multiple threads end up triggering
// this simultaneously.
if (hasAlreadyRun.getAndSet(true)) {
return;
} if (rmContext.isHAEnabled()) {
try {
// Transition to standby and reinit active services
LOG.info("Transitioning RM to Standby mode");
transitionToStandby(true);
EmbeddedElector elector = rmContext.getLeaderElectorService();
if (elector != null) {
elector.rejoinElection();
}
} catch (Exception e) {
LOG.error(FATAL, "Failed to transition RM to Standby mode.", e);
ExitUtil.terminate(1, e);
}
}
}
} {code}
so, i think use a lock here will be better
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]