[ https://issues.apache.org/jira/browse/YARN-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990037#comment-13990037 ]
Tsuyoshi OZAWA commented on YARN-2019: -------------------------------------- RMStateStore handles the exceptions in ZKRMStateStore like this: {code} try { // ZK related operations removeRMDTMasterKeyState(delegationKey); } catch (Exception e) { notifyStoreOperationFailed(e); } {code} If it's fenced, RMFatalEventDispatcher handles the exceptions and RM goes into standby state. However, if STATE_STORE_OP_FAILED occurs, Active RM terminates. After fail-over to standby RM, the exception could be repeated on new active RM. Maybe this is the case [~djp] mentioned. Please correct me if I get wrong. {code} @Private public static class RMFatalEventDispatcher implements EventHandler<RMFatalEvent> { @Override public void handle(RMFatalEvent event) { LOG.fatal("Received a " + RMFatalEvent.class.getName() + " of type " + event.getType().name() + ". Cause:\n" + event.getCause()); if (event.getType() == RMFatalEventType.STATE_STORE_FENCED) { LOG.info("RMStateStore has been fenced"); if (rmContext.isHAEnabled()) { try { // Transition to standby and reinit active services LOG.info("Transitioning RM to Standby mode"); rm.transitionToStandby(true); return; } catch (Exception e) { LOG.fatal("Failed to transition RM to Standby mode."); } } } ExitUtil.terminate(1, event.getCause()); } } {code} > Retrospect on decision of making RM crashed if any exception throw in > ZKRMStateStore > ------------------------------------------------------------------------------------ > > Key: YARN-2019 > URL: https://issues.apache.org/jira/browse/YARN-2019 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Junping Du > Priority: Critical > Labels: ha > > Currently, if any abnormal happens in ZKRMStateStore, it will throw a fetal > exception to crash RM down. As shown in YARN-1924, it could due to RM HA > internal bug itself, but not fatal exception. We should retrospect some > decision here as HA feature is designed to protect key component but not > disturb it. -- This message was sent by Atlassian JIRA (v6.2#6252)