[ 
https://issues.apache.org/jira/browse/YARN-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990037#comment-13990037
 ] 

Tsuyoshi OZAWA commented on YARN-2019:
--------------------------------------

RMStateStore handles the exceptions in ZKRMStateStore like this: 
{code}
    try {
      // ZK related operations
      removeRMDTMasterKeyState(delegationKey);
    } catch (Exception e) {
      notifyStoreOperationFailed(e);
    }
{code}

If it's fenced, RMFatalEventDispatcher handles the exceptions and RM goes into 
standby state. However, if STATE_STORE_OP_FAILED occurs, Active RM terminates. 
After fail-over to standby RM, the exception could be repeated on new active 
RM. Maybe this is the case [~djp] mentioned. Please correct me if I get wrong.

{code}
  @Private
  public static class RMFatalEventDispatcher
      implements EventHandler<RMFatalEvent> {
    @Override
    public void handle(RMFatalEvent event) {
      LOG.fatal("Received a " + RMFatalEvent.class.getName() + " of type " +
          event.getType().name() + ". Cause:\n" + event.getCause());

      if (event.getType() == RMFatalEventType.STATE_STORE_FENCED) {
        LOG.info("RMStateStore has been fenced");
        if (rmContext.isHAEnabled()) {
          try {
            // Transition to standby and reinit active services
            LOG.info("Transitioning RM to Standby mode");
            rm.transitionToStandby(true);
            return;
          } catch (Exception e) {
            LOG.fatal("Failed to transition RM to Standby mode.");
          }
        }
      }

      ExitUtil.terminate(1, event.getCause());
    }
  }
{code}



> Retrospect on decision of making RM crashed if any exception throw in 
> ZKRMStateStore
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-2019
>                 URL: https://issues.apache.org/jira/browse/YARN-2019
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Junping Du
>            Priority: Critical
>              Labels: ha
>
> Currently, if any abnormal happens in ZKRMStateStore, it will throw a fetal 
> exception to crash RM down. As shown in YARN-1924, it could due to RM HA 
> internal bug itself, but not fatal exception. We should retrospect some 
> decision here as HA feature is designed to protect key component but not 
> disturb it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to