[ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186176#comment-14186176
 ] 

Jian He commented on YARN-2010:
-------------------------------

bq. Any subsequent attempts to transition the RM to active fail because 
RMActiveServices is not INITED, as in the Standby case
 I think YARN-2588 fixed this.  are you running into this error with the patch ?
-  How about moving “addApplicationSync” into RMAppRecoveredTransition. We can 
catch the exception inside the transition and return failed state directly ?
{code}
      // If security is enabled and the application is NOT in a final state,
      // parse the credentials and renew delegation token
      if (UserGroupInformation.isSecurityEnabled() &&
          !isApplicationInFinalState(appState.getState())) {
        Credentials credentials = parseCredentials(appContext);
        // synchronously renew delegation token on recovery.
        rmContext.getDelegationTokenRenewer().addApplicationSync(appId,
            credentials, appContext.getCancelTokensWhenComplete());
      }

      // Actual recovery of the application
      application.handle(new RMAppEvent(appId, RMAppEventType.RECOVER));
    } catch (Exception e) {
      LOG.error("Failed to recover application + " + appId, e);
      // Fail the application if it is a running application.
      if (!isApplicationInFinalState(appState.getState())) {
        rmContext.getDispatcher().getEventHandler().handle(
            new RMAppRejectedEvent(appId, e.getMessage()));
      }
      throw e;
{code}
- changes in TestWorkPreservingRMRestart
It was purposely done to force RM to fail if the queue is missing for the app 
and indicate admin to config the queue properly.

> If RM fails to recover an app, it can never transition to active again
> ----------------------------------------------------------------------
>
>                 Key: YARN-2010
>                 URL: https://issues.apache.org/jira/browse/YARN-2010
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.3.0
>            Reporter: bc Wong
>            Assignee: Karthik Kambatla
>            Priority: Blocker
>         Attachments: YARN-2010.1.patch, YARN-2010.patch, 
> issue-stacktrace.rtf, yarn-2010-2.patch, yarn-2010-3.patch, 
> yarn-2010-3.patch, yarn-2010-4.patch, yarn-2010-5.patch, yarn-2010-6.patch
>
>
> Sometimes, the RM fails to recover an application. It could be because of 
> turning security on, token expiry, or issues connecting to HDFS etc. The 
> causes could be classified into (1) transient, (2) specific to one 
> application, and (3) permanent and apply to multiple (all) applications. 
> Today, the RM fails to transition to Active and ends up in STOPPED state and 
> can never be transitioned to Active again.
> The initial stacktrace reported is at 
> https://issues.apache.org/jira/secure/attachment/12676476/issue-stacktrace.rtf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to