[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186176#comment-14186176 ]
Jian He commented on YARN-2010: ------------------------------- bq. Any subsequent attempts to transition the RM to active fail because RMActiveServices is not INITED, as in the Standby case I think YARN-2588 fixed this. are you running into this error with the patch ? - How about moving “addApplicationSync” into RMAppRecoveredTransition. We can catch the exception inside the transition and return failed state directly ? {code} // If security is enabled and the application is NOT in a final state, // parse the credentials and renew delegation token if (UserGroupInformation.isSecurityEnabled() && !isApplicationInFinalState(appState.getState())) { Credentials credentials = parseCredentials(appContext); // synchronously renew delegation token on recovery. rmContext.getDelegationTokenRenewer().addApplicationSync(appId, credentials, appContext.getCancelTokensWhenComplete()); } // Actual recovery of the application application.handle(new RMAppEvent(appId, RMAppEventType.RECOVER)); } catch (Exception e) { LOG.error("Failed to recover application + " + appId, e); // Fail the application if it is a running application. if (!isApplicationInFinalState(appState.getState())) { rmContext.getDispatcher().getEventHandler().handle( new RMAppRejectedEvent(appId, e.getMessage())); } throw e; {code} - changes in TestWorkPreservingRMRestart It was purposely done to force RM to fail if the queue is missing for the app and indicate admin to config the queue properly. > If RM fails to recover an app, it can never transition to active again > ---------------------------------------------------------------------- > > Key: YARN-2010 > URL: https://issues.apache.org/jira/browse/YARN-2010 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.3.0 > Reporter: bc Wong > Assignee: Karthik Kambatla > Priority: Blocker > Attachments: YARN-2010.1.patch, YARN-2010.patch, > issue-stacktrace.rtf, yarn-2010-2.patch, yarn-2010-3.patch, > yarn-2010-3.patch, yarn-2010-4.patch, yarn-2010-5.patch, yarn-2010-6.patch > > > Sometimes, the RM fails to recover an application. It could be because of > turning security on, token expiry, or issues connecting to HDFS etc. The > causes could be classified into (1) transient, (2) specific to one > application, and (3) permanent and apply to multiple (all) applications. > Today, the RM fails to transition to Active and ends up in STOPPED state and > can never be transitioned to Active again. > The initial stacktrace reported is at > https://issues.apache.org/jira/secure/attachment/12676476/issue-stacktrace.rtf -- This message was sent by Atlassian JIRA (v6.3.4#6332)