[
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Karthik Kambatla updated YARN-2010:
-----------------------------------
Description:
Sometimes, the RM fails to recover an application. It could be because of
turning security on, token expiry, or issues connecting to HDFS etc. The causes
could be classified into (1) transient, (2) specific to one application, and
(3) permanent and apply to multiple (all) applications. Today, the RM fails to
transition to Active and ends up in STOPPED state and can never be transitioned
to Active again.
was:
If the RM fails to recover an app attempt, it won't come up. We should make it
more resilient.
Specifically, the underlying error is that the app was submitted before
Kerberos security got turned on. Makes sense for the app to fail in this case.
But YARN should still start.
{noformat}
2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector:
Exception handling the winning of election
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
at
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
at
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
at
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when
transitioning to Active mode
at
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)
at
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
... 4 more
Caused by: org.apache.hadoop.service.ServiceStateException:
org.apache.hadoop.yarn.exceptions.YarnException:
java.lang.IllegalArgumentException: Missing argument
at
org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)
at
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)
... 5 more
Caused by: org.apache.hadoop.yarn.exceptions.YarnException:
java.lang.IllegalArgumentException: Missing argument
at
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372)
at
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273)
at
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
... 8 more
Caused by: java.lang.IllegalArgumentException: Missing argument
at javax.crypto.spec.SecretKeySpec.<init>(SecretKeySpec.java:93)
at
org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188)
at
org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49)
at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711)
at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689)
at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663)
at
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369)
... 13 more
{noformat}
> If RM fails to recover an app, it can never transition to active again
> ----------------------------------------------------------------------
>
> Key: YARN-2010
> URL: https://issues.apache.org/jira/browse/YARN-2010
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.3.0
> Reporter: bc Wong
> Assignee: Karthik Kambatla
> Priority: Critical
> Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch,
> yarn-2010-3.patch, yarn-2010-3.patch, yarn-2010-4.patch
>
>
> Sometimes, the RM fails to recover an application. It could be because of
> turning security on, token expiry, or issues connecting to HDFS etc. The
> causes could be classified into (1) transient, (2) specific to one
> application, and (3) permanent and apply to multiple (all) applications.
> Today, the RM fails to transition to Active and ends up in STOPPED state and
> can never be transitioned to Active again.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)