liuxiangwei created YARN-6107:
---------------------------------
Summary: ResourceManager recovered with NPE Exception due to zk
store failed
Key: YARN-6107
URL: https://issues.apache.org/jira/browse/YARN-6107
Project: Hadoop YARN
Issue Type: Bug
Components: yarn
Affects Versions: 2.5.1
Reporter: liuxiangwei
Firstly, RM is stopped by the exception below:
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /nmg01-khan-yarn-on-normandy-rmstore/ZKRM
StateRoot/RMAppRoot/application_1484014091623_3711/appattempt_1484014091623_3711_000001
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1073)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:960)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:957)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1007)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1026)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:957)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:65
4)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:236)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:219)
at
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:774)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:845)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:840)
at
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:662)
Secondly, Restart the RM but never success due to exception below:
2017-01-18 15:07:48,130 FATAL
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
handling event type APP_ATTEMPT_ADDED t
o the scheduler
java.lang.NullPointerException
The stack trace points to the code blow:
SchedulerApplication<FiCaSchedulerApp> application =
applications.get(appAttemptId.getApplicationId());
It seems application does not exist.
And we found log like this
2017-01-18 15:11:21,204 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering app:
application_1484014091623_3711 wi
th 1 attempts and final state = FINISHED
2017-01-18 15:11:21,204 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
Recovering attempt: appattempt_148
4014091623_3711_000001 with final state: null
2017-01-18 15:11:21,204 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1484014091623_3711_0000
01 State change from NEW to LAUNCHED
2017-01-18 15:11:21,204 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1484014091623_3711 State change from
NEW to FINISHED
the final states do not make equal.
We have to check the application whether is null to avoid this problem and make
this failover success.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]