Che Yufei created YARN-8358:
-------------------------------
Summary: ResourceManager restart fail to recover due to
TimelineServiceV1Publisher NPE
Key: YARN-8358
URL: https://issues.apache.org/jira/browse/YARN-8358
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Affects Versions: 2.9.1
Environment: Ubuntu 16.04
java version "1.8.0_91"
Reporter: Che Yufei
I'm upgrading from Hadoop 2.7.3 to 2.9.1. ResourceManager restart works fine
for 2.7.3, but fails on 2.9.1.
I'm using LevelDB as the RM state store, the problem seems related to
TimelineServiceV1Publisher. If I set
yarn.resourcemanager.system-metrics-publisher.enabled to false, then recovery
works fine. But if the option is set to true, RM fails to start with the
following log:
{{2018-05-24 23:11:54,597 INFO
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Recovery
started}}
{{2018-05-24 23:11:54,673 INFO
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Loaded RM
state version info 1.1}}
{{2018-05-24 23:11:54,688 INFO
org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore:
Recovered 12 RM delegation token master keys}}
{{2018-05-24 23:11:54,688 INFO
org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore:
Recovered 0 RM delegation tokens}}
{{2018-05-24 23:11:54,990 INFO
org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore:
Recovered 2099 applications and 2100 application attempts}}
{{2018-05-24 23:11:54,998 INFO
org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore:
Recovered 0 reservations}}
{{2018-05-24 23:11:54,998 INFO
org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager:
recovering RMDelegationTokenSecretManager.}}
{{2018-05-24 23:11:55,003 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Recovering 2099
applications}}
{{2018-05-24 23:11:55,107 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Successfully
recovered 0 out of 2099 applications}}
{{2018-05-24 23:11:55,108 ERROR
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to
load/recover state}}
{{java.lang.NullPointerException}}
{{ at
org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV1Publisher.appCreated(TimelineServiceV1Publisher.java:90)}}
{{ at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.sendATSCreateEvent(RMAppImpl.java:1954)}}
{{ at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:931)}}
{{ at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1061)}}
{{ at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1054)}}
{{ at
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)}}
{{ at
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)}}
{{ at
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)}}
{{ at
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)}}
{{ at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:878)}}
{{ at
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:339)}}
{{ at
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:533)}}
{{ at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1394)}}
{{ at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:758)}}
{{ at
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)}}
{{ at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1147)}}
{{ at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1187)}}
{{ at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1183)}}
{{ at java.security.AccessController.doPrivileged(Native Method)}}
{{ at javax.security.auth.Subject.doAs(Subject.java:422)}}
{{ at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889)}}
{{ at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1183)}}
{{ at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1223)}}
{{ at
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)}}
{{ at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1422)}}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]