Che Yufei created YARN-8358:
-------------------------------

             Summary: ResourceManager restart fail to recover due to 
TimelineServiceV1Publisher NPE
                 Key: YARN-8358
                 URL: https://issues.apache.org/jira/browse/YARN-8358
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
    Affects Versions: 2.9.1
         Environment: Ubuntu 16.04

java version "1.8.0_91"
            Reporter: Che Yufei


I'm upgrading from Hadoop 2.7.3 to 2.9.1. ResourceManager restart works fine 
for 2.7.3, but fails on 2.9.1.

I'm using LevelDB as the RM state store, the problem seems related to 
TimelineServiceV1Publisher. If I set 
yarn.resourcemanager.system-metrics-publisher.enabled to false, then recovery 
works fine. But if the option is set to true, RM fails to start with the 
following log:

 

{{2018-05-24 23:11:54,597 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Recovery 
started}}
{{2018-05-24 23:11:54,673 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Loaded RM 
state version info 1.1}}
{{2018-05-24 23:11:54,688 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore: 
Recovered 12 RM delegation token master keys}}
{{2018-05-24 23:11:54,688 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore: 
Recovered 0 RM delegation tokens}}
{{2018-05-24 23:11:54,990 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore: 
Recovered 2099 applications and 2100 application attempts}}
{{2018-05-24 23:11:54,998 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore: 
Recovered 0 reservations}}
{{2018-05-24 23:11:54,998 INFO 
org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager:
 recovering RMDelegationTokenSecretManager.}}
{{2018-05-24 23:11:55,003 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Recovering 2099 
applications}}
{{2018-05-24 23:11:55,107 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Successfully 
recovered 0 out of 2099 applications}}
{{2018-05-24 23:11:55,108 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to 
load/recover state}}
{{java.lang.NullPointerException}}
{{ at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV1Publisher.appCreated(TimelineServiceV1Publisher.java:90)}}
{{ at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.sendATSCreateEvent(RMAppImpl.java:1954)}}
{{ at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:931)}}
{{ at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1061)}}
{{ at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1054)}}
{{ at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)}}
{{ at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)}}
{{ at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)}}
{{ at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)}}
{{ at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:878)}}
{{ at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:339)}}
{{ at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:533)}}
{{ at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1394)}}
{{ at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:758)}}
{{ at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)}}
{{ at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1147)}}
{{ at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1187)}}
{{ at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1183)}}
{{ at java.security.AccessController.doPrivileged(Native Method)}}
{{ at javax.security.auth.Subject.doAs(Subject.java:422)}}
{{ at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889)}}
{{ at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1183)}}
{{ at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1223)}}
{{ at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)}}
{{ at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1422)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to