[ https://issues.apache.org/jira/browse/YARN-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jon Bringhurst updated YARN-2223: --------------------------------- Description: I upgraded two clusters from tag 2.2.0 to branch-2.4.1 (latest commit is https://github.com/apache/hadoop-common/commit/c96c8e45a60651b677a1de338b7856a444dc0461). Both clusters have the same config (other than hostnames). Both are running on JDK8u5 (I'm not sure if this is a factor here). One cluster started up without any errors. The other started up with the following error on the RM: {noformat} 18:33:45,463 WARN RMAppImpl:331 - The specific max attempts: 0 for application: 1 is invalid, because it is out of the range [1, 50]. Use the global max attempts instead. 18:33:45,465 INFO RMAppImpl:651 - Recovering app: application_1398450350082_0001 with 8 attempts and final state = KILLED 18:33:45,468 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0001_000001 with final state: KILLED 18:33:45,478 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0001_000002 with final state: FAILED 18:33:45,478 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0001_000003 with final state: FAILED 18:33:45,479 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0001_000004 with final state: FAILED 18:33:45,479 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0001_000005 with final state: FAILED 18:33:45,480 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0001_000006 with final state: FAILED 18:33:45,480 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0001_000007 with final state: FAILED 18:33:45,481 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0001_000008 with final state: FAILED 18:33:45,482 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0001_000001 State change from NEW to KILLED 18:33:45,482 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0001_000002 State change from NEW to FAILED 18:33:45,482 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0001_000003 State change from NEW to FAILED 18:33:45,482 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0001_000004 State change from NEW to FAILED 18:33:45,483 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0001_000005 State change from NEW to FAILED 18:33:45,483 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0001_000006 State change from NEW to FAILED 18:33:45,483 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0001_000007 State change from NEW to FAILED 18:33:45,483 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0001_000008 State change from NEW to FAILED 18:33:45,485 INFO RMAppImpl:639 - application_1398450350082_0001 State change from NEW to KILLED 18:33:45,485 WARN RMAppImpl:331 - The specific max attempts: 0 for application: 2 is invalid, because it is out of the range [1, 50]. Use the global max attempts instead. 18:33:45,485 INFO RMAppImpl:651 - Recovering app: application_1398450350082_0002 with 8 attempts and final state = KILLED 18:33:45,486 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0002_000001 with final state: KILLED 18:33:45,486 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0002_000002 with final state: FAILED 18:33:45,487 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0002_000003 with final state: FAILED 18:33:45,487 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0002_000004 with final state: FAILED 18:33:45,488 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0002_000005 with final state: FAILED 18:33:45,488 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0002_000006 with final state: FAILED 18:33:45,489 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0002_000007 with final state: FAILED 18:33:45,489 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0002_000008 with final state: FAILED 18:33:45,490 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0002_000001 State change from NEW to KILLED 18:33:45,490 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0002_000002 State change from NEW to FAILED 18:33:45,490 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0002_000003 State change from NEW to FAILED 18:33:45,490 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0002_000004 State change from NEW to FAILED 18:33:45,491 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0002_000005 State change from NEW to FAILED 18:33:45,491 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0002_000006 State change from NEW to FAILED 18:33:45,491 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0002_000007 State change from NEW to FAILED 18:33:45,491 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0002_000008 State change from NEW to FAILED 18:33:45,491 INFO RMAppImpl:639 - application_1398450350082_0002 State change from NEW to KILLED 18:33:45,492 WARN RMAppImpl:331 - The specific max attempts: 0 for application: 33 is invalid, because it is out of the range [1, 50]. Use the global max attempts instead. 18:33:45,492 INFO RMAppImpl:651 - Recovering app: application_1401811496082_0033 with 2 attempts and final state = null 18:33:45,492 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1401811496082_0033_000001 with final state: FAILED 18:33:45,492 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1401811496082_0033_000002 with final state: null 18:33:45,493 INFO RMAppAttemptImpl:659 - appattempt_1401811496082_0033_000001 State change from NEW to FAILED 18:33:45,493 INFO RMAppAttemptImpl:659 - appattempt_1401811496082_0033_000002 State change from NEW to LAUNCHED 18:33:45,494 INFO RMAppImpl:639 - application_1401811496082_0033 State change from NEW to ACCEPTED 18:33:45,494 WARN RMAppImpl:331 - The specific max attempts: 0 for application: 1 is invalid, because it is out of the range [1, 50]. Use the global max attempts instead. 18:33:45,494 INFO RMAppImpl:651 - Recovering app: application_1398453545406_0001 with 9 attempts and final state = null 18:33:45,495 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398453545406_0001_000001 with final state: FAILED 18:33:45,495 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398453545406_0001_000002 with final state: FAILED 18:33:45,496 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398453545406_0001_000003 with final state: FAILED 18:33:45,496 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398453545406_0001_000004 with final state: FAILED 18:33:45,496 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398453545406_0001_000005 with final state: FAILED 18:33:45,497 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398453545406_0001_000006 with final state: FAILED 18:33:45,497 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398453545406_0001_000007 with final state: FAILED 18:33:45,498 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398453545406_0001_000008 with final state: FAILED 18:33:45,499 ERROR ResourceManager:488 - Failed to load/recover state java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:692) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:660) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:312) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:874) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:871) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:871) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:915) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1040) 18:33:45,500 INFO AbstractService:272 - Service RMActiveServices failed in state STARTED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:692) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:660) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:312) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:874) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:871) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:871) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:915) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1040) 18:33:45,501 INFO MetricsSystemImpl:200 - Stopping ResourceManager metrics system... 18:33:45,502 INFO MetricsSystemImpl:206 - ResourceManager metrics system stopped. 18:33:45,502 INFO MetricsSystemImpl:572 - ResourceManager metrics system shutdown complete. 18:33:45,502 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 18:33:45,503 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 18:33:45,503 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 18:33:45,503 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 18:33:45,504 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 18:33:45,504 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 18:33:45,504 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 18:33:45,504 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 18:33:45,504 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 18:33:45,504 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 18:33:45,504 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 18:33:45,505 INFO AbstractService:272 - Service ResourceManager failed in state STARTED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:692) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:660) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:312) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:874) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:871) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:871) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:915) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1040) 18:33:45,505 INFO ResourceManager:891 - Transitioning to standby state 18:33:45,505 INFO ResourceManager:901 - Transitioned to standby state 18:33:45,505 FATAL ResourceManager:1042 - Error starting ResourceManager java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:692) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:660) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:312) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:874) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:871) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:871) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:915) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1040) 18:33:45,509 INFO ResourceManager:640 - SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down ResourceManager at xxxxxmy_server_hostname/x.x.x.x ************************************************************/ {noformat} Subsequent startups result in an error that appears similar. Before I try to wipe the state of this cluster, is there any debug info you'd like me to gather? Note that this warning is being shown in the above, I haven't gotten around to fixing it yet. I'm not sure if it's related to the crash. {noformat} 18:33:45,463 WARN RMAppImpl:331 - The specific max attempts: 0 for application: 1 is invalid, because it is out of the range [1, 50]. Use the global max attempts instead. {noformat} was: I upgraded two clusters from tag 2.2.0 to branch-2.4.1 (latest commit is https://github.com/apache/hadoop-common/commit/c96c8e45a60651b677a1de338b7856a444dc0461). Both clusters have the same config (other than hostnames). Both are running on JDK8u5 (I'm not sure if this is a factor here). One cluster started up without any errors. The other started up with the following error on the RM: {noformat} 18:33:45,463 WARN RMAppImpl:331 - The specific max attempts: 0 for application: 1 is invalid, because it is out of the range [1, 50]. Use the global max attempts instead. 18:33:45,465 INFO RMAppImpl:651 - Recovering app: application_1398450350082_0001 with 8 attempts and final state = KILLED 18:33:45,468 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0001_000001 with final state: KILLED 18:33:45,478 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0001_000002 with final state: FAILED 18:33:45,478 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0001_000003 with final state: FAILED 18:33:45,479 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0001_000004 with final state: FAILED 18:33:45,479 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0001_000005 with final state: FAILED 18:33:45,480 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0001_000006 with final state: FAILED 18:33:45,480 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0001_000007 with final state: FAILED 18:33:45,481 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0001_000008 with final state: FAILED 18:33:45,482 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0001_000001 State change from NEW to KILLED 18:33:45,482 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0001_000002 State change from NEW to FAILED 18:33:45,482 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0001_000003 State change from NEW to FAILED 18:33:45,482 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0001_000004 State change from NEW to FAILED 18:33:45,483 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0001_000005 State change from NEW to FAILED 18:33:45,483 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0001_000006 State change from NEW to FAILED 18:33:45,483 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0001_000007 State change from NEW to FAILED 18:33:45,483 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0001_000008 State change from NEW to FAILED 18:33:45,485 INFO RMAppImpl:639 - application_1398450350082_0001 State change from NEW to KILLED 18:33:45,485 WARN RMAppImpl:331 - The specific max attempts: 0 for application: 2 is invalid, because it is out of the range [1, 50]. Use the global max attempts instead. 18:33:45,485 INFO RMAppImpl:651 - Recovering app: application_1398450350082_0002 with 8 attempts and final state = KILLED 18:33:45,486 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0002_000001 with final state: KILLED 18:33:45,486 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0002_000002 with final state: FAILED 18:33:45,487 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0002_000003 with final state: FAILED 18:33:45,487 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0002_000004 with final state: FAILED 18:33:45,488 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0002_000005 with final state: FAILED 18:33:45,488 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0002_000006 with final state: FAILED 18:33:45,489 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0002_000007 with final state: FAILED 18:33:45,489 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398450350082_0002_000008 with final state: FAILED 18:33:45,490 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0002_000001 State change from NEW to KILLED 18:33:45,490 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0002_000002 State change from NEW to FAILED 18:33:45,490 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0002_000003 State change from NEW to FAILED 18:33:45,490 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0002_000004 State change from NEW to FAILED 18:33:45,491 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0002_000005 State change from NEW to FAILED 18:33:45,491 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0002_000006 State change from NEW to FAILED 18:33:45,491 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0002_000007 State change from NEW to FAILED 18:33:45,491 INFO RMAppAttemptImpl:659 - appattempt_1398450350082_0002_000008 State change from NEW to FAILED 18:33:45,491 INFO RMAppImpl:639 - application_1398450350082_0002 State change from NEW to KILLED 18:33:45,492 WARN RMAppImpl:331 - The specific max attempts: 0 for application: 33 is invalid, because it is out of the range [1, 50]. Use the global max attempts instead. 18:33:45,492 INFO RMAppImpl:651 - Recovering app: application_1401811496082_0033 with 2 attempts and final state = null 18:33:45,492 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1401811496082_0033_000001 with final state: FAILED 18:33:45,492 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1401811496082_0033_000002 with final state: null 18:33:45,493 INFO RMAppAttemptImpl:659 - appattempt_1401811496082_0033_000001 State change from NEW to FAILED 18:33:45,493 INFO RMAppAttemptImpl:659 - appattempt_1401811496082_0033_000002 State change from NEW to LAUNCHED 18:33:45,494 INFO RMAppImpl:639 - application_1401811496082_0033 State change from NEW to ACCEPTED 18:33:45,494 WARN RMAppImpl:331 - The specific max attempts: 0 for application: 1 is invalid, because it is out of the range [1, 50]. Use the global max attempts instead. 18:33:45,494 INFO RMAppImpl:651 - Recovering app: application_1398453545406_0001 with 9 attempts and final state = null 18:33:45,495 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398453545406_0001_000001 with final state: FAILED 18:33:45,495 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398453545406_0001_000002 with final state: FAILED 18:33:45,496 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398453545406_0001_000003 with final state: FAILED 18:33:45,496 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398453545406_0001_000004 with final state: FAILED 18:33:45,496 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398453545406_0001_000005 with final state: FAILED 18:33:45,497 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398453545406_0001_000006 with final state: FAILED 18:33:45,497 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398453545406_0001_000007 with final state: FAILED 18:33:45,498 INFO RMAppAttemptImpl:691 - Recovering attempt: appattempt_1398453545406_0001_000008 with final state: FAILED 18:33:45,499 ERROR ResourceManager:488 - Failed to load/recover state java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:692) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:660) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:312) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:874) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:871) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:871) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:915) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1040) 18:33:45,500 INFO AbstractService:272 - Service RMActiveServices failed in state STARTED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:692) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:660) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:312) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:874) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:871) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:871) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:915) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1040) 18:33:45,501 INFO MetricsSystemImpl:200 - Stopping ResourceManager metrics system... 18:33:45,502 INFO MetricsSystemImpl:206 - ResourceManager metrics system stopped. 18:33:45,502 INFO MetricsSystemImpl:572 - ResourceManager metrics system shutdown complete. 18:33:45,502 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 18:33:45,503 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 18:33:45,503 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 18:33:45,503 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 18:33:45,504 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 18:33:45,504 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 18:33:45,504 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 18:33:45,504 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 18:33:45,504 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 18:33:45,504 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 18:33:45,504 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 18:33:45,505 INFO AbstractService:272 - Service ResourceManager failed in state STARTED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:692) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:660) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:312) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:874) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:871) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:871) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:915) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1040) 18:33:45,505 INFO ResourceManager:891 - Transitioning to standby state 18:33:45,505 INFO ResourceManager:901 - Transitioned to standby state 18:33:45,505 FATAL ResourceManager:1042 - Error starting ResourceManager java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:692) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:660) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:312) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:874) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:871) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:871) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:915) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1040) 18:33:45,509 INFO ResourceManager:640 - SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down ResourceManager at xxxxxmy_server_hostname/x.x.x.x ************************************************************/ {noformat} When attempting to startup this cluster after failure, it crashed again with: {noformat} 19:19:05,662 INFO AMRMTokenSecretManager:107 - Rolling master-key for amrm-tokens 19:19:05,665 INFO RMContainerTokenSecretManager:103 - Rolling master-key for container-tokens 19:19:05,665 INFO NMTokenSecretManagerInRM:95 - Rolling master-key for nm-tokens 19:19:05,665 INFO RMContainerTokenSecretManager:108 - Going to activate master-key with key-id 1885856529 in 135000ms 19:19:05,665 INFO NMTokenSecretManagerInRM:100 - Going to activate master-key with key-id 1756560776 in 135000ms 19:19:35,971 INFO RMDelegationTokenSecretManager:96 - removing master key with keyID 86 19:19:35,971 INFO FileSystemRMStateStore:484 - Removing RMDelegationKey_86 19:19:35,972 INFO AbstractDelegationTokenSecretManager:223 - Updating the current master key for generating delegation tokens 19:19:35,972 INFO RMDelegationTokenSecretManager:85 - storing master key with keyID 94 19:19:35,973 INFO FileSystemRMStateStore:473 - Storing RMDelegationKey_94 19:21:20,666 INFO RMContainerTokenSecretManager:139 - Activating next master key with id: 1885856529 19:21:20,666 INFO NMTokenSecretManagerInRM:131 - Activating next master key with id: 1756560776 16:14:06,403 ERROR ResourceManager:60 - RECEIVED SIGNAL 15: SIGTERM 16:14:06,408 INFO log:67 - Stopped SelectChannelConnector@0.0.0.0:8088 16:14:06,510 INFO Server:2399 - Stopping server on 8032 16:14:06,511 INFO Server:694 - Stopping IPC Server listener on 8032 16:14:06,511 INFO Server:820 - Stopping IPC Server Responder 16:14:06,511 INFO Server:2399 - Stopping server on 8033 16:14:06,512 INFO Server:694 - Stopping IPC Server listener on 8033 16:14:06,512 INFO Server:820 - Stopping IPC Server Responder 16:14:06,512 INFO ResourceManager:890 - Transitioning to standby state 16:14:06,513 INFO MetricsSystemImpl:200 - Stopping ResourceManager metrics system... 16:14:06,516 INFO MetricsSystemImpl:206 - ResourceManager metrics system stopped. 16:14:06,516 INFO MetricsSystemImpl:572 - ResourceManager metrics system shutdown complete. 16:14:06,516 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 16:14:06,517 WARN ApplicationMasterLauncher:98 - org.apache.hadoop.yarn.server.resourcemanager.amlauncher.ApplicationMasterLauncher$LauncherThread interrupted. Returning. 16:14:06,518 INFO Server:2399 - Stopping server on 8030 16:14:06,520 INFO Server:694 - Stopping IPC Server listener on 8030 16:14:06,520 INFO Server:820 - Stopping IPC Server Responder 16:14:06,520 INFO Server:2399 - Stopping server on 8031 16:14:06,521 INFO Server:694 - Stopping IPC Server listener on 8031 16:14:06,521 INFO Server:820 - Stopping IPC Server Responder 16:14:06,522 ERROR ResourceManager:586 - Returning, interrupted : java.lang.InterruptedException 16:14:06,522 INFO AbstractLivelinessMonitor:127 - NMLivelinessMonitor thread interrupted 16:14:06,522 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 16:14:06,524 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 16:14:06,524 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 16:14:06,525 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 16:14:06,525 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 16:14:06,525 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 16:14:06,526 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 16:14:06,526 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 16:14:06,526 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 16:14:06,527 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, igonring any new events. 16:14:06,527 INFO AbstractLivelinessMonitor:127 - AMLivelinessMonitor thread interrupted 16:14:06,531 ERROR AbstractDelegationTokenSecretManager:557 - InterruptedExcpetion recieved for ExpiredTokenRemover thread java.lang.InterruptedException: sleep interrupted 16:14:06,531 INFO AbstractLivelinessMonitor:127 - org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer thread interrupted 16:14:06,527 INFO AbstractLivelinessMonitor:127 - AMLivelinessMonitor thread interrupted 16:14:06,532 INFO ResourceManager:900 - Transitioned to standby state 16:14:06,532 INFO ResourceManager:640 - SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down ResourceManager at xxxx_my_server_hostname/x.x.x.x ************************************************************/ {noformat} Subsequent startups result in an error that appears similar. Before I try to wipe the state of this cluster, is there any debug info you'd like me to gather? Note that this warning is being shown in the above, I haven't gotten around to fixing it yet. I'm not sure if it's related to the crash. {noformat} 18:33:45,463 WARN RMAppImpl:331 - The specific max attempts: 0 for application: 1 is invalid, because it is out of the range [1, 50]. Use the global max attempts instead. {noformat} > NPE on ResourceManager recover > ------------------------------ > > Key: YARN-2223 > URL: https://issues.apache.org/jira/browse/YARN-2223 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.4.1 > Reporter: Jon Bringhurst > > I upgraded two clusters from tag 2.2.0 to branch-2.4.1 (latest commit is > https://github.com/apache/hadoop-common/commit/c96c8e45a60651b677a1de338b7856a444dc0461). > Both clusters have the same config (other than hostnames). Both are running > on JDK8u5 (I'm not sure if this is a factor here). > One cluster started up without any errors. The other started up with the > following error on the RM: > {noformat} > 18:33:45,463 WARN RMAppImpl:331 - The specific max attempts: 0 for > application: 1 is invalid, because it is out of the range [1, 50]. Use the > global max attempts instead. > 18:33:45,465 INFO RMAppImpl:651 - Recovering app: > application_1398450350082_0001 with 8 attempts and final state = KILLED > 18:33:45,468 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_000001 with final state: KILLED > 18:33:45,478 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_000002 with final state: FAILED > 18:33:45,478 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_000003 with final state: FAILED > 18:33:45,479 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_000004 with final state: FAILED > 18:33:45,479 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_000005 with final state: FAILED > 18:33:45,480 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_000006 with final state: FAILED > 18:33:45,480 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_000007 with final state: FAILED > 18:33:45,481 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_000008 with final state: FAILED > 18:33:45,482 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_000001 State change from NEW to KILLED > 18:33:45,482 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_000002 State change from NEW to FAILED > 18:33:45,482 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_000003 State change from NEW to FAILED > 18:33:45,482 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_000004 State change from NEW to FAILED > 18:33:45,483 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_000005 State change from NEW to FAILED > 18:33:45,483 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_000006 State change from NEW to FAILED > 18:33:45,483 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_000007 State change from NEW to FAILED > 18:33:45,483 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_000008 State change from NEW to FAILED > 18:33:45,485 INFO RMAppImpl:639 - application_1398450350082_0001 State > change from NEW to KILLED > 18:33:45,485 WARN RMAppImpl:331 - The specific max attempts: 0 for > application: 2 is invalid, because it is out of the range [1, 50]. Use the > global max attempts instead. > 18:33:45,485 INFO RMAppImpl:651 - Recovering app: > application_1398450350082_0002 with 8 attempts and final state = KILLED > 18:33:45,486 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_000001 with final state: KILLED > 18:33:45,486 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_000002 with final state: FAILED > 18:33:45,487 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_000003 with final state: FAILED > 18:33:45,487 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_000004 with final state: FAILED > 18:33:45,488 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_000005 with final state: FAILED > 18:33:45,488 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_000006 with final state: FAILED > 18:33:45,489 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_000007 with final state: FAILED > 18:33:45,489 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_000008 with final state: FAILED > 18:33:45,490 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0002_000001 State change from NEW to KILLED > 18:33:45,490 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0002_000002 State change from NEW to FAILED > 18:33:45,490 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0002_000003 State change from NEW to FAILED > 18:33:45,490 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0002_000004 State change from NEW to FAILED > 18:33:45,491 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0002_000005 State change from NEW to FAILED > 18:33:45,491 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0002_000006 State change from NEW to FAILED > 18:33:45,491 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0002_000007 State change from NEW to FAILED > 18:33:45,491 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0002_000008 State change from NEW to FAILED > 18:33:45,491 INFO RMAppImpl:639 - application_1398450350082_0002 State > change from NEW to KILLED > 18:33:45,492 WARN RMAppImpl:331 - The specific max attempts: 0 for > application: 33 is invalid, because it is out of the range [1, 50]. Use the > global max attempts instead. > 18:33:45,492 INFO RMAppImpl:651 - Recovering app: > application_1401811496082_0033 with 2 attempts and final state = null > 18:33:45,492 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1401811496082_0033_000001 with final state: FAILED > 18:33:45,492 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1401811496082_0033_000002 with final state: null > 18:33:45,493 INFO RMAppAttemptImpl:659 - > appattempt_1401811496082_0033_000001 State change from NEW to FAILED > 18:33:45,493 INFO RMAppAttemptImpl:659 - > appattempt_1401811496082_0033_000002 State change from NEW to LAUNCHED > 18:33:45,494 INFO RMAppImpl:639 - application_1401811496082_0033 State > change from NEW to ACCEPTED > 18:33:45,494 WARN RMAppImpl:331 - The specific max attempts: 0 for > application: 1 is invalid, because it is out of the range [1, 50]. Use the > global max attempts instead. > 18:33:45,494 INFO RMAppImpl:651 - Recovering app: > application_1398453545406_0001 with 9 attempts and final state = null > 18:33:45,495 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398453545406_0001_000001 with final state: FAILED > 18:33:45,495 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398453545406_0001_000002 with final state: FAILED > 18:33:45,496 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398453545406_0001_000003 with final state: FAILED > 18:33:45,496 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398453545406_0001_000004 with final state: FAILED > 18:33:45,496 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398453545406_0001_000005 with final state: FAILED > 18:33:45,497 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398453545406_0001_000006 with final state: FAILED > 18:33:45,497 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398453545406_0001_000007 with final state: FAILED > 18:33:45,498 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398453545406_0001_000008 with final state: FAILED > 18:33:45,499 ERROR ResourceManager:488 - Failed to load/recover state > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:692) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:660) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:312) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:874) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:871) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:871) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:915) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1040) > 18:33:45,500 INFO AbstractService:272 - Service RMActiveServices failed in > state STARTED; cause: java.lang.NullPointerException > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:692) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:660) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:312) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:874) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:871) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:871) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:915) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1040) > 18:33:45,501 INFO MetricsSystemImpl:200 - Stopping ResourceManager metrics > system... > 18:33:45,502 INFO MetricsSystemImpl:206 - ResourceManager metrics system > stopped. > 18:33:45,502 INFO MetricsSystemImpl:572 - ResourceManager metrics system > shutdown complete. > 18:33:45,502 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, > igonring any new events. > 18:33:45,503 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, > igonring any new events. > 18:33:45,503 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, > igonring any new events. > 18:33:45,503 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, > igonring any new events. > 18:33:45,504 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, > igonring any new events. > 18:33:45,504 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, > igonring any new events. > 18:33:45,504 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, > igonring any new events. > 18:33:45,504 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, > igonring any new events. > 18:33:45,504 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, > igonring any new events. > 18:33:45,504 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, > igonring any new events. > 18:33:45,504 INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, > igonring any new events. > 18:33:45,505 INFO AbstractService:272 - Service ResourceManager failed in > state STARTED; cause: java.lang.NullPointerException > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:692) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:660) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:312) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:874) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:871) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:871) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:915) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1040) > 18:33:45,505 INFO ResourceManager:891 - Transitioning to standby state > 18:33:45,505 INFO ResourceManager:901 - Transitioned to standby state > 18:33:45,505 FATAL ResourceManager:1042 - Error starting ResourceManager > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:692) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:660) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:312) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:874) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:871) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:871) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:915) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1040) > 18:33:45,509 INFO ResourceManager:640 - SHUTDOWN_MSG: > /************************************************************ > SHUTDOWN_MSG: Shutting down ResourceManager at xxxxxmy_server_hostname/x.x.x.x > ************************************************************/ > {noformat} > Subsequent startups result in an error that appears similar. > Before I try to wipe the state of this cluster, is there any debug info you'd > like me to gather? > Note that this warning is being shown in the above, I haven't gotten around > to fixing it yet. I'm not sure if it's related to the crash. > {noformat} > 18:33:45,463 WARN RMAppImpl:331 - The specific max attempts: 0 for > application: 1 is invalid, because it is out of the range [1, 50]. Use the > global max attempts instead. > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)