[ 
https://issues.apache.org/jira/browse/YARN-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14523840#comment-14523840
 ] 

Jian He commented on YARN-2223:
-------------------------------

we have fixed a few similar NPE on RM recovery problem recently. Probably this 
has been fixed in one of them. 
I'm closing this for now.  [~jonbringhurst], please feel free to reopen this if 
you still see this problem in latest build.

> NPE on ResourceManager recover
> ------------------------------
>
>                 Key: YARN-2223
>                 URL: https://issues.apache.org/jira/browse/YARN-2223
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.4.1
>         Environment: JDK 8u5
>            Reporter: Jon Bringhurst
>
> I upgraded two clusters from tag 2.2.0 to branch-2.4.1 (latest commit is 
> https://github.com/apache/hadoop-common/commit/c96c8e45a60651b677a1de338b7856a444dc0461).
> Both clusters have the same config (other than hostnames). Both are running 
> on JDK8u5 (I'm not sure if this is a factor here).
> One cluster started up without any errors. The other started up with the 
> following error on the RM:
> {noformat}
> 18:33:45,463  WARN RMAppImpl:331 - The specific max attempts: 0 for 
> application: 1 is invalid, because it is out of the range [1, 50]. Use the 
> global max attempts instead.
> 18:33:45,465  INFO RMAppImpl:651 - Recovering app: 
> application_1398450350082_0001 with 8 attempts and final state = KILLED
> 18:33:45,468  INFO RMAppAttemptImpl:691 - Recovering attempt: 
> appattempt_1398450350082_0001_000001 with final state: KILLED
> 18:33:45,478  INFO RMAppAttemptImpl:691 - Recovering attempt: 
> appattempt_1398450350082_0001_000002 with final state: FAILED
> 18:33:45,478  INFO RMAppAttemptImpl:691 - Recovering attempt: 
> appattempt_1398450350082_0001_000003 with final state: FAILED
> 18:33:45,479  INFO RMAppAttemptImpl:691 - Recovering attempt: 
> appattempt_1398450350082_0001_000004 with final state: FAILED
> 18:33:45,479  INFO RMAppAttemptImpl:691 - Recovering attempt: 
> appattempt_1398450350082_0001_000005 with final state: FAILED
> 18:33:45,480  INFO RMAppAttemptImpl:691 - Recovering attempt: 
> appattempt_1398450350082_0001_000006 with final state: FAILED
> 18:33:45,480  INFO RMAppAttemptImpl:691 - Recovering attempt: 
> appattempt_1398450350082_0001_000007 with final state: FAILED
> 18:33:45,481  INFO RMAppAttemptImpl:691 - Recovering attempt: 
> appattempt_1398450350082_0001_000008 with final state: FAILED
> 18:33:45,482  INFO RMAppAttemptImpl:659 - 
> appattempt_1398450350082_0001_000001 State change from NEW to KILLED
> 18:33:45,482  INFO RMAppAttemptImpl:659 - 
> appattempt_1398450350082_0001_000002 State change from NEW to FAILED
> 18:33:45,482  INFO RMAppAttemptImpl:659 - 
> appattempt_1398450350082_0001_000003 State change from NEW to FAILED
> 18:33:45,482  INFO RMAppAttemptImpl:659 - 
> appattempt_1398450350082_0001_000004 State change from NEW to FAILED
> 18:33:45,483  INFO RMAppAttemptImpl:659 - 
> appattempt_1398450350082_0001_000005 State change from NEW to FAILED
> 18:33:45,483  INFO RMAppAttemptImpl:659 - 
> appattempt_1398450350082_0001_000006 State change from NEW to FAILED
> 18:33:45,483  INFO RMAppAttemptImpl:659 - 
> appattempt_1398450350082_0001_000007 State change from NEW to FAILED
> 18:33:45,483  INFO RMAppAttemptImpl:659 - 
> appattempt_1398450350082_0001_000008 State change from NEW to FAILED
> 18:33:45,485  INFO RMAppImpl:639 - application_1398450350082_0001 State 
> change from NEW to KILLED
> 18:33:45,485  WARN RMAppImpl:331 - The specific max attempts: 0 for 
> application: 2 is invalid, because it is out of the range [1, 50]. Use the 
> global max attempts instead.
> 18:33:45,485  INFO RMAppImpl:651 - Recovering app: 
> application_1398450350082_0002 with 8 attempts and final state = KILLED
> 18:33:45,486  INFO RMAppAttemptImpl:691 - Recovering attempt: 
> appattempt_1398450350082_0002_000001 with final state: KILLED
> 18:33:45,486  INFO RMAppAttemptImpl:691 - Recovering attempt: 
> appattempt_1398450350082_0002_000002 with final state: FAILED
> 18:33:45,487  INFO RMAppAttemptImpl:691 - Recovering attempt: 
> appattempt_1398450350082_0002_000003 with final state: FAILED
> 18:33:45,487  INFO RMAppAttemptImpl:691 - Recovering attempt: 
> appattempt_1398450350082_0002_000004 with final state: FAILED
> 18:33:45,488  INFO RMAppAttemptImpl:691 - Recovering attempt: 
> appattempt_1398450350082_0002_000005 with final state: FAILED
> 18:33:45,488  INFO RMAppAttemptImpl:691 - Recovering attempt: 
> appattempt_1398450350082_0002_000006 with final state: FAILED
> 18:33:45,489  INFO RMAppAttemptImpl:691 - Recovering attempt: 
> appattempt_1398450350082_0002_000007 with final state: FAILED
> 18:33:45,489  INFO RMAppAttemptImpl:691 - Recovering attempt: 
> appattempt_1398450350082_0002_000008 with final state: FAILED
> 18:33:45,490  INFO RMAppAttemptImpl:659 - 
> appattempt_1398450350082_0002_000001 State change from NEW to KILLED
> 18:33:45,490  INFO RMAppAttemptImpl:659 - 
> appattempt_1398450350082_0002_000002 State change from NEW to FAILED
> 18:33:45,490  INFO RMAppAttemptImpl:659 - 
> appattempt_1398450350082_0002_000003 State change from NEW to FAILED
> 18:33:45,490  INFO RMAppAttemptImpl:659 - 
> appattempt_1398450350082_0002_000004 State change from NEW to FAILED
> 18:33:45,491  INFO RMAppAttemptImpl:659 - 
> appattempt_1398450350082_0002_000005 State change from NEW to FAILED
> 18:33:45,491  INFO RMAppAttemptImpl:659 - 
> appattempt_1398450350082_0002_000006 State change from NEW to FAILED
> 18:33:45,491  INFO RMAppAttemptImpl:659 - 
> appattempt_1398450350082_0002_000007 State change from NEW to FAILED
> 18:33:45,491  INFO RMAppAttemptImpl:659 - 
> appattempt_1398450350082_0002_000008 State change from NEW to FAILED
> 18:33:45,491  INFO RMAppImpl:639 - application_1398450350082_0002 State 
> change from NEW to KILLED
> 18:33:45,492  WARN RMAppImpl:331 - The specific max attempts: 0 for 
> application: 33 is invalid, because it is out of the range [1, 50]. Use the 
> global max attempts instead.
> 18:33:45,492  INFO RMAppImpl:651 - Recovering app: 
> application_1401811496082_0033 with 2 attempts and final state = null
> 18:33:45,492  INFO RMAppAttemptImpl:691 - Recovering attempt: 
> appattempt_1401811496082_0033_000001 with final state: FAILED
> 18:33:45,492  INFO RMAppAttemptImpl:691 - Recovering attempt: 
> appattempt_1401811496082_0033_000002 with final state: null
> 18:33:45,493  INFO RMAppAttemptImpl:659 - 
> appattempt_1401811496082_0033_000001 State change from NEW to FAILED
> 18:33:45,493  INFO RMAppAttemptImpl:659 - 
> appattempt_1401811496082_0033_000002 State change from NEW to LAUNCHED
> 18:33:45,494  INFO RMAppImpl:639 - application_1401811496082_0033 State 
> change from NEW to ACCEPTED
> 18:33:45,494  WARN RMAppImpl:331 - The specific max attempts: 0 for 
> application: 1 is invalid, because it is out of the range [1, 50]. Use the 
> global max attempts instead.
> 18:33:45,494  INFO RMAppImpl:651 - Recovering app: 
> application_1398453545406_0001 with 9 attempts and final state = null
> 18:33:45,495  INFO RMAppAttemptImpl:691 - Recovering attempt: 
> appattempt_1398453545406_0001_000001 with final state: FAILED
> 18:33:45,495  INFO RMAppAttemptImpl:691 - Recovering attempt: 
> appattempt_1398453545406_0001_000002 with final state: FAILED
> 18:33:45,496  INFO RMAppAttemptImpl:691 - Recovering attempt: 
> appattempt_1398453545406_0001_000003 with final state: FAILED
> 18:33:45,496  INFO RMAppAttemptImpl:691 - Recovering attempt: 
> appattempt_1398453545406_0001_000004 with final state: FAILED
> 18:33:45,496  INFO RMAppAttemptImpl:691 - Recovering attempt: 
> appattempt_1398453545406_0001_000005 with final state: FAILED
> 18:33:45,497  INFO RMAppAttemptImpl:691 - Recovering attempt: 
> appattempt_1398453545406_0001_000006 with final state: FAILED
> 18:33:45,497  INFO RMAppAttemptImpl:691 - Recovering attempt: 
> appattempt_1398453545406_0001_000007 with final state: FAILED
> 18:33:45,498  INFO RMAppAttemptImpl:691 - Recovering attempt: 
> appattempt_1398453545406_0001_000008 with final state: FAILED
> 18:33:45,499 ERROR ResourceManager:488 - Failed to load/recover state
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:692)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:660)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:312)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484)
>       at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:874)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:871)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:871)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:915)
>       at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1040)
> 18:33:45,500  INFO AbstractService:272 - Service RMActiveServices failed in 
> state STARTED; cause: java.lang.NullPointerException
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:692)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:660)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:312)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484)
>       at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:874)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:871)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:871)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:915)
>       at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1040)
> 18:33:45,501  INFO MetricsSystemImpl:200 - Stopping ResourceManager metrics 
> system...
> 18:33:45,502  INFO MetricsSystemImpl:206 - ResourceManager metrics system 
> stopped.
> 18:33:45,502  INFO MetricsSystemImpl:572 - ResourceManager metrics system 
> shutdown complete.
> 18:33:45,502  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, 
> igonring any new events.
> 18:33:45,503  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, 
> igonring any new events.
> 18:33:45,503  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, 
> igonring any new events.
> 18:33:45,503  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, 
> igonring any new events.
> 18:33:45,504  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, 
> igonring any new events.
> 18:33:45,504  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, 
> igonring any new events.
> 18:33:45,504  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, 
> igonring any new events.
> 18:33:45,504  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, 
> igonring any new events.
> 18:33:45,504  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, 
> igonring any new events.
> 18:33:45,504  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, 
> igonring any new events.
> 18:33:45,504  INFO AsyncDispatcher:138 - AsyncDispatcher is draining to stop, 
> igonring any new events.
> 18:33:45,505  INFO AbstractService:272 - Service ResourceManager failed in 
> state STARTED; cause: java.lang.NullPointerException
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:692)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:660)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:312)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484)
>       at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:874)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:871)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:871)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:915)
>       at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1040)
> 18:33:45,505  INFO ResourceManager:891 - Transitioning to standby state
> 18:33:45,505  INFO ResourceManager:901 - Transitioned to standby state
> 18:33:45,505 FATAL ResourceManager:1042 - Error starting ResourceManager
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:692)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:660)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:312)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484)
>       at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:874)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:871)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:871)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:915)
>       at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1040)
> 18:33:45,509  INFO ResourceManager:640 - SHUTDOWN_MSG: 
> /************************************************************
> SHUTDOWN_MSG: Shutting down ResourceManager at xxxxxmy_server_hostname/x.x.x.x
> ************************************************************/
> {noformat}
> Subsequent startups result in an error that appears similar.
> Before I try to wipe the state of this cluster, is there any debug info you'd 
> like me to gather?
> Note that this warning is being shown in the above, I haven't gotten around 
> to fixing it yet. I'm not sure if it's related to the crash.
> {noformat}
> 18:33:45,463  WARN RMAppImpl:331 - The specific max attempts: 0 for 
> application: 1 is invalid, because it is out of the range [1, 50]. Use the 
> global max attempts instead.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to