[jira] [Commented] (YARN-2223) NPE on ResourceManager recover
[ https://issues.apache.org/jira/browse/YARN-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538005#comment-17538005 ] tony ke commented on YARN-2223: --- we have fixed a few similar NPE on RM recovery problem recently-would u shell the issue url for me?[~jianhe] > NPE on ResourceManager recover > -- > > Key: YARN-2223 > URL: https://issues.apache.org/jira/browse/YARN-2223 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.1 > Environment: JDK 8u5 >Reporter: Jon Bringhurst >Priority: Major > > I upgraded two clusters from tag 2.2.0 to branch-2.4.1 (latest commit is > https://github.com/apache/hadoop-common/commit/c96c8e45a60651b677a1de338b7856a444dc0461). > Both clusters have the same config (other than hostnames). Both are running > on JDK8u5 (I'm not sure if this is a factor here). > One cluster started up without any errors. The other started up with the > following error on the RM: > {noformat} > 18:33:45,463 WARN RMAppImpl:331 - The specific max attempts: 0 for > application: 1 is invalid, because it is out of the range [1, 50]. Use the > global max attempts instead. > 18:33:45,465 INFO RMAppImpl:651 - Recovering app: > application_1398450350082_0001 with 8 attempts and final state = KILLED > 18:33:45,468 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_01 with final state: KILLED > 18:33:45,478 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_02 with final state: FAILED > 18:33:45,478 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_03 with final state: FAILED > 18:33:45,479 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_04 with final state: FAILED > 18:33:45,479 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_05 with final state: FAILED > 18:33:45,480 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_06 with final state: FAILED > 18:33:45,480 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_07 with final state: FAILED > 18:33:45,481 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_08 with final state: FAILED > 18:33:45,482 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_01 State change from NEW to KILLED > 18:33:45,482 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_02 State change from NEW to FAILED > 18:33:45,482 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_03 State change from NEW to FAILED > 18:33:45,482 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_04 State change from NEW to FAILED > 18:33:45,483 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_05 State change from NEW to FAILED > 18:33:45,483 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_06 State change from NEW to FAILED > 18:33:45,483 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_07 State change from NEW to FAILED > 18:33:45,483 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_08 State change from NEW to FAILED > 18:33:45,485 INFO RMAppImpl:639 - application_1398450350082_0001 State > change from NEW to KILLED > 18:33:45,485 WARN RMAppImpl:331 - The specific max attempts: 0 for > application: 2 is invalid, because it is out of the range [1, 50]. Use the > global max attempts instead. > 18:33:45,485 INFO RMAppImpl:651 - Recovering app: > application_1398450350082_0002 with 8 attempts and final state = KILLED > 18:33:45,486 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_01 with final state: KILLED > 18:33:45,486 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_02 with final state: FAILED > 18:33:45,487 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_03 with final state: FAILED > 18:33:45,487 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_04 with final state: FAILED > 18:33:45,488 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_05 with final state: FAILED > 18:33:45,488 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_06 with final state: FAILED > 18:33:45,489 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_07 with final state: FAILED > 18:33:45,489 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_08 with final state: FAILED > 18:33:45,490 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0002_01 State change from NEW to KILLED > 18:33:45,490 INFO RMAppAttemptIm
[jira] [Commented] (YARN-2223) NPE on ResourceManager recover
[ https://issues.apache.org/jira/browse/YARN-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14523853#comment-14523853 ] Jon Bringhurst commented on YARN-2223: -- Hey [~jianhe], that sounds good to me -- I haven't seen this problem in a long time. We're running 2.6.0 now. > NPE on ResourceManager recover > -- > > Key: YARN-2223 > URL: https://issues.apache.org/jira/browse/YARN-2223 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.1 > Environment: JDK 8u5 >Reporter: Jon Bringhurst > > I upgraded two clusters from tag 2.2.0 to branch-2.4.1 (latest commit is > https://github.com/apache/hadoop-common/commit/c96c8e45a60651b677a1de338b7856a444dc0461). > Both clusters have the same config (other than hostnames). Both are running > on JDK8u5 (I'm not sure if this is a factor here). > One cluster started up without any errors. The other started up with the > following error on the RM: > {noformat} > 18:33:45,463 WARN RMAppImpl:331 - The specific max attempts: 0 for > application: 1 is invalid, because it is out of the range [1, 50]. Use the > global max attempts instead. > 18:33:45,465 INFO RMAppImpl:651 - Recovering app: > application_1398450350082_0001 with 8 attempts and final state = KILLED > 18:33:45,468 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_01 with final state: KILLED > 18:33:45,478 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_02 with final state: FAILED > 18:33:45,478 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_03 with final state: FAILED > 18:33:45,479 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_04 with final state: FAILED > 18:33:45,479 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_05 with final state: FAILED > 18:33:45,480 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_06 with final state: FAILED > 18:33:45,480 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_07 with final state: FAILED > 18:33:45,481 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_08 with final state: FAILED > 18:33:45,482 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_01 State change from NEW to KILLED > 18:33:45,482 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_02 State change from NEW to FAILED > 18:33:45,482 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_03 State change from NEW to FAILED > 18:33:45,482 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_04 State change from NEW to FAILED > 18:33:45,483 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_05 State change from NEW to FAILED > 18:33:45,483 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_06 State change from NEW to FAILED > 18:33:45,483 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_07 State change from NEW to FAILED > 18:33:45,483 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_08 State change from NEW to FAILED > 18:33:45,485 INFO RMAppImpl:639 - application_1398450350082_0001 State > change from NEW to KILLED > 18:33:45,485 WARN RMAppImpl:331 - The specific max attempts: 0 for > application: 2 is invalid, because it is out of the range [1, 50]. Use the > global max attempts instead. > 18:33:45,485 INFO RMAppImpl:651 - Recovering app: > application_1398450350082_0002 with 8 attempts and final state = KILLED > 18:33:45,486 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_01 with final state: KILLED > 18:33:45,486 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_02 with final state: FAILED > 18:33:45,487 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_03 with final state: FAILED > 18:33:45,487 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_04 with final state: FAILED > 18:33:45,488 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_05 with final state: FAILED > 18:33:45,488 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_06 with final state: FAILED > 18:33:45,489 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_07 with final state: FAILED > 18:33:45,489 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_08 with final state: FAILED > 18:33:45,490 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0002_01 State change from NEW to KILLED > 18:33:45,490 INFO RMAppAttemptImpl:659 - > appattem
[jira] [Commented] (YARN-2223) NPE on ResourceManager recover
[ https://issues.apache.org/jira/browse/YARN-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14523840#comment-14523840 ] Jian He commented on YARN-2223: --- we have fixed a few similar NPE on RM recovery problem recently. Probably this has been fixed in one of them. I'm closing this for now. [~jonbringhurst], please feel free to reopen this if you still see this problem in latest build. > NPE on ResourceManager recover > -- > > Key: YARN-2223 > URL: https://issues.apache.org/jira/browse/YARN-2223 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.1 > Environment: JDK 8u5 >Reporter: Jon Bringhurst > > I upgraded two clusters from tag 2.2.0 to branch-2.4.1 (latest commit is > https://github.com/apache/hadoop-common/commit/c96c8e45a60651b677a1de338b7856a444dc0461). > Both clusters have the same config (other than hostnames). Both are running > on JDK8u5 (I'm not sure if this is a factor here). > One cluster started up without any errors. The other started up with the > following error on the RM: > {noformat} > 18:33:45,463 WARN RMAppImpl:331 - The specific max attempts: 0 for > application: 1 is invalid, because it is out of the range [1, 50]. Use the > global max attempts instead. > 18:33:45,465 INFO RMAppImpl:651 - Recovering app: > application_1398450350082_0001 with 8 attempts and final state = KILLED > 18:33:45,468 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_01 with final state: KILLED > 18:33:45,478 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_02 with final state: FAILED > 18:33:45,478 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_03 with final state: FAILED > 18:33:45,479 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_04 with final state: FAILED > 18:33:45,479 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_05 with final state: FAILED > 18:33:45,480 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_06 with final state: FAILED > 18:33:45,480 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_07 with final state: FAILED > 18:33:45,481 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_08 with final state: FAILED > 18:33:45,482 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_01 State change from NEW to KILLED > 18:33:45,482 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_02 State change from NEW to FAILED > 18:33:45,482 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_03 State change from NEW to FAILED > 18:33:45,482 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_04 State change from NEW to FAILED > 18:33:45,483 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_05 State change from NEW to FAILED > 18:33:45,483 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_06 State change from NEW to FAILED > 18:33:45,483 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_07 State change from NEW to FAILED > 18:33:45,483 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_08 State change from NEW to FAILED > 18:33:45,485 INFO RMAppImpl:639 - application_1398450350082_0001 State > change from NEW to KILLED > 18:33:45,485 WARN RMAppImpl:331 - The specific max attempts: 0 for > application: 2 is invalid, because it is out of the range [1, 50]. Use the > global max attempts instead. > 18:33:45,485 INFO RMAppImpl:651 - Recovering app: > application_1398450350082_0002 with 8 attempts and final state = KILLED > 18:33:45,486 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_01 with final state: KILLED > 18:33:45,486 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_02 with final state: FAILED > 18:33:45,487 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_03 with final state: FAILED > 18:33:45,487 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_04 with final state: FAILED > 18:33:45,488 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_05 with final state: FAILED > 18:33:45,488 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_06 with final state: FAILED > 18:33:45,489 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_07 with final state: FAILED > 18:33:45,489 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_08 with final state: FAILED > 18:33:45,490 INFO RMAppAttemptImpl:659 - > appattempt_1
[jira] [Commented] (YARN-2223) NPE on ResourceManager recover
[ https://issues.apache.org/jira/browse/YARN-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14047017#comment-14047017 ] Jian He commented on YARN-2223: --- looks like the some attempt data is missing . Can you find out the list of attempt files are under the state-store directory for application_1398453545406_0001 ? > NPE on ResourceManager recover > -- > > Key: YARN-2223 > URL: https://issues.apache.org/jira/browse/YARN-2223 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.1 > Environment: JDK 8u5 >Reporter: Jon Bringhurst > > I upgraded two clusters from tag 2.2.0 to branch-2.4.1 (latest commit is > https://github.com/apache/hadoop-common/commit/c96c8e45a60651b677a1de338b7856a444dc0461). > Both clusters have the same config (other than hostnames). Both are running > on JDK8u5 (I'm not sure if this is a factor here). > One cluster started up without any errors. The other started up with the > following error on the RM: > {noformat} > 18:33:45,463 WARN RMAppImpl:331 - The specific max attempts: 0 for > application: 1 is invalid, because it is out of the range [1, 50]. Use the > global max attempts instead. > 18:33:45,465 INFO RMAppImpl:651 - Recovering app: > application_1398450350082_0001 with 8 attempts and final state = KILLED > 18:33:45,468 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_01 with final state: KILLED > 18:33:45,478 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_02 with final state: FAILED > 18:33:45,478 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_03 with final state: FAILED > 18:33:45,479 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_04 with final state: FAILED > 18:33:45,479 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_05 with final state: FAILED > 18:33:45,480 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_06 with final state: FAILED > 18:33:45,480 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_07 with final state: FAILED > 18:33:45,481 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0001_08 with final state: FAILED > 18:33:45,482 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_01 State change from NEW to KILLED > 18:33:45,482 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_02 State change from NEW to FAILED > 18:33:45,482 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_03 State change from NEW to FAILED > 18:33:45,482 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_04 State change from NEW to FAILED > 18:33:45,483 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_05 State change from NEW to FAILED > 18:33:45,483 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_06 State change from NEW to FAILED > 18:33:45,483 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_07 State change from NEW to FAILED > 18:33:45,483 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0001_08 State change from NEW to FAILED > 18:33:45,485 INFO RMAppImpl:639 - application_1398450350082_0001 State > change from NEW to KILLED > 18:33:45,485 WARN RMAppImpl:331 - The specific max attempts: 0 for > application: 2 is invalid, because it is out of the range [1, 50]. Use the > global max attempts instead. > 18:33:45,485 INFO RMAppImpl:651 - Recovering app: > application_1398450350082_0002 with 8 attempts and final state = KILLED > 18:33:45,486 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_01 with final state: KILLED > 18:33:45,486 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_02 with final state: FAILED > 18:33:45,487 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_03 with final state: FAILED > 18:33:45,487 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_04 with final state: FAILED > 18:33:45,488 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_05 with final state: FAILED > 18:33:45,488 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_06 with final state: FAILED > 18:33:45,489 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_07 with final state: FAILED > 18:33:45,489 INFO RMAppAttemptImpl:691 - Recovering attempt: > appattempt_1398450350082_0002_08 with final state: FAILED > 18:33:45,490 INFO RMAppAttemptImpl:659 - > appattempt_1398450350082_0002_01 State change from NEW to KILLED > 18:33:45,490