[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception
[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021228#comment-17021228 ] Hadoop QA commented on YARN-7913: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 14m 15s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} branch-3.1 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 4s{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 43s{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 38s{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 48s{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 37s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 12s{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 34s{color} | {color:green} branch-3.1 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 38s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 38s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 1s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 67m 14s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 29s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}137m 44s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:70a0ef5d4a6 | | JIRA Issue | YARN-7913 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12991516/YARN-7913-branch-3.1.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 53efb2016a9f 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | branch-3.1 / 96c653d | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_232 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/25418/testReport/ | | Max. process+thread count | 763 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/25418/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically g
[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception
[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021222#comment-17021222 ] Szilard Nemeth commented on YARN-7913: -- Thanks [~wilfreds] for other patches, committed them to their respective branches. Closing this jira. > Improve error handling when application recovery fails with exception > - > > Key: YARN-7913 > URL: https://issues.apache.org/jira/browse/YARN-7913 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.0.0 >Reporter: Gergo Repas >Assignee: Wilfred Spiegelenburg >Priority: Major > Fix For: 3.3.0, 3.2.2, 3.1.4 > > Attachments: YARN-7913-branch-3.1.001.patch, > YARN-7913-branch-3.1.001.patch, YARN-7913-branch-3.2.001.patch, > YARN-7913.000.poc.patch, YARN-7913.001.patch, YARN-7913.002.patch, > YARN-7913.003.patch > > > There are edge cases when the application recovery fails with an exception. > Example failure scenario: > * setup: a queue is a leaf queue in the primary RM's config and the same > queue is a parent queue in the secondary RM's config. > * When failover happens with this setup, the recovery will fail for > applications on this queue, and an APP_REJECTED event will be dispatched to > the async dispatcher. On the same thread (that handles the recovery), a > NullPointerException is thrown when the applicationAttempt is tried to be > recovered > (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494). > I don't see a good way to avoid the NPE in this scenario, because when the > NPE occurs the APP_REJECTED has not been processed yet, and we don't know > that the application recovery failed. > Currently the first exception will abort the recovery, and if there are X > applications, there will be ~X passive -> active RM transition attempts - the > passive -> active RM transition will only succeed when the last APP_REJECTED > event is processed on the async dispatcher thread. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception
[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021104#comment-17021104 ] Hadoop QA commented on YARN-7913: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 19m 8s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} branch-3.1 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 29s{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 45s{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 34s{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 47s{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 29s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 22s{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s{color} | {color:green} branch-3.1 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 22s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 71m 20s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 32s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}146m 30s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:70a0ef5d4a6 | | JIRA Issue | YARN-7913 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12991516/YARN-7913-branch-3.1.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 053237d4c20a 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | branch-3.1 / 96c653d | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_232 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/25417/testReport/ | | Max. process+thread count | 810 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/25417/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically g
[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception
[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17020988#comment-17020988 ] Szilard Nemeth commented on YARN-7913: -- Thanks [~wilfreds], Makes sense. Triggered build for branch-3.1 patch, also reuploaded the patch so Jenkins will pick that up instead of 3.2 patch. > Improve error handling when application recovery fails with exception > - > > Key: YARN-7913 > URL: https://issues.apache.org/jira/browse/YARN-7913 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.0.0 >Reporter: Gergo Repas >Assignee: Wilfred Spiegelenburg >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-7913-branch-3.1.001.patch, > YARN-7913-branch-3.1.001.patch, YARN-7913-branch-3.2.001.patch, > YARN-7913.000.poc.patch, YARN-7913.001.patch, YARN-7913.002.patch, > YARN-7913.003.patch > > > There are edge cases when the application recovery fails with an exception. > Example failure scenario: > * setup: a queue is a leaf queue in the primary RM's config and the same > queue is a parent queue in the secondary RM's config. > * When failover happens with this setup, the recovery will fail for > applications on this queue, and an APP_REJECTED event will be dispatched to > the async dispatcher. On the same thread (that handles the recovery), a > NullPointerException is thrown when the applicationAttempt is tried to be > recovered > (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494). > I don't see a good way to avoid the NPE in this scenario, because when the > NPE occurs the APP_REJECTED has not been processed yet, and we don't know > that the application recovery failed. > Currently the first exception will abort the recovery, and if there are X > applications, there will be ~X passive -> active RM transition attempts - the > passive -> active RM transition will only succeed when the last APP_REJECTED > event is processed on the async dispatcher thread. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception
[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17020662#comment-17020662 ] Wilfred Spiegelenburg commented on YARN-7913: - Test failures are not related, branch-3.2 fix seems to be good to go. You might also need to trigger the branch-3.1 build, the backport is slightly different because of YARN-8248 > Improve error handling when application recovery fails with exception > - > > Key: YARN-7913 > URL: https://issues.apache.org/jira/browse/YARN-7913 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.0.0 >Reporter: Gergo Repas >Assignee: Wilfred Spiegelenburg >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-7913-branch-3.1.001.patch, > YARN-7913-branch-3.2.001.patch, YARN-7913.000.poc.patch, YARN-7913.001.patch, > YARN-7913.002.patch, YARN-7913.003.patch > > > There are edge cases when the application recovery fails with an exception. > Example failure scenario: > * setup: a queue is a leaf queue in the primary RM's config and the same > queue is a parent queue in the secondary RM's config. > * When failover happens with this setup, the recovery will fail for > applications on this queue, and an APP_REJECTED event will be dispatched to > the async dispatcher. On the same thread (that handles the recovery), a > NullPointerException is thrown when the applicationAttempt is tried to be > recovered > (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494). > I don't see a good way to avoid the NPE in this scenario, because when the > NPE occurs the APP_REJECTED has not been processed yet, and we don't know > that the application recovery failed. > Currently the first exception will abort the recovery, and if there are X > applications, there will be ~X passive -> active RM transition attempts - the > passive -> active RM transition will only succeed when the last APP_REJECTED > event is processed on the async dispatcher thread. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception
[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17020444#comment-17020444 ] Hadoop QA commented on YARN-7913: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 42s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} branch-3.2 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 23s{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 39s{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 38s{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 43s{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 50s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 13s{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 33s{color} | {color:green} branch-3.2 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 45s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 29s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 82m 12s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 30s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}143m 53s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.metrics.TestSystemMetricsPublisherForV2 | | | hadoop.yarn.server.resourcemanager.metrics.TestCombinedSystemMetricsPublisher | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:0f25cbbb251 | | JIRA Issue | YARN-7913 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12991444/YARN-7913-branch-3.2.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux d2773ddcb8af 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | branch-3.2 / c36fbcb | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_232 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-YARN-Build/25413/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/25413/testReport/ | | Max. process+thread count | 793 (vs. ulimit o
[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception
[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17020316#comment-17020316 ] Szilard Nemeth commented on YARN-7913: -- Thanks [~wilfreds], Triggered build as precommit job did not pick the patch up. > Improve error handling when application recovery fails with exception > - > > Key: YARN-7913 > URL: https://issues.apache.org/jira/browse/YARN-7913 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.0.0 >Reporter: Gergo Repas >Assignee: Wilfred Spiegelenburg >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-7913-branch-3.1.001.patch, > YARN-7913-branch-3.2.001.patch, YARN-7913.000.poc.patch, YARN-7913.001.patch, > YARN-7913.002.patch, YARN-7913.003.patch > > > There are edge cases when the application recovery fails with an exception. > Example failure scenario: > * setup: a queue is a leaf queue in the primary RM's config and the same > queue is a parent queue in the secondary RM's config. > * When failover happens with this setup, the recovery will fail for > applications on this queue, and an APP_REJECTED event will be dispatched to > the async dispatcher. On the same thread (that handles the recovery), a > NullPointerException is thrown when the applicationAttempt is tried to be > recovered > (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494). > I don't see a good way to avoid the NPE in this scenario, because when the > NPE occurs the APP_REJECTED has not been processed yet, and we don't know > that the application recovery failed. > Currently the first exception will abort the recovery, and if there are X > applications, there will be ~X passive -> active RM transition attempts - the > passive -> active RM transition will only succeed when the last APP_REJECTED > event is processed on the async dispatcher thread. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception
[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17020242#comment-17020242 ] Wilfred Spiegelenburg commented on YARN-7913: - added branch-3.1 and branch-3.2 patches > Improve error handling when application recovery fails with exception > - > > Key: YARN-7913 > URL: https://issues.apache.org/jira/browse/YARN-7913 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.0.0 >Reporter: Gergo Repas >Assignee: Wilfred Spiegelenburg >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-7913-branch-3.1.001.patch, > YARN-7913-branch-3.2.001.patch, YARN-7913.000.poc.patch, YARN-7913.001.patch, > YARN-7913.002.patch, YARN-7913.003.patch > > > There are edge cases when the application recovery fails with an exception. > Example failure scenario: > * setup: a queue is a leaf queue in the primary RM's config and the same > queue is a parent queue in the secondary RM's config. > * When failover happens with this setup, the recovery will fail for > applications on this queue, and an APP_REJECTED event will be dispatched to > the async dispatcher. On the same thread (that handles the recovery), a > NullPointerException is thrown when the applicationAttempt is tried to be > recovered > (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494). > I don't see a good way to avoid the NPE in this scenario, because when the > NPE occurs the APP_REJECTED has not been processed yet, and we don't know > that the application recovery failed. > Currently the first exception will abort the recovery, and if there are X > applications, there will be ~X passive -> active RM transition attempts - the > passive -> active RM transition will only succeed when the last APP_REJECTED > event is processed on the async dispatcher thread. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception
[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019444#comment-17019444 ] Hudson commented on YARN-7913: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17878 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/17878/]) YARN-7913. Improve error handling when application recovery fails with (snemeth: rev 581072a8f04f7568d3560f105fd1988d3acc9e54) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairSchedulerTestBase.java > Improve error handling when application recovery fails with exception > - > > Key: YARN-7913 > URL: https://issues.apache.org/jira/browse/YARN-7913 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.0.0 >Reporter: Gergo Repas >Assignee: Wilfred Spiegelenburg >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-7913.000.poc.patch, YARN-7913.001.patch, > YARN-7913.002.patch, YARN-7913.003.patch > > > There are edge cases when the application recovery fails with an exception. > Example failure scenario: > * setup: a queue is a leaf queue in the primary RM's config and the same > queue is a parent queue in the secondary RM's config. > * When failover happens with this setup, the recovery will fail for > applications on this queue, and an APP_REJECTED event will be dispatched to > the async dispatcher. On the same thread (that handles the recovery), a > NullPointerException is thrown when the applicationAttempt is tried to be > recovered > (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494). > I don't see a good way to avoid the NPE in this scenario, because when the > NPE occurs the APP_REJECTED has not been processed yet, and we don't know > that the application recovery failed. > Currently the first exception will abort the recovery, and if there are X > applications, there will be ~X passive -> active RM transition attempts - the > passive -> active RM transition will only succeed when the last APP_REJECTED > event is processed on the async dispatcher thread. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception
[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019440#comment-17019440 ] Szilard Nemeth commented on YARN-7913: -- Hi [~wilfreds], Do you want to add branch-3.2 / branch-3.1 patches as well? > Improve error handling when application recovery fails with exception > - > > Key: YARN-7913 > URL: https://issues.apache.org/jira/browse/YARN-7913 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.0.0 >Reporter: Gergo Repas >Assignee: Wilfred Spiegelenburg >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-7913.000.poc.patch, YARN-7913.001.patch, > YARN-7913.002.patch, YARN-7913.003.patch > > > There are edge cases when the application recovery fails with an exception. > Example failure scenario: > * setup: a queue is a leaf queue in the primary RM's config and the same > queue is a parent queue in the secondary RM's config. > * When failover happens with this setup, the recovery will fail for > applications on this queue, and an APP_REJECTED event will be dispatched to > the async dispatcher. On the same thread (that handles the recovery), a > NullPointerException is thrown when the applicationAttempt is tried to be > recovered > (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494). > I don't see a good way to avoid the NPE in this scenario, because when the > NPE occurs the APP_REJECTED has not been processed yet, and we don't know > that the application recovery failed. > Currently the first exception will abort the recovery, and if there are X > applications, there will be ~X passive -> active RM transition attempts - the > passive -> active RM transition will only succeed when the last APP_REJECTED > event is processed on the async dispatcher thread. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception
[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019438#comment-17019438 ] Szilard Nemeth commented on YARN-7913: -- Committed to trunk. Thanks [~wilfreds] for the contribution and [~grepas] for the initial testing and the PoC. > Improve error handling when application recovery fails with exception > - > > Key: YARN-7913 > URL: https://issues.apache.org/jira/browse/YARN-7913 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.0.0 >Reporter: Gergo Repas >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-7913.000.poc.patch, YARN-7913.001.patch, > YARN-7913.002.patch, YARN-7913.003.patch > > > There are edge cases when the application recovery fails with an exception. > Example failure scenario: > * setup: a queue is a leaf queue in the primary RM's config and the same > queue is a parent queue in the secondary RM's config. > * When failover happens with this setup, the recovery will fail for > applications on this queue, and an APP_REJECTED event will be dispatched to > the async dispatcher. On the same thread (that handles the recovery), a > NullPointerException is thrown when the applicationAttempt is tried to be > recovered > (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494). > I don't see a good way to avoid the NPE in this scenario, because when the > NPE occurs the APP_REJECTED has not been processed yet, and we don't know > that the application recovery failed. > Currently the first exception will abort the recovery, and if there are X > applications, there will be ~X passive -> active RM transition attempts - the > passive -> active RM transition will only succeed when the last APP_REJECTED > event is processed on the async dispatcher thread. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception
[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019436#comment-17019436 ] Szilard Nemeth commented on YARN-7913: -- Hi [~wilfreds], 1. I see. 2. Ack'd 3. I was tired when checking the code, most probably, as I haven't noticed the return statement from this block: {code:java} if (!isAppRecovering) { rejectApplicationWithMessage(applicationId, queueName + " is not a leaf queue"); return; } {code} So all good with this. 4. Thanks 5. Okay, I see the point now, got it. 6. Sure, just wanted to ask you about this. So in overall, patch LGTM, committing to trunk. > Improve error handling when application recovery fails with exception > - > > Key: YARN-7913 > URL: https://issues.apache.org/jira/browse/YARN-7913 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.0.0 >Reporter: Gergo Repas >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-7913.000.poc.patch, YARN-7913.001.patch, > YARN-7913.002.patch, YARN-7913.003.patch > > > There are edge cases when the application recovery fails with an exception. > Example failure scenario: > * setup: a queue is a leaf queue in the primary RM's config and the same > queue is a parent queue in the secondary RM's config. > * When failover happens with this setup, the recovery will fail for > applications on this queue, and an APP_REJECTED event will be dispatched to > the async dispatcher. On the same thread (that handles the recovery), a > NullPointerException is thrown when the applicationAttempt is tried to be > recovered > (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494). > I don't see a good way to avoid the NPE in this scenario, because when the > NPE occurs the APP_REJECTED has not been processed yet, and we don't know > that the application recovery failed. > Currently the first exception will abort the recovery, and if there are X > applications, there will be ~X passive -> active RM transition attempts - the > passive -> active RM transition will only succeed when the last APP_REJECTED > event is processed on the async dispatcher thread. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception
[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015884#comment-17015884 ] Peter Bacsko commented on YARN-7913: +1 (non-binding). Patch looks good to me with all the explanation comments. > Improve error handling when application recovery fails with exception > - > > Key: YARN-7913 > URL: https://issues.apache.org/jira/browse/YARN-7913 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.0.0 >Reporter: Gergo Repas >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-7913.000.poc.patch, YARN-7913.001.patch, > YARN-7913.002.patch, YARN-7913.003.patch > > > There are edge cases when the application recovery fails with an exception. > Example failure scenario: > * setup: a queue is a leaf queue in the primary RM's config and the same > queue is a parent queue in the secondary RM's config. > * When failover happens with this setup, the recovery will fail for > applications on this queue, and an APP_REJECTED event will be dispatched to > the async dispatcher. On the same thread (that handles the recovery), a > NullPointerException is thrown when the applicationAttempt is tried to be > recovered > (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494). > I don't see a good way to avoid the NPE in this scenario, because when the > NPE occurs the APP_REJECTED has not been processed yet, and we don't know > that the application recovery failed. > Currently the first exception will abort the recovery, and if there are X > applications, there will be ~X passive -> active RM transition attempts - the > passive -> active RM transition will only succeed when the last APP_REJECTED > event is processed on the async dispatcher thread. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception
[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015435#comment-17015435 ] Wilfred Spiegelenburg commented on YARN-7913: - Thank you [~snemeth] for looking at the patch 1) yes it is a different approach: we now do a proper restore instead of failing the app and or crashing the RM 2) removed that line it does not make sense to throw errors and fail the app but not fail the RM 3) You are correct in your assumption about the flag. It is checked in the case you are mentioning. This is the full change in that area: {code} 487 if (!isAppRecovering) { 488 rejectApplicationWithMessage(applicationId, 489 queueName + " is not a leaf queue"); 490 return; 491 } 492 // app is recovering we do not want to fail the app now as it was there 493 // before we started the recovery. Add it to the recovery queue: 494 // dynamic queue directly under root, no ACL needed (auto clean up) 495 queueName = "root.recovery"; 496 queue = queueMgr.getLeafQueue(queueName, true, applicationId); {code} The flag is already checked in line 487 of the change. When we get to the create of the recovery queue (line 492-496) we are recovering, if we are not recovering we have rejected the app and left the method using the return in line 490. 4) will fix the comment 5) Yes we only want to check newly added apps that are _not_ recovering. When we recover there is a large chance that there are no NMs registered. When we use the % resource setup for the maximum size of a queue then the size check of the AM resource will fail. Since the AM was/is already running we need to only perform that check if we are not recovering the app. 6) The largest number of statements inside an if..else I can see is 6 lines. Not sure if refactoring the method will make it any clearer. I also don't see any consistency in the checks that would allow us to change the flow easily and improve clarity. > Improve error handling when application recovery fails with exception > - > > Key: YARN-7913 > URL: https://issues.apache.org/jira/browse/YARN-7913 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.0.0 >Reporter: Gergo Repas >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-7913.000.poc.patch, YARN-7913.001.patch, > YARN-7913.002.patch, YARN-7913.003.patch > > > There are edge cases when the application recovery fails with an exception. > Example failure scenario: > * setup: a queue is a leaf queue in the primary RM's config and the same > queue is a parent queue in the secondary RM's config. > * When failover happens with this setup, the recovery will fail for > applications on this queue, and an APP_REJECTED event will be dispatched to > the async dispatcher. On the same thread (that handles the recovery), a > NullPointerException is thrown when the applicationAttempt is tried to be > recovered > (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494). > I don't see a good way to avoid the NPE in this scenario, because when the > NPE occurs the APP_REJECTED has not been processed yet, and we don't know > that the application recovery failed. > Currently the first exception will abort the recovery, and if there are X > applications, there will be ~X passive -> active RM transition attempts - the > passive -> active RM transition will only succeed when the last APP_REJECTED > event is processed on the async dispatcher thread. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception
[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015009#comment-17015009 ] Szilard Nemeth commented on YARN-7913: -- Hi [~wilfreds], Has read through all comments and checked your patch. 1. IIUC, your approach is a new one compared to Gergo's previous PoC patch, right? 2. If I was just checking the description: "The point of this ticket is to improve the error handling and reduce the number of passive -> active RM transition attempts (solving the above described failure scenario isn't in scope).", it was not in-line with the patch you uploaded. I think it would be better to remove this statement or add something to the desription that describes what has been done recently with your patch. This way, we could make a better experience of future readers of this jira. What do you think? 3. FairScheduler#addApplication is invoked by processing the SchedulerEventType#APP_ADDED event. This event type is used by AppAddedSchedulerEvent. The calls to constructors of this event class from RMAppImpl are happening when an application is added or recovered. I can see the following calls in RmAppImpl: {code} 1. app.handler.handle( new AppAddedSchedulerEvent(app.user, app.submissionContext, false, app.applicationPriority, app.placementContext)); 2. app.scheduler.handle( new AppAddedSchedulerEvent(app.user, app.submissionContext, false, app.applicationPriority, app.placementContext)); 3. // Add application to scheduler synchronously to guarantee scheduler // knows applications before AM or NM re-registers. app.scheduler.handle( new AppAddedSchedulerEvent(app.user, app.submissionContext, true, app.applicationPriority, app.placementContext)); {code} The boolean parameter is set to isAppRecovering field of the event, so this is why I came to the conclusion that apps being created or recovered are both sent to the event dispatcher. So based on the above statement, the code you added to FairScheduler#addApplication seems a bit incorrect: {code} // app is recovering we do not want to fail the app now as it was there // before we started the recovery. Add it to the recovery queue: // dynamic queue directly under root, no ACL needed (auto clean up) queueName = "root.recovery"; queue = queueMgr.getLeafQueue(queueName, true, applicationId); {code} Shouldn't we need to check the isAppRecovering flag and only run this code block if the flag is true, meaning the app is recovering? 4. Nit: The comment has a missing comma at least: {code} // Skip ACL check for recovering applications: they have been accepted // in the queue already recovery should not break that. {code} 5. One thing I don't fully understand: {code} // when recovering the NMs might not have registered and we could have // no resources in the queue, the app is already running and has thus // passed all these checks, skip them now. if (!isAppRecovering && rmApp != null && rmApp.getAMResourceRequests() != null) { {code} Is this condition for checking the apps that are NOT recovering? 6. I can imagine some improvement that could be filed as a follow-up jira: I was having a bit hard time to have an understanding of the conditions inside FairScheduler#addApplication. For example, the statements inside the if-else conditions could be extracted to methods with meaningful names, so code readers would understand what happens, right away by reading the code. If they are interested in details of a particular case, they could open the method anyway. Can you file a jira for this? thanks! > Improve error handling when application recovery fails with exception > - > > Key: YARN-7913 > URL: https://issues.apache.org/jira/browse/YARN-7913 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.0.0 >Reporter: Gergo Repas >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-7913.000.poc.patch, YARN-7913.001.patch, > YARN-7913.002.patch, YARN-7913.003.patch > > > There are edge cases when the application recovery fails with an exception. > Example failure scenario: > * setup: a queue is a leaf queue in the primary RM's config and the same > queue is a parent queue in the secondary RM's config. > * When failover happens with this setup, the recovery will fail for > applications on this queue, and an APP_REJECTED event will be dispatched to > the async dispatcher. On the same thread (that handles the recovery), a > NullPointerException is thrown when the applicationAttempt is tried to be > recovered > (https://github.com/apache/
[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception
[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17009551#comment-17009551 ] Wilfred Spiegelenburg commented on YARN-7913: - [~snemeth] and [~sunilg] can you have a look at the change please? > Improve error handling when application recovery fails with exception > - > > Key: YARN-7913 > URL: https://issues.apache.org/jira/browse/YARN-7913 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.0.0 >Reporter: Gergo Repas >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-7913.000.poc.patch, YARN-7913.001.patch, > YARN-7913.002.patch, YARN-7913.003.patch > > > There are edge cases when the application recovery fails with an exception. > Example failure scenario: > * setup: a queue is a leaf queue in the primary RM's config and the same > queue is a parent queue in the secondary RM's config. > * When failover happens with this setup, the recovery will fail for > applications on this queue, and an APP_REJECTED event will be dispatched to > the async dispatcher. On the same thread (that handles the recovery), a > NullPointerException is thrown when the applicationAttempt is tried to be > recovered > (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494). > I don't see a good way to avoid the NPE in this scenario, because when the > NPE occurs the APP_REJECTED has not been processed yet, and we don't know > that the application recovery failed. > Currently the first exception will abort the recovery, and if there are X > applications, there will be ~X passive -> active RM transition attempts - the > passive -> active RM transition will only succeed when the last APP_REJECTED > event is processed on the async dispatcher thread. > _The point of this ticket is to improve the error handling and reduce the > number of passive -> active RM transition attempts (solving the above > described failure scenario isn't in scope)._ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception
[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17009382#comment-17009382 ] Hadoop QA commented on YARN-7913: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 35s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 24s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 35s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 44s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 59s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 12s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 44s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 27s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 34s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 86m 56s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 28s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}143m 20s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:c44943d1fc3 | | JIRA Issue | YARN-7913 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12990058/YARN-7913.003.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 5d59e5f1e264 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 59aac00 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_232 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/25338/testReport/ | | Max. process+thread count | 827 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/25338/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > Improve error handling when applicatio
[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception
[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17009329#comment-17009329 ] Wilfred Spiegelenburg commented on YARN-7913: - The ACL test case did not fail in the expected way when reversing the fix as ACLs were not correctly setup in the test config. Updating patch with the proper config. [^YARN-7913.003.patch] > Improve error handling when application recovery fails with exception > - > > Key: YARN-7913 > URL: https://issues.apache.org/jira/browse/YARN-7913 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.0.0 >Reporter: Gergo Repas >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-7913.000.poc.patch, YARN-7913.001.patch, > YARN-7913.002.patch, YARN-7913.003.patch > > > There are edge cases when the application recovery fails with an exception. > Example failure scenario: > * setup: a queue is a leaf queue in the primary RM's config and the same > queue is a parent queue in the secondary RM's config. > * When failover happens with this setup, the recovery will fail for > applications on this queue, and an APP_REJECTED event will be dispatched to > the async dispatcher. On the same thread (that handles the recovery), a > NullPointerException is thrown when the applicationAttempt is tried to be > recovered > (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494). > I don't see a good way to avoid the NPE in this scenario, because when the > NPE occurs the APP_REJECTED has not been processed yet, and we don't know > that the application recovery failed. > Currently the first exception will abort the recovery, and if there are X > applications, there will be ~X passive -> active RM transition attempts - the > passive -> active RM transition will only succeed when the last APP_REJECTED > event is processed on the async dispatcher thread. > _The point of this ticket is to improve the error handling and reduce the > number of passive -> active RM transition attempts (solving the above > described failure scenario isn't in scope)._ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception
[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17009304#comment-17009304 ] Hadoop QA commented on YARN-7913: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 42s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 28s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 45s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 15s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 17s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 46s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 85m 49s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 27s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}143m 32s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:c44943d1fc3 | | JIRA Issue | YARN-7913 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12990050/YARN-7913.002.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 7fbf987bdaec 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 819159f | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_232 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/25337/testReport/ | | Max. process+thread count | 830 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/25337/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > Improve error handling when applicatio
[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception
[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17009240#comment-17009240 ] Wilfred Spiegelenburg commented on YARN-7913: - new patch to fix javac commenting on deprecated method use and checkstyle issues (IDE was not setup correctly) > Improve error handling when application recovery fails with exception > - > > Key: YARN-7913 > URL: https://issues.apache.org/jira/browse/YARN-7913 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.0.0 >Reporter: Gergo Repas >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-7913.000.poc.patch, YARN-7913.001.patch, > YARN-7913.002.patch > > > There are edge cases when the application recovery fails with an exception. > Example failure scenario: > * setup: a queue is a leaf queue in the primary RM's config and the same > queue is a parent queue in the secondary RM's config. > * When failover happens with this setup, the recovery will fail for > applications on this queue, and an APP_REJECTED event will be dispatched to > the async dispatcher. On the same thread (that handles the recovery), a > NullPointerException is thrown when the applicationAttempt is tried to be > recovered > (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494). > I don't see a good way to avoid the NPE in this scenario, because when the > NPE occurs the APP_REJECTED has not been processed yet, and we don't know > that the application recovery failed. > Currently the first exception will abort the recovery, and if there are X > applications, there will be ~X passive -> active RM transition attempts - the > passive -> active RM transition will only succeed when the last APP_REJECTED > event is processed on the async dispatcher thread. > _The point of this ticket is to improve the error handling and reduce the > number of passive -> active RM transition attempts (solving the above > described failure scenario isn't in scope)._ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception
[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008838#comment-17008838 ] Hadoop QA commented on YARN-7913: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 34s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 48s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 41s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 45s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 29s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 23s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 40s{color} | {color:red} hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager generated 1 new + 27 unchanged - 0 fixed = 28 total (was 27) {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 31s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 25 new + 197 unchanged - 0 fixed = 222 total (was 197) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 19s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 87m 33s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 26s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}146m 34s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:c44943d1fc3 | | JIRA Issue | YARN-7913 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12989995/YARN-7913.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux e82176f64e53 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 4a76ab7 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_232 | | findbugs | v3.1.0-RC1 | | javac | https://builds.apache.org/job/PreCommit-YARN-Build/25331/artifact/out/diff-compile-javac-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/25331/artifact/out/diff-checkstyle-hadoop-yar
[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception
[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16483563#comment-16483563 ] Oleksandr Shevchenko commented on YARN-7913: Thank you [~wilfreds] for your comment. I created separated JIRA ticket YARN-7998 to solve the problem related to ACLs configurations changes. > Improve error handling when application recovery fails with exception > - > > Key: YARN-7913 > URL: https://issues.apache.org/jira/browse/YARN-7913 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.0.0 >Reporter: Gergo Repas >Assignee: Gergo Repas >Priority: Major > Attachments: YARN-7913.000.poc.patch > > > There are edge cases when the application recovery fails with an exception. > Example failure scenario: > * setup: a queue is a leaf queue in the primary RM's config and the same > queue is a parent queue in the secondary RM's config. > * When failover happens with this setup, the recovery will fail for > applications on this queue, and an APP_REJECTED event will be dispatched to > the async dispatcher. On the same thread (that handles the recovery), a > NullPointerException is thrown when the applicationAttempt is tried to be > recovered > (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494). > I don't see a good way to avoid the NPE in this scenario, because when the > NPE occurs the APP_REJECTED has not been processed yet, and we don't know > that the application recovery failed. > Currently the first exception will abort the recovery, and if there are X > applications, there will be ~X passive -> active RM transition attempts - the > passive -> active RM transition will only succeed when the last APP_REJECTED > event is processed on the async dispatcher thread. > _The point of this ticket is to improve the error handling and reduce the > number of passive -> active RM transition attempts (solving the above > described failure scenario isn't in scope)._ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception
[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16483490#comment-16483490 ] Wilfred Spiegelenburg commented on YARN-7913: - Just ran into this issue and stumbled into this jira. First point I do not see how continuing on really solves the issue. We are recovering running applications when this issue occurs. If we have failed or finished applications we shortcut and process things differently. I fixed restoring failed finished application in YARN-7139 in the proper queues. Do we really want to fail recovering a running application when we have a scheduler configuration that is out of sync? We do not allow the removal of a queue that is not empty and also handle other configuration changes graceful. That is all covered in YARN-8191. To now say we just dump a running application and forget about it is not the correct thing. I think we should handle this gracefully and properly restore the running application. As described by [~oshevchenko] we should take into account the ACLs and queues etc. YARN-2308 for the capacity scheduler handles it by stating that you MUST have all the queues from the previous state in the system. We can do something like that or more dynamic. I worked with Yufei on a possible solution a while back around this that fits in with the dynamic way we work with rules etc. I am happy to add a poc for this to the jira. > Improve error handling when application recovery fails with exception > - > > Key: YARN-7913 > URL: https://issues.apache.org/jira/browse/YARN-7913 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.0.0 >Reporter: Gergo Repas >Assignee: Gergo Repas >Priority: Major > Attachments: YARN-7913.000.poc.patch > > > There are edge cases when the application recovery fails with an exception. > Example failure scenario: > * setup: a queue is a leaf queue in the primary RM's config and the same > queue is a parent queue in the secondary RM's config. > * When failover happens with this setup, the recovery will fail for > applications on this queue, and an APP_REJECTED event will be dispatched to > the async dispatcher. On the same thread (that handles the recovery), a > NullPointerException is thrown when the applicationAttempt is tried to be > recovered > (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494). > I don't see a good way to avoid the NPE in this scenario, because when the > NPE occurs the APP_REJECTED has not been processed yet, and we don't know > that the application recovery failed. > Currently the first exception will abort the recovery, and if there are X > applications, there will be ~X passive -> active RM transition attempts - the > passive -> active RM transition will only succeed when the last APP_REJECTED > event is processed on the async dispatcher thread. > _The point of this ticket is to improve the error handling and reduce the > number of passive -> active RM transition attempts (solving the above > described failure scenario isn't in scope)._ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception
[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16425543#comment-16425543 ] Gergo Repas commented on YARN-7913: --- [~oshevchenko] I see, that's a good point that ACL changes may cause similar issues. In this ticket my proposal would be to improve error-handling in general. I'd propose to try to recover all applications in a passive-to-active RM transition-attempt even with the presence of exceptions (logging all exceptions, rethrowing the last one), and the async event processing would take care of putting the application attempt to rejected state. This way we would have only a couple of passive-to-active RM transition-attempts (depending on how long the async event processing takes), instead of N+1 passive-to-active attempts, where N is the number of rejected applications. This is to improve the handling of any unforeseen exceptions. What do you think? > Improve error handling when application recovery fails with exception > - > > Key: YARN-7913 > URL: https://issues.apache.org/jira/browse/YARN-7913 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.0.0 >Reporter: Gergo Repas >Assignee: Gergo Repas >Priority: Major > Attachments: YARN-7913.000.poc.patch > > > There are edge cases when the application recovery fails with an exception. > Example failure scenario: > * setup: a queue is a leaf queue in the primary RM's config and the same > queue is a parent queue in the secondary RM's config. > * When failover happens with this setup, the recovery will fail for > applications on this queue, and an APP_REJECTED event will be dispatched to > the async dispatcher. On the same thread (that handles the recovery), a > NullPointerException is thrown when the applicationAttempt is tried to be > recovered > (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494). > I don't see a good way to avoid the NPE in this scenario, because when the > NPE occurs the APP_REJECTED has not been processed yet, and we don't know > that the application recovery failed. > Currently the first exception will abort the recovery, and if there are X > applications, there will be ~X passive -> active RM transition attempts - the > passive -> active RM transition will only succeed when the last APP_REJECTED > event is processed on the async dispatcher thread. > _The point of this ticket is to improve the error handling and reduce the > number of passive -> active RM transition attempts (solving the above > described failure scenario isn't in scope)._ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception
[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383462#comment-16383462 ] Oleksandr Shevchenko commented on YARN-7913: I faced the same issue. RM failed with NPE during failover if FairScheduler configurations were changed. An application was not finished yet, so, application final state = null and also, the last app attempt doesn't have the final state too. {{{}}{{noformat}}{{}}} 2018-02-28 15:50:51,576 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering app: application_1517497680557_565955 *with 2 attempts and final state = null* 2018-02-28 15:50:54,761 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Recovering attempt: appattempt_1517497680557_565955_01 with *final state: FAILED* 2018-02-28 15:50:54,766 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Recovering attempt: appattempt_1517497680557_565955_02 with *final state: null* {{{}}{{noformat}}{{}}} In my case, an *ACL configuration in fair-scheduler.xml was changed* as a result we no longer have a rights to submit this application. In FairScheduler#addApplication() we skip it application. We do not add this application to the scheduler application map and send event APP_REJECTED to go an application to the state FAILED. {{{code}}} if (!queue.hasAccess(QueueACL.SUBMIT_APPLICATIONS, userUgi) && !queue.hasAccess(QueueACL.ADMINISTER_QUEUE, userUgi)) { String msg = "User " + userUgi.getUserName() + " cannot submit applications to queue " + queue.getName(); LOG.info(msg); rmContext.getDispatcher().getEventHandler() .handle(new RMAppRejectedEvent(applicationId, msg)); return; } {{{code}}} Then we try to recovery app attempts. When we try to recovery the last app attempt we should check the final state of attempt and the final state of the application (See RMAppAttemptImpl#transition()). As I said before, application final state = null and also, the last app attempt doesn't have the final state too. So, we check RM app current state in method "isAppInFinalState". {{{code}}} public static boolean isAppInFinalState(RMApp rmApp) { RMAppState appState = ((RMAppImpl) rmApp).getRecoveredFinalState(); if (appState == null) { appState = rmApp.getState(); } return appState == RMAppState.FAILED || appState == RMAppState.FINISHED || appState == RMAppState.KILLED; } {{{code}}} For now, the *current state of the application is NEW because the APP_REJECTED event has not been processed yet* as was described by Gergo Repas. *This lead to the wrong decision to recover attempt*. We try to get a user of the application in FairScheduler#addApplicationAttempt and get NPE because the application nod found in the scheduler. {{{code}}} SchedulerApplication application = applications.get(applicationAttemptId.getApplicationId()); String user = application.getUser(); //NPE {{{code}}} {{{}}{{noformat}}{{}}}{{}} java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:740) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1327) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:117) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1100) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1046) {{{}}{{noformat}}{{}}} *Ideally, we should process APP_REJECTED event before we try to recovery attempts.* But for now, I didn't find an easy way to do that. *As a workaround, we can check whether an application is null.* (See the attached patch ) If it true then skip this attempt. The same way as in CapacityScheduler and as was proposed in YARN-2025. Perhaps, we should open a new ticket for this. {{{code}}} SchedulerApplication application = applications.get(applicationAttemptId.getApplicationId()); _if (application == null) {_ _LOG.warn("Application " + applicationAttemptId.getApplicationId() +_ _" cannot be found in scheduler.");_ _return;_ _}_ String user = application.getUser(); {{{code}}} As a result, RM not failed now but we will get InvalidStateTransitonException because APP_REJECTED event will be processed too late. {{{noformat}}} 2018-02-28 16:00:24,847 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: *Invalid event: APP_REJECTED at ACCEPTED.* {{{noformat}}} If we also add transition from ACCEPTED state to FAILED to the RMAppImpl StateMachineFactory {{{code}}} .addTransition(RMAppSta
[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception
[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16358263#comment-16358263 ] Gergo Repas commented on YARN-7913: --- I'm attaching a proof-of-concept, I did the following test on trunk and with this patch: I started 4 long-running applications with the failure scenario setup (where primary resourcemanager could accept the application, but the secondary resourcemanager fails to place the application on the queue - since it's a parent queue in the secondary RM). On trunk there were 5 "RM could not transition to Active" error messages, with the patch there was only 1 "RM could not transition to Active" error message. Any comments are appreciated. > Improve error handling when application recovery fails with exception > - > > Key: YARN-7913 > URL: https://issues.apache.org/jira/browse/YARN-7913 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.0.0 >Reporter: Gergo Repas >Assignee: Gergo Repas >Priority: Major > Attachments: YARN-7913.000.poc.patch > > > There are edge cases when the application recovery fails with an exception. > Example failure scenario: > * setup: a queue is a leaf queue in the primary RM's config and the same > queue is a parent queue in the secondary RM's config. > * When failover happens with this setup, the recovery will fail for > applications on this queue, and an APP_REJECTED event will be dispatched to > the async dispatcher. On the same thread (that handles the recovery), a > NullPointerException is thrown when the applicationAttempt is tried to be > recovered > (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494). > I don't see a good way to avoid the NPE in this scenario, because when the > NPE occurs the APP_REJECTED has not been processed yet, and we don't know > that the application recovery failed. > Currently the first exception will abort the recovery, and if there are X > applications, there will be ~X passive -> active RM transition attempts - the > passive -> active RM transition will only succeed when the last APP_REJECTED > event is processed on the async dispatcher thread. > _The point of this ticket is to improve the error handling and reduce the > number of passive -> active RM transition attempts (solving the above > described failure scenario isn't in scope)._ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org