[jira] [Commented] (YARN-5195) RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16175407#comment-16175407 ] Jonathan Hung commented on YARN-5195: - Thanks [~jlowe] for verifying! (Also for the review/commit.) > RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event > when async-scheduling enabled in CapacityScheduler > -- > > Key: YARN-5195 > URL: https://issues.apache.org/jira/browse/YARN-5195 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: Karam Singh >Assignee: sandflee > Fix For: 2.9.0, 3.0.0-alpha1 > > Attachments: YARN-5195.01.patch, YARN-5195.02.patch, > YARN-5195.03.patch, YARN-5195-branch-2.7.001.patch, > YARN-5195-branch-2.8.001.patch, YARN-5195-branch-2.8.001.patch > > > While running gridmix experiments one time came across incident where RM went > down with following exception > {noformat} > 2016-05-28 15:45:24,459 [ResourceManager Event Processor] FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1282) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:860) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1319) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:704) > at java.lang.Thread.run(Thread.java:745) > 2016-05-28 15:45:24,460 [ApplicationMasterLauncher #49] INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Cleaning > master appattempt_1464449118385_0006_01 > 2016-05-28 15:45:24,460 [ResourceManager Event Processor] INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5195) RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16175395#comment-16175395 ] Jason Lowe commented on YARN-5195: -- The unit test failures are similar to the branch-2.7 case -- known issues with the Jenkins test environment on those branches. The unit tests pass locally for me with the patch applied. Committing. > RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event > when async-scheduling enabled in CapacityScheduler > -- > > Key: YARN-5195 > URL: https://issues.apache.org/jira/browse/YARN-5195 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: Karam Singh >Assignee: sandflee > Fix For: 2.9.0, 3.0.0-alpha1 > > Attachments: YARN-5195.01.patch, YARN-5195.02.patch, > YARN-5195.03.patch, YARN-5195-branch-2.7.001.patch, > YARN-5195-branch-2.8.001.patch, YARN-5195-branch-2.8.001.patch > > > While running gridmix experiments one time came across incident where RM went > down with following exception > {noformat} > 2016-05-28 15:45:24,459 [ResourceManager Event Processor] FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1282) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:860) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1319) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:704) > at java.lang.Thread.run(Thread.java:745) > 2016-05-28 15:45:24,460 [ApplicationMasterLauncher #49] INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Cleaning > master appattempt_1464449118385_0006_01 > 2016-05-28 15:45:24,460 [ResourceManager Event Processor] INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5195) RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16175147#comment-16175147 ] Hadoop QA commented on YARN-5195: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 12m 14s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} branch-2.8 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 9m 24s{color} | {color:green} branch-2.8 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s{color} | {color:green} branch-2.8 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 19s{color} | {color:green} branch-2.8 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 37s{color} | {color:green} branch-2.8 passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 10s{color} | {color:green} branch-2.8 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green} branch-2.8 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 17s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 75m 39s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 17s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}105m 33s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | | hadoop.yarn.server.resourcemanager.TestClientRMTokens | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:c2d96dd | | JIRA Issue | YARN-5195 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12888323/YARN-5195-branch-2.8.001.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 488fc670ea27 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | branch-2.8 / c0bb242 | | Default Java | 1.7.0_151 | | findbugs | v3.0.0 | | unit | https://builds.apache.org/job/PreCommit-YARN-Build/17570/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/17570/testReport/ | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/17570/console | | Powered by | Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event > when async-scheduling enabled in CapacityScheduler >
[jira] [Commented] (YARN-5195) RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173902#comment-16173902 ] Hadoop QA commented on YARN-5195: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 10m 20s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} branch-2.7 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 47s{color} | {color:green} branch-2.7 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s{color} | {color:green} branch-2.7 passed with JDK v1.8.0_144 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s{color} | {color:green} branch-2.7 passed with JDK v1.7.0_151 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 23s{color} | {color:green} branch-2.7 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s{color} | {color:green} branch-2.7 passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 4s{color} | {color:green} branch-2.7 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s{color} | {color:green} branch-2.7 passed with JDK v1.8.0_144 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s{color} | {color:green} branch-2.7 passed with JDK v1.7.0_151 {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 27s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s{color} | {color:green} the patch passed with JDK v1.8.0_144 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s{color} | {color:green} the patch passed with JDK v1.7.0_151 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 20s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 4 new + 441 unchanged - 2 fixed = 445 total (was 443) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 16s{color} | {color:green} the patch passed with JDK v1.8.0_144 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s{color} | {color:green} the patch passed with JDK v1.7.0_151 {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 50m 51s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_151. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 16s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}129m 15s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_144 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestRMRestart | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | JDK v1.7.0_151 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:67e87c9 | | JIRA Issue | YARN-5195 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12888104/YARN-5195-branch-2.7.001.patch | | Optional Tests | asflicense compile javac
[jira] [Commented] (YARN-5195) RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173647#comment-16173647 ] Jonathan Hung commented on YARN-5195: - Hi [~sunilg]/[~leftnoteasy], I attached branch-2.8 and branch-2.7 patches (implementation is slightly different due to YARN-4719). Can we commit this to these two branches? Thanks! FWIW, we encountered this on our cluster with async scheduling disabled (as the test case in the original patch suggests). > RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event > when async-scheduling enabled in CapacityScheduler > -- > > Key: YARN-5195 > URL: https://issues.apache.org/jira/browse/YARN-5195 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: Karam Singh >Assignee: sandflee > Fix For: 2.9.0, 3.0.0-alpha1 > > Attachments: YARN-5195.01.patch, YARN-5195.02.patch, > YARN-5195.03.patch, YARN-5195-branch-2.7.001.patch, > YARN-5195-branch-2.8.001.patch > > > While running gridmix experiments one time came across incident where RM went > down with following exception > {noformat} > 2016-05-28 15:45:24,459 [ResourceManager Event Processor] FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1282) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:860) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1319) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:704) > at java.lang.Thread.run(Thread.java:745) > 2016-05-28 15:45:24,460 [ApplicationMasterLauncher #49] INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Cleaning > master appattempt_1464449118385_0006_01 > 2016-05-28 15:45:24,460 [ResourceManager Event Processor] INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5195) RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15397695#comment-15397695 ] sandflee commented on YARN-5195: Thanks [~leftnoteasy] and [~sunilg] for reviewing and committing. > RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event > when async-scheduling enabled in CapacityScheduler > -- > > Key: YARN-5195 > URL: https://issues.apache.org/jira/browse/YARN-5195 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: Karam Singh >Assignee: sandflee > Fix For: 2.9.0 > > Attachments: YARN-5195.01.patch, YARN-5195.02.patch, > YARN-5195.03.patch > > > While running gridmix experiments one time came across incident where RM went > down with following exception > {noformat} > 2016-05-28 15:45:24,459 [ResourceManager Event Processor] FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1282) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:860) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1319) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:704) > at java.lang.Thread.run(Thread.java:745) > 2016-05-28 15:45:24,460 [ApplicationMasterLauncher #49] INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Cleaning > master appattempt_1464449118385_0006_01 > 2016-05-28 15:45:24,460 [ResourceManager Event Processor] INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5195) RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15395043#comment-15395043 ] Hudson commented on YARN-5195: -- SUCCESS: Integrated in Hadoop-trunk-Commit #10161 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/10161/]) YARN-5195. RM intermittently crashed with NPE while handling (wangda: rev d62e121ffc0239e7feccc1e23ece92c5fac685f6) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java > RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event > when async-scheduling enabled in CapacityScheduler > -- > > Key: YARN-5195 > URL: https://issues.apache.org/jira/browse/YARN-5195 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: Karam Singh >Assignee: sandflee > Fix For: 2.9.0 > > Attachments: YARN-5195.01.patch, YARN-5195.02.patch, > YARN-5195.03.patch > > > While running gridmix experiments one time came across incident where RM went > down with following exception > {noformat} > 2016-05-28 15:45:24,459 [ResourceManager Event Processor] FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1282) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:860) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1319) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:704) > at java.lang.Thread.run(Thread.java:745) > 2016-05-28 15:45:24,460 [ApplicationMasterLauncher #49] INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Cleaning > master appattempt_1464449118385_0006_01 > 2016-05-28 15:45:24,460 [ResourceManager Event Processor] INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5195) RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15390374#comment-15390374 ] Wangda Tan commented on YARN-5195: -- +1, thanks [~sandflee], will commit soon if no objections. > RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event > when async-scheduling enabled in CapacityScheduler > -- > > Key: YARN-5195 > URL: https://issues.apache.org/jira/browse/YARN-5195 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: Karam Singh >Assignee: sandflee > Attachments: YARN-5195.01.patch, YARN-5195.02.patch, > YARN-5195.03.patch > > > While running gridmix experiments one time came across incident where RM went > down with following exception > {noformat} > 2016-05-28 15:45:24,459 [ResourceManager Event Processor] FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1282) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:860) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1319) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:704) > at java.lang.Thread.run(Thread.java:745) > 2016-05-28 15:45:24,460 [ApplicationMasterLauncher #49] INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Cleaning > master appattempt_1464449118385_0006_01 > 2016-05-28 15:45:24,460 [ResourceManager Event Processor] INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5195) RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15389227#comment-15389227 ] Sunil G commented on YARN-5195: --- Patch looks fine for me. Thanks [~sandflee]. > RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event > when async-scheduling enabled in CapacityScheduler > -- > > Key: YARN-5195 > URL: https://issues.apache.org/jira/browse/YARN-5195 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: Karam Singh >Assignee: sandflee > Attachments: YARN-5195.01.patch, YARN-5195.02.patch, > YARN-5195.03.patch > > > While running gridmix experiments one time came across incident where RM went > down with following exception > {noformat} > 2016-05-28 15:45:24,459 [ResourceManager Event Processor] FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1282) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:860) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1319) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:704) > at java.lang.Thread.run(Thread.java:745) > 2016-05-28 15:45:24,460 [ApplicationMasterLauncher #49] INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Cleaning > master appattempt_1464449118385_0006_01 > 2016-05-28 15:45:24,460 [ResourceManager Event Processor] INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5195) RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15389219#comment-15389219 ] Sunil G commented on YARN-5195: --- Yes [~sandflee]. That make sense. > RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event > when async-scheduling enabled in CapacityScheduler > -- > > Key: YARN-5195 > URL: https://issues.apache.org/jira/browse/YARN-5195 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: Karam Singh >Assignee: sandflee > Attachments: YARN-5195.01.patch, YARN-5195.02.patch, > YARN-5195.03.patch > > > While running gridmix experiments one time came across incident where RM went > down with following exception > {noformat} > 2016-05-28 15:45:24,459 [ResourceManager Event Processor] FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1282) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:860) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1319) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:704) > at java.lang.Thread.run(Thread.java:745) > 2016-05-28 15:45:24,460 [ApplicationMasterLauncher #49] INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Cleaning > master appattempt_1464449118385_0006_01 > 2016-05-28 15:45:24,460 [ResourceManager Event Processor] INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5195) RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15389027#comment-15389027 ] sandflee commented on YARN-5195: Thanks [~sunilg] , nodeTracker#remove are invoked at Scheduler#removeNode, Scheduler#updateNodeResource, they are synced with scheduler#allocateContainersToNode, it's safe for now. > RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event > when async-scheduling enabled in CapacityScheduler > -- > > Key: YARN-5195 > URL: https://issues.apache.org/jira/browse/YARN-5195 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: Karam Singh >Assignee: sandflee > Attachments: YARN-5195.01.patch, YARN-5195.02.patch, > YARN-5195.03.patch > > > While running gridmix experiments one time came across incident where RM went > down with following exception > {noformat} > 2016-05-28 15:45:24,459 [ResourceManager Event Processor] FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1282) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:860) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1319) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:704) > at java.lang.Thread.run(Thread.java:745) > 2016-05-28 15:45:24,460 [ApplicationMasterLauncher #49] INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Cleaning > master appattempt_1464449118385_0006_01 > 2016-05-28 15:45:24,460 [ResourceManager Event Processor] INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5195) RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15388968#comment-15388968 ] Sunil G commented on YARN-5195: --- Hi [~sandflee] Thanks for the patch. I have a doubt here. bq.1 . all nodes copied from nodeTracker Since we copy all nodes from *nodeTracker*, we could loose one node any time during the allocation process. Currently the null check is added only at the start of {{allocateContainersToNode}}. So is it possible that we may loose node after this step too. Are we looking for lock here to avoid the problem, like an operating lock on *node*. Pls feel free to correct me if i understood the problem wrongly. > RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event > when async-scheduling enabled in CapacityScheduler > -- > > Key: YARN-5195 > URL: https://issues.apache.org/jira/browse/YARN-5195 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: Karam Singh >Assignee: sandflee > Attachments: YARN-5195.01.patch, YARN-5195.02.patch, > YARN-5195.03.patch > > > While running gridmix experiments one time came across incident where RM went > down with following exception > {noformat} > 2016-05-28 15:45:24,459 [ResourceManager Event Processor] FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1282) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:860) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1319) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:704) > at java.lang.Thread.run(Thread.java:745) > 2016-05-28 15:45:24,460 [ApplicationMasterLauncher #49] INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Cleaning > master appattempt_1464449118385_0006_01 > 2016-05-28 15:45:24,460 [ResourceManager Event Processor] INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5195) RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15388707#comment-15388707 ] Hadoop QA commented on YARN-5195: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 25s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 50s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 24s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 38s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 17s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 57s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 33s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 29s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 21s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 37s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 2s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 36m 9s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 16s {color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 52m 4s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:9560f25 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12819522/YARN-5195.03.patch | | JIRA Issue | YARN-5195 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux ea77b5fb38ab 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 132deb4 | | Default Java | 1.8.0_91 | | findbugs | v3.0.0 | | unit | https://builds.apache.org/job/PreCommit-YARN-Build/12453/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | | unit test logs | https://builds.apache.org/job/PreCommit-YARN-Build/12453/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/12453/testReport/ | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/12453/console | | Powered by | Apache Yetus 0.3.0 http://yetus.apache.org | This message was automatically generated. > RM
[jira] [Commented] (YARN-5195) RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15388665#comment-15388665 ] sandflee commented on YARN-5195: update a patch to fix checkstyle warning, failed test could pass locally, seems not related. > RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event > when async-scheduling enabled in CapacityScheduler > -- > > Key: YARN-5195 > URL: https://issues.apache.org/jira/browse/YARN-5195 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: Karam Singh >Assignee: sandflee > Attachments: YARN-5195.01.patch, YARN-5195.02.patch, > YARN-5195.03.patch > > > While running gridmix experiments one time came across incident where RM went > down with following exception > {noformat} > 2016-05-28 15:45:24,459 [ResourceManager Event Processor] FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1282) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:860) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1319) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:704) > at java.lang.Thread.run(Thread.java:745) > 2016-05-28 15:45:24,460 [ApplicationMasterLauncher #49] INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Cleaning > master appattempt_1464449118385_0006_01 > 2016-05-28 15:45:24,460 [ResourceManager Event Processor] INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5195) RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15388320#comment-15388320 ] Hadoop QA commented on YARN-5195: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 22s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 26s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 46s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 19s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 3s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 37s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 34s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 34s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 25s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 1 new + 291 unchanged - 0 fixed = 292 total (was 291) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 42s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 17s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 11s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 33m 58s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 16s {color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 51m 19s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:9560f25 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12819430/YARN-5195.02.patch | | JIRA Issue | YARN-5195 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 998c78924c17 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / ecff7d0 | | Default Java | 1.8.0_91 | | findbugs | v3.0.0 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/12446/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | | unit | https://builds.apache.org/job/PreCommit-YARN-Build/12446/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | | unit test logs | https://builds.apache.org/job/PreCommit-YARN-Build/12446/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/12446/testReport/ | | modules | C: hadoop-yarn-project/hadoop-y
[jira] [Commented] (YARN-5195) RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15387144#comment-15387144 ] Hadoop QA commented on YARN-5195: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 19s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 9m 18s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 39s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 24s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 44s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 19s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 7s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 36s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 36s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 21s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 41s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 16s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 14s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 37m 31s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 20s {color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 55m 51s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:9560f25 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12819231/YARN-5195.01.patch | | JIRA Issue | YARN-5195 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux bced71b4d78d 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 521f343 | | Default Java | 1.8.0_91 | | findbugs | v3.0.0 | | unit | https://builds.apache.org/job/PreCommit-YARN-Build/12431/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | | unit test logs | https://builds.apache.org/job/PreCommit-YARN-Build/12431/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/12431/testReport/ | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://builds.apache.org/job/PreCommit-YAR
[jira] [Commented] (YARN-5195) RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386891#comment-15386891 ] sandflee commented on YARN-5195: AsyncSchedulerThread will copy all node from nodeTracker before attemptScheduling on node. there is a race condition: 1, all nodes copied from nodeTracker 2, nodeA lost and removed from scheduler, all launched containers are cleaned 3, app attempt completed and the container allocated (or reserved) on nodeA will refer to non-exist node. this is fixed in fairscheduler in YARN-3675, add a init patch and will add a test later > RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event > when async-scheduling enabled in CapacityScheduler > -- > > Key: YARN-5195 > URL: https://issues.apache.org/jira/browse/YARN-5195 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: Karam Singh >Assignee: sandflee > Attachments: YARN-5195.01.patch > > > While running gridmix experiments one time came across incident where RM went > down with following exception > {noformat} > 2016-05-28 15:45:24,459 [ResourceManager Event Processor] FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1282) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:860) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1319) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:704) > at java.lang.Thread.run(Thread.java:745) > 2016-05-28 15:45:24,460 [ApplicationMasterLauncher #49] INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Cleaning > master appattempt_1464449118385_0006_01 > 2016-05-28 15:45:24,460 [ResourceManager Event Processor] INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5195) RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378233#comment-15378233 ] Wangda Tan commented on YARN-5195: -- I don't have bandwidth to do this now, please feel free to pick it up if you have time. > RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event > when async-scheduling enabled in CapacityScheduler > -- > > Key: YARN-5195 > URL: https://issues.apache.org/jira/browse/YARN-5195 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: Karam Singh > > While running gridmix experiments one time came across incident where RM went > down with following exception > {noformat} > 2016-05-28 15:45:24,459 [ResourceManager Event Processor] FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1282) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:860) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1319) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:704) > at java.lang.Thread.run(Thread.java:745) > 2016-05-28 15:45:24,460 [ApplicationMasterLauncher #49] INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Cleaning > master appattempt_1464449118385_0006_01 > 2016-05-28 15:45:24,460 [ResourceManager Event Processor] INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org