[jira] [Updated] (YARN-8547) rm may crash if nm register with too many applications
[ https://issues.apache.org/jira/browse/YARN-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-8547: --- Attachment: YARN-8547.01.patch > rm may crash if nm register with too many applications > -- > > Key: YARN-8547 > URL: https://issues.apache.org/jira/browse/YARN-8547 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee >Assignee: sandflee >Priority: Major > Attachments: YARN-8547.01.patch > > > 1, our cluster had n k+ nodes, and disabled log aggregation, one single nm > may keeps 1w+ apps > 2, when rm failover, single nm register with 1w+ apps, causing active rm > always gc and lost connection with zk. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8547) rm may crash if nm register with too many applications
[ https://issues.apache.org/jira/browse/YARN-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-8547: --- Description: 1, our cluster had n k+ nodes, and disabled log aggregation, one single nm may keeps 1w+ apps 2, when rm failover, single nm register with 1w+ apps, causing active rm always gc and lost connection with zk. was: 1, our cluster had n k+ nodes, and we disable log aggregation, single nm may keeps 1w+ apps 2, when rm failover, nm register with 1w+ apps, causing active rm always gc and lost connection with zk. > rm may crash if nm register with too many applications > -- > > Key: YARN-8547 > URL: https://issues.apache.org/jira/browse/YARN-8547 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee >Assignee: sandflee >Priority: Major > > 1, our cluster had n k+ nodes, and disabled log aggregation, one single nm > may keeps 1w+ apps > 2, when rm failover, single nm register with 1w+ apps, causing active rm > always gc and lost connection with zk. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8547) rm may crash if nm register with too many applications
sandflee created YARN-8547: -- Summary: rm may crash if nm register with too many applications Key: YARN-8547 URL: https://issues.apache.org/jira/browse/YARN-8547 Project: Hadoop YARN Issue Type: Bug Reporter: sandflee Assignee: sandflee 1, our cluster had n k+ nodes, and we disable log aggregation, single nm may keeps 1w+ apps 2, when rm failover, nm register with 1w+ apps, causing active rm always gc and lost connection with zk. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-7229) Add a metric for the size of event queue in AsyncDispatcher
[ https://issues.apache.org/jira/browse/YARN-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16267978#comment-16267978 ] sandflee edited comment on YARN-7229 at 11/28/17 2:30 AM: -- yes, planed to add this to our cluster, assign to myself was (Author: sandflee): yes, planed to add this to our cluster, assign this to myself > Add a metric for the size of event queue in AsyncDispatcher > --- > > Key: YARN-7229 > URL: https://issues.apache.org/jira/browse/YARN-7229 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.1.0 >Reporter: Yufei Gu >Assignee: sandflee > > The size of event queue in AsyncDispatcher is a good point to monitor daemon > performance. Let's make it a RM metric. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7229) Add a metric for the size of event queue in AsyncDispatcher
[ https://issues.apache.org/jira/browse/YARN-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16267978#comment-16267978 ] sandflee commented on YARN-7229: yes, planed to add this to our cluster, assign this to myself > Add a metric for the size of event queue in AsyncDispatcher > --- > > Key: YARN-7229 > URL: https://issues.apache.org/jira/browse/YARN-7229 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.1.0 >Reporter: Yufei Gu >Assignee: sandflee > > The size of event queue in AsyncDispatcher is a good point to monitor daemon > performance. Let's make it a RM metric. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-7229) Add a metric for the size of event queue in AsyncDispatcher
[ https://issues.apache.org/jira/browse/YARN-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee reassigned YARN-7229: -- Assignee: sandflee > Add a metric for the size of event queue in AsyncDispatcher > --- > > Key: YARN-7229 > URL: https://issues.apache.org/jira/browse/YARN-7229 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.1.0 >Reporter: Yufei Gu >Assignee: sandflee > > The size of event queue in AsyncDispatcher is a good point to monitor daemon > performance. Let's make it a RM metric. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7229) Add a metric for the size of event queue in AsyncDispatcher
[ https://issues.apache.org/jira/browse/YARN-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16264890#comment-16264890 ] sandflee commented on YARN-7229: YARN-5276, seems do similar things [~asuresh] > Add a metric for the size of event queue in AsyncDispatcher > --- > > Key: YARN-7229 > URL: https://issues.apache.org/jira/browse/YARN-7229 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.1.0 >Reporter: Yufei Gu > > The size of event queue in AsyncDispatcher is a good point to monitor daemon > performance. Let's make it a RM metric. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-7498) NM failed to start if the namespace of remote log dirs differs from fs.defaultFS
[ https://issues.apache.org/jira/browse/YARN-7498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee resolved YARN-7498. Resolution: Duplicate > NM failed to start if the namespace of remote log dirs differs from > fs.defaultFS > > > Key: YARN-7498 > URL: https://issues.apache.org/jira/browse/YARN-7498 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee >Assignee: sandflee > > fs.defaultFS is hdfs://nameservice1 and yarn.nodemanager.remote-app-log-dir > is hdfs://nameservice2, when nm start see errors: > {quote} > java.lang.IllegalArgumentException: Wrong FS: hdfs://nameservice2/yarn-logs, > expected: hdfs://nameservice1 > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:193) > at > org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:105) > at > org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1128) > at > org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1124) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1124) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.verifyAndCreateRemoteLogDir(LogAggregationService.java:192) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:319) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:745) > {quote} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7498) NM failed to start if the namespace of remote log dirs differs from fs.defaultFS
sandflee created YARN-7498: -- Summary: NM failed to start if the namespace of remote log dirs differs from fs.defaultFS Key: YARN-7498 URL: https://issues.apache.org/jira/browse/YARN-7498 Project: Hadoop YARN Issue Type: Bug Reporter: sandflee Assignee: sandflee fs.defaultFS is hdfs://nameservice1 and yarn.nodemanager.remote-app-log-dir is hdfs://nameservice2, when nm start see errors: {quote} java.lang.IllegalArgumentException: Wrong FS: hdfs://nameservice2/yarn-logs, expected: hdfs://nameservice1 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645) at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:193) at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:105) at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1128) at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1124) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1124) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.verifyAndCreateRemoteLogDir(LogAggregationService.java:192) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:319) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) {quote} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4599) Set OOM control for memory cgroups
[ https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16187004#comment-16187004 ] sandflee commented on YARN-4599: Hi, [~miklos.szeg...@cloudera.com], I'm busy on other work recently , feel free to assign this to yourself, will join you when not so busy. > Set OOM control for memory cgroups > -- > > Key: YARN-4599 > URL: https://issues.apache.org/jira/browse/YARN-4599 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.9.0 >Reporter: Karthik Kambatla >Assignee: sandflee > Labels: oct16-medium > Attachments: yarn-4599-not-so-useful.patch, YARN-4599.sandflee.patch > > > YARN-1856 adds memory cgroups enforcing support. We should also explicitly > set OOM control so that containers are not killed as soon as they go over > their usage. Today, one could set the swappiness to control this, but > clusters with swap turned off exist. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7035) Add health checker to ResourceManager
[ https://issues.apache.org/jira/browse/YARN-7035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16134636#comment-16134636 ] sandflee commented on YARN-7035: Thanks [~yufeigu], YARN-6061 is very useful to handle critical thread exit, for dead locks we use ThreadMxBean to detect. > Add health checker to ResourceManager > - > > Key: YARN-7035 > URL: https://issues.apache.org/jira/browse/YARN-7035 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: sandflee > > RM may becomes unhealthy but still alive, for example scheduling thread > exit, dead lock happens. seems useful to add a healthy checker service, if > check failed, let RM exit -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7035) Add health checker to ResourceManager
sandflee created YARN-7035: -- Summary: Add health checker to ResourceManager Key: YARN-7035 URL: https://issues.apache.org/jira/browse/YARN-7035 Project: Hadoop YARN Issue Type: Improvement Reporter: sandflee RM may becomes unhealthy but still alive, for example scheduling thread exit, dead lock happens. seems useful to add a healthy checker service, if check failed, let RM exit -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5349) TestWorkPreservingRMRestart#testUAMRecoveryOnRMWorkPreservingRestart fail intermittently
[ https://issues.apache.org/jira/browse/YARN-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16108121#comment-16108121 ] sandflee commented on YARN-5349: not working on this, set unassigned. > TestWorkPreservingRMRestart#testUAMRecoveryOnRMWorkPreservingRestart fail > intermittently > - > > Key: YARN-5349 > URL: https://issues.apache.org/jira/browse/YARN-5349 > Project: Hadoop YARN > Issue Type: Test >Reporter: sandflee >Priority: Minor > > {noformat} > java.lang.AssertionError: null > at org.junit.Assert.fail(Assert.java:86) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertTrue(Assert.java:52) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testUAMRecoveryOnRMWorkPreservingRestart(TestWorkPreservingRMRestart.java:1463) > {noformat} > https://builds.apache.org/job/PreCommit-YARN-Build/12250/testReport/org.apache.hadoop.yarn.server.resourcemanager/TestWorkPreservingRMRestart/testUAMRecoveryOnRMWorkPreservingRestart/ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-5349) TestWorkPreservingRMRestart#testUAMRecoveryOnRMWorkPreservingRestart fail intermittently
[ https://issues.apache.org/jira/browse/YARN-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee reassigned YARN-5349: -- Assignee: (was: sandflee) > TestWorkPreservingRMRestart#testUAMRecoveryOnRMWorkPreservingRestart fail > intermittently > - > > Key: YARN-5349 > URL: https://issues.apache.org/jira/browse/YARN-5349 > Project: Hadoop YARN > Issue Type: Test >Reporter: sandflee >Priority: Minor > > {noformat} > java.lang.AssertionError: null > at org.junit.Assert.fail(Assert.java:86) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertTrue(Assert.java:52) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testUAMRecoveryOnRMWorkPreservingRestart(TestWorkPreservingRMRestart.java:1463) > {noformat} > https://builds.apache.org/job/PreCommit-YARN-Build/12250/testReport/org.apache.hadoop.yarn.server.resourcemanager/TestWorkPreservingRMRestart/testUAMRecoveryOnRMWorkPreservingRestart/ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6854) many job failed if NM couldn't detect disk error
sandflee created YARN-6854: -- Summary: many job failed if NM couldn't detect disk error Key: YARN-6854 URL: https://issues.apache.org/jira/browse/YARN-6854 Project: Hadoop YARN Issue Type: Bug Reporter: sandflee Priority: Critical checkDiskHealthy is enabled, but it couldn't find this error. leading containers failed and new containers assigned to this node then failed again. the disk error seems a filesystem error, all io operation (such as ls) failed on $localdir/usercache/userFoo, and no effect on other dir. Any suggestion? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4051) ContainerKillEvent lost when container is still recovering and application finishes
[ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928224#comment-15928224 ] sandflee commented on YARN-4051: Thanks [~jlowe] for your review and commit! > ContainerKillEvent lost when container is still recovering and application > finishes > --- > > Key: YARN-4051 > URL: https://issues.apache.org/jira/browse/YARN-4051 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: sandflee >Assignee: sandflee >Priority: Critical > Fix For: 2.9.0, 2.8.1, 3.0.0-alpha3 > > Attachments: YARN-4051.01.patch, YARN-4051.02.patch, > YARN-4051.03.patch, YARN-4051.04.patch, YARN-4051.05.patch, > YARN-4051.06.patch, YARN-4051.07.patch, YARN-4051.08.patch, > YARN-4051.08.patch-branch-2 > > > As in YARN-4050, NM event dispatcher is blocked, and container is in New > state, when we finish application, the container still alive even after NM > event dispatcher is unblocked. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4051) ContainerKillEvent lost when container is still recovering and application finishes
[ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927733#comment-15927733 ] sandflee commented on YARN-4051: update a patch for branch-2 > ContainerKillEvent lost when container is still recovering and application > finishes > --- > > Key: YARN-4051 > URL: https://issues.apache.org/jira/browse/YARN-4051 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: sandflee >Assignee: sandflee >Priority: Critical > Attachments: YARN-4051.01.patch, YARN-4051.02.patch, > YARN-4051.03.patch, YARN-4051.04.patch, YARN-4051.05.patch, > YARN-4051.06.patch, YARN-4051.07.patch, YARN-4051.08.patch, > YARN-4051.08.patch-branch-2 > > > As in YARN-4050, NM event dispatcher is blocked, and container is in New > state, when we finish application, the container still alive even after NM > event dispatcher is unblocked. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4051) ContainerKillEvent lost when container is still recovering and application finishes
[ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4051: --- Attachment: YARN-4051.08.patch-branch-2 > ContainerKillEvent lost when container is still recovering and application > finishes > --- > > Key: YARN-4051 > URL: https://issues.apache.org/jira/browse/YARN-4051 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: sandflee >Assignee: sandflee >Priority: Critical > Attachments: YARN-4051.01.patch, YARN-4051.02.patch, > YARN-4051.03.patch, YARN-4051.04.patch, YARN-4051.05.patch, > YARN-4051.06.patch, YARN-4051.07.patch, YARN-4051.08.patch, > YARN-4051.08.patch-branch-2 > > > As in YARN-4050, NM event dispatcher is blocked, and container is in New > state, when we finish application, the container still alive even after NM > event dispatcher is unblocked. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4051) ContainerKillEvent lost when container is still recovering and application finishes
[ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925696#comment-15925696 ] sandflee commented on YARN-4051: patch updated, also fix test failure > ContainerKillEvent lost when container is still recovering and application > finishes > --- > > Key: YARN-4051 > URL: https://issues.apache.org/jira/browse/YARN-4051 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: sandflee >Assignee: sandflee >Priority: Critical > Attachments: YARN-4051.01.patch, YARN-4051.02.patch, > YARN-4051.03.patch, YARN-4051.04.patch, YARN-4051.05.patch, > YARN-4051.06.patch, YARN-4051.07.patch, YARN-4051.08.patch > > > As in YARN-4050, NM event dispatcher is blocked, and container is in New > state, when we finish application, the container still alive even after NM > event dispatcher is unblocked. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6340) TestEventFlow not work as expected
sandflee created YARN-6340: -- Summary: TestEventFlow not work as expected Key: YARN-6340 URL: https://issues.apache.org/jira/browse/YARN-6340 Project: Hadoop YARN Issue Type: Test Reporter: sandflee see many exceptions in test logs, app/container never running, surprising the test could run pass. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4599) Set OOM control for memory cgroups
[ https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925683#comment-15925683 ] sandflee commented on YARN-4599: Hi, [~miklos.szeg...@cloudera.com], I think the first thing is whether community accept this propose, if yes, we could push forward, if not it will be a burden to keep the patch with the trunk. Thought? use Linux Container Executor seems simple the code but will use a new process additionally, I'm ok to both. > Set OOM control for memory cgroups > -- > > Key: YARN-4599 > URL: https://issues.apache.org/jira/browse/YARN-4599 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.9.0 >Reporter: Karthik Kambatla >Assignee: sandflee > Labels: oct16-medium > Attachments: yarn-4599-not-so-useful.patch, YARN-4599.sandflee.patch > > > YARN-1856 adds memory cgroups enforcing support. We should also explicitly > set OOM control so that containers are not killed as soon as they go over > their usage. Today, one could set the swappiness to control this, but > clusters with swap turned off exist. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4051) ContainerKillEvent lost when container is still recovering and application finishes
[ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4051: --- Attachment: YARN-4051.08.patch > ContainerKillEvent lost when container is still recovering and application > finishes > --- > > Key: YARN-4051 > URL: https://issues.apache.org/jira/browse/YARN-4051 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: sandflee >Assignee: sandflee >Priority: Critical > Attachments: YARN-4051.01.patch, YARN-4051.02.patch, > YARN-4051.03.patch, YARN-4051.04.patch, YARN-4051.05.patch, > YARN-4051.06.patch, YARN-4051.07.patch, YARN-4051.08.patch > > > As in YARN-4050, NM event dispatcher is blocked, and container is in New > state, when we finish application, the container still alive even after NM > event dispatcher is unblocked. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering
[ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15907544#comment-15907544 ] sandflee commented on YARN-4051: Thanks [~jlowe], bq. I'm also wondering about the scenario where the kill event is coming in from an AM and not the RM. simple throw a YarnException when AM stops a recovering container, but seems NMClientAsyncImpl could't try stopContainer again, we could fix this in a new issue? {code} .addTransition(ContainerState.RUNNING, EnumSet.of(ContainerState.DONE, ContainerState.FAILED), ContainerEventType.STOP_CONTAINER, new StopContainerTransition()) {code} do another two changes: 1, using app.handle(new ApplicationContainerInitEvent(container)) when recover containers, for there is a race condition when Finish events comes, ApplicationContainerInitEvent not processed and containers are not added to app 2, use ConcurrentHashMap to store containers in app. because I encountered ConcurrentModifyException when iterating app.getContainers() , and I also see web and AppLogAggregator using app.getContainers() without protect. > ContainerKillEvent is lost when container is In New State and is recovering > > > Key: YARN-4051 > URL: https://issues.apache.org/jira/browse/YARN-4051 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: sandflee >Assignee: sandflee >Priority: Critical > Attachments: YARN-4051.01.patch, YARN-4051.02.patch, > YARN-4051.03.patch, YARN-4051.04.patch, YARN-4051.05.patch, > YARN-4051.06.patch, YARN-4051.07.patch > > > As in YARN-4050, NM event dispatcher is blocked, and container is in New > state, when we finish application, the container still alive even after NM > event dispatcher is unblocked. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering
[ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4051: --- Attachment: YARN-4051.07.patch > ContainerKillEvent is lost when container is In New State and is recovering > > > Key: YARN-4051 > URL: https://issues.apache.org/jira/browse/YARN-4051 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: sandflee >Assignee: sandflee >Priority: Critical > Attachments: YARN-4051.01.patch, YARN-4051.02.patch, > YARN-4051.03.patch, YARN-4051.04.patch, YARN-4051.05.patch, > YARN-4051.06.patch, YARN-4051.07.patch > > > As in YARN-4050, NM event dispatcher is blocked, and container is in New > state, when we finish application, the container still alive even after NM > event dispatcher is unblocked. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering
[ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4051: --- Attachment: YARN-4051.06.patch > ContainerKillEvent is lost when container is In New State and is recovering > > > Key: YARN-4051 > URL: https://issues.apache.org/jira/browse/YARN-4051 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: sandflee >Assignee: sandflee >Priority: Critical > Attachments: YARN-4051.01.patch, YARN-4051.02.patch, > YARN-4051.03.patch, YARN-4051.04.patch, YARN-4051.05.patch, YARN-4051.06.patch > > > As in YARN-4050, NM event dispatcher is blocked, and container is in New > state, when we finish application, the container still alive even after NM > event dispatcher is unblocked. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering
[ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4051: --- Attachment: (was: YARN-4051.06.patch) > ContainerKillEvent is lost when container is In New State and is recovering > > > Key: YARN-4051 > URL: https://issues.apache.org/jira/browse/YARN-4051 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: sandflee >Assignee: sandflee >Priority: Critical > Attachments: YARN-4051.01.patch, YARN-4051.02.patch, > YARN-4051.03.patch, YARN-4051.04.patch, YARN-4051.05.patch > > > As in YARN-4050, NM event dispatcher is blocked, and container is in New > state, when we finish application, the container still alive even after NM > event dispatcher is unblocked. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering
[ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900345#comment-15900345 ] sandflee commented on YARN-4051: since RM will resend FINISH_APPS/FINISH_CONTAINER if nm reports app/container running, seems safe to drop the event if container is recovering, [~jlowe] > ContainerKillEvent is lost when container is In New State and is recovering > > > Key: YARN-4051 > URL: https://issues.apache.org/jira/browse/YARN-4051 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: sandflee >Assignee: sandflee >Priority: Critical > Attachments: YARN-4051.01.patch, YARN-4051.02.patch, > YARN-4051.03.patch, YARN-4051.04.patch, YARN-4051.05.patch, YARN-4051.06.patch > > > As in YARN-4050, NM event dispatcher is blocked, and container is in New > state, when we finish application, the container still alive even after NM > event dispatcher is unblocked. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-5301) NM mount cpu cgroups failed on some system
[ https://issues.apache.org/jira/browse/YARN-5301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee reassigned YARN-5301: -- Assignee: (was: sandflee) > NM mount cpu cgroups failed on some system > -- > > Key: YARN-5301 > URL: https://issues.apache.org/jira/browse/YARN-5301 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee > > on ubuntu with linux kernel 3.19, , NM start failed if enable auto mount > cgroup. try command: > ./bin/container-executor --mount-cgroups yarn-hadoop cpu=/cgroup/cpufail > ./bin/container-executor --mount-cgroups yarn-hadoop cpu,cpuacct=/cgroup/cpu > succ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering
[ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4051: --- Attachment: YARN-4051.06.patch > ContainerKillEvent is lost when container is In New State and is recovering > > > Key: YARN-4051 > URL: https://issues.apache.org/jira/browse/YARN-4051 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: sandflee >Assignee: sandflee >Priority: Critical > Attachments: YARN-4051.01.patch, YARN-4051.02.patch, > YARN-4051.03.patch, YARN-4051.04.patch, YARN-4051.05.patch, YARN-4051.06.patch > > > As in YARN-4050, NM event dispatcher is blocked, and container is in New > state, when we finish application, the container still alive even after NM > event dispatcher is unblocked. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5911) DrainDispatcher does not drain all events on stop even if setDrainEventsOnStop is true
[ https://issues.apache.org/jira/browse/YARN-5911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15684917#comment-15684917 ] sandflee commented on YARN-5911: sorry, not noticed it was removed in the patch, patch LGTM > DrainDispatcher does not drain all events on stop even if > setDrainEventsOnStop is true > -- > > Key: YARN-5911 > URL: https://issues.apache.org/jira/browse/YARN-5911 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Varun Saxena >Assignee: Varun Saxena > Attachments: YARN-5911.01.patch, YARN-5911.02.patch > > > DrainDispatcher#serviceStop sets the stopped flag first before draining the > event queue. > This means that the thread terminates as soon as it encounters stopped flag > as true and does not continue to process leftover events in queue, something > which it should do if setDrainEventsOnStop is set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5911) DrainDispatcher does not drain all events on stop even if setDrainEventsOnStop is true
[ https://issues.apache.org/jira/browse/YARN-5911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15683146#comment-15683146 ] sandflee commented on YARN-5911: seems no need to keep var stopped in DrainDispatcher? > DrainDispatcher does not drain all events on stop even if > setDrainEventsOnStop is true > -- > > Key: YARN-5911 > URL: https://issues.apache.org/jira/browse/YARN-5911 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Varun Saxena >Assignee: Varun Saxena > Attachments: YARN-5911.01.patch, YARN-5911.02.patch > > > DrainDispatcher#serviceStop sets the stopped flag first before draining the > event queue. > This means that the thread terminates as soon as it encounters stopped flag > as true and does not continue to process leftover events in queue, something > which it should do if setDrainEventsOnStop is set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5911) DrainDispatcher does not drain all events on stop even if setDrainEventsOnStop is true
[ https://issues.apache.org/jira/browse/YARN-5911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15678114#comment-15678114 ] sandflee commented on YARN-5911: thanks [~varun_saxena], one minor comment: if events not drained, we should at least wait 1s since drainDispatcher not invoke waitForDrained.notify, could we use a shorter time? {code} while (!isDrained() && eventHandlingThread != null && eventHandlingThread.isAlive() && System.currentTimeMillis() < endTime) { waitForDrained.wait(1000); LOG.info("Waiting for AsyncDispatcher to drain. Thread state is :" + eventHandlingThread.getState()); } } {code} > DrainDispatcher does not drain all events on stop even if > setDrainEventsOnStop is true > -- > > Key: YARN-5911 > URL: https://issues.apache.org/jira/browse/YARN-5911 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Varun Saxena >Assignee: Varun Saxena > Attachments: YARN-5911.01.patch > > > DrainDispatcher#serviceStop sets the stopped flag first before draining the > event queue. > This means that the thread terminates as soon as it encounters stopped flag > as true and does not continue to process leftover events in queue, something > which it should do if setDrainEventsOnStop is set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5898) Container can not stop, because the call stopContainer NMClient method appears DIGEST-MD5 exception, onGetContainerStatusError NMClientAsync method is also the same
[ https://issues.apache.org/jira/browse/YARN-5898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15673309#comment-15673309 ] sandflee commented on YARN-5898: had AM ever restarted? > Container can not stop, because the call stopContainer NMClient method > appears DIGEST-MD5 exception, onGetContainerStatusError NMClientAsync method > is also the same > > > Key: YARN-5898 > URL: https://issues.apache.org/jira/browse/YARN-5898 > Project: Hadoop YARN > Issue Type: Bug > Components: api >Affects Versions: 2.6.0 > Environment: cdh5.5,java 7 >Reporter: gaoyanfu > Labels: DIGEST-MD5, getContainerStatuses, > onGetContainerStatusError, stopContainer > Fix For: 2.6.0 > > Original Estimate: 96h > Remaining Estimate: 96h > > GetContainerStatusAsync call the NMClientAsync method, the callback method > corresponding onGetContainerStatusError method, DIGEST-MD5 SaslException, > ContainerStatus stopContainer can not get; call the nmClient method will be > the exception, not stop Container. > ---REST API--- > request: > http://server3.xdpp.boco:8042/ws/v1/node/containers > response: > {"containers":{"container":[ > {"id":"container_e07_1477704520017_0001_01_04","state":"RUNNING","exitCode":-1000,"diagnostics":"","user":"xdpp","totalMemoryNeededMB":8704,"totalVCoresNeeded":1,"containerLogsLink":"http://server3.xdpp.boco:8042/node/containerlogs/container_e07_1477704520017_0001_01_04/xdpp","nodeId":"server3.xdpp.boco:8041"}, > {"id":"container_e09_1477719748865_0003_01_25","state":"RUNNING","exitCode":-1000,"diagnostics":"","user":"xdpp","totalMemoryNeededMB":1536,"totalVCoresNeeded":1,"containerLogsLink":"http://server3.xdpp.boco:8042/node/containerlogs/container_e09_1477719748865_0003_01_25/xdpp","nodeId":"server3.xdpp.boco:8041"}, > {"id":"container_e09_1477719748865_0004_02_000103","state":"RUNNING","exitCode":-1000,"diagnostics":"","user":"xdpp","totalMemoryNeededMB":6656,"totalVCoresNeeded":1,"containerLogsLink":"http://server3.xdpp.boco:8042/node/containerlogs/container_e09_1477719748865_0004_02_000103/xdpp","nodeId":"server3.xdpp.boco:8041"} > ]}} > ---exception-- > 2016-11-14 11:17:12.725 ERROR containerStatusLogger > [ContainerManager.java:484] *Container onGetContainerStatusError deal > begin.containerId:container_e09_1477719748865_0003_01_25 > javax.security.sasl.SaslException: DIGEST-MD5: digest response format > violation. Mismatched response. > at sun.reflect.GeneratedConstructorAccessor59.newInstance(Unknown > Source) ~[na:na] > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > ~[na:1.7.0_79] > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > ~[na:1.7.0_79] > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) > ~[hadoop-yarn-common-2.6.0.jar:na] > at > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104) > ~[hadoop-yarn-common-2.6.0.jar:na] > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.getContainerStatuses(ContainerManagementProtocolPBClientImpl.java:127) > ~[hadoop-yarn-common-2.6.0.jar:na] > at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source) ~[na:na] > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > ~[na:1.7.0_79] > at java.lang.reflect.Method.invoke(Method.java:606) ~[na:1.7.0_79] > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > ~[hadoop-common-2.6.0.jar:na] > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > ~[hadoop-common-2.6.0.jar:na] > at com.sun.proxy.$Proxy23.getContainerStatuses(Unknown Source) ~[na:na] > at > org.apache.hadoop.yarn.client.api.impl.NMClientImpl.getContainerStatus(NMClientImpl.java:267) > ~[hadoop-yarn-client-2.6.0.jar:na] > at > org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl$ContainerEventProcessor.run(NMClientAsyncImpl.java:534) > ~[hadoop-yarn-client-2.6.0.jar:na] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > [na:1.7.0_79] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > [na:1.7.0_79] > at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79] > Caused by: org.apache.hadoop.ipc.RemoteException: DIGEST-MD5: digest response > format violation. Mismatched
[jira] [Assigned] (YARN-5897) using drainEvent to replace sleep-wait in MockRM#waitForState
[ https://issues.apache.org/jira/browse/YARN-5897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee reassigned YARN-5897: -- Assignee: sandflee > using drainEvent to replace sleep-wait in MockRM#waitForState > - > > Key: YARN-5897 > URL: https://issues.apache.org/jira/browse/YARN-5897 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: sandflee >Assignee: sandflee > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-5897) using drainEvent to replace sleep-wait in MockRM#waitForState
sandflee created YARN-5897: -- Summary: using drainEvent to replace sleep-wait in MockRM#waitForState Key: YARN-5897 URL: https://issues.apache.org/jira/browse/YARN-5897 Project: Hadoop YARN Issue Type: Improvement Reporter: sandflee -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-5896) add drainEvents to MockAM to reduce test failure
sandflee created YARN-5896: -- Summary: add drainEvents to MockAM to reduce test failure Key: YARN-5896 URL: https://issues.apache.org/jira/browse/YARN-5896 Project: Hadoop YARN Issue Type: Improvement Reporter: sandflee Assignee: sandflee -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5895) TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey
[ https://issues.apache.org/jira/browse/YARN-5895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672982#comment-15672982 ] sandflee commented on YARN-5895: I'm ok, close it with dup > TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey > --- > > Key: YARN-5895 > URL: https://issues.apache.org/jira/browse/YARN-5895 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Affects Versions: 3.0.0-alpha1 >Reporter: Wilfred Spiegelenburg >Assignee: sandflee > > Even after YARN-5362 the test is still flaky: > {code} > Tests run: 29, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 100.652 sec > <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart > testFinishedAppRemovalAfterRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) > Time elapsed: 0.338 sec <<< FAILURE! > java.lang.AssertionError: expected null, but was: application_submission_context { application_id { id: 1 cluster_timestamp: > 1479326359158 } application_name: "" queue: "default" priority { priority: 0 > } am_container_spec { } cancel_tokens_when_complete: true maxAppAttempts: 2 > resource { memory: 1024 virtual_cores: 1 } applicationType: "YARN" > keep_containers_across_application_attempts: false > attempt_failures_validity_interval: 0 am_container_resource_request { > priority { priority: 0 } resource_name: "*" capability { memory: 1024 > virtual_cores: 1 } num_containers: 0 relax_locality: true > node_label_expression: "" execution_type_request { execution_type: GUARANTEED > enforce_execution_type: false } } } user: "jenkins" start_time: 1479326359188 > application_state: RMAPP_FINISHED finish_time: 1479326359214> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotNull(Assert.java:664) > at org.junit.Assert.assertNull(Assert.java:646) > at org.junit.Assert.assertNull(Assert.java:656) > at > org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testFinishedAppRemovalAfterRMRestart(TestRMRestart.java:1659) > {code} > The test finishes with two asserts. This is the second assert that fails, > YARN-5362 looked at a failure on the first of the two asserts -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-5895) TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey
[ https://issues.apache.org/jira/browse/YARN-5895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee resolved YARN-5895. Resolution: Duplicate > TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey > --- > > Key: YARN-5895 > URL: https://issues.apache.org/jira/browse/YARN-5895 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Affects Versions: 3.0.0-alpha1 >Reporter: Wilfred Spiegelenburg >Assignee: sandflee > > Even after YARN-5362 the test is still flaky: > {code} > Tests run: 29, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 100.652 sec > <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart > testFinishedAppRemovalAfterRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) > Time elapsed: 0.338 sec <<< FAILURE! > java.lang.AssertionError: expected null, but was: application_submission_context { application_id { id: 1 cluster_timestamp: > 1479326359158 } application_name: "" queue: "default" priority { priority: 0 > } am_container_spec { } cancel_tokens_when_complete: true maxAppAttempts: 2 > resource { memory: 1024 virtual_cores: 1 } applicationType: "YARN" > keep_containers_across_application_attempts: false > attempt_failures_validity_interval: 0 am_container_resource_request { > priority { priority: 0 } resource_name: "*" capability { memory: 1024 > virtual_cores: 1 } num_containers: 0 relax_locality: true > node_label_expression: "" execution_type_request { execution_type: GUARANTEED > enforce_execution_type: false } } } user: "jenkins" start_time: 1479326359188 > application_state: RMAPP_FINISHED finish_time: 1479326359214> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotNull(Assert.java:664) > at org.junit.Assert.assertNull(Assert.java:646) > at org.junit.Assert.assertNull(Assert.java:656) > at > org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testFinishedAppRemovalAfterRMRestart(TestRMRestart.java:1659) > {code} > The test finishes with two asserts. This is the second assert that fails, > YARN-5362 looked at a failure on the first of the two asserts -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-5548) Random test failure TestRMRestart#testFinishedAppRemovalAfterRMRestart
[ https://issues.apache.org/jira/browse/YARN-5548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672972#comment-15672972 ] sandflee edited comment on YARN-5548 at 11/17/16 6:54 AM: -- agree, and maybe we could try to replace all MemoryStateStore with MockRMMemoryStateStore, not only in this test was (Author: sandflee): agree, and maybe we could try to replace all MemoryStateStore with MockRMStateStore > Random test failure TestRMRestart#testFinishedAppRemovalAfterRMRestart > -- > > Key: YARN-5548 > URL: https://issues.apache.org/jira/browse/YARN-5548 > Project: Hadoop YARN > Issue Type: Test >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Labels: oct16-easy, test > Attachments: YARN-5548.0001.patch, YARN-5548.0002.patch, > YARN-5548.0003.patch > > > https://builds.apache.org/job/PreCommit-YARN-Build/12850/testReport/org.apache.hadoop.yarn.server.resourcemanager/TestRMRestart/testFinishedAppRemovalAfterRMRestart/ > {noformat} > Error Message > Stacktrace > java.lang.AssertionError: expected null, but was: application_submission_context { application_id { id: 1 cluster_timestamp: > 1471885197388 } application_name: "" queue: "default" priority { priority: 0 > } am_container_spec { } cancel_tokens_when_complete: true maxAppAttempts: 2 > resource { memory: 1024 virtual_cores: 1 } applicationType: "YARN" > keep_containers_across_application_attempts: false > attempt_failures_validity_interval: 0 am_container_resource_request { > priority { priority: 0 } resource_name: "*" capability { memory: 1024 > virtual_cores: 1 } num_containers: 0 relax_locality: true > node_label_expression: "" execution_type_request { execution_type: GUARANTEED > enforce_execution_type: false } } } user: "jenkins" start_time: 1471885197417 > application_state: RMAPP_FINISHED finish_time: 1471885197478> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotNull(Assert.java:664) > at org.junit.Assert.assertNull(Assert.java:646) > at org.junit.Assert.assertNull(Assert.java:656) > at > org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testFinishedAppRemovalAfterRMRestart(TestRMRestart.java:1656) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5548) Random test failure TestRMRestart#testFinishedAppRemovalAfterRMRestart
[ https://issues.apache.org/jira/browse/YARN-5548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672972#comment-15672972 ] sandflee commented on YARN-5548: agree, and maybe we could try to replace all MemoryStateStore with MockRMStateStore > Random test failure TestRMRestart#testFinishedAppRemovalAfterRMRestart > -- > > Key: YARN-5548 > URL: https://issues.apache.org/jira/browse/YARN-5548 > Project: Hadoop YARN > Issue Type: Test >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Labels: oct16-easy, test > Attachments: YARN-5548.0001.patch, YARN-5548.0002.patch, > YARN-5548.0003.patch > > > https://builds.apache.org/job/PreCommit-YARN-Build/12850/testReport/org.apache.hadoop.yarn.server.resourcemanager/TestRMRestart/testFinishedAppRemovalAfterRMRestart/ > {noformat} > Error Message > Stacktrace > java.lang.AssertionError: expected null, but was: application_submission_context { application_id { id: 1 cluster_timestamp: > 1471885197388 } application_name: "" queue: "default" priority { priority: 0 > } am_container_spec { } cancel_tokens_when_complete: true maxAppAttempts: 2 > resource { memory: 1024 virtual_cores: 1 } applicationType: "YARN" > keep_containers_across_application_attempts: false > attempt_failures_validity_interval: 0 am_container_resource_request { > priority { priority: 0 } resource_name: "*" capability { memory: 1024 > virtual_cores: 1 } num_containers: 0 relax_locality: true > node_label_expression: "" execution_type_request { execution_type: GUARANTEED > enforce_execution_type: false } } } user: "jenkins" start_time: 1471885197417 > application_state: RMAPP_FINISHED finish_time: 1471885197478> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotNull(Assert.java:664) > at org.junit.Assert.assertNull(Assert.java:646) > at org.junit.Assert.assertNull(Assert.java:656) > at > org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testFinishedAppRemovalAfterRMRestart(TestRMRestart.java:1656) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
[ https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672967#comment-15672967 ] sandflee commented on YARN-5375: yes, to keep the patch simple, not replace MemoryRMStateStore, let's go to YARN-5548 > invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures > -- > > Key: YARN-5375 > URL: https://issues.apache.org/jira/browse/YARN-5375 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: sandflee >Assignee: sandflee > Labels: oct16-medium > Fix For: 2.9.0, 3.0.0-alpha2 > > Attachments: YARN-5375.01.patch, YARN-5375.03.patch, > YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, > YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch, > YARN-5375.08.patch, YARN-5375.09.patch, YARN-5375.10.patch, > YARN-5375.11.patch, YARN-5375.12.new.patch, YARN-5375.12.patch > > > seen many test failures related to RMApp/RMAppattempt comes to some state but > some event are not processed in rm event queue or scheduler event queue, > cause test failure, seems we could implicitly invokes drainEvents(should also > drain sheduler event) in some mockRM method like waitForState -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5895) TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey
[ https://issues.apache.org/jira/browse/YARN-5895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672602#comment-15672602 ] sandflee commented on YARN-5895: yes, I think there are two main places not drain Events: 1, explicitly using MemoryStateStore like most RMRestart test, we could resolve by using MockRMMemoryStateStore 2, using MockRM.waitForState in static way, we could explicitly add rm.drainEvents() before call MockRM.waitForState or change MockRM.waitForState from static to non-static. thought? bq. some of the tests in YARN-4929 need to look again. will do > TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey > --- > > Key: YARN-5895 > URL: https://issues.apache.org/jira/browse/YARN-5895 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Affects Versions: 3.0.0-alpha1 >Reporter: Wilfred Spiegelenburg >Assignee: sandflee > > Even after YARN-5362 the test is still flaky: > {code} > Tests run: 29, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 100.652 sec > <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart > testFinishedAppRemovalAfterRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) > Time elapsed: 0.338 sec <<< FAILURE! > java.lang.AssertionError: expected null, but was: application_submission_context { application_id { id: 1 cluster_timestamp: > 1479326359158 } application_name: "" queue: "default" priority { priority: 0 > } am_container_spec { } cancel_tokens_when_complete: true maxAppAttempts: 2 > resource { memory: 1024 virtual_cores: 1 } applicationType: "YARN" > keep_containers_across_application_attempts: false > attempt_failures_validity_interval: 0 am_container_resource_request { > priority { priority: 0 } resource_name: "*" capability { memory: 1024 > virtual_cores: 1 } num_containers: 0 relax_locality: true > node_label_expression: "" execution_type_request { execution_type: GUARANTEED > enforce_execution_type: false } } } user: "jenkins" start_time: 1479326359188 > application_state: RMAPP_FINISHED finish_time: 1479326359214> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotNull(Assert.java:664) > at org.junit.Assert.assertNull(Assert.java:646) > at org.junit.Assert.assertNull(Assert.java:656) > at > org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testFinishedAppRemovalAfterRMRestart(TestRMRestart.java:1659) > {code} > The test finishes with two asserts. This is the second assert that fails, > YARN-5362 looked at a failure on the first of the two asserts -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5895) TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey
[ https://issues.apache.org/jira/browse/YARN-5895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672604#comment-15672604 ] sandflee commented on YARN-5895: will do > TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey > --- > > Key: YARN-5895 > URL: https://issues.apache.org/jira/browse/YARN-5895 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Affects Versions: 3.0.0-alpha1 >Reporter: Wilfred Spiegelenburg >Assignee: sandflee > > Even after YARN-5362 the test is still flaky: > {code} > Tests run: 29, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 100.652 sec > <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart > testFinishedAppRemovalAfterRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) > Time elapsed: 0.338 sec <<< FAILURE! > java.lang.AssertionError: expected null, but was: application_submission_context { application_id { id: 1 cluster_timestamp: > 1479326359158 } application_name: "" queue: "default" priority { priority: 0 > } am_container_spec { } cancel_tokens_when_complete: true maxAppAttempts: 2 > resource { memory: 1024 virtual_cores: 1 } applicationType: "YARN" > keep_containers_across_application_attempts: false > attempt_failures_validity_interval: 0 am_container_resource_request { > priority { priority: 0 } resource_name: "*" capability { memory: 1024 > virtual_cores: 1 } num_containers: 0 relax_locality: true > node_label_expression: "" execution_type_request { execution_type: GUARANTEED > enforce_execution_type: false } } } user: "jenkins" start_time: 1479326359188 > application_state: RMAPP_FINISHED finish_time: 1479326359214> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotNull(Assert.java:664) > at org.junit.Assert.assertNull(Assert.java:646) > at org.junit.Assert.assertNull(Assert.java:656) > at > org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testFinishedAppRemovalAfterRMRestart(TestRMRestart.java:1659) > {code} > The test finishes with two asserts. This is the second assert that fails, > YARN-5362 looked at a failure on the first of the two asserts -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5895) TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey
[ https://issues.apache.org/jira/browse/YARN-5895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672483#comment-15672483 ] sandflee commented on YARN-5895: yes. > TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey > --- > > Key: YARN-5895 > URL: https://issues.apache.org/jira/browse/YARN-5895 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Affects Versions: 3.0.0-alpha1 >Reporter: Wilfred Spiegelenburg >Assignee: sandflee > > Even after YARN-5362 the test is still flaky: > {code} > Tests run: 29, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 100.652 sec > <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart > testFinishedAppRemovalAfterRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) > Time elapsed: 0.338 sec <<< FAILURE! > java.lang.AssertionError: expected null, but was: application_submission_context { application_id { id: 1 cluster_timestamp: > 1479326359158 } application_name: "" queue: "default" priority { priority: 0 > } am_container_spec { } cancel_tokens_when_complete: true maxAppAttempts: 2 > resource { memory: 1024 virtual_cores: 1 } applicationType: "YARN" > keep_containers_across_application_attempts: false > attempt_failures_validity_interval: 0 am_container_resource_request { > priority { priority: 0 } resource_name: "*" capability { memory: 1024 > virtual_cores: 1 } num_containers: 0 relax_locality: true > node_label_expression: "" execution_type_request { execution_type: GUARANTEED > enforce_execution_type: false } } } user: "jenkins" start_time: 1479326359188 > application_state: RMAPP_FINISHED finish_time: 1479326359214> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotNull(Assert.java:664) > at org.junit.Assert.assertNull(Assert.java:646) > at org.junit.Assert.assertNull(Assert.java:656) > at > org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testFinishedAppRemovalAfterRMRestart(TestRMRestart.java:1659) > {code} > The test finishes with two asserts. This is the second assert that fails, > YARN-5362 looked at a failure on the first of the two asserts -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-5895) TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey
[ https://issues.apache.org/jira/browse/YARN-5895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672377#comment-15672377 ] sandflee edited comment on YARN-5895 at 11/17/16 1:48 AM: -- thanks [~wilfreds] for reporting this, before YARN-5375, drainEvents did't grant really drain all the events, so there is a litte change to cause test failure. especially for RMRestart test which is using MemoryStateStore not MockMemoryStateStore. at this jira. I'll try to resolve RMRestart related test failure completely. was (Author: sandflee): thanks [~wilfreds] for reporting this, before YARN-5375, drainEvents did't grant really drain all the events, no there is a litte change to cause test failure. especially for RMRestart test which is using MemoryStateStore not MockMemoryStateStore. at this jira. I'll try to resolve RMRestart related test failure completely. > TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey > --- > > Key: YARN-5895 > URL: https://issues.apache.org/jira/browse/YARN-5895 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Affects Versions: 3.0.0-alpha1 >Reporter: Wilfred Spiegelenburg >Assignee: sandflee > > Even after YARN-5362 the test is still flaky: > {code} > Tests run: 29, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 100.652 sec > <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart > testFinishedAppRemovalAfterRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) > Time elapsed: 0.338 sec <<< FAILURE! > java.lang.AssertionError: expected null, but was: application_submission_context { application_id { id: 1 cluster_timestamp: > 1479326359158 } application_name: "" queue: "default" priority { priority: 0 > } am_container_spec { } cancel_tokens_when_complete: true maxAppAttempts: 2 > resource { memory: 1024 virtual_cores: 1 } applicationType: "YARN" > keep_containers_across_application_attempts: false > attempt_failures_validity_interval: 0 am_container_resource_request { > priority { priority: 0 } resource_name: "*" capability { memory: 1024 > virtual_cores: 1 } num_containers: 0 relax_locality: true > node_label_expression: "" execution_type_request { execution_type: GUARANTEED > enforce_execution_type: false } } } user: "jenkins" start_time: 1479326359188 > application_state: RMAPP_FINISHED finish_time: 1479326359214> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotNull(Assert.java:664) > at org.junit.Assert.assertNull(Assert.java:646) > at org.junit.Assert.assertNull(Assert.java:656) > at > org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testFinishedAppRemovalAfterRMRestart(TestRMRestart.java:1659) > {code} > The test finishes with two asserts. This is the second assert that fails, > YARN-5362 looked at a failure on the first of the two asserts -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5895) TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey
[ https://issues.apache.org/jira/browse/YARN-5895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672377#comment-15672377 ] sandflee commented on YARN-5895: thanks [~wilfreds] for reporting this, before YARN-5375, drainEvents did't grant really drain all the events, no there is a litte change to cause test failure. especially for RMRestart test which is using MemoryStateStore not MockMemoryStateStore. at this jira. I'll try to resolve RMRestart related test failure completely. > TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey > --- > > Key: YARN-5895 > URL: https://issues.apache.org/jira/browse/YARN-5895 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Affects Versions: 3.0.0-alpha1 >Reporter: Wilfred Spiegelenburg >Assignee: sandflee > > Even after YARN-5362 the test is still flaky: > {code} > Tests run: 29, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 100.652 sec > <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart > testFinishedAppRemovalAfterRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) > Time elapsed: 0.338 sec <<< FAILURE! > java.lang.AssertionError: expected null, but was: application_submission_context { application_id { id: 1 cluster_timestamp: > 1479326359158 } application_name: "" queue: "default" priority { priority: 0 > } am_container_spec { } cancel_tokens_when_complete: true maxAppAttempts: 2 > resource { memory: 1024 virtual_cores: 1 } applicationType: "YARN" > keep_containers_across_application_attempts: false > attempt_failures_validity_interval: 0 am_container_resource_request { > priority { priority: 0 } resource_name: "*" capability { memory: 1024 > virtual_cores: 1 } num_containers: 0 relax_locality: true > node_label_expression: "" execution_type_request { execution_type: GUARANTEED > enforce_execution_type: false } } } user: "jenkins" start_time: 1479326359188 > application_state: RMAPP_FINISHED finish_time: 1479326359214> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotNull(Assert.java:664) > at org.junit.Assert.assertNull(Assert.java:646) > at org.junit.Assert.assertNull(Assert.java:656) > at > org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testFinishedAppRemovalAfterRMRestart(TestRMRestart.java:1659) > {code} > The test finishes with two asserts. This is the second assert that fails, > YARN-5362 looked at a failure on the first of the two asserts -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-5895) TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey
[ https://issues.apache.org/jira/browse/YARN-5895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee reassigned YARN-5895: -- Assignee: sandflee > TestRMRestart#testFinishedAppRemovalAfterRMRestart is still flakey > --- > > Key: YARN-5895 > URL: https://issues.apache.org/jira/browse/YARN-5895 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Affects Versions: 3.0.0-alpha1 >Reporter: Wilfred Spiegelenburg >Assignee: sandflee > > Even after YARN-5362 the test is still flaky: > {code} > Tests run: 29, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 100.652 sec > <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart > testFinishedAppRemovalAfterRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) > Time elapsed: 0.338 sec <<< FAILURE! > java.lang.AssertionError: expected null, but was: application_submission_context { application_id { id: 1 cluster_timestamp: > 1479326359158 } application_name: "" queue: "default" priority { priority: 0 > } am_container_spec { } cancel_tokens_when_complete: true maxAppAttempts: 2 > resource { memory: 1024 virtual_cores: 1 } applicationType: "YARN" > keep_containers_across_application_attempts: false > attempt_failures_validity_interval: 0 am_container_resource_request { > priority { priority: 0 } resource_name: "*" capability { memory: 1024 > virtual_cores: 1 } num_containers: 0 relax_locality: true > node_label_expression: "" execution_type_request { execution_type: GUARANTEED > enforce_execution_type: false } } } user: "jenkins" start_time: 1479326359188 > application_state: RMAPP_FINISHED finish_time: 1479326359214> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotNull(Assert.java:664) > at org.junit.Assert.assertNull(Assert.java:646) > at org.junit.Assert.assertNull(Assert.java:656) > at > org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testFinishedAppRemovalAfterRMRestart(TestRMRestart.java:1659) > {code} > The test finishes with two asserts. This is the second assert that fails, > YARN-5362 looked at a failure on the first of the two asserts -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
[ https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672354#comment-15672354 ] sandflee commented on YARN-5375: Thanks Rohith,Sunil, Varun for review and commit ! after this path most random test failure should be resolved. will open another jira to fix RMRestart and MockAM. > invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures > -- > > Key: YARN-5375 > URL: https://issues.apache.org/jira/browse/YARN-5375 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: sandflee >Assignee: sandflee > Labels: oct16-medium > Fix For: 2.9.0, 3.0.0-alpha2 > > Attachments: YARN-5375.01.patch, YARN-5375.03.patch, > YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, > YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch, > YARN-5375.08.patch, YARN-5375.09.patch, YARN-5375.10.patch, > YARN-5375.11.patch, YARN-5375.12.new.patch, YARN-5375.12.patch > > > seen many test failures related to RMApp/RMAppattempt comes to some state but > some event are not processed in rm event queue or scheduler event queue, > cause test failure, seems we could implicitly invokes drainEvents(should also > drain sheduler event) in some mockRM method like waitForState -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
[ https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15669662#comment-15669662 ] sandflee commented on YARN-5375: update YARN-5375.12.new.patch to trigger jenkins > invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures > -- > > Key: YARN-5375 > URL: https://issues.apache.org/jira/browse/YARN-5375 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: sandflee >Assignee: sandflee > Labels: oct16-medium > Attachments: YARN-5375.01.patch, YARN-5375.03.patch, > YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, > YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch, > YARN-5375.08.patch, YARN-5375.09.patch, YARN-5375.10.patch, > YARN-5375.11.patch, YARN-5375.12.new.patch, YARN-5375.12.patch > > > seen many test failures related to RMApp/RMAppattempt comes to some state but > some event are not processed in rm event queue or scheduler event queue, > cause test failure, seems we could implicitly invokes drainEvents(should also > drain sheduler event) in some mockRM method like waitForState -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
[ https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5375: --- Attachment: YARN-5375.12.new.patch > invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures > -- > > Key: YARN-5375 > URL: https://issues.apache.org/jira/browse/YARN-5375 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: sandflee >Assignee: sandflee > Labels: oct16-medium > Attachments: YARN-5375.01.patch, YARN-5375.03.patch, > YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, > YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch, > YARN-5375.08.patch, YARN-5375.09.patch, YARN-5375.10.patch, > YARN-5375.11.patch, YARN-5375.12.new.patch, YARN-5375.12.patch > > > seen many test failures related to RMApp/RMAppattempt comes to some state but > some event are not processed in rm event queue or scheduler event queue, > cause test failure, seems we could implicitly invokes drainEvents(should also > drain sheduler event) in some mockRM method like waitForState -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
[ https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5375: --- Attachment: YARN-5375.12.patch > invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures > -- > > Key: YARN-5375 > URL: https://issues.apache.org/jira/browse/YARN-5375 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: sandflee >Assignee: sandflee > Labels: oct16-medium > Attachments: YARN-5375.01.patch, YARN-5375.03.patch, > YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, > YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch, > YARN-5375.08.patch, YARN-5375.09.patch, YARN-5375.10.patch, > YARN-5375.11.patch, YARN-5375.12.patch > > > seen many test failures related to RMApp/RMAppattempt comes to some state but > some event are not processed in rm event queue or scheduler event queue, > cause test failure, seems we could implicitly invokes drainEvents(should also > drain sheduler event) in some mockRM method like waitForState -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
[ https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15666843#comment-15666843 ] sandflee commented on YARN-5375: update the patch to invoke drainEventsImplicitly() at the front of waitForState(). and fix a test failure caused by this {code:title=TestAbstractYarnScheduler.java} // AM crashes, and a new app-attempt gets created node.nodeHeartbeat(applicationAttemptOneID, 1, ContainerState.COMPLETE); rm.waitForState(node, am1ContainerID, RMContainerState.COMPLETED, 30 * 1000); RMAppAttempt rmAppAttempt2 = MockRM.waitForAttemptScheduled(rmApp, rm); {code} waitForState will drain all events first leading completed container not exist in new schedulerAppAttempt, and the coresponding waitForState() code will invoke node hearbeat, appattempt will be allocated state, the check for scheduled state will fail > invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures > -- > > Key: YARN-5375 > URL: https://issues.apache.org/jira/browse/YARN-5375 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: sandflee >Assignee: sandflee > Labels: oct16-medium > Attachments: YARN-5375.01.patch, YARN-5375.03.patch, > YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, > YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch, > YARN-5375.08.patch, YARN-5375.09.patch, YARN-5375.10.patch, YARN-5375.11.patch > > > seen many test failures related to RMApp/RMAppattempt comes to some state but > some event are not processed in rm event queue or scheduler event queue, > cause test failure, seems we could implicitly invokes drainEvents(should also > drain sheduler event) in some mockRM method like waitForState -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
[ https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5375: --- Attachment: YARN-5375.11.patch > invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures > -- > > Key: YARN-5375 > URL: https://issues.apache.org/jira/browse/YARN-5375 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: sandflee >Assignee: sandflee > Labels: oct16-medium > Attachments: YARN-5375.01.patch, YARN-5375.03.patch, > YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, > YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch, > YARN-5375.08.patch, YARN-5375.09.patch, YARN-5375.10.patch, YARN-5375.11.patch > > > seen many test failures related to RMApp/RMAppattempt comes to some state but > some event are not processed in rm event queue or scheduler event queue, > cause test failure, seems we could implicitly invokes drainEvents(should also > drain sheduler event) in some mockRM method like waitForState -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
[ https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15663934#comment-15663934 ] sandflee commented on YARN-5375: thanks [~rohithsharma], TestTokenClientRMService test failure seems not related to this issue, it couldn't run pass locally and is tracked by YARN-5875 > invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures > -- > > Key: YARN-5375 > URL: https://issues.apache.org/jira/browse/YARN-5375 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: sandflee >Assignee: sandflee > Labels: oct16-medium > Attachments: YARN-5375.01.patch, YARN-5375.03.patch, > YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, > YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch, > YARN-5375.08.patch, YARN-5375.09.patch, YARN-5375.10.patch > > > seen many test failures related to RMApp/RMAppattempt comes to some state but > some event are not processed in rm event queue or scheduler event queue, > cause test failure, seems we could implicitly invokes drainEvents(should also > drain sheduler event) in some mockRM method like waitForState -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5453) FairScheduler#update may skip update demand resource of child queue/app if current demand reached maxResource
[ https://issues.apache.org/jira/browse/YARN-5453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653433#comment-15653433 ] sandflee commented on YARN-5453: thanks [~kasha] for review and commit ! > FairScheduler#update may skip update demand resource of child queue/app if > current demand reached maxResource > - > > Key: YARN-5453 > URL: https://issues.apache.org/jira/browse/YARN-5453 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: sandflee >Assignee: sandflee > Labels: oct16-easy > Fix For: 2.9.0, 3.0.0-alpha2 > > Attachments: YARN-5453.01.patch, YARN-5453.02.patch, > YARN-5453.03.patch, YARN-5453.04.patch, YARN-5453.05.patch > > > {code} > demand = Resources.createResource(0); > for (FSQueue childQueue : childQueues) { > childQueue.updateDemand(); > Resource toAdd = childQueue.getDemand(); > demand = Resources.add(demand, toAdd); > demand = Resources.componentwiseMin(demand, maxRes); > if (Resources.equals(demand, maxRes)) { > break; > } > } > {code} > if one singe queue's demand resource exceed maxRes, the other queue's demand > resource will not update. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
[ https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15643429#comment-15643429 ] sandflee commented on YARN-5375: update the patch to fix TestFairScheduler, adding drainEvents after a node is registered. not using a rm.dispatcher.handle() for most TestFairScheduler not using this way. > invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures > -- > > Key: YARN-5375 > URL: https://issues.apache.org/jira/browse/YARN-5375 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: sandflee >Assignee: sandflee > Labels: oct16-medium > Attachments: YARN-5375.01.patch, YARN-5375.03.patch, > YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, > YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch, > YARN-5375.08.patch, YARN-5375.09.patch > > > seen many test failures related to RMApp/RMAppattempt comes to some state but > some event are not processed in rm event queue or scheduler event queue, > cause test failure, seems we could implicitly invokes drainEvents(should also > drain sheduler event) in some mockRM method like waitForState -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
[ https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5375: --- Attachment: YARN-5375.09.patch > invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures > -- > > Key: YARN-5375 > URL: https://issues.apache.org/jira/browse/YARN-5375 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: sandflee >Assignee: sandflee > Labels: oct16-medium > Attachments: YARN-5375.01.patch, YARN-5375.03.patch, > YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, > YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch, > YARN-5375.08.patch, YARN-5375.09.patch > > > seen many test failures related to RMApp/RMAppattempt comes to some state but > some event are not processed in rm event queue or scheduler event queue, > cause test failure, seems we could implicitly invokes drainEvents(should also > drain sheduler event) in some mockRM method like waitForState -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5453) FairScheduler#update may skip update demand resource of child queue/app if current demand reached maxResource
[ https://issues.apache.org/jira/browse/YARN-5453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15642979#comment-15642979 ] sandflee commented on YARN-5453: thanks [~kasha], patch updated. > FairScheduler#update may skip update demand resource of child queue/app if > current demand reached maxResource > - > > Key: YARN-5453 > URL: https://issues.apache.org/jira/browse/YARN-5453 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: sandflee >Assignee: sandflee > Labels: oct16-easy > Attachments: YARN-5453.01.patch, YARN-5453.02.patch, > YARN-5453.03.patch, YARN-5453.04.patch, YARN-5453.05.patch > > > {code} > demand = Resources.createResource(0); > for (FSQueue childQueue : childQueues) { > childQueue.updateDemand(); > Resource toAdd = childQueue.getDemand(); > demand = Resources.add(demand, toAdd); > demand = Resources.componentwiseMin(demand, maxRes); > if (Resources.equals(demand, maxRes)) { > break; > } > } > {code} > if one singe queue's demand resource exceed maxRes, the other queue's demand > resource will not update. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5453) FairScheduler#update may skip update demand resource of child queue/app if current demand reached maxResource
[ https://issues.apache.org/jira/browse/YARN-5453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5453: --- Attachment: YARN-5453.05.patch > FairScheduler#update may skip update demand resource of child queue/app if > current demand reached maxResource > - > > Key: YARN-5453 > URL: https://issues.apache.org/jira/browse/YARN-5453 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: sandflee >Assignee: sandflee > Labels: oct16-easy > Attachments: YARN-5453.01.patch, YARN-5453.02.patch, > YARN-5453.03.patch, YARN-5453.04.patch, YARN-5453.05.patch > > > {code} > demand = Resources.createResource(0); > for (FSQueue childQueue : childQueues) { > childQueue.updateDemand(); > Resource toAdd = childQueue.getDemand(); > demand = Resources.add(demand, toAdd); > demand = Resources.componentwiseMin(demand, maxRes); > if (Resources.equals(demand, maxRes)) { > break; > } > } > {code} > if one singe queue's demand resource exceed maxRes, the other queue's demand > resource will not update. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5276) print more info when event queue is blocked
[ https://issues.apache.org/jira/browse/YARN-5276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15628654#comment-15628654 ] sandflee commented on YARN-5276: Thanks [~miklos.szeg...@cloudera.com] for your detailed reply, seems no much necessary to add a UT :( > print more info when event queue is blocked > --- > > Key: YARN-5276 > URL: https://issues.apache.org/jira/browse/YARN-5276 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager >Reporter: sandflee >Assignee: sandflee > Labels: oct16-easy > Attachments: YARN-5276.01.patch, YARN-5276.02.patch, > YARN-5276.03.patch, YARN-5276.04.patch > > > we now see logs like "Size of event-queue is 498000, Size of event-queue is > 499000" and difficult to know which event flood the queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5276) print more info when event queue is blocked
[ https://issues.apache.org/jira/browse/YARN-5276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5276: --- Attachment: YARN-5276.04.patch > print more info when event queue is blocked > --- > > Key: YARN-5276 > URL: https://issues.apache.org/jira/browse/YARN-5276 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager >Reporter: sandflee >Assignee: sandflee > Labels: oct16-easy > Attachments: YARN-5276.01.patch, YARN-5276.02.patch, > YARN-5276.03.patch, YARN-5276.04.patch > > > we now see logs like "Size of event-queue is 498000, Size of event-queue is > 499000" and difficult to know which event flood the queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5276) print more info when event queue is blocked
[ https://issues.apache.org/jira/browse/YARN-5276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613741#comment-15613741 ] sandflee commented on YARN-5276: Thanks for your comment, the patch just add some log info, I wondered how to test it? > print more info when event queue is blocked > --- > > Key: YARN-5276 > URL: https://issues.apache.org/jira/browse/YARN-5276 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager >Reporter: sandflee >Assignee: sandflee > Labels: oct16-easy > Attachments: YARN-5276.01.patch, YARN-5276.02.patch, > YARN-5276.03.patch > > > we now see logs like "Size of event-queue is 498000, Size of event-queue is > 499000" and difficult to know which event flood the queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
[ https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612189#comment-15612189 ] sandflee commented on YARN-5375: Thanks [~rohithsharma], yes, MockRM#createSchedulerEventDispatcher triggers the deadlock. NodeAddEvent used to left in schedule event queue now get processed. agree that using resourcemanager.handle(nodeUpdate) could solve this bug. and this test seems very special, only rm dispatcher started. > invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures > -- > > Key: YARN-5375 > URL: https://issues.apache.org/jira/browse/YARN-5375 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-5375.01.patch, YARN-5375.03.patch, > YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, > YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch, > YARN-5375.08.patch > > > seen many test failures related to RMApp/RMAppattempt comes to some state but > some event are not processed in rm event queue or scheduler event queue, > cause test failure, seems we could implicitly invokes drainEvents(should also > drain sheduler event) in some mockRM method like waitForState -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
[ https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15605288#comment-15605288 ] sandflee commented on YARN-5375: update YARN-5375.08.patch to address the comment of [~rohithsharma], and 1. add MockRMNullStateStore since NullStateStore is mostly used 2. simple fix TestFairScheduler deadlock > invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures > -- > > Key: YARN-5375 > URL: https://issues.apache.org/jira/browse/YARN-5375 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-5375.01.patch, YARN-5375.03.patch, > YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, > YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch, > YARN-5375.08.patch > > > seen many test failures related to RMApp/RMAppattempt comes to some state but > some event are not processed in rm event queue or scheduler event queue, > cause test failure, seems we could implicitly invokes drainEvents(should also > drain sheduler event) in some mockRM method like waitForState -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
[ https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5375: --- Attachment: YARN-5375.08.patch > invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures > -- > > Key: YARN-5375 > URL: https://issues.apache.org/jira/browse/YARN-5375 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-5375.01.patch, YARN-5375.03.patch, > YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, > YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch, > YARN-5375.08.patch > > > seen many test failures related to RMApp/RMAppattempt comes to some state but > some event are not processed in rm event queue or scheduler event queue, > cause test failure, seems we could implicitly invokes drainEvents(should also > drain sheduler event) in some mockRM method like waitForState -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
[ https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601173#comment-15601173 ] sandflee commented on YARN-5375: sorry for the delay, will do this in these days > invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures > -- > > Key: YARN-5375 > URL: https://issues.apache.org/jira/browse/YARN-5375 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-5375.01.patch, YARN-5375.03.patch, > YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, > YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch > > > seen many test failures related to RMApp/RMAppattempt comes to some state but > some event are not processed in rm event queue or scheduler event queue, > cause test failure, seems we could implicitly invokes drainEvents(should also > drain sheduler event) in some mockRM method like waitForState -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5362) TestRMRestart#testFinishedAppRemovalAfterRMRestart can fail
[ https://issues.apache.org/jira/browse/YARN-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559054#comment-15559054 ] sandflee commented on YARN-5362: thanks [~Naganarasimha], I'll have a look. > TestRMRestart#testFinishedAppRemovalAfterRMRestart can fail > --- > > Key: YARN-5362 > URL: https://issues.apache.org/jira/browse/YARN-5362 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jason Lowe >Assignee: sandflee > Fix For: 2.9.0, 3.0.0-alpha1 > > Attachments: YARN-5362.01.patch > > > Saw the following in a precommit build that only changed an unrelated unit > test: > {noformat} > Tests run: 29, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 101.265 sec > <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart > testFinishedAppRemovalAfterRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) > Time elapsed: 0.411 sec <<< FAILURE! > java.lang.AssertionError: expected null, but > was:> at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotNull(Assert.java:664) > at org.junit.Assert.assertNull(Assert.java:646) > at org.junit.Assert.assertNull(Assert.java:656) > at > org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testFinishedAppRemovalAfterRMRestart(TestRMRestart.java:1653) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5453) FairScheduler#update may skip update demand resource of child queue/app if current demand reached maxResource
[ https://issues.apache.org/jira/browse/YARN-5453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15551422#comment-15551422 ] sandflee commented on YARN-5453: Hi, [~kasha], patch updated, failed test could run pass locally seems not related. > FairScheduler#update may skip update demand resource of child queue/app if > current demand reached maxResource > - > > Key: YARN-5453 > URL: https://issues.apache.org/jira/browse/YARN-5453 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-5453.01.patch, YARN-5453.02.patch, > YARN-5453.03.patch, YARN-5453.04.patch > > > {code} > demand = Resources.createResource(0); > for (FSQueue childQueue : childQueues) { > childQueue.updateDemand(); > Resource toAdd = childQueue.getDemand(); > demand = Resources.add(demand, toAdd); > demand = Resources.componentwiseMin(demand, maxRes); > if (Resources.equals(demand, maxRes)) { > break; > } > } > {code} > if one singe queue's demand resource exceed maxRes, the other queue's demand > resource will not update. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5453) FairScheduler#update may skip update demand resource of child queue/app if current demand reached maxResource
[ https://issues.apache.org/jira/browse/YARN-5453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5453: --- Attachment: YARN-5453.04.patch > FairScheduler#update may skip update demand resource of child queue/app if > current demand reached maxResource > - > > Key: YARN-5453 > URL: https://issues.apache.org/jira/browse/YARN-5453 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-5453.01.patch, YARN-5453.02.patch, > YARN-5453.03.patch, YARN-5453.04.patch > > > {code} > demand = Resources.createResource(0); > for (FSQueue childQueue : childQueues) { > childQueue.updateDemand(); > Resource toAdd = childQueue.getDemand(); > demand = Resources.add(demand, toAdd); > demand = Resources.componentwiseMin(demand, maxRes); > if (Resources.equals(demand, maxRes)) { > break; > } > } > {code} > if one singe queue's demand resource exceed maxRes, the other queue's demand > resource will not update. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
[ https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15538216#comment-15538216 ] sandflee commented on YARN-5375: Thanks [~rohithsharma] for your review ! bq. private volatile boolean drained = true; default value has been changed. Would you tell why this change required? to be consistent with AsyncDispatcher#drained. and seems a default value of true is more reasonable. bq. I think change in the method static to non-static not necessarily required in MockRM#waitForState. Lets keep it as it is. As a result, MockAM modifications are not at all required. change from static to non-static is to add drainEventsImplicitly(). if keep it as it is, the invoker (MockAM) maybe have to explicitly call rm#drainEvents bq. nit: couple of changes which are not modified are appeared in patch. May be check those also, else patch looks very huge. Ex : MockRM class, line no 349, 332 will do bq. One doubt, if once * disableDrainEventsImplicitly* set then there is no way to enable it. Should we provide enabling method also? couldn't figure out the scene to disable and then enable, but I'm ok to add enable method bq. After this patch, can sleeps can be avoided ? If yes, I think we need to remove so that test execute faster. yes, after drainEvents, all events are processed, no need to sleep-wait anymore > invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures > -- > > Key: YARN-5375 > URL: https://issues.apache.org/jira/browse/YARN-5375 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-5375.01.patch, YARN-5375.03.patch, > YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, > YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch > > > seen many test failures related to RMApp/RMAppattempt comes to some state but > some event are not processed in rm event queue or scheduler event queue, > cause test failure, seems we could implicitly invokes drainEvents(should also > drain sheduler event) in some mockRM method like waitForState -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
[ https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15522347#comment-15522347 ] sandflee commented on YARN-5375: I'm ok for both method, any thought? > invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures > -- > > Key: YARN-5375 > URL: https://issues.apache.org/jira/browse/YARN-5375 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-5375.01.patch, YARN-5375.03.patch, > YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, > YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch > > > seen many test failures related to RMApp/RMAppattempt comes to some state but > some event are not processed in rm event queue or scheduler event queue, > cause test failure, seems we could implicitly invokes drainEvents(should also > drain sheduler event) in some mockRM method like waitForState -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5315) Standby RM keep sending start am container request to NM
[ https://issues.apache.org/jira/browse/YARN-5315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15522338#comment-15522338 ] sandflee commented on YARN-5315: {quote} shutdownNow : Attempts to stop all actively executing tasks, halts the processing of waiting tasks, and returns a list of the tasks that were awaiting execution. These tasks are drained (removed) from the task queue upon return from this method. {quote} thanks [~jianhe], shutdownNow will interrupt active workers and drain pending task. so the difference is patch1 will not wait active worker terminated but patch2 will. seems we couldn't get much benefit from awaitTermination. > Standby RM keep sending start am container request to NM > > > Key: YARN-5315 > URL: https://issues.apache.org/jira/browse/YARN-5315 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-5315.01.patch, YARN-5315.02.patch > > > 1, network partitions, RM couldn't connect to NMs and start AM request pending > 2, RM becomes standby, int ApplicatioinMasterLauncher#serviceStop, > launcherPool are shutdown. the launching thread are interrupted, but start AM > request may still left in Queue > 3,network reconnect, standby RM sends start AM request to NM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
[ https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1565#comment-1565 ] sandflee edited comment on YARN-5375 at 8/29/16 1:24 AM: - thanks [~varun_saxena], [~sunilg], [~rohithsharma] for your comment and suggest ! , update two patches. 1, add MockRMMemoryStateStore in MockRM "drain patch" adds a DrainDispatcher and will call rm-dispatcher.await, statestore-dispatcher.await rm-dispatcher.await when drainEvents. this works for almost all of cases "sync patch" makes stateStore Event processed in a sync way. so drainEvents will drain all events, this will drain some unnessesary events, but seems a more general way. 2, accessing DrainDispatcher#drained should be protected by mutex, or there will be a race condition. was (Author: sandflee): thanks [~varun_saxena], [~sunilg], [~rohithsharma] for your comment and suggest ! 1, add MockRMMemoryStateStore, "drain patch" adds a DrainDispatcher and will call rm-dispatcher.await, statestore-dispatcher.await rm-dispatcher.await when drainEvents. this works for almost all of cases "sync patch" makes stateStore Event processed in a sync way. so drainEvents will drain all events, this will drain some unnessesary events, but seems a more general way. 2, accessing DrainDispatcher#drained should be protected by mutex, or there will be a race condition. > invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures > -- > > Key: YARN-5375 > URL: https://issues.apache.org/jira/browse/YARN-5375 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-5375.01.patch, YARN-5375.03.patch, > YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, > YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch > > > seen many test failures related to RMApp/RMAppattempt comes to some state but > some event are not processed in rm event queue or scheduler event queue, > cause test failure, seems we could implicitly invokes drainEvents(should also > drain sheduler event) in some mockRM method like waitForState -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
[ https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1565#comment-1565 ] sandflee commented on YARN-5375: thanks [~varun_saxena], [~sunilg], [~rohithsharma] for your comment and suggest ! 1, add MockRMMemoryStateStore, "drain patch" adds a DrainDispatcher and will call rm-dispatcher.await, statestore-dispatcher.await rm-dispatcher.await when drainEvents. this works for almost all of cases "sync patch" makes stateStore Event processed in a sync way. so drainEvents will drain all events, this will drain some unnessesary events, but seems a more general way. 2, accessing DrainDispatcher#drained should be protected by mutex, or there will be a race condition. > invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures > -- > > Key: YARN-5375 > URL: https://issues.apache.org/jira/browse/YARN-5375 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-5375.01.patch, YARN-5375.03.patch, > YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, > YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch > > > seen many test failures related to RMApp/RMAppattempt comes to some state but > some event are not processed in rm event queue or scheduler event queue, > cause test failure, seems we could implicitly invokes drainEvents(should also > drain sheduler event) in some mockRM method like waitForState -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
[ https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5375: --- Attachment: (was: YARN-5375.07-sync-statestore.patch) > invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures > -- > > Key: YARN-5375 > URL: https://issues.apache.org/jira/browse/YARN-5375 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-5375.01.patch, YARN-5375.03.patch, > YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, > YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch > > > seen many test failures related to RMApp/RMAppattempt comes to some state but > some event are not processed in rm event queue or scheduler event queue, > cause test failure, seems we could implicitly invokes drainEvents(should also > drain sheduler event) in some mockRM method like waitForState -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
[ https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5375: --- Attachment: YARN-5375.07-sync-statestore.patch > invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures > -- > > Key: YARN-5375 > URL: https://issues.apache.org/jira/browse/YARN-5375 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-5375.01.patch, YARN-5375.03.patch, > YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, > YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch > > > seen many test failures related to RMApp/RMAppattempt comes to some state but > some event are not processed in rm event queue or scheduler event queue, > cause test failure, seems we could implicitly invokes drainEvents(should also > drain sheduler event) in some mockRM method like waitForState -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
[ https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5375: --- Attachment: YARN-5375.07-drain-statestore.patch YARN-5375.07-sync-statestore.patch > invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures > -- > > Key: YARN-5375 > URL: https://issues.apache.org/jira/browse/YARN-5375 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-5375.01.patch, YARN-5375.03.patch, > YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch, > YARN-5375.07-drain-statestore.patch, YARN-5375.07-sync-statestore.patch > > > seen many test failures related to RMApp/RMAppattempt comes to some state but > some event are not processed in rm event queue or scheduler event queue, > cause test failure, seems we could implicitly invokes drainEvents(should also > drain sheduler event) in some mockRM method like waitForState -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
[ https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435099#comment-15435099 ] sandflee commented on YARN-5375: thanks [~rohithsharma], even if we using the order rmDispatcher.await --> stateStoreDispatcher --> rmDispatcher , there are still very very little change that events are not processed in stateStoreDispatcher or rmDispatcher, agree? I prefer a accurate way to really drain all events if MockRM#drainEvents returned. based on this we could use drainEvents to replace sleep-wait way in MockRM#waitForState, this may help to reduce test time, thought? > invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures > -- > > Key: YARN-5375 > URL: https://issues.apache.org/jira/browse/YARN-5375 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-5375.01.patch, YARN-5375.03.patch, > YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch > > > seen many test failures related to RMApp/RMAppattempt comes to some state but > some event are not processed in rm event queue or scheduler event queue, > cause test failure, seems we could implicitly invokes drainEvents(should also > drain sheduler event) in some mockRM method like waitForState -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
[ https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434730#comment-15434730 ] sandflee commented on YARN-5375: bq. One comment from earlier patch in MockRM is need to take care whenever drain for Stastore dispatcher, again need to wait for draining rm-dispatcher. This is required because state store trigger another event to rm-dispatcher. yes, we should take care of this. but double drain drain rm-dispatcher may not help, because rm state store may have new Event again. One approach is to add a timestamp to drainDispatcher to record newEvent add time, after drain StateStore event, we could check the timestamp of rm-disptacher, if not updated, we could make sure that all events are drained. > invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures > -- > > Key: YARN-5375 > URL: https://issues.apache.org/jira/browse/YARN-5375 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-5375.01.patch, YARN-5375.03.patch, > YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch > > > seen many test failures related to RMApp/RMAppattempt comes to some state but > some event are not processed in rm event queue or scheduler event queue, > cause test failure, seems we could implicitly invokes drainEvents(should also > drain sheduler event) in some mockRM method like waitForState -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
[ https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434024#comment-15434024 ] sandflee commented on YARN-5375: also see YARN-5043, if StateStore Event is not processed, more likely it will produce another RM Event. > invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures > -- > > Key: YARN-5375 > URL: https://issues.apache.org/jira/browse/YARN-5375 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-5375.01.patch, YARN-5375.03.patch, > YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch > > > seen many test failures related to RMApp/RMAppattempt comes to some state but > some event are not processed in rm event queue or scheduler event queue, > cause test failure, seems we could implicitly invokes drainEvents(should also > drain sheduler event) in some mockRM method like waitForState -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
[ https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431785#comment-15431785 ] sandflee commented on YARN-5375: Thanks [~varun_saxena], yes this will reduce the change for main class and much cleaner, thought? [~sunilg][~rohithsharma] > invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures > -- > > Key: YARN-5375 > URL: https://issues.apache.org/jira/browse/YARN-5375 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-5375.01.patch, YARN-5375.03.patch, > YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch > > > seen many test failures related to RMApp/RMAppattempt comes to some state but > some event are not processed in rm event queue or scheduler event queue, > cause test failure, seems we could implicitly invokes drainEvents(should also > drain sheduler event) in some mockRM method like waitForState -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5526) DrainDispacher#ServiceStop blocked if setDrainEventsOnStop invoked
[ https://issues.apache.org/jira/browse/YARN-5526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15426170#comment-15426170 ] sandflee commented on YARN-5526: thanks [~varun_saxena] for suggestion and review ! > DrainDispacher#ServiceStop blocked if setDrainEventsOnStop invoked > -- > > Key: YARN-5526 > URL: https://issues.apache.org/jira/browse/YARN-5526 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee >Assignee: sandflee > Fix For: 2.9.0 > > Attachments: YARN-5526.01.patch, YARN-5526.02.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5526) DrainDispacher#ServiceStop blocked if setDrainEventsOnStop invoked
[ https://issues.apache.org/jira/browse/YARN-5526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15423746#comment-15423746 ] sandflee commented on YARN-5526: Thanks [~varun_saxena], that's more reasonable! update the patch. > DrainDispacher#ServiceStop blocked if setDrainEventsOnStop invoked > -- > > Key: YARN-5526 > URL: https://issues.apache.org/jira/browse/YARN-5526 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-5526.01.patch, YARN-5526.02.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5526) DrainDispacher#ServiceStop blocked if setDrainEventsOnStop invoked
[ https://issues.apache.org/jira/browse/YARN-5526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5526: --- Attachment: YARN-5526.02.patch > DrainDispacher#ServiceStop blocked if setDrainEventsOnStop invoked > -- > > Key: YARN-5526 > URL: https://issues.apache.org/jira/browse/YARN-5526 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-5526.01.patch, YARN-5526.02.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5526) DrainDispacher#ServiceStop blocked if setDrainEventsOnStop invoked
[ https://issues.apache.org/jira/browse/YARN-5526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5526: --- Attachment: YARN-5526.01.patch > DrainDispacher#ServiceStop blocked if setDrainEventsOnStop invoked > -- > > Key: YARN-5526 > URL: https://issues.apache.org/jira/browse/YARN-5526 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-5526.01.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
[ https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422698#comment-15422698 ] sandflee commented on YARN-5375: 1, replace rmStateStore#AsyncDispatcher with DrainDispatcher, yes the code is not clean, welcome suggestions! 2, set DrainDispacher#isDrain default value to true, and sleep a while if not drained to reduce cpu usage. 3, not invoke setDrainEventsOnStop in RMStateStore#DrainDispatcher creation , for DrainDispacher will take 300s to stop if enabled setDrainEventsOnStop, file YARN-5526 to track. > invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures > -- > > Key: YARN-5375 > URL: https://issues.apache.org/jira/browse/YARN-5375 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-5375.01.patch, YARN-5375.03.patch, > YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch > > > seen many test failures related to RMApp/RMAppattempt comes to some state but > some event are not processed in rm event queue or scheduler event queue, > cause test failure, seems we could implicitly invokes drainEvents(should also > drain sheduler event) in some mockRM method like waitForState -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5526) DrainDispacher#ServiceStop blocked if setDrainEventsOnStop invoked
[ https://issues.apache.org/jira/browse/YARN-5526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422691#comment-15422691 ] sandflee commented on YARN-5526: DrainDispacher override AsyncDispacher#CreateThead, and AsyncDispacher#drained will never be refreshed to true. > DrainDispacher#ServiceStop blocked if setDrainEventsOnStop invoked > -- > > Key: YARN-5526 > URL: https://issues.apache.org/jira/browse/YARN-5526 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee >Assignee: sandflee > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-5526) DrainDispacher#ServiceStop blocked if setDrainEventsOnStop invoked
sandflee created YARN-5526: -- Summary: DrainDispacher#ServiceStop blocked if setDrainEventsOnStop invoked Key: YARN-5526 URL: https://issues.apache.org/jira/browse/YARN-5526 Project: Hadoop YARN Issue Type: Bug Reporter: sandflee Assignee: sandflee -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5375) invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures
[ https://issues.apache.org/jira/browse/YARN-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5375: --- Attachment: YARN-5375.06.patch > invoke MockRM#drainEvents implicitly in MockRM methods to reduce test failures > -- > > Key: YARN-5375 > URL: https://issues.apache.org/jira/browse/YARN-5375 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-5375.01.patch, YARN-5375.03.patch, > YARN-5375.04.patch, YARN-5375.05.patch, YARN-5375.06.patch > > > seen many test failures related to RMApp/RMAppattempt comes to some state but > some event are not processed in rm event queue or scheduler event queue, > cause test failure, seems we could implicitly invokes drainEvents(should also > drain sheduler event) in some mockRM method like waitForState -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5521) TestCapacityScheduler#testKillAllAppsInQueue fails randomly
[ https://issues.apache.org/jira/browse/YARN-5521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421781#comment-15421781 ] sandflee commented on YARN-5521: Thanks [~varun_saxena], [~sunilg], [~bibinchundatt] > TestCapacityScheduler#testKillAllAppsInQueue fails randomly > --- > > Key: YARN-5521 > URL: https://issues.apache.org/jira/browse/YARN-5521 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Varun Saxena >Assignee: sandflee > Fix For: 2.9.0 > > Attachments: Failure.txt, YARN-5521.01.patch > > > {noformat} > Running > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler > Tests run: 49, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 35.922 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler > testKillAllAppsInQueue(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler) > Time elapsed: 0.146 sec <<< FAILURE! > java.lang.AssertionError: null > at org.junit.Assert.fail(Assert.java:86) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertTrue(Assert.java:52) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler.testKillAllAppsInQueue(TestCapacityScheduler.java:2188) > Results : > Failed tests: > TestCapacityScheduler.testKillAllAppsInQueue:2188 null > Tests run: 49, Failures: 1, Errors: 0, Skipped: 0 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5521) TestCapacityScheduler#testKillAllAppsInQueue fails randomly
[ https://issues.apache.org/jira/browse/YARN-5521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5521: --- Attachment: YARN-5521.01.patch > TestCapacityScheduler#testKillAllAppsInQueue fails randomly > --- > > Key: YARN-5521 > URL: https://issues.apache.org/jira/browse/YARN-5521 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Varun Saxena >Assignee: sandflee > Attachments: Failure.txt, YARN-5521.01.patch > > > {noformat} > Running > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler > Tests run: 49, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 35.922 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler > testKillAllAppsInQueue(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler) > Time elapsed: 0.146 sec <<< FAILURE! > java.lang.AssertionError: null > at org.junit.Assert.fail(Assert.java:86) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertTrue(Assert.java:52) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler.testKillAllAppsInQueue(TestCapacityScheduler.java:2188) > Results : > Failed tests: > TestCapacityScheduler.testKillAllAppsInQueue:2188 null > Tests run: 49, Failures: 1, Errors: 0, Skipped: 0 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5521) TestCapacityScheduler#testKillAllAppsInQueue fails randomly
[ https://issues.apache.org/jira/browse/YARN-5521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421202#comment-15421202 ] sandflee commented on YARN-5521: thanks [~bibinchundatt], seems caused by app went to KILLED state, but APP_ATTEMPT_REMOVED event not processed by scheduler dispatcher. this could simple fixed by {code} rm.waitForState(app.getApplicationId(), RMAppState.KILLED); rm.waitForAppRemovedFromScheduler(app.getApplicationId()); appsInRoot = scheduler.getAppsInQueue("root"); {code} and YARN-5375 introduce a more general way , cc [~sunilg] [~rohithsharma] > TestCapacityScheduler#testKillAllAppsInQueue fails randomly > --- > > Key: YARN-5521 > URL: https://issues.apache.org/jira/browse/YARN-5521 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Varun Saxena >Assignee: sandflee > Attachments: Failure.txt > > > {noformat} > Running > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler > Tests run: 49, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 35.922 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler > testKillAllAppsInQueue(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler) > Time elapsed: 0.146 sec <<< FAILURE! > java.lang.AssertionError: null > at org.junit.Assert.fail(Assert.java:86) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertTrue(Assert.java:52) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler.testKillAllAppsInQueue(TestCapacityScheduler.java:2188) > Results : > Failed tests: > TestCapacityScheduler.testKillAllAppsInQueue:2188 null > Tests run: 49, Failures: 1, Errors: 0, Skipped: 0 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-5521) TestCapacityScheduler#testKillAllAppsInQueue fails randomly
[ https://issues.apache.org/jira/browse/YARN-5521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee reassigned YARN-5521: -- Assignee: sandflee > TestCapacityScheduler#testKillAllAppsInQueue fails randomly > --- > > Key: YARN-5521 > URL: https://issues.apache.org/jira/browse/YARN-5521 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Varun Saxena >Assignee: sandflee > Attachments: Failure.txt > > > {noformat} > Running > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler > Tests run: 49, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 35.922 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler > testKillAllAppsInQueue(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler) > Time elapsed: 0.146 sec <<< FAILURE! > java.lang.AssertionError: null > at org.junit.Assert.fail(Assert.java:86) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertTrue(Assert.java:52) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler.testKillAllAppsInQueue(TestCapacityScheduler.java:2188) > Results : > Failed tests: > TestCapacityScheduler.testKillAllAppsInQueue:2188 null > Tests run: 49, Failures: 1, Errors: 0, Skipped: 0 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5479) FairScheduler: Scheduling performance improvement
[ https://issues.apache.org/jira/browse/YARN-5479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15420796#comment-15420796 ] sandflee commented on YARN-5479: will do, thx > FairScheduler: Scheduling performance improvement > - > > Key: YARN-5479 > URL: https://issues.apache.org/jira/browse/YARN-5479 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler, resourcemanager >Affects Versions: 2.6.0 >Reporter: He Tianyi >Assignee: He Tianyi > > Currently ResourceManager uses a single thread to handle async events for > scheduling. As number of nodes grows, more events need to be processed in > time in FairScheduler. Also, increased number of applications & queues slows > down processing of each single event. > There are two cases that slow processing of nodeUpdate events is problematic: > A. global throughput is lower than number of nodes through heartbeat rounds. > This keeps resource from being allocated since the inefficiency. > B. global throughput meets the need, but for some of these rounds, events of > some nodes cannot get processed before next heartbeat. This brings > inefficiency handling burst requests (i.e. newly submitted MapReduce > application cannot get its all task launched soon given enough resource). > Pretty sure some people will encounter the problem eventually after a single > cluster is scaled to several K of nodes (even with {{assignmultiple}} > enabled). > This issue proposes to perform several optimization towards performance in > FairScheduler {{nodeUpdate}} method. To be specific: > A. trading off fairness with efficiency, queue & app sorting can be skipped > (or should this be called 'delayed sorting'?). we can either start another > dedicated thread to do the sorting & updating, or actually perform sorting > after current result have been used several times (say sort once in every 100 > calls.) > B. performing calculation on {{Resource}} instances is expensive, since at > least 2 objects ({{ResourceImpl}} and its proto builder) is created each time > (using 'immutable' apis). the overhead can be eliminated with a > light-weighted implementation of Resource, which do not instantiate a builder > until necessary, because most instances are used as intermediate result in > scheduler instead of being exchanged via IPC. Also, {{createResource}} is > using reflection, which can be replaced by a plain {{new}} (for scheduler > usage only). furthermore, perhaps we could 'intern' resource to avoid > allocation. > C. other minor changes: such as move {{updateRootMetrics}} call to > {{update}}, making root queue metrics eventual consistent (which may > satisfies most of the needs). or introduce counters to {{getResourceUsage}} > and make changing of resource incrementally instead of recalculate each time. > With A and B, I was looking at 4 times improvement in a cluster with 2K nodes. > Suggestions? Opinions? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5479) FairScheduler: Scheduling performance improvement
[ https://issues.apache.org/jira/browse/YARN-5479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419963#comment-15419963 ] sandflee commented on YARN-5479: seems no need to compute minShare/isNeed/MinShareRatio/UseToWeightRatio in every comparator#compute, we could snapshot these before do real sort. > FairScheduler: Scheduling performance improvement > - > > Key: YARN-5479 > URL: https://issues.apache.org/jira/browse/YARN-5479 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler, resourcemanager >Affects Versions: 2.6.0 >Reporter: He Tianyi >Assignee: He Tianyi > > Currently ResourceManager uses a single thread to handle async events for > scheduling. As number of nodes grows, more events need to be processed in > time in FairScheduler. Also, increased number of applications & queues slows > down processing of each single event. > There are two cases that slow processing of nodeUpdate events is problematic: > A. global throughput is lower than number of nodes through heartbeat rounds. > This keeps resource from being allocated since the inefficiency. > B. global throughput meets the need, but for some of these rounds, events of > some nodes cannot get processed before next heartbeat. This brings > inefficiency handling burst requests (i.e. newly submitted MapReduce > application cannot get its all task launched soon given enough resource). > Pretty sure some people will encounter the problem eventually after a single > cluster is scaled to several K of nodes (even with {{assignmultiple}} > enabled). > This issue proposes to perform several optimization towards performance in > FairScheduler {{nodeUpdate}} method. To be specific: > A. trading off fairness with efficiency, queue & app sorting can be skipped > (or should this be called 'delayed sorting'?). we can either start another > dedicated thread to do the sorting & updating, or actually perform sorting > after current result have been used several times (say sort once in every 100 > calls.) > B. performing calculation on {{Resource}} instances is expensive, since at > least 2 objects ({{ResourceImpl}} and its proto builder) is created each time > (using 'immutable' apis). the overhead can be eliminated with a > light-weighted implementation of Resource, which do not instantiate a builder > until necessary, because most instances are used as intermediate result in > scheduler instead of being exchanged via IPC. Also, {{createResource}} is > using reflection, which can be replaced by a plain {{new}} (for scheduler > usage only). furthermore, perhaps we could 'intern' resource to avoid > allocation. > C. other minor changes: such as move {{updateRootMetrics}} call to > {{update}}, making root queue metrics eventual consistent (which may > satisfies most of the needs). or introduce counters to {{getResourceUsage}} > and make changing of resource incrementally instead of recalculate each time. > With A and B, I was looking at 4 times improvement in a cluster with 2K nodes. > Suggestions? Opinions? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4743) ResourceManager crash because TimSort
[ https://issues.apache.org/jira/browse/YARN-4743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419674#comment-15419674 ] sandflee commented on YARN-4743: bq. I think the root cause is NodeAvailableResourceComparator is not transitive. agree, but do we really need comparator to be transitive? it may reduce the perform greatly. > ResourceManager crash because TimSort > - > > Key: YARN-4743 > URL: https://issues.apache.org/jira/browse/YARN-4743 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.4 >Reporter: Zephyr Guo >Assignee: Yufei Gu > Attachments: YARN-4743-cdh5.4.7.patch > > > {code} > 2016-02-26 14:08:50,821 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type NODE_UPDATE to the scheduler > java.lang.IllegalArgumentException: Comparison method violates its general > contract! > at java.util.TimSort.mergeHi(TimSort.java:868) > at java.util.TimSort.mergeAt(TimSort.java:485) > at java.util.TimSort.mergeCollapse(TimSort.java:410) > at java.util.TimSort.sort(TimSort.java:214) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:316) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:240) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1091) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:989) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1185) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:684) > at java.lang.Thread.run(Thread.java:745) > 2016-02-26 14:08:50,822 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {code} > Actually, this issue found in 2.6.0-cdh5.4.7. > I think the cause is that we modify {{Resouce}} while we are sorting > {{runnableApps}}. > {code:title=FSLeafQueue.java} > Comparator comparator = policy.getComparator(); > writeLock.lock(); > try { > Collections.sort(runnableApps, comparator); > } finally { > writeLock.unlock(); > } > readLock.lock(); > {code} > {code:title=FairShareComparator} > public int compare(Schedulable s1, Schedulable s2) { > .. > s1.getResourceUsage(), minShare1); > boolean s2Needy = Resources.lessThan(RESOURCE_CALCULATOR, null, > s2.getResourceUsage(), minShare2); > minShareRatio1 = (double) s1.getResourceUsage().getMemory() > / Resources.max(RESOURCE_CALCULATOR, null, minShare1, > ONE).getMemory(); > minShareRatio2 = (double) s2.getResourceUsage().getMemory() > / Resources.max(RESOURCE_CALCULATOR, null, minShare2, > ONE).getMemory(); > .. > {code} > {{getResourceUsage}} will return current Resource. The current Resource is > unstable. > {code:title=FSAppAttempt.java} > @Override > public Resource getResourceUsage() { > // Here the getPreemptedResources() always return zero, except in > // a preemption round > return Resources.subtract(getCurrentConsumption(), > getPreemptedResources()); > } > {code} > {code:title=SchedulerApplicationAttempt} > public Resource getCurrentConsumption() { > return currentConsumption; > } > // This method may modify current Resource. > public synchronized void recoverContainer(RMContainer rmContainer) { > .. > Resources.addTo(currentConsumption, rmContainer.getContainer() > .getResource()); > .. > } > {code} > I suggest that use stable Resource in comparator. > Is there something i think wrong? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5393) [Umbrella] Optimize YARN tests runtime
[ https://issues.apache.org/jira/browse/YARN-5393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419643#comment-15419643 ] sandflee commented on YARN-5393: YARN-5375 aims to add drainAllEvents to waitForState to reduce test failure, also, if all events are really drained, there is no need to sleep any more. > [Umbrella] Optimize YARN tests runtime > --- > > Key: YARN-5393 > URL: https://issues.apache.org/jira/browse/YARN-5393 > Project: Hadoop YARN > Issue Type: Test >Reporter: Vinod Kumar Vavilapalli > > When I originally merged MAPREDUCE-279 into Hadoop, *all* of YARN tests used > to take 10 mins with pretty good coverage. > Now only TestRMRestart takes that much time - we'ven't been that great > writing pointed - short tests. > Time for an initiative to optimize YARN tests. And even after that, if it > takes too long, we go the MAPREDUCE-670 route. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5483) Optimize RMAppAttempt#pullJustFinishedContainers
[ https://issues.apache.org/jira/browse/YARN-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15416202#comment-15416202 ] sandflee commented on YARN-5483: thanks [~jlowe] [~templedf] [~jianhe] and [~rohithsharma] for the review and commit! > Optimize RMAppAttempt#pullJustFinishedContainers > > > Key: YARN-5483 > URL: https://issues.apache.org/jira/browse/YARN-5483 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.6.0 >Reporter: sandflee >Assignee: sandflee > Fix For: 2.8.0, 2.6.5, 2.7.4 > > Attachments: YARN-5483-branch-2.6.patch, > YARN-5483-branch-2.6.patch.02, YARN-5483-branch-2.7.patch, > YARN-5483-branch-2.7.patch.02, YARN-5483.01.patch, YARN-5483.02.patch, > YARN-5483.03.patch, YARN-5483.04.patch, jprofiler-cpu.png > > > about 1000 app running on cluster, jprofiler found pullJustFinishedContainers > cost too much cpu. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5453) FairScheduler#update may skip update demand resource of child queue/app if current demand reached maxResource
[ https://issues.apache.org/jira/browse/YARN-5453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-5453: --- Attachment: YARN-5453.03.patch > FairScheduler#update may skip update demand resource of child queue/app if > current demand reached maxResource > - > > Key: YARN-5453 > URL: https://issues.apache.org/jira/browse/YARN-5453 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-5453.01.patch, YARN-5453.02.patch, > YARN-5453.03.patch > > > {code} > demand = Resources.createResource(0); > for (FSQueue childQueue : childQueues) { > childQueue.updateDemand(); > Resource toAdd = childQueue.getDemand(); > demand = Resources.add(demand, toAdd); > demand = Resources.componentwiseMin(demand, maxRes); > if (Resources.equals(demand, maxRes)) { > break; > } > } > {code} > if one singe queue's demand resource exceed maxRes, the other queue's demand > resource will not update. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5453) FairScheduler#update may skip update demand resource of child queue/app if current demand reached maxResource
[ https://issues.apache.org/jira/browse/YARN-5453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15414452#comment-15414452 ] sandflee commented on YARN-5453: agree, update the patch as you suggested. > FairScheduler#update may skip update demand resource of child queue/app if > current demand reached maxResource > - > > Key: YARN-5453 > URL: https://issues.apache.org/jira/browse/YARN-5453 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-5453.01.patch, YARN-5453.02.patch > > > {code} > demand = Resources.createResource(0); > for (FSQueue childQueue : childQueues) { > childQueue.updateDemand(); > Resource toAdd = childQueue.getDemand(); > demand = Resources.add(demand, toAdd); > demand = Resources.componentwiseMin(demand, maxRes); > if (Resources.equals(demand, maxRes)) { > break; > } > } > {code} > if one singe queue's demand resource exceed maxRes, the other queue's demand > resource will not update. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org