[ https://issues.apache.org/jira/browse/YARN-9436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Peter Bacsko resolved YARN-9436. -------------------------------- Resolution: Duplicate > Flaky test testApplicationLifetimeMonitor > ----------------------------------------- > > Key: YARN-9436 > URL: https://issues.apache.org/jira/browse/YARN-9436 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler, test > Reporter: Peter Bacsko > Assignee: Peter Bacsko > Priority: Major > > In our test environment, we occasionally encounter this failure: > {noformat} > 2019-04-03 12:49:32 [INFO] Running > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor > 2019-04-03 12:53:08 [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, > Time elapsed: 215.535 s <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor > 2019-04-03 12:53:08 [ERROR] > testApplicationLifetimeMonitor[0](org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor) > Time elapsed: 34.244 s <<< FAILURE! > 2019-04-03 12:53:08 java.lang.AssertionError: Application killed before > lifetime value > 2019-04-03 12:53:08 at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor.testApplicationLifetimeMonitor(TestApplicationLifetimeMonitor.java:218) > 2019-04-03 12:53:08 > {noformat} > The root cause is the condition here: > {noformat} > Assert.assertTrue("Application killed before lifetime value", > totalTimeRun > maxLifetime); > {noformat} > However, there are two problems with this condition: > 1. Logically it's not correct. In fact, since the app should be killed after > 30 seconds, one would expect to see {{totalTimeRun = maxLifetime}}. Due to > some asynchronicity and rounding, most of the time {{totalTimeRun}} ends up > being 31. > 2. Sometimes the application is killed fast enough and {{totalTimeRun}} is > 30, but this is correct, because in {{setUpCSQueue}} we set the queue > lifetime: > {noformat} > csConf.setMaximumLifetimePerQueue( > CapacitySchedulerConfiguration.ROOT + ".default", maxLifetime); > csConf.setDefaultLifetimePerQueue( > CapacitySchedulerConfiguration.ROOT + ".default", defaultLifetime); > {noformat} > A more proper condition is: > {noformat} > Assert.assertTrue("Application killed before lifetime value", > totalTimeRun >= maxLifetime); > {noformat} > The assertion message in the next line is also misleading: > {noformat} > Assert.assertTrue( > "Application killed before lifetime value " + totalTimeRun, > totalTimeRun < maxLifetime + 10L); > {noformat} > If it false, it means that the application is killed _after_ 40 seconds, > which exceeds both the app's lifetime (40s) and that of the queue (30s). > {noformat} > Assert.assertTrue( > "Application killed after queue/app lifetime value: " + > totalTimeRun, > totalTimeRun < maxLifetime + 10L); > {noformat} > We can be even be stricter, since we expect a kill almost immediately after > 30 seconds: > {noformat} > Assert.assertTrue( > "Application killed too late: " + totalTimeRun, > totalTimeRun < maxLifetime + 2L); > {noformat} > where we allow a 2 second tolerance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org