[ https://issues.apache.org/jira/browse/YARN-5700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546852#comment-15546852 ]
Eric Badger commented on YARN-5700: ----------------------------------- Looks like there are 2 bugs here. 1) TestAMRestart uses {{waitForState()}} to wait for the completed container. However, this checks for the liveContainers list. Once the container is completed, it will quickly be taken out of this list. I think we can instead use {{waitForContainerToComplete()}} to check for the last set of finished containers. Something like this: {noformat} - rm1.waitForState(nm1, containerId2, RMContainerState.RUNNING); + NMContainerStatus completedContainer = + TestRMRestart.createNMContainerStatus(am1.getApplicationAttemptId(), 2, + ContainerState.COMPLETE); + rm1.waitForContainerToComplete(app1.getCurrentAppAttempt(), completedContainer); {noformat} 2) YARN-4807 changed {{waitForState()}} in MockRM.java so that it quietly returns false on failure instead of throwing an exception. In 2.8 and below, the code would call an {{assertNotNull()}} on the container to make sure that it wasn't null and throw an exception if it was. Since 2.9+ quietly returns false instead of throwing an exception, the test waits for the timeout and then continues with the test once {{waitForState()}} returns (even though it returned false). We could fix the test to check for a false return value, but there are most likely other tests that also depend on {{waitForState()}} throwing an exception on failure instead of checking the return value. So I would think that it'd be better to put the {{assertNotNull()}} back in. [~kasha], [~yufeigu] (reporter/assignee from YARN-4807), what do you think about adding the {{assertNotNull()}} back into {{waitForState()}} > testAMRestartNotLostContainerCompleteMsg times out intermittently in 2.8 > ------------------------------------------------------------------------ > > Key: YARN-5700 > URL: https://issues.apache.org/jira/browse/YARN-5700 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Eric Badger > Assignee: Eric Badger > > {noformat} > java.lang.Exception: test timed out after 30000 milliseconds > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:301) > at > org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:286) > at > org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:281) > at > org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart.testAMRestartNotLostContainerCompleteMsg(TestAMRestart.java:774) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org