[ 
https://issues.apache.org/jira/browse/YARN-5700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546852#comment-15546852
 ] 

Eric Badger commented on YARN-5700:
-----------------------------------

Looks like there are 2 bugs here. 

1) TestAMRestart uses {{waitForState()}} to wait for the completed container. 
However, this checks for the liveContainers list. Once the container is 
completed, it will quickly be taken out of this list. I think we can instead 
use {{waitForContainerToComplete()}} to check for the last set of finished 
containers. Something like this:

{noformat}
-    rm1.waitForState(nm1, containerId2, RMContainerState.RUNNING);
+    NMContainerStatus completedContainer =
+        TestRMRestart.createNMContainerStatus(am1.getApplicationAttemptId(), 2,
+        ContainerState.COMPLETE);
+    rm1.waitForContainerToComplete(app1.getCurrentAppAttempt(), 
completedContainer);
{noformat}

2) YARN-4807 changed {{waitForState()}} in MockRM.java so that it quietly 
returns false on failure instead of throwing an exception. In 2.8 and below, 
the code would call an {{assertNotNull()}} on the container to make sure that 
it wasn't null and throw an exception if it was. Since 2.9+ quietly returns 
false instead of throwing an exception, the test waits for the timeout and then 
continues with the test once {{waitForState()}} returns (even though it 
returned false). We could fix the test to check for a false return value, but 
there are most likely other tests that also depend on {{waitForState()}} 
throwing an exception on failure instead of checking the return value. So I 
would think that it'd be better to put the {{assertNotNull()}} back in. 

[~kasha], [~yufeigu] (reporter/assignee from YARN-4807), what do you think 
about adding the {{assertNotNull()}} back into {{waitForState()}}

> testAMRestartNotLostContainerCompleteMsg times out intermittently in 2.8
> ------------------------------------------------------------------------
>
>                 Key: YARN-5700
>                 URL: https://issues.apache.org/jira/browse/YARN-5700
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Eric Badger
>            Assignee: Eric Badger
>
> {noformat}
> java.lang.Exception: test timed out after 30000 milliseconds
>       at java.lang.Thread.sleep(Native Method)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:301)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:286)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:281)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart.testAMRestartNotLostContainerCompleteMsg(TestAMRestart.java:774)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to