[ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10672:
--------------------------------
    Fix Version/s: 3.2.3
                   3.3.1

> All testcases in TestReservations are flaky
> -------------------------------------------
>
>                 Key: YARN-10672
>                 URL: https://issues.apache.org/jira/browse/YARN-10672
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Szilard Nemeth
>            Assignee: Szilard Nemeth
>            Priority: Major
>             Fix For: 3.4.0, 3.3.1, 3.2.3
>
>         Attachments: Screenshot 2021-03-04 at 21.34.18.png, Screenshot 
> 2021-03-04 at 22.06.20.png, Screenshot-mockitostubbing1-2021-03-04 at 
> 22.34.01.png, Screenshot-mockitostubbing2-2021-03-04 at 22.34.12.png, 
> YARN-10672-debuglogs.patch, YARN-10672.001.patch, 
> YARN-10672.branch-3.2.001.patch, YARN-10672.branch-3.3.001.patch
>
>
> All testcases in TestReservations are flaky
> Running a particular test in TestReservations 100 times never passes all the 
> time.
>  For example, let's run testReservationNoContinueLook 100 times. For me, it 
> produced 39 failed and 61 passed results.
>  Sometimes just 1 out of 100 runs is failed.
>  Screenshot is attached.
> Stacktrace:
> {code:java}
> java.lang.AssertionError: 
> Expected :2048
> Actual   :0
> <Click to see difference>
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
> {code}
> The test fails here:
> {code:java}
>  // Start testing...
> // Only AM
> TestUtils.applyResourceCommitRequest(clusterResource,
>     a.assignContainers(clusterResource, node_0,
>         new ResourceLimits(clusterResource),
>         SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
> assertEquals(2 * GB, a.getUsedResources().getMemorySize());
> {code}
> With some debugging (patch attached), I realized that sometimes there are no 
> registered nodes so the AM can't be allocated and test will fail:
> {code:java}
> 2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:canAssign(312)) - ******Can't assign 
> container, no nodes... rmContext: 2a8dd942, scheduler: 2322e56f
> {code}
> In these cases, this is also printed from 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes:
> {code:java}
> 2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler 
> (CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real 
> getNumClusterNodes
> {code}
> h2. Let's break this down:
>  1. The mocking happens in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration,
>  boolean):
> {code:java}
> cs.setRMContext(spyRMContext);
> cs.init(csConf);
> cs.start();
> when(cs.getNumClusterNodes()).thenReturn(3);
> {code}
> Under no circumstances this could be allowed to return any other value than 3.
>  However, as mentioned above, sometimes the real method of 
> 'getNumClusterNodes' is called on CapacityScheduler.
> 2. Sometimes, this gets printed to the console:
> {code:java}
> org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>    Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>    - with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566)
>       at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>       at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>       at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>       at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>       at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>       at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>       at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>       at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>       at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>       at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>       at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>       at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69)
>       at 
> com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:40)
>       at 
> com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:220)
>       at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:53)
> {code}
> TestReservations.java:166 is exactly our suspicious stubbing call:
> {code:java}
> when(cs.getNumClusterNodes()).thenReturn(3);
> {code}
> But what the heck is the relation between this and the 
> isMultiNodePlacementEnabled method?
>  Let's dive into this more.
> 3. The only caller of isMultiNodePlacementEnabled is 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager#dynamicallyUpdateAppActivitiesMaxQueueLengthIfNeeded,
>  that gets called from the anonymous Thread instance (assigned to field 
> 'cleanupThread') in class 
> [ActivitiesManager|https://github.com/apache/hadoop/blob/6699198b54bf6360c164a6ce7552c8b91a318c59/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/activities/ActivitiesManager.java#L301-L357]
>  The ActivitiesManager.serviceStart is invoked when the CapacityScheduler is 
> started.
> *Theory*: When CS is started 
> [here|https://github.com/apache/hadoop/blob/6699198b54bf6360c164a6ce7552c8b91a318c59/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestReservations.java#L156-L160],
>  the ActivitiesManager starts running and eventually 
> cs.isMultiNodePlacementEnabled() will be called by the thread 
> [here|https://github.com/apache/hadoop/blob/6699198b54bf6360c164a6ce7552c8b91a318c59/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/activities/ActivitiesManager.java#L266].
>  As the main thread (JUnit) started to stub method 'getNumClusterNodes', 
> there's an ongoing stubbing so when isMultiNodePlacementEnabled is called, 
> the Answer will hold the value of 3, which is improper for a boolean method, 
> so the stubbing will fail.
>  This is a race-condition situation, it is presented with 2 screenshots.
>  As the stubbing failed, subsequent calls to getNumClusterNodes will return 0 
> as the original implementation returns 0 for the tests. This is the root 
> cause of the test failures.
> 4. Some references to this mockito "issue":
>  - [https://github.com/mockito/mockito/issues/1151]
>  - [https://github.com/mockito/mockito/issues/1151#issuecomment-336718714]
>  - [https://github.com/mockito/mockito/pull/1200#issuecomment-336717814]
>  The last link points to the documentation that describes the thread-safety 
> of mockito: 
> [https://github.com/mockito/mockito/wiki/FAQ#is-mockito-thread-safe]
> *CONCLUSION & FIX*
>  As all tests in this class are affected by the setup method and this 
> thread-safety issue, we need to fix it consistently.
>  Actually, the fix is quite simple: All the stub calls of the CS instance 
> should happen before the CS.init / CS.start calls, that triggers the 
> ActivitiesManager thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to