[ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10672:
----------------------------------
    Description: 
All testcases in TestReservations are flaky

Running a particular test in TestReservations 100 times never passes all the 
time.
 For example, let's run testReservationNoContinueLook 100 times. For me, it 
produced 39 failed and 61 passed results.
 Sometimes just 1 out of 100 runs is failed.
 Screenshot is attached.

Stacktrace:
{code:java}
java.lang.AssertionError: 
Expected :2048
Actual   :0
<Click to see difference>

at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:633)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
{code}
The test fails here:
{code:java}
 // Start testing...
// Only AM
TestUtils.applyResourceCommitRequest(clusterResource,
    a.assignContainers(clusterResource, node_0,
        new ResourceLimits(clusterResource),
        SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
assertEquals(2 * GB, a.getUsedResources().getMemorySize());
{code}
With some debugging (patch attached), I realized that sometimes there are no 
registered nodes so the AM can't be allocated and test will fail:
{code:java}
2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator 
(RegularContainerAllocator.java:canAssign(312)) - ******Can't assign container, 
no nodes... rmContext: 2a8dd942, scheduler: 2322e56f
{code}
In these cases, this is also printed from 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes:
{code:java}
2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler 
(CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real 
getNumClusterNodes
{code}

h2. Let's break this down:
 1. The mocking happens in 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration,
 boolean):
{code:java}
cs.setRMContext(spyRMContext);
cs.init(csConf);
cs.start();

when(cs.getNumClusterNodes()).thenReturn(3);
{code}
Under no circumstances this could be allowed to return any other value than 3.
 However, as mentioned above, sometimes the real method of 'getNumClusterNodes' 
is called on CapacityScheduler.

2. Sometimes, this gets printed to the console:
{code:java}
org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
Integer cannot be returned by isMultiNodePlacementEnabled()
isMultiNodePlacementEnabled() should return boolean
***
If you're unsure why you're getting above error read on.
Due to the nature of the syntax above problem might occur because:
1. This exception *might* occur in wrongly written multi-threaded tests.
   Please refer to Mockito FAQ on limitations of concurrency testing.
2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
spies - 
   - with doReturn|Throw() family of methods. More in javadocs for 
Mockito.spy() method.


        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566)
        at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
        at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
        at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
        at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
        at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
        at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
        at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
        at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
        at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
        at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
        at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
        at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
        at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
        at 
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69)
        at 
com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:40)
        at 
com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:220)
        at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:53)

{code}
TestReservations.java:166 is exactly our suspicious stubbing call:
{code:java}
when(cs.getNumClusterNodes()).thenReturn(3);
{code}
But what the heck is the relation between this and the 
isMultiNodePlacementEnabled method?
 Let's dive into this more.

3. The only caller of isMultiNodePlacementEnabled is 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager#dynamicallyUpdateAppActivitiesMaxQueueLengthIfNeeded,
 that gets called from the anonymous Thread instance (assigned to field 
'cleanupThread') in class 
[ActivitiesManager|https://github.com/apache/hadoop/blob/6699198b54bf6360c164a6ce7552c8b91a318c59/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/activities/ActivitiesManager.java#L301-L357]
 The ActivitiesManager.serviceStart is invoked when the CapacityScheduler is 
started.

*Theory*: When CS is started 
[here|https://github.com/apache/hadoop/blob/6699198b54bf6360c164a6ce7552c8b91a318c59/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestReservations.java#L156-L160],
 the ActivitiesManager starts running and eventually 
cs.isMultiNodePlacementEnabled() will be called by the thread 
[here|https://github.com/apache/hadoop/blob/6699198b54bf6360c164a6ce7552c8b91a318c59/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/activities/ActivitiesManager.java#L266].
 As the main thread (JUnit) started to stub method 'getNumClusterNodes', 
there's an ongoing stubbing so when isMultiNodePlacementEnabled is called, the 
Answer will hold the value of 3, which is improper for a boolean method, so the 
stubbing will fail.
 This is a race-condition situation, it is presented with 2 screenshots.
 As the stubbing failed, subsequent calls to getNumClusterNodes will return 0 
as the original implementation returns 0 for the tests. This is the root cause 
of the test failures.

4. Some references to this mockito "issue":
 - [https://github.com/mockito/mockito/issues/1151]
 - [https://github.com/mockito/mockito/issues/1151#issuecomment-336718714]
 - [https://github.com/mockito/mockito/pull/1200#issuecomment-336717814]
 The last link points to the documentation that describes the thread-safety of 
mockito: [https://github.com/mockito/mockito/wiki/FAQ#is-mockito-thread-safe]

*CONCLUSION & FIX*
 As all tests in this class are affected by the setup method and this 
thread-safety issue, we need to fix it consistently.
 Actually, the fix is quite simple: All the stub calls of the CS instance 
should happend before the CS.init / CS.start calls, that triggers the 
ActivitiesManager thread.

  was:
All testcases in TestReservations are flaky

Running a particular test in TestReservations 100 times never passes all the 
time.
 For example, let's run testReservationNoContinueLook 100 times. For me, it 
produced 39 failed and 61 passed results.
 Sometimes just 1 out of 100 runs is failed.
 Screenshot is attached.

Stacktrace:
{code:java}
java.lang.AssertionError: 
Expected :2048
Actual   :0
<Click to see difference>

at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:633)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
{code}
The test fails here:
{code:java}
 // Start testing...
// Only AM
TestUtils.applyResourceCommitRequest(clusterResource,
    a.assignContainers(clusterResource, node_0,
        new ResourceLimits(clusterResource),
        SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
assertEquals(2 * GB, a.getUsedResources().getMemorySize());
{code}
With some debugging (patch attached), I realized that sometimes there are no 
registered nodes so the AM can't be allocated and test will fail:
{code:java}
2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator 
(RegularContainerAllocator.java:canAssign(312)) - ******Can't assign container, 
no nodes... rmContext: 2a8dd942, scheduler: 2322e56f
{code}
In these cases, this is also printed from 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes:
{code:java}
2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler 
(CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real 
getNumClusterNodes
{code}
Let's break this down:
 1. The mocking happens in 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration,
 boolean):
{code:java}
cs.setRMContext(spyRMContext);
cs.init(csConf);
cs.start();

when(cs.getNumClusterNodes()).thenReturn(3);
{code}
Under no circumstances this could be allowed to return any other value than 3.
 However, as mentioned above, sometimes the real method of 'getNumClusterNodes' 
is called on CapacityScheduler.

2. Sometimes, this gets printed to the console:
{code:java}
org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
Integer cannot be returned by isMultiNodePlacementEnabled()
isMultiNodePlacementEnabled() should return boolean
***
If you're unsure why you're getting above error read on.
Due to the nature of the syntax above problem might occur because:
1. This exception *might* occur in wrongly written multi-threaded tests.
   Please refer to Mockito FAQ on limitations of concurrency testing.
2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
spies - 
   - with doReturn|Throw() family of methods. More in javadocs for 
Mockito.spy() method.


        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566)
        at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
        at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
        at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
        at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
        at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
        at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
        at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
        at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
        at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
        at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
        at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
        at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
        at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
        at 
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69)
        at 
com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:40)
        at 
com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:220)
        at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:53)

{code}
TestReservations.java:166 is exactly our suspicious stubbing call:
{code:java}
when(cs.getNumClusterNodes()).thenReturn(3);
{code}
But what the heck is the relation between this and the 
isMultiNodePlacementEnabled method?
 Let's dive into this more.

3. The only caller of isMultiNodePlacementEnabled is 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager#dynamicallyUpdateAppActivitiesMaxQueueLengthIfNeeded,
 that gets called from the anonymous Thread instance (assigned to field 
'cleanupThread') in class 
[ActivitiesManager|https://github.com/apache/hadoop/blob/6699198b54bf6360c164a6ce7552c8b91a318c59/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/activities/ActivitiesManager.java#L301-L357]
 The ActivitiesManager.serviceStart is invoked when the CapacityScheduler is 
started.

*Theory*: When CS is started 
[here|https://github.com/apache/hadoop/blob/6699198b54bf6360c164a6ce7552c8b91a318c59/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestReservations.java#L156-L160],
 the ActivitiesManager starts running and eventually 
cs.isMultiNodePlacementEnabled() will be called by the thread 
[here|https://github.com/apache/hadoop/blob/6699198b54bf6360c164a6ce7552c8b91a318c59/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/activities/ActivitiesManager.java#L266].
 As the main thread (JUnit) started to stub method 'getNumClusterNodes', 
there's an ongoing stubbing so when isMultiNodePlacementEnabled is called, the 
Answer will hold the value of 3, which is improper for a boolean method, so the 
stubbing will fail.
 This is a race-condition situation, it is presented with 2 screenshots.
 As the stubbing failed, subsequent calls to getNumClusterNodes will return 0 
as the original implementation returns 0 for the tests. This is the root cause 
of the test failures.

4. Some references to this mockito "issue":
 - [https://github.com/mockito/mockito/issues/1151]
 - [https://github.com/mockito/mockito/issues/1151#issuecomment-336718714]
 - [https://github.com/mockito/mockito/pull/1200#issuecomment-336717814]
 The last link points to the documentation that describes the thread-safety of 
mockito: [https://github.com/mockito/mockito/wiki/FAQ#is-mockito-thread-safe]

*CONCLUSION & FIX*
 As all tests in this class are affected by the setup method and this 
thread-safety issue, we need to fix it consistently.
 Actually, the fix is quite simple: All the stub calls of the CS instance 
should happend before the CS.init / CS.start calls, that triggers the 
ActivitiesManager thread.


> All testcases in TestReservations are flaky
> -------------------------------------------
>
>                 Key: YARN-10672
>                 URL: https://issues.apache.org/jira/browse/YARN-10672
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Szilard Nemeth
>            Assignee: Szilard Nemeth
>            Priority: Major
>         Attachments: Screenshot 2021-03-04 at 21.34.18.png, Screenshot 
> 2021-03-04 at 22.06.20.png, Screenshot-mockitostubbing1-2021-03-04 at 
> 22.34.01.png, Screenshot-mockitostubbing2-2021-03-04 at 22.34.12.png, 
> YARN-10672-debuglogs.patch, YARN-10672.001.patch
>
>
> All testcases in TestReservations are flaky
> Running a particular test in TestReservations 100 times never passes all the 
> time.
>  For example, let's run testReservationNoContinueLook 100 times. For me, it 
> produced 39 failed and 61 passed results.
>  Sometimes just 1 out of 100 runs is failed.
>  Screenshot is attached.
> Stacktrace:
> {code:java}
> java.lang.AssertionError: 
> Expected :2048
> Actual   :0
> <Click to see difference>
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
> {code}
> The test fails here:
> {code:java}
>  // Start testing...
> // Only AM
> TestUtils.applyResourceCommitRequest(clusterResource,
>     a.assignContainers(clusterResource, node_0,
>         new ResourceLimits(clusterResource),
>         SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
> assertEquals(2 * GB, a.getUsedResources().getMemorySize());
> {code}
> With some debugging (patch attached), I realized that sometimes there are no 
> registered nodes so the AM can't be allocated and test will fail:
> {code:java}
> 2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:canAssign(312)) - ******Can't assign 
> container, no nodes... rmContext: 2a8dd942, scheduler: 2322e56f
> {code}
> In these cases, this is also printed from 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes:
> {code:java}
> 2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler 
> (CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real 
> getNumClusterNodes
> {code}
> h2. Let's break this down:
>  1. The mocking happens in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration,
>  boolean):
> {code:java}
> cs.setRMContext(spyRMContext);
> cs.init(csConf);
> cs.start();
> when(cs.getNumClusterNodes()).thenReturn(3);
> {code}
> Under no circumstances this could be allowed to return any other value than 3.
>  However, as mentioned above, sometimes the real method of 
> 'getNumClusterNodes' is called on CapacityScheduler.
> 2. Sometimes, this gets printed to the console:
> {code:java}
> org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>    Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>    - with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566)
>       at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>       at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>       at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>       at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>       at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>       at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>       at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>       at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>       at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>       at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>       at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>       at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69)
>       at 
> com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:40)
>       at 
> com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:220)
>       at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:53)
> {code}
> TestReservations.java:166 is exactly our suspicious stubbing call:
> {code:java}
> when(cs.getNumClusterNodes()).thenReturn(3);
> {code}
> But what the heck is the relation between this and the 
> isMultiNodePlacementEnabled method?
>  Let's dive into this more.
> 3. The only caller of isMultiNodePlacementEnabled is 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager#dynamicallyUpdateAppActivitiesMaxQueueLengthIfNeeded,
>  that gets called from the anonymous Thread instance (assigned to field 
> 'cleanupThread') in class 
> [ActivitiesManager|https://github.com/apache/hadoop/blob/6699198b54bf6360c164a6ce7552c8b91a318c59/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/activities/ActivitiesManager.java#L301-L357]
>  The ActivitiesManager.serviceStart is invoked when the CapacityScheduler is 
> started.
> *Theory*: When CS is started 
> [here|https://github.com/apache/hadoop/blob/6699198b54bf6360c164a6ce7552c8b91a318c59/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestReservations.java#L156-L160],
>  the ActivitiesManager starts running and eventually 
> cs.isMultiNodePlacementEnabled() will be called by the thread 
> [here|https://github.com/apache/hadoop/blob/6699198b54bf6360c164a6ce7552c8b91a318c59/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/activities/ActivitiesManager.java#L266].
>  As the main thread (JUnit) started to stub method 'getNumClusterNodes', 
> there's an ongoing stubbing so when isMultiNodePlacementEnabled is called, 
> the Answer will hold the value of 3, which is improper for a boolean method, 
> so the stubbing will fail.
>  This is a race-condition situation, it is presented with 2 screenshots.
>  As the stubbing failed, subsequent calls to getNumClusterNodes will return 0 
> as the original implementation returns 0 for the tests. This is the root 
> cause of the test failures.
> 4. Some references to this mockito "issue":
>  - [https://github.com/mockito/mockito/issues/1151]
>  - [https://github.com/mockito/mockito/issues/1151#issuecomment-336718714]
>  - [https://github.com/mockito/mockito/pull/1200#issuecomment-336717814]
>  The last link points to the documentation that describes the thread-safety 
> of mockito: 
> [https://github.com/mockito/mockito/wiki/FAQ#is-mockito-thread-safe]
> *CONCLUSION & FIX*
>  As all tests in this class are affected by the setup method and this 
> thread-safety issue, we need to fix it consistently.
>  Actually, the fix is quite simple: All the stub calls of the CS instance 
> should happend before the CS.init / CS.start calls, that triggers the 
> ActivitiesManager thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to