[ https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Szilard Nemeth updated YARN-10672: ---------------------------------- Attachment: YARN-10672.branch-3.3.001.patch > All testcases in TestReservations are flaky > ------------------------------------------- > > Key: YARN-10672 > URL: https://issues.apache.org/jira/browse/YARN-10672 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Szilard Nemeth > Assignee: Szilard Nemeth > Priority: Major > Fix For: 3.4.0 > > Attachments: Screenshot 2021-03-04 at 21.34.18.png, Screenshot > 2021-03-04 at 22.06.20.png, Screenshot-mockitostubbing1-2021-03-04 at > 22.34.01.png, Screenshot-mockitostubbing2-2021-03-04 at 22.34.12.png, > YARN-10672-debuglogs.patch, YARN-10672.001.patch, > YARN-10672.branch-3.3.001.patch > > > All testcases in TestReservations are flaky > Running a particular test in TestReservations 100 times never passes all the > time. > For example, let's run testReservationNoContinueLook 100 times. For me, it > produced 39 failed and 61 passed results. > Sometimes just 1 out of 100 runs is failed. > Screenshot is attached. > Stacktrace: > {code:java} > java.lang.AssertionError: > Expected :2048 > Actual :0 > <Click to see difference> > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:647) > at org.junit.Assert.assertEquals(Assert.java:633) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642) > {code} > The test fails here: > {code:java} > // Start testing... > // Only AM > TestUtils.applyResourceCommitRequest(clusterResource, > a.assignContainers(clusterResource, node_0, > new ResourceLimits(clusterResource), > SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps); > assertEquals(2 * GB, a.getUsedResources().getMemorySize()); > {code} > With some debugging (patch attached), I realized that sometimes there are no > registered nodes so the AM can't be allocated and test will fail: > {code:java} > 2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator > (RegularContainerAllocator.java:canAssign(312)) - ******Can't assign > container, no nodes... rmContext: 2a8dd942, scheduler: 2322e56f > {code} > In these cases, this is also printed from > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes: > {code:java} > 2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler > (CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real > getNumClusterNodes > {code} > h2. Let's break this down: > 1. The mocking happens in > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration, > boolean): > {code:java} > cs.setRMContext(spyRMContext); > cs.init(csConf); > cs.start(); > when(cs.getNumClusterNodes()).thenReturn(3); > {code} > Under no circumstances this could be allowed to return any other value than 3. > However, as mentioned above, sometimes the real method of > 'getNumClusterNodes' is called on CapacityScheduler. > 2. Sometimes, this gets printed to the console: > {code:java} > org.mockito.exceptions.misusing.WrongTypeOfReturnValue: > Integer cannot be returned by isMultiNodePlacementEnabled() > isMultiNodePlacementEnabled() should return boolean > *** > If you're unsure why you're getting above error read on. > Due to the nature of the syntax above problem might occur because: > 1. This exception *might* occur in wrongly written multi-threaded tests. > Please refer to Mockito FAQ on limitations of concurrency testing. > 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub > spies - > - with doReturn|Throw() family of methods. More in javadocs for > Mockito.spy() method. > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566) > at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at org.junit.runner.JUnitCore.run(JUnitCore.java:137) > at > com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69) > at > com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:40) > at > com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:220) > at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:53) > {code} > TestReservations.java:166 is exactly our suspicious stubbing call: > {code:java} > when(cs.getNumClusterNodes()).thenReturn(3); > {code} > But what the heck is the relation between this and the > isMultiNodePlacementEnabled method? > Let's dive into this more. > 3. The only caller of isMultiNodePlacementEnabled is > org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager#dynamicallyUpdateAppActivitiesMaxQueueLengthIfNeeded, > that gets called from the anonymous Thread instance (assigned to field > 'cleanupThread') in class > [ActivitiesManager|https://github.com/apache/hadoop/blob/6699198b54bf6360c164a6ce7552c8b91a318c59/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/activities/ActivitiesManager.java#L301-L357] > The ActivitiesManager.serviceStart is invoked when the CapacityScheduler is > started. > *Theory*: When CS is started > [here|https://github.com/apache/hadoop/blob/6699198b54bf6360c164a6ce7552c8b91a318c59/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestReservations.java#L156-L160], > the ActivitiesManager starts running and eventually > cs.isMultiNodePlacementEnabled() will be called by the thread > [here|https://github.com/apache/hadoop/blob/6699198b54bf6360c164a6ce7552c8b91a318c59/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/activities/ActivitiesManager.java#L266]. > As the main thread (JUnit) started to stub method 'getNumClusterNodes', > there's an ongoing stubbing so when isMultiNodePlacementEnabled is called, > the Answer will hold the value of 3, which is improper for a boolean method, > so the stubbing will fail. > This is a race-condition situation, it is presented with 2 screenshots. > As the stubbing failed, subsequent calls to getNumClusterNodes will return 0 > as the original implementation returns 0 for the tests. This is the root > cause of the test failures. > 4. Some references to this mockito "issue": > - [https://github.com/mockito/mockito/issues/1151] > - [https://github.com/mockito/mockito/issues/1151#issuecomment-336718714] > - [https://github.com/mockito/mockito/pull/1200#issuecomment-336717814] > The last link points to the documentation that describes the thread-safety > of mockito: > [https://github.com/mockito/mockito/wiki/FAQ#is-mockito-thread-safe] > *CONCLUSION & FIX* > As all tests in this class are affected by the setup method and this > thread-safety issue, we need to fix it consistently. > Actually, the fix is quite simple: All the stub calls of the CS instance > should happen before the CS.init / CS.start calls, that triggers the > ActivitiesManager thread. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org