[ 
https://issues.apache.org/jira/browse/IGNITE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladislav Pyatkov reassigned IGNITE-21381:
------------------------------------------

    Assignee: Vladislav Pyatkov

> ActiveActorTest#testChangeLeaderForce has problems with resource cleanup
> ------------------------------------------------------------------------
>
>                 Key: IGNITE-21381
>                 URL: https://issues.apache.org/jira/browse/IGNITE-21381
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Mirza Aliev
>            Assignee: Vladislav Pyatkov
>            Priority: Major
>              Labels: ignite-3
>         Attachments: screenshot-1.png, screenshot-2.png
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> {{ActiveActorTest#testChangeLeaderForce}} is started to be flaky on TC with 
> {noformat}
> [05:19:12]F:                   
> [org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)]
>  org.opentest4j.AssertionFailedError: expected: <true> but was: <false>
>       at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>       at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>       at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
>       at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
>       at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)
>       at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180)
>       at 
> app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370)
> {noformat}
> From the log we can see that transfer leadership, which was supposed to be 
> successful, do not happen. Behaviour is the following:
> 1) Current leader is {{Leader: ClusterNodeImpl 
> [id=e99210fb-f872-4e08-a99c-53f9512da20e, name=aat_tclf_1235}}
> 2) We want to transfer leadership to {{Peer to transfer leader: Peer 
> [consistentId=aat_tclf_1234, idx=0]}}
> 3) Process of transfer is started
> 4) We receive warn about error during {{GetLeaderRequestImpl}}:
> {noformat}
> [2024-01-29T05:19:08,855][WARN 
> ][CompletableFutureDelayScheduler][RaftGroupServiceImpl] Recoverable error 
> during the request occurred (will be retried on the randomly selected node) 
> [request=GetLeaderRequestImpl [groupId=TestReplicationGroup, 
> peerId=aat_tclf_1235], peer=Peer [consistentId=aat_tclf_1235, idx=0], 
> newPeer=Peer [consistentId=aat_tclf_1234, idx=0]].
> java.util.concurrent.CompletionException: 
> java.util.concurrent.TimeoutException
>       at 
> java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367)
>  ~[?:?]
>       at 
> java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376)
>  ~[?:?]
>       at 
> java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019)
>  ~[?:?]
>       at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>  [?:?]
>       at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
>  [?:?]
>       at 
> java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792)
>  [?:?]
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
>       at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>  [?:?]
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>       at java.lang.Thread.run(Thread.java:834) [?:?]
> Caused by: java.util.concurrent.TimeoutException
>       ... 7 more
> {noformat}
> 5) After that we see that node {{aat_tclf_1236}} sends invalid 
> {{RequestVoteResponse}} because it thinks that it is the leader:
> {noformat}
> [2024-01-29T05:19:11,370][WARN 
> ][%aat_tclf_1234%JRaft-Response-Processor-15][NodeImpl] Node 
> <TestReplicationGroup/aat_tclf_1234> received invalid RequestVoteResponse 
> from aat_tclf_1236, state not in STATE_CANDIDATE but STATE_LEADER.
> {noformat}
>  
> Tests {{ActiveActorTest#testChangeLeaderForce}} and 
> {{TopologyAwareRaftGroupServiceTest#testChangeLeaderForce}} were muted.
> Also there are some other problems with this tests, they incorrectly clean up 
> resources in case of failure. Cluster is stopped in test itself, meaning that 
> if some assertion is failed, the rest part of the test won't be evaluated, 
> hence cluster won't be stopped.
> The next problem is that if we run this test a several times, even if they 
> pass successfully, we can see that at some point new test cannot be run 
> because of 
> {noformat}
>  java.lang.OutOfMemoryError: unable to create native thread: possibly out of 
> memory or process/resource limits reached
> {noformat}
> From visualvm we can see, that {{Raft-Group-Client}} threads leaked:
>  !screenshot-1.png! 
>  !screenshot-2.png! 
> h4. Definition of done
> 1) Investigate and fix the problem with the failed transferLeadership
> 2) Correctly clean up resources if test is failed. Move all cleanup logic to 
> {{AfterEach}} section of tests for all {{ActiveActorTest}} and 
> {{TopologyAwareRaftGroupServiceTest}}
> 3) Refactor {{ActiveActorTest}} and {{TopologyAwareRaftGroupServiceTest}}, 
> the code is just copy-pasted
> 4) Investigate the problem with leaked {{Raft-Group-Client}} threads 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to