[ https://issues.apache.org/jira/browse/IGNITE-20303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sergey Uttsel reassigned IGNITE-20303: -------------------------------------- Assignee: Sergey Uttsel > "Raft group on the node is already started" exception when pending and > planned assignment changed faster then rebalance > ----------------------------------------------------------------------------------------------------------------------- > > Key: IGNITE-20303 > URL: https://issues.apache.org/jira/browse/IGNITE-20303 > Project: Ignite > Issue Type: Bug > Reporter: Sergey Uttsel > Assignee: Sergey Uttsel > Priority: Major > Labels: ignite-3 > > If many changes of assignment are happened quickly then rebalance does not > have time to be completed for each change. In this case exception is thrown: > {code:java} > 2023-08-24T16:58:51,328][ERROR][%irdt_ttqr_20000%tableManager-io-10][WatchProcessor] > Error occurred when processing a watch event > org.apache.ignite.lang.IgniteInternalException: Raft group on the node is > already started [nodeId=RaftNodeId [groupId=1_part_0, peer=Peer > [consistentId=irdt_ttqr_20000, idx=0]]] > at > org.apache.ignite.internal.raft.Loza.startRaftGroupNodeInternal(Loza.java:342) > ~[main/:?] > at > org.apache.ignite.internal.raft.Loza.startRaftGroupNode(Loza.java:230) > ~[main/:?] > at > org.apache.ignite.internal.raft.Loza.startRaftGroupNode(Loza.java:203) > ~[main/:?] > at > org.apache.ignite.internal.table.distributed.TableManager.startPartitionRaftGroupNode(TableManager.java:2361) > ~[main/:?] > at > org.apache.ignite.internal.table.distributed.TableManager.lambda$handleChangePendingAssignmentEvent$98(TableManager.java:2261) > ~[main/:?] > at > org.apache.ignite.internal.util.IgniteUtils.inBusyLock(IgniteUtils.java:922) > ~[main/:?] > at > org.apache.ignite.internal.table.distributed.TableManager.lambda$handleChangePendingAssignmentEvent$99(TableManager.java:2259) > ~[main/:?] > at > java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736) > ~[?:?] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > ~[?:?] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > ~[?:?] > at java.lang.Thread.run(Thread.java:834) ~[?:?] > {code} > The reproducer based on ItRebalanceDistributedTest#testThreeQueuedRebalances. > See exception in the test log: > {code:java} > @Test > void testThreeQueuedRebalances() throws Exception { > Node node = getNode(0); > createZone(node, ZONE_NAME, 1, 1); > createTable(node, ZONE_NAME, TABLE_NAME); > assertTrue(waitForCondition(() -> getPartitionClusterNodes(node, > 0).size() == 1, AWAIT_TIMEOUT_MILLIS)); > alterZone(node, ZONE_NAME, 2); > alterZone(node, ZONE_NAME, 3); > alterZone(node, ZONE_NAME, 2); > alterZone(node, ZONE_NAME, 3); > alterZone(node, ZONE_NAME, 2); > alterZone(node, ZONE_NAME, 3); > alterZone(node, ZONE_NAME, 2); > alterZone(node, ZONE_NAME, 3); > alterZone(node, ZONE_NAME, 2); > alterZone(node, ZONE_NAME, 3); > alterZone(node, ZONE_NAME, 2); > waitPartitionAssignmentsSyncedToExpected(0, 2); > checkPartitionNodes(0, 2); > } > {code} > We can fix it by a check if the raft node and the Replica are created before > startPartitionRaftGroupNode and startReplicaWithNewListener in > TableManager#handleChangePendingAssignmentEvent. -- This message was sent by Atlassian Jira (v8.20.10#820010)