Pavel Kovalenko created IGNITE-8391: ---------------------------------------
Summary: Removing some WAL history segments leads to WAL rebalance hanging Key: IGNITE-8391 URL: https://issues.apache.org/jira/browse/IGNITE-8391 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.4 Reporter: Pavel Kovalenko Fix For: 2.6 Problem: 1) Start 2 nodes, load some data to it. 2) Stop node 2, load some data to cache. 3) Remove WAL archived segment which doesn't contain Checkpoint record needed to find start point for WAL rebalance, but contains necessary data for rebalancing. 4) Start node 2, this node will start rebalance data from node 1 using WAL. Rebalance will be hanged with following assertion: {noformat} java.lang.AssertionError: Partitions after rebalance should be either done or missing: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:417) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:364) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:379) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1054) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:579) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:99) at org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1603) at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556) at org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:125) at org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2752) at org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1516) at org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:125) at org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1485) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} This happened because we never reached necessary data and updateCounters contained in removed WAL segment. To resolve such problems we should introduce some fallback strategy if rebalance by WAL has been failed. Example of fallback strategy is - re-run full rebalance for partitions that were not able properly rebalanced using WAL. -- This message was sent by Atlassian JIRA (v7.6.3#76005)