Pavel Kovalenko created IGNITE-8391:
---------------------------------------

             Summary: Removing some WAL history segments leads to WAL rebalance 
hanging
                 Key: IGNITE-8391
                 URL: https://issues.apache.org/jira/browse/IGNITE-8391
             Project: Ignite
          Issue Type: Bug
          Components: cache
    Affects Versions: 2.4
            Reporter: Pavel Kovalenko
             Fix For: 2.6


Problem:
1) Start 2 nodes, load some data to it.
2) Stop node 2, load some data to cache.
3) Remove WAL archived segment which doesn't contain Checkpoint record needed 
to find start point for WAL rebalance, but contains necessary data for 
rebalancing. 
4) Start node 2, this node will start rebalance data from node 1 using WAL.

Rebalance will be hanged with following assertion:

{noformat}
java.lang.AssertionError: Partitions after rebalance should be either done or 
missing: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:417)
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:364)
        at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:379)
        at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364)
        at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1054)
        at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:579)
        at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:99)
        at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1603)
        at 
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556)
        at 
org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:125)
        at 
org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2752)
        at 
org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1516)
        at 
org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:125)
        at 
org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1485)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{noformat}
 
This happened because we never reached necessary data and updateCounters 
contained in removed WAL segment.

To resolve such problems we should introduce some fallback strategy if 
rebalance by WAL has been failed. Example of fallback strategy is - re-run full 
rebalance for partitions that were not able properly rebalanced using WAL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to