Dmitry Sherstobitov created IGNITE-8893:
-------------------------------------------

             Summary: Blinking node in baseline may corrupt own WAL records
                 Key: IGNITE-8893
                 URL: https://issues.apache.org/jira/browse/IGNITE-8893
             Project: Ignite
          Issue Type: Bug
    Affects Versions: 2.5
            Reporter: Dmitry Sherstobitov


# Start cluster, load data
 # Start additional node that not in BLT
 # Repeat 10 times: kill 1 node in baseline and 1 node not in baseline, start 
node in blt and node not in BLT

Node in baseline in some moment may unable to start because of corrupted WAL:
Notice that there is no loading on cluster at all - so there is no reason to 
corrupt WAL, rebalance should be interruptible.

Also there is another scenario that may case same error (but also may cause JVM 
crash)
 # Start cluster, load data, start nodes
 # Repeat 10 times: kill 1 node in baseline, clean LFS, start node again, while 
rebalance blink node that should rebalance data to previously killed node

Node that should rebalance data to cleaned node may corrupt own WAL. But this 
second scenario has configuration "error" - number of backups in each case is 
1. So obviously 2 nodes blinking actually may cause data loss.
{code:java}
[2018-06-28 17:33:39,583][ERROR][wal-file-archiver%null-#63][root] Critical 
system error detected. Will be handled accordingly to configured handler 
[hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, 
failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, 
err=java.lang.AssertionError: lastArchived=757, current=42]]
java.lang.AssertionError: lastArchived=757, current=42
        at 
org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$FileArchiver.body(FileWriteAheadLogManager.java:1629)
        at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to