Re: Ignite 2.5 | Can't restore memory - critical part of WAL archive is missing with walMode=NONE

Emmanuel Marchand Wed, 06 Jun 2018 01:02:16 -0700

Hi,

Here are some more informations :


 * the region are as in the provided configuration file. I guess there
   is only one.
 * I have 3 caches and I update all 3 using one datastreamer with
   custom receiver for each.
 * all caches are not using any backup and atomicity mode is set to ATOMIC
 * caches are indeed partitioned, keys use affinityKeyMapped and
   implement Comparable<K> and Externalizable
 * values are using indexing

About your test :

 * do you add enough data to trigger a checkpoint ? May be try to
   reduce setMaxSize ?
 * I don't set checkpointPageBufferSize
 * are you sure the checkpoint folder is not empty before restart ?

Thanks for your help.

Regards,
---
Emmanuel.

On 05/06/18 17:26, Andrey Mashenkov wrote:

Hi,

I've make a simple test and still can't reproduce this.

Would you please take a look if I missed smth?
Is it possible you have more than one region and\or cache?

On Tue, Jun 5, 2018 at 5:41 PM, Emmanuel Marchand<[email protected] <mailto:[email protected]>>wrote:


    Hi,

    I'm not sure I can provide a reproducer, but here is some
    informations :

      * configuration file attached
      * 2 server nodes, 1 client (+ visor)
      * I'm streaming enough data to trigger a checkpoint with a
        reason 'too many dirty pages'
          o [INFO][db-checkpoint-thread-#69][GridCacheDatabaseSharedManager]
            Checkpoint started
            [checkpointId=225ef67d-2850-499f-860d-f7868f1f73ec,
            startPtr=FileWALPointer [idx=0, fileOff=0, len=0],
            checkpointLockWait=151ms, checkpointLockHoldTime=17ms,
            walCpRecordFsyncDuration=0ms, pages=1508362, reason='too
            many dirty pages']
      * no error occurs
      * deactivate cluster then stop nodes using visor
      * restart nodes
      * activate cluster using visor -> crash

    Workaround : if I delete (or rename) the checkpoint folder (which
    is *not* empty), the activation completes successfully.

    Regards,
    ---
    Emmanuel.



    On 05/06/18 15:40, Andrey Mashenkov wrote:

    Hi,

    I can't reproduce the issue.
    Is it possible grid configuration was changed between runs?
    Is it possible to share a reproducer?


    On Tue, Jun 5, 2018 at 12:05 PM, Emmanuel Marchand
    <[email protected]
    <mailto:[email protected]>> wrote:

        Hi,

        I'm testing v2.5 vs v2.4 for persisted dataregion with
        *walModel = NONE* and while performance seems better I failed
        to restart the cluster after what I think is a proper
        shutdown (using top -deactivate then kill -k from visor).

        When I try to reactivate the cluster (using top -activate
        from visor) I get the following exception on each nodes :

        
[09:21:37,592][INFO][grid-nio-worker-tcp-comm-0-#33][TcpCommunicationSpi]
        Accepted incoming communication connection
        [locAddr=/192.168.1.1:47100 <http://192.168.1.1:47100>,
        rmtAddr=/192.168.1.102:44646 <http://192.168.1.102:44646>]
        [09:21:37,656][INFO][pub-#92][GridClusterStateProcessor]
        Sending activate request with BaselineTopology null
        [09:21:37,659][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
        Received activate request with BaselineTopology: null
        [09:21:37,661][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
        Started state transition: true
        [09:21:37,687][INFO][exchange-worker-#52][time] Started
        exchange init [topVer=AffinityTopologyVersion [topVer=69,
        minorTopVer=1], crd=true, evt=DISCOVERY_CUSTOM_EVT,
        evtNode=0f5d38b7-b748-4861-91ef-204ed9343e60,
        customEvt=ChangeGlobalStateMessage
        [id=c0eeccec361-85ace6cb-d27e-4a0e-9106-ca39e6fcbfdd,
        reqId=5a1cf16e-f610-4b4b-b1eb-76078be38d6c,
        initiatingNodeId=0f5d38b7-b748-4861-91ef-204ed9343e60,
        activate=true, baselineTopology=null,
        forceChangeBaselineTopology=false, timestamp=1528183297656],
        allowMerge=false]
        
[09:21:37,688][INFO][exchange-worker-#52][GridDhtPartitionsExchangeFuture]
        Start activation process
        [nodeId=0f5d38b7-b748-4861-91ef-204ed9343e60, client=false,
        topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1]]
        [09:21:37,688][INFO][exchange-worker-#52][FilePageStoreManager]
        Resolved page store work directory:
        
/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
        [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
        Resolved write ahead log work directory:
        
/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/wal/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
        [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
        Resolved write ahead log archive directory:
        
/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/wal/archive/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
        [09:21:37,690][WARNING][exchange-worker-#52][FileWriteAheadLogManager]
        Started write-ahead log manager in NONE mode, persisted data
        may be lost in a case of unexpected node failure. Make sure
        to deactivate the cluster before shutdown.
        [09:21:37,701][INFO][exchange-worker-#52][PageMemoryImpl]
        Started page memory [memoryAllocated=100.0 MiB, pages=24804,
        tableSize=1.9 MiB, checkpointBuffer=100.0 MiB]
        [09:21:37,798][INFO][exchange-worker-#52][PageMemoryImpl]
        Started page memory [memoryAllocated=8.0 GiB, pages=2032836,
        tableSize=158.1 MiB, checkpointBuffer=2.0 GiB]
        [09:21:37,800][INFO][exchange-worker-#52][PageMemoryImpl]
        Started page memory [memoryAllocated=100.0 MiB, pages=24804,
        tableSize=1.9 MiB, checkpointBuffer=100.0 MiB]
        
[09:21:38,168][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
        Read checkpoint status
        
[startMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-START.bin,
        
endMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-END.bin]
        
[09:21:38,169][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
        Checking memory state [lastValidPos=FileWALPointer [idx=0,
        fileOff=0, len=0], lastMarked=FileWALPointer [idx=0,
        fileOff=0, len=0],
        lastCheckpointId=ea54267c-22c4-4b64-b328-87cc09d3d460]
        *[09:21:38,228][SEVERE][exchange-worker-#52][] Critical
        system error detected. Will be handled accordingly to
        configured handler [hnd=class
        o.a.i.failure.StopNodeOrHaltFailureHandler,
        failureCtx=FailureContext [type=CRITICAL_ERROR, err=class
        o.a.i.i.pagemem.wal.StorageException: Restore wal pointer =
        null, while status.endPtr = FileWALPointer [idx=0, fileOff=0,
        len=0]. Can't restore memory - critical part of WAL archive
        is missing.]]**
        **class
        org.apache.ignite.internal.pagemem.wal.StorageException:
        Restore wal pointer = null, while status.endPtr =
        FileWALPointer [idx=0, fileOff=0, len=0]. Can't restore
        memory - critical part of WAL archive is missing.**
        **    at org.apache.ignite.internal.pro
        
<http://org.apache.ignite.internal.pro>cessors.cache.persistence.GridCacheDatabaseSharedManager.readCheckpointAndRestoreMemory(GridCacheDatabaseSharedManager.java:759)**
        **    at org.apache.ignite.internal.pro
        
<http://org.apache.ignite.internal.pro>cessors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onClusterStateChangeRequest(GridDhtPartitionsExchangeFuture.java:894)**
        **    at org.apache.ignite.internal.pro
        
<http://org.apache.ignite.internal.pro>cessors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:641)**
        **    at org.apache.ignite.internal.pro
        
<http://org.apache.ignite.internal.pro>cessors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2419)**
        **    at org.apache.ignite.internal.pro
        
<http://org.apache.ignite.internal.pro>cessors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2299)**
        **    at
        
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)**
        **    at java.lang.Thread.run(Thread.java:748)**
        **[09:21:38,229][SEVERE][exchange-worker-#52][] JVM will be
        halted immediately due to the failure:
        [failureCtx=FailureContext [type=CRITICAL_ERROR, err=class
        o.a.i.i.pagemem.wal.StorageException: Restore wal pointer =
        null, while status.endPtr = FileWALPointer [idx=0, fileOff=0,
        len=0]. Can't restore memory - critical part of WAL archive
        is missing.]]*

        This operation ended successfully with version 2.4.

        The exception was introduced by IGNITE-8393
        
<https://github.com/apache/ignite/commit/5ec1796408d45296cbb26bb93353e23e18b039c8>
        but the problem may lie deeper, like in the checkpoint
        process which store a status.endPtr = FileWALPointer even
        when walMode = NONE.

        Any idea ?


        Regards,
        ---
        Emmanuel.

--Best regards,

    Andrey V. Mashenkov





--
Best regards,
Andrey V. Mashenkov

Re: Ignite 2.5 | Can't restore memory - critical part of WAL archive is missing with walMode=NONE

Reply via email to