Re: Ignite 2.5 | Can't restore memory - critical part of WAL archive is missing with walMode=NONE

Emmanuel Marchand Tue, 05 Jun 2018 07:42:02 -0700

Hi,

I'm not sure I can provide a reproducer, but here is some informations :


 * configuration file attached
 * 2 server nodes, 1 client (+ visor)
 * I'm streaming enough data to trigger a checkpoint with a reason 'too
   many dirty pages'
     o [INFO][db-checkpoint-thread-#69][GridCacheDatabaseSharedManager]
       Checkpoint started
       [checkpointId=225ef67d-2850-499f-860d-f7868f1f73ec,
       startPtr=FileWALPointer [idx=0, fileOff=0, len=0],
       checkpointLockWait=151ms, checkpointLockHoldTime=17ms,
       walCpRecordFsyncDuration=0ms, pages=1508362, reason='too many
       dirty pages']
 * no error occurs
 * deactivate cluster then stop nodes using visor
 * restart nodes
 * activate cluster using visor -> crash

Workaround : if I delete (or rename) the checkpoint folder (which is*not* empty), the activation completes successfully.


Regards,
---
Emmanuel.


On 05/06/18 15:40, Andrey Mashenkov wrote:

Hi,

I can't reproduce the issue.
Is it possible grid configuration was changed between runs?
Is it possible to share a reproducer?

On Tue, Jun 5, 2018 at 12:05 PM, Emmanuel Marchand<[email protected] <mailto:[email protected]>>wrote:


    Hi,

    I'm testing v2.5 vs v2.4 for persisted dataregion with *walModel =
    NONE* and while performance seems better I failed to restart the
    cluster after what I think is a proper shutdown (using top
    -deactivate then kill -k from visor).

    When I try to reactivate the cluster (using top -activate from
    visor) I get the following exception on each nodes :

    [09:21:37,592][INFO][grid-nio-worker-tcp-comm-0-#33][TcpCommunicationSpi]
    Accepted incoming communication connection
    [locAddr=/192.168.1.1:47100 <http://192.168.1.1:47100>,
    rmtAddr=/192.168.1.102:44646 <http://192.168.1.102:44646>]
    [09:21:37,656][INFO][pub-#92][GridClusterStateProcessor] Sending
    activate request with BaselineTopology null
    [09:21:37,659][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
    Received activate request with BaselineTopology: null
    [09:21:37,661][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
    Started state transition: true
    [09:21:37,687][INFO][exchange-worker-#52][time] Started exchange
    init [topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1],
    crd=true, evt=DISCOVERY_CUSTOM_EVT,
    evtNode=0f5d38b7-b748-4861-91ef-204ed9343e60,
    customEvt=ChangeGlobalStateMessage
    [id=c0eeccec361-85ace6cb-d27e-4a0e-9106-ca39e6fcbfdd,
    reqId=5a1cf16e-f610-4b4b-b1eb-76078be38d6c,
    initiatingNodeId=0f5d38b7-b748-4861-91ef-204ed9343e60,
    activate=true, baselineTopology=null,
    forceChangeBaselineTopology=false, timestamp=1528183297656],
    allowMerge=false]
    [09:21:37,688][INFO][exchange-worker-#52][GridDhtPartitionsExchangeFuture]
    Start activation process
    [nodeId=0f5d38b7-b748-4861-91ef-204ed9343e60, client=false,
    topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1]]
    [09:21:37,688][INFO][exchange-worker-#52][FilePageStoreManager]
    Resolved page store work directory:
    
/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
    [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
    Resolved write ahead log work directory:
    
/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/wal/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
    [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
    Resolved write ahead log archive directory:
    
/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/wal/archive/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
    [09:21:37,690][WARNING][exchange-worker-#52][FileWriteAheadLogManager]
    Started write-ahead log manager in NONE mode, persisted data may
    be lost in a case of unexpected node failure. Make sure to
    deactivate the cluster before shutdown.
    [09:21:37,701][INFO][exchange-worker-#52][PageMemoryImpl] Started
    page memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9
    MiB, checkpointBuffer=100.0 MiB]
    [09:21:37,798][INFO][exchange-worker-#52][PageMemoryImpl] Started
    page memory [memoryAllocated=8.0 GiB, pages=2032836,
    tableSize=158.1 MiB, checkpointBuffer=2.0 GiB]
    [09:21:37,800][INFO][exchange-worker-#52][PageMemoryImpl] Started
    page memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9
    MiB, checkpointBuffer=100.0 MiB]
    [09:21:38,168][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
    Read checkpoint status
    
[startMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-START.bin,
    
endMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-END.bin]
    [09:21:38,169][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
    Checking memory state [lastValidPos=FileWALPointer [idx=0,
    fileOff=0, len=0], lastMarked=FileWALPointer [idx=0, fileOff=0,
    len=0], lastCheckpointId=ea54267c-22c4-4b64-b328-87cc09d3d460]
    *[09:21:38,228][SEVERE][exchange-worker-#52][] Critical system
    error detected. Will be handled accordingly to configured handler
    [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler,
    failureCtx=FailureContext [type=CRITICAL_ERROR, err=class
    o.a.i.i.pagemem.wal.StorageException: Restore wal pointer = null,
    while status.endPtr = FileWALPointer [idx=0, fileOff=0, len=0].
    Can't restore memory - critical part of WAL archive is missing.]]**
    **class org.apache.ignite.internal.pagemem.wal.StorageException:
    Restore wal pointer = null, while status.endPtr = FileWALPointer
    [idx=0, fileOff=0, len=0]. Can't restore memory - critical part of
    WAL archive is missing.**
    **    at
    
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readCheckpointAndRestoreMemory(GridCacheDatabaseSharedManager.java:759)**
    **    at
    
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onClusterStateChangeRequest(GridDhtPartitionsExchangeFuture.java:894)**
    **    at
    
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:641)**
    **    at
    
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2419)**
    **    at
    
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2299)**
    **    at
    org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)**
    **    at java.lang.Thread.run(Thread.java:748)**
    **[09:21:38,229][SEVERE][exchange-worker-#52][] JVM will be halted
    immediately due to the failure: [failureCtx=FailureContext
    [type=CRITICAL_ERROR, err=class
    o.a.i.i.pagemem.wal.StorageException: Restore wal pointer = null,
    while status.endPtr = FileWALPointer [idx=0, fileOff=0, len=0].
    Can't restore memory - critical part of WAL archive is missing.]]*

    This operation ended successfully with version 2.4.

    The exception was introduced by IGNITE-8393
    
<https://github.com/apache/ignite/commit/5ec1796408d45296cbb26bb93353e23e18b039c8>
    but the problem may lie deeper, like in the checkpoint process
    which store a status.endPtr = FileWALPointer even when walMode = NONE.

    Any idea ?


    Regards,
    ---
    Emmanuel.




--
Best regards,
Andrey V. Mashenkov

<?xml version="1.0" encoding="UTF-8"?>

<beans xmlns="http://www.springframework.org/schema/beans";
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
       xsi:schemaLocation="
       http://www.springframework.org/schema/beans
       http://www.springframework.org/schema/beans/spring-beans.xsd";>

  <!-- Alter configuration below as needed. -->

  <bean id="grid.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">

    <!-- Explicitly enable peer class loading. -->
    <property name="peerClassLoadingEnabled" value="true"/>

    <!-- Durable memory configuration. -->
    <property name="dataStorageConfiguration">
      <bean class="org.apache.ignite.configuration.DataStorageConfiguration">
        <property name="dataRegionConfigurations">
          <list>
            <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
              <property name="name" value="persisted-cache"/>
              <property name="initialSize" value="#{256L * 1024 * 1024}"/>              
              <property name="maxSize" value="#{8L * 1024 * 1024 * 1024}"/>
              <property name="persistenceEnabled" value="true"/>
            </bean> <!-- org.apache.ignite.configuration.DataRegionConfiguration -->
          </list>
        </property> <!-- dataRegionConfigurations -->
        <property name="walMode" value="NONE"/>
      </bean> <!-- org.apache.ignite.configuration.DataStorageConfiguration -->
    </property> <!-- dataStorageConfiguration -->

  </bean> <!-- org.apache.ignite.configuration.IgniteConfiguration -->

  <!-- Other configurations. -->

</beans>

Re: Ignite 2.5 | Can't restore memory - critical part of WAL archive is missing with walMode=NONE

Reply via email to