Re: Ignite 2.5 | Can't restore memory - critical part of WAL archive is missing with walMode=NONE

Andrey Mashenkov Tue, 05 Jun 2018 08:26:51 -0700

Hi,

I've make a simple test and still can't reproduce this.


Would you please take a look if I missed smth?
Is it possible you have more than one region and\or cache?



On Tue, Jun 5, 2018 at 5:41 PM, Emmanuel Marchand <
[email protected]> wrote:

> Hi,
>
> I'm not sure I can provide a reproducer, but here is some informations :
>
>    - configuration file attached
>    - 2 server nodes, 1 client (+ visor)
>    - I'm streaming enough data to trigger a checkpoint with a reason 'too
>    many dirty pages'
>       - [INFO][db-checkpoint-thread-#69][GridCacheDatabaseSharedManager]
>       Checkpoint started [checkpointId=225ef67d-2850-499f-860d-f7868f1f73ec,
>       startPtr=FileWALPointer [idx=0, fileOff=0, len=0],
>       checkpointLockWait=151ms, checkpointLockHoldTime=17ms,
>       walCpRecordFsyncDuration=0ms, pages=1508362, reason='too many dirty 
> pages']
>       - no error occurs
>    - deactivate cluster then stop nodes using visor
>    - restart nodes
>    - activate cluster using visor -> crash
>
> Workaround : if I delete (or rename) the checkpoint folder (which is *not*
> empty), the activation completes successfully.
> Regards,
> ---
> Emmanuel.
>
>
>
> On 05/06/18 15:40, Andrey Mashenkov wrote:
>
> Hi,
>
> I can't reproduce the issue.
> Is it possible grid configuration was changed between runs?
> Is it possible to share a reproducer?
>
>
> On Tue, Jun 5, 2018 at 12:05 PM, Emmanuel Marchand <
> [email protected]> wrote:
>
>> Hi,
>>
>> I'm testing v2.5 vs v2.4 for persisted dataregion with *walModel = NONE*
>> and while performance seems better I failed to restart the cluster after
>> what I think is a proper shutdown (using top -deactivate then kill -k from
>> visor).
>>
>> When I try to reactivate the cluster (using top -activate from visor) I
>> get the following exception on each nodes :
>> [09:21:37,592][INFO][grid-nio-worker-tcp-comm-0-#33][TcpCommunicationSpi]
>> Accepted incoming communication connection [locAddr=/192.168.1.1:47100,
>> rmtAddr=/192.168.1.102:44646]
>> [09:21:37,656][INFO][pub-#92][GridClusterStateProcessor] Sending
>> activate request with BaselineTopology null
>> [09:21:37,659][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
>> Received activate request with BaselineTopology: null
>> [09:21:37,661][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
>> Started state transition: true
>> [09:21:37,687][INFO][exchange-worker-#52][time] Started exchange init
>> [topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1], crd=true,
>> evt=DISCOVERY_CUSTOM_EVT, evtNode=0f5d38b7-b748-4861-91ef-204ed9343e60,
>> customEvt=ChangeGlobalStateMessage 
>> [id=c0eeccec361-85ace6cb-d27e-4a0e-9106-ca39e6fcbfdd,
>> reqId=5a1cf16e-f610-4b4b-b1eb-76078be38d6c,
>> initiatingNodeId=0f5d38b7-b748-4861-91ef-204ed9343e60, activate=true,
>> baselineTopology=null, forceChangeBaselineTopology=false,
>> timestamp=1528183297656], allowMerge=false]
>> [09:21:37,688][INFO][exchange-worker-#52][GridDhtPartitionsExchangeFuture]
>> Start activation process [nodeId=0f5d38b7-b748-4861-91ef-204ed9343e60,
>> client=false, topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1]]
>> [09:21:37,688][INFO][exchange-worker-#52][FilePageStoreManager] Resolved
>> page store work directory: /usr/share/apache-ignite-fabri
>> c-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
>> [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
>> Resolved write ahead log work directory: /usr/share/apache-ignite-fabri
>> c-2.5.0-bin/work/db/wal/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
>> [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
>> Resolved write ahead log archive directory: /usr/share/apache-ignite-fabri
>> c-2.5.0-bin/work/db/wal/archive/node00-bcfb4de5-5fc6-41e9-
>> 9ebd-90b873711c19
>> [09:21:37,690][WARNING][exchange-worker-#52][FileWriteAheadLogManager]
>> Started write-ahead log manager in NONE mode, persisted data may be lost in
>> a case of unexpected node failure. Make sure to deactivate the cluster
>> before shutdown.
>> [09:21:37,701][INFO][exchange-worker-#52][PageMemoryImpl] Started page
>> memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9 MiB,
>> checkpointBuffer=100.0 MiB]
>> [09:21:37,798][INFO][exchange-worker-#52][PageMemoryImpl] Started page
>> memory [memoryAllocated=8.0 GiB, pages=2032836, tableSize=158.1 MiB,
>> checkpointBuffer=2.0 GiB]
>> [09:21:37,800][INFO][exchange-worker-#52][PageMemoryImpl] Started page
>> memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9 MiB,
>> checkpointBuffer=100.0 MiB]
>> [09:21:38,168][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
>> Read checkpoint status [startMarker=/usr/share/apache
>> -ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-
>> 9ebd-90b873711c19/cp/1528182048551-ea54267c-22c4-
>> 4b64-b328-87cc09d3d460-START.bin, endMarker=/usr/share/apache-ig
>> nite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd
>> -90b873711c19/cp/1528182048551-ea54267c-22c4-4b64-b328-
>> 87cc09d3d460-END.bin]
>> [09:21:38,169][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
>> Checking memory state [lastValidPos=FileWALPointer [idx=0, fileOff=0,
>> len=0], lastMarked=FileWALPointer [idx=0, fileOff=0, len=0],
>> lastCheckpointId=ea54267c-22c4-4b64-b328-87cc09d3d460]
>> *[09:21:38,228][SEVERE][exchange-worker-#52][] Critical system error
>> detected. Will be handled accordingly to configured handler [hnd=class
>> o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext
>> [type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException:
>> Restore wal pointer = null, while status.endPtr = FileWALPointer [idx=0,
>> fileOff=0, len=0]. Can't restore memory - critical part of WAL archive is
>> missing.]]*
>> *class org.apache.ignite.internal.pagemem.wal.StorageException: Restore
>> wal pointer = null, while status.endPtr = FileWALPointer [idx=0, fileOff=0,
>> len=0]. Can't restore memory - critical part of WAL archive is missing.*
>> *    at org.apache.ignite.internal.pro
>> <http://org.apache.ignite.internal.pro>cessors.cache.persistence.GridCacheDatabaseSharedManager.readCheckpointAndRestoreMemory(GridCacheDatabaseSharedManager.java:759)*
>> *    at org.apache.ignite.internal.pro
>> <http://org.apache.ignite.internal.pro>cessors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onClusterStateChangeRequest(GridDhtPartitionsExchangeFuture.java:894)*
>> *    at org.apache.ignite.internal.pro
>> <http://org.apache.ignite.internal.pro>cessors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:641)*
>> *    at org.apache.ignite.internal.pro
>> <http://org.apache.ignite.internal.pro>cessors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2419)*
>> *    at org.apache.ignite.internal.pro
>> <http://org.apache.ignite.internal.pro>cessors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2299)*
>> *    at
>> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)*
>> *    at java.lang.Thread.run(Thread.java:748)*
>> *[09:21:38,229][SEVERE][exchange-worker-#52][] JVM will be halted
>> immediately due to the failure: [failureCtx=FailureContext
>> [type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException:
>> Restore wal pointer = null, while status.endPtr = FileWALPointer [idx=0,
>> fileOff=0, len=0]. Can't restore memory - critical part of WAL archive is
>> missing.]]*
>>
>> This operation ended successfully with version 2.4.
>>
>> The exception was introduced by IGNITE-8393
>> <https://github.com/apache/ignite/commit/5ec1796408d45296cbb26bb93353e23e18b039c8>
>> but the problem may lie deeper, like in the checkpoint process which store
>> a status.endPtr = FileWALPointer even when walMode = NONE.
>>
>> Any idea ?
>>
>>
>> Regards,
>> ---
>> Emmanuel.
>>
>
>
>
> --
> Best regards,
> Andrey V. Mashenkov
>
>
>


-- 
Best regards,
Andrey V. Mashenkov

package userlist;

import java.util.Random;
import java.util.UUID;
import org.apache.ignite.Ignite;
import org.apache.ignite.IgniteCache;
import org.apache.ignite.IgniteDataStreamer;
import org.apache.ignite.cache.CacheMode;
import org.apache.ignite.configuration.CacheConfiguration;
import org.apache.ignite.configuration.DataRegionConfiguration;
import org.apache.ignite.configuration.DataStorageConfiguration;
import org.apache.ignite.configuration.IgniteConfiguration;
import org.apache.ignite.configuration.WALMode;
import org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi;
import org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder;
import org.apache.ignite.testframework.junits.common.GridCommonAbstractTest;

/** */
public class GridRestartNoWalTest extends GridCommonAbstractTest {
    /** */
    public static final String CACHE_NAME = "MYCACHE";

    /** */
    public static final TcpDiscoveryVmIpFinder IP_FINDER = new TcpDiscoveryVmIpFinder(true);

    /** */
    public static final int ENTRIES = 100_000;

    @Override protected void beforeTest() throws Exception {
        super.beforeTest();

        cleanPersistenceDir();
    }

    /** {@inheritDoc} */
    protected IgniteConfiguration getConfiguration(String igniteInstanceName) throws Exception {
        return super.getConfiguration(igniteInstanceName).setDataStorageConfiguration(new DataStorageConfiguration()
            .setDataRegionConfigurations(
                new DataRegionConfiguration()
                    .setMaxSize(100L * 1024 * 1024)
                    .setCheckpointPageBufferSize(15L * 1024 * 1024)
                    .setPersistenceEnabled(true)
                    .setName("persisted-cache"))
            .setWalMode(WALMode.NONE))
            .setCacheConfiguration(cacheConfig())
            .setDiscoverySpi(new TcpDiscoverySpi().setIpFinder(IP_FINDER));
    }

    /** */
    protected CacheConfiguration cacheConfig() {
        return new CacheConfiguration<UUID, String>()
            .setName(CACHE_NAME)
            .setCacheMode(CacheMode.PARTITIONED)
            .setDataRegionName("persisted-cache")
            .setBackups(1);
    }

    /**
     * @throws Exception if fails.
     */
    public void testName() throws Exception {
        Ignite srv = startGridsMultiThreaded(2, 1);
        srv.cluster().active(true);

        fillCache(srv);

        srv.cluster().active(false);
        stopAllGrids();

        srv = startGridsMultiThreaded(2, 1);
        srv.cluster().active(true);

        final IgniteCache<Integer, byte[]> cache = srv.cache(CACHE_NAME);

        waitAndCheckExpired(cache);

        stopAllGrids();
    }

    /** */
    protected void fillCache(Ignite node) {
        byte[] value = new byte[1024];

        Random rnd = new Random(0);

        try (IgniteDataStreamer<Object, Object> strmr = node.dataStreamer(CACHE_NAME)) {
            for (int i = 0; i < ENTRIES; i++)
                strmr.addData(rnd.nextInt(ENTRIES), value);
        }
    }

    /** */
    protected void waitAndCheckExpired(IgniteCache<Integer, byte[]> cache) {
        Random rnd = new Random(0);

        for (int i = 0; i < ENTRIES; i++)
            assertNotNull(cache.get(rnd.nextInt(ENTRIES)));
    }
}

Re: Ignite 2.5 | Can't restore memory - critical part of WAL archive is missing with walMode=NONE

Reply via email to