I am using the default WAL mode. I think its LOG_ONLY.
The crashed ignite server log is below:
{"type":"log","host":"ignite-cluster-ap-ignite-10","level":"INFO","systemid":"296b639f","system":"ignite-service","time":"2019-01-31
16:01:29,093","logger":"GridCacheDatabaseSharedManager","timezone":"UTC","marker":"","log":"Read
checkpoint status
[startMarker=/opt/ignite/apache-ignite-fabric-2.6.0-bin/persistence/node00-1ed7d92a-a181-4ffb-ad90-df30e3e1fa12/cp/1548909757044-63969238-f350-4b12-bdf5-f7a540021e58-START.bin,
endMarker=/opt/ignite/apache-ignite-fabric-2.6.0-bin/persistence/node00-1ed7d92a-a181-4ffb-ad90-df30e3e1fa12/cp/1548909575263-435715b4-71a9-4c2b-90ef-d831ed575ffc-END.bin]"}
{"type":"log","host":"ignite-cluster-ap-ignite-10","level":"INFO","systemid":"296b639f","system":"ignite-service","time":"2019-01-31
16:01:29,093","logger":"GridCacheDatabaseSharedManager","timezone":"UTC","marker":"","log":"Checking
memory state [lastValidPos=FileWALPointer [idx=412, fileOff=50500521,
len=57801], lastMarked=FileWALPointer [idx=426, fileOff=38038736,
len=57801], lastCheckpointId=63969238-f350-4b12-bdf5-f7a540021e58]"}
{"type":"log","host":"ignite-cluster-ap-ignite-10","level":"WARN","systemid":"296b639f","system":"ignite-service","time":"2019-01-31
16:01:29,094","logger":"GridCacheDatabaseSharedManager","timezone":"UTC","marker":"","log":"Ignite
node stopped in the middle of checkpoint. Will restore memory state and
finish checkpoint on node start."}
{"type":"log","host":"ignite-cluster-ap-ignite-10","level":"ERROR","systemid":"296b639f","system":"ignite-service","time":"2019-01-31
16:01:29,105","logger":"","timezone":"UTC","marker":"","log":"Critical
system error detected. Will be handled accordingly to configured handler
[hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler,
failureCtx=FailureContext [type=CRITICAL_ERROR, err=class
o.a.i.i.pagemem.wal.StorageException: Failed to restore memory state
(checkpoint marker is present on disk, but checkpoint record is missed in
WAL) [cpStatus=CheckpointStatus [cpStartTs=1548909757044,
cpStartId=63969238-f350-4b12-bdf5-f7a540021e58, startPtr=FileWALPointer
[idx=426, fileOff=38038736, len=57801],
cpEndId=435715b4-71a9-4c2b-90ef-d831ed575ffc, endPtr=FileWALPointer
[idx=412, fileOff=50500521, len=57801]], lastRead=null]]] class
org.apache.ignite.internal.pagemem.wal.StorageException: Failed to restore
memory state (checkpoint marker is present on disk, but checkpoint record
is missed in WAL) [cpStatus=CheckpointStatus [cpStartTs=1548909757044,
cpStartId=63969238-f350-4b12-bdf5-f7a540021e58, startPtr=FileWALPointer
[idx=426, fileOff=38038736, len=57801],
cpEndId=435715b4-71a9-4c2b-90ef-d831ed575ffc, endPtr=FileWALPointer
[idx=412, fileOff=50500521, len=57801]], lastRead=null]
at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restoreMemory(GridCacheDatabaseSharedManager.java:2120)
at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restoreMemory(GridCacheDatabaseSharedManager.java:1929)
at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readCheckpointAndRestoreMemory(GridCacheDatabaseSharedManager.java:755)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.initCachesOnLocalJoin(GridDhtPartitionsExchangeFuture.java:789)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:674)
at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2419)
at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2299)
at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.lang.Thread.run(Thread.java:748)
"}
{"type":"log","host":"ignite-cluster-ap-ignite-10","level":"ERROR","systemid":"296b639f","system":"ignite-service","time":"2019-01-31
16:01:29,106","logger":"","timezone":"UTC","marker":"","log":"JVM will be
halted immediately due to the failure: [failureCtx=FailureContext
[type=CRITICAL_ERROR,
err=class o.a.i.i.pagemem.wal.StorageException: Failed to restore memory
state (checkpoint marker is present on disk, but checkpoint record is
missed in WAL) [cpStatus=CheckpointStatus [cpStartTs=1548909757044,
cpStartId=63969238-f350-4b12-bdf5-f7a540021e58, startPtr=FileWALPointer
[idx=426, fileOff=38038736, len=57801],
cpEndId=435715b4-71a9-4c2b-90ef-d831ed575ffc, endPtr=FileWALPointer
[idx=412, fileOff=50500521, len=57801]], lastRead=null]]]"}
Regards
Krupa
On Fri, 1 Feb 2019 at 19:50, Ilya Kasnacheev <[email protected]>
wrote:
> Hello!
>
> It's hard to say outright. Can you provide full log before node crash? Is
> there a chance that you ran out of disk space? What's your WALMode?
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пт, 1 февр. 2019 г. в 08:16, radha jai <[email protected]>:
>
>> Hi,
>> Ignite has been deployed on k8s has 12 ignite-servers, which are
>> spread out one on each worker node. The limits are 1 CPU 32GB RAM, with
>> maximum of 8 CPU and 64GB. Each ignite-server has a WAL and Persistent
>> storage volume of 30GB.
>> Getting below error after inserting the 60GB of data to ignite
>> cluster, one of the nodes crashes, and never recovers. The error on
>> startup indicates that the WAL fails to restore memory state,
>> type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException:
>> Failed to restore memory state (checkpoint marker is present on disk, but
>> checkpoint record is missed in WAL)
>>
>> following warning message are seen in some of the server logs.
>>
>> [03:53:53,375][WARNING][jvm-pause-detector-worker][] Possible too long
>> JVM pause: 1022 milliseconds.
>>
>>
>> The snippet of ignite configuration is below:
>>
>>
>> <property name="peerClassLoadingEnabled" value="true"/>
>>
>> <property name="dataStorageConfiguration">
>>
>> <bean
>> class="org.apache.ignite.configuration.DataStorageConfiguration">
>>
>> <!-- Enable metrics for Ignite persistence -->
>>
>> <property name="metricsEnabled" value="true"/>
>>
>> <property name="defaultDataRegionConfiguration">
>>
>> <bean
>> class="org.apache.ignite.configuration.DataRegionConfiguration">
>>
>>
>> <property name="name" value="Default_Region"/>
>>
>> <property name="initialSize" value="#{32L * 1024 * 1024
>> * 1024}"/>
>>
>> <property name="maxSize" value="#{64L * 1024 * 1024 *
>> 1024}"/>
>>
>> <!-- Enabling Apache Ignite Persistent Store. -->
>>
>> <property name="persistenceEnabled" value="true"/>
>>
>> <!-- Enable metrics for this data region -->
>>
>> <property name="metricsEnabled" value="true"/>
>>
>> </bean>
>>
>> </property>
>>
>> <property name="storagePath" value="/opt/ignite/persistence/"/>
>>
>> <property name="walPath" value="/opt/ignite/wal/"/>
>>
>> </bean>
>>
>> </property>
>>
>>
>> Ignite JVM configuration: -server -Xms1g -Xmx1g -XX:+AlwaysPreTouch
>> -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC
>>
>>
>> Thanks
>>
>> radha
>>
>>
>>
>>