RE: Failed to get page IO instance (page content is corrupted) after onenode failed when trying to reboot.

Ray Wed, 28 Nov 2018 21:13:55 -0800

This issue happened again.

Here's the summary.
I'm running a three nodes of Ignite 2.6 cluster with these config


<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans";
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
       xsi:schemaLocation="
       http://www.springframework.org/schema/beans
       http://www.springframework.org/schema/beans/spring-beans.xsd";>

    <bean id="grid.cfg"
class="org.apache.ignite.configuration.IgniteConfiguration">
        <property name="segmentationPolicy" value="RESTART_JVM"/>
        <property name="peerClassLoadingEnabled" value="true"/>
        <property name="failureDetectionTimeout" value="60000"/>
        <property name="dataStorageConfiguration">
            <bean
class="org.apache.ignite.configuration.DataStorageConfiguration">
            <property name="storagePath" value="/data/ignite/persistence"/>
            <property name="walPath" value="/wal"/>
            <property name="walArchivePath" value="/wal/archive"/>
            <property name="defaultDataRegionConfiguration">
                <bean
class="org.apache.ignite.configuration.DataRegionConfiguration">
                    <property name="name" value="default_Region"/>
                    <property name="initialSize" value="#{100L * 1024 * 1024
* 1024}"/>
                    <property name="maxSize" value="#{400L * 1024 * 1024 *
1024}"/>
                    <property name="persistenceEnabled" value="true"/>
                    <property name="checkpointPageBufferSize" value="#{8L *
1024 * 1024 * 1024}"/>
                </bean>
            </property>
            <property name="walMode" value="BACKGROUND"/>
            <property name="walFlushFrequency" value="5000"/>
            <property name="checkpointFrequency" value="600000"/>
            </bean>
        </property>
        <property name="discoverySpi">
                <bean
class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
                    <property name="localPort" value="49500"/>
                    <property name="ipFinder">
                        <bean
class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
                            <property name="addresses">
                                <list>
                                <value>node1:49500</value>
                                <value>node2:49500</value>
                                <value>node3:49500</value>
                                </list>
                            </property>
                        </bean>
                    </property>
                </bean>
            </property>
            <property name="gridLogger">
            <bean class="org.apache.ignite.logger.log4j2.Log4J2Logger">
                <constructor-arg type="java.lang.String"
value="config/ignite-log4j2.xml"/>
            </bean>
        </property>
    </bean>
</beans>

I have a few caches setup with TTL with enabled persistence.
Why I'm mentioning this because I check this thread
http://apache-ignite-users.70518.x6.nabble.com/And-again-Failed-to-get-page-IO-instance-page-content-is-corrupted-td20095.html#a22037
and a few tickets mentioned in this ticket.
https://issues.apache.org/jira/browse/IGNITE-8659
https://issues.apache.org/jira/browse/IGNITE-5874
Other issues is ignored because they're already fixed in 2.6


Node1 goes down because of a long GC pause.
When I try to restart Ignite service on Node1, I got "Still waiting for
initial partition map exchange" warning log going on for more than 2 hours. 
[WARN ][main][GridCachePartitionExchangeManager] Still waiting for initial
partition map exchange [fut=GridDhtPartitionsExchangeFuture
[firstDiscoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode
[id=9d66b750-09a3-4f0e-afa9-7cf24847ee6a, addrs=[10.252.4.60, 127.0.0.1],
sockAddrs=[rpsj1ign001.webex.com/10.252.4.60:49500, /127.0.0.1:49500],
discPort=49500, order=11813, intOrder=5909, lastExchangeTime=1543451981558,
loc=true, ver=2.6.0#20180709-sha1:5faffcee, isClient=false], topVer=11813,
nodeId8=9d66b750, msg=null, type=NODE_JOINED, tstamp=1543451943071],
crd=TcpDiscoveryNode [id=f14c8e36-9a20-4668-b52e-0de64c743700,
addrs=[10.252.10.20, 127.0.0.1],
sockAddrs=[rpsj1ign003.webex.com/10.252.10.20:49500, /127.0.0.1:49500],
discPort=49500, order=2310, intOrder=1158, lastExchangeTime=1543451942304,
loc=false, ver=2.6.0#20180709-sha1:5faffcee, isClient=false],
exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion
[topVer=11813, minorTopVer=0], discoEvt=DiscoveryEvent
[evtNode=TcpDiscoveryNode [id=9d66b750-09a3-4f0e-afa9-7cf24847ee6a,
addrs=[10.252.4.60, 127.0.0.1],
sockAddrs=[rpsj1ign001.webex.com/10.252.4.60:49500, /127.0.0.1:49500],
discPort=49500, order=11813, intOrder=5909, lastExchangeTime=1543451981558,
loc=true, ver=2.6.0#20180709-sha1:5faffcee, isClient=false], topVer=11813,
nodeId8=9d66b750, msg=null, type=NODE_JOINED, tstamp=1543451943071],
nodeId=9d66b750, evt=NODE_JOINED], added=true, initFut=GridFutureAdapter
[ignoreInterrupts=false, state=INIT, res=null, hash=830022440], init=false,
lastVer=null, partReleaseFut=PartitionReleaseFuture
[topVer=AffinityTopologyVersion [topVer=11813, minorTopVer=0],
futures=[ExplicitLockReleaseFuture [topVer=AffinityTopologyVersion
[topVer=11813, minorTopVer=0], futures=[]], AtomicUpdateReleaseFuture
[topVer=AffinityTopologyVersion [topVer=11813, minorTopVer=0], futures=[]],
DataStreamerReleaseFuture [topVer=AffinityTopologyVersion [topVer=11813,
minorTopVer=0], futures=[]], LocalTxReleaseFuture
[topVer=AffinityTopologyVersion [topVer=11813, minorTopVer=0], futures=[]],
AllTxReleaseFuture [topVer=AffinityTopologyVersion [topVer=11813,
minorTopVer=0], futures=[RemoteTxReleaseFuture
[topVer=AffinityTopologyVersion [topVer=11813, minorTopVer=0],
futures=[]]]]]], exchActions=ExchangeActions [startCaches=null,
stopCaches=null, startGrps=[], stopGrps=[], resetParts=null,
stateChangeRequest=null], affChangeMsg=null, initTs=1543451943112,
centralizedAff=false, forceAffReassignment=false, changeGlobalStateE=null,
done=false, state=SRV, evtLatch=0,
remaining=[0126e998-0c18-452f-8f3b-b6dd4b2ae84c,
f14c8e36-9a20-4668-b52e-0de64c743700], super=GridFutureAdapter
[ignoreInterrupts=false, state=INIT, res=null, hash=773110813]]]

So I try to reboot Ignite service on node2 and node3.
But only node2 manages to join the cluster, node3 prints "Still waiting for
initial partition map exchange" for more than 30 minutes.

So I stopped all three nodes, and restarted the Ignite service on them.
Then I got Failed to get page IO instance (page content is corrupted) on
Node1.

[ERROR][exchange-worker-#162][] Critical system error detected. Will be
handled accordingly to configured handler [hnd=class
o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext
[type=CRITICAL_ERROR, err=java.lang.IllegalStateException: Failed to get
page IO instance (page content is corrupted)]]
java.lang.IllegalStateException: Failed to get page IO instance (page
content is corrupted)
        at
org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forVersion(IOVersions.java:83)
~[ignite-core-2.6.0.jar:2.6.0]
        at
org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forPage(IOVersions.java:95)
~[ignite-core-2.6.0.jar:2.6.0]
        at
org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.init(PagesList.java:175)
~[ignite-core-2.6.0.jar:2.6.0]
        at
org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList.<init>(AbstractFreeList.java:370)
~[ignite-core-2.6.0.jar:2.6.0]
        at
org.apache.ignite.internal.processors.cache.persistence.freelist.CacheFreeListImpl.<init>(CacheFreeListImpl.java:47)
~[ignite-core-2.6.0.jar:2.6.0]
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore$1.<init>(GridCacheOffheapManager.java:1203)
~[ignite-core-2.6.0.jar:2.6.0]
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.init0(GridCacheOffheapManager.java:1203)
[ignite-core-2.6.0.jar:2.6.0]
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.updateCounter(GridCacheOffheapManager.java:1420)
[ignite-core-2.6.0.jar:2.6.0]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.updateCounter(GridDhtLocalPartition.java:942)
[ignite-core-2.6.0.jar:2.6.0]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.<init>(GridDhtLocalPartition.java:222)
[ignite-core-2.6.0.jar:2.6.0]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.getOrCreatePartition(GridDhtPartitionTopologyImpl.java:812)
[ignite-core-2.6.0.jar:2.6.0]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.initPartitions(GridDhtPartitionTopologyImpl.java:368)
[ignite-core-2.6.0.jar:2.6.0]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.beforeExchange(GridDhtPartitionTopologyImpl.java:543)
[ignite-core-2.6.0.jar:2.6.0]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:1141)
[ignite-core-2.6.0.jar:2.6.0]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:712)
[ignite-core-2.6.0.jar:2.6.0]
        at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2419)
[ignite-core-2.6.0.jar:2.6.0]
        at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2299)
[ignite-core-2.6.0.jar:2.6.0]
        at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
[ignite-core-2.6.0.jar:2.6.0]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
[2018-11-29T03:53:25,629][ERROR][exchange-worker-#162][] JVM will be halted
immediately due to the failure: [failureCtx=FailureContext
[type=CRITICAL_ERROR, err=java.lang.IllegalStateException: Failed to get
page IO instance (page content is corrupted)]]

Here's the full log file.
node1.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/node1.zip>  
node2.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/node2.zip>  
node3.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/node3.zip>  





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

RE: Failed to get page IO instance (page content is corrupted) after onenode failed when trying to reboot.

Reply via email to