This issue happened again. Here's the summary. I'm running a three nodes of Ignite 2.6 cluster with these config
<?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation=" http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd"> <bean id="grid.cfg" class="org.apache.ignite.configuration.IgniteConfiguration"> <property name="segmentationPolicy" value="RESTART_JVM"/> <property name="peerClassLoadingEnabled" value="true"/> <property name="failureDetectionTimeout" value="60000"/> <property name="dataStorageConfiguration"> <bean class="org.apache.ignite.configuration.DataStorageConfiguration"> <property name="storagePath" value="/data/ignite/persistence"/> <property name="walPath" value="/wal"/> <property name="walArchivePath" value="/wal/archive"/> <property name="defaultDataRegionConfiguration"> <bean class="org.apache.ignite.configuration.DataRegionConfiguration"> <property name="name" value="default_Region"/> <property name="initialSize" value="#{100L * 1024 * 1024 * 1024}"/> <property name="maxSize" value="#{400L * 1024 * 1024 * 1024}"/> <property name="persistenceEnabled" value="true"/> <property name="checkpointPageBufferSize" value="#{8L * 1024 * 1024 * 1024}"/> </bean> </property> <property name="walMode" value="BACKGROUND"/> <property name="walFlushFrequency" value="5000"/> <property name="checkpointFrequency" value="600000"/> </bean> </property> <property name="discoverySpi"> <bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi"> <property name="localPort" value="49500"/> <property name="ipFinder"> <bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder"> <property name="addresses"> <list> <value>node1:49500</value> <value>node2:49500</value> <value>node3:49500</value> </list> </property> </bean> </property> </bean> </property> <property name="gridLogger"> <bean class="org.apache.ignite.logger.log4j2.Log4J2Logger"> <constructor-arg type="java.lang.String" value="config/ignite-log4j2.xml"/> </bean> </property> </bean> </beans> I have a few caches setup with TTL with enabled persistence. Why I'm mentioning this because I check this thread http://apache-ignite-users.70518.x6.nabble.com/And-again-Failed-to-get-page-IO-instance-page-content-is-corrupted-td20095.html#a22037 and a few tickets mentioned in this ticket. https://issues.apache.org/jira/browse/IGNITE-8659 https://issues.apache.org/jira/browse/IGNITE-5874 Other issues is ignored because they're already fixed in 2.6 Node1 goes down because of a long GC pause. When I try to restart Ignite service on Node1, I got "Still waiting for initial partition map exchange" warning log going on for more than 2 hours. [WARN ][main][GridCachePartitionExchangeManager] Still waiting for initial partition map exchange [fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode [id=9d66b750-09a3-4f0e-afa9-7cf24847ee6a, addrs=[10.252.4.60, 127.0.0.1], sockAddrs=[rpsj1ign001.webex.com/10.252.4.60:49500, /127.0.0.1:49500], discPort=49500, order=11813, intOrder=5909, lastExchangeTime=1543451981558, loc=true, ver=2.6.0#20180709-sha1:5faffcee, isClient=false], topVer=11813, nodeId8=9d66b750, msg=null, type=NODE_JOINED, tstamp=1543451943071], crd=TcpDiscoveryNode [id=f14c8e36-9a20-4668-b52e-0de64c743700, addrs=[10.252.10.20, 127.0.0.1], sockAddrs=[rpsj1ign003.webex.com/10.252.10.20:49500, /127.0.0.1:49500], discPort=49500, order=2310, intOrder=1158, lastExchangeTime=1543451942304, loc=false, ver=2.6.0#20180709-sha1:5faffcee, isClient=false], exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=11813, minorTopVer=0], discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode [id=9d66b750-09a3-4f0e-afa9-7cf24847ee6a, addrs=[10.252.4.60, 127.0.0.1], sockAddrs=[rpsj1ign001.webex.com/10.252.4.60:49500, /127.0.0.1:49500], discPort=49500, order=11813, intOrder=5909, lastExchangeTime=1543451981558, loc=true, ver=2.6.0#20180709-sha1:5faffcee, isClient=false], topVer=11813, nodeId8=9d66b750, msg=null, type=NODE_JOINED, tstamp=1543451943071], nodeId=9d66b750, evt=NODE_JOINED], added=true, initFut=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null, hash=830022440], init=false, lastVer=null, partReleaseFut=PartitionReleaseFuture [topVer=AffinityTopologyVersion [topVer=11813, minorTopVer=0], futures=[ExplicitLockReleaseFuture [topVer=AffinityTopologyVersion [topVer=11813, minorTopVer=0], futures=[]], AtomicUpdateReleaseFuture [topVer=AffinityTopologyVersion [topVer=11813, minorTopVer=0], futures=[]], DataStreamerReleaseFuture [topVer=AffinityTopologyVersion [topVer=11813, minorTopVer=0], futures=[]], LocalTxReleaseFuture [topVer=AffinityTopologyVersion [topVer=11813, minorTopVer=0], futures=[]], AllTxReleaseFuture [topVer=AffinityTopologyVersion [topVer=11813, minorTopVer=0], futures=[RemoteTxReleaseFuture [topVer=AffinityTopologyVersion [topVer=11813, minorTopVer=0], futures=[]]]]]], exchActions=ExchangeActions [startCaches=null, stopCaches=null, startGrps=[], stopGrps=[], resetParts=null, stateChangeRequest=null], affChangeMsg=null, initTs=1543451943112, centralizedAff=false, forceAffReassignment=false, changeGlobalStateE=null, done=false, state=SRV, evtLatch=0, remaining=[0126e998-0c18-452f-8f3b-b6dd4b2ae84c, f14c8e36-9a20-4668-b52e-0de64c743700], super=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null, hash=773110813]]] So I try to reboot Ignite service on node2 and node3. But only node2 manages to join the cluster, node3 prints "Still waiting for initial partition map exchange" for more than 30 minutes. So I stopped all three nodes, and restarted the Ignite service on them. Then I got Failed to get page IO instance (page content is corrupted) on Node1. [ERROR][exchange-worker-#162][] Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext [type=CRITICAL_ERROR, err=java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted)]] java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted) at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forVersion(IOVersions.java:83) ~[ignite-core-2.6.0.jar:2.6.0] at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forPage(IOVersions.java:95) ~[ignite-core-2.6.0.jar:2.6.0] at org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.init(PagesList.java:175) ~[ignite-core-2.6.0.jar:2.6.0] at org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList.<init>(AbstractFreeList.java:370) ~[ignite-core-2.6.0.jar:2.6.0] at org.apache.ignite.internal.processors.cache.persistence.freelist.CacheFreeListImpl.<init>(CacheFreeListImpl.java:47) ~[ignite-core-2.6.0.jar:2.6.0] at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore$1.<init>(GridCacheOffheapManager.java:1203) ~[ignite-core-2.6.0.jar:2.6.0] at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.init0(GridCacheOffheapManager.java:1203) [ignite-core-2.6.0.jar:2.6.0] at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.updateCounter(GridCacheOffheapManager.java:1420) [ignite-core-2.6.0.jar:2.6.0] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.updateCounter(GridDhtLocalPartition.java:942) [ignite-core-2.6.0.jar:2.6.0] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.<init>(GridDhtLocalPartition.java:222) [ignite-core-2.6.0.jar:2.6.0] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.getOrCreatePartition(GridDhtPartitionTopologyImpl.java:812) [ignite-core-2.6.0.jar:2.6.0] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.initPartitions(GridDhtPartitionTopologyImpl.java:368) [ignite-core-2.6.0.jar:2.6.0] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.beforeExchange(GridDhtPartitionTopologyImpl.java:543) [ignite-core-2.6.0.jar:2.6.0] at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:1141) [ignite-core-2.6.0.jar:2.6.0] at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:712) [ignite-core-2.6.0.jar:2.6.0] at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2419) [ignite-core-2.6.0.jar:2.6.0] at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2299) [ignite-core-2.6.0.jar:2.6.0] at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) [ignite-core-2.6.0.jar:2.6.0] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161] [2018-11-29T03:53:25,629][ERROR][exchange-worker-#162][] JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=CRITICAL_ERROR, err=java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted)]] Here's the full log file. node1.zip <http://apache-ignite-users.70518.x6.nabble.com/file/t1346/node1.zip> node2.zip <http://apache-ignite-users.70518.x6.nabble.com/file/t1346/node2.zip> node3.zip <http://apache-ignite-users.70518.x6.nabble.com/file/t1346/node3.zip> -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
