I see a node in the topology flacking up and down every minute in the restart1.log: TcpDiscoveryNode [id=d6e52510-3380-4258-8a8e-798640b1786c, addrs=[10.29.42.49, 127.0.0.1], sockAddrs=[/10.29.42.49:47500, /127.0.0.1:47500], discPort=47500, order=596, intOrder=302, lastExchangeTime=1537154393454, loc=false, ver=2.6.0#20180710-sha1:669feacc, isClient=false]
In other logs I also see different node IDs. Try finding what is the node on 10.29.42.49 that connects to your cluster. In the last attempt you’re getting a page corruption which is not supposed to happen in any case, but perhaps this is some bug that’s already fixed in the latest versions. Stan From: Ray Sent: 17 сентября 2018 г. 13:10 To: user@ignite.apache.org Subject: Failed to get page IO instance (page content is corrupted) after onenode failed when trying to reboot. I have a three nodes Ignite 2.6.0 cluster with native persistence enabled. Here's the config <?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation=" http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd"> <bean id="grid.cfg" class="org.apache.ignite.configuration.IgniteConfiguration"> <property name="segmentationPolicy" value="RESTART_JVM"/> <property name="peerClassLoadingEnabled" value="true"/> <property name="failureDetectionTimeout" value="60000"/> <property name="dataStorageConfiguration"> <bean class="org.apache.ignite.configuration.DataStorageConfiguration"> <property name="storagePath" value="/data/ignite/persistence"/> <property name="walPath" value="/wal"/> <property name="walArchivePath" value="/wal/archive"/> <property name="defaultDataRegionConfiguration"> <bean class="org.apache.ignite.configuration.DataRegionConfiguration"> <property name="name" value="default_Region"/> <property name="initialSize" value="#{100L * 1024 * 1024 * 1024}"/> <property name="maxSize" value="#{460L * 1024 * 1024 * 1024}"/> <property name="persistenceEnabled" value="true"/> <property name="checkpointPageBufferSize" value="#{8L * 1024 * 1024 * 1024}"/> </bean> </property> <property name="walMode" value="BACKGROUND"/> <property name="walFlushFrequency" value="5000"/> <property name="checkpointFrequency" value="600000"/> </bean> </property> <property name="discoverySpi"> <bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi"> <property name="localPort" value="49500"/> <property name="ipFinder"> <bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder"> <property name="addresses"> <list> <value>node1:49500</value> <value>node2:49500</value> <value>node3:49500</value> </list> </property> </bean> </property> </bean> </property> <property name="gridLogger"> <bean class="org.apache.ignite.logger.log4j2.Log4J2Logger"> <constructor-arg type="java.lang.String" value="config/ignite-log4j2.xml"/> </bean> </property> </bean> </beans> One node failed when trying to process a long running sql, the detailed log can be found in attachment with filename nodefail.log. The other two nodes may have some new data coming in when one node in failed state. When I try to reboot this server after several hours, first it got stuck for one hour with "Failed to wait for partition map exchange" exception. The detailed log can be found in attachment with filename restart1.log. So I try to reboot this server again but got "org.apache.ignite.spi.IgniteSpiException: Node with set up BaselineTopology is not allowed to join cluster without one:" exception. The detailed log can be found in attachment with filename restart2.log. So I try to reboot the whole cluster by starting failed node first, but I got "Unable to await partitions release latch within timeout:" exception when two other servers are started. The detailed log can be found in attachment with filename restart3.log. So I try to reboot this server again, but I got "Failed to get page IO instance (page content is corrupted)" exception. The detailed log can be found in attachment with filename restart4.log. >From this point on, the cluster is in non-recoverable state. Please advice me how to avoid this situation and how to recover data. The log of failed node is in log.zip. The other two files are logs for two good nodes. log.zip <http://apache-ignite-users.70518.x6.nabble.com/file/t1346/log.zip> log-goodNode.zip <http://apache-ignite-users.70518.x6.nabble.com/file/t1346/log-goodNode.zip> log-GoodNode2.zip <http://apache-ignite-users.70518.x6.nabble.com/file/t1346/log-GoodNode2.zip> -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/