Failed to get page IO instance (page content is corrupted) after one node failed when trying to reboot.

Ray Mon, 17 Sep 2018 03:11:27 -0700

I have a three nodes Ignite 2.6.0 cluster with native persistence enabled.
Here's the config
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans";
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
       xsi:schemaLocation="
       http://www.springframework.org/schema/beans
       http://www.springframework.org/schema/beans/spring-beans.xsd";>


    <bean id="grid.cfg"
class="org.apache.ignite.configuration.IgniteConfiguration">
        <property name="segmentationPolicy" value="RESTART_JVM"/>
        <property name="peerClassLoadingEnabled" value="true"/>
        <property name="failureDetectionTimeout" value="60000"/>
        <property name="dataStorageConfiguration">
            <bean
class="org.apache.ignite.configuration.DataStorageConfiguration">
            <property name="storagePath" value="/data/ignite/persistence"/>
            <property name="walPath" value="/wal"/>
            <property name="walArchivePath" value="/wal/archive"/>
            <property name="defaultDataRegionConfiguration">
                <bean
class="org.apache.ignite.configuration.DataRegionConfiguration">
                    <property name="name" value="default_Region"/>
                    <property name="initialSize" value="#{100L * 1024 * 1024
* 1024}"/>
                    <property name="maxSize" value="#{460L * 1024 * 1024 *
1024}"/>
                    <property name="persistenceEnabled" value="true"/>
                    <property name="checkpointPageBufferSize" value="#{8L *
1024 * 1024 * 1024}"/>
                </bean>
            </property>
            <property name="walMode" value="BACKGROUND"/>
            <property name="walFlushFrequency" value="5000"/>
            <property name="checkpointFrequency" value="600000"/>
            </bean>
        </property>
        <property name="discoverySpi">
                <bean
class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
                    <property name="localPort" value="49500"/>
                    <property name="ipFinder">
                        <bean
class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
                            <property name="addresses">
                                <list>
                                <value>node1:49500</value>
                                <value>node2:49500</value>
                                <value>node3:49500</value>
                                </list>
                            </property>
                        </bean>
                    </property>
                </bean>
            </property>
            <property name="gridLogger">
            <bean class="org.apache.ignite.logger.log4j2.Log4J2Logger">
                <constructor-arg type="java.lang.String"
value="config/ignite-log4j2.xml"/>
            </bean>
        </property>
    </bean>
</beans>

One node failed when trying to process a long running sql, the detailed log
can be found in attachment with filename nodefail.log.

The other two nodes may have some new data coming in when one node in failed
state.
When I try to reboot this server after several hours, first it got stuck for
one hour with "Failed to wait for partition map exchange" exception.
The detailed log can be found in attachment with filename restart1.log.

So I try to reboot this server again but got
"org.apache.ignite.spi.IgniteSpiException: Node with set up BaselineTopology
is not allowed to join cluster without one:" exception.
The detailed log can be found in attachment with filename restart2.log.

So I try to reboot the whole cluster by starting failed node first, but I
got "Unable to await partitions release latch within timeout:" exception
when two other servers are started.
The detailed log can be found in attachment with filename restart3.log.

So I try to reboot this server again, but I got "Failed to get page IO
instance (page content is corrupted)" exception.
The detailed log can be found in attachment with filename restart4.log.

>From this point on, the cluster is in non-recoverable state.
Please advice me how to avoid this situation and how to recover data.
The log of failed node is in log.zip.
The other two files are logs for two good nodes.

log.zip <http://apache-ignite-users.70518.x6.nabble.com/file/t1346/log.zip>  
log-goodNode.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/log-goodNode.zip>  
log-GoodNode2.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/log-GoodNode2.zip>  






--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Failed to get page IO instance (page content is corrupted) after one node failed when trying to reboot.

Reply via email to