I see a node in the topology flacking up and down every minute in the 
restart1.log:
 TcpDiscoveryNode [id=d6e52510-3380-4258-8a8e-798640b1786c, addrs=[10.29.42.49, 
127.0.0.1], sockAddrs=[/10.29.42.49:47500, /127.0.0.1:47500], discPort=47500, 
order=596, intOrder=302, lastExchangeTime=1537154393454, loc=false, 
ver=2.6.0#20180710-sha1:669feacc, isClient=false]

In other logs I also see different node IDs.

Try finding what is the node on 10.29.42.49 that connects to your cluster.

In the last attempt you’re getting a page corruption which is not supposed to 
happen in any case,
but perhaps this is some bug that’s already fixed in the latest versions.

Stan

From: Ray
Sent: 17 сентября 2018 г. 13:10
To: user@ignite.apache.org
Subject: Failed to get page IO instance (page content is corrupted) after 
onenode failed when trying to reboot.

I have a three nodes Ignite 2.6.0 cluster with native persistence enabled.
Here's the config
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans";
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
       xsi:schemaLocation="
       http://www.springframework.org/schema/beans
       http://www.springframework.org/schema/beans/spring-beans.xsd";>

    <bean id="grid.cfg"
class="org.apache.ignite.configuration.IgniteConfiguration">
        <property name="segmentationPolicy" value="RESTART_JVM"/>
        <property name="peerClassLoadingEnabled" value="true"/>
        <property name="failureDetectionTimeout" value="60000"/>
        <property name="dataStorageConfiguration">
            <bean
class="org.apache.ignite.configuration.DataStorageConfiguration">
            <property name="storagePath" value="/data/ignite/persistence"/>
            <property name="walPath" value="/wal"/>
            <property name="walArchivePath" value="/wal/archive"/>
            <property name="defaultDataRegionConfiguration">
                <bean
class="org.apache.ignite.configuration.DataRegionConfiguration">
                    <property name="name" value="default_Region"/>
                    <property name="initialSize" value="#{100L * 1024 * 1024
* 1024}"/>
                    <property name="maxSize" value="#{460L * 1024 * 1024 *
1024}"/>
                    <property name="persistenceEnabled" value="true"/>
                    <property name="checkpointPageBufferSize" value="#{8L *
1024 * 1024 * 1024}"/>
                </bean>
            </property>
            <property name="walMode" value="BACKGROUND"/>
            <property name="walFlushFrequency" value="5000"/>
            <property name="checkpointFrequency" value="600000"/>
            </bean>
        </property>
        <property name="discoverySpi">
                <bean
class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
                    <property name="localPort" value="49500"/>
                    <property name="ipFinder">
                        <bean
class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
                            <property name="addresses">
                                <list>
                                <value>node1:49500</value>
                                <value>node2:49500</value>
                                <value>node3:49500</value>
                                </list>
                            </property>
                        </bean>
                    </property>
                </bean>
            </property>
            <property name="gridLogger">
            <bean class="org.apache.ignite.logger.log4j2.Log4J2Logger">
                <constructor-arg type="java.lang.String"
value="config/ignite-log4j2.xml"/>
            </bean>
        </property>
    </bean>
</beans>

One node failed when trying to process a long running sql, the detailed log
can be found in attachment with filename nodefail.log.

The other two nodes may have some new data coming in when one node in failed
state.
When I try to reboot this server after several hours, first it got stuck for
one hour with "Failed to wait for partition map exchange" exception.
The detailed log can be found in attachment with filename restart1.log.

So I try to reboot this server again but got
"org.apache.ignite.spi.IgniteSpiException: Node with set up BaselineTopology
is not allowed to join cluster without one:" exception.
The detailed log can be found in attachment with filename restart2.log.

So I try to reboot the whole cluster by starting failed node first, but I
got "Unable to await partitions release latch within timeout:" exception
when two other servers are started.
The detailed log can be found in attachment with filename restart3.log.

So I try to reboot this server again, but I got "Failed to get page IO
instance (page content is corrupted)" exception.
The detailed log can be found in attachment with filename restart4.log.

>From this point on, the cluster is in non-recoverable state.
Please advice me how to avoid this situation and how to recover data.
The log of failed node is in log.zip.
The other two files are logs for two good nodes.

log.zip <http://apache-ignite-users.70518.x6.nabble.com/file/t1346/log.zip>  
log-goodNode.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/log-goodNode.zip>  
log-GoodNode2.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/log-GoodNode2.zip>  






--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Reply via email to