Ignite node failed for no obvious reason

Ray Tue, 24 Jul 2018 00:31:17 -0700

I have a three node Ignite 2.6 cluster setup with the following config.

    <bean id="grid.cfg"
class="org.apache.ignite.configuration.IgniteConfiguration">
        <property name="segmentationPolicy" value="RESTART_JVM"/>
        <property name="peerClassLoadingEnabled" value="true"/>
        <property name="failureDetectionTimeout" value="60000"/>
        <property name="dataStorageConfiguration">
            <bean
class="org.apache.ignite.configuration.DataStorageConfiguration">
            <property name="storagePath" value="/data/ignite/persistence"/>
            <property name="walPath" value="/wal"/>
            <property name="walArchivePath" value="/wal/archive"/>
            <property name="defaultDataRegionConfiguration">
                <bean
class="org.apache.ignite.configuration.DataRegionConfiguration">
                    <property name="name" value="default_Region"/>
                    <property name="initialSize" value="#{100L * 1024 * 1024
* 1024}"/>
                    <property name="maxSize" value="#{460L * 1024 * 1024 *
1024}"/>
                    <property name="persistenceEnabled" value="true"/>
                    <property name="checkpointPageBufferSize" value="#{8L *
1024 * 1024 * 1024}"/>
                </bean>
            </property>
            <property name="walMode" value="BACKGROUND"/>
            <property name="walFlushFrequency" value="5000"/>
            <property name="checkpointFrequency" value="600000"/>
            </bean>
        </property>
        <property name="discoverySpi">
                <bean
class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
                    <property name="localPort" value="49500"/>
                    <property name="ipFinder">
                        <bean
class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
                            <property name="addresses">
                                <list>
                                <value>node1:49500</value>
                                <value>node2:49500</value>
                                <value>node3:49500</value>
                                </list>
                            </property>
                        </bean>
                    </property>
                </bean>
            </property>
            <property name="gridLogger">
            <bean class="org.apache.ignite.logger.log4j2.Log4J2Logger">
                <constructor-arg type="java.lang.String"
value="config/ignite-log4j2.xml"/>
            </bean>
        </property>
    </bean>
</beans>


And I used this command to start Ignite service on three nodes.

./ignite.sh -J-Xmx32000m -J-Xms32000m -J-XX:+UseG1GC
-J-XX:+ScavengeBeforeFullGC -J-XX:+DisableExplicitGC -J-XX:+AlwaysPreTouch
-J-XX:+PrintGCDetails -J-XX:+PrintGCTimeStamps -J-XX:+PrintGCDateStamps
-J-XX:+PrintAdaptiveSizePolicy -XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCApplicationConcurrentTime
-J-Xloggc:/spare/ignite/log/ignitegc-$(date +%Y_%m_%d-%H_%M).log 
config/persistent-config.xml

When I'm using Spark dataframe API to ingest data into this cluster, node2
failed.
The logic for this Spark job is rather simple, it just extract data from
hdfs then ingest into Ignite day by day.
Every Spark job represents one day's data, it usually take less then 10
minutes to complete.
>From the following Spark job web management page, we can see that the Job Id
282 takes more than 8 hours.
spark.png
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/spark.png>  

So I suspect Ignite cluster is malfunctioning around 2018/07/23 16:25:52.

 
>From line 8619 in ignite-f7266ac7-1-2018-07-28.log at 2018-07-23T16:26:04,
it looks like there're some kind of network issue between node2 and Spark
client because there's lots of log saying "No route to host" etc.
node2-log.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/node2-log.zip>  

Node2 at that time is still connected in the cluster, but after some time at
2018-07-24T00:50:14,941, node2 printed last metrics log and failed.
I can't see any obvious reason why node2 failed because node2 only prints
metric log in ignite-f7266ac7.log and the GC time is normal.

So I also check log in node1 and node3 around the time node2 failed, the log
snippet for node3 is
  14213 [2018-07-24T00:51:16,622][INFO
][tcp-disco-sock-reader-#4][TcpDiscoverySpi] Finished serving remote node
connection [rmtAddr=/10.252.10.4        :59144, rmtPort=59144
  14214 [2018-07-24T00:51:16,623][INFO ][tcp-disco-srvr-#2][TcpDiscoverySpi]
TCP discovery accepted incoming connection [rmtAddr=/10.252.4.60, rm       
tPort=23056]
  14215 [2018-07-24T00:51:16,623][INFO ][tcp-disco-srvr-#2][TcpDiscoverySpi]
TCP discovery spawning a new thread for connection [rmtAddr=/10.252.       
4.60, rmtPort=23056]
  14216 [2018-07-24T00:51:16,624][INFO
][tcp-disco-sock-reader-#44][TcpDiscoverySpi] Started serving remote node
connection [rmtAddr=/10.252.4.60        :23056, rmtPort=23056]
  14217 [2018-07-24T00:51:16,626][INFO ][tcp-disco-srvr-#2][TcpDiscoverySpi]
TCP discovery accepted incoming connection [rmtAddr=/10.29.42.45, rm       
tPort=45605]
  14218 [2018-07-24T00:51:16,626][INFO ][tcp-disco-srvr-#2][TcpDiscoverySpi]
TCP discovery spawning a new thread for connection [rmtAddr=/10.29.4       
2.45, rmtPort=45605]
  14219 [2018-07-24T00:51:16,626][INFO
][tcp-disco-sock-reader-#45][TcpDiscoverySpi] Started serving remote node
connection [rmtAddr=/10.29.42.45        :45605, rmtPort=45605]
  14220 [2018-07-24T00:51:16,630][INFO ][tcp-disco-srvr-#2][TcpDiscoverySpi]
TCP discovery accepted incoming connection [rmtAddr=/10.29.42.48, rm       
tPort=57089]
  14221 [2018-07-24T00:51:16,630][INFO ][tcp-disco-srvr-#2][TcpDiscoverySpi]
TCP discovery spawning a new thread for connection [rmtAddr=/10.29.4       
2.48, rmtPort=57089]
  14222 [2018-07-24T00:51:16,630][INFO
][tcp-disco-sock-reader-#47][TcpDiscoverySpi] Started serving remote node
connection [rmtAddr=/10.29.42.48        :57089, rmtPort=57089]
  14223 [2018-07-24T00:51:16,666][WARN
][disco-event-worker-#161][GridDiscoveryManager] Node FAILED:
TcpDiscoveryNode [id=f7266ac7-409a-454f-b601        -b077b15594b3,
addrs=[10.252.10.4, 127.0.0.1],
sockAddrs=[rpsj1ign002.webex.com/10.252.10.4:49500, /127.0.0.1:49500],
discPort=49500, ord        er=2, intOrder=2, lastExchangeTime=1532323119618,
loc=false, ver=2.6.0#20180710-sha1:669feacc, isClient=false]
  14224 [2018-07-24T00:51:16,667][INFO
][disco-event-worker-#161][GridDiscoveryManager] Topology snapshot [ver=90,
servers=2, clients=8, CPUs=392        , offheap=920.0GB, heap=170.0GB]
  14225 [2018-07-24T00:51:16,667][INFO
][disco-event-worker-#161][GridDiscoveryManager]   ^-- Node
[id=BF7076FB-7C87-4AEE-AC8B-D3DB0AC8A9E2, clus        terState=ACTIVE]
  14226 [2018-07-24T00:51:16,667][INFO
][disco-event-worker-#161][GridDiscoveryManager]   ^-- Baseline [id=0,
size=3, online=2, offline=1]

and log snippet for node1 is
[2018-07-24T00:51:16,617][WARN ][tcp-disco-msg-worker-#3][TcpDiscoverySpi]
Failed to send message to next node [msg=TcpDiscoveryMetricsUpdateMessage
[super=TcpDiscoveryAbstractMessage [sndNodeId=null,
id=bf4e1a5c461-0712d9d6-49a3-42df-a8ee-5512764cb1a9,
verifierNodeId=0712d9d6-49a3-42df-a8ee-5512764cb1a9, topVer=0, pendingIdx=0,
failedNodes=null, isClient=false]], next=TcpDiscoveryNode
[id=f7266ac7-409a-454f-b601-b077b15594b3, addrs=[10.252.10.4, 127.0.0.1],
sockAddrs=[rpsj1ign002.webex.com/10.252.10.4:49500, /127.0.0.1:49500],
discPort=49500, order=2, intOrder=2, lastExchangeTime=1532323112326,
loc=false, ver=2.6.0#20180710-sha1:669feacc, isClient=false], errMsg=Failed
to send message to next node [msg=TcpDiscoveryMetricsUpdateMessage
[super=TcpDiscoveryAbstractMessage [sndNodeId=null,
id=bf4e1a5c461-0712d9d6-49a3-42df-a8ee-5512764cb1a9,
verifierNodeId=0712d9d6-49a3-42df-a8ee-5512764cb1a9, topVer=0, pendingIdx=0,
failedNodes=null, isClient=false]], next=ClusterNode
[id=f7266ac7-409a-454f-b601-b077b15594b3, order=2, addr=[10.252.10.4,
127.0.0.1], daemon=false]]]
[2018-07-24T00:51:16,627][INFO ][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP
discovery accepted incoming connection [rmtAddr=/10.29.42.46, rmtPort=58968]
[2018-07-24T00:51:16,627][INFO ][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP
discovery spawning a new thread for connection [rmtAddr=/10.29.42.46,
rmtPort=58968]
[2018-07-24T00:51:16,627][INFO ][tcp-disco-sock-reader-#52][TcpDiscoverySpi]
Started serving remote node connection [rmtAddr=/10.29.42.46:58968,
rmtPort=58968]
[2018-07-24T00:51:16,647][WARN ][tcp-disco-msg-worker-#3][TcpDiscoverySpi]
Local node has detected failed nodes and started cluster-wide procedure. To
speed up failure detection please see 'Failure Detection' section under
javadoc for 'TcpDiscoverySpi'
[2018-07-24T00:51:16,648][WARN
][disco-event-worker-#161][GridDiscoveryManager] Node FAILED:
TcpDiscoveryNode [id=f7266ac7-409a-454f-b601-b077b15594b3,
addrs=[10.252.10.4, 127.0.0.1],
sockAddrs=[rpsj1ign002.webex.com/10.252.10.4:49500, /127.0.0.1:49500],
discPort=49500, order=2, intOrder=2, lastExchangeTime=1532323112326,
loc=false, ver=2.6.0#20180710-sha1:669feacc, isClient=false]
[2018-07-24T00:51:16,649][INFO
][disco-event-worker-#161][GridDiscoveryManager] Topology snapshot [ver=90,
servers=2, clients=8, CPUs=392, offheap=920.0GB, heap=170.0GB]
[2018-07-24T00:51:16,649][INFO
][disco-event-worker-#161][GridDiscoveryManager]   ^-- Node
[id=0712D9D6-49A3-42DF-A8EE-5512764CB1A9, clusterState=ACTIVE]
[2018-07-24T00:51:16,649][INFO
][disco-event-worker-#161][GridDiscoveryManager]   ^-- Baseline [id=0,
size=3, online=2, offline=1]

I've attached the full log for all three nodes and gc log for node2 in the
attachment.
ignite-node1.log
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/ignite-node1.log>  
ignite-node3.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/ignite-node3.zip>  

I also noticed there's no gc activity on node2 between
2018-07-23T19:28:03.401 to 018-07-24T00:51:51.
Can anybody help me on why node2 failed?








--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Ignite node failed for no obvious reason

Reply via email to