Re: Ignite node failed for no obvious reason

Evgenii Zhuravlev Tue, 24 Jul 2018 05:19:09 -0700

Hi,

Are you sure that you have the full connectivity between all nodes? I mean
server and client nodes. There are a lot of different exceptions like
"err=No route to host" in the log, which definitely points to the network
problems.


At the same time, if you think that it was just a short-time outage and it
was restored later, then, it's possible that this fix could help you:
https://issues.apache.org/jira/browse/IGNITE-8739

Regards,
Evgenii

2018-07-24 10:30 GMT+03:00 Ray <[email protected]>:

> I have a three node Ignite 2.6 cluster setup with the following config.
>
>     <bean id="grid.cfg"
> class="org.apache.ignite.configuration.IgniteConfiguration">
>         <property name="segmentationPolicy" value="RESTART_JVM"/>
>         <property name="peerClassLoadingEnabled" value="true"/>
>         <property name="failureDetectionTimeout" value="60000"/>
>         <property name="dataStorageConfiguration">
>             <bean
> class="org.apache.ignite.configuration.DataStorageConfiguration">
>             <property name="storagePath" value="/data/ignite/
> persistence"/>
>             <property name="walPath" value="/wal"/>
>             <property name="walArchivePath" value="/wal/archive"/>
>             <property name="defaultDataRegionConfiguration">
>                 <bean
> class="org.apache.ignite.configuration.DataRegionConfiguration">
>                     <property name="name" value="default_Region"/>
>                     <property name="initialSize" value="#{100L * 1024 *
> 1024
> * 1024}"/>
>                     <property name="maxSize" value="#{460L * 1024 * 1024 *
> 1024}"/>
>                     <property name="persistenceEnabled" value="true"/>
>                     <property name="checkpointPageBufferSize" value="#{8L
> *
> 1024 * 1024 * 1024}"/>
>                 </bean>
>             </property>
>             <property name="walMode" value="BACKGROUND"/>
>             <property name="walFlushFrequency" value="5000"/>
>             <property name="checkpointFrequency" value="600000"/>
>             </bean>
>         </property>
>         <property name="discoverySpi">
>                 <bean
> class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
>                     <property name="localPort" value="49500"/>
>                     <property name="ipFinder">
>                         <bean
> class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.
> TcpDiscoveryVmIpFinder">
>                             <property name="addresses">
>                                 <list>
>                                 <value>node1:49500</value>
>                                 <value>node2:49500</value>
>                                 <value>node3:49500</value>
>                                 </list>
>                             </property>
>                         </bean>
>                     </property>
>                 </bean>
>             </property>
>             <property name="gridLogger">
>             <bean class="org.apache.ignite.logger.log4j2.Log4J2Logger">
>                 <constructor-arg type="java.lang.String"
> value="config/ignite-log4j2.xml"/>
>             </bean>
>         </property>
>     </bean>
> </beans>
>
> And I used this command to start Ignite service on three nodes.
>
> ./ignite.sh -J-Xmx32000m -J-Xms32000m -J-XX:+UseG1GC
> -J-XX:+ScavengeBeforeFullGC -J-XX:+DisableExplicitGC -J-XX:+AlwaysPreTouch
> -J-XX:+PrintGCDetails -J-XX:+PrintGCTimeStamps -J-XX:+PrintGCDateStamps
> -J-XX:+PrintAdaptiveSizePolicy -XX:+PrintGCApplicationStoppedTime
> -XX:+PrintGCApplicationConcurrentTime
> -J-Xloggc:/spare/ignite/log/ignitegc-$(date +%Y_%m_%d-%H_%M).log
> config/persistent-config.xml
>
> When I'm using Spark dataframe API to ingest data into this cluster, node2
> failed.
> The logic for this Spark job is rather simple, it just extract data from
> hdfs then ingest into Ignite day by day.
> Every Spark job represents one day's data, it usually take less then 10
> minutes to complete.
> From the following Spark job web management page, we can see that the Job
> Id
> 282 takes more than 8 hours.
> spark.png
> <http://apache-ignite-users.70518.x6.nabble.com/file/t1346/spark.png>
>
> So I suspect Ignite cluster is malfunctioning around 2018/07/23 16:25:52.
>
>
> From line 8619 in ignite-f7266ac7-1-2018-07-28.log at 2018-07-23T16:26:04,
> it looks like there're some kind of network issue between node2 and Spark
> client because there's lots of log saying "No route to host" etc.
> node2-log.zip
> <http://apache-ignite-users.70518.x6.nabble.com/file/t1346/node2-log.zip>
>
>
> Node2 at that time is still connected in the cluster, but after some time
> at
> 2018-07-24T00:50:14,941, node2 printed last metrics log and failed.
> I can't see any obvious reason why node2 failed because node2 only prints
> metric log in ignite-f7266ac7.log and the GC time is normal.
>
> So I also check log in node1 and node3 around the time node2 failed, the
> log
> snippet for node3 is
>   14213 [2018-07-24T00:51:16,622][INFO
> ][tcp-disco-sock-reader-#4][TcpDiscoverySpi] Finished serving remote node
> connection [rmtAddr=/10.252.10.4        :59144, rmtPort=59144
>   14214 [2018-07-24T00:51:16,623][INFO ][tcp-disco-srvr-#2][
> TcpDiscoverySpi]
> TCP discovery accepted incoming connection [rmtAddr=/10.252.4.60, rm
>
> tPort=23056]
>   14215 [2018-07-24T00:51:16,623][INFO ][tcp-disco-srvr-#2][
> TcpDiscoverySpi]
> TCP discovery spawning a new thread for connection [rmtAddr=/10.252.
> 4.60, rmtPort=23056]
>   14216 [2018-07-24T00:51:16,624][INFO
> ][tcp-disco-sock-reader-#44][TcpDiscoverySpi] Started serving remote node
> connection [rmtAddr=/10.252.4.60        :23056, rmtPort=23056]
>   14217 [2018-07-24T00:51:16,626][INFO ][tcp-disco-srvr-#2][
> TcpDiscoverySpi]
> TCP discovery accepted incoming connection [rmtAddr=/10.29.42.45, rm
>
> tPort=45605]
>   14218 [2018-07-24T00:51:16,626][INFO ][tcp-disco-srvr-#2][
> TcpDiscoverySpi]
> TCP discovery spawning a new thread for connection [rmtAddr=/10.29.4
> 2.45, rmtPort=45605]
>   14219 [2018-07-24T00:51:16,626][INFO
> ][tcp-disco-sock-reader-#45][TcpDiscoverySpi] Started serving remote node
> connection [rmtAddr=/10.29.42.45        :45605, rmtPort=45605]
>   14220 [2018-07-24T00:51:16,630][INFO ][tcp-disco-srvr-#2][
> TcpDiscoverySpi]
> TCP discovery accepted incoming connection [rmtAddr=/10.29.42.48, rm
>
> tPort=57089]
>   14221 [2018-07-24T00:51:16,630][INFO ][tcp-disco-srvr-#2][
> TcpDiscoverySpi]
> TCP discovery spawning a new thread for connection [rmtAddr=/10.29.4
> 2.48, rmtPort=57089]
>   14222 [2018-07-24T00:51:16,630][INFO
> ][tcp-disco-sock-reader-#47][TcpDiscoverySpi] Started serving remote node
> connection [rmtAddr=/10.29.42.48        :57089, rmtPort=57089]
>   14223 [2018-07-24T00:51:16,666][WARN
> ][disco-event-worker-#161][GridDiscoveryManager] Node FAILED:
> TcpDiscoveryNode [id=f7266ac7-409a-454f-b601        -b077b15594b3,
> addrs=[10.252.10.4, 127.0.0.1],
> sockAddrs=[rpsj1ign002.webex.com/10.252.10.4:49500, /127.0.0.1:49500],
> discPort=49500, ord        er=2, intOrder=2, lastExchangeTime=
> 1532323119618,
> loc=false, ver=2.6.0#20180710-sha1:669feacc, isClient=false]
>   14224 [2018-07-24T00:51:16,667][INFO
> ][disco-event-worker-#161][GridDiscoveryManager] Topology snapshot
> [ver=90,
> servers=2, clients=8, CPUs=392        , offheap=920.0GB, heap=170.0GB]
>   14225 [2018-07-24T00:51:16,667][INFO
> ][disco-event-worker-#161][GridDiscoveryManager]   ^-- Node
> [id=BF7076FB-7C87-4AEE-AC8B-D3DB0AC8A9E2, clus        terState=ACTIVE]
>   14226 [2018-07-24T00:51:16,667][INFO
> ][disco-event-worker-#161][GridDiscoveryManager]   ^-- Baseline [id=0,
> size=3, online=2, offline=1]
>
> and log snippet for node1 is
> [2018-07-24T00:51:16,617][WARN ][tcp-disco-msg-worker-#3][TcpDiscoverySpi]
> Failed to send message to next node [msg=TcpDiscoveryMetricsUpdateMessage
> [super=TcpDiscoveryAbstractMessage [sndNodeId=null,
> id=bf4e1a5c461-0712d9d6-49a3-42df-a8ee-5512764cb1a9,
> verifierNodeId=0712d9d6-49a3-42df-a8ee-5512764cb1a9, topVer=0,
> pendingIdx=0,
> failedNodes=null, isClient=false]], next=TcpDiscoveryNode
> [id=f7266ac7-409a-454f-b601-b077b15594b3, addrs=[10.252.10.4, 127.0.0.1],
> sockAddrs=[rpsj1ign002.webex.com/10.252.10.4:49500, /127.0.0.1:49500],
> discPort=49500, order=2, intOrder=2, lastExchangeTime=1532323112326,
> loc=false, ver=2.6.0#20180710-sha1:669feacc, isClient=false],
> errMsg=Failed
> to send message to next node [msg=TcpDiscoveryMetricsUpdateMessage
> [super=TcpDiscoveryAbstractMessage [sndNodeId=null,
> id=bf4e1a5c461-0712d9d6-49a3-42df-a8ee-5512764cb1a9,
> verifierNodeId=0712d9d6-49a3-42df-a8ee-5512764cb1a9, topVer=0,
> pendingIdx=0,
> failedNodes=null, isClient=false]], next=ClusterNode
> [id=f7266ac7-409a-454f-b601-b077b15594b3, order=2, addr=[10.252.10.4,
> 127.0.0.1], daemon=false]]]
> [2018-07-24T00:51:16,627][INFO ][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP
> discovery accepted incoming connection [rmtAddr=/10.29.42.46,
> rmtPort=58968]
> [2018-07-24T00:51:16,627][INFO ][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP
> discovery spawning a new thread for connection [rmtAddr=/10.29.42.46,
> rmtPort=58968]
> [2018-07-24T00:51:16,627][INFO ][tcp-disco-sock-reader-#52][
> TcpDiscoverySpi]
> Started serving remote node connection [rmtAddr=/10.29.42.46:58968,
> rmtPort=58968]
> [2018-07-24T00:51:16,647][WARN ][tcp-disco-msg-worker-#3][TcpDiscoverySpi]
> Local node has detected failed nodes and started cluster-wide procedure. To
> speed up failure detection please see 'Failure Detection' section under
> javadoc for 'TcpDiscoverySpi'
> [2018-07-24T00:51:16,648][WARN
> ][disco-event-worker-#161][GridDiscoveryManager] Node FAILED:
> TcpDiscoveryNode [id=f7266ac7-409a-454f-b601-b077b15594b3,
> addrs=[10.252.10.4, 127.0.0.1],
> sockAddrs=[rpsj1ign002.webex.com/10.252.10.4:49500, /127.0.0.1:49500],
> discPort=49500, order=2, intOrder=2, lastExchangeTime=1532323112326,
> loc=false, ver=2.6.0#20180710-sha1:669feacc, isClient=false]
> [2018-07-24T00:51:16,649][INFO
> ][disco-event-worker-#161][GridDiscoveryManager] Topology snapshot
> [ver=90,
> servers=2, clients=8, CPUs=392, offheap=920.0GB, heap=170.0GB]
> [2018-07-24T00:51:16,649][INFO
> ][disco-event-worker-#161][GridDiscoveryManager]   ^-- Node
> [id=0712D9D6-49A3-42DF-A8EE-5512764CB1A9, clusterState=ACTIVE]
> [2018-07-24T00:51:16,649][INFO
> ][disco-event-worker-#161][GridDiscoveryManager]   ^-- Baseline [id=0,
> size=3, online=2, offline=1]
>
> I've attached the full log for all three nodes and gc log for node2 in the
> attachment.
> ignite-node1.log
> <http://apache-ignite-users.70518.x6.nabble.com/file/
> t1346/ignite-node1.log>
> ignite-node3.zip
> <http://apache-ignite-users.70518.x6.nabble.com/file/
> t1346/ignite-node3.zip>
>
> I also noticed there's no gc activity on node2 between
> 2018-07-23T19:28:03.401 to 018-07-24T00:51:51.
> Can anybody help me on why node2 failed?
>
>
>
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: Ignite node failed for no obvious reason

Reply via email to