Hi, Are you sure that you have the full connectivity between all nodes? I mean server and client nodes. There are a lot of different exceptions like "err=No route to host" in the log, which definitely points to the network problems.
At the same time, if you think that it was just a short-time outage and it was restored later, then, it's possible that this fix could help you: https://issues.apache.org/jira/browse/IGNITE-8739 Regards, Evgenii 2018-07-24 10:30 GMT+03:00 Ray <[email protected]>: > I have a three node Ignite 2.6 cluster setup with the following config. > > <bean id="grid.cfg" > class="org.apache.ignite.configuration.IgniteConfiguration"> > <property name="segmentationPolicy" value="RESTART_JVM"/> > <property name="peerClassLoadingEnabled" value="true"/> > <property name="failureDetectionTimeout" value="60000"/> > <property name="dataStorageConfiguration"> > <bean > class="org.apache.ignite.configuration.DataStorageConfiguration"> > <property name="storagePath" value="/data/ignite/ > persistence"/> > <property name="walPath" value="/wal"/> > <property name="walArchivePath" value="/wal/archive"/> > <property name="defaultDataRegionConfiguration"> > <bean > class="org.apache.ignite.configuration.DataRegionConfiguration"> > <property name="name" value="default_Region"/> > <property name="initialSize" value="#{100L * 1024 * > 1024 > * 1024}"/> > <property name="maxSize" value="#{460L * 1024 * 1024 * > 1024}"/> > <property name="persistenceEnabled" value="true"/> > <property name="checkpointPageBufferSize" value="#{8L > * > 1024 * 1024 * 1024}"/> > </bean> > </property> > <property name="walMode" value="BACKGROUND"/> > <property name="walFlushFrequency" value="5000"/> > <property name="checkpointFrequency" value="600000"/> > </bean> > </property> > <property name="discoverySpi"> > <bean > class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi"> > <property name="localPort" value="49500"/> > <property name="ipFinder"> > <bean > class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm. > TcpDiscoveryVmIpFinder"> > <property name="addresses"> > <list> > <value>node1:49500</value> > <value>node2:49500</value> > <value>node3:49500</value> > </list> > </property> > </bean> > </property> > </bean> > </property> > <property name="gridLogger"> > <bean class="org.apache.ignite.logger.log4j2.Log4J2Logger"> > <constructor-arg type="java.lang.String" > value="config/ignite-log4j2.xml"/> > </bean> > </property> > </bean> > </beans> > > And I used this command to start Ignite service on three nodes. > > ./ignite.sh -J-Xmx32000m -J-Xms32000m -J-XX:+UseG1GC > -J-XX:+ScavengeBeforeFullGC -J-XX:+DisableExplicitGC -J-XX:+AlwaysPreTouch > -J-XX:+PrintGCDetails -J-XX:+PrintGCTimeStamps -J-XX:+PrintGCDateStamps > -J-XX:+PrintAdaptiveSizePolicy -XX:+PrintGCApplicationStoppedTime > -XX:+PrintGCApplicationConcurrentTime > -J-Xloggc:/spare/ignite/log/ignitegc-$(date +%Y_%m_%d-%H_%M).log > config/persistent-config.xml > > When I'm using Spark dataframe API to ingest data into this cluster, node2 > failed. > The logic for this Spark job is rather simple, it just extract data from > hdfs then ingest into Ignite day by day. > Every Spark job represents one day's data, it usually take less then 10 > minutes to complete. > From the following Spark job web management page, we can see that the Job > Id > 282 takes more than 8 hours. > spark.png > <http://apache-ignite-users.70518.x6.nabble.com/file/t1346/spark.png> > > So I suspect Ignite cluster is malfunctioning around 2018/07/23 16:25:52. > > > From line 8619 in ignite-f7266ac7-1-2018-07-28.log at 2018-07-23T16:26:04, > it looks like there're some kind of network issue between node2 and Spark > client because there's lots of log saying "No route to host" etc. > node2-log.zip > <http://apache-ignite-users.70518.x6.nabble.com/file/t1346/node2-log.zip> > > > Node2 at that time is still connected in the cluster, but after some time > at > 2018-07-24T00:50:14,941, node2 printed last metrics log and failed. > I can't see any obvious reason why node2 failed because node2 only prints > metric log in ignite-f7266ac7.log and the GC time is normal. > > So I also check log in node1 and node3 around the time node2 failed, the > log > snippet for node3 is > 14213 [2018-07-24T00:51:16,622][INFO > ][tcp-disco-sock-reader-#4][TcpDiscoverySpi] Finished serving remote node > connection [rmtAddr=/10.252.10.4 :59144, rmtPort=59144 > 14214 [2018-07-24T00:51:16,623][INFO ][tcp-disco-srvr-#2][ > TcpDiscoverySpi] > TCP discovery accepted incoming connection [rmtAddr=/10.252.4.60, rm > > tPort=23056] > 14215 [2018-07-24T00:51:16,623][INFO ][tcp-disco-srvr-#2][ > TcpDiscoverySpi] > TCP discovery spawning a new thread for connection [rmtAddr=/10.252. > 4.60, rmtPort=23056] > 14216 [2018-07-24T00:51:16,624][INFO > ][tcp-disco-sock-reader-#44][TcpDiscoverySpi] Started serving remote node > connection [rmtAddr=/10.252.4.60 :23056, rmtPort=23056] > 14217 [2018-07-24T00:51:16,626][INFO ][tcp-disco-srvr-#2][ > TcpDiscoverySpi] > TCP discovery accepted incoming connection [rmtAddr=/10.29.42.45, rm > > tPort=45605] > 14218 [2018-07-24T00:51:16,626][INFO ][tcp-disco-srvr-#2][ > TcpDiscoverySpi] > TCP discovery spawning a new thread for connection [rmtAddr=/10.29.4 > 2.45, rmtPort=45605] > 14219 [2018-07-24T00:51:16,626][INFO > ][tcp-disco-sock-reader-#45][TcpDiscoverySpi] Started serving remote node > connection [rmtAddr=/10.29.42.45 :45605, rmtPort=45605] > 14220 [2018-07-24T00:51:16,630][INFO ][tcp-disco-srvr-#2][ > TcpDiscoverySpi] > TCP discovery accepted incoming connection [rmtAddr=/10.29.42.48, rm > > tPort=57089] > 14221 [2018-07-24T00:51:16,630][INFO ][tcp-disco-srvr-#2][ > TcpDiscoverySpi] > TCP discovery spawning a new thread for connection [rmtAddr=/10.29.4 > 2.48, rmtPort=57089] > 14222 [2018-07-24T00:51:16,630][INFO > ][tcp-disco-sock-reader-#47][TcpDiscoverySpi] Started serving remote node > connection [rmtAddr=/10.29.42.48 :57089, rmtPort=57089] > 14223 [2018-07-24T00:51:16,666][WARN > ][disco-event-worker-#161][GridDiscoveryManager] Node FAILED: > TcpDiscoveryNode [id=f7266ac7-409a-454f-b601 -b077b15594b3, > addrs=[10.252.10.4, 127.0.0.1], > sockAddrs=[rpsj1ign002.webex.com/10.252.10.4:49500, /127.0.0.1:49500], > discPort=49500, ord er=2, intOrder=2, lastExchangeTime= > 1532323119618, > loc=false, ver=2.6.0#20180710-sha1:669feacc, isClient=false] > 14224 [2018-07-24T00:51:16,667][INFO > ][disco-event-worker-#161][GridDiscoveryManager] Topology snapshot > [ver=90, > servers=2, clients=8, CPUs=392 , offheap=920.0GB, heap=170.0GB] > 14225 [2018-07-24T00:51:16,667][INFO > ][disco-event-worker-#161][GridDiscoveryManager] ^-- Node > [id=BF7076FB-7C87-4AEE-AC8B-D3DB0AC8A9E2, clus terState=ACTIVE] > 14226 [2018-07-24T00:51:16,667][INFO > ][disco-event-worker-#161][GridDiscoveryManager] ^-- Baseline [id=0, > size=3, online=2, offline=1] > > and log snippet for node1 is > [2018-07-24T00:51:16,617][WARN ][tcp-disco-msg-worker-#3][TcpDiscoverySpi] > Failed to send message to next node [msg=TcpDiscoveryMetricsUpdateMessage > [super=TcpDiscoveryAbstractMessage [sndNodeId=null, > id=bf4e1a5c461-0712d9d6-49a3-42df-a8ee-5512764cb1a9, > verifierNodeId=0712d9d6-49a3-42df-a8ee-5512764cb1a9, topVer=0, > pendingIdx=0, > failedNodes=null, isClient=false]], next=TcpDiscoveryNode > [id=f7266ac7-409a-454f-b601-b077b15594b3, addrs=[10.252.10.4, 127.0.0.1], > sockAddrs=[rpsj1ign002.webex.com/10.252.10.4:49500, /127.0.0.1:49500], > discPort=49500, order=2, intOrder=2, lastExchangeTime=1532323112326, > loc=false, ver=2.6.0#20180710-sha1:669feacc, isClient=false], > errMsg=Failed > to send message to next node [msg=TcpDiscoveryMetricsUpdateMessage > [super=TcpDiscoveryAbstractMessage [sndNodeId=null, > id=bf4e1a5c461-0712d9d6-49a3-42df-a8ee-5512764cb1a9, > verifierNodeId=0712d9d6-49a3-42df-a8ee-5512764cb1a9, topVer=0, > pendingIdx=0, > failedNodes=null, isClient=false]], next=ClusterNode > [id=f7266ac7-409a-454f-b601-b077b15594b3, order=2, addr=[10.252.10.4, > 127.0.0.1], daemon=false]]] > [2018-07-24T00:51:16,627][INFO ][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP > discovery accepted incoming connection [rmtAddr=/10.29.42.46, > rmtPort=58968] > [2018-07-24T00:51:16,627][INFO ][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP > discovery spawning a new thread for connection [rmtAddr=/10.29.42.46, > rmtPort=58968] > [2018-07-24T00:51:16,627][INFO ][tcp-disco-sock-reader-#52][ > TcpDiscoverySpi] > Started serving remote node connection [rmtAddr=/10.29.42.46:58968, > rmtPort=58968] > [2018-07-24T00:51:16,647][WARN ][tcp-disco-msg-worker-#3][TcpDiscoverySpi] > Local node has detected failed nodes and started cluster-wide procedure. To > speed up failure detection please see 'Failure Detection' section under > javadoc for 'TcpDiscoverySpi' > [2018-07-24T00:51:16,648][WARN > ][disco-event-worker-#161][GridDiscoveryManager] Node FAILED: > TcpDiscoveryNode [id=f7266ac7-409a-454f-b601-b077b15594b3, > addrs=[10.252.10.4, 127.0.0.1], > sockAddrs=[rpsj1ign002.webex.com/10.252.10.4:49500, /127.0.0.1:49500], > discPort=49500, order=2, intOrder=2, lastExchangeTime=1532323112326, > loc=false, ver=2.6.0#20180710-sha1:669feacc, isClient=false] > [2018-07-24T00:51:16,649][INFO > ][disco-event-worker-#161][GridDiscoveryManager] Topology snapshot > [ver=90, > servers=2, clients=8, CPUs=392, offheap=920.0GB, heap=170.0GB] > [2018-07-24T00:51:16,649][INFO > ][disco-event-worker-#161][GridDiscoveryManager] ^-- Node > [id=0712D9D6-49A3-42DF-A8EE-5512764CB1A9, clusterState=ACTIVE] > [2018-07-24T00:51:16,649][INFO > ][disco-event-worker-#161][GridDiscoveryManager] ^-- Baseline [id=0, > size=3, online=2, offline=1] > > I've attached the full log for all three nodes and gc log for node2 in the > attachment. > ignite-node1.log > <http://apache-ignite-users.70518.x6.nabble.com/file/ > t1346/ignite-node1.log> > ignite-node3.zip > <http://apache-ignite-users.70518.x6.nabble.com/file/ > t1346/ignite-node3.zip> > > I also noticed there's no gc activity on node2 between > 2018-07-23T19:28:03.401 to 018-07-24T00:51:51. > Can anybody help me on why node2 failed? > > > > > > > > > -- > Sent from: http://apache-ignite-users.70518.x6.nabble.com/ >
