Hello! I have made Java application with embedded Ignite and deployed it to Kubernetes by using StatefulSet with LivenessProbe. Application has three pods replicated per three physical nodes.
As failover test, I reboot one Kubernetes node (it has pod with slave Ignite node). *While it is offline *(about two minutes), *another slave Ignite replica goes down with Segmentation*[1]. This does not happen on killing pod by Kubernetes. All timeouts in Ignite are default. Kubernetes liveness probe has initialDelaySeconds: 10, periodSeconds: 2, timeoutSeconds: 1. Pods are: replica-0: 172.30.53.18, Ignite master (it is master because was started first by Kubernetes) replica-1: 172.30.73.15, Ignite slave, has segmented (and stopped itself) — this is my problem replica-2: 172.30.55.10, Ignite slave, was killed for failover test Failover reboot was made at 11:08:23. *Some logs from master Ignite node* (with my IMHO comments, marked italic): ========================================================= /Connect timed out from killed node, got soon after default failureDetectionTimeout value — looks OK/ sep 13, 2018 11:08:36 AM org.apache.ignite.logger.java.JavaLogger warning WARNING: Connect timed out (consider increasing 'failureDetectionTimeout' configuration property) [addr=/172.30.55.10:47100, failureDetectionTimeout=10000] /Timed out from another node (self-stopped in future) — why? And what is the difference between "Connect timed out" vs "Timed out waiting for message delivery receipt"?/ sep 13, 2018 11:08:40 AM org.apache.ignite.logger.java.JavaLogger warning WARNING: Timed out waiting for message delivery receipt (most probably, the reason is in long GC pauses on remote node; consider tuning GC and increasing 'ackTimeout' configuration property). Will retry to send message with increased timeout [currentTimeout=10000, rmtAddr=/172.30.73.15:47500, rmtPort=47500] /Next problems, same for both slaves with 10 seconds interval./ sep 13, 2018 11:08:40 AM org.apache.ignite.logger.java.JavaLogger warning WARNING: Failed to send message to next node [msg=TcpDiscoveryStatusCheckMessage [creatorNode=TcpDiscoveryNode [id=ebf41e67-e115-4185-9478-92ffa0463b5e, addrs=[127.0.0.1, 172.30.53.18], sockAddrs=[replica-0.ignite-deployment.ignite-namespace.svc.cluster.local/172.30.53.18:47500, /127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1536826110655, loc=true, ver=2.6.1#20180906-sha1:8cf681db, isClient=false], failedNodeId=null, status=0, super=TcpDiscoveryAbstractMessage [sndNodeId=null, id=523d4c2d561-ebf41e67-e115-4185-9478-92ffa0463b5e, verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, isClient=false]], next=TcpDiscoveryNode [id=20e118e5-cb8c-4139-902e-0e44f4a586c3, addrs=[127.0.0.1, 172.30.73.15], sockAddrs=[/172.30.73.15:47500, /127.0.0.1:47500], discPort=47500, order=2, intOrder=2, lastExchangeTime=1536825520251, loc=false, ver=2.6.1#20180906-sha1:8cf681db, isClient=false], errMsg=Failed to send message to next node [msg=TcpDiscoveryStatusCheckMessage [creatorNode=TcpDiscoveryNode [id=ebf41e67-e115-4185-9478-92ffa0463b5e, addrs=[127.0.0.1, 172.30.53.18], sockAddrs=[replica-0.ignite-deployment.ignite-namespace.svc.cluster.local/172.30.53.18:47500, /127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1536826110655, loc=true, ver=2.6.1#20180906-sha1:8cf681db, isClient=false], failedNodeId=null, status=0, super=TcpDiscoveryAbstractMessage [sndNodeId=null, id=523d4c2d561-ebf41e67-e115-4185-9478-92ffa0463b5e, verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, isClient=false]], next=ClusterNode [id=20e118e5-cb8c-4139-902e-0e44f4a586c3, order=2, addr=[127.0.0.1, 172.30.73.15], daemon=false]]] sep 13, 2018 11:08:50 AM org.apache.ignite.logger.java.JavaLogger warning WARNING: Failed to send message to next node [msg=TcpDiscoveryStatusCheckMessage [creatorNode=TcpDiscoveryNode [id=ebf41e67-e115-4185-9478-92ffa0463b5e, addrs=[127.0.0.1, 172.30.53.18], sockAddrs=[replica-0.ignite-deployment.ignite-namespace.svc.cluster.local/172.30.53.18:47500, /127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1536826110655, loc=true, ver=2.6.1#20180906-sha1:8cf681db, isClient=false], failedNodeId=null, status=0, super=TcpDiscoveryAbstractMessage [sndNodeId=null, id=523d4c2d561-ebf41e67-e115-4185-9478-92ffa0463b5e, verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, isClient=false]], next=TcpDiscoveryNode [id=46f906ab-175f-4055-817c-fab88c5abfb0, addrs=[127.0.0.1, 172.30.55.10], sockAddrs=[/172.30.55.10:47500, /127.0.0.1:47500], discPort=47500, order=3, intOrder=3, lastExchangeTime=1536825531571, loc=false, ver=2.6.1#20180906-sha1:8cf681db, isClient=false], errMsg=Failed to send message to next node [msg=TcpDiscoveryStatusCheckMessage [creatorNode=TcpDiscoveryNode [id=ebf41e67-e115-4185-9478-92ffa0463b5e, addrs=[127.0.0.1, 172.30.53.18], sockAddrs=[replica-0.ignite-deployment.ignite-namespace.svc.cluster.local/172.30.53.18:47500, /127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1536826110655, loc=true, ver=2.6.1#20180906-sha1:8cf681db, isClient=false], failedNodeId=null, status=0, super=TcpDiscoveryAbstractMessage [sndNodeId=null, id=523d4c2d561-ebf41e67-e115-4185-9478-92ffa0463b5e, verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, isClient=false]], next=ClusterNode [id=46f906ab-175f-4055-817c-fab88c5abfb0, order=3, addr=[127.0.0.1, 172.30.55.10], daemon=false]]] /order=1, intOrder=1, lastExchangeTime=1536826110655 — time is equivalent to 11:08:30/ /order=2, intOrder=2, lastExchangeTime=1536825520251 — time is equivalent to 10:58:40, that is second node join moment/ /order=3, intOrder=3, lastExchangeTime=1536825531571 — time is equivalent to 10:58:51, that is third node join moment/ /Both nodes are dropped from topology at the same time/ sep 13, 2018 11:08:50 AM org.apache.ignite.logger.java.JavaLogger warning WARNING: Local node has detected failed nodes and started cluster-wide procedure. To speed up failure detection please see 'Failure Detection' section under javadoc for 'TcpDiscoverySpi' sep 13, 2018 11:08:50 AM org.apache.ignite.logger.java.JavaLogger warning WARNING: Node FAILED: TcpDiscoveryNode [id=20e118e5-cb8c-4139-902e-0e44f4a586c3, addrs=[127.0.0.1, 172.30.73.15], sockAddrs=[/172.30.73.15:47500, /127.0.0.1:47500], discPort=47500, order=2, intOrder=2, lastExchangeTime=1536825520251, loc=false, ver=2.6.1#20180906-sha1:8cf681db, isClient=false] sep 13, 2018 11:08:50 AM org.apache.ignite.logger.java.JavaLogger info INFO: Topology snapshot [ver=4, servers=2, clients=0, CPUs=32, offheap=48.0GB, heap=64.0GB] sep 13, 2018 11:08:50 AM org.apache.ignite.logger.java.JavaLogger info INFO: ^-- Node [id=EBF41E67-E115-4185-9478-92FFA0463B5E, clusterState=ACTIVE] sep 13, 2018 11:08:50 AM org.apache.ignite.logger.java.JavaLogger info INFO: Data Regions Configured: sep 13, 2018 11:08:50 AM org.apache.ignite.logger.java.JavaLogger info INFO: ^-- Default_Region [initSize=1,0 GiB, maxSize=24,0 GiB, persistenceEnabled=false] sep 13, 2018 11:08:50 AM org.apache.ignite.logger.java.JavaLogger info INFO: Started exchange init [topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], crd=true, evt=NODE_FAILED, evtNode=20e118e5-cb8c-4139-902e-0e44f4a586c3, customEvt=null, allowMerge=true] sep 13, 2018 11:08:50 AM org.apache.ignite.logger.java.JavaLogger warning WARNING: Node FAILED: TcpDiscoveryNode [id=46f906ab-175f-4055-817c-fab88c5abfb0, addrs=[127.0.0.1, 172.30.55.10], sockAddrs=[/172.30.55.10:47500, /127.0.0.1:47500], discPort=47500, order=3, intOrder=3, lastExchangeTime=1536825531571, loc=false, ver=2.6.1#20180906-sha1:8cf681db, isClient=false] sep 13, 2018 11:08:50 AM org.apache.ignite.logger.java.JavaLogger info INFO: Topology snapshot [ver=5, servers=1, clients=0, CPUs=16, offheap=24.0GB, heap=32.0GB] ========================================================= *Some logs from self-stopped slave node:* ========================================================= /Timed out waiting for message delivery receipt from killed node — looks OK, master gave another message at the same time/ sep 13, 2018 11:08:36 AM org.apache.ignite.logger.java.JavaLogger warning WARNING: Timed out waiting for message delivery receipt (most probably, the reason is in long GC pauses on remote node; consider tuning GC and increasing 'ackTimeout' configuration property). Will retry to send message with increased timeout [currentTimeout=10000, rmtAddr=/172.30.55.10:47500, rmtPort=47500] sep 13, 2018 11:08:36 AM org.apache.ignite.logger.java.JavaLogger warning WARNING: Failed to send message to next node [msg=TcpDiscoveryMetricsUpdateMessage [super=TcpDiscoveryAbstractMessage [sndNodeId=ebf41e67-e115-4185-9478-92ffa0463b5e, id=ab0a4c2d561-ebf41e67-e115-4185-9478-92ffa0463b5e, verifierNodeId=ebf41e67-e115-4185-9478-92ffa0463b5e, topVer=0, pendingIdx=0, failedNodes=null, isClient=false]], next=TcpDiscoveryNode [id=46f906ab-175f-4055-817c-fab88c5abfb0, addrs=[127.0.0.1, 172.30.55.10], sockAddrs=[/172.30.55.10:47500, /127.0.0.1:47500], discPort=47500, order=3, intOrder=3, lastExchangeTime=1536825531578, loc=false, ver=2.6.1#20180906-sha1:8cf681db, isClient=false], errMsg=Failed to send message to next node [msg=TcpDiscoveryMetricsUpdateMessage [super=TcpDiscoveryAbstractMessage [sndNodeId=ebf41e67-e115-4185-9478-92ffa0463b5e, id=ab0a4c2d561-ebf41e67-e115-4185-9478-92ffa0463b5e, verifierNodeId=ebf41e67-e115-4185-9478-92ffa0463b5e, topVer=0, pendingIdx=0, failedNodes=null, isClient=false]], next=ClusterNode [id=46f906ab-175f-4055-817c-fab88c5abfb0, order=3, addr=[127.0.0.1, 172.30.55.10], daemon=false]]] /The same cluster-wide procedure will be at 11:08:50 on master. Was this missed/ignored by it?/ sep 13, 2018 11:08:36 AM org.apache.ignite.logger.java.JavaLogger warning WARNING: Local node has detected failed nodes and started cluster-wide procedure. To speed up failure detection please see 'Failure Detection' section under javadoc for 'TcpDiscoverySpi' /Segmentation detected at the same time as master changed topology to one node/ sep 13, 2018 11:08:50 AM org.apache.ignite.logger.java.JavaLogger warning WARNING: Local node SEGMENTED: TcpDiscoveryNode [id=20e118e5-cb8c-4139-902e-0e44f4a586c3, addrs=[127.0.0.1, 172.30.73.15], sockAddrs=[replica-1.ignite-deployment.ignite-namespace.svc.cluster.local/172.30.73.15:47500, /127.0.0.1:47500], discPort=47500, order=2, intOrder=2, lastExchangeTime=1536826130700, loc=true, ver=2.6.1#20180906-sha1:8cf681db, isClient=false] sep 13, 2018 11:08:50 AM java.util.logging.LogManager$RootLogger log SEVERE: Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeFailureHandler, failureCtx=FailureContext [type=SEGMENTATION, err=null]] sep 13, 2018 11:08:50 AM org.apache.ignite.logger.java.JavaLogger error SEVERE: Ignite node is in invalid state due to a critical failure. sep 13, 2018 11:08:50 AM java.util.logging.LogManager$RootLogger log SEVERE: Stopping local node on Ignite failure: [failureCtx=FailureContext [type=SEGMENTATION, err=null]] sep 13, 2018 11:08:50 AM org.apache.ignite.logger.java.JavaLogger warning WARNING: Node FAILED: TcpDiscoveryNode [id=ebf41e67-e115-4185-9478-92ffa0463b5e, addrs=[127.0.0.1, 172.30.53.18], sockAddrs=[/172.30.53.18:47500, /127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1536825520301, loc=false, ver=2.6.1#20180906-sha1:8cf681db, isClient=false] sep 13, 2018 11:08:50 AM org.apache.ignite.logger.java.JavaLogger info INFO: Topology snapshot [ver=4, servers=2, clients=0, CPUs=32, offheap=48.0GB, heap=64.0GB] sep 13, 2018 11:08:50 AM org.apache.ignite.logger.java.JavaLogger info INFO: ^-- Node [id=20E118E5-CB8C-4139-902E-0E44F4A586C3, clusterState=ACTIVE] sep 13, 2018 11:08:50 AM org.apache.ignite.logger.java.JavaLogger info INFO: Data Regions Configured: sep 13, 2018 11:08:50 AM org.apache.ignite.logger.java.JavaLogger info INFO: ^-- Default_Region [initSize=1,0 GiB, maxSize=24,0 GiB, persistenceEnabled=false] sep 13, 2018 11:08:50 AM org.apache.ignite.logger.java.JavaLogger info INFO: Command protocol successfully stopped: TCP binary sep 13, 2018 11:08:50 AM org.apache.ignite.logger.java.JavaLogger info INFO: Become new coordinator 20e118e5-cb8c-4139-902e-0e44f4a586c3 sep 13, 2018 11:08:50 AM org.apache.ignite.logger.java.JavaLogger warning WARNING: Node FAILED: TcpDiscoveryNode [id=46f906ab-175f-4055-817c-fab88c5abfb0, addrs=[127.0.0.1, 172.30.55.10], sockAddrs=[/172.30.55.10:47500, /127.0.0.1:47500], discPort=47500, order=3, intOrder=3, lastExchangeTime=1536825531578, loc=false, ver=2.6.1#20180906-sha1:8cf681db, isClient=false] sep 13, 2018 11:08:50 AM org.apache.ignite.logger.java.JavaLogger info INFO: Topology snapshot [ver=5, servers=1, clients=0, CPUs=16, offheap=24.0GB, heap=32.0GB] /And little later/ >>> +------------------------------------------------------------------------------------------+ >>> Ignite ver. >>> 2.6.1-SNAPSHOT#20180906-sha1:8cf681db120b41dd7935cda9fabbb501b6b05402 >>> stopped OK >>> +------------------------------------------------------------------------------------------+ >>> Grid uptime: 00:10:18.337 ========================================================= Logs messages looks like network problems between replica-0 (master) and replica-1 (segmented and self-stopped slave) but I had executed connection test (by using watch nc -l on replica-1 and watch nc on replica-0) at the same time on same pods and it had not shown any large pauses. At the same time, replica-2 (together with affinited kube-proxy and kube-dns) really was offline. And I do not know if packets for it's IP were refused or dropped. *Generally*, had anybody met the same problem and does anybody know how to avoid it? What can be wrong with my Kubernetes or Ignite configuration? I can restart segmented node, but that is the last variant. Segmentation avoiding would be much better. (I already had tried to make the same test with six replicas, four of them were deployed on non-failovered nodes and only one not stopped itself). *[1]* By default (Ignite-2.6.0) it crashed with useless message java.lang.IllegalStateException: Thread tcp-disco-srvr-#3 is terminated unexpectedly I had backported patch from https://issues.apache.org/jira/browse/IGNITE-8751 issue manually and got Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeFailureHandler, failureCtx=FailureContext [type=SEGMENTATION, err=null]] then Ignite node stopped OK Building Ignite from source with that patch is why there is "2.6.1-SNAPSHOT" version in the log. -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
