A follow-up question about recovery. Node ves-hx-40 was frozen for about a minute due to VM backup, and was considered failed by the cluster.
Then ves-hx-40 woke up after the VM backup, and found itself being disconnected from topoloyg (see below the logs). It then stopped itself. [23:34:03,752][INFO ][tcp-disco-msg-worker-#2%null%][TcpDiscoverySpi] Local node seems to be disconnected from topology (failure detection timeout is reached) [failureDetectionTimeout=10000, connCheckFreq=3333] [23:34:03,783][WARN ][tcp-disco-msg-worker-#2%null%][TcpDiscoverySpi] Node is out of topology (probably, due to short-time network problems). [23:34:03,786][WARN ][disco-event-worker-#44%null%][GridDiscoveryManager] Local node SEGMENTED: TcpDiscoveryNode [id=9a069f70-d49d-472e-9771-7ac2353e751f, addrs=[10.3.0.64, 127.0.0.1], sockAddrs=[ves-hx-40.ebi.ac.uk/10.3.0.64:47500, /10.3.0.64:47500, /127.0.0.1:47500], discPort=47500, order=56, intOrder=29, lastExchangeTime=1470350043783, loc=true, ver=1.6.0#20160518-sha1:0b22c45b, isClient=false] [23:34:03,819][WARN ][disco-event-worker-#44%null%][GridDiscoveryManager] Stopping local node according to configured segmentation policy. I understand that in such situations Apache Ignite would stop the local node according to the segmentation policy. My question is, why Apache Ignite does not give an option to try to reconnect to the cluster, in stead of just stopping the local node (or doing nothing, or restart JVM)? I think it is a reasonable policy option, that is, to regard the disconnected local node as a new potential member of the cluster, clear all of its local caches and states, and then rejoin the cluster. Thanks. Yuci -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Local-node-seems-to-be-disconnected-from-topology-failure-detection-timeout-is-reached-tp6797p10386.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
