Thanks, Vladislav. Have already tuned the GC parameters according to that link.
Seems that this error happens very frequently when the cache is very big like over 15G per node in off-heap. BTW, found that in the remote nodes, the JoinEvent and FailedEvent were received at almost the same time. Any idea about this? [2016.08.11 00:36:34,994 PDT][INFO ][disco-event-worker-#174%null%][GridDiscoveryManager] Added new node to topology: TcpDiscoveryNode [id=eeb076c7-2b63-431b-b53d-4acef66e99f2, addrs=[10.183.142.50, 10.65.84.249, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, /10.183.142.50:47500, CO3SCH050520537/10.65.84.249:47500], discPort=47500, order=1158, intOrder=662, lastExchangeTime=1470900973709, loc=false, ver=1.7.0#20160801-sha1:383273e3, isClient=false] [2016.08.11 00:36:34,996 PDT][INFO ][disco-event-worker-#174%null%][GridDiscoveryManager] Topology snapshot [ver=1158, servers=20, clients=146, CPUs=1336, heap=360.0GB] [2016.08.11 00:36:35,439 PDT][INFO ][exchange-worker-#176%null%][GridCachePartitionExchangeManager] Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=1157, minorTopVer=0], evt=NODE_FAILED, node=eeb076c7-2b63-431b-b53d-4acef66e99f2] [2016.08.11 00:36:35,625 PDT][WARN ][disco-event-worker-#174%null%][GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=eeb076c7-2b63-431b-b53d-4acef66e99f2, addrs=[10.183.142.50, 10.65.84.249, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, /10.183.142.50:47500, CO3SCH050520537/10.65.84.249:47500], discPort=47500, order=1158, intOrder=662, lastExchangeTime=1470900973709, loc=false, ver=1.7.0#20160801-sha1:383273e3, isClient=false] -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Fail-to-join-topology-and-repeat-join-process-tp6987p7045.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
