Hi, I'm running a compute job that will take many days. After a few hours of running it the client left the cluster.
I see a lot of errors like: [06:25:46,172][SEVERE][exchange-worker-#49%null%][GridCachePartitionExchangeManager] Failed to send local partition map to node [node=TcpDiscoveryNode [id=6eff7f64-f0a0-455d-8e50-5d2e18ac56f8, addrs=[0:0:0:0:0:0:0:1, 10.30.9.52, 127.0.0.1], sockAddrs=[D0065-gtp-corp/10.30.9.52:47500, /0:0:0:0:0:0:0:1:47500, /10.30.9.52:47500, /127.0.0.1:47500], discPort=47500, order=5, intOrder=5, lastExchangeTime=1445859635058, loc=false, ver=1.4.0#20150924-sha1:c2def5f6, isClient=false], exchId=null] After a while the the client seems to leave the cluster and I see: [07:07:09] Topology snapshot [ver=98, servers=1, clients=0, CPUs=4, heap=4.4GB] The cluster is still up however and if I start up other nodes they join the cluster, it's just the client that seems to have left in this instance. On the 10.30.9.52 machine I see GC overhead limit exceeded errors. Could this be causing the issue? Why is the node not removed from the cluster if it cannot be reached? Thanks, Sam
