Hi,

I'm running a compute job that will take many days. After a few hours of
running it the client left the cluster.

I see a lot of errors like:

[06:25:46,172][SEVERE][exchange-worker-#49%null%][GridCachePartitionExchangeManager]
Failed to send local partition map to node [node=TcpDiscoveryNode
[id=6eff7f64-f0a0-455d-8e50-5d2e18ac56f8, addrs=[0:0:0:0:0:0:0:1,
10.30.9.52, 127.0.0.1], sockAddrs=[D0065-gtp-corp/10.30.9.52:47500,
/0:0:0:0:0:0:0:1:47500, /10.30.9.52:47500, /127.0.0.1:47500],
discPort=47500, order=5, intOrder=5, lastExchangeTime=1445859635058,
loc=false, ver=1.4.0#20150924-sha1:c2def5f6, isClient=false], exchId=null]

After a while the the client seems to leave the cluster and I see:

[07:07:09] Topology snapshot [ver=98, servers=1, clients=0, CPUs=4,
heap=4.4GB]

The cluster is still up however and if I start up other nodes they join the
cluster, it's just the client that seems to have left in this instance.

On the 10.30.9.52 machine I see GC overhead limit exceeded errors. Could
this be causing the issue? Why is the node not removed from the cluster if
it cannot be reached?

Thanks,

Sam

Reply via email to