Sam, Garbage collection can potentially be the main issue here. Here is a link you can use as a starting point for tuning GC:
http://apacheignite.gridgain.org/docs/performance-tips#tune-garbage-collection D. On Tue, Oct 27, 2015 at 3:26 AM, Sam Adams <[email protected]> wrote: > Hi, > > I'm running a compute job that will take many days. After a few hours of > running it the client left the cluster. > > I see a lot of errors like: > > [06:25:46,172][SEVERE][exchange-worker-#49%null%][GridCachePartitionExchangeManager] > Failed to send local partition map to node [node=TcpDiscoveryNode > [id=6eff7f64-f0a0-455d-8e50-5d2e18ac56f8, addrs=[0:0:0:0:0:0:0:1, > 10.30.9.52, 127.0.0.1], sockAddrs=[D0065-gtp-corp/10.30.9.52:47500, > /0:0:0:0:0:0:0:1:47500, /10.30.9.52:47500, /127.0.0.1:47500], > discPort=47500, order=5, intOrder=5, lastExchangeTime=1445859635058, > loc=false, ver=1.4.0#20150924-sha1:c2def5f6, isClient=false], exchId=null] > > After a while the the client seems to leave the cluster and I see: > > [07:07:09] Topology snapshot [ver=98, servers=1, clients=0, CPUs=4, > heap=4.4GB] > > The cluster is still up however and if I start up other nodes they join > the cluster, it's just the client that seems to have left in this instance. > > On the 10.30.9.52 machine I see GC overhead limit exceeded errors. Could > this be causing the issue? Why is the node not removed from the cluster if > it cannot be reached? > > Thanks, > > Sam >
