Sam,

Garbage collection can potentially be the main issue here. Here is a link
you can use as a starting point for tuning GC:

http://apacheignite.gridgain.org/docs/performance-tips#tune-garbage-collection

D.

On Tue, Oct 27, 2015 at 3:26 AM, Sam Adams <[email protected]> wrote:

> Hi,
>
> I'm running a compute job that will take many days. After a few hours of
> running it the client left the cluster.
>
> I see a lot of errors like:
>
> [06:25:46,172][SEVERE][exchange-worker-#49%null%][GridCachePartitionExchangeManager]
> Failed to send local partition map to node [node=TcpDiscoveryNode
> [id=6eff7f64-f0a0-455d-8e50-5d2e18ac56f8, addrs=[0:0:0:0:0:0:0:1,
> 10.30.9.52, 127.0.0.1], sockAddrs=[D0065-gtp-corp/10.30.9.52:47500,
> /0:0:0:0:0:0:0:1:47500, /10.30.9.52:47500, /127.0.0.1:47500],
> discPort=47500, order=5, intOrder=5, lastExchangeTime=1445859635058,
> loc=false, ver=1.4.0#20150924-sha1:c2def5f6, isClient=false], exchId=null]
>
> After a while the the client seems to leave the cluster and I see:
>
> [07:07:09] Topology snapshot [ver=98, servers=1, clients=0, CPUs=4,
> heap=4.4GB]
>
> The cluster is still up however and if I start up other nodes they join
> the cluster, it's just the client that seems to have left in this instance.
>
> On the 10.30.9.52 machine I see GC overhead limit exceeded errors. Could
> this be causing the issue? Why is the node not removed from the cluster if
> it cannot be reached?
>
> Thanks,
>
> Sam
>

Reply via email to