Re: Lost node again.

2020-08-20 Thread John Smith
It's the default. And as per Ilya I had a suspected GC pause of 45000 ms so I figure 60 second would be ok. As for the GC pauses we (as in I and ignite team) have already looked at GC logs previously and it wasn't the issue. For the monitoring we are using Elastisearch, with Metricbeat and Kibana

Re: Lost node again.

2020-08-20 Thread Denis Magda
> > Dennis, wouldn't 15 seconds faillureDetectionTimeout cause even more > shutdowns? What's your current value? For sure, It doesn't make sense to decrease the value until all mysterious pauses are figured out. The downside of a high failureDetectionTimeout is that the cluster won't remove a

Re: Lost node again.

2020-08-19 Thread Denis Magda
John, I would try to get to the bottom of the issue, especially, if the case is reproducible. If that's not GC then check if that's the I/O (your logs show that the checkpointing rate is high): - You can monitor checkpointing duration with a JMX tool

Re: Lost node again.

2020-08-19 Thread Ilya Kasnacheev
Hello! Most of those questions are rhetorical, but I would say that 60s of failure detection timeout is not unheard of. For clients you can put smaller value (clientFailureDetectionTimeout) since losing a client is not as impactful. Regards, -- Ilya Kasnacheev вт, 18 авг. 2020 г. в 20:37,

Re: Lost node again.

2020-08-18 Thread John Smith
I don't see why we would get such a huge pause, in fact I have provided GC logs before and we found nothing... All operations are in the "big" partitioned 3 million cache are put or get and a query on another cache which has 450 entries. There no other caches. The nodes all have 6G off heap and

Re: Lost node again.

2020-08-18 Thread Ilya Kasnacheev
Hello! [13:39:53,242][WARNING][jvm-pause-detector-worker][IgniteKernal%company] Possible too long JVM pause: 41779 milliseconds. It seems that you have too-long full GC. Either make sure it does not happen, or increase failureDetectionTimeout to be longer than any expected GC. Regards, -- Ilya

Lost node again.

2020-08-17 Thread John Smith
Hi guys it seems every couple of weeks we lose a node... Here are the logs: https://www.dropbox.com/sh/8cv2v8q5lcsju53/AAAU6ZSFkfiZPaMwHgIh5GAfa?dl=0 And some extra details. Maybe I need to do more tuning then what is already mentioned below, maybe set a higher timeout? 3 server nodes and 9