It's the default. And as per Ilya I had a suspected GC pause of 45000 ms so
I figure 60 second would be ok. As for the GC pauses we (as in I and ignite
team) have already looked at GC logs previously and it wasn't the issue.
For the monitoring we are using Elastisearch, with Metricbeat and Kibana
>
> Dennis, wouldn't 15 seconds faillureDetectionTimeout cause even more
> shutdowns?
What's your current value? For sure, It doesn't make sense to decrease the
value until all mysterious pauses are figured out. The downside of a high
failureDetectionTimeout is that the cluster won't remove a
John,
I would try to get to the bottom of the issue, especially, if the case is
reproducible.
If that's not GC then check if that's the I/O (your logs show that the
checkpointing rate is high):
- You can monitor checkpointing duration with a JMX tool
Hello!
Most of those questions are rhetorical, but I would say that 60s of failure
detection timeout is not unheard of. For clients you can put smaller value
(clientFailureDetectionTimeout) since losing a client is not as impactful.
Regards,
--
Ilya Kasnacheev
вт, 18 авг. 2020 г. в 20:37,
I don't see why we would get such a huge pause, in fact I have provided GC
logs before and we found nothing...
All operations are in the "big" partitioned 3 million cache are put or get
and a query on another cache which has 450 entries. There no other caches.
The nodes all have 6G off heap and
Hello!
[13:39:53,242][WARNING][jvm-pause-detector-worker][IgniteKernal%company]
Possible too long JVM pause: 41779 milliseconds.
It seems that you have too-long full GC. Either make sure it does not
happen, or increase failureDetectionTimeout to be longer than any expected
GC.
Regards,
--
Ilya
Hi guys it seems every couple of weeks we lose a node... Here are the logs:
https://www.dropbox.com/sh/8cv2v8q5lcsju53/AAAU6ZSFkfiZPaMwHgIh5GAfa?dl=0
And some extra details. Maybe I need to do more tuning then what is already
mentioned below, maybe set a higher timeout?
3 server nodes and 9