Hello! Most of those questions are rhetorical, but I would say that 60s of failure detection timeout is not unheard of. For clients you can put smaller value (clientFailureDetectionTimeout) since losing a client is not as impactful.
Regards, -- Ilya Kasnacheev вт, 18 авг. 2020 г. в 20:37, John Smith <[email protected]>: > I don't see why we would get such a huge pause, in fact I have provided GC > logs before and we found nothing... > > All operations are in the "big" partitioned 3 million cache are put or get > and a query on another cache which has 450 entries. There no other caches. > > The nodes all have 6G off heap and 26G off heap. > > I think it can be IO related but I can't seem to be able to correlate it > to IO. I saw some heavy IO usage but the node failed way after. > > Now my question is should I put the failure detection to 60s just for the > sake of trying it? Isn't that too high? If i put the servers to 60s how how > high should I put the clients? > > On Tue., Aug. 18, 2020, 7:32 a.m. Ilya Kasnacheev, < > [email protected]> wrote: > >> Hello! >> >> [13:39:53,242][WARNING][jvm-pause-detector-worker][IgniteKernal%company] >> Possible too long JVM pause: 41779 milliseconds. >> >> It seems that you have too-long full GC. Either make sure it does not >> happen, or increase failureDetectionTimeout to be longer than any expected >> GC. >> >> Regards, >> -- >> Ilya Kasnacheev >> >> >> пн, 17 авг. 2020 г. в 17:51, John Smith <[email protected]>: >> >>> Hi guys it seems every couple of weeks we lose a node... Here are the >>> logs: >>> https://www.dropbox.com/sh/8cv2v8q5lcsju53/AAAU6ZSFkfiZPaMwHgIh5GAfa?dl=0 >>> >>> And some extra details. Maybe I need to do more tuning then what is >>> already mentioned below, maybe set a higher timeout? >>> >>> 3 server nodes and 9 clients (client = true) >>> >>> Performance wise the cluster is not doing any kind of high volume on >>> average it does about 15-20 puts/gets/queries (any combination of) per >>> 30-60 seconds. >>> >>> The biggest cache we have is: 3 million records distributed with 1 >>> backup using the following template. >>> >>> <bean id="cache-template-bean" abstract="true" >>> class="org.apache.ignite.configuration.CacheConfiguration"> >>> <!-- when you create a template via XML configuration, >>> you must add an asterisk to the name of the template --> >>> <property name="name" value="partitionedTpl*"/> >>> <property name="cacheMode" value="PARTITIONED" /> >>> <property name="backups" value="1" /> >>> <property name="partitionLossPolicy" >>> value="READ_WRITE_SAFE"/> >>> </bean> >>> >>> Persistence is configured: >>> >>> <property name="dataStorageConfiguration"> >>> <bean >>> class="org.apache.ignite.configuration.DataStorageConfiguration"> >>> <!-- Redefining the default region's settings --> >>> <property name="defaultDataRegionConfiguration"> >>> <bean >>> class="org.apache.ignite.configuration.DataRegionConfiguration"> >>> <property name="persistenceEnabled" value="true"/> >>> >>> <property name="name" value="Default_Region"/> >>> <property name="maxSize" value="#{10L * 1024 * 1024 * >>> 1024}"/> >>> </bean> >>> </property> >>> </bean> >>> </property> >>> >>> We also followed the tuning instructions for GC and I/O >>> if [ -z "$JVM_OPTS" ] ; then >>> JVM_OPTS="-Xms6g -Xmx6g -server -XX:MaxMetaspaceSize=256m" >>> fi >>> >>> # >>> # Uncomment the following GC settings if you see spikes in your >>> throughput due to Garbage Collection. >>> # >>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -XX:+AlwaysPreTouch >>> -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC" >>> sysctl -w vm.dirty_writeback_centisecs=500 sysctl -w vm >>> .dirty_expire_centisecs=500 >>> >>>
