Re: Lost node again.

John Smith Thu, 20 Aug 2020 10:55:54 -0700

It's the default. And as per Ilya I had a suspected GC pause of 45000 ms so
I figure 60 second would be ok. As for the GC pauses we (as in I and ignite
team) have already looked at GC logs previously and it wasn't the issue.


For the monitoring we are using Elastisearch, with Metricbeat and Kibana as
the dashboard. Not the latest because then I would be able to use JMX as
well :p
I will try toll look into a JMX kafka log exporter or something and see if
I can get them into Elastic when and if I hav3 time lol



On Thu, 20 Aug 2020 at 12:28, Denis Magda <[email protected]> wrote:

> Dennis, wouldn't 15 seconds faillureDetectionTimeout cause even more
>> shutdowns?
>
>
> What's your current value? For sure, It doesn't make sense to decrease the
> value until all mysterious pauses are figured out. The downside of a high
> failureDetectionTimeout is that the cluster won't remove a node that failed
> for a reason until the timeout expires. So, if there is a failed node that
> has to process some operations then the rest of the cluster will be trying
> to reach it out until the failureDetectionTimeout is reached. That affects
> performance of some operations where the failed node has to be involved.
>
> Btw, what's the tool you are using for the monitoring? Looks nice.
>
> -
> Denis
>
>
> On Thu, Aug 20, 2020 at 6:44 AM John Smith <[email protected]> wrote:
>
>> Hi here is an example of our cluster during our normal "high" usage. The
>> node shutting down seems to happen on "off" hours.
>>
>> Dennis, wouldn't 15 seconds faillureDetectionTimeout cause even more
>> shutdowns?
>> We also considered more tuning stuff in the docs, we' ll see I guess...
>> As for now we don't have separate disks for now.
>>
>>
>>
>> On Wed, 19 Aug 2020 at 23:35, Denis Magda <[email protected]> wrote:
>>
>>> John,
>>>
>>> I would try to get to the bottom of the issue, especially, if the case
>>> is reproducible.
>>>
>>> If that's not GC then check if that's the I/O (your logs show that the
>>> checkpointing rate is high):
>>>
>>>    - You can monitor checkpointing duration with a JMX tool
>>>    
>>> <https://www.gridgain.com/docs/latest/administrators-guide/monitoring-metrics/metrics#monitoring-checkpointing-operations>
>>>  or
>>>    Control Center
>>>    
>>> <https://www.gridgain.com/docs/control-center/latest/monitoring/metrics#checkpoint-duration>
>>>    .
>>>    - Configure write-throttling
>>>    
>>> <https://www.gridgain.com/docs/latest/perf-troubleshooting-guide/persistence-tuning#pages-writes-throttling>
>>>    if the checkpointing buffer fills in quickly.
>>>    - Ideally, storage files and WALs should be stored on different SSD
>>>    media
>>>    
>>> <https://www.gridgain.com/docs/latest/perf-troubleshooting-guide/persistence-tuning#keep-wals-separately>.
>>>    SSDs also do garbage collection and you might hit it frequently.
>>>
>>> As for the failureDetectionTimeout, I would set it to 15 secs until your
>>> cluster is battle-tested and well-tuned for your use case.
>>>
>>> -
>>> Denis
>>>
>>>
>>> On Tue, Aug 18, 2020 at 10:37 AM John Smith <[email protected]>
>>> wrote:
>>>
>>>> I don't see why we would get such a huge pause, in fact I have provided
>>>> GC logs before and we found nothing...
>>>>
>>>> All operations are in the "big" partitioned 3 million cache are put or
>>>> get and a query on another cache which has 450 entries. There no other
>>>> caches.
>>>>
>>>> The nodes all have 6G off heap and 26G off heap.
>>>>
>>>> I think it can be IO related but I can't seem to be able to correlate
>>>> it to IO. I saw some heavy IO usage but the node failed way after.
>>>>
>>>> Now my question is should I put the failure detection to 60s just for
>>>> the sake of trying it? Isn't that too high? If i put the servers to 60s how
>>>> how high should I put the clients?
>>>>
>>>> On Tue., Aug. 18, 2020, 7:32 a.m. Ilya Kasnacheev, <
>>>> [email protected]> wrote:
>>>>
>>>>> Hello!
>>>>>
>>>>> [13:39:53,242][WARNING][jvm-pause-detector-worker][IgniteKernal%company]
>>>>> Possible too long JVM pause: 41779 milliseconds.
>>>>>
>>>>> It seems that you have too-long full GC. Either make sure it does not
>>>>> happen, or increase failureDetectionTimeout to be longer than any expected
>>>>> GC.
>>>>>
>>>>> Regards,
>>>>> --
>>>>> Ilya Kasnacheev
>>>>>
>>>>>
>>>>> пн, 17 авг. 2020 г. в 17:51, John Smith <[email protected]>:
>>>>>
>>>>>> Hi guys it seems every couple of weeks we lose a node... Here are the
>>>>>> logs:
>>>>>> https://www.dropbox.com/sh/8cv2v8q5lcsju53/AAAU6ZSFkfiZPaMwHgIh5GAfa?dl=0
>>>>>>
>>>>>> And some extra details. Maybe I need to do more tuning then what is
>>>>>> already mentioned below, maybe set a higher timeout?
>>>>>>
>>>>>> 3 server nodes and 9 clients (client = true)
>>>>>>
>>>>>> Performance wise the cluster is not doing any kind of high volume on
>>>>>> average it does about 15-20 puts/gets/queries (any combination of) per
>>>>>> 30-60 seconds.
>>>>>>
>>>>>> The biggest cache we have is: 3 million records distributed with 1
>>>>>> backup using the following template.
>>>>>>
>>>>>>           <bean id="cache-template-bean" abstract="true"
>>>>>> class="org.apache.ignite.configuration.CacheConfiguration">
>>>>>>             <!-- when you create a template via XML configuration,
>>>>>>             you must add an asterisk to the name of the template -->
>>>>>>             <property name="name" value="partitionedTpl*"/>
>>>>>>             <property name="cacheMode" value="PARTITIONED" />
>>>>>>             <property name="backups" value="1" />
>>>>>>             <property name="partitionLossPolicy"
>>>>>> value="READ_WRITE_SAFE"/>
>>>>>>           </bean>
>>>>>>
>>>>>> Persistence is configured:
>>>>>>
>>>>>>       <property name="dataStorageConfiguration">
>>>>>>         <bean
>>>>>> class="org.apache.ignite.configuration.DataStorageConfiguration">
>>>>>>           <!-- Redefining the default region's settings -->
>>>>>>           <property name="defaultDataRegionConfiguration">
>>>>>>             <bean
>>>>>> class="org.apache.ignite.configuration.DataRegionConfiguration">
>>>>>>               <property name="persistenceEnabled" value="true"/>
>>>>>>
>>>>>>               <property name="name" value="Default_Region"/>
>>>>>>               <property name="maxSize" value="#{10L * 1024 * 1024 *
>>>>>> 1024}"/>
>>>>>>             </bean>
>>>>>>           </property>
>>>>>>         </bean>
>>>>>>       </property>
>>>>>>
>>>>>> We also followed the tuning instructions for GC and I/O
>>>>>> if [ -z "$JVM_OPTS" ] ; then
>>>>>>     JVM_OPTS="-Xms6g -Xmx6g -server -XX:MaxMetaspaceSize=256m"
>>>>>> fi
>>>>>>
>>>>>> #
>>>>>> # Uncomment the following GC settings if you see spikes in your
>>>>>> throughput due to Garbage Collection.
>>>>>> #
>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -XX:+AlwaysPreTouch
>>>>>> -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC"
>>>>>> sysctl -w vm.dirty_writeback_centisecs=500 sysctl -w vm
>>>>>> .dirty_expire_centisecs=500
>>>>>>
>>>>>>

Re: Lost node again.

Reply via email to