Re: Lost node again.

Ilya Kasnacheev Wed, 19 Aug 2020 07:34:54 -0700

Hello!

Most of those questions are rhetorical, but I would say that 60s of failure
detection timeout is not unheard of. For clients you can put smaller value
(clientFailureDetectionTimeout) since losing a client is not as impactful.


Regards,
-- 
Ilya Kasnacheev


вт, 18 авг. 2020 г. в 20:37, John Smith <[email protected]>:

> I don't see why we would get such a huge pause, in fact I have provided GC
> logs before and we found nothing...
>
> All operations are in the "big" partitioned 3 million cache are put or get
> and a query on another cache which has 450 entries. There no other caches.
>
> The nodes all have 6G off heap and 26G off heap.
>
> I think it can be IO related but I can't seem to be able to correlate it
> to IO. I saw some heavy IO usage but the node failed way after.
>
> Now my question is should I put the failure detection to 60s just for the
> sake of trying it? Isn't that too high? If i put the servers to 60s how how
> high should I put the clients?
>
> On Tue., Aug. 18, 2020, 7:32 a.m. Ilya Kasnacheev, <
> [email protected]> wrote:
>
>> Hello!
>>
>> [13:39:53,242][WARNING][jvm-pause-detector-worker][IgniteKernal%company]
>> Possible too long JVM pause: 41779 milliseconds.
>>
>> It seems that you have too-long full GC. Either make sure it does not
>> happen, or increase failureDetectionTimeout to be longer than any expected
>> GC.
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> пн, 17 авг. 2020 г. в 17:51, John Smith <[email protected]>:
>>
>>> Hi guys it seems every couple of weeks we lose a node... Here are the
>>> logs:
>>> https://www.dropbox.com/sh/8cv2v8q5lcsju53/AAAU6ZSFkfiZPaMwHgIh5GAfa?dl=0
>>>
>>> And some extra details. Maybe I need to do more tuning then what is
>>> already mentioned below, maybe set a higher timeout?
>>>
>>> 3 server nodes and 9 clients (client = true)
>>>
>>> Performance wise the cluster is not doing any kind of high volume on
>>> average it does about 15-20 puts/gets/queries (any combination of) per
>>> 30-60 seconds.
>>>
>>> The biggest cache we have is: 3 million records distributed with 1
>>> backup using the following template.
>>>
>>>           <bean id="cache-template-bean" abstract="true"
>>> class="org.apache.ignite.configuration.CacheConfiguration">
>>>             <!-- when you create a template via XML configuration,
>>>             you must add an asterisk to the name of the template -->
>>>             <property name="name" value="partitionedTpl*"/>
>>>             <property name="cacheMode" value="PARTITIONED" />
>>>             <property name="backups" value="1" />
>>>             <property name="partitionLossPolicy"
>>> value="READ_WRITE_SAFE"/>
>>>           </bean>
>>>
>>> Persistence is configured:
>>>
>>>       <property name="dataStorageConfiguration">
>>>         <bean
>>> class="org.apache.ignite.configuration.DataStorageConfiguration">
>>>           <!-- Redefining the default region's settings -->
>>>           <property name="defaultDataRegionConfiguration">
>>>             <bean
>>> class="org.apache.ignite.configuration.DataRegionConfiguration">
>>>               <property name="persistenceEnabled" value="true"/>
>>>
>>>               <property name="name" value="Default_Region"/>
>>>               <property name="maxSize" value="#{10L * 1024 * 1024 *
>>> 1024}"/>
>>>             </bean>
>>>           </property>
>>>         </bean>
>>>       </property>
>>>
>>> We also followed the tuning instructions for GC and I/O
>>> if [ -z "$JVM_OPTS" ] ; then
>>>     JVM_OPTS="-Xms6g -Xmx6g -server -XX:MaxMetaspaceSize=256m"
>>> fi
>>>
>>> #
>>> # Uncomment the following GC settings if you see spikes in your
>>> throughput due to Garbage Collection.
>>> #
>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -XX:+AlwaysPreTouch
>>> -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC"
>>> sysctl -w vm.dirty_writeback_centisecs=500 sysctl -w vm
>>> .dirty_expire_centisecs=500
>>>
>>>

Re: Lost node again.

Reply via email to