Re: Node stopped.

Ilya Kasnacheev Mon, 28 Oct 2019 09:08:54 -0700

Hello!

Sure, if nodes can't reach each other, eventually they may segment and stop.


Regards,
-- 
Ilya Kasnacheev


пт, 25 окт. 2019 г. в 00:08, John Smith <java.dev....@gmail.com>:

> Is it possible this is somehow causing the issue of the node stopping?
>
> On Thu, 24 Oct 2019 at 11:24, Ilya Kasnacheev <ilya.kasnach...@gmail.com>
> wrote:
>
>> Hello!
>>
>> This likely means that you have reachability problems in your cluster,
>> such as, xxx.xxx.xxx.68 can connect to xxx.xxx.xxx.82 (on range
>> 47100-47200) but not the other way around.
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> пн, 21 окт. 2019 г. в 19:36, John Smith <java.dev....@gmail.com>:
>>
>>> I also see this printing every few seconds on my client application...
>>> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted
>>> incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101,
>>> rmtAddr=/xxx.xxx.xxx.82:49816
>>>
>>> On Mon, 21 Oct 2019 at 12:04, John Smith <java.dev....@gmail.com> wrote:
>>>
>>>> Hi, thanks. I already made sure that each ignite VM runs on a separate
>>>> host within our cloud. I'm not doing any of that migration stuff.
>>>>
>>>> Also recently disabled metrics igniteConfig.setMetricsLogFrequency(0);
>>>> Just to make sure it doesn't get too chatty. But i doubt this would affect
>>>> it...
>>>>
>>>> Should I maybe set the node timeout a bit higher then 30 seconds maybe
>>>> put it to 60 seconds? I remember people suggesting this but not sure...
>>>>
>>>> On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <dmekhani...@gmail.com>
>>>> wrote:
>>>>
>>>>> The following documentation page has some useful points on deployment
>>>>> in a virtualised environment:
>>>>> https://apacheignite.readme.io/docs/vmware-deployment
>>>>>
>>>>> Denis
>>>>> On 17 Oct 2019, 17:41 +0300, John Smith <java.dev....@gmail.com>,
>>>>> wrote:
>>>>>
>>>>> Ok I have metribeat running on the VM hopefully I will see something...
>>>>>
>>>>> On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <dmekhani...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> There are no long pauses in the GC logs, so it must be the whole VM
>>>>>> pause.
>>>>>>
>>>>>> Denis
>>>>>> On 16 Oct 2019, 23:07 +0300, John Smith <java.dev....@gmail.com>,
>>>>>> wrote:
>>>>>>
>>>>>> Sorry here is the gc logs for all 3 machines:
>>>>>> https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0
>>>>>>
>>>>>> On Wed, 16 Oct 2019 at 15:49, John Smith <java.dev....@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi, so it happened again here is my latest gc.log stats:
>>>>>>> https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e
>>>>>>>
>>>>>>> Everything seems ok to me. I also have Elasticsearch Metricbeat
>>>>>>> running, the CPU usage looked normal at the time.
>>>>>>>
>>>>>>> On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <
>>>>>>> dmekhani...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Unfortunately, I don’t.
>>>>>>>> You can ask the VM vendor or the cloud provider (if you use any)
>>>>>>>> for a proper tooling or logs.
>>>>>>>> Make sure, that there is no such step in the VM’s lifecycle that
>>>>>>>> makes it freeze for a minute.
>>>>>>>> Also make sure that the physical CPU is not overutilized and no VMs
>>>>>>>> that run on it are starving.
>>>>>>>>
>>>>>>>> Denis
>>>>>>>> On 10 Oct 2019, 19:03 +0300, John Smith <java.dev....@gmail.com>,
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Do you know of any good tools I can use to check the VM?
>>>>>>>>
>>>>>>>> On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <
>>>>>>>> dmekhani...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> > Hi Dennis, so are you saying I should enable GC logs + the safe
>>>>>>>>> point logs as well?
>>>>>>>>>
>>>>>>>>> Having safepoint statistics in your GC logs may be useful, so I
>>>>>>>>> recommend enabling them for troubleshooting purposes.
>>>>>>>>> Check the lifecycle of your virtual machines. There is a high
>>>>>>>>> chance that the whole machine is frozen, not just the Ignite node.
>>>>>>>>>
>>>>>>>>> Denis
>>>>>>>>> On 10 Oct 2019, 18:25 +0300, John Smith <java.dev....@gmail.com>,
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Dennis, so are you saying I should enable GC logs + the safe
>>>>>>>>> point logs as well?
>>>>>>>>>
>>>>>>>>> On Thu, 10 Oct 2019 at 11:22, John Smith <java.dev....@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> You are correct, it is running in a VM.
>>>>>>>>>>
>>>>>>>>>> On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <
>>>>>>>>>> dmekhani...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi!
>>>>>>>>>>>
>>>>>>>>>>> There are the following messages in the logs:
>>>>>>>>>>>
>>>>>>>>>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
>>>>>>>>>>> Possible too long JVM pause: *55705 milliseconds*.
>>>>>>>>>>> ...
>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>> Blocked system-critical thread has been detected. This can lead to
>>>>>>>>>>> cluster-wide undefined behaviour [threadName=partition-exchanger,
>>>>>>>>>>> blockedFor=*57s*]
>>>>>>>>>>>
>>>>>>>>>>> Looks like the JVM was paused for almost a minute. It doesn’t
>>>>>>>>>>> seem to be caused by a garbage collection, since there is no 
>>>>>>>>>>> evidence of GC
>>>>>>>>>>> pressure in the GC log. Usually such big pauses happen in 
>>>>>>>>>>> virtualised
>>>>>>>>>>> environments when backups are captured from machines or they just 
>>>>>>>>>>> don’t
>>>>>>>>>>> have enough CPU time.
>>>>>>>>>>>
>>>>>>>>>>> Looking at safepoint statistics may also reveal some interesting
>>>>>>>>>>> details. You can learn about safepoints here:
>>>>>>>>>>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
>>>>>>>>>>>
>>>>>>>>>>> Denis
>>>>>>>>>>> On 9 Oct 2019, 23:14 +0300, John Smith <java.dev....@gmail.com>,
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> So the error sais to set clientFailureDetectionTimeout=30000
>>>>>>>>>>>
>>>>>>>>>>> 1- Do I put a higher value than 30000?
>>>>>>>>>>> 2- Do I do it on the client or the server nodes or all nodes?
>>>>>>>>>>> 3- Also if a client is misbehaving why shutoff the server node?
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 3 Oct 2019 at 21:02, John Smith <java.dev....@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> But if it's the client node that's failing why is the server
>>>>>>>>>>>> node stopping? I'm pretty sure we do verry simple put and get 
>>>>>>>>>>>> operations.
>>>>>>>>>>>> All the client nodes are started as client=true
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <
>>>>>>>>>>>> dma...@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don't see any GC pressure or STW pauses either. If not GC
>>>>>>>>>>>>> then it might have been caused by a network glitch or some 
>>>>>>>>>>>>> long-running
>>>>>>>>>>>>> operation started by the app. These logs statement
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>> Client node considered as unreachable and will be dropped from 
>>>>>>>>>>>>> cluster,
>>>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be 
>>>>>>>>>>>>> caused by
>>>>>>>>>>>>> network problems or long GC pause on client node, try to increase 
>>>>>>>>>>>>> this
>>>>>>>>>>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>> Client node considered as unreachable and will be dropped from 
>>>>>>>>>>>>> cluster,
>>>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be 
>>>>>>>>>>>>> caused by
>>>>>>>>>>>>> network problems or long GC pause on client node, try to increase 
>>>>>>>>>>>>> this
>>>>>>>>>>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>>>> Blocked system-critical thread has been detected. This can lead to
>>>>>>>>>>>>> cluster-wide undefined behaviour [threadName=partition-exchanger,
>>>>>>>>>>>>> blockedFor=57s]
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][]
>>>>>>>>>>>>> Critical system error detected. Will be handled accordingly to 
>>>>>>>>>>>>> configured
>>>>>>>>>>>>> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, 
>>>>>>>>>>>>> timeout=0,
>>>>>>>>>>>>> super=AbstractFailureHandler
>>>>>>>>>>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], 
>>>>>>>>>>>>> failureCtx=FailureContext
>>>>>>>>>>>>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: 
>>>>>>>>>>>>> GridWorker
>>>>>>>>>>>>> [name=partition-exchanger, igniteInstanceName=xxxxxx, 
>>>>>>>>>>>>> finished=false,
>>>>>>>>>>>>> heartbeatTs=1568931981805]]]
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> -
>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Oct 3, 2019 at 11:50 AM John Smith <
>>>>>>>>>>>>> java.dev....@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> So I have been monitoring my node and the same one seems to
>>>>>>>>>>>>>> stop once in a while.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have attached the GC logs and the ignite logs. From what I
>>>>>>>>>>>>>> see from gc.logs I don't see big pauses. I could be wrong.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The machine is 16GB and I have the configs here:
>>>>>>>>>>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here are the JVM settings...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> if [ -z "$JVM_OPTS" ] ; then
>>>>>>>>>>>>>>     JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
>>>>>>>>>>>>>> fi
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc
>>>>>>>>>>>>>> -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> JVM_OPTS="${JVM_OPTS} -Xss16m"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>

Re: Node stopped.

Reply via email to