Re: Node stopped.

John Smith Mon, 21 Oct 2019 09:36:25 -0700

I also see this printing every few seconds on my client application...
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted
incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101,
rmtAddr=/xxx.xxx.xxx.82:49816


On Mon, 21 Oct 2019 at 12:04, John Smith <java.dev....@gmail.com> wrote:

> Hi, thanks. I already made sure that each ignite VM runs on a separate
> host within our cloud. I'm not doing any of that migration stuff.
>
> Also recently disabled metrics igniteConfig.setMetricsLogFrequency(0);
> Just to make sure it doesn't get too chatty. But i doubt this would affect
> it...
>
> Should I maybe set the node timeout a bit higher then 30 seconds maybe put
> it to 60 seconds? I remember people suggesting this but not sure...
>
> On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <dmekhani...@gmail.com>
> wrote:
>
>> The following documentation page has some useful points on deployment in
>> a virtualised environment:
>> https://apacheignite.readme.io/docs/vmware-deployment
>>
>> Denis
>> On 17 Oct 2019, 17:41 +0300, John Smith <java.dev....@gmail.com>, wrote:
>>
>> Ok I have metribeat running on the VM hopefully I will see something...
>>
>> On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <dmekhani...@gmail.com>
>> wrote:
>>
>>> There are no long pauses in the GC logs, so it must be the whole VM
>>> pause.
>>>
>>> Denis
>>> On 16 Oct 2019, 23:07 +0300, John Smith <java.dev....@gmail.com>, wrote:
>>>
>>> Sorry here is the gc logs for all 3 machines:
>>> https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0
>>>
>>> On Wed, 16 Oct 2019 at 15:49, John Smith <java.dev....@gmail.com> wrote:
>>>
>>>> Hi, so it happened again here is my latest gc.log stats:
>>>> https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e
>>>>
>>>> Everything seems ok to me. I also have Elasticsearch Metricbeat
>>>> running, the CPU usage looked normal at the time.
>>>>
>>>> On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <dmekhani...@gmail.com>
>>>> wrote:
>>>>
>>>>> Unfortunately, I don’t.
>>>>> You can ask the VM vendor or the cloud provider (if you use any) for a
>>>>> proper tooling or logs.
>>>>> Make sure, that there is no such step in the VM’s lifecycle that makes
>>>>> it freeze for a minute.
>>>>> Also make sure that the physical CPU is not overutilized and no VMs
>>>>> that run on it are starving.
>>>>>
>>>>> Denis
>>>>> On 10 Oct 2019, 19:03 +0300, John Smith <java.dev....@gmail.com>,
>>>>> wrote:
>>>>>
>>>>> Do you know of any good tools I can use to check the VM?
>>>>>
>>>>> On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <dmekhani...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> > Hi Dennis, so are you saying I should enable GC logs + the safe
>>>>>> point logs as well?
>>>>>>
>>>>>> Having safepoint statistics in your GC logs may be useful, so I
>>>>>> recommend enabling them for troubleshooting purposes.
>>>>>> Check the lifecycle of your virtual machines. There is a high chance
>>>>>> that the whole machine is frozen, not just the Ignite node.
>>>>>>
>>>>>> Denis
>>>>>> On 10 Oct 2019, 18:25 +0300, John Smith <java.dev....@gmail.com>,
>>>>>> wrote:
>>>>>>
>>>>>> Hi Dennis, so are you saying I should enable GC logs + the safe point
>>>>>> logs as well?
>>>>>>
>>>>>> On Thu, 10 Oct 2019 at 11:22, John Smith <java.dev....@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> You are correct, it is running in a VM.
>>>>>>>
>>>>>>> On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <
>>>>>>> dmekhani...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi!
>>>>>>>>
>>>>>>>> There are the following messages in the logs:
>>>>>>>>
>>>>>>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
>>>>>>>> Possible too long JVM pause: *55705 milliseconds*.
>>>>>>>> ...
>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
>>>>>>>> system-critical thread has been detected. This can lead to cluster-wide
>>>>>>>> undefined behaviour [threadName=partition-exchanger, blockedFor=
>>>>>>>> *57s*]
>>>>>>>>
>>>>>>>> Looks like the JVM was paused for almost a minute. It doesn’t seem
>>>>>>>> to be caused by a garbage collection, since there is no evidence of GC
>>>>>>>> pressure in the GC log. Usually such big pauses happen in virtualised
>>>>>>>> environments when backups are captured from machines or they just don’t
>>>>>>>> have enough CPU time.
>>>>>>>>
>>>>>>>> Looking at safepoint statistics may also reveal some interesting
>>>>>>>> details. You can learn about safepoints here:
>>>>>>>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
>>>>>>>>
>>>>>>>> Denis
>>>>>>>> On 9 Oct 2019, 23:14 +0300, John Smith <java.dev....@gmail.com>,
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> So the error sais to set clientFailureDetectionTimeout=30000
>>>>>>>>
>>>>>>>> 1- Do I put a higher value than 30000?
>>>>>>>> 2- Do I do it on the client or the server nodes or all nodes?
>>>>>>>> 3- Also if a client is misbehaving why shutoff the server node?
>>>>>>>>
>>>>>>>> On Thu, 3 Oct 2019 at 21:02, John Smith <java.dev....@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> But if it's the client node that's failing why is the server node
>>>>>>>>> stopping? I'm pretty sure we do verry simple put and get operations. 
>>>>>>>>> All
>>>>>>>>> the client nodes are started as client=true
>>>>>>>>>
>>>>>>>>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <dma...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi John,
>>>>>>>>>>
>>>>>>>>>> I don't see any GC pressure or STW pauses either. If not GC then
>>>>>>>>>> it might have been caused by a network glitch or some long-running
>>>>>>>>>> operation started by the app. These logs statement
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>> Client node considered as unreachable and will be dropped from 
>>>>>>>>>> cluster,
>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused 
>>>>>>>>>> by
>>>>>>>>>> network problems or long GC pause on client node, try to increase 
>>>>>>>>>> this
>>>>>>>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>> Client node considered as unreachable and will be dropped from 
>>>>>>>>>> cluster,
>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused 
>>>>>>>>>> by
>>>>>>>>>> network problems or long GC pause on client node, try to increase 
>>>>>>>>>> this
>>>>>>>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
>>>>>>>>>> system-critical thread has been detected. This can lead to 
>>>>>>>>>> cluster-wide
>>>>>>>>>> undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical
>>>>>>>>>> system error detected. Will be handled accordingly to configured 
>>>>>>>>>> handler
>>>>>>>>>> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
>>>>>>>>>> super=AbstractFailureHandler
>>>>>>>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], 
>>>>>>>>>> failureCtx=FailureContext
>>>>>>>>>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: 
>>>>>>>>>> GridWorker
>>>>>>>>>> [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false,
>>>>>>>>>> heartbeatTs=1568931981805]]]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -
>>>>>>>>>> Denis
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Oct 3, 2019 at 11:50 AM John Smith <
>>>>>>>>>> java.dev....@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> So I have been monitoring my node and the same one seems to stop
>>>>>>>>>>> once in a while.
>>>>>>>>>>>
>>>>>>>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>>>>>>>>>>>
>>>>>>>>>>> I have attached the GC logs and the ignite logs. From what I see
>>>>>>>>>>> from gc.logs I don't see big pauses. I could be wrong.
>>>>>>>>>>>
>>>>>>>>>>> The machine is 16GB and I have the configs here:
>>>>>>>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>>>>>>>>>>>
>>>>>>>>>>> Here are the JVM settings...
>>>>>>>>>>>
>>>>>>>>>>> if [ -z "$JVM_OPTS" ] ; then
>>>>>>>>>>>     JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
>>>>>>>>>>> fi
>>>>>>>>>>>
>>>>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails
>>>>>>>>>>> -Xloggc:/var/log/apache-ignite/gc.log"
>>>>>>>>>>>
>>>>>>>>>>> JVM_OPTS="${JVM_OPTS} -Xss16m"
>>>>>>>>>>>
>>>>>>>>>>

Re: Node stopped.

Reply via email to