Re: Node stopped.

Ilya Kasnacheev Wed, 30 Oct 2019 04:11:16 -0700

Hello!

I think the node has determined that it need to shut down:
[22:26:31,800][SEVERE][disco-event-worker-#42%xxxxxx%][FailureProcessor]
Ignite node is in invalid state due to a critical failure.


You can tune failure handler to avoid this behavior:
https://apacheignite.readme.io/docs/critical-failures-handling

WRT clientFailureDetectionTimeout, it should be set on server nodes.

Regards,
-- 
Ilya Kasnacheev


вт, 29 окт. 2019 г. в 22:21, John Smith <java.dev....@gmail.com>:

> Hi, the GC logs where also provided and we determined, there was no GC
> pressure. At least what I understood from the thread above. We also enabled
> some extra thread info for GC and that showed nothing also.
>
> On the client side I also see this: Possible too long JVM pause: 25468
> milliseconds. My SERVER heap size is at -Xmx3g and I even tried -Xmx2g My
> client is at 1G
> 1- So the SERVER node is stopped because the client was misbehaving?
> 2- clientFailureDetectionTimeout should be set on all nodes including the
> client?
>
> On Tue, 29 Oct 2019 at 11:14, Ilya Kasnacheev <ilya.kasnach...@gmail.com>
> wrote:
>
>> Hello!
>>
>> Well, this is the relevant part:
>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] 
>> *Possible
>> too long JVM pause: 55705 milliseconds*.
>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>> Client node considered as unreachable and will be dropped from cluster,
>> because no metrics update messages received in interval:
>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>> network problems or long GC pause on client node, try to increase this
>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>> clientFailureDetectionTimeout=30000]
>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>> Client node considered as unreachable and will be dropped from cluster,
>> because no metrics update messages received in interval:
>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>> network problems or long GC pause on client node, try to increase this
>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>> clientFailureDetectionTimeout=30000]
>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
>> system-critical thread has been detected. This can lead to cluster-wide
>> undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
>> [22:26:21,850][INFO][tcp-disco-sock-reader-#4%xxxxxx%][TcpDiscoverySpi]
>> Finished serving remote node connection [rmtAddr=/xxx.xxx.xxx.63:51827,
>> rmtPort=51827
>> [22:26:21,933][WARNING][ttl-cleanup-worker-#48%xxxxxx%][G] Thread
>> [name="exchange-worker-#43%xxxxxx%", id=89, state=TIMED_WAITING,
>> blockCnt=6, waitCnt=486312]
>>     Lock
>> [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@491d897d,
>> ownerName=null, ownerId=-1]
>>
>> Long GC will cause node to segment.
>>
>> Try either decreasing heap size / making sure full GC does not happen, or
>> increase failureDetectionTimeout (clientFailureDetectionTimeout if set) on
>> all nodes.
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> вт, 29 окт. 2019 г. в 18:02, John Smith <java.dev....@gmail.com>:
>>
>>> Hi, in my original post I had included full logs of the stopped node...
>>>
>>> And when you say set to localhost you which property do you mean is it
>>> the one IgniteConfiguration setLocalhost()?
>>>
>>> On Tue, 29 Oct 2019 at 03:48, Ilya Kasnacheev <ilya.kasnach...@gmail.com>
>>> wrote:
>>>
>>>> Hello!
>>>>
>>>> I think it is possible (such as, node tries to connect to wrong address
>>>> / itself) but more detailed analysis of logs is needed.
>>>>
>>>> You can try specifying localHost property in IgniteConfiguration to
>>>> make sure correct address is used.
>>>>
>>>> Regards,
>>>> --
>>>> Ilya Kasnacheev
>>>>
>>>>
>>>> пн, 28 окт. 2019 г. в 23:16, John Smith <java.dev....@gmail.com>:
>>>>
>>>>> Ok cool. Thanks. I'm just trying to figure scenarios where it may
>>>>> happen.
>>>>>
>>>>> When I connect with ignite visor and run the "top" topology command.
>>>>> The client node is bound to multiple addresses about 12...
>>>>>
>>>>> Ignite is running on plain regular VMs while the client node is
>>>>> running in a containerized env but using host network (which is regular 
>>>>> VM).
>>>>>
>>>>> So wondering if that would also screw it up something, by trying to
>>>>> connect to one of those multiple addresses?
>>>>>
>>>>> On Mon, 28 Oct 2019 at 12:08, Ilya Kasnacheev <
>>>>> ilya.kasnach...@gmail.com> wrote:
>>>>>
>>>>>> Hello!
>>>>>>
>>>>>> Sure, if nodes can't reach each other, eventually they may segment
>>>>>> and stop.
>>>>>>
>>>>>> Regards,
>>>>>> --
>>>>>> Ilya Kasnacheev
>>>>>>
>>>>>>
>>>>>> пт, 25 окт. 2019 г. в 00:08, John Smith <java.dev....@gmail.com>:
>>>>>>
>>>>>>> Is it possible this is somehow causing the issue of the node
>>>>>>> stopping?
>>>>>>>
>>>>>>> On Thu, 24 Oct 2019 at 11:24, Ilya Kasnacheev <
>>>>>>> ilya.kasnach...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello!
>>>>>>>>
>>>>>>>> This likely means that you have reachability problems in your
>>>>>>>> cluster, such as, xxx.xxx.xxx.68 can connect to xxx.xxx.xxx.82 (on
>>>>>>>> range 47100-47200) but not the other way around.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> --
>>>>>>>> Ilya Kasnacheev
>>>>>>>>
>>>>>>>>
>>>>>>>> пн, 21 окт. 2019 г. в 19:36, John Smith <java.dev....@gmail.com>:
>>>>>>>>
>>>>>>>>> I also see this printing every few seconds on my client
>>>>>>>>> application...
>>>>>>>>> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted
>>>>>>>>> incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101,
>>>>>>>>> rmtAddr=/xxx.xxx.xxx.82:49816
>>>>>>>>>
>>>>>>>>> On Mon, 21 Oct 2019 at 12:04, John Smith <java.dev....@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi, thanks. I already made sure that each ignite VM runs on a
>>>>>>>>>> separate host within our cloud. I'm not doing any of that migration 
>>>>>>>>>> stuff.
>>>>>>>>>>
>>>>>>>>>> Also recently disabled metrics igniteConfig
>>>>>>>>>> .setMetricsLogFrequency(0); Just to make sure it doesn't get too
>>>>>>>>>> chatty. But i doubt this would affect it...
>>>>>>>>>>
>>>>>>>>>> Should I maybe set the node timeout a bit higher then 30 seconds
>>>>>>>>>> maybe put it to 60 seconds? I remember people suggesting this but not
>>>>>>>>>> sure...
>>>>>>>>>>
>>>>>>>>>> On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <
>>>>>>>>>> dmekhani...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> The following documentation page has some useful points on
>>>>>>>>>>> deployment in a virtualised environment:
>>>>>>>>>>> https://apacheignite.readme.io/docs/vmware-deployment
>>>>>>>>>>>
>>>>>>>>>>> Denis
>>>>>>>>>>> On 17 Oct 2019, 17:41 +0300, John Smith <java.dev....@gmail.com>,
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Ok I have metribeat running on the VM hopefully I will see
>>>>>>>>>>> something...
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <
>>>>>>>>>>> dmekhani...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> There are no long pauses in the GC logs, so it must be the
>>>>>>>>>>>> whole VM pause.
>>>>>>>>>>>>
>>>>>>>>>>>> Denis
>>>>>>>>>>>> On 16 Oct 2019, 23:07 +0300, John Smith <java.dev....@gmail.com>,
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Sorry here is the gc logs for all 3 machines:
>>>>>>>>>>>> https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, 16 Oct 2019 at 15:49, John Smith <
>>>>>>>>>>>> java.dev....@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi, so it happened again here is my latest gc.log stats:
>>>>>>>>>>>>> https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e
>>>>>>>>>>>>>
>>>>>>>>>>>>> Everything seems ok to me. I also have Elasticsearch
>>>>>>>>>>>>> Metricbeat running, the CPU usage looked normal at the time.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <
>>>>>>>>>>>>> dmekhani...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Unfortunately, I don’t.
>>>>>>>>>>>>>> You can ask the VM vendor or the cloud provider (if you use
>>>>>>>>>>>>>> any) for a proper tooling or logs.
>>>>>>>>>>>>>> Make sure, that there is no such step in the VM’s lifecycle
>>>>>>>>>>>>>> that makes it freeze for a minute.
>>>>>>>>>>>>>> Also make sure that the physical CPU is not overutilized and
>>>>>>>>>>>>>> no VMs that run on it are starving.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>> On 10 Oct 2019, 19:03 +0300, John Smith <
>>>>>>>>>>>>>> java.dev....@gmail.com>, wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Do you know of any good tools I can use to check the VM?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <
>>>>>>>>>>>>>> dmekhani...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> > Hi Dennis, so are you saying I should enable GC logs + the
>>>>>>>>>>>>>>> safe point logs as well?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Having safepoint statistics in your GC logs may be useful,
>>>>>>>>>>>>>>> so I recommend enabling them for troubleshooting purposes.
>>>>>>>>>>>>>>> Check the lifecycle of your virtual machines. There is a
>>>>>>>>>>>>>>> high chance that the whole machine is frozen, not just the 
>>>>>>>>>>>>>>> Ignite node.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>> On 10 Oct 2019, 18:25 +0300, John Smith <
>>>>>>>>>>>>>>> java.dev....@gmail.com>, wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Dennis, so are you saying I should enable GC logs + the
>>>>>>>>>>>>>>> safe point logs as well?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 11:22, John Smith <
>>>>>>>>>>>>>>> java.dev....@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> You are correct, it is running in a VM.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <
>>>>>>>>>>>>>>>> dmekhani...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi!
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> There are the following messages in the logs:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
>>>>>>>>>>>>>>>>> Possible too long JVM pause: *55705 milliseconds*.
>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>>>>>>>> Blocked system-critical thread has been detected. This can 
>>>>>>>>>>>>>>>>> lead to
>>>>>>>>>>>>>>>>> cluster-wide undefined behaviour 
>>>>>>>>>>>>>>>>> [threadName=partition-exchanger,
>>>>>>>>>>>>>>>>> blockedFor=*57s*]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Looks like the JVM was paused for almost a minute. It
>>>>>>>>>>>>>>>>> doesn’t seem to be caused by a garbage collection, since 
>>>>>>>>>>>>>>>>> there is no
>>>>>>>>>>>>>>>>> evidence of GC pressure in the GC log. Usually such big 
>>>>>>>>>>>>>>>>> pauses happen in
>>>>>>>>>>>>>>>>> virtualised environments when backups are captured from 
>>>>>>>>>>>>>>>>> machines or they
>>>>>>>>>>>>>>>>> just don’t have enough CPU time.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Looking at safepoint statistics may also reveal some
>>>>>>>>>>>>>>>>> interesting details. You can learn about safepoints here:
>>>>>>>>>>>>>>>>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>>> On 9 Oct 2019, 23:14 +0300, John Smith <
>>>>>>>>>>>>>>>>> java.dev....@gmail.com>, wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> So the error sais to set clientFailureDetectionTimeout=
>>>>>>>>>>>>>>>>> 30000
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1- Do I put a higher value than 30000?
>>>>>>>>>>>>>>>>> 2- Do I do it on the client or the server nodes or all
>>>>>>>>>>>>>>>>> nodes?
>>>>>>>>>>>>>>>>> 3- Also if a client is misbehaving why shutoff the server
>>>>>>>>>>>>>>>>> node?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, 3 Oct 2019 at 21:02, John Smith <
>>>>>>>>>>>>>>>>> java.dev....@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> But if it's the client node that's failing why is the
>>>>>>>>>>>>>>>>>> server node stopping? I'm pretty sure we do verry simple put 
>>>>>>>>>>>>>>>>>> and get
>>>>>>>>>>>>>>>>>> operations. All the client nodes are started as client=true
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <
>>>>>>>>>>>>>>>>>> dma...@apache.org> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I don't see any GC pressure or STW pauses either. If not
>>>>>>>>>>>>>>>>>>> GC then it might have been caused by a network glitch or 
>>>>>>>>>>>>>>>>>>> some long-running
>>>>>>>>>>>>>>>>>>> operation started by the app. These logs statement
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>>>>>>>> Client node considered as unreachable and will be dropped 
>>>>>>>>>>>>>>>>>>> from cluster,
>>>>>>>>>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may 
>>>>>>>>>>>>>>>>>>> be caused by
>>>>>>>>>>>>>>>>>>> network problems or long GC pause on client node, try to 
>>>>>>>>>>>>>>>>>>> increase this
>>>>>>>>>>>>>>>>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>>>>>>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>>>>>>>> Client node considered as unreachable and will be dropped 
>>>>>>>>>>>>>>>>>>> from cluster,
>>>>>>>>>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may 
>>>>>>>>>>>>>>>>>>> be caused by
>>>>>>>>>>>>>>>>>>> network problems or long GC pause on client node, try to 
>>>>>>>>>>>>>>>>>>> increase this
>>>>>>>>>>>>>>>>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>>>>>>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>>>>>>>>>> Blocked system-critical thread has been detected. This can 
>>>>>>>>>>>>>>>>>>> lead to
>>>>>>>>>>>>>>>>>>> cluster-wide undefined behaviour 
>>>>>>>>>>>>>>>>>>> [threadName=partition-exchanger,
>>>>>>>>>>>>>>>>>>> blockedFor=57s]
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][]
>>>>>>>>>>>>>>>>>>> Critical system error detected. Will be handled accordingly 
>>>>>>>>>>>>>>>>>>> to configured
>>>>>>>>>>>>>>>>>>> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, 
>>>>>>>>>>>>>>>>>>> timeout=0,
>>>>>>>>>>>>>>>>>>> super=AbstractFailureHandler
>>>>>>>>>>>>>>>>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], 
>>>>>>>>>>>>>>>>>>> failureCtx=FailureContext
>>>>>>>>>>>>>>>>>>> [type=SYSTEM_WORKER_BLOCKED, err=class 
>>>>>>>>>>>>>>>>>>> o.a.i.IgniteException: GridWorker
>>>>>>>>>>>>>>>>>>> [name=partition-exchanger, igniteInstanceName=xxxxxx, 
>>>>>>>>>>>>>>>>>>> finished=false,
>>>>>>>>>>>>>>>>>>> heartbeatTs=1568931981805]]]
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Thu, Oct 3, 2019 at 11:50 AM John Smith <
>>>>>>>>>>>>>>>>>>> java.dev....@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> So I have been monitoring my node and the same one
>>>>>>>>>>>>>>>>>>>> seems to stop once in a while.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I have attached the GC logs and the ignite logs. From
>>>>>>>>>>>>>>>>>>>> what I see from gc.logs I don't see big pauses. I could be 
>>>>>>>>>>>>>>>>>>>> wrong.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The machine is 16GB and I have the configs here:
>>>>>>>>>>>>>>>>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Here are the JVM settings...
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> if [ -z "$JVM_OPTS" ] ; then
>>>>>>>>>>>>>>>>>>>>     JVM_OPTS="-Xms2g -Xmx2g -server
>>>>>>>>>>>>>>>>>>>> -XX:MaxMetaspaceSize=256m"
>>>>>>>>>>>>>>>>>>>> fi
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc
>>>>>>>>>>>>>>>>>>>> -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> JVM_OPTS="${JVM_OPTS} -Xss16m"
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>

Re: Node stopped.

Reply via email to