Re: Node stopped.

John Smith Tue, 29 Oct 2019 08:03:57 -0700

Hi, in my original post I had included full logs of the stopped node...

And when you say set to localhost you which property do you mean is it the
one IgniteConfiguration setLocalhost()?


On Tue, 29 Oct 2019 at 03:48, Ilya Kasnacheev <ilya.kasnach...@gmail.com>
wrote:

> Hello!
>
> I think it is possible (such as, node tries to connect to wrong address /
> itself) but more detailed analysis of logs is needed.
>
> You can try specifying localHost property in IgniteConfiguration to make
> sure correct address is used.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пн, 28 окт. 2019 г. в 23:16, John Smith <java.dev....@gmail.com>:
>
>> Ok cool. Thanks. I'm just trying to figure scenarios where it may happen.
>>
>> When I connect with ignite visor and run the "top" topology command. The
>> client node is bound to multiple addresses about 12...
>>
>> Ignite is running on plain regular VMs while the client node is running
>> in a containerized env but using host network (which is regular VM).
>>
>> So wondering if that would also screw it up something, by trying to
>> connect to one of those multiple addresses?
>>
>> On Mon, 28 Oct 2019 at 12:08, Ilya Kasnacheev <ilya.kasnach...@gmail.com>
>> wrote:
>>
>>> Hello!
>>>
>>> Sure, if nodes can't reach each other, eventually they may segment and
>>> stop.
>>>
>>> Regards,
>>> --
>>> Ilya Kasnacheev
>>>
>>>
>>> пт, 25 окт. 2019 г. в 00:08, John Smith <java.dev....@gmail.com>:
>>>
>>>> Is it possible this is somehow causing the issue of the node stopping?
>>>>
>>>> On Thu, 24 Oct 2019 at 11:24, Ilya Kasnacheev <
>>>> ilya.kasnach...@gmail.com> wrote:
>>>>
>>>>> Hello!
>>>>>
>>>>> This likely means that you have reachability problems in your cluster,
>>>>> such as, xxx.xxx.xxx.68 can connect to xxx.xxx.xxx.82 (on range
>>>>> 47100-47200) but not the other way around.
>>>>>
>>>>> Regards,
>>>>> --
>>>>> Ilya Kasnacheev
>>>>>
>>>>>
>>>>> пн, 21 окт. 2019 г. в 19:36, John Smith <java.dev....@gmail.com>:
>>>>>
>>>>>> I also see this printing every few seconds on my client application...
>>>>>> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted
>>>>>> incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101,
>>>>>> rmtAddr=/xxx.xxx.xxx.82:49816
>>>>>>
>>>>>> On Mon, 21 Oct 2019 at 12:04, John Smith <java.dev....@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi, thanks. I already made sure that each ignite VM runs on a
>>>>>>> separate host within our cloud. I'm not doing any of that migration 
>>>>>>> stuff.
>>>>>>>
>>>>>>> Also recently disabled metrics igniteConfig.setMetricsLogFrequency(0
>>>>>>> ); Just to make sure it doesn't get too chatty. But i doubt this
>>>>>>> would affect it...
>>>>>>>
>>>>>>> Should I maybe set the node timeout a bit higher then 30 seconds
>>>>>>> maybe put it to 60 seconds? I remember people suggesting this but not
>>>>>>> sure...
>>>>>>>
>>>>>>> On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <
>>>>>>> dmekhani...@gmail.com> wrote:
>>>>>>>
>>>>>>>> The following documentation page has some useful points on
>>>>>>>> deployment in a virtualised environment:
>>>>>>>> https://apacheignite.readme.io/docs/vmware-deployment
>>>>>>>>
>>>>>>>> Denis
>>>>>>>> On 17 Oct 2019, 17:41 +0300, John Smith <java.dev....@gmail.com>,
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Ok I have metribeat running on the VM hopefully I will see
>>>>>>>> something...
>>>>>>>>
>>>>>>>> On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <
>>>>>>>> dmekhani...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> There are no long pauses in the GC logs, so it must be the whole
>>>>>>>>> VM pause.
>>>>>>>>>
>>>>>>>>> Denis
>>>>>>>>> On 16 Oct 2019, 23:07 +0300, John Smith <java.dev....@gmail.com>,
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Sorry here is the gc logs for all 3 machines:
>>>>>>>>> https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0
>>>>>>>>>
>>>>>>>>> On Wed, 16 Oct 2019 at 15:49, John Smith <java.dev....@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi, so it happened again here is my latest gc.log stats:
>>>>>>>>>> https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e
>>>>>>>>>>
>>>>>>>>>> Everything seems ok to me. I also have Elasticsearch Metricbeat
>>>>>>>>>> running, the CPU usage looked normal at the time.
>>>>>>>>>>
>>>>>>>>>> On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <
>>>>>>>>>> dmekhani...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Unfortunately, I don’t.
>>>>>>>>>>> You can ask the VM vendor or the cloud provider (if you use any)
>>>>>>>>>>> for a proper tooling or logs.
>>>>>>>>>>> Make sure, that there is no such step in the VM’s lifecycle that
>>>>>>>>>>> makes it freeze for a minute.
>>>>>>>>>>> Also make sure that the physical CPU is not overutilized and no
>>>>>>>>>>> VMs that run on it are starving.
>>>>>>>>>>>
>>>>>>>>>>> Denis
>>>>>>>>>>> On 10 Oct 2019, 19:03 +0300, John Smith <java.dev....@gmail.com>,
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Do you know of any good tools I can use to check the VM?
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <
>>>>>>>>>>> dmekhani...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> > Hi Dennis, so are you saying I should enable GC logs + the
>>>>>>>>>>>> safe point logs as well?
>>>>>>>>>>>>
>>>>>>>>>>>> Having safepoint statistics in your GC logs may be useful, so I
>>>>>>>>>>>> recommend enabling them for troubleshooting purposes.
>>>>>>>>>>>> Check the lifecycle of your virtual machines. There is a high
>>>>>>>>>>>> chance that the whole machine is frozen, not just the Ignite node.
>>>>>>>>>>>>
>>>>>>>>>>>> Denis
>>>>>>>>>>>> On 10 Oct 2019, 18:25 +0300, John Smith <java.dev....@gmail.com>,
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Dennis, so are you saying I should enable GC logs + the safe
>>>>>>>>>>>> point logs as well?
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, 10 Oct 2019 at 11:22, John Smith <
>>>>>>>>>>>> java.dev....@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> You are correct, it is running in a VM.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <
>>>>>>>>>>>>> dmekhani...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There are the following messages in the logs:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
>>>>>>>>>>>>>> Possible too long JVM pause: *55705 milliseconds*.
>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>>>>> Blocked system-critical thread has been detected. This can lead 
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> cluster-wide undefined behaviour [threadName=partition-exchanger,
>>>>>>>>>>>>>> blockedFor=*57s*]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Looks like the JVM was paused for almost a minute. It doesn’t
>>>>>>>>>>>>>> seem to be caused by a garbage collection, since there is no 
>>>>>>>>>>>>>> evidence of GC
>>>>>>>>>>>>>> pressure in the GC log. Usually such big pauses happen in 
>>>>>>>>>>>>>> virtualised
>>>>>>>>>>>>>> environments when backups are captured from machines or they 
>>>>>>>>>>>>>> just don’t
>>>>>>>>>>>>>> have enough CPU time.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Looking at safepoint statistics may also reveal some
>>>>>>>>>>>>>> interesting details. You can learn about safepoints here:
>>>>>>>>>>>>>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>> On 9 Oct 2019, 23:14 +0300, John Smith <
>>>>>>>>>>>>>> java.dev....@gmail.com>, wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So the error sais to set clientFailureDetectionTimeout=30000
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1- Do I put a higher value than 30000?
>>>>>>>>>>>>>> 2- Do I do it on the client or the server nodes or all nodes?
>>>>>>>>>>>>>> 3- Also if a client is misbehaving why shutoff the server
>>>>>>>>>>>>>> node?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, 3 Oct 2019 at 21:02, John Smith <
>>>>>>>>>>>>>> java.dev....@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> But if it's the client node that's failing why is the server
>>>>>>>>>>>>>>> node stopping? I'm pretty sure we do verry simple put and get 
>>>>>>>>>>>>>>> operations.
>>>>>>>>>>>>>>> All the client nodes are started as client=true
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <
>>>>>>>>>>>>>>> dma...@apache.org> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I don't see any GC pressure or STW pauses either. If not GC
>>>>>>>>>>>>>>>> then it might have been caused by a network glitch or some 
>>>>>>>>>>>>>>>> long-running
>>>>>>>>>>>>>>>> operation started by the app. These logs statement
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>>>>> Client node considered as unreachable and will be dropped from 
>>>>>>>>>>>>>>>> cluster,
>>>>>>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be 
>>>>>>>>>>>>>>>> caused by
>>>>>>>>>>>>>>>> network problems or long GC pause on client node, try to 
>>>>>>>>>>>>>>>> increase this
>>>>>>>>>>>>>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>>>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>>>>> Client node considered as unreachable and will be dropped from 
>>>>>>>>>>>>>>>> cluster,
>>>>>>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be 
>>>>>>>>>>>>>>>> caused by
>>>>>>>>>>>>>>>> network problems or long GC pause on client node, try to 
>>>>>>>>>>>>>>>> increase this
>>>>>>>>>>>>>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>>>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>>>>>>> Blocked system-critical thread has been detected. This can 
>>>>>>>>>>>>>>>> lead to
>>>>>>>>>>>>>>>> cluster-wide undefined behaviour 
>>>>>>>>>>>>>>>> [threadName=partition-exchanger,
>>>>>>>>>>>>>>>> blockedFor=57s]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][]
>>>>>>>>>>>>>>>> Critical system error detected. Will be handled accordingly to 
>>>>>>>>>>>>>>>> configured
>>>>>>>>>>>>>>>> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, 
>>>>>>>>>>>>>>>> timeout=0,
>>>>>>>>>>>>>>>> super=AbstractFailureHandler
>>>>>>>>>>>>>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], 
>>>>>>>>>>>>>>>> failureCtx=FailureContext
>>>>>>>>>>>>>>>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: 
>>>>>>>>>>>>>>>> GridWorker
>>>>>>>>>>>>>>>> [name=partition-exchanger, igniteInstanceName=xxxxxx, 
>>>>>>>>>>>>>>>> finished=false,
>>>>>>>>>>>>>>>> heartbeatTs=1568931981805]]]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Oct 3, 2019 at 11:50 AM John Smith <
>>>>>>>>>>>>>>>> java.dev....@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> So I have been monitoring my node and the same one seems
>>>>>>>>>>>>>>>>> to stop once in a while.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I have attached the GC logs and the ignite logs. From what
>>>>>>>>>>>>>>>>> I see from gc.logs I don't see big pauses. I could be wrong.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The machine is 16GB and I have the configs here:
>>>>>>>>>>>>>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Here are the JVM settings...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> if [ -z "$JVM_OPTS" ] ; then
>>>>>>>>>>>>>>>>>     JVM_OPTS="-Xms2g -Xmx2g -server
>>>>>>>>>>>>>>>>> -XX:MaxMetaspaceSize=256m"
>>>>>>>>>>>>>>>>> fi
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc
>>>>>>>>>>>>>>>>> -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> JVM_OPTS="${JVM_OPTS} -Xss16m"
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>

Re: Node stopped.

Reply via email to