Hello! Sure, if nodes can't reach each other, eventually they may segment and stop.
Regards, -- Ilya Kasnacheev пт, 25 окт. 2019 г. в 00:08, John Smith <java.dev....@gmail.com>: > Is it possible this is somehow causing the issue of the node stopping? > > On Thu, 24 Oct 2019 at 11:24, Ilya Kasnacheev <ilya.kasnach...@gmail.com> > wrote: > >> Hello! >> >> This likely means that you have reachability problems in your cluster, >> such as, xxx.xxx.xxx.68 can connect to xxx.xxx.xxx.82 (on range >> 47100-47200) but not the other way around. >> >> Regards, >> -- >> Ilya Kasnacheev >> >> >> пн, 21 окт. 2019 г. в 19:36, John Smith <java.dev....@gmail.com>: >> >>> I also see this printing every few seconds on my client application... >>> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted >>> incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101, >>> rmtAddr=/xxx.xxx.xxx.82:49816 >>> >>> On Mon, 21 Oct 2019 at 12:04, John Smith <java.dev....@gmail.com> wrote: >>> >>>> Hi, thanks. I already made sure that each ignite VM runs on a separate >>>> host within our cloud. I'm not doing any of that migration stuff. >>>> >>>> Also recently disabled metrics igniteConfig.setMetricsLogFrequency(0); >>>> Just to make sure it doesn't get too chatty. But i doubt this would affect >>>> it... >>>> >>>> Should I maybe set the node timeout a bit higher then 30 seconds maybe >>>> put it to 60 seconds? I remember people suggesting this but not sure... >>>> >>>> On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <dmekhani...@gmail.com> >>>> wrote: >>>> >>>>> The following documentation page has some useful points on deployment >>>>> in a virtualised environment: >>>>> https://apacheignite.readme.io/docs/vmware-deployment >>>>> >>>>> Denis >>>>> On 17 Oct 2019, 17:41 +0300, John Smith <java.dev....@gmail.com>, >>>>> wrote: >>>>> >>>>> Ok I have metribeat running on the VM hopefully I will see something... >>>>> >>>>> On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <dmekhani...@gmail.com> >>>>> wrote: >>>>> >>>>>> There are no long pauses in the GC logs, so it must be the whole VM >>>>>> pause. >>>>>> >>>>>> Denis >>>>>> On 16 Oct 2019, 23:07 +0300, John Smith <java.dev....@gmail.com>, >>>>>> wrote: >>>>>> >>>>>> Sorry here is the gc logs for all 3 machines: >>>>>> https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0 >>>>>> >>>>>> On Wed, 16 Oct 2019 at 15:49, John Smith <java.dev....@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi, so it happened again here is my latest gc.log stats: >>>>>>> https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e >>>>>>> >>>>>>> Everything seems ok to me. I also have Elasticsearch Metricbeat >>>>>>> running, the CPU usage looked normal at the time. >>>>>>> >>>>>>> On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov < >>>>>>> dmekhani...@gmail.com> wrote: >>>>>>> >>>>>>>> Unfortunately, I don’t. >>>>>>>> You can ask the VM vendor or the cloud provider (if you use any) >>>>>>>> for a proper tooling or logs. >>>>>>>> Make sure, that there is no such step in the VM’s lifecycle that >>>>>>>> makes it freeze for a minute. >>>>>>>> Also make sure that the physical CPU is not overutilized and no VMs >>>>>>>> that run on it are starving. >>>>>>>> >>>>>>>> Denis >>>>>>>> On 10 Oct 2019, 19:03 +0300, John Smith <java.dev....@gmail.com>, >>>>>>>> wrote: >>>>>>>> >>>>>>>> Do you know of any good tools I can use to check the VM? >>>>>>>> >>>>>>>> On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov < >>>>>>>> dmekhani...@gmail.com> wrote: >>>>>>>> >>>>>>>>> > Hi Dennis, so are you saying I should enable GC logs + the safe >>>>>>>>> point logs as well? >>>>>>>>> >>>>>>>>> Having safepoint statistics in your GC logs may be useful, so I >>>>>>>>> recommend enabling them for troubleshooting purposes. >>>>>>>>> Check the lifecycle of your virtual machines. There is a high >>>>>>>>> chance that the whole machine is frozen, not just the Ignite node. >>>>>>>>> >>>>>>>>> Denis >>>>>>>>> On 10 Oct 2019, 18:25 +0300, John Smith <java.dev....@gmail.com>, >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi Dennis, so are you saying I should enable GC logs + the safe >>>>>>>>> point logs as well? >>>>>>>>> >>>>>>>>> On Thu, 10 Oct 2019 at 11:22, John Smith <java.dev....@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> You are correct, it is running in a VM. >>>>>>>>>> >>>>>>>>>> On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov < >>>>>>>>>> dmekhani...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi! >>>>>>>>>>> >>>>>>>>>>> There are the following messages in the logs: >>>>>>>>>>> >>>>>>>>>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] >>>>>>>>>>> Possible too long JVM pause: *55705 milliseconds*. >>>>>>>>>>> ... >>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] >>>>>>>>>>> Blocked system-critical thread has been detected. This can lead to >>>>>>>>>>> cluster-wide undefined behaviour [threadName=partition-exchanger, >>>>>>>>>>> blockedFor=*57s*] >>>>>>>>>>> >>>>>>>>>>> Looks like the JVM was paused for almost a minute. It doesn’t >>>>>>>>>>> seem to be caused by a garbage collection, since there is no >>>>>>>>>>> evidence of GC >>>>>>>>>>> pressure in the GC log. Usually such big pauses happen in >>>>>>>>>>> virtualised >>>>>>>>>>> environments when backups are captured from machines or they just >>>>>>>>>>> don’t >>>>>>>>>>> have enough CPU time. >>>>>>>>>>> >>>>>>>>>>> Looking at safepoint statistics may also reveal some interesting >>>>>>>>>>> details. You can learn about safepoints here: >>>>>>>>>>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/ >>>>>>>>>>> >>>>>>>>>>> Denis >>>>>>>>>>> On 9 Oct 2019, 23:14 +0300, John Smith <java.dev....@gmail.com>, >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> So the error sais to set clientFailureDetectionTimeout=30000 >>>>>>>>>>> >>>>>>>>>>> 1- Do I put a higher value than 30000? >>>>>>>>>>> 2- Do I do it on the client or the server nodes or all nodes? >>>>>>>>>>> 3- Also if a client is misbehaving why shutoff the server node? >>>>>>>>>>> >>>>>>>>>>> On Thu, 3 Oct 2019 at 21:02, John Smith <java.dev....@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> But if it's the client node that's failing why is the server >>>>>>>>>>>> node stopping? I'm pretty sure we do verry simple put and get >>>>>>>>>>>> operations. >>>>>>>>>>>> All the client nodes are started as client=true >>>>>>>>>>>> >>>>>>>>>>>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, < >>>>>>>>>>>> dma...@apache.org> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi John, >>>>>>>>>>>>> >>>>>>>>>>>>> I don't see any GC pressure or STW pauses either. If not GC >>>>>>>>>>>>> then it might have been caused by a network glitch or some >>>>>>>>>>>>> long-running >>>>>>>>>>>>> operation started by the app. These logs statement >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] >>>>>>>>>>>>> Client node considered as unreachable and will be dropped from >>>>>>>>>>>>> cluster, >>>>>>>>>>>>> because no metrics update messages received in interval: >>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be >>>>>>>>>>>>> caused by >>>>>>>>>>>>> network problems or long GC pause on client node, try to increase >>>>>>>>>>>>> this >>>>>>>>>>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, >>>>>>>>>>>>> clientFailureDetectionTimeout=30000] >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] >>>>>>>>>>>>> Client node considered as unreachable and will be dropped from >>>>>>>>>>>>> cluster, >>>>>>>>>>>>> because no metrics update messages received in interval: >>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be >>>>>>>>>>>>> caused by >>>>>>>>>>>>> network problems or long GC pause on client node, try to increase >>>>>>>>>>>>> this >>>>>>>>>>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, >>>>>>>>>>>>> clientFailureDetectionTimeout=30000] >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] >>>>>>>>>>>>> Blocked system-critical thread has been detected. This can lead to >>>>>>>>>>>>> cluster-wide undefined behaviour [threadName=partition-exchanger, >>>>>>>>>>>>> blockedFor=57s] >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] >>>>>>>>>>>>> Critical system error detected. Will be handled accordingly to >>>>>>>>>>>>> configured >>>>>>>>>>>>> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, >>>>>>>>>>>>> timeout=0, >>>>>>>>>>>>> super=AbstractFailureHandler >>>>>>>>>>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], >>>>>>>>>>>>> failureCtx=FailureContext >>>>>>>>>>>>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: >>>>>>>>>>>>> GridWorker >>>>>>>>>>>>> [name=partition-exchanger, igniteInstanceName=xxxxxx, >>>>>>>>>>>>> finished=false, >>>>>>>>>>>>> heartbeatTs=1568931981805]]] >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> - >>>>>>>>>>>>> Denis >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Oct 3, 2019 at 11:50 AM John Smith < >>>>>>>>>>>>> java.dev....@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> So I have been monitoring my node and the same one seems to >>>>>>>>>>>>>> stop once in a while. >>>>>>>>>>>>>> >>>>>>>>>>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0 >>>>>>>>>>>>>> >>>>>>>>>>>>>> I have attached the GC logs and the ignite logs. From what I >>>>>>>>>>>>>> see from gc.logs I don't see big pauses. I could be wrong. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The machine is 16GB and I have the configs here: >>>>>>>>>>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Here are the JVM settings... >>>>>>>>>>>>>> >>>>>>>>>>>>>> if [ -z "$JVM_OPTS" ] ; then >>>>>>>>>>>>>> JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m" >>>>>>>>>>>>>> fi >>>>>>>>>>>>>> >>>>>>>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc >>>>>>>>>>>>>> -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log" >>>>>>>>>>>>>> >>>>>>>>>>>>>> JVM_OPTS="${JVM_OPTS} -Xss16m" >>>>>>>>>>>>>> >>>>>>>>>>>>>