Hello! Well, this is the relevant part: [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] *Possible too long JVM pause: 55705 milliseconds*. [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, clientFailureDetectionTimeout=30000] [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, clientFailureDetectionTimeout=30000] [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s] [22:26:21,850][INFO][tcp-disco-sock-reader-#4%xxxxxx%][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/xxx.xxx.xxx.63:51827, rmtPort=51827 [22:26:21,933][WARNING][ttl-cleanup-worker-#48%xxxxxx%][G] Thread [name="exchange-worker-#43%xxxxxx%", id=89, state=TIMED_WAITING, blockCnt=6, waitCnt=486312] Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@491d897d, ownerName=null, ownerId=-1]
Long GC will cause node to segment. Try either decreasing heap size / making sure full GC does not happen, or increase failureDetectionTimeout (clientFailureDetectionTimeout if set) on all nodes. Regards, -- Ilya Kasnacheev вт, 29 окт. 2019 г. в 18:02, John Smith <java.dev....@gmail.com>: > Hi, in my original post I had included full logs of the stopped node... > > And when you say set to localhost you which property do you mean is it the > one IgniteConfiguration setLocalhost()? > > On Tue, 29 Oct 2019 at 03:48, Ilya Kasnacheev <ilya.kasnach...@gmail.com> > wrote: > >> Hello! >> >> I think it is possible (such as, node tries to connect to wrong address / >> itself) but more detailed analysis of logs is needed. >> >> You can try specifying localHost property in IgniteConfiguration to make >> sure correct address is used. >> >> Regards, >> -- >> Ilya Kasnacheev >> >> >> пн, 28 окт. 2019 г. в 23:16, John Smith <java.dev....@gmail.com>: >> >>> Ok cool. Thanks. I'm just trying to figure scenarios where it may happen. >>> >>> When I connect with ignite visor and run the "top" topology command. The >>> client node is bound to multiple addresses about 12... >>> >>> Ignite is running on plain regular VMs while the client node is running >>> in a containerized env but using host network (which is regular VM). >>> >>> So wondering if that would also screw it up something, by trying to >>> connect to one of those multiple addresses? >>> >>> On Mon, 28 Oct 2019 at 12:08, Ilya Kasnacheev <ilya.kasnach...@gmail.com> >>> wrote: >>> >>>> Hello! >>>> >>>> Sure, if nodes can't reach each other, eventually they may segment and >>>> stop. >>>> >>>> Regards, >>>> -- >>>> Ilya Kasnacheev >>>> >>>> >>>> пт, 25 окт. 2019 г. в 00:08, John Smith <java.dev....@gmail.com>: >>>> >>>>> Is it possible this is somehow causing the issue of the node stopping? >>>>> >>>>> On Thu, 24 Oct 2019 at 11:24, Ilya Kasnacheev < >>>>> ilya.kasnach...@gmail.com> wrote: >>>>> >>>>>> Hello! >>>>>> >>>>>> This likely means that you have reachability problems in your >>>>>> cluster, such as, xxx.xxx.xxx.68 can connect to xxx.xxx.xxx.82 (on >>>>>> range 47100-47200) but not the other way around. >>>>>> >>>>>> Regards, >>>>>> -- >>>>>> Ilya Kasnacheev >>>>>> >>>>>> >>>>>> пн, 21 окт. 2019 г. в 19:36, John Smith <java.dev....@gmail.com>: >>>>>> >>>>>>> I also see this printing every few seconds on my client >>>>>>> application... >>>>>>> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted >>>>>>> incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101, >>>>>>> rmtAddr=/xxx.xxx.xxx.82:49816 >>>>>>> >>>>>>> On Mon, 21 Oct 2019 at 12:04, John Smith <java.dev....@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, thanks. I already made sure that each ignite VM runs on a >>>>>>>> separate host within our cloud. I'm not doing any of that migration >>>>>>>> stuff. >>>>>>>> >>>>>>>> Also recently disabled metrics igniteConfig.setMetricsLogFrequency( >>>>>>>> 0); Just to make sure it doesn't get too chatty. But i doubt this >>>>>>>> would affect it... >>>>>>>> >>>>>>>> Should I maybe set the node timeout a bit higher then 30 seconds >>>>>>>> maybe put it to 60 seconds? I remember people suggesting this but not >>>>>>>> sure... >>>>>>>> >>>>>>>> On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov < >>>>>>>> dmekhani...@gmail.com> wrote: >>>>>>>> >>>>>>>>> The following documentation page has some useful points on >>>>>>>>> deployment in a virtualised environment: >>>>>>>>> https://apacheignite.readme.io/docs/vmware-deployment >>>>>>>>> >>>>>>>>> Denis >>>>>>>>> On 17 Oct 2019, 17:41 +0300, John Smith <java.dev....@gmail.com>, >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Ok I have metribeat running on the VM hopefully I will see >>>>>>>>> something... >>>>>>>>> >>>>>>>>> On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov < >>>>>>>>> dmekhani...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> There are no long pauses in the GC logs, so it must be the whole >>>>>>>>>> VM pause. >>>>>>>>>> >>>>>>>>>> Denis >>>>>>>>>> On 16 Oct 2019, 23:07 +0300, John Smith <java.dev....@gmail.com>, >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Sorry here is the gc logs for all 3 machines: >>>>>>>>>> https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0 >>>>>>>>>> >>>>>>>>>> On Wed, 16 Oct 2019 at 15:49, John Smith <java.dev....@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, so it happened again here is my latest gc.log stats: >>>>>>>>>>> https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e >>>>>>>>>>> >>>>>>>>>>> Everything seems ok to me. I also have Elasticsearch Metricbeat >>>>>>>>>>> running, the CPU usage looked normal at the time. >>>>>>>>>>> >>>>>>>>>>> On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov < >>>>>>>>>>> dmekhani...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Unfortunately, I don’t. >>>>>>>>>>>> You can ask the VM vendor or the cloud provider (if you use >>>>>>>>>>>> any) for a proper tooling or logs. >>>>>>>>>>>> Make sure, that there is no such step in the VM’s lifecycle >>>>>>>>>>>> that makes it freeze for a minute. >>>>>>>>>>>> Also make sure that the physical CPU is not overutilized and no >>>>>>>>>>>> VMs that run on it are starving. >>>>>>>>>>>> >>>>>>>>>>>> Denis >>>>>>>>>>>> On 10 Oct 2019, 19:03 +0300, John Smith <java.dev....@gmail.com>, >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Do you know of any good tools I can use to check the VM? >>>>>>>>>>>> >>>>>>>>>>>> On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov < >>>>>>>>>>>> dmekhani...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> > Hi Dennis, so are you saying I should enable GC logs + the >>>>>>>>>>>>> safe point logs as well? >>>>>>>>>>>>> >>>>>>>>>>>>> Having safepoint statistics in your GC logs may be useful, so >>>>>>>>>>>>> I recommend enabling them for troubleshooting purposes. >>>>>>>>>>>>> Check the lifecycle of your virtual machines. There is a high >>>>>>>>>>>>> chance that the whole machine is frozen, not just the Ignite node. >>>>>>>>>>>>> >>>>>>>>>>>>> Denis >>>>>>>>>>>>> On 10 Oct 2019, 18:25 +0300, John Smith < >>>>>>>>>>>>> java.dev....@gmail.com>, wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Dennis, so are you saying I should enable GC logs + the >>>>>>>>>>>>> safe point logs as well? >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, 10 Oct 2019 at 11:22, John Smith < >>>>>>>>>>>>> java.dev....@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> You are correct, it is running in a VM. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov < >>>>>>>>>>>>>> dmekhani...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi! >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> There are the following messages in the logs: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] >>>>>>>>>>>>>>> Possible too long JVM pause: *55705 milliseconds*. >>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] >>>>>>>>>>>>>>> Blocked system-critical thread has been detected. This can lead >>>>>>>>>>>>>>> to >>>>>>>>>>>>>>> cluster-wide undefined behaviour >>>>>>>>>>>>>>> [threadName=partition-exchanger, >>>>>>>>>>>>>>> blockedFor=*57s*] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Looks like the JVM was paused for almost a minute. It >>>>>>>>>>>>>>> doesn’t seem to be caused by a garbage collection, since there >>>>>>>>>>>>>>> is no >>>>>>>>>>>>>>> evidence of GC pressure in the GC log. Usually such big pauses >>>>>>>>>>>>>>> happen in >>>>>>>>>>>>>>> virtualised environments when backups are captured from >>>>>>>>>>>>>>> machines or they >>>>>>>>>>>>>>> just don’t have enough CPU time. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Looking at safepoint statistics may also reveal some >>>>>>>>>>>>>>> interesting details. You can learn about safepoints here: >>>>>>>>>>>>>>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/ >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Denis >>>>>>>>>>>>>>> On 9 Oct 2019, 23:14 +0300, John Smith < >>>>>>>>>>>>>>> java.dev....@gmail.com>, wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> So the error sais to set clientFailureDetectionTimeout=30000 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 1- Do I put a higher value than 30000? >>>>>>>>>>>>>>> 2- Do I do it on the client or the server nodes or all nodes? >>>>>>>>>>>>>>> 3- Also if a client is misbehaving why shutoff the server >>>>>>>>>>>>>>> node? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, 3 Oct 2019 at 21:02, John Smith < >>>>>>>>>>>>>>> java.dev....@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> But if it's the client node that's failing why is the >>>>>>>>>>>>>>>> server node stopping? I'm pretty sure we do verry simple put >>>>>>>>>>>>>>>> and get >>>>>>>>>>>>>>>> operations. All the client nodes are started as client=true >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, < >>>>>>>>>>>>>>>> dma...@apache.org> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi John, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I don't see any GC pressure or STW pauses either. If not >>>>>>>>>>>>>>>>> GC then it might have been caused by a network glitch or some >>>>>>>>>>>>>>>>> long-running >>>>>>>>>>>>>>>>> operation started by the app. These logs statement >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] >>>>>>>>>>>>>>>>> Client node considered as unreachable and will be dropped >>>>>>>>>>>>>>>>> from cluster, >>>>>>>>>>>>>>>>> because no metrics update messages received in interval: >>>>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be >>>>>>>>>>>>>>>>> caused by >>>>>>>>>>>>>>>>> network problems or long GC pause on client node, try to >>>>>>>>>>>>>>>>> increase this >>>>>>>>>>>>>>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, >>>>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000] >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] >>>>>>>>>>>>>>>>> Client node considered as unreachable and will be dropped >>>>>>>>>>>>>>>>> from cluster, >>>>>>>>>>>>>>>>> because no metrics update messages received in interval: >>>>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be >>>>>>>>>>>>>>>>> caused by >>>>>>>>>>>>>>>>> network problems or long GC pause on client node, try to >>>>>>>>>>>>>>>>> increase this >>>>>>>>>>>>>>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, >>>>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000] >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] >>>>>>>>>>>>>>>>> Blocked system-critical thread has been detected. This can >>>>>>>>>>>>>>>>> lead to >>>>>>>>>>>>>>>>> cluster-wide undefined behaviour >>>>>>>>>>>>>>>>> [threadName=partition-exchanger, >>>>>>>>>>>>>>>>> blockedFor=57s] >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] >>>>>>>>>>>>>>>>> Critical system error detected. Will be handled accordingly >>>>>>>>>>>>>>>>> to configured >>>>>>>>>>>>>>>>> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, >>>>>>>>>>>>>>>>> timeout=0, >>>>>>>>>>>>>>>>> super=AbstractFailureHandler >>>>>>>>>>>>>>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], >>>>>>>>>>>>>>>>> failureCtx=FailureContext >>>>>>>>>>>>>>>>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: >>>>>>>>>>>>>>>>> GridWorker >>>>>>>>>>>>>>>>> [name=partition-exchanger, igniteInstanceName=xxxxxx, >>>>>>>>>>>>>>>>> finished=false, >>>>>>>>>>>>>>>>> heartbeatTs=1568931981805]]] >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>> Denis >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thu, Oct 3, 2019 at 11:50 AM John Smith < >>>>>>>>>>>>>>>>> java.dev....@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> So I have been monitoring my node and the same one seems >>>>>>>>>>>>>>>>>> to stop once in a while. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I have attached the GC logs and the ignite logs. From >>>>>>>>>>>>>>>>>> what I see from gc.logs I don't see big pauses. I could be >>>>>>>>>>>>>>>>>> wrong. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The machine is 16GB and I have the configs here: >>>>>>>>>>>>>>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Here are the JVM settings... >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> if [ -z "$JVM_OPTS" ] ; then >>>>>>>>>>>>>>>>>> JVM_OPTS="-Xms2g -Xmx2g -server >>>>>>>>>>>>>>>>>> -XX:MaxMetaspaceSize=256m" >>>>>>>>>>>>>>>>>> fi >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc >>>>>>>>>>>>>>>>>> -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log" >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> JVM_OPTS="${JVM_OPTS} -Xss16m" >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>