I also see this printing every few seconds on my client application... org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101, rmtAddr=/xxx.xxx.xxx.82:49816
On Mon, 21 Oct 2019 at 12:04, John Smith <java.dev....@gmail.com> wrote: > Hi, thanks. I already made sure that each ignite VM runs on a separate > host within our cloud. I'm not doing any of that migration stuff. > > Also recently disabled metrics igniteConfig.setMetricsLogFrequency(0); > Just to make sure it doesn't get too chatty. But i doubt this would affect > it... > > Should I maybe set the node timeout a bit higher then 30 seconds maybe put > it to 60 seconds? I remember people suggesting this but not sure... > > On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <dmekhani...@gmail.com> > wrote: > >> The following documentation page has some useful points on deployment in >> a virtualised environment: >> https://apacheignite.readme.io/docs/vmware-deployment >> >> Denis >> On 17 Oct 2019, 17:41 +0300, John Smith <java.dev....@gmail.com>, wrote: >> >> Ok I have metribeat running on the VM hopefully I will see something... >> >> On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <dmekhani...@gmail.com> >> wrote: >> >>> There are no long pauses in the GC logs, so it must be the whole VM >>> pause. >>> >>> Denis >>> On 16 Oct 2019, 23:07 +0300, John Smith <java.dev....@gmail.com>, wrote: >>> >>> Sorry here is the gc logs for all 3 machines: >>> https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0 >>> >>> On Wed, 16 Oct 2019 at 15:49, John Smith <java.dev....@gmail.com> wrote: >>> >>>> Hi, so it happened again here is my latest gc.log stats: >>>> https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e >>>> >>>> Everything seems ok to me. I also have Elasticsearch Metricbeat >>>> running, the CPU usage looked normal at the time. >>>> >>>> On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <dmekhani...@gmail.com> >>>> wrote: >>>> >>>>> Unfortunately, I don’t. >>>>> You can ask the VM vendor or the cloud provider (if you use any) for a >>>>> proper tooling or logs. >>>>> Make sure, that there is no such step in the VM’s lifecycle that makes >>>>> it freeze for a minute. >>>>> Also make sure that the physical CPU is not overutilized and no VMs >>>>> that run on it are starving. >>>>> >>>>> Denis >>>>> On 10 Oct 2019, 19:03 +0300, John Smith <java.dev....@gmail.com>, >>>>> wrote: >>>>> >>>>> Do you know of any good tools I can use to check the VM? >>>>> >>>>> On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <dmekhani...@gmail.com> >>>>> wrote: >>>>> >>>>>> > Hi Dennis, so are you saying I should enable GC logs + the safe >>>>>> point logs as well? >>>>>> >>>>>> Having safepoint statistics in your GC logs may be useful, so I >>>>>> recommend enabling them for troubleshooting purposes. >>>>>> Check the lifecycle of your virtual machines. There is a high chance >>>>>> that the whole machine is frozen, not just the Ignite node. >>>>>> >>>>>> Denis >>>>>> On 10 Oct 2019, 18:25 +0300, John Smith <java.dev....@gmail.com>, >>>>>> wrote: >>>>>> >>>>>> Hi Dennis, so are you saying I should enable GC logs + the safe point >>>>>> logs as well? >>>>>> >>>>>> On Thu, 10 Oct 2019 at 11:22, John Smith <java.dev....@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> You are correct, it is running in a VM. >>>>>>> >>>>>>> On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov < >>>>>>> dmekhani...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi! >>>>>>>> >>>>>>>> There are the following messages in the logs: >>>>>>>> >>>>>>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] >>>>>>>> Possible too long JVM pause: *55705 milliseconds*. >>>>>>>> ... >>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked >>>>>>>> system-critical thread has been detected. This can lead to cluster-wide >>>>>>>> undefined behaviour [threadName=partition-exchanger, blockedFor= >>>>>>>> *57s*] >>>>>>>> >>>>>>>> Looks like the JVM was paused for almost a minute. It doesn’t seem >>>>>>>> to be caused by a garbage collection, since there is no evidence of GC >>>>>>>> pressure in the GC log. Usually such big pauses happen in virtualised >>>>>>>> environments when backups are captured from machines or they just don’t >>>>>>>> have enough CPU time. >>>>>>>> >>>>>>>> Looking at safepoint statistics may also reveal some interesting >>>>>>>> details. You can learn about safepoints here: >>>>>>>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/ >>>>>>>> >>>>>>>> Denis >>>>>>>> On 9 Oct 2019, 23:14 +0300, John Smith <java.dev....@gmail.com>, >>>>>>>> wrote: >>>>>>>> >>>>>>>> So the error sais to set clientFailureDetectionTimeout=30000 >>>>>>>> >>>>>>>> 1- Do I put a higher value than 30000? >>>>>>>> 2- Do I do it on the client or the server nodes or all nodes? >>>>>>>> 3- Also if a client is misbehaving why shutoff the server node? >>>>>>>> >>>>>>>> On Thu, 3 Oct 2019 at 21:02, John Smith <java.dev....@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> But if it's the client node that's failing why is the server node >>>>>>>>> stopping? I'm pretty sure we do verry simple put and get operations. >>>>>>>>> All >>>>>>>>> the client nodes are started as client=true >>>>>>>>> >>>>>>>>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <dma...@apache.org> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi John, >>>>>>>>>> >>>>>>>>>> I don't see any GC pressure or STW pauses either. If not GC then >>>>>>>>>> it might have been caused by a network glitch or some long-running >>>>>>>>>> operation started by the app. These logs statement >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] >>>>>>>>>> Client node considered as unreachable and will be dropped from >>>>>>>>>> cluster, >>>>>>>>>> because no metrics update messages received in interval: >>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused >>>>>>>>>> by >>>>>>>>>> network problems or long GC pause on client node, try to increase >>>>>>>>>> this >>>>>>>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, >>>>>>>>>> clientFailureDetectionTimeout=30000] >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] >>>>>>>>>> Client node considered as unreachable and will be dropped from >>>>>>>>>> cluster, >>>>>>>>>> because no metrics update messages received in interval: >>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused >>>>>>>>>> by >>>>>>>>>> network problems or long GC pause on client node, try to increase >>>>>>>>>> this >>>>>>>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, >>>>>>>>>> clientFailureDetectionTimeout=30000] >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked >>>>>>>>>> system-critical thread has been detected. This can lead to >>>>>>>>>> cluster-wide >>>>>>>>>> undefined behaviour [threadName=partition-exchanger, blockedFor=57s] >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical >>>>>>>>>> system error detected. Will be handled accordingly to configured >>>>>>>>>> handler >>>>>>>>>> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, >>>>>>>>>> super=AbstractFailureHandler >>>>>>>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], >>>>>>>>>> failureCtx=FailureContext >>>>>>>>>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: >>>>>>>>>> GridWorker >>>>>>>>>> [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false, >>>>>>>>>> heartbeatTs=1568931981805]]] >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> - >>>>>>>>>> Denis >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Oct 3, 2019 at 11:50 AM John Smith < >>>>>>>>>> java.dev....@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> So I have been monitoring my node and the same one seems to stop >>>>>>>>>>> once in a while. >>>>>>>>>>> >>>>>>>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0 >>>>>>>>>>> >>>>>>>>>>> I have attached the GC logs and the ignite logs. From what I see >>>>>>>>>>> from gc.logs I don't see big pauses. I could be wrong. >>>>>>>>>>> >>>>>>>>>>> The machine is 16GB and I have the configs here: >>>>>>>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0 >>>>>>>>>>> >>>>>>>>>>> Here are the JVM settings... >>>>>>>>>>> >>>>>>>>>>> if [ -z "$JVM_OPTS" ] ; then >>>>>>>>>>> JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m" >>>>>>>>>>> fi >>>>>>>>>>> >>>>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails >>>>>>>>>>> -Xloggc:/var/log/apache-ignite/gc.log" >>>>>>>>>>> >>>>>>>>>>> JVM_OPTS="${JVM_OPTS} -Xss16m" >>>>>>>>>>> >>>>>>>>>>