Hello! I think the node has determined that it need to shut down: [22:26:31,800][SEVERE][disco-event-worker-#42%xxxxxx%][FailureProcessor] Ignite node is in invalid state due to a critical failure.
You can tune failure handler to avoid this behavior: https://apacheignite.readme.io/docs/critical-failures-handling WRT clientFailureDetectionTimeout, it should be set on server nodes. Regards, -- Ilya Kasnacheev вт, 29 окт. 2019 г. в 22:21, John Smith <java.dev....@gmail.com>: > Hi, the GC logs where also provided and we determined, there was no GC > pressure. At least what I understood from the thread above. We also enabled > some extra thread info for GC and that showed nothing also. > > On the client side I also see this: Possible too long JVM pause: 25468 > milliseconds. My SERVER heap size is at -Xmx3g and I even tried -Xmx2g My > client is at 1G > 1- So the SERVER node is stopped because the client was misbehaving? > 2- clientFailureDetectionTimeout should be set on all nodes including the > client? > > On Tue, 29 Oct 2019 at 11:14, Ilya Kasnacheev <ilya.kasnach...@gmail.com> > wrote: > >> Hello! >> >> Well, this is the relevant part: >> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] >> *Possible >> too long JVM pause: 55705 milliseconds*. >> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] >> Client node considered as unreachable and will be dropped from cluster, >> because no metrics update messages received in interval: >> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by >> network problems or long GC pause on client node, try to increase this >> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, >> clientFailureDetectionTimeout=30000] >> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] >> Client node considered as unreachable and will be dropped from cluster, >> because no metrics update messages received in interval: >> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by >> network problems or long GC pause on client node, try to increase this >> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, >> clientFailureDetectionTimeout=30000] >> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked >> system-critical thread has been detected. This can lead to cluster-wide >> undefined behaviour [threadName=partition-exchanger, blockedFor=57s] >> [22:26:21,850][INFO][tcp-disco-sock-reader-#4%xxxxxx%][TcpDiscoverySpi] >> Finished serving remote node connection [rmtAddr=/xxx.xxx.xxx.63:51827, >> rmtPort=51827 >> [22:26:21,933][WARNING][ttl-cleanup-worker-#48%xxxxxx%][G] Thread >> [name="exchange-worker-#43%xxxxxx%", id=89, state=TIMED_WAITING, >> blockCnt=6, waitCnt=486312] >> Lock >> [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@491d897d, >> ownerName=null, ownerId=-1] >> >> Long GC will cause node to segment. >> >> Try either decreasing heap size / making sure full GC does not happen, or >> increase failureDetectionTimeout (clientFailureDetectionTimeout if set) on >> all nodes. >> >> Regards, >> -- >> Ilya Kasnacheev >> >> >> вт, 29 окт. 2019 г. в 18:02, John Smith <java.dev....@gmail.com>: >> >>> Hi, in my original post I had included full logs of the stopped node... >>> >>> And when you say set to localhost you which property do you mean is it >>> the one IgniteConfiguration setLocalhost()? >>> >>> On Tue, 29 Oct 2019 at 03:48, Ilya Kasnacheev <ilya.kasnach...@gmail.com> >>> wrote: >>> >>>> Hello! >>>> >>>> I think it is possible (such as, node tries to connect to wrong address >>>> / itself) but more detailed analysis of logs is needed. >>>> >>>> You can try specifying localHost property in IgniteConfiguration to >>>> make sure correct address is used. >>>> >>>> Regards, >>>> -- >>>> Ilya Kasnacheev >>>> >>>> >>>> пн, 28 окт. 2019 г. в 23:16, John Smith <java.dev....@gmail.com>: >>>> >>>>> Ok cool. Thanks. I'm just trying to figure scenarios where it may >>>>> happen. >>>>> >>>>> When I connect with ignite visor and run the "top" topology command. >>>>> The client node is bound to multiple addresses about 12... >>>>> >>>>> Ignite is running on plain regular VMs while the client node is >>>>> running in a containerized env but using host network (which is regular >>>>> VM). >>>>> >>>>> So wondering if that would also screw it up something, by trying to >>>>> connect to one of those multiple addresses? >>>>> >>>>> On Mon, 28 Oct 2019 at 12:08, Ilya Kasnacheev < >>>>> ilya.kasnach...@gmail.com> wrote: >>>>> >>>>>> Hello! >>>>>> >>>>>> Sure, if nodes can't reach each other, eventually they may segment >>>>>> and stop. >>>>>> >>>>>> Regards, >>>>>> -- >>>>>> Ilya Kasnacheev >>>>>> >>>>>> >>>>>> пт, 25 окт. 2019 г. в 00:08, John Smith <java.dev....@gmail.com>: >>>>>> >>>>>>> Is it possible this is somehow causing the issue of the node >>>>>>> stopping? >>>>>>> >>>>>>> On Thu, 24 Oct 2019 at 11:24, Ilya Kasnacheev < >>>>>>> ilya.kasnach...@gmail.com> wrote: >>>>>>> >>>>>>>> Hello! >>>>>>>> >>>>>>>> This likely means that you have reachability problems in your >>>>>>>> cluster, such as, xxx.xxx.xxx.68 can connect to xxx.xxx.xxx.82 (on >>>>>>>> range 47100-47200) but not the other way around. >>>>>>>> >>>>>>>> Regards, >>>>>>>> -- >>>>>>>> Ilya Kasnacheev >>>>>>>> >>>>>>>> >>>>>>>> пн, 21 окт. 2019 г. в 19:36, John Smith <java.dev....@gmail.com>: >>>>>>>> >>>>>>>>> I also see this printing every few seconds on my client >>>>>>>>> application... >>>>>>>>> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted >>>>>>>>> incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101, >>>>>>>>> rmtAddr=/xxx.xxx.xxx.82:49816 >>>>>>>>> >>>>>>>>> On Mon, 21 Oct 2019 at 12:04, John Smith <java.dev....@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi, thanks. I already made sure that each ignite VM runs on a >>>>>>>>>> separate host within our cloud. I'm not doing any of that migration >>>>>>>>>> stuff. >>>>>>>>>> >>>>>>>>>> Also recently disabled metrics igniteConfig >>>>>>>>>> .setMetricsLogFrequency(0); Just to make sure it doesn't get too >>>>>>>>>> chatty. But i doubt this would affect it... >>>>>>>>>> >>>>>>>>>> Should I maybe set the node timeout a bit higher then 30 seconds >>>>>>>>>> maybe put it to 60 seconds? I remember people suggesting this but not >>>>>>>>>> sure... >>>>>>>>>> >>>>>>>>>> On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov < >>>>>>>>>> dmekhani...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> The following documentation page has some useful points on >>>>>>>>>>> deployment in a virtualised environment: >>>>>>>>>>> https://apacheignite.readme.io/docs/vmware-deployment >>>>>>>>>>> >>>>>>>>>>> Denis >>>>>>>>>>> On 17 Oct 2019, 17:41 +0300, John Smith <java.dev....@gmail.com>, >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Ok I have metribeat running on the VM hopefully I will see >>>>>>>>>>> something... >>>>>>>>>>> >>>>>>>>>>> On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov < >>>>>>>>>>> dmekhani...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> There are no long pauses in the GC logs, so it must be the >>>>>>>>>>>> whole VM pause. >>>>>>>>>>>> >>>>>>>>>>>> Denis >>>>>>>>>>>> On 16 Oct 2019, 23:07 +0300, John Smith <java.dev....@gmail.com>, >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Sorry here is the gc logs for all 3 machines: >>>>>>>>>>>> https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0 >>>>>>>>>>>> >>>>>>>>>>>> On Wed, 16 Oct 2019 at 15:49, John Smith < >>>>>>>>>>>> java.dev....@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, so it happened again here is my latest gc.log stats: >>>>>>>>>>>>> https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e >>>>>>>>>>>>> >>>>>>>>>>>>> Everything seems ok to me. I also have Elasticsearch >>>>>>>>>>>>> Metricbeat running, the CPU usage looked normal at the time. >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov < >>>>>>>>>>>>> dmekhani...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Unfortunately, I don’t. >>>>>>>>>>>>>> You can ask the VM vendor or the cloud provider (if you use >>>>>>>>>>>>>> any) for a proper tooling or logs. >>>>>>>>>>>>>> Make sure, that there is no such step in the VM’s lifecycle >>>>>>>>>>>>>> that makes it freeze for a minute. >>>>>>>>>>>>>> Also make sure that the physical CPU is not overutilized and >>>>>>>>>>>>>> no VMs that run on it are starving. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Denis >>>>>>>>>>>>>> On 10 Oct 2019, 19:03 +0300, John Smith < >>>>>>>>>>>>>> java.dev....@gmail.com>, wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Do you know of any good tools I can use to check the VM? >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov < >>>>>>>>>>>>>> dmekhani...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> > Hi Dennis, so are you saying I should enable GC logs + the >>>>>>>>>>>>>>> safe point logs as well? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Having safepoint statistics in your GC logs may be useful, >>>>>>>>>>>>>>> so I recommend enabling them for troubleshooting purposes. >>>>>>>>>>>>>>> Check the lifecycle of your virtual machines. There is a >>>>>>>>>>>>>>> high chance that the whole machine is frozen, not just the >>>>>>>>>>>>>>> Ignite node. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Denis >>>>>>>>>>>>>>> On 10 Oct 2019, 18:25 +0300, John Smith < >>>>>>>>>>>>>>> java.dev....@gmail.com>, wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Dennis, so are you saying I should enable GC logs + the >>>>>>>>>>>>>>> safe point logs as well? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 11:22, John Smith < >>>>>>>>>>>>>>> java.dev....@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> You are correct, it is running in a VM. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov < >>>>>>>>>>>>>>>> dmekhani...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi! >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> There are the following messages in the logs: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] >>>>>>>>>>>>>>>>> Possible too long JVM pause: *55705 milliseconds*. >>>>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] >>>>>>>>>>>>>>>>> Blocked system-critical thread has been detected. This can >>>>>>>>>>>>>>>>> lead to >>>>>>>>>>>>>>>>> cluster-wide undefined behaviour >>>>>>>>>>>>>>>>> [threadName=partition-exchanger, >>>>>>>>>>>>>>>>> blockedFor=*57s*] >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Looks like the JVM was paused for almost a minute. It >>>>>>>>>>>>>>>>> doesn’t seem to be caused by a garbage collection, since >>>>>>>>>>>>>>>>> there is no >>>>>>>>>>>>>>>>> evidence of GC pressure in the GC log. Usually such big >>>>>>>>>>>>>>>>> pauses happen in >>>>>>>>>>>>>>>>> virtualised environments when backups are captured from >>>>>>>>>>>>>>>>> machines or they >>>>>>>>>>>>>>>>> just don’t have enough CPU time. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Looking at safepoint statistics may also reveal some >>>>>>>>>>>>>>>>> interesting details. You can learn about safepoints here: >>>>>>>>>>>>>>>>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/ >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Denis >>>>>>>>>>>>>>>>> On 9 Oct 2019, 23:14 +0300, John Smith < >>>>>>>>>>>>>>>>> java.dev....@gmail.com>, wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> So the error sais to set clientFailureDetectionTimeout= >>>>>>>>>>>>>>>>> 30000 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 1- Do I put a higher value than 30000? >>>>>>>>>>>>>>>>> 2- Do I do it on the client or the server nodes or all >>>>>>>>>>>>>>>>> nodes? >>>>>>>>>>>>>>>>> 3- Also if a client is misbehaving why shutoff the server >>>>>>>>>>>>>>>>> node? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thu, 3 Oct 2019 at 21:02, John Smith < >>>>>>>>>>>>>>>>> java.dev....@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> But if it's the client node that's failing why is the >>>>>>>>>>>>>>>>>> server node stopping? I'm pretty sure we do verry simple put >>>>>>>>>>>>>>>>>> and get >>>>>>>>>>>>>>>>>> operations. All the client nodes are started as client=true >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, < >>>>>>>>>>>>>>>>>> dma...@apache.org> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi John, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I don't see any GC pressure or STW pauses either. If not >>>>>>>>>>>>>>>>>>> GC then it might have been caused by a network glitch or >>>>>>>>>>>>>>>>>>> some long-running >>>>>>>>>>>>>>>>>>> operation started by the app. These logs statement >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] >>>>>>>>>>>>>>>>>>> Client node considered as unreachable and will be dropped >>>>>>>>>>>>>>>>>>> from cluster, >>>>>>>>>>>>>>>>>>> because no metrics update messages received in interval: >>>>>>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may >>>>>>>>>>>>>>>>>>> be caused by >>>>>>>>>>>>>>>>>>> network problems or long GC pause on client node, try to >>>>>>>>>>>>>>>>>>> increase this >>>>>>>>>>>>>>>>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, >>>>>>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000] >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] >>>>>>>>>>>>>>>>>>> Client node considered as unreachable and will be dropped >>>>>>>>>>>>>>>>>>> from cluster, >>>>>>>>>>>>>>>>>>> because no metrics update messages received in interval: >>>>>>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may >>>>>>>>>>>>>>>>>>> be caused by >>>>>>>>>>>>>>>>>>> network problems or long GC pause on client node, try to >>>>>>>>>>>>>>>>>>> increase this >>>>>>>>>>>>>>>>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, >>>>>>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000] >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] >>>>>>>>>>>>>>>>>>> Blocked system-critical thread has been detected. This can >>>>>>>>>>>>>>>>>>> lead to >>>>>>>>>>>>>>>>>>> cluster-wide undefined behaviour >>>>>>>>>>>>>>>>>>> [threadName=partition-exchanger, >>>>>>>>>>>>>>>>>>> blockedFor=57s] >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] >>>>>>>>>>>>>>>>>>> Critical system error detected. Will be handled accordingly >>>>>>>>>>>>>>>>>>> to configured >>>>>>>>>>>>>>>>>>> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, >>>>>>>>>>>>>>>>>>> timeout=0, >>>>>>>>>>>>>>>>>>> super=AbstractFailureHandler >>>>>>>>>>>>>>>>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], >>>>>>>>>>>>>>>>>>> failureCtx=FailureContext >>>>>>>>>>>>>>>>>>> [type=SYSTEM_WORKER_BLOCKED, err=class >>>>>>>>>>>>>>>>>>> o.a.i.IgniteException: GridWorker >>>>>>>>>>>>>>>>>>> [name=partition-exchanger, igniteInstanceName=xxxxxx, >>>>>>>>>>>>>>>>>>> finished=false, >>>>>>>>>>>>>>>>>>> heartbeatTs=1568931981805]]] >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>>>> Denis >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Thu, Oct 3, 2019 at 11:50 AM John Smith < >>>>>>>>>>>>>>>>>>> java.dev....@gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> So I have been monitoring my node and the same one >>>>>>>>>>>>>>>>>>>> seems to stop once in a while. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I have attached the GC logs and the ignite logs. From >>>>>>>>>>>>>>>>>>>> what I see from gc.logs I don't see big pauses. I could be >>>>>>>>>>>>>>>>>>>> wrong. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> The machine is 16GB and I have the configs here: >>>>>>>>>>>>>>>>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Here are the JVM settings... >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> if [ -z "$JVM_OPTS" ] ; then >>>>>>>>>>>>>>>>>>>> JVM_OPTS="-Xms2g -Xmx2g -server >>>>>>>>>>>>>>>>>>>> -XX:MaxMetaspaceSize=256m" >>>>>>>>>>>>>>>>>>>> fi >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc >>>>>>>>>>>>>>>>>>>> -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log" >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> JVM_OPTS="${JVM_OPTS} -Xss16m" >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>