Re: Node stopped.

Denis Mekhanikov Thu, 10 Oct 2019 07:12:01 -0700

Hi!

There are the following messages in the logs:


[22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] 
Possible too long JVM pause: 55705 milliseconds.
...
[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked 
system-critical thread has been detected. This can lead to cluster-wide 
undefined behaviour [threadName=partition-exchanger, blockedFor=57s]

Looks like the JVM was paused for almost a minute. It doesn’t seem to be caused 
by a garbage collection, since there is no evidence of GC pressure in the GC 
log. Usually such big pauses happen in virtualised environments when backups 
are captured from machines or they just don’t have enough CPU time.

Looking at safepoint statistics may also reveal some interesting details. You 
can learn about safepoints here: 
https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/

Denis
On 9 Oct 2019, 23:14 +0300, John Smith <java.dev....@gmail.com>, wrote:
> So the error sais to set clientFailureDetectionTimeout=30000
>
> 1- Do I put a higher value than 30000?
> 2- Do I do it on the client or the server nodes or all nodes?
> 3- Also if a client is misbehaving why shutoff the server node?
>
> > On Thu, 3 Oct 2019 at 21:02, John Smith <java.dev....@gmail.com> wrote:
> > > But if it's the client node that's failing why is the server node 
> > > stopping? I'm pretty sure we do verry simple put and get operations. All 
> > > the client nodes are started as client=true
> > >
> > > > On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <dma...@apache.org> wrote:
> > > > > Hi John,
> > > > >
> > > > > I don't see any GC pressure or STW pauses either. If not GC then it 
> > > > > might have been caused by a network glitch or some long-running 
> > > > > operation started by the app. These logs statement
> > > > >
> > > > >
> > > > > [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
> > > > >  Client node considered as unreachable and will be dropped from 
> > > > > cluster, because no metrics update messages received in interval: 
> > > > > TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused 
> > > > > by network problems or long GC pause on client node, try to increase 
> > > > > this parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, 
> > > > > clientFailureDetectionTimeout=30000]
> > > > >
> > > > > [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
> > > > >  Client node considered as unreachable and will be dropped from 
> > > > > cluster, because no metrics update messages received in interval: 
> > > > > TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused 
> > > > > by network problems or long GC pause on client node, try to increase 
> > > > > this parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, 
> > > > > clientFailureDetectionTimeout=30000]
> > > > >
> > > > > [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked 
> > > > > system-critical thread has been detected. This can lead to 
> > > > > cluster-wide undefined behaviour [threadName=partition-exchanger, 
> > > > > blockedFor=57s]
> > > > >
> > > > > 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical 
> > > > > system error detected. Will be handled accordingly to configured 
> > > > > handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, 
> > > > > super=AbstractFailureHandler 
> > > > > [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], 
> > > > > failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class 
> > > > > o.a.i.IgniteException: GridWorker [name=partition-exchanger, 
> > > > > igniteInstanceName=xxxxxx, finished=false, 
> > > > > heartbeatTs=1568931981805]]]
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > -
> > > > > Denis
> > > > >
> > > > >
> > > > > > On Thu, Oct 3, 2019 at 11:50 AM John Smith <java.dev....@gmail.com> 
> > > > > > wrote:
> > > > > > > So I have been monitoring my node and the same one seems to stop 
> > > > > > > once in a while.
> > > > > > >
> > > > > > > https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
> > > > > > >
> > > > > > > I have attached the GC logs and the ignite logs. From what I see 
> > > > > > > from gc.logs I don't see big pauses. I could be wrong.
> > > > > > >
> > > > > > > The machine is 16GB and I have the configs here: 
> > > > > > > https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
> > > > > > >
> > > > > > > Here are the JVM settings...
> > > > > > >
> > > > > > > if [ -z "$JVM_OPTS" ] ; then
> > > > > > >     JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
> > > > > > > fi
> > > > > > >
> > > > > > > JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails 
> > > > > > > -Xloggc:/var/log/apache-ignite/gc.log"
> > > > > > >
> > > > > > > JVM_OPTS="${JVM_OPTS} -Xss16m"

Re: Node stopped.

Reply via email to