Hello Ignite Experts,
We have recently upgraded from version 2.6 to 2.8.0 and have started to face
some weird behavior since then.
With the below configuration:
<property name="metricsUpdateFrequency" value="1000"/>
<property name="failureDetectionTimeout" value="1100"/>
<property name="clientFailureDetectionTimeout" value="1000"/>
We are seeing the below log (with different thread names) multiple times
every second as soon as the ignite server is started:
06-11 08:40:50,309978 [61] ERROR G(Ignite) - Blocked system-critical thread
has been detected. This can lead to cluster-wide undefined behaviour
[workerName=grid-nio-worker-tcp-comm-3,
threadName=grid-nio-worker-tcp-comm-3-#27%IgniteCluster1%, blockedFor=1s]
06-11 08:40:50,310267 [61] WARN G(Ignite) - Thread
[name="grid-nio-worker-tcp-comm-3-#27%IgniteCluster1%", id=40,
state=RUNNABLE, blockCnt=0, waitCnt=0]
06-11 08:40:50,310609 [61] WARN (Ignite) - Possible failure suppressed
accordingly to a configured handler [hnd=StopNodeOrHaltFailureHandler
[tryStop=false, timeout=0, super=AbstractFailureHandler
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
[name=grid-nio-worker-tcp-comm-3, igniteInstanceName=IgniteCluster1,
finished=false, heartbeatTs=1591864848869]]]
06-11 08:40:50,311975 [61] WARN CacheDiagnosticManager(Ignite) - Page locks
dump:
Thread=[name=NgServiceProvider_EventsExecutor_0, id=77], state=WAITING
Locked pages = []
Locked pages log: name=NgServiceProvider_EventsExecutor_0
time=(1591864850311, 2020-06-11 14:10:50.311)
Thread=[name=exchange-worker-#43%IgniteCluster1%, id=63],
state=TIMED_WAITING
Locked pages = []
Locked pages log: name=exchange-worker-#43%IgniteCluster1%
time=(1591864850311, 2020-06-11 14:10:50.311)
Thread=[name=sys-#45%IgniteCluster1%, id=65], state=TIMED_WAITING
Locked pages = []
Locked pages log: name=sys-#45%IgniteCluster1% time=(1591864850311,
2020-06-11 14:10:50.311)
Thread=[name=sys-#48%IgniteCluster1%, id=68], state=TIMED_WAITING
Locked pages = []
Locked pages log: name=sys-#48%IgniteCluster1% time=(1591864850311,
2020-06-11 14:10:50.311)
Thread=[name=sys-#49%IgniteCluster1%, id=69], state=TIMED_WAITING
Locked pages = []
Locked pages log: name=sys-#49%IgniteCluster1% time=(1591864850311,
2020-06-11 14:10:50.311)
Thread=[name=sys-#51%IgniteCluster1%, id=71], state=TIMED_WAITING
Locked pages = []
Locked pages log: name=sys-#51%IgniteCluster1% time=(1591864850311,
2020-06-11 14:10:50.311)
Full Logs attached here:
WebIgniteService_WEB.log
<http://apache-ignite-users.70518.x6.nabble.com/file/t2754/WebIgniteService_WEB.log>
However, if I change my timeouts like this:
<property name="metricsUpdateFrequency" value="2000"/>
<property name="failureDetectionTimeout" value="10000"/>
<property name="clientFailureDetectionTimeout" value="30000"/>
It still occurs but a lot less frequently (I observed it only after I have
added 1 server and 5 clients and the communication started between them).
I did some research and found that this is related to the Critical Workers
Health Check feature which I think is a great addition to ignite but I have
a few questions regarding the same.
Question-1: Does the above log mean that the thread is really blocked? or is
it just busy doing something else?
Question-2: How can I decide the suitable values of these timeouts for my
case? The former values were working for me earlier but now I face this
exception.
Question-3: I can see that this failure type (WORKER_THREAD_BLOCKED) is
actually ignored by default so why do we still see it as an ERROR in logs?
Question-4: As a remedy to this, I have thought of adding another timeout to
my configuration:
<property name="systemWorkerBlockedTimeout" value="30000"/>
I read that failureDetectionTimeout is ignored in case any other timeout is
set. Would that mean now my failure detection timeout would also become
30000? or would it mean that failureDetectionTimeout would still be the
configured value and just that it's value will be ignored for
systemWorkerBlockedTimeout (which would now be 30000)?
Question-5: How to decide the value for systemWorkerBlockedTimeout, do we
have some guidelines here?
Question-6: As I can see in
https://issues.apache.org/jira/browse/IGNITE-10154, this
WORKER_THREAD_BLOCKED failure is ignored by default, but on setting some
positive value for systemWorkerBlockedTimeout, it would actually start
working. However, I'm not sure if I want that right now. How else can I
handle this scenario so that I don't get these unnecessary and very frequent
exceptions without enabling this failure?
Please correct me if I'm wrong anywhere.
Thanks in advance.
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/