Hi, Question-1: Does the above log mean that the thread is really blocked? or is it just busy doing something else?
The thread in question might be busy doing something else and did not update its heartbeat timestamp. See: https://apacheignite.readme.io/docs/critical-failures-handling#critical-workers-health-check When that happens a thread dump is generated and the pre-configured failurehandler is called: Question-2: How can I decide the suitable values of these timeouts for my case? The former values were working for me earlier but now I face this exception. Usually the default values are what works best, otherwise use experimental means to determine optimal settings for your use case. Question-3: I can see that this failure type (WORKER_THREAD_BLOCKED) is actually ignored by default so why do we still see it as an ERROR in logs? When a thread is blocked, the event in question is: WORKER_THREAD_BLOCKED The error message reflects the event type. It is up to the failure handler to handle that event. See: https://apacheignite.readme.io/docs/critical-failures-handling#failure-handling(see implementation below) Question-4: As a remedy to this, I have thought of adding another timeout to my configuration: <property name="systemWorkerBlockedTimeout" value="30000"/> I read that failureDetectionTimeout is ignored in case any other timeout is set. Would that mean now my failure detection timeout would also become 30000? or would it mean that failureDetectionTimeout would still be the configured value and just that it's value will be ignored for systemWorkerBlockedTimeout (which would now be 30000)? Setting systemWorkerBlockedTimeout affects the timeout related to that property, and nothing else. Question-5: How to decide the value for systemWorkerBlockedTimeout, do we have some guidelines here? Leave default or use experimental means. Question-6: As I can see in https://issues.apache.org/jira/browse/IGNITE-10154, this WORKER_THREAD_BLOCKED failure is ignored by default, but on setting some positive value for systemWorkerBlockedTimeout, it would actually start working. However, I'm not sure if I want that right now. How else can I handle this scenario so that I don't get these unnecessary and very frequent exceptions without enabling this failure? This was an issue w/an old verison (2.7) and is now resolved. Setting systemWorkerBlockedTimeout would only affect that property only Liveness check is enabled either way and failurehandling is the sole domain of the configured Failure Handler. Here are some links to the implementation to make it clearer: All relevant threads get put into a workers registry: https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/worker/WorkersRegistry.java the registry get started by the kernel: https://github.com/apache/ignite/blob/c3a2deb8f464e4547f65164d2ad62b10854cb199/modules/core/src/main/java/org/apache/ignite/internal/IgnitionEx.java#L1805 the lamda uses the FailureProcessor to handle failures using the configured FailureHandler (default StopNodeOrHaltFaiulreHandler): https://github.com/apache/ignite/blob/c3a2deb8f464e4547f65164d2ad62b10854cb199/modules/core/src/main/java/org/apache/ignite/internal/processors/failure/FailureProcessor.java#L156 The workers registry continuously monitors all threads here: Take a look at the error message and how the thread dump is generated. https://github.com/apache/ignite/blob/c3a2deb8f464e4547f65164d2ad62b10854cb199/modules/core/src/main/java/org/apache/ignite/internal/worker/WorkersRegistry.java#L175 Thanks, Alex -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
