Re: Questions Regarding Critical Workkers Health Check

akorensh Fri, 12 Jun 2020 15:49:32 -0700

Hi,
   Question-1: Does the above log mean that the thread is really blocked? or
is
it just busy doing something else?


   The thread in question might be busy doing something else and did not
update its heartbeat timestamp.
   See:
https://apacheignite.readme.io/docs/critical-failures-handling#critical-workers-health-check
   When that happens a thread dump is generated and the pre-configured
failurehandler is called:
    

   Question-2: How can I decide the suitable values of these timeouts for my
   case? The former values were working for me earlier but now I face this
   exception.

     Usually the default values are what works best, otherwise use
    experimental means to determine optimal settings for your use case.

  Question-3: I can see that this failure type (WORKER_THREAD_BLOCKED) is
  actually ignored by default so why do we still see it as an ERROR in logs?
   When a thread is blocked, the event in question is: WORKER_THREAD_BLOCKED 


  The error message reflects the event type. It is up to the failure handler
to handle that
   event. See:
https://apacheignite.readme.io/docs/critical-failures-handling#failure-handling(see
implementation below)


   Question-4: As a remedy to this, I have thought of adding another timeout
to
  my configuration:
   <property name="systemWorkerBlockedTimeout" value="30000"/>
   I read that failureDetectionTimeout is ignored in case any other timeout
is
   set. Would that mean now my failure detection timeout would also become
  30000? or would it mean that failureDetectionTimeout would still be the
  configured value and just that it's value will be ignored for
  systemWorkerBlockedTimeout (which would now be 30000)?

     Setting systemWorkerBlockedTimeout affects the timeout related to that
property, and nothing else.



  Question-5: How to decide the value for systemWorkerBlockedTimeout, do we
  have some guidelines here?
   Leave default or use experimental means.


Question-6: As I can see in
https://issues.apache.org/jira/browse/IGNITE-10154, this
WORKER_THREAD_BLOCKED failure is ignored by default, but on setting some
positive value for systemWorkerBlockedTimeout, it would actually start
working. However, I'm not sure if I want that right now. How else can I
handle this scenario so that I don't get these unnecessary and very frequent
exceptions without enabling this failure?

This was an issue w/an old verison (2.7) and is now resolved.
Setting systemWorkerBlockedTimeout would only affect that property only
Liveness check is enabled either way and failurehandling is the sole domain
of the configured  
Failure Handler.  

Here are some links to the implementation to make it clearer:
All relevant threads get put into a workers registry:
https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/worker/WorkersRegistry.java

the registry get started by the kernel: 
https://github.com/apache/ignite/blob/c3a2deb8f464e4547f65164d2ad62b10854cb199/modules/core/src/main/java/org/apache/ignite/internal/IgnitionEx.java#L1805

the lamda uses the FailureProcessor to handle failures using the configured
FailureHandler (default StopNodeOrHaltFaiulreHandler):
https://github.com/apache/ignite/blob/c3a2deb8f464e4547f65164d2ad62b10854cb199/modules/core/src/main/java/org/apache/ignite/internal/processors/failure/FailureProcessor.java#L156


The workers registry continuously monitors all threads here:
Take a look at the error message and how the thread dump is generated.
https://github.com/apache/ignite/blob/c3a2deb8f464e4547f65164d2ad62b10854cb199/modules/core/src/main/java/org/apache/ignite/internal/worker/WorkersRegistry.java#L175

Thanks, Alex








    




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Questions Regarding Critical Workkers Health Check

Reply via email to