Yu Li created HBASE-20158:
-----------------------------

             Summary: Enhance regionserver self health check to avoid stale 
server
                 Key: HBASE-20158
                 URL: https://issues.apache.org/jira/browse/HBASE-20158
             Project: HBase
          Issue Type: New Feature
            Reporter: Yu Li
            Assignee: Yu Li


Currently we have many good metrics to monitor our cluster status, such as 
totalCallTime/processCallTime/queueCallTime etc. But these metrics won't work 
if server got stale and the client call timed out, for example during RS fullgc 
or there're some bad disk on HDFS and the read IO got stuck.

We also have a periodic health check chore introduced by HBASE-7351 which allow 
us to launch some external script periodically to perform some self detection. 
However this won't work if the server's system resource has ran out, for 
example no new native thread could be created, no new network connection could 
be setup, etc. Notice that although no new thread could not be launched, 
running thread won't be affected so zookeeper session is still alive and RS 
still regarded as alive, but clients cannot access since no new connection 
could be setup.

Here we propose a new HealthChecker called DirectHealthChecker. In this new 
checker we won't launch any outer script, but picking some regions on the RS 
and send some rpc request to itself, regarding the server as unhealthy if the 
call failure ratio exceeds some limit, and send the metrics out to our 
monitoring system. More details please refer to the coming patch



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to