sporadic hbase "outages"

feedly team Tue, 22 Mar 2016 10:58:30 -0700

Recently we have been experiencing short downtimes (~2-5 minutes) in our
hbase cluster and are trying to understand why. Many times we have HLog
write spikes around the down times, but not always. Not sure if this is a
red herring.


We have looked a bit farther back in time and have noticed many metrics
deteriorating over the past few months:

The compaction queue size seems to be growing.

The flushQueueSize and flushSizeAvgTime are growing.

Some map reduce tasks run extremely slowly. Maybe 90% will complete within
a couple minutes, but a small number will take 20 minutes or more. If I
look at the slow mappers, there is a high value for the
MILLIS_BETWEEN_NEXTS counter (these mappers didn't run data local).

We have seen application performance worsening, during slowdowns usually
threads are blocked on hbase connection operations
(HConnectionManager$HConnectionImplementation.processBatch).


This is a bit puzzling as our data nodes' os load values are really low. In
the past, we had performance issues when load got too high. The region
server log doesn't have anything interesting, the only messages we get are
a handful of responseTooSlow messages
Do these symptoms point to anything or is there something else we should
look at? We are (still) running 0.94.20. We are going to upgrade soon, but
we want to diagnose this issue first.

sporadic hbase "outages"

Reply via email to