Recently we have been experiencing short downtimes (~2-5 minutes) in our hbase cluster and are trying to understand why. Many times we have HLog write spikes around the down times, but not always. Not sure if this is a red herring.
We have looked a bit farther back in time and have noticed many metrics deteriorating over the past few months: The compaction queue size seems to be growing. The flushQueueSize and flushSizeAvgTime are growing. Some map reduce tasks run extremely slowly. Maybe 90% will complete within a couple minutes, but a small number will take 20 minutes or more. If I look at the slow mappers, there is a high value for the MILLIS_BETWEEN_NEXTS counter (these mappers didn't run data local). We have seen application performance worsening, during slowdowns usually threads are blocked on hbase connection operations (HConnectionManager$HConnectionImplementation.processBatch). This is a bit puzzling as our data nodes' os load values are really low. In the past, we had performance issues when load got too high. The region server log doesn't have anything interesting, the only messages we get are a handful of responseTooSlow messages Do these symptoms point to anything or is there something else we should look at? We are (still) running 0.94.20. We are going to upgrade soon, but we want to diagnose this issue first.
