bq. a small number will take 20 minutes or more Were these mappers performing selective scan on big regions ?
Can you pastebin the stack trace of region server(s) which served such regions during slow mapper operation ? Pastebin of region server log would also give us more clue. On Tue, Mar 22, 2016 at 10:57 AM, feedly team <[email protected]> wrote: > Recently we have been experiencing short downtimes (~2-5 minutes) in our > hbase cluster and are trying to understand why. Many times we have HLog > write spikes around the down times, but not always. Not sure if this is a > red herring. > > We have looked a bit farther back in time and have noticed many metrics > deteriorating over the past few months: > > The compaction queue size seems to be growing. > > The flushQueueSize and flushSizeAvgTime are growing. > > Some map reduce tasks run extremely slowly. Maybe 90% will complete within > a couple minutes, but a small number will take 20 minutes or more. If I > look at the slow mappers, there is a high value for the > MILLIS_BETWEEN_NEXTS counter (these mappers didn't run data local). > > We have seen application performance worsening, during slowdowns usually > threads are blocked on hbase connection operations > (HConnectionManager$HConnectionImplementation.processBatch). > > > This is a bit puzzling as our data nodes' os load values are really low. In > the past, we had performance issues when load got too high. The region > server log doesn't have anything interesting, the only messages we get are > a handful of responseTooSlow messages > Do these symptoms point to anything or is there something else we should > look at? We are (still) running 0.94.20. We are going to upgrade soon, but > we want to diagnose this issue first. >
