bq. a small number will take 20 minutes or more

Were these mappers performing selective scan on big regions ?

Can you pastebin the stack trace of region server(s) which served such
regions during slow mapper operation ?

Pastebin of region server log would also give us more clue.

On Tue, Mar 22, 2016 at 10:57 AM, feedly team <[email protected]> wrote:

> Recently we have been experiencing short downtimes (~2-5 minutes) in our
> hbase cluster and are trying to understand why. Many times we have HLog
> write spikes around the down times, but not always. Not sure if this is a
> red herring.
>
> We have looked a bit farther back in time and have noticed many metrics
> deteriorating over the past few months:
>
> The compaction queue size seems to be growing.
>
> The flushQueueSize and flushSizeAvgTime are growing.
>
> Some map reduce tasks run extremely slowly. Maybe 90% will complete within
> a couple minutes, but a small number will take 20 minutes or more. If I
> look at the slow mappers, there is a high value for the
> MILLIS_BETWEEN_NEXTS counter (these mappers didn't run data local).
>
> We have seen application performance worsening, during slowdowns usually
> threads are blocked on hbase connection operations
> (HConnectionManager$HConnectionImplementation.processBatch).
>
>
> This is a bit puzzling as our data nodes' os load values are really low. In
> the past, we had performance issues when load got too high. The region
> server log doesn't have anything interesting, the only messages we get are
> a handful of responseTooSlow messages
> Do these symptoms point to anything or is there something else we should
> look at? We are (still) running 0.94.20. We are going to upgrade soon, but
> we want to diagnose this issue first.
>

Reply via email to