I'll paste a thread dump later, writing this from my phone  :-)

So the same issue has happened at different times for different regions,
but I couldn't see that the region in question was the one being compacted,
either this time or earlier. Although I might have missed an earlier
correlation in the logs where the issue started just after the compaction
completed.

Usually a compaction for this table's regions take around 5-10 minutes,
much less for its smaller column family which is block cache enabled,
around a minute or less, and 5-10 minutes for the much larger one for which
we have block cache disabled in the schema, because we don't ever read it
in the primary cluster. So the only impact on reads would be from that
smaller column family which takes less than a minute to compact.

But the issue once started doesn't seem to recover for a long time, long
past when any compaction on the region itself could impact anything. The
compaction tool which is our own code has long since moved to other
regions.

Cheers.

----
Saad


On Wed, Feb 28, 2018 at 9:39 PM Ted Yu <yuzhih...@gmail.com> wrote:

> bq. timing out trying to obtain write locks on rows in that region.
>
> Can you confirm that the region under contention was the one being major
> compacted ?
>
> Can you pastebin thread dump so that we can have better idea of the
> scenario ?
>
> For the region being compacted, how long would the compaction take (just
> want to see if there was correlation between this duration and timeout) ?
>
> Cheers
>
> On Wed, Feb 28, 2018 at 6:31 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
>
> > Hi,
> >
> > We are running on Amazon EMR based HBase 1.4.0 . We are currently seeing
> a
> > situation where sometimes a particular region gets into a situation
> where a
> > lot of write requests to any row in that region timeout saying they
> failed
> > to obtain a lock on a row in a region and eventually they experience an
> IPC
> > timeout. This causes the IPC queue to blow up in size as requests get
> > backed up, and that region server experiences a much higher than normal
> > timeout rate for all requests, not just those timing out for failing to
> > obtain the row lock.
> >
> > The strange thing is the rows are always different but the region is
> always
> > the same. So the question is, is there a region component to how long a
> row
> > write lock would be held? I looked at the debug dump and the RowLocks
> > section shows a long list of write row locks held, all of them are from
> the
> > same region but different rows.
> >
> > Will trying to obtain a write row lock experience delays if no one else
> > holds a lock on the same row but the region itself is experiencing read
> > delays? We do have an incremental compaction tool running that major
> > compacts one region per region server at a time, so that will drive out
> > pages from the bucket cache. But for most regions the impact is
> > transitional until the bucket cache gets populated by pages from the new
> > HFile. But for this one region we start timing out trying to obtain write
> > locks on rows in that region.
> >
> > Any insight anyone can provide would be most welcome.
> >
> > Cheers.
> >
> > ----
> > Saad
> >
>

Reply via email to