So after much investigation I can confirm:

a) it was indeed one of the regions that was being compacted, major
compaction in one case, minor compaction in another, the issue started just
after compaction completed blowing away bucket cached blocks for the older
HFile's
b) in another case there was no compaction just a newly opened region in a
region server that hadn't finished perfetching its pages from S3

We have prefetch on open set to true. Our load is heavy on checkAndPut .The
theory is that since prefetch is an async operation, a lot of the reads in
the checkAndPut for the region in question start reading from S3 which is
slow. So the write lock obtained for the checkAndPut is held for a longer
duration than normal. This has cascading upstream effects. Does that sound
plausible?

The part I don't understand still is all the locks held are for the same
region but are all for different rows. So once the prefetch is completed,
shouldn't the problem clear up quickly? Or does the slow region slow down
anyone trying to do checkAndPut on any row in the same region even after
the prefetch has completed. That is, do the long held row locks prevent
others from getting a row lock on a different row in the same region?

In any case, we trying to use
https://issues.apache.org/jira/browse/HBASE-16388 support in HBase 1.4.0 to
both insulate the app a bit from this situation and hoping that it will
reduce pressure on the region server in question, allowing it to recover
faster. I haven't quite tested that yet, any advice in the meantime would
be appreciated.

Cheers.

----
Saad



On Thu, Mar 1, 2018 at 9:21 AM, Saad Mufti <saad.mu...@gmail.com> wrote:

> Actually it happened again while some minior compactions were running, so
> don't think it related to our major compaction tool, which isn't even
> running right now. I will try to capture a debug dump of threads and
> everything while the event is ongoing. Seems to last at least half an hour
> or so and sometimes longer.
>
> ----
> Saad
>
>
> On Thu, Mar 1, 2018 at 7:54 AM, Saad Mufti <saad.mu...@gmail.com> wrote:
>
>> Unfortunately I lost the stack trace overnight. But it does seem related
>> to compaction, because now that the compaction tool is done, I don't see
>> the issue anymore. I will run our incremental major compaction tool again
>> and see if I can reproduce the issue.
>>
>> On the plus side the system stayed stable and eventually recovered,
>> although it did suffer all those timeouts.
>>
>> ----
>> Saad
>>
>>
>> On Wed, Feb 28, 2018 at 10:18 PM, Saad Mufti <saad.mu...@gmail.com>
>> wrote:
>>
>>> I'll paste a thread dump later, writing this from my phone  :-)
>>>
>>> So the same issue has happened at different times for different regions,
>>> but I couldn't see that the region in question was the one being compacted,
>>> either this time or earlier. Although I might have missed an earlier
>>> correlation in the logs where the issue started just after the compaction
>>> completed.
>>>
>>> Usually a compaction for this table's regions take around 5-10 minutes,
>>> much less for its smaller column family which is block cache enabled,
>>> around a minute or less, and 5-10 minutes for the much larger one for which
>>> we have block cache disabled in the schema, because we don't ever read it
>>> in the primary cluster. So the only impact on reads would be from that
>>> smaller column family which takes less than a minute to compact.
>>>
>>> But the issue once started doesn't seem to recover for a long time, long
>>> past when any compaction on the region itself could impact anything. The
>>> compaction tool which is our own code has long since moved to other
>>> regions.
>>>
>>> Cheers.
>>>
>>> ----
>>> Saad
>>>
>>>
>>> On Wed, Feb 28, 2018 at 9:39 PM Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> bq. timing out trying to obtain write locks on rows in that region.
>>>>
>>>> Can you confirm that the region under contention was the one being major
>>>> compacted ?
>>>>
>>>> Can you pastebin thread dump so that we can have better idea of the
>>>> scenario ?
>>>>
>>>> For the region being compacted, how long would the compaction take (just
>>>> want to see if there was correlation between this duration and timeout)
>>>> ?
>>>>
>>>> Cheers
>>>>
>>>> On Wed, Feb 28, 2018 at 6:31 PM, Saad Mufti <saad.mu...@gmail.com>
>>>> wrote:
>>>>
>>>> > Hi,
>>>> >
>>>> > We are running on Amazon EMR based HBase 1.4.0 . We are currently
>>>> seeing a
>>>> > situation where sometimes a particular region gets into a situation
>>>> where a
>>>> > lot of write requests to any row in that region timeout saying they
>>>> failed
>>>> > to obtain a lock on a row in a region and eventually they experience
>>>> an IPC
>>>> > timeout. This causes the IPC queue to blow up in size as requests get
>>>> > backed up, and that region server experiences a much higher than
>>>> normal
>>>> > timeout rate for all requests, not just those timing out for failing
>>>> to
>>>> > obtain the row lock.
>>>> >
>>>> > The strange thing is the rows are always different but the region is
>>>> always
>>>> > the same. So the question is, is there a region component to how long
>>>> a row
>>>> > write lock would be held? I looked at the debug dump and the RowLocks
>>>> > section shows a long list of write row locks held, all of them are
>>>> from the
>>>> > same region but different rows.
>>>> >
>>>> > Will trying to obtain a write row lock experience delays if no one
>>>> else
>>>> > holds a lock on the same row but the region itself is experiencing
>>>> read
>>>> > delays? We do have an incremental compaction tool running that major
>>>> > compacts one region per region server at a time, so that will drive
>>>> out
>>>> > pages from the bucket cache. But for most regions the impact is
>>>> > transitional until the bucket cache gets populated by pages from the
>>>> new
>>>> > HFile. But for this one region we start timing out trying to obtain
>>>> write
>>>> > locks on rows in that region.
>>>> >
>>>> > Any insight anyone can provide would be most welcome.
>>>> >
>>>> > Cheers.
>>>> >
>>>> > ----
>>>> > Saad
>>>> >
>>>>
>>>
>>
>

Reply via email to