Although now that I think about this a bit more, all the failures we saw
were failure to obtain a row lock, and in the thread stack traces we always
saw it somewhere inside getRowLockInternal and similar. Never saw any
contention on bucket cache lock that I could see.

Cheers.

----
Saad


On Sat, Mar 10, 2018 at 8:04 PM, Saad Mufti <saad.mu...@gmail.com> wrote:

> Also, for now we have mitigated this problem by using the new setting in
> HBase 1.4.0 that prevents one slow region server from blocking all client
> requests. Of course it causes some timeouts but our overall ecosystem
> contains Kafka queues for retries, so we can live with that. From what I
> can see, it looks like this setting also has the good effect of preventing
> clients from hammering a region server that is slow because its IPC queues
> are backed up, allowing it to recover faster.
>
> Does that make sense?
>
> Cheers.
>
> ----
> Saad
>
>
> On Sat, Mar 10, 2018 at 7:04 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
>
>> So if I understand correctly, we would mitigate the problem by not
>> evicting blocks for archived files immediately? Wouldn't this potentially
>> lead to problems later if the LRU algo chooses to evict blocks for active
>> files and leave blocks for archived files in there?
>>
>> I would definitely love to test this!!! Unfortunately we are running on
>> EMR and the details of how to patch HBase under EMR are not clear to me :-(
>>
>> What we would really love would be a setting for actually immediately
>> caching blocks for a new compacted file. I have seen in the code that even
>> is we have the cache on write setting set to true, it will refuse to cache
>> blocks for a file that is a newly compacted one. In our case we have sized
>> the bucket cache to be big enough to hold all our data, and really want to
>> avoid having to go to S3 until the last possible moment. A config setting
>> to test this would be great.
>>
>> But thanks everyone for your feedback. Any more would also be welcome on
>> the idea to let a user cache all newly compacted files.
>>
>> ----
>> Saad
>>
>>
>> On Wed, Mar 7, 2018 at 12:00 AM, Anoop John <anoop.hb...@gmail.com>
>> wrote:
>>
>>> >>a) it was indeed one of the regions that was being compacted, major
>>> compaction in one case, minor compaction in another, the issue started
>>> just
>>> after compaction completed blowing away bucket cached blocks for the
>>> older
>>> HFile's
>>>
>>> About this part.    Ya after the compaction, there is a step where the
>>> compacted away HFile's blocks getting removed from cache. This op takes a
>>> write lock for this region (In Bucket Cache layer)..  Every read op which
>>> is part of checkAndPut will try read from BC and that in turn need a read
>>> lock for this region.  So there is chances that the read locks starve
>>> because of so many frequent write locks .  Each block evict will attain
>>> the
>>> write lock one after other.  Will it be possible for you to patch this
>>> evict and test once? We can avoid the immediate evict from BC after
>>> compaction. I can help you with a patch if you wish
>>>
>>> Anoop
>>>
>>>
>>>
>>> On Mon, Mar 5, 2018 at 11:07 AM, ramkrishna vasudevan <
>>> ramkrishna.s.vasude...@gmail.com> wrote:
>>> > Hi Saad
>>> >
>>> > Your argument here
>>> >>> The
>>> >>>theory is that since prefetch is an async operation, a lot of the
>>> reads
>>> in
>>> >>>the checkAndPut for the region in question start reading from S3
>>> which is
>>> >>>slow. So the write lock obtained for the checkAndPut is held for a
>>> longer
>>> >>>duration than normal. This has cascading upstream effects. Does that
>>> sound
>>> >>>plausible?
>>> >
>>> > Seems very much plausible. So before even the prefetch happens say for
>>> > 'block 1' - and you have already issues N checkAndPut calls for the
>>> rows
>>> in
>>> > that 'block 1' -  all those checkAndPut will have to read that block
>>> from
>>> > S3 to perform the get() and then apply the mutation.
>>> >
>>> > This may happen for multiple threads at the same time because we are
>>> not
>>> > sure when the prefetch would have actually been completed. I don know
>>> what
>>> > are the general read characteristics when a read happens from S3 but
>>> you
>>> > could try to see how things work when a read happens from S3 and after
>>> the
>>> > prefetch completes ensure the same checkandPut() is done (from cache
>>> this
>>> > time) to really know the difference what S3 does there.
>>> >
>>> > Regards
>>> > Ram
>>> >
>>> > On Fri, Mar 2, 2018 at 2:57 AM, Saad Mufti <saad.mu...@gmail.com>
>>> wrote:
>>> >
>>> >> So after much investigation I can confirm:
>>> >>
>>> >> a) it was indeed one of the regions that was being compacted, major
>>> >> compaction in one case, minor compaction in another, the issue started
>>> just
>>> >> after compaction completed blowing away bucket cached blocks for the
>>> older
>>> >> HFile's
>>> >> b) in another case there was no compaction just a newly opened region
>>> in
>>> a
>>> >> region server that hadn't finished perfetching its pages from S3
>>> >>
>>> >> We have prefetch on open set to true. Our load is heavy on checkAndPut
>>> .The
>>> >> theory is that since prefetch is an async operation, a lot of the
>>> reads
>>> in
>>> >> the checkAndPut for the region in question start reading from S3
>>> which is
>>> >> slow. So the write lock obtained for the checkAndPut is held for a
>>> longer
>>> >> duration than normal. This has cascading upstream effects. Does that
>>> sound
>>> >> plausible?
>>> >>
>>> >> The part I don't understand still is all the locks held are for the
>>> same
>>> >> region but are all for different rows. So once the prefetch is
>>> completed,
>>> >> shouldn't the problem clear up quickly? Or does the slow region slow
>>> down
>>> >> anyone trying to do checkAndPut on any row in the same region even
>>> after
>>> >> the prefetch has completed. That is, do the long held row locks
>>> prevent
>>> >> others from getting a row lock on a different row in the same region?
>>> >>
>>> >> In any case, we trying to use
>>> >> https://issues.apache.org/jira/browse/HBASE-16388 support in HBase
>>> 1.4.0
>>> >> to
>>> >> both insulate the app a bit from this situation and hoping that it
>>> will
>>> >> reduce pressure on the region server in question, allowing it to
>>> recover
>>> >> faster. I haven't quite tested that yet, any advice in the meantime
>>> would
>>> >> be appreciated.
>>> >>
>>> >> Cheers.
>>> >>
>>> >> ----
>>> >> Saad
>>> >>
>>> >>
>>> >>
>>> >> On Thu, Mar 1, 2018 at 9:21 AM, Saad Mufti <saad.mu...@gmail.com>
>>> wrote:
>>> >>
>>> >> > Actually it happened again while some minior compactions were
>>> running,
>>> so
>>> >> > don't think it related to our major compaction tool, which isn't
>>> even
>>> >> > running right now. I will try to capture a debug dump of threads and
>>> >> > everything while the event is ongoing. Seems to last at least half
>>> an
>>> >> hour
>>> >> > or so and sometimes longer.
>>> >> >
>>> >> > ----
>>> >> > Saad
>>> >> >
>>> >> >
>>> >> > On Thu, Mar 1, 2018 at 7:54 AM, Saad Mufti <saad.mu...@gmail.com>
>>> wrote:
>>> >> >
>>> >> >> Unfortunately I lost the stack trace overnight. But it does seem
>>> related
>>> >> >> to compaction, because now that the compaction tool is done, I
>>> don't
>>> see
>>> >> >> the issue anymore. I will run our incremental major compaction tool
>>> >> again
>>> >> >> and see if I can reproduce the issue.
>>> >> >>
>>> >> >> On the plus side the system stayed stable and eventually recovered,
>>> >> >> although it did suffer all those timeouts.
>>> >> >>
>>> >> >> ----
>>> >> >> Saad
>>> >> >>
>>> >> >>
>>> >> >> On Wed, Feb 28, 2018 at 10:18 PM, Saad Mufti <saad.mu...@gmail.com
>>> >
>>> >> >> wrote:
>>> >> >>
>>> >> >>> I'll paste a thread dump later, writing this from my phone  :-)
>>> >> >>>
>>> >> >>> So the same issue has happened at different times for different
>>> >> regions,
>>> >> >>> but I couldn't see that the region in question was the one being
>>> >> compacted,
>>> >> >>> either this time or earlier. Although I might have missed an
>>> earlier
>>> >> >>> correlation in the logs where the issue started just after the
>>> >> compaction
>>> >> >>> completed.
>>> >> >>>
>>> >> >>> Usually a compaction for this table's regions take around 5-10
>>> minutes,
>>> >> >>> much less for its smaller column family which is block cache
>>> enabled,
>>> >> >>> around a minute or less, and 5-10 minutes for the much larger one
>>> for
>>> >> which
>>> >> >>> we have block cache disabled in the schema, because we don't ever
>>> read
>>> >> it
>>> >> >>> in the primary cluster. So the only impact on reads would be from
>>> that
>>> >> >>> smaller column family which takes less than a minute to compact.
>>> >> >>>
>>> >> >>> But the issue once started doesn't seem to recover for a long
>>> time,
>>> >> long
>>> >> >>> past when any compaction on the region itself could impact
>>> anything.
>>> >> The
>>> >> >>> compaction tool which is our own code has long since moved to
>>> other
>>> >> >>> regions.
>>> >> >>>
>>> >> >>> Cheers.
>>> >> >>>
>>> >> >>> ----
>>> >> >>> Saad
>>> >> >>>
>>> >> >>>
>>> >> >>> On Wed, Feb 28, 2018 at 9:39 PM Ted Yu <yuzhih...@gmail.com>
>>> wrote:
>>> >> >>>
>>> >> >>>> bq. timing out trying to obtain write locks on rows in that
>>> region.
>>> >> >>>>
>>> >> >>>> Can you confirm that the region under contention was the one
>>> being
>>> >> major
>>> >> >>>> compacted ?
>>> >> >>>>
>>> >> >>>> Can you pastebin thread dump so that we can have better idea of
>>> the
>>> >> >>>> scenario ?
>>> >> >>>>
>>> >> >>>> For the region being compacted, how long would the compaction
>>> take
>>> >> (just
>>> >> >>>> want to see if there was correlation between this duration and
>>> >> timeout)
>>> >> >>>> ?
>>> >> >>>>
>>> >> >>>> Cheers
>>> >> >>>>
>>> >> >>>> On Wed, Feb 28, 2018 at 6:31 PM, Saad Mufti <
>>> saad.mu...@gmail.com>
>>> >> >>>> wrote:
>>> >> >>>>
>>> >> >>>> > Hi,
>>> >> >>>> >
>>> >> >>>> > We are running on Amazon EMR based HBase 1.4.0 . We are
>>> currently
>>> >> >>>> seeing a
>>> >> >>>> > situation where sometimes a particular region gets into a
>>> situation
>>> >> >>>> where a
>>> >> >>>> > lot of write requests to any row in that region timeout saying
>>> they
>>> >> >>>> failed
>>> >> >>>> > to obtain a lock on a row in a region and eventually they
>>> experience
>>> >> >>>> an IPC
>>> >> >>>> > timeout. This causes the IPC queue to blow up in size as
>>> requests
>>> >> get
>>> >> >>>> > backed up, and that region server experiences a much higher
>>> than
>>> >> >>>> normal
>>> >> >>>> > timeout rate for all requests, not just those timing out for
>>> failing
>>> >> >>>> to
>>> >> >>>> > obtain the row lock.
>>> >> >>>> >
>>> >> >>>> > The strange thing is the rows are always different but the
>>> region
>>> is
>>> >> >>>> always
>>> >> >>>> > the same. So the question is, is there a region component to
>>> how
>>> >> long
>>> >> >>>> a row
>>> >> >>>> > write lock would be held? I looked at the debug dump and the
>>> >> RowLocks
>>> >> >>>> > section shows a long list of write row locks held, all of them
>>> are
>>> >> >>>> from the
>>> >> >>>> > same region but different rows.
>>> >> >>>> >
>>> >> >>>> > Will trying to obtain a write row lock experience delays if no
>>> one
>>> >> >>>> else
>>> >> >>>> > holds a lock on the same row but the region itself is
>>> experiencing
>>> >> >>>> read
>>> >> >>>> > delays? We do have an incremental compaction tool running that
>>> major
>>> >> >>>> > compacts one region per region server at a time, so that will
>>> drive
>>> >> >>>> out
>>> >> >>>> > pages from the bucket cache. But for most regions the impact is
>>> >> >>>> > transitional until the bucket cache gets populated by pages
>>> from
>>> the
>>> >> >>>> new
>>> >> >>>> > HFile. But for this one region we start timing out trying to
>>> obtain
>>> >> >>>> write
>>> >> >>>> > locks on rows in that region.
>>> >> >>>> >
>>> >> >>>> > Any insight anyone can provide would be most welcome.
>>> >> >>>> >
>>> >> >>>> > Cheers.
>>> >> >>>> >
>>> >> >>>> > ----
>>> >> >>>> > Saad
>>> >> >>>> >
>>> >> >>>>
>>> >> >>>
>>> >> >>
>>> >> >
>>> >>
>>>
>>
>>
>

Reply via email to