Thanks. I left a comment on that ticket.

----
Saad


On Sat, Mar 10, 2018 at 11:57 PM, Anoop John <anoop.hb...@gmail.com> wrote:

> Hi Saad
>                In your initial mail you mentioned that there are lots
> of checkAndPut ops but on different rows. The failure in obtaining
> locks (write lock as it is checkAndPut) means there is contention on
> the same row key.  If that is the case , ya that is the 1st step
> before BC reads and it make sense.
>
> On the Q on why not caching the compacted file content, yes it is this
> way. Even if cache on write is true. This is because some times the
> compacted result file could be so large (what is major compaction) and
> that will exhaust the BC if written. Also it might contain some data
> which are very old.  There is a jira recently raised jira which
> discuss abt this.  Pls see HBASE-20045
>
>
> -Anoop-
>
> On Sun, Mar 11, 2018 at 7:57 AM, Saad Mufti <saad.mu...@gmail.com> wrote:
> > Although now that I think about this a bit more, all the failures we saw
> > were failure to obtain a row lock, and in the thread stack traces we
> always
> > saw it somewhere inside getRowLockInternal and similar. Never saw any
> > contention on bucket cache lock that I could see.
> >
> > Cheers.
> >
> > ----
> > Saad
> >
> >
> > On Sat, Mar 10, 2018 at 8:04 PM, Saad Mufti <saad.mu...@gmail.com>
> wrote:
> >
> >> Also, for now we have mitigated this problem by using the new setting in
> >> HBase 1.4.0 that prevents one slow region server from blocking all
> client
> >> requests. Of course it causes some timeouts but our overall ecosystem
> >> contains Kafka queues for retries, so we can live with that. From what I
> >> can see, it looks like this setting also has the good effect of
> preventing
> >> clients from hammering a region server that is slow because its IPC
> queues
> >> are backed up, allowing it to recover faster.
> >>
> >> Does that make sense?
> >>
> >> Cheers.
> >>
> >> ----
> >> Saad
> >>
> >>
> >> On Sat, Mar 10, 2018 at 7:04 PM, Saad Mufti <saad.mu...@gmail.com>
> wrote:
> >>
> >>> So if I understand correctly, we would mitigate the problem by not
> >>> evicting blocks for archived files immediately? Wouldn't this
> potentially
> >>> lead to problems later if the LRU algo chooses to evict blocks for
> active
> >>> files and leave blocks for archived files in there?
> >>>
> >>> I would definitely love to test this!!! Unfortunately we are running on
> >>> EMR and the details of how to patch HBase under EMR are not clear to
> me :-(
> >>>
> >>> What we would really love would be a setting for actually immediately
> >>> caching blocks for a new compacted file. I have seen in the code that
> even
> >>> is we have the cache on write setting set to true, it will refuse to
> cache
> >>> blocks for a file that is a newly compacted one. In our case we have
> sized
> >>> the bucket cache to be big enough to hold all our data, and really
> want to
> >>> avoid having to go to S3 until the last possible moment. A config
> setting
> >>> to test this would be great.
> >>>
> >>> But thanks everyone for your feedback. Any more would also be welcome
> on
> >>> the idea to let a user cache all newly compacted files.
> >>>
> >>> ----
> >>> Saad
> >>>
> >>>
> >>> On Wed, Mar 7, 2018 at 12:00 AM, Anoop John <anoop.hb...@gmail.com>
> >>> wrote:
> >>>
> >>>> >>a) it was indeed one of the regions that was being compacted, major
> >>>> compaction in one case, minor compaction in another, the issue started
> >>>> just
> >>>> after compaction completed blowing away bucket cached blocks for the
> >>>> older
> >>>> HFile's
> >>>>
> >>>> About this part.    Ya after the compaction, there is a step where the
> >>>> compacted away HFile's blocks getting removed from cache. This op
> takes a
> >>>> write lock for this region (In Bucket Cache layer)..  Every read op
> which
> >>>> is part of checkAndPut will try read from BC and that in turn need a
> read
> >>>> lock for this region.  So there is chances that the read locks starve
> >>>> because of so many frequent write locks .  Each block evict will
> attain
> >>>> the
> >>>> write lock one after other.  Will it be possible for you to patch this
> >>>> evict and test once? We can avoid the immediate evict from BC after
> >>>> compaction. I can help you with a patch if you wish
> >>>>
> >>>> Anoop
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Mar 5, 2018 at 11:07 AM, ramkrishna vasudevan <
> >>>> ramkrishna.s.vasude...@gmail.com> wrote:
> >>>> > Hi Saad
> >>>> >
> >>>> > Your argument here
> >>>> >>> The
> >>>> >>>theory is that since prefetch is an async operation, a lot of the
> >>>> reads
> >>>> in
> >>>> >>>the checkAndPut for the region in question start reading from S3
> >>>> which is
> >>>> >>>slow. So the write lock obtained for the checkAndPut is held for a
> >>>> longer
> >>>> >>>duration than normal. This has cascading upstream effects. Does
> that
> >>>> sound
> >>>> >>>plausible?
> >>>> >
> >>>> > Seems very much plausible. So before even the prefetch happens say
> for
> >>>> > 'block 1' - and you have already issues N checkAndPut calls for the
> >>>> rows
> >>>> in
> >>>> > that 'block 1' -  all those checkAndPut will have to read that block
> >>>> from
> >>>> > S3 to perform the get() and then apply the mutation.
> >>>> >
> >>>> > This may happen for multiple threads at the same time because we are
> >>>> not
> >>>> > sure when the prefetch would have actually been completed. I don
> know
> >>>> what
> >>>> > are the general read characteristics when a read happens from S3 but
> >>>> you
> >>>> > could try to see how things work when a read happens from S3 and
> after
> >>>> the
> >>>> > prefetch completes ensure the same checkandPut() is done (from cache
> >>>> this
> >>>> > time) to really know the difference what S3 does there.
> >>>> >
> >>>> > Regards
> >>>> > Ram
> >>>> >
> >>>> > On Fri, Mar 2, 2018 at 2:57 AM, Saad Mufti <saad.mu...@gmail.com>
> >>>> wrote:
> >>>> >
> >>>> >> So after much investigation I can confirm:
> >>>> >>
> >>>> >> a) it was indeed one of the regions that was being compacted, major
> >>>> >> compaction in one case, minor compaction in another, the issue
> started
> >>>> just
> >>>> >> after compaction completed blowing away bucket cached blocks for
> the
> >>>> older
> >>>> >> HFile's
> >>>> >> b) in another case there was no compaction just a newly opened
> region
> >>>> in
> >>>> a
> >>>> >> region server that hadn't finished perfetching its pages from S3
> >>>> >>
> >>>> >> We have prefetch on open set to true. Our load is heavy on
> checkAndPut
> >>>> .The
> >>>> >> theory is that since prefetch is an async operation, a lot of the
> >>>> reads
> >>>> in
> >>>> >> the checkAndPut for the region in question start reading from S3
> >>>> which is
> >>>> >> slow. So the write lock obtained for the checkAndPut is held for a
> >>>> longer
> >>>> >> duration than normal. This has cascading upstream effects. Does
> that
> >>>> sound
> >>>> >> plausible?
> >>>> >>
> >>>> >> The part I don't understand still is all the locks held are for the
> >>>> same
> >>>> >> region but are all for different rows. So once the prefetch is
> >>>> completed,
> >>>> >> shouldn't the problem clear up quickly? Or does the slow region
> slow
> >>>> down
> >>>> >> anyone trying to do checkAndPut on any row in the same region even
> >>>> after
> >>>> >> the prefetch has completed. That is, do the long held row locks
> >>>> prevent
> >>>> >> others from getting a row lock on a different row in the same
> region?
> >>>> >>
> >>>> >> In any case, we trying to use
> >>>> >> https://issues.apache.org/jira/browse/HBASE-16388 support in HBase
> >>>> 1.4.0
> >>>> >> to
> >>>> >> both insulate the app a bit from this situation and hoping that it
> >>>> will
> >>>> >> reduce pressure on the region server in question, allowing it to
> >>>> recover
> >>>> >> faster. I haven't quite tested that yet, any advice in the meantime
> >>>> would
> >>>> >> be appreciated.
> >>>> >>
> >>>> >> Cheers.
> >>>> >>
> >>>> >> ----
> >>>> >> Saad
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >> On Thu, Mar 1, 2018 at 9:21 AM, Saad Mufti <saad.mu...@gmail.com>
> >>>> wrote:
> >>>> >>
> >>>> >> > Actually it happened again while some minior compactions were
> >>>> running,
> >>>> so
> >>>> >> > don't think it related to our major compaction tool, which isn't
> >>>> even
> >>>> >> > running right now. I will try to capture a debug dump of threads
> and
> >>>> >> > everything while the event is ongoing. Seems to last at least
> half
> >>>> an
> >>>> >> hour
> >>>> >> > or so and sometimes longer.
> >>>> >> >
> >>>> >> > ----
> >>>> >> > Saad
> >>>> >> >
> >>>> >> >
> >>>> >> > On Thu, Mar 1, 2018 at 7:54 AM, Saad Mufti <saad.mu...@gmail.com
> >
> >>>> wrote:
> >>>> >> >
> >>>> >> >> Unfortunately I lost the stack trace overnight. But it does seem
> >>>> related
> >>>> >> >> to compaction, because now that the compaction tool is done, I
> >>>> don't
> >>>> see
> >>>> >> >> the issue anymore. I will run our incremental major compaction
> tool
> >>>> >> again
> >>>> >> >> and see if I can reproduce the issue.
> >>>> >> >>
> >>>> >> >> On the plus side the system stayed stable and eventually
> recovered,
> >>>> >> >> although it did suffer all those timeouts.
> >>>> >> >>
> >>>> >> >> ----
> >>>> >> >> Saad
> >>>> >> >>
> >>>> >> >>
> >>>> >> >> On Wed, Feb 28, 2018 at 10:18 PM, Saad Mufti <
> saad.mu...@gmail.com
> >>>> >
> >>>> >> >> wrote:
> >>>> >> >>
> >>>> >> >>> I'll paste a thread dump later, writing this from my phone  :-)
> >>>> >> >>>
> >>>> >> >>> So the same issue has happened at different times for different
> >>>> >> regions,
> >>>> >> >>> but I couldn't see that the region in question was the one
> being
> >>>> >> compacted,
> >>>> >> >>> either this time or earlier. Although I might have missed an
> >>>> earlier
> >>>> >> >>> correlation in the logs where the issue started just after the
> >>>> >> compaction
> >>>> >> >>> completed.
> >>>> >> >>>
> >>>> >> >>> Usually a compaction for this table's regions take around 5-10
> >>>> minutes,
> >>>> >> >>> much less for its smaller column family which is block cache
> >>>> enabled,
> >>>> >> >>> around a minute or less, and 5-10 minutes for the much larger
> one
> >>>> for
> >>>> >> which
> >>>> >> >>> we have block cache disabled in the schema, because we don't
> ever
> >>>> read
> >>>> >> it
> >>>> >> >>> in the primary cluster. So the only impact on reads would be
> from
> >>>> that
> >>>> >> >>> smaller column family which takes less than a minute to
> compact.
> >>>> >> >>>
> >>>> >> >>> But the issue once started doesn't seem to recover for a long
> >>>> time,
> >>>> >> long
> >>>> >> >>> past when any compaction on the region itself could impact
> >>>> anything.
> >>>> >> The
> >>>> >> >>> compaction tool which is our own code has long since moved to
> >>>> other
> >>>> >> >>> regions.
> >>>> >> >>>
> >>>> >> >>> Cheers.
> >>>> >> >>>
> >>>> >> >>> ----
> >>>> >> >>> Saad
> >>>> >> >>>
> >>>> >> >>>
> >>>> >> >>> On Wed, Feb 28, 2018 at 9:39 PM Ted Yu <yuzhih...@gmail.com>
> >>>> wrote:
> >>>> >> >>>
> >>>> >> >>>> bq. timing out trying to obtain write locks on rows in that
> >>>> region.
> >>>> >> >>>>
> >>>> >> >>>> Can you confirm that the region under contention was the one
> >>>> being
> >>>> >> major
> >>>> >> >>>> compacted ?
> >>>> >> >>>>
> >>>> >> >>>> Can you pastebin thread dump so that we can have better idea
> of
> >>>> the
> >>>> >> >>>> scenario ?
> >>>> >> >>>>
> >>>> >> >>>> For the region being compacted, how long would the compaction
> >>>> take
> >>>> >> (just
> >>>> >> >>>> want to see if there was correlation between this duration and
> >>>> >> timeout)
> >>>> >> >>>> ?
> >>>> >> >>>>
> >>>> >> >>>> Cheers
> >>>> >> >>>>
> >>>> >> >>>> On Wed, Feb 28, 2018 at 6:31 PM, Saad Mufti <
> >>>> saad.mu...@gmail.com>
> >>>> >> >>>> wrote:
> >>>> >> >>>>
> >>>> >> >>>> > Hi,
> >>>> >> >>>> >
> >>>> >> >>>> > We are running on Amazon EMR based HBase 1.4.0 . We are
> >>>> currently
> >>>> >> >>>> seeing a
> >>>> >> >>>> > situation where sometimes a particular region gets into a
> >>>> situation
> >>>> >> >>>> where a
> >>>> >> >>>> > lot of write requests to any row in that region timeout
> saying
> >>>> they
> >>>> >> >>>> failed
> >>>> >> >>>> > to obtain a lock on a row in a region and eventually they
> >>>> experience
> >>>> >> >>>> an IPC
> >>>> >> >>>> > timeout. This causes the IPC queue to blow up in size as
> >>>> requests
> >>>> >> get
> >>>> >> >>>> > backed up, and that region server experiences a much higher
> >>>> than
> >>>> >> >>>> normal
> >>>> >> >>>> > timeout rate for all requests, not just those timing out for
> >>>> failing
> >>>> >> >>>> to
> >>>> >> >>>> > obtain the row lock.
> >>>> >> >>>> >
> >>>> >> >>>> > The strange thing is the rows are always different but the
> >>>> region
> >>>> is
> >>>> >> >>>> always
> >>>> >> >>>> > the same. So the question is, is there a region component to
> >>>> how
> >>>> >> long
> >>>> >> >>>> a row
> >>>> >> >>>> > write lock would be held? I looked at the debug dump and the
> >>>> >> RowLocks
> >>>> >> >>>> > section shows a long list of write row locks held, all of
> them
> >>>> are
> >>>> >> >>>> from the
> >>>> >> >>>> > same region but different rows.
> >>>> >> >>>> >
> >>>> >> >>>> > Will trying to obtain a write row lock experience delays if
> no
> >>>> one
> >>>> >> >>>> else
> >>>> >> >>>> > holds a lock on the same row but the region itself is
> >>>> experiencing
> >>>> >> >>>> read
> >>>> >> >>>> > delays? We do have an incremental compaction tool running
> that
> >>>> major
> >>>> >> >>>> > compacts one region per region server at a time, so that
> will
> >>>> drive
> >>>> >> >>>> out
> >>>> >> >>>> > pages from the bucket cache. But for most regions the
> impact is
> >>>> >> >>>> > transitional until the bucket cache gets populated by pages
> >>>> from
> >>>> the
> >>>> >> >>>> new
> >>>> >> >>>> > HFile. But for this one region we start timing out trying to
> >>>> obtain
> >>>> >> >>>> write
> >>>> >> >>>> > locks on rows in that region.
> >>>> >> >>>> >
> >>>> >> >>>> > Any insight anyone can provide would be most welcome.
> >>>> >> >>>> >
> >>>> >> >>>> > Cheers.
> >>>> >> >>>> >
> >>>> >> >>>> > ----
> >>>> >> >>>> > Saad
> >>>> >> >>>> >
> >>>> >> >>>>
> >>>> >> >>>
> >>>> >> >>
> >>>> >> >
> >>>> >>
> >>>>
> >>>
> >>>
> >>
>

Reply via email to