Although now that I think about this a bit more, all the failures we saw were failure to obtain a row lock, and in the thread stack traces we always saw it somewhere inside getRowLockInternal and similar. Never saw any contention on bucket cache lock that I could see.
Cheers. ---- Saad On Sat, Mar 10, 2018 at 8:04 PM, Saad Mufti <saad.mu...@gmail.com> wrote: > Also, for now we have mitigated this problem by using the new setting in > HBase 1.4.0 that prevents one slow region server from blocking all client > requests. Of course it causes some timeouts but our overall ecosystem > contains Kafka queues for retries, so we can live with that. From what I > can see, it looks like this setting also has the good effect of preventing > clients from hammering a region server that is slow because its IPC queues > are backed up, allowing it to recover faster. > > Does that make sense? > > Cheers. > > ---- > Saad > > > On Sat, Mar 10, 2018 at 7:04 PM, Saad Mufti <saad.mu...@gmail.com> wrote: > >> So if I understand correctly, we would mitigate the problem by not >> evicting blocks for archived files immediately? Wouldn't this potentially >> lead to problems later if the LRU algo chooses to evict blocks for active >> files and leave blocks for archived files in there? >> >> I would definitely love to test this!!! Unfortunately we are running on >> EMR and the details of how to patch HBase under EMR are not clear to me :-( >> >> What we would really love would be a setting for actually immediately >> caching blocks for a new compacted file. I have seen in the code that even >> is we have the cache on write setting set to true, it will refuse to cache >> blocks for a file that is a newly compacted one. In our case we have sized >> the bucket cache to be big enough to hold all our data, and really want to >> avoid having to go to S3 until the last possible moment. A config setting >> to test this would be great. >> >> But thanks everyone for your feedback. Any more would also be welcome on >> the idea to let a user cache all newly compacted files. >> >> ---- >> Saad >> >> >> On Wed, Mar 7, 2018 at 12:00 AM, Anoop John <anoop.hb...@gmail.com> >> wrote: >> >>> >>a) it was indeed one of the regions that was being compacted, major >>> compaction in one case, minor compaction in another, the issue started >>> just >>> after compaction completed blowing away bucket cached blocks for the >>> older >>> HFile's >>> >>> About this part. Ya after the compaction, there is a step where the >>> compacted away HFile's blocks getting removed from cache. This op takes a >>> write lock for this region (In Bucket Cache layer).. Every read op which >>> is part of checkAndPut will try read from BC and that in turn need a read >>> lock for this region. So there is chances that the read locks starve >>> because of so many frequent write locks . Each block evict will attain >>> the >>> write lock one after other. Will it be possible for you to patch this >>> evict and test once? We can avoid the immediate evict from BC after >>> compaction. I can help you with a patch if you wish >>> >>> Anoop >>> >>> >>> >>> On Mon, Mar 5, 2018 at 11:07 AM, ramkrishna vasudevan < >>> ramkrishna.s.vasude...@gmail.com> wrote: >>> > Hi Saad >>> > >>> > Your argument here >>> >>> The >>> >>>theory is that since prefetch is an async operation, a lot of the >>> reads >>> in >>> >>>the checkAndPut for the region in question start reading from S3 >>> which is >>> >>>slow. So the write lock obtained for the checkAndPut is held for a >>> longer >>> >>>duration than normal. This has cascading upstream effects. Does that >>> sound >>> >>>plausible? >>> > >>> > Seems very much plausible. So before even the prefetch happens say for >>> > 'block 1' - and you have already issues N checkAndPut calls for the >>> rows >>> in >>> > that 'block 1' - all those checkAndPut will have to read that block >>> from >>> > S3 to perform the get() and then apply the mutation. >>> > >>> > This may happen for multiple threads at the same time because we are >>> not >>> > sure when the prefetch would have actually been completed. I don know >>> what >>> > are the general read characteristics when a read happens from S3 but >>> you >>> > could try to see how things work when a read happens from S3 and after >>> the >>> > prefetch completes ensure the same checkandPut() is done (from cache >>> this >>> > time) to really know the difference what S3 does there. >>> > >>> > Regards >>> > Ram >>> > >>> > On Fri, Mar 2, 2018 at 2:57 AM, Saad Mufti <saad.mu...@gmail.com> >>> wrote: >>> > >>> >> So after much investigation I can confirm: >>> >> >>> >> a) it was indeed one of the regions that was being compacted, major >>> >> compaction in one case, minor compaction in another, the issue started >>> just >>> >> after compaction completed blowing away bucket cached blocks for the >>> older >>> >> HFile's >>> >> b) in another case there was no compaction just a newly opened region >>> in >>> a >>> >> region server that hadn't finished perfetching its pages from S3 >>> >> >>> >> We have prefetch on open set to true. Our load is heavy on checkAndPut >>> .The >>> >> theory is that since prefetch is an async operation, a lot of the >>> reads >>> in >>> >> the checkAndPut for the region in question start reading from S3 >>> which is >>> >> slow. So the write lock obtained for the checkAndPut is held for a >>> longer >>> >> duration than normal. This has cascading upstream effects. Does that >>> sound >>> >> plausible? >>> >> >>> >> The part I don't understand still is all the locks held are for the >>> same >>> >> region but are all for different rows. So once the prefetch is >>> completed, >>> >> shouldn't the problem clear up quickly? Or does the slow region slow >>> down >>> >> anyone trying to do checkAndPut on any row in the same region even >>> after >>> >> the prefetch has completed. That is, do the long held row locks >>> prevent >>> >> others from getting a row lock on a different row in the same region? >>> >> >>> >> In any case, we trying to use >>> >> https://issues.apache.org/jira/browse/HBASE-16388 support in HBase >>> 1.4.0 >>> >> to >>> >> both insulate the app a bit from this situation and hoping that it >>> will >>> >> reduce pressure on the region server in question, allowing it to >>> recover >>> >> faster. I haven't quite tested that yet, any advice in the meantime >>> would >>> >> be appreciated. >>> >> >>> >> Cheers. >>> >> >>> >> ---- >>> >> Saad >>> >> >>> >> >>> >> >>> >> On Thu, Mar 1, 2018 at 9:21 AM, Saad Mufti <saad.mu...@gmail.com> >>> wrote: >>> >> >>> >> > Actually it happened again while some minior compactions were >>> running, >>> so >>> >> > don't think it related to our major compaction tool, which isn't >>> even >>> >> > running right now. I will try to capture a debug dump of threads and >>> >> > everything while the event is ongoing. Seems to last at least half >>> an >>> >> hour >>> >> > or so and sometimes longer. >>> >> > >>> >> > ---- >>> >> > Saad >>> >> > >>> >> > >>> >> > On Thu, Mar 1, 2018 at 7:54 AM, Saad Mufti <saad.mu...@gmail.com> >>> wrote: >>> >> > >>> >> >> Unfortunately I lost the stack trace overnight. But it does seem >>> related >>> >> >> to compaction, because now that the compaction tool is done, I >>> don't >>> see >>> >> >> the issue anymore. I will run our incremental major compaction tool >>> >> again >>> >> >> and see if I can reproduce the issue. >>> >> >> >>> >> >> On the plus side the system stayed stable and eventually recovered, >>> >> >> although it did suffer all those timeouts. >>> >> >> >>> >> >> ---- >>> >> >> Saad >>> >> >> >>> >> >> >>> >> >> On Wed, Feb 28, 2018 at 10:18 PM, Saad Mufti <saad.mu...@gmail.com >>> > >>> >> >> wrote: >>> >> >> >>> >> >>> I'll paste a thread dump later, writing this from my phone :-) >>> >> >>> >>> >> >>> So the same issue has happened at different times for different >>> >> regions, >>> >> >>> but I couldn't see that the region in question was the one being >>> >> compacted, >>> >> >>> either this time or earlier. Although I might have missed an >>> earlier >>> >> >>> correlation in the logs where the issue started just after the >>> >> compaction >>> >> >>> completed. >>> >> >>> >>> >> >>> Usually a compaction for this table's regions take around 5-10 >>> minutes, >>> >> >>> much less for its smaller column family which is block cache >>> enabled, >>> >> >>> around a minute or less, and 5-10 minutes for the much larger one >>> for >>> >> which >>> >> >>> we have block cache disabled in the schema, because we don't ever >>> read >>> >> it >>> >> >>> in the primary cluster. So the only impact on reads would be from >>> that >>> >> >>> smaller column family which takes less than a minute to compact. >>> >> >>> >>> >> >>> But the issue once started doesn't seem to recover for a long >>> time, >>> >> long >>> >> >>> past when any compaction on the region itself could impact >>> anything. >>> >> The >>> >> >>> compaction tool which is our own code has long since moved to >>> other >>> >> >>> regions. >>> >> >>> >>> >> >>> Cheers. >>> >> >>> >>> >> >>> ---- >>> >> >>> Saad >>> >> >>> >>> >> >>> >>> >> >>> On Wed, Feb 28, 2018 at 9:39 PM Ted Yu <yuzhih...@gmail.com> >>> wrote: >>> >> >>> >>> >> >>>> bq. timing out trying to obtain write locks on rows in that >>> region. >>> >> >>>> >>> >> >>>> Can you confirm that the region under contention was the one >>> being >>> >> major >>> >> >>>> compacted ? >>> >> >>>> >>> >> >>>> Can you pastebin thread dump so that we can have better idea of >>> the >>> >> >>>> scenario ? >>> >> >>>> >>> >> >>>> For the region being compacted, how long would the compaction >>> take >>> >> (just >>> >> >>>> want to see if there was correlation between this duration and >>> >> timeout) >>> >> >>>> ? >>> >> >>>> >>> >> >>>> Cheers >>> >> >>>> >>> >> >>>> On Wed, Feb 28, 2018 at 6:31 PM, Saad Mufti < >>> saad.mu...@gmail.com> >>> >> >>>> wrote: >>> >> >>>> >>> >> >>>> > Hi, >>> >> >>>> > >>> >> >>>> > We are running on Amazon EMR based HBase 1.4.0 . We are >>> currently >>> >> >>>> seeing a >>> >> >>>> > situation where sometimes a particular region gets into a >>> situation >>> >> >>>> where a >>> >> >>>> > lot of write requests to any row in that region timeout saying >>> they >>> >> >>>> failed >>> >> >>>> > to obtain a lock on a row in a region and eventually they >>> experience >>> >> >>>> an IPC >>> >> >>>> > timeout. This causes the IPC queue to blow up in size as >>> requests >>> >> get >>> >> >>>> > backed up, and that region server experiences a much higher >>> than >>> >> >>>> normal >>> >> >>>> > timeout rate for all requests, not just those timing out for >>> failing >>> >> >>>> to >>> >> >>>> > obtain the row lock. >>> >> >>>> > >>> >> >>>> > The strange thing is the rows are always different but the >>> region >>> is >>> >> >>>> always >>> >> >>>> > the same. So the question is, is there a region component to >>> how >>> >> long >>> >> >>>> a row >>> >> >>>> > write lock would be held? I looked at the debug dump and the >>> >> RowLocks >>> >> >>>> > section shows a long list of write row locks held, all of them >>> are >>> >> >>>> from the >>> >> >>>> > same region but different rows. >>> >> >>>> > >>> >> >>>> > Will trying to obtain a write row lock experience delays if no >>> one >>> >> >>>> else >>> >> >>>> > holds a lock on the same row but the region itself is >>> experiencing >>> >> >>>> read >>> >> >>>> > delays? We do have an incremental compaction tool running that >>> major >>> >> >>>> > compacts one region per region server at a time, so that will >>> drive >>> >> >>>> out >>> >> >>>> > pages from the bucket cache. But for most regions the impact is >>> >> >>>> > transitional until the bucket cache gets populated by pages >>> from >>> the >>> >> >>>> new >>> >> >>>> > HFile. But for this one region we start timing out trying to >>> obtain >>> >> >>>> write >>> >> >>>> > locks on rows in that region. >>> >> >>>> > >>> >> >>>> > Any insight anyone can provide would be most welcome. >>> >> >>>> > >>> >> >>>> > Cheers. >>> >> >>>> > >>> >> >>>> > ---- >>> >> >>>> > Saad >>> >> >>>> > >>> >> >>>> >>> >> >>> >>> >> >> >>> >> > >>> >> >>> >> >> >