Thanks. I left a comment on that ticket.
Saad
On Sat, Mar 10, 2018 at 11:57 PM, Anoop John wrote:
> Hi Saad
>In your initial mail you mentioned that there are lots
> of checkAndPut ops but on different rows. The failure in obtaining
> locks (write
Hi Saad
In your initial mail you mentioned that there are lots
of checkAndPut ops but on different rows. The failure in obtaining
locks (write lock as it is checkAndPut) means there is contention on
the same row key. If that is the case , ya that is the 1st step
before BC reads and
Although now that I think about this a bit more, all the failures we saw
were failure to obtain a row lock, and in the thread stack traces we always
saw it somewhere inside getRowLockInternal and similar. Never saw any
contention on bucket cache lock that I could see.
Cheers.
Saad
On Sat,
Also, for now we have mitigated this problem by using the new setting in
HBase 1.4.0 that prevents one slow region server from blocking all client
requests. Of course it causes some timeouts but our overall ecosystem
contains Kafka queues for retries, so we can live with that. From what I
can see,
So if I understand correctly, we would mitigate the problem by not evicting
blocks for archived files immediately? Wouldn't this potentially lead to
problems later if the LRU algo chooses to evict blocks for active files and
leave blocks for archived files in there?
I would definitely love to
>>a) it was indeed one of the regions that was being compacted, major
compaction in one case, minor compaction in another, the issue started just
after compaction completed blowing away bucket cached blocks for the older
HFile's
About this part.Ya after the compaction, there is a step where
Hi Saad
Your argument here
>> The
>>theory is that since prefetch is an async operation, a lot of the reads in
>>the checkAndPut for the region in question start reading from S3 which is
>>slow. So the write lock obtained for the checkAndPut is held for a longer
>>duration than normal. This has
So after much investigation I can confirm:
a) it was indeed one of the regions that was being compacted, major
compaction in one case, minor compaction in another, the issue started just
after compaction completed blowing away bucket cached blocks for the older
HFile's
b) in another case there
Actually it happened again while some minior compactions were running, so
don't think it related to our major compaction tool, which isn't even
running right now. I will try to capture a debug dump of threads and
everything while the event is ongoing. Seems to last at least half an hour
or so and
Unfortunately I lost the stack trace overnight. But it does seem related to
compaction, because now that the compaction tool is done, I don't see the
issue anymore. I will run our incremental major compaction tool again and
see if I can reproduce the issue.
On the plus side the system stayed
I'll paste a thread dump later, writing this from my phone :-)
So the same issue has happened at different times for different regions,
but I couldn't see that the region in question was the one being compacted,
either this time or earlier. Although I might have missed an earlier
correlation in
One additional data point, I tried to manually re-assign the region in
question from the shell, that for some reason caused the region server to
restart and the region did get assigned to another region server. But then
the problem moved to that region server almost immediately.
Does that just
bq. timing out trying to obtain write locks on rows in that region.
Can you confirm that the region under contention was the one being major
compacted ?
Can you pastebin thread dump so that we can have better idea of the
scenario ?
For the region being compacted, how long would the compaction
Hi,
We are running on Amazon EMR based HBase 1.4.0 . We are currently seeing a
situation where sometimes a particular region gets into a situation where a
lot of write requests to any row in that region timeout saying they failed
to obtain a lock on a row in a region and eventually they
14 matches
Mail list logo