Hi HBase Community!

I have some questions on block caches around how the prefetch and 
cacheblocksonwrite settings work.

In our production environments we've been having some performance issues with 
our HBase deployment (HBase 1.4.9 as part of AWS EMR 5.22, with data backed by 
S3).

Looking into the issue, we've discovered that when regions of a particular 
table that are under heavy simultaneous write and read load go through a big 
compaction, the rpc handler threads will all block while servicing read 
requests to the region that was compacted. Here are a few relevant lines from a 
log where you can see the compaction happen. I've included a couple 
responseTooSlow warnings, but there are many more in the log after this:

2019-09-16 15:31:10,204 INFO  
[regionserver/ip-172-20-113-118.us-west-2.compute.internal/172.20.113.118:16020-shortCompactions-1568478085425]
 regionserver.HRegion: Starting compaction on a in region 
block_v2,\x07\x84\x8B>b\x00\x00\x14mU0p6,1567528560602.98be887c6f4938e0b492b17c669f3ac7.
2019-09-16 15:31:10,204 INFO  
[regionserver/ip-172-20-113-118.us-west-2.compute.internal/172.20.113.118:16020-shortCompactions-1568478085425]
 regionserver.HStore: Starting compaction of 8 file(s) in a of 
block_v2,\x07\x84\x8B>b\x00\x00\x14mU0p6,1567528560602.98be887c6f4938e0b492b17c669f3ac7.
 into 
tmpdir=s3://cmx-emr-hbase-us-west-2-oregon/hbase/data/default/block_v2/98be887c6f4938e0b492b17c669f3ac7/.tmp,
 totalSize=3.0 G
2019-09-16 15:33:55,572 INFO  
[regionserver/ip-172-20-113-118.us-west-2.compute.internal/172.20.113.118:16020-shortCompactions-1568478085425]
 dispatch.DefaultMultipartUploadDispatcher: Completed multipart upload of 24 
parts 3144722724 bytes
2019-09-16 15:33:56,017 INFO  
[regionserver/ip-172-20-113-118.us-west-2.compute.internal/172.20.113.118:16020-shortCompactions-1568478085425]
 s3n2.S3NativeFileSystem2: rename 
s3://cmx-emr-hbase-us-west-2-oregon/hbase/data/default/block_v2/98be887c6f4938e0b492b17c669f3ac7/.tmp/eede47d55e06454ca72482ce33529669
 
s3://cmx-emr-hbase-us-west-2-oregon/hbase/data/default/block_v2/98be887c6f4938e0b492b17c669f3ac7/a/eede47d55e06454ca72482ce33529669
2019-09-16 15:34:03,328 WARN  
[RpcServer.default.FPBQ.Fifo.handler=3,queue=3,port=16020] ipc.RpcServer: 
(responseTooSlow): 
{"call":"Get(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$GetRequest)","starttimems":1568648032777,"responsesize":562,"method":"Get","param":"region=
 
block_v2,\\x07\\x84\\x8B>b\\x00\\x00\\x14mU0p6,1567528560602.98be887c6f4938e0b492b17c669f3ac7.,
 row=\\x07\\x84\\x8B>b\\x00\\x00\\x14newAcPMKh/dkK2vGxPO1XI 
<TRUNCATED>","processingtimems":10551,"client":"172.20.132.45:51168","queuetimems":0,"class":"HRegionServer"}
2019-09-16 15:34:03,750 WARN  
[RpcServer.default.FPBQ.Fifo.handler=35,queue=5,port=16020] ipc.RpcServer: 
(responseTooSlow): 
{"call":"Get(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$GetRequest)","starttimems":1568648032787,"responsesize":565,"method":"Get","param":"region=
 
block_v2,\\x07\\x84\\x8B>b\\x00\\x00\\x14mU0p6,1567528560602.98be887c6f4938e0b492b17c669f3ac7.,
 row=\\x07\\x84\\x8B>b\\x00\\x00\\x14nfet675AvHhY4nnKAV2iqu 
<TRUNCATED>","processingtimems":10963,"client":"172.20.112.226:52222","queuetimems":0,"class":"HRegionServer"}

Note those log lines are from a "shortCompactions" thread. This also happens 
with major compactions, but I understand we can better control major 
compactions by running them manually in off hours if we choose.

When this occurs we see the numCallsInGeneralQueue metric spike up, and some 
threads in our application that service REST API requests get tied up which 
causes some 504 gateway timeouts for end users.

Thread dumps from the region server show that the rpc handler threads are 
blocking on an FSInputStream object (the read method is synchronized). Here is 
a pastebin of one such dump: https://pastebin.com/Mh0JWx3T

Because we are running in AWS with data backed by S3 and we expect read 
latencies to be larger, we are hosting large bucket caches on the local disk of 
the region servers. So our understanding is that after the compaction, the 
relevant portions of the bucket cache are invalidated which is causing read 
requests to have to go to S3, and these are all trying to use the same input 
stream and block each other, and this continues until eventually the cache is 
populated enough so that performance returns to normal.

In an effort to mitigate the effects of compaction on the cache, we enabled the 
hbase.rs.cacheblocksonwrite setting on our region servers. My understanding was 
that this would be placing the blocks into the bucketcache while the new hfile 
was being written. However, after enabling this setting we are still seeing the 
same issue occur. Furthermore, we enabled the PREFETCH_BLOCKS_ON_OPEN setting 
on the column family and when we see this issue occur, one of the threads that 
is getting blocked from reading is the prefetching thread.

Here are my questions:
- Why is the hbase.rs.cacheblocksonwrite not seeming to work? Does it only work 
for flushing and not for compaction? I can see from the logs that the file is 
renamed after being written. Does that have something to do with why 
cacheblocksonwrite isn't working?
- Why are the prefetching threads trying to read the same data? I thought that 
would only happen when a region is opened and I confirmed from the master and 
region server logs that wasn't happening. Maybe I have a misunderstanding of 
how/when prefetching comes into play?
- Speaking more generally, any other thoughts on how we can avoid this issue? 
It seems a shame that we have this nicely populated bucketcache that is 
somewhat necessary with a slower file system (S3), but that the cache suddenly 
gets invalidated because of compactions happening. I'm wary of turning off 
compactions altogether during our peak load hours because I don't want updates 
to be blocked due to too many store files.

Thanks for any help on this!

--Jacob LeBlanc

Reply via email to