Hi HBase Community!
I have some questions on block caches around how the prefetch and
cacheblocksonwrite settings work.
In our production environments we've been having some performance issues with
our HBase deployment (HBase 1.4.9 as part of AWS EMR 5.22, with data backed by
S3).
Looking into the issue, we've discovered that when regions of a particular
table that are under heavy simultaneous write and read load go through a big
compaction, the rpc handler threads will all block while servicing read
requests to the region that was compacted. Here are a few relevant lines from a
log where you can see the compaction happen. I've included a couple
responseTooSlow warnings, but there are many more in the log after this:
2019-09-16 15:31:10,204 INFO
[regionserver/ip-172-20-113-118.us-west-2.compute.internal/172.20.113.118:16020-shortCompactions-1568478085425]
regionserver.HRegion: Starting compaction on a in region
block_v2,\x07\x84\x8B>b\x00\x00\x14mU0p6,1567528560602.98be887c6f4938e0b492b17c669f3ac7.
2019-09-16 15:31:10,204 INFO
[regionserver/ip-172-20-113-118.us-west-2.compute.internal/172.20.113.118:16020-shortCompactions-1568478085425]
regionserver.HStore: Starting compaction of 8 file(s) in a of
block_v2,\x07\x84\x8B>b\x00\x00\x14mU0p6,1567528560602.98be887c6f4938e0b492b17c669f3ac7.
into
tmpdir=s3://cmx-emr-hbase-us-west-2-oregon/hbase/data/default/block_v2/98be887c6f4938e0b492b17c669f3ac7/.tmp,
totalSize=3.0 G
2019-09-16 15:33:55,572 INFO
[regionserver/ip-172-20-113-118.us-west-2.compute.internal/172.20.113.118:16020-shortCompactions-1568478085425]
dispatch.DefaultMultipartUploadDispatcher: Completed multipart upload of 24
parts 3144722724 bytes
2019-09-16 15:33:56,017 INFO
[regionserver/ip-172-20-113-118.us-west-2.compute.internal/172.20.113.118:16020-shortCompactions-1568478085425]
s3n2.S3NativeFileSystem2: rename
s3://cmx-emr-hbase-us-west-2-oregon/hbase/data/default/block_v2/98be887c6f4938e0b492b17c669f3ac7/.tmp/eede47d55e06454ca72482ce33529669
s3://cmx-emr-hbase-us-west-2-oregon/hbase/data/default/block_v2/98be887c6f4938e0b492b17c669f3ac7/a/eede47d55e06454ca72482ce33529669
2019-09-16 15:34:03,328 WARN
[RpcServer.default.FPBQ.Fifo.handler=3,queue=3,port=16020] ipc.RpcServer:
(responseTooSlow):
{"call":"Get(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$GetRequest)","starttimems":1568648032777,"responsesize":562,"method":"Get","param":"region=
block_v2,\\x07\\x84\\x8B>b\\x00\\x00\\x14mU0p6,1567528560602.98be887c6f4938e0b492b17c669f3ac7.,
row=\\x07\\x84\\x8B>b\\x00\\x00\\x14newAcPMKh/dkK2vGxPO1XI
<TRUNCATED>","processingtimems":10551,"client":"172.20.132.45:51168","queuetimems":0,"class":"HRegionServer"}
2019-09-16 15:34:03,750 WARN
[RpcServer.default.FPBQ.Fifo.handler=35,queue=5,port=16020] ipc.RpcServer:
(responseTooSlow):
{"call":"Get(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$GetRequest)","starttimems":1568648032787,"responsesize":565,"method":"Get","param":"region=
block_v2,\\x07\\x84\\x8B>b\\x00\\x00\\x14mU0p6,1567528560602.98be887c6f4938e0b492b17c669f3ac7.,
row=\\x07\\x84\\x8B>b\\x00\\x00\\x14nfet675AvHhY4nnKAV2iqu
<TRUNCATED>","processingtimems":10963,"client":"172.20.112.226:52222","queuetimems":0,"class":"HRegionServer"}
Note those log lines are from a "shortCompactions" thread. This also happens
with major compactions, but I understand we can better control major
compactions by running them manually in off hours if we choose.
When this occurs we see the numCallsInGeneralQueue metric spike up, and some
threads in our application that service REST API requests get tied up which
causes some 504 gateway timeouts for end users.
Thread dumps from the region server show that the rpc handler threads are
blocking on an FSInputStream object (the read method is synchronized). Here is
a pastebin of one such dump: https://pastebin.com/Mh0JWx3T
Because we are running in AWS with data backed by S3 and we expect read
latencies to be larger, we are hosting large bucket caches on the local disk of
the region servers. So our understanding is that after the compaction, the
relevant portions of the bucket cache are invalidated which is causing read
requests to have to go to S3, and these are all trying to use the same input
stream and block each other, and this continues until eventually the cache is
populated enough so that performance returns to normal.
In an effort to mitigate the effects of compaction on the cache, we enabled the
hbase.rs.cacheblocksonwrite setting on our region servers. My understanding was
that this would be placing the blocks into the bucketcache while the new hfile
was being written. However, after enabling this setting we are still seeing the
same issue occur. Furthermore, we enabled the PREFETCH_BLOCKS_ON_OPEN setting
on the column family and when we see this issue occur, one of the threads that
is getting blocked from reading is the prefetching thread.
Here are my questions:
- Why is the hbase.rs.cacheblocksonwrite not seeming to work? Does it only work
for flushing and not for compaction? I can see from the logs that the file is
renamed after being written. Does that have something to do with why
cacheblocksonwrite isn't working?
- Why are the prefetching threads trying to read the same data? I thought that
would only happen when a region is opened and I confirmed from the master and
region server logs that wasn't happening. Maybe I have a misunderstanding of
how/when prefetching comes into play?
- Speaking more generally, any other thoughts on how we can avoid this issue?
It seems a shame that we have this nicely populated bucketcache that is
somewhat necessary with a slower file system (S3), but that the cache suddenly
gets invalidated because of compactions happening. I'm wary of turning off
compactions altogether during our peak load hours because I don't want updates
to be blocked due to too many store files.
Thanks for any help on this!
--Jacob LeBlanc