Hey again,
I'm running a bulk load on s3 and I'm seeing region servers being
instructed to load an hfile multiple times. I've seen this behavior two
different ways. The first time I saw this was after deciding to brute
force my way around the problem in HBASE-20774 [1] by just letting the
cluster copy the data down from s3 and back up. After letting this run
over night the input 15TB of data was now over 60TB in the hbase
catalog. I noticed in the logs that files were being copied multiple
times (I saw around 10 times on some files on a single region server).
After writing a patch for HBASE-20774 and deploying it to the cluster on
the next attempt I'm seeing the same behavior but because when you bulk
load from the same FS as the HBase root, HBase renames files rather than
copying them. This is causing FileNotFoundExceptions the next time the
region server is instructed to load the file. I've cherry picked the
logs of an occurrence of this happening off of a region server:
2018-06-27 19:34:18,170 INFO
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
regionserver.HStore: Validating hfile at
s3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe for
inclusion in store d region z3_v2,\x00\x09\xD9
\x00\x00\x00\x00\x00\x00\x00,1530122295326.f6e3567551881fbd48baed14bfe2252b.
2018-06-27 19:34:18,236 INFO
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
s3n.S3NativeFileSystem: Opening
's3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe'
for reading
2018-06-27 19:34:45,915 INFO
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
s3n.S3NativeFileSystem: rename
s3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe
s3://bucket_name/data/default/z3_v2/f6e3567551881fbd48baed14bfe2252b/d/47aef50a986340908a8242b256adbedf_SeqId_4_
2018-06-27 19:34:59,005 INFO
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
regionserver.HStore: Loaded HFile
s3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe
into store 'd' as
s3://bucket_name/data/default/z3_v2/f6e3567551881fbd48baed14bfe2252b/d/47aef50a986340908a8242b256adbedf_SeqId_4_
- updating store file list.
2018-06-27 19:34:59,101 INFO
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
regionserver.HStore: Successfully loaded store file
s3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe
into store d (new location:
s3://bucket_name/data/default/z3_v2/f6e3567551881fbd48baed14bfe2252b/d/47aef50a986340908a8242b256adbedf_SeqId_4_)
2018-06-27 19:40:03,463 INFO
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
regionserver.HStore: Validating hfile at
s3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe for
inclusion in store d region z3_v2,\x00\x09\xD9
\x00\x00\x00\x00\x00\x00\x00,1530122295326.f6e3567551881fbd48baed14bfe2252b.
org.apache.hadoop.io.MultipleIOException: 18 exceptions
[java.io.FileNotFoundException: No such file or directory
's3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe',
....]
at
org.apache.hadoop.io.MultipleIOException.createIOException(MultipleIOException.java:53)
at
org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:5782)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.bulkLoadHFile(RSRpcServices.java:2170)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36615)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2354)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
I'm not sure where the problem is coming from here, if the master is
adding files to the queue multiple times or just not removing them. If
anyone has knowledge on this or have an idea where things are going
wrong it would be very helpful since I'm not super familiar with the
code base.
Thanks,
Austin
[1] https://issues.apache.org/jira/browse/HBASE-20774