Optimizing bulk load performance

Harry Waye Wed, 23 Oct 2013 07:58:46 -0700

I'm trying to load data into hbase using HFileOutputFormat and incremental
bulk load but am getting rather lackluster performance, 10h for ~0.5TB
data, ~50000 blocks.  This is being loaded into a table that has 2
families, 9 columns, 2500 regions and is ~10TB in size.  Keys are md5
hashes and regions are pretty evenly spread.  The majority of time appears
to be spend in the reduce phase, with the map phase completing very
quickly.  The network doesn't appear to be saturated, but the load is
consistently at 6 which is the number or reduce tasks per node.


12 hosts (6 cores, 2 disk as RAID0, 1GB eth, no one else on the rack).

MR conf: 6 mappers, 6 reducers per node.

I spoke to someone on IRC and they recommended reducing job output
replication to 1, and reducing the number of mappers which I reduced to 2.
 Reducing replication appeared not to make any difference, reducing
reducers appeared just to slow the job down.  I'm going to have a look at
running the benchmarks mentioned on Michael Noll's blog and see what that
turns up.  I guess some questions I have are:

How does the global number/size of blocks affect perf.?  (I have a lot of
10mb files, which are the input files)

How does the job local number/size of input blocks affect perf.?

What is actually happening in the reduce phase that requires so much CPU?
 I assume the actual construction of HFiles isn't intensive.

Ultimately, how can I improve performance?
Thanks

Optimizing bulk load performance

Reply via email to