I'm trying to load data into hbase using HFileOutputFormat and incremental bulk load but am getting rather lackluster performance, 10h for ~0.5TB data, ~50000 blocks. This is being loaded into a table that has 2 families, 9 columns, 2500 regions and is ~10TB in size. Keys are md5 hashes and regions are pretty evenly spread. The majority of time appears to be spend in the reduce phase, with the map phase completing very quickly. The network doesn't appear to be saturated, but the load is consistently at 6 which is the number or reduce tasks per node.
12 hosts (6 cores, 2 disk as RAID0, 1GB eth, no one else on the rack). MR conf: 6 mappers, 6 reducers per node. I spoke to someone on IRC and they recommended reducing job output replication to 1, and reducing the number of mappers which I reduced to 2. Reducing replication appeared not to make any difference, reducing reducers appeared just to slow the job down. I'm going to have a look at running the benchmarks mentioned on Michael Noll's blog and see what that turns up. I guess some questions I have are: How does the global number/size of blocks affect perf.? (I have a lot of 10mb files, which are the input files) How does the job local number/size of input blocks affect perf.? What is actually happening in the reduce phase that requires so much CPU? I assume the actual construction of HFiles isn't intensive. Ultimately, how can I improve performance? Thanks
