Re: Optimizing bulk load performance

Jean-Marc Spaggiari Thu, 24 Oct 2013 07:16:32 -0700

Hi Harry,

Do you have more details on the exact load? Can you run vmstats and see
what kind of load it is? Is it user? cpu? wio?


I suspect your disks to be the issue. There is 2 things here.

First, we don't recommend RAID for the HDFS/HBase disk. The best is to
simply mount the disks on 2 mounting points and give them to HDFS.
Second, 2 disks per not is very low. On a dev cluster is not even
recommended. In production, you should go with 12 or more.

So with only 2 disks in RAID, I suspect your WIO to be high which is what
might slow your process.

Can you take a look on that direction? If it's not that, we will continue
to investigate ;)

Thanks,

JM


2013/10/23 Harry Waye <[email protected]>

> I'm trying to load data into hbase using HFileOutputFormat and incremental
> bulk load but am getting rather lackluster performance, 10h for ~0.5TB
> data, ~50000 blocks.  This is being loaded into a table that has 2
> families, 9 columns, 2500 regions and is ~10TB in size.  Keys are md5
> hashes and regions are pretty evenly spread.  The majority of time appears
> to be spend in the reduce phase, with the map phase completing very
> quickly.  The network doesn't appear to be saturated, but the load is
> consistently at 6 which is the number or reduce tasks per node.
>
> 12 hosts (6 cores, 2 disk as RAID0, 1GB eth, no one else on the rack).
>
> MR conf: 6 mappers, 6 reducers per node.
>
> I spoke to someone on IRC and they recommended reducing job output
> replication to 1, and reducing the number of mappers which I reduced to 2.
>  Reducing replication appeared not to make any difference, reducing
> reducers appeared just to slow the job down.  I'm going to have a look at
> running the benchmarks mentioned on Michael Noll's blog and see what that
> turns up.  I guess some questions I have are:
>
> How does the global number/size of blocks affect perf.?  (I have a lot of
> 10mb files, which are the input files)
>
> How does the job local number/size of input blocks affect perf.?
>
> What is actually happening in the reduce phase that requires so much CPU?
>  I assume the actual construction of HFiles isn't intensive.
>
> Ultimately, how can I improve performance?
> Thanks
>

Re: Optimizing bulk load performance

Reply via email to