Hi Harry, Do you have more details on the exact load? Can you run vmstats and see what kind of load it is? Is it user? cpu? wio?
I suspect your disks to be the issue. There is 2 things here. First, we don't recommend RAID for the HDFS/HBase disk. The best is to simply mount the disks on 2 mounting points and give them to HDFS. Second, 2 disks per not is very low. On a dev cluster is not even recommended. In production, you should go with 12 or more. So with only 2 disks in RAID, I suspect your WIO to be high which is what might slow your process. Can you take a look on that direction? If it's not that, we will continue to investigate ;) Thanks, JM 2013/10/23 Harry Waye <[email protected]> > I'm trying to load data into hbase using HFileOutputFormat and incremental > bulk load but am getting rather lackluster performance, 10h for ~0.5TB > data, ~50000 blocks. This is being loaded into a table that has 2 > families, 9 columns, 2500 regions and is ~10TB in size. Keys are md5 > hashes and regions are pretty evenly spread. The majority of time appears > to be spend in the reduce phase, with the map phase completing very > quickly. The network doesn't appear to be saturated, but the load is > consistently at 6 which is the number or reduce tasks per node. > > 12 hosts (6 cores, 2 disk as RAID0, 1GB eth, no one else on the rack). > > MR conf: 6 mappers, 6 reducers per node. > > I spoke to someone on IRC and they recommended reducing job output > replication to 1, and reducing the number of mappers which I reduced to 2. > Reducing replication appeared not to make any difference, reducing > reducers appeared just to slow the job down. I'm going to have a look at > running the benchmarks mentioned on Michael Noll's blog and see what that > turns up. I guess some questions I have are: > > How does the global number/size of blocks affect perf.? (I have a lot of > 10mb files, which are the input files) > > How does the job local number/size of input blocks affect perf.? > > What is actually happening in the reduce phase that requires so much CPU? > I assume the actual construction of HFiles isn't intensive. > > Ultimately, how can I improve performance? > Thanks >
