Enhancing hbase bulk import performance

sakin cali Fri, 25 May 2012 07:08:34 -0700

Hi all,

I have a few question regarding bulk load,
some of them may be "novice", sorry for them...


I am trying to enhance my bulk loading performance into hbase.

Setup:
 - I have one table with one column family and 10 columns.
 - 4 pc cluster ( each: i5 2400 cpu, 1tb harddisk, 4 gb ram)
 - Ubuntu 12.04 64 bit
 - CDH3 installiation
 - Hdfs: 1 namenode, 4 datanode
            1 jobtracker, 4 tasktracker
            replication = 1
 - Hbase:
            1 master, 4 slaves

My software architecture:

1- I have a server application listening ports for incomming rows

2- I am creating table with pre splits.  ( say split1, split2, split3,
split4)

3- I have a worker for each split. When a row arrives I decide which split
the arriving key will go..
Than I pass the incoming row to the responsible worker.
Each worker writes its own hfile periodically ( each 2-3 minutes).

Writing hfiles requires disk io, I need to increase hfile writing
performance..
Is it possible to write hfile in memory (like memory mapped file) and flush
to disk when finished writing?
I am looking for some hdfs tunning for incrementing disk io performans, do
you have any advice?


4-  I have another worker which takes written hfiles and loads them to
hbase.
I have a question at that point. doBulkLoad method takes a directory as
input,
do I have to clean this directory after each doBulkLoad invocation,
Because, if I don't clean this directory, I think it will try to load same
files again, am I wrong?

5- My application currently works on master machine,
I am planning to run this application on each pc in my cluster?
I mean, do bulkload can be done in parallel?

6- I am writing each row in hfile in increasing key order. I remember that
I read something regarding this key order.
Do I have to write to hfile regarding key order?

Enhancing hbase bulk import performance

Reply via email to