Hi all,
I have a few question regarding bulk load,
some of them may be "novice", sorry for them...
I am trying to enhance my bulk loading performance into hbase.
Setup:
- I have one table with one column family and 10 columns.
- 4 pc cluster ( each: i5 2400 cpu, 1tb harddisk, 4 gb ram)
- Ubuntu 12.04 64 bit
- CDH3 installiation
- Hdfs: 1 namenode, 4 datanode
1 jobtracker, 4 tasktracker
replication = 1
- Hbase:
1 master, 4 slaves
My software architecture:
1- I have a server application listening ports for incomming rows
2- I am creating table with pre splits. ( say split1, split2, split3,
split4)
3- I have a worker for each split. When a row arrives I decide which split
the arriving key will go..
Than I pass the incoming row to the responsible worker.
Each worker writes its own hfile periodically ( each 2-3 minutes).
Writing hfiles requires disk io, I need to increase hfile writing
performance..
Is it possible to write hfile in memory (like memory mapped file) and flush
to disk when finished writing?
I am looking for some hdfs tunning for incrementing disk io performans, do
you have any advice?
4- I have another worker which takes written hfiles and loads them to
hbase.
I have a question at that point. doBulkLoad method takes a directory as
input,
do I have to clean this directory after each doBulkLoad invocation,
Because, if I don't clean this directory, I think it will try to load same
files again, am I wrong?
5- My application currently works on master machine,
I am planning to run this application on each pc in my cluster?
I mean, do bulkload can be done in parallel?
6- I am writing each row in hfile in increasing key order. I remember that
I read something regarding this key order.
Do I have to write to hfile regarding key order?