Hi Guys,

I've been doing some data insertion from HDFS to HBase and the performance 
seems to be really bad. It took about 3 hours to insert 15 GB of data.  The 
mapreduce job is launched from one machine which grabs data from HDFS and 
insert them into an HTable located at 3 other machines (1 master and 2 
regionservers). There are 17 map job in total (no reduce jobs), representing 17 
files each about 1GB in size. The mapper simply extracts the useful information 
from each of these files and insert them into HBase. In the end there are about 
22 million rows added in the table, and with my implementation (pretty 
low-efficient I think), for each of these row a 'table.put(Put p)' method is 
called once, so in the end there are 22 million 'table.put()' calls.  

Does it make sense that these many 'table.put' calls talks 3 hours? Because I 
have played with my code and I have determined that the bottleneck is these 
'table.put()' calls, because if I remove them, the rest of the code (doing 
every part of the job except for committing the updates via 'table.put()' )only 
takes 2 minutes to run. I am really inexperienced in HBase, so how do you guys 
usually do data insertion? What could be the tricks to enhance performance? 

I am thinking about using the bulk load feature to batch insert data into 
HBase. Is this a popular method out there in the HBase community? 

Really sorry about asking so much help for my problems but not helping other 
people with theirs. I really would like to offer help once I get more 
experienced with HBase. 

Thanks a lot in advance :)


----
Han Liu
SCS & HCI Institute
Undergrad. Class of 2012 
Carnegie Mellon University



Reply via email to