Hi Guys, I've been doing some data insertion from HDFS to HBase and the performance seems to be really bad. It took about 3 hours to insert 15 GB of data. The mapreduce job is launched from one machine which grabs data from HDFS and insert them into an HTable located at 3 other machines (1 master and 2 regionservers). There are 17 map job in total (no reduce jobs), representing 17 files each about 1GB in size. The mapper simply extracts the useful information from each of these files and insert them into HBase. In the end there are about 22 million rows added in the table, and with my implementation (pretty low-efficient I think), for each of these row a 'table.put(Put p)' method is called once, so in the end there are 22 million 'table.put()' calls.
Does it make sense that these many 'table.put' calls talks 3 hours? Because I have played with my code and I have determined that the bottleneck is these 'table.put()' calls, because if I remove them, the rest of the code (doing every part of the job except for committing the updates via 'table.put()' )only takes 2 minutes to run. I am really inexperienced in HBase, so how do you guys usually do data insertion? What could be the tricks to enhance performance? I am thinking about using the bulk load feature to batch insert data into HBase. Is this a popular method out there in the HBase community? Really sorry about asking so much help for my problems but not helping other people with theirs. I really would like to offer help once I get more experienced with HBase. Thanks a lot in advance :) ---- Han Liu SCS & HCI Institute Undergrad. Class of 2012 Carnegie Mellon University
