Hi, I am new to hbase and to hadoop as well so forgive me if the following is naive.
I am trying to bulk upload large amounts of data (billions of rows with 15-20 columns) into an empty hbase table using two column families. The approach I tried was to use MR. The code is copied over and modified from to ImportTsv.java. I did not get good performance because the code used TotalOrderPartioner which I gathered looked at the current number of regions and decided to use a single reducer on an empty table. I then tried SimpleTotalOrderPartioner with conservatively large start and end keys which then ended up dividing unequally over our 10 node cluster. Questions 1. Can bulk upload use totalorderpartioner with multiple reducers ? 2. I don't have a handle of the min and max row key from the data unless I collect it over the MAP phase. Is it possible to reconfigure the partioner after map phase is over ? 3. I would need to frequently load datasets with billions of rows (450-800GB) to hbase as the solution is part of a data processing pipeline. My estimate (optimistic) on a 10 node cluster is 7 hours . Is this reasonable. Would hbase scale to say 100s of such datasets, giving I can add disk spsace and nodes to the cluster. Thanks, - Ashish