I have around 20 GB of data to be dumped into a hbase table. Initially, I had a simple java program to put the values in a batch of (5000-10000) records. I tried concurrent inserts and each insert took about 15 seconds to write. Which is very slow and was taking ages.
Next approach was to use importtsv, this started off with a set of maps and after few minutes, I started getting RetriesException and errors out in a while. Of these experiments, I noticed that the master node was handing all the traffic. I understand that initially it dumps data in one node and then splits across multiple nodes as data comes in. Is there a way to split this across regions in the beginning? Or any other thoughts on how to handle inserts of large amounts of data? Viv
