Hi,
In my experience the fastest way to load data is directly write to HFile. I have measured a performance gain of 10x. Also, if you have binary data or need to escape characters HBase bulk loader does not escape characters. For my use case, I create HFiles and load the HFIle. Then, I create a view on HBase table. Behdad From: Riesland, Zack [mailto:zack.riesl...@sensus.com] Sent: Monday, August 31, 2015 6:20 AM To: user@phoenix.apache.org Subject: Help Tuning CsvBulkImport MapReduce I'm looking for some pointers on speeding up CsvBulkImport. Here's an example: I took about 2 billion rows from hive and exported them to CSV. HDFS decided to translate this to 257 files, each about 1 GB. Running the CsvBulkImport tool against this folder results in 1,835 mappers and then 1 reducer per region on the HBase table. The whole process takes something like 2 hours, the bulk of which is spent on mappers. Any suggestions on how to possibly make this faster? When I create the CSV files, I'm doing a pretty simple select statement from hive. The results tend to be mostly sorted. I honestly don't know this space well enough to know whether that's good, bad, or neutral. Thanks for any feedback!