I'm looking for some pointers on speeding up CsvBulkImport.

Here's an example:

I took about 2 billion rows from hive and exported them to CSV.

HDFS decided to translate this to 257 files, each about 1 GB.

Running the CsvBulkImport tool against this folder results in 1,835 mappers and 
then 1 reducer per region on the HBase table.

The whole process takes something like 2 hours, the bulk of which is spent on 
mappers.

Any suggestions on how to possibly make this faster?

When I create the CSV files, I'm doing a pretty simple select statement from 
hive. The results tend to be mostly sorted.

I honestly don't know this space well enough to know whether that's good, bad, 
or neutral.

Thanks for any feedback!

Reply via email to