BTW, since hive represents NULL values as \\N in the textfile, how do you handle those NULL values when using the CsvBulkImport tool?
On Tue, Sep 1, 2015 at 9:04 AM, Behdad Forghani <beh...@exapackets.com> wrote: > Hi, > > > > In my experience the fastest way to load data is directly write to HFile. > I have measured a performance gain of 10x. Also, if you have binary data or > need to escape characters HBase bulk loader does not escape characters. > For my use case, I create HFiles and load the HFIle. Then, I create a view > on HBase table. > > > > Behdad > > > > *From:* Riesland, Zack [mailto:zack.riesl...@sensus.com] > *Sent:* Monday, August 31, 2015 6:20 AM > *To:* user@phoenix.apache.org > *Subject:* Help Tuning CsvBulkImport MapReduce > > > > I’m looking for some pointers on speeding up CsvBulkImport. > > > > Here’s an example: > > > > I took about 2 billion rows from hive and exported them to CSV. > > > > HDFS decided to translate this to 257 files, each about 1 GB. > > > > Running the CsvBulkImport tool against this folder results in 1,835 > mappers and then 1 reducer per region on the HBase table. > > > > The whole process takes something like 2 hours, the bulk of which is spent > on mappers. > > > > Any suggestions on how to possibly make this faster? > > > > When I create the CSV files, I’m doing a pretty simple select statement > from hive. The results tend to be mostly sorted. > > > > I honestly don’t know this space well enough to know whether that’s good, > bad, or neutral. > > > > Thanks for any feedback! > > >