Thanks Gabriel, That is extremely helpful.
One clarification: You say I can find information about spills in the job counters. Are you talking about “failed” map tasks, or is there something else that will help me identify spill scenarios? From: Gabriel Reid [mailto:gabriel.r...@gmail.com] Sent: Monday, August 31, 2015 4:39 PM To: user@phoenix.apache.org Subject: Re: Help Tuning CsvBulkImport MapReduce If the bulk of the time is being spent in the map phase, then there probably isn't all that much that can be done in terms of tuning that will make a huge difference. However, there may be a few things to look at. You mentioned that HDFS decided to translate the hive export to 257 files -- do you mean blocks, or are there actually 257 files on HDFS? If so, it's Hive (and/or MapReduce) that's responsible for the 257 files, but that's probably just a detail and not all that important. How long are each of the map tasks taking? If they're only taking something like 30 seconds or so, then it would be worth trying to have each task process more data. This is most easily accomplished by using a bigger block size on HDFS, as each HDFS block typically results in a single map task. However, you'll want to first check how long each map task is taking -- if they're each taking 3-5 minutes (or more), then you won't gain much by increasing the block size. A second thing to look at is the number of spills compared to the number of map output records -- you can find this information in the job counters. If the number of spills in the map phase are two (or more) times the number of map output records, you'll likely get an increase in performance by upping the mapreduce.task.io.sort.mb setting (or some other sort settings). However, before getting into this you'll want to see if the spills are an issue. As I said above though, most likely if the map phase is taking up the most time, it's probably CPU-bound on the conversion of CSV data to HBase KeyValues. This is likely if you're dealing with really wide rows. How many columns are you importing into your table? - Gabriel On Mon, Aug 31, 2015 at 3:20 PM Riesland, Zack <zack.riesl...@sensus.com<mailto:zack.riesl...@sensus.com>> wrote: I’m looking for some pointers on speeding up CsvBulkImport. Here’s an example: I took about 2 billion rows from hive and exported them to CSV. HDFS decided to translate this to 257 files, each about 1 GB. Running the CsvBulkImport tool against this folder results in 1,835 mappers and then 1 reducer per region on the HBase table. The whole process takes something like 2 hours, the bulk of which is spent on mappers. Any suggestions on how to possibly make this faster? When I create the CSV files, I’m doing a pretty simple select statement from hive. The results tend to be mostly sorted. I honestly don’t know this space well enough to know whether that’s good, bad, or neutral. Thanks for any feedback!