Thanks Gabriel,

That is extremely helpful.

One clarification:

You say I can find information about spills in the job counters. Are you 
talking about “failed” map tasks, or is there something else that will help me 
identify spill scenarios?

From: Gabriel Reid [mailto:gabriel.r...@gmail.com]
Sent: Monday, August 31, 2015 4:39 PM
To: user@phoenix.apache.org
Subject: Re: Help Tuning CsvBulkImport MapReduce

If the bulk of the time is being spent in the map phase, then there probably 
isn't all that much that can be done in terms of tuning that will make a huge 
difference. However, there may be a few things to look at.

You mentioned that HDFS decided to translate the hive export to 257 files -- do 
you mean blocks, or are there actually 257 files on HDFS? If so, it's Hive 
(and/or MapReduce) that's responsible for the 257 files, but that's probably 
just a detail and not all that important.

How long are each of the map tasks taking? If they're only taking something 
like 30 seconds or so, then it would be worth trying to have each task process 
more data. This is most easily accomplished by using a bigger block size on 
HDFS, as each HDFS block typically results in a single map task. However, 
you'll want to first check how long each map task is taking -- if they're each 
taking 3-5 minutes (or more), then you won't gain much by increasing the block 
size.

A second thing to look at is the number of spills compared to the number of map 
output records -- you can find this information in the job counters. If the 
number of spills in the map phase are two (or more) times the number of map 
output records, you'll likely get an increase in performance by upping the 
mapreduce.task.io.sort.mb setting (or some other sort settings). However, 
before getting into this you'll want to see if the spills are an issue.

As I said above though, most likely if the map phase is taking up the most 
time, it's probably CPU-bound on the conversion of CSV data to HBase KeyValues. 
This is likely if you're dealing with really wide rows. How many columns are 
you importing into your table?

- Gabriel
On Mon, Aug 31, 2015 at 3:20 PM Riesland, Zack 
<zack.riesl...@sensus.com<mailto:zack.riesl...@sensus.com>> wrote:
I’m looking for some pointers on speeding up CsvBulkImport.

Here’s an example:

I took about 2 billion rows from hive and exported them to CSV.

HDFS decided to translate this to 257 files, each about 1 GB.

Running the CsvBulkImport tool against this folder results in 1,835 mappers and 
then 1 reducer per region on the HBase table.

The whole process takes something like 2 hours, the bulk of which is spent on 
mappers.

Any suggestions on how to possibly make this faster?

When I create the CSV files, I’m doing a pretty simple select statement from 
hive. The results tend to be mostly sorted.

I honestly don’t know this space well enough to know whether that’s good, bad, 
or neutral.

Thanks for any feedback!

Reply via email to