how to tune phoenix CsvBulkLoadTool job

2016-03-19 Thread Vamsi Krishna
Hi,

I'm using CsvBulkLoadTool to load a csv data file into Phoenix/HBase table.

HDP Version : 2.3.2 (Phoenix Version : 4.4.0, HBase Version: 1.1.2)
CSV file size: 97.6 GB
No. of records: 1,439,000,238
Cluster: 13 node
Phoenix table salt-buckets: 13
Phoenix table compression: snappy
HBase table size after loading: 26.6 GB

The job completed in *1hrs, 39mins, 43sec*.
Average Map Time 5mins, 25sec
Average Shuffle Time *47mins, 46sec*
Average Merge Time 12mins, 22sec
Average Reduce Time *32mins, 9sec*

I'm looking for an opportunity to tune this job.
Could someone please help me with some pointers on how to tune this job?
Please let me know if you need to know any cluster configuration parameters
that I'm using.

*This is only a performance test. My PRODUCTION data file is 7x bigger.*

Thanks,
Vamsi Attluri

-- 
Vamsi Attluri


Re: how to tune phoenix CsvBulkLoadTool job

2016-03-19 Thread Gabriel Reid
Hi Vamsi,

I see from your counters that the number of map spill records is
double the number of map output records, so I think that raising the
mapreduce.task.io.sort.mb setting on the job should improve the
shuffle throughput.

However, like I said before, I think that the first thing to try is
increasing the number of regions.

Indeed, increasing the number of regions can potentially increase
parallelism for reads by Phoenix, although Phoenix actually internally
does sub-region reads as-is, so there probably won't be a huge effect
either way in terms of read performance.

Aggregate queries shouldn't be impacted much either way. The increased
parallelism that Phoenix does to do sub-region reads is still in place
regardless. In addition, aggregate reads are done per region (or
sub-region split), and then the aggregation results are combined to
give the whole aggregate result. Having five times as many regions
(for example) would increase the number of portions of the aggregation
that need to be combined, but this should still be very minor in
comparison to the total amount of work required to do aggregations, so
it also shouldn't have a major effect either way.

- Gabriel

On Wed, Mar 16, 2016 at 7:15 PM, Vamsi Krishna  wrote:
> Thanks Gabriel,
> Please find the job counters attached.
>
> Would increasing the splitting affect the reads?
> I assume a simple read would be benefitted by increased splitting as it
> increases the parallelism.
> But, how would it impact the aggregate queries?
>
> Vamsi Attluri
>
> On Wed, Mar 16, 2016 at 9:06 AM Gabriel Reid  wrote:
>>
>> Hi Vamsi,
>>
>> The first thing that I notice looking at the info that you've posted
>> is that you have 13 nodes and 13 salt buckets (which I assume also
>> means that you have 13 regions).
>>
>> A single region is the unit of parallelism that is used for reducers
>> in the CsvBulkLoadTool (or HFile-writing MapReduce job in general), so
>> currently you're only getting an average of a single reduce process
>> per node on your cluster. Assuming that you have multiple cores in
>> each of those nodes, you will probably get a decent improvement in
>> performance by further splitting your destination table so that it has
>> multiple regions per node (thereby triggering multiple reduce tasks
>> per node).
>>
>> Would you also be able to post the full set of job counters that are
>> shown after the job is completed? This would also be helpful in
>> pinpointing things that can be (possibly) tuned.
>>
>> - Gabriel
>>
>>
>> On Wed, Mar 16, 2016 at 1:28 PM, Vamsi Krishna 
>> wrote:
>> > Hi,
>> >
>> > I'm using CsvBulkLoadTool to load a csv data file into Phoenix/HBase
>> > table.
>> >
>> > HDP Version : 2.3.2 (Phoenix Version : 4.4.0, HBase Version: 1.1.2)
>> > CSV file size: 97.6 GB
>> > No. of records: 1,439,000,238
>> > Cluster: 13 node
>> > Phoenix table salt-buckets: 13
>> > Phoenix table compression: snappy
>> > HBase table size after loading: 26.6 GB
>> >
>> > The job completed in 1hrs, 39mins, 43sec.
>> > Average Map Time 5mins, 25sec
>> > Average Shuffle Time 47mins, 46sec
>> > Average Merge Time 12mins, 22sec
>> > Average Reduce Time 32mins, 9sec
>> >
>> > I'm looking for an opportunity to tune this job.
>> > Could someone please help me with some pointers on how to tune this job?
>> > Please let me know if you need to know any cluster configuration
>> > parameters
>> > that I'm using.
>> >
>> > This is only a performance test. My PRODUCTION data file is 7x bigger.
>> >
>> > Thanks,
>> > Vamsi Attluri
>> >
>> > --
>> > Vamsi Attluri
>
> --
> Vamsi Attluri


Re: how to tune phoenix CsvBulkLoadTool job

2016-03-19 Thread Gabriel Reid
Hi Vamsi,

The first thing that I notice looking at the info that you've posted
is that you have 13 nodes and 13 salt buckets (which I assume also
means that you have 13 regions).

A single region is the unit of parallelism that is used for reducers
in the CsvBulkLoadTool (or HFile-writing MapReduce job in general), so
currently you're only getting an average of a single reduce process
per node on your cluster. Assuming that you have multiple cores in
each of those nodes, you will probably get a decent improvement in
performance by further splitting your destination table so that it has
multiple regions per node (thereby triggering multiple reduce tasks
per node).

Would you also be able to post the full set of job counters that are
shown after the job is completed? This would also be helpful in
pinpointing things that can be (possibly) tuned.

- Gabriel


On Wed, Mar 16, 2016 at 1:28 PM, Vamsi Krishna  wrote:
> Hi,
>
> I'm using CsvBulkLoadTool to load a csv data file into Phoenix/HBase table.
>
> HDP Version : 2.3.2 (Phoenix Version : 4.4.0, HBase Version: 1.1.2)
> CSV file size: 97.6 GB
> No. of records: 1,439,000,238
> Cluster: 13 node
> Phoenix table salt-buckets: 13
> Phoenix table compression: snappy
> HBase table size after loading: 26.6 GB
>
> The job completed in 1hrs, 39mins, 43sec.
> Average Map Time 5mins, 25sec
> Average Shuffle Time 47mins, 46sec
> Average Merge Time 12mins, 22sec
> Average Reduce Time 32mins, 9sec
>
> I'm looking for an opportunity to tune this job.
> Could someone please help me with some pointers on how to tune this job?
> Please let me know if you need to know any cluster configuration parameters
> that I'm using.
>
> This is only a performance test. My PRODUCTION data file is 7x bigger.
>
> Thanks,
> Vamsi Attluri
>
> --
> Vamsi Attluri