Re: bulk loader MR counters

James Taylor Thu, 02 Apr 2015 16:00:13 -0700

You may also want to take a look at tuning
phoenix.stats.guidepost.width or phoenix.stats.guidepost.per.region as
these may impact how many mapper tasks you end up with. See
http://phoenix.apache.org/tuning.html and
http://phoenix.apache.org/update_statistics.html for more information.
A quick test would be to set phoenix.stats.guidepost.per.region to 1
on all your region servers and run UPDATE STATISTICS on the table
again. Or even easier, as a quick test, delete existing stats prior to
running your MR job: DELETE FROM SYSTEM.STATS


Please let us know how it goes, Ralph.

Thanks,
James

On Thu, Apr 2, 2015 at 2:40 PM, Perko, Ralph J <[email protected]> wrote:
> Thanks - I will try your suggestion.  Do you know why there are so some many
> more output than input records on the main table (39x more).
>
>
>
> From: Ravi Kiran [mailto:[email protected]]
> Sent: Thursday, April 02, 2015 2:35 PM
> To: [email protected]
> Subject: Re: bulk loader MR counters
>
>
>
> Hi Ralph.
>
>     I assume when you are running the MR for the main table, you have a
> larger number of columns to load than the MR for the index table due to
> which you see more spilled records.
>
> To tune the MR for the Main table, I would do the following first and then
> measure the counters to see for any improvement.
>
> a) To avoid the spilled the records during the MR for the main table, I
> would recommend trying to increase the mapreduce.task.io.sort.mb to a value
> like 500 MB rather than the default 100 MB
>
> b) mapreduce.task.io.sort.factor to have higher number of streams to merge
> at once during sorting map output .
>
> Regards
>
> Ravi
>
>
>
>
>
> From: Perko, Ralph J
> Sent: Thursday, April 02, 2015 2:36 PM
> To: [email protected]
> Subject: RE: bulk loader MR counters
>
>
>
> My apologies, the formatting did not come out as planned.  Here is another
> go:
>
>
>
> Hi, we recently upgraded our cluster (Phoenix 4.3 – HDP 2.2) and I’m seeing
> a significant degradation in performance.  I am going through the MR
> counters for a Phoenix CsvBulkLoad job and I am hoping you can help me
> understand some things.
>
>
>
> There is a base table with 4 index tables, so a total of 5 MR jobs run – one
> for each table.
>
>
>
> Here are the counters for an index table MR job:
>
>
>
> Note two things – the Input and output are the same number as expected
>
> There seems to be a lot spilled records.
>
> ===========================================================
>
> Category,Map, Reduce,Total
>
> Combine input records,0,0,0
>
> Combine output records,0,0,0
>
> CPU time spent (ms),1800380,156630,1957010
>
> Failed Shuffles,0,0,0
>
> GC time elapsed (ms),39738,1923,41661
>
> Input split bytes,690,0,690
>
> Map input records,13637198,0,13637198
>
> Map output bytes,2144112474,0,2144112474
>
> Map output materialized bytes,2171387170,0,2171387170
>
> Map output records,13637198,0,13637198
>
> Merged Map outputs,0,50,50
>
> Physical memory (bytes) snapshot,8493744128,10708692992,19202437120
>
> Reduce input groups,0,13637198,13637198
>
> Reduce input records,0,13637198,13637198
>
> Reduce output records,0,13637198,13637198
>
> Reduce shuffle bytes,0,2171387170,2171387170
>
> Shuffled Maps,0,50,50
>
> Spilled Records,13637198,13637198,27274396
>
> Total committed heap usage (bytes),11780751360,26862419968,38643171328
>
> Virtual memory (bytes) snapshot,25903271936,96590065664,122493337600
>
>
>
> Here are the counters for the main table MR job
>
> Please note the input records are correct – same as above
>
> The output records are many times the input
>
> The output bytes are many times the output from above
>
> The amount of spilled records is many times the number of input records and
> twice the number of output records
>
> ===========================================================
>
> Category,Map, Reduce,Total
>
> Combine input records,0,0,0
>
> Combine output records,0,0,0
>
> CPU time spent (ms),5059340,2035910,7095250
>
> Failed Shuffles,0,0,0
>
> GC time elapsed (ms),38937,13748,52685
>
> Input split bytes,690,0,690
>
> Map input records,13637198,0,13637198
>
> Map output bytes,59638106406,0,59638106406
>
> Map output materialized bytes,60702718624,0,60702718624
>
> Map output records,531850722,0,531850722
>
> Merged Map outputs,0,50,50
>
> Physical memory (bytes) snapshot,8398745600,2756530176,11155275776
>
> Reduce input groups,0,13637198,13637198
>
> Reduce input records,0,531850722,531850722
>
> Reduce output records,0,531850722,531850722
>
> Reduce shuffle bytes,0,60702718624,60702718624
>
> Shuffled Maps,0,50,50
>
> Spilled Records,1063701444,531850722,1595552166
>
> Total committed heap usage (bytes),10136059904,19488309248,29624369152
>
> Virtual memory (bytes) snapshot,25926946816,96562970624,122489917440
>
>
>
>
>
> Is the large number of output records as opposed to input records normal?
>
> Is the large number of spilled records normal?
>
>
>
> Thanks for your help,
>
> Ralph

Re: bulk loader MR counters

Reply via email to