You may also want to take a look at tuning phoenix.stats.guidepost.width or phoenix.stats.guidepost.per.region as these may impact how many mapper tasks you end up with. See http://phoenix.apache.org/tuning.html and http://phoenix.apache.org/update_statistics.html for more information. A quick test would be to set phoenix.stats.guidepost.per.region to 1 on all your region servers and run UPDATE STATISTICS on the table again. Or even easier, as a quick test, delete existing stats prior to running your MR job: DELETE FROM SYSTEM.STATS
Please let us know how it goes, Ralph. Thanks, James On Thu, Apr 2, 2015 at 2:40 PM, Perko, Ralph J <[email protected]> wrote: > Thanks - I will try your suggestion. Do you know why there are so some many > more output than input records on the main table (39x more). > > > > From: Ravi Kiran [mailto:[email protected]] > Sent: Thursday, April 02, 2015 2:35 PM > To: [email protected] > Subject: Re: bulk loader MR counters > > > > Hi Ralph. > > I assume when you are running the MR for the main table, you have a > larger number of columns to load than the MR for the index table due to > which you see more spilled records. > > To tune the MR for the Main table, I would do the following first and then > measure the counters to see for any improvement. > > a) To avoid the spilled the records during the MR for the main table, I > would recommend trying to increase the mapreduce.task.io.sort.mb to a value > like 500 MB rather than the default 100 MB > > b) mapreduce.task.io.sort.factor to have higher number of streams to merge > at once during sorting map output . > > Regards > > Ravi > > > > > > From: Perko, Ralph J > Sent: Thursday, April 02, 2015 2:36 PM > To: [email protected] > Subject: RE: bulk loader MR counters > > > > My apologies, the formatting did not come out as planned. Here is another > go: > > > > Hi, we recently upgraded our cluster (Phoenix 4.3 – HDP 2.2) and I’m seeing > a significant degradation in performance. I am going through the MR > counters for a Phoenix CsvBulkLoad job and I am hoping you can help me > understand some things. > > > > There is a base table with 4 index tables, so a total of 5 MR jobs run – one > for each table. > > > > Here are the counters for an index table MR job: > > > > Note two things – the Input and output are the same number as expected > > There seems to be a lot spilled records. > > =========================================================== > > Category,Map, Reduce,Total > > Combine input records,0,0,0 > > Combine output records,0,0,0 > > CPU time spent (ms),1800380,156630,1957010 > > Failed Shuffles,0,0,0 > > GC time elapsed (ms),39738,1923,41661 > > Input split bytes,690,0,690 > > Map input records,13637198,0,13637198 > > Map output bytes,2144112474,0,2144112474 > > Map output materialized bytes,2171387170,0,2171387170 > > Map output records,13637198,0,13637198 > > Merged Map outputs,0,50,50 > > Physical memory (bytes) snapshot,8493744128,10708692992,19202437120 > > Reduce input groups,0,13637198,13637198 > > Reduce input records,0,13637198,13637198 > > Reduce output records,0,13637198,13637198 > > Reduce shuffle bytes,0,2171387170,2171387170 > > Shuffled Maps,0,50,50 > > Spilled Records,13637198,13637198,27274396 > > Total committed heap usage (bytes),11780751360,26862419968,38643171328 > > Virtual memory (bytes) snapshot,25903271936,96590065664,122493337600 > > > > Here are the counters for the main table MR job > > Please note the input records are correct – same as above > > The output records are many times the input > > The output bytes are many times the output from above > > The amount of spilled records is many times the number of input records and > twice the number of output records > > =========================================================== > > Category,Map, Reduce,Total > > Combine input records,0,0,0 > > Combine output records,0,0,0 > > CPU time spent (ms),5059340,2035910,7095250 > > Failed Shuffles,0,0,0 > > GC time elapsed (ms),38937,13748,52685 > > Input split bytes,690,0,690 > > Map input records,13637198,0,13637198 > > Map output bytes,59638106406,0,59638106406 > > Map output materialized bytes,60702718624,0,60702718624 > > Map output records,531850722,0,531850722 > > Merged Map outputs,0,50,50 > > Physical memory (bytes) snapshot,8398745600,2756530176,11155275776 > > Reduce input groups,0,13637198,13637198 > > Reduce input records,0,531850722,531850722 > > Reduce output records,0,531850722,531850722 > > Reduce shuffle bytes,0,60702718624,60702718624 > > Shuffled Maps,0,50,50 > > Spilled Records,1063701444,531850722,1595552166 > > Total committed heap usage (bytes),10136059904,19488309248,29624369152 > > Virtual memory (bytes) snapshot,25926946816,96562970624,122489917440 > > > > > > Is the large number of output records as opposed to input records normal? > > Is the large number of spilled records normal? > > > > Thanks for your help, > > Ralph
