Hi Sonny, If the mappers are similarly slow, it likely indicates there are too many cuboids (dimension combination) for the cube; Could you please let me know your dimension number, and how you distribute them to the aggregation groups? Try to optimize the design with mandatory/joint/hierarchy as much as possible, according to your query pattern and data characteristics.
2017-12-20 14:25 GMT+08:00 Sonny Heer <[email protected]>: > Hi ShaoFeng, thanks for quick response. Kylin version 1.6. > > The step is #3 and it takes the longest time in the Map phase. > sort/shuffle and reduce seem to be ok. Yes we went through that document. > The input mappers are set to about 1.1 million giving us 225 mappers for > input of 234 million records. All mappers run at the same time since that > is the number of mapper slots we have. The mappers all seem to take the > same amount of time (we didn't notice any long runners in the end). > > the m/r stats output for that step is below. Troubling is the 4.6 billion > output records from map phase. So is there a general place we can look for > "Extract Fact Table Distinct Columns" step. Thanks > > > Map-Reduce Framework > Map input records=234707850 > Map output records=4687531086 <0468%20753%201086> > Map output bytes=49568802916 > Map output materialized bytes=9852827353 > Input split bytes=965025 > Combine input records=4687531086 <0468%20753%201086> > Combine output records=33878243 > Reduce input groups=281301 > Reduce shuffle bytes=9852827353 > Reduce input records=33878243 > Reduce output records=0 > Spilled Records=67756486 > Shuffled Maps =5850 > Failed Shuffles=0 > Merged Map outputs=5850 > GC time elapsed (ms)=49602314 > CPU time spent (ms)=759218400 > Physical memory (bytes) snapshot=418766036992 > Virtual memory (bytes) snapshot=898566012928 > Total committed heap usage (bytes)=391907901440 > > > On Tue, Dec 19, 2017 at 10:13 PM, ShaoFeng Shi <[email protected]> > wrote: > >> Hi Sonny, >> >> Did you check this document, which has the description of each step: >> https://kylin.apache.org/docs21/howto/howto_optimize_build.html >> >> Besides, what's your Kylin version? and did you check the MR job progress >> to see which stage is the most expensive, map or reduce, and what's the >> number of the mappers and reducers; Are all mapper/reducers take a similar >> time, or some specific took much longer than others? >> >> Furthermore, for deep div, please provide the cube definition; We need to >> know the dimension number, aggregation groups, encodings method as well as >> other possible factors. >> >> 2017-12-20 13:00 GMT+08:00 Sonny Heer <[email protected]>: >> >>> can someone explain what step 3 does? >>> >>> specifically how it relates dimensions, measures, and row keys. our >>> input fact table is abou 234 million records and this step is taking >>> forever. >>> >>> we have 450gb memory with 25 slots per node, which is about 225 >>> concurrently running slots, and its still taking a while. >>> >>> The doc just talks about looking at optimize cube, but that page talks >>> about hierarchy columns and derived columns. we dont have any lookup >>> tables so no derived and there is no natural hierarchy >>> >>> Just trying to find what item controls why this step takes longer vs >>> shorter time wise. >>> >>> Thanks >>> >> >> >> >> -- >> Best regards, >> >> Shaofeng Shi 史少锋 >> >> > -- Best regards, Shaofeng Shi 史少锋
