Re: Extract Fact Table Distinct Columns Step

ShaoFeng Shi Wed, 20 Dec 2017 00:09:07 -0800

Hi Sonny,

If the mappers are similarly slow, it likely indicates there are too many
cuboids (dimension combination) for the cube; Could you please let me know
your dimension number, and how you distribute them to the aggregation
groups? Try to optimize the design with mandatory/joint/hierarchy as much
as possible, according to your query pattern and data characteristics.


2017-12-20 14:25 GMT+08:00 Sonny Heer <[email protected]>:

> Hi ShaoFeng,  thanks for quick response.  Kylin version 1.6.
>
> The step is #3 and it takes the longest time in the Map phase.
> sort/shuffle and reduce seem to be ok.  Yes we went through that document.
>  The input mappers are set to about 1.1 million giving us 225 mappers for
> input of 234 million records.  All mappers run at the same time since that
> is the number of mapper slots we have.  The mappers all seem to take the
> same amount of time (we didn't notice any long runners in the end).
>
> the m/r stats output for that step is below.  Troubling is the 4.6 billion
> output records from map phase.  So is there a general place we can look for
> "Extract Fact Table Distinct Columns"  step.    Thanks
>
>
> Map-Reduce Framework
>               Map input records=234707850
>               Map output records=4687531086 <0468%20753%201086>
>               Map output bytes=49568802916
>               Map output materialized bytes=9852827353
>               Input split bytes=965025
>               Combine input records=4687531086 <0468%20753%201086>
>               Combine output records=33878243
>               Reduce input groups=281301
>               Reduce shuffle bytes=9852827353
>               Reduce input records=33878243
>               Reduce output records=0
>               Spilled Records=67756486
>               Shuffled Maps =5850
>               Failed Shuffles=0
>               Merged Map outputs=5850
>               GC time elapsed (ms)=49602314
>               CPU time spent (ms)=759218400
>               Physical memory (bytes) snapshot=418766036992
>               Virtual memory (bytes) snapshot=898566012928
>               Total committed heap usage (bytes)=391907901440
>
>
> On Tue, Dec 19, 2017 at 10:13 PM, ShaoFeng Shi <[email protected]>
> wrote:
>
>> Hi Sonny,
>>
>> Did you check this document, which has the description of each step:
>> https://kylin.apache.org/docs21/howto/howto_optimize_build.html
>>
>> Besides, what's your Kylin version? and did you check the MR job progress
>> to see which stage is the most expensive, map or reduce, and what's the
>> number of the mappers and reducers; Are all mapper/reducers take a similar
>> time, or some specific took much longer than others?
>>
>> Furthermore, for deep div, please provide the cube definition; We need to
>> know the dimension number, aggregation groups,  encodings method as well as
>> other possible factors.
>>
>> 2017-12-20 13:00 GMT+08:00 Sonny Heer <[email protected]>:
>>
>>> can someone explain what step 3 does?
>>>
>>> specifically how it relates dimensions, measures, and row keys.  our
>>> input fact table is abou 234 million records and this step is taking
>>> forever.
>>>
>>> we have 450gb memory with 25 slots per node, which is about 225
>>> concurrently running slots, and its still taking a while.
>>>
>>>  The doc just talks about looking at optimize cube, but that page talks
>>> about hierarchy columns and derived columns.  we dont have any lookup
>>> tables so no derived and there is no natural hierarchy
>>>
>>> Just trying to find what item controls why this step takes longer vs
>>> shorter time wise.
>>>
>>> Thanks
>>>
>>
>>
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>>
>>
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: Extract Fact Table Distinct Columns Step

Reply via email to