Re: Cube with TopN metrics takes a very long time to build

Ostapenko, Vsevolod Tue, 23 Jul 2019 09:43:58 -0700

Hi Shaofeng,
Thanks for your reply.

I’m running Kylin 2.6.3 with a few minor tweaks.
To produce the below mentioned log output, I added an extra logging command to 
report the total number of processes values for each invocation of the 
doReduce() method.


    @Override
    public void doReduce(Text key, Iterable<Text> values, Context context) 
throws IOException, InterruptedException {
        aggs.reset();

        for (Text value : values) {
            if (vcounter++ % BatchConstants.NORMAL_RECORD_LOG_THRESHOLD == 0) {
                logger.info("Handling value with ordinal (This is not KV 
number!): {}", vcounter);
            }
            codec.decode(ByteBuffer.wrap(value.getBytes(), 0, 
value.getLength()), input);
            aggs.aggregate(input, needAggrMeasures);
        }
        logger.info("Total number of values processed: {}", vcounter);
        aggs.collectStates(result);

        ByteBuffer valueBuf = codec.encode(result);

        outputValue.set(valueBuf.array(), 0, valueBuf.position());
        context.write(key, outputValue);
    }

I read the blog post about TopN implementation details (more than once if 
fact). While it’s quite useful in explaining general concepts, it doesn’t seem 
to address the question about how to deal with skewed data or how to estimate 
“breaking point” for the TopN algorithm.

I do understand that reducing the target number of topn records to 100 would 
result in lesser resource usage, yet topn(500,4) is the default for TopN metric 
when created via Web UI.
Data set we are using is not even considered to be a medium sized, so it would 
be highly desirable to understand if and how TopN computation can scaled and 
better balanced when data are known to be skewed.

Thank you,
Vsevolod.

On 2019/07/23 14:36:14, ShaoFeng Shi <[email protected]<mailto:[email protected]>> 
wrote:
> Hi Vsevolod,>
>
> What's the Kylin version you're running? I didn't find the log of "Total>
> number of values processed" in CuboidReducer,  is it the size of the>
> "values" iterator? If so, I think it is very likely with the>
> "TopNCounter.merge()" method.>
>
> I see you are using topn(500,4), which actually will keep the top 500*50 =>
> 25,000 element at max. in each cell (measure). During the merge, Kylin will>
> use 10X more space to hold the element (which will be 250,000) to reduce>
> the error; Each time to merge a top N, there will be re-grouping and>
> sorting, that would be a little CPU intensive. And also the aggregating is>
> in sequence. You also mentioned that there is prominent data skew, so for>
> some values, the topn cell might be far less than full, but for others, it>
> can be full, so the processing time is different.>
>
> Usually we use a smaller N in the TopN measure. You can try a smaller>
> number like 100 (inside Kylin will keep top 5,000) and then try it again.>
>
> If you have't read this blog, recommend to get some background:>
> https://kylin.apache.org/blog/2016/03/19/approximate-topn-measure/>
>
> Best regards,>
>
> Shaofeng Shi 史少锋>
> Apache Kylin PMC>
> Email: [email protected]<mailto:[email protected]>>
>
> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html>
> Join Kylin user mail group: 
> [email protected]<mailto:[email protected]>>
> Join Kylin dev mail group: 
> [email protected]<mailto:[email protected]>>
>
>
>
>
> Ostapenko, Vsevolod <[email protected]<mailto:[email protected]>> 于2019年7月23日周二 
> 上午2:13写道：>
>
> > Greetings,>
> >>
> > We are computing a cube with a bunch of TopN metrics on a moderate-size>
> > data set that has prominent data skew on two dimensions.>
> >>
> > TopN metrics are defined using default settings topn(500,4).>
> >>
> > When count of records processed by CuboidReducer.doReduce() reaches some>
> > thresholds, time to compute aggregate metrics seems to grow exponentially.>
> >>
> > Here are some excerpts from cube build MR job logs with some additional>
> > logging added to see the count of values processed.>
> >>
> > Note that 8K input records are processed just fine, but 65K input records>
> > resulted in a major jump in processing time.>
> >>
> > …>
> >>
> > 2019-07-19 12:04:19,478 INFO [main]>
> > org.apache.kylin.engine.mr.KylinReducer: Do setup, available memory: 7700m>
> >>
> > 2019-07-19 12:04:19,479 INFO [main]>
> > org.apache.kylin.engine.mr.KylinReducer: Accepting Reducer Key with>
> > ordinal: 1>
> >>
> > 2019-07-19 12:04:19,479 INFO [main]>
> > org.apache.kylin.engine.mr.KylinReducer: Do reduce, available memory: 7700m>
> >>
> > 2019-07-19 12:04:19,479 INFO [main]>
> > org.apache.kylin.engine.mr.steps.CuboidReducer: Handling value with ordinal>
> > (This is not KV number!): 1>
> >>
> > 2019-07-19 12:04:24,457 INFO [main]>
> > org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values>
> > processed: 2377>
> >>
> > 2019-07-19 12:04:24,511 INFO [main]>
> > org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values>
> > processed: 2425>
> >>
> > 2019-07-19 12:04:24,511 INFO [main]>
> > org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values>
> > processed: 2432>
> >>
> > 2019-07-19 12:04:24,512 INFO [main]>
> > org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values>
> > processed: 2462>
> >>
> > 2019-07-19 12:04:24,536 INFO [main]>
> > org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values>
> > processed: 2485>
> >>
> > 2019-07-19 12:04:24,537 INFO [main]>
> > org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values>
> > processed: 2486>
> >>
> > 2019-07-19 12:04:24,537 INFO [main]>
> > org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values>
> > processed: 2491>
> >>
> > 2019-07-19 12:04:24,537 INFO [main]>
> > org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values>
> > processed: 2518>
> >>
> > 2019-07-19 12:04:24,537 INFO [main]>
> > org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values>
> > processed: 2520>
> >>
> > 2019-07-19 12:04:59,430 INFO [main]>
> > org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values>
> > processed: 8287>
> >>
> > 2019-07-19 12:04:59,468 INFO [main]>
> > org.apache.kylin.engine.mr.KylinReducer: Do cleanup, available memory: 
> > 7382m>
> >>
> > 2019-07-19 12:04:59,468 INFO [main]>
> > org.apache.kylin.engine.mr.KylinReducer: Total rows: 10>
> >>
> > …>
> >>
> > 2019-07-19 12:05:01,639 INFO [main]>
> > org.apache.kylin.engine.mr.KylinReducer: Do setup, available memory: 7254m>
> >>
> > 2019-07-19 12:05:01,639 INFO [main]>
> > org.apache.kylin.engine.mr.KylinReducer: Accepting Reducer Key with>
> > ordinal: 1>
> >>
> > 2019-07-19 12:05:01,640 INFO [main]>
> > org.apache.kylin.engine.mr.KylinReducer: Do reduce, available memory: 7254m>
> >>
> > 2019-07-19 12:05:01,640 INFO [main]>
> > org.apache.kylin.engine.mr.steps.CuboidReducer: Handling value with ordinal>
> > (This is not KV number!): 1>
> >>
> > 2019-07-19 12:05:01,640 INFO [main]>
> > org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values>
> > processed: 1>
> >>
> > 2019-07-19 12:05:01,682 INFO [main]>
> > org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values>
> > processed: 385>
> >>
> > 2019-07-19 12:05:01,684 INFO [main]>
> > org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values>
> > processed: 386>
> >>
> > 2019-07-19 12:05:01,684 INFO [main]>
> > org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values>
> > processed: 391>
> >>
> > 2019-07-19 12:05:01,684 INFO [main]>
> > org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values>
> > processed: 426>
> >>
> > 2019-07-19 12:05:01,685 INFO [main]>
> > org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values>
> > processed: 427>
> >>
> > 2019-07-19 12:05:01,685 INFO [main]>
> > org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values>
> > processed: 429>
> >>
> > 2019-07-19 12:33:14,119 INFO [main]>
> > org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values>
> > processed: 65997>
> >>
> > 2019-07-19 12:33:14,348 INFO [main]>
> > org.apache.kylin.engine.mr.KylinReducer: Do cleanup, available memory: 
> > 7184m>
> >>
> > 2019-07-19 12:33:14,348 INFO [main]>
> > org.apache.kylin.engine.mr.KylinReducer: Total rows: 8>
> >>
> > …>
> >>
> > There is a breaking point, after which cuboid aggregations would never>
> > complete and time out even after waiting for 3+ hours.>
> >>
> > Adding more memory to the mapper (via mapreduce.map.memory.mb/>
> > mapreduce.map.java.opts in kylin_job_conf.xml) doesn’t seem to make much>
> > difference.>
> >>
> >>
> >>
> > There doesn’t seem to the any documented way to combat data skew during>
> > the cuboid build step.>
> >>
> >>
> >>
> > Any suggestions on how to deal with the above mentioned issue would be>
> > greatly appreciated.>
> >>
> >>
> >>
> > Thanks,>
> >>
> > Seva.>
> >>
> >>
> >>
> >>
> > ------------------------------>
> > Notice: This e-mail together with any attachments may contain information>
> > of Ribbon Communications Inc. that is confidential and/or proprietary for>
> > the sole use of the intended recipient. Any review, disclosure, reliance or>
> > distribution by others or forwarding without express permission is strictly>
> > prohibited. If you are not the intended recipient, please notify the sender>
> > immediately and then delete all copies, including any attachments.>
> > ------------------------------>
> >>
>

Vsevolod Ostapenko
Principal Data Architect, Software Engineering
4 Technology Park Drive | Westford, MA 01886 USA
office: +1.978.577.6829 | mobile: +1.978.394.3363
[cid:[email protected]]
<https://ribboncommunications.com/>


-----------------------------------------------------------------------------------------------------------------------
Notice: This e-mail together with any attachments may contain information of 
Ribbon Communications Inc. that
is confidential and/or proprietary for the sole use of the intended recipient.  
Any review, disclosure, reliance or
distribution by others or forwarding without express permission is strictly 
prohibited.  If you are not the intended
recipient, please notify the sender immediately and then delete all copies, 
including any attachments.
-----------------------------------------------------------------------------------------------------------------------

Re: Cube with TopN metrics takes a very long time to build

Reply via email to