Re: How to estimate resource cost according to data scale?

ShaoFeng Shi Wed, 15 Nov 2017 21:55:44 -0800

Hi Chase,

You mentioned a very good point about the forecast model for cube building.
But so far I don't know whether some has made a study on it. Maybe you're
the first :-)


About the cubing algorithm, there are some publications on it, including
paper, blog; besides Kylin's source code is open, you can read it.

Here are some articles you can take as references:

https://kylin.apache.org/blog/2015/08/15/fast-cubing/
https://blog.bcmeng.com/post/kylin-cube.html
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.9109&rep=rep1&type=pdf

Welcome to share your finding/experience in Kylin community!


2017-11-16 11:08 GMT+08:00 Chase Zhang <chase.zh...@striking.ly>:

> Thanks for your answer.
>
> Auto scale sounds good, but it seems not quite fit in our demand because
> of the following reasons:
>
> 1. The auto scale function of EMR seems have no rule about task running
> time as the cluster may not be able to aware this from Kylin. But the build
> time has a hard baseline in our use case
> 2. We want to save money by replacing some instances by the so called
> Reserved Instance provided by AWS which cost less but have to plan for long
> run as the billing model on them does not as flexible as normal instances
>
> We're wondering if the Kylin dev team or the community have related study.
> For example, we have already have a running task which will run in 10
> minutes. What will be the build time if the scale of data gained by 10
> times without changes in data's distribution. And, if we add more machines,
> how will the build time be reduced?
>
> We think this estimation is very valuable because it can direct us on
> resource allocation and saving money. :p
>
> Another alternative might be an article to provide a more detail
> description about the underlying algorithm by Kylin as currently it's more
> like a black box for us so that we cannot foresee the reaction from Kylin
> if one variable has changed.
>
> On 16 Nov 2017, 10:46 AM +0800, ShaoFeng Shi <shaofeng...@apache.org>,
> wrote:
>
> Hi Chase,
>
> I see your Hadoop is AWS EMR; did you try EMR's auto-scaling rules?  Kylin
> builds the cube on Hadoop in parallelly; If a big data set comes, Hadoop
> will start more tasks than normal; If there are many pending tasks, EMR can
> detect and then add new task nodes. This should help to improve the overall
> building performance. But this may not as efficient as you expected (in 20
> minutes).
>
> Is it possible to forecast a big data set will come and then call AWS API
> to scale out the cluster? Besides, what's your build engine, MR or Spark?
> You can switch to Spark to further reduce the building time.
>
>
> 2017-11-14 16:29 GMT+08:00 Chase Zhang <chase.zh...@striking.ly>:
>
>> Hi all,
>>
>> This is Chase from Strikingly. Recently we're confronted with one problem
>> upon the usage of Apache Kylin. Here is the description. Hoping anyone here
>> could give some suggestions :)
>>
>> The problem is about the estimation of resource and time cost for one
>> build of cube in proportion to data scale.
>>
>> Currently we have a task which will be triggered once per hour and the
>> cube build will averagely cost 7-10 minutes or so. Per our business's
>> growth, we need to plan an up scaling for our data platform in case the
>> build time becomes too long.
>>
>> Thus, we're wondering if there is a good way to forecast the resource
>> required to keep the same task's build time under 20 minutes if the data
>> scale has enlarged, for example, 100 times. As we are not familiar to the
>> underlying algorithm of Kylin, we're not sure how will Kylin actually
>> perform upon our dataset.
>>
>> Do the develop team and other users in community have any experience or
>> suggestions for this? Is there any articles for this specific problem?
>>
>>
>>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: How to estimate resource cost according to data scale?

Reply via email to