Hi Chase, You mentioned a very good point about the forecast model for cube building. But so far I don't know whether some has made a study on it. Maybe you're the first :-)
About the cubing algorithm, there are some publications on it, including paper, blog; besides Kylin's source code is open, you can read it. Here are some articles you can take as references: https://kylin.apache.org/blog/2015/08/15/fast-cubing/ https://blog.bcmeng.com/post/kylin-cube.html http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.9109&rep=rep1&type=pdf Welcome to share your finding/experience in Kylin community! 2017-11-16 11:08 GMT+08:00 Chase Zhang <chase.zh...@striking.ly>: > Thanks for your answer. > > Auto scale sounds good, but it seems not quite fit in our demand because > of the following reasons: > > 1. The auto scale function of EMR seems have no rule about task running > time as the cluster may not be able to aware this from Kylin. But the build > time has a hard baseline in our use case > 2. We want to save money by replacing some instances by the so called > Reserved Instance provided by AWS which cost less but have to plan for long > run as the billing model on them does not as flexible as normal instances > > We're wondering if the Kylin dev team or the community have related study. > For example, we have already have a running task which will run in 10 > minutes. What will be the build time if the scale of data gained by 10 > times without changes in data's distribution. And, if we add more machines, > how will the build time be reduced? > > We think this estimation is very valuable because it can direct us on > resource allocation and saving money. :p > > Another alternative might be an article to provide a more detail > description about the underlying algorithm by Kylin as currently it's more > like a black box for us so that we cannot foresee the reaction from Kylin > if one variable has changed. > > On 16 Nov 2017, 10:46 AM +0800, ShaoFeng Shi <shaofeng...@apache.org>, > wrote: > > Hi Chase, > > I see your Hadoop is AWS EMR; did you try EMR's auto-scaling rules? Kylin > builds the cube on Hadoop in parallelly; If a big data set comes, Hadoop > will start more tasks than normal; If there are many pending tasks, EMR can > detect and then add new task nodes. This should help to improve the overall > building performance. But this may not as efficient as you expected (in 20 > minutes). > > Is it possible to forecast a big data set will come and then call AWS API > to scale out the cluster? Besides, what's your build engine, MR or Spark? > You can switch to Spark to further reduce the building time. > > > 2017-11-14 16:29 GMT+08:00 Chase Zhang <chase.zh...@striking.ly>: > >> Hi all, >> >> This is Chase from Strikingly. Recently we're confronted with one problem >> upon the usage of Apache Kylin. Here is the description. Hoping anyone here >> could give some suggestions :) >> >> The problem is about the estimation of resource and time cost for one >> build of cube in proportion to data scale. >> >> Currently we have a task which will be triggered once per hour and the >> cube build will averagely cost 7-10 minutes or so. Per our business's >> growth, we need to plan an up scaling for our data platform in case the >> build time becomes too long. >> >> Thus, we're wondering if there is a good way to forecast the resource >> required to keep the same task's build time under 20 minutes if the data >> scale has enlarged, for example, 100 times. As we are not familiar to the >> underlying algorithm of Kylin, we're not sure how will Kylin actually >> perform upon our dataset. >> >> Do the develop team and other users in community have any experience or >> suggestions for this? Is there any articles for this specific problem? >> >> >> > > > -- > Best regards, > > Shaofeng Shi 史少锋 > > -- Best regards, Shaofeng Shi 史少锋