Re: New document: "How to optimize cube build"

ShaoFeng Shi Mon, 06 Feb 2017 05:44:43 -0800

Ajay, thanks for your feedback;
For question 1, the code has been merged in master branch; next release would 
be 2.0; a beta release will be published soon.
For question 2, yes your understanding is correct: a N dim FULL cube will have 
2^N - 1 cuboids; but if you adopted some way like hierarchy, joint or 
separating dimensions to multi groups, it will be a "partial" cube which means 
some cuboids will be pruned. 
If a query uses dimensions across aggregation groups, then only the base cuboid 
can fulfill it, kylin has to do the post aggregation from the base cuboid, the 
performance would be downgraded. Please check whether it's this case in your 
side.
Get Outlook for iOS





On Mon, Feb 6, 2017 at 2:05 PM +0900, "Ajay Chitre" <[email protected]> 
wrote:










Thanks for writing this document. It's very helpful. I've following questions:

1) Doc says... "Kylin will build dictionaries in memory (in next version this 
will be moved to MR)".

Which version can we expect this in? For large Cubes this process takes a long 
time on local machine. We really need to move this to the Hadoop cluster. In 
fact, it will be great if we can have an option to run this under Spark -:) 

2) About the "Build N-Dimension Cuboid" step.

Does Kylin build ALL Cuboids? My understanding is:

Total no. of Cuboids = (2 to the power of # of dimensions) - 1

Correct?

So if there are 7 dimensions, there will be 127 Cuboids, right? Does Kylin 
create ALL of them?

I was under the impression that, after some point, Kylin will just get measures 
from the Base Cuboid; instead of building all of them. Please explain.

Thanks.



On Sat, Feb 4, 2017 at 2:19 AM, Li Yang <[email protected]> wrote:
Be free to update the document with different opinions. :-)

On Thu, Jan 26, 2017 at 11:34 AM, ShaoFeng Shi <[email protected]> wrote:
Hi Alberto,
Thanks for your comments! In many cases the data is imported to Hadoop in T+1 
mode. Especially when everyday's data is tens of GB, it is reasonable to 
partition the Hive table by date. The problem is whether it worth to keep a 
long history data in Hive; Usually user only keep a couple monthes' data in 
Hive; If the partition number exceeds the threshold in Hive, he/she can remove 
the oldest partitions or move to another table easily; That is a common 
practice of Hive I think, and it is very good to know that Hive 2.0 will solve 
this. 
2017-01-25 17:10 GMT+08:00 Alberto Ramón <[email protected]>:
Be careful about partition by "FLIGHTDATE"

>From https://github.com/albertoRamon/Kylin/tree/master/KylinPerformance

"Option 1: Use id_date as partition column on Hive table. This have a big
 problem: the Hive metastore is meant for few hundred of partitions not 
thousand (Hive 9452 there is an idea to solve this isn’t in progress)"

In Hive 2.0 will be a preview (only for testing) to solve this

2017-01-25 9:46 GMT+01:00 ShaoFeng Shi <[email protected]>:
Hello,
A new document is added for the practices of cube build. Any suggestion or 
comment is welcomed. We can update the doc later with feedbacks;
Here is the link:https://kylin.apache.org/docs16/howto/howto_optimize_build.html

-- 
Best regards,
Shaofeng Shi 史少锋







-- 
Best regards,
Shaofeng Shi 史少锋

Re: New document: "How to optimize cube build"

Reply via email to