Re: New document: "How to optimize cube build"

Ajay Chitre Sun, 12 Feb 2017 23:53:30 -0800

In this case, if user runs a query with a WHERE clause that has 2
dimensions from the "aggregation group" & 2 dimensions from the "other 5
dimensions", Kylin will compute the results from the base cuboid, correct?
Or would it error out?


I can test it myself but I am being lazy -:) Looking for a quick answer
from the experts. Thanks for your help.

On Sun, Feb 12, 2017 at 3:04 AM, ShaoFeng Shi <[email protected]>
wrote:

> Ajay,
>
> There is no such a setting, but the "aggregation group" has something
> similar; say the cube totally has 15 dimensions, but in the agg group you
> only pick up 10 dimensions, then Kylin will build totally 1 (base cuboid) +
> 2^10 -1 (combinations of the 10 dimensions); Use this way you can leave
> those 5 dimension only appear on the base cuboid.
>
> 2017-02-09 9:20 GMT+08:00 Ajay Chitre <[email protected]>:
>
>> My question was a general question. Not any specific issue that I am
>> encountering -:)
>>
>> I understand that we can prune by using Hierarchical dimensions,
>> aggregation groups etc. But what if these types of aggregations are not
>> possible.
>>
>> Let's say I've 15 dimensions (& I can't prune any), would Kylin build
>> 32,766 Cuboids or is there a property to say... "If no. of dimensions are
>> over X, stop building more Cuboids. Get from the base"? (Knowing this will
>> slow down the queries).
>>
>> Please let me know. Thanks.
>>
>>
>> On Mon, Feb 6, 2017 at 5:43 AM, ShaoFeng Shi <[email protected]>
>> wrote:
>>
>>> Ajay, thanks for your feedback;
>>>
>>> For question 1, the code has been merged in master branch; next release
>>> would be 2.0; a beta release will be published soon.
>>>
>>> For question 2, yes your understanding is correct: a N dim FULL cube
>>> will have 2^N - 1 cuboids; but if you adopted some way like hierarchy,
>>> joint or separating dimensions to multi groups, it will be a "partial" cube
>>> which means some cuboids will be pruned.
>>>
>>> If a query uses dimensions across aggregation groups, then only the base
>>> cuboid can fulfill it, kylin has to do the post aggregation from the base
>>> cuboid, the performance would be downgraded. Please check whether it's this
>>> case in your side.
>>>
>>> Get Outlook for iOS <https://aka.ms/o0ukef>
>>>
>>>
>>>
>>>
>>> On Mon, Feb 6, 2017 at 2:05 PM +0900, "Ajay Chitre" <
>>> [email protected]> wrote:
>>>
>>> Thanks for writing this document. It's very helpful. I've following
>>>> questions:
>>>>
>>>> 1) Doc says... "Kylin will build dictionaries in memory (in next
>>>> version this will be moved to MR)".
>>>>
>>>> Which version can we expect this in? For large Cubes this process takes
>>>> a long time on local machine. We really need to move this to the Hadoop
>>>> cluster. In fact, it will be great if we can have an option to run this
>>>> under Spark -:)
>>>>
>>>> 2) About the "Build N-Dimension Cuboid" step.
>>>>
>>>> Does Kylin build ALL Cuboids? My understanding is:
>>>>
>>>> Total no. of Cuboids = (2 to the power of # of dimensions) - 1
>>>>
>>>> Correct?
>>>>
>>>> So if there are 7 dimensions, there will be 127 Cuboids, right? Does
>>>> Kylin create ALL of them?
>>>>
>>>> I was under the impression that, after some point, Kylin will just get
>>>> measures from the Base Cuboid; instead of building all of them. Please
>>>> explain.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>> On Sat, Feb 4, 2017 at 2:19 AM, Li Yang <[email protected]> wrote:
>>>>
>>>>> Be free to update the document with different opinions. :-)
>>>>>
>>>>> On Thu, Jan 26, 2017 at 11:34 AM, ShaoFeng Shi <[email protected]
>>>>> > wrote:
>>>>>
>>>>>> Hi Alberto,
>>>>>>
>>>>>> Thanks for your comments! In many cases the data is imported to
>>>>>> Hadoop in T+1 mode. Especially when everyday's data is tens of GB, it is
>>>>>> reasonable to partition the Hive table by date. The problem is whether it
>>>>>> worth to keep a long history data in Hive; Usually user only keep a 
>>>>>> couple
>>>>>> monthes' data in Hive; If the partition number exceeds the threshold in
>>>>>> Hive, he/she can remove the oldest partitions or move to another table
>>>>>> easily; That is a common practice of Hive I think, and it is very good to
>>>>>> know that Hive 2.0 will solve this.
>>>>>>
>>>>>> 2017-01-25 17:10 GMT+08:00 Alberto Ramón <[email protected]>:
>>>>>>
>>>>>>> Be careful about partition by "FLIGHTDATE"
>>>>>>>
>>>>>>> From https://github.com/albertoRamon/Kylin/tree/master/KylinPerfo
>>>>>>> rmance
>>>>>>>
>>>>>>> *"Option 1: Use id_date as partition column on Hive table. This have
>>>>>>> a big problem: the Hive metastore is meant for few hundred of partitions
>>>>>>> not thousand (Hive 9452 there is an idea to solve this isn’t in 
>>>>>>> progress)*
>>>>>>> "
>>>>>>>
>>>>>>> In Hive 2.0 will be a preview (only for testing) to solve this
>>>>>>>
>>>>>>> 2017-01-25 9:46 GMT+01:00 ShaoFeng Shi <[email protected]>:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> A new document is added for the practices of cube build. Any
>>>>>>>> suggestion or comment is welcomed. We can update the doc later with
>>>>>>>> feedbacks;
>>>>>>>>
>>>>>>>> Here is the link:
>>>>>>>> https://kylin.apache.org/docs16/howto/howto_optimize_build.html
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best regards,
>>>>>>>>
>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best regards,
>>>>>>
>>>>>> Shaofeng Shi 史少锋
>>>>>>
>>>>>>
>>>>>
>>>>
>>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>

Re: New document: "How to optimize cube build"

Reply via email to