Re: New document: "How to optimize cube build"

ShaoFeng Shi Sun, 12 Feb 2017 03:06:08 -0800

Ajay,

There is no such a setting, but the "aggregation group" has something
similar; say the cube totally has 15 dimensions, but in the agg group you
only pick up 10 dimensions, then Kylin will build totally 1 (base cuboid) +
2^10 -1 (combinations of the 10 dimensions); Use this way you can leave
those 5 dimension only appear on the base cuboid.


2017-02-09 9:20 GMT+08:00 Ajay Chitre <[email protected]>:

> My question was a general question. Not any specific issue that I am
> encountering -:)
>
> I understand that we can prune by using Hierarchical dimensions,
> aggregation groups etc. But what if these types of aggregations are not
> possible.
>
> Let's say I've 15 dimensions (& I can't prune any), would Kylin build
> 32,766 Cuboids or is there a property to say... "If no. of dimensions are
> over X, stop building more Cuboids. Get from the base"? (Knowing this will
> slow down the queries).
>
> Please let me know. Thanks.
>
>
> On Mon, Feb 6, 2017 at 5:43 AM, ShaoFeng Shi <[email protected]>
> wrote:
>
>> Ajay, thanks for your feedback;
>>
>> For question 1, the code has been merged in master branch; next release
>> would be 2.0; a beta release will be published soon.
>>
>> For question 2, yes your understanding is correct: a N dim FULL cube will
>> have 2^N - 1 cuboids; but if you adopted some way like hierarchy, joint or
>> separating dimensions to multi groups, it will be a "partial" cube which
>> means some cuboids will be pruned.
>>
>> If a query uses dimensions across aggregation groups, then only the base
>> cuboid can fulfill it, kylin has to do the post aggregation from the base
>> cuboid, the performance would be downgraded. Please check whether it's this
>> case in your side.
>>
>> Get Outlook for iOS <https://aka.ms/o0ukef>
>>
>>
>>
>>
>> On Mon, Feb 6, 2017 at 2:05 PM +0900, "Ajay Chitre" <
>> [email protected]> wrote:
>>
>> Thanks for writing this document. It's very helpful. I've following
>>> questions:
>>>
>>> 1) Doc says... "Kylin will build dictionaries in memory (in next version
>>> this will be moved to MR)".
>>>
>>> Which version can we expect this in? For large Cubes this process takes
>>> a long time on local machine. We really need to move this to the Hadoop
>>> cluster. In fact, it will be great if we can have an option to run this
>>> under Spark -:)
>>>
>>> 2) About the "Build N-Dimension Cuboid" step.
>>>
>>> Does Kylin build ALL Cuboids? My understanding is:
>>>
>>> Total no. of Cuboids = (2 to the power of # of dimensions) - 1
>>>
>>> Correct?
>>>
>>> So if there are 7 dimensions, there will be 127 Cuboids, right? Does
>>> Kylin create ALL of them?
>>>
>>> I was under the impression that, after some point, Kylin will just get
>>> measures from the Base Cuboid; instead of building all of them. Please
>>> explain.
>>>
>>> Thanks.
>>>
>>>
>>>
>>> On Sat, Feb 4, 2017 at 2:19 AM, Li Yang <[email protected]> wrote:
>>>
>>>> Be free to update the document with different opinions. :-)
>>>>
>>>> On Thu, Jan 26, 2017 at 11:34 AM, ShaoFeng Shi <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Alberto,
>>>>>
>>>>> Thanks for your comments! In many cases the data is imported to Hadoop
>>>>> in T+1 mode. Especially when everyday's data is tens of GB, it is
>>>>> reasonable to partition the Hive table by date. The problem is whether it
>>>>> worth to keep a long history data in Hive; Usually user only keep a couple
>>>>> monthes' data in Hive; If the partition number exceeds the threshold in
>>>>> Hive, he/she can remove the oldest partitions or move to another table
>>>>> easily; That is a common practice of Hive I think, and it is very good to
>>>>> know that Hive 2.0 will solve this.
>>>>>
>>>>> 2017-01-25 17:10 GMT+08:00 Alberto Ramón <[email protected]>:
>>>>>
>>>>>> Be careful about partition by "FLIGHTDATE"
>>>>>>
>>>>>> From https://github.com/albertoRamon/Kylin/tree/master/KylinPerfo
>>>>>> rmance
>>>>>>
>>>>>> *"Option 1: Use id_date as partition column on Hive table. This have
>>>>>> a big problem: the Hive metastore is meant for few hundred of partitions
>>>>>> not thousand (Hive 9452 there is an idea to solve this isn’t in 
>>>>>> progress)*
>>>>>> "
>>>>>>
>>>>>> In Hive 2.0 will be a preview (only for testing) to solve this
>>>>>>
>>>>>> 2017-01-25 9:46 GMT+01:00 ShaoFeng Shi <[email protected]>:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> A new document is added for the practices of cube build. Any
>>>>>>> suggestion or comment is welcomed. We can update the doc later with
>>>>>>> feedbacks;
>>>>>>>
>>>>>>> Here is the link:
>>>>>>> https://kylin.apache.org/docs16/howto/howto_optimize_build.html
>>>>>>>
>>>>>>> --
>>>>>>> Best regards,
>>>>>>>
>>>>>>> Shaofeng Shi 史少锋
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>>
>>>>> Shaofeng Shi 史少锋
>>>>>
>>>>>
>>>>
>>>
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: New document: "How to optimize cube build"

Reply via email to