Re: Queries For Building Cube

ShaoFeng Shi Tue, 28 Aug 2018 17:10:04 -0700

Welcome!

2018-08-28 22:55 GMT+08:00 Shrikant Bang <[email protected]>:


> Thank you, ShaoFeng for response !
> I have learned many internals of Kylin with this mail thread.
> I appreciate your help.
>
> Regards,
> Shrikant Bang.
>
> On Tue, Aug 28, 2018 at 7:14 PM ShaoFeng Shi <[email protected]>
> wrote:
>
>> Hi Shrikant,
>>
>> Do you mean reusing the lookup table snapshot across cubes? As you know,
>> Kylin took snapshots for lookup table and load them to memory during at
>> query time.
>>
>> When Kylin took a snapshot, it will check whether the lookup table was
>> changed since last time. If no change, it will reuse the snapshot. The
>> detail logic is in SnapshotManager.buildSnapshot();
>>
>> So, from this point of view, if your calendar lookup table is stable, it
>> will be reused by multiple cubes.
>>
>> Hope this helps;
>>
>> 2018-08-28 17:04 GMT+08:00 Shrikant Bang <[email protected]>:
>>
>>> Thank you, ShaoFeng for response!
>>>
>>> Apart from UHC, we have other dimension which will be used by multiple
>>> cubes.
>>>
>>> e.g. calendar_dimension ( date, day, week, week, month, quarter .... etc
>>> etc ) which immutable.
>>>
>>> Few of calendar's dimension become part of cube and few become derived
>>> columns.
>>>
>>> Is there any way I can cache in Kylin's node and keep using it every
>>> other cube? It will be kind of global cache for all cubes under a project.
>>>
>>> Thank You,
>>> Shrikant Bang.
>>>
>>>
>>>
>>> On Tue, Aug 28, 2018 at 9:05 AM ShaoFeng Shi <[email protected]>
>>> wrote:
>>>
>>>>
>>>> Will you recommend using "integer" type for UHC (3+ millions) dimension
>>>> and then have derived columns for relative dimensions (look-ups) where type
>>>> is not "integer"?
>>>> >> This depends on the cardinality of the two columns. For example,
>>>> "user_id" and "email", they are close to 1:1, so this derivation is good.
>>>> But "user_id" and "sex" is not good because "sex"'s cardinality is much
>>>> smaller than "user_id", which means lots of post-aggregation will happen
>>>> after the derivation. Usually, we suggest the relationship is less or
>>>> around 10:1, but this is not fixed, you can select depends on the
>>>> performance requirement.
>>>>
>>>> Is derived column's aggregation happens at HBase Co-Processor side? Any
>>>> JIRA/doc for my learnings?
>>>> >> No, derivation calculation only happens in Kylin node, won't be
>>>> pushed down. Because Lookup table's snapshot is only loaded in Kylin node.
>>>>
>>>> 2018-08-27 19:00 GMT+08:00 Shrikant Bang <[email protected]>:
>>>>
>>>>> Thanks, ShaoFeng for response!
>>>>>
>>>>> I have started using memory 2G (default cluster setting) and OOM got
>>>>> solved when memory increased to 4G.
>>>>>
>>>>> Will you recommend using "integer" type for UHC (3+ millions)
>>>>> dimension and then have derived columns for relative dimensions (look-ups)
>>>>> where type is not "integer"?
>>>>>
>>>>> Is derived column's aggregation happens at HBase Co-Processor side?
>>>>> Any JIRA/doc for my learnings?
>>>>>
>>>>> please suggest.
>>>>>
>>>>> Thank You,
>>>>> Shrikant Bang
>>>>>
>>>>> On Tue, Aug 21, 2018 at 6:36 PM ShaoFeng Shi <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Shrikant,
>>>>>>
>>>>>> How much memory are you allocating to Reducer? Please consider to
>>>>>> allocate more mem to reducer, as Kylin builds the dictionary in the
>>>>>> reducers.
>>>>>>
>>>>>> You can also disable this, then Kylin will build dict in its own JVM.
>>>>>> This may cause your Kylin process OOM if there is an ultra high 
>>>>>> cardinality
>>>>>> (UHC) column.
>>>>>>
>>>>>> kylin.engine.mr.build-dict-in-reducer=false
>>>>>>
>>>>>>
>>>>>> Do you know how high the cardinality of that dimension? For UHC which 
>>>>>> cardinality > 3 millions, we don't recommend to use dictionary as the 
>>>>>> encoding. You may need to use "fixed_length" or "integer"(if it is in 
>>>>>> type of integer).
>>>>>>
>>>>>>
>>>>>> 2018-08-16 16:50 GMT+08:00 Ashish Singhi <[email protected]>:
>>>>>>
>>>>>>> Hi Shrikant,
>>>>>>>
>>>>>>> Refer http://kylin.apache.org/blog/2015/08/13/kylin-dictionary/
>>>>>>> You might find it useful.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Ashish
>>>>>>>
>>>>>>> On Thu, Aug 16, 2018 at 10:33 AM, Shrikant Bang <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Thank you, ShaoFeng & Billy for responses.
>>>>>>>>
>>>>>>>> I could able to set hierarchies in dimension.
>>>>>>>>
>>>>>>>> While building cube, step "fact distinct column" job is failing in
>>>>>>>> a reducer with Out Of Memory exception.
>>>>>>>>
>>>>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>>>> at java.util.IdentityHashMap.resize(IdentityHashMap.java:471)
>>>>>>>> at java.util.IdentityHashMap.put(IdentityHashMap.java:440)
>>>>>>>> at org.apache.kylin.dict.TrieDictionaryBuilder.buildTrieBytes(
>>>>>>>> TrieDictionaryBuilder.java:476)
>>>>>>>> at org.apache.kylin.dict.TrieDictionaryBuilder.build(
>>>>>>>> TrieDictionaryBuilder.java:418)
>>>>>>>> at org.apache.kylin.dict.TrieDictionaryForestBuilder.build(
>>>>>>>> TrieDictionaryForestBuilder.java:109)
>>>>>>>> at org.apache.kylin.dict.DictionaryGenerator$
>>>>>>>> StringTrieDictForestBuilder.build(DictionaryGenerator.java:220)
>>>>>>>> at org.apache.kylin.engine.mr.steps.FactDistinctColumnsReducer.
>>>>>>>> doCleanup(FactDistinctColumnsReducer.java:216)
>>>>>>>> at org.apache.kylin.engine.mr.KylinReducer.cleanup(
>>>>>>>> KylinReducer.java:103)
>>>>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:179)
>>>>>>>> at org.apache.hadoop.mapred.ReduceTask.runNewReducer(
>>>>>>>> ReduceTask.java:627)
>>>>>>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
>>>>>>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>>>>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>>>>> at org.apache.hadoop.security.UserGroupInformation.doAs(
>>>>>>>> UserGroupInformation.java:1657)
>>>>>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
>>>>>>>>
>>>>>>>>
>>>>>>>> I tried debugging and understood that dictionary is getting built
>>>>>>>> in reducer's clean up method.
>>>>>>>>
>>>>>>>> I am curious to learn internals. Can you please help me in below :
>>>>>>>>
>>>>>>>>   1.  Any pointer/reference/JIRA for understanding how TRIE
>>>>>>>> (dictionary) of dimension's value getting used in next steps?
>>>>>>>>
>>>>>>>>   2.  Any best practice/references in tuning "fact distinct column"
>>>>>>>> job for those reducer which have high cardinality. I am trying with
>>>>>>>> increasing memory as of now as partitioning and number of reducers are
>>>>>>>> depends on cuboids number.
>>>>>>>>
>>>>>>>>
>>>>>>>> P.S. I am using v2.4 of Kylin with HBase 1.x
>>>>>>>>
>>>>>>>> Thank You,
>>>>>>>> Shrikant Bang
>>>>>>>>
>>>>>>>> On Tue, Aug 14, 2018 at 8:33 PM ShaoFeng Shi <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> For question 1), in Cube's "advanced setting" step, you can
>>>>>>>>> specify the cuboid whitelist to build.
>>>>>>>>>
>>>>>>>>> 2018-08-13 22:26 GMT+08:00 Billy Liu <[email protected]>:
>>>>>>>>>
>>>>>>>>>> Hello Shrikant,
>>>>>>>>>>
>>>>>>>>>> For 1, seems the 4 dimensions are hierarchy structure. You could
>>>>>>>>>> define them as hierarchy dimensions in Cube, and leave A as
>>>>>>>>>> mandatory
>>>>>>>>>> dimension.
>>>>>>>>>>
>>>>>>>>>> For 2, select 'user_activity' as partition column in model design.
>>>>>>>>>> There are a few built-in formats, most date types are supported.
>>>>>>>>>>
>>>>>>>>>> With Warm regards
>>>>>>>>>>
>>>>>>>>>> Billy Liu
>>>>>>>>>> Shrikant Bang <[email protected]> 于2018年8月13日周一 下午5:39写道：
>>>>>>>>>> >
>>>>>>>>>> > Hi Team,
>>>>>>>>>> >
>>>>>>>>>> >      We are doing a PoC on building OLAP cubes. Could you
>>>>>>>>>> please help me to get answer of below queries?
>>>>>>>>>> >
>>>>>>>>>> > Selective Cuboids:
>>>>>>>>>> > We need to have selective cuboids as part of OLAP cubes.
>>>>>>>>>> > Let say if we have 4 dimensions : A, B, C, D then we need just
>>>>>>>>>> (A,B,C,D) , (A,B,C), (A,B) and (A)
>>>>>>>>>> >
>>>>>>>>>> > Refresh Settings:
>>>>>>>>>> > How to specify partition column and format while building cube
>>>>>>>>>> for fact table.
>>>>>>>>>> > e.g. user_activity is partitioned by date 'yyyy-MM-dd' and cube
>>>>>>>>>> should be refreshed everyday with previous day's computation.
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > Thank You,
>>>>>>>>>> > Shrikant Bang
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Best regards,
>>>>>>>>>
>>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best regards,
>>>>>>
>>>>>> Shaofeng Shi 史少锋
>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>> Best regards,
>>>>
>>>> Shaofeng Shi 史少锋
>>>>
>>>>
>>
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>>
>>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: Queries For Building Cube

Reply via email to