Re: Queries For Building Cube

Shrikant Bang Tue, 28 Aug 2018 07:56:11 -0700

Thank you, ShaoFeng for response !
I have learned many internals of Kylin with this mail thread.
I appreciate your help.


Regards,
Shrikant Bang.

On Tue, Aug 28, 2018 at 7:14 PM ShaoFeng Shi <[email protected]> wrote:

> Hi Shrikant,
>
> Do you mean reusing the lookup table snapshot across cubes? As you know,
> Kylin took snapshots for lookup table and load them to memory during at
> query time.
>
> When Kylin took a snapshot, it will check whether the lookup table was
> changed since last time. If no change, it will reuse the snapshot. The
> detail logic is in SnapshotManager.buildSnapshot();
>
> So, from this point of view, if your calendar lookup table is stable, it
> will be reused by multiple cubes.
>
> Hope this helps;
>
> 2018-08-28 17:04 GMT+08:00 Shrikant Bang <[email protected]>:
>
>> Thank you, ShaoFeng for response!
>>
>> Apart from UHC, we have other dimension which will be used by multiple
>> cubes.
>>
>> e.g. calendar_dimension ( date, day, week, week, month, quarter .... etc
>> etc ) which immutable.
>>
>> Few of calendar's dimension become part of cube and few become derived
>> columns.
>>
>> Is there any way I can cache in Kylin's node and keep using it every
>> other cube? It will be kind of global cache for all cubes under a project.
>>
>> Thank You,
>> Shrikant Bang.
>>
>>
>>
>> On Tue, Aug 28, 2018 at 9:05 AM ShaoFeng Shi <[email protected]>
>> wrote:
>>
>>>
>>> Will you recommend using "integer" type for UHC (3+ millions) dimension
>>> and then have derived columns for relative dimensions (look-ups) where type
>>> is not "integer"?
>>> >> This depends on the cardinality of the two columns. For example,
>>> "user_id" and "email", they are close to 1:1, so this derivation is good.
>>> But "user_id" and "sex" is not good because "sex"'s cardinality is much
>>> smaller than "user_id", which means lots of post-aggregation will happen
>>> after the derivation. Usually, we suggest the relationship is less or
>>> around 10:1, but this is not fixed, you can select depends on the
>>> performance requirement.
>>>
>>> Is derived column's aggregation happens at HBase Co-Processor side? Any
>>> JIRA/doc for my learnings?
>>> >> No, derivation calculation only happens in Kylin node, won't be
>>> pushed down. Because Lookup table's snapshot is only loaded in Kylin node.
>>>
>>> 2018-08-27 19:00 GMT+08:00 Shrikant Bang <[email protected]>:
>>>
>>>> Thanks, ShaoFeng for response!
>>>>
>>>> I have started using memory 2G (default cluster setting) and OOM got
>>>> solved when memory increased to 4G.
>>>>
>>>> Will you recommend using "integer" type for UHC (3+ millions) dimension
>>>> and then have derived columns for relative dimensions (look-ups) where type
>>>> is not "integer"?
>>>>
>>>> Is derived column's aggregation happens at HBase Co-Processor side? Any
>>>> JIRA/doc for my learnings?
>>>>
>>>> please suggest.
>>>>
>>>> Thank You,
>>>> Shrikant Bang
>>>>
>>>> On Tue, Aug 21, 2018 at 6:36 PM ShaoFeng Shi <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Shrikant,
>>>>>
>>>>> How much memory are you allocating to Reducer? Please consider to
>>>>> allocate more mem to reducer, as Kylin builds the dictionary in the
>>>>> reducers.
>>>>>
>>>>> You can also disable this, then Kylin will build dict in its own JVM.
>>>>> This may cause your Kylin process OOM if there is an ultra high 
>>>>> cardinality
>>>>> (UHC) column.
>>>>>
>>>>> kylin.engine.mr.build-dict-in-reducer=false
>>>>>
>>>>>
>>>>> Do you know how high the cardinality of that dimension? For UHC which 
>>>>> cardinality > 3 millions, we don't recommend to use dictionary as the 
>>>>> encoding. You may need to use "fixed_length" or "integer"(if it is in 
>>>>> type of integer).
>>>>>
>>>>>
>>>>> 2018-08-16 16:50 GMT+08:00 Ashish Singhi <[email protected]>:
>>>>>
>>>>>> Hi Shrikant,
>>>>>>
>>>>>> Refer http://kylin.apache.org/blog/2015/08/13/kylin-dictionary/
>>>>>> You might find it useful.
>>>>>>
>>>>>> Regards,
>>>>>> Ashish
>>>>>>
>>>>>> On Thu, Aug 16, 2018 at 10:33 AM, Shrikant Bang <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Thank you, ShaoFeng & Billy for responses.
>>>>>>>
>>>>>>> I could able to set hierarchies in dimension.
>>>>>>>
>>>>>>> While building cube, step "fact distinct column" job is failing in a
>>>>>>> reducer with Out Of Memory exception.
>>>>>>>
>>>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>>> at java.util.IdentityHashMap.resize(IdentityHashMap.java:471)
>>>>>>> at java.util.IdentityHashMap.put(IdentityHashMap.java:440)
>>>>>>> at
>>>>>>> org.apache.kylin.dict.TrieDictionaryBuilder.buildTrieBytes(TrieDictionaryBuilder.java:476)
>>>>>>> at
>>>>>>> org.apache.kylin.dict.TrieDictionaryBuilder.build(TrieDictionaryBuilder.java:418)
>>>>>>> at
>>>>>>> org.apache.kylin.dict.TrieDictionaryForestBuilder.build(TrieDictionaryForestBuilder.java:109)
>>>>>>> at
>>>>>>> org.apache.kylin.dict.DictionaryGenerator$StringTrieDictForestBuilder.build(DictionaryGenerator.java:220)
>>>>>>> at org.apache.kylin.engine.mr
>>>>>>> .steps.FactDistinctColumnsReducer.doCleanup(FactDistinctColumnsReducer.java:216)
>>>>>>> at org.apache.kylin.engine.mr
>>>>>>> .KylinReducer.cleanup(KylinReducer.java:103)
>>>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:179)
>>>>>>> at
>>>>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
>>>>>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
>>>>>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>>>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>>>> at
>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>>>>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
>>>>>>>
>>>>>>>
>>>>>>> I tried debugging and understood that dictionary is getting built in
>>>>>>> reducer's clean up method.
>>>>>>>
>>>>>>> I am curious to learn internals. Can you please help me in below :
>>>>>>>
>>>>>>>   1.  Any pointer/reference/JIRA for understanding how TRIE
>>>>>>> (dictionary) of dimension's value getting used in next steps?
>>>>>>>
>>>>>>>   2.  Any best practice/references in tuning "fact distinct column"
>>>>>>> job for those reducer which have high cardinality. I am trying with
>>>>>>> increasing memory as of now as partitioning and number of reducers are
>>>>>>> depends on cuboids number.
>>>>>>>
>>>>>>>
>>>>>>> P.S. I am using v2.4 of Kylin with HBase 1.x
>>>>>>>
>>>>>>> Thank You,
>>>>>>> Shrikant Bang
>>>>>>>
>>>>>>> On Tue, Aug 14, 2018 at 8:33 PM ShaoFeng Shi <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> For question 1), in Cube's "advanced setting" step, you can specify
>>>>>>>> the cuboid whitelist to build.
>>>>>>>>
>>>>>>>> 2018-08-13 22:26 GMT+08:00 Billy Liu <[email protected]>:
>>>>>>>>
>>>>>>>>> Hello Shrikant,
>>>>>>>>>
>>>>>>>>> For 1, seems the 4 dimensions are hierarchy structure. You could
>>>>>>>>> define them as hierarchy dimensions in Cube, and leave A as
>>>>>>>>> mandatory
>>>>>>>>> dimension.
>>>>>>>>>
>>>>>>>>> For 2, select 'user_activity' as partition column in model design.
>>>>>>>>> There are a few built-in formats, most date types are supported.
>>>>>>>>>
>>>>>>>>> With Warm regards
>>>>>>>>>
>>>>>>>>> Billy Liu
>>>>>>>>> Shrikant Bang <[email protected]> 于2018年8月13日周一 下午5:39写道：
>>>>>>>>> >
>>>>>>>>> > Hi Team,
>>>>>>>>> >
>>>>>>>>> >      We are doing a PoC on building OLAP cubes. Could you please
>>>>>>>>> help me to get answer of below queries?
>>>>>>>>> >
>>>>>>>>> > Selective Cuboids:
>>>>>>>>> > We need to have selective cuboids as part of OLAP cubes.
>>>>>>>>> > Let say if we have 4 dimensions : A, B, C, D then we need just
>>>>>>>>> (A,B,C,D) , (A,B,C), (A,B) and (A)
>>>>>>>>> >
>>>>>>>>> > Refresh Settings:
>>>>>>>>> > How to specify partition column and format while building cube
>>>>>>>>> for fact table.
>>>>>>>>> > e.g. user_activity is partitioned by date 'yyyy-MM-dd' and cube
>>>>>>>>> should be refreshed everyday with previous day's computation.
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > Thank You,
>>>>>>>>> > Shrikant Bang
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best regards,
>>>>>>>>
>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>>
>>>>> Shaofeng Shi 史少锋
>>>>>
>>>>>
>>>
>>>
>>> --
>>> Best regards,
>>>
>>> Shaofeng Shi 史少锋
>>>
>>>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>

Re: Queries For Building Cube

Reply via email to