Re: Queries For Building Cube

ShaoFeng Shi Mon, 27 Aug 2018 20:35:52 -0700

Will you recommend using "integer" type for UHC (3+ millions) dimension and
then have derived columns for relative dimensions (look-ups) where type is
not "integer"?
>> This depends on the cardinality of the two columns. For example,
"user_id" and "email", they are close to 1:1, so this derivation is good.
But "user_id" and "sex" is not good because "sex"'s cardinality is much
smaller than "user_id", which means lots of post-aggregation will happen
after the derivation. Usually, we suggest the relationship is less or
around 10:1, but this is not fixed, you can select depends on the
performance requirement.


Is derived column's aggregation happens at HBase Co-Processor side? Any
JIRA/doc for my learnings?
>> No, derivation calculation only happens in Kylin node, won't be pushed
down. Because Lookup table's snapshot is only loaded in Kylin node.

2018-08-27 19:00 GMT+08:00 Shrikant Bang <[email protected]>:

> Thanks, ShaoFeng for response!
>
> I have started using memory 2G (default cluster setting) and OOM got
> solved when memory increased to 4G.
>
> Will you recommend using "integer" type for UHC (3+ millions) dimension
> and then have derived columns for relative dimensions (look-ups) where type
> is not "integer"?
>
> Is derived column's aggregation happens at HBase Co-Processor side? Any
> JIRA/doc for my learnings?
>
> please suggest.
>
> Thank You,
> Shrikant Bang
>
> On Tue, Aug 21, 2018 at 6:36 PM ShaoFeng Shi <[email protected]>
> wrote:
>
>> Hi Shrikant,
>>
>> How much memory are you allocating to Reducer? Please consider to
>> allocate more mem to reducer, as Kylin builds the dictionary in the
>> reducers.
>>
>> You can also disable this, then Kylin will build dict in its own JVM.
>> This may cause your Kylin process OOM if there is an ultra high cardinality
>> (UHC) column.
>>
>> kylin.engine.mr.build-dict-in-reducer=false
>>
>>
>> Do you know how high the cardinality of that dimension? For UHC which 
>> cardinality > 3 millions, we don't recommend to use dictionary as the 
>> encoding. You may need to use "fixed_length" or "integer"(if it is in type 
>> of integer).
>>
>>
>> 2018-08-16 16:50 GMT+08:00 Ashish Singhi <[email protected]>:
>>
>>> Hi Shrikant,
>>>
>>> Refer http://kylin.apache.org/blog/2015/08/13/kylin-dictionary/
>>> You might find it useful.
>>>
>>> Regards,
>>> Ashish
>>>
>>> On Thu, Aug 16, 2018 at 10:33 AM, Shrikant Bang <[email protected]>
>>> wrote:
>>>
>>>> Thank you, ShaoFeng & Billy for responses.
>>>>
>>>> I could able to set hierarchies in dimension.
>>>>
>>>> While building cube, step "fact distinct column" job is failing in a
>>>> reducer with Out Of Memory exception.
>>>>
>>>> java.lang.OutOfMemoryError: Java heap space
>>>> at java.util.IdentityHashMap.resize(IdentityHashMap.java:471)
>>>> at java.util.IdentityHashMap.put(IdentityHashMap.java:440)
>>>> at org.apache.kylin.dict.TrieDictionaryBuilder.buildTrieBytes(
>>>> TrieDictionaryBuilder.java:476)
>>>> at org.apache.kylin.dict.TrieDictionaryBuilder.build(
>>>> TrieDictionaryBuilder.java:418)
>>>> at org.apache.kylin.dict.TrieDictionaryForestBuilder.build(
>>>> TrieDictionaryForestBuilder.java:109)
>>>> at org.apache.kylin.dict.DictionaryGenerator$
>>>> StringTrieDictForestBuilder.build(DictionaryGenerator.java:220)
>>>> at org.apache.kylin.engine.mr.steps.FactDistinctColumnsReducer.
>>>> doCleanup(FactDistinctColumnsReducer.java:216)
>>>> at org.apache.kylin.engine.mr.KylinReducer.cleanup(
>>>> KylinReducer.java:103)
>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:179)
>>>> at org.apache.hadoop.mapred.ReduceTask.runNewReducer(
>>>> ReduceTask.java:627)
>>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
>>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>> at org.apache.hadoop.security.UserGroupInformation.doAs(
>>>> UserGroupInformation.java:1657)
>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
>>>>
>>>>
>>>> I tried debugging and understood that dictionary is getting built in
>>>> reducer's clean up method.
>>>>
>>>> I am curious to learn internals. Can you please help me in below :
>>>>
>>>>   1.  Any pointer/reference/JIRA for understanding how TRIE
>>>> (dictionary) of dimension's value getting used in next steps?
>>>>
>>>>   2.  Any best practice/references in tuning "fact distinct column" job
>>>> for those reducer which have high cardinality. I am trying with increasing
>>>> memory as of now as partitioning and number of reducers are depends on
>>>> cuboids number.
>>>>
>>>>
>>>> P.S. I am using v2.4 of Kylin with HBase 1.x
>>>>
>>>> Thank You,
>>>> Shrikant Bang
>>>>
>>>> On Tue, Aug 14, 2018 at 8:33 PM ShaoFeng Shi <[email protected]>
>>>> wrote:
>>>>
>>>>> For question 1), in Cube's "advanced setting" step, you can specify
>>>>> the cuboid whitelist to build.
>>>>>
>>>>> 2018-08-13 22:26 GMT+08:00 Billy Liu <[email protected]>:
>>>>>
>>>>>> Hello Shrikant,
>>>>>>
>>>>>> For 1, seems the 4 dimensions are hierarchy structure. You could
>>>>>> define them as hierarchy dimensions in Cube, and leave A as mandatory
>>>>>> dimension.
>>>>>>
>>>>>> For 2, select 'user_activity' as partition column in model design.
>>>>>> There are a few built-in formats, most date types are supported.
>>>>>>
>>>>>> With Warm regards
>>>>>>
>>>>>> Billy Liu
>>>>>> Shrikant Bang <[email protected]> 于2018年8月13日周一 下午5:39写道：
>>>>>> >
>>>>>> > Hi Team,
>>>>>> >
>>>>>> >      We are doing a PoC on building OLAP cubes. Could you please
>>>>>> help me to get answer of below queries?
>>>>>> >
>>>>>> > Selective Cuboids:
>>>>>> > We need to have selective cuboids as part of OLAP cubes.
>>>>>> > Let say if we have 4 dimensions : A, B, C, D then we need just
>>>>>> (A,B,C,D) , (A,B,C), (A,B) and (A)
>>>>>> >
>>>>>> > Refresh Settings:
>>>>>> > How to specify partition column and format while building cube for
>>>>>> fact table.
>>>>>> > e.g. user_activity is partitioned by date 'yyyy-MM-dd' and cube
>>>>>> should be refreshed everyday with previous day's computation.
>>>>>> >
>>>>>> >
>>>>>> > Thank You,
>>>>>> > Shrikant Bang
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>>
>>>>> Shaofeng Shi 史少锋
>>>>>
>>>>>
>>>
>>
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>>
>>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: Queries For Building Cube

Reply via email to