Re: Queries For Building Cube

Shrikant Bang Tue, 28 Aug 2018 02:05:35 -0700

Thank you, ShaoFeng for response!

Apart from UHC, we have other dimension which will be used by multiple
cubes.


e.g. calendar_dimension ( date, day, week, week, month, quarter .... etc
etc ) which immutable.

Few of calendar's dimension become part of cube and few become derived
columns.

Is there any way I can cache in Kylin's node and keep using it every other
cube? It will be kind of global cache for all cubes under a project.

Thank You,
Shrikant Bang.



On Tue, Aug 28, 2018 at 9:05 AM ShaoFeng Shi <[email protected]> wrote:

>
> Will you recommend using "integer" type for UHC (3+ millions) dimension
> and then have derived columns for relative dimensions (look-ups) where type
> is not "integer"?
> >> This depends on the cardinality of the two columns. For example,
> "user_id" and "email", they are close to 1:1, so this derivation is good.
> But "user_id" and "sex" is not good because "sex"'s cardinality is much
> smaller than "user_id", which means lots of post-aggregation will happen
> after the derivation. Usually, we suggest the relationship is less or
> around 10:1, but this is not fixed, you can select depends on the
> performance requirement.
>
> Is derived column's aggregation happens at HBase Co-Processor side? Any
> JIRA/doc for my learnings?
> >> No, derivation calculation only happens in Kylin node, won't be pushed
> down. Because Lookup table's snapshot is only loaded in Kylin node.
>
> 2018-08-27 19:00 GMT+08:00 Shrikant Bang <[email protected]>:
>
>> Thanks, ShaoFeng for response!
>>
>> I have started using memory 2G (default cluster setting) and OOM got
>> solved when memory increased to 4G.
>>
>> Will you recommend using "integer" type for UHC (3+ millions) dimension
>> and then have derived columns for relative dimensions (look-ups) where type
>> is not "integer"?
>>
>> Is derived column's aggregation happens at HBase Co-Processor side? Any
>> JIRA/doc for my learnings?
>>
>> please suggest.
>>
>> Thank You,
>> Shrikant Bang
>>
>> On Tue, Aug 21, 2018 at 6:36 PM ShaoFeng Shi <[email protected]>
>> wrote:
>>
>>> Hi Shrikant,
>>>
>>> How much memory are you allocating to Reducer? Please consider to
>>> allocate more mem to reducer, as Kylin builds the dictionary in the
>>> reducers.
>>>
>>> You can also disable this, then Kylin will build dict in its own JVM.
>>> This may cause your Kylin process OOM if there is an ultra high cardinality
>>> (UHC) column.
>>>
>>> kylin.engine.mr.build-dict-in-reducer=false
>>>
>>>
>>> Do you know how high the cardinality of that dimension? For UHC which 
>>> cardinality > 3 millions, we don't recommend to use dictionary as the 
>>> encoding. You may need to use "fixed_length" or "integer"(if it is in type 
>>> of integer).
>>>
>>>
>>> 2018-08-16 16:50 GMT+08:00 Ashish Singhi <[email protected]>:
>>>
>>>> Hi Shrikant,
>>>>
>>>> Refer http://kylin.apache.org/blog/2015/08/13/kylin-dictionary/
>>>> You might find it useful.
>>>>
>>>> Regards,
>>>> Ashish
>>>>
>>>> On Thu, Aug 16, 2018 at 10:33 AM, Shrikant Bang <[email protected]
>>>> > wrote:
>>>>
>>>>> Thank you, ShaoFeng & Billy for responses.
>>>>>
>>>>> I could able to set hierarchies in dimension.
>>>>>
>>>>> While building cube, step "fact distinct column" job is failing in a
>>>>> reducer with Out Of Memory exception.
>>>>>
>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>> at java.util.IdentityHashMap.resize(IdentityHashMap.java:471)
>>>>> at java.util.IdentityHashMap.put(IdentityHashMap.java:440)
>>>>> at
>>>>> org.apache.kylin.dict.TrieDictionaryBuilder.buildTrieBytes(TrieDictionaryBuilder.java:476)
>>>>> at
>>>>> org.apache.kylin.dict.TrieDictionaryBuilder.build(TrieDictionaryBuilder.java:418)
>>>>> at
>>>>> org.apache.kylin.dict.TrieDictionaryForestBuilder.build(TrieDictionaryForestBuilder.java:109)
>>>>> at
>>>>> org.apache.kylin.dict.DictionaryGenerator$StringTrieDictForestBuilder.build(DictionaryGenerator.java:220)
>>>>> at org.apache.kylin.engine.mr
>>>>> .steps.FactDistinctColumnsReducer.doCleanup(FactDistinctColumnsReducer.java:216)
>>>>> at org.apache.kylin.engine.mr
>>>>> .KylinReducer.cleanup(KylinReducer.java:103)
>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:179)
>>>>> at
>>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
>>>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
>>>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>> at
>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
>>>>>
>>>>>
>>>>> I tried debugging and understood that dictionary is getting built in
>>>>> reducer's clean up method.
>>>>>
>>>>> I am curious to learn internals. Can you please help me in below :
>>>>>
>>>>>   1.  Any pointer/reference/JIRA for understanding how TRIE
>>>>> (dictionary) of dimension's value getting used in next steps?
>>>>>
>>>>>   2.  Any best practice/references in tuning "fact distinct column"
>>>>> job for those reducer which have high cardinality. I am trying with
>>>>> increasing memory as of now as partitioning and number of reducers are
>>>>> depends on cuboids number.
>>>>>
>>>>>
>>>>> P.S. I am using v2.4 of Kylin with HBase 1.x
>>>>>
>>>>> Thank You,
>>>>> Shrikant Bang
>>>>>
>>>>> On Tue, Aug 14, 2018 at 8:33 PM ShaoFeng Shi <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> For question 1), in Cube's "advanced setting" step, you can specify
>>>>>> the cuboid whitelist to build.
>>>>>>
>>>>>> 2018-08-13 22:26 GMT+08:00 Billy Liu <[email protected]>:
>>>>>>
>>>>>>> Hello Shrikant,
>>>>>>>
>>>>>>> For 1, seems the 4 dimensions are hierarchy structure. You could
>>>>>>> define them as hierarchy dimensions in Cube, and leave A as mandatory
>>>>>>> dimension.
>>>>>>>
>>>>>>> For 2, select 'user_activity' as partition column in model design.
>>>>>>> There are a few built-in formats, most date types are supported.
>>>>>>>
>>>>>>> With Warm regards
>>>>>>>
>>>>>>> Billy Liu
>>>>>>> Shrikant Bang <[email protected]> 于2018年8月13日周一 下午5:39写道：
>>>>>>> >
>>>>>>> > Hi Team,
>>>>>>> >
>>>>>>> >      We are doing a PoC on building OLAP cubes. Could you please
>>>>>>> help me to get answer of below queries?
>>>>>>> >
>>>>>>> > Selective Cuboids:
>>>>>>> > We need to have selective cuboids as part of OLAP cubes.
>>>>>>> > Let say if we have 4 dimensions : A, B, C, D then we need just
>>>>>>> (A,B,C,D) , (A,B,C), (A,B) and (A)
>>>>>>> >
>>>>>>> > Refresh Settings:
>>>>>>> > How to specify partition column and format while building cube for
>>>>>>> fact table.
>>>>>>> > e.g. user_activity is partitioned by date 'yyyy-MM-dd' and cube
>>>>>>> should be refreshed everyday with previous day's computation.
>>>>>>> >
>>>>>>> >
>>>>>>> > Thank You,
>>>>>>> > Shrikant Bang
>>>>>>> >
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best regards,
>>>>>>
>>>>>> Shaofeng Shi 史少锋
>>>>>>
>>>>>>
>>>>
>>>
>>>
>>> --
>>> Best regards,
>>>
>>> Shaofeng Shi 史少锋
>>>
>>>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>

Re: Queries For Building Cube

Reply via email to