Thanks, ShaoFeng for response!

I have started using memory 2G (default cluster setting) and OOM got solved
when memory increased to 4G.

Will you recommend using "integer" type for UHC (3+ millions) dimension and
then have derived columns for relative dimensions (look-ups) where type is
not "integer"?

Is derived column's aggregation happens at HBase Co-Processor side? Any
JIRA/doc for my learnings?

please suggest.

Thank You,
Shrikant Bang

On Tue, Aug 21, 2018 at 6:36 PM ShaoFeng Shi <[email protected]> wrote:

> Hi Shrikant,
>
> How much memory are you allocating to Reducer? Please consider to allocate
> more mem to reducer, as Kylin builds the dictionary in the reducers.
>
> You can also disable this, then Kylin will build dict in its own JVM. This
> may cause your Kylin process OOM if there is an ultra high cardinality
> (UHC) column.
>
> kylin.engine.mr.build-dict-in-reducer=false
>
>
> Do you know how high the cardinality of that dimension? For UHC which 
> cardinality > 3 millions, we don't recommend to use dictionary as the 
> encoding. You may need to use "fixed_length" or "integer"(if it is in type of 
> integer).
>
>
> 2018-08-16 16:50 GMT+08:00 Ashish Singhi <[email protected]>:
>
>> Hi Shrikant,
>>
>> Refer http://kylin.apache.org/blog/2015/08/13/kylin-dictionary/
>> You might find it useful.
>>
>> Regards,
>> Ashish
>>
>> On Thu, Aug 16, 2018 at 10:33 AM, Shrikant Bang <[email protected]>
>> wrote:
>>
>>> Thank you, ShaoFeng & Billy for responses.
>>>
>>> I could able to set hierarchies in dimension.
>>>
>>> While building cube, step "fact distinct column" job is failing in a
>>> reducer with Out Of Memory exception.
>>>
>>> java.lang.OutOfMemoryError: Java heap space
>>> at java.util.IdentityHashMap.resize(IdentityHashMap.java:471)
>>> at java.util.IdentityHashMap.put(IdentityHashMap.java:440)
>>> at
>>> org.apache.kylin.dict.TrieDictionaryBuilder.buildTrieBytes(TrieDictionaryBuilder.java:476)
>>> at
>>> org.apache.kylin.dict.TrieDictionaryBuilder.build(TrieDictionaryBuilder.java:418)
>>> at
>>> org.apache.kylin.dict.TrieDictionaryForestBuilder.build(TrieDictionaryForestBuilder.java:109)
>>> at
>>> org.apache.kylin.dict.DictionaryGenerator$StringTrieDictForestBuilder.build(DictionaryGenerator.java:220)
>>> at org.apache.kylin.engine.mr
>>> .steps.FactDistinctColumnsReducer.doCleanup(FactDistinctColumnsReducer.java:216)
>>> at org.apache.kylin.engine.mr
>>> .KylinReducer.cleanup(KylinReducer.java:103)
>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:179)
>>> at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>> at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
>>>
>>>
>>> I tried debugging and understood that dictionary is getting built in
>>> reducer's clean up method.
>>>
>>> I am curious to learn internals. Can you please help me in below :
>>>
>>>   1.  Any pointer/reference/JIRA for understanding how TRIE (dictionary)
>>> of dimension's value getting used in next steps?
>>>
>>>   2.  Any best practice/references in tuning "fact distinct column" job
>>> for those reducer which have high cardinality. I am trying with increasing
>>> memory as of now as partitioning and number of reducers are depends on
>>> cuboids number.
>>>
>>>
>>> P.S. I am using v2.4 of Kylin with HBase 1.x
>>>
>>> Thank You,
>>> Shrikant Bang
>>>
>>> On Tue, Aug 14, 2018 at 8:33 PM ShaoFeng Shi <[email protected]>
>>> wrote:
>>>
>>>> For question 1), in Cube's "advanced setting" step, you can specify the
>>>> cuboid whitelist to build.
>>>>
>>>> 2018-08-13 22:26 GMT+08:00 Billy Liu <[email protected]>:
>>>>
>>>>> Hello Shrikant,
>>>>>
>>>>> For 1, seems the 4 dimensions are hierarchy structure. You could
>>>>> define them as hierarchy dimensions in Cube, and leave A as mandatory
>>>>> dimension.
>>>>>
>>>>> For 2, select 'user_activity' as partition column in model design.
>>>>> There are a few built-in formats, most date types are supported.
>>>>>
>>>>> With Warm regards
>>>>>
>>>>> Billy Liu
>>>>> Shrikant Bang <[email protected]> 于2018年8月13日周一 下午5:39写道:
>>>>> >
>>>>> > Hi Team,
>>>>> >
>>>>> >      We are doing a PoC on building OLAP cubes. Could you please
>>>>> help me to get answer of below queries?
>>>>> >
>>>>> > Selective Cuboids:
>>>>> > We need to have selective cuboids as part of OLAP cubes.
>>>>> > Let say if we have 4 dimensions : A, B, C, D then we need just
>>>>> (A,B,C,D) , (A,B,C), (A,B) and (A)
>>>>> >
>>>>> > Refresh Settings:
>>>>> > How to specify partition column and format while building cube for
>>>>> fact table.
>>>>> > e.g. user_activity is partitioned by date 'yyyy-MM-dd' and cube
>>>>> should be refreshed everyday with previous day's computation.
>>>>> >
>>>>> >
>>>>> > Thank You,
>>>>> > Shrikant Bang
>>>>> >
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best regards,
>>>>
>>>> Shaofeng Shi 史少锋
>>>>
>>>>
>>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>

Reply via email to