Thanks, ShaoFeng for response! I have started using memory 2G (default cluster setting) and OOM got solved when memory increased to 4G.
Will you recommend using "integer" type for UHC (3+ millions) dimension and then have derived columns for relative dimensions (look-ups) where type is not "integer"? Is derived column's aggregation happens at HBase Co-Processor side? Any JIRA/doc for my learnings? please suggest. Thank You, Shrikant Bang On Tue, Aug 21, 2018 at 6:36 PM ShaoFeng Shi <[email protected]> wrote: > Hi Shrikant, > > How much memory are you allocating to Reducer? Please consider to allocate > more mem to reducer, as Kylin builds the dictionary in the reducers. > > You can also disable this, then Kylin will build dict in its own JVM. This > may cause your Kylin process OOM if there is an ultra high cardinality > (UHC) column. > > kylin.engine.mr.build-dict-in-reducer=false > > > Do you know how high the cardinality of that dimension? For UHC which > cardinality > 3 millions, we don't recommend to use dictionary as the > encoding. You may need to use "fixed_length" or "integer"(if it is in type of > integer). > > > 2018-08-16 16:50 GMT+08:00 Ashish Singhi <[email protected]>: > >> Hi Shrikant, >> >> Refer http://kylin.apache.org/blog/2015/08/13/kylin-dictionary/ >> You might find it useful. >> >> Regards, >> Ashish >> >> On Thu, Aug 16, 2018 at 10:33 AM, Shrikant Bang <[email protected]> >> wrote: >> >>> Thank you, ShaoFeng & Billy for responses. >>> >>> I could able to set hierarchies in dimension. >>> >>> While building cube, step "fact distinct column" job is failing in a >>> reducer with Out Of Memory exception. >>> >>> java.lang.OutOfMemoryError: Java heap space >>> at java.util.IdentityHashMap.resize(IdentityHashMap.java:471) >>> at java.util.IdentityHashMap.put(IdentityHashMap.java:440) >>> at >>> org.apache.kylin.dict.TrieDictionaryBuilder.buildTrieBytes(TrieDictionaryBuilder.java:476) >>> at >>> org.apache.kylin.dict.TrieDictionaryBuilder.build(TrieDictionaryBuilder.java:418) >>> at >>> org.apache.kylin.dict.TrieDictionaryForestBuilder.build(TrieDictionaryForestBuilder.java:109) >>> at >>> org.apache.kylin.dict.DictionaryGenerator$StringTrieDictForestBuilder.build(DictionaryGenerator.java:220) >>> at org.apache.kylin.engine.mr >>> .steps.FactDistinctColumnsReducer.doCleanup(FactDistinctColumnsReducer.java:216) >>> at org.apache.kylin.engine.mr >>> .KylinReducer.cleanup(KylinReducer.java:103) >>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:179) >>> at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627) >>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) >>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) >>> at java.security.AccessController.doPrivileged(Native Method) >>> at javax.security.auth.Subject.doAs(Subject.java:422) >>> at >>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) >>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) >>> >>> >>> I tried debugging and understood that dictionary is getting built in >>> reducer's clean up method. >>> >>> I am curious to learn internals. Can you please help me in below : >>> >>> 1. Any pointer/reference/JIRA for understanding how TRIE (dictionary) >>> of dimension's value getting used in next steps? >>> >>> 2. Any best practice/references in tuning "fact distinct column" job >>> for those reducer which have high cardinality. I am trying with increasing >>> memory as of now as partitioning and number of reducers are depends on >>> cuboids number. >>> >>> >>> P.S. I am using v2.4 of Kylin with HBase 1.x >>> >>> Thank You, >>> Shrikant Bang >>> >>> On Tue, Aug 14, 2018 at 8:33 PM ShaoFeng Shi <[email protected]> >>> wrote: >>> >>>> For question 1), in Cube's "advanced setting" step, you can specify the >>>> cuboid whitelist to build. >>>> >>>> 2018-08-13 22:26 GMT+08:00 Billy Liu <[email protected]>: >>>> >>>>> Hello Shrikant, >>>>> >>>>> For 1, seems the 4 dimensions are hierarchy structure. You could >>>>> define them as hierarchy dimensions in Cube, and leave A as mandatory >>>>> dimension. >>>>> >>>>> For 2, select 'user_activity' as partition column in model design. >>>>> There are a few built-in formats, most date types are supported. >>>>> >>>>> With Warm regards >>>>> >>>>> Billy Liu >>>>> Shrikant Bang <[email protected]> 于2018年8月13日周一 下午5:39写道: >>>>> > >>>>> > Hi Team, >>>>> > >>>>> > We are doing a PoC on building OLAP cubes. Could you please >>>>> help me to get answer of below queries? >>>>> > >>>>> > Selective Cuboids: >>>>> > We need to have selective cuboids as part of OLAP cubes. >>>>> > Let say if we have 4 dimensions : A, B, C, D then we need just >>>>> (A,B,C,D) , (A,B,C), (A,B) and (A) >>>>> > >>>>> > Refresh Settings: >>>>> > How to specify partition column and format while building cube for >>>>> fact table. >>>>> > e.g. user_activity is partitioned by date 'yyyy-MM-dd' and cube >>>>> should be refreshed everyday with previous day's computation. >>>>> > >>>>> > >>>>> > Thank You, >>>>> > Shrikant Bang >>>>> > >>>>> >>>> >>>> >>>> >>>> -- >>>> Best regards, >>>> >>>> Shaofeng Shi 史少锋 >>>> >>>> >> > > > -- > Best regards, > > Shaofeng Shi 史少锋 > >
