Thank you, ShaoFeng for response! Apart from UHC, we have other dimension which will be used by multiple cubes.
e.g. calendar_dimension ( date, day, week, week, month, quarter .... etc etc ) which immutable. Few of calendar's dimension become part of cube and few become derived columns. Is there any way I can cache in Kylin's node and keep using it every other cube? It will be kind of global cache for all cubes under a project. Thank You, Shrikant Bang. On Tue, Aug 28, 2018 at 9:05 AM ShaoFeng Shi <[email protected]> wrote: > > Will you recommend using "integer" type for UHC (3+ millions) dimension > and then have derived columns for relative dimensions (look-ups) where type > is not "integer"? > >> This depends on the cardinality of the two columns. For example, > "user_id" and "email", they are close to 1:1, so this derivation is good. > But "user_id" and "sex" is not good because "sex"'s cardinality is much > smaller than "user_id", which means lots of post-aggregation will happen > after the derivation. Usually, we suggest the relationship is less or > around 10:1, but this is not fixed, you can select depends on the > performance requirement. > > Is derived column's aggregation happens at HBase Co-Processor side? Any > JIRA/doc for my learnings? > >> No, derivation calculation only happens in Kylin node, won't be pushed > down. Because Lookup table's snapshot is only loaded in Kylin node. > > 2018-08-27 19:00 GMT+08:00 Shrikant Bang <[email protected]>: > >> Thanks, ShaoFeng for response! >> >> I have started using memory 2G (default cluster setting) and OOM got >> solved when memory increased to 4G. >> >> Will you recommend using "integer" type for UHC (3+ millions) dimension >> and then have derived columns for relative dimensions (look-ups) where type >> is not "integer"? >> >> Is derived column's aggregation happens at HBase Co-Processor side? Any >> JIRA/doc for my learnings? >> >> please suggest. >> >> Thank You, >> Shrikant Bang >> >> On Tue, Aug 21, 2018 at 6:36 PM ShaoFeng Shi <[email protected]> >> wrote: >> >>> Hi Shrikant, >>> >>> How much memory are you allocating to Reducer? Please consider to >>> allocate more mem to reducer, as Kylin builds the dictionary in the >>> reducers. >>> >>> You can also disable this, then Kylin will build dict in its own JVM. >>> This may cause your Kylin process OOM if there is an ultra high cardinality >>> (UHC) column. >>> >>> kylin.engine.mr.build-dict-in-reducer=false >>> >>> >>> Do you know how high the cardinality of that dimension? For UHC which >>> cardinality > 3 millions, we don't recommend to use dictionary as the >>> encoding. You may need to use "fixed_length" or "integer"(if it is in type >>> of integer). >>> >>> >>> 2018-08-16 16:50 GMT+08:00 Ashish Singhi <[email protected]>: >>> >>>> Hi Shrikant, >>>> >>>> Refer http://kylin.apache.org/blog/2015/08/13/kylin-dictionary/ >>>> You might find it useful. >>>> >>>> Regards, >>>> Ashish >>>> >>>> On Thu, Aug 16, 2018 at 10:33 AM, Shrikant Bang <[email protected] >>>> > wrote: >>>> >>>>> Thank you, ShaoFeng & Billy for responses. >>>>> >>>>> I could able to set hierarchies in dimension. >>>>> >>>>> While building cube, step "fact distinct column" job is failing in a >>>>> reducer with Out Of Memory exception. >>>>> >>>>> java.lang.OutOfMemoryError: Java heap space >>>>> at java.util.IdentityHashMap.resize(IdentityHashMap.java:471) >>>>> at java.util.IdentityHashMap.put(IdentityHashMap.java:440) >>>>> at >>>>> org.apache.kylin.dict.TrieDictionaryBuilder.buildTrieBytes(TrieDictionaryBuilder.java:476) >>>>> at >>>>> org.apache.kylin.dict.TrieDictionaryBuilder.build(TrieDictionaryBuilder.java:418) >>>>> at >>>>> org.apache.kylin.dict.TrieDictionaryForestBuilder.build(TrieDictionaryForestBuilder.java:109) >>>>> at >>>>> org.apache.kylin.dict.DictionaryGenerator$StringTrieDictForestBuilder.build(DictionaryGenerator.java:220) >>>>> at org.apache.kylin.engine.mr >>>>> .steps.FactDistinctColumnsReducer.doCleanup(FactDistinctColumnsReducer.java:216) >>>>> at org.apache.kylin.engine.mr >>>>> .KylinReducer.cleanup(KylinReducer.java:103) >>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:179) >>>>> at >>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627) >>>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) >>>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) >>>>> at java.security.AccessController.doPrivileged(Native Method) >>>>> at javax.security.auth.Subject.doAs(Subject.java:422) >>>>> at >>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) >>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) >>>>> >>>>> >>>>> I tried debugging and understood that dictionary is getting built in >>>>> reducer's clean up method. >>>>> >>>>> I am curious to learn internals. Can you please help me in below : >>>>> >>>>> 1. Any pointer/reference/JIRA for understanding how TRIE >>>>> (dictionary) of dimension's value getting used in next steps? >>>>> >>>>> 2. Any best practice/references in tuning "fact distinct column" >>>>> job for those reducer which have high cardinality. I am trying with >>>>> increasing memory as of now as partitioning and number of reducers are >>>>> depends on cuboids number. >>>>> >>>>> >>>>> P.S. I am using v2.4 of Kylin with HBase 1.x >>>>> >>>>> Thank You, >>>>> Shrikant Bang >>>>> >>>>> On Tue, Aug 14, 2018 at 8:33 PM ShaoFeng Shi <[email protected]> >>>>> wrote: >>>>> >>>>>> For question 1), in Cube's "advanced setting" step, you can specify >>>>>> the cuboid whitelist to build. >>>>>> >>>>>> 2018-08-13 22:26 GMT+08:00 Billy Liu <[email protected]>: >>>>>> >>>>>>> Hello Shrikant, >>>>>>> >>>>>>> For 1, seems the 4 dimensions are hierarchy structure. You could >>>>>>> define them as hierarchy dimensions in Cube, and leave A as mandatory >>>>>>> dimension. >>>>>>> >>>>>>> For 2, select 'user_activity' as partition column in model design. >>>>>>> There are a few built-in formats, most date types are supported. >>>>>>> >>>>>>> With Warm regards >>>>>>> >>>>>>> Billy Liu >>>>>>> Shrikant Bang <[email protected]> 于2018年8月13日周一 下午5:39写道: >>>>>>> > >>>>>>> > Hi Team, >>>>>>> > >>>>>>> > We are doing a PoC on building OLAP cubes. Could you please >>>>>>> help me to get answer of below queries? >>>>>>> > >>>>>>> > Selective Cuboids: >>>>>>> > We need to have selective cuboids as part of OLAP cubes. >>>>>>> > Let say if we have 4 dimensions : A, B, C, D then we need just >>>>>>> (A,B,C,D) , (A,B,C), (A,B) and (A) >>>>>>> > >>>>>>> > Refresh Settings: >>>>>>> > How to specify partition column and format while building cube for >>>>>>> fact table. >>>>>>> > e.g. user_activity is partitioned by date 'yyyy-MM-dd' and cube >>>>>>> should be refreshed everyday with previous day's computation. >>>>>>> > >>>>>>> > >>>>>>> > Thank You, >>>>>>> > Shrikant Bang >>>>>>> > >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Best regards, >>>>>> >>>>>> Shaofeng Shi 史少锋 >>>>>> >>>>>> >>>> >>> >>> >>> -- >>> Best regards, >>> >>> Shaofeng Shi 史少锋 >>> >>> > > > -- > Best regards, > > Shaofeng Shi 史少锋 > >
