Hi Shrikant, Do you mean reusing the lookup table snapshot across cubes? As you know, Kylin took snapshots for lookup table and load them to memory during at query time.
When Kylin took a snapshot, it will check whether the lookup table was changed since last time. If no change, it will reuse the snapshot. The detail logic is in SnapshotManager.buildSnapshot(); So, from this point of view, if your calendar lookup table is stable, it will be reused by multiple cubes. Hope this helps; 2018-08-28 17:04 GMT+08:00 Shrikant Bang <[email protected]>: > Thank you, ShaoFeng for response! > > Apart from UHC, we have other dimension which will be used by multiple > cubes. > > e.g. calendar_dimension ( date, day, week, week, month, quarter .... etc > etc ) which immutable. > > Few of calendar's dimension become part of cube and few become derived > columns. > > Is there any way I can cache in Kylin's node and keep using it every other > cube? It will be kind of global cache for all cubes under a project. > > Thank You, > Shrikant Bang. > > > > On Tue, Aug 28, 2018 at 9:05 AM ShaoFeng Shi <[email protected]> > wrote: > >> >> Will you recommend using "integer" type for UHC (3+ millions) dimension >> and then have derived columns for relative dimensions (look-ups) where type >> is not "integer"? >> >> This depends on the cardinality of the two columns. For example, >> "user_id" and "email", they are close to 1:1, so this derivation is good. >> But "user_id" and "sex" is not good because "sex"'s cardinality is much >> smaller than "user_id", which means lots of post-aggregation will happen >> after the derivation. Usually, we suggest the relationship is less or >> around 10:1, but this is not fixed, you can select depends on the >> performance requirement. >> >> Is derived column's aggregation happens at HBase Co-Processor side? Any >> JIRA/doc for my learnings? >> >> No, derivation calculation only happens in Kylin node, won't be pushed >> down. Because Lookup table's snapshot is only loaded in Kylin node. >> >> 2018-08-27 19:00 GMT+08:00 Shrikant Bang <[email protected]>: >> >>> Thanks, ShaoFeng for response! >>> >>> I have started using memory 2G (default cluster setting) and OOM got >>> solved when memory increased to 4G. >>> >>> Will you recommend using "integer" type for UHC (3+ millions) dimension >>> and then have derived columns for relative dimensions (look-ups) where type >>> is not "integer"? >>> >>> Is derived column's aggregation happens at HBase Co-Processor side? Any >>> JIRA/doc for my learnings? >>> >>> please suggest. >>> >>> Thank You, >>> Shrikant Bang >>> >>> On Tue, Aug 21, 2018 at 6:36 PM ShaoFeng Shi <[email protected]> >>> wrote: >>> >>>> Hi Shrikant, >>>> >>>> How much memory are you allocating to Reducer? Please consider to >>>> allocate more mem to reducer, as Kylin builds the dictionary in the >>>> reducers. >>>> >>>> You can also disable this, then Kylin will build dict in its own JVM. >>>> This may cause your Kylin process OOM if there is an ultra high cardinality >>>> (UHC) column. >>>> >>>> kylin.engine.mr.build-dict-in-reducer=false >>>> >>>> >>>> Do you know how high the cardinality of that dimension? For UHC which >>>> cardinality > 3 millions, we don't recommend to use dictionary as the >>>> encoding. You may need to use "fixed_length" or "integer"(if it is in type >>>> of integer). >>>> >>>> >>>> 2018-08-16 16:50 GMT+08:00 Ashish Singhi <[email protected]>: >>>> >>>>> Hi Shrikant, >>>>> >>>>> Refer http://kylin.apache.org/blog/2015/08/13/kylin-dictionary/ >>>>> You might find it useful. >>>>> >>>>> Regards, >>>>> Ashish >>>>> >>>>> On Thu, Aug 16, 2018 at 10:33 AM, Shrikant Bang < >>>>> [email protected]> wrote: >>>>> >>>>>> Thank you, ShaoFeng & Billy for responses. >>>>>> >>>>>> I could able to set hierarchies in dimension. >>>>>> >>>>>> While building cube, step "fact distinct column" job is failing in a >>>>>> reducer with Out Of Memory exception. >>>>>> >>>>>> java.lang.OutOfMemoryError: Java heap space >>>>>> at java.util.IdentityHashMap.resize(IdentityHashMap.java:471) >>>>>> at java.util.IdentityHashMap.put(IdentityHashMap.java:440) >>>>>> at org.apache.kylin.dict.TrieDictionaryBuilder.buildTrieBytes( >>>>>> TrieDictionaryBuilder.java:476) >>>>>> at org.apache.kylin.dict.TrieDictionaryBuilder.build( >>>>>> TrieDictionaryBuilder.java:418) >>>>>> at org.apache.kylin.dict.TrieDictionaryForestBuilder.build( >>>>>> TrieDictionaryForestBuilder.java:109) >>>>>> at org.apache.kylin.dict.DictionaryGenerator$ >>>>>> StringTrieDictForestBuilder.build(DictionaryGenerator.java:220) >>>>>> at org.apache.kylin.engine.mr.steps.FactDistinctColumnsReducer. >>>>>> doCleanup(FactDistinctColumnsReducer.java:216) >>>>>> at org.apache.kylin.engine.mr.KylinReducer.cleanup( >>>>>> KylinReducer.java:103) >>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:179) >>>>>> at org.apache.hadoop.mapred.ReduceTask.runNewReducer( >>>>>> ReduceTask.java:627) >>>>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) >>>>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) >>>>>> at java.security.AccessController.doPrivileged(Native Method) >>>>>> at javax.security.auth.Subject.doAs(Subject.java:422) >>>>>> at org.apache.hadoop.security.UserGroupInformation.doAs( >>>>>> UserGroupInformation.java:1657) >>>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) >>>>>> >>>>>> >>>>>> I tried debugging and understood that dictionary is getting built in >>>>>> reducer's clean up method. >>>>>> >>>>>> I am curious to learn internals. Can you please help me in below : >>>>>> >>>>>> 1. Any pointer/reference/JIRA for understanding how TRIE >>>>>> (dictionary) of dimension's value getting used in next steps? >>>>>> >>>>>> 2. Any best practice/references in tuning "fact distinct column" >>>>>> job for those reducer which have high cardinality. I am trying with >>>>>> increasing memory as of now as partitioning and number of reducers are >>>>>> depends on cuboids number. >>>>>> >>>>>> >>>>>> P.S. I am using v2.4 of Kylin with HBase 1.x >>>>>> >>>>>> Thank You, >>>>>> Shrikant Bang >>>>>> >>>>>> On Tue, Aug 14, 2018 at 8:33 PM ShaoFeng Shi <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> For question 1), in Cube's "advanced setting" step, you can specify >>>>>>> the cuboid whitelist to build. >>>>>>> >>>>>>> 2018-08-13 22:26 GMT+08:00 Billy Liu <[email protected]>: >>>>>>> >>>>>>>> Hello Shrikant, >>>>>>>> >>>>>>>> For 1, seems the 4 dimensions are hierarchy structure. You could >>>>>>>> define them as hierarchy dimensions in Cube, and leave A as >>>>>>>> mandatory >>>>>>>> dimension. >>>>>>>> >>>>>>>> For 2, select 'user_activity' as partition column in model design. >>>>>>>> There are a few built-in formats, most date types are supported. >>>>>>>> >>>>>>>> With Warm regards >>>>>>>> >>>>>>>> Billy Liu >>>>>>>> Shrikant Bang <[email protected]> 于2018年8月13日周一 下午5:39写道: >>>>>>>> > >>>>>>>> > Hi Team, >>>>>>>> > >>>>>>>> > We are doing a PoC on building OLAP cubes. Could you please >>>>>>>> help me to get answer of below queries? >>>>>>>> > >>>>>>>> > Selective Cuboids: >>>>>>>> > We need to have selective cuboids as part of OLAP cubes. >>>>>>>> > Let say if we have 4 dimensions : A, B, C, D then we need just >>>>>>>> (A,B,C,D) , (A,B,C), (A,B) and (A) >>>>>>>> > >>>>>>>> > Refresh Settings: >>>>>>>> > How to specify partition column and format while building cube >>>>>>>> for fact table. >>>>>>>> > e.g. user_activity is partitioned by date 'yyyy-MM-dd' and cube >>>>>>>> should be refreshed everyday with previous day's computation. >>>>>>>> > >>>>>>>> > >>>>>>>> > Thank You, >>>>>>>> > Shrikant Bang >>>>>>>> > >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Best regards, >>>>>>> >>>>>>> Shaofeng Shi 史少锋 >>>>>>> >>>>>>> >>>>> >>>> >>>> >>>> -- >>>> Best regards, >>>> >>>> Shaofeng Shi 史少锋 >>>> >>>> >> >> >> -- >> Best regards, >> >> Shaofeng Shi 史少锋 >> >> -- Best regards, Shaofeng Shi 史少锋
