Welcome! 2018-08-28 22:55 GMT+08:00 Shrikant Bang <[email protected]>:
> Thank you, ShaoFeng for response ! > I have learned many internals of Kylin with this mail thread. > I appreciate your help. > > Regards, > Shrikant Bang. > > On Tue, Aug 28, 2018 at 7:14 PM ShaoFeng Shi <[email protected]> > wrote: > >> Hi Shrikant, >> >> Do you mean reusing the lookup table snapshot across cubes? As you know, >> Kylin took snapshots for lookup table and load them to memory during at >> query time. >> >> When Kylin took a snapshot, it will check whether the lookup table was >> changed since last time. If no change, it will reuse the snapshot. The >> detail logic is in SnapshotManager.buildSnapshot(); >> >> So, from this point of view, if your calendar lookup table is stable, it >> will be reused by multiple cubes. >> >> Hope this helps; >> >> 2018-08-28 17:04 GMT+08:00 Shrikant Bang <[email protected]>: >> >>> Thank you, ShaoFeng for response! >>> >>> Apart from UHC, we have other dimension which will be used by multiple >>> cubes. >>> >>> e.g. calendar_dimension ( date, day, week, week, month, quarter .... etc >>> etc ) which immutable. >>> >>> Few of calendar's dimension become part of cube and few become derived >>> columns. >>> >>> Is there any way I can cache in Kylin's node and keep using it every >>> other cube? It will be kind of global cache for all cubes under a project. >>> >>> Thank You, >>> Shrikant Bang. >>> >>> >>> >>> On Tue, Aug 28, 2018 at 9:05 AM ShaoFeng Shi <[email protected]> >>> wrote: >>> >>>> >>>> Will you recommend using "integer" type for UHC (3+ millions) dimension >>>> and then have derived columns for relative dimensions (look-ups) where type >>>> is not "integer"? >>>> >> This depends on the cardinality of the two columns. For example, >>>> "user_id" and "email", they are close to 1:1, so this derivation is good. >>>> But "user_id" and "sex" is not good because "sex"'s cardinality is much >>>> smaller than "user_id", which means lots of post-aggregation will happen >>>> after the derivation. Usually, we suggest the relationship is less or >>>> around 10:1, but this is not fixed, you can select depends on the >>>> performance requirement. >>>> >>>> Is derived column's aggregation happens at HBase Co-Processor side? Any >>>> JIRA/doc for my learnings? >>>> >> No, derivation calculation only happens in Kylin node, won't be >>>> pushed down. Because Lookup table's snapshot is only loaded in Kylin node. >>>> >>>> 2018-08-27 19:00 GMT+08:00 Shrikant Bang <[email protected]>: >>>> >>>>> Thanks, ShaoFeng for response! >>>>> >>>>> I have started using memory 2G (default cluster setting) and OOM got >>>>> solved when memory increased to 4G. >>>>> >>>>> Will you recommend using "integer" type for UHC (3+ millions) >>>>> dimension and then have derived columns for relative dimensions (look-ups) >>>>> where type is not "integer"? >>>>> >>>>> Is derived column's aggregation happens at HBase Co-Processor side? >>>>> Any JIRA/doc for my learnings? >>>>> >>>>> please suggest. >>>>> >>>>> Thank You, >>>>> Shrikant Bang >>>>> >>>>> On Tue, Aug 21, 2018 at 6:36 PM ShaoFeng Shi <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Shrikant, >>>>>> >>>>>> How much memory are you allocating to Reducer? Please consider to >>>>>> allocate more mem to reducer, as Kylin builds the dictionary in the >>>>>> reducers. >>>>>> >>>>>> You can also disable this, then Kylin will build dict in its own JVM. >>>>>> This may cause your Kylin process OOM if there is an ultra high >>>>>> cardinality >>>>>> (UHC) column. >>>>>> >>>>>> kylin.engine.mr.build-dict-in-reducer=false >>>>>> >>>>>> >>>>>> Do you know how high the cardinality of that dimension? For UHC which >>>>>> cardinality > 3 millions, we don't recommend to use dictionary as the >>>>>> encoding. You may need to use "fixed_length" or "integer"(if it is in >>>>>> type of integer). >>>>>> >>>>>> >>>>>> 2018-08-16 16:50 GMT+08:00 Ashish Singhi <[email protected]>: >>>>>> >>>>>>> Hi Shrikant, >>>>>>> >>>>>>> Refer http://kylin.apache.org/blog/2015/08/13/kylin-dictionary/ >>>>>>> You might find it useful. >>>>>>> >>>>>>> Regards, >>>>>>> Ashish >>>>>>> >>>>>>> On Thu, Aug 16, 2018 at 10:33 AM, Shrikant Bang < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Thank you, ShaoFeng & Billy for responses. >>>>>>>> >>>>>>>> I could able to set hierarchies in dimension. >>>>>>>> >>>>>>>> While building cube, step "fact distinct column" job is failing in >>>>>>>> a reducer with Out Of Memory exception. >>>>>>>> >>>>>>>> java.lang.OutOfMemoryError: Java heap space >>>>>>>> at java.util.IdentityHashMap.resize(IdentityHashMap.java:471) >>>>>>>> at java.util.IdentityHashMap.put(IdentityHashMap.java:440) >>>>>>>> at org.apache.kylin.dict.TrieDictionaryBuilder.buildTrieBytes( >>>>>>>> TrieDictionaryBuilder.java:476) >>>>>>>> at org.apache.kylin.dict.TrieDictionaryBuilder.build( >>>>>>>> TrieDictionaryBuilder.java:418) >>>>>>>> at org.apache.kylin.dict.TrieDictionaryForestBuilder.build( >>>>>>>> TrieDictionaryForestBuilder.java:109) >>>>>>>> at org.apache.kylin.dict.DictionaryGenerator$ >>>>>>>> StringTrieDictForestBuilder.build(DictionaryGenerator.java:220) >>>>>>>> at org.apache.kylin.engine.mr.steps.FactDistinctColumnsReducer. >>>>>>>> doCleanup(FactDistinctColumnsReducer.java:216) >>>>>>>> at org.apache.kylin.engine.mr.KylinReducer.cleanup( >>>>>>>> KylinReducer.java:103) >>>>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:179) >>>>>>>> at org.apache.hadoop.mapred.ReduceTask.runNewReducer( >>>>>>>> ReduceTask.java:627) >>>>>>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) >>>>>>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) >>>>>>>> at java.security.AccessController.doPrivileged(Native Method) >>>>>>>> at javax.security.auth.Subject.doAs(Subject.java:422) >>>>>>>> at org.apache.hadoop.security.UserGroupInformation.doAs( >>>>>>>> UserGroupInformation.java:1657) >>>>>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) >>>>>>>> >>>>>>>> >>>>>>>> I tried debugging and understood that dictionary is getting built >>>>>>>> in reducer's clean up method. >>>>>>>> >>>>>>>> I am curious to learn internals. Can you please help me in below : >>>>>>>> >>>>>>>> 1. Any pointer/reference/JIRA for understanding how TRIE >>>>>>>> (dictionary) of dimension's value getting used in next steps? >>>>>>>> >>>>>>>> 2. Any best practice/references in tuning "fact distinct column" >>>>>>>> job for those reducer which have high cardinality. I am trying with >>>>>>>> increasing memory as of now as partitioning and number of reducers are >>>>>>>> depends on cuboids number. >>>>>>>> >>>>>>>> >>>>>>>> P.S. I am using v2.4 of Kylin with HBase 1.x >>>>>>>> >>>>>>>> Thank You, >>>>>>>> Shrikant Bang >>>>>>>> >>>>>>>> On Tue, Aug 14, 2018 at 8:33 PM ShaoFeng Shi < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> For question 1), in Cube's "advanced setting" step, you can >>>>>>>>> specify the cuboid whitelist to build. >>>>>>>>> >>>>>>>>> 2018-08-13 22:26 GMT+08:00 Billy Liu <[email protected]>: >>>>>>>>> >>>>>>>>>> Hello Shrikant, >>>>>>>>>> >>>>>>>>>> For 1, seems the 4 dimensions are hierarchy structure. You could >>>>>>>>>> define them as hierarchy dimensions in Cube, and leave A as >>>>>>>>>> mandatory >>>>>>>>>> dimension. >>>>>>>>>> >>>>>>>>>> For 2, select 'user_activity' as partition column in model design. >>>>>>>>>> There are a few built-in formats, most date types are supported. >>>>>>>>>> >>>>>>>>>> With Warm regards >>>>>>>>>> >>>>>>>>>> Billy Liu >>>>>>>>>> Shrikant Bang <[email protected]> 于2018年8月13日周一 下午5:39写道: >>>>>>>>>> > >>>>>>>>>> > Hi Team, >>>>>>>>>> > >>>>>>>>>> > We are doing a PoC on building OLAP cubes. Could you >>>>>>>>>> please help me to get answer of below queries? >>>>>>>>>> > >>>>>>>>>> > Selective Cuboids: >>>>>>>>>> > We need to have selective cuboids as part of OLAP cubes. >>>>>>>>>> > Let say if we have 4 dimensions : A, B, C, D then we need just >>>>>>>>>> (A,B,C,D) , (A,B,C), (A,B) and (A) >>>>>>>>>> > >>>>>>>>>> > Refresh Settings: >>>>>>>>>> > How to specify partition column and format while building cube >>>>>>>>>> for fact table. >>>>>>>>>> > e.g. user_activity is partitioned by date 'yyyy-MM-dd' and cube >>>>>>>>>> should be refreshed everyday with previous day's computation. >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > Thank You, >>>>>>>>>> > Shrikant Bang >>>>>>>>>> > >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Best regards, >>>>>>>>> >>>>>>>>> Shaofeng Shi 史少锋 >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Best regards, >>>>>> >>>>>> Shaofeng Shi 史少锋 >>>>>> >>>>>> >>>> >>>> >>>> -- >>>> Best regards, >>>> >>>> Shaofeng Shi 史少锋 >>>> >>>> >> >> >> -- >> Best regards, >> >> Shaofeng Shi 史少锋 >> >> -- Best regards, Shaofeng Shi 史少锋
