Re: Queries For Building Cube

ShaoFeng Shi Tue, 28 Aug 2018 06:44:57 -0700

Hi Shrikant,

Do you mean reusing the lookup table snapshot across cubes? As you know,
Kylin took snapshots for lookup table and load them to memory during at
query time.


When Kylin took a snapshot, it will check whether the lookup table was
changed since last time. If no change, it will reuse the snapshot. The
detail logic is in SnapshotManager.buildSnapshot();

So, from this point of view, if your calendar lookup table is stable, it
will be reused by multiple cubes.

Hope this helps;

2018-08-28 17:04 GMT+08:00 Shrikant Bang <[email protected]>:

> Thank you, ShaoFeng for response!
>
> Apart from UHC, we have other dimension which will be used by multiple
> cubes.
>
> e.g. calendar_dimension ( date, day, week, week, month, quarter .... etc
> etc ) which immutable.
>
> Few of calendar's dimension become part of cube and few become derived
> columns.
>
> Is there any way I can cache in Kylin's node and keep using it every other
> cube? It will be kind of global cache for all cubes under a project.
>
> Thank You,
> Shrikant Bang.
>
>
>
> On Tue, Aug 28, 2018 at 9:05 AM ShaoFeng Shi <[email protected]>
> wrote:
>
>>
>> Will you recommend using "integer" type for UHC (3+ millions) dimension
>> and then have derived columns for relative dimensions (look-ups) where type
>> is not "integer"?
>> >> This depends on the cardinality of the two columns. For example,
>> "user_id" and "email", they are close to 1:1, so this derivation is good.
>> But "user_id" and "sex" is not good because "sex"'s cardinality is much
>> smaller than "user_id", which means lots of post-aggregation will happen
>> after the derivation. Usually, we suggest the relationship is less or
>> around 10:1, but this is not fixed, you can select depends on the
>> performance requirement.
>>
>> Is derived column's aggregation happens at HBase Co-Processor side? Any
>> JIRA/doc for my learnings?
>> >> No, derivation calculation only happens in Kylin node, won't be pushed
>> down. Because Lookup table's snapshot is only loaded in Kylin node.
>>
>> 2018-08-27 19:00 GMT+08:00 Shrikant Bang <[email protected]>:
>>
>>> Thanks, ShaoFeng for response!
>>>
>>> I have started using memory 2G (default cluster setting) and OOM got
>>> solved when memory increased to 4G.
>>>
>>> Will you recommend using "integer" type for UHC (3+ millions) dimension
>>> and then have derived columns for relative dimensions (look-ups) where type
>>> is not "integer"?
>>>
>>> Is derived column's aggregation happens at HBase Co-Processor side? Any
>>> JIRA/doc for my learnings?
>>>
>>> please suggest.
>>>
>>> Thank You,
>>> Shrikant Bang
>>>
>>> On Tue, Aug 21, 2018 at 6:36 PM ShaoFeng Shi <[email protected]>
>>> wrote:
>>>
>>>> Hi Shrikant,
>>>>
>>>> How much memory are you allocating to Reducer? Please consider to
>>>> allocate more mem to reducer, as Kylin builds the dictionary in the
>>>> reducers.
>>>>
>>>> You can also disable this, then Kylin will build dict in its own JVM.
>>>> This may cause your Kylin process OOM if there is an ultra high cardinality
>>>> (UHC) column.
>>>>
>>>> kylin.engine.mr.build-dict-in-reducer=false
>>>>
>>>>
>>>> Do you know how high the cardinality of that dimension? For UHC which 
>>>> cardinality > 3 millions, we don't recommend to use dictionary as the 
>>>> encoding. You may need to use "fixed_length" or "integer"(if it is in type 
>>>> of integer).
>>>>
>>>>
>>>> 2018-08-16 16:50 GMT+08:00 Ashish Singhi <[email protected]>:
>>>>
>>>>> Hi Shrikant,
>>>>>
>>>>> Refer http://kylin.apache.org/blog/2015/08/13/kylin-dictionary/
>>>>> You might find it useful.
>>>>>
>>>>> Regards,
>>>>> Ashish
>>>>>
>>>>> On Thu, Aug 16, 2018 at 10:33 AM, Shrikant Bang <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Thank you, ShaoFeng & Billy for responses.
>>>>>>
>>>>>> I could able to set hierarchies in dimension.
>>>>>>
>>>>>> While building cube, step "fact distinct column" job is failing in a
>>>>>> reducer with Out Of Memory exception.
>>>>>>
>>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>> at java.util.IdentityHashMap.resize(IdentityHashMap.java:471)
>>>>>> at java.util.IdentityHashMap.put(IdentityHashMap.java:440)
>>>>>> at org.apache.kylin.dict.TrieDictionaryBuilder.buildTrieBytes(
>>>>>> TrieDictionaryBuilder.java:476)
>>>>>> at org.apache.kylin.dict.TrieDictionaryBuilder.build(
>>>>>> TrieDictionaryBuilder.java:418)
>>>>>> at org.apache.kylin.dict.TrieDictionaryForestBuilder.build(
>>>>>> TrieDictionaryForestBuilder.java:109)
>>>>>> at org.apache.kylin.dict.DictionaryGenerator$
>>>>>> StringTrieDictForestBuilder.build(DictionaryGenerator.java:220)
>>>>>> at org.apache.kylin.engine.mr.steps.FactDistinctColumnsReducer.
>>>>>> doCleanup(FactDistinctColumnsReducer.java:216)
>>>>>> at org.apache.kylin.engine.mr.KylinReducer.cleanup(
>>>>>> KylinReducer.java:103)
>>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:179)
>>>>>> at org.apache.hadoop.mapred.ReduceTask.runNewReducer(
>>>>>> ReduceTask.java:627)
>>>>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
>>>>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>>> at org.apache.hadoop.security.UserGroupInformation.doAs(
>>>>>> UserGroupInformation.java:1657)
>>>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
>>>>>>
>>>>>>
>>>>>> I tried debugging and understood that dictionary is getting built in
>>>>>> reducer's clean up method.
>>>>>>
>>>>>> I am curious to learn internals. Can you please help me in below :
>>>>>>
>>>>>>   1.  Any pointer/reference/JIRA for understanding how TRIE
>>>>>> (dictionary) of dimension's value getting used in next steps?
>>>>>>
>>>>>>   2.  Any best practice/references in tuning "fact distinct column"
>>>>>> job for those reducer which have high cardinality. I am trying with
>>>>>> increasing memory as of now as partitioning and number of reducers are
>>>>>> depends on cuboids number.
>>>>>>
>>>>>>
>>>>>> P.S. I am using v2.4 of Kylin with HBase 1.x
>>>>>>
>>>>>> Thank You,
>>>>>> Shrikant Bang
>>>>>>
>>>>>> On Tue, Aug 14, 2018 at 8:33 PM ShaoFeng Shi <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> For question 1), in Cube's "advanced setting" step, you can specify
>>>>>>> the cuboid whitelist to build.
>>>>>>>
>>>>>>> 2018-08-13 22:26 GMT+08:00 Billy Liu <[email protected]>:
>>>>>>>
>>>>>>>> Hello Shrikant,
>>>>>>>>
>>>>>>>> For 1, seems the 4 dimensions are hierarchy structure. You could
>>>>>>>> define them as hierarchy dimensions in Cube, and leave A as
>>>>>>>> mandatory
>>>>>>>> dimension.
>>>>>>>>
>>>>>>>> For 2, select 'user_activity' as partition column in model design.
>>>>>>>> There are a few built-in formats, most date types are supported.
>>>>>>>>
>>>>>>>> With Warm regards
>>>>>>>>
>>>>>>>> Billy Liu
>>>>>>>> Shrikant Bang <[email protected]> 于2018年8月13日周一 下午5:39写道：
>>>>>>>> >
>>>>>>>> > Hi Team,
>>>>>>>> >
>>>>>>>> >      We are doing a PoC on building OLAP cubes. Could you please
>>>>>>>> help me to get answer of below queries?
>>>>>>>> >
>>>>>>>> > Selective Cuboids:
>>>>>>>> > We need to have selective cuboids as part of OLAP cubes.
>>>>>>>> > Let say if we have 4 dimensions : A, B, C, D then we need just
>>>>>>>> (A,B,C,D) , (A,B,C), (A,B) and (A)
>>>>>>>> >
>>>>>>>> > Refresh Settings:
>>>>>>>> > How to specify partition column and format while building cube
>>>>>>>> for fact table.
>>>>>>>> > e.g. user_activity is partitioned by date 'yyyy-MM-dd' and cube
>>>>>>>> should be refreshed everyday with previous day's computation.
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Thank You,
>>>>>>>> > Shrikant Bang
>>>>>>>> >
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best regards,
>>>>>>>
>>>>>>> Shaofeng Shi 史少锋
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best regards,
>>>>
>>>> Shaofeng Shi 史少锋
>>>>
>>>>
>>
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>>
>>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: Queries For Building Cube

Reply via email to