RE: Dimension table 300MB Limit

Richard Calaba (Fishbowl) Tue, 28 Jun 2016 15:18:35 -0700

Hi Li Yang,


Can we get better example how to configure the JSON to define the “extended” 
measure ???  Some description what it exactly does and what is the impact on 
cube build and query ….

 

The join dimensions might be a good idea as well and is not limited to lookup 
tables/fact table right ??? … let’s use this scenario:

 

    --- Dimensions date, customer, cluster1, cluster2, ….  Cluster10 , .. 
measures … 

 

   Question 1: if I define the joint dimensions {cluster1, cluster2, …., 
cluster10}– 

can I still run correct SQL query SELECT date, cluster1,COUNT(*) FROM fact 
GROUP BY date, cluster1 ???? 

Meaning I am not specifying where filter neither I am reading in select the 
values of cluster2/3/…10  nowhere …. But I might have 2nd query to do the same 
grouping and count(*) logic but just for cluster2 .. or 3 …. Or 10 …

 

  Question 2: as the clusterX values depend on the date dimension -> the date 
will be always in the query -> should I then define the joint dimensions: 
{date, cluster1, cluster2, …. Cluster10 } ???

 

   Question 3: if answer to question 1 is that this is not correct joint 
dimension definition but answer to Question 2 is that date is to be part of the 
joint dimension definition if the value of the clusterX depends on the 
dimension date … then I am concluding that I can optimize the cuboids numbers 
by specifying 10 joint dimensions:

 

                {date, cluster1}

                {date, cluster2}

                ….

                {date, cluster10}

 

                                Right ???

 

Please help us to understand those advanced topics …(I have read the  
https://kylin.apache.org/blog/2016/02/18/new-aggregation-group/ ) which states 
only this:

*         Joint rules. This is a newly introduced rule. If two or more 
dimensions are “joint”, then any valid cuboid will either contain none of these 
dimensions, or contain them all. In other words, these dimensions will always 
be “together”. This is useful when the cube designer is sure some of the 
dimensions will always be queried together. It is also a nuclear weapon for 
combination pruning on less-likely-to-use dimensions. Suppose you have 20 
dimensions, the first 10 dimensions are frequently used and the latter 10 are 
less likely to be used. By joining the latter 10 dimensions as “joint”, you’re 
effectively reducing cuboid numbers from 2^20 to 2^11. Actually this is pretty 
much what the old “aggregation group” mechanism was for. If you’re using it 
prior Kylin v1.5, our metadata upgrade tool will automatically translate it to 
joint semantics.
By flexibly using the new aggregation group you can in theory control whatever 
cuboid to compute/skip. This could significant reduce the computation and 
storage overhead, especially when the cube is serving for a fixed dashboard, 
which will reproduce SQL queries that only require some specific cuboids. In 
extreme cases you can configure each AGG contain only one cuboid, and a handful 
of AGGs will consists of the cuboid whitelist that you’ll need.

Thank you, Richard.

 

From: Li Yang [mailto:[email protected]] 
Sent: Tuesday, June 28, 2016 7:16 AM
To: [email protected]
Subject: Re: Dimension table 300MB Limit

 

There are options to treat columns on fact table without triggering the 
dimension explosion (just like derived). One is the "joint" dimensions 
introduced is v1.5. Another is "extended" measure. The related document need to 
catch up however.

Yang

 

On Tue, Jun 28, 2016 at 10:09 AM, Arun Khetarpal <[email protected] 
<mailto:[email protected]> > wrote:

I agree with Ric - Forcing to put back the dimension value may be a step back. 

I propose to open a Jira to track this issue (and possibly work on this) - 
Thoughts/Suggestions? 

 

Regards,

Arun

 

 

 

On 28 June 2016 at 06:51, Richard Calaba (Fishbowl) <[email protected] 
<mailto:[email protected]> > wrote:

Did little search in the bin/*.sh and found setenv.sh so tried setting 
KYLIN_JVM_SETTINGS environment variable to -Xms1024M -Xmx16g – resolved my 
‘sudden’ death of the Kylin server after increasing kylin.table.snapshot.max_mb

 

So far all looks good, fingers crossed :)

 

Ric.

 

From: Richard Calaba (Fishbowl) [mailto:[email protected] 
<mailto:[email protected]> ] 
Sent: Monday, June 27, 2016 5:48 PM
To: [email protected] <mailto:[email protected]> 
Cc: 'Richard Calaba (Fishbowl)' <[email protected] 
<mailto:[email protected]> >


Subject: RE: Dimension table 300MB Limit

 

I am facing errors in kylin.log complaining about less than 100MB available -> 
then the Kylin server dies silently. The issues is caused by high cardinality 
dimension which requires approx 700MB data snapshot. I have increase the 
parameter kylin.table.snapshot.max_mb=750 to 750MB – with this settings the 
Build Step 4 is not anymore complaining about the snapshot more than 300MB (the 
exeception java.lang.IllegalStateException: Table snapshot should be no greater 
than 300 MB is gone) but the server dies after a while. There is a plenty of 
memory free on the node where Kylin runs (more than 20GB free) so it seems to 
be problem of Kylin total memory limit. I didn’t find a way how to increase the 
Kylin memory limit so the big snapshot won’t kill the Kylin server …. How to do 
that ??? 

 

It is urgent ! :)

 

Thanx, ric

 

From: Richard Calaba (Fishbowl) [mailto:[email protected]] 
Sent: Monday, June 27, 2016 5:23 PM
To: '[email protected] <mailto:[email protected]> ' 
<[email protected] <mailto:[email protected]> >
Subject: RE: Dimension table 300MB Limit

 

I have 2 scenarios:

 

1)      time -dependent attributes of customer – here it might be an option to 
put those to fact table as the values are derived from date and ID -> but I 
need those dimensions to be “derived” from fact table (2 fields – date and id – 
define the value – I have 10 fields like that in the lookup table so bringing 
those as independent (normal) dimensions would increase the Build time by 2^10 
times right … ??? 

 

2)      2nd scenario is similar – lot of attributes of customer (which is the 
high cardinality dimension – approx 10 mio customers) to be used as derived 
dimension 

 

Forcing to put the high cardinality dimensions into fact table is in my opinion 
a step back – we are denormalizing the star-schema …. 

 

Ric.

 

From: Li Yang [mailto:[email protected]] 
Sent: Monday, June 27, 2016 3:45 PM
To: [email protected] <mailto:[email protected]> 
Subject: Re: Dimension table 300MB Limit

 

Such big dimensions better be part of the fact table (rather than on a lookup 
table). Simplest way is to create a hive view joining the old fact and the 
customer, then assign the view to be the new fact table.

 

On Tue, Jun 28, 2016 at 5:26 AM, Richard Calaba (Fishbowl) 
<[email protected] <mailto:[email protected]> > wrote:

We have same issue though our size is just 700MB …. So interested in the 
background info and workarounds other than setting higher snapshot limit … if 
any ?

 

Ric.

 

From: Arun Khetarpal [mailto:[email protected] <mailto:[email protected]> ] 
Sent: Monday, June 27, 2016 11:55 AM
To: [email protected] <mailto:[email protected]> 
Subject: Dimension table 300MB Limit

 

Hi, 

 

We are evaluating Kylin as an Analytical Engine for OLAP. We are facing issues 
with OOM when dealing with large dimensions ~ 70GB (customer data) [set 
kylin.table.snapshot.max_mb to a high limit] 

 

I guess having a Dictionary this big in memory will not be a solution. Is there 
any suggested workaround for the same? 

 

Is there any work done to get around this by the community? 

 

Regards,

Arun

No virus found in this message.
Checked by AVG - www.avg.com <http://www.avg.com> 
Version: 2016.0.7640 / Virus Database: 4613/12504 - Release Date: 06/27/16

 

No virus found in this message.
Checked by AVG - www.avg.com <http://www.avg.com> 
Version: 2016.0.7640 / Virus Database: 4613/12505 - Release Date: 06/27/16

No virus found in this message.
Checked by AVG - www.avg.com <http://www.avg.com> 
Version: 2016.0.7640 / Virus Database: 4613/12505 - Release Date: 06/27/16

 

 

No virus found in this message.
Checked by AVG - www.avg.com <http://www.avg.com> 
Version: 2016.0.7640 / Virus Database: 4613/12512 - Release Date: 06/28/16

RE: Dimension table 300MB Limit

Reply via email to