Re: Number of cube dimensions is limited to 62?

2016-06-22 Thread Victoria Tskhay

Hello ShaoFeng,

Thanks a lot for providing detailed feedback. These are crucial insights 
for us.


The solution B also seems to be the most "scalable" one to us, as our 
data will grow in time.




Best regards
Victoria

On 22.06.2016 04:18, ShaoFeng Shi wrote:

Hi Victoria,

Thanks for the case sharing! Below are just some my cents:

Solution A is okay in this case, as the cardinality of "ab..." for each
"H"/"M"/"L" is acceptable for a dictionary; But as we know dictionary isn't
suitable for ultra high cardinality (default cap is 5million); this
solution may have bottleneck when data increases;

Solution B looks better in my opinion, it worth a try. You can even combine
more dimensions like 4 or 5 into 1;

Solution C is similar as B; as bitwise operation isn't supported, you need
convert that to a value set and then use IN clause in app level, then it
will be almost the same as B, because Kylin will convert "LIKE" to "IN"
with the values in dictionary;





2016-06-20 19:45 GMT+08:00 Victoria Tskhay 
:


Hello,

Thank you for the quick reply! Good to know.

As for the suggestions:

1) use derived columns -> unfortunately the logic of derived dimensions
doesn't match our usecase
2) use multiple cubes -> as well, unfortunately we need to have all
dimensions interconnected in one cube

We are currently evaluating the following approaches to deal with our
situation:

*A.* Use *LIKE* in a naive way:

E.g. we have 60 columns named 'a', 'b', 'c', etc. Each one can have one of
these three values: 'L', 'M', 'H' (+ NULL). So what we did is "rotate" the
whole thing in such a way that the values become column names and the
column names turn into values, all concatenated into one string(and sorted
before concat). Thus, the column names are 'L', 'M' and 'H' and values
would be like: 'acde', 'a', 'h', 'ace', 'dg', etc.

Now, when we want to get a count of *WHERE a = 'M' OR b = 'L'**OR c =
'L'*, we do *WHERE M LIKE '%a%' OR L LIKE '%b%'**OR L LIKE '%c%'*.

It looks very expensive but surprisingly for our amount of data (2 billion
rows 6 million out of which have any values in those 60 columns) currently
it works not so bad (response time p99 < 1s).

*B.* *Group several *dimensions (e.g. 3)*into one* and use something like
a *bitmap*.

Again, e.g. we have 60 columns named 'a', 'b', 'c', etc. Each one can have
one of these three values: 'L', 'M', 'H' (+ NULL). So we wanna group them
into, say, columns named 'abc', 'def', 'ghi', etc. Then the values would be
like 'L__', 'LM_', '_HL', 'HLM', etc. (The underscore means NULL.) Thus, we
can imitate *WHERE a = 'M' OR b = 'L'**OR c = 'L'* with *WHERE abc LIKE
'M__' OR abc LIKE '_L_'**OR c LIKE '__L'*. Presumably such LIKE expressions
with single-character wildcards will be easier for Kylin to handle than
those with '%'. We haven't tried that yet though.

*C.* Similar to the previous one but *replacing values with integers* and
applying *bitwise operations with bitmasks*. E.g. L would be "encoded" with
01, M - 10, H - 11. Then a string 'HL_' would look like 110100 -> 52. Then
when we want to get an 'L' in the middle ('_L_'), we could do something
like *WHERE abc & 001100 = 01*. BUT it looks like bitwise operations are
not supported neither by ANSI SQL nor by Kylin. And if they were, we are
not sure it would be faster than a dictionary.

Would you please provide us with some feedback on what you think about the
approaches described above and which one you think would actually work/be
the best?

Thank you!


Best regards
Victoria


On 18.06.2016 04:07, ShaoFeng Shi wrote:


Almost true; You can think Kylin is 64 bit, in theory it supports up to 63
dimension in one cube;

There is no plan to extend to 128 or more in near term I believe; Since in
most of the cases the dimension number wouldn't exceed 20, 64 is already
"redundant" and causing extra space;

With so many dimensions, there must be room for optimization; You can try
some ways like:
1) extract some columns to lookup tables, and create them as "derived"
dimension in the cube;
2) or create multiple cubes, each serving a part of these columns;

If you have other way, please also share with the community; Thanks;


2016-06-18 0:01 GMT+08:00 Victoria Tskhay <
victoria.tsk...@glispamedia.com>:

Hello,

It looks like the max number of dimensions in one cube is 62, is that
correct?

We would like to add more than that. That may sound crazy, I know, but we
have a special case where all the dimensions have low cardinality (3) and
the data is very sparse. We already tried with 62 dimensions and it works
great.

Is there any way to work around that limit? What would you suggest? Thank
you!



Best regards
--
Victoria Tskhay

*Java Backend Developer*I glispa GmbH

Sonnenburger Straße 73, 10437 Berlin, Germany
E victoria.tsk...@glispamedia.com 
Skype: vikatskhay I www.glispa.com 

Sitz Berlin, AG Charlottenburg HRB 114678B





--
Victoria Tskhay

*Java Backend Developer*I 

Re: Number of cube dimensions is limited to 62?

2016-06-20 Thread Victoria Tskhay

Hello,

Thank you for the quick reply! Good to know.

As for the suggestions:

1) use derived columns -> unfortunately the logic of derived dimensions 
doesn't match our usecase
2) use multiple cubes -> as well, unfortunately we need to have all 
dimensions interconnected in one cube


We are currently evaluating the following approaches to deal with our 
situation:


*A.* Use *LIKE* in a naive way:

E.g. we have 60 columns named 'a', 'b', 'c', etc. Each one can have one 
of these three values: 'L', 'M', 'H' (+ NULL). So what we did is 
"rotate" the whole thing in such a way that the values become column 
names and the column names turn into values, all concatenated into one 
string(and sorted before concat). Thus, the column names are 'L', 'M' 
and 'H' and values would be like: 'acde', 'a', 'h', 'ace', 'dg', etc.


Now, when we want to get a count of *WHERE a = 'M' OR b = 'L'**OR c = 
'L'*, we do *WHERE M LIKE '%a%' OR L LIKE '%b%'**OR L LIKE '%c%'*.


It looks very expensive but surprisingly for our amount of data (2 
billion rows 6 million out of which have any values in those 60 columns) 
currently it works not so bad (response time p99 < 1s).


*B.* *Group several *dimensions (e.g. 3)*into one* and use something 
like a *bitmap*.


Again, e.g. we have 60 columns named 'a', 'b', 'c', etc. Each one can 
have one of these three values: 'L', 'M', 'H' (+ NULL). So we wanna 
group them into, say, columns named 'abc', 'def', 'ghi', etc. Then the 
values would be like 'L__', 'LM_', '_HL', 'HLM', etc. (The underscore 
means NULL.) Thus, we can imitate *WHERE a = 'M' OR b = 'L'**OR c = 'L'* 
with *WHERE abc LIKE 'M__' OR abc LIKE '_L_'**OR c LIKE '__L'*. 
Presumably such LIKE expressions with single-character wildcards will be 
easier for Kylin to handle than those with '%'. We haven't tried that 
yet though.


*C.* Similar to the previous one but *replacing values with integers* 
and applying *bitwise operations with bitmasks*. E.g. L would be 
"encoded" with 01, M - 10, H - 11. Then a string 'HL_' would look like 
110100 -> 52. Then when we want to get an 'L' in the middle ('_L_'), we 
could do something like *WHERE abc & 001100 = 01*. BUT it looks like 
bitwise operations are not supported neither by ANSI SQL nor by Kylin. 
And if they were, we are not sure it would be faster than a dictionary.


Would you please provide us with some feedback on what you think about 
the approaches described above and which one you think would actually 
work/be the best?


Thank you!


Best regards
Victoria

On 18.06.2016 04:07, ShaoFeng Shi wrote:

Almost true; You can think Kylin is 64 bit, in theory it supports up to 63
dimension in one cube;

There is no plan to extend to 128 or more in near term I believe; Since in
most of the cases the dimension number wouldn't exceed 20, 64 is already
"redundant" and causing extra space;

With so many dimensions, there must be room for optimization; You can try
some ways like:
1) extract some columns to lookup tables, and create them as "derived"
dimension in the cube;
2) or create multiple cubes, each serving a part of these columns;

If you have other way, please also share with the community; Thanks;


2016-06-18 0:01 GMT+08:00 Victoria Tskhay :


Hello,

It looks like the max number of dimensions in one cube is 62, is that
correct?

We would like to add more than that. That may sound crazy, I know, but we
have a special case where all the dimensions have low cardinality (3) and
the data is very sparse. We already tried with 62 dimensions and it works
great.

Is there any way to work around that limit? What would you suggest? Thank
you!



Best regards
--
Victoria Tskhay

*Java Backend Developer*I glispa GmbH

Sonnenburger Straße 73, 10437 Berlin, Germany
E victoria.tsk...@glispamedia.com 
Skype: vikatskhay I www.glispa.com 

Sitz Berlin, AG Charlottenburg HRB 114678B






--
Victoria Tskhay

*Java Backend Developer*I glispa GmbH

Sonnenburger Straße 73, 10437 Berlin, Germany
E victoria.tsk...@glispamedia.com 


Skype: vikatskhay I www.glispa.com 

Sitz Berlin, AG Charlottenburg HRB 114678B


Re: Number of cube dimensions is limited to 62?

2016-06-17 Thread ShaoFeng Shi
Almost true; You can think Kylin is 64 bit, in theory it supports up to 63
dimension in one cube;

There is no plan to extend to 128 or more in near term I believe; Since in
most of the cases the dimension number wouldn't exceed 20, 64 is already
"redundant" and causing extra space;

With so many dimensions, there must be room for optimization; You can try
some ways like:
1) extract some columns to lookup tables, and create them as "derived"
dimension in the cube;
2) or create multiple cubes, each serving a part of these columns;

If you have other way, please also share with the community; Thanks;


2016-06-18 0:01 GMT+08:00 Victoria Tskhay :

> Hello,
>
> It looks like the max number of dimensions in one cube is 62, is that
> correct?
>
> We would like to add more than that. That may sound crazy, I know, but we
> have a special case where all the dimensions have low cardinality (3) and
> the data is very sparse. We already tried with 62 dimensions and it works
> great.
>
> Is there any way to work around that limit? What would you suggest? Thank
> you!
>
>
>
> Best regards
> --
> Victoria Tskhay
>
> *Java Backend Developer*I glispa GmbH
>
> Sonnenburger Straße 73, 10437 Berlin, Germany
> E victoria.tsk...@glispamedia.com  e.mail.ru/compose/?mailto=mailto%3avictoria.tsk...@glispamedia.com>
> Skype: vikatskhay I www.glispa.com 
>
> Sitz Berlin, AG Charlottenburg HRB 114678B
>



-- 
Best regards,

Shaofeng Shi


Number of cube dimensions is limited to 62?

2016-06-17 Thread Victoria Tskhay

Hello,

It looks like the max number of dimensions in one cube is 62, is that 
correct?


We would like to add more than that. That may sound crazy, I know, but 
we have a special case where all the dimensions have low cardinality (3) 
and the data is very sparse. We already tried with 62 dimensions and it 
works great.


Is there any way to work around that limit? What would you suggest? 
Thank you!




Best regards
--
Victoria Tskhay

*Java Backend Developer*I glispa GmbH

Sonnenburger Straße 73, 10437 Berlin, Germany
E victoria.tsk...@glispamedia.com 


Skype: vikatskhay I www.glispa.com 

Sitz Berlin, AG Charlottenburg HRB 114678B