Re: joins and low selectivity optimization

2019-02-24 Thread Stamatis Zampetakis
We did this in the project that I am currently working on but this part is not open source so I don't have a link to give you. Στις Σάβ, 23 Φεβ 2019 στις 9:35 μ.μ., ο/η Andrei Sereda έγραψε: > > There are two tricky (but doable) parts: > > 1) pass bound variables from one convention to the

Re: joins and low selectivity optimization

2019-02-23 Thread Andrei Sereda
> There are two tricky (but doable) parts: > 1) pass bound variables from one convention to the other; > 2) push the Filter (associated with the Correlate) holding the condition on the correlated variable to elastic; Do you know if any projects (libraries) already did this ? I would like to look

Re: joins and low selectivity optimization

2019-02-22 Thread Stamatis Zampetakis
No need to mention it, I also appreciate your contributions/discussions. Block-based nested loops is not implemented but we already have tuple-based nested loop [1] and it is a good starting point. Regarding the particular query that you cited I think that even tuple-based nested loop would work

Re: joins and low selectivity optimization

2019-02-21 Thread Andrei Sereda
Hi Stamatis, As usual, appreciate your time to answer my questions. I will try to give more context below. I am not I understand why bloom-filters and block based nested loop are not possible for your use case Our destination data-source (ElasticSearch / Mongo) doesn’t support filtering based

Re: joins and low selectivity optimization

2019-02-20 Thread Stamatis Zampetakis
Hi Andrei, I am not I understand why bloom-filters and block based nested loop are not possible for your use case but I will try to provide some answers to the new questions you raised. By adding streams in the discussion I guess you add some additional limitations on one side of the join

Re: joins and low selectivity optimization

2019-02-19 Thread Andrei Sereda
Hello, I would like to resurrect this thread in the context of calcite streams. Unfortunately bloom-filters is not an option for the data-sources being used. Say one has stream to table join . >From docs example: SELECT

Re: joins and low selectivity optimization

2018-08-30 Thread Vineet Garg
Hive actually does this optimization (it is called semi-join reduction) by generating bloom-filters on one side and passing it on to the other side. This is not a rewrite but instead a physical implementation. Vineet On Aug 29, 2018, at 10:34 AM, Vladimir Sitnikov

Re: joins and low selectivity optimization

2018-08-29 Thread Vladimir Sitnikov
>Nested loops are never likely to happe What's wrong with that? Apparently Andrei asks for that, and "subquery precomputation" is quite close to nested loops in my opinion. Vladimir

Re: joins and low selectivity optimization

2018-08-29 Thread Julian Hyde
Regarding Vladimir’s ideas of Bloom filters and nested loop joins. Both are excellent if you can do them. They are fairly easy in single-node architectures (especially single-threaded) but get harder in distributed architectures. Bloom filters (also magic sets) require data to be pushed “up

Re: joins and low selectivity optimization

2018-08-29 Thread Andrei Sereda
Hi Vladimir, Thanks for follow-up and explanation. I wanted to make sure I'm not missing (mis-understanding) anything. Andrei. On Wed, Aug 29, 2018 at 11:01 AM Vladimir Sitnikov < sitnikov.vladi...@gmail.com> wrote: > One of the approaches to such queries is to throw Bloom filters all over >

Re: joins and low selectivity optimization

2018-08-29 Thread Vladimir Sitnikov
One of the approaches to such queries is to throw Bloom filters all over the place. That is it could execute "small side" of the join, collect the ids (or a lossy version of it in a form of Bloom filters), and it could propagate that Bloom filter to the second source to reduce the set of rows

Re: joins and low selectivity optimization

2018-08-29 Thread Andrei Sereda
A follow-up question. I presume this is applicable equally to joins without predicates ? select *from t1 join t2 on (t1.id = t2.id) Even if t1 has a much lower cardinality than t2 (say many orders of magnitude), Calcite would not know to convert t2 query to: select * from t2 where id in (1,

Re: joins and low selectivity optimization

2018-08-28 Thread Julian Hyde
Sure, Calcite makes use of stats in its cost formulas. And you are correct that “metadata” is what Calcite calls statistics. But you have to be careful to only treat statistics as approximate. If the statistics were gathered using an “ANALYZE TABLE” command a month ago they may be out of date,

Re: joins and low selectivity optimization

2018-08-28 Thread Andrei Sereda
Thank you, Michael and Julian, for your answers. Even if optimizers don't have access to data can they have access to table statistics ? If I remember correctly Oracle CBO is estimating selectivity based on column distribution (histograms) and some formulas for density

Re: joins and low selectivity optimization

2018-08-28 Thread Julian Hyde
If I recall correctly, Hive does this kind of optimization. It’s pretty important you have a date dimension table and your fact table is partitioned on date. Example: select * from sales join date_dim on sales.date_id = date_dim.id where sales.product_name = ‘foo' and

Re: joins and low selectivity optimization

2018-08-28 Thread Michael Mior
As far as I am aware, the optimizer has no access to data, only metadata. The traditional way to solve such problems would be to select among different join algorithms which perform better for varying cardinalities of each side of the join. Unfortunately, I think you're likely to have a tough time

joins and low selectivity optimization

2018-08-28 Thread Andrei Sereda
Hello, I’m looking for a way to improve performance of a join query. Suppose one joins two heterogeneous sources t1 and t2 with some predicates. Further assume that cardinality of one of the predicates is very low (compared cardinality of the second one). (How) Is it possible to convert second