Re: Employ bloom filters in joins

2023-05-04 Thread Chunwei Lei
It sounds like a Runtime Filter[1], which is commonly used by many systems.

As Stamatis mentioned, integrating it into the cost model is much more
challenging than implementing the rule. Fortunately, we can refer to the
practices of other systems.



[1]
https://www.google.com.hk/search?q=Runtime+Filter=1C5GCEA_enCN948CN948=Runtime+Filter=chrome..69i57j0i512j0i30l6j0i15i30j0i5i30.367j0j7=chrome=UTF-8

Best,
Chunwei


On Sat, Apr 29, 2023 at 9:32 PM Stamatis Zampetakis 
wrote:

> The topic is really interesting, thanks for sharing your ideas Zoltan!
>
> I see no drawbacks adding the new transformation rule; definitely worth
> having! However, adding them to the default rule set or using them in a
> cost based decision may require much more work/thinking.
>
> Calcite's built-in cost model is pretty basic and does not account for
> parallelism / concurrency etc. Any rule that adds more operations to the
> plan is gonna make the cost worse so in the end the new plan may never get
> selected.
>
> The proposed rewrite rule is really close to the semi-join reduction
> technique. I would say that introducing rules and optimizing queries with
> semi-join reducers [1] is a good starting point before moving to more
> complicated plans with aggregations and specialized UDFs (bloom/cuckoo
> etc). Simpler primitives are also more likely to be adopted by
> downstream projects.
>
> Regarding the part of constructing and passing the bloom filter around we
> may be able to come up with an alternative design without the use of
> additional scans/joins by exploiting correlation variables. I haven't
> thought this all the way through but binding things on the left side and
> passing them to the right side seems something that could be relevant to
> this use-case.
>
> Best,
> Stamatis
>
> [1] Integrating semi-join-reducers into state-of-the-art query processors (
> https://db.in.tum.de/research/publications/conferences/semijoin.pdf)
>
> On Sat, Apr 29, 2023, 3:54 AM Julian Hyde  wrote:
>
> > It would be great to have such a rule. People who don’t want it can
> > disable it; and people who enable it can use a cost function.
> >
> > Some systems that use Bloom filters (and other probabilistic filters)
> > don’t execute the query twice but use a side-channel to send the Bloom
> > filter from one scan to the other. For example, suppose that the “dept"
> > table is smaller and its scan finishes faster. When the scan has
> finished,
> > it sends the Bloom filter to the “emp" scan, which is still under way.
> From
> > that point, the “emp” scan can eliminate a fraction of its rows because
> it
> > knows that their “deptno” values do not pass the filter.
> >
> > Julian
> >
> >
> > > On Apr 28, 2023, at 9:01 AM, Zoltan Haindrich  wrote:
> > >
> > > Hi,
> > >
> > > I was wondering about the pros and cons of having a Calcite rule which
> > could rewrite a join to utilize bloom filters; something like:
> > >
> > > select e.*
> > >   from emp e
> > >   join dept d on(e.deptno=d.deptno);
> > >   where d.dname='Sales';
> > >
> > > into something like:
> > >
> > > select e.*
> > >   from (
> > >   select e.* from emp e join (
> > >   select bloom_sketch(deptno) as sketch
> from
> > dept dname='Sales'
> > >   ) dept_agg on (bloom_contains(sketch,e.deptno)
> > >   ) e
> > >   join dept d on(e.deptno=d.deptno)
> > >   where d.dname='Sales';
> > >
> > > Generally for the original query:
> > > * if "dept" is very small a mapjoin is used which is great
> > > * or possibly some nested loops with index usages on the big table
> > > * but if the execution engine decides to use a non-specialized approach
> > like merge-join or hash-join; it may move around a lot of data - and in
> > those cases this might be usefull
> > >
> > > There are systems which handle this by introducing a bloom filter
> (Hive;
> > Spark) and transfer that in the background for the big-table readers -
> but
> > that's outside the scope of the planner. I was wondering if it would be
> > beneficial or not to introduce such a rule - so that using this can be a
> > cost-based decision during planning.
> > >
> > > pro:
> > > * to enable an engine to support this optimization - it would only need
> > to implement a few UDFs
> > > * the rule could put the use of this optimization under cost-based
> > decision
> > >
> > > con:
> > > * an extra scan of the small table
> > > * it adds an extra join + aggregate computation
> > >  * exec engine will most likely exploit that its just a single row
> > > * I guess without proper stats this could even worsen things
> > > * it could put more stress on (join) planning - as it could introduce
> > more joins
> > >
> > > What do you guys think?
> > >
> > > cheers,
> > > Zoltan
> > >
> > >
> >
> >
>


Re: Employ bloom filters in joins

2023-04-29 Thread Stamatis Zampetakis
The topic is really interesting, thanks for sharing your ideas Zoltan!

I see no drawbacks adding the new transformation rule; definitely worth
having! However, adding them to the default rule set or using them in a
cost based decision may require much more work/thinking.

Calcite's built-in cost model is pretty basic and does not account for
parallelism / concurrency etc. Any rule that adds more operations to the
plan is gonna make the cost worse so in the end the new plan may never get
selected.

The proposed rewrite rule is really close to the semi-join reduction
technique. I would say that introducing rules and optimizing queries with
semi-join reducers [1] is a good starting point before moving to more
complicated plans with aggregations and specialized UDFs (bloom/cuckoo
etc). Simpler primitives are also more likely to be adopted by
downstream projects.

Regarding the part of constructing and passing the bloom filter around we
may be able to come up with an alternative design without the use of
additional scans/joins by exploiting correlation variables. I haven't
thought this all the way through but binding things on the left side and
passing them to the right side seems something that could be relevant to
this use-case.

Best,
Stamatis

[1] Integrating semi-join-reducers into state-of-the-art query processors (
https://db.in.tum.de/research/publications/conferences/semijoin.pdf)

On Sat, Apr 29, 2023, 3:54 AM Julian Hyde  wrote:

> It would be great to have such a rule. People who don’t want it can
> disable it; and people who enable it can use a cost function.
>
> Some systems that use Bloom filters (and other probabilistic filters)
> don’t execute the query twice but use a side-channel to send the Bloom
> filter from one scan to the other. For example, suppose that the “dept"
> table is smaller and its scan finishes faster. When the scan has finished,
> it sends the Bloom filter to the “emp" scan, which is still under way. From
> that point, the “emp” scan can eliminate a fraction of its rows because it
> knows that their “deptno” values do not pass the filter.
>
> Julian
>
>
> > On Apr 28, 2023, at 9:01 AM, Zoltan Haindrich  wrote:
> >
> > Hi,
> >
> > I was wondering about the pros and cons of having a Calcite rule which
> could rewrite a join to utilize bloom filters; something like:
> >
> > select e.*
> >   from emp e
> >   join dept d on(e.deptno=d.deptno);
> >   where d.dname='Sales';
> >
> > into something like:
> >
> > select e.*
> >   from (
> >   select e.* from emp e join (
> >   select bloom_sketch(deptno) as sketch from
> dept dname='Sales'
> >   ) dept_agg on (bloom_contains(sketch,e.deptno)
> >   ) e
> >   join dept d on(e.deptno=d.deptno)
> >   where d.dname='Sales';
> >
> > Generally for the original query:
> > * if "dept" is very small a mapjoin is used which is great
> > * or possibly some nested loops with index usages on the big table
> > * but if the execution engine decides to use a non-specialized approach
> like merge-join or hash-join; it may move around a lot of data - and in
> those cases this might be usefull
> >
> > There are systems which handle this by introducing a bloom filter (Hive;
> Spark) and transfer that in the background for the big-table readers - but
> that's outside the scope of the planner. I was wondering if it would be
> beneficial or not to introduce such a rule - so that using this can be a
> cost-based decision during planning.
> >
> > pro:
> > * to enable an engine to support this optimization - it would only need
> to implement a few UDFs
> > * the rule could put the use of this optimization under cost-based
> decision
> >
> > con:
> > * an extra scan of the small table
> > * it adds an extra join + aggregate computation
> >  * exec engine will most likely exploit that its just a single row
> > * I guess without proper stats this could even worsen things
> > * it could put more stress on (join) planning - as it could introduce
> more joins
> >
> > What do you guys think?
> >
> > cheers,
> > Zoltan
> >
> >
>
>


Re: Employ bloom filters in joins

2023-04-28 Thread Julian Hyde
It would be great to have such a rule. People who don’t want it can disable it; 
and people who enable it can use a cost function.

Some systems that use Bloom filters (and other probabilistic filters) don’t 
execute the query twice but use a side-channel to send the Bloom filter from 
one scan to the other. For example, suppose that the “dept" table is smaller 
and its scan finishes faster. When the scan has finished, it sends the Bloom 
filter to the “emp" scan, which is still under way. From that point, the “emp” 
scan can eliminate a fraction of its rows because it knows that their “deptno” 
values do not pass the filter.

Julian


> On Apr 28, 2023, at 9:01 AM, Zoltan Haindrich  wrote:
> 
> Hi,
> 
> I was wondering about the pros and cons of having a Calcite rule which could 
> rewrite a join to utilize bloom filters; something like:
> 
> select e.*
>   from emp e
>   join dept d on(e.deptno=d.deptno);
>   where d.dname='Sales';
> 
> into something like:
> 
> select e.*
>   from (
>   select e.* from emp e join (
>   select bloom_sketch(deptno) as sketch from dept 
> dname='Sales'
>   ) dept_agg on (bloom_contains(sketch,e.deptno)
>   ) e
>   join dept d on(e.deptno=d.deptno)
>   where d.dname='Sales';
> 
> Generally for the original query:
> * if "dept" is very small a mapjoin is used which is great
> * or possibly some nested loops with index usages on the big table
> * but if the execution engine decides to use a non-specialized approach like 
> merge-join or hash-join; it may move around a lot of data - and in those 
> cases this might be usefull
> 
> There are systems which handle this by introducing a bloom filter (Hive; 
> Spark) and transfer that in the background for the big-table readers - but 
> that's outside the scope of the planner. I was wondering if it would be 
> beneficial or not to introduce such a rule - so that using this can be a 
> cost-based decision during planning.
> 
> pro:
> * to enable an engine to support this optimization - it would only need to 
> implement a few UDFs
> * the rule could put the use of this optimization under cost-based decision
> 
> con:
> * an extra scan of the small table
> * it adds an extra join + aggregate computation
>  * exec engine will most likely exploit that its just a single row
> * I guess without proper stats this could even worsen things
> * it could put more stress on (join) planning - as it could introduce more 
> joins
> 
> What do you guys think?
> 
> cheers,
> Zoltan
> 
> 



Employ bloom filters in joins

2023-04-28 Thread Zoltan Haindrich

Hi,

I was wondering about the pros and cons of having a Calcite rule which could 
rewrite a join to utilize bloom filters; something like:

select e.*
from emp e
join dept d on(e.deptno=d.deptno);
where d.dname='Sales';

into something like:

select e.*
from (
select e.* from emp e join (
select bloom_sketch(deptno) as sketch from dept 
dname='Sales'
) dept_agg on (bloom_contains(sketch,e.deptno)
) e
join dept d on(e.deptno=d.deptno)
where d.dname='Sales';

Generally for the original query:
* if "dept" is very small a mapjoin is used which is great
* or possibly some nested loops with index usages on the big table
* but if the execution engine decides to use a non-specialized approach like merge-join or hash-join; it may move around a lot of data - and in those cases this might be 
usefull


There are systems which handle this by introducing a bloom filter (Hive; Spark) and transfer that in the background for the big-table readers - but that's outside the scope 
of the planner. I was wondering if it would be beneficial or not to introduce such a rule - so that using this can be a cost-based decision during planning.


pro:
* to enable an engine to support this optimization - it would only need to 
implement a few UDFs
* the rule could put the use of this optimization under cost-based decision

con:
* an extra scan of the small table
* it adds an extra join + aggregate computation
  * exec engine will most likely exploit that its just a single row
* I guess without proper stats this could even worsen things
* it could put more stress on (join) planning - as it could introduce more joins

What do you guys think?

cheers,
Zoltan




OpenPGP_signature
Description: OpenPGP digital signature