Re: Employ bloom filters in joins
It sounds like a Runtime Filter[1], which is commonly used by many systems. As Stamatis mentioned, integrating it into the cost model is much more challenging than implementing the rule. Fortunately, we can refer to the practices of other systems. [1] https://www.google.com.hk/search?q=Runtime+Filter=1C5GCEA_enCN948CN948=Runtime+Filter=chrome..69i57j0i512j0i30l6j0i15i30j0i5i30.367j0j7=chrome=UTF-8 Best, Chunwei On Sat, Apr 29, 2023 at 9:32 PM Stamatis Zampetakis wrote: > The topic is really interesting, thanks for sharing your ideas Zoltan! > > I see no drawbacks adding the new transformation rule; definitely worth > having! However, adding them to the default rule set or using them in a > cost based decision may require much more work/thinking. > > Calcite's built-in cost model is pretty basic and does not account for > parallelism / concurrency etc. Any rule that adds more operations to the > plan is gonna make the cost worse so in the end the new plan may never get > selected. > > The proposed rewrite rule is really close to the semi-join reduction > technique. I would say that introducing rules and optimizing queries with > semi-join reducers [1] is a good starting point before moving to more > complicated plans with aggregations and specialized UDFs (bloom/cuckoo > etc). Simpler primitives are also more likely to be adopted by > downstream projects. > > Regarding the part of constructing and passing the bloom filter around we > may be able to come up with an alternative design without the use of > additional scans/joins by exploiting correlation variables. I haven't > thought this all the way through but binding things on the left side and > passing them to the right side seems something that could be relevant to > this use-case. > > Best, > Stamatis > > [1] Integrating semi-join-reducers into state-of-the-art query processors ( > https://db.in.tum.de/research/publications/conferences/semijoin.pdf) > > On Sat, Apr 29, 2023, 3:54 AM Julian Hyde wrote: > > > It would be great to have such a rule. People who don’t want it can > > disable it; and people who enable it can use a cost function. > > > > Some systems that use Bloom filters (and other probabilistic filters) > > don’t execute the query twice but use a side-channel to send the Bloom > > filter from one scan to the other. For example, suppose that the “dept" > > table is smaller and its scan finishes faster. When the scan has > finished, > > it sends the Bloom filter to the “emp" scan, which is still under way. > From > > that point, the “emp” scan can eliminate a fraction of its rows because > it > > knows that their “deptno” values do not pass the filter. > > > > Julian > > > > > > > On Apr 28, 2023, at 9:01 AM, Zoltan Haindrich wrote: > > > > > > Hi, > > > > > > I was wondering about the pros and cons of having a Calcite rule which > > could rewrite a join to utilize bloom filters; something like: > > > > > > select e.* > > > from emp e > > > join dept d on(e.deptno=d.deptno); > > > where d.dname='Sales'; > > > > > > into something like: > > > > > > select e.* > > > from ( > > > select e.* from emp e join ( > > > select bloom_sketch(deptno) as sketch > from > > dept dname='Sales' > > > ) dept_agg on (bloom_contains(sketch,e.deptno) > > > ) e > > > join dept d on(e.deptno=d.deptno) > > > where d.dname='Sales'; > > > > > > Generally for the original query: > > > * if "dept" is very small a mapjoin is used which is great > > > * or possibly some nested loops with index usages on the big table > > > * but if the execution engine decides to use a non-specialized approach > > like merge-join or hash-join; it may move around a lot of data - and in > > those cases this might be usefull > > > > > > There are systems which handle this by introducing a bloom filter > (Hive; > > Spark) and transfer that in the background for the big-table readers - > but > > that's outside the scope of the planner. I was wondering if it would be > > beneficial or not to introduce such a rule - so that using this can be a > > cost-based decision during planning. > > > > > > pro: > > > * to enable an engine to support this optimization - it would only need > > to implement a few UDFs > > > * the rule could put the use of this optimization under cost-based > > decision > > > > > > con: > > > * an extra scan of the small table > > > * it adds an extra join + aggregate computation > > > * exec engine will most likely exploit that its just a single row > > > * I guess without proper stats this could even worsen things > > > * it could put more stress on (join) planning - as it could introduce > > more joins > > > > > > What do you guys think? > > > > > > cheers, > > > Zoltan > > > > > > > > > > >
Re: Employ bloom filters in joins
The topic is really interesting, thanks for sharing your ideas Zoltan! I see no drawbacks adding the new transformation rule; definitely worth having! However, adding them to the default rule set or using them in a cost based decision may require much more work/thinking. Calcite's built-in cost model is pretty basic and does not account for parallelism / concurrency etc. Any rule that adds more operations to the plan is gonna make the cost worse so in the end the new plan may never get selected. The proposed rewrite rule is really close to the semi-join reduction technique. I would say that introducing rules and optimizing queries with semi-join reducers [1] is a good starting point before moving to more complicated plans with aggregations and specialized UDFs (bloom/cuckoo etc). Simpler primitives are also more likely to be adopted by downstream projects. Regarding the part of constructing and passing the bloom filter around we may be able to come up with an alternative design without the use of additional scans/joins by exploiting correlation variables. I haven't thought this all the way through but binding things on the left side and passing them to the right side seems something that could be relevant to this use-case. Best, Stamatis [1] Integrating semi-join-reducers into state-of-the-art query processors ( https://db.in.tum.de/research/publications/conferences/semijoin.pdf) On Sat, Apr 29, 2023, 3:54 AM Julian Hyde wrote: > It would be great to have such a rule. People who don’t want it can > disable it; and people who enable it can use a cost function. > > Some systems that use Bloom filters (and other probabilistic filters) > don’t execute the query twice but use a side-channel to send the Bloom > filter from one scan to the other. For example, suppose that the “dept" > table is smaller and its scan finishes faster. When the scan has finished, > it sends the Bloom filter to the “emp" scan, which is still under way. From > that point, the “emp” scan can eliminate a fraction of its rows because it > knows that their “deptno” values do not pass the filter. > > Julian > > > > On Apr 28, 2023, at 9:01 AM, Zoltan Haindrich wrote: > > > > Hi, > > > > I was wondering about the pros and cons of having a Calcite rule which > could rewrite a join to utilize bloom filters; something like: > > > > select e.* > > from emp e > > join dept d on(e.deptno=d.deptno); > > where d.dname='Sales'; > > > > into something like: > > > > select e.* > > from ( > > select e.* from emp e join ( > > select bloom_sketch(deptno) as sketch from > dept dname='Sales' > > ) dept_agg on (bloom_contains(sketch,e.deptno) > > ) e > > join dept d on(e.deptno=d.deptno) > > where d.dname='Sales'; > > > > Generally for the original query: > > * if "dept" is very small a mapjoin is used which is great > > * or possibly some nested loops with index usages on the big table > > * but if the execution engine decides to use a non-specialized approach > like merge-join or hash-join; it may move around a lot of data - and in > those cases this might be usefull > > > > There are systems which handle this by introducing a bloom filter (Hive; > Spark) and transfer that in the background for the big-table readers - but > that's outside the scope of the planner. I was wondering if it would be > beneficial or not to introduce such a rule - so that using this can be a > cost-based decision during planning. > > > > pro: > > * to enable an engine to support this optimization - it would only need > to implement a few UDFs > > * the rule could put the use of this optimization under cost-based > decision > > > > con: > > * an extra scan of the small table > > * it adds an extra join + aggregate computation > > * exec engine will most likely exploit that its just a single row > > * I guess without proper stats this could even worsen things > > * it could put more stress on (join) planning - as it could introduce > more joins > > > > What do you guys think? > > > > cheers, > > Zoltan > > > > > >
Re: Employ bloom filters in joins
It would be great to have such a rule. People who don’t want it can disable it; and people who enable it can use a cost function. Some systems that use Bloom filters (and other probabilistic filters) don’t execute the query twice but use a side-channel to send the Bloom filter from one scan to the other. For example, suppose that the “dept" table is smaller and its scan finishes faster. When the scan has finished, it sends the Bloom filter to the “emp" scan, which is still under way. From that point, the “emp” scan can eliminate a fraction of its rows because it knows that their “deptno” values do not pass the filter. Julian > On Apr 28, 2023, at 9:01 AM, Zoltan Haindrich wrote: > > Hi, > > I was wondering about the pros and cons of having a Calcite rule which could > rewrite a join to utilize bloom filters; something like: > > select e.* > from emp e > join dept d on(e.deptno=d.deptno); > where d.dname='Sales'; > > into something like: > > select e.* > from ( > select e.* from emp e join ( > select bloom_sketch(deptno) as sketch from dept > dname='Sales' > ) dept_agg on (bloom_contains(sketch,e.deptno) > ) e > join dept d on(e.deptno=d.deptno) > where d.dname='Sales'; > > Generally for the original query: > * if "dept" is very small a mapjoin is used which is great > * or possibly some nested loops with index usages on the big table > * but if the execution engine decides to use a non-specialized approach like > merge-join or hash-join; it may move around a lot of data - and in those > cases this might be usefull > > There are systems which handle this by introducing a bloom filter (Hive; > Spark) and transfer that in the background for the big-table readers - but > that's outside the scope of the planner. I was wondering if it would be > beneficial or not to introduce such a rule - so that using this can be a > cost-based decision during planning. > > pro: > * to enable an engine to support this optimization - it would only need to > implement a few UDFs > * the rule could put the use of this optimization under cost-based decision > > con: > * an extra scan of the small table > * it adds an extra join + aggregate computation > * exec engine will most likely exploit that its just a single row > * I guess without proper stats this could even worsen things > * it could put more stress on (join) planning - as it could introduce more > joins > > What do you guys think? > > cheers, > Zoltan > >
Employ bloom filters in joins
Hi, I was wondering about the pros and cons of having a Calcite rule which could rewrite a join to utilize bloom filters; something like: select e.* from emp e join dept d on(e.deptno=d.deptno); where d.dname='Sales'; into something like: select e.* from ( select e.* from emp e join ( select bloom_sketch(deptno) as sketch from dept dname='Sales' ) dept_agg on (bloom_contains(sketch,e.deptno) ) e join dept d on(e.deptno=d.deptno) where d.dname='Sales'; Generally for the original query: * if "dept" is very small a mapjoin is used which is great * or possibly some nested loops with index usages on the big table * but if the execution engine decides to use a non-specialized approach like merge-join or hash-join; it may move around a lot of data - and in those cases this might be usefull There are systems which handle this by introducing a bloom filter (Hive; Spark) and transfer that in the background for the big-table readers - but that's outside the scope of the planner. I was wondering if it would be beneficial or not to introduce such a rule - so that using this can be a cost-based decision during planning. pro: * to enable an engine to support this optimization - it would only need to implement a few UDFs * the rule could put the use of this optimization under cost-based decision con: * an extra scan of the small table * it adds an extra join + aggregate computation * exec engine will most likely exploit that its just a single row * I guess without proper stats this could even worsen things * it could put more stress on (join) planning - as it could introduce more joins What do you guys think? cheers, Zoltan OpenPGP_signature Description: OpenPGP digital signature