Re: [HACKERS] Implied BETWEEN from join quals
On 28 April 2014 15:01, Robert Haas wrote: > Interesting. This can pretty obviously produce a big win if things > break our way. But it'll take some effort do determine whether the > range of possible values for the join column is narrow enough to make > the inferred BETWEEN clause selective enough to be useful, and that > effort will be wasted if it turns out that the answer is no. I can > certainly think of times when this would have helped me, so I kind of > like the idea in theory ... but I won't be surprised if it turns out > to be possible to construct realistic test cases where trying to > figure this out causes too much increase in planning time. > > Actually, it seems possible to me that this could end up winning even > if it doesn't enable the use of an index scan rather than a sequential > scan, because eliminating rows during the scan is typically much > cheaper than eliminating them during the join step. But it also seems > possible that it could lose heavily in that case, if the extra steps > during the scan phase don't eliminate enough rows to reduce the join > cost to a sufficient degree. I'm not really sure how it will work out > on balance. I agree its not always a win and that the planning overhead is a little high. That's why I suggest we only attempt to calculate such a plan if the normal plan for the scan is already high. That way the extra planning is worth it. I'm sure there are other optimizations that we might only wish to consider for plans that are otherwise high cost. This need hardly touch the main planning path for shirt queries at all. The extra run-time cost of the test adds 2*cpu_operator_cost to the scan. However, it will save at least (OldSelectivity - NewSelectivity)*(cpu_tuple_cost + cpu_operator_cost) in the join node directly above it (and possible more above that - though without being able to inspect the join tree we won't know that until later - and the output plan might actually chage the later join plan anyway). So we can calculate the threshold at which we should choose the modified plan, though it will likely be a conservative value. The new selectivity can be calculated just as other selectivities, I see no problem there. So we have a proposed way to limit planning time in cases where the benefit would be low. Plus we have a way to estimate the reduction in cost once we add the new tests. The real value is whether we can use the additional tests to pick up an index we hadn't been able to use before. Given that MinMax index use will be a win even for fairly high selectivities (because of the low cost of index access), it looks like this optimization will play very nicely with MinMax. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Implied BETWEEN from join quals
On Tue, Apr 22, 2014 at 11:44 AM, Simon Riggs wrote: > I was recently nudged to describe an optimisation of merge > joins/sorts. Rather than decribe that, I've looked at the general > case: > > There are some additional implications we may make when joining > tables... a particularly interesting one is that > > SELECT * > FROM Fact JOIN Dim on Fact.col = Dim.col > > can be rewritten as > > SELECT * > FROM Fact JOIN Dim on Fact.col = Dim.col > WHERE > ( > Fact.col BETWEEN > (SELECT min(col) FROM Dim) > AND > (SELECT max(col) FROM Dim) > ) > AND > ( > Dim.col BETWEEN > (SELECT min(col) FROM Fact) > AND > (SELECT max(col) FROM Fact) > ) > > which also works similarly for anti/semi-joins. > > If we have indexes on A.col and B.col then these additional quals can > be derived cheaply at run-time and could have an important effect on > optimisation. Interesting. This can pretty obviously produce a big win if things break our way. But it'll take some effort do determine whether the range of possible values for the join column is narrow enough to make the inferred BETWEEN clause selective enough to be useful, and that effort will be wasted if it turns out that the answer is no. I can certainly think of times when this would have helped me, so I kind of like the idea in theory ... but I won't be surprised if it turns out to be possible to construct realistic test cases where trying to figure this out causes too much increase in planning time. Actually, it seems possible to me that this could end up winning even if it doesn't enable the use of an index scan rather than a sequential scan, because eliminating rows during the scan is typically much cheaper than eliminating them during the join step. But it also seems possible that it could lose heavily in that case, if the extra steps during the scan phase don't eliminate enough rows to reduce the join cost to a sufficient degree. I'm not really sure how it will work out on balance. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Implied BETWEEN from join quals
On 22 April 2014 17:00, Stephen Frost wrote: > Simon, > > This all looks good, and at the risk of being slightly off-topic for > this thread, I just wanted to mention.. > > * Simon Riggs (si...@2ndquadrant.com) wrote: >> Current proposal ends there, but there is a further optimization that >> allows us to remove the join altogether if >> * There is a FK between Fact and Dim > > It'd be great if we could start by looking for the above requirement and > then doing join-removal when it exists and no columns from Dim are > referenced (outside of the FK reference, which must result in 'true' or > we've screwed something up). > > I had been looking at this about a month ago and just ran out of time to > play with it, but it doesn't seem like it'd be terribly difficult to > do.. Yeh, it was on my list. I had a few optimizer changes I've been looking to make for a while. Hopefully 9.5 is the year I get the chance at a running start on that since they're never as easy as you'd hope. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Implied BETWEEN from join quals
Simon, This all looks good, and at the risk of being slightly off-topic for this thread, I just wanted to mention.. * Simon Riggs (si...@2ndquadrant.com) wrote: > Current proposal ends there, but there is a further optimization that > allows us to remove the join altogether if > * There is a FK between Fact and Dim It'd be great if we could start by looking for the above requirement and then doing join-removal when it exists and no columns from Dim are referenced (outside of the FK reference, which must result in 'true' or we've screwed something up). I had been looking at this about a month ago and just ran out of time to play with it, but it doesn't seem like it'd be terribly difficult to do.. Thanks, Stephen signature.asc Description: Digital signature
[HACKERS] Implied BETWEEN from join quals
I was recently nudged to describe an optimisation of merge joins/sorts. Rather than decribe that, I've looked at the general case: There are some additional implications we may make when joining tables... a particularly interesting one is that SELECT * FROM Fact JOIN Dim on Fact.col = Dim.col can be rewritten as SELECT * FROM Fact JOIN Dim on Fact.col = Dim.col WHERE ( Fact.col BETWEEN (SELECT min(col) FROM Dim) AND (SELECT max(col) FROM Dim) ) AND ( Dim.col BETWEEN (SELECT min(col) FROM Fact) AND (SELECT max(col) FROM Fact) ) which also works similarly for anti/semi-joins. If we have indexes on A.col and B.col then these additional quals can be derived cheaply at run-time and could have an important effect on optimisation. 1) With various kinds of index, we would be able to use these implied quals to further restrict the scan. Perhaps that doesn't sound very interesting, but it is very important when solving an "outside-in" join on a star schema, such as... SELECT count(*) FROM Fact JOIN Dim on Fact.col = Dim.col WHERE Dim.other = 1 since there is no qual that can be applied directly to the Fact table, causing us to scan the entire table. We can rewrite this query as EXPLAIN (ANALYZE on, TIMING off, COSTS on) SELECT count(*) FROM Fact JOIN Dim on Fact.col = Dim.col WHERE Dim.other = 1 AND ( Fact.col BETWEEN (SELECT min(col) FROM Dim WHERE Dim.other = 1) AND (SELECT max(col) FROM Dim WHERE Dim.other = 9) ) Note that the implied qual on the Dim table has been dropped as uninteresting. This is because we can calculate the cost and potential benefit of applying the rewrite, allowing us to discard one or both implied clauses. Note also that this has nothing to do with join order. This is solely about making inferences using the join quals between any two tables. 2) We calculate the join selectivity by comparing the MFVs of the join columns on the tables being joined. ISTM that we could use the min() and max() values to refine the selectivity, which can often be wrong as a result. - - - The current planner doesn't add these predicates automatically, but if it did, it would come up with the following slightly sub-optimal plan... EXPLAIN (ANALYZE on, TIMING off, COSTS on) SELECT count(*) FROM Fact JOIN Dim on Fact.col = Dim.col WHERE Dim.other = 1 AND ( Fact.col BETWEEN (SELECT min(col) FROM Dim WHERE Dim.other = 1) AND (SELECT max(col) FROM Dim WHERE Dim.other = 1) ) Aggregate (cost=31.79..31.80 rows=1 width=0) (actual rows=1 loops=1) InitPlan 1 (returns $0) -> Aggregate (cost=1.03..1.04 rows=1 width=4) (actual rows=1 loops=1) -> Seq Scan on dim dim_1 (cost=0.00..1.02 rows=1 width=4) (actual rows=1 loops=1) Filter: (other = 1) Rows Removed by Filter: 1 InitPlan 2 (returns $1) -> Aggregate (cost=1.03..1.04 rows=1 width=4) (actual rows=1 loops=1) -> Seq Scan on dim dim_2 (cost=0.00..1.02 rows=1 width=4) (actual rows=1 loops=1) Filter: (other = 1) Rows Removed by Filter: 1 -> Merge Join (cost=1.33..29.09 rows=250 width=0) (actual rows=10 loops=1) Merge Cond: (dim.col = fact.col) -> Sort (cost=1.03..1.04 rows=1 width=4) (actual rows=1 loops=1) Sort Key: dim.col Sort Method: quicksort Memory: 25kB -> Seq Scan on dim (cost=0.00..1.02 rows=1 width=4) (actual rows=1 loops=1) Filter: (other = 1) Rows Removed by Filter: 1 -> Index Only Scan using fact_col_idx on fact (cost=0.29..24.29 rows=500 width=4) (actual rows=10 loops=1) Index Cond: ((col >= $0) AND (col <= $1)) Heap Fetches: 10 which is sub-optimal only because of the mis-estimation of the effect of the min() and max(), so we will still benefit from resolving the parameters to a constant before proceeding with the main query. A better plan would be EXPLAIN (ANALYZE on, TIMING off, COSTS on) SELECT count(*) FROM Fact JOIN Dim on Fact.col = Dim.col WHERE Dim.other = 1 AND Fact.col BETWEEN 1 AND 1 Aggregate (cost=2944.04..2944.05 rows=1 width=0) (actual rows=1 loops=1) -> Hash Join (cost=1.04..2819.04 rows=5 width=0) (actual rows=10 loops=1) Hash Cond: (fact.col = dim.col) -> Seq Scan on fact (cost=0.00..1943.00 rows=10 width=4) (actual rows=10 loops=1) Filter: ((col >= 1) AND (col <= 1)) -> Hash (cost=1.02..1.02 rows=1 width=4) (actual rows=1 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 1kB -> Seq Scan on dim (cost=0.00..1.02 rows=1 width=4) (actual rows=1 loops=1) Filter: (other = 1) Rows Removed by Filter: 1 but we can probably live with that, as we do with other dynamic index plans. So when can we use this? The additional cost of adding this to the query is * additional qual: 2 * cpu_operator_cost * rows * ge