Re: [HACKERS] Implied BETWEEN from join quals

2014-04-28 Thread Simon Riggs
On 28 April 2014 15:01, Robert Haas  wrote:

> Interesting.  This can pretty obviously produce a big win if things
> break our way.  But it'll take some effort do determine whether the
> range of possible values for the join column is narrow enough to make
> the inferred BETWEEN clause selective enough to be useful, and that
> effort will be wasted if it turns out that the answer is no.  I can
> certainly think of times when this would have helped me, so I kind of
> like the idea in theory ... but I won't be surprised if it turns out
> to be possible to construct realistic test cases where trying to
> figure this out causes too much increase in planning time.
>
> Actually, it seems possible to me that this could end up winning even
> if it doesn't enable the use of an index scan rather than a sequential
> scan, because eliminating rows during the scan is typically much
> cheaper than eliminating them during the join step.  But it also seems
> possible that it could lose heavily in that case, if the extra steps
> during the scan phase don't eliminate enough rows to reduce the join
> cost to a sufficient degree.  I'm not really sure how it will work out
> on balance.

I agree its not always a win and that the planning overhead is a
little high. That's why I suggest we only attempt to calculate such a
plan if the normal plan for the scan is already high. That way the
extra planning is worth it. I'm sure there are other optimizations
that we might only wish to consider for plans that are otherwise high
cost. This need hardly touch the main planning path for shirt queries
at all.

The extra run-time cost of the test adds 2*cpu_operator_cost to the
scan. However, it will save at least (OldSelectivity -
NewSelectivity)*(cpu_tuple_cost + cpu_operator_cost) in the join node
directly above it (and possible more above that - though without being
able to inspect the join tree we won't know that until later - and the
output plan might actually chage the later join plan anyway). So we
can calculate the threshold at which we should choose the modified
plan, though it will likely be a conservative value.

The new selectivity can be calculated just as other selectivities, I
see no problem there.

So we have a proposed way to limit planning time in cases where the
benefit would be low. Plus we have a way to estimate the reduction in
cost once we add the new tests.

The real value is whether we can use the additional tests to pick up
an index we hadn't been able to use before. Given that MinMax index
use will be a win even for fairly high selectivities (because of the
low cost of index access), it looks like this optimization will play
very nicely with MinMax.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Implied BETWEEN from join quals

2014-04-28 Thread Robert Haas
On Tue, Apr 22, 2014 at 11:44 AM, Simon Riggs  wrote:
> I was recently nudged to describe an optimisation of merge
> joins/sorts. Rather than decribe that, I've looked at the general
> case:
>
> There are some additional implications we may make when joining
> tables... a particularly interesting one is that
>
> SELECT *
> FROM Fact JOIN Dim on Fact.col = Dim.col
>
> can be rewritten as
>
> SELECT *
> FROM Fact JOIN Dim on Fact.col = Dim.col
> WHERE
> (
> Fact.col BETWEEN
>  (SELECT min(col) FROM Dim)
> AND
>  (SELECT max(col) FROM Dim)
> )
> AND
> (
> Dim.col BETWEEN
>  (SELECT min(col) FROM Fact)
> AND
>  (SELECT max(col) FROM Fact)
> )
>
> which also works similarly for anti/semi-joins.
>
> If we have indexes on A.col and B.col then these additional quals can
> be derived cheaply at run-time and could have an important effect on
> optimisation.

Interesting.  This can pretty obviously produce a big win if things
break our way.  But it'll take some effort do determine whether the
range of possible values for the join column is narrow enough to make
the inferred BETWEEN clause selective enough to be useful, and that
effort will be wasted if it turns out that the answer is no.  I can
certainly think of times when this would have helped me, so I kind of
like the idea in theory ... but I won't be surprised if it turns out
to be possible to construct realistic test cases where trying to
figure this out causes too much increase in planning time.

Actually, it seems possible to me that this could end up winning even
if it doesn't enable the use of an index scan rather than a sequential
scan, because eliminating rows during the scan is typically much
cheaper than eliminating them during the join step.  But it also seems
possible that it could lose heavily in that case, if the extra steps
during the scan phase don't eliminate enough rows to reduce the join
cost to a sufficient degree.  I'm not really sure how it will work out
on balance.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Implied BETWEEN from join quals

2014-04-22 Thread Simon Riggs
On 22 April 2014 17:00, Stephen Frost  wrote:
> Simon,
>
> This all looks good, and at the risk of being slightly off-topic for
> this thread, I just wanted to mention..
>
> * Simon Riggs (si...@2ndquadrant.com) wrote:
>> Current proposal ends there, but there is a further optimization that
>> allows us to remove the join altogether if
>> * There is a FK between Fact and Dim
>
> It'd be great if we could start by looking for the above requirement and
> then doing join-removal when it exists and no columns from Dim are
> referenced (outside of the FK reference, which must result in 'true' or
> we've screwed something up).
>
> I had been looking at this about a month ago and just ran out of time to
> play with it, but it doesn't seem like it'd be terribly difficult to
> do..

Yeh, it was on my list. I had a few optimizer changes I've been
looking to make for a while.

Hopefully 9.5 is the year I get the chance at a running start on that
since they're never as easy as you'd hope.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Implied BETWEEN from join quals

2014-04-22 Thread Stephen Frost
Simon,

This all looks good, and at the risk of being slightly off-topic for
this thread, I just wanted to mention..

* Simon Riggs (si...@2ndquadrant.com) wrote:
> Current proposal ends there, but there is a further optimization that
> allows us to remove the join altogether if
> * There is a FK between Fact and Dim

It'd be great if we could start by looking for the above requirement and
then doing join-removal when it exists and no columns from Dim are
referenced (outside of the FK reference, which must result in 'true' or
we've screwed something up).

I had been looking at this about a month ago and just ran out of time to
play with it, but it doesn't seem like it'd be terribly difficult to
do..

Thanks,

Stephen


signature.asc
Description: Digital signature


[HACKERS] Implied BETWEEN from join quals

2014-04-22 Thread Simon Riggs
I was recently nudged to describe an optimisation of merge
joins/sorts. Rather than decribe that, I've looked at the general
case:

There are some additional implications we may make when joining
tables... a particularly interesting one is that

SELECT *
FROM Fact JOIN Dim on Fact.col = Dim.col

can be rewritten as

SELECT *
FROM Fact JOIN Dim on Fact.col = Dim.col
WHERE
(
Fact.col BETWEEN
 (SELECT min(col) FROM Dim)
AND
 (SELECT max(col) FROM Dim)
)
AND
(
Dim.col BETWEEN
 (SELECT min(col) FROM Fact)
AND
 (SELECT max(col) FROM Fact)
)

which also works similarly for anti/semi-joins.

If we have indexes on A.col and B.col then these additional quals can
be derived cheaply at run-time and could have an important effect on
optimisation.

1) With various kinds of index, we would be able to use these implied
quals to further restrict the scan. Perhaps that doesn't sound very
interesting, but it is very important when solving an "outside-in"
join on a star schema, such as...

SELECT count(*)
FROM Fact JOIN Dim on Fact.col = Dim.col
WHERE Dim.other = 1

since there is no qual that can be applied directly to the Fact table,
causing us to scan the entire table.

We can rewrite this query as

EXPLAIN (ANALYZE on, TIMING off, COSTS on)
SELECT count(*)
FROM Fact JOIN Dim on Fact.col = Dim.col
WHERE Dim.other = 1
AND
(
Fact.col BETWEEN
 (SELECT min(col) FROM Dim WHERE Dim.other = 1)
AND
 (SELECT max(col) FROM Dim WHERE Dim.other = 9)
)

Note that the implied qual on the Dim table has been dropped as
uninteresting. This is because we can calculate the cost and potential
benefit of applying the rewrite, allowing us to discard one or both
implied clauses.

Note also that this has nothing to do with join order. This is solely
about making inferences using the join quals between any two tables.

2) We calculate the join selectivity by comparing the MFVs of the join
columns on the tables being joined. ISTM that we could use the min()
and max() values to refine the selectivity, which can often be wrong
as a result.

- - -

The current planner doesn't add these predicates automatically, but if
it did, it would come up with the following slightly sub-optimal
plan...

EXPLAIN (ANALYZE on, TIMING off, COSTS on)
SELECT count(*)
FROM Fact JOIN Dim on Fact.col = Dim.col
WHERE Dim.other = 1
AND
(
Fact.col BETWEEN
 (SELECT min(col) FROM Dim WHERE Dim.other = 1)
AND
 (SELECT max(col) FROM Dim WHERE Dim.other = 1)
)

 Aggregate  (cost=31.79..31.80 rows=1 width=0) (actual rows=1 loops=1)
   InitPlan 1 (returns $0)
 ->  Aggregate  (cost=1.03..1.04 rows=1 width=4) (actual rows=1 loops=1)
   ->  Seq Scan on dim dim_1  (cost=0.00..1.02 rows=1 width=4)
(actual rows=1 loops=1)
 Filter: (other = 1)
 Rows Removed by Filter: 1
   InitPlan 2 (returns $1)
 ->  Aggregate  (cost=1.03..1.04 rows=1 width=4) (actual rows=1 loops=1)
   ->  Seq Scan on dim dim_2  (cost=0.00..1.02 rows=1 width=4)
(actual rows=1 loops=1)
 Filter: (other = 1)
 Rows Removed by Filter: 1
   ->  Merge Join  (cost=1.33..29.09 rows=250 width=0) (actual
rows=10 loops=1)
 Merge Cond: (dim.col = fact.col)
 ->  Sort  (cost=1.03..1.04 rows=1 width=4) (actual rows=1 loops=1)
   Sort Key: dim.col
   Sort Method: quicksort  Memory: 25kB
   ->  Seq Scan on dim  (cost=0.00..1.02 rows=1 width=4)
(actual rows=1 loops=1)
 Filter: (other = 1)
 Rows Removed by Filter: 1
 ->  Index Only Scan using fact_col_idx on fact
(cost=0.29..24.29 rows=500 width=4) (actual rows=10 loops=1)
   Index Cond: ((col >= $0) AND (col <= $1))
   Heap Fetches: 10

which is sub-optimal only because of the mis-estimation of the effect
of the min() and max(), so we will still benefit from resolving the
parameters to a constant before proceeding with the main query. A
better plan would be

EXPLAIN (ANALYZE on, TIMING off, COSTS on)
SELECT count(*)
FROM Fact JOIN Dim on Fact.col = Dim.col
WHERE Dim.other = 1
AND Fact.col BETWEEN 1 AND 1

 Aggregate  (cost=2944.04..2944.05 rows=1 width=0) (actual rows=1 loops=1)
   ->  Hash Join  (cost=1.04..2819.04 rows=5 width=0) (actual
rows=10 loops=1)
 Hash Cond: (fact.col = dim.col)
 ->  Seq Scan on fact  (cost=0.00..1943.00 rows=10
width=4) (actual rows=10 loops=1)
   Filter: ((col >= 1) AND (col <= 1))
 ->  Hash  (cost=1.02..1.02 rows=1 width=4) (actual rows=1 loops=1)
   Buckets: 1024  Batches: 1  Memory Usage: 1kB
   ->  Seq Scan on dim  (cost=0.00..1.02 rows=1 width=4)
(actual rows=1 loops=1)
 Filter: (other = 1)
 Rows Removed by Filter: 1

but we can probably live with that, as we do with other dynamic index plans.

So when can we use this?

The additional cost of adding this to the query is
* additional qual:  2 * cpu_operator_cost * rows
* ge