[HACKERS] Join consolidation / Removing duplicate joins

2014-09-17 Thread David Rowley
I've been looking into improving the performance of queries that have
co-related sub-queries, such as:

SELECT id,(SELECT value FROM t2 WHERE t2.id = t1.id) FROM t1;

Where currently we produce a plan that executes the subquery as a sub plan,
like:

QUERY PLAN
--
 Seq Scan on t1  (cost=0.00..8456925.00 rows=100 width=4) (actual
time=0.139..4071.598 rows=100 loops=1)
   SubPlan 1
 -  Index Scan using t2_pkey on t2  (cost=0.42..8.44 rows=1 width=4)
(actual time=0.003..0.003 rows=1 loops=100)
   Index Cond: (id = t1.id)
 Planning time: 0.169 ms
 Execution time: 4107.809 ms

Though, if the subquery could be proved only to ever return 1 record, then
this could be re-written to become:

explain analyze SELECT t1.id,t2.value FROM t1 LEFT OUTER JOIN t2 ON t1.id =
t2.id;
QUERY PLAN
--
 Hash Left Join  (cost=30832.00..70728.00 rows=100 width=8) (actual
time=384.337..1452.387 rows=100 loops=1)
   Hash Cond: (t1.id = t2.id)
   -  Seq Scan on t1  (cost=0.00..14425.00 rows=100 width=4) (actual
time=0.015..199.716 rows=100 loops=1)
   -  Hash  (cost=14425.00..14425.00 rows=100 width=8) (actual
time=382.387..382.387 rows=100 loops=1)
 Buckets: 131072  Batches: 16  Memory Usage: 2463kB
 -  Seq Scan on t2  (cost=0.00..14425.00 rows=100 width=8)
(actual time=0.010..179.911 rows=100 loops=1)
 Planning time: 0.396 ms
 Execution time: 1473.392 ms
(8 rows)

(notice performance increase)

Of course, you might ask, why not just write the 2nd query in the first
place? Well, good idea, but it's not always that simple, take the following
update statement:

explain update t1
set value = (select value from t2 where t1.id=t2.id),
  value2 = (select value2 from t2 where t1.id = t2.id)
where exists(select 1 from t2 where t1.id=t2.id);

We end up with a quite a few extra sub queries where probably 1 join would
have done the trick. We could have probably written this using the
UPDATE/FROM syntax, but if you like to write standard SQL then that might
not fly.

Anyway... I've been thinking of writing some code that converts these sub
plans into left joins where it can be proved that the subquery would only
at most produce 1 row, but as for the case in the UPDATE statement above, I
didn't really want the code to create a new join for each subplan that it
pulls into the outer query, it would be nice if the code could detect if
there's a suitable join there already, and make use of it.

I started to think what an efficient way to do this might be, and thought
maybe that it would be good if on RelOptInfo or so, if we could add a bool
flag named something like uniquejoin, where this would be set to True, if
we could detect that a unique index existed on the relation that was a
subset of the join condition, (similar to the requirement of LEFT JOIN
removals), we could have a function that tried to pullup all these subplans
into the outer query, when it could detect that at most 1 subplan row could
exist for each outer plan row, the code could then convert these to a LEFT
OUTER JOIN on the outer query.

A bit later in planning, once all the subqueries are planned or pulled up,
a join consolidation function could be run that merges all of these
duplicate joins into 1 join for each relation, it could do this by just
looking at relations that have this uniquejoin flag set to true, and then
dig deeper to ensure that the join conditions also match. If we also went
to the trouble of setting this flag for outer rels and not just the ones we
add from this pullup operation, then we could also merge duplicate INNER
JOINS too.

Another example of where removing these duplicated joins could be useful is
when 2 views get joined that share a non-empty intersection of their
relations, where the join conditions are the same.

I kind of thought that perhaps that all of this extra processing to look
for unique indexes might just be more processing than it's worth, as
success cases of join consolidation might be small, but then I started
looking at the executor code around merge and hash join. It seems that
there might also be some performance gains in here too.  I see with SEMI
joins we move to the next outer row once we've found 1 matching inner row.
I believe that with these uniquejoin relations that we could also skip to
the next outer row the same as with semi joins. I did some quick hacks in
the merge join code to test if there was much performance to gain to be had
here and found that on a join conditions involving 2 varchar fields of
about 16 chars, that I could get the query to run in 97% of the time
(saving 1 comparison on the join 

Re: [HACKERS] Join consolidation / Removing duplicate joins

2014-09-17 Thread Marti Raudsepp
On Wed, Sep 17, 2014 at 2:00 PM, David Rowley dgrowle...@gmail.com wrote:
 Anyway... I've been thinking of writing some code that converts these sub
 plans into left joins where it can be proved that the subquery would only at
 most produce 1 row

 Does anyone have any thoughts on this?

+1, I've thought about this part before. There is already precedent
for inlining FROM clause subqueries into the main query, it would be
nice to do that for correlated subqueries as well. It seems we've been
adding features to the planner without fully exploiting opportunities
for normalization and consolidation of optimization techniques.

I think it's not even necessary to prove uniqueness of the subquery as
you describe. Now that 9.3 added LATERAL, a correlated subquery can be
seen as a special case of LATERAL LEFT JOIN with an additional check
to raise an error if 1 rows are returned from the inner side. And you
could optionally elide the error check if you can prove uniqueness.

Advantages:
1. Sufficiently simple lateral subqueries are already normalized into
ordinary JOINs with hash/merge support, so you would get that for free
(probably requires eliding the 1-row check).
2. We get rid of silliness like the explosion of SubPlan nodes for
each reference (see examples below).
3. Based on some naive testing, it seems that 9.5devel performs
slightly better with NestLoop LATERAL subqueries than SubPlan
correlated ones.
4. EXPLAIN output is easier to read, I find.

I suppose EXISTS/IN with correlated subqueries needs some different
treatment, as it can currently take advantage of the hashed SubPlan
optimization.

Can anyone see any downsides? Perhaps one day we can get rid of
SubPlan entirely, would anyone miss it?


Example of SubPlan explosion:

regression=# create view foo1 as select *, (select ten as ten2 from
tenk2 where tenk1.unique1=tenk2.unique1) from tenk1;

regression=# explain analyze select * from foo1 where ten2 between 1 and 3;
 Seq Scan on tenk1  (cost=0.00..175782.08 rows= width=244) (actual
time=0.052..49.288 rows=3000 loops=1)
   Filter: (((SubPlan 2) = 1) AND ((SubPlan 3) = 3))
   Rows Removed by Filter: 7000
   SubPlan 1
 -  Index Scan using tenk2_unique1 on tenk2  (cost=0.29..8.30
rows=1 width=4) (actual time=0.001..0.002 rows=1 loops=3000)
   Index Cond: (tenk1.unique1 = unique1)
   SubPlan 2
 -  Index Scan using tenk2_unique1 on tenk2 tenk2_1
(cost=0.29..8.30 rows=1 width=4) (actual time=0.002..0.002 rows=1
loops=1)
   Index Cond: (tenk1.unique1 = unique1)
   SubPlan 3
 -  Index Scan using tenk2_unique1 on tenk2 tenk2_2
(cost=0.29..8.30 rows=1 width=4) (actual time=0.001..0.002 rows=1
loops=9000)
   Index Cond: (tenk1.unique1 = unique1)
 Execution time: 49.508 ms


LATERAL is a win even when using OFFSET 0 to prevent inlining:

regression=# create view foo3 as select * from tenk1 left join lateral
(select ten as ten2 from tenk2 where tenk1.unique1=tenk2.unique1
offset 0) x on true;

regression=# explain analyze select * from foo3 where ten2 between 1 and 3;
 Nested Loop  (cost=0.29..83733.00 rows=1 width=248) (actual
time=0.043..28.963 rows=3000 loops=1)
   -  Seq Scan on tenk1  (cost=0.00..458.00 rows=1 width=244)
(actual time=0.008..1.177 rows=1 loops=1)
   -  Subquery Scan on x  (cost=0.29..8.32 rows=1 width=4) (actual
time=0.002..0.002 rows=0 loops=1)
 Filter: ((x.ten2 = 1) AND (x.ten2 = 3))
 Rows Removed by Filter: 1
 -  Index Scan using tenk2_unique1 on tenk2  (cost=0.29..8.30
rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=1)
   Index Cond: (tenk1.unique1 = unique1)
 Execution time: 29.186 ms


And if you could prove uniqueness of the inner side and inline it,
WHERE clauses can also be pushed down trivially:

regression=# create view foo2 as select * from tenk1 left join lateral
(select ten as ten2 from tenk2 where tenk1.unique1=tenk2.unique1) x on
true;

regression=# explain analyze select * from foo2 where ten2 between 1 and 3;
 Hash Join  (cost=532.50..1083.00 rows=3000 width=248) (actual
time=1.848..4.480 rows=3000 loops=1)
   Hash Cond: (tenk1.unique1 = tenk2.unique1)
   -  Seq Scan on tenk1  (cost=0.00..458.00 rows=1 width=244)
(actual time=0.002..0.617 rows=1 loops=1)
   -  Hash  (cost=495.00..495.00 rows=3000 width=8) (actual
time=1.837..1.837 rows=3000 loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 118kB
 -  Seq Scan on tenk2  (cost=0.00..495.00 rows=3000 width=8)
(actual time=0.004..1.562 rows=3000 loops=1)
   Filter: ((ten = 1) AND (ten = 3))
   Rows Removed by Filter: 7000
 Execution time: 4.591 ms


Regards,
Marti


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers