Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2016-04-07 Thread Simon Riggs
On 7 April 2016 at 12:23, Simon Riggs  wrote:


> For 0002
>

For find_best_foreign_key_quals() how can this ever match 2 FKs with
different keys?
The fkrel references the foreignrel, which has a single PK. How can the FK
have a different number of columns to the PK?
Assuming that is possible/makes sense, I don't see why it matters whether
we take the FK with the most keys or the least keys and that isn't
documented.

This also affects your comments in clauselist_join_selectivity() about how
we handle overlapping matches from FKs in different directions. If we're
going to prove that there is a 1:1 match, why does that matter? That
section of code looks too much. I would be happy for now with dealing
correctly and simply with the common case where just one FK matches in
either direction.

Also, I see that the effects of this on outer joins are not documented.
Earlier you mention you hadn't thought of outer joins, so I need to check
whether outer joins are not handled at all, or whether we are assuming that
we can still use an estimate even if there is an outer join present.

Thoughts?

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2016-04-07 Thread Tomas Vondra

Hi,

On 04/07/2016 01:23 PM, Simon Riggs wrote:

On 7 April 2016 at 00:15, Tomas Vondra > wrote:


Right. Fixed.


0001 committed. I added comments and a fastpath when no triggers are
present.


For 0002, I would be more comfortable adding
   enable_fk_plans = on (true) | off

even if we decided to remove that parameter that before release.


Attached is 0002 with the GUC added. I certainly believe we should think 
twice before keeping the GUC in the final release - it's definitely 
useful for testing etc. but I'm afraid it'd be just another tuning knob 
for regular users.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


0002-use-fkeys-in-join-estimation.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2016-04-07 Thread Simon Riggs
On 7 April 2016 at 00:15, Tomas Vondra  wrote:


> Right. Fixed.


0001 committed. I added comments and a fastpath when no triggers are
present.


For 0002, I would be more comfortable adding
   enable_fk_plans = on (true) | off

even if we decided to remove that parameter that before release.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2016-04-06 Thread Tomas Vondra

Hi,

attached is the patch split into two parts, as proposed by Simon. 0001 
just adds the stuff to relcache, 0002 actually uses it for estimation.


On 04/04/2016 12:03 PM, Amit Langote wrote:

On 2016/04/04 17:25, Simon Riggs wrote:

The rel cache code you're adding uses a flag called "rd_fkeyvalid"
which indicates that the relcache is correctly filled. That is
confusing, since it has nothing to do with the concept of
constraint validity. We should rename that to rd_fkeycachefilled or
similar.


Maybe I'm missing something, but is a separate bool required at all
in this case? Wouldn't simply doing the following suffice?

/* Quick exit if we already computed the list. */
if (relation->rd_fkeylist)
return list_copy(relation->rd_fkeylist);

ISTM, rd_fkeyvalid is modeled on rd_indexvalid, where the latter
serves to convey more info than simply whether the index list is
valid or not, so the extra field is justified.


I think you're right. I've copied the logic for indexes, but clearly for 
foreign keys we don't need this flag. I've removed it.




Also, it seems the patch forgot to update RelationDestroyRelation().


Right. Fixed.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


0001-add-foreign-key-info-into-relcache.patch
Description: binary/octet-stream


0002-use-fkeys-in-join-estimation.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2016-04-04 Thread Amit Langote
On 2016/04/04 17:25, Simon Riggs wrote:
> The rel cache code you're adding uses a flag called "rd_fkeyvalid" which
> indicates that the relcache is correctly filled. That is confusing, since
> it has nothing to do with the concept of constraint validity. We should
> rename that to rd_fkeycachefilled or similar.

Maybe I'm missing something, but is a separate bool required at all in
this case?  Wouldn't simply doing the following suffice?

/* Quick exit if we already computed the list. */
if (relation->rd_fkeylist)
return list_copy(relation->rd_fkeylist);

ISTM, rd_fkeyvalid is modeled on rd_indexvalid, where the latter serves to
convey more info than simply whether the index list is valid or not, so
the extra field is justified.

Also, it seems the patch forgot to update RelationDestroyRelation().

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2016-04-04 Thread Simon Riggs
On 3 April 2016 at 22:44, Simon Riggs  wrote:


> Detailed comments in the planner part of the patch. The discussion around
> this patch isn't reflected enough in the patch.
>

I think we should record that the planner uses the constraint, even if the
constraint is not yet valid, per DDL.

The rel cache code you're adding uses a flag called "rd_fkeyvalid" which
indicates that the relcache is correctly filled. That is confusing, since
it has nothing to do with the concept of constraint validity. We should
rename that to rd_fkeycachefilled or similar.

ISTM that the FKey info added to the rel cache would be useful for many
optimizations, hence why I think we should commit that separately, whether
or not the specific optimization for the other half of the patch is
accepted or later modified or revoked. Objections?

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2016-04-03 Thread Simon Riggs
On 3 April 2016 at 22:09, Tomas Vondra  wrote:

> On 04/03/2016 10:06 PM, Simon Riggs wrote:
>
>> On 14 March 2016 at 19:42, Tomas Vondra > > wrote:
>>
>> ...
>
>>
>>
>> I'd like to split this into 2 patches
>> 1) Add FK info to relcache
>> 2) use FK info in planner
>>
>> Would the credit for this be 1) Tomas, 2) Tomas + David ?
>>
>
> I could split the patch like that, but I don't quite see the point. The
> two parts will not overlap at all, so not making reviews any simpler. So
> the only reason for such split seems to be be different credits, but would
> we actually commit the pieces independently? Not sure about that.


Oh sorry. Reason for split was because adding the FK info to relcache was a
very solid addition, whereas we might imagine some churn around the planner
aspects.

I'd be inclined to see a little more explanatory docs on this.
>>
>
> That's probably a good idea. Do you have any particular place for the docs
> in mind?


Detailed comments in the planner part of the patch. The discussion around
this patch isn't reflected enough in the patch.

Have we done any tests on planning overhead for cases where multiple
>> FKs exist?
>>
>>
> I have done some benchmarks initially, and haven't measured any noticeable
> impact. But the code changed since than, so I'll redo that and post some
> results.


Thanks

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2016-04-03 Thread Tomas Vondra

On 04/03/2016 10:06 PM, Simon Riggs wrote:

On 14 March 2016 at 19:42, Tomas Vondra > wrote:


...



I'd like to split this into 2 patches
1) Add FK info to relcache
2) use FK info in planner

Would the credit for this be 1) Tomas, 2) Tomas + David ?


I could split the patch like that, but I don't quite see the point. The 
two parts will not overlap at all, so not making reviews any simpler. So 
the only reason for such split seems to be be different credits, but 
would we actually commit the pieces independently? Not sure about that.




I'd be inclined to see a little more explanatory docs on this.


That's probably a good idea. Do you have any particular place for the 
docs in mind?




Have we done any tests on planning overhead for cases where multiple
FKs exist?



I have done some benchmarks initially, and haven't measured any 
noticeable impact. But the code changed since than, so I'll redo that 
and post some results.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2016-04-03 Thread Simon Riggs
On 14 March 2016 at 19:42, Tomas Vondra 
wrote:

> Hi,
>
> On 03/14/2016 02:12 PM, David Steele wrote:
>
>> Hi Thomas,
>>
> ...
>
>> I don't think it would be clear to any reviewer which patch to apply
>> even if they were working.  I'm marking this "waiting for author".
>>
>
> Yeah. Rebasing the patches to current master was simple enough (there was
> just a simple #include conflict), but figuring out which of the patches is
> review-worthy was definitely difficult.
>
> I do believe David's last patch is the best step forward, so I've rebased
> it, and made some basic aesthetic fixes (adding or rewording comments on a
> few places, etc.)
>

I'd like to split this into 2 patches
1) Add FK info to relcache
2) use FK info in planner

Would the credit for this be 1) Tomas, 2) Tomas + David ?

I'd be inclined to see a little more explanatory docs on this.

Have we done any tests on planning overhead for cases where multiple FKs
exist?

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2016-03-25 Thread David Steele

Hi Simon,

On 3/14/16 3:42 PM, Tomas Vondra wrote:


Attached is v3 of the patch, and also three SQL scripts demonstrating
the impact of the patch on simple examples.


Do you know when you'll have a chance to review Tomas' latest patch?

Thanks,
--
-David
da...@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2016-03-14 Thread Tomas Vondra

Hi,

On 03/14/2016 02:12 PM, David Steele wrote:

Hi Thomas,

...

I don't think it would be clear to any reviewer which patch to apply
even if they were working.  I'm marking this "waiting for author".


Yeah. Rebasing the patches to current master was simple enough (there 
was just a simple #include conflict), but figuring out which of the 
patches is review-worthy was definitely difficult.


I do believe David's last patch is the best step forward, so I've 
rebased it, and made some basic aesthetic fixes (adding or rewording 
comments on a few places, etc.)


The one important code change is that I've removed the piece of code 
from find_best_foreign_key_quals that tried to be a bit too smart about 
equivalence classes.


My understanding is that it tried to handle cases like this example:

CREATE TABLE f (id1 INT, id2 INT, PRIMARY KEY (id1, id2));

CREATE TABLE d1 (id1 INT, id2 INT, FOREIGN KEY (id1, id2)
   REFERENCES f(id1, id2));

CREATE TABLE d2 (id1 INT, id2 INT, FOREIGN KEY (id1, id2)
   REFERENCES f(id1, id2));

SELECT * FROM f JOIN d1 ON (f.id1 = d1.id1 AND f.id2 = d1.id2)
JOIN d2 ON (f.id1 = d2.id1 AND f.id2 = d2.id2);

But it did so by also deriving foreign keys between d1 and d2, which I 
believe is wrong as there really is no foreign key, and thus no 
guarantee of existence of a matching row.


FWIW as I explained in a message from 24/2/2015, while this is 
definitely an issue worth fixing, I believe it needs to be done in some 
other way, not by foreign keys.


Attached is v3 of the patch, and also three SQL scripts demonstrating 
the impact of the patch on simple examples.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index eb0fc1e..3d38384 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2153,6 +2153,16 @@ _outIndexOptInfo(StringInfo str, const IndexOptInfo *node)
 }
 
 static void
+_outForeignKeyOptInfo(StringInfo str, const ForeignKeyOptInfo *node)
+{
+	WRITE_NODE_TYPE("FOREIGNKEYOPTINFO");
+
+	WRITE_OID_FIELD(conrelid);
+	WRITE_OID_FIELD(confrelid);
+	WRITE_INT_FIELD(nkeys);
+}
+
+static void
 _outEquivalenceClass(StringInfo str, const EquivalenceClass *node)
 {
 	/*
@@ -3603,6 +3613,9 @@ _outNode(StringInfo str, const void *obj)
 			case T_IndexOptInfo:
 _outIndexOptInfo(str, obj);
 break;
+			case T_ForeignKeyOptInfo:
+_outForeignKeyOptInfo(str, obj);
+break;
 			case T_EquivalenceClass:
 _outEquivalenceClass(str, obj);
 break;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 5350329..4399a9f 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -3870,6 +3870,347 @@ get_parameterized_joinrel_size(PlannerInfo *root, RelOptInfo *rel,
 }
 
 /*
+ * quals_match_foreign_key
+ *		Determines if the foreign key is matched by joinquals.
+ *
+ * Checks that there are conditions on all columns of the foreign key, matching
+ * the operator used by the foreign key etc. If such complete match is found,
+ * the function returns bitmap identifying the matching quals (0-based).
+ *
+ * Otherwise (no match at all or incomplete match), NULL is returned.
+ */
+static Bitmapset *
+quals_match_foreign_key(PlannerInfo *root, ForeignKeyOptInfo *fkinfo,
+		RelOptInfo *fkrel, RelOptInfo *foreignrel,
+		List *joinquals)
+{
+	int i;
+	int nkeys = fkinfo->nkeys;
+	Bitmapset *qualmatches = NULL;
+	Bitmapset *fkmatches = NULL;
+
+	/*
+	 * Loop over each column of the foreign key and build a bitmap index
+	 * of each joinqual which matches. Note that we don't stop when we find
+	 * the first match, as the expression could be duplicated in the
+	 * joinquals, and we want to generate a bitmap index which has bits set for
+	 * every matching join qual.
+	 */
+	for (i = 0; i < nkeys; i++)
+	{
+		ListCell *lc;
+		int quallstidx = -1;
+
+		foreach(lc, joinquals)
+		{
+			RestrictInfo   *rinfo;
+			OpExpr		   *clause;
+			Var			   *leftvar;
+			Var			   *rightvar;
+
+			quallstidx++;
+
+			/*
+			 * Technically we don't need to, but here we skip this qual if
+			 * we've matched it to part of the foreign key already. This
+			 * should prove to be a useful optimization when the quals appear
+			 * in the same order as the foreign key's keys. We need only bother
+			 * doing this when the foreign key is made up of more than 1 set
+			 * of columns, and we're not testing the first column.
+			 */
+			if (i > 0 && bms_is_member(quallstidx, qualmatches))
+continue;
+
+			/*
+			 * Here since 'usefulquals' only contains bitmap indexes for quals
+			 * of type "var op var" we can safely skip checking this.
+			 */
+			rinfo = (RestrictInfo *) lfirst(lc);
+			clause = (OpExpr *) rinfo->clause;
+

Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2016-03-14 Thread David Steele

Hi Thomas,

On 2/24/16 11:21 AM, Tomas Vondra wrote:


Overall, I still believe the FK patch is a clear improvement of the
current status - while it's true it does not improve all possible cases
and there's a room for additional improvements (like handling multiple
candidate FK constraints), it should not make existing estimates worse.


The latest version of this patch does not apply:

$ git apply ../other/0001-estimation-with-fkeys-v2.patch
../other/0001-estimation-with-fkeys-v2.patch:748: trailing whitespace.

error: patch failed: src/backend/optimizer/util/plancat.c:27
error: src/backend/optimizer/util/plancat.c: patch does not apply
error: patch failed: src/include/nodes/relation.h:468
error: src/include/nodes/relation.h: patch does not apply

David's most recent version also does not apply:

$ git apply ../other/estimation-with-fkeys-v2_davidv4.patch
../other/estimation-with-fkeys-v2_davidv4.patch:517: trailing whitespace.

error: patch failed: src/backend/optimizer/util/plancat.c:27
error: src/backend/optimizer/util/plancat.c: patch does not apply
error: patch failed: src/include/nodes/relation.h:472
error: src/include/nodes/relation.h: patch does not apply

I don't think it would be clear to any reviewer which patch to apply 
even if they were working.  I'm marking this "waiting for author".


Thanks,
--
-David
da...@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2016-02-24 Thread Tomas Vondra

Hi,

On 09/30/2015 03:12 AM, David Rowley wrote:
...

CREATE TABLE f (id1 INT, id2 INT, PRIMARY KEY (id1, id2));

CREATE TABLE d1 (id1 INT, id2 INT, FOREIGN KEY (id1, id2) REFERENCES
f(id1, id2));
CREATE TABLE d2 (id1 INT, id2 INT, FOREIGN KEY (id1, id2) REFERENCES
f(id1, id2));

INSERT INTO f SELECT i, i FROM generate_series(1,100) s(i);

INSERT INTO d1 SELECT i, i FROM generate_series(1,10) s(i);
INSERT INTO d2 SELECT i, i FROM generate_series(1,30) s(i);

now, both pair-wise joins (f JOIN d1) and (f JOIN d2) are estimated
perfectly accurately, but as soon as the query involves both of
them, this happens:

SELECT * FROM f JOIN d1 ON (f.id1 = d1.id1 AND f.id2 = d1.id2)
 JOIN d2 ON (f.id1 = d2.id1 AND f.id2 = d2.id2);

   QUERY PLAN
-
  Nested Loop  (cost=3334.43..12647.57 rows=3 width=24)
   (actual time=221.086..1767.206 rows=10 loops=1)
Join Filter: ((d1.id1 = f.id1) AND (d1.id2 = f.id2))
->  Hash Join  (cost=3334.00..12647.01 rows=1 width=16)
   (actual time=221.058..939.482 rows=10 loops=1)
  Hash Cond: ((d2.id1 = d1.id1) AND (d2.id2 = d1.id2))
  ->  Seq Scan on d2  (cost=0.00..4328.00 rows=30 width=8)
  (actual time=0.038..263.356 rows=30 loops=1)
  ->  Hash  (cost=1443.00..1443.00 rows=10 width=8)
(actual time=220.721..220.721 rows=10 loops=1)
Buckets: 131072  Batches: 2  Memory Usage: 2982kB
->  Seq Scan on d1  (cost=0.00..1443.00 rows=10 ...)
(actual time=0.033..101.547 rows=10 loops=1)
->  Index Only Scan using f_pkey on f  (cost=0.42..0.54 rows=1 ...)
 (actual time=0.004..0.004 rows=1 loops=10)
  Index Cond: ((id1 = d2.id1) AND (id2 = d2.id2))
  Heap Fetches: 10

Clearly, the inner join (d1 JOIN d2) is poorly estimated (1 vs.
10). I assume that's only because we find FK only on the second
join with f.

So it seems like s a clear improvement, both compared to master and
the previous versions of the patch.


I've been experimenting with this example. Of course, the reason why we
get the 1 row estimate on the join between d1 and d2 is that there's no
foreign key between those two relations.

The attached patch changes things so that the foreign key matching code
is better able to see foreign keys "hidden" behind eclasses. So it does
now in fact detect a foreign key on d2 referencing d1, by looking for
Vars foreign keys which have Vars in the same eclasses as the joinquals
are built from. This has improved the result

postgres=# EXPLAIN ANALYZE SELECT * FROM f JOIN d1 ON (f.id1 = d1.id1
AND f.id2 = d1.id2) JOIN d2 ON (f.id1 = d2.id1 AND f.id2 = d2.id2);

  QUERY PLAN
-
  Hash Join  (cost=16655.94..26066.95 rows=3 width=24) (actual
time=267.322..468.383 rows=10 loops=1)
Hash Cond: ((d2.id1 = f.id1) AND (d2.id2 = f.id2))
->  Seq Scan on d2  (cost=0.00..4328.00 rows=30 width=8) (actual
time=0.019..31.396 rows=30 loops=1)
->  Hash  (cost=14666.94..14666.94 rows=10 width=16) (actual
time=266.263..266.263 rows=10 loops=1)
  Buckets: 131072  Batches: 2  Memory Usage: 3373kB
  ->  Merge Join  (cost=9748.32..14666.94 rows=10 width=16)
(actual time=104.494..224.908 rows=10 loops=1)
Merge Cond: ((f.id1 = d1.id1) AND (f.id2 = d1.id2))
->  Index Only Scan using f_pkey on f
  (cost=0.42..36214.93 rows=100 width=8) (actual time=0.045..35.758
rows=11 loops=1)
  Heap Fetches: 11
->  Sort  (cost=9747.82..9997.82 rows=10 width=8)
(actual time=104.440..122.401 rows=10 loops=1)
  Sort Key: d1.id1, d1.id2
  Sort Method: external sort  Disk: 2152kB
  ->  Seq Scan on d1  (cost=0.00..1443.00
rows=10 width=8) (actual time=0.019..9.443 rows=10 loops=1)

The problem is that the code I added is sometimes a bit too optimistic
at finding a suitable foreign key. When performing estimates for the
join between (f,d1) <-> (d2), since the code loops over each relation
making up the set of relations at either side of the join, we find a
foreign key on 'f' which references d2, this one actually exists. It
then goes on and also finds a foreign key for (d1) references (d2), of
course this one does not exists and it's only could due to the eclasses.
The problem here is, which one do we use? If we multiply the selectivity
for each of these foreign keys then we'd end up with a 

Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2016-01-31 Thread Alvaro Herrera
David Rowley wrote:
 
> I'm not sure that I agree with this being set to "Needs review". The last
> progress that I see made on this was me hacking at the patch to remove some
> equivalence class limitations. I think the logical next step would be for
> you to look at these changes and either accept or reject them. So does that
> not suit "Waiting on author" better than "Needs review" ?

Tomas, I think we need you need to submit a new version of this patch,
please.  Closing it for now.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2015-12-23 Thread Tomas Vondra



On 12/24/2015 03:45 AM, Michael Paquier wrote:

On Wed, Sep 30, 2015 at 10:12 AM, David Rowley

...

In the attached I've coded it to take the Min() selectivity for when the
same quals are matched more than once. I know this is not correct, but since
it seems impossible to obtain an exact estimate in this case, we'd need to
decide on some logic which does well in the average case.


Is there still an interest for this patch? The last message of this
thread has added a new version of the patch but the patch was still in
"Waiting on author" state for a couple of months. Just guessing that
the status was incorrect, I have moved it to next CF.


Yes, I still think the patch is useful. Thanks for moving it to the next 
commitfest and sorry for not updating the status.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2015-12-23 Thread David Rowley
On 24 December 2015 at 16:32, Tomas Vondra 
wrote:

>
>
> On 12/24/2015 03:45 AM, Michael Paquier wrote:
>
>> On Wed, Sep 30, 2015 at 10:12 AM, David Rowley
>>
> ...
>
>> In the attached I've coded it to take the Min() selectivity for when the
>>> same quals are matched more than once. I know this is not correct, but
>>> since
>>> it seems impossible to obtain an exact estimate in this case, we'd need
>>> to
>>> decide on some logic which does well in the average case.
>>>
>>
>> Is there still an interest for this patch? The last message of this
>> thread has added a new version of the patch but the patch was still in
>> "Waiting on author" state for a couple of months. Just guessing that
>> the status was incorrect, I have moved it to next CF.
>>
>
> Yes, I still think the patch is useful. Thanks for moving it to the next
> commitfest and sorry for not updating the status.


Hi,

I'm not sure that I agree with this being set to "Needs review". The last
progress that I see made on this was me hacking at the patch to remove some
equivalence class limitations. I think the logical next step would be for
you to look at these changes and either accept or reject them. So does that
not suit "Waiting on author" better than "Needs review" ?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2015-12-23 Thread Michael Paquier
On Wed, Sep 30, 2015 at 10:12 AM, David Rowley
 wrote:
> On 29 September 2015 at 01:59, Tomas Vondra 
> wrote:
>>
>> Hi,
>>
>> On 09/27/2015 02:00 PM, David Rowley wrote:
>>>
>>> I've been working on this again. I've put back the code that you wrote
>>> for the looping over each combination of relations from either side of
>>> the join.
>>>
>>> I've also added some code to get around the problem with eclass joins
>>> and the RestrictInfo having some alternative Vars that don't belong to
>>> the foreign key. Basically I'm just checking if the RestrictInfo has a
>>> parent_ec, and if it does just loop over the members to try and find the
>>> Vars that belong to the foreign key. I've tested it with the following,
>>> and it seems to work:
>>
>>
>> I didn't have time to look into the code yet, but this seems like an
>> interesting idea.
>>
>>
>>>
>>> create table a as select i as a_id1, i as a_id2, i as dummy1 from
>>> generate_series(0,999) s(i);
>>> alter table a add unique (a_id1, a_id2);
>>> create table b as select i as b_id1, i as b_id2 from
>>> generate_series(0,332) s(i);
>>>
>>> analyze a;
>>> analyze b;
>>>
>>> alter table b add foreign key (b_id1, b_id2) references a (a_id1, a_id2);
>>>
>>> explain analyze select * from a inner join b on a.dummy1 = b.b_id1 and
>>> a.a_id2 = b.b_id2 where a.a_id1 = a.dummy1;
>>>
>>>   QUERY PLAN
>>>
>>> ---
>>>   Hash Join  (cost=18.57..26.41 rows=2 width=20) (actual
>>> time=0.775..1.046 rows=333 loops=1)
>>> Hash Cond: ((b.b_id1 = a.dummy1) AND (b.b_id2 = a.a_id2))
>>> ->  Seq Scan on b  (cost=0.00..5.33 rows=333 width=8) (actual
>>> time=0.013..0.046 rows=333 loops=1)
>>> ->  Hash  (cost=18.50..18.50 rows=5 width=12) (actual
>>> time=0.737..0.737 rows=1000 loops=1)
>>>   Buckets: 1024  Batches: 1  Memory Usage: 51kB
>>>   ->  Seq Scan on a  (cost=0.00..18.50 rows=5 width=12) (actual
>>> time=0.014..0.389 rows=1000 loops=1)
>>> Filter: (dummy1 = a_id1)
>>>
>>> The non-patched version estimates 1 row. The patched estimates 2 rows,
>>> but that's due to the bad estimate on dummy1 = a_id1.
>>>
>>> The 2 comes from ceil(5 * 0.333).
>>>
>>> Perhaps you have a better test case to for this?
>>
>>
>> I think the additional WHERE clause is needlessly confusing. I've been
>> able to come up with an example - pretty much a normalized with a "main"
>> table and auxiliary tables (referencing the main one using FK) with
>> additional info. So not unlikely to happen in practice (except maybe for the
>> multi-column foreign key bit).
>>
>>
>> CREATE TABLE f (id1 INT, id2 INT, PRIMARY KEY (id1, id2));
>>
>> CREATE TABLE d1 (id1 INT, id2 INT, FOREIGN KEY (id1, id2) REFERENCES
>> f(id1, id2));
>> CREATE TABLE d2 (id1 INT, id2 INT, FOREIGN KEY (id1, id2) REFERENCES
>> f(id1, id2));
>>
>> INSERT INTO f SELECT i, i FROM generate_series(1,100) s(i);
>>
>> INSERT INTO d1 SELECT i, i FROM generate_series(1,10) s(i);
>> INSERT INTO d2 SELECT i, i FROM generate_series(1,30) s(i);
>>
>> now, both pair-wise joins (f JOIN d1) and (f JOIN d2) are estimated
>> perfectly accurately, but as soon as the query involves both of them, this
>> happens:
>>
>> SELECT * FROM f JOIN d1 ON (f.id1 = d1.id1 AND f.id2 = d1.id2)
>> JOIN d2 ON (f.id1 = d2.id1 AND f.id2 = d2.id2);
>>
>>   QUERY PLAN
>> -
>>  Nested Loop  (cost=3334.43..12647.57 rows=3 width=24)
>>   (actual time=221.086..1767.206 rows=10 loops=1)
>>Join Filter: ((d1.id1 = f.id1) AND (d1.id2 = f.id2))
>>->  Hash Join  (cost=3334.00..12647.01 rows=1 width=16)
>>   (actual time=221.058..939.482 rows=10 loops=1)
>>  Hash Cond: ((d2.id1 = d1.id1) AND (d2.id2 = d1.id2))
>>  ->  Seq Scan on d2  (cost=0.00..4328.00 rows=30 width=8)
>>  (actual time=0.038..263.356 rows=30 loops=1)
>>  ->  Hash  (cost=1443.00..1443.00 rows=10 width=8)
>>(actual time=220.721..220.721 rows=10 loops=1)
>>Buckets: 131072  Batches: 2  Memory Usage: 2982kB
>>->  Seq Scan on d1  (cost=0.00..1443.00 rows=10 ...)
>>(actual time=0.033..101.547 rows=10 loops=1)
>>->  Index Only Scan using f_pkey on f  (cost=0.42..0.54 rows=1 ...)
>> (actual time=0.004..0.004 rows=1 loops=10)
>>  Index Cond: ((id1 = d2.id1) AND (id2 = d2.id2))
>>  Heap Fetches: 10
>>
>> Clearly, the inner join (d1 JOIN d2) is poorly estimated (1 vs. 10). I
>> assume that's only because we find FK only on the second join with f.
>>
>> So it seems like s a clear improvement, both compared to master and 

Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2015-09-29 Thread David Rowley
On 29 September 2015 at 01:59, Tomas Vondra 
wrote:

>
> CREATE TABLE f (id1 INT, id2 INT, PRIMARY KEY (id1, id2));
>
> CREATE TABLE d1 (id1 INT, id2 INT, FOREIGN KEY (id1, id2) REFERENCES
> f(id1, id2));
> CREATE TABLE d2 (id1 INT, id2 INT, FOREIGN KEY (id1, id2) REFERENCES
> f(id1, id2));
>
> INSERT INTO f SELECT i, i FROM generate_series(1,100) s(i);
>
> INSERT INTO d1 SELECT i, i FROM generate_series(1,10) s(i);
> INSERT INTO d2 SELECT i, i FROM generate_series(1,30) s(i);
>
> now, both pair-wise joins (f JOIN d1) and (f JOIN d2) are estimated
> perfectly accurately, but as soon as the query involves both of them, this
> happens:
>
> SELECT * FROM f JOIN d1 ON (f.id1 = d1.id1 AND f.id2 = d1.id2)
> JOIN d2 ON (f.id1 = d2.id1 AND f.id2 = d2.id2);
>
>
>
This is a near perfect example of what I was trying to explain about being
unable to have guarantees about there being 1.0 matching tuples in the
referenced relation.

If we run the above with join_collapse_limit = 1, then we'll first join f
to d1, which will give us 10 tuples. (With IDs ranging from 1 to 10)

If we now perform estimates to join d2 to (f, d1), we don't have all of the
referenced tuples in (f, d1), so how do we estimate that? Remember that d2
has IDs 11 to 30 that won't be found in the "referenced" relation.

What if we had populated d1 with:

INSERT INTO d1 SELECT i, i FROM generate_series(91,100) s(i);

The  join will find exactly 0 tuples between the join of (f, d1) -> d2.

Is my logic here wrong somehow?

-- 
 David Rowley   http://www.2ndQuadrant.com/

 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2015-09-29 Thread David Rowley
On 29 September 2015 at 01:59, Tomas Vondra 
wrote:

> Hi,
>
> On 09/27/2015 02:00 PM, David Rowley wrote:
>
>> I've been working on this again. I've put back the code that you wrote
>> for the looping over each combination of relations from either side of
>> the join.
>>
>> I've also added some code to get around the problem with eclass joins
>> and the RestrictInfo having some alternative Vars that don't belong to
>> the foreign key. Basically I'm just checking if the RestrictInfo has a
>> parent_ec, and if it does just loop over the members to try and find the
>> Vars that belong to the foreign key. I've tested it with the following,
>> and it seems to work:
>>
>
> I didn't have time to look into the code yet, but this seems like an
> interesting idea.
>
>
>
>> create table a as select i as a_id1, i as a_id2, i as dummy1 from
>> generate_series(0,999) s(i);
>> alter table a add unique (a_id1, a_id2);
>> create table b as select i as b_id1, i as b_id2 from
>> generate_series(0,332) s(i);
>>
>> analyze a;
>> analyze b;
>>
>> alter table b add foreign key (b_id1, b_id2) references a (a_id1, a_id2);
>>
>> explain analyze select * from a inner join b on a.dummy1 = b.b_id1 and
>> a.a_id2 = b.b_id2 where a.a_id1 = a.dummy1;
>>
>>   QUERY PLAN
>>
>> ---
>>   Hash Join  (cost=18.57..26.41 rows=2 width=20) (actual
>> time=0.775..1.046 rows=333 loops=1)
>> Hash Cond: ((b.b_id1 = a.dummy1) AND (b.b_id2 = a.a_id2))
>> ->  Seq Scan on b  (cost=0.00..5.33 rows=333 width=8) (actual
>> time=0.013..0.046 rows=333 loops=1)
>> ->  Hash  (cost=18.50..18.50 rows=5 width=12) (actual
>> time=0.737..0.737 rows=1000 loops=1)
>>   Buckets: 1024  Batches: 1  Memory Usage: 51kB
>>   ->  Seq Scan on a  (cost=0.00..18.50 rows=5 width=12) (actual
>> time=0.014..0.389 rows=1000 loops=1)
>> Filter: (dummy1 = a_id1)
>>
>> The non-patched version estimates 1 row. The patched estimates 2 rows,
>> but that's due to the bad estimate on dummy1 = a_id1.
>>
>> The 2 comes from ceil(5 * 0.333).
>>
>> Perhaps you have a better test case to for this?
>>
>
> I think the additional WHERE clause is needlessly confusing. I've been
> able to come up with an example - pretty much a normalized with a "main"
> table and auxiliary tables (referencing the main one using FK) with
> additional info. So not unlikely to happen in practice (except maybe for
> the multi-column foreign key bit).
>
>
> CREATE TABLE f (id1 INT, id2 INT, PRIMARY KEY (id1, id2));
>
> CREATE TABLE d1 (id1 INT, id2 INT, FOREIGN KEY (id1, id2) REFERENCES
> f(id1, id2));
> CREATE TABLE d2 (id1 INT, id2 INT, FOREIGN KEY (id1, id2) REFERENCES
> f(id1, id2));
>
> INSERT INTO f SELECT i, i FROM generate_series(1,100) s(i);
>
> INSERT INTO d1 SELECT i, i FROM generate_series(1,10) s(i);
> INSERT INTO d2 SELECT i, i FROM generate_series(1,30) s(i);
>
> now, both pair-wise joins (f JOIN d1) and (f JOIN d2) are estimated
> perfectly accurately, but as soon as the query involves both of them, this
> happens:
>
> SELECT * FROM f JOIN d1 ON (f.id1 = d1.id1 AND f.id2 = d1.id2)
> JOIN d2 ON (f.id1 = d2.id1 AND f.id2 = d2.id2);
>
>   QUERY PLAN
> -
>  Nested Loop  (cost=3334.43..12647.57 rows=3 width=24)
>   (actual time=221.086..1767.206 rows=10 loops=1)
>Join Filter: ((d1.id1 = f.id1) AND (d1.id2 = f.id2))
>->  Hash Join  (cost=3334.00..12647.01 rows=1 width=16)
>   (actual time=221.058..939.482 rows=10 loops=1)
>  Hash Cond: ((d2.id1 = d1.id1) AND (d2.id2 = d1.id2))
>  ->  Seq Scan on d2  (cost=0.00..4328.00 rows=30 width=8)
>  (actual time=0.038..263.356 rows=30 loops=1)
>  ->  Hash  (cost=1443.00..1443.00 rows=10 width=8)
>(actual time=220.721..220.721 rows=10 loops=1)
>Buckets: 131072  Batches: 2  Memory Usage: 2982kB
>->  Seq Scan on d1  (cost=0.00..1443.00 rows=10 ...)
>(actual time=0.033..101.547 rows=10 loops=1)
>->  Index Only Scan using f_pkey on f  (cost=0.42..0.54 rows=1 ...)
> (actual time=0.004..0.004 rows=1 loops=10)
>  Index Cond: ((id1 = d2.id1) AND (id2 = d2.id2))
>  Heap Fetches: 10
>
> Clearly, the inner join (d1 JOIN d2) is poorly estimated (1 vs. 10). I
> assume that's only because we find FK only on the second join with f.
>
> So it seems like s a clear improvement, both compared to master and the
> previous versions of the patch.


I've been experimenting with this example. Of course, the reason why we get
the 1 row estimate on the join between d1 and d2 is that there's no foreign
key 

Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2015-09-28 Thread Tomas Vondra

Hi,

On 09/27/2015 02:00 PM, David Rowley wrote:

I've been working on this again. I've put back the code that you wrote
for the looping over each combination of relations from either side of
the join.

I've also added some code to get around the problem with eclass joins
and the RestrictInfo having some alternative Vars that don't belong to
the foreign key. Basically I'm just checking if the RestrictInfo has a
parent_ec, and if it does just loop over the members to try and find the
Vars that belong to the foreign key. I've tested it with the following,
and it seems to work:


I didn't have time to look into the code yet, but this seems like an 
interesting idea.




create table a as select i as a_id1, i as a_id2, i as dummy1 from
generate_series(0,999) s(i);
alter table a add unique (a_id1, a_id2);
create table b as select i as b_id1, i as b_id2 from
generate_series(0,332) s(i);

analyze a;
analyze b;

alter table b add foreign key (b_id1, b_id2) references a (a_id1, a_id2);

explain analyze select * from a inner join b on a.dummy1 = b.b_id1 and
a.a_id2 = b.b_id2 where a.a_id1 = a.dummy1;

  QUERY PLAN
---
  Hash Join  (cost=18.57..26.41 rows=2 width=20) (actual
time=0.775..1.046 rows=333 loops=1)
Hash Cond: ((b.b_id1 = a.dummy1) AND (b.b_id2 = a.a_id2))
->  Seq Scan on b  (cost=0.00..5.33 rows=333 width=8) (actual
time=0.013..0.046 rows=333 loops=1)
->  Hash  (cost=18.50..18.50 rows=5 width=12) (actual
time=0.737..0.737 rows=1000 loops=1)
  Buckets: 1024  Batches: 1  Memory Usage: 51kB
  ->  Seq Scan on a  (cost=0.00..18.50 rows=5 width=12) (actual
time=0.014..0.389 rows=1000 loops=1)
Filter: (dummy1 = a_id1)

The non-patched version estimates 1 row. The patched estimates 2 rows,
but that's due to the bad estimate on dummy1 = a_id1.

The 2 comes from ceil(5 * 0.333).

Perhaps you have a better test case to for this?


I think the additional WHERE clause is needlessly confusing. I've been 
able to come up with an example - pretty much a normalized with a "main" 
table and auxiliary tables (referencing the main one using FK) with 
additional info. So not unlikely to happen in practice (except maybe for 
the multi-column foreign key bit).



CREATE TABLE f (id1 INT, id2 INT, PRIMARY KEY (id1, id2));

CREATE TABLE d1 (id1 INT, id2 INT, FOREIGN KEY (id1, id2) REFERENCES 
f(id1, id2));
CREATE TABLE d2 (id1 INT, id2 INT, FOREIGN KEY (id1, id2) REFERENCES 
f(id1, id2));


INSERT INTO f SELECT i, i FROM generate_series(1,100) s(i);

INSERT INTO d1 SELECT i, i FROM generate_series(1,10) s(i);
INSERT INTO d2 SELECT i, i FROM generate_series(1,30) s(i);

now, both pair-wise joins (f JOIN d1) and (f JOIN d2) are estimated 
perfectly accurately, but as soon as the query involves both of them, 
this happens:


SELECT * FROM f JOIN d1 ON (f.id1 = d1.id1 AND f.id2 = d1.id2)
JOIN d2 ON (f.id1 = d2.id1 AND f.id2 = d2.id2);

  QUERY PLAN
-
 Nested Loop  (cost=3334.43..12647.57 rows=3 width=24)
  (actual time=221.086..1767.206 rows=10 loops=1)
   Join Filter: ((d1.id1 = f.id1) AND (d1.id2 = f.id2))
   ->  Hash Join  (cost=3334.00..12647.01 rows=1 width=16)
  (actual time=221.058..939.482 rows=10 loops=1)
 Hash Cond: ((d2.id1 = d1.id1) AND (d2.id2 = d1.id2))
 ->  Seq Scan on d2  (cost=0.00..4328.00 rows=30 width=8)
 (actual time=0.038..263.356 rows=30 loops=1)
 ->  Hash  (cost=1443.00..1443.00 rows=10 width=8)
   (actual time=220.721..220.721 rows=10 loops=1)
   Buckets: 131072  Batches: 2  Memory Usage: 2982kB
   ->  Seq Scan on d1  (cost=0.00..1443.00 rows=10 ...)
   (actual time=0.033..101.547 rows=10 loops=1)
   ->  Index Only Scan using f_pkey on f  (cost=0.42..0.54 rows=1 ...)
(actual time=0.004..0.004 rows=1 loops=10)
 Index Cond: ((id1 = d2.id1) AND (id2 = d2.id2))
 Heap Fetches: 10

Clearly, the inner join (d1 JOIN d2) is poorly estimated (1 vs. 10). 
I assume that's only because we find FK only on the second join with f.


So it seems like s a clear improvement, both compared to master and the 
previous versions of the patch.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2015-09-27 Thread David Rowley
On 26 September 2015 at 01:57, Tomas Vondra 
wrote:

> Hi,
>
> On 09/25/2015 03:39 AM, David Rowley wrote:
>
>> I looked at this again, and I'm not all that sure it's possible to
>>
> assume that 1.0 /  is valid when there's more than one
>> relation at either side of the join.
>>
> >
>
>> My reasoning for this is that the whole basis for the patch is that a
>> if we find a foreign key match, then we can be sure enough, as far as
>> row estimations go, that exactly 1 tuple will exist matching that
>> condition. This assumption of the 1 tuple no longer holds when 2
>> relations have already been joined, as this can either cause tuple
>> duplication, or elimination.
>>
>
> I don't see why that would be the case. Of course, you need to take the
> right , i.e. the "target" of the foreign key (the table with UNIQUE
> constraint) so that the selectivity matches the fact that exactly 1 tuple
> (on the PK side) matches.
>

hmm, ok. You're right. It appears I was a bit confused, but thanks for
explaining it again. I get it now.

I've been working on this again. I've put back the code that you wrote for
the looping over each combination of relations from either side of the join.

I've also added some code to get around the problem with eclass joins and
the RestrictInfo having some alternative Vars that don't belong to the
foreign key. Basically I'm just checking if the RestrictInfo has a
parent_ec, and if it does just loop over the members to try and find the
Vars that belong to the foreign key. I've tested it with the following, and
it seems to work:

create table a as select i as a_id1, i as a_id2, i as dummy1 from
generate_series(0,999) s(i);
alter table a add unique (a_id1, a_id2);
create table b as select i as b_id1, i as b_id2 from generate_series(0,332)
s(i);

analyze a;
analyze b;

alter table b add foreign key (b_id1, b_id2) references a (a_id1, a_id2);

explain analyze select * from a inner join b on a.dummy1 = b.b_id1 and
a.a_id2 = b.b_id2 where a.a_id1 = a.dummy1;

 QUERY PLAN
---
 Hash Join  (cost=18.57..26.41 rows=2 width=20) (actual time=0.775..1.046
rows=333 loops=1)
   Hash Cond: ((b.b_id1 = a.dummy1) AND (b.b_id2 = a.a_id2))
   ->  Seq Scan on b  (cost=0.00..5.33 rows=333 width=8) (actual
time=0.013..0.046 rows=333 loops=1)
   ->  Hash  (cost=18.50..18.50 rows=5 width=12) (actual time=0.737..0.737
rows=1000 loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 51kB
 ->  Seq Scan on a  (cost=0.00..18.50 rows=5 width=12) (actual
time=0.014..0.389 rows=1000 loops=1)
   Filter: (dummy1 = a_id1)

The non-patched version estimates 1 row. The patched estimates 2 rows, but
that's due to the bad estimate on dummy1 = a_id1.

The 2 comes from ceil(5 * 0.333).

Perhaps you have a better test case to for this?

Regards

David Rowley

--
 David Rowley   http://www.2ndQuadrant.com/

 PostgreSQL Development, 24x7 Support, Training & Services


estimation-with-fkeys-v2_davidv3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2015-09-25 Thread Tomas Vondra

Hi,

On 09/25/2015 03:39 AM, David Rowley wrote:

On 24 September 2015 at 23:57, Tomas Vondra
> wrote:


2) find_best_match_foreign_key
--

I think the comment before the function needs rephrasing (seems a
bit broken to me). I do like the approach in general, although it
changes the semantics a bit. The original code only considered
"fully matching" fkeys, while the new code simply takes the longest
match.



Oops, I did not code this at all the way I had originally pictured it. I
guess the evidence of that is in the function comment, which I wrote
before coding the function.
I of course intended to only allow full matches.
A full patch which fixes this is attached. This also should fix the
clause duplication trick that I talked about.


OK, thanks. Let's leave partial FK matches for the future.




The comment before the function mentions it's possible to confuse
the code with duplicate quals. Can you give an example? >


Something like: SELECT * FROM a LEFT JOIN b ON a.id a.id=b.id b.id
and a.id a.id=b.id b.id AND a.id a.id=b.id b.id AND a.id2 = b.id2 AND
a.id3 = b.id3;

Note a.id a.id = b.id b.id repeated 3 times.

Where a foreign key exists on (a.id a.id) ref (b.id b.id), and also
(a.id2, a.id3) ref (b.id2, b.id3). In this case we match 3 quals for
the FK with 1 key, and 2 quals with the FK with 2 keys. But this is
now fixed in the attached.

I used an outer join here as they won't be transformed into eclass
members and back to quals again. INNER JOIN wouldn't allow the
duplicates to materialise again.


Ah, OK. Didn't think about outer joins.



...

I looked at this again, and I'm not all that sure it's possible to
assume that 1.0 /  is valid when there's more than one
relation at either side of the join.

>

My reasoning for this is that the whole basis for the patch is that a
if we find a foreign key match, then we can be sure enough, as far as
row estimations go, that exactly 1 tuple will exist matching that
condition. This assumption of the 1 tuple no longer holds when 2
relations have already been joined, as this can either cause tuple
duplication, or elimination.


I don't see why that would be the case. Of course, you need to take the 
right , i.e. the "target" of the foreign key (the table with 
UNIQUE constraint) so that the selectivity matches the fact that exactly 
1 tuple (on the PK side) matches.


Let's say we have 3 tables

A (100 rows)
B (33 rows)
D (22 rows)

and let's assume A references the other two tables

   A (b_id) REFERENCES B(id)
   A (c_id) REFERENCES C(id)

Now, let's estimate the join

  SELECT * FROM A JOIN B ON (a.b_id = b.id)
  JOIN C ON (a.c_id = c.id)

Which will do this

  card(join) = card(A) * card(B) * card(C) * sel(b_id=id) * sel(c_id=id)

where card(T) is a cardinality of a table, and sel(C) is selectivity of 
the conditions. We do know that


  card(A) = 1000
  card(B) = 333
  card(C) = 222

and we do know that selectivity of each condition is 1/card(T) of the 
table with the UNIQUE constraint, referenced by the FK. So


  sel(b_id=id) = 1/card(B) = 1/333
  sel(c_id=id) = 1/card(C) = 1/222

The fact is that these selectivities perfectly eliminate the impact of 
cardinality of the table. So


  card(join) = 1000 * (333 * (1/333)) * (222 * (1/222))

so the expected cardinality is 1000.

Of course, this estimation effectively happens in steps, i.e. we first 
join 2 tables - say A and B, then (A,B) and C. So in the first step we 
do this:


  card(A,B) = card(A) * card(B) * sel(b_id=id) = 1000

  card((A,B),C) = card(A,B) * card(C) * sel(a_id=id) = 1000

The join order does not matter - we could easily join B,C first, and 
then join A.


  card(B,C) = card(B) * card(C) * sel(NULL) = 73926

and then

  card((B,C),A) = card(B,C) * card(A) * sel(a_id=id) * sel(b_id=id)
= 1000

Notice that in this case, the singleton table (A) actually references 
primary keys within the join relation - obviously the whole join 
relation does not have unique keys or anything, but that does not matter.


The crucial part here is that selectivity of the condition needs to use 
the number of tuples of the base relation, because that's the right 
selectivity for the join clause.




It's a little hard force the join order so that it happens this way, but
say the join order in the following example was, from left to right:

a CROSS JOIN b JOIN c on a.id a.id=c.id

Of course, the planner would perform the cross join last in reality, but
it's possible to trick it too not.

Let's say a foreign key exists on c (id) references a(id). If we
assume that a.id a.id = c.id  produces 1 tuple in the (a,b)
rel, thenwe'd be very wrong, as it's not 1, it's the number of
rows in b! Which could be millions.


I think this is the part where you're wrong. What needs to happen in the 
estimation is this:


  card(A,B) = card(A) * card(B)  /* 

Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2015-09-24 Thread Tomas Vondra

Hi,

Thanks for the review and time spent on reworking the patch!

On 09/24/2015 07:41 AM, David Rowley wrote:

On 23 September 2015 at 17:11, David Rowley
> wrote:

find_foreign_key_clauses() should look for the longest match and
return a Bitmap set of the list indexes to the caller.
It might be possible to fool the longest match logic by duplicating
clauses, e.g. a1 = b1 AND a1 = b1 and a1 = b1 AND a2 = b2 AND a3 =
b3, but I can't imagine that matters, but if it did, we could code
it to be smart enough to see through that.


I took a bash at implementing what I described, and I've ended up with
the attached.

git diff -stat gives me:

  src/backend/optimizer/path/costsize.c | 717
--
  src/backend/optimizer/plan/analyzejoins.c |   1 +
  src/backend/optimizer/util/plancat.c  | 112 +++--
  3 files changed, 228 insertions(+), 602 deletions(-)

So it's removed quite a bit of code. I also discovered that: 1.0 /
Max(rel->tuples,1.0) is no good for SEMI and ANTI joins. I've coded
around this in the attached, but I'm not certain it's the best way of
doing things.

I thought that it might be possible to add some regression test around
this, if we simply just find a plan the uses a nested loop due to
underestimation of matching rows, and then make sure that it no longer
uses a nested loop when the foreign key is added. I've not yet done this
in the attached.

Patched attached in delta form and complete form.

I still need to perform more analysis on the plancat.c changes.

Have I made any changes that you disagree with?


I think the changes in general are a step in the right direction, but 
I've noticed some differences in behavior - some of them may be actually 
desirable, other are bugs I believe.



1) (rel)hasindex


I'm perfectly fine with getting rid of hasindex, it was mostly 
cargo-cult programming anyway.



2) find_best_match_foreign_key
--

I think the comment before the function needs rephrasing (seems a bit 
broken to me). I do like the approach in general, although it changes 
the semantics a bit. The original code only considered "fully matching" 
fkeys, while the new code simply takes the longest match.


Let me illustrate on a somewhat artificial example (fkjoin1.sql):

create table t1 (a int, b int, c int, d int,
 unique(a,b), unique(a,b,c,d));

create table t2 (a int, b int, c int, d int,
 foreign key (a,b) references t1(a,b),
 foreign key (a,b,c,d) references t2(a,b,c,d));

insert into t1 select i, i, i, i from generate_series(1,100) s(i);
insert into t2 select i, i, i, i from generate_series(1,100) s(i);

Now, let's say a query joining the tables on (a,b,c). The original code 
did this


explain select * from t1 join t2 using (a,b,c);

   QUERY PLAN
-
 Hash Join  (cost=37789.00..79094.01 rows=1 width=20)
   Hash Cond: ((t1.a = t2.a) AND (t1.b = t2.b) AND (t1.c = t2.c))
->  Seq Scan on t1  (cost=0.00..15406.00 rows=100 width=16)
->  Hash  (cost=15406.00..15406.00 rows=100 width=16)
->  Seq Scan on t2  (cost=0.00..15406.00 rows=100 width=16)

while the new code does this:

   QUERY PLAN
-
 Hash Join  (cost=37789.00..79094.01 rows=100 width=20)
   Hash Cond: ((t1.a = t2.a) AND (t1.b = t2.b) AND (t1.c = t2.c))
   ->  Seq Scan on t1  (cost=0.00..15406.00 rows=100 width=16)
   ->  Hash  (cost=15406.00..15406.00 rows=100 width=16)
 ->  Seq Scan on t2  (cost=0.00..15406.00 rows=100 width=16)

This happens (I believe) because the new code only looks for the longest 
match, without the "full match" restriction. So while the old code finds 
(a,b)+c, the new code finds (a,b,c,d).


I'm not saying the new code is wrong - the "full match" restriction was 
there just to simplify v1 of the patch, because dealing with partial 
matches is a bit more complicated. So it's desirable to relax this 
condition.


It's also true that the new code works better in some cases, e.g. this

explain select * from t1 join t2 using (a,b,c,d);

 QUERY PLAN
---
 Hash Join  (cost=40289.00..85344.01 rows=100 width=16)
   Hash Cond: ((t1.a = t2.a) AND (t1.b = t2.b) AND (t1.c = t2.c)
 AND (t1.d = t2.d))
   ->  Seq Scan on t1  (cost=0.00..15406.00 rows=100 width=16)
   ->  Hash  (cost=15406.00..15406.00 rows=100 width=16)
 ->  Seq Scan on t2  (cost=0.00..15406.00 rows=100 width=16)

was originally estimated like this

 QUERY PLAN

Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2015-09-24 Thread David Rowley
On 24 September 2015 at 23:57, Tomas Vondra 
wrote:

>
> 2) find_best_match_foreign_key
> --
>
> I think the comment before the function needs rephrasing (seems a bit
> broken to me). I do like the approach in general, although it changes the
> semantics a bit. The original code only considered "fully matching" fkeys,
> while the new code simply takes the longest match.
>
>
>
Oops, I did not code this at all the way I had originally pictured it. I
guess the evidence of that is in the function comment, which I wrote before
coding the function.
I of course intended to only allow full matches.
A full patch which fixes this is attached. This also should fix the clause
duplication trick that I talked about.


> The comment before the function mentions it's possible to confuse the code
> with duplicate quals. Can you give an example?
>
>
Something like: SELECT * FROM a LEFT JOIN b ON a.id=b.id and a.id=b.id AND
a.id=b.id AND a.id2 = b.id2 AND a.id3 = b.id3;

Note a.id = b.id repeated 3 times.

Where a foreign key exists on (a.id) ref (b.id), and also (a.id2, a.id3)
ref (b.id2, b.id3). In this case we match 3 quals for the FK with 1 key,
and 2 quals with the FK with 2 keys. But this is now fixed in the attached.

I used an outer join here as they won't be transformed into eclass members
and back to quals again. INNER JOIN wouldn't allow the duplicates to
materialise again.



>
> 3) clauselist_join_selectivity
> --
>
> I think this is actually broken, i.e. it does not handle the cases it was
> handling before - namely the cases where more than 3 tables are joined
> (fkjoin2.sql).
>
> Imagine a "fact" table referencing two dimensions, using 2-column fkeys:
>
> create table a (a1 int, a2 int, unique (a1,a2));
> create table b (b1 int, b2 int, unique (b1,b2));
> create table f (a1 int, a2 int, b1 int, b2 int);
>
> insert into a select i,i from generate_series(0,99) s(i);
> insert into b select i,i from generate_series(0,99) s(i);
> insert into f select i/10,i/10,i/10,i/10
> from generate_series(0,999) s(i);
>
> alter table f add foreign key (a1,a2) references a(a1,a2);
> alter table f add foreign key (b1,b2) references b(b1,b2);
>
> Each dimension has 1M rows, fact has 10M rows (and references each row in
> dimensions 10x).
>
> Now, let's see this simple star join:
>
> explain select * from f join a using (a1,a2)
> join b using (b1,b2);
>
> This should return all 10M rows, and originally this was estimated like
> this:
>
>  QUERY PLAN
> ---
> Hash Join  (cost=4.00..573853.57 rows=1175 width=16)
>   Hash Cond: ((f.b1 = b.b1) AND (f.b2 = b.b2))
>   ->  Hash Join  (cost=2.00..363955.16 rows=1175 width=16)
>Hash Cond: ((f.a1 = a.a1) AND (f.a2 = a.a2))
>->  Seq Scan on f  (cost=0.00..154056.75 rows=1175 width=16)
>->  Hash  (cost=14425.00..14425.00 rows=100 width=8)
>->  Seq Scan on a  (cost=0.00..14425.00 rows=100 width=8)
>->  Hash  (cost=14425.00..14425.00 rows=100 width=8)
>  ->  Seq Scan on b  (cost=0.00..14425.00 rows=100 width=8)
>
> so pretty much perfectly accurate, while the new code does this:
>
>  QUERY PLAN
> --
>  Hash Join  (cost=4.00..573853.57 rows=10 width=16)
>Hash Cond: ((f.b1 = b.b1) AND (f.b2 = b.b2))
>->  Hash Join  (cost=2.00..363955.16 rows=1175 width=16)
>Hash Cond: ((f.a1 = a.a1) AND (f.a2 = a.a2))
> ->  Seq Scan on f  (cost=0.00..154056.75 rows=1175 width=16)
> ->  Hash  (cost=14425.00..14425.00 rows=100 width=8)
>->  Seq Scan on a  (cost=0.00..14425.00 rows=100 width=8)
>->  Hash  (cost=14425.00..14425.00 rows=100 width=8)
> ->  Seq Scan on b  (cost=0.00..14425.00 rows=100 width=8)
>
> I think this is due to this check in clauselist_join_selectivity:
>
>if (bms_get_singleton_member(sjinfo->min_righthand, ) &&
>bms_get_singleton_member(sjinfo->min_lefthand, ))
>
> which pretty much says "only work with joins on two base relations". So as
> long as you have a joinrel, (e.g. a fact + one of the dimensions), it's
> game over.
>
> I think the restriction is unnecessary, because when estimating joins, we
> effectively take cardinality of a cartesian product of all the base
> relations, and then apply selectivities for all the join quals (in a
> pairwise manner).
>
> So for the three tables we take
>
>card(join) = card(f) * card(a) * card(b) * sel(f,a) * sel(f,b)
>
> and we can consider the foreign keys in the pairwise selectivities.
>
>
I looked at this again, and I'm not all that sure it's possible to assume
that 1.0 /  is valid when there's more than one relation at either
side 

Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2015-09-23 Thread David Rowley
On 23 September 2015 at 17:11, David Rowley 
wrote:

> find_foreign_key_clauses() should look for the longest match and return a
> Bitmap set of the list indexes to the caller.
> It might be possible to fool the longest match logic by duplicating
> clauses, e.g. a1 = b1 AND a1 = b1 and a1 = b1 AND a2 = b2 AND a3 = b3, but
> I can't imagine that matters, but if it did, we could code it to be smart
> enough to see through that.
>

I took a bash at implementing what I described, and I've ended up with the
attached.

git diff -stat gives me:

 src/backend/optimizer/path/costsize.c | 717
--
 src/backend/optimizer/plan/analyzejoins.c |   1 +
 src/backend/optimizer/util/plancat.c  | 112 +++--
 3 files changed, 228 insertions(+), 602 deletions(-)

So it's removed quite a bit of code. I also discovered that: 1.0 /
Max(rel->tuples,1.0) is no good for SEMI and ANTI joins. I've coded around
this in the attached, but I'm not certain it's the best way of doing things.

I thought that it might be possible to add some regression test around
this, if we simply just find a plan the uses a nested loop due to
underestimation of matching rows, and then make sure that it no longer uses
a nested loop when the foreign key is added. I've not yet done this in the
attached.

Patched attached in delta form and complete form.

I still need to perform more analysis on the plancat.c changes.

Have I made any changes that you disagree with?

Regards

David Rowley

--
 David Rowley   http://www.2ndQuadrant.com/

 PostgreSQL Development, 24x7 Support, Training & Services


estimation-with-fkeys-v2_david.patch
Description: Binary data


estimation-with-fkeys-v2_david_delta.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2015-09-22 Thread David Rowley
On 20 August 2015 at 13:49, Tomas Vondra 
wrote:

> attached is a significantly reworked patch for using the foreign keys in
> selectivity estimation.
>

Thanks for working a new patch, I'm starting to look at it again now:

Here's my thoughts so far:

+ /*
+ * TODO Can we do something like (hasindex) here? Is it necessary? The
+ *  trouble with that is that we don't have a good place to reset that
+ *  flag (relhasindex is reset by vacuum, but is has nothing to do with
+ *  foreign keys at this point).
+ */
+ if (true)

You don't seem to have addressed this part yet.

I mentioned previously that I had attempted this already. Here was the
response
http://www.postgresql.org/message-id/18703.1410450...@sss.pgh.pa.us

Please just remove the comment and if (true).  By today's
standards relhasindex would never be added.
Does it not just serve as an optimization only when the rel is not cached
anyway?

--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -562,7 +562,6 @@ remove_rel_from_joinlist(List *joinlist, int relid, int
*nremoved)
  return result;
 }

-
 /*

Accidental change, please remove.


+static Selectivity
+clauselist_join_selectivity(PlannerInfo *root, List *joinquals, int varno,
+ JoinType jointype, SpecialJoinInfo *sjinfo)

varno is not used.

+ /*
+ * First we'll identify foreign keys that are fully matched by the join
+ * clauses, and we'll update the selectivity accordingly while doing so.
+ */
+ fkeys = find_satisfied_fkeys(root, sjinfo, joinquals, );
+
+ /*
+ * Now that we have the foreign keys, we can get rid of the clauses
+ * matching any of them, and only keep the remaining clauses, so that
+ * we can estimate them using the regular selectivity estimation.
+ */
+ unmatched = filter_fk_join_clauses(root, sjinfo, fkeys, joinquals);

This seems a bit convoluted and inefficient.
Why not just return a bitmask of the matched items indexed by the list
position?
Doing it this way you're having to match the expressions twice. Once seems
fine.
Did you read my review about looking for the longest matches by counting
the bits in the mask?

+ * Another slightly strange case is FK constraints in both directions
(these
+ * statements don't work - the foreign keys need to be established using
+ * ALTER, but for illustration it's sufficient).
+ *
+ * CREATE TABLE a (
+ * a1 INT,
+ * a2 INT,
+ * UNIQUE (a1, a2),
+ * FOREIGN KEY (a1, a2) REFERENCES a (b1, b2)
+ * );

Did you perhaps mean b? i.e. REFERENCES b (b1, b2) ?

Same is duplicated further down.

+ *sel *= (1.0 / rel_outer->tuples);

I think this needs to be :

+ *sel *= (1.0 / Max(rel_outer->tuples, 1.0));

As the following causes a divide by zero.

See attached divide_by_zero.sql

Basically causes this:  Hash Join  (cost=8.29..-1.#J rows=1 width=40)


+ * XXX This does exactly the same thing as the previous loop, so no
+ * comments.

It would be better if we could do something more how make_join_rel() does
things like:

add_paths_to_joinrel(root, joinrel, rel1, rel2,
JOIN_INNER, sjinfo,
restrictlist);
add_paths_to_joinrel(root, joinrel, rel2, rel1,
JOIN_INNER, sjinfo,
restrictlist);

Where the inputs to the function are just swapped.


+ outer = -1;
+ while ((outer = bms_next_member(sjinfo->min_lefthand, outer)) >= 0)

Why did you change this from ensuring we get a singleton member?
I'd imagine if more than 2 relations are required to make the join, then we
can't use foreign keys, as the other join may duplicate or eliminate tuples.
Perhaps you've thought of this more than I have. Would you be able to
explain?

+ * XXX Maybe we should estimate even the single-column keys here,
+ * as it's really cheap. But it can't do any cross-table check
+ * of MCV lists or whatever clauselist_selectivity() does.
+ */
+ if (fkinfo->nkeys < 2)
+ continue;

This should be removed. I see no reason at all to pass up likely perfect
estimates for histogram lookup estimates.


Overall, I really feel that you should just take the longest matching
foreign key. i.e the one that matches the most clauses in the joinquals,
and completely ignore any shorter matches, or partial matches.  Just that
alone is probably 99.% of the use cases that will benefit from this. No
use adding weird code that's only right half the time for the other 0.0001%
of use cases.

I imagine clauselist_join_selectivity should be more like:

int outerrelid, innerrelid;

if (bms_get_singleton_member(sjinfo->min_righthand, ) &&
   bms_get_singleton_member(sjinfo->min_lefthand, ))
{
   Bitmapset fkclauses1, fkclauses2;
   List *unmatched;
   Selectivity sel;
   RelOptInfo *innerrel = find_base_rel(root, innerrelid);
   RelOptInfo *outerrel = find_base_rel(root, outerrelid);

   fkclauses1 = find_foreign_key_clauses(root, outerrel, innerrel,
joinquals);
   fkclauses2 = find_foreign_key_clauses(root, innerrel, outerrel,
joinquals);

  if (fkclauses1 || fkclauses2)
  {
 if 

Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2015-08-26 Thread Michael Paquier
On Thu, Aug 20, 2015 at 11:25 AM, Tomas Vondra
tomas.von...@2ndquadrant.com wrote:

 On 08/20/2015 03:49 AM, Tomas Vondra wrote:

 Then on current master, I get these estimates (showing just rows,
 because that's what matter):

 [...]

Moved to next CF 2015-09.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2015-08-19 Thread Tomas Vondra

Hi,

attached is a significantly reworked patch for using the foreign keys in 
selectivity estimation. The previous patch only really worked if the 
clauses matched the foreign key perfectly (i.e. no additional join 
clauses) - this patch attempts to relax those restrictions a bit.


This patch also significantly improves the comments - the best place to 
start reading is clauselist_join_selectivity().


In general, the patch works by looking for foreign keys between the 
inner and outer side of the join, but only for keys that:


  (a) have more than 2 keys (as this only really helps with multi-
  column foreign keys)

  (b) are 'fully matched' by the join clauses, i.e. there's a clause
  exactly matching each part of the foreign key

Once we have matched foreign keys (for each join), we can estimate each 
of them using 1/cardinality of the referenced table and estimate the 
remaining clauses (not used to match any foreign key) the old way.


example with 3 tables
-
create table a (a1 int, a2 int, primary key (a1, a2));

create table b (b1 int, b2 int, primary key (b1, b2));

create table f (f1 int, f2 int, f3 int, f4 int,
   foreign key (f1,f2) references a(a1,a2),
   foreign key (f3,f4) references b(b1,b2));

insert into a select i, i from generate_series(1,1) s(i);
insert into b select i, i from generate_series(1,1) s(i);

-- 10x
insert into f select i, i, i, i from generate_series(1,1) s(i);

analyze;

Then on current master, I get these estimates (showing just rows, 
because that's what matter):



while with the patch I get this:

select * from f join a on (f1 = a1 and f2 = a2);

  QUERY PLAN

 Hash Join  (rows=10) (actual rows=10)
   Hash Cond: ((f.f1 = a.a1) AND (f.f2 = a.a2))
   -  Seq Scan on f  (rows=10) (actual rows=10)
   -  Hash  (rows=1) (actual rows=1)
 Buckets: 16384  Batches: 1  Memory Usage: 519kB
 -  Seq Scan on a  (rows=1) (actual rows=1)

select * from f join a on (f1 = a1 and f2 = a2)
join b on (f3 = b1 and f4 = b2);

  QUERY PLAN

 Hash Join  (rows=10) (actual rows=10)
   Hash Cond: ((f.f3 = b.b1) AND (f.f4 = b.b2))
   -  Hash Join  (rows=10) (actual rows=10)
 Hash Cond: ((f.f1 = a.a1) AND (f.f2 = a.a2))
 -  Seq Scan on f  (rows=10) (actual rows=10)
 -  Hash  (rows=1) (actual rows=1)
   Buckets: 16384  Batches: 1  Memory Usage: 519kB
   -  Seq Scan on a  (rows=1) (actual rows=1)
   -  Hash  (rows=1) (actual rows=1)
 Buckets: 16384  Batches: 1  Memory Usage: 519kB
 -  Seq Scan on b  (rows=1) (actual rows=1)

So, that's pretty good.

I believe it might be possible to use even foreign keys matched only 
partially (e.g. foreign key on 3 columns, but only 2 of those matched by 
clauses), but I think that's a bit too much for now.


The examples above are rather trivial, and sadly it's not difficult to 
break them. For example by adding a single additional join clause to the 
first query does this:


select * from f join a on (f1 = a1 and f2 = a2 and f1 = a2);

  QUERY PLAN

 Hash Join  (rows=2) (actual rows=10)
   Hash Cond: (f.f1 = a.a1)
   -  Seq Scan on f  (rows=500) (actual rows=10)
 Filter: (f1 = f2)
   -  Hash  (rows=50) (rows=1)
 Buckets: 16384 (originally 1024)  Batches: 1 (originally 1) ...
 -  Seq Scan on a  (rows=50) (actual rows=1)
   Filter: (a1 = a2)

This of course happens thanks to equality classes because the planner 
is smart enough to realize that (f1=f2=a1=a2) thanks to the new clause. 
I'm not sure how to fix this, and maybe this particular case would be 
easier to fix using the multivariate stats (once it can estimate clauses 
like a1=a2).


Similarly, the equality classes may break other examples by deriving 
completely new clauses - for example assume the f table is defined 
like this (again, with 100k rows):


create table f (f1 int, f2 int,
   foreign key (f1,f2) references a(a1,a2),
   foreign key (f1,f2) references b(b1,b2));

then this query

select * from f join a on (f1 = a1 and f2 = a2)
join b on (f1 = b1 and f2 = b2);

may get planned like this:

  QUERY PLAN

 Hash Join  (rows=10) (actual rows=10)
   Hash Cond: ((f.f1 = a.a1) AND (f.f2 = a.a2))
   -  Seq Scan on f  (rows=10) (actual rows=10)
   -  Hash  (rows=1) (actual rows=1)
 -  Hash Join  (rows=1) (actual rows=1)
   Hash Cond: 

Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2015-08-19 Thread Tomas Vondra


On 08/20/2015 03:49 AM, Tomas Vondra wrote:

Then on current master, I get these estimates (showing just rows,
because that's what matter):


while with the patch I get this:


And of course I forgot to include the plans from master, so here we go:

select * from f join a on (f1 = a1 and f2 = a2);

QUERY PLAN

 Hash Join  (rows=10) (actual rows=10)
   Hash Cond: ((f.f1 = a.a1) AND (f.f2 = a.a2))
   -  Seq Scan on f  (rows=10) (actual rows=10)
   -  Hash  (rows=1) (actual rows=1)
 Buckets: 16384  Batches: 1  Memory Usage: 519kB
 -  Seq Scan on a  (rows=1) (actual rows=1)


select * from f join a on (f1 = a1 and f2 = a2)
join b on (f3 = b1 and f4 = b2);

QUERY PLAN

 Nested Loop  (rows=1) (actual rows=10)
   -  Hash Join  (rows=10) (actual rows=10)
 Hash Cond: ((f.f1 = a.a1) AND (f.f2 = a.a2))
 -  Seq Scan on f  (rows=10) (actual rows=10)
 -  Hash  (rows=1) (actual rows=1)
   Buckets: 16384  Batches: 1  Memory Usage: 519kB
   -  Seq Scan on a  (rows=1) (actual rows=1)
   -  Index Only Scan using b_pkey on b  (rows=1) (actual rows=1)
 Index Cond: ((b1 = f.f3) AND (b2 = f.f4))

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2015-06-08 Thread David Rowley
On 19 May 2015 at 11:08, Tomas Vondra tomas.von...@2ndquadrant.com wrote:


 On 05/17/15 14:31, David Rowley wrote:


 I think if you find any quals that are a part of *any* foreign key
 between the 2 join tables, then you should be never assume these quals
 to reduce the number of rows. I believe this should be the case even if
 you don't fully match all columns of the foreign key.

 If we modify your example a little, let's say your foreign key between
 fact and dim is made from 3 columns (a,b,c)

 If we do:

 EXPLAIN SELECT * FROM fact f JOIN dim d USING (a,b);

 Then we should always (under normal circumstances) find at least one
 matching row, although in this case since the join qual for c is
 missing, we could find more than 1 matching row.

 Without digging too deep here, I'd say that the best way to do this
 would be to either have calc_joinrel_size_estimate() build a list of
 restrictinfo items of all quals that are NOT part of any foreign key and
 pass that trimmed list down to clauselist_selectivity() for selectivity
 estimates. Or perhaps a better way would be just to
 teach clauselist_selectivity() about foreign keys. Likely
 clauselist_selectivity() would just have to skip over RestrictInfo items
 that are part of a foreign key.


 I'm not particularly happy about the way the current patch tweaks the
 selectivity, but I think simply removing the clauses is not going to work,
 because that implies the (removed) conditions have selectivity 1.0 (so the
 estimate would match a cartesian product). So we really need to estimate
 the selectivity, we can't just throw them away.


Of course I should have said 1.0 / estimated rows


 And that's essentially what the current patch does - it realizes the
 clauses match a FK, and then computes the estimate as 1/N where N is the
 cardinality of the table with PK.

 Another thing is that there may be multiple candidate foreign keys, and
 we can't just remove clauses matching at least one of them. Imagine e.g. a
 (somewhat artificial) example:

 CREATE TABLE parent (
a INT,
b INT,
c INT,
d INT,
PRIMARY KEY (a,b),
UNIQUE (c,d)
 );

 CREATE TABLE child (
a INT,
b INT,
c INT,
d INT,
FOREIGN KEY (a,b) REFERENCES parent (a,b),
FOREIGN KEY (c,d) REFERENCES parent (c,b)
 );

 and a join on (a,b,c,d). We can't just discard all the conditions, because
 (a,b) and (c,d) may 'mismatch'. We know (a,b) and (c,d) matches about 1/N
 of the 'parent' table, but we don't know selectivity for the whole join
 condition.



Ok how about we ignore partial foreign key matches and just get things
working when additional quals exist that are not a part of the foreign key.

I think here it would just be a matter of just deciding on which foreign
key to use the quals for and which ones to estimate the old fashioned way.

How about we have a function:

List *
get_non_foreign_key_quals(PlannerInfo *root, List *restrictlist) /* XXX
think of a better name */

which will be called from within calc_joinrel_size_estimate()

This function will simply take the list that's currently being sent off
to clauselist_selectivity() and it would search for the biggest possible
subset of foreign key columns from within that list. The returned list
would be what we would pass to clauselist_selectivity(), so if nothing
matched we'd just do the normal thing.

Let's say we have a restrictlist like in the join you describe above...
I'll use p for parent, and c for the child table:

{p.a = c.a}, {p.b = c.b}, {p.c = c.c}, {p.d = c.d}

get_non_foreign_key_quals would parse that list looking for foreign keys
going in either direction from both of these relations. It would look for
the longest chain of quals that match a foreign key and return the quals
which couldn't be matched. This couldn't match list would be what would
be returned to clauselist_selectivity(), the return value from that
function would have to be multiplied with I think 1.0 / min(est inner
rows, est outer rows). Though I've not quite gotten my head around how
outer joins change that, but I think that's ok for inner... Perhaps the
divide depends on which relation the fk is on... I need to think a bit more
to get my head around that...

I think you could implement this by generating a List of Bitmapset, pseudo
code along the lines of:

List *
get_non_foreign_key_quals(PlannerInfo *root, List *restrictlist) /* XXX
think of a better name */
{
List *fkey_matches;
foreach (foreign key in list)
{
Bitmapset matches;
if (foreign key is not for this rel)
   continue;

foreach (qual in fkey)
{
 int which_qual = 1;
 foreach (qual in restrict list)
 {
if (fkey qual matches ri qual)
   matches = bmp_add_member(matches, which_qual);
which_qual++;
 }
   }
   fkey_matches = lappend(fkey_matches, matches);
}

int longest_match = -1
int longest_match_idx;
int 

Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2015-05-18 Thread Tomas Vondra

Hi David,

On 05/17/15 14:31, David Rowley wrote:

Hi Tomas,

I did glance at this patch a while back, but just thinking on it again.

I think if you find any quals that are a part of *any* foreign key
between the 2 join tables, then you should be never assume these quals
to reduce the number of rows. I believe this should be the case even if
you don't fully match all columns of the foreign key.

If we modify your example a little, let's say your foreign key between
fact and dim is made from 3 columns (a,b,c)

If we do:

EXPLAIN SELECT * FROM fact f JOIN dim d USING (a,b);

Then we should always (under normal circumstances) find at least one
matching row, although in this case since the join qual for c is
missing, we could find more than 1 matching row.

Without digging too deep here, I'd say that the best way to do this
would be to either have calc_joinrel_size_estimate() build a list of
restrictinfo items of all quals that are NOT part of any foreign key and
pass that trimmed list down to clauselist_selectivity() for selectivity
estimates. Or perhaps a better way would be just to
teach clauselist_selectivity() about foreign keys. Likely
clauselist_selectivity() would just have to skip over RestrictInfo items
that are part of a foreign key.


I'm not particularly happy about the way the current patch tweaks the 
selectivity, but I think simply removing the clauses is not going to 
work, because that implies the (removed) conditions have selectivity 1.0 
(so the estimate would match a cartesian product). So we really need to 
estimate the selectivity, we can't just throw them away.


And that's essentially what the current patch does - it realizes the 
clauses match a FK, and then computes the estimate as 1/N where N is the 
cardinality of the table with PK.


Another thing is that there may be multiple candidate foreign keys, 
and we can't just remove clauses matching at least one of them. Imagine 
e.g. a (somewhat artificial) example:


CREATE TABLE parent (
   a INT,
   b INT,
   c INT,
   d INT,
   PRIMARY KEY (a,b),
   UNIQUE (c,d)
);

CREATE TABLE child (
   a INT,
   b INT,
   c INT,
   d INT,
   FOREIGN KEY (a,b) REFERENCES parent (a,b),
   FOREIGN KEY (c,d) REFERENCES parent (c,b)
);

and a join on (a,b,c,d). We can't just discard all the conditions, 
because (a,b) and (c,d) may 'mismatch'. We know (a,b) and (c,d) matches 
about 1/N of the 'parent' table, but we don't know selectivity for the 
whole join condition.


And it may be more complex, e.g. there may be two overlapping FKeys, 
e.b. (a,b) and (b,c) - how do you split that?


But this may be an overkill (multiple multi-column FK join), although if 
we could handle that reasonably, then why not ... someone out there 
certainly has schema like that ;-)


What I think is somewhat more realistic is conditions matching only a 
subset of FK columns - for example with a FK on (a,b) the join may only 
use (a), for example. Then again - we can't just discard that, we need 
to estimate that (and use it to compute the actual selectivity).


I agree that if we want to do anything fancy with this, we will have to 
teach clauselist_selectivity() about foreign keys.


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2015-05-17 Thread David Rowley
On 7 April 2015 at 13:41, Tomas Vondra tomas.von...@2ndquadrant.com wrote:


 (1) The current patch only does the trick when the FK matches the
 conditions perfectly - when there are no missing columns (present
 in the FK, not covered by a condition).


Hi Tomas,

I did glance at this patch a while back, but just thinking on it again.

I think if you find any quals that are a part of *any* foreign key between
the 2 join tables, then you should be never assume these quals to reduce
the number of rows. I believe this should be the case even if you don't
fully match all columns of the foreign key.

If we modify your example a little, let's say your foreign key between fact
and dim is made from 3 columns (a,b,c)

If we do:

EXPLAIN SELECT * FROM fact f JOIN dim d USING (a,b);

Then we should always (under normal circumstances) find at least one
matching row, although in this case since the join qual for c is missing,
we could find more than 1 matching row.

Without digging too deep here, I'd say that the best way to do this would
be to either have calc_joinrel_size_estimate() build a list of restrictinfo
items of all quals that are NOT part of any foreign key and pass that
trimmed list down to clauselist_selectivity() for selectivity estimates. Or
perhaps a better way would be just to teach clauselist_selectivity() about
foreign keys. Likely clauselist_selectivity() would just have to skip over
RestrictInfo items that are part of a foreign key.

Regards

David Rowley