Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes

2022-11-18 Thread Tom Lane
David Geier  writes:
> On 11/18/22 14:00, Tomas Vondra wrote:
>> Seems fine. I wonder if we could/could introduce a new constant for 0,
>> similar to ATTSTATSSLOT_NUMBERS/ATTSTATSSLOT_VALUES, instead of using a
>> magic constant. Say, ATTSTATSSLOT_NONE or ATTSTATSSLOT_CHECK.

> Good idea. I called it ATTSTATSSLOT_EXISTS. New patch attached.

No, I don't think it's a good idea.  The flags argument is documented as,
and used as, a bitmask of multiple options.  Passing zero fits fine with
that and is consistent with what we do elsewhere.  Turning it into
sort-of-an-enum-but-not-really isn't an improvement.

I didn't like your draft comment too much, because it didn't cover
what I think is the most important point: after a call with flags=0
we do not need a matching free_attstatsslot call to avoid leaking
anything.  (If we did, this patch would be a lot hairier.)

I rewrote the comment the way I wanted it and pushed.

regards, tom lane




Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes

2022-11-18 Thread David Geier

On 11/18/22 14:00, Tomas Vondra wrote:

Seems fine. I wonder if we could/could introduce a new constant for 0,
similar to ATTSTATSSLOT_NUMBERS/ATTSTATSSLOT_VALUES, instead of using a
magic constant. Say, ATTSTATSSLOT_NONE or ATTSTATSSLOT_CHECK.

Good idea. I called it ATTSTATSSLOT_EXISTS. New patch attached.

I don't think you can write a test for this, because there is no change
to behavior that can be observed by the user. If one side has no MCV,
the only difference is whether we try to load the other MCV or not.


Yeah. I thought along the lines of checking the number of pages read 
when the pg_stats entry is not in syscache yet. But that seems awfully 
implementation specific. So no test provided.


--
David Geier
(ServiceNow)
From 5c5c0fb9dd99e79daaa015984c9dda22e4ccda17 Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Fri, 18 Nov 2022 09:35:08 +0100
Subject: [PATCH] Don't read MCV stats needlessly in join selectivity
 estimation

Join selectivity estimation only uses MCV stats if they're present
on both join attributes. Therefore, if MCV stats are not available
on one of the two columns, skip reading MCV stats altogether.

Frequently encountered situations in which MCV stats not available
is unique columns or columns for which all values have roughly
the same frequency and therefore ANALYZE decided to not store them.
---
 src/backend/utils/adt/selfuncs.c| 17 +++--
 src/backend/utils/cache/lsyscache.c |  4 
 src/include/utils/lsyscache.h   |  3 ++-
 3 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index d597b7e81f..85a29eaeb9 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -2263,6 +2263,7 @@ eqjoinsel(PG_FUNCTION_ARGS)
 	bool		have_mcvs2 = false;
 	bool		join_is_reversed;
 	RelOptInfo *inner_rel;
+	bool		get_mcv_stats;
 
 	get_join_variables(root, args, sjinfo,
 	   , , _is_reversed);
@@ -2275,11 +2276,23 @@ eqjoinsel(PG_FUNCTION_ARGS)
 	memset(, 0, sizeof(sslot1));
 	memset(, 0, sizeof(sslot2));
 
+	/*
+	 * There is no use in fetching one side's MCVs if we lack MCVs for the
+	 * other side, so do a quick check to verify that both stats exist.
+	 */
+	get_mcv_stats = (HeapTupleIsValid(vardata1.statsTuple) &&
+			 HeapTupleIsValid(vardata2.statsTuple) &&
+			 get_attstatsslot(, vardata1.statsTuple,
+	  STATISTIC_KIND_MCV, InvalidOid, ATTSTATSSLOT_EXISTS) &&
+			 get_attstatsslot(, vardata2.statsTuple,
+	  STATISTIC_KIND_MCV, InvalidOid, ATTSTATSSLOT_EXISTS));
+
+
 	if (HeapTupleIsValid(vardata1.statsTuple))
 	{
 		/* note we allow use of nullfrac regardless of security check */
 		stats1 = (Form_pg_statistic) GETSTRUCT(vardata1.statsTuple);
-		if (statistic_proc_security_check(, opfuncoid))
+		if (get_mcv_stats && statistic_proc_security_check(, opfuncoid))
 			have_mcvs1 = get_attstatsslot(, vardata1.statsTuple,
 		  STATISTIC_KIND_MCV, InvalidOid,
 		  ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
@@ -2289,7 +2302,7 @@ eqjoinsel(PG_FUNCTION_ARGS)
 	{
 		/* note we allow use of nullfrac regardless of security check */
 		stats2 = (Form_pg_statistic) GETSTRUCT(vardata2.statsTuple);
-		if (statistic_proc_security_check(, opfuncoid))
+		if (get_mcv_stats && statistic_proc_security_check(, opfuncoid))
 			have_mcvs2 = get_attstatsslot(, vardata2.statsTuple,
 		  STATISTIC_KIND_MCV, InvalidOid,
 		  ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index a16a63f495..191f68f7b4 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -3157,6 +3157,10 @@ get_attavgwidth(Oid relid, AttrNumber attnum)
  * reqkind: STAKIND code for desired statistics slot kind.
  * reqop: STAOP value wanted, or InvalidOid if don't care.
  * flags: bitmask of ATTSTATSSLOT_VALUES and/or ATTSTATSSLOT_NUMBERS.
+ *Passing 0 will only check if the requested slot type exists but not
+ *extract any content. This is useful in cases where MCV stats are only
+ *useful if they exists on all involved columns (e.g. during join
+ *selectivity computation).
  *
  * If a matching slot is found, true is returned, and *sslot is filled thus:
  * staop: receives the actual STAOP value.
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 50f0288305..d9a3916997 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -39,7 +39,8 @@ typedef enum IOFuncSelector
 } IOFuncSelector;
 
 /* Flag bits for get_attstatsslot */
-#define ATTSTATSSLOT_VALUES		0x01
+#define ATTSTATSSLOT_EXISTS	0x00
+#define ATTSTATSSLOT_VALUES	0x01
 #define ATTSTATSSLOT_NUMBERS	0x02
 
 /* Result struct for get_attstatsslot */
-- 
2.34.1



Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes

2022-11-18 Thread Tom Lane
Tomas Vondra  writes:
> On 11/18/22 09:54, David Geier wrote:
>> I couldn't come up with any reasonable way of writing an automated test
>> for that.

> I don't think you can write a test for this, because there is no change
> to behavior that can be observed by the user.

Yeah, and the delta in performance is surely too small to be
measured reliably in the buildfarm.  I think coverage will have
to be sufficient.

regards, tom lane




Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes

2022-11-18 Thread Tomas Vondra
On 11/18/22 09:54, David Geier wrote:
> Thanks everyone for the great feedback and suggestions.
> 
>>
>>> Yes, it is.  Using zero flag would short-cut get_attstatsslot() to just
>>> return whether the slot type exists without loading it.  Do you think we
>>> need to emphasize this use case in the comments for 'flags'?
>> Perhaps, it's not really obvious now.
> 
> Comment added.
> 
> 
>> I wonder whether we need to also check statistic_proc_security_check()
>>> when determining if MCVs exists in both sides.
>> Yeah, I thought about hoisting the statistic_proc_security_check
>> tests up into get_mcv_stats.  I don't think it's a great idea
>> though.  Again, it'd complicate untangling this if we ever
>> generalize the use of MCVs in this function.  Also, I don't
>> think we should be micro-optimizing the case where the security
>> check doesn't pass --- if it doesn't, you're going to be hurting
>> from bad plans a lot more than you are from some wasted cycles
>> here.
> 
> Sounds reasonable.
> 
> Attached is v2 of the patch.
> This is basically Tom's version plus a comment for the flags of
> get_attstatslot() as suggested by Richard.
> 

Seems fine. I wonder if we could/could introduce a new constant for 0,
similar to ATTSTATSSLOT_NUMBERS/ATTSTATSSLOT_VALUES, instead of using a
magic constant. Say, ATTSTATSSLOT_NONE or ATTSTATSSLOT_CHECK.

> I couldn't come up with any reasonable way of writing an automated test
> for that.
> Any ideas?
> 

I don't think you can write a test for this, because there is no change
to behavior that can be observed by the user. If one side has no MCV,
the only difference is whether we try to load the other MCV or not.
There's no impact on estimates, because we won't use it.

IMO the best thing we can do is check coverage, that the new code is
exercised in regression tests. And I think that's fine.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes

2022-11-18 Thread David Geier

Thanks everyone for the great feedback and suggestions.




Yes, it is.  Using zero flag would short-cut get_attstatsslot() to just
return whether the slot type exists without loading it.  Do you think we
need to emphasize this use case in the comments for 'flags'?

Perhaps, it's not really obvious now.


Comment added.



I wonder whether we need to also check statistic_proc_security_check()

when determining if MCVs exists in both sides.

Yeah, I thought about hoisting the statistic_proc_security_check
tests up into get_mcv_stats.  I don't think it's a great idea
though.  Again, it'd complicate untangling this if we ever
generalize the use of MCVs in this function.  Also, I don't
think we should be micro-optimizing the case where the security
check doesn't pass --- if it doesn't, you're going to be hurting
from bad plans a lot more than you are from some wasted cycles
here.


Sounds reasonable.

Attached is v2 of the patch.
This is basically Tom's version plus a comment for the flags of 
get_attstatslot() as suggested by Richard.


I couldn't come up with any reasonable way of writing an automated test 
for that.

Any ideas?

--
David Geier
(ServiceNow)
From 66b16792e8e16dad33adb5614a0936cc41002f4c Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Fri, 18 Nov 2022 09:35:08 +0100
Subject: [PATCH] Don't read MCV stats needlessly in join selectivity
 estimation

Join selectivity estimation only uses MCV stats if they're present
on both join attributes. Therefore, if MCV stats are not available
on one of the two columns, skip reading MCV stats altogether.

Frequently encountered situations in which MCV stats not available
is unique columns or columns for which all values have roughly
the same frequency and therefore ANALYZE decided to not store them.
---
 src/backend/utils/adt/selfuncs.c| 15 +--
 src/backend/utils/cache/lsyscache.c |  4 
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index d597b7e81f..5baab4e3ac 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -2263,6 +2263,7 @@ eqjoinsel(PG_FUNCTION_ARGS)
 	bool		have_mcvs2 = false;
 	bool		join_is_reversed;
 	RelOptInfo *inner_rel;
+	bool		get_mcv_stats;
 
 	get_join_variables(root, args, sjinfo,
 	   , , _is_reversed);
@@ -2275,11 +2276,21 @@ eqjoinsel(PG_FUNCTION_ARGS)
 	memset(, 0, sizeof(sslot1));
 	memset(, 0, sizeof(sslot2));
 
+	/*
+	 * There is no use in fetching one side's MCVs if we lack MCVs for the
+	 * other side, so do a quick check to verify that both stats exist.
+	 */
+	get_mcv_stats = (HeapTupleIsValid(vardata1.statsTuple) &&
+			 HeapTupleIsValid(vardata2.statsTuple) &&
+			 get_attstatsslot(, vardata1.statsTuple, STATISTIC_KIND_MCV, InvalidOid, 0) &&
+			 get_attstatsslot(, vardata2.statsTuple, STATISTIC_KIND_MCV, InvalidOid, 0));
+
+
 	if (HeapTupleIsValid(vardata1.statsTuple))
 	{
 		/* note we allow use of nullfrac regardless of security check */
 		stats1 = (Form_pg_statistic) GETSTRUCT(vardata1.statsTuple);
-		if (statistic_proc_security_check(, opfuncoid))
+		if (get_mcv_stats && statistic_proc_security_check(, opfuncoid))
 			have_mcvs1 = get_attstatsslot(, vardata1.statsTuple,
 		  STATISTIC_KIND_MCV, InvalidOid,
 		  ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
@@ -2289,7 +2300,7 @@ eqjoinsel(PG_FUNCTION_ARGS)
 	{
 		/* note we allow use of nullfrac regardless of security check */
 		stats2 = (Form_pg_statistic) GETSTRUCT(vardata2.statsTuple);
-		if (statistic_proc_security_check(, opfuncoid))
+		if (get_mcv_stats && statistic_proc_security_check(, opfuncoid))
 			have_mcvs2 = get_attstatsslot(, vardata2.statsTuple,
 		  STATISTIC_KIND_MCV, InvalidOid,
 		  ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index a16a63f495..191f68f7b4 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -3157,6 +3157,10 @@ get_attavgwidth(Oid relid, AttrNumber attnum)
  * reqkind: STAKIND code for desired statistics slot kind.
  * reqop: STAOP value wanted, or InvalidOid if don't care.
  * flags: bitmask of ATTSTATSSLOT_VALUES and/or ATTSTATSSLOT_NUMBERS.
+ *Passing 0 will only check if the requested slot type exists but not
+ *extract any content. This is useful in cases where MCV stats are only
+ *useful if they exists on all involved columns (e.g. during join
+ *selectivity computation).
  *
  * If a matching slot is found, true is returned, and *sslot is filled thus:
  * staop: receives the actual STAOP value.
-- 
2.34.1



Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes

2022-11-17 Thread Tom Lane
Richard Guo  writes:
> Yes, it is.  Using zero flag would short-cut get_attstatsslot() to just
> return whether the slot type exists without loading it.  Do you think we
> need to emphasize this use case in the comments for 'flags'?

Perhaps, it's not really obvious now.

> I wonder whether we need to also check statistic_proc_security_check()
> when determining if MCVs exists in both sides.

Yeah, I thought about hoisting the statistic_proc_security_check
tests up into get_mcv_stats.  I don't think it's a great idea
though.  Again, it'd complicate untangling this if we ever
generalize the use of MCVs in this function.  Also, I don't
think we should be micro-optimizing the case where the security
check doesn't pass --- if it doesn't, you're going to be hurting
from bad plans a lot more than you are from some wasted cycles
here.

regards, tom lane




Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes

2022-11-17 Thread Richard Guo
On Fri, Nov 18, 2022 at 9:36 AM Tom Lane  wrote:

> Actually, looking at get_attstatslot, I realize it was already designed
> to do that -- just pass zero for flags.  So we could do it as attached.


Yes, it is.  Using zero flag would short-cut get_attstatsslot() to just
return whether the slot type exists without loading it.  Do you think we
need to emphasize this use case in the comments for 'flags'?  It seems
currently there is no such use case in the codes on HEAD.

I wonder whether we need to also check statistic_proc_security_check()
when determining if MCVs exists in both sides.

Thanks
Richard


Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes

2022-11-17 Thread Tom Lane
I wrote:
> Tomas Vondra  writes:
>> Or perhaps what if we have a function that quickly determines if the
>> attribute has MCV, without loading it? I'd bet the expensive part of
>> get_attstatslot() is the deconstruct_array().
>> We could have a function that only does the first small loop over slots,
>> and returns true/false if we have a slot of the requested stakind.

> Yeah, I like this idea.

Actually, looking at get_attstatslot, I realize it was already designed
to do that -- just pass zero for flags.  So we could do it as attached.

We could make some consequent simplifications by only retaining one
"have_mcvs" flag, but I'm inclined to leave the rest of the code as-is.
We would not get much gain from that, and it would make this harder
to undo if there ever is a reason to consider just one set of MCVs.

regards, tom lane

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index d597b7e81f..e0aeaa6909 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -2261,6 +2261,7 @@ eqjoinsel(PG_FUNCTION_ARGS)
 	Form_pg_statistic stats2 = NULL;
 	bool		have_mcvs1 = false;
 	bool		have_mcvs2 = false;
+	bool		get_mcv_stats;
 	bool		join_is_reversed;
 	RelOptInfo *inner_rel;
 
@@ -2275,11 +2276,25 @@ eqjoinsel(PG_FUNCTION_ARGS)
 	memset(, 0, sizeof(sslot1));
 	memset(, 0, sizeof(sslot2));
 
+	/*
+	 * There is no use in fetching one side's MCVs if we lack MCVs for the
+	 * other side, so do a quick check to verify that both stats exist.
+	 */
+	get_mcv_stats = (HeapTupleIsValid(vardata1.statsTuple) &&
+	 HeapTupleIsValid(vardata2.statsTuple) &&
+	 get_attstatsslot(, vardata1.statsTuple,
+	  STATISTIC_KIND_MCV, InvalidOid,
+	  0) &&
+	 get_attstatsslot(, vardata2.statsTuple,
+	  STATISTIC_KIND_MCV, InvalidOid,
+	  0));
+
 	if (HeapTupleIsValid(vardata1.statsTuple))
 	{
 		/* note we allow use of nullfrac regardless of security check */
 		stats1 = (Form_pg_statistic) GETSTRUCT(vardata1.statsTuple);
-		if (statistic_proc_security_check(, opfuncoid))
+		if (get_mcv_stats &&
+			statistic_proc_security_check(, opfuncoid))
 			have_mcvs1 = get_attstatsslot(, vardata1.statsTuple,
 		  STATISTIC_KIND_MCV, InvalidOid,
 		  ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
@@ -2289,7 +2304,8 @@ eqjoinsel(PG_FUNCTION_ARGS)
 	{
 		/* note we allow use of nullfrac regardless of security check */
 		stats2 = (Form_pg_statistic) GETSTRUCT(vardata2.statsTuple);
-		if (statistic_proc_security_check(, opfuncoid))
+		if (get_mcv_stats &&
+			statistic_proc_security_check(, opfuncoid))
 			have_mcvs2 = get_attstatsslot(, vardata2.statsTuple,
 		  STATISTIC_KIND_MCV, InvalidOid,
 		  ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);


Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes

2022-11-17 Thread Tom Lane
Tomas Vondra  writes:
> Or perhaps what if we have a function that quickly determines if the
> attribute has MCV, without loading it? I'd bet the expensive part of
> get_attstatslot() is the deconstruct_array().
> We could have a function that only does the first small loop over slots,
> and returns true/false if we have a slot of the requested stakind.

Yeah, I like this idea.

> It might even check the isunique flag first, to make it more convenient.

That would tie it to this one use-case, which doesn't seem advisable.
I think we should forget the known-unique angle and just do quick
checks to see if both sides have MCVs.

regards, tom lane




Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes

2022-11-17 Thread Tomas Vondra



On 11/14/22 10:19, David Geier wrote:
> Hi Tom,
>> There won't *be* any MCV stats for a column that ANALYZE perceives to
>> be unique, so I'm not quite sure where the claimed savings comes from.
>
> We save if one join attribute is unique while the other isn't. In that
> case stored MCV stats are read for the non-unique attribute but then
> never used. This is because MCV stats in join selectivity estimation are
> only used if they're present on both columns
>

Right - if we only have MCV on one side of the join, we currently end up
loading the MCV we have only to not use it anyway. The uniqueness is a
simple way to detect some of those cases. I'd bet the savings can be
quite significant for small joins and/or cases with large MCV.

I wonder if we might be yet a bit smarter, though.

For example, assume the first attribute is not defined as "unique" but
we still don't have a MCV (it may be unique - or close to unique - in
practice, or maybe it's just uniform distribution). We end up with

  have_mcvs1 = false

Can't we just skip trying to load the second MCV? So we could do

if (have_mcvs1 && HeapTupleIsValid(vardata2.statsTuple))
{ ... try loading mcv2 ... }

Or perhaps what if we have a function that quickly determines if the
attribute has MCV, without loading it? I'd bet the expensive part of
get_attstatslot() is the deconstruct_array().

We could have a function that only does the first small loop over slots,
and returns true/false if we have a slot of the requested stakind. It
might even check the isunique flag first, to make it more convenient.

And only if both sides return "true" we'd load the MCV, deconstruct the
array and all that.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes

2022-11-14 Thread David Geier

Hi Tom,

There won't *be* any MCV stats for a column that ANALYZE perceives to
be unique, so I'm not quite sure where the claimed savings comes from.
We save if one join attribute is unique while the other isn't. In that 
case stored MCV stats are read for the non-unique attribute but then 
never used. This is because MCV stats in join selectivity estimation are 
only used if they're present on both columns

Please provide a concrete example.


A super simple case already showing a significant speedup is the 
following. The more ways to join two tables and the more joins overall, 
the higher the expected gain.


CREATE TABLE bar(col INT UNIQUE);
CREATE TABLE foo (col INT);
INSERT INTO foo SELECT generate_series(1, 100, 0.5);
SET default_statistics_target = 1;
ANALYZE foo, bar;
\timing on
EXPLAIN SELECT * FROM foo, bar WHERE foo.col = bar.col;

Running the above query five times gave me average runtimes of:

- 0.62 ms without the patch and
- 0.48 ms with the patch.

--
David Geier
(ServiceNow)





Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes

2022-11-11 Thread Tom Lane
David Geier  writes:
> eqjoinsel() can be optimized by not reading MCV stats if at least one of 
> the two join attributes is unique.

There won't *be* any MCV stats for a column that ANALYZE perceives to
be unique, so I'm not quite sure where the claimed savings comes from.

> With this change we saw some queries improve planning time by more than 
> 2x, especially with larger values for default_statistics_target.

Please provide a concrete example.

regards, tom lane