Re: using extended statistics to improve join estimates

2024-05-22 Thread Andrei Lepikhov

On 5/23/24 09:04, Andy Fan wrote:

Andrei Lepikhov  writes:

* c) No extended stats with MCV. If there are multiple join clauses,
* we can try using ndistinct coefficients and do what eqjoinsel does.


OK, I didn't pay enough attention to this comment before. and yes, I get
the same conclusion as you -  there is no implementation of this.

and if so, I think we should remove the comments and do the
implementation in the next patch.

I have an opposite opinion about it:
1. distinct estimation is more universal thing - you can use it 
precisely on any subset of columns.
2. distinct estimation is faster - it just a number, you don't need to 
detoast huge array of values and compare them one-by-one.


So, IMO, it is essential part of join estimation and it should be 
implemented like in eqjoinsel.

Do you think the hook proposal is closely connected with the current
topic? IIUC it's seems not. So a dedicated thread to explain the problem
to slove and the proposal and the follwing discussion should be helpful
for both topics. I'm just worried about mixing the two in one thread
would make things complexer unnecessarily.

Sure.

--
regards,
Andrei Lepikhov
Postgres Professional





Re: using extended statistics to improve join estimates

2024-05-22 Thread Andy Fan


Andrei Lepikhov  writes:

> On 20/5/2024 15:52, Andy Fan wrote:
>> Hi Andrei,
>> 
>>> On 4/3/24 01:22, Tomas Vondra wrote:
 Cool! There's obviously no chance to get this into v18, and I have stuff
 to do in this CF. But I'll take a look after that.
>>> I'm looking at your patch now - an excellent start to an eagerly awaited
>>> feature!
>>> A couple of questions:
>>> 1. I didn't find the implementation of strategy 'c' - estimation by the
>>> number of distinct values. Do you forget it?
>> What do you mean the "strategy 'c'"?
> As described in 0001-* patch:
> * c) No extended stats with MCV. If there are multiple join clauses,
> * we can try using ndistinct coefficients and do what eqjoinsel does.

OK, I didn't pay enough attention to this comment before. and yes, I get
the same conclusion as you -  there is no implementation of this.

and if so, I think we should remove the comments and do the
implementation in the next patch. 

>>> 2. Can we add a clauselist selectivity hook into the core (something
>>> similar the code in attachment)? It can allow the development and
>>> testing of multicolumn join estimations without patching the core.
>> The idea LGTM. But do you want
>> +if (clauselist_selectivity_hook)
>> +s1 = clauselist_selectivity_hook(root, clauses, varRelid, 
>> jointype,
>> +
>> rather than
>> +if (clauselist_selectivity_hook)
>> +*return* clauselist_selectivity_hook(root, clauses, ..)
> Of course - library may estimate not all the clauses - it is a reason,
> why I added input/output parameter 'estimatedclauses' by analogy with
> statext_clauselist_selectivity.

OK.

Do you think the hook proposal is closely connected with the current
topic? IIUC it's seems not. So a dedicated thread to explain the problem
to slove and the proposal and the follwing discussion should be helpful
for both topics. I'm just worried about mixing the two in one thread
would make things complexer unnecessarily.

-- 
Best Regards
Andy Fan





Re: using extended statistics to improve join estimates

2024-05-21 Thread Andrei Lepikhov

On 5/20/24 16:40, Andrei Lepikhov wrote:

On 20/5/2024 15:52, Andy Fan wrote:

+    if (clauselist_selectivity_hook)
+    *return* clauselist_selectivity_hook(root, clauses, ..)
Of course - library may estimate not all the clauses - it is a reason, 
why I added input/output parameter 'estimatedclauses' by analogy with 
statext_clauselist_selectivity.

Here is a polished and a bit modified version of the hook proposed.
Additionally, I propose exporting the statext_mcv_clauselist_selectivity 
routine, likewise dependencies_clauselist_selectivity. This could 
potentially enhance the functionality of the PostgreSQL estimation code.


To clarify the purpose, I want an optional, loaded as a library, more 
conservative estimation based on distinct statistics. Let's provide (a 
bit degenerate) example:


CREATE TABLE is_test(x1 integer, x2 integer, x3 integer, x4 integer);
INSERT INTO is_test (x1,x2,x3,x4)
   SELECT x%5,x%7,x%11,x%13 FROM generate_series(1,1E3) AS x;
INSERT INTO is_test (x1,x2,x3,x4)
   SELECT 14,14,14,14 FROM generate_series(1,100) AS x;
CREATE STATISTICS ist_stat (dependencies,ndistinct)
  ON x1,x2,x3,x4 FROM is_test;
ANALYZE is_test;
EXPLAIN (ANALYZE, COSTS ON, SUMMARY OFF, TIMING OFF)
SELECT * FROM is_test WHERE x1=14 AND x2=14 AND x3=14 AND x4=14;
DROP TABLE is_test CASCADE;

I see:
(cost=0.00..15.17 rows=3 width=16) (actual rows=100 loops=1)

Dependency works great if it is the same for all the data in the 
columns. But we get underestimations if we have different laws for 
subsets of rows. So, if we don't have MCV statistics, sometimes we need 
to pass over dependency statistics and use ndistinct instead.


--
regards,
Andrei Lepikhov
Postgres Professional
diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 0ab021c1e8..1508a1beea 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -128,6 +128,11 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	ListCell   *l;
 	int			listidx;
 
+	if (clauselist_selectivity_hook)
+		s1 = clauselist_selectivity_hook(root, clauses, varRelid, jointype,
+		 sjinfo, ,
+		 use_extended_stats);
+
 	/*
 	 * If there's exactly one clause, just go directly to
 	 * clause_selectivity_ext(). None of what we might do below is relevant.
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 99fdf208db..b1722f5a60 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1712,7 +1712,7 @@ statext_is_compatible_clause(PlannerInfo *root, Node *clause, Index relid,
  * 0-based 'clauses' indexes we estimate for and also skip clause items that
  * already have a bit set.
  */
-static Selectivity
+Selectivity
 statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
    JoinType jointype, SpecialJoinInfo *sjinfo,
    RelOptInfo *rel, Bitmapset **estimatedclauses,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 5f5d7959d8..ff98fda08c 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -146,6 +146,7 @@
 /* Hooks for plugins to get control when we ask for stats */
 get_relation_stats_hook_type get_relation_stats_hook = NULL;
 get_index_stats_hook_type get_index_stats_hook = NULL;
+clauselist_selectivity_hook_type clauselist_selectivity_hook = NULL;
 
 static double eqsel_internal(PG_FUNCTION_ARGS, bool negate);
 static double eqjoinsel_inner(Oid opfuncoid, Oid collation,
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 7f2bf18716..436f30bdde 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -104,6 +104,14 @@ extern void BuildRelationExtStatistics(Relation onerel, bool inh, double totalro
 extern int	ComputeExtStatisticsRows(Relation onerel,
 	 int natts, VacAttrStats **vacattrstats);
 extern bool statext_is_kind_built(HeapTuple htup, char type);
+extern Selectivity statext_mcv_clauselist_selectivity(PlannerInfo *root,
+	  List *clauses,
+	  int varRelid,
+	  JoinType jointype,
+	  SpecialJoinInfo *sjinfo,
+	   RelOptInfo *rel,
+	   Bitmapset **estimatedclauses,
+	   bool is_or);
 extern Selectivity dependencies_clauselist_selectivity(PlannerInfo *root,
 	   List *clauses,
 	   int varRelid,
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index f2563ad1cb..253f584c65 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -148,6 +148,17 @@ typedef bool (*get_index_stats_hook_type) (PlannerInfo *root,
 		   VariableStatData *vardata);
 extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook;
 
+/* Hooks for plugins to get control when we ask for selectivity estimation */
+typedef Selectivity 

Re: using extended statistics to improve join estimates

2024-05-20 Thread Andrei Lepikhov

On 20/5/2024 15:52, Andy Fan wrote:


Hi Andrei,


On 4/3/24 01:22, Tomas Vondra wrote:

Cool! There's obviously no chance to get this into v18, and I have stuff
to do in this CF. But I'll take a look after that.

I'm looking at your patch now - an excellent start to an eagerly awaited
feature!
A couple of questions:
1. I didn't find the implementation of strategy 'c' - estimation by the
number of distinct values. Do you forget it?


What do you mean the "strategy 'c'"?

As described in 0001-* patch:
* c) No extended stats with MCV. If there are multiple join clauses,
* we can try using ndistinct coefficients and do what eqjoinsel does.




2. Can we add a clauselist selectivity hook into the core (something
similar the code in attachment)? It can allow the development and
testing of multicolumn join estimations without patching the core.


The idea LGTM. But do you want

+   if (clauselist_selectivity_hook)
+   s1 = clauselist_selectivity_hook(root, clauses, varRelid, 
jointype,
+

rather than

+   if (clauselist_selectivity_hook)
+   *return* clauselist_selectivity_hook(root, clauses, ..)
Of course - library may estimate not all the clauses - it is a reason, 
why I added input/output parameter 'estimatedclauses' by analogy with 
statext_clauselist_selectivity.


--
regards,
Andrei Lepikhov
Postgres Professional





Re: using extended statistics to improve join estimates

2024-05-20 Thread Andy Fan


Hi Andrei,

> On 4/3/24 01:22, Tomas Vondra wrote:
>> Cool! There's obviously no chance to get this into v18, and I have stuff
>> to do in this CF. But I'll take a look after that.
> I'm looking at your patch now - an excellent start to an eagerly awaited
> feature!
> A couple of questions:
> 1. I didn't find the implementation of strategy 'c' - estimation by the
> number of distinct values. Do you forget it?

What do you mean the "strategy 'c'"?  

> 2. Can we add a clauselist selectivity hook into the core (something
> similar the code in attachment)? It can allow the development and
> testing of multicolumn join estimations without patching the core.

The idea LGTM. But do you want 

+   if (clauselist_selectivity_hook)
+   s1 = clauselist_selectivity_hook(root, clauses, varRelid, 
jointype,
+

rather than

+   if (clauselist_selectivity_hook)
+   *return* clauselist_selectivity_hook(root, clauses, ..)


?

-- 
Best Regards
Andy Fan





Re: using extended statistics to improve join estimates

2024-05-20 Thread Andrei Lepikhov

On 4/3/24 01:22, Tomas Vondra wrote:

Cool! There's obviously no chance to get this into v18, and I have stuff
to do in this CF. But I'll take a look after that.
I'm looking at your patch now - an excellent start to an eagerly awaited 
feature!

A couple of questions:
1. I didn't find the implementation of strategy 'c' - estimation by the 
number of distinct values. Do you forget it?
2. Can we add a clauselist selectivity hook into the core (something 
similar the code in attachment)? It can allow the development and 
testing of multicolumn join estimations without patching the core.


--
regards,
Andrei Lepikhov
Postgres Professional
diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 0ab021c1e8..271d36a522 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -128,6 +128,9 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	ListCell   *l;
 	int			listidx;
 
+	if (clauselist_selectivity_hook)
+		s1 = clauselist_selectivity_hook(root, clauses, varRelid, jointype,
+		 sjinfo, );
 	/*
 	 * If there's exactly one clause, just go directly to
 	 * clause_selectivity_ext(). None of what we might do below is relevant.
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 5f5d7959d8..ff98fda08c 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -146,6 +146,7 @@
 /* Hooks for plugins to get control when we ask for stats */
 get_relation_stats_hook_type get_relation_stats_hook = NULL;
 get_index_stats_hook_type get_index_stats_hook = NULL;
+clauselist_selectivity_hook_type clauselist_selectivity_hook = NULL;
 
 static double eqsel_internal(PG_FUNCTION_ARGS, bool negate);
 static double eqjoinsel_inner(Oid opfuncoid, Oid collation,
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index f2563ad1cb..ee28d3ba9b 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -148,6 +148,15 @@ typedef bool (*get_index_stats_hook_type) (PlannerInfo *root,
 		   VariableStatData *vardata);
 extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook;
 
+/* Hooks for plugins to get control when we ask for selectivity estimation */
+typedef bool (*clauselist_selectivity_hook_type) (PlannerInfo *root,
+  List *clauses,
+  int varRelid,
+  JoinType jointype,
+  SpecialJoinInfo *sjinfo,
+  Bitmapset **estimatedclauses);
+extern PGDLLIMPORT clauselist_selectivity_hook_type clauselist_selectivity_hook;
+
 /* Functions in selfuncs.c */
 
 extern void examine_variable(PlannerInfo *root, Node *node, int varRelid,


Re: using extended statistics to improve join estimates

2024-04-30 Thread Andy Fan


Hello Justin,

Thanks for showing interest on this!

> On Sun, Apr 28, 2024 at 10:07:01AM +0800, Andy Fan wrote:
>> 's/estimiatedcluases/estimatedclauses/' typo error in the
>> commit message is not fixed since I have to regenerate all the commits
>
> Maybe you know this, but some of these patches need to be squashed.
> Regenerating the patches to address feedback is the usual process.
> When they're not squished, it makes it hard to review the content of the
> patches.

You might overlooked the fact that the each individual commit is just to
make the communication effectively (easy to review) and all of them
will be merged into 1 commit at the last / during the process of review. 

Even though if something make it hard to review, I am pretty happy to
regenerate the patches, but does 's/estimiatedcluases/estimatedclauses/'
belongs to this category? I'm pretty sure that is not the only typo
error or inapproprate word, if we need to regenerate the 22 patches
because of that, we have to regenerate that pretty often.

Do you mind to provide more feedback once and I can merge all of them in
one modification or you think the typo error has blocked the review
process?

>
> For example:
> [PATCH v1 18/22] Fix error "unexpected system attribute" when join with 
> system attr
> ..adds .sql regression tests, but the expected .out isn't updated until
> [PATCH v1 19/22] Fix the incorrect comment on extended stats.
>
> That fixes an elog() in Tomas' original commit, so it should probably be
> 002 or 003.

Which elog are you talking about?

> It might make sense to keep the first commit separate for
> now, since it's nice to keep Tomas' original patch "pristine" to make
> more apparent the changes you're proposing.

This is my goal as well, did you find anything I did which break this
rule, that's absoluately not my intention.

> Another:
> [PATCH v1 20/22] Add fastpath when combine the 2 MCV like eqjoinsel_inner.
> ..doesn't compile without
> [PATCH v1 21/22] When mcv->ndimensions == list_length(clauses), handle it 
> same as
>
> Your 022 patch fixes a typo in your 002 patch, which means that first
> one reads a patch with a typo, and then later, a 10 line long patch
> reflowing the comment with a typo fixed.

I would like to regenerate the 22 patches if you think the typo error
make the reivew process hard. I can do such things but not willing to
do that often.

>
> A good guideline is that each patch should be self-contained, compiling
> and passing tests.  Which is more difficult with a long stack of
> patches.

I agree.

-- 
Best Regards
Andy Fan





Re: using extended statistics to improve join estimates

2024-04-29 Thread Justin Pryzby
On Sun, Apr 28, 2024 at 10:07:01AM +0800, Andy Fan wrote:
> 's/estimiatedcluases/estimatedclauses/' typo error in the
> commit message is not fixed since I have to regenerate all the commits

Maybe you know this, but some of these patches need to be squashed.
Regenerating the patches to address feedback is the usual process.
When they're not squished, it makes it hard to review the content of the
patches.

For example:
[PATCH v1 18/22] Fix error "unexpected system attribute" when join with system 
attr
..adds .sql regression tests, but the expected .out isn't updated until
[PATCH v1 19/22] Fix the incorrect comment on extended stats.

That fixes an elog() in Tomas' original commit, so it should probably be
002 or 003.  It might make sense to keep the first commit separate for
now, since it's nice to keep Tomas' original patch "pristine" to make
more apparent the changes you're proposing.

Another:
[PATCH v1 20/22] Add fastpath when combine the 2 MCV like eqjoinsel_inner.
..doesn't compile without
[PATCH v1 21/22] When mcv->ndimensions == list_length(clauses), handle it same 
as

Your 022 patch fixes a typo in your 002 patch, which means that first
one reads a patch with a typo, and then later, a 10 line long patch
reflowing the comment with a typo fixed.

A good guideline is that each patch should be self-contained, compiling
and passing tests.  Which is more difficult with a long stack of
patches.

-- 
Justin




Re: using extended statistics to improve join estimates

2024-04-27 Thread Andy Fan


Hello Justin!

Justin Pryzby  writes:


> |../src/backend/statistics/extended_stats.c:3151:36: warning: ‘relid’ may be 
> used uninitialized [-Wmaybe-uninitialized]
> | 3151 | if (var->varno != relid)
> |  |^
> |../src/backend/statistics/extended_stats.c:3104:33: note: ‘relid’ was 
> declared here
> | 3104 | int relid;
> |  | ^
> |[1016/1893] Compiling C object 
> src/backend/postgres_lib.a.p/statistics_mcv.c.o
> |../src/backend/statistics/mcv.c: In function ‘mcv_combine_extended’:
> |../src/backend/statistics/mcv.c:2431:49: warning: declaration of ‘idx’ 
> shadows a previous local [-Wshadow=compatible-local]

Thanks for the feedback, the warnning should be fixed in the lastest
revision and 's/estimiatedcluases/estimatedclauses/' typo error in the
commit message is not fixed since I have to regenerate all the commits
to fix that. We are still in dicussion stage and I think these impact is
pretty limited on dicussion.

> FYI, I also ran the patch with a $large number of reports without
> observing any errors or crashes.

Good to know that.

> I'll try to look harder at the next patch revision.

Thank you!

-- 
Best Regards
Andy Fan





Re: using extended statistics to improve join estimates

2024-04-12 Thread Justin Pryzby
On Tue, Apr 02, 2024 at 04:23:45PM +0800, Andy Fan wrote:
> 
> 0001 is your patch, I just rebase them against the current master. 0006
> is not much relevant with current patch, and I think it can be committed
> individually if you are OK with that.

Your 002 should also remove listidx to avoid warning
../src/backend/statistics/extended_stats.c:2879:8: error: variable 'listidx' 
set but not used [-Werror,-Wunused-but-set-variable]

> Subject: [PATCH v1 2/8] Remove estimiatedcluases and varRelid arguments

> @@ -2939,15 +2939,11 @@ statext_try_join_estimates(PlannerInfo *root, List 
> *clauses, int varRelid,
>   /* needs to happen before skipping any clauses */
>   listidx++;
>  
> - /* Skip clauses that were already estimated. */
> - if (bms_is_member(listidx, estimatedclauses))
> - continue;
> -

Your 007 could instead test if relids == NULL:

> Subject: [PATCH v1 7/8] bms_is_empty is more effective than bms_num_members(b)
>-   if (bms_num_members(relids) == 0)
>+   if (bms_is_empty(relids))

typos:
001: s/heuristict/heuristics/
002: s/grantee/guarantee/
002: s/estimiatedcluases/estimatedclauses/

It'd be nice to fix/silence these warnings from 001:

|../src/backend/statistics/extended_stats.c:3151:36: warning: ‘relid’ may be 
used uninitialized [-Wmaybe-uninitialized]
| 3151 | if (var->varno != relid)
|  |^
|../src/backend/statistics/extended_stats.c:3104:33: note: ‘relid’ was declared 
here
| 3104 | int relid;
|  | ^
|[1016/1893] Compiling C object src/backend/postgres_lib.a.p/statistics_mcv.c.o
|../src/backend/statistics/mcv.c: In function ‘mcv_combine_extended’:
|../src/backend/statistics/mcv.c:2431:49: warning: declaration of ‘idx’ shadows 
a previous local [-Wshadow=compatible-local]

FYI, I also ran the patch with a $large number of reports without
observing any errors or crashes.

I'll try to look harder at the next patch revision.

-- 
Justin




Re: using extended statistics to improve join estimates

2024-04-02 Thread Andy Fan


Tomas Vondra  writes:

> On 4/2/24 10:23, Andy Fan wrote:
>> 
>>> On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote:
 Rebased over 269b532ae and muted compiler warnings.
>> 
>> Thank you Justin for the rebase!
>> 
>> Hello Tomas,
>> 
>> Thanks for the patch! Before I review the path at the code level, I want
>> to explain my understanding about this patch first.
>> 
>
> If you want to work on this patch, that'd be cool. A review would be
> great, but if you want to maybe take over and try moving it forward,
> that'd be even better. I don't know when I'll have time to work on it
> again, but I'd promise to help you with working on it.

OK, I'd try to moving it forward.

>
>> Before this patch, we already use MCV information for the eqjoinsel, it
>> works as combine the MCV on the both sides to figure out the mcv_freq
>> and then treat the rest equally, but this doesn't work for MCV in
>> extended statistics, this patch fill this gap. Besides that, since
>> extended statistics means more than 1 columns are involved, if 1+
>> columns are Const based on RestrictInfo, we can use such information to
>> filter the MCVs we are interesting, that's really cool. 
>> 
>
> Yes, I think that's an accurate description of what the patch does.

Great to know that:)

>
>> I did some more testing, all of them are inner join so far, all of them
>> works amazing and I am suprised this patch didn't draw enough
>> attention.

> I think it didn't go forward for a bunch of reasons:
>
..
>
> 3) Uncertainty about how applicable the patch is in practice.
>
> I suppose it was some combination of these reasons, not sure.
>
> As for the "practicality" mentioned in (3), it's been a while since I
> worked on the patch so I don't recall the details, but I think I've been
> thinking mostly about "start join" queries, where a big "fact" table
> joins to small dimensions. And in that case the fact table may have a
> MCV, but the dimensions certainly don't have any (because the join
> happens on a PK).
>
> But maybe that's a wrong way to think about it - it was clearly useful
> to consider the case with (per-attribute) MCVs on both sides as worth
> special handling. So why not to do that for multi-column MCVs, right?

Yes, that's what my current understanding is.

There are some cases where there are 2+ clauses between two tables AND
the rows estimiation is bad AND the plan is not the best one. In such
sisuations, I'd think this patch probably be helpful. The current case
in hand is PG11, there is no MCV information for extended statistics, so
I even can't verify the patch here is useful or not manually. When I see
them next time in a newer version of PG, I can verity it manually to see
if the rows estimation can be better.  

>> At for the code level, I reviewed them in the top-down manner and almost
>> 40% completed. Here are some findings just FYI. For efficiency purpose,
>> I provide each feedback with a individual commit, after all I want to
>> make sure my comment is practical and coding and testing is a good way
>> to archive that. I tried to make each of them as small as possible so
>> that you can reject or accept them convinently.
>> 
>> 0001 is your patch, I just rebase them against the current master. 0006
>> is not much relevant with current patch, and I think it can be committed
>> individually if you are OK with that.
>> 
>> Hope this kind of review is helpful.
>> 
>
> Cool! There's obviously no chance to get this into v18, and I have stuff
> to do in this CF. But I'll take a look after that.

Good to know that. I will continue my work before that. 

-- 
Best Regards
Andy Fan





Re: using extended statistics to improve join estimates

2024-04-02 Thread Tomas Vondra
On 4/2/24 10:23, Andy Fan wrote:
> 
>> On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote:
>>> Rebased over 269b532ae and muted compiler warnings.
> 
> Thank you Justin for the rebase!
> 
> Hello Tomas,
> 
> Thanks for the patch! Before I review the path at the code level, I want
> to explain my understanding about this patch first.
> 

If you want to work on this patch, that'd be cool. A review would be
great, but if you want to maybe take over and try moving it forward,
that'd be even better. I don't know when I'll have time to work on it
again, but I'd promise to help you with working on it.

> Before this patch, we already use MCV information for the eqjoinsel, it
> works as combine the MCV on the both sides to figure out the mcv_freq
> and then treat the rest equally, but this doesn't work for MCV in
> extended statistics, this patch fill this gap. Besides that, since
> extended statistics means more than 1 columns are involved, if 1+
> columns are Const based on RestrictInfo, we can use such information to
> filter the MCVs we are interesting, that's really cool. 
> 

Yes, I think that's an accurate description of what the patch does.

> I did some more testing, all of them are inner join so far, all of them
> works amazing and I am suprised this patch didn't draw enough
> attention. I will test more after I go though the code.
> 

I think it didn't go forward for a bunch of reasons:

1) I got distracted by something else requiring immediate attention, and
forgot about this patch.

2) I got stuck on some detail of the patch, unsure which of the possible
solutions to try first.

3) Uncertainty about how applicable the patch is in practice.

I suppose it was some combination of these reasons, not sure.


As for the "practicality" mentioned in (3), it's been a while since I
worked on the patch so I don't recall the details, but I think I've been
thinking mostly about "start join" queries, where a big "fact" table
joins to small dimensions. And in that case the fact table may have a
MCV, but the dimensions certainly don't have any (because the join
happens on a PK).

But maybe that's a wrong way to think about it - it was clearly useful
to consider the case with (per-attribute) MCVs on both sides as worth
special handling. So why not to do that for multi-column MCVs, right?

> At for the code level, I reviewed them in the top-down manner and almost
> 40% completed. Here are some findings just FYI. For efficiency purpose,
> I provide each feedback with a individual commit, after all I want to
> make sure my comment is practical and coding and testing is a good way
> to archive that. I tried to make each of them as small as possible so
> that you can reject or accept them convinently.
> 
> 0001 is your patch, I just rebase them against the current master. 0006
> is not much relevant with current patch, and I think it can be committed
> individually if you are OK with that.
> 
> Hope this kind of review is helpful.
> 

Cool! There's obviously no chance to get this into v18, and I have stuff
to do in this CF. But I'll take a look after that.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: using extended statistics to improve join estimates

2024-04-02 Thread Andy Fan

> On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote:
>> Rebased over 269b532ae and muted compiler warnings.

Thank you Justin for the rebase!

Hello Tomas,

Thanks for the patch! Before I review the path at the code level, I want
to explain my understanding about this patch first.

Before this patch, we already use MCV information for the eqjoinsel, it
works as combine the MCV on the both sides to figure out the mcv_freq
and then treat the rest equally, but this doesn't work for MCV in
extended statistics, this patch fill this gap. Besides that, since
extended statistics means more than 1 columns are involved, if 1+
columns are Const based on RestrictInfo, we can use such information to
filter the MCVs we are interesting, that's really cool. 

I did some more testing, all of them are inner join so far, all of them
works amazing and I am suprised this patch didn't draw enough
attention. I will test more after I go though the code.

At for the code level, I reviewed them in the top-down manner and almost
40% completed. Here are some findings just FYI. For efficiency purpose,
I provide each feedback with a individual commit, after all I want to
make sure my comment is practical and coding and testing is a good way
to archive that. I tried to make each of them as small as possible so
that you can reject or accept them convinently.

0001 is your patch, I just rebase them against the current master. 0006
is not much relevant with current patch, and I think it can be committed
individually if you are OK with that.

Hope this kind of review is helpful.

-- 
Best Regards
Andy Fan

>From daa6c27bc7dd0631607f0f254cc15491633a9ccc Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Mon, 13 Dec 2021 14:05:17 +0100
Subject: [PATCH v1 1/8] Estimate joins using extended statistics

Use extended statistics (MCV) to improve join estimates. In general this
is similar to how we use regular statistics - we search for extended
statistics (with MCV) covering all join clauses, and if we find such MCV
on both sides of the join, we combine those two MCVs.

Extended statistics allow a couple additional improvements - e.g. if
there are baserel conditions, we can use them to restrict the part of
the MCVs combined. This means we're building conditional probability
distribution and calculating conditional probability

P(join clauses | baserel conditions)

instead of just P(join clauses).

The patch also allows combining regular and extended MCV - we don't need
extended MCVs on both sides. This helps when one of the tables does not
have extended statistics (e.g. because there are no correlations).
---
 src/backend/optimizer/path/clausesel.c|  63 +-
 src/backend/statistics/extended_stats.c   | 805 ++
 src/backend/statistics/mcv.c  | 758 +
 .../statistics/extended_stats_internal.h  |  20 +
 src/include/statistics/statistics.h   |  12 +
 src/test/regress/expected/stats_ext.out   | 167 
 src/test/regress/sql/stats_ext.sql|  66 ++
 7 files changed, 1890 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 0ab021c1e8..bedf76edae 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -48,6 +48,9 @@ static Selectivity clauselist_selectivity_or(PlannerInfo *root,
 			 JoinType jointype,
 			 SpecialJoinInfo *sjinfo,
 			 bool use_extended_stats);
+static inline bool treat_as_join_clause(PlannerInfo *root,
+		Node *clause, RestrictInfo *rinfo,
+		int varRelid, SpecialJoinInfo *sjinfo);
 
 /
  *		ROUTINES TO COMPUTE SELECTIVITIES
@@ -127,12 +130,53 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	bool		single_clause_optimization = true;
+
+	/*
+	 * The optimization of skipping to clause_selectivity_ext for single
+	 * clauses means we can't improve join estimates with a single join
+	 * clause but additional baserel restrictions. So we disable it when
+	 * estimating joins.
+	 *
+	 * XXX Not sure if this is the right way to do it, but more elaborate
+	 * checks would mostly negate the whole point of the optimization.
+	 * The (Var op Var) patch has the same issue.
+	 *
+	 * XXX An alternative might be making clause_selectivity_ext smarter
+	 * and make it use the join extended stats there. But that seems kinda
+	 * against the whole point of the optimization (skipping expensive
+	 * stuff) and it's making other parts more complex.
+	 *
+	 * XXX Maybe this should check if there are at least some restrictions
+	 * on some base relations, which seems important. But then again, that
+	 * seems to go against the idea of this check to be cheap. Moreover, it
+	 * won't work for OR clauses, which may have multiple parts but we still
+	 * see them 

Re: using extended statistics to improve join estimates

2022-03-02 Thread Justin Pryzby
On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote:
> Rebased over 269b532ae and muted compiler warnings.

And attached.
>From 587a5e9fe87c26cdcd9602fc349f092da95cc580 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Mon, 13 Dec 2021 14:05:17 +0100
Subject: [PATCH] Estimate joins using extended statistics

Use extended statistics (MCV) to improve join estimates. In general this
is similar to how we use regular statistics - we search for extended
statistics (with MCV) covering all join clauses, and if we find such MCV
on both sides of the join, we combine those two MCVs.

Extended statistics allow a couple additional improvements - e.g. if
there are baserel conditions, we can use them to restrict the part of
the MCVs combined. This means we're building conditional probability
distribution and calculating conditional probability

P(join clauses | baserel conditions)

instead of just P(join clauses).

The patch also allows combining regular and extended MCV - we don't need
extended MCVs on both sides. This helps when one of the tables does not
have extended statistics (e.g. because there are no correlations).
---
 src/backend/optimizer/path/clausesel.c|  63 +-
 src/backend/statistics/extended_stats.c   | 805 ++
 src/backend/statistics/mcv.c  | 757 
 .../statistics/extended_stats_internal.h  |  20 +
 src/include/statistics/statistics.h   |  12 +
 src/test/regress/expected/stats_ext.out   | 167 
 src/test/regress/sql/stats_ext.sql|  66 ++
 7 files changed, 1889 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 06f836308d0..1b2227321a2 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -50,6 +50,9 @@ static Selectivity clauselist_selectivity_or(PlannerInfo *root,
 			 JoinType jointype,
 			 SpecialJoinInfo *sjinfo,
 			 bool use_extended_stats);
+static inline bool treat_as_join_clause(PlannerInfo *root,
+		Node *clause, RestrictInfo *rinfo,
+		int varRelid, SpecialJoinInfo *sjinfo);
 
 /
  *		ROUTINES TO COMPUTE SELECTIVITIES
@@ -129,12 +132,53 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	bool		single_clause_optimization = true;
+
+	/*
+	 * The optimization of skipping to clause_selectivity_ext for single
+	 * clauses means we can't improve join estimates with a single join
+	 * clause but additional baserel restrictions. So we disable it when
+	 * estimating joins.
+	 *
+	 * XXX Not sure if this is the right way to do it, but more elaborate
+	 * checks would mostly negate the whole point of the optimization.
+	 * The (Var op Var) patch has the same issue.
+	 *
+	 * XXX An alternative might be making clause_selectivity_ext smarter
+	 * and make it use the join extended stats there. But that seems kinda
+	 * against the whole point of the optimization (skipping expensive
+	 * stuff) and it's making other parts more complex.
+	 *
+	 * XXX Maybe this should check if there are at least some restrictions
+	 * on some base relations, which seems important. But then again, that
+	 * seems to go against the idea of this check to be cheap. Moreover, it
+	 * won't work for OR clauses, which may have multiple parts but we still
+	 * see them as a single BoolExpr clause (it doesn't work later, though).
+	 */
+	if (list_length(clauses) == 1)
+	{
+		Node *clause = linitial(clauses);
+		RestrictInfo *rinfo = NULL;
+
+		if (IsA(clause, RestrictInfo))
+		{
+			rinfo = (RestrictInfo *) clause;
+			clause = (Node *) rinfo->clause;
+		}
+
+		single_clause_optimization
+			= !treat_as_join_clause(root, clause, rinfo, varRelid, sjinfo);
+	}
 
 	/*
 	 * If there's exactly one clause, just go directly to
 	 * clause_selectivity_ext(). None of what we might do below is relevant.
+	 *
+	 * XXX This means we won't try using extended stats on OR-clauses (which
+	 * are a single BoolExpr clause at this point), although we'll do that
+	 * later (once we look at the arguments).
 	 */
-	if (list_length(clauses) == 1)
+	if ((list_length(clauses) == 1) && single_clause_optimization)
 		return clause_selectivity_ext(root, (Node *) linitial(clauses),
 	  varRelid, jointype, sjinfo,
 	  use_extended_stats);
@@ -157,6 +201,23 @@ clauselist_selectivity_ext(PlannerInfo *root,
 			, false);
 	}
 
+	/*
+	 * Try applying extended statistics to joins. There's not much we can
+	 * do to detect when this makes sense, but we can check that there are
+	 * join clauses, and that at least some of the rels have stats.
+	 *
+	 * XXX Isn't this mutually exclusive with the preceding block which
+	 * calculates estimates for a single relation?
+	 */
+	if (use_extended_stats &&
+		statext_try_join_estimates(root, 

Re: using extended statistics to improve join estimates

2022-03-02 Thread Justin Pryzby
On Wed, Jan 19, 2022 at 06:18:09PM +0800, Julien Rouhaud wrote:
> On Tue, Jan 04, 2022 at 03:55:50PM -0800, Andres Freund wrote:
> > On 2022-01-01 18:21:06 +0100, Tomas Vondra wrote:
> > > Here's an updated patch, rebased and fixing a couple typos reported by
> > > Justin Pryzby directly.
> > 
> > FWIW, cfbot reports a few compiler warnings:
> 
> Also the patch doesn't apply anymore:
> 
> http://cfbot.cputube.org/patch_36_3055.log
> === Applying patches on top of PostgreSQL commit ID 
> 74527c3e022d3ace648340b79a6ddec3419f6732 ===
> === applying patch 
> ./0001-Estimate-joins-using-extended-statistics-20220101.patch
> patching file src/backend/optimizer/path/clausesel.c
> patching file src/backend/statistics/extended_stats.c
> Hunk #1 FAILED at 30.
> Hunk #2 succeeded at 102 (offset 1 line).
> Hunk #3 succeeded at 2619 (offset 9 lines).
> 1 out of 3 hunks FAILED -- saving rejects to file 
> src/backend/statistics/extended_stats.c.rej

Rebased over 269b532ae and muted compiler warnings.

Tomas - is this patch viable for pg15 , or should move to the next CF ?

In case it's useful, I ran this on cirrus with my branch for code coverage.
https://cirrus-ci.com/task/5816731397521408
https://api.cirrus-ci.com/v1/artifact/task/5816731397521408/coverage/coverage/00-index.html

statext_find_matching_mcv() has poor coverage.
statext_clauselist_join_selectivity() has poor coverage for the "stats2" case.

In mcv.c: mcv_combine_extended() and mcv_combine_simple() have poor coverage
for the "else if" cases (does it matter?)

Not related to this patch:
build_attnums_array() isn't being hit.

Same at statext_is_compatible_clause_internal()
1538   0 : *exprs = lappend(*exprs, clause);

statext_mcv_[de]serialize() aren't being hit for cstrings.

-- 
Justin




Re: using extended statistics to improve join estimates

2022-01-19 Thread Julien Rouhaud
Hi,

On Tue, Jan 04, 2022 at 03:55:50PM -0800, Andres Freund wrote:
> On 2022-01-01 18:21:06 +0100, Tomas Vondra wrote:
> > Here's an updated patch, rebased and fixing a couple typos reported by
> > Justin Pryzby directly.
> 
> FWIW, cfbot reports a few compiler warnings:

Also the patch doesn't apply anymore:

http://cfbot.cputube.org/patch_36_3055.log
=== Applying patches on top of PostgreSQL commit ID 
74527c3e022d3ace648340b79a6ddec3419f6732 ===
=== applying patch 
./0001-Estimate-joins-using-extended-statistics-20220101.patch
patching file src/backend/optimizer/path/clausesel.c
patching file src/backend/statistics/extended_stats.c
Hunk #1 FAILED at 30.
Hunk #2 succeeded at 102 (offset 1 line).
Hunk #3 succeeded at 2619 (offset 9 lines).
1 out of 3 hunks FAILED -- saving rejects to file 
src/backend/statistics/extended_stats.c.rej




Re: using extended statistics to improve join estimates

2022-01-04 Thread Andres Freund
On 2022-01-01 18:21:06 +0100, Tomas Vondra wrote:
> Here's an updated patch, rebased and fixing a couple typos reported by
> Justin Pryzby directly.

FWIW, cfbot reports a few compiler warnings:

https://cirrus-ci.com/task/6067262669979648?logs=gcc_warning#L505
[18:52:15.132] time make -s -j${BUILD_JOBS} world-bin
[18:52:22.697] mcv.c: In function ‘mcv_combine_simple’:
[18:52:22.697] mcv.c:2787:7: error: ‘reverse’ may be used uninitialized in this 
function [-Werror=maybe-uninitialized]
[18:52:22.697]  2787 |if (reverse)
[18:52:22.697]   |   ^
[18:52:22.697] mcv.c:2766:27: error: ‘index’ may be used uninitialized in this 
function [-Werror=maybe-uninitialized]
[18:52:22.697]  2766 |   if (mcv->items[i].isnull[index])
[18:52:22.697]   |   ^

Greetings,

Andres Freund




Re: using extended statistics to improve join estimates

2022-01-01 Thread Tomas Vondra

Hi,

Here's an updated patch, rebased and fixing a couple typos reported by 
Justin Pryzby directly.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 15d0fa5b565d9ae8b4f333c1d54745397964110d Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Mon, 13 Dec 2021 14:05:17 +0100
Subject: [PATCH] Estimate joins using extended statistics

Use extended statistics (MCV) to improve join estimates. In general this
is similar to how we use regular statistics - we search for extended
statistics (with MCV) covering all join clauses, and if we find such MCV
on both sides of the join, we combine those two MCVs.

Extended statistics allow a couple additional improvements - e.g. if
there are baserel conditions, we can use them to restrict the part of
the MCVs combined. This means we're building conditional probability
distribution and calculating conditional probability

P(join clauses | baserel conditions)

instead of just P(join clauses).

The patch also allows combining regular and extended MCV - we don't need
extended MCVs on both sides. This helps when one of the tables does not
have extended statistics (e.g. because there are no correlations).
---
 src/backend/optimizer/path/clausesel.c|  63 +-
 src/backend/statistics/extended_stats.c   | 805 ++
 src/backend/statistics/mcv.c  | 754 
 .../statistics/extended_stats_internal.h  |  20 +
 src/include/statistics/statistics.h   |  12 +
 src/test/regress/expected/stats_ext.out   | 167 
 src/test/regress/sql/stats_ext.sql|  66 ++
 7 files changed, 1886 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index d263ecf0827..09f3d246c9d 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -50,6 +50,9 @@ static Selectivity clauselist_selectivity_or(PlannerInfo *root,
 			 JoinType jointype,
 			 SpecialJoinInfo *sjinfo,
 			 bool use_extended_stats);
+static inline bool treat_as_join_clause(PlannerInfo *root,
+		Node *clause, RestrictInfo *rinfo,
+		int varRelid, SpecialJoinInfo *sjinfo);
 
 /
  *		ROUTINES TO COMPUTE SELECTIVITIES
@@ -129,12 +132,53 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	bool		single_clause_optimization = true;
+
+	/*
+	 * The optimization of skipping to clause_selectivity_ext for single
+	 * clauses means we can't improve join estimates with a single join
+	 * clause but additional baserel restrictions. So we disable it when
+	 * estimating joins.
+	 *
+	 * XXX Not sure if this is the right way to do it, but more elaborate
+	 * checks would mostly negate the whole point of the optimization.
+	 * The (Var op Var) patch has the same issue.
+	 *
+	 * XXX An alternative might be making clause_selectivity_ext smarter
+	 * and make it use the join extended stats there. But that seems kinda
+	 * against the whole point of the optimization (skipping expensive
+	 * stuff) and it's making other parts more complex.
+	 *
+	 * XXX Maybe this should check if there are at least some restrictions
+	 * on some base relations, which seems important. But then again, that
+	 * seems to go against the idea of this check to be cheap. Moreover, it
+	 * won't work for OR clauses, which may have multiple parts but we still
+	 * see them as a single BoolExpr clause (it doesn't work later, though).
+	 */
+	if (list_length(clauses) == 1)
+	{
+		Node *clause = linitial(clauses);
+		RestrictInfo *rinfo = NULL;
+
+		if (IsA(clause, RestrictInfo))
+		{
+			rinfo = (RestrictInfo *) clause;
+			clause = (Node *) rinfo->clause;
+		}
+
+		single_clause_optimization
+			= !treat_as_join_clause(root, clause, rinfo, varRelid, sjinfo);
+	}
 
 	/*
 	 * If there's exactly one clause, just go directly to
 	 * clause_selectivity_ext(). None of what we might do below is relevant.
+	 *
+	 * XXX This means we won't try using extended stats on OR-clauses (which
+	 * are a single BoolExpr clause at this point), although we'll do that
+	 * later (once we look at the arguments).
 	 */
-	if (list_length(clauses) == 1)
+	if ((list_length(clauses) == 1) && single_clause_optimization)
 		return clause_selectivity_ext(root, (Node *) linitial(clauses),
 	  varRelid, jointype, sjinfo,
 	  use_extended_stats);
@@ -157,6 +201,23 @@ clauselist_selectivity_ext(PlannerInfo *root,
 			, false);
 	}
 
+	/*
+	 * Try applying extended statistics to joins. There's not much we can
+	 * do to detect when this makes sense, but we can check that there are
+	 * join clauses, and that at least some of the rels have stats.
+	 *
+	 * XXX Isn't this mutually exclusive with the preceding block which
+	 * calculates estimates for a single relation?
+	 */

Re: using extended statistics to improve join estimates

2021-12-13 Thread Tomas Vondra

On 11/6/21 11:03, Andy Fan wrote:

Hi Tomas:

This is the exact patch I want, thanks for the patch!



Good to hear.


On Thu, Oct 7, 2021 at 3:33 AM Tomas Vondra
 wrote:



3) estimation by join pairs

At the moment, the estimates are calculated for pairs of relations, so
for example given a query

   explain analyze
   select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
join t3 on (t1.b = t3.b and t2.c = t3.c);

we'll estimate the first join (t1,t2) just fine, but then the second
join actually combines (t1,t2,t3). What the patch currently does is it
splits it into (t1,t2) and (t2,t3) and estimates those.


Actually I can't understand how this works even for a simpler example.
let's say we query like this (ONLY use t2's column to join t3).

select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
  join t3 on (t2.c = t3.c and t2.d = t3.d);

Then it works well on JoinRel(t1, t2)  AND JoinRel(t2, t3).  But when comes
to JoinRel(t1, t2, t3), we didn't maintain the MCV on join rel,  so it
is hard to
work.  Here I see your solution is splitting it into (t1, t2) AND (t2,
t3) and estimate
those.  But how does this help to estimate the size of JoinRel(t1, t2, t3)?



Yeah, this is really confusing. The crucial thing to keep in mind is 
this works with clauses before running setrefs.c, so the clauses 
reference the original relations - *not* the join relation. Otherwise 
even the regular estimation would not work, because where would it get 
the per-column MCV lists etc.


Let's use a simple case with join clauses referencing just a single 
attribute for each pair or relations. And let's talk about how many join 
pairs it'll extract:


  t1 JOIN t2 ON (t1.a = t2.a) JOIN t3 ON (t1.b = t3.b)

=> first we join t1/t2, which is 1 join pair (t1,t2)
=> then we join t1/t2/t3, but the join clause references just 2 rels, so 
1 join pair (t1,t3)


Now a more complicated case, with more complex join clause

  t1 JOIN t2 ON (t1.a = t2.a) JOIN t3 ON (t1.b = t3.b AND t2.c = t3.c)

=> first we join t1/t2, which is 1 join pair (t1,t2)
=> then we join t1/t2/t3, but this time the join clause references all 
three rels, so we have two join pairs (t1,t3) and (t2,t3) and we can use 
all the statistics.





I wonder if this
should actually combine all three MCVs at once - we're pretty much
combining the MCVs into one large MCV representing the join result.



I guess we can keep the MCVs on joinrel for these matches.  Take the above
query I provided for example, and suppose the MCV data as below:

t1(a, b)
(1, 2) -> 0.1
(1, 3) -> 0.2
(2, 3) -> 0.5
(2, 8) -> 0.1

t2(a, b)
(1, 2) -> 0.2
(1, 3) -> 0.1
(2, 4) -> 0.2
(2, 10) -> 0.1

After t1.a = t2.a AND t1.b = t2.b,  we can build the MCV as below

(1, 2, 1, 2)  -> 0.1 * 0.2
(1, 3, 1, 3)  -> 0.2 * 0.1

And recording the total mcv frequence as (0.1 + 0.2 + 0.5 + 0.1) *
(0.2 + 0.1 + 0.2 + 0.1)



Right, that's about the joint distribution I whole join.


With this design, the nitems of MCV on joinrel would be less than
either of baserel.



Actually, I think the number of items can grow, because the matches may 
duplicate some items. For example in your example with (t1.a = t2.a) the 
first first (1,2) item in t1 matches (1,2) and (1,3) in t2. And same for 
(1,3) in t1. So that's 4 combinations. Of course, we could aggregate the 
MCV by ignoring columns not used in the query.



and since we handle the eqjoin as well, we even can record the items as

(1, 2) -> 0.1 * 0.2
(1, 3) -> 0.2 * 0.1;

About when we should maintain the JoinRel's MCV data, rather than
maintain this just
after the JoinRel size is estimated, we can only estimate it when it
is needed.  for example:

select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
  join t3 on (t2.c = t3.c and t2.d = t3.d);

we don't need to maintain the MCV on (t1, t2, t3)  since no others
need it at all. However
I don't check code too closely to see if it (Lazing computing MVC on
joinrel) is convenient
to do.



I'm not sure I understand what you're proposing here.

However, I think that estimating it for pairs has two advantages:

1) Combining MCVs for k relations requires k for loops. Processing 2 
relations at a time limits the amount of CPU we need. Of course, this 
assumes the joins are independent, which may or may not be true.


2) It seems fairly easy to combine different types of statistics 
(regular, extended, ...), and also consider the part not represented by 
MCV. It seems much harder when joining more than 2 relations.


I'm also worried about amplification of errors - I suspect attempting to 
build the joint MCV for the whole join relation may produce significant 
estimation errors.


Furthermore, I think joins with clauses referencing more than just two 
relations are fairly uncommon. And we can always improve the feature in 
this direction in the future.





But I haven't done that yet, as it requires the MCVs to be combined
using the join clauses 

Re: using extended statistics to improve join estimates

2021-12-13 Thread Tomas Vondra

On 11/22/21 02:23, Justin Pryzby wrote:

Your regression tests include two errors, which appear to be accidental, and
fixing the error shows that this case is being estimated poorly.

+-- try combining with single-column (and single-expression) statistics
+DROP STATISTICS join_test_2;
+ERROR:  statistics object "join_test_2" does not exist
...
+ERROR:  statistics object "join_stats_2" already exists



D'oh, what a silly mistake ...

You're right fixing the DROP STATISTICS results in worse estimate, but 
that's actually expected for a fairly simple reason. The join condition 
has expressions on both sides, and dropping the statistics means we 
don't have any MCV for the join_test_2 side. So the optimizer ends up 
not using the regular estimates, as if there were no extended stats.


A couple lines later the script creates an extended statistics on that 
expression alone, which fixes this. An expression index would do the 
trick too.


Attached is a patch fixing the test and also the issue reported by 
Zhihong Yu some time ago.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 616f7f3faa818ea89c4c1cecb9aa50dd9e4fe8e7 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Mon, 13 Dec 2021 14:05:17 +0100
Subject: [PATCH] Estimate joins using extended statistics

Use extended statistics (MCV) to improve join estimates. In general this
is similar to how we use regular statistics - we search for extended
statistics (with MCV) covering all join clauses, and if we find such MCV
on both sides of the join, we combine those two MCVs.

Extended statistics allow a couple additional improvements - e.g. if
there are baserel conditions, we can use them to restrict the part of
the MCVs combined. This means we're building conditional probability
distribution and calculating conditional probability

P(join clauses | baserel conditions)

instead of just P(join clauses).

The patch also allows combining regular and extended MCV - we don't need
extended MCVs on both sides. This helps when one of the tables does not
have extended statistics (e.g. because there are no correlations).
---
 src/backend/optimizer/path/clausesel.c|  63 +-
 src/backend/statistics/extended_stats.c   | 805 ++
 src/backend/statistics/mcv.c  | 754 
 .../statistics/extended_stats_internal.h  |  20 +
 src/include/statistics/statistics.h   |  12 +
 src/test/regress/expected/stats_ext.out   | 167 
 src/test/regress/sql/stats_ext.sql|  66 ++
 7 files changed, 1886 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index d263ecf082..709e92446b 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -50,6 +50,9 @@ static Selectivity clauselist_selectivity_or(PlannerInfo *root,
 			 JoinType jointype,
 			 SpecialJoinInfo *sjinfo,
 			 bool use_extended_stats);
+static inline bool treat_as_join_clause(PlannerInfo *root,
+		Node *clause, RestrictInfo *rinfo,
+		int varRelid, SpecialJoinInfo *sjinfo);
 
 /
  *		ROUTINES TO COMPUTE SELECTIVITIES
@@ -129,12 +132,53 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	bool		single_clause_optimization = true;
+
+	/*
+	 * The optimization of skipping to clause_selectivity_ext for single
+	 * clauses means we can't improve join estimates with a single join
+	 * clause but additional baserel restrictions. So we disable it when
+	 * estimating joins.
+	 *
+	 * XXX Not sure if this is the right way to do it, but more elaborate
+	 * checks would mostly negate the whole point of the optimization.
+	 * The (Var op Var) patch has the same issue.
+	 *
+	 * XXX An alternative might be making clause_selectivity_ext smarter
+	 * and make it use the join extended stats there. But that seems kinda
+	 * against the whole point of the optimization (skipping expensive
+	 * stuff) and it's making other parts more complex.
+	 *
+	 * XXX Maybe this should check if there are at least some restrictions
+	 * on some base relations, which seems important. But then again, that
+	 * seems to go against the idea of this check to be cheap. Moreover, it
+	 * won't work for OR clauses, which may have multiple parts but we still
+	 * see them as a single BoolExpr clause (it doesn't work later, though).
+	 */
+	if (list_length(clauses) == 1)
+	{
+		Node *clause = linitial(clauses);
+		RestrictInfo *rinfo = NULL;
+
+		if (IsA(clause, RestrictInfo))
+		{
+			rinfo = (RestrictInfo *) clause;
+			clause = (Node *) rinfo->clause;
+		}
+
+		single_clause_optimization
+			= !treat_as_join_clause(root, clause, rinfo, varRelid, sjinfo);
+	}
 
 	/*
 	 * If there's exactly one clause, just go directly to
 	 * 

Re: using extended statistics to improve join estimates

2021-11-21 Thread Justin Pryzby
Your regression tests include two errors, which appear to be accidental, and
fixing the error shows that this case is being estimated poorly.

+-- try combining with single-column (and single-expression) statistics
+DROP STATISTICS join_test_2;
+ERROR:  statistics object "join_test_2" does not exist
...
+ERROR:  statistics object "join_stats_2" already exists

-- 
Justin




Re: using extended statistics to improve join estimates

2021-11-06 Thread Andy Fan
Hi Tomas:

This is the exact patch I want, thanks for the patch!

On Thu, Oct 7, 2021 at 3:33 AM Tomas Vondra
 wrote:


> 3) estimation by join pairs
>
> At the moment, the estimates are calculated for pairs of relations, so
> for example given a query
>
>   explain analyze
>   select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
>join t3 on (t1.b = t3.b and t2.c = t3.c);
>
> we'll estimate the first join (t1,t2) just fine, but then the second
> join actually combines (t1,t2,t3). What the patch currently does is it
> splits it into (t1,t2) and (t2,t3) and estimates those.

Actually I can't understand how this works even for a simpler example.
let's say we query like this (ONLY use t2's column to join t3).

select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
 join t3 on (t2.c = t3.c and t2.d = t3.d);

Then it works well on JoinRel(t1, t2)  AND JoinRel(t2, t3).  But when comes
to JoinRel(t1, t2, t3), we didn't maintain the MCV on join rel,  so it
is hard to
work.  Here I see your solution is splitting it into (t1, t2) AND (t2,
t3) and estimate
those.  But how does this help to estimate the size of JoinRel(t1, t2, t3)?

> I wonder if this
> should actually combine all three MCVs at once - we're pretty much
> combining the MCVs into one large MCV representing the join result.
>

I guess we can keep the MCVs on joinrel for these matches.  Take the above
query I provided for example, and suppose the MCV data as below:

t1(a, b)
(1, 2) -> 0.1
(1, 3) -> 0.2
(2, 3) -> 0.5
(2, 8) -> 0.1

t2(a, b)
(1, 2) -> 0.2
(1, 3) -> 0.1
(2, 4) -> 0.2
(2, 10) -> 0.1

After t1.a = t2.a AND t1.b = t2.b,  we can build the MCV as below

(1, 2, 1, 2)  -> 0.1 * 0.2
(1, 3, 1, 3)  -> 0.2 * 0.1

And recording the total mcv frequence as (0.1 + 0.2 + 0.5 + 0.1) *
(0.2 + 0.1 + 0.2 + 0.1)

With this design, the nitems of MCV on joinrel would be less than
either of baserel.

and since we handle the eqjoin as well, we even can record the items as

(1, 2) -> 0.1 * 0.2
(1, 3) -> 0.2 * 0.1;

About when we should maintain the JoinRel's MCV data, rather than
maintain this just
after the JoinRel size is estimated, we can only estimate it when it
is needed.  for example:

select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
 join t3 on (t2.c = t3.c and t2.d = t3.d);

we don't need to maintain the MCV on (t1, t2, t3)  since no others
need it at all. However
I don't check code too closely to see if it (Lazing computing MVC on
joinrel) is convenient
to do.


> But I haven't done that yet, as it requires the MCVs to be combined
> using the join clauses (overlap in a way), but I'm not sure how likely
> that is in practice. In the example it could help, but that's a bit
> artificial example.
>
>
> 4) still just inner equi-joins
>
> I haven't done any work on extending this to outer joins etc. Adding
> outer and semi joins should not be complicated, mostly copying and
> tweaking what eqjoinsel() does.
>

Overall, thanks for the feature and I am expecting there are more cases
to handle during discussion.  To make the review process more efficient,
I suggest that we split the patch into smaller ones and review/commit them
separately if we have finalized the design roughly . For example:

Patch 1 -- required both sides to have extended statistics.
Patch 2 -- required one side to have extended statistics and the other side had
per-column MCV.
Patch 3 -- handle the case like  WHERE t1.a = t2.a  and t1.b = Const;
Patch 3 -- handle the case for 3+ table joins.
Patch 4 -- supports the outer join.

I think we can do this if we are sure that each individual patch would work in
some cases and would not make any other case worse.  If you agree with this,
I can do that splitting work during my review process.


-- 
Best Regards
Andy Fan (https://www.aliyun.com/)




Re: using extended statistics to improve join estimates

2021-10-06 Thread Tomas Vondra

On 10/6/21 23:03, Zhihong Yu wrote:

Hi,

+       conditions2 = statext_determine_join_restrictions(root, rel, mcv);
+
+       /* if the new statistics covers more conditions, use it */
+       if (list_length(conditions2) > list_length(conditions1))
+       {
+           mcv = stat;

It seems conditions2 is calculated using mcv, I wonder why mcv is 
replaced by stat (for conditions1 whose length is shorter) ?




Yeah, that's wrong - it should be the other way around, i.e.

if (list_length(conditions1) > list_length(conditions2))

There's no test with multiple candidate statistics yet, so this went 
unnoticed :-/



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: using extended statistics to improve join estimates

2021-10-06 Thread Zhihong Yu
On Wed, Oct 6, 2021 at 12:33 PM Tomas Vondra 
wrote:

> Hi,
>
> attached is an improved version of this patch, addressing some of the
> points mentioned in my last message:
>
> 1) Adds a couple regression tests, testing various join cases with
> expressions, additional conditions, etc.
>
> 2) Adds support for expressions, so the join clauses don't need to
> reference just simple columns. So e.g. this can benefit from extended
> statistics, when defined on the expressions:
>
>  -- CREATE STATISTICS s1 ON (a+1), b FROM t1;
>  -- CREATE STATISTICS s2 ON (a+1), b FROM t2;
>
>  SELECT * FROM t1 JOIN t2 ON ((t1.a + 1) = (t2.a + 1) AND t1.b = t2.b);
>
> 3) Can combine extended statistics and regular (per-column) statistics.
> The previous version required extended statistics MCV on both sides of
> the join, but adding extended statistics on both sides may impractical
> (e.g. if one side does not have correlated columns it's silly to have to
> add it just to make this patch happy).
>
> For example you may have extended stats on the dimension table but not
> the fact table, and the patch still can combine those two. Of course, if
> there's no MCV on either side, we can't do much.
>
> So this patch works when both sides have extended statistics MCV, or
> when one side has extended MCV and the other side regular MCV. It might
> seem silly, but the extended MCV allows considering additional baserel
> conditions (if there are any).
>
>
> examples
> 
>
> The table / data is very simple, but hopefully good enough for some
> simple examples.
>
>   create table t1 (a int, b int, c int);
>   create table t2 (a int, b int, c int);
>
>   insert into t1 select mod(i,50), mod(i,50), mod(i,50)
> from generate_series(1,1000) s(i);
>
>   insert into t2 select mod(i,50), mod(i,50), mod(i,50)
> from generate_series(1,1000) s(i);
>
>   analyze t1, t2;
>
> First, without extended stats (just the first line of explain analyze,
> to keep the message short):
>
> explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);
>
> QUERY PLAN
> 
>  Hash Join  (cost=31.00..106.00 rows=400 width=24)
> (actual time=5.426..22.678 rows=2 loops=1)
>
>
> explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25;
>
> QUERY PLAN
> 
>  Hash Join  (cost=28.50..160.75 rows=1 width=24)
> (actual time=5.325..19.760 rows=1 loops=1)
>
>
> explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c <
> 25 and t2.c > 25;
>
> QUERY PLAN
> 
>  Hash Join  (cost=24.50..104.75 rows=4800 width=24)
> (actual time=5.618..5.639 rows=0 loops=1)
>
>
> Now, let's create statistics:
>
>   create statistics s1 on a, b, c from t1 ;
>   create statistics s2 on a, b, c from t2 ;
>   analyze t1, t2;
>
> and now the same queries again:
>
> explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);
>
> QUERY PLAN
> 
>  Hash Join  (cost=31.00..106.00 rows=2 width=24)
> (actual time=5.448..22.713 rows=2 loops=1)
>
> explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25;
>
> QUERY PLAN
> 
>  Hash Join  (cost=28.50..160.75 rows=1 width=24)
> (actual time=5.317..16.680 rows=1 loops=1)
>
>
> explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c <
> 25 and t2.c > 25;
>
> QUERY PLAN
> 
>  Hash Join  (cost=24.50..104.75 rows=1 width=24)
> (actual time=5.647..5.667 rows=0 loops=1)
>
>
> Those examples are a bit simplistic, but the improvements are fairly
> clear I think.
>
>
> limitations & open issues
> =
>
> Let's talk about the main general restrictions and open issues in the
> current patch that I can think of at the moment.
>
> 1) statistics covering all join clauses
>
> The patch requires the statistics to cover all the join clauses, mostly
> because it simplifies the implementation. This means that to use the
> per-column statistics, there has to be just a single join clause.
>
> AFAICS this could be relaxed and we could use multiple statistics to
> estimate the clauses. But it'd make selection of statistics much more
> complicated, because we have to pick "matching" statistics on both sides
> of the join. So it seems like an overkill, and most joins have very few
> conditions anyway.
>
>
> 2) only equality join clauses
>
> The other restriction is that at the moment 

Re: using extended statistics to improve join estimates

2021-10-06 Thread Tomas Vondra
Hi,

attached is an improved version of this patch, addressing some of the
points mentioned in my last message:

1) Adds a couple regression tests, testing various join cases with
expressions, additional conditions, etc.

2) Adds support for expressions, so the join clauses don't need to
reference just simple columns. So e.g. this can benefit from extended
statistics, when defined on the expressions:

 -- CREATE STATISTICS s1 ON (a+1), b FROM t1;
 -- CREATE STATISTICS s2 ON (a+1), b FROM t2;

 SELECT * FROM t1 JOIN t2 ON ((t1.a + 1) = (t2.a + 1) AND t1.b = t2.b);

3) Can combine extended statistics and regular (per-column) statistics.
The previous version required extended statistics MCV on both sides of
the join, but adding extended statistics on both sides may impractical
(e.g. if one side does not have correlated columns it's silly to have to
add it just to make this patch happy).

For example you may have extended stats on the dimension table but not
the fact table, and the patch still can combine those two. Of course, if
there's no MCV on either side, we can't do much.

So this patch works when both sides have extended statistics MCV, or
when one side has extended MCV and the other side regular MCV. It might
seem silly, but the extended MCV allows considering additional baserel
conditions (if there are any).


examples


The table / data is very simple, but hopefully good enough for some
simple examples.

  create table t1 (a int, b int, c int);
  create table t2 (a int, b int, c int);

  insert into t1 select mod(i,50), mod(i,50), mod(i,50)
from generate_series(1,1000) s(i);

  insert into t2 select mod(i,50), mod(i,50), mod(i,50)
from generate_series(1,1000) s(i);

  analyze t1, t2;

First, without extended stats (just the first line of explain analyze,
to keep the message short):

explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);

QUERY PLAN

 Hash Join  (cost=31.00..106.00 rows=400 width=24)
(actual time=5.426..22.678 rows=2 loops=1)


explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25;

QUERY PLAN

 Hash Join  (cost=28.50..160.75 rows=1 width=24)
(actual time=5.325..19.760 rows=1 loops=1)


explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c <
25 and t2.c > 25;

QUERY PLAN

 Hash Join  (cost=24.50..104.75 rows=4800 width=24)
(actual time=5.618..5.639 rows=0 loops=1)


Now, let's create statistics:

  create statistics s1 on a, b, c from t1 ;
  create statistics s2 on a, b, c from t2 ;
  analyze t1, t2;

and now the same queries again:

explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);

QUERY PLAN

 Hash Join  (cost=31.00..106.00 rows=2 width=24)
(actual time=5.448..22.713 rows=2 loops=1)

explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25;

QUERY PLAN

 Hash Join  (cost=28.50..160.75 rows=1 width=24)
(actual time=5.317..16.680 rows=1 loops=1)


explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c <
25 and t2.c > 25;

QUERY PLAN

 Hash Join  (cost=24.50..104.75 rows=1 width=24)
(actual time=5.647..5.667 rows=0 loops=1)


Those examples are a bit simplistic, but the improvements are fairly
clear I think.


limitations & open issues
=

Let's talk about the main general restrictions and open issues in the
current patch that I can think of at the moment.

1) statistics covering all join clauses

The patch requires the statistics to cover all the join clauses, mostly
because it simplifies the implementation. This means that to use the
per-column statistics, there has to be just a single join clause.

AFAICS this could be relaxed and we could use multiple statistics to
estimate the clauses. But it'd make selection of statistics much more
complicated, because we have to pick "matching" statistics on both sides
of the join. So it seems like an overkill, and most joins have very few
conditions anyway.


2) only equality join clauses

The other restriction is that at the moment this only supports simple
equality clauses, combined with AND. So for example this is supported

   t1 JOIN t2 ON ((t1.a = t2.a) AND (t1.b + 2 = t2.b + 1))

while these are not:

   t1 JOIN t2 ON ((t1.a = t2.a) OR (t1.b + 2 = t2.b + 1))

   t1 JOIN t2 ON ((t1.a - t2.a = 0) AND 

Re: using extended statistics to improve join estimates

2021-06-14 Thread Tomas Vondra

Hi,

Here's a slightly improved / cleaned up version of the PoC patch, 
removing a bunch of XXX and FIXMEs, adding comments, etc.


The approach is sound in principle, I think, although there's still a 
bunch of things to address:


1) statext_compare_mcvs only really deals with equijoins / inner joins 
at the moment, as it's based on eqjoinsel_inner. It's probably desirable 
to add support for additional join types (inequality and outer joins).


2) Some of the steps are performed multiple times - e.g. matching base 
restrictions to statistics, etc. Those probably can be cached somehow, 
to reduce the overhead.


3) The logic of picking the statistics to apply is somewhat simplistic, 
and maybe could be improved in some way. OTOH the number of candidate 
statistics is likely low, so this is not a big issue.


4) statext_compare_mcvs is based on eqjoinsel_inner and makes a bunch of 
assumptions similar to the original, but some of those assumptions may 
be wrong in multi-column case, particularly when working with a subset 
of columns. For example (ndistinct - size(MCV)) may not be the number of 
distinct combinations outside the MCV, when ignoring some columns. Same 
for nullfract, and so on. I'm not sure we can do much more than pick 
some reasonable approximation.


5) There are no regression tests at the moment. Clearly a gap.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index d263ecf082..dca1e7d34e 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -157,6 +157,23 @@ clauselist_selectivity_ext(PlannerInfo *root,
 			, false);
 	}
 
+	/*
+	 * Try applying extended statistics to joins. There's not much we can
+	 * do to detect when this makes sense, but we can check that there are
+	 * join clauses, and that at least some of the rels have stats.
+	 *
+	 * XXX Isn't this mutualy exclusive with the preceding block which
+	 * calculates estimates for a single relation?
+	 */
+	if (use_extended_stats &&
+		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo,
+		 estimatedclauses))
+	{
+		s1 *= statext_clauselist_join_selectivity(root, clauses, varRelid,
+  jointype, sjinfo,
+  );
+	}
+
 	/*
 	 * Apply normal selectivity estimates for remaining clauses. We'll be
 	 * careful to skip any clauses which were already estimated above.
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index b05e818ba9..d4cbbee785 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -30,6 +30,7 @@
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/optimizer.h"
+#include "optimizer/pathnode.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "statistics/extended_stats_internal.h"
@@ -101,6 +102,16 @@ static StatsBuildData *make_build_data(Relation onerel, StatExtEntry *stat,
 	   int numrows, HeapTuple *rows,
 	   VacAttrStats **stats, int stattarget);
 
+static bool stat_covers_expressions(StatisticExtInfo *stat, List *exprs,
+	Bitmapset **expr_idxs);
+
+static List *statext_mcv_get_conditions(PlannerInfo *root,
+		RelOptInfo *rel,
+		StatisticExtInfo *info);
+
+static bool *statext_mcv_eval_conditions(PlannerInfo *root, RelOptInfo *rel,
+		 StatisticExtInfo *stat, MCVList *mcv,
+		 Selectivity *sel);
 
 /*
  * Compute requested extended stats, using the rows sampled for the plain
@@ -1154,6 +1165,89 @@ has_stats_of_kind(List *stats, char requiredkind)
 	return false;
 }
 
+/*
+ * find_matching_mcv
+ *		Search for a MCV covering all the attributes.
+ *
+ * XXX Should consider both attnums and expressions. Also should consider
+ * additional restrictinfos as conditions (but only as optional).
+ *
+ * XXX The requirement that all the attributes need to be covered might be
+ * too strong. Maybe we could relax it a bit, and search for MCVs (on both
+ * sides of the join) with the largest overlap. But we don't really expect
+ * many candidate MCVs, so this simple approach seems sufficient.
+ */
+StatisticExtInfo *
+find_matching_mcv(PlannerInfo *root, RelOptInfo *rel, Bitmapset *attnums, List *exprs)
+{
+	ListCell   *l;
+	StatisticExtInfo *mcv = NULL;
+	List *stats = rel->statlist;
+
+	foreach(l, stats)
+	{
+		StatisticExtInfo *stat = (StatisticExtInfo *) lfirst(l);
+		List *conditions1 = NIL,
+			 *conditions2 = NIL;
+
+		/* We only care about MCV statistics here. */
+		if (stat->kind != STATS_EXT_MCV)
+			continue;
+
+		/*
+		 * Ignore MCVs not covering all the attributes/expressions.
+		 *
+		 * XXX Maybe we shouldn't be so strict and consider only partial
+		 * matches for join clauses too?
+		 */
+		if (!bms_is_subset(attnums, stat->keys) ||
+			!stat_covers_expressions(stat, exprs, 

Re: using extended statistics to improve join estimates

2021-03-31 Thread Zhihong Yu
Hi,

+ * has_matching_mcv
+ * Check whether the list contains statistic of a given kind

The method name is find_matching_mcv(). It seems the method initially
returned bool but later the return type was changed.

+   StatisticExtInfo *found = NULL;

found normally is associated with bool return value. Maybe call the
variable matching_mcv or something similar.

+   /* skip items eliminated by restrictions on rel2 */
+   if (conditions2 && !conditions2[j])
+   continue;

Maybe you can add a counter recording the number of non-skipped items for
the inner loop. If this counter is 0 after the completion of one iteration,
we come out of the outer loop directly.

Cheers

On Wed, Mar 31, 2021 at 10:36 AM Tomas Vondra 
wrote:

> Hi,
>
> So far the extended statistics are applied only at scan level, i.e. when
> estimating selectivity for individual tables. Which is great, but joins
> are a known challenge, so let's try doing something about it ...
>
> Konstantin Knizhnik posted a patch [1] using functional dependencies to
> improve join estimates in January. It's an interesting approach, but as
> I explained in that thread I think we should try a different approach,
> similar to how we use MCV lists without extended stats. We'll probably
> end up considering functional dependencies too, but probably only as a
> fallback (similar to what we do for single-relation estimates).
>
> This is a PoC demonstrating the approach I envisioned. It's incomplete
> and has various limitations:
>
> - no expression support, just plain attribute references
> - only equality conditions
> - requires MCV lists on both sides
> - inner joins only
>
> All of this can / should be relaxed later, of course. But for a PoC this
> seems sufficient.
>
> The basic principle is fairly simple, and mimics what eqjoinsel_inner
> does. Assume we have a query:
>
>   SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b)
>
> If we have MCV lists on (t1.a,t1.b) and (t2.a,t2.b) then we can use the
> same logic as eqjoinsel_inner and "match" them together. If the MCV list
> is "larger" - e.g. on (a,b,c) - then it's a bit more complicated, but
> the general idea is the same.
>
> To demonstrate this, consider a very simple example with a table that
> has a lot of dependency between the columns:
>
> ==
>
> CREATE TABLE t (a INT, b INT, c INT, d INT);
> INSERT INTO t SELECT mod(i,100), mod(i,100), mod(i,100), mod(i,100)
>   FROM generate_series(1,10) s(i);
> ANALYZE t;
>
> SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);
>
> CREATE STATISTICS s (mcv, ndistinct) ON a,b,c,d FROM t;
> ANALYZE t;
>
> SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);
>
> ALTER STATISTICS s SET STATISTICS 1;
> ANALYZE t;
>
> SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);
>
> ==
>
> The results look like this:
>
> - actual rows: 1
> - estimated (no stats):  1003638
> - estimated (stats, 100):  100247844
> - estimated (stats, 10k):  1
>
> So, in this case the extended stats help quite a bit, even with the
> default statistics target.
>
> However, there are other things we can do. For example, we can use
> restrictions (at relation level) as "conditions" to filter the MCV lits,
> and calculate conditional probability. This is useful even if there's
> just a single join condition (on one column), but there are dependencies
> between that column and the other filters. Or maybe when there are
> filters between conditions on the two sides.
>
> Consider for example these two queries:
>
> SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
>  WHERE t1.c < 25 AND t2.d < 25;
>
> SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
>  WHERE t1.c < 25 AND t2.d > 75;
>
> In this particular case we know that (a = b = c = d) so the two filters
> are somewhat redundant. The regular estimates will ignore that, but with
> MCV we can actually detect that - when we combine the two MCV lists, we
> essentially calculate MCV (a,b,t1.c,t2.d), and use that.
>
>   Q1  Q2
> - actual rows:  2500   0
> - estimated (no stats):62158   60241
> - estimated (stats, 100):   25047900   1
> - estimated (stats, 10k):   2500   1
>
> Obviously, the accuracy depends on how representative the MCV list is
> (what fraction of the data it represents), and in this case it works
> fairly nicely. A lot of the future work will be about handling cases
> when it represents only part of the data.
>
> The attached PoC patch has a number of FIXME and XXX, describing stuff I
> ignored to keep it simple, possible future improvement. And so on.
>
>
> regards
>
>
> [1]
>
> https://www.postgresql.org/message-id/flat/71d67391-16a9-3e5e-b5e4-8f7fd32cc...@postgrespro.ru
>
> --
> Tomas Vondra

using extended statistics to improve join estimates

2021-03-31 Thread Tomas Vondra
Hi,

So far the extended statistics are applied only at scan level, i.e. when
estimating selectivity for individual tables. Which is great, but joins
are a known challenge, so let's try doing something about it ...

Konstantin Knizhnik posted a patch [1] using functional dependencies to
improve join estimates in January. It's an interesting approach, but as
I explained in that thread I think we should try a different approach,
similar to how we use MCV lists without extended stats. We'll probably
end up considering functional dependencies too, but probably only as a
fallback (similar to what we do for single-relation estimates).

This is a PoC demonstrating the approach I envisioned. It's incomplete
and has various limitations:

- no expression support, just plain attribute references
- only equality conditions
- requires MCV lists on both sides
- inner joins only

All of this can / should be relaxed later, of course. But for a PoC this
seems sufficient.

The basic principle is fairly simple, and mimics what eqjoinsel_inner
does. Assume we have a query:

  SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b)

If we have MCV lists on (t1.a,t1.b) and (t2.a,t2.b) then we can use the
same logic as eqjoinsel_inner and "match" them together. If the MCV list
is "larger" - e.g. on (a,b,c) - then it's a bit more complicated, but
the general idea is the same.

To demonstrate this, consider a very simple example with a table that
has a lot of dependency between the columns:

==

CREATE TABLE t (a INT, b INT, c INT, d INT);
INSERT INTO t SELECT mod(i,100), mod(i,100), mod(i,100), mod(i,100)
  FROM generate_series(1,10) s(i);
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

CREATE STATISTICS s (mcv, ndistinct) ON a,b,c,d FROM t;
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

ALTER STATISTICS s SET STATISTICS 1;
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

==

The results look like this:

- actual rows: 1
- estimated (no stats):  1003638
- estimated (stats, 100):  100247844
- estimated (stats, 10k):  1

So, in this case the extended stats help quite a bit, even with the
default statistics target.

However, there are other things we can do. For example, we can use
restrictions (at relation level) as "conditions" to filter the MCV lits,
and calculate conditional probability. This is useful even if there's
just a single join condition (on one column), but there are dependencies
between that column and the other filters. Or maybe when there are
filters between conditions on the two sides.

Consider for example these two queries:

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
 WHERE t1.c < 25 AND t2.d < 25;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
 WHERE t1.c < 25 AND t2.d > 75;

In this particular case we know that (a = b = c = d) so the two filters
are somewhat redundant. The regular estimates will ignore that, but with
MCV we can actually detect that - when we combine the two MCV lists, we
essentially calculate MCV (a,b,t1.c,t2.d), and use that.

  Q1  Q2
- actual rows:  2500   0
- estimated (no stats):62158   60241
- estimated (stats, 100):   25047900   1
- estimated (stats, 10k):   2500   1

Obviously, the accuracy depends on how representative the MCV list is
(what fraction of the data it represents), and in this case it works
fairly nicely. A lot of the future work will be about handling cases
when it represents only part of the data.

The attached PoC patch has a number of FIXME and XXX, describing stuff I
ignored to keep it simple, possible future improvement. And so on.


regards


[1]
https://www.postgresql.org/message-id/flat/71d67391-16a9-3e5e-b5e4-8f7fd32cc...@postgrespro.ru

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index d263ecf082..dca1e7d34e 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -157,6 +157,23 @@ clauselist_selectivity_ext(PlannerInfo *root,
 			, false);
 	}
 
+	/*
+	 * Try applying extended statistics to joins. There's not much we can
+	 * do to detect when this makes sense, but we can check that there are
+	 * join clauses, and that at least some of the rels have stats.
+	 *
+	 * XXX Isn't this mutualy exclusive with the preceding block which
+	 * calculates estimates for a single relation?
+	 */
+	if (use_extended_stats &&
+		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo,
+		 estimatedclauses))
+	{
+		s1 *= statext_clauselist_join_selectivity(root, clauses, varRelid,
+