Re: Using multiple extended statistics for estimates

2020-01-12 Thread Tomas Vondra

Hi,

I've pushed these two improvements after some minor improvements, mostly
to comments. I ended up not using the extra tests, as it wasn't clear to
me it's worth the extra duration.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Using multiple extended statistics for estimates

2019-12-09 Thread Mark Dilger



On 12/9/19 2:00 PM, Tomas Vondra wrote:


These look good to me.  I added extra tests (not included in this email)
to verify the code on more interesting test cases, such as partitioned
tables and within joins.  Your test cases are pretty trivial, just being
selects from a single table.



Adding such more complex tests seem like a good idea, maybe you'd like
to share them?


You can find them attached.  I did not include them in my earlier email
because they seem a bit unrefined, taking too many lines of code for the
amount of coverage they provide.  But you can prune them down and add
them to the patch if you like.

These only test the functional dependencies.  If you want to include
something like them in your commit, you might create similar tests for
the mcv statistics, too.

--
Mark Dilger
diff --git a/src/test/regress/expected/stats_ext.out 
b/src/test/regress/expected/stats_ext.out
index d42a372197..6c5106a6b9 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -458,6 +458,19 @@ CREATE TABLE functional_dependencies_multi (
c INTEGER,
d INTEGER
 );
+CREATE TABLE partitioned_dependencies_multi (
+   a INTEGER,
+   b INTEGER,
+   c INTEGER,
+   d INTEGER,
+   e INTEGER,
+   f INTEGER
+) PARTITION BY list (a);
+CREATE TABLE partitioned_dependencies_multi_0 PARTITION OF 
partitioned_dependencies_multi FOR VALUES IN (0, 5, 10, 15);
+CREATE TABLE partitioned_dependencies_multi_1 PARTITION OF 
partitioned_dependencies_multi FOR VALUES IN (1, 6, 11, 16);
+CREATE TABLE partitioned_dependencies_multi_2 PARTITION OF 
partitioned_dependencies_multi FOR VALUES IN (2, 7, 12, 17);
+CREATE TABLE partitioned_dependencies_multi_3 PARTITION OF 
partitioned_dependencies_multi FOR VALUES IN (3, 8, 13, 18);
+CREATE TABLE partitioned_dependencies_multi_4 PARTITION OF 
partitioned_dependencies_multi FOR VALUES IN (4, 9, 14, 19);
 -- 
 INSERT INTO functional_dependencies_multi (a, b, c, d)
 SELECT
@@ -466,7 +479,17 @@ INSERT INTO functional_dependencies_multi (a, b, c, d)
  mod(i,11),
  mod(i,11)
 FROM generate_series(1,5000) s(i);
+INSERT INTO partitioned_dependencies_multi (a, b, c, d, e, f)
+SELECT
+ mod(i,13),
+ mod(i,13),
+ mod(i,17),
+ mod(i,17),
+ mod(i,19),
+ mod(i,19)
+FROM generate_series(1,5000) s(i);
 ANALYZE functional_dependencies_multi;
+ANALYZE partitioned_dependencies_multi;
 -- estimates without any functional dependencies
 SELECT * FROM check_estimated_rows('SELECT * FROM 
functional_dependencies_multi WHERE a = 0 AND b = 0');
  estimated | actual 
@@ -486,10 +509,63 @@ SELECT * FROM check_estimated_rows('SELECT * FROM 
functional_dependencies_multi
  1 | 64
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM 
partitioned_dependencies_multi WHERE a = 0 AND b = 0');
+ estimated | actual 
+---+
+   128 |384
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM 
partitioned_dependencies_multi WHERE c = 0 AND d = 0');
+ estimated | actual 
+---+
+18 |294
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM 
partitioned_dependencies_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
+ estimated | actual 
+---+
+ 1 | 22
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM 
functional_dependencies_multi NATURAL JOIN partitioned_dependencies_multi WHERE 
a = 0 AND b = 0');
+ estimated | actual 
+---+
+45 |  16098
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM 
functional_dependencies_multi NATURAL JOIN partitioned_dependencies_multi WHERE 
c = 0 AND d = 0');
+ estimated | actual 
+---+
+ 4 |  10248
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM 
functional_dependencies_multi NATURAL JOIN partitioned_dependencies_multi WHERE 
a = 0 AND b = 0 AND c = 0 AND d = 0');
+ estimated | actual 
+---+
+ 1 |   1408
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM 
functional_dependencies_multi NATURAL JOIN partitioned_dependencies_multi WHERE 
a = 0 AND b = 0 AND c = 0 AND d = 0 AND e = 0 AND f = 0');
+ estimated | actual 
+---+
+ 1 | 64
+(1 row)
+
 -- create separate functional dependencies
 CREATE STATISTICS functional_dependencies_multi_1 (dependencies) ON a, b FROM 
functional_dependencies_multi;
 CREATE STATISTICS functional_dependencies_multi_2 (dependencies) ON c, d FROM 
functional_dependencies_multi;
+CREATE STATISTICS partitioned_dependencies_multi_0 (dependencies) ON a, b FROM 
partitioned_dependencies_multi_0;
+CREATE STATISTICS partitioned_dependencies_multi_1 (dependencies) ON a, b FROM 
partitioned_dependencies_multi_1;
+CREATE STATISTICS partitioned_dependencies_multi_2 (dependencies) ON c, d FROM 
partitioned_dependencies_multi_2;
+CREATE STATISTICS 

Re: Using multiple extended statistics for estimates

2019-12-09 Thread Tomas Vondra

On Mon, Dec 09, 2019 at 11:56:39AM -0800, Mark Dilger wrote:



On 12/5/19 9:51 AM, Tomas Vondra wrote:

On Thu, Dec 05, 2019 at 06:15:54PM +0100, Tomas Vondra wrote:

On Sun, Dec 01, 2019 at 08:08:58PM +0100, Tomas Vondra wrote:

On Sat, Nov 30, 2019 at 03:01:31PM -0800, Mark Dilger wrote:


Are you planning to submit a revised patch for this?



Yes, I'll submit a rebased version of this patch shortly. I got broken
because of the recent fix in choose_best_statistics, shouldn't take long
to update the patch. I do have a couple more related patches in the
queue, so I want to submit them all at once.



OK, here we go - these two patched allow applying multiple extended
statistics, both for MCV and functional dependencies. Functional
dependencies are simply merged and then applied at once (so withouth
choose_best_statistics), statistics are considered in greedy manner by
calling choose_best_statistics in a loop.



OK, this time with the patches actually attached ;-)


These look good to me.  I added extra tests (not included in this email)
to verify the code on more interesting test cases, such as partitioned
tables and within joins.  Your test cases are pretty trivial, just being
selects from a single table.



Adding such more complex tests seem like a good idea, maybe you'd like
to share them?


I'll go mark this "ready for committer".



Thanks for the review. I'll hold-off with the commit until the next CF,
though, just to give others a proper opportunity to look at it.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Using multiple extended statistics for estimates

2019-12-09 Thread Mark Dilger




On 12/5/19 9:51 AM, Tomas Vondra wrote:

On Thu, Dec 05, 2019 at 06:15:54PM +0100, Tomas Vondra wrote:

On Sun, Dec 01, 2019 at 08:08:58PM +0100, Tomas Vondra wrote:

On Sat, Nov 30, 2019 at 03:01:31PM -0800, Mark Dilger wrote:


Are you planning to submit a revised patch for this?



Yes, I'll submit a rebased version of this patch shortly. I got broken
because of the recent fix in choose_best_statistics, shouldn't take long
to update the patch. I do have a couple more related patches in the
queue, so I want to submit them all at once.



OK, here we go - these two patched allow applying multiple extended
statistics, both for MCV and functional dependencies. Functional
dependencies are simply merged and then applied at once (so withouth
choose_best_statistics), statistics are considered in greedy manner by
calling choose_best_statistics in a loop.



OK, this time with the patches actually attached ;-)


These look good to me.  I added extra tests (not included in this email)
to verify the code on more interesting test cases, such as partitioned
tables and within joins.  Your test cases are pretty trivial, just being
selects from a single table.

I'll go mark this "ready for committer".

--
Mark Dilger




Re: Using multiple extended statistics for estimates

2019-12-05 Thread Tomas Vondra

On Thu, Dec 05, 2019 at 06:15:54PM +0100, Tomas Vondra wrote:

On Sun, Dec 01, 2019 at 08:08:58PM +0100, Tomas Vondra wrote:

On Sat, Nov 30, 2019 at 03:01:31PM -0800, Mark Dilger wrote:


Are you planning to submit a revised patch for this?



Yes, I'll submit a rebased version of this patch shortly. I got broken
because of the recent fix in choose_best_statistics, shouldn't take long
to update the patch. I do have a couple more related patches in the
queue, so I want to submit them all at once.



OK, here we go - these two patched allow applying multiple extended
statistics, both for MCV and functional dependencies. Functional
dependencies are simply merged and then applied at once (so withouth
choose_best_statistics), statistics are considered in greedy manner by
calling choose_best_statistics in a loop.



OK, this time with the patches actually attached ;-)


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 8683f00a54aec7bd3873b63e6844642988b67573 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Tue, 12 Nov 2019 22:57:06 +0100
Subject: [PATCH 1/9] Apply multiple multivariate MCV lists when possible

Until now we've only used a single multivariate MCV list per relation,
covering the largest number of clauses. So for example given a query

SELECT * FROM t WHERE a = 1 AND b =1 AND c = 1 AND d = 1

and extended statistics on (a,b) and (c,d), we'd only pick and use only
one of them. This commit relaxes this by repeating the process, using
the best statistics (matching the largest number of remaining clauses)
in each step.

This greedy algorithm is very simple, but may not be optimal. There may
be a different choice of stats leaving fewer clauses unestimated and/or
giving better estimates for some other reason.

This can however happen only when there are overlapping statistics, and
selecting one makes it impossible to use the other. E.g. with statistics
on (a,b), (c,d), (b,c,d), we may pick either (a,b) and (c,d) or (b,c,d).
But it's not clear which option is better, though, as each one ignores
information about possible correlation between different columns.

We however assume cases like this are rare, and the easiest solution is
to define statistics covering the whole group of correlated columns for
a given query. In the future we might support overlapping stats, using
some of the clauses as conditions (in conditional probability sense).

Author: Tomas Vondra
Reviewed-by: Mark Dilger
Discussion: https://postgr.es/m/20191028152048.jc6pqv5hb7j77ocp@development
---
 src/backend/statistics/extended_stats.c | 129 +---
 src/test/regress/expected/stats_ext.out |  58 +++
 src/test/regress/sql/stats_ext.sql  |  36 +++
 3 files changed, 162 insertions(+), 61 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c 
b/src/backend/statistics/extended_stats.c
index 9d339433f6..4ec0148fcc 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1194,11 +1194,6 @@ statext_is_compatible_clause(PlannerInfo *root, Node 
*clause, Index relid,
  * 'estimatedclauses' is an input/output parameter.  We set bits for the
  * 0-based 'clauses' indexes we estimate for and also skip clause items that
  * already have a bit set.
- *
- * XXX If we were to use multiple statistics, this is where it would happen.
- * We would simply repeat this on a loop on the "remaining" clauses, possibly
- * using the already estimated clauses as conditions (and combining the values
- * using conditional probability formula).
  */
 static Selectivity
 statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int 
varRelid,
@@ -1208,14 +1203,7 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, 
List *clauses, int varReli
ListCell   *l;
Bitmapset **list_attnums;
int listidx;
-   StatisticExtInfo *stat;
-   List   *stat_clauses;
-   Selectivity simple_sel,
-   mcv_sel,
-   mcv_basesel,
-   mcv_totalsel,
-   other_sel,
-   sel;
+   Selectivity sel = 1.0;
 
/* check if there's any stats that might be useful for us. */
if (!has_stats_of_kind(rel->statlist, STATS_EXT_MCV))
@@ -1250,65 +1238,84 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, 
List *clauses, int varReli
listidx++;
}
 
-   /* find the best suited statistics object for these attnums */
-   stat = choose_best_statistics(rel->statlist, STATS_EXT_MCV,
- list_attnums, 
list_length(clauses));
-
-   /* if no matching stats could be found then we've nothing to do */
-   if (!stat)
-   return 1.0;
+   /* apply as many extended statistics as 

Re: Using multiple extended statistics for estimates

2019-12-05 Thread Tomas Vondra

On Sun, Dec 01, 2019 at 08:08:58PM +0100, Tomas Vondra wrote:

On Sat, Nov 30, 2019 at 03:01:31PM -0800, Mark Dilger wrote:


Are you planning to submit a revised patch for this?



Yes, I'll submit a rebased version of this patch shortly. I got broken
because of the recent fix in choose_best_statistics, shouldn't take long
to update the patch. I do have a couple more related patches in the
queue, so I want to submit them all at once.



OK, here we go - these two patched allow applying multiple extended
statistics, both for MCV and functional dependencies. Functional
dependencies are simply merged and then applied at once (so withouth
choose_best_statistics), statistics are considered in greedy manner by
calling choose_best_statistics in a loop.

I do have some additional enhancements in the queue, but those are not
fully baked yet, so I'll post them later in separate patches.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Using multiple extended statistics for estimates

2019-12-01 Thread Tomas Vondra

On Sat, Nov 30, 2019 at 03:01:31PM -0800, Mark Dilger wrote:


Are you planning to submit a revised patch for this?



Yes, I'll submit a rebased version of this patch shortly. I got broken
because of the recent fix in choose_best_statistics, shouldn't take long
to update the patch. I do have a couple more related patches in the
queue, so I want to submit them all at once.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Using multiple extended statistics for estimates

2019-11-30 Thread Mark Dilger




On 11/14/19 12:04 PM, Tomas Vondra wrote:

On Thu, Nov 14, 2019 at 10:23:44AM -0800, Mark Dilger wrote:



On 11/14/19 7:55 AM, Tomas Vondra wrote:

On Wed, Nov 13, 2019 at 10:04:36AM -0800, Mark Dilger wrote:



On 11/13/19 7:28 AM, Tomas Vondra wrote:

Hi,

here's an updated patch, with some minor tweaks based on the review 
and

added tests (I ended up reworking those a bit, to make them more like
the existing ones).


Thanks, Tomas, for the new patch set!

Attached are my review comments so far, in the form of a patch 
applied on top of yours.




Thanks.

1) It's not clear to me why adding 'const' to the List parameters would
  be useful? Can you explain?


When I first started reviewing the functions, I didn't know if those 
lists were intended to be modified by the function.  Adding 'const' 
helps document that the function does not intend to change them.




Hmmm, ok. I'll think about it, but we're not really using const* in this
way very much I think - at least not in the surrounding code.


2) I think you're right we can change find_strongest_dependency to do

   /* also skip weaker dependencies when attribute count matches */
   if (strongest->nattributes == dependency->nattributes &&
   strongest->degree >= dependency->degree)
   continue;

  That'll skip some additional dependencies, which seems OK.

3) It's not clear to me what you mean by

    * TODO: Improve this code comment.  Specifically, why would we
    * ignore that no rows will match?  It seems that such a discovery
    * would allow us to return an estimate of 0 rows, and that would
    * be useful.

  added to dependencies_clauselist_selectivity. Are you saying we
  should also compute selectivity estimates for individual clauses and
  use Min() as a limit? Maybe, but that seems unrelated to the patch.


I mean that the comment right above that TODO is hard to understand. 
You seem to be saying that it is good and proper to only take the 
selectivity estimate from the final clause in the list, but then go on 
to say that other clauses might prove that no rows will match.  So 
that implies that by ignoring all but the last clause, we're ignoring 
such other clauses that prove no rows can match.  But why would we be 
ignoring those?


I am not arguing that your code is wrong.  I'm just critiquing the 
hard-to-understand phrasing of that code comment.




Aha, I think I understand now - thanks for the explanation. You're right
the comment is trying to explain why just taking the last clause for a
given attnum is fine. I'll try to make the comment clearer.


Are you planning to submit a revised patch for this?



--
Mark Dilger




Re: Using multiple extended statistics for estimates

2019-11-14 Thread Tomas Vondra

On Thu, Nov 14, 2019 at 01:17:02PM -0800, Mark Dilger wrote:



On 11/14/19 12:04 PM, Tomas Vondra wrote:

Aha, I think I understand now - thanks for the explanation. You're right
the comment is trying to explain why just taking the last clause for a
given attnum is fine. I'll try to make the comment clearer.

For the case with equal Const values that should be mostly obvious, i.e.
"a=1 AND a=1 AND a=1" has the same selectivity as "a=1".

The case with different Const values is harder, unfortunately. It might
seem obvious that "a=1 AND a=2" means there are no matching rows, but
that heavily relies on the semantics of the equality operator. And we
can't simply compare the Const values either, I'm afraid, because there
are cases with cross-type operators like

 a = 1::int AND a = 1.0::numeric

where the Consts are of different type, yet both conditions can be true.

So it would be pretty tricky to do this, and the current code does not
even try to do that.

Instead, it just assumes that it's mostly fine to overestimate, because
then at runtime we'll simply end up with 0 rows here.


I'm unsure whether that could be a performance problem at runtime.

I could imagine the planner short-circuiting additional planning when
it finds a plan with zero rows, and so we'd save planner time if we
avoid overestimating.  I don't recall if the planner does anything like
that, or if there are plans to implement such logic, but it might be
good not to rule it out.  Tom's suggestion elsewhere in this thread to
use code in predtest.c sounds good to me.



No, AFAIK the planner does not do anything like that - it might probaly
do that if it could prove there are no such rows, but that's hardly the
case for estimates based on approximate information (i.e. statistics).

If could do that based on the predicate analysis in predtest.c mentioned
by Tom, although I don't think it does anything beyond tweaking the row
estimate to ~1 row.


I don't know if you want to expand the scope of this particular patch to
include that, though.



Certainly not. It's an interesting but surprisingly complicated problem,
and this patch simply aims to add different improvement.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Using multiple extended statistics for estimates

2019-11-14 Thread Tomas Vondra

On Thu, Nov 14, 2019 at 03:16:04PM -0500, Tom Lane wrote:

Tomas Vondra  writes:

For the case with equal Const values that should be mostly obvious, i.e.
"a=1 AND a=1 AND a=1" has the same selectivity as "a=1".



The case with different Const values is harder, unfortunately. It might
seem obvious that "a=1 AND a=2" means there are no matching rows, but
that heavily relies on the semantics of the equality operator. And we
can't simply compare the Const values either, I'm afraid, because there
are cases with cross-type operators like
  a = 1::int AND a = 1.0::numeric
where the Consts are of different type, yet both conditions can be true.


FWIW, there's code in predtest.c to handle exactly that, at least for
types sharing a btree opfamily.  Whether it's worth applying that logic
here is unclear, but note that we've had the ability to recognize
redundant and contradictory clauses for a long time:

regression=# explain select * from tenk1 where two = 1;
QUERY PLAN

Seq Scan on tenk1  (cost=0.00..470.00 rows=5000 width=244)
  Filter: (two = 1)
(2 rows)

regression=# explain select * from tenk1 where two = 1 and two = 1::bigint;
QUERY PLAN

Seq Scan on tenk1  (cost=0.00..470.00 rows=5000 width=244)
  Filter: (two = 1)
(2 rows)

regression=# explain select * from tenk1 where two = 1 and two = 2::bigint;
 QUERY PLAN
---
Result  (cost=0.00..470.00 rows=1 width=244)
  One-Time Filter: false
  ->  Seq Scan on tenk1  (cost=0.00..470.00 rows=1 width=244)
Filter: (two = 1)
(4 rows)

It falls down on

regression=# explain select * from tenk1 where two = 1 and two = 2::numeric;
   QUERY PLAN
---
Seq Scan on tenk1  (cost=0.00..520.00 rows=25 width=244)
  Filter: ((two = 1) AND ((two)::numeric = '2'::numeric))
(2 rows)

because numeric isn't in the same opfamily, so these clauses can't be
compared easily.

regards, tom lane


Yeah, and this logic still works - the redundant clauses won't even get
to the selectivity estimation, I think. So maybe the comment is not
quite necessary, because the problem does not even exist ...

Maybe we could do something about the cases that predtest.c can't solve,
but it's not clear if we can be much smarter for types with different
opfamilies.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Using multiple extended statistics for estimates

2019-11-14 Thread Mark Dilger




On 11/14/19 12:04 PM, Tomas Vondra wrote:

Aha, I think I understand now - thanks for the explanation. You're right
the comment is trying to explain why just taking the last clause for a
given attnum is fine. I'll try to make the comment clearer.

For the case with equal Const values that should be mostly obvious, i.e.
"a=1 AND a=1 AND a=1" has the same selectivity as "a=1".

The case with different Const values is harder, unfortunately. It might
seem obvious that "a=1 AND a=2" means there are no matching rows, but
that heavily relies on the semantics of the equality operator. And we
can't simply compare the Const values either, I'm afraid, because there
are cases with cross-type operators like

  a = 1::int AND a = 1.0::numeric

where the Consts are of different type, yet both conditions can be true.

So it would be pretty tricky to do this, and the current code does not
even try to do that.

Instead, it just assumes that it's mostly fine to overestimate, because
then at runtime we'll simply end up with 0 rows here.


I'm unsure whether that could be a performance problem at runtime.

I could imagine the planner short-circuiting additional planning when
it finds a plan with zero rows, and so we'd save planner time if we
avoid overestimating.  I don't recall if the planner does anything like
that, or if there are plans to implement such logic, but it might be
good not to rule it out.  Tom's suggestion elsewhere in this thread to
use code in predtest.c sounds good to me.

I don't know if you want to expand the scope of this particular patch to
include that, though.

--
Mark Dilger




Re: Using multiple extended statistics for estimates

2019-11-14 Thread Tom Lane
Tomas Vondra  writes:
> For the case with equal Const values that should be mostly obvious, i.e.
> "a=1 AND a=1 AND a=1" has the same selectivity as "a=1".

> The case with different Const values is harder, unfortunately. It might
> seem obvious that "a=1 AND a=2" means there are no matching rows, but
> that heavily relies on the semantics of the equality operator. And we
> can't simply compare the Const values either, I'm afraid, because there
> are cases with cross-type operators like
>   a = 1::int AND a = 1.0::numeric
> where the Consts are of different type, yet both conditions can be true.

FWIW, there's code in predtest.c to handle exactly that, at least for
types sharing a btree opfamily.  Whether it's worth applying that logic
here is unclear, but note that we've had the ability to recognize
redundant and contradictory clauses for a long time:

regression=# explain select * from tenk1 where two = 1;  
 QUERY PLAN 

 Seq Scan on tenk1  (cost=0.00..470.00 rows=5000 width=244)
   Filter: (two = 1)
(2 rows)

regression=# explain select * from tenk1 where two = 1 and two = 1::bigint; 
 QUERY PLAN 

 Seq Scan on tenk1  (cost=0.00..470.00 rows=5000 width=244)
   Filter: (two = 1)
(2 rows)

regression=# explain select * from tenk1 where two = 1 and two = 2::bigint;
  QUERY PLAN   
---
 Result  (cost=0.00..470.00 rows=1 width=244)
   One-Time Filter: false
   ->  Seq Scan on tenk1  (cost=0.00..470.00 rows=1 width=244)
 Filter: (two = 1)
(4 rows)

It falls down on

regression=# explain select * from tenk1 where two = 1 and two = 2::numeric;
QUERY PLAN 
---
 Seq Scan on tenk1  (cost=0.00..520.00 rows=25 width=244)
   Filter: ((two = 1) AND ((two)::numeric = '2'::numeric))
(2 rows)

because numeric isn't in the same opfamily, so these clauses can't be
compared easily.

regards, tom lane




Re: Using multiple extended statistics for estimates

2019-11-14 Thread Tomas Vondra

On Thu, Nov 14, 2019 at 10:23:44AM -0800, Mark Dilger wrote:



On 11/14/19 7:55 AM, Tomas Vondra wrote:

On Wed, Nov 13, 2019 at 10:04:36AM -0800, Mark Dilger wrote:



On 11/13/19 7:28 AM, Tomas Vondra wrote:

Hi,

here's an updated patch, with some minor tweaks based on the review and
added tests (I ended up reworking those a bit, to make them more like
the existing ones).


Thanks, Tomas, for the new patch set!

Attached are my review comments so far, in the form of a patch 
applied on top of yours.




Thanks.

1) It's not clear to me why adding 'const' to the List parameters would
  be useful? Can you explain?


When I first started reviewing the functions, I didn't know if those 
lists were intended to be modified by the function.  Adding 'const' 
helps document that the function does not intend to change them.




Hmmm, ok. I'll think about it, but we're not really using const* in this
way very much I think - at least not in the surrounding code.


2) I think you're right we can change find_strongest_dependency to do

   /* also skip weaker dependencies when attribute count matches */
   if (strongest->nattributes == dependency->nattributes &&
   strongest->degree >= dependency->degree)
   continue;

  That'll skip some additional dependencies, which seems OK.

3) It's not clear to me what you mean by

    * TODO: Improve this code comment.  Specifically, why would we
    * ignore that no rows will match?  It seems that such a discovery
    * would allow us to return an estimate of 0 rows, and that would
    * be useful.

  added to dependencies_clauselist_selectivity. Are you saying we
  should also compute selectivity estimates for individual clauses and
  use Min() as a limit? Maybe, but that seems unrelated to the patch.


I mean that the comment right above that TODO is hard to understand. 
You seem to be saying that it is good and proper to only take the 
selectivity estimate from the final clause in the list, but then go on 
to say that other clauses might prove that no rows will match.  So 
that implies that by ignoring all but the last clause, we're ignoring 
such other clauses that prove no rows can match.  But why would we be 
ignoring those?


I am not arguing that your code is wrong.  I'm just critiquing the 
hard-to-understand phrasing of that code comment.




Aha, I think I understand now - thanks for the explanation. You're right
the comment is trying to explain why just taking the last clause for a
given attnum is fine. I'll try to make the comment clearer.

For the case with equal Const values that should be mostly obvious, i.e.
"a=1 AND a=1 AND a=1" has the same selectivity as "a=1".

The case with different Const values is harder, unfortunately. It might
seem obvious that "a=1 AND a=2" means there are no matching rows, but
that heavily relies on the semantics of the equality operator. And we
can't simply compare the Const values either, I'm afraid, because there
are cases with cross-type operators like

 a = 1::int AND a = 1.0::numeric

where the Consts are of different type, yet both conditions can be true.

So it would be pretty tricky to do this, and the current code does not
even try to do that.

Instead, it just assumes that it's mostly fine to overestimate, because
then at runtime we'll simply end up with 0 rows here.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Using multiple extended statistics for estimates

2019-11-14 Thread Mark Dilger




On 11/14/19 7:55 AM, Tomas Vondra wrote:

On Wed, Nov 13, 2019 at 10:04:36AM -0800, Mark Dilger wrote:



On 11/13/19 7:28 AM, Tomas Vondra wrote:

Hi,

here's an updated patch, with some minor tweaks based on the review and
added tests (I ended up reworking those a bit, to make them more like
the existing ones).


Thanks, Tomas, for the new patch set!

Attached are my review comments so far, in the form of a patch applied 
on top of yours.




Thanks.

1) It's not clear to me why adding 'const' to the List parameters would
   be useful? Can you explain?


When I first started reviewing the functions, I didn't know if those 
lists were intended to be modified by the function.  Adding 'const' 
helps document that the function does not intend to change them.



2) I think you're right we can change find_strongest_dependency to do

    /* also skip weaker dependencies when attribute count matches */
    if (strongest->nattributes == dependency->nattributes &&
    strongest->degree >= dependency->degree)
    continue;

   That'll skip some additional dependencies, which seems OK.

3) It's not clear to me what you mean by

     * TODO: Improve this code comment.  Specifically, why would we
     * ignore that no rows will match?  It seems that such a discovery
     * would allow us to return an estimate of 0 rows, and that would
     * be useful.

   added to dependencies_clauselist_selectivity. Are you saying we
   should also compute selectivity estimates for individual clauses and
   use Min() as a limit? Maybe, but that seems unrelated to the patch.


I mean that the comment right above that TODO is hard to understand. You 
seem to be saying that it is good and proper to only take the 
selectivity estimate from the final clause in the list, but then go on 
to say that other clauses might prove that no rows will match.  So that 
implies that by ignoring all but the last clause, we're ignoring such 
other clauses that prove no rows can match.  But why would we be 
ignoring those?


I am not arguing that your code is wrong.  I'm just critiquing the 
hard-to-understand phrasing of that code comment.


--
Mark Dilger




Re: Using multiple extended statistics for estimates

2019-11-14 Thread Tomas Vondra

On Wed, Nov 13, 2019 at 10:04:36AM -0800, Mark Dilger wrote:



On 11/13/19 7:28 AM, Tomas Vondra wrote:

Hi,

here's an updated patch, with some minor tweaks based on the review and
added tests (I ended up reworking those a bit, to make them more like
the existing ones).


Thanks, Tomas, for the new patch set!

Attached are my review comments so far, in the form of a patch applied 
on top of yours.




Thanks.

1) It's not clear to me why adding 'const' to the List parameters would
  be useful? Can you explain?

2) I think you're right we can change find_strongest_dependency to do

   /* also skip weaker dependencies when attribute count matches */
   if (strongest->nattributes == dependency->nattributes &&
   strongest->degree >= dependency->degree)
   continue;

  That'll skip some additional dependencies, which seems OK.

3) It's not clear to me what you mean by

* TODO: Improve this code comment.  Specifically, why would we
* ignore that no rows will match?  It seems that such a discovery
* would allow us to return an estimate of 0 rows, and that would
* be useful.

  added to dependencies_clauselist_selectivity. Are you saying we
  should also compute selectivity estimates for individual clauses and
  use Min() as a limit? Maybe, but that seems unrelated to the patch.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Using multiple extended statistics for estimates

2019-11-13 Thread Mark Dilger



On 11/13/19 7:28 AM, Tomas Vondra wrote:

Hi,

here's an updated patch, with some minor tweaks based on the review and
added tests (I ended up reworking those a bit, to make them more like
the existing ones).


Thanks, Tomas, for the new patch set!

Attached are my review comments so far, in the form of a patch applied 
on top of yours.


--
Mark Dilger
diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c
index a6ca11c675..ccf6e41b9c 100644
--- a/src/backend/statistics/dependencies.c
+++ b/src/backend/statistics/dependencies.c
@@ -850,7 +850,8 @@ dependency_is_compatible_clause(Node *clause, Index relid, AttrNumber *attnum)
  *		find the strongest dependency on the attributes
  *
  * When applying functional dependencies, we start with the strongest
- * dependencies. That is, we select the dependency that:
+ * dependencies. That is, we select the best dependency in the following
+ * descending order of preference:
  *
  * (a) has all attributes covered by equality clauses
  *
@@ -860,6 +861,9 @@ dependency_is_compatible_clause(Node *clause, Index relid, AttrNumber *attnum)
  *
  * This guarantees that we eliminate the most redundant conditions first
  * (see the comment in dependencies_clauselist_selectivity).
+ *
+ * TODO: Rename 'dependencies', as that name is what you'd expect for an
+ *   argument of type (MVDependencies *), not (MVDependencies **)
  */
 static MVDependency *
 find_strongest_dependency(MVDependencies **dependencies, int ndependencies,
@@ -897,11 +901,14 @@ find_strongest_dependency(MVDependencies **dependencies, int ndependencies,
 
 /* also skip weaker dependencies when attribute count matches */
 if (strongest->nattributes == dependency->nattributes &&
-	strongest->degree > dependency->degree)
+	strongest->degree > dependency->degree)		/* TODO: Why not ">=" here? */
 	continue;
 			}
 
 			/*
+			 * TODO: As coded, this dependency is "at least as strong as", not
+			 * necessarily "stronger".
+			 *
 			 * this dependency is stronger, but we must still check that it's
 			 * fully matched to these attnums. We perform this check last as it's
 			 * slightly more expensive than the previous checks.
@@ -942,7 +949,7 @@ find_strongest_dependency(MVDependencies **dependencies, int ndependencies,
  */
 Selectivity
 dependencies_clauselist_selectivity(PlannerInfo *root,
-	List *clauses,
+	const List *clauses,
 	int varRelid,
 	JoinType jointype,
 	SpecialJoinInfo *sjinfo,
@@ -1084,6 +1091,11 @@ dependencies_clauselist_selectivity(PlannerInfo *root,
 			 * to a different Const in another clause then no rows will match
 			 * anyway. If it happens to be compared to the same Const, then
 			 * ignoring the additional clause is just the thing to do.
+			 *
+			 * TODO: Improve this code comment.  Specifically, why would we
+			 * ignore that no rows will match?  It seems that such a discovery
+			 * would allow us to return an estimate of 0 rows, and that would
+			 * be useful.
 			 */
 			if (dependency_implies_attribute(dependency,
 			 list_attnums[listidx]))
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index b55799b8b1..a1798a6b91 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1202,7 +1202,7 @@ statext_mcv_remaining_attnums(int nclauses, Bitmapset **estimatedclauses,
  * already have a bit set.
  */
 static Selectivity
-statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
+statext_mcv_clauselist_selectivity(PlannerInfo *root, const List *clauses, int varRelid,
    JoinType jointype, SpecialJoinInfo *sjinfo,
    RelOptInfo *rel, Bitmapset **estimatedclauses)
 {
@@ -1340,7 +1340,7 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
  *		Estimate clauses using the best multi-column statistics.
  */
 Selectivity
-statext_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
+statext_clauselist_selectivity(PlannerInfo *root, const List *clauses, int varRelid,
 			   JoinType jointype, SpecialJoinInfo *sjinfo,
 			   RelOptInfo *rel, Bitmapset **estimatedclauses)
 {
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 588b6738b2..ebac910d72 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -104,14 +104,14 @@ extern int ComputeExtStatisticsRows(Relation onerel,
 	int natts, VacAttrStats **stats);
 extern bool statext_is_kind_built(HeapTuple htup, char kind);
 extern Selectivity dependencies_clauselist_selectivity(PlannerInfo *root,
-	   List *clauses,
+	   const List *clauses,
 	   int varRelid,
 	   JoinType jointype,
 	   SpecialJoinInfo *sjinfo,
 	   RelOptInfo *rel,
 	   Bitmapset **estimatedclauses);
 

Re: Using multiple extended statistics for estimates

2019-11-13 Thread Tomas Vondra

Hi,

here's an updated patch, with some minor tweaks based on the review and
added tests (I ended up reworking those a bit, to make them more like
the existing ones).

There's also a new piece, dealing with functional dependencies. Until
now we did the same thing as for MCV lists - we picketd the "best"
extended statistics (with functional dependencies built) and just used
that. At first I thought we might simply do the same loop as for MCV
lists, but that does not really make sense because we might end up
applying "weaker" dependency first.

Say for example we have table with columns (a,b,c,d,e) and functional
dependencies on (a,b,c,d) and (c,d,e) where all the dependencies on
(a,b,c,d) are weaker than (c,d => e). In a query with clauses on all
attributes this is guaranteed to apply all dependencies from the first
statistic first, which si clearly wrong.

So what this does instead is simply merging all the dependencies from
all the relevant stats, and treating them as a single collection.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 0e09c4749f3712da1983374a2838e4b7e14b7c62 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Tue, 12 Nov 2019 22:57:06 +0100
Subject: [PATCH 01/10] Apply multiple multivariate MCV lists when possible

Until now we've only used a single multivariate MCV list per relation,
covering the largest number of clauses. So for example given a query

SELECT * FROM t WHERE a = 1 AND b =1 AND c = 1 AND d = 1

and extended statistics on (a,b) and (c,d), we'd only pick and use only
one of them. This commit relaxes this by repeating the process, using
the best statistics (matching the largest number of remaining clauses)
in each step.

This greedy algorithm is very simple, but may not be optimal. There may
be a different choice of stats leaving fewer clauses unestimated and/or
giving better estimates for some other reason.

This can however happen only when there are overlapping statistics, and
selecting one makes it impossible to use the other. E.g. with statistics
on (a,b), (c,d), (b,c,d), we may pick either (a,b) and (c,d) or (b,c,d).
But it's not clear which option is better, though, as each one ignores
information about possible correlation between different columns.

We however assume cases like this are rare, and the easiest solution is
to define statistics covering the whole group of correlated columns for
a given query. In the future we might support overlapping stats, using
some of the clauses as conditions (in conditional probability sense).

Author: Tomas Vondra
Reviewed-by: Mark Dilger
Discussion: https://postgr.es/m/20191028152048.jc6pqv5hb7j77ocp@development
---
 src/backend/statistics/extended_stats.c | 166 ++--
 src/test/regress/expected/stats_ext.out |  58 +
 src/test/regress/sql/stats_ext.sql  |  36 +
 3 files changed, 195 insertions(+), 65 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c 
b/src/backend/statistics/extended_stats.c
index 207ee3160e..b55799b8b1 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1123,6 +1123,33 @@ statext_is_compatible_clause(PlannerInfo *root, Node 
*clause, Index relid,
return true;
 }
 
+/*
+ * statext_mcv_clause_attnums
+ * Recalculate attnums from compatible but not-yet-estimated 
clauses.
+ */
+static Bitmapset *
+statext_mcv_remaining_attnums(int nclauses, Bitmapset **estimatedclauses,
+ Bitmapset 
**list_attnums)
+{
+   int listidx;
+   Bitmapset  *attnums = NULL;
+
+   for (listidx = 0; listidx < nclauses; listidx++)
+   {
+   /*
+* Skip clauses that have no precalculated attnums, which means 
it is
+* either incompatible or was already used by some other 
statistic.
+*/
+   if (!list_attnums[listidx])
+   continue;
+
+   if (!bms_is_member(listidx, *estimatedclauses))
+   attnums = bms_add_members(attnums, 
list_attnums[listidx]);
+   }
+
+   return attnums;
+}
+
 /*
  * statext_mcv_clauselist_selectivity
  * Estimate clauses using the best multi-column statistics.
@@ -1173,11 +1200,6 @@ statext_is_compatible_clause(PlannerInfo *root, Node 
*clause, Index relid,
  * 'estimatedclauses' is an input/output parameter.  We set bits for the
  * 0-based 'clauses' indexes we estimate for and also skip clause items that
  * already have a bit set.
- *
- * XXX If we were to use multiple statistics, this is where it would happen.
- * We would simply repeat this on a loop on the "remaining" clauses, possibly
- * using the already estimated clauses as conditions (and combining the values
- * using conditional probability formula).
  */
 static Selectivity
 

Re: Using multiple extended statistics for estimates

2019-11-10 Thread Tomas Vondra

On Sat, Nov 09, 2019 at 02:32:27PM -0800, Mark Dilger wrote:



On 11/9/19 12:33 PM, Mark Dilger wrote:



On 11/6/19 11:58 AM, Tomas Vondra wrote:

On Wed, Nov 06, 2019 at 08:54:40PM +0100, Tomas Vondra wrote:

On Mon, Oct 28, 2019 at 04:20:48PM +0100, Tomas Vondra wrote:

Hi,

PostgreSQL 10 introduced extended statistics, allowing us to consider
correlation between columns to improve estimates, and PostgreSQL 12
added support for MCV statistics. But we still had the limitation that
we only allowed using a single extended statistics per relation, i.e.
given a table with two extended stats

 CREATE TABLE t (a int, b int, c int, d int);
 CREATE STATISTICS s1 (mcv) ON a, b FROM t;
 CREATE STATISTICS s2 (mcv) ON c, d FROM t;

and a query

 SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1;

we only ever used one of the statistics (and we considered 
them in a not

particularly well determined order).

This patch addresses this by using as many extended stats as possible,
by adding a loop to statext_mcv_clauselist_selectivity(). In each step
we pick the "best" applicable statistics (in the sense of covering the
most attributes) and factor it into the oveall estimate.


Tomas,

Your patch compiles and passes the regression tests for me on debian 
linux under master.


Since your patch does not include modified regression tests, I wrote 
a test that I expected to improve under this new code, but running 
it both before and after applying your patch, there is no change.


Ok, the attached test passes before applying your patch and fails 
afterward owing to the estimates improving and no longer matching the 
expected output.  To be clear, this confirms your patch working as 
expected.


I haven't seen any crashes in several hours of running different 
tests, so I think it looks good.




Yep, thanks for adding the tests. I'll include them into the patch.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Using multiple extended statistics for estimates

2019-11-10 Thread Tomas Vondra

On Sat, Nov 09, 2019 at 12:33:05PM -0800, Mark Dilger wrote:



On 11/6/19 11:58 AM, Tomas Vondra wrote:

On Wed, Nov 06, 2019 at 08:54:40PM +0100, Tomas Vondra wrote:

On Mon, Oct 28, 2019 at 04:20:48PM +0100, Tomas Vondra wrote:

Hi,

PostgreSQL 10 introduced extended statistics, allowing us to consider
correlation between columns to improve estimates, and PostgreSQL 12
added support for MCV statistics. But we still had the limitation that
we only allowed using a single extended statistics per relation, i.e.
given a table with two extended stats

 CREATE TABLE t (a int, b int, c int, d int);
 CREATE STATISTICS s1 (mcv) ON a, b FROM t;
 CREATE STATISTICS s2 (mcv) ON c, d FROM t;

and a query

 SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1;

we only ever used one of the statistics (and we considered them in a not
particularly well determined order).

This patch addresses this by using as many extended stats as possible,
by adding a loop to statext_mcv_clauselist_selectivity(). In each step
we pick the "best" applicable statistics (in the sense of covering the
most attributes) and factor it into the oveall estimate.


Tomas,

Your patch compiles and passes the regression tests for me on debian 
linux under master.




Thanks.

Since your patch does not include modified regression tests, I wrote a 
test that I expected to improve under this new code, but running it 
both before and after applying your patch, there is no change.  Please 
find the modified test attached.  Am I wrong to expect some change in 
this test's output?  If so, can you provide a test example that works 
differently under your patch?




Those queries are not improved by the patch, because we only support
clauses "Var op Const" for now - your tests are using "Var op Var" so
that doesn't work.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Using multiple extended statistics for estimates

2019-11-09 Thread Mark Dilger



On 11/9/19 12:33 PM, Mark Dilger wrote:



On 11/6/19 11:58 AM, Tomas Vondra wrote:

On Wed, Nov 06, 2019 at 08:54:40PM +0100, Tomas Vondra wrote:

On Mon, Oct 28, 2019 at 04:20:48PM +0100, Tomas Vondra wrote:

Hi,

PostgreSQL 10 introduced extended statistics, allowing us to consider
correlation between columns to improve estimates, and PostgreSQL 12
added support for MCV statistics. But we still had the limitation that
we only allowed using a single extended statistics per relation, i.e.
given a table with two extended stats

 CREATE TABLE t (a int, b int, c int, d int);
 CREATE STATISTICS s1 (mcv) ON a, b FROM t;
 CREATE STATISTICS s2 (mcv) ON c, d FROM t;

and a query

 SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1;

we only ever used one of the statistics (and we considered them in a 
not

particularly well determined order).

This patch addresses this by using as many extended stats as possible,
by adding a loop to statext_mcv_clauselist_selectivity(). In each step
we pick the "best" applicable statistics (in the sense of covering the
most attributes) and factor it into the oveall estimate.


Tomas,

Your patch compiles and passes the regression tests for me on debian 
linux under master.


Since your patch does not include modified regression tests, I wrote a 
test that I expected to improve under this new code, but running it both 
before and after applying your patch, there is no change.


Ok, the attached test passes before applying your patch and fails 
afterward owing to the estimates improving and no longer matching the 
expected output.  To be clear, this confirms your patch working as expected.


I haven't seen any crashes in several hours of running different tests, 
so I think it looks good.



--
Mark Dilger
diff --git a/src/test/regress/expected/stats_ext.out 
b/src/test/regress/expected/stats_ext.out
index b65228fa07..e89865e3ee 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -353,6 +353,48 @@ SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM 
ndistinct GROUP BY a, d
500 | 50
 (1 row)
 
+-- muliple extended statistics
+CREATE TABLE multistats (
+   a INTEGER,
+   b INTEGER,
+   c INTEGER,
+   d INTEGER
+);
+-- Insert unique values
+INSERT INTO multistats (a, b, c, d)
+   SELECT i, i*2, i*3, i*4
+   FROM generate_series(1,1) s(i);
+-- Duplidate one set of values 1 times
+INSERT INTO multistats (a, b, c, d)
+   SELECT 0, 0, 0, 0
+   FROM generate_series(1,1) s(i);
+-- estimates without mcv statistics
+ANALYZE multistats;
+SELECT * FROM check_estimated_rows('SELECT * FROM multistats WHERE a = 0 AND b 
= 0 AND c = 0 AND d = 0');
+ estimated | actual 
+---+
+  1250 |  1
+(1 row)
+
+-- create some mcv statistics
+CREATE STATISTICS ms_ab (mcv) ON a, b FROM multistats;
+ANALYZE multistats;
+SELECT * FROM check_estimated_rows('SELECT * FROM multistats WHERE a = 0 AND b 
= 0 AND c = 0 AND d = 0');
+ estimated | actual 
+---+
+  2500 |  1
+(1 row)
+
+-- create some mcv statistics
+CREATE STATISTICS ms_cd (mcv) ON c, d FROM multistats;
+ANALYZE multistats;
+SELECT * FROM check_estimated_rows('SELECT * FROM multistats WHERE a = 0 AND b 
= 0 AND c = 0 AND d = 0');
+ estimated | actual 
+---+
+  2500 |  1
+(1 row)
+
+DROP TABLE multistats;
 -- functional dependencies tests
 CREATE TABLE functional_dependencies (
 filler1 TEXT,
diff --git a/src/test/regress/sql/stats_ext.sql 
b/src/test/regress/sql/stats_ext.sql
index 040ee97a1e..ba802d7490 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -224,6 +224,41 @@ SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM 
ndistinct GROUP BY b, c
 
 SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM ndistinct GROUP BY a, 
d');
 
+-- muliple extended statistics
+CREATE TABLE multistats (
+   a INTEGER,
+   b INTEGER,
+   c INTEGER,
+   d INTEGER
+);
+
+-- Insert unique values
+INSERT INTO multistats (a, b, c, d)
+   SELECT i, i*2, i*3, i*4
+   FROM generate_series(1,1) s(i);
+-- Duplidate one set of values 1 times
+INSERT INTO multistats (a, b, c, d)
+   SELECT 0, 0, 0, 0
+   FROM generate_series(1,1) s(i);
+
+-- estimates without mcv statistics
+ANALYZE multistats;
+SELECT * FROM check_estimated_rows('SELECT * FROM multistats WHERE a = 0 AND b 
= 0 AND c = 0 AND d = 0');
+
+-- create some mcv statistics
+CREATE STATISTICS ms_ab (mcv) ON a, b FROM multistats;
+
+ANALYZE multistats;
+SELECT * FROM check_estimated_rows('SELECT * FROM multistats WHERE a = 0 AND b 
= 0 AND c = 0 AND d = 0');
+
+-- create some mcv statistics
+CREATE STATISTICS ms_cd (mcv) ON c, d FROM multistats;
+
+ANALYZE multistats;
+SELECT * FROM check_estimated_rows('SELECT * FROM multistats WHERE a = 0 AND b 
= 0 AND c = 0 AND d = 0');
+
+DROP TABLE multistats;
+
 -- 

Re: Using multiple extended statistics for estimates

2019-11-09 Thread Mark Dilger



On 11/6/19 11:58 AM, Tomas Vondra wrote:

On Wed, Nov 06, 2019 at 08:54:40PM +0100, Tomas Vondra wrote:

On Mon, Oct 28, 2019 at 04:20:48PM +0100, Tomas Vondra wrote:

Hi,

PostgreSQL 10 introduced extended statistics, allowing us to consider
correlation between columns to improve estimates, and PostgreSQL 12
added support for MCV statistics. But we still had the limitation that
we only allowed using a single extended statistics per relation, i.e.
given a table with two extended stats

 CREATE TABLE t (a int, b int, c int, d int);
 CREATE STATISTICS s1 (mcv) ON a, b FROM t;
 CREATE STATISTICS s2 (mcv) ON c, d FROM t;

and a query

 SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1;

we only ever used one of the statistics (and we considered them in a not
particularly well determined order).

This patch addresses this by using as many extended stats as possible,
by adding a loop to statext_mcv_clauselist_selectivity(). In each step
we pick the "best" applicable statistics (in the sense of covering the
most attributes) and factor it into the oveall estimate.


Tomas,

Your patch compiles and passes the regression tests for me on debian 
linux under master.


Since your patch does not include modified regression tests, I wrote a 
test that I expected to improve under this new code, but running it both 
before and after applying your patch, there is no change.  Please find 
the modified test attached.  Am I wrong to expect some change in this 
test's output?  If so, can you provide a test example that works 
differently under your patch?


Thanks!


--
Mark Dilger
diff --git a/src/test/regress/expected/stats_ext.out 
b/src/test/regress/expected/stats_ext.out
index b65228fa07..8e20df25fc 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -353,6 +353,298 @@ SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM 
ndistinct GROUP BY a, d
500 | 50
 (1 row)
 
+-- muliple extended statistics
+CREATE TABLE multistats (
+   i2 INTEGER,
+   i3 INTEGER,
+   i17 INTEGER,
+   i19 INTEGER,
+   i34 INTEGER,
+   i38 INTEGER,
+   i51 INTEGER,
+   i57 INTEGER
+);
+INSERT INTO multistats (i2, i3, i17, i19, i34, i38, i51, i57)
+   SELECT i % 2, i % 3, i % 17, i % 19, i % 34, i % 38, i % 51, i % 57
+   FROM generate_series(1, 50) s(i);
+-- estimates without mcv statistics
+ANALYZE multistats;
+SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM multistats GROUP BY 
i2, i34');
+ estimated | actual 
+---+
+68 | 34
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM multistats GROUP BY 
i2, i38');
+ estimated | actual 
+---+
+76 | 38
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM multistats GROUP BY 
i17, i34');
+ estimated | actual 
+---+
+   578 | 34
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM multistats GROUP BY 
i19, i38');
+ estimated | actual 
+---+
+   722 | 38
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM multistats GROUP BY 
i17, i19, i34, i38');
+ estimated | actual 
+---+
+ 5 |646
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM multistats GROUP BY 
i3, i51');
+ estimated | actual 
+---+
+   153 | 51
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM multistats GROUP BY 
i3, i57');
+ estimated | actual 
+---+
+   171 | 57
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM multistats GROUP BY 
i17, i51');
+ estimated | actual 
+---+
+   867 | 51
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM multistats GROUP BY 
i19, i57');
+ estimated | actual 
+---+
+  1083 | 57
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM multistats GROUP BY 
i17, i19, i51, i57');
+ estimated | actual 
+---+
+ 5 |969
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM multistats WHERE i2 = i34');
+ estimated | actual 
+---+
+  2500 |  29411
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM multistats WHERE i2 = i38');
+ estimated | actual 
+---+
+  2500 |  26315
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM multistats WHERE i17 = i34');
+ estimated | actual 
+---+
+  2500 | 250001
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM multistats WHERE i19 = i38');
+ estimated | actual 
+---+
+  2500 | 250001
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM multistats WHERE i17 = i34 
AND i19 = i38');
+ estimated | actual 
+---+
+12 | 125387
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM multistats WHERE i17 = i34 

Re: Using multiple extended statistics for estimates

2019-11-07 Thread Tomas Vondra

On Thu, Nov 07, 2019 at 01:38:20PM +0900, Kyotaro Horiguchi wrote:

Hello.

At Wed, 6 Nov 2019 20:58:49 +0100, Tomas Vondra  
wrote in

>Here is a slightly polished v2 of the patch, the main difference being
>that computing clause_attnums was moved to a separate function.
>

This time with the attachment ;-)


This patch is a kind of straight-forward, which repeats what the
previous statext_mcv_clauselist_selectivity did as long as remaining
clauses matches any of MV-MCVs. Almost no regression in the cases
where zero or just one MV-MCV applies to the given clause list.

It applies cleanly on the current master and seems working as
expected.


I have some comments.

Could we have description in the documentation on what multiple
MV-MCVs are used in a query? And don't we need some regression tests?



Yes, regression tests are certainly needed - I though I've added them,
but it seems I failed to include them in the patch. Will fix.

I agree it's probably worth mentioning we can consider multiple stats,
but I'm a bit hesitant to put the exact rules how we pick the "best"
statistic to the docs. It's not 100% deterministic and it's likely
we'll need to tweak it a bit in the future.

I'd prefer showing the stats in EXPLAIN, but that's a separate patch.



+/*
+ * statext_mcv_clause_attnums
+ * Recalculate attnums from compatible but not-yet-estimated 
clauses.

It returns attnums collected from multiple clause*s*. Is the name OK
with "clause_attnums"?

The comment says as if it checks the compatibility of each clause but
the work is done in the caller side. I'm not sure such strictness is
required, but it might be better that the comment represents what
exactly the function does.



But the incompatible clauses have the pre-computed attnums set to NULL,
so technically the comment is correct. But I'll clarify.



+ */
+static Bitmapset *
+statext_mcv_clause_attnums(int nclauses, Bitmapset **estimatedclauses,
+  Bitmapset **list_attnums)

The last two parameters are in the same type in notation but in
different actual types.. that is, one is a pointer to Bitmapset*, and
another is an array of Bitmaptset*. The code in the function itself
suggests that, but it would be helpful if a brief explanation of the
parameters is seen in the function comment.



OK, will explain in a comment.


+   /*
+* Recompute attnums in the remaining clauses (we simply use 
the bitmaps
+* computed earlier, so that we don't have to inspect the 
clauses again).
+*/
+   clauses_attnums = 
statext_mcv_clause_attnums(list_length(clauses),

Couldn't we avoid calling this function twice with the same parameters
at the first round in the loop?



Hmmm, yeah. That's a good point.


+   foreach(l, clauses)
{
-   stat_clauses = lappend(stat_clauses, (Node *) 
lfirst(l));
-   *estimatedclauses = bms_add_member(*estimatedclauses, 
listidx);
+   /*
+* If the clause is compatible with the selected 
statistics, mark it
+* as estimated and add it to the list to estimate.
+*/
+   if (list_attnums[listidx] != NULL &&
+   bms_is_subset(list_attnums[listidx], 
stat->keys))
+   {
+   stat_clauses = lappend(stat_clauses, (Node *) 
lfirst(l));
+   *estimatedclauses = 
bms_add_member(*estimatedclauses, listidx);
+   }

The loop runs through all clauses every time. I agree that that is
better than using a copy of the clauses to avoid to step on already
estimated clauses, but maybe we need an Assertion that the listidx is
not a part of estimatedclauses to make sure no clauses are not
estimated twice.



Well, we can't really operate on a smaller "copy" of the list anyway,
because that would break the precalculation logic (the listidx value
would be incorrect for the new list), and tweaking it would be more
expensive than just iterating over all clauses. The assumption is that
we won't see extremely large number of clauses here.

Adding an assert seems reasonable. And maybe a comment why we should not
see any already-estimated clauses here.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Using multiple extended statistics for estimates

2019-11-06 Thread Kyotaro Horiguchi
Hello.

At Wed, 6 Nov 2019 20:58:49 +0100, Tomas Vondra  
wrote in 
> >Here is a slightly polished v2 of the patch, the main difference being
> >that computing clause_attnums was moved to a separate function.
> >
> 
> This time with the attachment ;-)

This patch is a kind of straight-forward, which repeats what the
previous statext_mcv_clauselist_selectivity did as long as remaining
clauses matches any of MV-MCVs. Almost no regression in the cases
where zero or just one MV-MCV applies to the given clause list.

It applies cleanly on the current master and seems working as
expected.


I have some comments.

Could we have description in the documentation on what multiple
MV-MCVs are used in a query? And don't we need some regression tests?


+/*
+ * statext_mcv_clause_attnums
+ * Recalculate attnums from compatible but not-yet-estimated 
clauses.

It returns attnums collected from multiple clause*s*. Is the name OK
with "clause_attnums"?

The comment says as if it checks the compatibility of each clause but
the work is done in the caller side. I'm not sure such strictness is
required, but it might be better that the comment represents what
exactly the function does.


+ */
+static Bitmapset *
+statext_mcv_clause_attnums(int nclauses, Bitmapset **estimatedclauses,
+  Bitmapset **list_attnums)

The last two parameters are in the same type in notation but in
different actual types.. that is, one is a pointer to Bitmapset*, and
another is an array of Bitmaptset*. The code in the function itself
suggests that, but it would be helpful if a brief explanation of the
parameters is seen in the function comment.

+   /*
+* Recompute attnums in the remaining clauses (we simply use 
the bitmaps
+* computed earlier, so that we don't have to inspect the 
clauses again).
+*/
+   clauses_attnums = 
statext_mcv_clause_attnums(list_length(clauses),

Couldn't we avoid calling this function twice with the same parameters
at the first round in the loop?

+   foreach(l, clauses)
{
-   stat_clauses = lappend(stat_clauses, (Node *) 
lfirst(l));
-   *estimatedclauses = bms_add_member(*estimatedclauses, 
listidx);
+   /*
+* If the clause is compatible with the selected 
statistics, mark it
+* as estimated and add it to the list to estimate.
+*/
+   if (list_attnums[listidx] != NULL &&
+   bms_is_subset(list_attnums[listidx], 
stat->keys))
+   {
+   stat_clauses = lappend(stat_clauses, (Node *) 
lfirst(l));
+   *estimatedclauses = 
bms_add_member(*estimatedclauses, listidx);
+   }

The loop runs through all clauses every time. I agree that that is
better than using a copy of the clauses to avoid to step on already
estimated clauses, but maybe we need an Assertion that the listidx is
not a part of estimatedclauses to make sure no clauses are not
estimated twice.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Using multiple extended statistics for estimates

2019-11-06 Thread Tomas Vondra

On Wed, Nov 06, 2019 at 08:54:40PM +0100, Tomas Vondra wrote:

On Mon, Oct 28, 2019 at 04:20:48PM +0100, Tomas Vondra wrote:

Hi,

PostgreSQL 10 introduced extended statistics, allowing us to consider
correlation between columns to improve estimates, and PostgreSQL 12
added support for MCV statistics. But we still had the limitation that
we only allowed using a single extended statistics per relation, i.e.
given a table with two extended stats

 CREATE TABLE t (a int, b int, c int, d int);
 CREATE STATISTICS s1 (mcv) ON a, b FROM t;
 CREATE STATISTICS s2 (mcv) ON c, d FROM t;

and a query

 SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1;

we only ever used one of the statistics (and we considered them in a not
particularly well determined order).

This patch addresses this by using as many extended stats as possible,
by adding a loop to statext_mcv_clauselist_selectivity(). In each step
we pick the "best" applicable statistics (in the sense of covering the
most attributes) and factor it into the oveall estimate.

All this happens where we'd originally consider applying a single MCV
list, i.e. before even considering the functional dependencies, so
roughly like this:

 while ()
 {
 ... apply another MCV list ...
 }

 ... apply functional dependencies ...


I've both in the loop, but I think that'd be wrong - the MCV list is
expected to contain more information about individual values (compared
to functional deps, which are column-level).



Here is a slightly polished v2 of the patch, the main difference being
that computing clause_attnums was moved to a separate function.



This time with the attachment ;-)


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 
diff --git a/src/backend/statistics/extended_stats.c 
b/src/backend/statistics/extended_stats.c
index 207ee3160e..12093151f6 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1123,6 +1123,33 @@ statext_is_compatible_clause(PlannerInfo *root, Node 
*clause, Index relid,
return true;
 }
 
+/*
+ * statext_mcv_clause_attnums
+ * Recalculate attnums from compatible but not-yet-estimated 
clauses.
+ */
+static Bitmapset *
+statext_mcv_clause_attnums(int nclauses, Bitmapset **estimatedclauses,
+  Bitmapset **list_attnums)
+{
+   int listidx;
+   Bitmapset  *clauses_attnums = NULL;
+
+   for (listidx = 0; listidx < nclauses; listidx++)
+   {
+   /*
+* Skip clauses that have no precalculated attnums, which means 
it is
+* either incompatible or was already used by some other 
statistic.
+*/
+   if (!list_attnums[listidx])
+   continue;
+
+   if (!bms_is_member(listidx, *estimatedclauses))
+   clauses_attnums = bms_add_members(clauses_attnums, 
list_attnums[listidx]);
+   }
+
+   return clauses_attnums;
+}
+
 /*
  * statext_mcv_clauselist_selectivity
  * Estimate clauses using the best multi-column statistics.
@@ -1173,11 +1200,6 @@ statext_is_compatible_clause(PlannerInfo *root, Node 
*clause, Index relid,
  * 'estimatedclauses' is an input/output parameter.  We set bits for the
  * 0-based 'clauses' indexes we estimate for and also skip clause items that
  * already have a bit set.
- *
- * XXX If we were to use multiple statistics, this is where it would happen.
- * We would simply repeat this on a loop on the "remaining" clauses, possibly
- * using the already estimated clauses as conditions (and combining the values
- * using conditional probability formula).
  */
 static Selectivity
 statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int 
varRelid,
@@ -1188,14 +1210,7 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, 
List *clauses, int varReli
Bitmapset  *clauses_attnums = NULL;
Bitmapset **list_attnums;
int listidx;
-   StatisticExtInfo *stat;
-   List   *stat_clauses;
-   Selectivity simple_sel,
-   mcv_sel,
-   mcv_basesel,
-   mcv_totalsel,
-   other_sel,
-   sel;
+   Selectivity sel = 1.0;
 
/* check if there's any stats that might be useful for us. */
if (!has_stats_of_kind(rel->statlist, STATS_EXT_MCV))
@@ -1221,80 +1236,113 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, 
List *clauses, int varReli
Node   *clause = (Node *) lfirst(l);
Bitmapset  *attnums = NULL;
 
+   list_attnums[listidx] = NULL;
+
if (!bms_is_member(listidx, *estimatedclauses) &&
statext_is_compatible_clause(root, clause, rel->relid, 
))
-  

Re: Using multiple extended statistics for estimates

2019-11-06 Thread Tomas Vondra

On Mon, Oct 28, 2019 at 04:20:48PM +0100, Tomas Vondra wrote:

Hi,

PostgreSQL 10 introduced extended statistics, allowing us to consider
correlation between columns to improve estimates, and PostgreSQL 12
added support for MCV statistics. But we still had the limitation that
we only allowed using a single extended statistics per relation, i.e.
given a table with two extended stats

  CREATE TABLE t (a int, b int, c int, d int);
  CREATE STATISTICS s1 (mcv) ON a, b FROM t;
  CREATE STATISTICS s2 (mcv) ON c, d FROM t;

and a query

  SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1;

we only ever used one of the statistics (and we considered them in a not
particularly well determined order).

This patch addresses this by using as many extended stats as possible,
by adding a loop to statext_mcv_clauselist_selectivity(). In each step
we pick the "best" applicable statistics (in the sense of covering the
most attributes) and factor it into the oveall estimate.

All this happens where we'd originally consider applying a single MCV
list, i.e. before even considering the functional dependencies, so
roughly like this:

  while ()
  {
  ... apply another MCV list ...
  }

  ... apply functional dependencies ...


I've both in the loop, but I think that'd be wrong - the MCV list is
expected to contain more information about individual values (compared
to functional deps, which are column-level).



Here is a slightly polished v2 of the patch, the main difference being
that computing clause_attnums was moved to a separate function.

This is a fairly simple patch, and it's not entirely new functionality
(applying multiple statistics was part of the very first patch seris,
although of course in a very different form). So unless there are
objections, I'd like to get this committed sometime next week.

There's room for improvement, of course, for example when handling
overlapping statistics. Consider a table with columns (a,b,c) and two
extended statistics on (a,b) and (b,c), and query with one clause per
column

  SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1

In this case the patch does not help, because we apply (a,b) and then we
have just a single clause remaining. What we could do is still apply the
(b,c) statistic, using the already-estimated clause on b as a condition.
So essentially we'd compute

   P(a=1 && b=1) * P(c=1 | b=1)

But that'll require larger changes, and I see it as an evolution of the
current patch.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Using multiple extended statistics for estimates

2019-10-28 Thread Tomas Vondra

Hi,

PostgreSQL 10 introduced extended statistics, allowing us to consider
correlation between columns to improve estimates, and PostgreSQL 12
added support for MCV statistics. But we still had the limitation that
we only allowed using a single extended statistics per relation, i.e.
given a table with two extended stats

   CREATE TABLE t (a int, b int, c int, d int);
   CREATE STATISTICS s1 (mcv) ON a, b FROM t;
   CREATE STATISTICS s2 (mcv) ON c, d FROM t;

and a query

   SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1;

we only ever used one of the statistics (and we considered them in a not
particularly well determined order).

This patch addresses this by using as many extended stats as possible,
by adding a loop to statext_mcv_clauselist_selectivity(). In each step
we pick the "best" applicable statistics (in the sense of covering the
most attributes) and factor it into the oveall estimate.

All this happens where we'd originally consider applying a single MCV
list, i.e. before even considering the functional dependencies, so
roughly like this:

   while ()
   {
   ... apply another MCV list ...
   }

   ... apply functional dependencies ...


I've both in the loop, but I think that'd be wrong - the MCV list is
expected to contain more information about individual values (compared
to functional deps, which are column-level).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 
diff --git a/src/backend/statistics/extended_stats.c 
b/src/backend/statistics/extended_stats.c
index 207ee3160e..f817bc6189 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1173,11 +1173,6 @@ statext_is_compatible_clause(PlannerInfo *root, Node 
*clause, Index relid,
  * 'estimatedclauses' is an input/output parameter.  We set bits for the
  * 0-based 'clauses' indexes we estimate for and also skip clause items that
  * already have a bit set.
- *
- * XXX If we were to use multiple statistics, this is where it would happen.
- * We would simply repeat this on a loop on the "remaining" clauses, possibly
- * using the already estimated clauses as conditions (and combining the values
- * using conditional probability formula).
  */
 static Selectivity
 statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int 
varRelid,
@@ -1188,14 +1183,7 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, 
List *clauses, int varReli
Bitmapset  *clauses_attnums = NULL;
Bitmapset **list_attnums;
int listidx;
-   StatisticExtInfo *stat;
-   List   *stat_clauses;
-   Selectivity simple_sel,
-   mcv_sel,
-   mcv_basesel,
-   mcv_totalsel,
-   other_sel,
-   sel;
+   Selectivity sel = 1.0;
 
/* check if there's any stats that might be useful for us. */
if (!has_stats_of_kind(rel->statlist, STATS_EXT_MCV))
@@ -1237,64 +1225,99 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, 
List *clauses, int varReli
if (bms_membership(clauses_attnums) != BMS_MULTIPLE)
return 1.0;
 
-   /* find the best suited statistics object for these attnums */
-   stat = choose_best_statistics(rel->statlist, clauses_attnums, 
STATS_EXT_MCV);
+   /* apply as many extended statistics as possible */
+   while (true)
+   {
+   StatisticExtInfo *stat;
+   List   *stat_clauses;
+   Selectivity simple_sel,
+   mcv_sel,
+   mcv_basesel,
+   mcv_totalsel,
+   other_sel,
+   stat_sel;
 
-   /* if no matching stats could be found then we've nothing to do */
-   if (!stat)
-   return 1.0;
+   /*
+* Recompute attnums in the remaining clauses (we simply use 
the bitmaps
+* computed earlier, so that we don't have to inspect the 
clauses again).
+*/
+   clauses_attnums = NULL;
 
-   /* Ensure choose_best_statistics produced an expected stats type. */
-   Assert(stat->kind == STATS_EXT_MCV);
+   listidx = 0;
+   foreach(l, clauses)
+   {
+   if (!bms_is_member(listidx, *estimatedclauses))
+   clauses_attnums = 
bms_add_members(clauses_attnums, list_attnums[listidx]);
 
-   /* now filter the clauses to be estimated using the selected MCV */
-   stat_clauses = NIL;
+   listidx++;
+   }
 
-   listidx = 0;
-   foreach(l, clauses)
-   {
-   /*
-* If the clause is compatible with