Re: Using multiple extended statistics for estimates
Hi, I've pushed these two improvements after some minor improvements, mostly to comments. I ended up not using the extra tests, as it wasn't clear to me it's worth the extra duration. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Using multiple extended statistics for estimates
On 12/9/19 2:00 PM, Tomas Vondra wrote: These look good to me. I added extra tests (not included in this email) to verify the code on more interesting test cases, such as partitioned tables and within joins. Your test cases are pretty trivial, just being selects from a single table. Adding such more complex tests seem like a good idea, maybe you'd like to share them? You can find them attached. I did not include them in my earlier email because they seem a bit unrefined, taking too many lines of code for the amount of coverage they provide. But you can prune them down and add them to the patch if you like. These only test the functional dependencies. If you want to include something like them in your commit, you might create similar tests for the mcv statistics, too. -- Mark Dilger diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out index d42a372197..6c5106a6b9 100644 --- a/src/test/regress/expected/stats_ext.out +++ b/src/test/regress/expected/stats_ext.out @@ -458,6 +458,19 @@ CREATE TABLE functional_dependencies_multi ( c INTEGER, d INTEGER ); +CREATE TABLE partitioned_dependencies_multi ( + a INTEGER, + b INTEGER, + c INTEGER, + d INTEGER, + e INTEGER, + f INTEGER +) PARTITION BY list (a); +CREATE TABLE partitioned_dependencies_multi_0 PARTITION OF partitioned_dependencies_multi FOR VALUES IN (0, 5, 10, 15); +CREATE TABLE partitioned_dependencies_multi_1 PARTITION OF partitioned_dependencies_multi FOR VALUES IN (1, 6, 11, 16); +CREATE TABLE partitioned_dependencies_multi_2 PARTITION OF partitioned_dependencies_multi FOR VALUES IN (2, 7, 12, 17); +CREATE TABLE partitioned_dependencies_multi_3 PARTITION OF partitioned_dependencies_multi FOR VALUES IN (3, 8, 13, 18); +CREATE TABLE partitioned_dependencies_multi_4 PARTITION OF partitioned_dependencies_multi FOR VALUES IN (4, 9, 14, 19); -- INSERT INTO functional_dependencies_multi (a, b, c, d) SELECT @@ -466,7 +479,17 @@ INSERT INTO functional_dependencies_multi (a, b, c, d) mod(i,11), mod(i,11) FROM generate_series(1,5000) s(i); +INSERT INTO partitioned_dependencies_multi (a, b, c, d, e, f) +SELECT + mod(i,13), + mod(i,13), + mod(i,17), + mod(i,17), + mod(i,19), + mod(i,19) +FROM generate_series(1,5000) s(i); ANALYZE functional_dependencies_multi; +ANALYZE partitioned_dependencies_multi; -- estimates without any functional dependencies SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies_multi WHERE a = 0 AND b = 0'); estimated | actual @@ -486,10 +509,63 @@ SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies_multi 1 | 64 (1 row) +SELECT * FROM check_estimated_rows('SELECT * FROM partitioned_dependencies_multi WHERE a = 0 AND b = 0'); + estimated | actual +---+ + 128 |384 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM partitioned_dependencies_multi WHERE c = 0 AND d = 0'); + estimated | actual +---+ +18 |294 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM partitioned_dependencies_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0'); + estimated | actual +---+ + 1 | 22 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies_multi NATURAL JOIN partitioned_dependencies_multi WHERE a = 0 AND b = 0'); + estimated | actual +---+ +45 | 16098 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies_multi NATURAL JOIN partitioned_dependencies_multi WHERE c = 0 AND d = 0'); + estimated | actual +---+ + 4 | 10248 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies_multi NATURAL JOIN partitioned_dependencies_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0'); + estimated | actual +---+ + 1 | 1408 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies_multi NATURAL JOIN partitioned_dependencies_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0 AND e = 0 AND f = 0'); + estimated | actual +---+ + 1 | 64 +(1 row) + -- create separate functional dependencies CREATE STATISTICS functional_dependencies_multi_1 (dependencies) ON a, b FROM functional_dependencies_multi; CREATE STATISTICS functional_dependencies_multi_2 (dependencies) ON c, d FROM functional_dependencies_multi; +CREATE STATISTICS partitioned_dependencies_multi_0 (dependencies) ON a, b FROM partitioned_dependencies_multi_0; +CREATE STATISTICS partitioned_dependencies_multi_1 (dependencies) ON a, b FROM partitioned_dependencies_multi_1; +CREATE STATISTICS partitioned_dependencies_multi_2 (dependencies) ON c, d FROM partitioned_dependencies_multi_2; +CREATE STATISTICS
Re: Using multiple extended statistics for estimates
On Mon, Dec 09, 2019 at 11:56:39AM -0800, Mark Dilger wrote: On 12/5/19 9:51 AM, Tomas Vondra wrote: On Thu, Dec 05, 2019 at 06:15:54PM +0100, Tomas Vondra wrote: On Sun, Dec 01, 2019 at 08:08:58PM +0100, Tomas Vondra wrote: On Sat, Nov 30, 2019 at 03:01:31PM -0800, Mark Dilger wrote: Are you planning to submit a revised patch for this? Yes, I'll submit a rebased version of this patch shortly. I got broken because of the recent fix in choose_best_statistics, shouldn't take long to update the patch. I do have a couple more related patches in the queue, so I want to submit them all at once. OK, here we go - these two patched allow applying multiple extended statistics, both for MCV and functional dependencies. Functional dependencies are simply merged and then applied at once (so withouth choose_best_statistics), statistics are considered in greedy manner by calling choose_best_statistics in a loop. OK, this time with the patches actually attached ;-) These look good to me. I added extra tests (not included in this email) to verify the code on more interesting test cases, such as partitioned tables and within joins. Your test cases are pretty trivial, just being selects from a single table. Adding such more complex tests seem like a good idea, maybe you'd like to share them? I'll go mark this "ready for committer". Thanks for the review. I'll hold-off with the commit until the next CF, though, just to give others a proper opportunity to look at it. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Using multiple extended statistics for estimates
On 12/5/19 9:51 AM, Tomas Vondra wrote: On Thu, Dec 05, 2019 at 06:15:54PM +0100, Tomas Vondra wrote: On Sun, Dec 01, 2019 at 08:08:58PM +0100, Tomas Vondra wrote: On Sat, Nov 30, 2019 at 03:01:31PM -0800, Mark Dilger wrote: Are you planning to submit a revised patch for this? Yes, I'll submit a rebased version of this patch shortly. I got broken because of the recent fix in choose_best_statistics, shouldn't take long to update the patch. I do have a couple more related patches in the queue, so I want to submit them all at once. OK, here we go - these two patched allow applying multiple extended statistics, both for MCV and functional dependencies. Functional dependencies are simply merged and then applied at once (so withouth choose_best_statistics), statistics are considered in greedy manner by calling choose_best_statistics in a loop. OK, this time with the patches actually attached ;-) These look good to me. I added extra tests (not included in this email) to verify the code on more interesting test cases, such as partitioned tables and within joins. Your test cases are pretty trivial, just being selects from a single table. I'll go mark this "ready for committer". -- Mark Dilger
Re: Using multiple extended statistics for estimates
On Thu, Dec 05, 2019 at 06:15:54PM +0100, Tomas Vondra wrote: On Sun, Dec 01, 2019 at 08:08:58PM +0100, Tomas Vondra wrote: On Sat, Nov 30, 2019 at 03:01:31PM -0800, Mark Dilger wrote: Are you planning to submit a revised patch for this? Yes, I'll submit a rebased version of this patch shortly. I got broken because of the recent fix in choose_best_statistics, shouldn't take long to update the patch. I do have a couple more related patches in the queue, so I want to submit them all at once. OK, here we go - these two patched allow applying multiple extended statistics, both for MCV and functional dependencies. Functional dependencies are simply merged and then applied at once (so withouth choose_best_statistics), statistics are considered in greedy manner by calling choose_best_statistics in a loop. OK, this time with the patches actually attached ;-) regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From 8683f00a54aec7bd3873b63e6844642988b67573 Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Tue, 12 Nov 2019 22:57:06 +0100 Subject: [PATCH 1/9] Apply multiple multivariate MCV lists when possible Until now we've only used a single multivariate MCV list per relation, covering the largest number of clauses. So for example given a query SELECT * FROM t WHERE a = 1 AND b =1 AND c = 1 AND d = 1 and extended statistics on (a,b) and (c,d), we'd only pick and use only one of them. This commit relaxes this by repeating the process, using the best statistics (matching the largest number of remaining clauses) in each step. This greedy algorithm is very simple, but may not be optimal. There may be a different choice of stats leaving fewer clauses unestimated and/or giving better estimates for some other reason. This can however happen only when there are overlapping statistics, and selecting one makes it impossible to use the other. E.g. with statistics on (a,b), (c,d), (b,c,d), we may pick either (a,b) and (c,d) or (b,c,d). But it's not clear which option is better, though, as each one ignores information about possible correlation between different columns. We however assume cases like this are rare, and the easiest solution is to define statistics covering the whole group of correlated columns for a given query. In the future we might support overlapping stats, using some of the clauses as conditions (in conditional probability sense). Author: Tomas Vondra Reviewed-by: Mark Dilger Discussion: https://postgr.es/m/20191028152048.jc6pqv5hb7j77ocp@development --- src/backend/statistics/extended_stats.c | 129 +--- src/test/regress/expected/stats_ext.out | 58 +++ src/test/regress/sql/stats_ext.sql | 36 +++ 3 files changed, 162 insertions(+), 61 deletions(-) diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c index 9d339433f6..4ec0148fcc 100644 --- a/src/backend/statistics/extended_stats.c +++ b/src/backend/statistics/extended_stats.c @@ -1194,11 +1194,6 @@ statext_is_compatible_clause(PlannerInfo *root, Node *clause, Index relid, * 'estimatedclauses' is an input/output parameter. We set bits for the * 0-based 'clauses' indexes we estimate for and also skip clause items that * already have a bit set. - * - * XXX If we were to use multiple statistics, this is where it would happen. - * We would simply repeat this on a loop on the "remaining" clauses, possibly - * using the already estimated clauses as conditions (and combining the values - * using conditional probability formula). */ static Selectivity statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid, @@ -1208,14 +1203,7 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli ListCell *l; Bitmapset **list_attnums; int listidx; - StatisticExtInfo *stat; - List *stat_clauses; - Selectivity simple_sel, - mcv_sel, - mcv_basesel, - mcv_totalsel, - other_sel, - sel; + Selectivity sel = 1.0; /* check if there's any stats that might be useful for us. */ if (!has_stats_of_kind(rel->statlist, STATS_EXT_MCV)) @@ -1250,65 +1238,84 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli listidx++; } - /* find the best suited statistics object for these attnums */ - stat = choose_best_statistics(rel->statlist, STATS_EXT_MCV, - list_attnums, list_length(clauses)); - - /* if no matching stats could be found then we've nothing to do */ - if (!stat) - return 1.0; + /* apply as many extended statistics as
Re: Using multiple extended statistics for estimates
On Sun, Dec 01, 2019 at 08:08:58PM +0100, Tomas Vondra wrote: On Sat, Nov 30, 2019 at 03:01:31PM -0800, Mark Dilger wrote: Are you planning to submit a revised patch for this? Yes, I'll submit a rebased version of this patch shortly. I got broken because of the recent fix in choose_best_statistics, shouldn't take long to update the patch. I do have a couple more related patches in the queue, so I want to submit them all at once. OK, here we go - these two patched allow applying multiple extended statistics, both for MCV and functional dependencies. Functional dependencies are simply merged and then applied at once (so withouth choose_best_statistics), statistics are considered in greedy manner by calling choose_best_statistics in a loop. I do have some additional enhancements in the queue, but those are not fully baked yet, so I'll post them later in separate patches. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Using multiple extended statistics for estimates
On Sat, Nov 30, 2019 at 03:01:31PM -0800, Mark Dilger wrote: Are you planning to submit a revised patch for this? Yes, I'll submit a rebased version of this patch shortly. I got broken because of the recent fix in choose_best_statistics, shouldn't take long to update the patch. I do have a couple more related patches in the queue, so I want to submit them all at once. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Using multiple extended statistics for estimates
On 11/14/19 12:04 PM, Tomas Vondra wrote: On Thu, Nov 14, 2019 at 10:23:44AM -0800, Mark Dilger wrote: On 11/14/19 7:55 AM, Tomas Vondra wrote: On Wed, Nov 13, 2019 at 10:04:36AM -0800, Mark Dilger wrote: On 11/13/19 7:28 AM, Tomas Vondra wrote: Hi, here's an updated patch, with some minor tweaks based on the review and added tests (I ended up reworking those a bit, to make them more like the existing ones). Thanks, Tomas, for the new patch set! Attached are my review comments so far, in the form of a patch applied on top of yours. Thanks. 1) It's not clear to me why adding 'const' to the List parameters would be useful? Can you explain? When I first started reviewing the functions, I didn't know if those lists were intended to be modified by the function. Adding 'const' helps document that the function does not intend to change them. Hmmm, ok. I'll think about it, but we're not really using const* in this way very much I think - at least not in the surrounding code. 2) I think you're right we can change find_strongest_dependency to do /* also skip weaker dependencies when attribute count matches */ if (strongest->nattributes == dependency->nattributes && strongest->degree >= dependency->degree) continue; That'll skip some additional dependencies, which seems OK. 3) It's not clear to me what you mean by * TODO: Improve this code comment. Specifically, why would we * ignore that no rows will match? It seems that such a discovery * would allow us to return an estimate of 0 rows, and that would * be useful. added to dependencies_clauselist_selectivity. Are you saying we should also compute selectivity estimates for individual clauses and use Min() as a limit? Maybe, but that seems unrelated to the patch. I mean that the comment right above that TODO is hard to understand. You seem to be saying that it is good and proper to only take the selectivity estimate from the final clause in the list, but then go on to say that other clauses might prove that no rows will match. So that implies that by ignoring all but the last clause, we're ignoring such other clauses that prove no rows can match. But why would we be ignoring those? I am not arguing that your code is wrong. I'm just critiquing the hard-to-understand phrasing of that code comment. Aha, I think I understand now - thanks for the explanation. You're right the comment is trying to explain why just taking the last clause for a given attnum is fine. I'll try to make the comment clearer. Are you planning to submit a revised patch for this? -- Mark Dilger
Re: Using multiple extended statistics for estimates
On Thu, Nov 14, 2019 at 01:17:02PM -0800, Mark Dilger wrote: On 11/14/19 12:04 PM, Tomas Vondra wrote: Aha, I think I understand now - thanks for the explanation. You're right the comment is trying to explain why just taking the last clause for a given attnum is fine. I'll try to make the comment clearer. For the case with equal Const values that should be mostly obvious, i.e. "a=1 AND a=1 AND a=1" has the same selectivity as "a=1". The case with different Const values is harder, unfortunately. It might seem obvious that "a=1 AND a=2" means there are no matching rows, but that heavily relies on the semantics of the equality operator. And we can't simply compare the Const values either, I'm afraid, because there are cases with cross-type operators like a = 1::int AND a = 1.0::numeric where the Consts are of different type, yet both conditions can be true. So it would be pretty tricky to do this, and the current code does not even try to do that. Instead, it just assumes that it's mostly fine to overestimate, because then at runtime we'll simply end up with 0 rows here. I'm unsure whether that could be a performance problem at runtime. I could imagine the planner short-circuiting additional planning when it finds a plan with zero rows, and so we'd save planner time if we avoid overestimating. I don't recall if the planner does anything like that, or if there are plans to implement such logic, but it might be good not to rule it out. Tom's suggestion elsewhere in this thread to use code in predtest.c sounds good to me. No, AFAIK the planner does not do anything like that - it might probaly do that if it could prove there are no such rows, but that's hardly the case for estimates based on approximate information (i.e. statistics). If could do that based on the predicate analysis in predtest.c mentioned by Tom, although I don't think it does anything beyond tweaking the row estimate to ~1 row. I don't know if you want to expand the scope of this particular patch to include that, though. Certainly not. It's an interesting but surprisingly complicated problem, and this patch simply aims to add different improvement. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Using multiple extended statistics for estimates
On Thu, Nov 14, 2019 at 03:16:04PM -0500, Tom Lane wrote: Tomas Vondra writes: For the case with equal Const values that should be mostly obvious, i.e. "a=1 AND a=1 AND a=1" has the same selectivity as "a=1". The case with different Const values is harder, unfortunately. It might seem obvious that "a=1 AND a=2" means there are no matching rows, but that heavily relies on the semantics of the equality operator. And we can't simply compare the Const values either, I'm afraid, because there are cases with cross-type operators like a = 1::int AND a = 1.0::numeric where the Consts are of different type, yet both conditions can be true. FWIW, there's code in predtest.c to handle exactly that, at least for types sharing a btree opfamily. Whether it's worth applying that logic here is unclear, but note that we've had the ability to recognize redundant and contradictory clauses for a long time: regression=# explain select * from tenk1 where two = 1; QUERY PLAN Seq Scan on tenk1 (cost=0.00..470.00 rows=5000 width=244) Filter: (two = 1) (2 rows) regression=# explain select * from tenk1 where two = 1 and two = 1::bigint; QUERY PLAN Seq Scan on tenk1 (cost=0.00..470.00 rows=5000 width=244) Filter: (two = 1) (2 rows) regression=# explain select * from tenk1 where two = 1 and two = 2::bigint; QUERY PLAN --- Result (cost=0.00..470.00 rows=1 width=244) One-Time Filter: false -> Seq Scan on tenk1 (cost=0.00..470.00 rows=1 width=244) Filter: (two = 1) (4 rows) It falls down on regression=# explain select * from tenk1 where two = 1 and two = 2::numeric; QUERY PLAN --- Seq Scan on tenk1 (cost=0.00..520.00 rows=25 width=244) Filter: ((two = 1) AND ((two)::numeric = '2'::numeric)) (2 rows) because numeric isn't in the same opfamily, so these clauses can't be compared easily. regards, tom lane Yeah, and this logic still works - the redundant clauses won't even get to the selectivity estimation, I think. So maybe the comment is not quite necessary, because the problem does not even exist ... Maybe we could do something about the cases that predtest.c can't solve, but it's not clear if we can be much smarter for types with different opfamilies. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Using multiple extended statistics for estimates
On 11/14/19 12:04 PM, Tomas Vondra wrote: Aha, I think I understand now - thanks for the explanation. You're right the comment is trying to explain why just taking the last clause for a given attnum is fine. I'll try to make the comment clearer. For the case with equal Const values that should be mostly obvious, i.e. "a=1 AND a=1 AND a=1" has the same selectivity as "a=1". The case with different Const values is harder, unfortunately. It might seem obvious that "a=1 AND a=2" means there are no matching rows, but that heavily relies on the semantics of the equality operator. And we can't simply compare the Const values either, I'm afraid, because there are cases with cross-type operators like a = 1::int AND a = 1.0::numeric where the Consts are of different type, yet both conditions can be true. So it would be pretty tricky to do this, and the current code does not even try to do that. Instead, it just assumes that it's mostly fine to overestimate, because then at runtime we'll simply end up with 0 rows here. I'm unsure whether that could be a performance problem at runtime. I could imagine the planner short-circuiting additional planning when it finds a plan with zero rows, and so we'd save planner time if we avoid overestimating. I don't recall if the planner does anything like that, or if there are plans to implement such logic, but it might be good not to rule it out. Tom's suggestion elsewhere in this thread to use code in predtest.c sounds good to me. I don't know if you want to expand the scope of this particular patch to include that, though. -- Mark Dilger
Re: Using multiple extended statistics for estimates
Tomas Vondra writes: > For the case with equal Const values that should be mostly obvious, i.e. > "a=1 AND a=1 AND a=1" has the same selectivity as "a=1". > The case with different Const values is harder, unfortunately. It might > seem obvious that "a=1 AND a=2" means there are no matching rows, but > that heavily relies on the semantics of the equality operator. And we > can't simply compare the Const values either, I'm afraid, because there > are cases with cross-type operators like > a = 1::int AND a = 1.0::numeric > where the Consts are of different type, yet both conditions can be true. FWIW, there's code in predtest.c to handle exactly that, at least for types sharing a btree opfamily. Whether it's worth applying that logic here is unclear, but note that we've had the ability to recognize redundant and contradictory clauses for a long time: regression=# explain select * from tenk1 where two = 1; QUERY PLAN Seq Scan on tenk1 (cost=0.00..470.00 rows=5000 width=244) Filter: (two = 1) (2 rows) regression=# explain select * from tenk1 where two = 1 and two = 1::bigint; QUERY PLAN Seq Scan on tenk1 (cost=0.00..470.00 rows=5000 width=244) Filter: (two = 1) (2 rows) regression=# explain select * from tenk1 where two = 1 and two = 2::bigint; QUERY PLAN --- Result (cost=0.00..470.00 rows=1 width=244) One-Time Filter: false -> Seq Scan on tenk1 (cost=0.00..470.00 rows=1 width=244) Filter: (two = 1) (4 rows) It falls down on regression=# explain select * from tenk1 where two = 1 and two = 2::numeric; QUERY PLAN --- Seq Scan on tenk1 (cost=0.00..520.00 rows=25 width=244) Filter: ((two = 1) AND ((two)::numeric = '2'::numeric)) (2 rows) because numeric isn't in the same opfamily, so these clauses can't be compared easily. regards, tom lane
Re: Using multiple extended statistics for estimates
On Thu, Nov 14, 2019 at 10:23:44AM -0800, Mark Dilger wrote: On 11/14/19 7:55 AM, Tomas Vondra wrote: On Wed, Nov 13, 2019 at 10:04:36AM -0800, Mark Dilger wrote: On 11/13/19 7:28 AM, Tomas Vondra wrote: Hi, here's an updated patch, with some minor tweaks based on the review and added tests (I ended up reworking those a bit, to make them more like the existing ones). Thanks, Tomas, for the new patch set! Attached are my review comments so far, in the form of a patch applied on top of yours. Thanks. 1) It's not clear to me why adding 'const' to the List parameters would be useful? Can you explain? When I first started reviewing the functions, I didn't know if those lists were intended to be modified by the function. Adding 'const' helps document that the function does not intend to change them. Hmmm, ok. I'll think about it, but we're not really using const* in this way very much I think - at least not in the surrounding code. 2) I think you're right we can change find_strongest_dependency to do /* also skip weaker dependencies when attribute count matches */ if (strongest->nattributes == dependency->nattributes && strongest->degree >= dependency->degree) continue; That'll skip some additional dependencies, which seems OK. 3) It's not clear to me what you mean by * TODO: Improve this code comment. Specifically, why would we * ignore that no rows will match? It seems that such a discovery * would allow us to return an estimate of 0 rows, and that would * be useful. added to dependencies_clauselist_selectivity. Are you saying we should also compute selectivity estimates for individual clauses and use Min() as a limit? Maybe, but that seems unrelated to the patch. I mean that the comment right above that TODO is hard to understand. You seem to be saying that it is good and proper to only take the selectivity estimate from the final clause in the list, but then go on to say that other clauses might prove that no rows will match. So that implies that by ignoring all but the last clause, we're ignoring such other clauses that prove no rows can match. But why would we be ignoring those? I am not arguing that your code is wrong. I'm just critiquing the hard-to-understand phrasing of that code comment. Aha, I think I understand now - thanks for the explanation. You're right the comment is trying to explain why just taking the last clause for a given attnum is fine. I'll try to make the comment clearer. For the case with equal Const values that should be mostly obvious, i.e. "a=1 AND a=1 AND a=1" has the same selectivity as "a=1". The case with different Const values is harder, unfortunately. It might seem obvious that "a=1 AND a=2" means there are no matching rows, but that heavily relies on the semantics of the equality operator. And we can't simply compare the Const values either, I'm afraid, because there are cases with cross-type operators like a = 1::int AND a = 1.0::numeric where the Consts are of different type, yet both conditions can be true. So it would be pretty tricky to do this, and the current code does not even try to do that. Instead, it just assumes that it's mostly fine to overestimate, because then at runtime we'll simply end up with 0 rows here. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Using multiple extended statistics for estimates
On 11/14/19 7:55 AM, Tomas Vondra wrote: On Wed, Nov 13, 2019 at 10:04:36AM -0800, Mark Dilger wrote: On 11/13/19 7:28 AM, Tomas Vondra wrote: Hi, here's an updated patch, with some minor tweaks based on the review and added tests (I ended up reworking those a bit, to make them more like the existing ones). Thanks, Tomas, for the new patch set! Attached are my review comments so far, in the form of a patch applied on top of yours. Thanks. 1) It's not clear to me why adding 'const' to the List parameters would be useful? Can you explain? When I first started reviewing the functions, I didn't know if those lists were intended to be modified by the function. Adding 'const' helps document that the function does not intend to change them. 2) I think you're right we can change find_strongest_dependency to do /* also skip weaker dependencies when attribute count matches */ if (strongest->nattributes == dependency->nattributes && strongest->degree >= dependency->degree) continue; That'll skip some additional dependencies, which seems OK. 3) It's not clear to me what you mean by * TODO: Improve this code comment. Specifically, why would we * ignore that no rows will match? It seems that such a discovery * would allow us to return an estimate of 0 rows, and that would * be useful. added to dependencies_clauselist_selectivity. Are you saying we should also compute selectivity estimates for individual clauses and use Min() as a limit? Maybe, but that seems unrelated to the patch. I mean that the comment right above that TODO is hard to understand. You seem to be saying that it is good and proper to only take the selectivity estimate from the final clause in the list, but then go on to say that other clauses might prove that no rows will match. So that implies that by ignoring all but the last clause, we're ignoring such other clauses that prove no rows can match. But why would we be ignoring those? I am not arguing that your code is wrong. I'm just critiquing the hard-to-understand phrasing of that code comment. -- Mark Dilger
Re: Using multiple extended statistics for estimates
On Wed, Nov 13, 2019 at 10:04:36AM -0800, Mark Dilger wrote: On 11/13/19 7:28 AM, Tomas Vondra wrote: Hi, here's an updated patch, with some minor tweaks based on the review and added tests (I ended up reworking those a bit, to make them more like the existing ones). Thanks, Tomas, for the new patch set! Attached are my review comments so far, in the form of a patch applied on top of yours. Thanks. 1) It's not clear to me why adding 'const' to the List parameters would be useful? Can you explain? 2) I think you're right we can change find_strongest_dependency to do /* also skip weaker dependencies when attribute count matches */ if (strongest->nattributes == dependency->nattributes && strongest->degree >= dependency->degree) continue; That'll skip some additional dependencies, which seems OK. 3) It's not clear to me what you mean by * TODO: Improve this code comment. Specifically, why would we * ignore that no rows will match? It seems that such a discovery * would allow us to return an estimate of 0 rows, and that would * be useful. added to dependencies_clauselist_selectivity. Are you saying we should also compute selectivity estimates for individual clauses and use Min() as a limit? Maybe, but that seems unrelated to the patch. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Using multiple extended statistics for estimates
On 11/13/19 7:28 AM, Tomas Vondra wrote: Hi, here's an updated patch, with some minor tweaks based on the review and added tests (I ended up reworking those a bit, to make them more like the existing ones). Thanks, Tomas, for the new patch set! Attached are my review comments so far, in the form of a patch applied on top of yours. -- Mark Dilger diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c index a6ca11c675..ccf6e41b9c 100644 --- a/src/backend/statistics/dependencies.c +++ b/src/backend/statistics/dependencies.c @@ -850,7 +850,8 @@ dependency_is_compatible_clause(Node *clause, Index relid, AttrNumber *attnum) * find the strongest dependency on the attributes * * When applying functional dependencies, we start with the strongest - * dependencies. That is, we select the dependency that: + * dependencies. That is, we select the best dependency in the following + * descending order of preference: * * (a) has all attributes covered by equality clauses * @@ -860,6 +861,9 @@ dependency_is_compatible_clause(Node *clause, Index relid, AttrNumber *attnum) * * This guarantees that we eliminate the most redundant conditions first * (see the comment in dependencies_clauselist_selectivity). + * + * TODO: Rename 'dependencies', as that name is what you'd expect for an + * argument of type (MVDependencies *), not (MVDependencies **) */ static MVDependency * find_strongest_dependency(MVDependencies **dependencies, int ndependencies, @@ -897,11 +901,14 @@ find_strongest_dependency(MVDependencies **dependencies, int ndependencies, /* also skip weaker dependencies when attribute count matches */ if (strongest->nattributes == dependency->nattributes && - strongest->degree > dependency->degree) + strongest->degree > dependency->degree) /* TODO: Why not ">=" here? */ continue; } /* + * TODO: As coded, this dependency is "at least as strong as", not + * necessarily "stronger". + * * this dependency is stronger, but we must still check that it's * fully matched to these attnums. We perform this check last as it's * slightly more expensive than the previous checks. @@ -942,7 +949,7 @@ find_strongest_dependency(MVDependencies **dependencies, int ndependencies, */ Selectivity dependencies_clauselist_selectivity(PlannerInfo *root, - List *clauses, + const List *clauses, int varRelid, JoinType jointype, SpecialJoinInfo *sjinfo, @@ -1084,6 +1091,11 @@ dependencies_clauselist_selectivity(PlannerInfo *root, * to a different Const in another clause then no rows will match * anyway. If it happens to be compared to the same Const, then * ignoring the additional clause is just the thing to do. + * + * TODO: Improve this code comment. Specifically, why would we + * ignore that no rows will match? It seems that such a discovery + * would allow us to return an estimate of 0 rows, and that would + * be useful. */ if (dependency_implies_attribute(dependency, list_attnums[listidx])) diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c index b55799b8b1..a1798a6b91 100644 --- a/src/backend/statistics/extended_stats.c +++ b/src/backend/statistics/extended_stats.c @@ -1202,7 +1202,7 @@ statext_mcv_remaining_attnums(int nclauses, Bitmapset **estimatedclauses, * already have a bit set. */ static Selectivity -statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid, +statext_mcv_clauselist_selectivity(PlannerInfo *root, const List *clauses, int varRelid, JoinType jointype, SpecialJoinInfo *sjinfo, RelOptInfo *rel, Bitmapset **estimatedclauses) { @@ -1340,7 +1340,7 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli * Estimate clauses using the best multi-column statistics. */ Selectivity -statext_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid, +statext_clauselist_selectivity(PlannerInfo *root, const List *clauses, int varRelid, JoinType jointype, SpecialJoinInfo *sjinfo, RelOptInfo *rel, Bitmapset **estimatedclauses) { diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h index 588b6738b2..ebac910d72 100644 --- a/src/include/statistics/statistics.h +++ b/src/include/statistics/statistics.h @@ -104,14 +104,14 @@ extern int ComputeExtStatisticsRows(Relation onerel, int natts, VacAttrStats **stats); extern bool statext_is_kind_built(HeapTuple htup, char kind); extern Selectivity dependencies_clauselist_selectivity(PlannerInfo *root, - List *clauses, + const List *clauses, int varRelid, JoinType jointype, SpecialJoinInfo *sjinfo, RelOptInfo *rel, Bitmapset **estimatedclauses);
Re: Using multiple extended statistics for estimates
Hi, here's an updated patch, with some minor tweaks based on the review and added tests (I ended up reworking those a bit, to make them more like the existing ones). There's also a new piece, dealing with functional dependencies. Until now we did the same thing as for MCV lists - we picketd the "best" extended statistics (with functional dependencies built) and just used that. At first I thought we might simply do the same loop as for MCV lists, but that does not really make sense because we might end up applying "weaker" dependency first. Say for example we have table with columns (a,b,c,d,e) and functional dependencies on (a,b,c,d) and (c,d,e) where all the dependencies on (a,b,c,d) are weaker than (c,d => e). In a query with clauses on all attributes this is guaranteed to apply all dependencies from the first statistic first, which si clearly wrong. So what this does instead is simply merging all the dependencies from all the relevant stats, and treating them as a single collection. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From 0e09c4749f3712da1983374a2838e4b7e14b7c62 Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Tue, 12 Nov 2019 22:57:06 +0100 Subject: [PATCH 01/10] Apply multiple multivariate MCV lists when possible Until now we've only used a single multivariate MCV list per relation, covering the largest number of clauses. So for example given a query SELECT * FROM t WHERE a = 1 AND b =1 AND c = 1 AND d = 1 and extended statistics on (a,b) and (c,d), we'd only pick and use only one of them. This commit relaxes this by repeating the process, using the best statistics (matching the largest number of remaining clauses) in each step. This greedy algorithm is very simple, but may not be optimal. There may be a different choice of stats leaving fewer clauses unestimated and/or giving better estimates for some other reason. This can however happen only when there are overlapping statistics, and selecting one makes it impossible to use the other. E.g. with statistics on (a,b), (c,d), (b,c,d), we may pick either (a,b) and (c,d) or (b,c,d). But it's not clear which option is better, though, as each one ignores information about possible correlation between different columns. We however assume cases like this are rare, and the easiest solution is to define statistics covering the whole group of correlated columns for a given query. In the future we might support overlapping stats, using some of the clauses as conditions (in conditional probability sense). Author: Tomas Vondra Reviewed-by: Mark Dilger Discussion: https://postgr.es/m/20191028152048.jc6pqv5hb7j77ocp@development --- src/backend/statistics/extended_stats.c | 166 ++-- src/test/regress/expected/stats_ext.out | 58 + src/test/regress/sql/stats_ext.sql | 36 + 3 files changed, 195 insertions(+), 65 deletions(-) diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c index 207ee3160e..b55799b8b1 100644 --- a/src/backend/statistics/extended_stats.c +++ b/src/backend/statistics/extended_stats.c @@ -1123,6 +1123,33 @@ statext_is_compatible_clause(PlannerInfo *root, Node *clause, Index relid, return true; } +/* + * statext_mcv_clause_attnums + * Recalculate attnums from compatible but not-yet-estimated clauses. + */ +static Bitmapset * +statext_mcv_remaining_attnums(int nclauses, Bitmapset **estimatedclauses, + Bitmapset **list_attnums) +{ + int listidx; + Bitmapset *attnums = NULL; + + for (listidx = 0; listidx < nclauses; listidx++) + { + /* +* Skip clauses that have no precalculated attnums, which means it is +* either incompatible or was already used by some other statistic. +*/ + if (!list_attnums[listidx]) + continue; + + if (!bms_is_member(listidx, *estimatedclauses)) + attnums = bms_add_members(attnums, list_attnums[listidx]); + } + + return attnums; +} + /* * statext_mcv_clauselist_selectivity * Estimate clauses using the best multi-column statistics. @@ -1173,11 +1200,6 @@ statext_is_compatible_clause(PlannerInfo *root, Node *clause, Index relid, * 'estimatedclauses' is an input/output parameter. We set bits for the * 0-based 'clauses' indexes we estimate for and also skip clause items that * already have a bit set. - * - * XXX If we were to use multiple statistics, this is where it would happen. - * We would simply repeat this on a loop on the "remaining" clauses, possibly - * using the already estimated clauses as conditions (and combining the values - * using conditional probability formula). */ static Selectivity
Re: Using multiple extended statistics for estimates
On Sat, Nov 09, 2019 at 02:32:27PM -0800, Mark Dilger wrote: On 11/9/19 12:33 PM, Mark Dilger wrote: On 11/6/19 11:58 AM, Tomas Vondra wrote: On Wed, Nov 06, 2019 at 08:54:40PM +0100, Tomas Vondra wrote: On Mon, Oct 28, 2019 at 04:20:48PM +0100, Tomas Vondra wrote: Hi, PostgreSQL 10 introduced extended statistics, allowing us to consider correlation between columns to improve estimates, and PostgreSQL 12 added support for MCV statistics. But we still had the limitation that we only allowed using a single extended statistics per relation, i.e. given a table with two extended stats CREATE TABLE t (a int, b int, c int, d int); CREATE STATISTICS s1 (mcv) ON a, b FROM t; CREATE STATISTICS s2 (mcv) ON c, d FROM t; and a query SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1; we only ever used one of the statistics (and we considered them in a not particularly well determined order). This patch addresses this by using as many extended stats as possible, by adding a loop to statext_mcv_clauselist_selectivity(). In each step we pick the "best" applicable statistics (in the sense of covering the most attributes) and factor it into the oveall estimate. Tomas, Your patch compiles and passes the regression tests for me on debian linux under master. Since your patch does not include modified regression tests, I wrote a test that I expected to improve under this new code, but running it both before and after applying your patch, there is no change. Ok, the attached test passes before applying your patch and fails afterward owing to the estimates improving and no longer matching the expected output. To be clear, this confirms your patch working as expected. I haven't seen any crashes in several hours of running different tests, so I think it looks good. Yep, thanks for adding the tests. I'll include them into the patch. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Using multiple extended statistics for estimates
On Sat, Nov 09, 2019 at 12:33:05PM -0800, Mark Dilger wrote: On 11/6/19 11:58 AM, Tomas Vondra wrote: On Wed, Nov 06, 2019 at 08:54:40PM +0100, Tomas Vondra wrote: On Mon, Oct 28, 2019 at 04:20:48PM +0100, Tomas Vondra wrote: Hi, PostgreSQL 10 introduced extended statistics, allowing us to consider correlation between columns to improve estimates, and PostgreSQL 12 added support for MCV statistics. But we still had the limitation that we only allowed using a single extended statistics per relation, i.e. given a table with two extended stats CREATE TABLE t (a int, b int, c int, d int); CREATE STATISTICS s1 (mcv) ON a, b FROM t; CREATE STATISTICS s2 (mcv) ON c, d FROM t; and a query SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1; we only ever used one of the statistics (and we considered them in a not particularly well determined order). This patch addresses this by using as many extended stats as possible, by adding a loop to statext_mcv_clauselist_selectivity(). In each step we pick the "best" applicable statistics (in the sense of covering the most attributes) and factor it into the oveall estimate. Tomas, Your patch compiles and passes the regression tests for me on debian linux under master. Thanks. Since your patch does not include modified regression tests, I wrote a test that I expected to improve under this new code, but running it both before and after applying your patch, there is no change. Please find the modified test attached. Am I wrong to expect some change in this test's output? If so, can you provide a test example that works differently under your patch? Those queries are not improved by the patch, because we only support clauses "Var op Const" for now - your tests are using "Var op Var" so that doesn't work. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Using multiple extended statistics for estimates
On 11/9/19 12:33 PM, Mark Dilger wrote: On 11/6/19 11:58 AM, Tomas Vondra wrote: On Wed, Nov 06, 2019 at 08:54:40PM +0100, Tomas Vondra wrote: On Mon, Oct 28, 2019 at 04:20:48PM +0100, Tomas Vondra wrote: Hi, PostgreSQL 10 introduced extended statistics, allowing us to consider correlation between columns to improve estimates, and PostgreSQL 12 added support for MCV statistics. But we still had the limitation that we only allowed using a single extended statistics per relation, i.e. given a table with two extended stats CREATE TABLE t (a int, b int, c int, d int); CREATE STATISTICS s1 (mcv) ON a, b FROM t; CREATE STATISTICS s2 (mcv) ON c, d FROM t; and a query SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1; we only ever used one of the statistics (and we considered them in a not particularly well determined order). This patch addresses this by using as many extended stats as possible, by adding a loop to statext_mcv_clauselist_selectivity(). In each step we pick the "best" applicable statistics (in the sense of covering the most attributes) and factor it into the oveall estimate. Tomas, Your patch compiles and passes the regression tests for me on debian linux under master. Since your patch does not include modified regression tests, I wrote a test that I expected to improve under this new code, but running it both before and after applying your patch, there is no change. Ok, the attached test passes before applying your patch and fails afterward owing to the estimates improving and no longer matching the expected output. To be clear, this confirms your patch working as expected. I haven't seen any crashes in several hours of running different tests, so I think it looks good. -- Mark Dilger diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out index b65228fa07..e89865e3ee 100644 --- a/src/test/regress/expected/stats_ext.out +++ b/src/test/regress/expected/stats_ext.out @@ -353,6 +353,48 @@ SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM ndistinct GROUP BY a, d 500 | 50 (1 row) +-- muliple extended statistics +CREATE TABLE multistats ( + a INTEGER, + b INTEGER, + c INTEGER, + d INTEGER +); +-- Insert unique values +INSERT INTO multistats (a, b, c, d) + SELECT i, i*2, i*3, i*4 + FROM generate_series(1,1) s(i); +-- Duplidate one set of values 1 times +INSERT INTO multistats (a, b, c, d) + SELECT 0, 0, 0, 0 + FROM generate_series(1,1) s(i); +-- estimates without mcv statistics +ANALYZE multistats; +SELECT * FROM check_estimated_rows('SELECT * FROM multistats WHERE a = 0 AND b = 0 AND c = 0 AND d = 0'); + estimated | actual +---+ + 1250 | 1 +(1 row) + +-- create some mcv statistics +CREATE STATISTICS ms_ab (mcv) ON a, b FROM multistats; +ANALYZE multistats; +SELECT * FROM check_estimated_rows('SELECT * FROM multistats WHERE a = 0 AND b = 0 AND c = 0 AND d = 0'); + estimated | actual +---+ + 2500 | 1 +(1 row) + +-- create some mcv statistics +CREATE STATISTICS ms_cd (mcv) ON c, d FROM multistats; +ANALYZE multistats; +SELECT * FROM check_estimated_rows('SELECT * FROM multistats WHERE a = 0 AND b = 0 AND c = 0 AND d = 0'); + estimated | actual +---+ + 2500 | 1 +(1 row) + +DROP TABLE multistats; -- functional dependencies tests CREATE TABLE functional_dependencies ( filler1 TEXT, diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql index 040ee97a1e..ba802d7490 100644 --- a/src/test/regress/sql/stats_ext.sql +++ b/src/test/regress/sql/stats_ext.sql @@ -224,6 +224,41 @@ SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM ndistinct GROUP BY b, c SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM ndistinct GROUP BY a, d'); +-- muliple extended statistics +CREATE TABLE multistats ( + a INTEGER, + b INTEGER, + c INTEGER, + d INTEGER +); + +-- Insert unique values +INSERT INTO multistats (a, b, c, d) + SELECT i, i*2, i*3, i*4 + FROM generate_series(1,1) s(i); +-- Duplidate one set of values 1 times +INSERT INTO multistats (a, b, c, d) + SELECT 0, 0, 0, 0 + FROM generate_series(1,1) s(i); + +-- estimates without mcv statistics +ANALYZE multistats; +SELECT * FROM check_estimated_rows('SELECT * FROM multistats WHERE a = 0 AND b = 0 AND c = 0 AND d = 0'); + +-- create some mcv statistics +CREATE STATISTICS ms_ab (mcv) ON a, b FROM multistats; + +ANALYZE multistats; +SELECT * FROM check_estimated_rows('SELECT * FROM multistats WHERE a = 0 AND b = 0 AND c = 0 AND d = 0'); + +-- create some mcv statistics +CREATE STATISTICS ms_cd (mcv) ON c, d FROM multistats; + +ANALYZE multistats; +SELECT * FROM check_estimated_rows('SELECT * FROM multistats WHERE a = 0 AND b = 0 AND c = 0 AND d = 0'); + +DROP TABLE multistats; + --
Re: Using multiple extended statistics for estimates
On 11/6/19 11:58 AM, Tomas Vondra wrote: On Wed, Nov 06, 2019 at 08:54:40PM +0100, Tomas Vondra wrote: On Mon, Oct 28, 2019 at 04:20:48PM +0100, Tomas Vondra wrote: Hi, PostgreSQL 10 introduced extended statistics, allowing us to consider correlation between columns to improve estimates, and PostgreSQL 12 added support for MCV statistics. But we still had the limitation that we only allowed using a single extended statistics per relation, i.e. given a table with two extended stats CREATE TABLE t (a int, b int, c int, d int); CREATE STATISTICS s1 (mcv) ON a, b FROM t; CREATE STATISTICS s2 (mcv) ON c, d FROM t; and a query SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1; we only ever used one of the statistics (and we considered them in a not particularly well determined order). This patch addresses this by using as many extended stats as possible, by adding a loop to statext_mcv_clauselist_selectivity(). In each step we pick the "best" applicable statistics (in the sense of covering the most attributes) and factor it into the oveall estimate. Tomas, Your patch compiles and passes the regression tests for me on debian linux under master. Since your patch does not include modified regression tests, I wrote a test that I expected to improve under this new code, but running it both before and after applying your patch, there is no change. Please find the modified test attached. Am I wrong to expect some change in this test's output? If so, can you provide a test example that works differently under your patch? Thanks! -- Mark Dilger diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out index b65228fa07..8e20df25fc 100644 --- a/src/test/regress/expected/stats_ext.out +++ b/src/test/regress/expected/stats_ext.out @@ -353,6 +353,298 @@ SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM ndistinct GROUP BY a, d 500 | 50 (1 row) +-- muliple extended statistics +CREATE TABLE multistats ( + i2 INTEGER, + i3 INTEGER, + i17 INTEGER, + i19 INTEGER, + i34 INTEGER, + i38 INTEGER, + i51 INTEGER, + i57 INTEGER +); +INSERT INTO multistats (i2, i3, i17, i19, i34, i38, i51, i57) + SELECT i % 2, i % 3, i % 17, i % 19, i % 34, i % 38, i % 51, i % 57 + FROM generate_series(1, 50) s(i); +-- estimates without mcv statistics +ANALYZE multistats; +SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM multistats GROUP BY i2, i34'); + estimated | actual +---+ +68 | 34 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM multistats GROUP BY i2, i38'); + estimated | actual +---+ +76 | 38 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM multistats GROUP BY i17, i34'); + estimated | actual +---+ + 578 | 34 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM multistats GROUP BY i19, i38'); + estimated | actual +---+ + 722 | 38 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM multistats GROUP BY i17, i19, i34, i38'); + estimated | actual +---+ + 5 |646 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM multistats GROUP BY i3, i51'); + estimated | actual +---+ + 153 | 51 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM multistats GROUP BY i3, i57'); + estimated | actual +---+ + 171 | 57 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM multistats GROUP BY i17, i51'); + estimated | actual +---+ + 867 | 51 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM multistats GROUP BY i19, i57'); + estimated | actual +---+ + 1083 | 57 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM multistats GROUP BY i17, i19, i51, i57'); + estimated | actual +---+ + 5 |969 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM multistats WHERE i2 = i34'); + estimated | actual +---+ + 2500 | 29411 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM multistats WHERE i2 = i38'); + estimated | actual +---+ + 2500 | 26315 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM multistats WHERE i17 = i34'); + estimated | actual +---+ + 2500 | 250001 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM multistats WHERE i19 = i38'); + estimated | actual +---+ + 2500 | 250001 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM multistats WHERE i17 = i34 AND i19 = i38'); + estimated | actual +---+ +12 | 125387 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM multistats WHERE i17 = i34
Re: Using multiple extended statistics for estimates
On Thu, Nov 07, 2019 at 01:38:20PM +0900, Kyotaro Horiguchi wrote: Hello. At Wed, 6 Nov 2019 20:58:49 +0100, Tomas Vondra wrote in >Here is a slightly polished v2 of the patch, the main difference being >that computing clause_attnums was moved to a separate function. > This time with the attachment ;-) This patch is a kind of straight-forward, which repeats what the previous statext_mcv_clauselist_selectivity did as long as remaining clauses matches any of MV-MCVs. Almost no regression in the cases where zero or just one MV-MCV applies to the given clause list. It applies cleanly on the current master and seems working as expected. I have some comments. Could we have description in the documentation on what multiple MV-MCVs are used in a query? And don't we need some regression tests? Yes, regression tests are certainly needed - I though I've added them, but it seems I failed to include them in the patch. Will fix. I agree it's probably worth mentioning we can consider multiple stats, but I'm a bit hesitant to put the exact rules how we pick the "best" statistic to the docs. It's not 100% deterministic and it's likely we'll need to tweak it a bit in the future. I'd prefer showing the stats in EXPLAIN, but that's a separate patch. +/* + * statext_mcv_clause_attnums + * Recalculate attnums from compatible but not-yet-estimated clauses. It returns attnums collected from multiple clause*s*. Is the name OK with "clause_attnums"? The comment says as if it checks the compatibility of each clause but the work is done in the caller side. I'm not sure such strictness is required, but it might be better that the comment represents what exactly the function does. But the incompatible clauses have the pre-computed attnums set to NULL, so technically the comment is correct. But I'll clarify. + */ +static Bitmapset * +statext_mcv_clause_attnums(int nclauses, Bitmapset **estimatedclauses, + Bitmapset **list_attnums) The last two parameters are in the same type in notation but in different actual types.. that is, one is a pointer to Bitmapset*, and another is an array of Bitmaptset*. The code in the function itself suggests that, but it would be helpful if a brief explanation of the parameters is seen in the function comment. OK, will explain in a comment. + /* +* Recompute attnums in the remaining clauses (we simply use the bitmaps +* computed earlier, so that we don't have to inspect the clauses again). +*/ + clauses_attnums = statext_mcv_clause_attnums(list_length(clauses), Couldn't we avoid calling this function twice with the same parameters at the first round in the loop? Hmmm, yeah. That's a good point. + foreach(l, clauses) { - stat_clauses = lappend(stat_clauses, (Node *) lfirst(l)); - *estimatedclauses = bms_add_member(*estimatedclauses, listidx); + /* +* If the clause is compatible with the selected statistics, mark it +* as estimated and add it to the list to estimate. +*/ + if (list_attnums[listidx] != NULL && + bms_is_subset(list_attnums[listidx], stat->keys)) + { + stat_clauses = lappend(stat_clauses, (Node *) lfirst(l)); + *estimatedclauses = bms_add_member(*estimatedclauses, listidx); + } The loop runs through all clauses every time. I agree that that is better than using a copy of the clauses to avoid to step on already estimated clauses, but maybe we need an Assertion that the listidx is not a part of estimatedclauses to make sure no clauses are not estimated twice. Well, we can't really operate on a smaller "copy" of the list anyway, because that would break the precalculation logic (the listidx value would be incorrect for the new list), and tweaking it would be more expensive than just iterating over all clauses. The assumption is that we won't see extremely large number of clauses here. Adding an assert seems reasonable. And maybe a comment why we should not see any already-estimated clauses here. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Using multiple extended statistics for estimates
Hello. At Wed, 6 Nov 2019 20:58:49 +0100, Tomas Vondra wrote in > >Here is a slightly polished v2 of the patch, the main difference being > >that computing clause_attnums was moved to a separate function. > > > > This time with the attachment ;-) This patch is a kind of straight-forward, which repeats what the previous statext_mcv_clauselist_selectivity did as long as remaining clauses matches any of MV-MCVs. Almost no regression in the cases where zero or just one MV-MCV applies to the given clause list. It applies cleanly on the current master and seems working as expected. I have some comments. Could we have description in the documentation on what multiple MV-MCVs are used in a query? And don't we need some regression tests? +/* + * statext_mcv_clause_attnums + * Recalculate attnums from compatible but not-yet-estimated clauses. It returns attnums collected from multiple clause*s*. Is the name OK with "clause_attnums"? The comment says as if it checks the compatibility of each clause but the work is done in the caller side. I'm not sure such strictness is required, but it might be better that the comment represents what exactly the function does. + */ +static Bitmapset * +statext_mcv_clause_attnums(int nclauses, Bitmapset **estimatedclauses, + Bitmapset **list_attnums) The last two parameters are in the same type in notation but in different actual types.. that is, one is a pointer to Bitmapset*, and another is an array of Bitmaptset*. The code in the function itself suggests that, but it would be helpful if a brief explanation of the parameters is seen in the function comment. + /* +* Recompute attnums in the remaining clauses (we simply use the bitmaps +* computed earlier, so that we don't have to inspect the clauses again). +*/ + clauses_attnums = statext_mcv_clause_attnums(list_length(clauses), Couldn't we avoid calling this function twice with the same parameters at the first round in the loop? + foreach(l, clauses) { - stat_clauses = lappend(stat_clauses, (Node *) lfirst(l)); - *estimatedclauses = bms_add_member(*estimatedclauses, listidx); + /* +* If the clause is compatible with the selected statistics, mark it +* as estimated and add it to the list to estimate. +*/ + if (list_attnums[listidx] != NULL && + bms_is_subset(list_attnums[listidx], stat->keys)) + { + stat_clauses = lappend(stat_clauses, (Node *) lfirst(l)); + *estimatedclauses = bms_add_member(*estimatedclauses, listidx); + } The loop runs through all clauses every time. I agree that that is better than using a copy of the clauses to avoid to step on already estimated clauses, but maybe we need an Assertion that the listidx is not a part of estimatedclauses to make sure no clauses are not estimated twice. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Using multiple extended statistics for estimates
On Wed, Nov 06, 2019 at 08:54:40PM +0100, Tomas Vondra wrote: On Mon, Oct 28, 2019 at 04:20:48PM +0100, Tomas Vondra wrote: Hi, PostgreSQL 10 introduced extended statistics, allowing us to consider correlation between columns to improve estimates, and PostgreSQL 12 added support for MCV statistics. But we still had the limitation that we only allowed using a single extended statistics per relation, i.e. given a table with two extended stats CREATE TABLE t (a int, b int, c int, d int); CREATE STATISTICS s1 (mcv) ON a, b FROM t; CREATE STATISTICS s2 (mcv) ON c, d FROM t; and a query SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1; we only ever used one of the statistics (and we considered them in a not particularly well determined order). This patch addresses this by using as many extended stats as possible, by adding a loop to statext_mcv_clauselist_selectivity(). In each step we pick the "best" applicable statistics (in the sense of covering the most attributes) and factor it into the oveall estimate. All this happens where we'd originally consider applying a single MCV list, i.e. before even considering the functional dependencies, so roughly like this: while () { ... apply another MCV list ... } ... apply functional dependencies ... I've both in the loop, but I think that'd be wrong - the MCV list is expected to contain more information about individual values (compared to functional deps, which are column-level). Here is a slightly polished v2 of the patch, the main difference being that computing clause_attnums was moved to a separate function. This time with the attachment ;-) -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c index 207ee3160e..12093151f6 100644 --- a/src/backend/statistics/extended_stats.c +++ b/src/backend/statistics/extended_stats.c @@ -1123,6 +1123,33 @@ statext_is_compatible_clause(PlannerInfo *root, Node *clause, Index relid, return true; } +/* + * statext_mcv_clause_attnums + * Recalculate attnums from compatible but not-yet-estimated clauses. + */ +static Bitmapset * +statext_mcv_clause_attnums(int nclauses, Bitmapset **estimatedclauses, + Bitmapset **list_attnums) +{ + int listidx; + Bitmapset *clauses_attnums = NULL; + + for (listidx = 0; listidx < nclauses; listidx++) + { + /* +* Skip clauses that have no precalculated attnums, which means it is +* either incompatible or was already used by some other statistic. +*/ + if (!list_attnums[listidx]) + continue; + + if (!bms_is_member(listidx, *estimatedclauses)) + clauses_attnums = bms_add_members(clauses_attnums, list_attnums[listidx]); + } + + return clauses_attnums; +} + /* * statext_mcv_clauselist_selectivity * Estimate clauses using the best multi-column statistics. @@ -1173,11 +1200,6 @@ statext_is_compatible_clause(PlannerInfo *root, Node *clause, Index relid, * 'estimatedclauses' is an input/output parameter. We set bits for the * 0-based 'clauses' indexes we estimate for and also skip clause items that * already have a bit set. - * - * XXX If we were to use multiple statistics, this is where it would happen. - * We would simply repeat this on a loop on the "remaining" clauses, possibly - * using the already estimated clauses as conditions (and combining the values - * using conditional probability formula). */ static Selectivity statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid, @@ -1188,14 +1210,7 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli Bitmapset *clauses_attnums = NULL; Bitmapset **list_attnums; int listidx; - StatisticExtInfo *stat; - List *stat_clauses; - Selectivity simple_sel, - mcv_sel, - mcv_basesel, - mcv_totalsel, - other_sel, - sel; + Selectivity sel = 1.0; /* check if there's any stats that might be useful for us. */ if (!has_stats_of_kind(rel->statlist, STATS_EXT_MCV)) @@ -1221,80 +1236,113 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli Node *clause = (Node *) lfirst(l); Bitmapset *attnums = NULL; + list_attnums[listidx] = NULL; + if (!bms_is_member(listidx, *estimatedclauses) && statext_is_compatible_clause(root, clause, rel->relid, )) -
Re: Using multiple extended statistics for estimates
On Mon, Oct 28, 2019 at 04:20:48PM +0100, Tomas Vondra wrote: Hi, PostgreSQL 10 introduced extended statistics, allowing us to consider correlation between columns to improve estimates, and PostgreSQL 12 added support for MCV statistics. But we still had the limitation that we only allowed using a single extended statistics per relation, i.e. given a table with two extended stats CREATE TABLE t (a int, b int, c int, d int); CREATE STATISTICS s1 (mcv) ON a, b FROM t; CREATE STATISTICS s2 (mcv) ON c, d FROM t; and a query SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1; we only ever used one of the statistics (and we considered them in a not particularly well determined order). This patch addresses this by using as many extended stats as possible, by adding a loop to statext_mcv_clauselist_selectivity(). In each step we pick the "best" applicable statistics (in the sense of covering the most attributes) and factor it into the oveall estimate. All this happens where we'd originally consider applying a single MCV list, i.e. before even considering the functional dependencies, so roughly like this: while () { ... apply another MCV list ... } ... apply functional dependencies ... I've both in the loop, but I think that'd be wrong - the MCV list is expected to contain more information about individual values (compared to functional deps, which are column-level). Here is a slightly polished v2 of the patch, the main difference being that computing clause_attnums was moved to a separate function. This is a fairly simple patch, and it's not entirely new functionality (applying multiple statistics was part of the very first patch seris, although of course in a very different form). So unless there are objections, I'd like to get this committed sometime next week. There's room for improvement, of course, for example when handling overlapping statistics. Consider a table with columns (a,b,c) and two extended statistics on (a,b) and (b,c), and query with one clause per column SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 In this case the patch does not help, because we apply (a,b) and then we have just a single clause remaining. What we could do is still apply the (b,c) statistic, using the already-estimated clause on b as a condition. So essentially we'd compute P(a=1 && b=1) * P(c=1 | b=1) But that'll require larger changes, and I see it as an evolution of the current patch. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Using multiple extended statistics for estimates
Hi, PostgreSQL 10 introduced extended statistics, allowing us to consider correlation between columns to improve estimates, and PostgreSQL 12 added support for MCV statistics. But we still had the limitation that we only allowed using a single extended statistics per relation, i.e. given a table with two extended stats CREATE TABLE t (a int, b int, c int, d int); CREATE STATISTICS s1 (mcv) ON a, b FROM t; CREATE STATISTICS s2 (mcv) ON c, d FROM t; and a query SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1; we only ever used one of the statistics (and we considered them in a not particularly well determined order). This patch addresses this by using as many extended stats as possible, by adding a loop to statext_mcv_clauselist_selectivity(). In each step we pick the "best" applicable statistics (in the sense of covering the most attributes) and factor it into the oveall estimate. All this happens where we'd originally consider applying a single MCV list, i.e. before even considering the functional dependencies, so roughly like this: while () { ... apply another MCV list ... } ... apply functional dependencies ... I've both in the loop, but I think that'd be wrong - the MCV list is expected to contain more information about individual values (compared to functional deps, which are column-level). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c index 207ee3160e..f817bc6189 100644 --- a/src/backend/statistics/extended_stats.c +++ b/src/backend/statistics/extended_stats.c @@ -1173,11 +1173,6 @@ statext_is_compatible_clause(PlannerInfo *root, Node *clause, Index relid, * 'estimatedclauses' is an input/output parameter. We set bits for the * 0-based 'clauses' indexes we estimate for and also skip clause items that * already have a bit set. - * - * XXX If we were to use multiple statistics, this is where it would happen. - * We would simply repeat this on a loop on the "remaining" clauses, possibly - * using the already estimated clauses as conditions (and combining the values - * using conditional probability formula). */ static Selectivity statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid, @@ -1188,14 +1183,7 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli Bitmapset *clauses_attnums = NULL; Bitmapset **list_attnums; int listidx; - StatisticExtInfo *stat; - List *stat_clauses; - Selectivity simple_sel, - mcv_sel, - mcv_basesel, - mcv_totalsel, - other_sel, - sel; + Selectivity sel = 1.0; /* check if there's any stats that might be useful for us. */ if (!has_stats_of_kind(rel->statlist, STATS_EXT_MCV)) @@ -1237,64 +1225,99 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli if (bms_membership(clauses_attnums) != BMS_MULTIPLE) return 1.0; - /* find the best suited statistics object for these attnums */ - stat = choose_best_statistics(rel->statlist, clauses_attnums, STATS_EXT_MCV); + /* apply as many extended statistics as possible */ + while (true) + { + StatisticExtInfo *stat; + List *stat_clauses; + Selectivity simple_sel, + mcv_sel, + mcv_basesel, + mcv_totalsel, + other_sel, + stat_sel; - /* if no matching stats could be found then we've nothing to do */ - if (!stat) - return 1.0; + /* +* Recompute attnums in the remaining clauses (we simply use the bitmaps +* computed earlier, so that we don't have to inspect the clauses again). +*/ + clauses_attnums = NULL; - /* Ensure choose_best_statistics produced an expected stats type. */ - Assert(stat->kind == STATS_EXT_MCV); + listidx = 0; + foreach(l, clauses) + { + if (!bms_is_member(listidx, *estimatedclauses)) + clauses_attnums = bms_add_members(clauses_attnums, list_attnums[listidx]); - /* now filter the clauses to be estimated using the selected MCV */ - stat_clauses = NIL; + listidx++; + } - listidx = 0; - foreach(l, clauses) - { - /* -* If the clause is compatible with