Re: using extended statistics to improve join estimates
On 5/23/24 09:04, Andy Fan wrote: Andrei Lepikhov writes: * c) No extended stats with MCV. If there are multiple join clauses, * we can try using ndistinct coefficients and do what eqjoinsel does. OK, I didn't pay enough attention to this comment before. and yes, I get the same conclusion as you - there is no implementation of this. and if so, I think we should remove the comments and do the implementation in the next patch. I have an opposite opinion about it: 1. distinct estimation is more universal thing - you can use it precisely on any subset of columns. 2. distinct estimation is faster - it just a number, you don't need to detoast huge array of values and compare them one-by-one. So, IMO, it is essential part of join estimation and it should be implemented like in eqjoinsel. Do you think the hook proposal is closely connected with the current topic? IIUC it's seems not. So a dedicated thread to explain the problem to slove and the proposal and the follwing discussion should be helpful for both topics. I'm just worried about mixing the two in one thread would make things complexer unnecessarily. Sure. -- regards, Andrei Lepikhov Postgres Professional
Re: POC: GROUP BY optimization
On 20/5/2024 15:54, jian he wrote: As mentioned previously, both A and B can use Incremental Sort, set_cheapest will choose the A instead of B, because A step generated path **first** satisfies increment sort. Yeah, I agree with your analysis. Looking into the cost_incremental_sort, I see that we have ten group_tuples. This value doesn't depend on how many presorted keys (1 or 2) we use. This is caused by default estimation. Given the current circumstances, it seems unlikely that we can make any significant changes without introducing a new sort cost model that accounts for the number of sorted columns. -- regards, Andrei Lepikhov Postgres Professional
Re: using extended statistics to improve join estimates
On 5/20/24 16:40, Andrei Lepikhov wrote: On 20/5/2024 15:52, Andy Fan wrote: + if (clauselist_selectivity_hook) + *return* clauselist_selectivity_hook(root, clauses, ..) Of course - library may estimate not all the clauses - it is a reason, why I added input/output parameter 'estimatedclauses' by analogy with statext_clauselist_selectivity. Here is a polished and a bit modified version of the hook proposed. Additionally, I propose exporting the statext_mcv_clauselist_selectivity routine, likewise dependencies_clauselist_selectivity. This could potentially enhance the functionality of the PostgreSQL estimation code. To clarify the purpose, I want an optional, loaded as a library, more conservative estimation based on distinct statistics. Let's provide (a bit degenerate) example: CREATE TABLE is_test(x1 integer, x2 integer, x3 integer, x4 integer); INSERT INTO is_test (x1,x2,x3,x4) SELECT x%5,x%7,x%11,x%13 FROM generate_series(1,1E3) AS x; INSERT INTO is_test (x1,x2,x3,x4) SELECT 14,14,14,14 FROM generate_series(1,100) AS x; CREATE STATISTICS ist_stat (dependencies,ndistinct) ON x1,x2,x3,x4 FROM is_test; ANALYZE is_test; EXPLAIN (ANALYZE, COSTS ON, SUMMARY OFF, TIMING OFF) SELECT * FROM is_test WHERE x1=14 AND x2=14 AND x3=14 AND x4=14; DROP TABLE is_test CASCADE; I see: (cost=0.00..15.17 rows=3 width=16) (actual rows=100 loops=1) Dependency works great if it is the same for all the data in the columns. But we get underestimations if we have different laws for subsets of rows. So, if we don't have MCV statistics, sometimes we need to pass over dependency statistics and use ndistinct instead. -- regards, Andrei Lepikhov Postgres Professional diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c index 0ab021c1e8..1508a1beea 100644 --- a/src/backend/optimizer/path/clausesel.c +++ b/src/backend/optimizer/path/clausesel.c @@ -128,6 +128,11 @@ clauselist_selectivity_ext(PlannerInfo *root, ListCell *l; int listidx; + if (clauselist_selectivity_hook) + s1 = clauselist_selectivity_hook(root, clauses, varRelid, jointype, + sjinfo, , + use_extended_stats); + /* * If there's exactly one clause, just go directly to * clause_selectivity_ext(). None of what we might do below is relevant. diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c index 99fdf208db..b1722f5a60 100644 --- a/src/backend/statistics/extended_stats.c +++ b/src/backend/statistics/extended_stats.c @@ -1712,7 +1712,7 @@ statext_is_compatible_clause(PlannerInfo *root, Node *clause, Index relid, * 0-based 'clauses' indexes we estimate for and also skip clause items that * already have a bit set. */ -static Selectivity +Selectivity statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid, JoinType jointype, SpecialJoinInfo *sjinfo, RelOptInfo *rel, Bitmapset **estimatedclauses, diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 5f5d7959d8..ff98fda08c 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -146,6 +146,7 @@ /* Hooks for plugins to get control when we ask for stats */ get_relation_stats_hook_type get_relation_stats_hook = NULL; get_index_stats_hook_type get_index_stats_hook = NULL; +clauselist_selectivity_hook_type clauselist_selectivity_hook = NULL; static double eqsel_internal(PG_FUNCTION_ARGS, bool negate); static double eqjoinsel_inner(Oid opfuncoid, Oid collation, diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h index 7f2bf18716..436f30bdde 100644 --- a/src/include/statistics/statistics.h +++ b/src/include/statistics/statistics.h @@ -104,6 +104,14 @@ extern void BuildRelationExtStatistics(Relation onerel, bool inh, double totalro extern int ComputeExtStatisticsRows(Relation onerel, int natts, VacAttrStats **vacattrstats); extern bool statext_is_kind_built(HeapTuple htup, char type); +extern Selectivity statext_mcv_clauselist_selectivity(PlannerInfo *root, + List *clauses, + int varRelid, + JoinType jointype, + SpecialJoinInfo *sjinfo, + RelOptInfo *rel, + Bitmapset **estimatedclauses, + bool is_or); extern Selectivity dependencies_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid, diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h index f2563ad1cb..253f584c65 100644 --- a/src/include/utils/selfuncs.h +++ b/src/include/utils/selfuncs.h @@ -148,6 +148,17 @@ typedef bool (*get_index_stats_hook_type) (PlannerInfo *root, VariableStatData *vardata); extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook; +/* Hooks for plugins to get control when we ask for selectivity estimation */ +typedef Selectivity
Re: using extended statistics to improve join estimates
On 20/5/2024 15:52, Andy Fan wrote: Hi Andrei, On 4/3/24 01:22, Tomas Vondra wrote: Cool! There's obviously no chance to get this into v18, and I have stuff to do in this CF. But I'll take a look after that. I'm looking at your patch now - an excellent start to an eagerly awaited feature! A couple of questions: 1. I didn't find the implementation of strategy 'c' - estimation by the number of distinct values. Do you forget it? What do you mean the "strategy 'c'"? As described in 0001-* patch: * c) No extended stats with MCV. If there are multiple join clauses, * we can try using ndistinct coefficients and do what eqjoinsel does. 2. Can we add a clauselist selectivity hook into the core (something similar the code in attachment)? It can allow the development and testing of multicolumn join estimations without patching the core. The idea LGTM. But do you want + if (clauselist_selectivity_hook) + s1 = clauselist_selectivity_hook(root, clauses, varRelid, jointype, + rather than + if (clauselist_selectivity_hook) + *return* clauselist_selectivity_hook(root, clauses, ..) Of course - library may estimate not all the clauses - it is a reason, why I added input/output parameter 'estimatedclauses' by analogy with statext_clauselist_selectivity. -- regards, Andrei Lepikhov Postgres Professional
Re: using extended statistics to improve join estimates
On 4/3/24 01:22, Tomas Vondra wrote: Cool! There's obviously no chance to get this into v18, and I have stuff to do in this CF. But I'll take a look after that. I'm looking at your patch now - an excellent start to an eagerly awaited feature! A couple of questions: 1. I didn't find the implementation of strategy 'c' - estimation by the number of distinct values. Do you forget it? 2. Can we add a clauselist selectivity hook into the core (something similar the code in attachment)? It can allow the development and testing of multicolumn join estimations without patching the core. -- regards, Andrei Lepikhov Postgres Professional diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c index 0ab021c1e8..271d36a522 100644 --- a/src/backend/optimizer/path/clausesel.c +++ b/src/backend/optimizer/path/clausesel.c @@ -128,6 +128,9 @@ clauselist_selectivity_ext(PlannerInfo *root, ListCell *l; int listidx; + if (clauselist_selectivity_hook) + s1 = clauselist_selectivity_hook(root, clauses, varRelid, jointype, + sjinfo, ); /* * If there's exactly one clause, just go directly to * clause_selectivity_ext(). None of what we might do below is relevant. diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 5f5d7959d8..ff98fda08c 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -146,6 +146,7 @@ /* Hooks for plugins to get control when we ask for stats */ get_relation_stats_hook_type get_relation_stats_hook = NULL; get_index_stats_hook_type get_index_stats_hook = NULL; +clauselist_selectivity_hook_type clauselist_selectivity_hook = NULL; static double eqsel_internal(PG_FUNCTION_ARGS, bool negate); static double eqjoinsel_inner(Oid opfuncoid, Oid collation, diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h index f2563ad1cb..ee28d3ba9b 100644 --- a/src/include/utils/selfuncs.h +++ b/src/include/utils/selfuncs.h @@ -148,6 +148,15 @@ typedef bool (*get_index_stats_hook_type) (PlannerInfo *root, VariableStatData *vardata); extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook; +/* Hooks for plugins to get control when we ask for selectivity estimation */ +typedef bool (*clauselist_selectivity_hook_type) (PlannerInfo *root, + List *clauses, + int varRelid, + JoinType jointype, + SpecialJoinInfo *sjinfo, + Bitmapset **estimatedclauses); +extern PGDLLIMPORT clauselist_selectivity_hook_type clauselist_selectivity_hook; + /* Functions in selfuncs.c */ extern void examine_variable(PlannerInfo *root, Node *node, int varRelid,
Re: POC: GROUP BY optimization
On 24.04.2024 13:25, jian he wrote: hi. I found an interesting case. CREATE TABLE t1 AS SELECT (i % 10)::numeric AS x,(i % 10)::int8 AS y,'abc' || i % 10 AS z, i::int4 AS w FROM generate_series(1, 100) AS i; CREATE INDEX t1_x_y_idx ON t1 (x, y); ANALYZE t1; SET enable_hashagg = off; SET enable_seqscan = off; EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY x,z,y,w; EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY x,w,y,z; EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY x,z,w,y; EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY x,w,z,y; the above part will use: -> Incremental Sort Sort Key: x, $, $, $ Presorted Key: x -> Index Scan using t1_x_y_idx on t1 EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY z,y,w,x; EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY w,y,z,x; EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY y,z,x,w; EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY y,w,x,z; EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY y,x,z,w; EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY y,x,w,z; these will use: -> Incremental Sort Sort Key: x, y, $, $ Presorted Key: x, y -> Index Scan using t1_x_y_idx on t1 I guess this is fine, but not optimal? It looks like a bug right now - in current implementation we don't differentiate different orders. So: 1. Applying all the patches from the thread which I proposed as an answer to T.Lane last rebuke - does behavior still the same?. 2. Could you try to find the reason? -- regards, Andrei Lepikhov Postgres Professional
Re: query_id, pg_stat_activity, extended query protocol
On 15.05.2024 10:24, Imseih (AWS), Sami wrote: I discovered the current state of queryId reporting and found that it may be unlogical: Postgres resets queryId right before query execution in simple protocol and doesn't reset it at all in extended protocol and other ways to execute queries. In exec_parse_message, exec_bind_message and exec_execute_message, the queryId is reset via pgstat_report_activity I think we should generally report it when the backend executes a job related to the query with that queryId. This means it would reset the queryId at the end of the query execution. When the query completes execution and the session goes into a state other than "active", both the query text and the queryId should be of the last executed statement. This is the documented behavior, and I believe it's the correct behavior. I discovered this case a bit. As I can see, the origin of the problem is that the exec_execute_message report STATE_RUNNING, although ExecutorStart was called in the exec_bind_message routine beforehand. I'm unsure if it needs to call ExecutorStart in the bind code. But if we don't change the current logic, would it make more sense to move pgstat_report_query_id to the ExecutorRun routine? -- regards, Andrei Lepikhov
Re: query_id, pg_stat_activity, extended query protocol
On 15/5/2024 12:09, Michael Paquier wrote: On Wed, May 15, 2024 at 03:24:05AM +, Imseih (AWS), Sami wrote: I think we should generally report it when the backend executes a job related to the query with that queryId. This means it would reset the queryId at the end of the query execution. When the query completes execution and the session goes into a state other than "active", both the query text and the queryId should be of the last executed statement. This is the documented behavior, and I believe it's the correct behavior. If we reset queryId at the end of execution, this behavior breaks. Right? Idle sessions keep track of the last query string run, hence being consistent in pg_stat_activity and report its query ID is user friendly. Resetting it while keeping the string is less consistent. It's been this way for years, so I'd rather let it be this way. Okay, that's what I precisely wanted to understand: queryId doesn't have semantics to show the job that consumes resources right now—it is mostly about convenience to know that the backend processes nothing except (probably) this query. -- regards, Andrei Lepikhov
Re: query_id, pg_stat_activity, extended query protocol
On 5/1/24 10:07, Imseih (AWS), Sami wrote: Here is a new rev of the patch which deals with the scenario mentioned by Andrei [1] in which the queryId may change due to a cached query invalidation. [1] https://www.postgresql.org/message-id/724348C9-8023-41BC-895E-80634E79A538%40amazon.com I discovered the current state of queryId reporting and found that it may be unlogical: Postgres resets queryId right before query execution in simple protocol and doesn't reset it at all in extended protocol and other ways to execute queries. I think we should generally report it when the backend executes a job related to the query with that queryId. This means it would reset the queryId at the end of the query execution. However, the process of setting up the queryId is more complex. Should we set it at the beginning of query execution? This seems logical, but what about the planning process? If an extension plans a query without the intention to execute it for speculative reasons, should we still show the queryId? Perhaps we should reset the state right after planning to accurately reflect the current queryId. See in the attachment some sketch for that - it needs to add queryId reset on abortion. -- regards, Andrei Lepikhov diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c index 4d7c92d63c..a4d38a157f 100644 --- a/src/backend/executor/execMain.c +++ b/src/backend/executor/execMain.c @@ -470,6 +470,12 @@ ExecutorEnd(QueryDesc *queryDesc) (*ExecutorEnd_hook) (queryDesc); else standard_ExecutorEnd(queryDesc); + + /* + * Report at the end of query execution. Don't change it for nested + * queries. + */ + pgstat_report_query_id(0, false); } void diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 032818423f..ba29dc5bc7 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -58,6 +58,7 @@ #include "parser/parse_relation.h" #include "parser/parsetree.h" #include "partitioning/partdesc.h" +#include "pgstat.h" #include "utils/lsyscache.h" #include "utils/rel.h" #include "utils/selfuncs.h" @@ -280,6 +281,13 @@ planner(Query *parse, const char *query_string, int cursorOptions, result = (*planner_hook) (parse, query_string, cursorOptions, boundParams); else result = standard_planner(parse, query_string, cursorOptions, boundParams); + + /* + * Reset queryId at the end of planning job. Executor code will set it up + * again on the ExecutorStart call. + */ + pgstat_report_query_id(0, false); + return result; } diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 2dff28afce..10e2529cf6 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -1108,8 +1108,6 @@ exec_simple_query(const char *query_string) const char *cmdtagname; size_t cmdtaglen; - pgstat_report_query_id(0, true); - /* * Get the command name for use in status display (it also becomes the * default completion tag, down inside PortalRun). Set ps_status and
Re: Removing unneeded self joins
On 6/5/2024 21:44, Robert Haas wrote: On Sat, May 4, 2024 at 10:46 PM Andrei Lepikhov wrote: Having no objective negative feedback, we have no reason to change anything in the design or any part of the code. It looks regrettable and unusual. To me, this sounds like you think it's someone else's job to tell you what is wrong with the patch, or how to fix it, and if they don't, then you should get to have the patch as part of PostgreSQL. But that is not how we do things, nor should we. I agree that it sucks when you need feedback and don't get it, and I've written about that elsewhere and recently. But if you don't get feedback and as a result you can't get the patch to an acceptable level, I'm really sorry that the level of my language caused a misunderstanding. The main purpose of this work is to form a more or less certain view of the direction of the planner's development. Right now, it evolves extensively - new structures, variables, alternative copies of the same node trees with slightly changed properties ... This way allows us to quickly introduce some planning features (a lot of changes in planner logic since PG16 is evidence of that) and with still growing computing resources it allows postgres to fit RAM and proper planning time. But maybe we want to be more modest? The Ashutosh's work he has been doing this year shows how sometimes expensive the planner is. Perhaps we want machinery that will check the integrity of planning data except the setrefs, which fail to detect that occasionally? If an extensive approach is the only viable option, then it's clear that this and many other features are simply not suitable for Postgres Planner. It's disheartening that this patch didn't elicit such high-level feedback. -- regards, Andrei Lepikhov
Re: Removing unneeded self joins
On 3/5/2024 20:55, Robert Haas wrote: One of my most embarrassing gaffes in this area personally was a448e49bcbe40fb72e1ed85af910dd216d45bad8. I don't know how I managed to commit the original patch without realizing it was going to cause an increase in the WAL size, but I can tell you that when I realized it, my heart sank through the floor. I discovered this feature and agree that it looks like a severe problem. Unfortunately, in the case of the SJE patch, the committer and reviewers don't provide negative feedback. We see the only (I'm not sure I use the proper English phrase) 'negative feelings' from people who haven't reviewed or analysed it at all (at least, they didn't mention it). Considering the situation, I suggest setting the default value of enable_self_join_removal to false in PG17 for added safety and then changing it to true in early PG18. Having no objective negative feedback, we have no reason to change anything in the design or any part of the code. It looks regrettable and unusual. After designing the feature, fixing its bugs, and reviewing joint patches on the commitfest, the question more likely lies in the planner design. For example, I wonder if anyone here knows why exactly the optimiser makes a copy of the whole query subtree in some places. Another example is PlannerInfo. Can we really control all the consequences of introducing, let's say, a new JoinDomain entity? You also mentioned 2024.pgconf.dev. Considering the current migration policy in some countries, it would be better to work through the online presence as equivalent to offline. Without an online part of the conference, the only way to communicate and discuss is through this mailing list. -- regards, Andrei Lepikhov
Re: Removing unneeded self joins
On 5/3/24 06:19, Tom Lane wrote: Alexander Korotkov writes: I would appreciate your review of this patchset, and review from Andrei as well. I hate to say this ... but if we're still finding bugs this basic in SJE, isn't it time to give up on it for v17? I might feel better about it if there were any reason to think these were the last major bugs. But you have already committed around twenty separate fixes for the original SJE patch, and now here you come with several more; so it doesn't seem like the defect rate has slowed materially. There can be no doubt whatever that the original patch was far from commit-ready. I think we should revert SJE for v17 and do a thorough design review before trying again in v18. I need to say I don't see any evidence of bad design. I think this feature follows the example of 2489d76 [1], 1349d27 [2], and partitionwise join features — we get some issues from time to time, but these strengths and frequencies are significantly reduced. First and foremost, this feature is highly isolated: like the PWJ feature, you can just disable (not enable?) SJE and it guarantees you will avoid the problems. Secondly, this feature reflects the design decisions the optimiser has made before. It raises some questions: Do we really control the consistency of our paths and the plan tree? Maybe we hide our misunderstanding of its logic by extensively copying expression trees, sometimes without fundamental necessity. Perhaps the optimiser needs some abstraction layers or reconstruction to reduce the quickly growing complexity. A good example here is [1]. IMO, the new promising feature it has introduced isn't worth the complexity it added to the planner. SJE, much like OR <-> ANY transformation, introduces a fresh perspective into the planner: if we encounter a complex, redundant query, it may be more beneficial to invest in simplifying the internal query representation rather than adding new optimisations that will grapple with this complexity. Also, SJE raised questions I've never seen before, like: Could we control the consistency of the PlannerInfo by changing something in the logic? Considering the current state, I don't see any concrete outcomes or evidence that a redesign of the feature will lead us to a new path. However, I believe the presence of SJE in the core could potentially trigger improvements in the planner. As a result, I vote to stay with the feature. But remember, as an author, I'm not entirely objective, so let's wait for alternative opinions. [1] Make Vars be outer-join-aware [2] Improve performance of ORDER BY / DISTINCT aggregates -- regards, Andrei Lepikhov Postgres Professional
Re: Removing unneeded self joins
On 5/1/24 18:59, Alexander Korotkov wrote: I think we probably could forbid SJE for the tables with TABLESAMPLE altogether. Please, check the attached patch. Your patch looks good to me. I added some comments and test case into the join.sql. One question for me is: Do we anticipate other lateral self-references except the TABLESAMPLE case? Looking into the extract_lateral_references implementation, I see the only RTE_SUBQUERY case to be afraid of. But we pull up subqueries before extracting lateral references. So, if we have a reference to a subquery, it means we will not flatten this subquery and don't execute SJE. Do we need more code, as you have written in the first patch? -- regards, Andrei Lepikhov Postgres Professional From dac8afd2095586921dfcf5564e7f2979e89b77de Mon Sep 17 00:00:00 2001 From: "Andrei V. Lepikhov" Date: Thu, 2 May 2024 16:17:52 +0700 Subject: [PATCH] Forbid self-join elimination on table sampling scans. Having custom table sampling methods we can stuck into unpredictable issues replacing join with scan operation. It may had sense to analyse possible situations and enable SJE, but the real profit from this operation looks too low. --- src/backend/optimizer/plan/analyzejoins.c | 8 src/backend/optimizer/plan/initsplan.c| 5 - src/test/regress/expected/join.out| 13 + src/test/regress/sql/join.sql | 5 + 4 files changed, 30 insertions(+), 1 deletion(-) diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c index 506fccd20c..bb89558dcd 100644 --- a/src/backend/optimizer/plan/analyzejoins.c +++ b/src/backend/optimizer/plan/analyzejoins.c @@ -2148,6 +2148,14 @@ remove_self_joins_one_group(PlannerInfo *root, Relids relids) Assert(root->simple_rte_array[k]->relid == root->simple_rte_array[r]->relid); + /* + * To avoid corner cases with table sampling methods just forbid + * SJE for table sampling entries. + */ + if (root->simple_rte_array[k]->tablesample || +root->simple_rte_array[r]->tablesample) +continue; + /* * It is impossible to eliminate join of two relations if they * belong to different rules of order. Otherwise planner can't be diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c index e2c68fe6f9..bf839bcaf6 100644 --- a/src/backend/optimizer/plan/initsplan.c +++ b/src/backend/optimizer/plan/initsplan.c @@ -415,7 +415,10 @@ extract_lateral_references(PlannerInfo *root, RelOptInfo *brel, Index rtindex) if (!rte->lateral) return; - /* Fetch the appropriate variables */ + /* Fetch the appropriate variables. + * Changes in this place may need changes in SJE logic, see + * the remove_self_joins_one_group routine. + */ if (rte->rtekind == RTE_RELATION) vars = pull_vars_of_level((Node *) rte->tablesample, 0); else if (rte->rtekind == RTE_SUBQUERY) diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out index 8b640c2fc2..63143fe55f 100644 --- a/src/test/regress/expected/join.out +++ b/src/test/regress/expected/join.out @@ -6900,6 +6900,19 @@ where s1.x = 1; Filter: (1 = 1) (9 rows) +-- Check that SJE doesn't touch TABLESAMPLE joins +explain (costs off) +SELECT * FROM emp1 t1 NATURAL JOIN LATERAL + (SELECT * FROM emp1 t2 TABLESAMPLE SYSTEM (t1.code)); + QUERY PLAN +- + Nested Loop + -> Seq Scan on emp1 t1 + -> Sample Scan on emp1 t2 + Sampling: system (t1.code) + Filter: ((t1.id = id) AND (t1.code = code)) +(5 rows) + -- Check that PHVs do not impose any constraints on removing self joins explain (verbose, costs off) select * from emp1 t1 join emp1 t2 on t1.id = t2.id left join diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql index c4c6c7b8ba..184fd0876b 100644 --- a/src/test/regress/sql/join.sql +++ b/src/test/regress/sql/join.sql @@ -2652,6 +2652,11 @@ select 1 from emp1 t1 left join on true where s1.x = 1; +-- Check that SJE doesn't touch TABLESAMPLE joins +explain (costs off) +SELECT * FROM emp1 t1 NATURAL JOIN LATERAL + (SELECT * FROM emp1 t2 TABLESAMPLE SYSTEM (t1.code)); + -- Check that PHVs do not impose any constraints on removing self joins explain (verbose, costs off) select * from emp1 t1 join emp1 t2 on t1.id = t2.id left join -- 2.39.2
Re: query_id, pg_stat_activity, extended query protocol
On 27/4/2024 20:54, Imseih (AWS), Sami wrote: But simplistic case with a prepared statement shows how the value of queryId can be changed if you don't acquire all the objects needed for the execution: CREATE TABLE test(); PREPARE name AS SELECT * FROM test; EXPLAIN (ANALYSE, VERBOSE, COSTS OFF) EXECUTE name; DROP TABLE test; CREATE TABLE test(); EXPLAIN (ANALYSE, VERBOSE, COSTS OFF) EXECUTE name; Hmm, you raise a good point. Isn't this a fundamental problem with prepared statements? If there is DDL on the relations of the prepared statement query, shouldn't the prepared statement be considered invalid at that point and raise an error to the user? I don't think so. It may be any object, even stored procedure, that can be changed. IMO, the right option here is to report zero (like the undefined value of queryId) until the end of the parsing stage. -- regards, Andrei Lepikhov
Re: query_id, pg_stat_activity, extended query protocol
On 4/23/24 12:49, Michael Paquier wrote: On Tue, Apr 23, 2024 at 11:42:41AM +0700, Andrei Lepikhov wrote: On 23/4/2024 11:16, Imseih (AWS), Sami wrote: + pgstat_report_query_id(linitial_node(Query, psrc->query_list)->queryId, true); set_ps_display("BIND"); @@ -2146,6 +2147,7 @@ exec_execute_message(const char *portal_name, long max_rows) debug_query_string = sourceText; pgstat_report_activity(STATE_RUNNING, sourceText); + pgstat_report_query_id(portal->queryDesc->plannedstmt->queryId, true); cmdtagname = GetCommandTagNameAndLen(portal->commandTag, ); In exec_bind_message, how can you be sure that queryId exists in query_list before the call of GetCachedPlan(), which will validate and lock the plan? What if some OIDs were altered in the middle? I am also a bit surprised with the choice of using the first Query available in the list for the ID, FWIW. Did you consider using \bind to show how this behaves in a regression test? I'm not sure how to invent a test based on the \bind command - we need some pause in the middle. But simplistic case with a prepared statement shows how the value of queryId can be changed if you don't acquire all the objects needed for the execution: CREATE TABLE test(); PREPARE name AS SELECT * FROM test; EXPLAIN (ANALYSE, VERBOSE, COSTS OFF) EXECUTE name; DROP TABLE test; CREATE TABLE test(); EXPLAIN (ANALYSE, VERBOSE, COSTS OFF) EXECUTE name; /* QUERY PLAN --- Seq Scan on public.test (actual time=0.002..0.004 rows=0 loops=1) Query Identifier: 6750745711909650694 QUERY PLAN --- Seq Scan on public.test (actual time=0.004..0.004 rows=0 loops=1) Query Identifier: -2597546769858730762 */ We have different objects which can be changed - I just have invented the most trivial example to discuss. -- regards, Andrei Lepikhov
Re: query_id, pg_stat_activity, extended query protocol
On 23/4/2024 11:16, Imseih (AWS), Sami wrote: FWIW, I'd like to think that we could improve the situation, requiring a mix of calling pgstat_report_query_id() while feeding on some query IDs retrieved from CachedPlanSource->query_list. I have not in details looked at how much could be achieved, TBH. I was dealing with this today and found this thread. I spent some time looking at possible solutions. In the flow of extended query protocol, the exec_parse_message reports the queryId, but subsequent calls to exec_bind_message and exec_execute_message reset the queryId when calling pgstat_report_activity(STATE_RUNNING,..) as you can see below. /* * If a new query is started, we reset the query identifier as it'll only * be known after parse analysis, to avoid reporting last query's * identifier. */ if (state == STATE_RUNNING) beentry->st_query_id = UINT64CONST(0); So, I think the simple answer is something like the below. Inside exec_bind_message and exec_execute_message, the query_id should be reported after pg_report_activity. diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 76f48b13d2..7ec2df91d5 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -1678,6 +1678,7 @@ exec_bind_message(StringInfo input_message) debug_query_string = psrc->query_string; pgstat_report_activity(STATE_RUNNING, psrc->query_string); + pgstat_report_query_id(linitial_node(Query, psrc->query_list)->queryId, true); set_ps_display("BIND"); @@ -2146,6 +2147,7 @@ exec_execute_message(const char *portal_name, long max_rows) debug_query_string = sourceText; pgstat_report_activity(STATE_RUNNING, sourceText); + pgstat_report_query_id(portal->queryDesc->plannedstmt->queryId, true); cmdtagname = GetCommandTagNameAndLen(portal->commandTag, ); thoughts? In exec_bind_message, how can you be sure that queryId exists in query_list before the call of GetCachedPlan(), which will validate and lock the plan? What if some OIDs were altered in the middle? -- regards, Andrei Lepikhov
Re: POC: GROUP BY optimization
On 4/12/24 06:44, Tom Lane wrote: If this patch were producing better results I'd be more excited about putting more work into it. But on the basis of what I'm seeing right now, I think maybe we ought to give up on it. Let me show current cases where users have a profit with this tiny improvement (see queries and execution results in query.sql): 1. 'Not optimised query text' — when we didn't consider group-by ordering during database development. 2. 'Accidental pathkeys' - we didn't see any explicit orderings, but accidentally, the planner used merge join that caused some orderings and we can utilise it. 3. 'Uncertain scan path' — We have options regarding which index to use, and because of that, we can't predict the optimal group-by ordering before the start of query planning. 4. 'HashAgg V/S GroupAgg' — sometimes, the GroupAgg strategy outperforms HashAgg just because we don't need any ordering procedure at all. And the last thing here — this code introduces the basics needed to add more sophisticated strategies, like ordering according to uniqueness or preferring to set constant-width columns first in grouping. -- regards, Andrei Lepikhov Postgres Professional query.sql Description: application/sql
Re: POC: GROUP BY optimization
On 4/12/24 06:44, Tom Lane wrote: * Speaking of pathnodes.h, PathKeyInfo is a pretty uninformative node type name, and the comments provided for it are not going to educate anybody. What is the "association" between the pathkeys and clauses? I guess the clauses are supposed to be SortGroupClauses, but not all pathkeys match up to SortGroupClauses, so what then? This is very underspecified, and fuzzy thinking about data structures tends to lead to bugs. I'm not the best in English documentation and naming style. So, feel free to check my version. So I'm quite afraid that there are still bugs lurking here. What's more, given that the plans this patch makes apparently seldom win when you don't put a thumb on the scales, it might take a very long time to isolate those bugs. If the patch produces a faulty plan (e.g. not sorted properly) we'd never notice if that plan isn't chosen, and even if it is chosen it probably wouldn't produce anything as obviously wrong as a crash. I added checkings on the proper order of pathkeys and clauses. If you really care about that, we should spend additional time and rewrite the code to generate an order of clauses somewhere in the plan creation phase. For example, during the create_group_plan call, we could use the processed_groupClause list, cycle through subpath->pathkeys, set the order, and detect whether the pathkeys list corresponds to the group-by or is enough to build a grouping plan. Anyway, make this part of code more resistant to blunders is another story. -- regards, Andrei Lepikhov Postgres Professional diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c index aa80f6486c..9c079270ec 100644 --- a/src/backend/optimizer/path/pathkeys.c +++ b/src/backend/optimizer/path/pathkeys.c @@ -461,7 +461,7 @@ group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys, /* * pathkeys_are_duplicate * Check if give pathkeys are already contained the list of - * PathKeyInfo's. + * GroupByOrdering's. */ static bool pathkeys_are_duplicate(List *infos, List *pathkeys) @@ -470,7 +470,7 @@ pathkeys_are_duplicate(List *infos, List *pathkeys) foreach(lc, infos) { - PathKeyInfo *info = lfirst_node(PathKeyInfo, lc); + GroupByOrdering *info = lfirst_node(GroupByOrdering, lc); if (compare_pathkeys(pathkeys, info->pathkeys) == PATHKEYS_EQUAL) return true; @@ -482,7 +482,7 @@ pathkeys_are_duplicate(List *infos, List *pathkeys) * get_useful_group_keys_orderings * Determine which orderings of GROUP BY keys are potentially interesting. * - * Returns a list of PathKeyInfo items, each representing an interesting + * Returns a list of GroupByOrdering items, each representing an interesting * ordering of GROUP BY keys. Each item stores pathkeys and clauses in the * matching order. * @@ -495,15 +495,15 @@ pathkeys_are_duplicate(List *infos, List *pathkeys) List * get_useful_group_keys_orderings(PlannerInfo *root, Path *path) { - Query *parse = root->parse; - List *infos = NIL; - PathKeyInfo *info; + Query *parse = root->parse; + List *infos = NIL; + GroupByOrdering *info; List *pathkeys = root->group_pathkeys; List *clauses = root->processed_groupClause; /* always return at least the original pathkeys/clauses */ - info = makeNode(PathKeyInfo); + info = makeNode(GroupByOrdering); info->pathkeys = pathkeys; info->clauses = clauses; infos = lappend(infos, info); @@ -539,7 +539,7 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path *path) (enable_incremental_sort || n == root->num_groupby_pathkeys) && !pathkeys_are_duplicate(infos, pathkeys)) { - info = makeNode(PathKeyInfo); + info = makeNode(GroupByOrdering); info->pathkeys = pathkeys; info->clauses = clauses; @@ -564,7 +564,7 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path *path) (enable_incremental_sort || n == list_length(root->sort_pathkeys)) && !pathkeys_are_duplicate(infos, pathkeys)) { - info = makeNode(PathKeyInfo); + info = makeNode(GroupByOrdering); info->pathkeys = pathkeys; info->clauses = clauses; @@ -574,18 +574,29 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path *path) #ifdef USE_ASSERT_CHECKING { - PathKeyInfo *pinfo = linitial_node(PathKeyInfo, infos); + GroupByOrdering *pinfo = linitial_node(GroupByOrdering, infos); ListCell *lc; /* Test consistency of info structures */ for_each_from(lc, infos, 1) { - info = lfirst_node(PathKeyInfo, lc); + ListCell *lc1, *lc2; + + info = lfirst_node(GroupByOrdering, lc); Assert(list_length(info->clauses) == list_length(pinfo->clauses)); Assert(list_length(info->pathkeys) == list_length(pinfo->pathkeys)); Assert(list_difference(info->clauses, pinfo->clauses) == NIL); Assert(list_difference_ptr(info->pathkeys, pinfo->pathkeys) ==
Re: POC: GROUP BY optimization
On 4/12/24 06:44, Tom Lane wrote: * It seems like root->processed_groupClause no longer has much to do with the grouping behavior, which is scary given how much code still believes that it does. I suspect there are bugs lurking there, and 1. Analysing the code, processed_groupClause is used in: - adjust_foreign_grouping_path_cost - to decide, do we need to add cost of sort to the foreign grouping. - just for replacing relids during SJE process. - create_groupingsets_plan(), preprocess_grouping_sets, remove_useless_groupby_columns - we don't apply this feature in the case of grouping sets. - get_number_of_groups - estimation shouldn't depend on the order of clauses. - create_grouping_paths - to analyse hashability of grouping clause - make_group_input_target, make_partial_grouping_target - doesn't depend on the order of clauses planner.c: add_paths_to_grouping_rel, create_partial_grouping_paths - in the case of hash grouping. So, the only case we can impact current behavior lies in the postgres_fdw. But here we don't really know which plan will be chosen during planning on foreign node and stay the same behavior is legal for me. am not comforted by the fact that the patch changed exactly nothing in the pathnodes.h documentation of that field. This comment looks pretty silly now too: /* Preprocess regular GROUP BY clause, if any */ root->processed_groupClause = list_copy(parse->groupClause); What "preprocessing" is going on there? This comment was adequate before, when the code line invoked preprocess_groupclause which had a bunch of commentary; but now it has to stand on its own and it's not doing a great job of that. Thanks for picking this place. I fixed it. More interesting here is that we move decisions on the order of group-by clauses to the stage, where we already have underlying subtree and incoming path keys. In principle, we could return the preprocessing routine and arrange GROUP-BY with the ORDER-BY clause as it was before. But looking into the code, I found only one reason to do this: adjust_group_pathkeys_for_groupagg. Curious about how much profit we get here, I think we can discover it later with no hurry. A good outcome here will be a test case that can show the importance of arranging GROUP-BY and ORDER-BY at an early stage. -- regards, Andrei Lepikhov Postgres Professional diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 5ea3705796..861656a192 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -1424,7 +1424,10 @@ grouping_planner(PlannerInfo *root, double tuple_fraction, } else if (parse->groupClause) { - /* Preprocess regular GROUP BY clause, if any */ + /* + * Make a copy of origin groupClause because we are going to + * remove redundant clauses. + */ root->processed_groupClause = list_copy(parse->groupClause); /* Remove any redundant GROUP BY columns */ remove_useless_groupby_columns(root);
Re: POC: GROUP BY optimization
On 4/12/24 06:44, Tom Lane wrote: * I'm pretty unconvinced by group_keys_reorder_by_pathkeys (which I notice has already had one band-aid added to it since commit). In particular, it seems to believe that the pathkeys and clauses lists match one-for-one, but I seriously doubt that that invariant remains guaranteed after the cleanup steps /* append the remaining group pathkeys (will be treated as not sorted) */ *group_pathkeys = list_concat_unique_ptr(new_group_pathkeys, *group_pathkeys); *group_clauses = list_concat_unique_ptr(new_group_clauses, *group_clauses); For that to be reliable, the SortGroupClauses added to new_group_clauses in the main loop have to be exactly those that are associated with the same pathkeys in the old lists. I doubt that that's necessarily true in the presence of redundant grouping clauses. (Maybe we can't get here with any redundant grouping clauses, but still, we don't really guarantee uniqueness of SortGroupClauses, much less that they are never copied which is what you need if you want to believe that pointer equality is sufficient for de-duping here. PathKeys are explicitly made to be safe to compare pointer-wise, but I know of no such guarantee for SortGroupClauses.) I spent a lot of time inventing situations with SortGroupClause duplicates. Unfortunately, it looks impossible so far. But because we really don't guarantee uniqueness, I changed the code to survive in this case. Also, I added assertion checking to be sure we don't have logical mistakes here - see attachment. About the band-aid mentioned above - as I see, 4169850 introduces the same trick in planner.c. So, it looks like result of design of the current code. -- regards, Andrei Lepikhov Postgres Professional diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c index 5d9597adcd..aa80f6486c 100644 --- a/src/backend/optimizer/path/pathkeys.c +++ b/src/backend/optimizer/path/pathkeys.c @@ -380,15 +380,18 @@ group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys, /* * We're going to search within just the first num_groupby_pathkeys of - * *group_pathkeys. The thing is that root->group_pathkeys is passed as + * *group_pathkeys. The thing is that root->group_pathkeys is passed as * *group_pathkeys containing grouping pathkeys altogether with aggregate - * pathkeys. If we process aggregate pathkeys we could get an invalid + * pathkeys. If we process aggregate pathkeys we could get an invalid * result of get_sortgroupref_clause_noerr(), because their - * pathkey->pk_eclass->ec_sortref doesn't referece query targetlist. So, + * pathkey->pk_eclass->ec_sortref doesn't reference query targetlist. So, * we allocate a separate list of pathkeys for lookups. */ grouping_pathkeys = list_copy_head(*group_pathkeys, num_groupby_pathkeys); + /* Make a new copy before reordering clauses */ + *group_clauses = list_copy(*group_clauses); + /* * Walk the pathkeys (determining ordering of the input path) and see if * there's a matching GROUP BY key. If we find one, we append it to the @@ -400,8 +403,8 @@ group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys, */ foreach(lc, pathkeys) { - PathKey*pathkey = (PathKey *) lfirst(lc); - SortGroupClause *sgc; + PathKey *pathkey = (PathKey *) lfirst(lc); + SortGroupClause *sgc; /* * Pathkeys are built in a way that allows simply comparing pointers. @@ -431,17 +434,25 @@ group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys, Assert(OidIsValid(sgc->sortop)); new_group_pathkeys = lappend(new_group_pathkeys, pathkey); + + /* + * Keeping in mind that the SortGroupClause list doesn't guarantee + * unique pointers we must explicitly transfer elements one-by-one. + */ new_group_clauses = lappend(new_group_clauses, sgc); + *group_clauses = list_delete_ptr(*group_clauses, sgc); } /* remember the number of pathkeys with a matching GROUP BY key */ n = list_length(new_group_pathkeys); - /* append the remaining group pathkeys (will be treated as not sorted) */ + /* + * Append the remaining group pathkeys (will be treated as not sorted) and + * grouping clauses. + */ *group_pathkeys = list_concat_unique_ptr(new_group_pathkeys, *group_pathkeys); - *group_clauses = list_concat_unique_ptr(new_group_clauses, - *group_clauses); + *group_clauses = list_concat(new_group_clauses, *group_clauses); list_free(grouping_pathkeys); return n; @@ -484,9 +495,9 @@ pathkeys_are_duplicate(List *infos, List *pathkeys) List * get_useful_group_keys_orderings(PlannerInfo *root, Path *path) { - Query *parse = root->parse; - List *infos = NIL; - PathKeyInfo *info; + Query *parse = root->parse; + List *infos = NIL; + PathKeyInfo *info; List *pathkeys = r
Re: POC: GROUP BY optimization
On 4/12/24 06:44, Tom Lane wrote: * The very first hunk causes get_eclass_for_sort_expr to have side-effects on existing EquivalenceClass data structures. What is the argument that that's not going to cause problems? At the very least it introduces asymmetry, in that the first caller wins the chance to change the EC's ec_sortref and later callers won't be able to. Let me resolve issues bit by bit. Addressing your first note, 'asymmetry,' I discovered this part of the code. Before the 8d83a5d, it could cause some artefacts, but since then, a kind of asymmetry has been introduced into the GROUP-BY clause list. As you can see in the attached patch, GROUP-BY reordering works even when the asymmetry of the 8d83a5d chooses different keys. At the same time, I agree that implicitly setting anything in an exporting function, which should just look up for EC, is a questionable coding style. To avoid possible usage issues (in extensions, for example), I propose setting it up afterwards, explicitly forcing this action by input parameter - see my attempt in the attachment. -- regards, Andrei Lepikhov Postgres Professional diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c index a619ff9177..bc08dfadaf 100644 --- a/src/backend/optimizer/path/equivclass.c +++ b/src/backend/optimizer/path/equivclass.c @@ -652,18 +652,7 @@ get_eclass_for_sort_expr(PlannerInfo *root, if (opcintype == cur_em->em_datatype && equal(expr, cur_em->em_expr)) - { -/* - * Match! - * - * Copy the sortref if it wasn't set yet. That may happen if - * the ec was constructed from a WHERE clause, i.e. it doesn't - * have a target reference at all. - */ -if (cur_ec->ec_sortref == 0 && sortref > 0) - cur_ec->ec_sortref = sortref; -return cur_ec; - } +return cur_ec; /* Match! */ } } diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c index 1d61881a6b..5d9597adcd 100644 --- a/src/backend/optimizer/path/pathkeys.c +++ b/src/backend/optimizer/path/pathkeys.c @@ -1355,7 +1355,8 @@ make_pathkeys_for_sortclauses(PlannerInfo *root, , tlist, false, - ); + , + false); /* It's caller error if not all clauses were sortable */ Assert(sortable); return result; @@ -1379,13 +1380,16 @@ make_pathkeys_for_sortclauses(PlannerInfo *root, * to remove any clauses that can be proven redundant via the eclass logic. * Even though we'll have to hash in that case, we might as well not hash * redundant columns.) + * *set_ec_sortref is true if caller wants to set up ec_sortref field in + * the pathkey Equivalence Class, if it have not initialized beforehand. */ List * make_pathkeys_for_sortclauses_extended(PlannerInfo *root, List **sortclauses, List *tlist, bool remove_redundant, - bool *sortable) + bool *sortable, + bool set_ec_sortref) { List *pathkeys = NIL; ListCell *l; @@ -1409,6 +1413,13 @@ make_pathkeys_for_sortclauses_extended(PlannerInfo *root, sortcl->nulls_first, sortcl->tleSortGroupRef, true); + if (pathkey->pk_eclass->ec_sortref == 0 && set_ec_sortref) + /* + * Copy the sortref if it wasn't set yet. That may happen if + * the ec was constructed from a WHERE clause, i.e. it doesn't + * have a target reference at all. + */ + pathkey->pk_eclass->ec_sortref = sortcl->tleSortGroupRef; /* Canonical form eliminates redundant ordering keys */ if (!pathkey_is_redundant(pathkey, pathkeys)) diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 5320da51a0..dee98c9a90 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -3391,7 +3391,7 @@ standard_qp_callback(PlannerInfo *root, void *extra) /* * With a plain GROUP BY list, we can remove any grouping items that * are proven redundant by EquivalenceClass processing. For example, - * we can remove y given "WHERE x = y GROUP BY x, y". These aren't + * we can remove y given "WHERE x get_eclass_for_sort_expr= y GROUP BY x, y". These aren't * especially common cases, but they're nearly free to detect. Note * that we remove redundant items from processed_groupClause but not * the original parse->groupClause. @@ -3403,7 +3403,8 @@ standard_qp_callback(PlannerInfo *root, void *extra) >processed_groupClause, tlist, true, - ); + , + true); if (!sortable) { /* Can't sort; no point in considering aggregate ordering either */ @@ -3453,7 +3454,8 @@ standard_qp_callback(PlannerInfo *root, void *extra)
Re: POC: GROUP BY optimization
On 4/12/24 06:44, Tom Lane wrote: If this patch were producing better results I'd be more excited about putting more work into it. But on the basis of what I'm seeing right now, I think maybe we ought to give up on it. First, thanks for the deep review - sometimes, only a commit gives us a chance to get such observation :))). On a broader note, introducing automatic group-by-order choosing is a step towards training the optimiser to handle poorly tuned incoming queries. While it's true that this may initially impact performance, it's crucial to weigh the potential benefits. So, beforehand, we should agree: Is it worth it? If yes, I would say I see how often hashing doesn't work in grouping. Sometimes because of estimation errors, sometimes because grouping already has sorted input, sometimes in analytical queries when planner doesn't have enough memory for hashing. In analytical cases, the only way to speed up queries sometimes is to be smart with features like IncrementalSort and this one. About low efficiency. Remember the previous version of the GROUP-BY optimisation - we disagreed on operator costs and the cost model in general. In the current version, we went the opposite - adding small features step-by-step. The current commit contains an integral part of the feature and is designed for safely testing the approach and adding more profitable parts like choosing group-by-order according to distinct values or unique indexes on grouping columns. I have passed through the code being steered by the issues explained in detail. I see seven issues. Two of them definitely should be scrutinised right now, and I'm ready to do that. -- regards, Andrei Lepikhov Postgres Professional
Re: post-freeze damage control
On 9/4/2024 12:55, Tom Lane wrote: Andrei Lepikhov writes: * I really, really dislike jamming this logic into prepqual.c, where it has no business being. I note that it was shoved into process_duplicate_ors without even the courtesy of expanding the header comment: Yeah, I preferred to do it in parse_expr.c with the assumption of some 'minimal' or 'canonical' tree form. That seems quite the wrong direction to me. AFAICS, the argument for making this transformation depends on being able to convert to an indexscan condition, so I would try to apply it even later, when we have a set of restriction conditions to apply to a particular baserel. (This would weaken the argument that we need hashing rather than naive equal() tests even further, I think.) Applying the transform to join quals seems unlikely to be a win. Our first prototype did this job right at the stage of index path creation. Unfortunately, this approach was too narrow and expensive. The most problematic cases we encountered were from BitmapOr paths: if an incoming query has a significant number of OR clauses, the optimizer spends a lot of time generating these, in most cases, senseless paths (remember also memory allocated for that purpose). Imagine how much worse the situation becomes when we scale it with partitions. Another issue we resolved with this transformation: shorter list of clauses speeds up planning and, sometimes, makes cardinality estimation more accurate. Moreover, it helps even SeqScan: attempting to find a value in the hashed array is much faster than cycling a long-expression on each incoming tuple. One more idea that I have set aside here is that the planner can utilize quick clause hashing: From time to time, in the mailing list, I see disputes on different approaches to expression transformation/simplification/grouping, and most of the time, it ends up with the problem of search complexity. Clause hash can be a way to solve this, can't it? -- regards, Andrei Lepikhov Postgres Professional
Re: post-freeze damage control
On 9/4/2024 09:12, Tom Lane wrote: I have another one that I'm not terribly happy about: Author: Alexander Korotkov Branch: master [72bd38cc9] 2024-04-08 01:27:52 +0300 Transform OR clauses to ANY expression Because I'm primary author of the idea, let me answer. I don't know that I'd call it scary exactly, but I do think it was premature. A week ago there was no consensus that it was ready to commit, but Alexander pushed it (or half of it, anyway) despite that. A few concrete concerns: * Yet another planner GUC. Do we really need or want that? It is the most interesting question here. Looking around planner features designed but not applied for the same reason because they can produce suboptimal plans in corner cases, I think about inventing flag-type parameters and hiding some features that work better for different load types under such flagged parameters. * What the medical community would call off-label usage of query jumbling. I'm not sure this is even correct as-used, and for sure it's using that code for something never intended. Nor is the added code adequately (as in, at all) documented. I agree with documentation and disagree with critics on the expression jumbling. It was introduced in the core. Why don't we allow it to be used to speed up machinery with some hashing? * Patch refuses to group anything but Consts into the SAOP transformation. I realize that if you want to produce an array Const you need Const inputs, but I wonder why it wasn't considered to produce an ARRAY[] construct if there are available clauses with pseudo-constant (eg Param) comparison values. Good point. I think, we can consider that in the future. * I really, really dislike jamming this logic into prepqual.c, where it has no business being. I note that it was shoved into process_duplicate_ors without even the courtesy of expanding the header comment: Yeah, I preferred to do it in parse_expr.c with the assumption of some 'minimal' or 'canonical' tree form. You can see this code in the previous version. I think we don't have any bugs here, but we have different opinions on how it should work. * process_duplicate_ors * Given a list of exprs which are ORed together, try to apply * the inverse OR distributive law. Another reason to think this wasn't a very well chosen place is that the file's list of #include's went from 4 entries to 11. Somebody should have twigged to the idea that this was off-topic for prepqual.c. * OrClauseGroupKey is not a Node type, so why does it have a NodeTag? I wonder what value will appear in that field, and what will happen if the struct is passed to any code that expects real Nodes. It's a hack authored by Alexander. I guess He can provide additional reasons in support of that. I could probably find some other nits if I spent more time on it, but I think these are sufficient to show that this was not commit-ready. It's up to you. On the one hand, I don't see any bugs or strong performance issues, and all the issues can be resolved further; on the other hand, I've got your good review and some ideas to work out. -- regards, Andrei Lepikhov Postgres Professional
Re: type cache cleanup improvements
On 3/15/24 17:57, Teodor Sigaev wrote: Okay, I've applied this piece for now. Not sure I'll have much room to look at the rest. Thank you very much! I have spent some time reviewing this feature. I think we can discuss and apply it step-by-step. So, the 0001-* patch is at this moment. The feature addresses the issue of TypCache being bloated by intensive usage of non-standard types and domains. It adds significant overhead during relcache invalidation by thoroughly scanning this hash table. IMO, this feature will be handy soon, as we already see some patches where TypCache is intensively used for storing composite types—for example, look into solutions proposed in [1]. One of my main concerns with this feature is the possibility of lost entries, which could be mistakenly used by relations with the same oid in the future. This seems particularly possible in cases with multiple temporary tables. The author has attempted to address this by replacing the typrelid and type_id fields in the mapRelType on each call of lookup_type_cache. However, I believe we could further improve this by removing the entry from mapRelType on invalidation, thus avoiding this potential issue. While reviewing the patch, I made some minor changes (see attachment) that you're free to adopt or reject. However, it's crucial that the patch includes a detailed explanation, not just a single sentence, to ensure everyone understands the changes. Upon closer inspection, I noticed that the current implementation only invalidates the cache entry. While this is acceptable for standard types, it may not be sufficient to maintain numerous custom types (as in the example in the initial letter) or in cases where whole-row vars are heavily used. In such scenarios, removing the entry and reducing the hash table's size might be more efficient. In toto, the 0001-* patch looks good, and I would be glad to see it in the core. [1] https://www.postgresql.org/message-id/flat/CAKcux6ktu-8tefLWtQuuZBYFaZA83vUzuRd7c1YHC-yEWyYFpg%40mail.gmail.com -- regards, Andrei Lepikhov Postgres Professional diff --git a/src/backend/utils/cache/typcache.c b/src/backend/utils/cache/typcache.c index e3c32c7848..ed321603d5 100644 --- a/src/backend/utils/cache/typcache.c +++ b/src/backend/utils/cache/typcache.c @@ -74,16 +74,17 @@ #include "utils/typcache.h" -/* The main type cache hashtable searched by lookup_type_cache */ -static HTAB *TypeCacheHash = NULL; - /* The map from relation's oid to its type oid */ -typedef struct mapRelTypeEntry +typedef struct RelTypeMapEntry { Oid typrelid; Oid type_id; -} mapRelTypeEntry; +} RelTypeMapEntry; + +/* The main type cache hashtable searched by lookup_type_cache */ +static HTAB *TypeCacheHash = NULL; +/* Utility hash table to speed up processing of invalidation relcache events. */ static HTAB *mapRelType = NULL; /* List of type cache entries for domain types */ @@ -368,7 +369,7 @@ lookup_type_cache(Oid type_id, int flags) , HASH_ELEM | HASH_BLOBS); ctl.keysize = sizeof(Oid); - ctl.entrysize = sizeof(mapRelTypeEntry); + ctl.entrysize = sizeof(RelTypeMapEntry); mapRelType = hash_create("Map reloid to typeoid", 64, , HASH_ELEM | HASH_BLOBS); @@ -492,11 +493,11 @@ lookup_type_cache(Oid type_id, int flags) */ if (OidIsValid(typentry->typrelid) && typentry->typtype == TYPTYPE_COMPOSITE) { - mapRelTypeEntry *relentry; + RelTypeMapEntry *relentry; - relentry = (mapRelTypeEntry*) hash_search(mapRelType, - >typrelid, - HASH_ENTER, NULL); + relentry = (RelTypeMapEntry *) hash_search(mapRelType, + >typrelid, + HASH_ENTER, NULL); relentry->typrelid = typentry->typrelid; relentry->type_id = typentry->type_id; @@ -2297,7 +2298,7 @@ SharedRecordTypmodRegistryAttach(SharedRecordTypmodRegistry *registry) } static void -invalidateCompositeTypeCacheEntry(TypeCacheEntry *typentry) +invalidateTypeCacheEntry(TypeCacheEntry *typentry) { /* Delete tupdesc if we have it */ if (typentry->tupDesc != NULL) @@ -2348,11 +2349,11 @@ TypeCacheRelCallback(Datum arg, Oid relid) if (OidIsValid(relid)) { - mapRelTypeEntry *relentry; + RelTypeMapEntry *relentry; - relentry = (mapRelTypeEntry *) hash_search(mapRelType, - , - HASH_FIND, NULL); + relentry = (RelTypeMapEntry *) hash_search(mapRelType, + , + HASH_FIND, NULL); if (relentry != NULL) { @@ -2365,7 +2366,7 @@ TypeCacheRelCallback(Datum arg, Oid relid) Assert(typentry->typtype == TYPTYPE_COMPOSITE); Assert(relid == typentry->typrelid); -invalidateCompositeTypeCacheEntry(typentry); +invalidateTypeCacheEntry(typentry); } } @@ -2397,7 +2398,7 @@ TypeCacheRelCallback(Datum arg, Oid relid) { if (typentry->typtype == TYPTYPE_COMPOSITE) { -invalidateCompositeTypeCacheEntry(ty
Re: Asymmetric partition-wise JOIN
On 15/10/2023 13:25, Alexander Korotkov wrote: Great! I'm looking forward to the revised patch. Revising the code and opinions before restarting this work, I found two different possible strategies mentioned in the thread: 1. 'Common Resources' shares the materialised result of the inner table scan (a hash table in the case of HashJoin) to join each partition one by one. It gives us a profit in the case of parallel append and possibly other cases, like the one shown in the initial message. 2. 'Individual strategies' - By limiting the AJ feature to cases when the JOIN clause contains a partitioning expression, we can push an additional scan clause into each copy of the inner table scan, reduce the number of tuples scanned, and even prune something because of proven zero input. I see the pros and cons of both approaches. The first option is more straightforward, and its outcome is obvious in the case of parallel append. But how can we guarantee the same join type for each join? Why should we ignore the positive effect of different strategies for different partitions? The second strategy is more expensive for the optimiser, especially in the multipartition case. But as I can predict, it is easier to implement and looks more natural for the architecture. What do you think about that? -- regards, Andrei Lepikhov Postgres Professional
Re: Table AM Interface Enhancements
On 31/3/2024 00:33, Alexander Korotkov wrote: I think the way forward might be to introduce the new API, which would isolate executor details from table AM. We may introduce a new data structure InsertWithArbiterContext which would contain EState and a set of callbacks which would avoid table AM from calling the executor directly. That should be our goal for pg18. Now, this is too close to FF to design a new API. I'm a bit late, but have you ever considered adding some sort of index probing routine to the AM interface for estimation purposes? I am working out the problem when we have dubious estimations. For example, we don't have MCV or do not fit MCV statistics for equality of multiple clauses, or we detected that the constant value is out of the histogram range. In such cases (especially for [parameterized] JOINs), the optimizer could have a chance to probe the index and avoid huge underestimation. This makes sense, especially for multicolumn filters/clauses. Having a probing AM method, we may invent something for this challenge. -- regards, Andrei Lepikhov Postgres Professional
Re: POC, WIP: OR-clause support for indexes
On 28/3/2024 16:54, Alexander Korotkov wrote: The current patch has a boolean guc enable_or_transformation. However, when we have just a few ORs to be transformated, then we should get less performance gain from the transformation and higher chances to lose a good bitmap scan plan from that. When there is a huge list of ORs to be transformed, then the performance gain is greater and it is less likely we could lose a good bitmap scan plan. What do you think about introducing a GUC threshold value: the minimum size of list to do OR-to-ANY transformation? min_list_or_transformation or something. I labelled it or_transformation_limit (see in attachment). Feel free to rename it. It's important to note that the limiting GUC doesn't operate symmetrically for forward, OR -> SAOP, and backward SAOP -> OR operations. In the forward case, it functions as you've proposed. However, in the backward case, we only check whether the feature is enabled or not. This is due to our existing limitation, MAX_SAOP_ARRAY_SIZE, and the fact that we can't match the length of the original OR list with the sizes of the resulting SAOPs. For instance, a lengthy OR list with 100 elements can be transformed into 3 SAOPs, each with a size of around 30 elements. One aspect that requires attention is the potential inefficiency of our OR -> ANY transformation when we have a number of elements less than MAX_SAOP_ARRAY_SIZE. This is because we perform a reverse transformation ANY -> OR at the stage of generating bitmap scans. If the BitmapScan path dominates, we may have done unnecessary work. Is this an occurrence that we should address? But the concern above may just be a point of improvement later: We can add one more strategy to the optimizer: testing each array element as an OR clause; we can also provide a BitmapOr path, where SAOP is covered with a minimal number of partial indexes (likewise, previous version). -- regards, Andrei Lepikhov Postgres Professional From e42a7111a12ef82eecdb2e692d65e7ba6e43ad79 Mon Sep 17 00:00:00 2001 From: Alena Rybakina Date: Fri, 2 Feb 2024 22:01:09 +0300 Subject: [PATCH 1/2] Transform OR clauses to ANY expression. Replace (expr op C1) OR (expr op C2) ... with expr op ANY(ARRAY[C1, C2, ...]) on the preliminary stage of optimization when we are still working with the expression tree. Here C is a constant expression, 'expr' is non-constant expression, 'op' is an operator which returns boolean result and has a commuter (for the case of reverse order of constant and non-constant parts of the expression, like 'CX op expr'). Sometimes it can lead to not optimal plan. But we think it is better to have array of elements instead of a lot of OR clauses. Here is a room for further optimizations on decomposing that array into more optimal parts. Authors: Alena Rybakina , Andrey Lepikhov Reviewed-by: Peter Geoghegan , Ranier Vilela Reviewed-by: Alexander Korotkov , Robert Haas Reviewed-by: jian he --- .../postgres_fdw/expected/postgres_fdw.out| 8 +- doc/src/sgml/config.sgml | 18 + src/backend/nodes/queryjumblefuncs.c | 27 ++ src/backend/optimizer/prep/prepqual.c | 379 +- src/backend/utils/misc/guc_tables.c | 13 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/nodes/queryjumble.h | 1 + src/include/optimizer/optimizer.h | 2 + src/test/regress/expected/create_index.out| 172 +++- src/test/regress/expected/join.out| 60 ++- src/test/regress/expected/partition_prune.out | 211 +- src/test/regress/expected/stats_ext.out | 12 +- src/test/regress/expected/tidscan.out | 21 +- src/test/regress/sql/create_index.sql | 44 ++ src/test/regress/sql/join.sql | 8 + src/test/regress/sql/partition_prune.sql | 18 + src/test/regress/sql/tidscan.sql | 4 + src/tools/pgindent/typedefs.list | 2 + 18 files changed, 950 insertions(+), 51 deletions(-) diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index b7af86d351..277ef3f385 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -8853,18 +8853,18 @@ insert into utrtest values (2, 'qux'); -- Check case where the foreign partition is a subplan target rel explain (verbose, costs off) update utrtest set a = 1 where a = 1 or a = 2 returning *; - QUERY PLAN - + QUERY PLAN +-- Update on public.utrtes
Re: POC, WIP: OR-clause support for indexes
On 14/3/2024 16:31, Alexander Korotkov wrote: On Wed, Mar 13, 2024 at 2:16 PM Andrei Lepikhov As you can see this case is not related to partial indexes. Just no index selective for the whole query. However, splitting scan by the OR qual lets use a combination of two selective indexes. I have rewritten the 0002-* patch according to your concern. A candidate and some thoughts are attached. As I see, we have a problem here: expanding each array and trying to apply an element to each index can result in a lengthy planning stage. Also, an index scan with the SAOP may potentially be more effective than with the list of OR clauses. Originally, the transformation's purpose was to reduce a query's complexity and the number of optimization ways to speed up planning and (sometimes) execution. Here, we reduce planning complexity only in the case of an array size larger than MAX_SAOP_ARRAY_SIZE. Maybe we can fall back to the previous version of the second patch, keeping in mind that someone who wants to get maximum profit from the BitmapOr scan of multiple indexes can just disable this optimization, enabling deep search of the most optimal scanning way? As a compromise solution, I propose adding one more option to the previous version: if an element doesn't fit any partial index, try to cover it with a plain index. In this case, we still do not guarantee the most optimal fit of elements to the set of indexes, but we speed up planning. Does that make sense? -- regards, Andrei Lepikhov Postgres Professional From d2d8944fc83ccd090653c1b15703a2c3ba096fa9 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Wed, 13 Mar 2024 12:26:02 +0700 Subject: [PATCH 2/2] Teach generate_bitmap_or_paths to build BitmapOr paths over SAOP clauses. Likewise OR clauses, discover SAOP array and try to split its elements between smaller sized arrays to fit a set of partial indexes. --- doc/src/sgml/config.sgml | 3 + src/backend/optimizer/path/indxpath.c | 74 +- src/backend/optimizer/util/predtest.c | 37 +++ src/backend/optimizer/util/restrictinfo.c | 13 + src/include/optimizer/optimizer.h | 3 + src/include/optimizer/restrictinfo.h | 1 + src/test/regress/expected/create_index.out | 24 +- src/test/regress/expected/select.out | 280 + src/test/regress/sql/select.sql| 82 ++ 9 files changed, 500 insertions(+), 17 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 2de6ae301a..0df56f44e3 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -5485,6 +5485,9 @@ ANY num_sync ( orclause)->args; + + if (!enable_or_transformation) + return orlist; + + if (restriction_is_saop_clause(rinfo)) + { + result = transform_saop_to_ors(root, rinfo); + return (result == NIL) ? list_make1(rinfo) : result; + } + + foreach(lc, orlist) + { + Expr *expr = (Expr *) lfirst(lc); + + if (IsA(expr, RestrictInfo) && restriction_is_saop_clause((RestrictInfo *) expr)) + { + List *sublist; + + sublist = extract_saop_ors(root, (RestrictInfo *) lfirst(lc)); + if (sublist != NIL) + { + result = list_concat(result, sublist); + continue; + } + + /* Need to return expr to the result list */ + } + + result = lappend(result, expr); + } + + return result; +} + /* * generate_bitmap_or_paths - * Look through the list of clauses to find OR clauses, and generate - * a BitmapOrPath for each one we can handle that way. Return a list - * of the generated BitmapOrPaths. + * Look through the list of clauses to find OR and SAOP clauses, and + * Each saop clause are splitted to be covered by partial indexes. + * generate a BitmapOrPath for each one we can handle that way. + * Return a list of the generated BitmapOrPaths. * * other_clauses is a list of additional clauses that can be assumed true * for the purpose of generating indexquals, but are not to be searched for @@ -1247,20 +1303,24 @@ generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo *rel, foreach(lc, clauses) { RestrictInfo *rinfo = lfirst_node(RestrictInfo, lc); - List *pathlist; + List *pathlist = NIL; Path *bitmapqual; ListCell *j; + List *orlist = NIL; - /* Ignore RestrictInfos that aren't ORs */ - if (!restriction_is_or_clause(rinfo)) + orlist = extract_saop_ors(root, rinfo); +
Re: POC, WIP: OR-clause support for indexes
On 14/3/2024 17:39, Alexander Korotkov wrote: Thank you, Andrei. Looks like a very undesirable side effect. Do you have any idea why it happens? Partition pruning should work correctly for both transformed and non-transformed quals, why does transformation hurt it? Now we have the v23-0001-* patch with all issues resolved. The last one which caused execution stage pruning was about necessity to evaluate SAOP expression right after transformation. In previous version the core executed it on transformed expressions. > As you can see this case is not related to partial indexes. Just no > index selective for the whole query. However, splitting scan by the > OR qual lets use a combination of two selective indexes. Thanks for the case. I will try to resolve it. -- regards, Andrei Lepikhov Postgres Professional From 156c00c820a38e5e1856f07363af87b3109b5d77 Mon Sep 17 00:00:00 2001 From: Alena Rybakina Date: Fri, 2 Feb 2024 22:01:09 +0300 Subject: [PATCH 1/2] Transform OR clauses to ANY expression. Replace (expr op C1) OR (expr op C2) ... with expr op ANY(ARRAY[C1, C2, ...]) on the preliminary stage of optimization when we are still working with the expression tree. Here C is a constant expression, 'expr' is non-constant expression, 'op' is an operator which returns boolean result and has a commuter (for the case of reverse order of constant and non-constant parts of the expression, like 'CX op expr'). Sometimes it can lead to not optimal plan. But we think it is better to have array of elements instead of a lot of OR clauses. Here is a room for further optimizations on decomposing that array into more optimal parts. Authors: Alena Rybakina , Andrey Lepikhov Reviewed-by: Peter Geoghegan , Ranier Vilela Reviewed-by: Alexander Korotkov , Robert Haas Reviewed-by: jian he --- .../postgres_fdw/expected/postgres_fdw.out| 8 +- doc/src/sgml/config.sgml | 17 + src/backend/nodes/queryjumblefuncs.c | 27 ++ src/backend/optimizer/prep/prepqual.c | 374 +- src/backend/utils/misc/guc_tables.c | 11 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/nodes/queryjumble.h | 1 + src/include/optimizer/optimizer.h | 2 + src/test/regress/expected/create_index.out| 156 +++- src/test/regress/expected/join.out| 62 ++- src/test/regress/expected/partition_prune.out | 215 +- src/test/regress/expected/stats_ext.out | 12 +- src/test/regress/expected/sysviews.out| 3 +- src/test/regress/expected/tidscan.out | 23 +- src/test/regress/sql/create_index.sql | 35 ++ src/test/regress/sql/join.sql | 10 + src/test/regress/sql/partition_prune.sql | 22 ++ src/test/regress/sql/tidscan.sql | 6 + src/tools/pgindent/typedefs.list | 2 + 19 files changed, 929 insertions(+), 58 deletions(-) diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index 58a603ac56..a965b43cc6 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -8838,18 +8838,18 @@ insert into utrtest values (2, 'qux'); -- Check case where the foreign partition is a subplan target rel explain (verbose, costs off) update utrtest set a = 1 where a = 1 or a = 2 returning *; - QUERY PLAN - + QUERY PLAN +-- Update on public.utrtest Output: utrtest_1.a, utrtest_1.b Foreign Update on public.remp utrtest_1 Update on public.locp utrtest_2 -> Append -> Foreign Update on public.remp utrtest_1 - Remote SQL: UPDATE public.loct SET a = 1 WHERE (((a = 1) OR (a = 2))) RETURNING a, b + Remote SQL: UPDATE public.loct SET a = 1 WHERE ((a = ANY ('{1,2}'::integer[]))) RETURNING a, b -> Seq Scan on public.locp utrtest_2 Output: 1, utrtest_2.tableoid, utrtest_2.ctid, NULL::record - Filter: ((utrtest_2.a = 1) OR (utrtest_2.a = 2)) + Filter: (utrtest_2.a = ANY ('{1,2}'::integer[])) (10 rows) -- The new values are concatenated with ' triggered !' diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 65a6e6c408..2de6ae301a 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -5472,6 +5472,23 @@ ANY num_sync ( + enable_or_transformation (boolean) + +enable_or_transformation configuration parameter + + + + +Enables or
Re: POC, WIP: OR-clause support for indexes
On 14/3/2024 16:31, Alexander Korotkov wrote: On Wed, Mar 13, 2024 at 2:16 PM Andrei Lepikhov mailto:a.lepik...@postgrespro.ru>> wrote: > On 13/3/2024 18:05, Alexander Korotkov wrote: > > On Wed, Mar 13, 2024 at 7:52 AM Andrei Lepikhov > > Given all of the above, I think moving transformation to the > > canonicalize_qual() would be the right way to go. > Ok, I will try to move the code. > I have no idea about the timings so far. I recall the last time I got > bogged down in tons of duplicated code. I hope with an almost-ready > sketch, it will be easier. Thank you! I'll be looking forward to the updated patch. Okay, I moved the 0001-* patch to the prepqual.c module. See it in the attachment. I treat it as a transient patch. It has positive outcomes as well as negative ones. The most damaging result you can see in the partition_prune test: partition pruning, in some cases, moved to the executor initialization stage. I guess, we should avoid it somehow in the next version. -- regards, Andrei Lepikhov Postgres Professional From 170f6871540025d0d1683750442e7af902b11a40 Mon Sep 17 00:00:00 2001 From: Alena Rybakina Date: Fri, 2 Feb 2024 22:01:09 +0300 Subject: [PATCH 1/2] Transform OR clauses to ANY expression. Replace (expr op C1) OR (expr op C2) ... with expr op ANY(ARRAY[C1, C2, ...]) on the preliminary stage of optimization when we are still working with the expression tree. Here C is a constant expression, 'expr' is non-constant expression, 'op' is an operator which returns boolean result and has a commuter (for the case of reverse order of constant and non-constant parts of the expression, like 'CX op expr'). Sometimes it can lead to not optimal plan. But we think it is better to have array of elements instead of a lot of OR clauses. Here is a room for further optimizations on decomposing that array into more optimal parts. Authors: Alena Rybakina , Andrey Lepikhov Reviewed-by: Peter Geoghegan , Ranier Vilela Reviewed-by: Alexander Korotkov , Robert Haas Reviewed-by: jian he --- .../postgres_fdw/expected/postgres_fdw.out| 8 +- doc/src/sgml/config.sgml | 17 + src/backend/nodes/queryjumblefuncs.c | 27 ++ src/backend/optimizer/prep/prepqual.c | 371 +- src/backend/utils/misc/guc_tables.c | 11 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/nodes/queryjumble.h | 1 + src/include/optimizer/optimizer.h | 2 + src/test/regress/expected/create_index.out| 156 +++- src/test/regress/expected/join.out| 62 ++- src/test/regress/expected/partition_prune.out | 235 +-- src/test/regress/expected/stats_ext.out | 12 +- src/test/regress/expected/sysviews.out| 3 +- src/test/regress/expected/tidscan.out | 19 +- src/test/regress/sql/create_index.sql | 35 ++ src/test/regress/sql/join.sql | 10 + src/test/regress/sql/partition_prune.sql | 22 ++ src/test/regress/sql/tidscan.sql | 6 + src/tools/pgindent/typedefs.list | 2 + 19 files changed, 939 insertions(+), 61 deletions(-) diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index 58a603ac56..a965b43cc6 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -8838,18 +8838,18 @@ insert into utrtest values (2, 'qux'); -- Check case where the foreign partition is a subplan target rel explain (verbose, costs off) update utrtest set a = 1 where a = 1 or a = 2 returning *; - QUERY PLAN - + QUERY PLAN +-- Update on public.utrtest Output: utrtest_1.a, utrtest_1.b Foreign Update on public.remp utrtest_1 Update on public.locp utrtest_2 -> Append -> Foreign Update on public.remp utrtest_1 - Remote SQL: UPDATE public.loct SET a = 1 WHERE (((a = 1) OR (a = 2))) RETURNING a, b + Remote SQL: UPDATE public.loct SET a = 1 WHERE ((a = ANY ('{1,2}'::integer[]))) RETURNING a, b -> Seq Scan on public.locp utrtest_2 Output: 1, utrtest_2.tableoid, utrtest_2.ctid, NULL::record - Filter: ((utrtest_2.a = 1) OR (utrtest_2.a = 2)) + Filter: (utrtest_2.a = ANY ('{1,2}'::integer[])) (10 rows) -- The new values are concatenated with ' triggered !' diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 65a6e6c408..2de6ae301a 100644 -
Re: POC, WIP: OR-clause support for indexes
On 13/3/2024 18:05, Alexander Korotkov wrote: On Wed, Mar 13, 2024 at 7:52 AM Andrei Lepikhov Given all of the above, I think moving transformation to the canonicalize_qual() would be the right way to go. Ok, I will try to move the code. I have no idea about the timings so far. I recall the last time I got bogged down in tons of duplicated code. I hope with an almost-ready sketch, it will be easier. -- regards, Andrei Lepikhov Postgres Professional
Re: POC, WIP: OR-clause support for indexes
On 12/3/2024 22:20, Alexander Korotkov wrote: On Mon, Mar 11, 2024 at 2:43 PM Andrei Lepikhov I think you are right. It is probably a better place than any other to remove duplicates in an array. I just think we should sort and remove duplicates from entry->consts in one pass. Thus, this optimisation should be applied to sortable constants. Ok. New version of the patch set implemented all we have agreed on for now. We can return MAX_SAOP_ARRAY_SIZE constraint and Alena's approach to duplicates deletion for non-sortable cases at the end. Hmm, we already tried to do it at that point. I vaguely recall some issues caused by this approach. Anyway, it should be done as quickly as possible to increase the effect of the optimization. I think there were provided quite strong reasons why this shouldn't be implemented at the parse analysis stage [1], [2], [3]. The canonicalize_qual() looks quite appropriate place for that since it does similar transformations. Ok. Let's discuss these reasons. In Robert's opinion [1,3], we should do the transformation based on the cost model. But in the canonicalize_qual routine, we still make the transformation blindly. Moreover, the second patch reduces the weight of this reason, doesn't it? Maybe we shouldn't think about that as about optimisation but some 'general form of expression'? Peter [2] worries about the possible transformation outcomes at this stage. But remember, we already transform clauses like ROW() IN (...) to a series of ORs here, so it is not an issue. Am I wrong? Why did we discard the attempt with canonicalize_qual on the previous iteration? - The stage of parsing is much more native for building SAOP quals. We can reuse make_scalar_array_op and other stuff, for example. During the optimisation stage, the only list partitioning machinery creates SAOP based on a list of constants. So, in theory, it is possible to implement. But do we really need to make the code more complex? Links. 1. https://www.postgresql.org/message-id/CA%2BTgmoZCgP6FrBQEusn4yaWm02XU8OPeoEMk91q7PRBgwaAkFw%40mail.gmail.com 2. https://www.postgresql.org/message-id/CAH2-Wzm2%3Dnf_JhiM3A2yetxRs8Nd2NuN3JqH%3Dfm_YWYd1oYoPg%40mail.gmail.com 3. https://www.postgresql.org/message-id/CA%2BTgmoaOiwMXBBTYknczepoZzKTp-Zgk5ss1%2BCuVQE-eFTqBmA%40mail.gmail.com -- regards, Andrei Lepikhov Postgres Professional From 86d969f2598a03b1eba84c0c064707de34ee60ac Mon Sep 17 00:00:00 2001 From: Alena Rybakina Date: Fri, 2 Feb 2024 22:01:09 +0300 Subject: [PATCH 1/2] Transform OR clauses to ANY expression. Replace (expr op C1) OR (expr op C2) ... with expr op ANY(ARRAY[C1, C2, ...]) on the preliminary stage of optimization when we are still working with the expression tree. Here C is a constant expression, 'expr' is non-constant expression, 'op' is an operator which returns boolean result and has a commuter (for the case of reverse order of constant and non-constant parts of the expression, like 'CX op expr'). Sometimes it can lead to not optimal plan. But we think it is better to have array of elements instead of a lot of OR clauses. Here is a room for further optimizations on decomposing that array into more optimal parts. Authors: Alena Rybakina , Andrey Lepikhov Reviewed-by: Peter Geoghegan , Ranier Vilela Reviewed-by: Alexander Korotkov , Robert Haas Reviewed-by: jian he --- .../postgres_fdw/expected/postgres_fdw.out| 8 +- doc/src/sgml/config.sgml | 17 + src/backend/nodes/queryjumblefuncs.c | 27 ++ src/backend/parser/parse_expr.c | 416 ++ src/backend/utils/misc/guc_tables.c | 11 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/bin/pg_dump/t/002_pg_dump.pl | 6 +- src/include/nodes/queryjumble.h | 1 + src/include/optimizer/optimizer.h | 1 + src/test/regress/expected/create_index.out| 156 ++- src/test/regress/expected/join.out| 62 ++- src/test/regress/expected/partition_prune.out | 215 - src/test/regress/expected/stats_ext.out | 12 +- src/test/regress/expected/sysviews.out| 3 +- src/test/regress/expected/tidscan.out | 23 +- src/test/regress/sql/create_index.sql | 35 ++ src/test/regress/sql/join.sql | 10 + src/test/regress/sql/partition_prune.sql | 22 + src/test/regress/sql/tidscan.sql | 6 + src/tools/pgindent/typedefs.list | 2 + 20 files changed, 974 insertions(+), 60 deletions(-) diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index 58a603ac56..a965b43cc6 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -8838,18 +8838,18 @@ insert into utrtest values (2, 'qux'); -- Check case where the foreign partition is a subplan target rel explain (verbose, costs off) update utrt
Re: a wrong index choose when statistics is out of date
On 8/3/2024 18:53, Andy Fan wrote: I just reviewed the bad queries plan for the past half years internally, I found many queries used the Nested loop which is the direct cause. now I think I find out a new reason for this, because the missed optimizer statistics cause the rows in outer relation to be 1, which make the Nest loop is choosed. I'm not sure your idea could help on this or can help on this than mine at this aspect. Having had the same problem for a long time, I've made an attempt and invented a patch that probes an index to determine whether the estimated constant is within statistics' scope. I remember David's remark on the overhead problem, but I don't argue it here. This patch is on the table to have one more solution sketch for further discussion. Also, Andy, if you have a specific problem with index choosing, you can try a tiny option that makes the index-picking technique less dependent on the ordering of index lists [1]. [1] https://www.postgresql.org/message-id/9b3dbf6d-776a-4953-a5a4-609929393...@postgrespro.ru -- regards, Andrei Lepikhov Postgres Professional diff --git a/src/backend/utils/adt/like_support.c b/src/backend/utils/adt/like_support.c index 2635050861..bc8e59bbf3 100644 --- a/src/backend/utils/adt/like_support.c +++ b/src/backend/utils/adt/like_support.c @@ -643,7 +643,7 @@ patternsel_common(PlannerInfo *root, /* * Pattern specifies an exact match, so estimate as for '=' */ - result = var_eq_const(, eqopr, collation, prefix->constvalue, + result = var_eq_const(root, , eqopr, collation, prefix->constvalue, false, true, false); } else @@ -1294,7 +1294,7 @@ prefix_selectivity(PlannerInfo *root, VariableStatData *vardata, * probably off the end of the histogram, and thus we probably got a very * small estimate from the >= condition; so we still need to clamp. */ - eq_sel = var_eq_const(vardata, eqopr, collation, prefixcon->constvalue, + eq_sel = var_eq_const(root, vardata, eqopr, collation, prefixcon->constvalue, false, true, false); prefixsel = Max(prefixsel, eq_sel); diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index cea777e9d4..be6213046e 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -273,7 +273,7 @@ eqsel_internal(PG_FUNCTION_ARGS, bool negate) * in the query.) */ if (IsA(other, Const)) - selec = var_eq_const(, operator, collation, + selec = var_eq_const(root, , operator, collation, ((Const *) other)->constvalue, ((Const *) other)->constisnull, varonleft, negate); @@ -286,13 +286,133 @@ eqsel_internal(PG_FUNCTION_ARGS, bool negate) return selec; } +/* + * Lookup the value in the index and try to estimate number of the tuples with + * the same value. + */ +static Selectivity +estimate_ntuples_by_index(PlannerInfo *root, VariableStatData *vardata, + Datum constval, + Oid collation, Oid regprocoid, int32 stawidth) +{ + RelOptInfo *rel = vardata->rel; + RangeTblEntry *rte = root->simple_rte_array[rel->relid]; + ListCell *lc; + int ntuples = 0; + Selectivity selec = -1.; + + foreach(lc, rel->indexlist) + { + IndexOptInfo *index = (IndexOptInfo *) lfirst(lc); + ScanKeyData scankeys[1]; + SnapshotDataSnapshotNonVacuumable; + IndexScanDesc index_scan; + RelationheapRel; + RelationindexRel; + + if (index->relam != BTREE_AM_OID) + continue; + if (index->indpred != NIL) + continue; + if (index->hypothetical) + continue; + if (!match_index_to_operand(vardata->var, 0, index)) + continue; + + heapRel = table_open(rte->relid, NoLock); + indexRel = index_open(index->indexoid, NoLock); + + ScanKeyEntryInitialize([0], + 0, + 1, + BTEqualStrategyNumber, + vardata->atttype, +
Re: POC, WIP: OR-clause support for indexes
On 11/3/2024 18:31, Alexander Korotkov wrote: I'm not convinced about this limit. The initial reason was to combine long lists of ORs into the array because such a transformation made at an early stage increases efficiency. I understand the necessity of this limit in the array decomposition routine but not in the creation one. The comment near MAX_SAOP_ARRAY_SIZE says that this limit is because N^2 algorithms could be applied to arrays. Are you sure that's not true for our case? When you operate an array, indeed. But when we transform ORs to an array, not. Just check all the places in the optimizer and even the executor where we would pass along the list of ORs. This is why I think we should use this optimization even more intensively for huge numbers of ORs in an attempt to speed up the overall query. 3) Better save the original order of clauses by putting hash entries and untransformable clauses to the same list. A lot of differences in regression tests output have gone. I agree that reducing the number of changes in regression tests looks better. But to achieve this, you introduced a hack that increases the complexity of the code. Is it worth it? Maybe it would be better to make one-time changes in tests instead of getting this burden on board. Or have you meant something more introducing the node type? For me the reason is not just a regression test. The current code keeps the original order of quals as much as possible. The OR transformation code reorders quals even in cases when it doesn't eventually apply any optimization. I don't think that's acceptable. However, less hackery ways for this is welcome for sure. Why is it unacceptable? Can the user implement some order-dependent logic with clauses, and will it be correct? Otherwise, it is a matter of taste, and generally, this decision is up to you. We don't make array values unique. That might make query execution performance somewhat worse, and also makes selectivity estimation worse. I suggest Andrei and/or Alena should implement making array values unique. The fix Alena has made looks correct. But I urge you to think twice: The optimizer doesn't care about duplicates, so why do we do it? What's more, this optimization is intended to speed up queries with long OR lists. Using the list_append_unique() comparator on such lists could impact performance. I suggest sticking to the common rule and leaving the responsibility on the user's shoulders. I don't see why the optimizer doesn't care about duplicates for OR lists. As I showed before in an example, it successfully removes the duplicate. So, currently OR transformation clearly introduces a regression in terms of selectivity estimation. I think we should evade that. I think you are right. It is probably a better place than any other to remove duplicates in an array. I just think we should sort and remove duplicates from entry->consts in one pass. Thus, this optimisation should be applied to sortable constants. At least, we should do this optimization later, in one pass, with sorting elements before building the array. But what if we don't have a sort operator for the type? It was probably discussed before, but can we do our work later? There is a canonicalize_qual() which calls find_duplicate_ors(). This is the place where currently duplicate OR clauses are removed. Could our OR-to-ANY transformation be just another call from canonicalize_qual()? Hmm, we already tried to do it at that point. I vaguely recall some issues caused by this approach. Anyway, it should be done as quickly as possible to increase the effect of the optimization. -- regards, Andrei Lepikhov Postgres Professional
Re: POC, WIP: OR-clause support for indexes
On 7/3/2024 21:51, Alexander Korotkov wrote: Hi! On Tue, Mar 5, 2024 at 9:59 AM Andrei Lepikhov mailto:a.lepik...@postgrespro.ru>> wrote: > On 5/3/2024 12:30, Andrei Lepikhov wrote: > > On 4/3/2024 09:26, jian he wrote: > ... and the new version of the patchset is attached. I made some revisions for the patchset. Great! 1) Use hash_combine() to combine hash values. Looks better 2) Upper limit the number of array elements by MAX_SAOP_ARRAY_SIZE. I'm not convinced about this limit. The initial reason was to combine long lists of ORs into the array because such a transformation made at an early stage increases efficiency. I understand the necessity of this limit in the array decomposition routine but not in the creation one. 3) Better save the original order of clauses by putting hash entries and untransformable clauses to the same list. A lot of differences in regression tests output have gone. I agree that reducing the number of changes in regression tests looks better. But to achieve this, you introduced a hack that increases the complexity of the code. Is it worth it? Maybe it would be better to make one-time changes in tests instead of getting this burden on board. Or have you meant something more introducing the node type? We don't make array values unique. That might make query execution performance somewhat worse, and also makes selectivity estimation worse. I suggest Andrei and/or Alena should implement making array values unique. The fix Alena has made looks correct. But I urge you to think twice: The optimizer doesn't care about duplicates, so why do we do it? What's more, this optimization is intended to speed up queries with long OR lists. Using the list_append_unique() comparator on such lists could impact performance. I suggest sticking to the common rule and leaving the responsibility on the user's shoulders. At least, we should do this optimization later, in one pass, with sorting elements before building the array. But what if we don't have a sort operator for the type? -- regards, Andrei Lepikhov Postgres Professional
Re: a wrong index choose when statistics is out of date
On 7/3/2024 17:32, David Rowley wrote: On Thu, 7 Mar 2024 at 21:17, Andrei Lepikhov wrote: I would like to ask David why the var_eq_const estimator doesn't have an option for estimation with a histogram. Having that would relieve a problem with skewed data. Detecting the situation with incoming const that is out of the covered area would allow us to fall back to ndistinct estimation or something else. At least, histogram usage can be restricted by the reltuples value and ratio between the total number of MCV values and the total number of distinct values in the table. If you can think of a way how to calculate it, you should propose a patch. IIRC, we try to make the histogram buckets evenly sized based on the number of occurrences. I've not followed the code in default, I'd guess that doing that allows us to just subtract off the MCV frequencies and assume the remainder is evenly split over each histogram bucket, so unless we had an n_distinct per histogram bucket, or at the very least n_distinct_for_histogram_values, then how would the calculation look for what we currently record? Yeah, It is my mistake; I see nothing special here with such a kind of histogram: in the case of a coarse histogram net, the level of uncertainty in one bin is too high to make a better estimation. I am just pondering detection situations when estimation constant is just out of statistics scope to apply to alternative, more expensive logic involving the number of index pages out of the boundary, index tuple width, and distinct value. The Left and right boundaries of the histogram are suitable detectors for such a situation. -- regards, Andrei Lepikhov Postgres Professional
Re: a wrong index choose when statistics is out of date
On 5/3/2024 19:56, Andy Fan wrote: I think it is OK for a design review, for the implementaion side, the known issue includes: 1. Support grap such infromation from its parent for partitioned table if the child doesn't have such information. 2. builtin document and testing. Any feedback is welcome. Thanks for your efforts. I was confused when you showed the problem connected to clauses like "Var op Const" and "Var op Param". As far as I know, the estimation logic of such clauses uses MCV and number-distinct statistics. So, being out of MCV values, it becomes totally insensitive to any internal skew in data and any data outside the statistics boundaries. Having studied the example you provided with the patch, I think it is not a correct example: Difference between var_eq_const and var_eq_non_const quite obvious: In the second routine, you don't have information about the const value and can't use MCV for estimation. Also, you can't exclude MCV values from the estimation. And it is just luck that you've got the right answer. I think if you increased the weight of the unknown part, you would get a bad result, too. I would like to ask David why the var_eq_const estimator doesn't have an option for estimation with a histogram. Having that would relieve a problem with skewed data. Detecting the situation with incoming const that is out of the covered area would allow us to fall back to ndistinct estimation or something else. At least, histogram usage can be restricted by the reltuples value and ratio between the total number of MCV values and the total number of distinct values in the table. Just for demo: demonstration of data skew issue: CREATE EXTENSION tablefunc; CREATE TABLE norm_test AS SELECT abs(r::integer) AS val FROM normal_rand(1E7::integer, 5.::float8, 300.::float8) AS r; ANALYZE norm_test; -- First query is estimated with MCV quite precisely: EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 100; -- result: planned rows=25669, actual rows=25139 -- Here we have numdistinct estimation, mostly arbitrary: EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 200; -- result: planned rows=8604, actual rows=21239 EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 500; -- result: planned rows=8604, actual rows=6748 EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 600; -- result: planned rows=8604, actual rows=3501 EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 700; -- result: planned rows=8604, actual rows=1705 EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 1000; -- result: planned rows=8604, actual rows=91 -- regards, Andrei Lepikhov Postgres Professional
Re: "type with xxxx does not exist" when doing ExecMemoize()
On 6/3/2024 10:10, Tender Wang wrote: Andrei Lepikhov <mailto:a.lepik...@postgrespro.ru>> 于2024年3月5日周二 17:36写道: On 1/3/2024 14:18, Tender Wang wrote: > During debug, I learned that numeric_add doesn't have type check like > rangetype, so aboved query will not report "type with xxx does not exist". > > And I realize that the test case added by Andrei Lepikhov in v3 is > right. So in v6 patch I add Andrei Lepikhov's test case. Thanks a lot. > > Now I think the v6 version patch seems to be complete now. I've passed through the patch, and it looks okay. Although I am afraid of the same problems that future changes can cause and how to detect them, it works correctly. Thanks for reviewing it, and I add it to commitfest 2024-07. I think, it is a bug. Should it be fixed (and back-patched) earlier? -- regards, Andrei Lepikhov Postgres Professional
Re: Hooking into ExplainOneQuery() complicated by missing standard_ExplainOneQuery
On 6/3/2024 06:25, Michael Paquier wrote: Just to elaborate: the intention was to allow a section to be added to every node in the plan containing information from further down and also allow this information to propagate upwards. We happen to have buffer information right now, but allowing something similar to be added dynamically by extending ExplainNode and passing down a callback to standard_ExplainOneQuery. Or an extra hook at the end of ExplainNode() to be able to append more information at node level? Not sure if others would agree with that, though. We already discussed EXPLAIN hooks, at least in [1]. IMO, extensions should have a chance to add something to the node explain and the summary, if only because they can significantly influence the planner and executor's behaviour. [1] https://www.postgresql.org/message-id/flat/6cd5caa7-06e1-4460-bf35-00a59da3f677%40garret.ru -- regards, Andrei Lepikhov Postgres Professional
Re: "type with xxxx does not exist" when doing ExecMemoize()
On 1/3/2024 14:18, Tender Wang wrote: During debug, I learned that numeric_add doesn't have type check like rangetype, so aboved query will not report "type with xxx does not exist". And I realize that the test case added by Andrei Lepikhov in v3 is right. So in v6 patch I add Andrei Lepikhov's test case. Thanks a lot. Now I think the v6 version patch seems to be complete now. I've passed through the patch, and it looks okay. Although I am afraid of the same problems that future changes can cause and how to detect them, it works correctly. -- regards, Andrei Lepikhov Postgres Professional
Re: POC, WIP: OR-clause support for indexes
On 5/3/2024 12:30, Andrei Lepikhov wrote: On 4/3/2024 09:26, jian he wrote: ... and the new version of the patchset is attached. -- regards, Andrei Lepikhov Postgres Professional From 1c3ac3e006cd66ff40f1ddaaa09e3fc0f3a75ca5 Mon Sep 17 00:00:00 2001 From: Alena Rybakina Date: Fri, 2 Feb 2024 22:01:09 +0300 Subject: [PATCH 1/2] Transform OR clauses to ANY expression. Replace (expr op C1) OR (expr op C2) ... with expr op ANY(ARRAY[C1, C2, ...]) on the preliminary stage of optimization when we are still working with the expression tree. Here C is a constant expression, 'expr' is non-constant expression, 'op' is an operator which returns boolean result and has a commuter (for the case of reverse order of constant and non-constant parts of the expression, like 'CX op expr'). Sometimes it can lead to not optimal plan. But we think it is better to have array of elements instead of a lot of OR clauses. Here is a room for further optimizations on decomposing that array into more optimal parts. Authors: Alena Rybakina , Andrey Lepikhov Reviewed-by: Peter Geoghegan , Ranier Vilela Reviewed-by: Alexander Korotkov , Robert Haas Reviewed-by: jian he --- .../postgres_fdw/expected/postgres_fdw.out| 16 +- doc/src/sgml/config.sgml | 17 + src/backend/nodes/queryjumblefuncs.c | 27 ++ src/backend/parser/parse_expr.c | 339 ++ src/backend/utils/misc/guc_tables.c | 11 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/bin/pg_dump/t/002_pg_dump.pl | 6 +- src/include/nodes/queryjumble.h | 1 + src/include/optimizer/optimizer.h | 1 + src/test/regress/expected/create_index.out| 156 +++- src/test/regress/expected/inherit.out | 2 +- src/test/regress/expected/join.out| 62 +++- src/test/regress/expected/partition_prune.out | 219 +-- src/test/regress/expected/rules.out | 4 +- src/test/regress/expected/stats_ext.out | 12 +- src/test/regress/expected/sysviews.out| 3 +- src/test/regress/expected/tidscan.out | 23 +- src/test/regress/sql/create_index.sql | 35 ++ src/test/regress/sql/join.sql | 10 + src/test/regress/sql/partition_prune.sql | 22 ++ src/test/regress/sql/tidscan.sql | 6 + src/tools/pgindent/typedefs.list | 2 + 22 files changed, 906 insertions(+), 69 deletions(-) diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index c355e8f3f7..0523bbd8f7 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -1349,7 +1349,7 @@ SELECT t1.c1, t1.c2, t2.c1, t2.c2 FROM ft4 t1 LEFT JOIN (SELECT * FROM ft5 WHERE Foreign Scan Output: t1.c1, t1.c2, ft5.c1, ft5.c2 Relations: (public.ft4 t1) LEFT JOIN (public.ft5) - Remote SQL: SELECT r1.c1, r1.c2, r4.c1, r4.c2 FROM ("S 1"."T 3" r1 LEFT JOIN "S 1"."T 4" r4 ON (((r1.c1 = r4.c1)) AND ((r4.c1 < 10 WHERE (((r4.c1 < 10) OR (r4.c1 IS NULL))) AND ((r1.c1 < 10)) + Remote SQL: SELECT r1.c1, r1.c2, r4.c1, r4.c2 FROM ("S 1"."T 3" r1 LEFT JOIN "S 1"."T 4" r4 ON (((r1.c1 = r4.c1)) AND ((r4.c1 < 10 WHERE (((r4.c1 IS NULL) OR (r4.c1 < 10))) AND ((r1.c1 < 10)) (4 rows) SELECT t1.c1, t1.c2, t2.c1, t2.c2 FROM ft4 t1 LEFT JOIN (SELECT * FROM ft5 WHERE c1 < 10) t2 ON (t1.c1 = t2.c1) @@ -3105,7 +3105,7 @@ select array_agg(distinct (t1.c1)%5) from ft4 t1 full join ft5 t2 on (t1.c1 = t2 Foreign Scan Output: (array_agg(DISTINCT (t1.c1 % 5))), ((t2.c1 % 3)) Relations: Aggregate on ((public.ft4 t1) FULL JOIN (public.ft5 t2)) - Remote SQL: SELECT array_agg(DISTINCT (r1.c1 % 5)), (r2.c1 % 3) FROM ("S 1"."T 3" r1 FULL JOIN "S 1"."T 4" r2 ON (((r1.c1 = r2.c1 WHERE (((r1.c1 < 20) OR ((r1.c1 IS NULL) AND (r2.c1 < 5 GROUP BY 2 ORDER BY array_agg(DISTINCT (r1.c1 % 5)) ASC NULLS LAST + Remote SQL: SELECT array_agg(DISTINCT (r1.c1 % 5)), (r2.c1 % 3) FROM ("S 1"."T 3" r1 FULL JOIN "S 1"."T 4" r2 ON (((r1.c1 = r2.c1 WHERE r1.c1 IS NULL) AND (r2.c1 < 5)) OR (r1.c1 < 20))) GROUP BY 2 ORDER BY array_agg(DISTINCT (r1.c1 % 5)) ASC NULLS LAST (4 rows) select array_agg(distinct (t1.c1)%5) from ft4 t1 full join ft5 t2 on (t1.c1 = t2.c1) where t1.c1 < 20 or (t1.c1 is null and t2.c1 < 5) group by (t2.c1)%3 order by 1; @@ -3123,7 +3123,7 @@ select array_agg(distinct (t1.c1)%5 order by (t1.c1)%5) from ft4 t1 full join ft Foreign Scan Output: (array_agg(DISTINCT (t1.c1 % 5) ORDER BY (t1.c1 % 5))), ((t2.c1 % 3)) Relations: Aggregate on ((public.ft4 t1) FULL JOIN (public.ft5 t2)) - Remote SQL: SELECT array_agg(DISTINCT (r1.c1 % 5) ORDER BY ((r1.c
Re: POC, WIP: OR-clause support for indexes
On 4/3/2024 09:26, jian he wrote: On Thu, Feb 29, 2024 at 4:59 PM Andrei Lepikhov Feel free to add, change or totally rewrite these changes. On replacement of static ScalarArrayOpExpr dest with dynamic allocated one: After discussion [1] I agree with that replacement. Some style (and language) changes in comments I haven't applied because it looks debatable for me. I think it should be something like: + gettext_noop("Transform a sequence of OR expressions to an array expression."), + gettext_noop("The planner will replace expression like 'x=c1 OR x=c2 " + "to the expression 'x = ANY( ARRAY[c1,c2])''" Fixed queryId may not be a good variable name here? Not sure. QueryId is a concept, part of queryjumble technique and can be used by other tools. It just tells the developer what it is the same thing as Query Jumbling but for a separate expression. At least you don't insist on removing of JumbleState return pointer that looks strange for a simple hash ... comment `/* Compute query ID */` seems not correct, here we are just hashing the expression? The same as above. +/* + * Dynahash match function to use in guc_hashtab the above comments seem not correct? Yes, fixed. ` It applies to equality expressions only.` seems not correct? `select * from tenk1 where unique1 < 1 or unique1 < 2; ` can also do the transformation. Yes, I forgot it. `similarity of variable sides.` seems not correct, should it be 'sameness of the variable sides`? The term 'equivalence' looks better *). in [2], we can get: SOME is a synonym for ANY. IN is equivalent to = ANY. but still transforming OR to ANY is not intuitive. a normal user may not know what is "transforming OR to ANY". so maybe adding a simple example at would be great. which, I did at previous thread. Not sure. Examples in that section are unusual things. What's more, should a user who doesn't know what it means to change this setting? Let's wait for other opinions. [1] https://www.postgresql.org/message-id/2157387.1709068...@sss.pgh.pa.us -- regards, Andrei Lepikhov Postgres Professional
Re: a wrong index choose when statistics is out of date
On 4/3/2024 12:33, David Rowley wrote: [1] https://www.postgresql.org/message-id/CAApHDvo2sMPF9m%3Di%2BYPPUssfTV1GB%3DZ8nMVa%2B9Uq4RZJ8sULeQ%40mail.gmail.com Thanks for the link! Could we use the trick with the get_actual_variable_range() to find some reason and extrapolate histogram data out of the boundaries when an index shows us that we have min/max outside known statistics? Because it would be used for the values out of the histogram, it should only add an overhead with a reason. -- regards, Andrei Lepikhov Postgres Professional
Re: a wrong index choose when statistics is out of date
On 3/3/2024 14:01, Andy Fan wrote: 1. We can let the user define the column as the value is increased day by day. the syntax may be: ALTER TABLE x_events ALTER COLUMN created_at ALWAYS_INCREASED. then when a query like 'create_at op const', the statistics module can treat it as 'created_at = $1'. so the missing statistics doesn't make difference. Then I think the above issue can be avoided. Let me write some words to support your efforts in that way. I also have some user cases where they periodically insert data in large chunks. These chunks contain 'always increased' values, and it causes trouble each time they start an analytic query over this new data before the analyze command. I have thought about that issue before but invented nothing special except a more aggressive analysis of such tables. Your trick can work, but it needs a new parameter in pg_type and a lot of additional code for such a rare case. I'm looking forward to the demo patch. -- regards, Andrei Lepikhov Postgres Professional
Re: POC, WIP: OR-clause support for indexes
On 1/3/2024 22:33, Alexander Korotkov wrote: I'm going to review and revise the patch. Nice! One question I have yet. > /* > * Transformation only works with both side type is not > * { array | composite | domain | record }. Why do we limit transformation for these types? Also, it doesn't seem the current code restricts anything except composite/record. Answer can be a bit long. Let's try to see comment a4424c5 at first. We forbid record types although they can have typarray. It is because of the RowExpr comparison specific. Although we have the record_eq() routine, all base types in comparing records must be strictly the same. Let me show: explain analyze SELECT * FROM (SELECT ROW(relpages,relnatts) AS x FROM pg_class LIMIT 10) AS q1, (SELECT ROW(relpages,relallvisible) AS x FROM pg_class LIMIT 10) AS q2 WHERE q1.x=q2.x; ERROR: cannot compare dissimilar column types smallint and integer at record column 2 As you can see, we have smallint and integer in the second position of RowExpr and it causes the ERROR. It is the reason, why PostgreSQL transforms ROW expressions to the series of ORs, Look: explain (costs off) SELECT oid,relname FROM pg_class WHERE (oid,relname) IN ((1, 'a'), (2,'b')); Bitmap Heap Scan on pg_class Recheck Cond: ((relname = 'a'::name) OR (relname = 'b'::name)) Filter: (((oid = '1'::oid) AND (relname = 'a'::name)) OR ((oid = '2'::oid) AND (relname = 'b'::name))) -> BitmapOr ... So, transforming composite types to the ScalarArrayOpExpr expression doesn't make sense. Am I wrong? The same with domain. If it have composite base type we reject the transformation according to the logic above. What about arrays? As I see, arrays don't have typarray and we can avoid to spend more cycles after detection of TYPCATEGORY_ARRAY. I haven't done it yet because have a second thought: what if to combine arrays into the larger one? I'm unsure on that, so we can forbid it too. -- regards, Andrei Lepikhov Postgres Professional
Re: POC, WIP: OR-clause support for indexes
On 28/2/2024 17:27, Alena Rybakina wrote: Maybe like that: It also considers the way to generate a path using BitmapScan indexes, converting the transformed expression into expressions separated by "OR" operations, and if it turns out to be the best and finally selects the best one. Thanks, I spent some time describing the feature with documentation. A condensed description of the GUC is in the runtime-config file. Feature description has spread between TransformOrExprToANY and generate_saop_pathlist routines. Also, I've made tiny changes in the code to look more smoothly. All modifications are integrated into the two new patches. Feel free to add, change or totally rewrite these changes. -- regards, Andrei Lepikhov Postgres Professional From 015a564cc784139c806a7004f25bf5f7a4b4a29d Mon Sep 17 00:00:00 2001 From: Alena Rybakina Date: Fri, 2 Feb 2024 22:01:09 +0300 Subject: [PATCH 1/2] Transform OR clause to ANY expressions. Replace (expr op C1) OR (expr op C2) ... with expr op ANY(C1, C2, ...) on the preliminary stage of optimization when we are still working with the tree expression. Here C is a constant expression, 'expr' is non-constant expression, 'op' is an operator which returns boolean result and has a commuter (for the case of reverse order of constant and non-constant parts of the expression, like 'CX op expr'). Sometimes it can lead to not optimal plan. But we think it is better to have array of elements instead of a lot of OR clauses. Here is a room for further optimizations on decomposing that array into more optimal parts. Authors: Alena Rybakina , Andrey Lepikhov Reviewed-by: Peter Geoghegan , Ranier Vilela Reviewed-by: Alexander Korotkov , Robert Haas Reviewed-by: jian he --- .../postgres_fdw/expected/postgres_fdw.out| 16 +- doc/src/sgml/config.sgml | 18 + src/backend/nodes/queryjumblefuncs.c | 27 ++ src/backend/parser/parse_expr.c | 340 ++ src/backend/utils/misc/guc_tables.c | 11 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/bin/pg_dump/t/002_pg_dump.pl | 6 +- src/include/nodes/queryjumble.h | 1 + src/include/optimizer/optimizer.h | 1 + src/test/regress/expected/create_index.out| 156 +++- src/test/regress/expected/inherit.out | 2 +- src/test/regress/expected/join.out| 62 +++- src/test/regress/expected/partition_prune.out | 219 +-- src/test/regress/expected/rules.out | 4 +- src/test/regress/expected/stats_ext.out | 12 +- src/test/regress/expected/sysviews.out| 3 +- src/test/regress/expected/tidscan.out | 23 +- src/test/regress/sql/create_index.sql | 35 ++ src/test/regress/sql/join.sql | 10 + src/test/regress/sql/partition_prune.sql | 22 ++ src/test/regress/sql/tidscan.sql | 6 + src/tools/pgindent/typedefs.list | 2 + 22 files changed, 908 insertions(+), 69 deletions(-) diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index c355e8f3f7..0523bbd8f7 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -1349,7 +1349,7 @@ SELECT t1.c1, t1.c2, t2.c1, t2.c2 FROM ft4 t1 LEFT JOIN (SELECT * FROM ft5 WHERE Foreign Scan Output: t1.c1, t1.c2, ft5.c1, ft5.c2 Relations: (public.ft4 t1) LEFT JOIN (public.ft5) - Remote SQL: SELECT r1.c1, r1.c2, r4.c1, r4.c2 FROM ("S 1"."T 3" r1 LEFT JOIN "S 1"."T 4" r4 ON (((r1.c1 = r4.c1)) AND ((r4.c1 < 10 WHERE (((r4.c1 < 10) OR (r4.c1 IS NULL))) AND ((r1.c1 < 10)) + Remote SQL: SELECT r1.c1, r1.c2, r4.c1, r4.c2 FROM ("S 1"."T 3" r1 LEFT JOIN "S 1"."T 4" r4 ON (((r1.c1 = r4.c1)) AND ((r4.c1 < 10 WHERE (((r4.c1 IS NULL) OR (r4.c1 < 10))) AND ((r1.c1 < 10)) (4 rows) SELECT t1.c1, t1.c2, t2.c1, t2.c2 FROM ft4 t1 LEFT JOIN (SELECT * FROM ft5 WHERE c1 < 10) t2 ON (t1.c1 = t2.c1) @@ -3105,7 +3105,7 @@ select array_agg(distinct (t1.c1)%5) from ft4 t1 full join ft5 t2 on (t1.c1 = t2 Foreign Scan Output: (array_agg(DISTINCT (t1.c1 % 5))), ((t2.c1 % 3)) Relations: Aggregate on ((public.ft4 t1) FULL JOIN (public.ft5 t2)) - Remote SQL: SELECT array_agg(DISTINCT (r1.c1 % 5)), (r2.c1 % 3) FROM ("S 1"."T 3" r1 FULL JOIN "S 1"."T 4" r2 ON (((r1.c1 = r2.c1 WHERE (((r1.c1 < 20) OR ((r1.c1 IS NULL) AND (r2.c1 < 5 GROUP BY 2 ORDER BY array_agg(DISTINCT (r1.c1 % 5)) ASC NULLS LAST + Remote SQL: SELECT array_agg(DISTINCT (r1.c1 % 5)), (r2.c1 % 3) FROM ("S 1"."T 3" r1 FULL JOIN "S 1"."T 4" r2 ON (((r1.c1 = r2.c1 WHERE r1.c1 IS NULL) AND (r2.c1 < 5)) OR (r1.c1 < 20))) GROUP BY 2 ORDER BY array_
Re: POC, WIP: OR-clause support for indexes
On 28/2/2024 17:07, jian he wrote: doc/src/sgml/array.sgml corresponds to https://www.postgresql.org/docs/current/arrays.html. this GUC is related to parser|optimzier. adding a GUC to array.sgml seems strange. (I think). Maybe. In that case, I suggest adding extended comments to functions transformBoolExprOr and generate_saop_pathlist (including cross-referencing each other). These are starting points to understand the transformation and, therefore, a good place for a detailed explanation. -- regards, Andrei Lepikhov Postgres Professional
Re: "type with xxxx does not exist" when doing ExecMemoize()
On 28/2/2024 13:53, Tender Wang wrote: The attached patch is a new version based on v3(not including Andrei's the test case). There is no need to call datumCopy when isnull is true. I have not added a new runtime memoryContext so far. Continue to use mstate->tableContext, I'm not sure the memory used of probeslot will affect mstate->mem_limit. Maybe adding a new memoryContext is better. I think I should spend a little time to learn nodeMemoize.c more deeply. I am curious about your reasons to stay with tableContext. In terms of memory allocation, Richard's approach looks better. Also, You don't need to initialize tts_values[i] at all if tts_isnull[i] set to true. -- regards, Andrei Lepikhov Postgres Professional
Re: POC, WIP: OR-clause support for indexes
On 26/2/2024 11:10, Alena Rybakina wrote: On 24.02.2024 14:28, jian he wrote: Hi. I wrote the first draft patch of the documentation. it's under the section: Planner Method Configuration (runtime-config-query.html) but this feature's main meat is in src/backend/parser/parse_expr.c so it may be slightly inconsistent, as mentioned by others. You can further furnish it. Thank you for your work. I found a few spelling mistakes - I fixed this and added some information about generating a partial index plan. I attached it. Thanks Jian and Alena, It is a good start for the documentation. But I think the runtime-config section needs only a condensed description of a feature underlying the GUC. The explanations in this section look a bit awkward. Having looked through the documentation for a better place for a detailed explanation, I found array.sgml as a candidate. Also, we have the parser's short overview section. I'm unsure about the best place but it is better when the server config section. What's more, there are some weak points in the documentation: 1. We choose constant and variable parts of an expression and don't require the constant to be on the right side. 2. We should describe the second part of the feature, where the optimiser can split an array to fit the optimal BitmapOr scan path. -- regards, Andrei Lepikhov Postgres Professional
Re: The const expression evaluation routine should always return a copy
On 28/2/2024 04:19, Tom Lane wrote: Andrei Lepikhov writes: IMO, the routine eval_const_expressions_mutator contains some stale code: I'd like to push back against the idea that eval_const_expressions is expected to return a freshly-copied tree. Its API specification contains no such claim. It's true that it appears to do that everywhere but here, but I think that's an implementation detail that callers had better not depend on. It's not hard to imagine someone trying to improve its performance by avoiding unnecessary copying. Thanks for the explanation. I was just such a developer of extensions who had looked into the internals of the eval_const_expressions, found a call for a '_mutator' function, and made such a mistake :). I agree that freeing the return node value doesn't provide a sensible benefit because the underlying bushy tree was copied during mutation. What's more, it makes even less sense with the selectivity context coming shortly (I hope). -- regards, Andrei Lepikhov Postgres Professional
The const expression evaluation routine should always return a copy
IMO, the routine eval_const_expressions_mutator contains some stale code: case T_SubPlan: case T_AlternativeSubPlan: /* * Return a SubPlan unchanged --- too late to do anything with it. * * XXX should we ereport() here instead? Probably this routine * should never be invoked after SubPlan creation. */ return node; At least, this code could be achieved with estimate_expression_value(). So, we should fix the comment. But maybe we need to do a bit more. According to the mutator idea, only the Query node is returned unchanged. If the Subplan node is on top of the expression, the call above returns the same node, which may be unconventional. I'm not totally sure about the impossibility of constantifying SubPlan: we already have InitPlans for uncorrelated subplans. Maybe something about parameters that can be estimated as constants at this level and, as a result, allow to return a Const instead of SubPlan? But at least we can return a flat copy of the SubPplan node just for the convention — the same thing for the AlternativeSubPlan. See the patch in the attachment. -- regards, Andrei Lepikhov Postgres ProfessionalFrom a227e7e72d917726085001027c350d2fda2ad3c4 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Tue, 27 Feb 2024 11:00:23 +0700 Subject: [PATCH] Force eval_const_expressions_mutator to return a copy of the node. In some situations, when SubPlan or AlternativeSubPlan nodes are on the top of the expression tree, this routine returns the same node, which could baffle users who, remembering the mutator convention, wanted to free the evaluation result immediately. Author: a.rybakina. --- src/backend/optimizer/util/clauses.c | 21 + 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c index edc25d712e..f09c6c6420 100644 --- a/src/backend/optimizer/util/clauses.c +++ b/src/backend/optimizer/util/clauses.c @@ -2907,15 +2907,28 @@ eval_const_expressions_mutator(Node *node, } case T_SubPlan: + { + SubPlan *subplan = makeNode(SubPlan); + + memcpy(subplan, node, sizeof(SubPlan)); + + /* +* Return a SubPlan unchanged. If the subplan had been uncorrelated +* it already have been converted to an InitPlan. +*/ + return (Node *) subplan; + } case T_AlternativeSubPlan: + { + AlternativeSubPlan *subplan = makeNode(AlternativeSubPlan); + + memcpy(subplan, node, sizeof(AlternativeSubPlan)); /* * Return a SubPlan unchanged --- too late to do anything with it. -* -* XXX should we ereport() here instead? Probably this routine -* should never be invoked after SubPlan creation. */ - return node; + return (Node *) subplan; + } case T_RelabelType: { RelabelType *relabel = (RelabelType *) node; -- 2.43.2
Re: "type with xxxx does not exist" when doing ExecMemoize()
On 26/2/2024 18:34, Richard Guo wrote: On Mon, Feb 26, 2024 at 3:54 PM Andrei Lepikhov mailto:a.lepik...@postgrespro.ru>> wrote: On 26/2/2024 12:44, Tender Wang wrote: > Make sense. I found MemoizeState already has a MemoryContext, so I used it. > I update the patch. This approach is better for me. In the next version of this patch, I included a test case. I am still unsure about the context chosen and the stability of the test case. Richard, you recently fixed some Memoize issues, could you look at this problem and patch? I looked at this issue a bit. It seems to me what happens is that at first the memory areas referenced by probeslot->tts_values[] are allocated in the per tuple context (see prepare_probe_slot). And then in MemoizeHash_hash, after we've calculated the hashkey, we will reset the per tuple context. However, later in MemoizeHash_equal, we still need to reference the values in probeslot->tts_values[], which have been cleared. Agree Actually the context would always be reset in MemoizeHash_equal, for both binary and logical mode. So I kind of wonder if it's necessary to reset the context in MemoizeHash_hash. I can only provide one thought against this solution: what if we have a lot of unique hash values, maybe all of them? In that case, we still have a kind of 'leak' David fixed by the commit 0b053e78b5. Also, I have a segfault report of one client. As I see, it was caused by too long text column in the table slot. As I see, key value, stored in the Memoize hash table, was corrupted, and the most plain reason is this bug. Should we add a test on this bug, and what do you think about the one proposed in v3? -- regards, Andrei Lepikhov Postgres Professional
Re: "type with xxxx does not exist" when doing ExecMemoize()
On 26/2/2024 12:44, Tender Wang wrote: Andrei Lepikhov <mailto:a.lepik...@postgrespro.ru>> 于2024年2月26日周一 10:57写道: On 25/2/2024 20:32, Tender Wang wrote: > I think in prepare_probe_slot(), should called datumCopy as the attached > patch does. > > Any thoughts? Thanks. Thanks for the report. I think it is better to invent a Runtime Memory Context; likewise, it is already designed in IndexScan and derivatives. Here, you just allocate the value in some upper memory context. Also, I'm curious why such a trivial error hasn't been found for a long time Make sense. I found MemoizeState already has a MemoryContext, so I used it. I update the patch. This approach is better for me. In the next version of this patch, I included a test case. I am still unsure about the context chosen and the stability of the test case. Richard, you recently fixed some Memoize issues, could you look at this problem and patch? -- regards, Andrei Lepikhov Postgres Professional From e32900e50730bccfde26355609d6b0b3e970f0a8 Mon Sep 17 00:00:00 2001 From: "tender.wang" Date: Mon, 26 Feb 2024 13:38:30 +0800 Subject: [PATCH] Store Memoize probeslot values in the hash table memory context. Values of probeslot evaluates in expression memory context which can be reset on hash comparison when we still need it for the equality comparison. So we copy the result to probeslot from ExecEvalExpr. --- src/backend/executor/nodeMemoize.c| 23 --- src/test/regress/expected/memoize.out | 26 ++ src/test/regress/sql/memoize.sql | 14 ++ 3 files changed, 56 insertions(+), 7 deletions(-) diff --git a/src/backend/executor/nodeMemoize.c b/src/backend/executor/nodeMemoize.c index 18870f10e1..6402943772 100644 --- a/src/backend/executor/nodeMemoize.c +++ b/src/backend/executor/nodeMemoize.c @@ -312,17 +312,26 @@ prepare_probe_slot(MemoizeState *mstate, MemoizeKey *key) if (key == NULL) { ExprContext *econtext = mstate->ss.ps.ps_ExprContext; + Datum value; + bool isnull; + TupleDesc tup = pslot->tts_tupleDescriptor; + Form_pg_attribute att; MemoryContext oldcontext; - oldcontext = MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory); - /* Set the probeslot's values based on the current parameter values */ for (int i = 0; i < numKeys; i++) - pslot->tts_values[i] = ExecEvalExpr(mstate->param_exprs[i], - econtext, - >tts_isnull[i]); - - MemoryContextSwitchTo(oldcontext); + { + att = TupleDescAttr(tup, i); + value = ExecEvalExprSwitchContext(mstate->param_exprs[i], + econtext, + ); + /* Copy the value to avoid freed after resetting ExprContext */ + oldcontext = MemoryContextSwitchTo(mstate->tableContext); + pslot->tts_values[i] = datumCopy(value, att->attbyval, att->attlen); + MemoryContextSwitchTo(oldcontext); + + pslot->tts_isnull[i] = isnull; + } } else { diff --git a/src/test/regress/expected/memoize.out b/src/test/regress/expected/memoize.out index cf6886a288..9842b238c6 100644 --- a/src/test/regress/expected/memoize.out +++ b/src/test/regress/expected/memoize.out @@ -196,6 +196,32 @@ SELECT * FROM flt f1 INNER JOIN flt f2 ON f1.f >= f2.f;', false); (10 rows) DROP TABLE flt; +CREATE TABLE expr_key (x numeric, t text); +INSERT INTO expr_key (x,t) SELECT d1::numeric, d1::text FROM ( + SELECT round((d/pi())::numeric,7) AS d1 FROM generate_series(1,20) AS d); +-- Having duplicates we must see hits in the Memoize node +INSERT INTO expr_key SELECT * FROM expr_key; +CREATE INDEX expr_key_idx_x_t ON expr_key (x,t); +VACUUM ANALYZE expr_key; +SELECT explain_memoize(' +SELECT * FROM expr_key t1 INNER JOIN expr_key t2 +ON t1.x=t2.t::numeric AND t1.t::numeric=t2.x;', true); + explain_memoize +--- + Nested Loop (actual rows=80 loops=N) + -> Index Only Scan using expr_key_idx_x_t on expr_key t1 (actual rows=40 loops=N) + Heap Fetches: N + -> Memoize (actual rows=2 loops=N) + Cache Key: t1.x, (t1.t)::numeric +
Re: "type with xxxx does not exist" when doing ExecMemoize()
On 26/2/2024 09:52, Andrei Lepikhov wrote: On 25/2/2024 20:32, Tender Wang wrote: I think in prepare_probe_slot(), should called datumCopy as the attached patch does. Any thoughts? Thanks. Thanks for the report. I think it is better to invent a Runtime Memory Context; likewise, it is already designed in IndexScan and derivatives. Here, you just allocate the value in some upper memory context. Also, I'm curious why such a trivial error hasn't been found for a long time Hmmm. I see the problem (test.sql in attachment for reproduction and results). We only detect it by the number of Hits: Cache Key: t1.x, (t1.t)::numeric Cache Mode: logical Hits: 0 Misses: 30 Evictions: 0 Overflows: 0 Memory Usage: 8kB We see no hits in logical mode and 100 hits in binary mode. We see 15 hits for both logical and binary mode if parameters are integer numbers - no problems with resetting expression context. Your patch resolves the issue for logical mode - I see 15 hits for integer and complex keys. But I still see 100 hits in binary mode. Maybe we still have a problem? What's more, why the Memoize node doesn't see the problem at all? -- regards, Andrei Lepikhov Postgres Professional test.sql Description: application/sql
Re: "type with xxxx does not exist" when doing ExecMemoize()
On 25/2/2024 20:32, Tender Wang wrote: I think in prepare_probe_slot(), should called datumCopy as the attached patch does. Any thoughts? Thanks. Thanks for the report. I think it is better to invent a Runtime Memory Context; likewise, it is already designed in IndexScan and derivatives. Here, you just allocate the value in some upper memory context. Also, I'm curious why such a trivial error hasn't been found for a long time -- regards, Andrei Lepikhov Postgres Professional
Re: Removing unneeded self joins
On 21/2/2024 14:26, Richard Guo wrote: I think the right fix for these issues is to introduce a new element 'sublevels_up' in ReplaceVarnoContext, and enhance replace_varno_walker to 1) recurse into subselects with sublevels_up increased, and 2) perform the replacement only when varlevelsup is equal to sublevels_up. This code looks good. No idea how we have lost it before. While writing the fix, I noticed some outdated comments. Such as in remove_rel_from_query, the first for loop updates otherrel's attr_needed as well as lateral_vars, but the comment only mentions attr_needed. So this patch also fixes some outdated comments. Thanks, looks good. While writing the test cases, I found that the test cases for SJE are quite messy. Below are what I have noticed: * There are several test cases using catalog tables like pg_class, pg_stats, pg_index, etc. for testing join removal. I don't see a reason why we need to use catalog tables, and I think this just raises the risk of instability. I see only one unusual query with the pg_class involved. * In many test cases, a mix of uppercase and lowercase keywords is used in one query. I think it'd better to maintain consistency by using either all uppercase or all lowercase keywords in one query. I see uppercase -> lowercase change: select t1.*, t2.a as ax from sj t1 join sj t2 and lowercase -> uppercase in many other cases: explain (costs off) I guess it is a matter of taste, so give up for the committer decision. Technically, it's OK. * In most situations, we verify the plan and the output of a query like: explain (costs off) select ...; select ...; The two select queries are supposed to be the same. But in the SJE test cases, I have noticed instances where the two select queries differ from each other. This patch also includes some cosmetic tweaks for SJE test cases. It does not change the test cases using catalog tables though. I think that should be a seperate patch. I can't assess the necessity of changing these dozens of lines of code because I follow another commenting style, but technically, it's still OK. -- regards, Andrei Lepikhov Postgres Professional
Re: POC: GROUP BY optimization
On 22/2/2024 13:35, Richard Guo wrote: The avg() function on integer argument is commonly used in aggregates.sql. I don't think this is an issue. See the first test query in aggregates.sql. Make sense > it should be parallel to the test cases for utilize the ordering of > index scan and subquery scan. Also, I'm unsure about removing the disabling of the max_parallel_workers_per_gather parameter. Have you discovered the domination of the current plan over the partial one? Do the cost fluctuations across platforms not trigger a parallel plan? The table used for testing contains only 100 tuples, which is the size of only one page. I don't believe it would trigger any parallel plans, unless we manually change min_parallel_table_scan_size. I don't intend to argue it, but just for the information, I frequently reduce it to zero, allowing PostgreSQL to make a decision based on costs. It sometimes works much better, because one small table in multi join can disallow an effective parallel plan. What's more, I suggest to address here the complaint from [1]. As I see, cost difference between Sort and IncrementalSort strategies in that case is around 0.5. To make the test more stable I propose to change it a bit and add a limit: SELECT count(*) FROM btg GROUP BY z, y, w, x LIMIT 10; It makes efficacy of IncrementalSort more obvious difference around 10 cost points. I don't think that's necessary. With Incremental Sort the final cost is: GroupAggregate (cost=1.66..19.00 rows=100 width=25) while with full Sort it is: GroupAggregate (cost=16.96..19.46 rows=100 width=25) With the STD_FUZZ_FACTOR (1.01), there is no doubt that the first path is cheaper on total cost. Not to say that even if somehow we decide the two paths are fuzzily the same on total cost, the first path still dominates because its startup cost is much cheaper. As before, I won't protest here - it needs some computations about how much cost can be added by bulk extension of the relation blocks. If Maxim will answer that it's enough to resolve his issue, why not? -- regards, Andrei Lepikhov Postgres Professional
Re: Unlinking Parallel Hash Join inner batch files sooner
On 22/2/2024 06:42, Thomas Munro wrote: On Wed, Feb 21, 2024 at 7:34 PM Andrei Lepikhov wrote: I see in [1] that the reporter mentioned a delay between the error message in parallel HashJoin and the return control back from PSQL. Your patch might reduce this delay. Also, I have the same complaint from users who processed gigabytes of data in parallel HashJoin. Presumably, they also stuck into the unlink of tons of temporary files. So, are you going to do something with this code? Yeah, right. I will aim to get this into the tree next week. First, there are a couple of minor issues to resolve around freeing that Heikki mentioned. Then there is the question of whether we think this might be a candidate for back-patching, given the complaints you mention. Opinions? The code is related to performance, not a bug. Also, it adds one external function into the 'sharedtuplestore.h'. IMO, it isn't worth it to make back-patches. I would add that the problems you reach when you get to very large number of partitions are hard (see several very long threads about extreme skew for one version of the problem, but even with zero/normal skewness and perfect estimation of the number of partitions, if you ask a computer to partition 42TB of data into partitions that fit in a work_mem suitable for a Commodore 64, it's gonna hurt on several levels) and this would only slightly improve one symptom. One idea that might improve just the directory entry and file descriptor aspect, would be to scatter the partitions into (say) 1MB chunks within the file, and hope that the file system supports holes (a bit like logtape.c's multiplexing but I wouldn't do it quite like that). Thanks, I found in [1] good entry point to dive into this issue. [1] https://www.postgresql.org/message-id/CA+hUKGKDbv+5uiJZDdB1wttkMPFs9CDb6=02qxitq4am-kb...@mail.gmail.com -- regards, Andrei Lepikhov Postgres Professional
Re: POC: GROUP BY optimization
On 22/2/2024 09:09, Richard Guo wrote: On Wed, Feb 21, 2024 at 6:20 PM Alexander Korotkov <mailto:aekorot...@gmail.com>> wrote: Hi, Richard! > What do you think about the revisions for the test cases? I've rebased your patch upthread. Did some minor beautifications. > * The table 'btg' is inserted with 1 tuples, which seems a bit > expensive for a test. I don't think we need such a big table to test > what we want. Your patch reduces the number of rows to 1000 tuples. I found it possible to further reduce it to 100 tuples. That also allowed me to save the plan in the test case introduced by e1b7fde418. Please check if you're OK with the patch attached. I looked through the v2 patch and have two comments. * The test case under "Check we don't pick aggregate path key instead of grouping path key" does not have EXPLAIN to show the plan. So how can we ensure we do not mistakenly select the aggregate pathkeys instead of the grouping pathkeys? I confirm it works correctly. I am not sure about the stability of the zeros number in the output on different platforms here: avg 4. 5. It was why I'd used the format() function before. So, may we elaborate on the test and restrict the output? * I don't think the test case introduced by e1b7fde418 is still needed, because we already have one under "Utilize the ordering of merge join to avoid a full Sort operation". This kind of test case is just to ensure that we are able to utilize the ordering of the subplans underneath. So it should be parallel to the test cases for utilize the ordering of index scan and subquery scan. I confirm, this version also checks ec_sortref initialization in the case when ec are contructed from WHERE clause. Generally, I like more one test for one issue instead of one test for all at once. But it works and I don't have any reason to dispute it. Also, I'm unsure about removing the disabling of the max_parallel_workers_per_gather parameter. Have you discovered the domination of the current plan over the partial one? Do the cost fluctuations across platforms not trigger a parallel plan? What's more, I suggest to address here the complaint from [1]. As I see, cost difference between Sort and IncrementalSort strategies in that case is around 0.5. To make the test more stable I propose to change it a bit and add a limit: SELECT count(*) FROM btg GROUP BY z, y, w, x LIMIT 10; It makes efficacy of IncrementalSort more obvious difference around 10 cost points. [1] https://www.postgresql.org/message-id/CACG=ezaYM1tr6Lmp8PRH1aeZq=rbkxeotwgzmclad5mphfw...@mail.gmail.com -- regards, Andrei Lepikhov Postgres Professional
Re: Removing unneeded self joins
On 21/2/2024 14:26, Richard Guo wrote: This patch also includes some cosmetic tweaks for SJE test cases. It does not change the test cases using catalog tables though. I think that should be a seperate patch. Thanks for this catch, it is really messy thing, keeping aside the question why we need two different subtrees for the same query. I will look into your fix. -- regards, Andrei Lepikhov Postgres Professional
Re: Unlinking Parallel Hash Join inner batch files sooner
Hi, I see in [1] that the reporter mentioned a delay between the error message in parallel HashJoin and the return control back from PSQL. Your patch might reduce this delay. Also, I have the same complaint from users who processed gigabytes of data in parallel HashJoin. Presumably, they also stuck into the unlink of tons of temporary files. So, are you going to do something with this code? [1] https://www.postgresql.org/message-id/18349-83d33dd3d0c855c3%40postgresql.org -- regards, Andrei Lepikhov Postgres Professional
Re: [POC] Allow flattening of subquery with a link to upper query
On 20/2/2024 17:43, David Rowley wrote: On Tue, 20 Feb 2024 at 22:57, Andrei Lepikhov wrote: I agree that it would be nice to teach the planner how to do this, but I think it just has to be a cost-based decision. Imagine how the transformed query would perform of pg_am had a billion rows and pg_class had 1 row. That's quite a costly hash table build to be probing it just once. True, the origins of this work lie in foreign tables where such a query generates an even worse situation. I didn't follow the patch, but there was a patch to push aggregate function evaluation down [1]. I imagine this has the same problem as if you just blindly pushed and aggregate function evaluation as deep as you could evaluate all the aggregate's parameters and group by vars then you may end up aggregating far more than you need to as some join could eliminate the majority of the groups. I think we'd need to come up with some way to have the planner consider these types of optimisations as alternatives to what happens today and only apply them when we estimate that they're cheaper. Right now a Path has no ability to describe that it's performed GROUP BY. Thanks for the link. We also ended up with the idea of an alternative subtree (inspired by the approach of AlternativeSubplan). Here, we just explain the current state of the pull-up sublink technique. -- regards, Andrei Lepikhov Postgres Professional
Re: [POC] Allow flattening of subquery with a link to upper query
On 1/9/2022 19:24, Richard Guo wrote: Even if we ignore these assertion checks, in the final plan we would have to access the RHS of the B/C semi join, i.e. C, to evaluate qual 'c.j = a.j' at the join level of A/BC join, which is wrong. Having committed 9f13376396 recently, we did a lot of work in this area. By applying regression tests from my last patch [1] to the master, I compared these two implementations. As I see, using the LATERAL trick allowed us to simplify the code drastically. But because we know just a fact of the lateral link, not its place, in the master we do less when in the patch proposed in that thread. For example, having query: explain (costs off) SELECT relname FROM pg_class c1 WHERE relname = ANY ( SELECT a.amname from pg_am a WHERE a.oid=c1.oid GROUP BY a.amname ); We see on master: Nested Loop -> Seq Scan on pg_class c1 -> Subquery Scan on "ANY_subquery" Filter: (c1.relname = "ANY_subquery".amname) -> Group Group Key: a.amname -> Sort Sort Key: a.amname -> Seq Scan on pg_am a Filter: (oid = c1.oid) And with this patch: Hash Join Hash Cond: ((c1.relname = a.amname) AND (c1.oid = a.oid)) -> Seq Scan on pg_class c1 -> Hash -> HashAggregate Group Key: a.amname -> Seq Scan on pg_am a Also, we attempted to fix links from a non-parent query block. So, in my opinion, the reason for this patch still exists, and we can continue this work further, maybe elaborating on flattening LATERAL references - this needs some research. [1] https://www.postgresql.org/message-id/35c8a3e8-d080-dfa8-2be3-cf5fe702010a%40postgrespro.ru -- regards, Andrei Lepikhov Postgres Professional
Re: Optimize planner memory consumption for huge arrays
On 20/2/2024 04:51, Tom Lane wrote: Tomas Vondra writes: On 2/19/24 16:45, Tom Lane wrote: Tomas Vondra writes: For example, I don't think we expect selectivity functions to allocate long-lived objects, right? So maybe we could run them in a dedicated memory context, and reset it aggressively (after each call). Here's a quick and probably-incomplete implementation of that idea. I've not tried to study its effects on memory consumption, just made sure it passes check-world. Thanks for the sketch. The trick with the planner_tmp_cxt_depth especially looks interesting. I think we should design small memory contexts - one per scalable direction of memory utilization, like selectivity or partitioning (appending ?). My coding experience shows that short-lived GEQO memory context forces people to learn on Postgres internals more precisely :). -- regards, Andrei Lepikhov Postgres Professional
Re: POC, WIP: OR-clause support for indexes
On 20/2/2024 11:03, jian he wrote: Neither the code comments nor the commit message really explain the design idea here. That's unfortunate, principally because it makes review difficult. I'm very skeptical about the idea of using JumbleExpr for any part of this. It seems fairly expensive, and it might produce false matches. If expensive is OK, then why not just use equal()? If it's not, then this probably isn't really OK either. But in any case there should be comments explaining why this strategy was chosen. The above message (https://postgr.es/m/CA%2BTgmoZCgP6FrBQEusn4yaWm02XU8OPeoEMk91q7PRBgwaAkFw%40mail.gmail.com) seems still not answered. How can we evaluate whether JumbleExpr is expensive or not? I used this naive script to test, but didn't find a big difference when enable_or_transformation is ON or OFF. First, I am open to discussion here. But IMO, equal() operation is quite expensive by itself. We should use the hash table approach to avoid quadratic behaviour when looking for similar clauses in the 'OR' list. Moreover, we use equal() in many places: selectivity estimations, proving of fitting the index, predtest, etc. So, by reducing the clause list, we eliminate many calls of the equal() routine, too. `leftop operator rightop` the operator can also be volatile. Do we need to check (op_volatile(opno) == PROVOLATILE_VOLATILE) within transformBoolExprOr? As usual, could you provide a test case to discuss it more objectively? -- regards, Andrei Lepikhov Postgres Professional
Re: Optimize planner memory consumption for huge arrays
On 19/2/2024 20:47, Tomas Vondra wrote: On 9/8/23 07:11, Lepikhov Andrei wrote: Just for comparison, without partitioning: elems 1 1E1 1E2 1E3 1E4 master: 12kB14kB37kB266kB 2.5MB patched:12kB11.5kB 13kB24kB141kB These improvements look pretty nice, considering how simple the patch seems to be. I can't even imagine how much memory we'd need with even more partitions (say, 1000) if 100 partitions means 274MB. BTW when releasing memory in scalararraysel, wouldn't it be good to also free the elem_values/elem_nulls? I haven't tried and maybe it's not that significant amount. Agree. Added into the next version of the patch. Moreover, I see a slight planning speedup. Looking into the reason for that, I discovered that it is because sometimes the planner utilizes the same memory piece for the next array element. It finds this piece more quickly than before that optimization. -- regards, Andrei Lepikhov Postgres Professional From 2e89dc8b743953068174c777d7a014e1ea71f659 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Tue, 20 Feb 2024 11:05:51 +0700 Subject: [PATCH] Utilize memory in scalararraysel in more optimal manner. Estimating selectivity on an array operation free the memory right after working out a single element. It reduces peak memory consumption on massive arrays. --- src/backend/utils/adt/selfuncs.c | 26 -- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index cea777e9d4..e5bad75ec1 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -281,6 +281,7 @@ eqsel_internal(PG_FUNCTION_ARGS, bool negate) selec = var_eq_non_const(, operator, collation, other, varonleft, negate); + pfree(other); ReleaseVariableStats(vardata); return selec; @@ -1961,15 +1962,15 @@ scalararraysel(PlannerInfo *root, { List *args; Selectivity s2; - - args = list_make2(leftop, - makeConst(nominal_element_type, - -1, - nominal_element_collation, - elmlen, - elem_values[i], - elem_nulls[i], - elmbyval)); + Const *c = makeConst(nominal_element_type, + -1, + nominal_element_collation, + elmlen, + elem_values[i], + elem_nulls[i], + elmbyval); + + args = list_make2(leftop, c); if (is_join_clause) s2 = DatumGetFloat8(FunctionCall5Coll(, clause->inputcollid, @@ -1985,7 +1986,8 @@ scalararraysel(PlannerInfo *root, ObjectIdGetDatum(operator), PointerGetDatum(args), Int32GetDatum(varRelid))); - + list_free(args); + pfree(c); if (useOr) { s1 = s1 + s2 - s1 * s2; @@ -2004,6 +2006,9 @@ scalararraysel(PlannerInfo *root, if ((useOr ? isEquality : isInequality) && s1disjoint >= 0.0 && s1disjoint <= 1.0) s1 = s1disjoint; + + pfree(elem_values); + pfree(elem_nulls); } else if (rightop && IsA(rightop, ArrayExpr) && !((ArrayExpr *) rightop)->multidims) @@ -2052,6 +2057,7 @@ scalararraysel(PlannerInfo *root,
Re: POC, WIP: OR-clause support for indexes
On 19/2/2024 19:53, Ranier Vilela wrote: v17-0002 1) move the vars *arrayconst and *dest, to after if, to avoid makeNode (palloc). + Const *arrayconst; + ScalarArrayOpExpr *dest; + + pd = (PredicatesData *) lfirst(lc); + if (pd->elems == NIL) + /* The index doesn't participate in this operation */ + continue; + arrayconst = lsecond_node(Const, saop->args); + dest = makeNode(ScalarArrayOpExpr); Thanks for the review! I'm not sure I understand you clearly. Does the patch in attachment fix the issue you raised? -- regards, Andrei Lepikhov Postgres Professional diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c index 56b04541db..1545876e2f 100644 --- a/src/backend/optimizer/path/indxpath.c +++ b/src/backend/optimizer/path/indxpath.c @@ -1284,7 +1284,7 @@ build_paths_for_SAOP(PlannerInfo *root, RelOptInfo *rel, RestrictInfo *rinfo, ArrayType *arrayval = NULL; ArrayExpr *arr = NULL; Const *arrayconst = lsecond_node(Const, saop->args); - ScalarArrayOpExpr *dest = makeNode(ScalarArrayOpExpr); + ScalarArrayOpExpr dest; pd = (PredicatesData *) lfirst(lc); if (pd->elems == NIL) @@ -1302,10 +1302,10 @@ build_paths_for_SAOP(PlannerInfo *root, RelOptInfo *rel, RestrictInfo *rinfo, arr->multidims = false; /* Compose new SAOP, partially covering the source one */ - memcpy(dest, saop, sizeof(ScalarArrayOpExpr)); - dest->args = list_make2(linitial(saop->args), arr); + memcpy(, saop, sizeof(ScalarArrayOpExpr)); + dest.args = list_make2(linitial(saop->args), arr); - clause = (Expr *) estimate_expression_value(root, (Node *) dest); + clause = (Expr *) estimate_expression_value(root, (Node *) ); /* * Create new RestrictInfo. It maybe more heavy than just copy node,
Re: Memory consumed by paths during partitionwise join planning
On 19/2/2024 19:25, Ashutosh Bapat wrote: On Fri, Feb 16, 2024 at 8:42 AM Andrei Lepikhov wrote: Live example: right now, I am working on the code like MSSQL has - a combination of NestLoop and HashJoin paths and switching between them in real-time. It requires both paths in the path list at the moment when extensions are coming. Even if one of them isn't referenced from the upper pathlist, it may still be helpful for the extension. There is no guarantee that every path presented to add_path will be preserved. Suboptimal paths are freed as and when add_path discovers that they are suboptimal. So I don't think an extension can rely on existence of a path. But having a refcount makes it easy to preserve the required paths by referencing them. I don't insist, just provide my use case. It would be ideal if you would provide some external routines for extensions that allow for sticking the path in pathlist even when it has terrible cost estimation. About partitioning. As I discovered planning issues connected to partitions, the painful problem is a rule, according to which we are trying to use all nomenclature of possible paths for each partition. With indexes, it quickly increases optimization work. IMO, this can help a 'symmetrical' approach, which could restrict the scope of possible pathways for upcoming partitions if we filter some paths in a set of previously planned partitions. filter or free? Filter. I meant that Postres tries to apply IndexScan, BitmapScan, IndexOnlyScan, and other strategies, passing throughout the partition indexes. The optimizer spends a lot of time doing that. So, why not introduce a symmetrical strategy and give away from the search some indexes of types of scan based on the pathifying experience of previous partitions of the same table: if you have dozens of partitions, Is it beneficial for the system to find a bit more optimal IndexScan on one partition having SeqScans on 999 other? IIUC, you are suggesting that instead of planning each partition/partitionwise join, we only create paths with the strategies which were found to be optimal with previous partitions. That's a good heuristic but it won't work if partition properties - statistics, indexes etc. differ between groups of partitions. Sure, but the "Symmetry" strategy assumes that on the scope of a thousand partitions, especially with parallel append involved, it doesn't cause sensible performance degradation if we find a bit suboptimal path in a small subset of partitions. Does it make sense? As I see, when people use 10-100 partitions for the table, they usually strive to keep indexes symmetrical for all partitions. -- regards, Andrei Lepikhov Postgres Professional
Re: POC, WIP: OR-clause support for indexes
On 16/2/2024 19:54, jian he wrote: After setting these parameters, overall enable_or_transformation ON is performance better. sorry for the noise. Don't worry, at least we know a weak point of partial paths estimation. so now I didn't find any corner case where enable_or_transformation is ON peforms worse than when it's OFF. +typedef struct OrClauseGroupEntry +{ + OrClauseGroupKey key; + + Node *node; + List *consts; + Oid scalar_type; + List *exprs; +} OrClauseGroupEntry; I found that the field `scalar_type` was never used. Thanks, fixed. In attachment - v17 for both patches. As I see it, the only general explanation of the idea is not addressed. I'm not sure how deeply we should explain it. -- regards, Andrei Lepikhov Postgres Professional From 3a3b6aa36320a06b64f2f608e3526255e53ed655 Mon Sep 17 00:00:00 2001 From: Alena Rybakina Date: Fri, 2 Feb 2024 22:01:09 +0300 Subject: [PATCH 1/2] Transform OR clause to ANY expressions. Replace (expr op C1) OR (expr op C2) ... with expr op ANY(C1, C2, ...) on the preliminary stage of optimization when we are still working with the tree expression. Here C is a constant expression, 'expr' is non-constant expression, 'op' is an operator which returns boolean result and has a commuter (for the case of reverse order of constant and non-constant parts of the expression, like 'CX op expr'). Sometimes it can lead to not optimal plan. But we think it is better to have array of elements instead of a lot of OR clauses. Here is a room for further optimizations on decomposing that array into more optimal parts. Authors: Alena Rybakina , Andrey Lepikhov Reviewed-by: Peter Geoghegan , Ranier Vilela Reviewed-by: Alexander Korotkov , Robert Haas Reviewed-by: jian he --- .../postgres_fdw/expected/postgres_fdw.out| 16 +- src/backend/nodes/queryjumblefuncs.c | 27 ++ src/backend/parser/parse_expr.c | 326 +- src/backend/utils/misc/guc_tables.c | 11 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/bin/pg_dump/t/002_pg_dump.pl | 6 +- src/include/nodes/queryjumble.h | 1 + src/include/optimizer/optimizer.h | 1 + src/test/regress/expected/create_index.out| 156 - src/test/regress/expected/inherit.out | 2 +- src/test/regress/expected/join.out| 62 +++- src/test/regress/expected/partition_prune.out | 219 ++-- src/test/regress/expected/rules.out | 4 +- src/test/regress/expected/stats_ext.out | 12 +- src/test/regress/expected/sysviews.out| 3 +- src/test/regress/expected/tidscan.out | 23 +- src/test/regress/sql/create_index.sql | 35 ++ src/test/regress/sql/join.sql | 10 + src/test/regress/sql/partition_prune.sql | 22 ++ src/test/regress/sql/tidscan.sql | 6 + src/tools/pgindent/typedefs.list | 2 + 21 files changed, 875 insertions(+), 70 deletions(-) diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index c355e8f3f7..0523bbd8f7 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -1349,7 +1349,7 @@ SELECT t1.c1, t1.c2, t2.c1, t2.c2 FROM ft4 t1 LEFT JOIN (SELECT * FROM ft5 WHERE Foreign Scan Output: t1.c1, t1.c2, ft5.c1, ft5.c2 Relations: (public.ft4 t1) LEFT JOIN (public.ft5) - Remote SQL: SELECT r1.c1, r1.c2, r4.c1, r4.c2 FROM ("S 1"."T 3" r1 LEFT JOIN "S 1"."T 4" r4 ON (((r1.c1 = r4.c1)) AND ((r4.c1 < 10 WHERE (((r4.c1 < 10) OR (r4.c1 IS NULL))) AND ((r1.c1 < 10)) + Remote SQL: SELECT r1.c1, r1.c2, r4.c1, r4.c2 FROM ("S 1"."T 3" r1 LEFT JOIN "S 1"."T 4" r4 ON (((r1.c1 = r4.c1)) AND ((r4.c1 < 10 WHERE (((r4.c1 IS NULL) OR (r4.c1 < 10))) AND ((r1.c1 < 10)) (4 rows) SELECT t1.c1, t1.c2, t2.c1, t2.c2 FROM ft4 t1 LEFT JOIN (SELECT * FROM ft5 WHERE c1 < 10) t2 ON (t1.c1 = t2.c1) @@ -3105,7 +3105,7 @@ select array_agg(distinct (t1.c1)%5) from ft4 t1 full join ft5 t2 on (t1.c1 = t2 Foreign Scan Output: (array_agg(DISTINCT (t1.c1 % 5))), ((t2.c1 % 3)) Relations: Aggregate on ((public.ft4 t1) FULL JOIN (public.ft5 t2)) - Remote SQL: SELECT array_agg(DISTINCT (r1.c1 % 5)), (r2.c1 % 3) FROM ("S 1"."T 3" r1 FULL JOIN "S 1"."T 4" r2 ON (((r1.c1 = r2.c1 WHERE (((r1.c1 < 20) OR ((r1.c1 IS NULL) AND (r2.c1 < 5 GROUP BY 2 ORDER BY array_agg(DISTINCT (r1.c1 % 5)) ASC NULLS LAST + Remote SQL: SELECT array_agg(DISTINCT (r1.c1 % 5)), (r2.c1 % 3) FROM ("S 1"."T 3" r1 FULL JOIN "S 1"."T 4" r2 ON (((r1.c1 = r2.c1 WHERE r1.c1 IS NULL) AND (r2.c1 < 5)) OR (r1.c1 < 20))) GROUP BY 2 ORDER BY array_agg(DISTINCT (r1.c1 % 5)) ASC NULLS LAST (4 ro
Re: Removing unneeded self joins
On 18/2/2024 23:18, Alexander Korotkov wrote: On Sun, Feb 18, 2024 at 5:04 PM Alexander Korotkov wrote: On Sun, Feb 18, 2024 at 3:00 PM Alexander Lakhin wrote: 09.01.2024 01:09, Alexander Korotkov wrote: Fixed in 30b4955a46. Please look at the following query which fails with an error since d3d55ce57: create table t (i int primary key); select t3.i from t t1 join t t2 on t1.i = t2.i, lateral (select t1.i limit 1) t3; ERROR: non-LATERAL parameter required by subquery Thank you for spotting. I'm looking at this. Attached is a draft patch fixing this query. Could you, please, recheck? I reviewed this patch. Why do you check only the target list? I guess these links can be everywhere. See the patch in the attachment with the elaborated test and slightly changed code. -- regards, Andrei Lepikhov Postgres Professional From 7f94a3c96fd410522b87e570240cdb96b300dd31 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Mon, 19 Feb 2024 12:17:55 +0700 Subject: [PATCH] Replace relids in lateral subquery target list during SJE --- src/backend/optimizer/plan/analyzejoins.c | 29 ++- src/test/regress/expected/join.out| 44 +++ src/test/regress/sql/join.sql | 12 +++ 3 files changed, 84 insertions(+), 1 deletion(-) diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c index e494acd51a..072298f66c 100644 --- a/src/backend/optimizer/plan/analyzejoins.c +++ b/src/backend/optimizer/plan/analyzejoins.c @@ -395,7 +395,34 @@ remove_rel_from_query(PlannerInfo *root, RelOptInfo *rel, } /* Update lateral references. */ - replace_varno((Node *) otherrel->lateral_vars, relid, subst); + if (root->hasLateralRTEs) + { + RangeTblEntry *rte = root->simple_rte_array[rti]; + ReplaceVarnoContext ctx = {.from = relid,.to = subst}; + + if (rte->lateral) + { + replace_varno((Node *) otherrel->lateral_vars, relid, subst); + + /* +* Although we pass root->parse through cleanup procedure, +* but parse->rtable and rte contains refs to different copies +* of the subquery. +*/ + if (otherrel->rtekind == RTE_SUBQUERY) + query_tree_walker(rte->subquery, replace_varno_walker, , + QTW_EXAMINE_SORTGROUP); +#ifdef USE_ASSERT_CHECKING + /* Just check possibly hidden non-replaced relids */ + Assert(!bms_is_member(relid, pull_varnos(root, (Node *) rte->tablesample))); + Assert(!bms_is_member(relid, pull_varnos(root, (Node *) rte->functions))); + Assert(!bms_is_member(relid, pull_varnos(root, (Node *) rte->tablefunc))); + Assert(!bms_is_member(relid, pull_varnos(root, (Node *) rte->values_lists))); +#endif + } + } + + } /* diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out index 0c2cba8921..d560a4a6b9 100644 --- a/src/test/regress/expected/join.out +++ b/src/test/regress/expected/join.out @@ -6349,6 +6349,50 @@ on true; -> Seq Scan on int8_tbl y (7 rows) +-- Test processing target lists in lateral subqueries +explain (verbose, costs off) +SELECT t3.a FROM sj t1, sj t2, +LATERAL (SELECT t1.a WHERE t1.a <> 1 +GROUP BY (t1.a) HAVING t1.a > 0 ORDER BY t1.a LIMIT 1) t3, +LATERAL (SELECT t1.a,t3.a WHERE t1.a <> t3.a+t2.a +GROUP BY (t3.a) HAVING t1.a > t3.a*t3.a+t2.a/t1.a LIMIT 2) t4, +LATERAL (SELECT * FROM sj TABLESAMPLE bernoulli(t1.a/t2.a) +REPEATABLE (t1.a+t2.a)) t5, +LATERAL generate_series(1, t1.a + t2.a) AS t6 +WHERE t1.a = t2.a; + QUERY PLAN +--- + Nested Loop + Output: (t2.a) + -> Nested Loop + Output: t2.a, (t2.a) + -> Nested Loop + Output: t2.a, (t2.a) + -> Nested Loop + Output: t2.a, (t2.a) + -> Seq Scan on public.sj t2 + Output: t2.a, t2.b, t2.c + Filter: (t2.a IS NOT NULL) + -> Limit + Output: (t2.
Re: Reducing memory consumed by RestrictInfo list translations in partitionwise join planning
On 19/2/2024 06:05, Tomas Vondra wrote: However, this is a somewhat extreme example - it's joining 5 tables, each with 1000 partitions, using a partition-wise join. It reduces the amount of memory, but the planning time is still quite high (and essentially the same as without the patch). So it's not like it'd make them significantly more practical ... do we have any other ideas/plans how to improve that? The planner principle of cleaning up all allocated structures after the optimization stage simplifies development and code. But, if we want to achieve horizontal scalability on many partitions, we should introduce per-partition memory context and reset it in between. GEQO already has a short-lived memory context, making designing extensions a bit more challenging but nothing too painful. -- regards, Andrei Lepikhov Postgres Professional
Re: POC, WIP: OR-clause support for indexes
On 16/2/2024 07:00, jian he wrote: On Wed, Feb 14, 2024 at 11:21 AM Andrei Lepikhov wrote: My OS: Ubuntu 22.04.3 LTS I already set the max_parallel_workers_per_gather to 10. So for all cases, it should use parallelism first? a better question would be: how to make the number of OR less than 29 still faster when enable_or_transformation is ON by only set parameters? In my test environment this example gives some subtle supremacy to ORs over ANY with only 3 ors and less. Please, provide next EXPLAIN ANALYZE results for the case you want to discuss here: 1. with enable_or_transformation enabled 2. with enable_or_transformation disabled 3. with enable_or_transformation disabled but with manual transformation OR -> ANY done, to check the overhead of this optimization. -- regards, Andrei Lepikhov Postgres Professional
Re: Memory consumed by paths during partitionwise join planning
On 15/2/2024 19:06, Ashutosh Bapat wrote: On Thu, Feb 15, 2024 at 9:41 AM Andrei Lepikhov But I'm not sure about freeing unreferenced paths. I would have to see alternatives in the pathlist. I didn't understand this. Can you please elaborate? A path in any pathlist is referenced. An unreferenced path should not be in any pathlist. I mean that at some point, an extension can reconsider the path tree after building the top node of this path. I vaguely recall that we already have (or had) kind of such optimization in the core where part of the plan changes after it has been built. Live example: right now, I am working on the code like MSSQL has - a combination of NestLoop and HashJoin paths and switching between them in real-time. It requires both paths in the path list at the moment when extensions are coming. Even if one of them isn't referenced from the upper pathlist, it may still be helpful for the extension. About partitioning. As I discovered planning issues connected to partitions, the painful problem is a rule, according to which we are trying to use all nomenclature of possible paths for each partition. With indexes, it quickly increases optimization work. IMO, this can help a 'symmetrical' approach, which could restrict the scope of possible pathways for upcoming partitions if we filter some paths in a set of previously planned partitions. filter or free? Filter. I meant that Postres tries to apply IndexScan, BitmapScan, IndexOnlyScan, and other strategies, passing throughout the partition indexes. The optimizer spends a lot of time doing that. So, why not introduce a symmetrical strategy and give away from the search some indexes of types of scan based on the pathifying experience of previous partitions of the same table: if you have dozens of partitions, Is it beneficial for the system to find a bit more optimal IndexScan on one partition having SeqScans on 999 other? -- regards, Andrei Lepikhov Postgres Professional
Re: planner chooses incremental but not the best one
On 15/2/2024 18:10, Tomas Vondra wrote: On 2/15/24 07:50, Andrei Lepikhov wrote: On 18/12/2023 19:53, Tomas Vondra wrote: On 12/18/23 11:40, Richard Guo wrote: The challenge is where to get usable information about correlation between columns. I only have a couple very rought ideas of what might try. For example, if we have multi-column ndistinct statistics, we might look at ndistinct(b,c) and ndistinct(b,c,d) and deduce something from ndistinct(b,c,d) / ndistinct(b,c) If we know how many distinct values we have for the predicate column, we could then estimate the number of groups. I mean, we know that for the restriction "WHERE b = 3" we only have 1 distinct value, so we could estimate the number of groups as 1 * ndistinct(b,c) Did you mean here ndistinct(c,d) and the formula: ndistinct(b,c,d) / ndistinct(c,d) ? Yes, I think that's probably a more correct ... Essentially, the idea is to estimate the change in number of distinct groups after adding a column (or restricting it in some way). Thanks, I got it. I just think how to implement such techniques with extensions just to test the idea in action. In the case of GROUP-BY we can use path hook, of course. But what if to invent a hook on clauselist estimation? Do you implicitly bear in mind here the necessity of tracking clauses that were applied to the data up to the moment of grouping? I don't recall what exactly I considered two months ago when writing the message, but I don't see why we would need to track that beyond what we already have. Shouldn't it be enough for the grouping to simply inspect the conditions on the lower levels? Yes, exactly. I've thought about looking into baserestrictinfos and, if group-by references a subquery targetlist, into subqueries too. -- regards, Andrei Lepikhov Postgres Professional
Re: planner chooses incremental but not the best one
On 15/12/2023 15:58, Richard Guo wrote: With the patch the estimate for the number of distinct 'b' values is more accurate. +1 to commit this patch. It looks good and resolves kind of a bug in the code. BTW, this patch does not change any existing regression test results. I attempted to devise a regression test that shows how this change can improve query plans, but failed. Should I try harder to find such a test case? The test that was changed refers to different features. Its behaviour can be changed in. the future, and mask testing of this code. IMO, you should add a test directly checking appendrel->tuples correction. -- regards, Andrei Lepikhov Postgres Professional
Re: planner chooses incremental but not the best one
On 18/12/2023 19:53, Tomas Vondra wrote: On 12/18/23 11:40, Richard Guo wrote: The challenge is where to get usable information about correlation between columns. I only have a couple very rought ideas of what might try. For example, if we have multi-column ndistinct statistics, we might look at ndistinct(b,c) and ndistinct(b,c,d) and deduce something from ndistinct(b,c,d) / ndistinct(b,c) If we know how many distinct values we have for the predicate column, we could then estimate the number of groups. I mean, we know that for the restriction "WHERE b = 3" we only have 1 distinct value, so we could estimate the number of groups as 1 * ndistinct(b,c) Did you mean here ndistinct(c,d) and the formula: ndistinct(b,c,d) / ndistinct(c,d) ? Do you implicitly bear in mind here the necessity of tracking clauses that were applied to the data up to the moment of grouping? -- regards, Andrei Lepikhov Postgres Professional
Re: Memory consumed by paths during partitionwise join planning
On 6/2/2024 19:51, Ashutosh Bapat wrote: On Fri, Dec 15, 2023 at 5:22 AM Ashutosh Bapat The patches are raw. make check has some crashes that I need to fix. I am waiting to hear whether this is useful and whether the design is on the right track. Let me write words of opinion on that feature. I generally like the idea of a refcount field. We had problems generating alternative paths in extensions and designing the Asymmetric Join feature [1] when we proposed an alternative path one level below and called the add_path() routine. We had lost the path in the path list referenced from the upper RelOptInfo. This approach allows us to make more handy solutions instead of hacking with a copy/restore pathlist. But I'm not sure about freeing unreferenced paths. I would have to see alternatives in the pathlist. About partitioning. As I discovered planning issues connected to partitions, the painful problem is a rule, according to which we are trying to use all nomenclature of possible paths for each partition. With indexes, it quickly increases optimization work. IMO, this can help a 'symmetrical' approach, which could restrict the scope of possible pathways for upcoming partitions if we filter some paths in a set of previously planned partitions. Also, I am glad to see a positive opinion about the path_walker() routine. Somewhere else, for example, in [2], it seems we need it too. [1] https://www.postgresql.org/message-id/flat/CAOP8fzaVL_2SCJayLL9kj5pCA46PJOXXjuei6-3aFUV45j4LJQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/CAMbWs496%2BN%3DUAjOc%3DrcD3P7B6oJe4rZw08e_TZRUsWbPxZW3Tw%40mail.gmail.com -- regards, Andrei Lepikhov Postgres Professional
Re: Memory consumed by child SpecialJoinInfo in partitionwise join planning
On 14/2/2024 13:32, Ashutosh Bapat wrote: On Wed, Feb 14, 2024 at 9:50 AM Andrei Lepikhov wrote: On 30/1/2024 12:44, Ashutosh Bapat wrote: Thanks Vignesh. PFA patches rebased on the latest HEAD. The patch addressing Amit's comments is still a separate patch for him to review. Thanks for this improvement. Working with partitions, I frequently see peaks of memory consumption during planning. So, maybe one more case can be resolved here. Patch 0001 looks good. I'm not sure about free_child_sjinfo_members. Do we really need it as a separate routine? It might be better to inline this code. try_partitionwise_join() is already 200 lines long. A separate function is better than adding more lines to try_partitionwise_join(). Also if someone wants to call build_child_join_sjinfo() outside try_partitionwise_join() may find free_child_sjinfo_members() handy. Make sense, thanks. Patch 0002 adds valuable comments, and I'm OK with that. Also, as I remember, some extensions, such as pg_hint_plan, call build_child_join_sjinfo. It is OK to break the interface with a major version. But what if they need child_sjinfo a bit longer and collect links to this structure? I don't think it is a real stopper, but it is worth additional analysis. If these extensions call build_child_join_sjinfo() and do not call free_child_sjinfo_members, they can keep their child sjinfo as long as they want. I didn't understand your concern. Nothing special. This patch makes external code responsible for allocation of the structure and it adds more paths to do something wrong. Current code looks good. -- regards, Andrei Lepikhov Postgres Professional
Re: Propagate pathkeys from CTEs up to the outer query
On 29/1/2024 10:18, Richard Guo wrote: In [1] we've reached a conclusion that for a MATERIALIZED CTE it's okay to 'allow our statistics or guesses for the sub-query to subsequently influence what the upper planner does'. Commit f7816aec23 exposes column statistics to the upper planner. In the light of that, here is a patch that exposes info about the ordering of the CTE result to the upper planner. This patch was initially posted in that same thread and has received some comments from Tom in [2]. Due to the presence of multiple patches in that thread, it has led to confusion. So fork a new thread here specifically dedicated to discussing the patch about exposing pathkeys from CTEs to the upper planner. I like this approach. It looks good initially, but such features need more opinions/views/time to analyse corner cases. It goes alongside my current backburner - pull parameterisation through the GROUP-BY and the query block fence up to the JOIN searching code of the parent query. -- regards, Andrei Lepikhov Postgres Professional
Re: Memory consumed by child SpecialJoinInfo in partitionwise join planning
On 30/1/2024 12:44, Ashutosh Bapat wrote: Thanks Vignesh. PFA patches rebased on the latest HEAD. The patch addressing Amit's comments is still a separate patch for him to review. Thanks for this improvement. Working with partitions, I frequently see peaks of memory consumption during planning. So, maybe one more case can be resolved here. Patch 0001 looks good. I'm not sure about free_child_sjinfo_members. Do we really need it as a separate routine? It might be better to inline this code. Patch 0002 adds valuable comments, and I'm OK with that. Also, as I remember, some extensions, such as pg_hint_plan, call build_child_join_sjinfo. It is OK to break the interface with a major version. But what if they need child_sjinfo a bit longer and collect links to this structure? I don't think it is a real stopper, but it is worth additional analysis. -- regards, Andrei Lepikhov Postgres Professional
Re: POC, WIP: OR-clause support for indexes
On 13/2/2024 17:03, Andrei Lepikhov wrote: On 13/2/2024 07:00, jian he wrote: The time is the last result of the 10 iterations. I'm not sure about the origins of such behavior, but it seems to be an issue of parallel workers, not this specific optimization. Having written that, I'd got a backburner. And to close that issue, I explored get_restriction_qual_cost(). A close look shows us that "x IN (..)" cheaper than its equivalent "x=N1 OR ...". Just numbers: ANY: startup_cost = 0.0225; total_cost = 0.005 OR: startup_cost==0; total_cost = 0.0225 Expression total_cost is calculated per tuple. In your example, we have many tuples, so the low cost of expression per tuple dominates over the significant startup cost. According to the above, SAOP adds 6250 to the cost of SeqScan; OR - 13541. So, the total cost of the query with SAOP is less than with OR, and the optimizer doesn't choose heavy parallel workers. And it is the answer. So, this example is more about the subtle balance between parallel/sequential execution, which can vary from one platform to another. -- regards, Andrei Lepikhov Postgres Professional
Re: POC, WIP: OR-clause support for indexes
On 12/2/2024 17:51, Alena Rybakina wrote: On 12.02.2024 12:01, Andrei Lepikhov wrote: On 12/2/2024 15:55, Alena Rybakina wrote: As I understand it, match_clauses_to_index is necessary if you have a RestrictInfo (rinfo1) variable, so maybe we should run it after the make_restrictonfo procedure, otherwise calling it, I think, is useless. I think you must explain your note in more detail. Before this call, we already called make_restrictinfo() and built rinfo1, haven't we? I got it, I think, I was confused by the “else“ block when we can process the index, otherwise we move on to the next element. I think maybe “else“ block of creating restrictinfo with the indexpaths list creation should be moved to a separate function or just remove "else"? IMO, it is a matter of taste. But if you are really confused, maybe it will make understanding for someone else simpler. So, changed. I think we need to check that rinfo->clause is not empty, because if it is we can miss calling build_paths_for_OR function. We should add it there: restriction_is_saop_clause(RestrictInfo *restrictinfo) { if (IsA(restrictinfo->clause, ScalarArrayOpExpr)) I wonder if we should add here assertion, not NULL check. In what case we could get NULL clause here? But added for surety. By the way, I think we need to add a check that the clauseset is not empty (if (!clauseset.nonempty)) otherwise we could get an error. The same check I noticed in build_paths_for_OR function. I don't. Feel free to provide counterexample. -- regards, Andrei Lepikhov Postgres Professional From 2b088a1bd662c614ee491a797d2fcb67e2fb2391 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Wed, 24 Jan 2024 14:07:17 +0700 Subject: [PATCH 2/2] Teach generate_bitmap_or_paths to build BitmapOr paths over SAOP clauses. Likewise OR clauses, discover SAOP array and try to split its elements between smaller sized arrays to fit a set of partial indexes. --- src/backend/optimizer/path/indxpath.c | 301 ++ src/backend/optimizer/util/predtest.c | 46 src/backend/optimizer/util/restrictinfo.c | 13 + src/include/optimizer/optimizer.h | 16 ++ src/include/optimizer/restrictinfo.h | 1 + src/test/regress/expected/select.out | 282 src/test/regress/sql/select.sql | 82 ++ src/tools/pgindent/typedefs.list | 1 + 8 files changed, 689 insertions(+), 53 deletions(-) diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c index 32c6a8bbdc..56b04541db 100644 --- a/src/backend/optimizer/path/indxpath.c +++ b/src/backend/optimizer/path/indxpath.c @@ -32,6 +32,7 @@ #include "optimizer/paths.h" #include "optimizer/prep.h" #include "optimizer/restrictinfo.h" +#include "utils/array.h" #include "utils/lsyscache.h" #include "utils/selfuncs.h" @@ -1220,11 +1221,174 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel, return result; } +/* + * Building index paths over SAOP clause differs from the logic of OR clauses. + * Here we iterate across all the array elements and split them to SAOPs, + * corresponding to different indexes. We must match each element to an index. + */ +static List * +build_paths_for_SAOP(PlannerInfo *root, RelOptInfo *rel, RestrictInfo *rinfo, +List *other_clauses) +{ + List *result = NIL; + List *predicate_lists = NIL; + ListCell *lc; + PredicatesData *pd; + ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) rinfo->clause; + + Assert(IsA(saop, ScalarArrayOpExpr) && saop->useOr); + + if (!IsA(lsecond(saop->args), Const)) + /* +* Has it practical outcome to merge arrays which couldn't constantified +* before that step? +*/ + return NIL; + + /* Collect predicates */ + foreach(lc, rel->indexlist) + { + IndexOptInfo *index = (IndexOptInfo *) lfirst(lc); + + /* Take into consideration partial indexes supporting bitmap scans */ + if (!index->amhasgetbitmap || index->indpred == NIL || index->predOK) + continue; + + pd = palloc0(sizeof(PredicatesData)); + pd->id = foreach_current_index(lc); + /* The trick with reducing recursion is stolen from predicate_implied_by */ + pd->predicate = list_length(index->indpred) == 1 ? + (Node *) linitial(index->indpred) : + (Node *) index->indpred; + predicate_lists = lappend(predicate_lis
Re: POC, WIP: OR-clause support for indexes
On 13/2/2024 07:00, jian he wrote: + newa = makeNode(ArrayExpr); + /* array_collid will be set by parse_collate.c */ + newa->element_typeid = scalar_type; + newa->array_typeid = array_type; + newa->multidims = false; + newa->elements = aexprs; + newa->location = -1; I am confused by the comments `array_collid will be set by parse_collate.c`, can you further explain it? I wonder if the second paragraph of comments on commit b310b6e will be enough to dive into details. if OR expression right arm is not plain Const, but with collation specification, eg. `where a = 'a' collate "C" or a = 'b' collate "C";` then the rightop is not Const, it will be CollateExpr, it will not be used in transformation. Yes, it is done for simplicity right now. I'm not sure about corner cases of merging such expressions. set enable_or_transformation to on; explain(timing off, analyze, costs off) select count(*) from test where (x = 1 or x = 2 or x = 3 or x = 4 or x = 5 or x = 6 or x = 7 or x = 8 or x = 9 ) \watch i=0.1 c=10 35.376 ms The time is the last result of the 10 iterations. The reason here - parallel workers. If you see into the plan you will find parallel workers without optimization and absence of them in the case of optimization: Gather (cost=1000.00..28685.37 rows=87037 width=12) (actual rows=90363 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Seq Scan on test Filter: ((x = 1) OR (x = 2) OR (x = 3) OR (x = 4) OR (x = 5) OR (x = 6) OR (x = 7) OR (x = 8) OR (x = 9)) Seq Scan on test (cost=0.02..20440.02 rows=90600 width=12) (actual rows=90363 loops=1) Filter: (x = ANY ('{1,2,3,4,5,6,7,8,9}'::integer[])) Having 90600 tuples returned we estimate it into 87000 (less precisely) without transformation and 90363 (more precisely) with the transformation. But if you play with parallel_tuple_cost and parallel_setup_cost, you will end up having these parallel workers: Gather (cost=0.12..11691.03 rows=90600 width=12) (actual rows=90363 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Seq Scan on test Filter: (x = ANY ('{1,2,3,4,5,6,7,8,9}'::integer[])) Rows Removed by Filter: 303212 And some profit about 25%, on my laptop. I'm not sure about the origins of such behavior, but it seems to be an issue of parallel workers, not this specific optimization. -- regards, Andrei Lepikhov Postgres Professional
Re: POC, WIP: OR-clause support for indexes
On 12/2/2024 15:55, Alena Rybakina wrote: Hi! I can't unnderstand this part of code: /* Time to generate index paths */ MemSet(, 0, sizeof(clauseset)); match_clauses_to_index(root, list_make1(rinfo1), index, ); As I understand it, match_clauses_to_index is necessary if you have a RestrictInfo (rinfo1) variable, so maybe we should run it after the make_restrictonfo procedure, otherwise calling it, I think, is useless. I think you must explain your note in more detail. Before this call, we already called make_restrictinfo() and built rinfo1, haven't we? -- regards, Andrei Lepikhov Postgres Professional
Re: POC, WIP: OR-clause support for indexes
Thanks for the review! It was the first version for discussion. Of course, refactoring and polishing cycles will be needed, even so we can discuss the general idea earlier. On 10/2/2024 12:00, jian he wrote: On Thu, Feb 8, 2024 at 1:34 PM Andrei Lepikhov 1235 | PredicatesData *pd; Thanks + if (!predicate_implied_by(index->indpred, list_make1(rinfo1), true)) + elog(ERROR, "Logical mistake in OR <-> ANY transformation code"); the error message seems not clear? Yeah, have rewritten static List * build_paths_for_SAOP(PlannerInfo *root, RelOptInfo *rel, RestrictInfo *rinfo, List *other_clauses) I am not sure what's `other_clauses`, and `rinfo` refers to? adding some comments would be great. struct PredicatesData needs some comments, I think. Added, not so much though +bool +saop_covered_by_predicates(ScalarArrayOpExpr *saop, List *predicate_lists) +{ + ListCell *lc; + PredIterInfoData clause_info; + bool result = false; + bool isConstArray; + + Assert(IsA(saop, ScalarArrayOpExpr)); is this Assert necessary? Not sure. Moved it into another routine. For the function build_paths_for_SAOP, I think I understand the first part of the code. But I am not 100% sure of the second part of the `foreach(lc, predicate_lists)` code. more comments in `foreach(lc, predicate_lists)` would be helpful. Done do you need to add `PredicatesData` to src/tools/pgindent/typedefs.list? Done I also did some minor refactoring of generate_saop_pathlist. Partially agree instead of let it go to `foreach (lc, entries)`, we can reject the Const array at `foreach(lc, expr->args)` Yeah, I just think we can go further and transform two const arrays into a new one if we have the same clause and operator. In that case, we should allow it to pass through this cycle down to the classification stage. also `foreach(lc, expr->args)` do we need to reject cases like `contain_subplans((Node *) nconst_expr)`? maybe let the nconst_expr be a Var node would be far more easier. It's contradictory. On the one hand, we simplify the comparison procedure and can avoid expr jumbling at all. On the other hand - we restrict the feature. IMO, it would be better to unite such clauses complex_clause1 IN (..) OR complex_clause1 IN (..) into complex_clause1 IN (.., ..) and don't do duplicated work computing complex clauses. In the attachment - the next version of the second patch. -- regards, Andrei Lepikhov Postgres Professional From c1e5fc35778acd01cca67e63fdf80c5605af03f2 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Wed, 24 Jan 2024 14:07:17 +0700 Subject: [PATCH 2/2] Teach generate_bitmap_or_paths to build BitmapOr paths over SAOP clauses. Likewise OR clauses, discover SAOP array and try to split its elements between smaller sized arrays to fit a set of partial indexes. --- src/backend/optimizer/path/indxpath.c | 304 ++ src/backend/optimizer/util/predtest.c | 46 src/backend/optimizer/util/restrictinfo.c | 13 + src/include/optimizer/optimizer.h | 16 ++ src/include/optimizer/restrictinfo.h | 1 + src/test/regress/expected/select.out | 282 src/test/regress/sql/select.sql | 82 ++ src/tools/pgindent/typedefs.list | 1 + 8 files changed, 692 insertions(+), 53 deletions(-) diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c index 32c6a8bbdc..5383cb76dc 100644 --- a/src/backend/optimizer/path/indxpath.c +++ b/src/backend/optimizer/path/indxpath.c @@ -32,6 +32,7 @@ #include "optimizer/paths.h" #include "optimizer/prep.h" #include "optimizer/restrictinfo.h" +#include "utils/array.h" #include "utils/lsyscache.h" #include "utils/selfuncs.h" @@ -1220,11 +1221,177 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel, return result; } +/* + * Building index paths over SAOP clause differs from the logic of OR clauses. + * Here we iterate across all the array elements and split them to SAOPs, + * corresponding to different indexes. We must match each element to an index. + */ +static List * +build_paths_for_SAOP(PlannerInfo *root, RelOptInfo *rel, RestrictInfo *rinfo, +List *other_clauses) +{ + List *result = NIL; + List *predicate_lists = NIL; + ListCell *lc; + PredicatesData *pd; + ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) rinfo->clause; + + Assert(IsA(saop, ScalarArrayOpExpr) && saop->useOr); + + if (!IsA(lsecond(saop->args), Const)) + /* +* Has it practical outcome to merge arrays which couldn't constantified +* before that step? +*/ + return NIL; + + /* Collect predicates */ + foreach(lc, rel-&g
Re: pg_stat_advisor extension
On 6/2/2024 22:27, Ilia Evdokimov wrote: I welcome your insights, feedback, and evaluations regarding the necessity of integrating this new extension into PostgreSQL. Besides other issues that were immediately raised during the discovery of the extension, Let me emphasize two issues: 1. In the case of parallel workers the plan_rows value has a different semantics than the number of rows predicted. Just explore get_parallel_divisor(). 2. The extension recommends new statistics immediately upon an error finding. But what if the reason for the error is stale statistics? Or this error may be raised for only one specific set of constants, and estimation will be done well in another 99.% of cases for the same expression. According to No.2, it might make sense to collect and track clause combinations and cardinality errors found and let the DBA make decisions on their own. -- regards, Andrei Lepikhov Postgres Professional
Re: POC, WIP: OR-clause support for indexes
On 3/2/2024 02:06, Alena Rybakina wrote: On 01.02.2024 08:00, jian he wrote: I added your code to the patch. Thanks Alena and Jian for the detailed scrutiny! A couple of questions: 1. As I see, transformAExprIn uses the same logic as we invented but allows composite and domain types. Could you add a comment explaining why we forbid row types in general, in contrast to the transformAExprIn routine? 2. Could you provide the tests to check issues covered by the recent (in v.15) changes? Patch 0001-* in the attachment incorporates changes induced by Jian's notes from [1]. Patch 0002-* contains a transformation of the SAOP clause, which allows the optimizer to utilize partial indexes if they cover all values in this array. Also, it is an answer to Alexander's note [2] on performance degradation. This first version may be a bit raw, but I need your opinion: Does it resolve the issue? Skimming through the thread, I see that, in general, all issues have been covered for now. Only Robert's note on a lack of documentation is still needs to be resolved. [1] https://www.postgresql.org/message-id/CACJufxGXhJ823cdAdp2Ho7qC-HZ3_-dtdj-myaAi_u9RQLn45g%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAPpHfduJtO0s9E%3DSHUTzrCD88BH0eik0UNog1_q3XBF2wLmH6g%40mail.gmail.com -- regards, Andrei Lepikhov Postgres Professional From 0ac511ab94959e87d1525d5ea8c4855643a6ccbe Mon Sep 17 00:00:00 2001 From: Alena Rybakina Date: Fri, 2 Feb 2024 22:01:09 +0300 Subject: [PATCH 1/2] Transform OR clause to ANY expressions. Replace (expr op C1) OR (expr op C2) ... with expr op ANY(C1, C2, ...) on the preliminary stage of optimization when we are still working with the tree expression. Here C is a constant expression, 'expr' is non-constant expression, 'op' is an operator which returns boolean result and has a commuter (for the case of reverse order of constant and non-constant parts of the expression, like 'CX op expr'). Sometimes it can lead to not optimal plan. But we think it is better to have array of elements instead of a lot of OR clauses. Here is a room for further optimizations on decomposing that array into more optimal parts. Authors: Alena Rybakina , Andrey Lepikhov Reviewed-by: Peter Geoghegan , Ranier Vilela Reviewed-by: Alexander Korotkov , Robert Haas Reviewed-by: jian he --- .../postgres_fdw/expected/postgres_fdw.out| 16 +- src/backend/nodes/queryjumblefuncs.c | 27 ++ src/backend/parser/parse_expr.c | 327 +- src/backend/utils/misc/guc_tables.c | 11 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/bin/pg_dump/t/002_pg_dump.pl | 6 +- src/include/nodes/queryjumble.h | 1 + src/include/optimizer/optimizer.h | 1 + src/test/regress/expected/create_index.out| 156 - src/test/regress/expected/inherit.out | 2 +- src/test/regress/expected/join.out| 62 +++- src/test/regress/expected/partition_prune.out | 219 ++-- src/test/regress/expected/rules.out | 4 +- src/test/regress/expected/stats_ext.out | 12 +- src/test/regress/expected/sysviews.out| 3 +- src/test/regress/expected/tidscan.out | 23 +- src/test/regress/sql/create_index.sql | 35 ++ src/test/regress/sql/join.sql | 10 + src/test/regress/sql/partition_prune.sql | 22 ++ src/test/regress/sql/tidscan.sql | 6 + src/tools/pgindent/typedefs.list | 2 + 21 files changed, 876 insertions(+), 70 deletions(-) diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index b5a38aeb21..a07aefc9e5 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -1349,7 +1349,7 @@ SELECT t1.c1, t1.c2, t2.c1, t2.c2 FROM ft4 t1 LEFT JOIN (SELECT * FROM ft5 WHERE Foreign Scan Output: t1.c1, t1.c2, ft5.c1, ft5.c2 Relations: (public.ft4 t1) LEFT JOIN (public.ft5) - Remote SQL: SELECT r1.c1, r1.c2, r4.c1, r4.c2 FROM ("S 1"."T 3" r1 LEFT JOIN "S 1"."T 4" r4 ON (((r1.c1 = r4.c1)) AND ((r4.c1 < 10 WHERE (((r4.c1 < 10) OR (r4.c1 IS NULL))) AND ((r1.c1 < 10)) + Remote SQL: SELECT r1.c1, r1.c2, r4.c1, r4.c2 FROM ("S 1"."T 3" r1 LEFT JOIN "S 1"."T 4" r4 ON (((r1.c1 = r4.c1)) AND ((r4.c1 < 10 WHERE (((r4.c1 IS NULL) OR (r4.c1 < 10))) AND ((r1.c1 < 10)) (4 rows) SELECT t1.c1, t1.c2, t2.c1, t2.c2 FROM ft4 t1 LEFT JOIN (SELECT * FROM ft5 WHERE c1 < 10) t2 ON (t1.c1 = t2.c1) @@ -3105,7 +3105,7 @@ select array_agg(distinct (t1.c1)%5) from ft4 t1 full join ft5 t2 on (t1.c1 = t2 Foreign Scan Output: (array_agg(DISTINCT (t1.c1 % 5))), ((t2.c1 % 3)) Relations: Aggregate on ((public.ft4 t1) FULL JOIN (public.ft5 t2)) - Remote SQL: SELECT array_agg(DISTINCT (r1.c1 % 5
Re: POC: GROUP BY optimization
On 2/2/2024 11:06, Richard Guo wrote: On Fri, Feb 2, 2024 at 11:32 AM Richard Guo <mailto:guofengli...@gmail.com>> wrote: On Fri, Feb 2, 2024 at 10:02 AM Tom Lane mailto:t...@sss.pgh.pa.us>> wrote: One of the test cases added by this commit has not been very stable in the buildfarm. Latest example is here: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion=2024-02-01%2021%3A28%3A04 <https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion=2024-02-01%2021%3A28%3A04> and I've seen similar failures intermittently on other machines. I'd suggest building this test atop a table that is more stable than pg_class. You're just waving a red flag in front of a bull if you expect stable statistics from that during a regression run. Nor do I see any particular reason for pg_class to be especially suited to the test. Yeah, it's not a good practice to use pg_class in this place. While looking through the test cases added by this commit, I noticed some other minor issues that are not great. Such as * The table 'btg' is inserted with 1 tuples, which seems a bit expensive for a test. I don't think we need such a big table to test what we want. * I don't see why we need to manipulate GUC max_parallel_workers and max_parallel_workers_per_gather. * I think we'd better write the tests with the keywords being all upper or all lower. A mixed use of upper and lower is not great. Such as in explain (COSTS OFF) SELECT x,y FROM btg GROUP BY x,y,z,w; * Some comments for the test queries are not easy to read. * For this statement CREATE INDEX idx_y_x_z ON btg(y,x,w); I think the index name would cause confusion. It creates an index on columns y, x and w, but the name indicates an index on y, x and z. I'd like to write a draft patch for the fixes. Here is the draft patch that fixes the issues I complained about in upthread. I passed through the patch. Looks like it doesn't break anything. Why do you prefer to use count(*) in EXPLAIN instead of plain targetlist, like "SELECT x,y,..."? Also, according to the test mentioned by Tom: 1. I see, PG uses IndexScan on (x,y). So, column x will be already sorted before the MergeJoin. Why not use Incremental Sort on (x,z,w) instead of full sort? 2. For memo, IMO, this test shows us the future near perspective of this feature: It is cheaper to use grouping order (w,z) instead of (z,w). -- regards, Andrei Lepikhov Postgres Professional
Re: POC: GROUP BY optimization
On 2/2/2024 09:02, Tom Lane wrote: Alexander Korotkov writes: I'm going to push this if there are no objections. One of the test cases added by this commit has not been very stable in the buildfarm. Latest example is here: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion=2024-02-01%2021%3A28%3A04 and I've seen similar failures intermittently on other machines. I'd suggest building this test atop a table that is more stable than pg_class. You're just waving a red flag in front of a bull if you expect stable statistics from that during a regression run. Nor do I see any particular reason for pg_class to be especially suited to the test. Yeah, It is my fault. Please, see in the attachment the patch fixing that. -- regards, Andrei Lepikhov Postgres Professional From 11a049d95ee48e38ad569aab7663d8de91f946ad Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Fri, 2 Feb 2024 10:39:55 +0700 Subject: [PATCH] Replace the GROUP-BY optimization test with the same based on something less volatile when the pg_class relation. --- src/test/regress/expected/aggregates.out | 32 +++- src/test/regress/sql/aggregates.sql | 9 +++ 2 files changed, 18 insertions(+), 23 deletions(-) diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out index 7a73c19314..c2e1b8c9ed 100644 --- a/src/test/regress/expected/aggregates.out +++ b/src/test/regress/expected/aggregates.out @@ -2873,7 +2873,6 @@ SELECT y,x,array_agg(distinct w) FROM btg WHERE y < 0 GROUP BY x,y; (6 rows) RESET enable_incremental_sort; -DROP TABLE btg; -- The case, when scanning sort order correspond to aggregate sort order but -- can not be found in the group-by list CREATE TABLE agg_sort_order (c1 int PRIMARY KEY, c2 int); @@ -2901,32 +2900,31 @@ DROP TABLE agg_sort_order CASCADE; SET enable_hashjoin = off; SET enable_nestloop = off; explain (COSTS OFF) -SELECT c1.relname,c1.relpages -FROM pg_class c1 JOIN pg_class c2 ON (c1.relname=c2.relname AND c1.relpages=c2.relpages) -GROUP BY c1.reltuples,c1.relpages,c1.relname -ORDER BY c1.relpages, c1.relname, c1.relpages*c1.relpages; - QUERY PLAN -- +SELECT b1.x,b1.w FROM btg b1 JOIN btg b2 ON (b1.z=b2.z AND b1.w=b2.w) +GROUP BY b1.x,b1.z,b1.w ORDER BY b1.z, b1.w, b1.x*b1.x; +QUERY PLAN +--- Incremental Sort - Sort Key: c1.relpages, c1.relname, ((c1.relpages * c1.relpages)) - Presorted Key: c1.relpages, c1.relname + Sort Key: b1.z, b1.w, ((b1.x * b1.x)) + Presorted Key: b1.z, b1.w -> Group - Group Key: c1.relpages, c1.relname, c1.reltuples + Group Key: b1.z, b1.w, b1.x -> Incremental Sort - Sort Key: c1.relpages, c1.relname, c1.reltuples - Presorted Key: c1.relpages, c1.relname + Sort Key: b1.z, b1.w, b1.x + Presorted Key: b1.z, b1.w -> Merge Join - Merge Cond: ((c1.relpages = c2.relpages) AND (c1.relname = c2.relname)) + Merge Cond: ((b1.z = b2.z) AND (b1.w = b2.w)) -> Sort - Sort Key: c1.relpages, c1.relname - -> Seq Scan on pg_class c1 + Sort Key: b1.z, b1.w + -> Seq Scan on btg b1 -> Sort - Sort Key: c2.relpages, c2.relname - -> Seq Scan on pg_class c2 + Sort Key: b2.z, b2.w + -> Seq Scan on btg b2 (16 rows) RESET enable_hashjoin; RESET enable_nestloop; +DROP TABLE btg; RESET enable_hashagg; RESET max_parallel_workers; RESET max_parallel_workers_per_gather; diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql index 916dbf908f..3548fbb8db 100644 --- a/src/test/regress/sql/aggregates.sql +++ b/src/test/regress/sql/aggregates.sql @@ -1229,8 +1229,6 @@ EXPLAIN (VERBOSE, COSTS OFF) SELECT y,x,array_agg(distinct w) FROM btg WHERE y < 0 GROUP BY x,y; RESET enable_incremental_sort; -DROP TABLE btg; - -- The case, when scanning sort order correspond to aggregate sort order but -- can not be found in the group-by list CREATE TABLE agg_sort_order (c1 int PRIMARY KEY, c2 int); @@ -1245,13 +1243,12 @@ DROP TABLE agg_sort_order CASCADE; SET enable_hashjoin = off; SET enable_nestloop = off; explain (COSTS OFF) -SELECT c1.relname,c1.relpages -FROM pg_class c1 JOIN pg_class c2 ON (c1.relname=c2.relname AND c1.relpages=c2.relpages) -GROUP BY c1.reltuples,c1.relpages,c1.relname -ORDER BY c1.relpages, c1.relname, c1.relpages*c1.r
Re: POC: GROUP BY optimization
Just forgotten second attachment. -- regards, Andrei Lepikhov Postgres Professional diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c index 1095b73dac..b612420547 100644 --- a/src/backend/optimizer/path/pathkeys.c +++ b/src/backend/optimizer/path/pathkeys.c @@ -432,6 +432,21 @@ group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys, return n; } +static bool +duplicated_pathkey_combination(List *infos, List *pathkeys) +{ + ListCell *lc; + + foreach (lc, infos) + { + PathKeyInfo *info = lfirst_node(PathKeyInfo, lc); + + if (compare_pathkeys(pathkeys, info->pathkeys) == PATHKEYS_EQUAL) + return true; + } + return false; +} + /* * get_useful_group_keys_orderings * Determine which orderings of GROUP BY keys are potentially interesting. @@ -491,7 +506,7 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path *path) if (n > 0 && (enable_incremental_sort || n == root->num_groupby_pathkeys) && - compare_pathkeys(pathkeys, root->group_pathkeys) != PATHKEYS_EQUAL) + !duplicated_pathkey_combination(infos, pathkeys)) { info = makeNode(PathKeyInfo); info->pathkeys = pathkeys; @@ -514,8 +529,9 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path *path) , root->num_groupby_pathkeys); - if (n > 0 && compare_pathkeys(pathkeys, root->group_pathkeys) != PATHKEYS_EQUAL && - (enable_incremental_sort || n == list_length(root->sort_pathkeys))) + if (n > 0 && + (enable_incremental_sort || n == list_length(root->sort_pathkeys)) && + !duplicated_pathkey_combination(infos, pathkeys)) { info = makeNode(PathKeyInfo); info->pathkeys = pathkeys;
Re: POC: GROUP BY optimization
On 16/1/2024 22:05, Alexander Korotkov wrote: On Tue, Jan 16, 2024 at 4:48 AM Richard Guo wrote: * When trying to match the ordering of GROUP BY to that of ORDER BY in get_useful_group_keys_orderings, this patch checks against the length of path->pathkeys. This does not make sense. I guess this is a typo and the real intention is to check against root->sort_pathkeys. Thanks! It is really my blunder - fresh look works. --- a/src/backend/optimizer/path/pathkeys.c +++ b/src/backend/optimizer/path/pathkeys.c @@ -504,7 +504,7 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path *path) root->num_groupby_pathkeys); if (n > 0 && - (enable_incremental_sort || n == list_length(path->pathkeys))) + (enable_incremental_sort || n == list_length(root->sort_pathkeys))) However, I think this is also incorrect. When incremental sort is disabled, it is reasonable to consider reordering the GROUP BY keys only if the number of matching pathkeys is equal to the length of root->group_pathkeys i.e. if 'n == list_length(root->group_pathkeys)'. Hmm... Why should this be list_length(root->group_pathkeys) while we're targeting root->sort_pathkeys. I yet changed that to list_length(root->sort_pathkeys). I think, in the first case, when we are trying to arrange group-by keys according to underlying pathkeys with incremental sort = off, it makes sense to do if we fetch all group-by keys regardless of a more or equal number of path keys the incoming path contains. The code and test case are in step1.txt. * When the original ordering of GROUP BY keys matches the path's pathkeys or ORDER BY keys, get_useful_group_keys_orderings would generate duplicate PathKeyInfos. Consequently, this duplication would lead to the creation and addition of identical paths multiple times. This is not great. Consider the query below: create table t (a int, b int); create index on t (a, b); set enable_hashagg to off; explain select count(*) from t group by a, b; QUERY PLAN -- GroupAggregate (cost=0.15..97.27 rows=226 width=16) Group Key: a, b -> Index Only Scan using t_a_b_idx on t (cost=0.15..78.06 rows=2260 width=8) (3 rows) The same path with group keys {a, b} is created and added twice. I tried to address that. * Part of the work performed in this patch overlaps with that of preprocess_groupclause. They are both trying to adjust the ordering of the GROUP BY keys to match ORDER BY. I wonder if it would be better to perform this work only once. Andrei, could you take a look. As I see, the PathKeyInfo list often contains duplicated pathkeys, coming from either sort_pathkeys or path->pathkeys orderings. So, I can propose to check duplicates each time (see step2.txt in attachment). * When reordering the GROUP BY keys, I think the following checks need some improvements. + /* +* Since 1349d27 pathkey coming from underlying node can be in the +* root->group_pathkeys but not in the processed_groupClause. So, we +* should be careful here. +*/ + sgc = get_sortgroupref_clause_noerr(pathkey->pk_eclass->ec_sortref, + *group_clauses); + if (!sgc) + /* The grouping clause does not cover this pathkey */ + break; I think we need to check or assert 'pathkey->pk_eclass->ec_sortref' is valid before calling get_sortgroupref_clause_noerr with it. Also, how can we guarantee that the returned GROUP BY item is sortable? Should we check that with OidIsValid(sgc->sortop)? Added. Reviewed it, looks good. -- regards, Andrei Lepikhov Postgres Professional diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c index 919d54dd79..1095b73dac 100644 --- a/src/backend/optimizer/path/pathkeys.c +++ b/src/backend/optimizer/path/pathkeys.c @@ -489,8 +489,9 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path *path) n = group_keys_reorder_by_pathkeys(path->pathkeys, , , root->num_groupby_pathkeys); - if (n > 0 && compare_pathkeys(pathkeys, root->group_pathkeys) != PATHKEYS_EQUAL && - (enable_incremental_sort || n == list_length(path->pathkeys))) + if (n > 0 && + (enable_incremental_sort || n == root->num_groupby_pathkeys) && + compare_pathkeys(pathkeys, root->group_pathkeys) != PATHKEYS_EQUAL) { info = makeNode(PathKeyInfo); info->pathkeys = pathkeys; diff --git a/src/test/regress/expected/ag
Re: POC: GROUP BY optimization
On 15/1/2024 13:42, Richard Guo wrote: On Mon, Jan 15, 2024 at 8:20 AM Alexander Korotkov <mailto:aekorot...@gmail.com>> wrote: Thank you for providing the test case relevant for this code change. The revised patch incorporating this change is attached. Now the patchset looks good to me. I'm going to push it if there are no objections. Seems I'm late for the party. Can we hold for several more days? I'd like to have a review on this patch. Get on board! It looks like this feature needs as much review as possible (likewise SJE). -- regards, Andrei Lepikhov Postgres Professional
Re: POC: GROUP BY optimization
On 13/1/2024 22:00, Alexander Korotkov wrote: On Sat, Jan 13, 2024 at 11:09 AM Andrei Lepikhov wrote: On 11/1/2024 18:30, Alexander Korotkov wrote: On Tue, Jan 9, 2024 at 1:14 PM Pavel Borisov wrote: Hmm, I don't see this old code in these patches. Resend 0002-* because of trailing spaces. AFAIK, cfbot does not seek old versions of patchset parts in previous messages. So for it to run correctly, a new version of the whole patchset should be sent even if most patches are unchanged. Please, find the revised patchset with some refactoring and comments improvement from me. I'll continue looking into this. The patch looks better, thanks to your refactoring. I propose additional comments and tests to make the code more understandable (see attachment). I intended to look into this part of the code more, but the tests show a difference between PostgreSQL v.15 and v.16, which causes changes in the code of this feature. Makes sense. I've incorporated your changes into the patchset. One more improvement. To underpin code change: - return cur_ec; /* Match! */ + { + /* +* Match! +* +* Copy the sortref if it wasn't set yet. That may happen if +* the ec was constructed from WHERE clause, i.e. it doesn't +* have a target reference at all. +*/ + if (cur_ec->ec_sortref == 0 && sortref > 0) + cur_ec->ec_sortref = sortref; + return cur_ec; + } I propose the test (see attachment). It shows why we introduce this change: GROUP-BY should juggle not only pathkeys generated by explicit sort operators but also planner-made, likewise in this example, by MergeJoin. -- regards, Andrei Lepikhov Postgres Professional diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out index 0d46e096e5..ca38e78f21 100644 --- a/src/test/regress/expected/aggregates.out +++ b/src/test/regress/expected/aggregates.out @@ -2879,6 +2879,37 @@ FROM t1 WHERE c2 < 100 GROUP BY c1 ORDER BY 2; (10 rows) DROP TABLE t1 CASCADE; +-- Check, that GROUP-BY reordering optimization can operate with pathkeys, built +-- by planner itself. For example, by MergeJoin. +SET enable_hashjoin = off; +SET enable_nestloop = off; +explain (COSTS OFF) +SELECT c1.relname,c1.relpages +FROM pg_class c1 JOIN pg_class c2 ON (c1.relname=c2.relname AND c1.relpages=c2.relpages) +GROUP BY c1.reltuples,c1.relpages,c1.relname +ORDER BY c1.relpages, c1.relname, c1.relpages*c1.relpages; + QUERY PLAN +- + Incremental Sort + Sort Key: c1.relpages, c1.relname, ((c1.relpages * c1.relpages)) + Presorted Key: c1.relpages, c1.relname + -> Group + Group Key: c1.relpages, c1.relname, c1.reltuples + -> Incremental Sort + Sort Key: c1.relpages, c1.relname, c1.reltuples + Presorted Key: c1.relpages, c1.relname + -> Merge Join + Merge Cond: ((c1.relpages = c2.relpages) AND (c1.relname = c2.relname)) + -> Sort + Sort Key: c1.relpages, c1.relname + -> Seq Scan on pg_class c1 + -> Sort + Sort Key: c2.relpages, c2.relname + -> Seq Scan on pg_class c2 +(16 rows) + +RESET enable_hashjoin; +RESET enable_nestloop; RESET enable_hashagg; RESET max_parallel_workers; RESET max_parallel_workers_per_gather; diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql index f99167ac9e..cf87b5d5dd 100644 --- a/src/test/regress/sql/aggregates.sql +++ b/src/test/regress/sql/aggregates.sql @@ -1233,6 +1233,18 @@ SELECT array_agg(c1 ORDER BY c2),c2 FROM t1 WHERE c2 < 100 GROUP BY c1 ORDER BY 2; DROP TABLE t1 CASCADE; +-- Check, that GROUP-BY reordering optimization can operate with pathkeys, built +-- by planner itself. For example, by MergeJoin. +SET enable_hashjoin = off; +SET enable_nestloop = off; +explain (COSTS OFF) +SELECT c1.relname,c1.relpages +FROM pg_class c1 JOIN pg_class c2 ON (c1.relname=c2.relname AND c1.relpages=c2.relpages) +GROUP BY c1.reltuples,c1.relpages,c1.relname +ORDER BY c1.relpages, c1.relname, c1.relpages*c1.relpages; +RESET enable_hashjoin; +RESET enable_nestloop; + RESET enable_hashagg; RESET max_parallel_workers; RESET max_parallel_workers_per_gather;
Re: POC: GROUP BY optimization
On 11/1/2024 18:30, Alexander Korotkov wrote: Hi! On Tue, Jan 9, 2024 at 1:14 PM Pavel Borisov wrote: Hmm, I don't see this old code in these patches. Resend 0002-* because of trailing spaces. AFAIK, cfbot does not seek old versions of patchset parts in previous messages. So for it to run correctly, a new version of the whole patchset should be sent even if most patches are unchanged. Please, find the revised patchset with some refactoring and comments improvement from me. I'll continue looking into this. The patch looks better, thanks to your refactoring. I propose additional comments and tests to make the code more understandable (see attachment). I intended to look into this part of the code more, but the tests show a difference between PostgreSQL v.15 and v.16, which causes changes in the code of this feature. -- regards, Andrei Lepikhov Postgres Professional diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c index f4b7dcac21..5aac6d6677 100644 --- a/src/backend/optimizer/path/pathkeys.c +++ b/src/backend/optimizer/path/pathkeys.c @@ -397,6 +397,11 @@ group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys, !list_member_ptr(*group_pathkeys, pathkey)) break; + /* +* Since 1349d27 pathkey coming from underlying node can be in the +* root->group_pathkeys but not in the processed_groupClause. So, we +* should be careful here. +*/ sgc = get_sortgroupref_clause_noerr(pathkey->pk_eclass->ec_sortref, *group_clauses); if (!sgc) diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out index 423c8ec3b6..0d46e096e5 100644 --- a/src/test/regress/expected/aggregates.out +++ b/src/test/regress/expected/aggregates.out @@ -2857,6 +2857,28 @@ explain (COSTS OFF) SELECT x,w FROM btg GROUP BY w,x,y,z ORDER BY x*x,z; (8 rows) DROP TABLE btg; +-- The case, when scanning sort order correspond to aggregate sort order but +-- can not be found in the group-by list +CREATE TABLE t1 (c1 int PRIMARY KEY, c2 int); +CREATE UNIQUE INDEX ON t1(c2); +explain (costs off) +SELECT array_agg(c1 ORDER BY c2),c2 +FROM t1 WHERE c2 < 100 GROUP BY c1 ORDER BY 2; + QUERY PLAN + + Sort + Sort Key: c2 + -> GroupAggregate + Group Key: c1 + -> Sort + Sort Key: c1, c2 + -> Bitmap Heap Scan on t1 + Recheck Cond: (c2 < 100) + -> Bitmap Index Scan on t1_c2_idx + Index Cond: (c2 < 100) +(10 rows) + +DROP TABLE t1 CASCADE; RESET enable_hashagg; RESET max_parallel_workers; RESET max_parallel_workers_per_gather; diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql index b9fcceedd7..f99167ac9e 100644 --- a/src/test/regress/sql/aggregates.sql +++ b/src/test/regress/sql/aggregates.sql @@ -1224,6 +1224,15 @@ explain (COSTS OFF) SELECT x,w FROM btg GROUP BY w,x,y,z ORDER BY x*x,z; DROP TABLE btg; +-- The case, when scanning sort order correspond to aggregate sort order but +-- can not be found in the group-by list +CREATE TABLE t1 (c1 int PRIMARY KEY, c2 int); +CREATE UNIQUE INDEX ON t1(c2); +explain (costs off) +SELECT array_agg(c1 ORDER BY c2),c2 +FROM t1 WHERE c2 < 100 GROUP BY c1 ORDER BY 2; +DROP TABLE t1 CASCADE; + RESET enable_hashagg; RESET max_parallel_workers; RESET max_parallel_workers_per_gather;
Re: introduce dynamic shared memory registry
On 9/1/2024 00:16, Nathan Bossart wrote: On Mon, Jan 08, 2024 at 10:53:17AM +0530, Bharath Rupireddy wrote: 1. I think we need to add some notes about this new way of getting shared memory for external modules in the Shared Memory and LWLocks section in xfunc.sgml? This will at least tell there's another way for external modules to get shared memory, not just with the shmem_request_hook and shmem_startup_hook. What do you think? +1. Maybe even more - in the section related to extensions, this approach to using shared data can be mentioned, too. 2. FWIW, I'd like to call this whole feature "Support for named DSM segments in Postgres". Do you see anything wrong with this? Why do you feel it should be renamed? I don't see anything wrong with it, but I also don't see any particular advantage with that name compared to "dynamic shared memory registry." It is not a big issue, I suppose. But for me personally (as not a native English speaker), the label "Named DSM segments" seems more straightforward to understand. 3. IIUC, this feature eventually makes both shmem_request_hook and shmem_startup_hook pointless, no? Or put another way, what's the significance of shmem request and startup hooks in lieu of this new feature? I think it's quite possible to get rid of the shmem request and startup hooks (of course, not now but at some point in future to not break the external modules), because all the external modules can allocate and initialize the same shared memory via dsm_registry_init_or_attach and its init_callback. All the external modules will then need to call dsm_registry_init_or_attach in their _PG_init callbacks and/or in their bg worker's main functions in case the modules intend to start up bg workers. Am I right? Well, modules might need to do a number of other things (e.g., adding hooks) that can presently only be done when preloaded, in which case I doubt there's much benefit from switching to the DSM registry. I don't really intend for it to replace the existing request/startup hooks, but you're probably right that most, if not all, could use the registry instead. IMHO this is well beyond the scope of this thread, though. +1, it may be a many reasons to use these hooks. >> 3. Use NAMEDATALEN instead of 64? >> +charkey[64]; > I kept this the same, as I didn't see any need to tie the key size to > NAMEDATALEN. IMO, we should avoid magic numbers whenever possible. Current logic according to which first users of this feature will be extensions naturally bonds this size to the size of the 'name' type. And one more point. I think the commit already deserves a more detailed commit message. -- regards, Andrei Lepikhov Postgres Professional
Re: Custom explain options
On 10/1/2024 20:27, Konstantin Knizhnik wrote: On 10/01/2024 8:46 am, Michael Paquier wrote: On Wed, Jan 10, 2024 at 01:29:30PM +0700, Andrei Lepikhov wrote: What do you think about this really useful feature? Do you wish to develop it further? I am biased here. This seems like a lot of code for something we've been delegating to the explain hook for ages. Even if I can see the appeal of pushing that more into explain.c to get more data on a per-node basis depending on the custom options given by the caller of an EXPLAIN entry point, I cannot get really excited about the extra maintenance this facility would involve compared to the potential gains, knowing that there's a hook. -- Michael Well, I am not sure that proposed patch is flexible enough to handle all possible scenarios. I just wanted to make it as simple as possible to leave some chances for it to me merged. But it is easy to answer the question why existed explain hook is not enough: 1. It doesn't allow to add some extra options to EXPLAIN. My intention was to be able to do something like this "explain (analyze,buffers,prefetch) ...". It is completely not possible with explain hook. I agree. Designing mostly planner-related extensions, I also wanted to add some information to the explain of nodes. For example, pg_query_state could add the status of the node at the time of interruption of execution: started, stopped, or loop closed. Maybe we should gather some statistics on how developers of extensions deal with that issue ... -- regards, Andrei Lepikhov Postgres Professional
Re: Multidimensional Histograms
On 8/1/2024 16:21, Alexander Cheshev wrote: Hi Andrei, Maybe my wording needed to be more precise. I didn't implement multidimensional histograms before, so I don't know how expensive they are. I meant that for dependency statistics over about six columns, we have a lot of combinations to compute. Equi-Depth Histogram in a 6 dimensional case requires 6 times more iterations. Postgres already uses Equi-Depth Histogram. Even if you increase the number of buckets from 100 to 1000 then there will be no overhead as the time complexity of Equi-Depth Histogram has no dependence on the number of buckets. So, no overhead at all! Maybe. For three columns, we have 9 combinations (passes) for building dependency statistics and 4 combinations for ndistincts; for six columns, we have 186 and 57 combinations correspondingly. Even remembering that dependency is just one number for one combination, building the dependency statistics is still massive work. So, in the multicolumn case, having something like a histogram may be more effective. -- regards, Andrei Lepikhov Postgres Professional
Re: Custom explain options
On 30/11/2023 22:40, Konstantin Knizhnik wrote: In all this cases we are using array of `Instrumentation` and if it contains varying part, then it is not clear where to place it. Yes, there is also code which serialize and sends instrumentations between worker processes and I have updated this code in my PR to send actual amount of custom instrumentation data. But it can not help with the cases above. What do you think about this really useful feature? Do you wish to develop it further? -- regards, Andrei Lepikhov Postgres Professional
Re: POC: GROUP BY optimization
On 9/1/2024 16:45, vignesh C wrote: On Tue, 9 Jan 2024 at 14:31, Andrei Lepikhov wrote: Here is a new version of GROUP-BY optimization without sort model. On 21/12/2023 17:53, Alexander Korotkov wrote: I'd like to make some notes. 1) As already mentioned, there is clearly a repetitive pattern for the code following after get_useful_group_keys_orderings() calls. I think it would be good to extract it into a separate function. Please, do this as a separate patch coming before the group-by patch. That would simplify the review. Done. See patch 0001-*. Unfortunately, extraction of whole cycle isn't practical, because it blows out the interface of the routine. 2) I wonder what planning overhead this patch could introduce? Could you try to measure the worst case? What if we have a table with a lot of indexes and a long list of group-by clauses partially patching every index. This should give us an understanding on whether we need a separate GUC to control this feature. In current implementation I don't anticipate any significant overhead. GUC is needed here to allow users adhere their own ordering and to disable feature in the case of problems. 4) I think we can do some optimizations when enable_incremental_sort == off. Then in get_useful_group_keys_orderings() we should only deal with input_path fully matching the group-by clause, and try only full match of group-by output to the required order. Hm, is it really make sense in current implementation? CFBot shows the following errors at [1] with: [08:33:28.813] ../src/backend/utils/adt/selfuncs.c: In function ‘estimate_num_groups’: [08:33:28.813] ../src/backend/utils/adt/selfuncs.c:3389:9: warning: implicit declaration of function ‘estimate_num_groups_incremental’ [-Wimplicit-function-declaration] [08:33:28.813] 3389 | return estimate_num_groups_incremental(root, groupExprs, [08:33:28.813] | ^~~ [08:33:28.813] ../src/backend/utils/adt/selfuncs.c: At top level: [08:33:28.813] ../src/backend/utils/adt/selfuncs.c:3400:1: warning: no previous prototype for ‘estimate_num_groups_incremental’ [-Wmissing-prototypes] [08:33:28.813] 3400 | estimate_num_groups_incremental(PlannerInfo *root, List *groupExprs, [08:33:28.813] | ^~~ [08:33:28.813] ../src/backend/utils/adt/selfuncs.c:3400:1: error: conflicting types for ‘estimate_num_groups_incremental’ [08:33:28.813] ../src/backend/utils/adt/selfuncs.c:3389:9: note: previous implicit declaration of ‘estimate_num_groups_incremental’ was here [08:33:28.813] 3389 | return estimate_num_groups_incremental(root, groupExprs, Hmm, I don't see this old code in these patches. Resend 0002-* because of trailing spaces. -- regards, Andrei Lepikhov Postgres Professional From 45cfff5731b81c0df2af00f5e4212fc598f6a231 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Tue, 9 Jan 2024 12:32:15 +0700 Subject: [PATCH 2/2] Explore alternative orderings of group-by pathkeys during optimization. When evaluating a query with a multi-column GROUP BY clause, we can minimize sort operations or avoid them if we synchronize the order of GROUP BY clauses with the ORDER BY sort clause or sort order, which comes from the underlying query tree. Grouping does not imply any ordering, so we can compare the keys in arbitrary order, and a Hash Agg leverages this. But for Group Agg, we simply compared keys in the order specified in the query. This commit explores alternative ordering of the keys, trying to find a cheaper one. The ordering of group keys may interact with other parts of the query, some of which may not be known while planning the grouping. For example, there may be an explicit ORDER BY clause or some other ordering-dependent operation higher up in the query, and using the same ordering may allow using either incremental sort or even eliminating the sort entirely. The patch always keeps the ordering specified in the query, assuming the user might have additional insights. This introduces a new GUC enable_group_by_reordering so that the optimization may be disabled if needed. --- src/backend/optimizer/path/equivclass.c | 13 +- src/backend/optimizer/path/pathkeys.c | 222 ++ src/backend/optimizer/plan/planner.c | 214 +++-- src/backend/utils/misc/guc_tables.c | 10 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/nodes/pathnodes.h | 10 + src/include/optimizer/paths.h | 3 + src/test/regress/expected/aggregates.out | 132 +++ src/test/regress/expected/sysviews.out| 3 +- src/test/regress/sql/aggregates.sql | 47 10 files changed, 586 insertions(+), 69 deletions(-) diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c index e86dfeaecd..7dd14d0a43 100644 --- a/src/backend/optimizer/path/equivclass.c +++ b/src/backend/optimizer/path/equivclass.c @@ -652