Re: speeding up planning with partitions
On Wed, May 1, 2019 at 4:08 AM Tom Lane wrote: > OK, I tweaked that a bit and pushed both versions. Thank you. Regards, Amit
Re: speeding up planning with partitions
Amit Langote writes: > On Tue, Apr 30, 2019 at 1:26 PM Amit Langote wrote: >> It would be nice if at least we fix the bug that directly accessed >> partitions are not excluded with constraint_exclusion = on, without >> removing PG 11's contortions in relation_excluded_by_constraints to >> work around the odd requirements imposed by inheritance_planner, which >> is causing the additional diffs in the regression expected output. > FWIW, attached is a delta patch that applies on top of your patch for > v11 branch that shows what may be one way to go about this. OK, I tweaked that a bit and pushed both versions. regards, tom lane
Re: speeding up planning with partitions
Amit Langote writes: > Here is the patch. I've also included the patch to update the text in > ddl.sgml regarding constraint exclusion and partition pruning. I thought this was a bit messy. In particular, IMV the reason to have a split between get_relation_constraints and its only caller relation_excluded_by_constraints is to create a policy vs mechanism separation: relation_excluded_by_constraints figures out what kinds of constraints we need to look at, while get_relation_constraints does the gruntwork of digging them out of the catalog data. Somebody had already ignored this principle to the extent of putting this very-much-policy test into get_relation_constraints: if (enable_partition_pruning && root->parse->commandType != CMD_SELECT) but the way to improve that is to add another flag parameter to convey the policy choice, not to move the whole chunk of mechanism out to the caller. It also struck me while looking at the code that we were being unnecessarily stupid about non-inheritable constraints: rather than just throwing up our hands for traditional inheritance situations, we can still apply constraint exclusion, as long as we consider only constraints that aren't marked ccnoinherit. (attnotnull constraints have to be considered as if they were ccnoinherit, for ordinary tables but not partitioned ones.) So, I propose the attached revised patch. I'm not sure how much of this, if anything, we should back-patch to v11. It definitely doesn't seem like we should back-patch the improvement just explained. I tried diking out that change, as in the v11 variant attached, and found that this still causes quite a few other changes in v11's expected results, most of them not for the better. So I'm thinking that we'd better conclude that v11's ship has sailed. Its behavior is in some ways weird, but I am not sure that anyone will appreciate our changing it on the fourth minor release. It's somewhat interesting that we get these other changes in v11 but not HEAD. I think the reason is that we reimplemented so much of inheritance_planner in 428b260f8; that is, it seems the weird decisions we find in relation_excluded_by_constraints are mostly there to band-aid over the old weird behavior of inheritance_planner. Anyway, my current thought is to apply this to HEAD and do nothing in v11. I include the v11 patch just for amusement. (I did not check v11's behavior outside the core regression tests; it might possibly have additional test diffs in contrib.) regards, tom lane diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 8ddab75..84341a3 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -5084,10 +5084,11 @@ ANY num_sync ( .) Refer to for -more information on using constraint exclusion and partitioning. +more information on using constraint exclusion to implement +partitioning. diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml index cba2ea9..a0a7435 100644 --- a/doc/src/sgml/ddl.sgml +++ b/doc/src/sgml/ddl.sgml @@ -4535,24 +4535,11 @@ EXPLAIN SELECT count(*) FROM measurement WHERE logdate = DATE '2008-01-01'; - Currently, pruning of partitions during the planning of an - UPDATE or DELETE command is - implemented using the constraint exclusion method (however, it is - controlled by the enable_partition_pruning rather than - constraint_exclusion) see the following section - for details and caveats that apply. - - - Execution-time partition pruning currently only occurs for the Append and MergeAppend node types. It is not yet implemented for the ModifyTable node - type. - - - - Both of these behaviors are likely to be changed in a future release - of PostgreSQL. + type, but that is likely to be changed in a future release of + PostgreSQL. diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 0a6710c..eb6f5a3 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -1513,8 +1513,9 @@ inheritance_planner(PlannerInfo *root) parent_rte->securityQuals = NIL; /* - * Mark whether we're planning a query to a partitioned table or an - * inheritance parent. + * HACK: setting this to a value other than INHKIND_NONE signals to + * relation_excluded_by_constraints() to treat the result relation as + * being an appendrel member. */ subroot->inhTargetKind = (rootRelation != 0) ? INHKIND_PARTITIONED : INHKIND_INHERITED; diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c index 3301331..3215c29 100644 --- a/src/backend/optimizer/util/plancat.c +++ b/src/backend/optimizer/util/plancat.c @@ -67,7 +67,9 @@ static bool infer_collation_opclass_match(InferenceElem *elem, Relation idxRel, List *idxExprs);
Re: speeding up planning with partitions
On Sun, Apr 28, 2019 at 8:10 AM Tom Lane wrote: > Amit Langote writes: > > Not sure if you'll like it but maybe we could ignore even regular > > inheritance child targets that are proven to be empty (is_dummy_rel()) for > > a given query during the initial SELECT planning. That way, we can avoid > > re-running relation_excluded_by_constraints() a second time for *all* > > child target relations. > > My thought was to keep traditional inheritance working more or less > as it has. To do what you're suggesting, we'd have to move generic > constraint-exclusion logic up into the RTE expansion phase, and I don't > think that's a particularly great idea. I think what we should be > doing is applying partition pruning (which is a very specialized form > of constraint exclusion) during RTE expansion, then applying generic > constraint exclusion in relation_excluded_by_constraints, and not > examining partition constraints again there if we already used them. Just to clarify, I wasn't suggesting that we change query_planner(), but the blocks in inheritance_planner() that perform initial planning as if the query was SELECT and gather child target relations from that planner run; the following consecutive blocks: /* * Before generating the real per-child-relation plans, do a cycle of * planning as though the query were a SELECT. ... */ { PlannerInfo *subroot; and: /*-- * Since only one rangetable can exist in the final plan, we need to make * sure that it contains all the RTEs needed for any child plan. ... child_appinfos = NIL; old_child_rtis = NIL; new_child_rtis = NIL; parent_relids = bms_make_singleton(top_parentRTindex); foreach(lc, select_appinfos) { AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc); RangeTblEntry *child_rte; /* append_rel_list contains all append rels; ignore others */ if (!bms_is_member(appinfo->parent_relid, parent_relids)) continue; /* remember relevant AppendRelInfos for use below */ child_appinfos = lappend(child_appinfos, appinfo); I'm suggesting that we don't add the child relations that are dummy due to constraint exclusion to child_appinfos. Maybe, we'll need to combine the two blocks so that the latter can use the PlannerInfo defined in the former to look up the child relation to check if dummy. > > Do you want me to update my patch considering the above summary? > > Yes please. I will try to get that done hopefully by tomorrow. (On extended holidays that those of us who are in Japan have around this time of year.) > However, I wonder whether you're thinking differently in > light of what you wrote in [1]: Thanks for checking that thread. > >>> Pruning in 10.2 works using internally generated partition constraints > >>> (which for this purpose are same as CHECK constraints). With the new > >>> pruning logic introduced in 11, planner no longer considers partition > >>> constraint because it's redundant to check them in most cases, because > >>> pruning would've selected the right set of partitions. Given that the new > >>> pruning logic is still unable to handle the cases like above, maybe we > >>> could change the planner to consider them, at least until we fix the > >>> pruning logic to handle such cases. > > If we take that seriously then it would suggest not ignoring partition > constraints in relation_excluded_by_constraints. However, I'm of the > opinion that we shouldn't let temporary deficiencies in the > partition-pruning logic drive what we do here. I don't think the set > of cases where we could get a win by reconsidering the partition > constraints is large enough to justify the cycles expended in doing so; > and it'll get even smaller as pruning gets smarter. Yeah, maybe we could away with that by telling users to define equivalent CHECK constraints for corner cases like that although that's not really great. Thanks, Amit
Re: speeding up planning with partitions
Amit Langote writes: > On 2019/04/23 7:08, Tom Lane wrote: >> [ a bunch of stuff ] > Not sure if you'll like it but maybe we could ignore even regular > inheritance child targets that are proven to be empty (is_dummy_rel()) for > a given query during the initial SELECT planning. That way, we can avoid > re-running relation_excluded_by_constraints() a second time for *all* > child target relations. My thought was to keep traditional inheritance working more or less as it has. To do what you're suggesting, we'd have to move generic constraint-exclusion logic up into the RTE expansion phase, and I don't think that's a particularly great idea. I think what we should be doing is applying partition pruning (which is a very specialized form of constraint exclusion) during RTE expansion, then applying generic constraint exclusion in relation_excluded_by_constraints, and not examining partition constraints again there if we already used them. > Do you want me to update my patch considering the above summary? Yes please. However, I wonder whether you're thinking differently in light of what you wrote in [1]: >>> Pruning in 10.2 works using internally generated partition constraints >>> (which for this purpose are same as CHECK constraints). With the new >>> pruning logic introduced in 11, planner no longer considers partition >>> constraint because it's redundant to check them in most cases, because >>> pruning would've selected the right set of partitions. Given that the new >>> pruning logic is still unable to handle the cases like above, maybe we >>> could change the planner to consider them, at least until we fix the >>> pruning logic to handle such cases. If we take that seriously then it would suggest not ignoring partition constraints in relation_excluded_by_constraints. However, I'm of the opinion that we shouldn't let temporary deficiencies in the partition-pruning logic drive what we do here. I don't think the set of cases where we could get a win by reconsidering the partition constraints is large enough to justify the cycles expended in doing so; and it'll get even smaller as pruning gets smarter. regards, tom lane [1] https://www.postgresql.org/message-id/358cd54d-c018-60f8-7d76-55780eef6...@lab.ntt.co.jp
Re: speeding up planning with partitions
On 2019/04/23 7:08, Tom Lane wrote: > Amit Langote writes: >> On 2019/04/02 2:34, Tom Lane wrote: >>> I'm not at all clear on what we think the interaction between >>> enable_partition_pruning and constraint_exclusion ought to be, >>> so I'm not sure what the appropriate resolution is here. Thoughts? > >> Prior to 428b260f87 (that is, in PG 11), partition pruning for UPDATE and >> DELETE queries is realized by applying constraint exclusion to the >> partition constraint of the target partition. The conclusion of the >> discussion when adding the enable_partition_pruning GUC [1] was that >> whether or not constraint exclusion is carried out (to facilitate >> partition pruning) should be governed by the new GUC, not >> constraint_exclusion, if only to avoid confusing users. > > I got back to thinking about how this ought to work. Thanks a lot for taking time to look at this. > It appears to me > that we've got half a dozen different behaviors that depend on one or both > of these settings: > > 1. Use of ordinary table constraints (CHECK, NOT NULL) in baserel pruning, > that is relation_excluded_by_constraints for baserels. > This is enabled by constraint_exclusion = on. > > 2. Use of partition constraints in baserel pruning (applicable only > when a partition is accessed directly). > This is currently partly broken, and it's what your patch wants to > change. Yes. Any fix we come up with for this will need to be back-patched to 11, because it's a regression introduced in 11 when the then new partition pruning feature was committed (9fdb675fc). > 3. Use of ordinary table constraints in appendrel pruning, > that is relation_excluded_by_constraints for appendrel members. > This is enabled by constraint_exclusion >= partition. > > 4. Use of partition constraints in appendrel pruning. > This is enabled by the combination of enable_partition_pruning AND > constraint_exclusion >= partition. However, it looks to me like this > is now nearly if not completely useless because of #5. > > 5. Use of partition constraints in expand_partitioned_rtentry. > Enabled by enable_partition_pruning (see prune_append_rel_partitions). Right, #5 obviates #4. > 6. Use of partition constraints in run-time partition pruning. > This is also enabled by enable_partition_pruning, cf > create_append_plan, create_merge_append_plan. > > Now in addition to what I mention above, there are assorted random > differences in behavior depending on whether we are in an inherited > UPDATE/DELETE or not. I consider these differences to be so bogus > that I'm not even going to include them in this taxonomy; they should > not exist. The UPDATE/DELETE target ought to act the same as a baserel. The *partition* constraint of UPDATE/DELETE targets would never be refuted by the query, because we process only those partition targets that remain after applying partition pruning during the initial planning of the query as if it were SELECT. I'm saying we should distinguish such targets as such when addressing #2. Not sure if you'll like it but maybe we could ignore even regular inheritance child targets that are proven to be empty (is_dummy_rel()) for a given query during the initial SELECT planning. That way, we can avoid re-running relation_excluded_by_constraints() a second time for *all* child target relations. > I think this is ridiculously overcomplicated even without said random > differences. I propose that we do the following: > > * Get rid of point 4 by not considering partition constraints for > appendrel members in relation_excluded_by_constraints. It's just > useless cycles in view of point 5, or nearly so. (Possibly there > are corner cases where we could prove contradictions between a > relation's partition constraints and regular constraints ... but is > it really worth spending planner cycles to look for that?) I guess not. If partition constraint contradicts regular constraints, there wouldn't be any data in such partitions to begin with, no? > * Make point 2 like point 1 by treating partition constraints for > baserels like ordinary table constraints, ie, they are considered > only when constraint_exclusion = on (independently of whether > enable_partition_pruning is on). Right, enable_partition_pruning should only apply to appendrel pruning. If a partition is accessed directly and hence a baserel to the planner, we only consider constraint_exclusion and perform it only if the setting is on. Another opinion on this is that we treat partition constraints differently from regular constraints and don't mind the setting of constraint_exclusion, that is, always perform constraint exclusion using partition constraints. > * Treat an inherited UPDATE/DELETE target table as if it were an > appendrel member for the purposes of relation_excluded_by_constraints, > thus removing the behavioral differences between SELECT and UPDATE/DELETE. As I mentioned above, planner encounters any given
Re: speeding up planning with partitions
Amit Langote writes: > On 2019/04/02 2:34, Tom Lane wrote: >> I'm not at all clear on what we think the interaction between >> enable_partition_pruning and constraint_exclusion ought to be, >> so I'm not sure what the appropriate resolution is here. Thoughts? > Prior to 428b260f87 (that is, in PG 11), partition pruning for UPDATE and > DELETE queries is realized by applying constraint exclusion to the > partition constraint of the target partition. The conclusion of the > discussion when adding the enable_partition_pruning GUC [1] was that > whether or not constraint exclusion is carried out (to facilitate > partition pruning) should be governed by the new GUC, not > constraint_exclusion, if only to avoid confusing users. I got back to thinking about how this ought to work. It appears to me that we've got half a dozen different behaviors that depend on one or both of these settings: 1. Use of ordinary table constraints (CHECK, NOT NULL) in baserel pruning, that is relation_excluded_by_constraints for baserels. This is enabled by constraint_exclusion = on. 2. Use of partition constraints in baserel pruning (applicable only when a partition is accessed directly). This is currently partly broken, and it's what your patch wants to change. 3. Use of ordinary table constraints in appendrel pruning, that is relation_excluded_by_constraints for appendrel members. This is enabled by constraint_exclusion >= partition. 4. Use of partition constraints in appendrel pruning. This is enabled by the combination of enable_partition_pruning AND constraint_exclusion >= partition. However, it looks to me like this is now nearly if not completely useless because of #5. 5. Use of partition constraints in expand_partitioned_rtentry. Enabled by enable_partition_pruning (see prune_append_rel_partitions). 6. Use of partition constraints in run-time partition pruning. This is also enabled by enable_partition_pruning, cf create_append_plan, create_merge_append_plan. Now in addition to what I mention above, there are assorted random differences in behavior depending on whether we are in an inherited UPDATE/DELETE or not. I consider these differences to be so bogus that I'm not even going to include them in this taxonomy; they should not exist. The UPDATE/DELETE target ought to act the same as a baserel. I think this is ridiculously overcomplicated even without said random differences. I propose that we do the following: * Get rid of point 4 by not considering partition constraints for appendrel members in relation_excluded_by_constraints. It's just useless cycles in view of point 5, or nearly so. (Possibly there are corner cases where we could prove contradictions between a relation's partition constraints and regular constraints ... but is it really worth spending planner cycles to look for that?) * Make point 2 like point 1 by treating partition constraints for baserels like ordinary table constraints, ie, they are considered only when constraint_exclusion = on (independently of whether enable_partition_pruning is on). * Treat an inherited UPDATE/DELETE target table as if it were an appendrel member for the purposes of relation_excluded_by_constraints, thus removing the behavioral differences between SELECT and UPDATE/DELETE. With this, constraint_exclusion would act pretty much as it traditionally has, and in most cases would not have any special impact on partitions compared to old-style inheritance. The behaviors that enable_partition_pruning would control are expand_partitioned_rtentry pruning and run-time pruning, neither of which have any applicability to old-style inheritance. Thoughts? regards, tom lane
Re: speeding up planning with partitions
On 2019/04/11 14:03, David Rowley wrote: > On Fri, 5 Apr 2019 at 19:50, Amit Langote > wrote: >> While we're on the topic of the relation between constraint exclusion and >> partition pruning, I'd like to (re-) propose this documentation update >> patch. The partitioning chapter in ddl.sgml says update/delete of >> partitioned tables uses constraint exclusion internally to emulate >> partition pruning, which is no longer true as of 428b260f8. > > Update-docs-that-update-delete-no-longer-use-cons.patch looks good to > me. It should be changed as what the docs say is no longer true. Thanks for the quick review. :) Regards, Amit
Re: speeding up planning with partitions
On Fri, 5 Apr 2019 at 19:50, Amit Langote wrote: > While we're on the topic of the relation between constraint exclusion and > partition pruning, I'd like to (re-) propose this documentation update > patch. The partitioning chapter in ddl.sgml says update/delete of > partitioned tables uses constraint exclusion internally to emulate > partition pruning, which is no longer true as of 428b260f8. Update-docs-that-update-delete-no-longer-use-cons.patch looks good to me. It should be changed as what the docs say is no longer true. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: speeding up planning with partitions
On Fri, 5 Apr 2019 at 23:07, Amit Langote wrote: > > Hi, > > On 2019/04/05 18:13, Floris Van Nee wrote: > > One unrelated thing I noticed (but I'm not sure if it's worth a separate > > email thread) is that the changed default of jit=on in v12 doesn't work > > very well with a large number of partitions at run-time, for which a large > > number get excluded at run-time. A query that has an estimated cost above > > jit_optimize_above_cost takes about 30 seconds to run (for a table with > > 1000 partitions), because JIT is optimizing the full plan. Without JIT it's > > barely 20ms (+400ms planning). I can give more details in a separate thread > > if it's deemed interesting. > > > > Planning Time: 411.321 ms > > JIT: > > Functions: 5005 > > Options: Inlining false, Optimization true, Expressions true, Deforming > > true > > Timing: Generation 721.472 ms, Inlining 0.000 ms, Optimization 16312.195 > > ms, Emission 12533.611 ms, Total 29567.278 ms > > I've noticed a similar problem but in the context of interaction with > parallel query mechanism. The planner, seeing that all partitions will be > scanned (after failing to prune with clauses containing CURRENT_TIMESTAMP > etc.), prepares a parallel plan (containing Parallel Append in this case). > As you can imagine, parallel query initialization (Gather+workers) will > take large amount of time relative to the time it will take to scan the > partitions that remain after pruning (often just one). > > The problem in this case is that the planner is oblivious to the > possibility of partition pruning occurring during execution, which may be > common to both parallel query and JIT. If it wasn't oblivious, it > would've set the cost of pruning-capable Append such that parallel query > and/or JIT won't be invoked. We are going to have to fix that sooner or > later. Robert and I had a go at discussing this in [1]. Some ideas were thrown around in the nature of contorting the Append/MergeAppend's total_cost in a similar way to how clauselist_selectivity does its estimates for unknown values. Perhaps it is possible to actually multiplying the total_cost by the clauselist_selectivity for the run-time pruning quals. That's pretty crude and highly unusual, but it's probably going to give something more sane than what's there today. The run-time prune quals would likely need to be determined earlier than createplan.c for that to work though. IIRC the reason it was done there is, because at the time, there wasn't a need to do it per path. I don't really have any better ideas right now, so if someone does then maybe we should take it up on a new thread. It would be good to leave this thread alone for unrelated things. It's long enough already. [1] https://www.postgresql.org/message-id/CA%2BTgmobhXJGMuHxKjbaKcEJXacxVZHG4%3DhEGFfPF_FrGt37T_Q%40mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: speeding up planning with partitions
Hi, On 2019/04/05 18:13, Floris Van Nee wrote: > One unrelated thing I noticed (but I'm not sure if it's worth a separate > email thread) is that the changed default of jit=on in v12 doesn't work very > well with a large number of partitions at run-time, for which a large number > get excluded at run-time. A query that has an estimated cost above > jit_optimize_above_cost takes about 30 seconds to run (for a table with 1000 > partitions), because JIT is optimizing the full plan. Without JIT it's barely > 20ms (+400ms planning). I can give more details in a separate thread if it's > deemed interesting. > > Planning Time: 411.321 ms > JIT: > Functions: 5005 > Options: Inlining false, Optimization true, Expressions true, Deforming true > Timing: Generation 721.472 ms, Inlining 0.000 ms, Optimization 16312.195 > ms, Emission 12533.611 ms, Total 29567.278 ms I've noticed a similar problem but in the context of interaction with parallel query mechanism. The planner, seeing that all partitions will be scanned (after failing to prune with clauses containing CURRENT_TIMESTAMP etc.), prepares a parallel plan (containing Parallel Append in this case). As you can imagine, parallel query initialization (Gather+workers) will take large amount of time relative to the time it will take to scan the partitions that remain after pruning (often just one). The problem in this case is that the planner is oblivious to the possibility of partition pruning occurring during execution, which may be common to both parallel query and JIT. If it wasn't oblivious, it would've set the cost of pruning-capable Append such that parallel query and/or JIT won't be invoked. We are going to have to fix that sooner or later. Thanks, Amit
Re: speeding up planning with partitions
Thanks for the details! Indeed the versions with now()/current_date use the runtime pruning rather than planning time. I wasn't aware of the use of 'today' though - that could be useful in case we're sure statements won't be prepared. Previously (v10/ partly v11) it was necessary to make sure that statements on partioned tables were never prepared, because run-time pruning wasn't available - using a generic plan was almost always a bad option. Now in v12 it seems to be a tradeoff between whether or not run-time pruning can occur. If pruning is possible at planning time it's probably still better not to prepare statements, whereas if run-time pruning has to occur, it's better to prepare them. One unrelated thing I noticed (but I'm not sure if it's worth a separate email thread) is that the changed default of jit=on in v12 doesn't work very well with a large number of partitions at run-time, for which a large number get excluded at run-time. A query that has an estimated cost above jit_optimize_above_cost takes about 30 seconds to run (for a table with 1000 partitions), because JIT is optimizing the full plan. Without JIT it's barely 20ms (+400ms planning). I can give more details in a separate thread if it's deemed interesting. Planning Time: 411.321 ms JIT: Functions: 5005 Options: Inlining false, Optimization true, Expressions true, Deforming true Timing: Generation 721.472 ms, Inlining 0.000 ms, Optimization 16312.195 ms, Emission 12533.611 ms, Total 29567.278 ms -Floris
Re: speeding up planning with partitions
On 2019/04/02 14:50, Amit Langote wrote: > Attached patch is only for HEAD this time. I'll post one for PG 11 (if > you'd like) once we reach consensus on the best thing to do here is. While we're on the topic of the relation between constraint exclusion and partition pruning, I'd like to (re-) propose this documentation update patch. The partitioning chapter in ddl.sgml says update/delete of partitioned tables uses constraint exclusion internally to emulate partition pruning, which is no longer true as of 428b260f8. The v2-0001 patch hasn't changed. Thanks, Amit >From 336963d5f08a937c0890e794553dc23aced1fca1 Mon Sep 17 00:00:00 2001 From: amit Date: Fri, 5 Apr 2019 15:41:11 +0900 Subject: [PATCH v3 2/2] Update docs that update/delete no longer use constraint exclusion --- doc/src/sgml/ddl.sgml | 18 ++ 1 file changed, 2 insertions(+), 16 deletions(-) diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml index 1fe27c5da9..33012939b8 100644 --- a/doc/src/sgml/ddl.sgml +++ b/doc/src/sgml/ddl.sgml @@ -4535,26 +4535,12 @@ EXPLAIN SELECT count(*) FROM measurement WHERE logdate = DATE '2008-01-01'; setting. - - - Currently, pruning of partitions during the planning of an - UPDATE or DELETE command is - implemented using the constraint exclusion method (however, it is - controlled by the enable_partition_pruning rather than - constraint_exclusion) see the following section - for details and caveats that apply. - - Execution-time partition pruning currently only occurs for the Append and MergeAppend node types. It is not yet implemented for the ModifyTable node - type. - - - - Both of these behaviors are likely to be changed in a future release - of PostgreSQL. + type, but that is likely to be changed in a future release of + PostgreSQL. -- 2.11.0 >From 5549e6caae79259032e844812804529ffbf0d321 Mon Sep 17 00:00:00 2001 From: amit Date: Tue, 2 Apr 2019 14:54:08 +0900 Subject: [PATCH v3 1/2] Fix partition constraint loading in planner --- src/backend/optimizer/plan/planner.c | 5 +- src/backend/optimizer/util/plancat.c | 106 +- src/test/regress/expected/partition_prune.out | 48 src/test/regress/sql/partition_prune.sql | 19 + 4 files changed, 121 insertions(+), 57 deletions(-) diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index e2cdc83613..1f6bd142b7 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -1513,8 +1513,9 @@ inheritance_planner(PlannerInfo *root) parent_rte->securityQuals = NIL; /* -* Mark whether we're planning a query to a partitioned table or an -* inheritance parent. +* HACK: setting this to a value other than INHKIND_NONE signals to +* relation_excluded_by_constraints() to process the result relation as +* a partition; see that function for more details. */ subroot->inhTargetKind = (rootRelation != 0) ? INHKIND_PARTITIONED : INHKIND_INHERITED; diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c index 3301331304..28923f805e 100644 --- a/src/backend/optimizer/util/plancat.c +++ b/src/backend/optimizer/util/plancat.c @@ -66,7 +66,7 @@ static void get_relation_foreign_keys(PlannerInfo *root, RelOptInfo *rel, static bool infer_collation_opclass_match(InferenceElem *elem, Relation idxRel, List *idxExprs); static List *get_relation_constraints(PlannerInfo *root, -Oid relationObjectId, RelOptInfo *rel, +Relation relation, RelOptInfo *rel, bool include_notnull); static List *build_index_tlist(PlannerInfo *root, IndexOptInfo *index, Relation heapRelation); @@ -1150,19 +1150,13 @@ get_relation_data_width(Oid relid, int32 *attr_widths) */ static List * get_relation_constraints(PlannerInfo *root, -Oid relationObjectId, RelOptInfo *rel, +Relation relation, RelOptInfo *rel, bool include_notnull) { List *result = NIL; Index varno = rel->relid; - Relationrelation; TupleConstr *constr; - /* -* We assume the relation has already been safely locked. -*/ - relation = table_open(relationObjectId, NoLock); - constr = relation->rd_att->constr; if (constr != NULL) { @@ -1242,38 +1236,6 @@
Re: speeding up planning with partitions
On 2019/04/05 12:18, David Rowley wrote: > On Fri, 5 Apr 2019 at 16:09, Amit Langote > wrote: >> Although, we still have ways >> to go in terms of scaling generic plan execution to larger partition >> counts, solution(s) for which have been proposed by David but haven't made >> it into master yet. > > Is that a reference to the last paragraph in [1]? That idea has not > gone beyond me writing that text yet! :-( It was more of a passing > comment on the only way I could think of to solve the problem. > > [1] > https://www.postgresql.org/message-id/CAKJS1f-y1HQK+VjG7=C==vGcLnzxjN8ysD5NmaN8Wh4=vsy...@mail.gmail.com Actually, I meant to refer to the following: https://commitfest.postgresql.org/22/1897/ Of course, we should pursue all available options. :) Thanks, Amit
Re: speeding up planning with partitions
On Fri, 5 Apr 2019 at 16:09, Amit Langote wrote: > Although, we still have ways > to go in terms of scaling generic plan execution to larger partition > counts, solution(s) for which have been proposed by David but haven't made > it into master yet. Is that a reference to the last paragraph in [1]? That idea has not gone beyond me writing that text yet! :-( It was more of a passing comment on the only way I could think of to solve the problem. [1] https://www.postgresql.org/message-id/CAKJS1f-y1HQK+VjG7=C==vGcLnzxjN8ysD5NmaN8Wh4=vsy...@mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: speeding up planning with partitions
On 2019/04/05 6:59, David Rowley wrote: > On Fri, 5 Apr 2019 at 07:33, Floris Van Nee wrote: >> I had a question about the performance of pruning of functions like now() >> and current_date. I know these are handled differently, as they cannot be >> excluded during the first phases of planning. However, curerntly, this new >> patch makes the performance difference between the static timestamp variant >> and now() very obvious (even more than before). Consider >> select * from partitioned_table where ts >= now() >> or >> select * from partitioned_table where ts >= '2019-04-04' >> >> The second plans in less than a millisecond, whereas the first takes +- >> 180ms for a table with 1000 partitions. Both end up with the same plan. > > The patch here only aims to improve the performance of queries to > partitioned tables when partitions can be pruned during planning. The > now() version of the query is unable to do that since we don't know > what that value will be during the execution of the query. In that > version, you're most likely seeing "Subplans Removed: ". This means > run-time pruning did some pruning and the planner generated subplans > for what you see plus others. Since planning for all partitions is > still slow, you're getting a larger performance difference than > before, but only due to the fact that the other plan is now faster to > generate. Yeah, the time for generating plan for a query that *can* use pruning but not during planning is still very much dependent on the number of partitions, because access plans must be created for all partitions, even if only one of those plans will actually be used and the rest pruned away during execution. > If you're never using prepared statements, Or if using prepared statements is an option, the huge planning cost mentioned above need not be paid repeatedly. Although, we still have ways to go in terms of scaling generic plan execution to larger partition counts, solution(s) for which have been proposed by David but haven't made it into master yet. Thanks, Amit
Re: speeding up planning with partitions
On Fri, 5 Apr 2019 at 07:33, Floris Van Nee wrote: > I had a question about the performance of pruning of functions like now() and > current_date. I know these are handled differently, as they cannot be > excluded during the first phases of planning. However, curerntly, this new > patch makes the performance difference between the static timestamp variant > and now() very obvious (even more than before). Consider > select * from partitioned_table where ts >= now() > or > select * from partitioned_table where ts >= '2019-04-04' > > The second plans in less than a millisecond, whereas the first takes +- 180ms > for a table with 1000 partitions. Both end up with the same plan. The patch here only aims to improve the performance of queries to partitioned tables when partitions can be pruned during planning. The now() version of the query is unable to do that since we don't know what that value will be during the execution of the query. In that version, you're most likely seeing "Subplans Removed: ". This means run-time pruning did some pruning and the planner generated subplans for what you see plus others. Since planning for all partitions is still slow, you're getting a larger performance difference than before, but only due to the fact that the other plan is now faster to generate. If you're never using prepared statements, i.e, always planning right before execution, then you might want to consider using "where ts >= 'today'::timestamp". This will evaluate to the current date during parse and make the value available to the planner. You'll need to be pretty careful with that though, as if you do prepare queries or change to do that in the future then the bugs in your application could be very subtle and only do the wrong thing just after midnight on some day when the current time progresses over your partition boundary. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: speeding up planning with partitions
Hi all, First of all I would like to thank everyone involved in this patch for their hard work on this. This is a big step forward. I've done some performance and functionality testing with the patch that was committed to master and it looks very good. I had a question about the performance of pruning of functions like now() and current_date. I know these are handled differently, as they cannot be excluded during the first phases of planning. However, curerntly, this new patch makes the performance difference between the static timestamp variant and now() very obvious (even more than before). Consider select * from partitioned_table where ts >= now() or select * from partitioned_table where ts >= '2019-04-04' The second plans in less than a millisecond, whereas the first takes +- 180ms for a table with 1000 partitions. Both end up with the same plan. I'm not too familiar with the code that handles this, but is there a possibility for improvement in this area? Or is the stage at which exclusion for now()/current_date occurs already too far in the process to make any good improvements to this? My apologies if this is considered off-topic for this patch, but I ran into this issue specifically when I was testing this patch, so I thought I'd ask here about it. I do think a large number of use-cases for tables with a large number of partitions involve a timestamp for partition key, and naturally people will start writing queries for this that use functions such as now() and current_date. Thanks again for your work on this patch! -Floris From: Amit Langote Sent: Tuesday, April 2, 2019 7:50 AM To: Tom Lane Cc: David Rowley; Imai Yoshikazu; jesper.peder...@redhat.com; Imai, Yoshikazu; Amit Langote; Alvaro Herrera; Robert Haas; Justin Pryzby; Pg Hackers Subject: Re: speeding up planning with partitions [External] Thanks for taking a look. On 2019/04/02 2:34, Tom Lane wrote: > Amit Langote writes: >> On 2019/03/30 0:29, Tom Lane wrote: >>> That seems like probably an independent patch --- do you want to write it? > >> Here is that patch. >> It revises get_relation_constraints() such that the partition constraint >> is loaded in only the intended cases. > > So I see the problem you're trying to solve here, but I don't like this > patch a bit, because it depends on root->inhTargetKind which IMO is a > broken bit of junk that we need to get rid of. Here is an example of > why, with this patch applied: > > regression=# create table p (a int) partition by list (a); > CREATE TABLE > regression=# create table p1 partition of p for values in (1); > CREATE TABLE > regression=# set constraint_exclusion to on; > SET > regression=# explain select * from p1 where a = 2; > QUERY PLAN > -- > Result (cost=0.00..0.00 rows=0 width=0) >One-Time Filter: false > (2 rows) > > So far so good, but watch what happens when we include the same case > in an UPDATE on some other partitioned table: > > regression=# create table prtab (a int, b int) partition by list (a); > CREATE TABLE > regression=# create table prtab2 partition of prtab for values in (2); > CREATE TABLE > regression=# explain update prtab set b=b+1 from p1 where prtab.a=p1.a and > p1.a=2; > QUERY PLAN > --- > Update on prtab (cost=0.00..82.30 rows=143 width=20) >Update on prtab2 >-> Nested Loop (cost=0.00..82.30 rows=143 width=20) > -> Seq Scan on p1 (cost=0.00..41.88 rows=13 width=10) >Filter: (a = 2) > -> Materialize (cost=0.00..38.30 rows=11 width=14) >-> Seq Scan on prtab2 (cost=0.00..38.25 rows=11 width=14) > Filter: (a = 2) > (8 rows) > > No constraint exclusion, while in v10 you get > > Update on prtab (cost=0.00..0.00 rows=0 width=0) >-> Result (cost=0.00..0.00 rows=0 width=0) > One-Time Filter: false > > The reason is that this logic supposes that root->inhTargetKind describes > *all* partitioned tables in the query, which is obviously wrong. > > Now maybe we could make it work by doing something like > > if (rel->reloptkind == RELOPT_BASEREL && > (root->inhTargetKind == INHKIND_NONE || > rel->relid != root->parse->resultRelation)) Ah, you're right. inhTargetKind has to be checked in conjunction with checking whether the relation is the target relation. > but I find that pretty messy, plus it's violating the concept that we > shouldn't be allowing messiness from inheritance_planner to leak into > other places. I'm afraid that we'll hav
Re: speeding up planning with partitions
Thanks for taking a look. On 2019/04/02 2:34, Tom Lane wrote: > Amit Langote writes: >> On 2019/03/30 0:29, Tom Lane wrote: >>> That seems like probably an independent patch --- do you want to write it? > >> Here is that patch. >> It revises get_relation_constraints() such that the partition constraint >> is loaded in only the intended cases. > > So I see the problem you're trying to solve here, but I don't like this > patch a bit, because it depends on root->inhTargetKind which IMO is a > broken bit of junk that we need to get rid of. Here is an example of > why, with this patch applied: > > regression=# create table p (a int) partition by list (a); > CREATE TABLE > regression=# create table p1 partition of p for values in (1); > CREATE TABLE > regression=# set constraint_exclusion to on; > SET > regression=# explain select * from p1 where a = 2; > QUERY PLAN > -- > Result (cost=0.00..0.00 rows=0 width=0) >One-Time Filter: false > (2 rows) > > So far so good, but watch what happens when we include the same case > in an UPDATE on some other partitioned table: > > regression=# create table prtab (a int, b int) partition by list (a); > CREATE TABLE > regression=# create table prtab2 partition of prtab for values in (2); > CREATE TABLE > regression=# explain update prtab set b=b+1 from p1 where prtab.a=p1.a and > p1.a=2; > QUERY PLAN > --- > Update on prtab (cost=0.00..82.30 rows=143 width=20) >Update on prtab2 >-> Nested Loop (cost=0.00..82.30 rows=143 width=20) > -> Seq Scan on p1 (cost=0.00..41.88 rows=13 width=10) >Filter: (a = 2) > -> Materialize (cost=0.00..38.30 rows=11 width=14) >-> Seq Scan on prtab2 (cost=0.00..38.25 rows=11 width=14) > Filter: (a = 2) > (8 rows) > > No constraint exclusion, while in v10 you get > > Update on prtab (cost=0.00..0.00 rows=0 width=0) >-> Result (cost=0.00..0.00 rows=0 width=0) > One-Time Filter: false > > The reason is that this logic supposes that root->inhTargetKind describes > *all* partitioned tables in the query, which is obviously wrong. > > Now maybe we could make it work by doing something like > > if (rel->reloptkind == RELOPT_BASEREL && > (root->inhTargetKind == INHKIND_NONE || > rel->relid != root->parse->resultRelation)) Ah, you're right. inhTargetKind has to be checked in conjunction with checking whether the relation is the target relation. > but I find that pretty messy, plus it's violating the concept that we > shouldn't be allowing messiness from inheritance_planner to leak into > other places. I'm afraid that we'll have to live with this particular hack as long as we have inheritance_planner(), but we maybe could somewhat reduce the extent to which the hack is spread into other planner files. How about we move the part of get_relation_constraints() that loads the partition constraint to its only caller relation_excluded_by_constraints()? If we do that, all uses of root->inhTargetKind will be confined to one place. Attached updated patch does that. > What I'd rather do is have this test just read > > if (rel->reloptkind == RELOPT_BASEREL) > > Making it be that way causes some changes in the partition_prune results, > as attached, which suggest that removing the enable_partition_pruning > check as you did wasn't such a great idea either. However, if I add > that back in, then it breaks the proposed new regression test case. > > I'm not at all clear on what we think the interaction between > enable_partition_pruning and constraint_exclusion ought to be, > so I'm not sure what the appropriate resolution is here. Thoughts? Prior to 428b260f87 (that is, in PG 11), partition pruning for UPDATE and DELETE queries is realized by applying constraint exclusion to the partition constraint of the target partition. The conclusion of the discussion when adding the enable_partition_pruning GUC [1] was that whether or not constraint exclusion is carried out (to facilitate partition pruning) should be governed by the new GUC, not constraint_exclusion, if only to avoid confusing users. 428b260f87 has obviated the need to check enable_partition_pruning in relation_excluded_by_constraints(), because inheritance_planner() now runs the query as if it were SELECT, which does partition pruning using partprune.c, governed by the setting of enable_partition_pruning. So, there's no need to check it again in relation_excluded_by_constraints(), because we won't be consulting the partition constraint again; well, at least after applying the proposed patch. > BTW, just about all the other uses of root->inhTargetKind seem equally > broken from here; none of them are accounting for whether the rel in
Re: speeding up planning with partitions
Amit Langote writes: > On 2019/03/30 0:29, Tom Lane wrote: >> That seems like probably an independent patch --- do you want to write it? > Here is that patch. > It revises get_relation_constraints() such that the partition constraint > is loaded in only the intended cases. So I see the problem you're trying to solve here, but I don't like this patch a bit, because it depends on root->inhTargetKind which IMO is a broken bit of junk that we need to get rid of. Here is an example of why, with this patch applied: regression=# create table p (a int) partition by list (a); CREATE TABLE regression=# create table p1 partition of p for values in (1); CREATE TABLE regression=# set constraint_exclusion to on; SET regression=# explain select * from p1 where a = 2; QUERY PLAN -- Result (cost=0.00..0.00 rows=0 width=0) One-Time Filter: false (2 rows) So far so good, but watch what happens when we include the same case in an UPDATE on some other partitioned table: regression=# create table prtab (a int, b int) partition by list (a); CREATE TABLE regression=# create table prtab2 partition of prtab for values in (2); CREATE TABLE regression=# explain update prtab set b=b+1 from p1 where prtab.a=p1.a and p1.a=2; QUERY PLAN --- Update on prtab (cost=0.00..82.30 rows=143 width=20) Update on prtab2 -> Nested Loop (cost=0.00..82.30 rows=143 width=20) -> Seq Scan on p1 (cost=0.00..41.88 rows=13 width=10) Filter: (a = 2) -> Materialize (cost=0.00..38.30 rows=11 width=14) -> Seq Scan on prtab2 (cost=0.00..38.25 rows=11 width=14) Filter: (a = 2) (8 rows) No constraint exclusion, while in v10 you get Update on prtab (cost=0.00..0.00 rows=0 width=0) -> Result (cost=0.00..0.00 rows=0 width=0) One-Time Filter: false The reason is that this logic supposes that root->inhTargetKind describes *all* partitioned tables in the query, which is obviously wrong. Now maybe we could make it work by doing something like if (rel->reloptkind == RELOPT_BASEREL && (root->inhTargetKind == INHKIND_NONE || rel->relid != root->parse->resultRelation)) but I find that pretty messy, plus it's violating the concept that we shouldn't be allowing messiness from inheritance_planner to leak into other places. What I'd rather do is have this test just read if (rel->reloptkind == RELOPT_BASEREL) Making it be that way causes some changes in the partition_prune results, as attached, which suggest that removing the enable_partition_pruning check as you did wasn't such a great idea either. However, if I add that back in, then it breaks the proposed new regression test case. I'm not at all clear on what we think the interaction between enable_partition_pruning and constraint_exclusion ought to be, so I'm not sure what the appropriate resolution is here. Thoughts? BTW, just about all the other uses of root->inhTargetKind seem equally broken from here; none of them are accounting for whether the rel in question is the query target. regards, tom lane diff -U3 /home/postgres/pgsql/src/test/regress/expected/partition_prune.out /home/postgres/pgsql/src/test/regress/results/partition_prune.out --- /home/postgres/pgsql/src/test/regress/expected/partition_prune.out 2019-04-01 12:39:52.613109088 -0400 +++ /home/postgres/pgsql/src/test/regress/results/partition_prune.out 2019-04-01 13:18:02.852615395 -0400 @@ -3409,24 +3409,18 @@ -- Update on pp_lp Update on pp_lp1 - Update on pp_lp2 -> Seq Scan on pp_lp1 Filter: (a = 1) - -> Seq Scan on pp_lp2 - Filter: (a = 1) -(7 rows) +(4 rows) explain (costs off) delete from pp_lp where a = 1; QUERY PLAN -- Delete on pp_lp Delete on pp_lp1 - Delete on pp_lp2 -> Seq Scan on pp_lp1 Filter: (a = 1) - -> Seq Scan on pp_lp2 - Filter: (a = 1) -(7 rows) +(4 rows) set constraint_exclusion = 'off'; -- this should not affect the result. explain (costs off) select * from pp_lp where a = 1; @@ -3444,24 +3438,18 @@ -- Update on pp_lp Update on pp_lp1 - Update on pp_lp2 -> Seq Scan on pp_lp1 Filter: (a = 1) - -> Seq Scan on pp_lp2 - Filter: (a = 1) -(7 rows) +(4 rows) explain (costs off) delete from pp_lp where a = 1; QUERY PLAN -- Delete on pp_lp Delete on pp_lp1 - Delete on pp_lp2 -> Seq Scan on pp_lp1 Filter: (a = 1) - -> Seq Scan on pp_lp2 - Filter: (a = 1) -(7 rows) +(4 rows) drop table pp_lp; -- Ensure enable_partition_prune does not affect non-partitioned tables.
Re: speeding up planning with partitions
Amit Langote writes: > On 2019/04/01 3:46, Tom Lane wrote: >> One thing that I intentionally left out of the committed patch was changes >> to stop short of scanning the whole simple_rel_array when looking only for >> baserels. I thought that had been done in a rather piecemeal fashion >> and it'd be better to address it holistically, which I've now done in the >> attached proposed patch. >> I have not done any performance testing to see if this is actually >> worth the trouble, though. Anybody want to do that? > Thanks for creating the patch. > I spent some time looking for cases where this patch would provide > recognizable benefit, but couldn't find one. Yeah, I was afraid of that. In cases where we do have a ton of otherrels, the processing that's actually needed on them would probably swamp any savings from this patch. The only place where that might possibly not be true is create_lateral_join_info, since that has nested loops that could potentially impose an O(N^2) cost. However, since your patch went in, that runs before inheritance expansion anyway. So this probably isn't worth even the minuscule cost it imposes. regards, tom lane
Re: speeding up planning with partitions
On 2019/03/30 0:29, Tom Lane wrote: > Amit Langote writes: >> Finally, it's not in the patch, but how about visiting >> get_relation_constraints() for revising this block of code: > > That seems like probably an independent patch --- do you want to write it? Here is that patch. It revises get_relation_constraints() such that the partition constraint is loaded in only the intended cases. To summarize: * PG 11 currently misses one such intended case (select * from partition) causing a *bug* that constraint exclusion fails to exclude the partition with constraint_exclusion = on * HEAD loads the partition constraint even in some cases where 428b260f87 rendered doing that unnecessary Thanks, Amit diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c index 31a3784536..b7ae063585 100644 --- a/src/backend/optimizer/util/plancat.c +++ b/src/backend/optimizer/util/plancat.c @@ -1250,11 +1250,15 @@ get_relation_constraints(PlannerInfo *root, /* * Append partition predicates, if any. * -* For selects, partition pruning uses the parent table's partition bound -* descriptor, instead of constraint exclusion which is driven by the -* individual partition's partition constraint. +* If the partition is accessed indirectly via its parent table, partition +* pruning is performed with the parent table's partition bound, so there +* is no need to include the partition constraint in that case. However, +* if the partition is referenced directly in the query and we're not +* being called from inheritance_planner(), then no partition pruning +* would have occurred, so we'll include it in that case. */ - if (enable_partition_pruning && root->parse->commandType != CMD_SELECT) + if (rel->reloptkind == RELOPT_BASEREL && + root->inhTargetKind == INHKIND_NONE) { List *pcqual = RelationGetPartitionQual(relation); diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out index 7806ba1d47..0bc0ed8042 100644 --- a/src/test/regress/expected/partition_prune.out +++ b/src/test/regress/expected/partition_prune.out @@ -3643,4 +3643,44 @@ select * from listp where a = (select 2) and b <> 10; -> Result (never executed) (4 rows) +-- +-- check that a partition directly accessed in a query is excluded with +-- constraint_exclusion = on +-- +-- turn off partition pruning, so that it doesn't interfere +set enable_partition_pruning to off; +-- constraint exclusion doesn't apply +set constraint_exclusion to 'partition'; +explain (costs off) select * from listp1 where a = 2; + QUERY PLAN + + Seq Scan on listp1 + Filter: (a = 2) +(2 rows) + +explain (costs off) select * from listp2 where a = 1; + QUERY PLAN +--- + Seq Scan on listp2_10 + Filter: (a = 1) +(2 rows) + +-- constraint exclusion applies +set constraint_exclusion to 'on'; +explain (costs off) select * from listp1 where a = 2; +QUERY PLAN +-- + Result + One-Time Filter: false +(2 rows) + +explain (costs off) select * from listp2 where a = 1; +QUERY PLAN +-- + Result + One-Time Filter: false +(2 rows) + +reset constraint_exclusion; +reset enable_partition_pruning; drop table listp; diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql index 2e4d2b483d..cc3c497238 100644 --- a/src/test/regress/sql/partition_prune.sql +++ b/src/test/regress/sql/partition_prune.sql @@ -990,4 +990,22 @@ create table listp2_10 partition of listp2 for values in (10); explain (analyze, costs off, summary off, timing off) select * from listp where a = (select 2) and b <> 10; +-- +-- check that a partition directly accessed in a query is excluded with +-- constraint_exclusion = on +-- + +-- turn off partition pruning, so that it doesn't interfere +set enable_partition_pruning to off; + +-- constraint exclusion doesn't apply +set constraint_exclusion to 'partition'; +explain (costs off) select * from listp1 where a = 2; +explain (costs off) select * from listp2 where a = 1; +-- constraint exclusion applies +set constraint_exclusion to 'on'; +explain (costs off) select * from listp1 where a = 2; +explain (costs off) select * from listp2 where a = 1; +reset constraint_exclusion; +reset enable_partition_pruning; drop table listp; diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c index 8369e3ad62..8428fe37bb 100644 --- a/src/backend/optimizer/util/plancat.c +++ b/src/backend/optimizer/util/plancat.c @@ -1269,10 +1269,14 @@ get_relation_constraints(PlannerInfo *root, * Append partition predicates, if any. * * For selects, partition pruning uses the parent table's partition bound -*
Re: speeding up planning with partitions
On 2019/04/01 3:46, Tom Lane wrote: > One thing that I intentionally left out of the committed patch was changes > to stop short of scanning the whole simple_rel_array when looking only for > baserels. I thought that had been done in a rather piecemeal fashion > and it'd be better to address it holistically, which I've now done in the > attached proposed patch. > > This probably makes little if any difference in the test cases we've > mostly focused on in this thread, since there wouldn't be very many > otherrels anyway now that we don't create them for pruned partitions. > However, in a case where there's a lot of partitions that we can't prune, > this could be useful. > > I have not done any performance testing to see if this is actually > worth the trouble, though. Anybody want to do that? Thanks for creating the patch. I spent some time looking for cases where this patch would provide recognizable benefit, but couldn't find one. As one would suspect, it's hard to notice it if only looking at the overall latency of the query, because time spent doing other things with such plans tends to be pretty huge anyway (both in the planner itself and other parts of the backend). I even devised a query on a partitioned table such that planner has to process all partitions, but ExecInitAppend can prune all but one, thus reducing the time spent in the executor, but still wasn't able to see an improvement in the overall latency of the query due to planner not doing looping over the long simple_rel_array. Thanks, Amit
Re: speeding up planning with partitions
(I've closed the CF entry: https://commitfest.postgresql.org/22/1778/) On 2019/04/01 2:04, Tom Lane wrote: > Amit Langote writes: >> On Sun, Mar 31, 2019 at 11:45 AM Imai Yoshikazu >> wrote: >>> Certainly, using bitmapset contributes to the performance when scanning >>> one partition(few partitions) from large partitions. > >> Thanks Imai-san for testing. > > I tried to replicate these numbers with the code as-committed, and > could not. Thanks for that. > What I get, using the same table-creation code as you > posted and a pgbench script file like > > \set param random(1, :N) > select * from rt where a = :param; > > is scaling like this: > > N tps, range tps, hash > > 2 10520.51993210415.230400 > 8 10443.36145710480.987665 > 3210341.19676810462.551167 > 128 10370.95384910383.885128 > 512 10207.57841310214.049394 > 1024 10042.79434010121.683993 > 4096 8937.561825 9214.993778 > 8192 8247.614040 8486.728918 > > If I use "-M prepared" the numbers go up a bit for lower N, but > drop at high N: > > N tps, range tps, hash > > 2 11449.92052711462.253871 > 8 11530.51314611470.812476 > 3211372.41299911450.213753 > 128 11289.35159611322.698856 > 512 11095.42845111200.683771 > 1024 10757.64610810805.052480 > 4096 8689.165875 8930.690887 > 8192 7301.609147 7502.806455 > > Digging into that, it seems like the degradation with -M prepared is > mostly in LockReleaseAll's hash_seq_search over the locallock hash table. > What I think must be happening is that with -M prepared, at some point the > plancache decides to try a generic plan, which causes opening/locking all > the partitions, resulting in permanent bloat in the locallock hash table. > We immediately go back to using custom plans, but hash_seq_search has > more buckets to look through for the remainder of the process' lifetime. Ah, we did find this to be a problem upthread [1] and Tsunakawa-san then even posted a patch which is being discussed at: https://commitfest.postgresql.org/22/1993/ > I do see some cycles getting spent in apply_scanjoin_target_to_paths > that look to be due to scanning over the long part_rels array, > which your proposal would ameliorate. But (a) that's pretty small > compared to other effects, and (b) IMO, apply_scanjoin_target_to_paths > is a remarkable display of brute force inefficiency to begin with. > I think we should see if we can't nuke that function altogether in > favor of generating the paths with the right target the first time. That's an option if we can make it work. Shouldn't we look at *all* of the places that have code that now look like this: for (i = 0; i < rel->nparts; i++) { RelOptInfo *partrel = rel->part_rels[i]; if (partrel == NULL) continue; ... } Beside apply_scanjoin_target_to_paths(), there are: create_partitionwise_grouping_paths() make_partitionedrel_pruneinfo() > BTW, the real elephant in the room is the O(N^2) cost of creating > these tables in the first place. The runtime for the table-creation > scripts looks like > > N range hash > > 2 0m0.011s0m0.011s > 8 0m0.015s0m0.014s > 320m0.032s0m0.030s > 128 0m0.132s0m0.099s > 512 0m0.969s0m0.524s > 1024 0m3.306s0m1.442s > 4096 0m46.058s 0m15.522s > 8192 3m11.995s 0m58.720s > > This seems to be down to the expense of doing RelationBuildPartitionDesc > to rebuild the parent's relcache entry for each child CREATE TABLE. > Not sure we can avoid that, but maybe we should consider adopting a > cheaper-to-read representation of partition descriptors. The fact that > range-style entries seem to be 3X more expensive to load than hash-style > entries is strange. I've noticed this many times too, but never prioritized doing something about it. I'll try sometime. Thanks, Amit [1] https://www.postgresql.org/message-id/CAKJS1f-dn1hDZqObwdMrYdV7-cELJwWCPRWet6EQX_WaV8JLgw%40mail.gmail.com
Re: speeding up planning with partitions
On Sun, 31 Mar 2019 at 05:50, Robert Haas wrote: > > On Sat, Mar 30, 2019 at 12:16 PM Amit Langote wrote: > > Fwiw, I had complained when reviewing the run-time pruning patch that > > creating those maps in the planner and putting them in > > PartitionPruneInfo might not be a good idea, but David insisted that > > it'd be good for performance (in the context of using cached plans) to > > compute this information during planning. > > Well, he's not wrong about that, I expect. I'm aware that there have been combinations of objects to either having these arrays and/or editing them during execution. I don't recall Amit's complaint, but I do recall Tom's. He suggested we not resequence the arrays in the executor and just maintain NULL elements in the Append/MergeAppend subplans. I did consider this when writing run-time pruning but found that performance suffers. I demonstrated this on a thread somewhere. IIRC, I wrote this code because there was no way to translate the result of the pruning code into Append/MergeAppend subplan indexes. Robert has since added a map of Oids to allow the executor to have those details, so it perhaps would be possible to take the result of the pruning code then lookup the Oids of the partitions that survived pruning, then map those to the subplans using the array Robert added. Using the array for that wouldn't be very efficient due to a lookup being O(n) per surviving partition. Maybe it could be thrown into a hashtable to make that faster. This solution would need to take into account mixed hierarchy Appends. e.g SELECT * FROM partitioned_table WHERE partkey = $1 UNION ALL SELECT * FROM something_else; so it would likely need to be a hashtable per partitioned table. If the pruning code returned a partition whose Oid we didn't know about, then it must be from a partition that was added concurrently since the plan was built... However, that shouldn't happen today since Robert went to great lengths for it not to. Further discussions are likely best put in their own thread. As far as I know, nothing is broken with the code today. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: speeding up planning with partitions
One thing that I intentionally left out of the committed patch was changes to stop short of scanning the whole simple_rel_array when looking only for baserels. I thought that had been done in a rather piecemeal fashion and it'd be better to address it holistically, which I've now done in the attached proposed patch. This probably makes little if any difference in the test cases we've mostly focused on in this thread, since there wouldn't be very many otherrels anyway now that we don't create them for pruned partitions. However, in a case where there's a lot of partitions that we can't prune, this could be useful. I have not done any performance testing to see if this is actually worth the trouble, though. Anybody want to do that? regards, tom lane diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index 727da33..7a9aa12 100644 --- a/src/backend/optimizer/path/allpaths.c +++ b/src/backend/optimizer/path/allpaths.c @@ -157,7 +157,7 @@ make_one_rel(PlannerInfo *root, List *joinlist) * Construct the all_baserels Relids set. */ root->all_baserels = NULL; - for (rti = 1; rti < root->simple_rel_array_size; rti++) + for (rti = 1; rti <= root->last_base_relid; rti++) { RelOptInfo *brel = root->simple_rel_array[rti]; @@ -290,7 +290,7 @@ set_base_rel_sizes(PlannerInfo *root) { Index rti; - for (rti = 1; rti < root->simple_rel_array_size; rti++) + for (rti = 1; rti <= root->last_base_relid; rti++) { RelOptInfo *rel = root->simple_rel_array[rti]; RangeTblEntry *rte; @@ -333,7 +333,7 @@ set_base_rel_pathlists(PlannerInfo *root) { Index rti; - for (rti = 1; rti < root->simple_rel_array_size; rti++) + for (rti = 1; rti <= root->last_base_relid; rti++) { RelOptInfo *rel = root->simple_rel_array[rti]; @@ -1994,7 +1994,7 @@ has_multiple_baserels(PlannerInfo *root) int num_base_rels = 0; Index rti; - for (rti = 1; rti < root->simple_rel_array_size; rti++) + for (rti = 1; rti <= root->last_base_relid; rti++) { RelOptInfo *brel = root->simple_rel_array[rti]; diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c index 61b5b11..723643c 100644 --- a/src/backend/optimizer/path/equivclass.c +++ b/src/backend/optimizer/path/equivclass.c @@ -828,11 +828,11 @@ generate_base_implied_equalities(PlannerInfo *root) * This is also a handy place to mark base rels (which should all exist by * now) with flags showing whether they have pending eclass joins. */ - for (rti = 1; rti < root->simple_rel_array_size; rti++) + for (rti = 1; rti <= root->last_base_relid; rti++) { RelOptInfo *brel = root->simple_rel_array[rti]; - if (brel == NULL) + if (brel == NULL || brel->reloptkind != RELOPT_BASEREL) continue; brel->has_eclass_joins = has_relevant_eclass_joinclause(root, brel); diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c index 9798dca..c5459b6 100644 --- a/src/backend/optimizer/plan/initsplan.c +++ b/src/backend/optimizer/plan/initsplan.c @@ -145,7 +145,7 @@ add_other_rels_to_query(PlannerInfo *root) { int rti; - for (rti = 1; rti < root->simple_rel_array_size; rti++) + for (rti = 1; rti <= root->last_base_relid; rti++) { RelOptInfo *rel = root->simple_rel_array[rti]; RangeTblEntry *rte = root->simple_rte_array[rti]; @@ -312,7 +312,7 @@ find_lateral_references(PlannerInfo *root) /* * Examine all baserels (the rel array has been set up by now). */ - for (rti = 1; rti < root->simple_rel_array_size; rti++) + for (rti = 1; rti <= root->last_base_relid; rti++) { RelOptInfo *brel = root->simple_rel_array[rti]; @@ -460,7 +460,7 @@ create_lateral_join_info(PlannerInfo *root) /* * Examine all baserels (the rel array has been set up by now). */ - for (rti = 1; rti < root->simple_rel_array_size; rti++) + for (rti = 1; rti <= root->last_base_relid; rti++) { RelOptInfo *brel = root->simple_rel_array[rti]; Relids lateral_relids; @@ -580,7 +580,7 @@ create_lateral_join_info(PlannerInfo *root) * The outer loop considers each baserel, and propagates its lateral * dependencies to those baserels that have a lateral dependency on it. */ - for (rti = 1; rti < root->simple_rel_array_size; rti++) + for (rti = 1; rti <= root->last_base_relid; rti++) { RelOptInfo *brel = root->simple_rel_array[rti]; Relids outer_lateral_relids; @@ -595,7 +595,7 @@ create_lateral_join_info(PlannerInfo *root) continue; /* else scan all baserels */ - for (rti2 = 1; rti2 < root->simple_rel_array_size; rti2++) + for (rti2 = 1; rti2 <= root->last_base_relid; rti2++) { RelOptInfo *brel2 = root->simple_rel_array[rti2]; @@ -614,7 +614,7 @@ create_lateral_join_info(PlannerInfo *root) * with the set of relids of rels that reference it laterally (possibly * indirectly) --- that is, the inverse mapping of lateral_relids. */ - for (rti = 1; rti <
Re: speeding up planning with partitions
Amit Langote writes: > On Sun, Mar 31, 2019 at 11:45 AM Imai Yoshikazu > wrote: >> Certainly, using bitmapset contributes to the performance when scanning >> one partition(few partitions) from large partitions. > Thanks Imai-san for testing. I tried to replicate these numbers with the code as-committed, and could not. What I get, using the same table-creation code as you posted and a pgbench script file like \set param random(1, :N) select * from rt where a = :param; is scaling like this: N tps, range tps, hash 2 10520.51993210415.230400 8 10443.36145710480.987665 32 10341.19676810462.551167 128 10370.95384910383.885128 512 10207.57841310214.049394 102410042.79434010121.683993 40968937.561825 9214.993778 81928247.614040 8486.728918 If I use "-M prepared" the numbers go up a bit for lower N, but drop at high N: N tps, range tps, hash 2 11449.92052711462.253871 8 11530.51314611470.812476 32 11372.41299911450.213753 128 11289.35159611322.698856 512 11095.42845111200.683771 102410757.64610810805.052480 40968689.165875 8930.690887 81927301.609147 7502.806455 Digging into that, it seems like the degradation with -M prepared is mostly in LockReleaseAll's hash_seq_search over the locallock hash table. What I think must be happening is that with -M prepared, at some point the plancache decides to try a generic plan, which causes opening/locking all the partitions, resulting in permanent bloat in the locallock hash table. We immediately go back to using custom plans, but hash_seq_search has more buckets to look through for the remainder of the process' lifetime. I do see some cycles getting spent in apply_scanjoin_target_to_paths that look to be due to scanning over the long part_rels array, which your proposal would ameliorate. But (a) that's pretty small compared to other effects, and (b) IMO, apply_scanjoin_target_to_paths is a remarkable display of brute force inefficiency to begin with. I think we should see if we can't nuke that function altogether in favor of generating the paths with the right target the first time. BTW, the real elephant in the room is the O(N^2) cost of creating these tables in the first place. The runtime for the table-creation scripts looks like N range hash 2 0m0.011s0m0.011s 8 0m0.015s0m0.014s 32 0m0.032s0m0.030s 128 0m0.132s0m0.099s 512 0m0.969s0m0.524s 10240m3.306s0m1.442s 40960m46.058s 0m15.522s 81923m11.995s 0m58.720s This seems to be down to the expense of doing RelationBuildPartitionDesc to rebuild the parent's relcache entry for each child CREATE TABLE. Not sure we can avoid that, but maybe we should consider adopting a cheaper-to-read representation of partition descriptors. The fact that range-style entries seem to be 3X more expensive to load than hash-style entries is strange. regards, tom lane
Re: speeding up planning with partitions
On Sun, Mar 31, 2019 at 11:45 AM Imai Yoshikazu wrote: > On 2019/03/31 1:06, Amit Langote wrote: > > On Sun, Mar 31, 2019 at 12:11 AM Tom Lane wrote: > >> I am curious as to why there seems to be more degradation > >> for hash cases, as per Yoshikazu-san's results in > >> <0F97FA9ABBDBE54F91744A9B37151A512BAC60@g01jpexmbkw24>, > >> but whatever's accounting for the difference probably > >> is not that. > > > > I suspected it may have been the lack of bitmapsets, but maybe only > > Imai-san could've confirmed that by applying the live_parts patch too. > > Yeah, I forgot to applying live_parts patch. I did same test again which > I did for hash before. > (BTW, thanks for committing speeding up patches!) Thanks a lot for committing, Tom. I wish you had listed yourself as an author though. I will send the patch for get_relation_constraints() mentioned upthread tomorrow. > [HEAD(428b260)] > npartsTPS > == = > 2: 13134 (13240, 13290, 13071, 13172, 12896) > 1024: 12627 (12489, 12635, 12716, 12732, 12562) > 8192: 10289 (10216, 10265, 10171, 10278, 10514) > > [HEAD(428b260) + live_parts.diff] > npartsTPS > == = > 2: 13277 (13112, 13290, 13241, 13360, 13382) > 1024: 12821 (12930, 12849, 12909, 12700, 12716) > 8192: 11102 (11134, 11158, 4, 10997, 11109) > > > Degradations of performance are below. > > > My test results from above (with live_parts, HEAD(428b260) + > live_parts.diff) > nparts live_parts HEAD > == == > 2:13277 13134 > 1024: 12821 12627 > 8192: 11102 10289 > > 11102/13277 = 83.6 % > > > Amit-san's test results (with live_parts) > > npartsv38 HEAD > > == > > 22971 2969 > > 82980 1949 > > 32 2955733 > > 128 2946145 > > 512 2924 11 > > 1024 2986 3 > > 4096 2702 0 > > 8192 2531OOM > > 2531/2971 = 85.2 % > > > My test results I posted before (without live_parts) > > npartsv38 HEAD > > == > > 0: 10538 10487 > > 2: 6942 7028 > > 4: 7043 5645 > > 8: 6981 3954 > > 16: 6932 2440 > > 32: 6897 1243 > > 64: 6897309 > > 128: 6753120 > > 256: 6727 46 > > 512: 6708 12 > > 1024:6063 3 > > 2048:5894 1 > > 4096:5374OOM > > 8192:4572OOM > > 4572/6942 = 65.9 % > > > Certainly, using bitmapset contributes to the performance when scanning > one partition(few partitions) from large partitions. Thanks Imai-san for testing. Regards, Amit
Re: speeding up planning with partitions
On 2019/03/31 1:06, Amit Langote wrote: > On Sun, Mar 31, 2019 at 12:11 AM Tom Lane wrote: >> Amit Langote writes: >>> I think the performance results did prove that degradation due to >>> those loops over part_rels becomes significant for very large >>> partition counts. Is there a better solution than the bitmapset that >>> you have in mind? >> >> Hm, I didn't see much degradation in what you posted in >> <5c83dbca-12b5-1acf-0e85-58299e464...@lab.ntt.co.jp>. > > Sorry that I didn't mention the link to begin with, but I meant to > point to numbers that I reported on Monday this week. > > https://www.postgresql.org/message-id/19f54c17-1619-b228-10e5-ca343be6a4e8%40lab.ntt.co.jp > > You were complaining of the bitmapset being useless overhead for small > partition counts, but the numbers I get tend to suggest that any > degradation in performance is within noise range, whereas the > performance benefit from having them looks pretty significant for very > large partition counts. > >> I am curious as to why there seems to be more degradation >> for hash cases, as per Yoshikazu-san's results in >> <0F97FA9ABBDBE54F91744A9B37151A512BAC60@g01jpexmbkw24>, >> but whatever's accounting for the difference probably >> is not that. > > I suspected it may have been the lack of bitmapsets, but maybe only > Imai-san could've confirmed that by applying the live_parts patch too. Yeah, I forgot to applying live_parts patch. I did same test again which I did for hash before. (BTW, thanks for committing speeding up patches!) [HEAD(428b260)] npartsTPS == = 2: 13134 (13240, 13290, 13071, 13172, 12896) 1024: 12627 (12489, 12635, 12716, 12732, 12562) 8192: 10289 (10216, 10265, 10171, 10278, 10514) [HEAD(428b260) + live_parts.diff] npartsTPS == = 2: 13277 (13112, 13290, 13241, 13360, 13382) 1024: 12821 (12930, 12849, 12909, 12700, 12716) 8192: 11102 (11134, 11158, 4, 10997, 11109) Degradations of performance are below. My test results from above (with live_parts, HEAD(428b260) + live_parts.diff) nparts live_parts HEAD == == 2:13277 13134 1024: 12821 12627 8192: 11102 10289 11102/13277 = 83.6 % Amit-san's test results (with live_parts) > npartsv38 HEAD > == > 22971 2969 > 82980 1949 > 32 2955733 > 128 2946145 > 512 2924 11 > 1024 2986 3 > 4096 2702 0 > 8192 2531OOM 2531/2971 = 85.2 % My test results I posted before (without live_parts) > npartsv38 HEAD > == > 0: 10538 10487 > 2: 6942 7028 > 4: 7043 5645 > 8: 6981 3954 > 16: 6932 2440 > 32: 6897 1243 > 64: 6897309 > 128: 6753120 > 256: 6727 46 > 512: 6708 12 > 1024:6063 3 > 2048:5894 1 > 4096:5374OOM > 8192:4572OOM 4572/6942 = 65.9 % Certainly, using bitmapset contributes to the performance when scanning one partition(few partitions) from large partitions. Thanks -- Imai Yoshikazu diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c index 34cc7da..e847655 100644 --- a/src/backend/optimizer/path/joinrels.c +++ b/src/backend/optimizer/path/joinrels.c @@ -1516,6 +1516,9 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2, populate_joinrel_with_paths(root, child_rel1, child_rel2, child_joinrel, child_sjinfo, child_restrictlist); + if(!IS_DUMMY_REL(child_joinrel)) + joinrel->live_parts = bms_add_member(joinrel->live_parts, + cnt_parts); } } diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index f08f1cd..9ddf42a 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -7107,7 +7107,9 @@ apply_scanjoin_target_to_paths(PlannerInfo *root, int partition_idx; /* Adjust each partition. */ - for (partition_idx = 0; partition_idx < rel->nparts; partition_idx++) + partition_idx = -1; + while ((partition_idx = bms_next_member(rel->live_parts, + partition_idx)) >= 0) { RelOptInfo *child_rel = rel->part_rels[partition_idx]; AppendRelInfo **appinfos; @@ -7115,9 +7117,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root, List
Re: speeding up planning with partitions
On Sat, Mar 30, 2019 at 12:16 PM Amit Langote wrote: > Fwiw, I had complained when reviewing the run-time pruning patch that > creating those maps in the planner and putting them in > PartitionPruneInfo might not be a good idea, but David insisted that > it'd be good for performance (in the context of using cached plans) to > compute this information during planning. Well, he's not wrong about that, I expect. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: speeding up planning with partitions
On Sun, Mar 31, 2019 at 12:59 AM Robert Haas wrote: > > On Sat, Mar 30, 2019 at 11:46 AM Tom Lane wrote: > > > The only problem with PartitionPruneInfo structures of which I am > > > aware is that they rely on PartitionDesc offsets not changing. But I > > > added code in that commit in ExecCreatePartitionPruneState to handle > > > that exact problem. See also paragraph 5 of the commit message, which > > > begins with "Although in general..." > > > > Ah. Grotty, but I guess it will cover the issue. > > I suppose it is. I am a little suspicious of the decision to make > PartitionPruneInfo structures depend on PartitionDesc indexes. Fwiw, I had complained when reviewing the run-time pruning patch that creating those maps in the planner and putting them in PartitionPruneInfo might not be a good idea, but David insisted that it'd be good for performance (in the context of using cached plans) to compute this information during planning. Thanks, Amit
Re: speeding up planning with partitions
On Sun, Mar 31, 2019 at 12:11 AM Tom Lane wrote: > Amit Langote writes: > > I think the performance results did prove that degradation due to > > those loops over part_rels becomes significant for very large > > partition counts. Is there a better solution than the bitmapset that > > you have in mind? > > Hm, I didn't see much degradation in what you posted in > <5c83dbca-12b5-1acf-0e85-58299e464...@lab.ntt.co.jp>. Sorry that I didn't mention the link to begin with, but I meant to point to numbers that I reported on Monday this week. https://www.postgresql.org/message-id/19f54c17-1619-b228-10e5-ca343be6a4e8%40lab.ntt.co.jp You were complaining of the bitmapset being useless overhead for small partition counts, but the numbers I get tend to suggest that any degradation in performance is within noise range, whereas the performance benefit from having them looks pretty significant for very large partition counts. > I am curious as to why there seems to be more degradation > for hash cases, as per Yoshikazu-san's results in > <0F97FA9ABBDBE54F91744A9B37151A512BAC60@g01jpexmbkw24>, > but whatever's accounting for the difference probably > is not that. I suspected it may have been the lack of bitmapsets, but maybe only Imai-san could've confirmed that by applying the live_parts patch too. Thanks, Amit
Re: speeding up planning with partitions
On Sat, Mar 30, 2019 at 11:46 AM Tom Lane wrote: > > The only problem with PartitionPruneInfo structures of which I am > > aware is that they rely on PartitionDesc offsets not changing. But I > > added code in that commit in ExecCreatePartitionPruneState to handle > > that exact problem. See also paragraph 5 of the commit message, which > > begins with "Although in general..." > > Ah. Grotty, but I guess it will cover the issue. I suppose it is. I am a little suspicious of the decision to make PartitionPruneInfo structures depend on PartitionDesc indexes. First, it's really non-obvious that the dependency exists, and I do not think I would have spotted it had not Alvaro pointed the problem out. Second, I wonder whether it is really a good idea in general to make a plan depend on array indexes when the array is not stored in the plan. In one sense, we do that all the time, because attnums are arguably just indexes into what is conceptually an array of attributes. However, I feel that's not quite the same, because the attnum is explicitly stored in the catalogs, and PartitionDesc array indexes are not stored anywhere, but rather are the result of a fairly complex calculation. Now I guess it's probably OK because we will probably have lots of other problems if we don't get the same answer every time we do that calculation, but it still makes me a little nervous. I would try to propose something better but I don't have a good idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: speeding up planning with partitions
Robert Haas writes: > On Sat, Mar 30, 2019 at 11:11 AM Tom Lane wrote: >> Before that, though, I remain concerned that the PartitionPruneInfo >> data structure the planner is transmitting to the executor is unsafe >> against concurrent ATTACH PARTITION operations. The comment for >> PartitionedRelPruneInfo says in so many words that it's relying on >> indexes in the table's PartitionDesc; how is that not broken by >> 898e5e329? > The only problem with PartitionPruneInfo structures of which I am > aware is that they rely on PartitionDesc offsets not changing. But I > added code in that commit in ExecCreatePartitionPruneState to handle > that exact problem. See also paragraph 5 of the commit message, which > begins with "Although in general..." Ah. Grotty, but I guess it will cover the issue. regards, tom lane
Re: speeding up planning with partitions
On Sat, Mar 30, 2019 at 11:11 AM Tom Lane wrote: > Before that, though, I remain concerned that the PartitionPruneInfo > data structure the planner is transmitting to the executor is unsafe > against concurrent ATTACH PARTITION operations. The comment for > PartitionedRelPruneInfo says in so many words that it's relying on > indexes in the table's PartitionDesc; how is that not broken by > 898e5e329? The only problem with PartitionPruneInfo structures of which I am aware is that they rely on PartitionDesc offsets not changing. But I added code in that commit in ExecCreatePartitionPruneState to handle that exact problem. See also paragraph 5 of the commit message, which begins with "Although in general..." -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: speeding up planning with partitions
Amit Langote writes: > On Sat, Mar 30, 2019 at 9:17 AM Tom Lane wrote: >> What I propose we do about the GEQO problem is shown in 0001 attached >> (which would need to be back-patched into v11). >> ... >> That's just dumb. What we *ought* to be doing in such degenerate >> outer-join cases is just emitting the non-dummy side, ie > Fwiw, I agree that we should fix join planning so that we get the > ProjectionPath atop scan path of non-nullable relation instead of a > full-fledged join path with dummy path on the nullable side. It seems > to me that the "fix" would be mostly be localized to > try_partitionwise_join() at least insofar as detecting whether we > should generate a join or the other plan shape is concerned, right? Well, if we're going to do something about that, I would like to see it work for non-partition cases too, ie we're not smart about this either: regression=# explain select * from tenk1 left join (select 1 where false) ss(x) on unique1=x; QUERY PLAN --- Nested Loop Left Join (cost=0.00..570.00 rows=1 width=248) Join Filter: (tenk1.unique1 = 1) -> Seq Scan on tenk1 (cost=0.00..445.00 rows=1 width=244) -> Result (cost=0.00..0.00 rows=0 width=0) One-Time Filter: false (5 rows) A general solution would presumably involve new logic in populate_joinrel_with_paths for the case where the nullable side is dummy. I'm not sure whether that leaves anything special to do in try_partitionwise_join or not. Maybe it would, since that would have a requirement to build the joinrel without any RHS input RelOptInfo, but I don't think that's the place to begin working on this. > By the way, does it make sense to remove the tests whose output > changes altogether and reintroduce them when we fix join planning? > Especially, in partitionwise_aggregate.out, there are comments near > changed plans which are no longer true. Good point about the comments, but we shouldn't just remove those test cases; they're useful to exercise the give-up-on-partitionwise-join code paths. I'll tweak the comments. >> 0002 attached is then the rest of the partition-planning patch; >> it doesn't need to mess with joinrels.c at all. I've addressed >> the other points discussed today in that, except for the business >> about whether we want your 0003 bitmap-of-live-partitions patch. >> I'm still inclined to think that that's not really worth it, >> especially in view of your performance results. > I think the performance results did prove that degradation due to > those loops over part_rels becomes significant for very large > partition counts. Is there a better solution than the bitmapset that > you have in mind? Hm, I didn't see much degradation in what you posted in <5c83dbca-12b5-1acf-0e85-58299e464...@lab.ntt.co.jp>. I am curious as to why there seems to be more degradation for hash cases, as per Yoshikazu-san's results in <0F97FA9ABBDBE54F91744A9B37151A512BAC60@g01jpexmbkw24>, but whatever's accounting for the difference probably is not that. Anyway I still believe that getting rid of these sparse arrays would be a better answer. Before that, though, I remain concerned that the PartitionPruneInfo data structure the planner is transmitting to the executor is unsafe against concurrent ATTACH PARTITION operations. The comment for PartitionedRelPruneInfo says in so many words that it's relying on indexes in the table's PartitionDesc; how is that not broken by 898e5e329? regards, tom lane
Re: speeding up planning with partitions
Thanks for the new patches. On Sat, Mar 30, 2019 at 9:17 AM Tom Lane wrote: > > Amit Langote writes: > > On 2019/03/29 7:38, Tom Lane wrote: > >> 2. I seriously dislike what's been done in joinrels.c, too. That > >> really seems like a kluge (and I haven't had time to study it > >> closely). > > > Those hunks account for the fact that pruned partitions, for which we no > > longer create RangeTblEntry and RelOptInfo, may appear on the nullable > > side of an outer join. We'll need a RelOptInfo holding a dummy path, so > > that outer join paths can be created with one side of join being dummy > > result path, which are built in the patch by build_dummy_partition_rel(). > > Now that I've had a chance to look closer, there's no way I'm committing > that change in joinrels.c. If it works at all, it's accidental, because > it's breaking all sorts of data structure invariants. The business with > an AppendRelInfo that maps from the parentrel to itself is particularly > ugly; and I doubt that you can get away with assuming that > root->append_rel_array[parent->relid] is available for use for that. > (What if the parent is an intermediate partitioned table?) > > There's also the small problem of the GEQO crash. It's possible that > that could be gotten around by switching into the long-term planner > context in update_child_rel_info and build_dummy_partition_rel, but > then you're creating a memory leak across GEQO cycles. It'd be much > better to avoid touching base-relation data structures during join > planning. > > What I propose we do about the GEQO problem is shown in 0001 attached > (which would need to be back-patched into v11). This is based on the > observation that, if we know an input relation is empty, we can often > prove the join is empty and then skip building it at all. (In the > existing partitionwise-join code, the same cases are detected by > populate_joinrel_with_paths, but we do a fair amount of work before > discovering that.) The cases where that's not true are where we > have a pruned partition on the inside of a left join, or either side > of a full join ... but frankly, what the existing code produces for > those cases is not short of embarrassing: > >-> Hash Left Join > Hash Cond: (pagg_tab1_p1.x = y) > Filter: ((pagg_tab1_p1.x > 5) OR (y < 20)) > -> Seq Scan on pagg_tab1_p1 >Filter: (x < 20) > -> Hash >-> Result > One-Time Filter: false > > That's just dumb. What we *ought* to be doing in such degenerate > outer-join cases is just emitting the non-dummy side, ie > >-> Seq Scan on pagg_tab1_p1 >Filter: (x < 20) AND ((pagg_tab1_p1.x > 5) OR (y < 20)) > > in this example. I would envision handling this by teaching the > code to generate a path for the joinrel that's basically just a > ProjectionPath atop a path for the non-dummy input rel, with the > projection set up to emit nulls for the columns of the dummy side. > (Note that this would be useful for outer joins against dummy rels > in regular planning contexts, not only partitionwise joins.) > > Pending somebody doing the work for that, though, I do not > have a problem with just being unable to generate partitionwise > joins in such cases, so 0001 attached just changes the expected > outputs for the affected regression test cases. Fwiw, I agree that we should fix join planning so that we get the ProjectionPath atop scan path of non-nullable relation instead of a full-fledged join path with dummy path on the nullable side. It seems to me that the "fix" would be mostly be localized to try_partitionwise_join() at least insofar as detecting whether we should generate a join or the other plan shape is concerned, right? By the way, does it make sense to remove the tests whose output changes altogether and reintroduce them when we fix join planning? Especially, in partitionwise_aggregate.out, there are comments near changed plans which are no longer true. > 0002 attached is then the rest of the partition-planning patch; > it doesn't need to mess with joinrels.c at all. I've addressed > the other points discussed today in that, except for the business > about whether we want your 0003 bitmap-of-live-partitions patch. > I'm still inclined to think that that's not really worth it, > especially in view of your performance results. I think the performance results did prove that degradation due to those loops over part_rels becomes significant for very large partition counts. Is there a better solution than the bitmapset that you have in mind? > If people are OK with this approach to solving the GEQO problem, > I think these are committable. Thanks again. Really appreciate that you are putting so much of your time into this. Regards, Amit
Re: speeding up planning with partitions
Amit Langote writes: > On 2019/03/29 7:38, Tom Lane wrote: >> 2. I seriously dislike what's been done in joinrels.c, too. That >> really seems like a kluge (and I haven't had time to study it >> closely). > Those hunks account for the fact that pruned partitions, for which we no > longer create RangeTblEntry and RelOptInfo, may appear on the nullable > side of an outer join. We'll need a RelOptInfo holding a dummy path, so > that outer join paths can be created with one side of join being dummy > result path, which are built in the patch by build_dummy_partition_rel(). Now that I've had a chance to look closer, there's no way I'm committing that change in joinrels.c. If it works at all, it's accidental, because it's breaking all sorts of data structure invariants. The business with an AppendRelInfo that maps from the parentrel to itself is particularly ugly; and I doubt that you can get away with assuming that root->append_rel_array[parent->relid] is available for use for that. (What if the parent is an intermediate partitioned table?) There's also the small problem of the GEQO crash. It's possible that that could be gotten around by switching into the long-term planner context in update_child_rel_info and build_dummy_partition_rel, but then you're creating a memory leak across GEQO cycles. It'd be much better to avoid touching base-relation data structures during join planning. What I propose we do about the GEQO problem is shown in 0001 attached (which would need to be back-patched into v11). This is based on the observation that, if we know an input relation is empty, we can often prove the join is empty and then skip building it at all. (In the existing partitionwise-join code, the same cases are detected by populate_joinrel_with_paths, but we do a fair amount of work before discovering that.) The cases where that's not true are where we have a pruned partition on the inside of a left join, or either side of a full join ... but frankly, what the existing code produces for those cases is not short of embarrassing: -> Hash Left Join Hash Cond: (pagg_tab1_p1.x = y) Filter: ((pagg_tab1_p1.x > 5) OR (y < 20)) -> Seq Scan on pagg_tab1_p1 Filter: (x < 20) -> Hash -> Result One-Time Filter: false That's just dumb. What we *ought* to be doing in such degenerate outer-join cases is just emitting the non-dummy side, ie -> Seq Scan on pagg_tab1_p1 Filter: (x < 20) AND ((pagg_tab1_p1.x > 5) OR (y < 20)) in this example. I would envision handling this by teaching the code to generate a path for the joinrel that's basically just a ProjectionPath atop a path for the non-dummy input rel, with the projection set up to emit nulls for the columns of the dummy side. (Note that this would be useful for outer joins against dummy rels in regular planning contexts, not only partitionwise joins.) Pending somebody doing the work for that, though, I do not have a problem with just being unable to generate partitionwise joins in such cases, so 0001 attached just changes the expected outputs for the affected regression test cases. 0002 attached is then the rest of the partition-planning patch; it doesn't need to mess with joinrels.c at all. I've addressed the other points discussed today in that, except for the business about whether we want your 0003 bitmap-of-live-partitions patch. I'm still inclined to think that that's not really worth it, especially in view of your performance results. If people are OK with this approach to solving the GEQO problem, I think these are committable. regards, tom lane diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index 56a5084..3c9d84f 100644 *** a/src/backend/optimizer/path/allpaths.c --- b/src/backend/optimizer/path/allpaths.c *** set_append_rel_size(PlannerInfo *root, R *** 1112,1122 * for partitioned child rels. * * Note: here we abuse the consider_partitionwise_join flag by setting ! * it *even* for child rels that are not partitioned. In that case, ! * we set it to tell try_partitionwise_join() that it doesn't need to ! * generate their targetlists and EC entries as they have already been ! * generated here, as opposed to the dummy child rels for which the ! * flag is left set to false so that it will generate them. */ if (rel->consider_partitionwise_join) childrel->consider_partitionwise_join = true; --- 1112,1122 * for partitioned child rels. * * Note: here we abuse the consider_partitionwise_join flag by setting ! * it for child rels that are not themselves partitioned. We do so to ! * tell try_partitionwise_join() that the child rel is sufficiently ! * valid to be used as a per-partition input, even if it later gets ! * proven to be dummy. (It's not
Re: speeding up planning with partitions
Amit Langote writes: > On 2019/03/29 7:38, Tom Lane wrote: >> 2. I seriously dislike what's been done in joinrels.c, too. That >> really seems like a kluge (and I haven't had time to study it >> closely). > Those hunks account for the fact that pruned partitions, for which we no > longer create RangeTblEntry and RelOptInfo, may appear on the nullable > side of an outer join. We'll need a RelOptInfo holding a dummy path, so > that outer join paths can be created with one side of join being dummy > result path, which are built in the patch by build_dummy_partition_rel(). Just for the record, that code is completely broken: it falls over badly under GEQO. (Try running the regression tests with "alter system set geqo_threshold to 2".) However, the partitionwise join code was completely broken for GEQO before this patch, too, so I'm just going to log that as an open issue for the moment. regards, tom lane
Re: speeding up planning with partitions
I wrote: > Amit Langote writes: >> About the XXX: I think resetting inh flag is unnecessary, so we should >> just remove the line. > Possibly. I hadn't had time to follow up the XXX annotation. Now I have ... Yeah, it seems we can just drop that and leave the flag alone. We'll end up running through set_append_rel_size and finding no relevant AppendRelInfos, but that's not going to take long enough to be a problem. It seems better to have the principle that rte->inh doesn't change after subquery_planner's initial scan of the rtable, so I'll make it so. >> If we do that, we can also get rid of the following >> code in set_rel_size(): No, we can't --- that's still reachable if somebody says "SELECT FROM ONLY partitioned_table". regards, tom lane
Re: speeding up planning with partitions
Amit Langote writes: > Here are some comments on v38. Thanks for looking it over! I'll just reply to points worth discussing: > -Assert(rte->rtekind == RTE_RELATION || > - rte->rtekind == RTE_SUBQUERY); > -add_appendrel_other_rels(root, rel, rti); > +if (rte->rtekind == RTE_RELATION) > +expand_inherited_rtentry(root, rel, rte, rti); > +else > +expand_appendrel_subquery(root, rel, rte, rti); > Wouldn't it be a good idea to keep the Assert? There's an Assert in expand_appendrel_subquery that what it got is an RTE_SUBQUERY, so I thought the one at the call site was redundant. I suppose another way to do this would be like if (rte->rtekind == RTE_RELATION) expand_inherited_rtentry(root, rel, rte, rti); else if (rte->rtekind == RTE_SUBQUERY) expand_appendrel_subquery(root, rel, rte, rti); else Assert(false); Not sure if that's better or not. Or we could go back to the design of just having one function and letting it dispatch the case it doesn't want to the other function --- though I think I'd make expand_inherited_rtentry be the primary function, rather than the other way around as you had it in v37. > +forboth(lc, old_child_rtis, lc2, new_child_rtis) > +{ > +int old_child_rti = lfirst_int(lc); > +int new_child_rti = lfirst_int(lc2); > + > +if (old_child_rti == new_child_rti) > +continue; /* nothing to do */ > + > +Assert(old_child_rti > new_child_rti); > + > +ChangeVarNodes((Node *) child_appinfos, > + old_child_rti, new_child_rti, 0); > +} > This seems necessary? RTEs of children of the target table should be in > the same position in the final_rtable as they are in the select_rtable. Well, that's what I'm not very convinced of. I have observed that the regression tests don't reach this ChangeVarNodes call, but I think that might just be lack of test cases rather than a proof that it's impossible. The question is whether it'd ever be possible for the update/delete target to not be the first "inh" table that gets expanded. Since that expansion is done in RTE order, it reduces to "is the target always before any other RTE entries that could need inheritance expansion?" Certainly that would typically be true, but I don't feel very comfortable about assuming that it must be true, when you start thinking about things like updatable views, rules, WITH queries, and so on. It might be worth trying to devise a test case that does reach this code. If we could convince ourselves that it's really impossible, I'd be willing to drop it in favor of putting a test-and-elog check in the earlier loop that the RTI pairs are all equal. But I'm not willing to do it without more investigation. > +/* XXX wrong? */ > +parentrte->inh = false; > About the XXX: I think resetting inh flag is unnecessary, so we should > just remove the line. Possibly. I hadn't had time to follow up the XXX annotation. > If we do that, we can also get rid of the following > code in set_rel_size(): > else if (rte->relkind == RELKIND_PARTITIONED_TABLE) > { > /* > * A partitioned table without any partitions is marked as > * a dummy rel. > */ > set_dummy_rel_pathlist(rel); > } Not following? Surely we need to mark the childless parent as dummy at some point, and that seems like as good a place as any. > Finally, it's not in the patch, but how about visiting > get_relation_constraints() for revising this block of code: That seems like probably an independent patch --- do you want to write it? regards, tom lane
Re: speeding up planning with partitions
Here are some comments on v38. On 2019/03/29 12:44, Amit Langote wrote: > Thanks again for the new patch. I'm reading it now and will send comments > later today if I find something. -Assert(rte->rtekind == RTE_RELATION || - rte->rtekind == RTE_SUBQUERY); -add_appendrel_other_rels(root, rel, rti); +if (rte->rtekind == RTE_RELATION) +expand_inherited_rtentry(root, rel, rte, rti); +else +expand_appendrel_subquery(root, rel, rte, rti); Wouldn't it be a good idea to keep the Assert? + * It's possible that the RTIs we just assigned for the child rels in the + * final rtable are different from where they were in the SELECT query. In the 2nd sentence, maybe you meant "...from what they were" +forboth(lc, old_child_rtis, lc2, new_child_rtis) +{ +int old_child_rti = lfirst_int(lc); +int new_child_rti = lfirst_int(lc2); + +if (old_child_rti == new_child_rti) +continue; /* nothing to do */ + +Assert(old_child_rti > new_child_rti); + +ChangeVarNodes((Node *) child_appinfos, + old_child_rti, new_child_rti, 0); +} This seems necessary? RTEs of children of the target table should be in the same position in the final_rtable as they are in the select_rtable. It seems that they can be added to parse->rtable simply as: orig_rtable_len = list_length(parse->rtable); parse->rtable = list_concat(parse->rtable, list_copy_tail(select_rtable, orig_rtable_len)); That is, after the block of code that plans the query as SELECT. + * about the childrens' reltargets, they'll be made later Should it be children's? +/* + * If the partitioned table has no partitions, treat this as the + * non-inheritance case. + */ +if (partdesc->nparts == 0) +{ +/* XXX wrong? */ +parentrte->inh = false; +return; +} About the XXX: I think resetting inh flag is unnecessary, so we should just remove the line. If we do that, we can also get rid of the following code in set_rel_size(): else if (rte->relkind == RELKIND_PARTITIONED_TABLE) { /* * A partitioned table without any partitions is marked as * a dummy rel. */ set_dummy_rel_pathlist(rel); } Finally, it's not in the patch, but how about visiting get_relation_constraints() for revising this block of code: /* * Append partition predicates, if any. * * For selects, partition pruning uses the parent table's partition bound * descriptor, instead of constraint exclusion which is driven by the * individual partition's partition constraint. */ if (enable_partition_pruning && root->parse->commandType != CMD_SELECT) { List *pcqual = RelationGetPartitionQual(relation); if (pcqual) { /* * Run the partition quals through const-simplification similar to * check constraints. We skip canonicalize_qual, though, because * partition quals should be in canonical form already; also, * since the qual is in implicit-AND format, we'd have to * explicitly convert it to explicit-AND format and back again. */ pcqual = (List *) eval_const_expressions(root, (Node *) pcqual); /* Fix Vars to have the desired varno */ if (varno != 1) ChangeVarNodes((Node *) pcqual, 1, varno, 0); result = list_concat(result, pcqual); } } We will no longer need to load the partition constraints for "other rel" partitions, not even for UPDATE and DELETE queries. Now, we won't load them with the patch applied, because we're cheating by first planning the query as SELECT, so that's not an issue. But we should change the condition here to check if the input relation is a "baserel", in which case, this should still load the partition constraint so that constraint exclusion can use it when running with constraint_exclusion = on. In fact, I recently reported [1] on -hackers that we don't load the partition constraint even if the partition is being accessed directly as a bug introduced in PG 11. Thanks, Amit [1] https://www.postgresql.org/message-id/9813f079-f16b-61c8-9ab7-4363cab28d80%40lab.ntt.co.jp
RE: speeding up planning with partitions
On Fri, Mar 29, 2019 at 3:45 PM, Amit Langote wrote: > Thanks a lot for hacking on the patch. I'm really happy with the direction > you took for inheritance_planner, as it allows UPDATE/DELETE to use > partition pruning. I was astonished by Tom's awesome works and really thanks him. > Certainly. Note that previously we'd always scan *all* hash partitions > for UPDATE and DELETE queries, because constraint exclusion can't exclude > hash partitions due to the shape of their partition constraint. > > I ran my usual benchmark with up to 8192 partitions. > > N: 2..8192 > > create table rt (a int, b int, c int) partition by range (a); select 'create > table rt' || x::text || ' partition of rt for values from (' || (x)::text > || ') to (' || (x+1)::text || ');' from generate_series(1, > N) x; > \gexec > > update.sql: > > \set param random(1, N) > update rt set a = 0 where a = :param; > > pgbench -n -T 120 -f select.sql > > npartsv38 HEAD > == > 2 2971 2969 > 8 2980 1949 > 32 2955733 > 1282946145 > 5122924 11 > 1024 2986 3 > 4096 2702 0 > 8192 2531OOM > > Obviously, you'll get similar numbers with hash or list partitioning. I also ran the test for hash partitioning for just make sure. N: 2..8192 create table ht (a int, b int, c int) partition by hash (a); select 'create table ht' || x::text || ' partition of ht for values with (MODULUS N, REMAINDER || (x)::text || ');' from generate_series(0, N-1) x; \gexec update.sql: \set param random(1, N * 100) update ht set b = b + 1 where a = :param; pgbench -n -T 60 -f update.sql [updating one partition] npartsv38 HEAD == 0: 10538 10487 2: 6942 7028 4: 7043 5645 8: 6981 3954 16: 6932 2440 32: 6897 1243 64: 6897309 128: 6753120 256: 6727 46 512: 6708 12 1024:6063 3 2048:5894 1 4096:5374OOM 8192:4572OOM The performance for hash is also improved, though drop rate of performance with large partitions seems higher than that of range partitioning. Thanks -- Imai Yoshikazu
Re: speeding up planning with partitions
Thanks a lot for hacking on the patch. I'm really happy with the direction you took for inheritance_planner, as it allows UPDATE/DELETE to use partition pruning. On 2019/03/29 7:38, Tom Lane wrote: > I've been hacking on this pretty hard over the last couple days, > because I really didn't like the contortions you'd made to allow > inheritance_planner to call expand_inherited_rtentry in a completely > different context than the regular code path did. I eventually got > rid of that Good riddance. > by having inheritance_planner run one cycle of planning > the query as if it were a SELECT, and extracting the list of unpruned > children from that. This is somewhat like my earlier patch that we decided to not to pursue, minus all the hackery within query_planner() that was in that patch, which is great. (I can't find the link, but David Rowley had posted a patch for allowing UPDATE/DELETE to use partition pruning in the late stages of PG 11 development, which had taken a similar approach.) > I had to rearrange the generation of the final > rtable a bit to make that work, but I think that inheritance_planner > winds up somewhat cleaner and safer this way; it's making (slightly) > fewer assumptions about how much the results of planning the child > queries can vary. > > Perhaps somebody will object that that's one more planning pass than > we had before, but I'm not very concerned, because > (a) at least for partitioned tables that we can prune successfully, > this should still be better than v11, since we avoid the planning > passes for pruned children. Certainly. Note that previously we'd always scan *all* hash partitions for UPDATE and DELETE queries, because constraint exclusion can't exclude hash partitions due to the shape of their partition constraint. I ran my usual benchmark with up to 8192 partitions. N: 2..8192 create table rt (a int, b int, c int) partition by range (a); select 'create table rt' || x::text || ' partition of rt for values from (' || (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, N) x; \gexec update.sql: \set param random(1, N) update rt set a = 0 where a = :param; pgbench -n -T 120 -f select.sql npartsv38 HEAD == 22971 2969 82980 1949 32 2955733 128 2946145 512 2924 11 1024 2986 3 4096 2702 0 8192 2531OOM Obviously, you'll get similar numbers with hash or list partitioning. > (b) inheritance_planner is horridly inefficient anyway, in that it > has to run a near-duplicate planning pass for each child table. > If we're concerned about its cost, we should be working to get rid of > the function altogether, as per [1]. In the meantime, I do not want > to contort other code to make life easier for inheritance_planner. Agreed. > There's still some loose ends: > > 1. I don't like 0003 much, and omitted it from the attached. > I think that what we ought to be doing instead is not having holes > in the rel_parts[] arrays to begin with, ie they should only include > the unpruned partitions. If we are actually encoding any important > information in those array positions, I suspect that is broken anyway > in view of 898e5e329: we can't assume that the association of child > rels with particular PartitionDesc slots will hold still from planning > to execution. It's useful for part_rels array to be indexed in the same way as PartitionDesc. Firstly, because partition pruning code returns the PartitionDesc-defined indexes of unpruned partitions. Second, partitionwise join code decides two partitioned tables as being compatible for partitionwise joining, then it must join partitions that have identical *PartitionDesc* indexes, which is what it does by part_rels arrays of both sides in one loop. Regarding the impact of 898e5e329 on this, I think it invented PartitionDirectory exactly to avoid PartitionDesc changing under us affecting the planning or execution of a given query. As for PartitionDesc indexes being different between planning and execution, it only affects PartitionPruneInfo and the commit did make changes to ExecCreatePartitionPruningState to remap the old indexes of unpruned partitions in PartitionPruneInfo (as they were during planning) to the new ones. > 2. I seriously dislike what's been done in joinrels.c, too. That > really seems like a kluge (and I haven't had time to study it > closely). Those hunks account for the fact that pruned partitions, for which we no longer create RangeTblEntry and RelOptInfo, may appear on the nullable side of an outer join. We'll need a RelOptInfo holding a dummy path, so that outer join paths can be created with one side of join being dummy result path, which are built in the patch by build_dummy_partition_rel(). > 3. It's not entirely clear to me why the patch has to touch > execPartition.c. That implies that the planner-to-executor API > changed, but how so, and why is there no comment update
Re: speeding up planning with partitions
I've been hacking on this pretty hard over the last couple days, because I really didn't like the contortions you'd made to allow inheritance_planner to call expand_inherited_rtentry in a completely different context than the regular code path did. I eventually got rid of that by having inheritance_planner run one cycle of planning the query as if it were a SELECT, and extracting the list of unpruned children from that. I had to rearrange the generation of the final rtable a bit to make that work, but I think that inheritance_planner winds up somewhat cleaner and safer this way; it's making (slightly) fewer assumptions about how much the results of planning the child queries can vary. Perhaps somebody will object that that's one more planning pass than we had before, but I'm not very concerned, because (a) at least for partitioned tables that we can prune successfully, this should still be better than v11, since we avoid the planning passes for pruned children. (b) inheritance_planner is horridly inefficient anyway, in that it has to run a near-duplicate planning pass for each child table. If we're concerned about its cost, we should be working to get rid of the function altogether, as per [1]. In the meantime, I do not want to contort other code to make life easier for inheritance_planner. There's still some loose ends: 1. I don't like 0003 much, and omitted it from the attached. I think that what we ought to be doing instead is not having holes in the rel_parts[] arrays to begin with, ie they should only include the unpruned partitions. If we are actually encoding any important information in those array positions, I suspect that is broken anyway in view of 898e5e329: we can't assume that the association of child rels with particular PartitionDesc slots will hold still from planning to execution. 2. I seriously dislike what's been done in joinrels.c, too. That really seems like a kluge (and I haven't had time to study it closely). 3. It's not entirely clear to me why the patch has to touch execPartition.c. That implies that the planner-to-executor API changed, but how so, and why is there no comment update clarifying it? Given the short amount of time left in this CF, there may not be time to address the first two points, and I won't necessarily insist that those be changed before committing. I'd like at least a comment about point 3 though. Attached is updated patch as a single patch --- I didn't think the division into multiple patches was terribly helpful, due to the flapping in expected regression results. regards, tom lane [1] https://www.postgresql.org/message-id/357.1550612...@sss.pgh.pa.us diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index 9ad9035..310c715 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -7116,9 +7116,9 @@ select * from bar where f1 in (select f1 from foo) for update; QUERY PLAN -- LockRows - Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid + Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid -> Hash Join - Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid + Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid Inner Unique: true Hash Cond: (bar.f1 = foo.f1) -> Append @@ -7128,15 +7128,15 @@ select * from bar where f1 in (select f1 from foo) for update; Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE -> Hash - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 + Output: foo.ctid, foo.f1, foo.*, foo.tableoid -> HashAggregate - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 + Output: foo.ctid, foo.f1, foo.*, foo.tableoid Group Key: foo.f1 -> Append -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 + Output: foo.ctid, foo.f1, foo.*, foo.tableoid -> Foreign Scan on public.foo2 - Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 + Output: foo2.ctid, foo2.f1, foo2.*, foo2.tableoid Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 (23 rows) @@ -7154,9 +7154,9 @@ select * from bar where f1 in (select f1 from foo) for share;
Re: speeding up planning with partitions
On 2019/03/28 14:03, Tom Lane wrote: > Amit Langote writes: >> On 2019/03/27 23:57, Tom Lane wrote: >>> Yeah, there's something to be said for having plancat.c open each table >>> *and store the Relation pointer in the RelOptInfo*, and then close that >>> again at the end of planning rather than immediately. If we can't avoid >>> these retail table_opens without a great deal of pain, that's the >>> direction I'd tend to go in. However it would add some overhead, in >>> the form of a need to traverse the RelOptInfo array an additional time. > >> Just to be sure, do you mean we should do that now or later (David said >> "in the long term")? > > It's probably not high priority, though I wonder how much time is being > eaten by the repeated table_opens. It took me a couple of hours to confirm this, but it seems significantly less than it takes to go over the simple_rel_array one more time to close the relations. The scan of simple_rel_array to close the relations needs to be done once per query_planner() invocation, whereas I'd hoped this could be done, say, in add_rtes_to_flat_rtable() that has to iterate over the range table anyway. That's because we call query_planner multiple times on the same query in a few cases, viz. build_minmax_path(), inheritance_planner(). Within query_planner(), table is opened in expand_inherited_rtentry(), get_relation_constraints(), get_relation_data_width(), and build_physical_tlist(), of which the first two are more common. So, we end up doing table_open() 3 times on average for any given relation, just inside query_planner(). Thanks, Amit
Re: speeding up planning with partitions
Amit Langote writes: > On 2019/03/27 23:57, Tom Lane wrote: >> Yeah, there's something to be said for having plancat.c open each table >> *and store the Relation pointer in the RelOptInfo*, and then close that >> again at the end of planning rather than immediately. If we can't avoid >> these retail table_opens without a great deal of pain, that's the >> direction I'd tend to go in. However it would add some overhead, in >> the form of a need to traverse the RelOptInfo array an additional time. > Just to be sure, do you mean we should do that now or later (David said > "in the long term")? It's probably not high priority, though I wonder how much time is being eaten by the repeated table_opens. regards, tom lane
Re: speeding up planning with partitions
On 2019/03/27 23:57, Tom Lane wrote: > David Rowley writes: >> On Wed, 27 Mar 2019 at 18:39, Amit Langote >> wrote: >>> On 2019/03/27 14:26, David Rowley wrote: Perhaps the way to make this work, at least in the long term is to do in the planner what we did in the executor in d73f4c74dd34. > >>> Maybe you meant 9ddef36278a9? > >> Probably. > > Yeah, there's something to be said for having plancat.c open each table > *and store the Relation pointer in the RelOptInfo*, and then close that > again at the end of planning rather than immediately. If we can't avoid > these retail table_opens without a great deal of pain, that's the > direction I'd tend to go in. However it would add some overhead, in > the form of a need to traverse the RelOptInfo array an additional time. Just to be sure, do you mean we should do that now or later (David said "in the long term")? Thanks, Amit
Re: speeding up planning with partitions
David Rowley writes: > On Wed, 27 Mar 2019 at 18:39, Amit Langote > wrote: >> On 2019/03/27 14:26, David Rowley wrote: >>> Perhaps the way to make this work, at least in the long term is to do >>> in the planner what we did in the executor in d73f4c74dd34. >> Maybe you meant 9ddef36278a9? > Probably. Yeah, there's something to be said for having plancat.c open each table *and store the Relation pointer in the RelOptInfo*, and then close that again at the end of planning rather than immediately. If we can't avoid these retail table_opens without a great deal of pain, that's the direction I'd tend to go in. However it would add some overhead, in the form of a need to traverse the RelOptInfo array an additional time. regards, tom lane
Re: speeding up planning with partitions
On Wed, 27 Mar 2019 at 18:39, Amit Langote wrote: > > On 2019/03/27 14:26, David Rowley wrote: > > Perhaps the way to make this work, at least in the long term is to do > > in the planner what we did in the executor in d73f4c74dd34. > > Maybe you meant 9ddef36278a9? Probably. > What would be nice is being able to readily access Relation pointers of > all tables accessed in a query from anywhere in the planner, whereby, a > given table is opened only once. Well, yeah, that's what the commit did for the executor, so it is what I was trying to get at. > Note that Tom complained upthread that the patch is introducing > table_open()'s at random points within the planner, which is something to > avoid. Yip. :) hence my suggestion. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: speeding up planning with partitions
On 2019/03/27 14:26, David Rowley wrote: > On Wed, 27 Mar 2019 at 18:13, Amit Langote > wrote: >> >> On 2019/03/27 13:50, Amit Langote wrote: >>> On 2019/03/27 12:06, Amit Langote wrote: I wonder if I should rework inherit.c so that its internal interfaces don't pass around parent Relation, but make do with the RelOptInfo? I'll need to add tupdesc and reltype fields to RelOptInfo to go ahead with that though. >>> >>> To give more context on the last sentence, the only place we need the >>> TupleDesc for is make_inh_translation_list(). Maybe, instead of adding >>> tupdesc to RelOptInfo, we could add a List of Vars of all attributes of a >>> table. To avoid the overhead of always setting it, maybe we could set it >>> only for inheritance parent tables. Adding tupdesc to RelOptInfo might be >>> a hard sell to begin with, because it would need to be pinned and then >>> released at the end of planning. >>> >>> I'll see if I can make this work. >> >> Hmm, scratch that. We'd need to keep around the attribute names too for >> comparing with the attribute names of the children during translation. >> Maybe we could store TargetEntry's in the list instead of just Vars, but >> maybe that's too much. >> >> Furthermore, if we make make_inh_translation_list too dependent on this >> stuff, that will make it hard to use from inheritance_planner(), where we >> expand the target inheritance without having created any RelOptInfos. > > Perhaps the way to make this work, at least in the long term is to do > in the planner what we did in the executor in d73f4c74dd34. Maybe you meant 9ddef36278a9? What would be nice is being able to readily access Relation pointers of all tables accessed in a query from anywhere in the planner, whereby, a given table is opened only once. Note that Tom complained upthread that the patch is introducing table_open()'s at random points within the planner, which is something to avoid. > I'm not > quite sure how exactly you'd know what size to make the array, but if > it's not something that's happening for PG12, then maybe we can just > use a List, once we get array based versions of them, at least. We're going to have to expand the PlannerInfo arrays (simple_rel_*, append_rel_array, etc.) with the patches here anyway. Maybe, it will be just one more array to consider? Thanks, Amit
Re: speeding up planning with partitions
On Wed, 27 Mar 2019 at 18:13, Amit Langote wrote: > > On 2019/03/27 13:50, Amit Langote wrote: > > On 2019/03/27 12:06, Amit Langote wrote: > >> I wonder if I should rework inherit.c so that its internal interfaces > >> don't pass around parent Relation, but make do with the RelOptInfo? I'll > >> need to add tupdesc and reltype fields to RelOptInfo to go ahead with that > >> though. > > > > To give more context on the last sentence, the only place we need the > > TupleDesc for is make_inh_translation_list(). Maybe, instead of adding > > tupdesc to RelOptInfo, we could add a List of Vars of all attributes of a > > table. To avoid the overhead of always setting it, maybe we could set it > > only for inheritance parent tables. Adding tupdesc to RelOptInfo might be > > a hard sell to begin with, because it would need to be pinned and then > > released at the end of planning. > > > > I'll see if I can make this work. > > Hmm, scratch that. We'd need to keep around the attribute names too for > comparing with the attribute names of the children during translation. > Maybe we could store TargetEntry's in the list instead of just Vars, but > maybe that's too much. > > Furthermore, if we make make_inh_translation_list too dependent on this > stuff, that will make it hard to use from inheritance_planner(), where we > expand the target inheritance without having created any RelOptInfos. Perhaps the way to make this work, at least in the long term is to do in the planner what we did in the executor in d73f4c74dd34. I'm not quite sure how exactly you'd know what size to make the array, but if it's not something that's happening for PG12, then maybe we can just use a List, once we get array based versions of them, at least. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: speeding up planning with partitions
On 2019/03/27 13:50, Amit Langote wrote: > On 2019/03/27 12:06, Amit Langote wrote: >> I wonder if I should rework inherit.c so that its internal interfaces >> don't pass around parent Relation, but make do with the RelOptInfo? I'll >> need to add tupdesc and reltype fields to RelOptInfo to go ahead with that >> though. > > To give more context on the last sentence, the only place we need the > TupleDesc for is make_inh_translation_list(). Maybe, instead of adding > tupdesc to RelOptInfo, we could add a List of Vars of all attributes of a > table. To avoid the overhead of always setting it, maybe we could set it > only for inheritance parent tables. Adding tupdesc to RelOptInfo might be > a hard sell to begin with, because it would need to be pinned and then > released at the end of planning. > > I'll see if I can make this work. Hmm, scratch that. We'd need to keep around the attribute names too for comparing with the attribute names of the children during translation. Maybe we could store TargetEntry's in the list instead of just Vars, but maybe that's too much. Furthermore, if we make make_inh_translation_list too dependent on this stuff, that will make it hard to use from inheritance_planner(), where we expand the target inheritance without having created any RelOptInfos. Thanks, Amit
Re: speeding up planning with partitions
On 2019/03/27 12:06, Amit Langote wrote: > I wonder if I should rework inherit.c so that its internal interfaces > don't pass around parent Relation, but make do with the RelOptInfo? I'll > need to add tupdesc and reltype fields to RelOptInfo to go ahead with that > though. To give more context on the last sentence, the only place we need the TupleDesc for is make_inh_translation_list(). Maybe, instead of adding tupdesc to RelOptInfo, we could add a List of Vars of all attributes of a table. To avoid the overhead of always setting it, maybe we could set it only for inheritance parent tables. Adding tupdesc to RelOptInfo might be a hard sell to begin with, because it would need to be pinned and then released at the end of planning. I'll see if I can make this work. Thanks, Amit
Re: speeding up planning with partitions
On 2019/03/27 1:06, Tom Lane wrote: > Amit Langote writes: >> 0002 is a new patch to get rid of the duplicate RTE and PlanRowMark that's >> created for partitioned parent table, as it's pointless. I was planning >> to propose it later, but decided to post it now if we're modifying the >> nearby code anyway. > > Good idea, but it needs a bit more work on nearby comments, and the > logic in expand_single_inheritance_child could be simplified, and > you evidently didn't try check-world because the RTE-removal causes > cosmetic changes in postgres_fdw.out. I fixed that up and pushed > this part. Thanks. Sorry that I left a lot to be done. Regards, Amit
Re: speeding up planning with partitions
Amit Langote writes: > On 2019/03/23 6:07, Tom Lane wrote: >> I also feel like you used a dartboard while deciding where to insert the >> call in query_planner(); dropping it into the middle of a sequence of >> equivalence-class-related operations seems quite random. Maybe we could >> delay that step all the way to just before make_one_rel, since the other >> stuff in between seems to only care about baserels? For example, >> it'd certainly be better if we could run remove_useless_joins before >> otherrel expansion, so that we needn't do otherrel expansion at all >> on a table that gets thrown away as being a useless join. > create_lateral_join_info() expects otherrels to be present to propagate > the information it creates. Sure, but the actually-useful work in that function is just concerned with baserels. The only thing it does with otherrels is to propagate parents' lateral-ref info to children, which is a lot easier, cheaper, and more reliable if we let build_simple_rel do it. See the version of 0001 I just pushed. I'll look at 0003 and up tomorrow. regards, tom lane
Re: speeding up planning with partitions
Amit Langote writes: > 0002 is a new patch to get rid of the duplicate RTE and PlanRowMark that's > created for partitioned parent table, as it's pointless. I was planning > to propose it later, but decided to post it now if we're modifying the > nearby code anyway. Good idea, but it needs a bit more work on nearby comments, and the logic in expand_single_inheritance_child could be simplified, and you evidently didn't try check-world because the RTE-removal causes cosmetic changes in postgres_fdw.out. I fixed that up and pushed this part. regards, tom lane
RE: speeding up planning with partitions
Amit-san, On Tue, Mar 26, 2019 at 7:17 AM, Amit Langote wrote: > Rebased patches attached. I could only do the code review of v36-0001. On Mon, Mar 25, 2019 at 11:35 AM, Amit Langote wrote: > On 2019/03/23 6:07, Tom Lane wrote: > > Amit Langote writes: > >> [ v34 patch set ] > > > > I had a bit of a look through this. I went ahead and pushed 0008 and > > 0009, as they seem straightforward and independent of the rest. (BTW, > > 0009 makes 0003's dubious optimization in set_relation_partition_info > > quite unnecessary.) As for the rest: > > > > 0001: OK in principle, but I wonder why you implemented it by adding > > another recursive scan of the jointree rather than just iterating > > over the baserel array, as in make_one_rel() for instance. Seems > > like this way is more work, and more code that we'll have to touch > > if we ever change the jointree representation. > > I've changed this to work by iterating over baserel array. I was mostly > worried about looping over potentially many elements of baserel array that > we simply end up skipping, but considering the other patch that delays > adding inheritance children, we don't need to worry about that. > > > I also feel like you used a dartboard while deciding where to insert the > > call in query_planner(); dropping it into the middle of a sequence of > > equivalence-class-related operations seems quite random. Maybe we could > > delay that step all the way to just before make_one_rel, since the other > > stuff in between seems to only care about baserels? For example, > > it'd certainly be better if we could run remove_useless_joins before > > otherrel expansion, so that we needn't do otherrel expansion at all > > on a table that gets thrown away as being a useless join. > > create_lateral_join_info() expects otherrels to be present to propagate > the information it creates. > > I have moved add_other_rels_to_query() followed by > create_lateral_join_info() to just before make_one_rel(). I checked 0001 patch modifies the thing which is discussed above correctly. What problem I only found is a little typo. s/propgated/propagated/ I'm really sorry for my shortage of time to review for now... -- Yoshikazu Imai
RE: speeding up planning with partitions
On Fri, Mar 22, 2019 at 9:07 PM, Tom Lane wrote: > BTW, it strikes me that we could take advantage of the fact that baserels > must all appear before otherrels in the rel array, by having loops over > that array stop early if they're only interested in baserels. We could > either save the index of the last baserel, or just extend the loop logic > to fall out upon hitting an otherrel. > Seems like this'd save some cycles when there are lots of partitions, > although perhaps loops like that would be fast enough to not matter. Actually, this speeds up planning time little when scanning a lot of otherrels like selecting thousands of partitions. I tested below. * rt with 8192 partitions * execute "select * from rt;" for 60 seconds. [results] HEAD: 4.19 TPS (4.06, 4.34, 4.17) (v34 patches) + (looping over only baserels): 4.26 TPS (4.31, 4.28, 4.19) Attached is the diff I used for this test. -- Yoshikazu Imai looping-over-last-baserel-idx.diff Description: looping-over-last-baserel-idx.diff
Re: speeding up planning with partitions
I wrote: > ... I also > don't like the undocumented way that you've got build_base_rel_tlists > working on something that's not the final tlist, and then going back > and doing more work of the same sort later. I wonder whether we can > postpone build_base_rel_tlists until after the other stuff happens, > so that it can continue to do all of the tlist-distribution work. On further reflection: we don't really need the full build_base_rel_tlists pushups for these extra variables. The existence of the original rowmark variable will be sufficient, for example, to make sure that we don't decide the parent partitioned table can be thrown away as a useless join. All we need to do is add the new Var to the parent's reltarget and adjust its attr_needed, and we can do that very late in query_planner. So now I'm thinking leave build_base_rel_tlists() where it is, and then fix up the tlist/reltarget data on the fly when we discover that new child rels need more rowmark types than we had before. (This'd be another reason to keep the tlist in root->processed_tlist throughout, likely.) regards, tom lane
Re: speeding up planning with partitions
Amit Langote writes: > [ v34 patch set ] I had a bit of a look through this. I went ahead and pushed 0008 and 0009, as they seem straightforward and independent of the rest. (BTW, 0009 makes 0003's dubious optimization in set_relation_partition_info quite unnecessary.) As for the rest: 0001: OK in principle, but I wonder why you implemented it by adding another recursive scan of the jointree rather than just iterating over the baserel array, as in make_one_rel() for instance. Seems like this way is more work, and more code that we'll have to touch if we ever change the jointree representation. I also feel like you used a dartboard while deciding where to insert the call in query_planner(); dropping it into the middle of a sequence of equivalence-class-related operations seems quite random. Maybe we could delay that step all the way to just before make_one_rel, since the other stuff in between seems to only care about baserels? For example, it'd certainly be better if we could run remove_useless_joins before otherrel expansion, so that we needn't do otherrel expansion at all on a table that gets thrown away as being a useless join. BTW, it strikes me that we could take advantage of the fact that baserels must all appear before otherrels in the rel array, by having loops over that array stop early if they're only interested in baserels. We could either save the index of the last baserel, or just extend the loop logic to fall out upon hitting an otherrel. Seems like this'd save some cycles when there are lots of partitions, although perhaps loops like that would be fast enough to not matter. 0002: I seriously dislike this from a system-structure viewpoint. For one thing it makes root->processed_tlist rather moot, since it doesn't get set till after most of the work is done. (I don't know if there are any FDWs out there that look at processed_tlist, but they'd be broken by this if so. It'd be better to get rid of the field if we can, so that at least such breakage is obvious. Or, maybe, use root->processed_tlist as the communication field rather than having the tlist be a param to query_planner?) I also don't like the undocumented way that you've got build_base_rel_tlists working on something that's not the final tlist, and then going back and doing more work of the same sort later. I wonder whether we can postpone build_base_rel_tlists until after the other stuff happens, so that it can continue to do all of the tlist-distribution work. 0003: I haven't studied this in great detail, but it seems like there's some pretty random things in it, like this addition in inheritance_planner: + /* grouping_planner doesn't need to see the target children. */ + subroot->append_rel_list = list_copy(orig_append_rel_list); That sure seems like a hack unrelated to the purpose of the patch ... and since list_copy is a shallow copy, I'm not even sure it's safe. Won't we be mutating the contained (and shared) AppendRelInfos as we go along? 0004 and 0005: I'm not exactly thrilled about loading more layers of overcomplication into inheritance_planner, especially not when the complication then extends out into new global planner flags with questionable semantics. We should be striving to get rid of that function, as previously discussed, and I fear this is just throwing more roadblocks in the way of doing so. Can we skip these and come back to the question after we replace inheritance_planner with expand-at-the-bottom? 0006: Seems mostly OK, but I'm not very happy about the additional table_open calls it's throwing into random places in the planner. Those can be rather expensive. Why aren't we collecting the appropriate info during the initial plancat.c consultation of the relcache? 0007: Not really sold on this; it seems like it's going to be a net negative for cases where there aren't a lot of partitions. regards, tom lane
Re: speeding up planning with partitions
On Fri, 22 Mar 2019 at 20:39, Amit Langote wrote: > The problem is that make_partitionedrel_pruneinfo() does some work which > involves allocating arrays containing nparts elements and then looping > over the part_rels array to fill values in those arrays that will be used > by run-time pruning. It does all that even *before* it has checked that > run-time pruning will be needed at all, which if it is not, it will simply > discard the result of the aforementioned work. > > Patch 0008 of 0009 rearranges the code such that we check whether we will > need run-time pruning at all before doing any of the work that's necessary > for run-time to work. I had a quick look at the make_partitionedrel_pruneinfo() code earlier before this patch appeared and I agree that something like this could be done. I've not gone over the patch in detail though. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: speeding up planning with partitions
Hi Amit, On 3/22/19 3:39 AM, Amit Langote wrote: I took a stab at this. I think rearranging the code in make_partitionedrel_pruneinfo() slightly will take care of this. The problem is that make_partitionedrel_pruneinfo() does some work which involves allocating arrays containing nparts elements and then looping over the part_rels array to fill values in those arrays that will be used by run-time pruning. It does all that even *before* it has checked that run-time pruning will be needed at all, which if it is not, it will simply discard the result of the aforementioned work. Patch 0008 of 0009 rearranges the code such that we check whether we will need run-time pruning at all before doing any of the work that's necessary for run-time to work. Yeah, this is better :) Increasing number of partitions leads to a TPS within the noise level. Passes check-world. Best regards, Jesper
Re: speeding up planning with partitions
Jesper, Imai-san, Thanks for testing and reporting your findings. On 2019/03/21 23:10, Imai Yoshikazu wrote: > On 2019/03/20 23:25, Jesper Pedersen wrote:> Hi, > > My tests - using hash partitions - shows that the extra time is spent in > > make_partition_pruneinfo() for the relid_subplan_map variable. > > > >64 partitions: make_partition_pruneinfo() 2.52% > > 8192 partitions: make_partition_pruneinfo() 5.43% > > > > TPS goes down ~8% between the two. The test setup is similar to the > above. > > > > Given that Tom is planning to change the List implementation [1] in 13 I > > think using the palloc0 structure is ok for 12 before trying other > > implementation options. Hmm, relid_subplan_map's size should be constant (number of partitions scanned) even as the actual partition count grows. However, looking into make_partitionedrel_pruneinfo(), it seems that it's unconditionally allocating 3 arrays that all have nparts elements: subplan_map = (int *) palloc0(nparts * sizeof(int)); subpart_map = (int *) palloc0(nparts * sizeof(int)); relid_map = (Oid *) palloc0(nparts * sizeof(int)); So, that part has got to cost more as the partition count grows. This is the code for runtime pruning, which is not exercised in our tests, so it might seem odd that we're spending any time here at all. I've been told that we have to perform at least *some* work here if only to conclude that runtime pruning is not needed and it seems that above allocations occur before making that conclusion. Maybe, that's salvageable by rearranging this code a bit. David may be in a better position to help with that. > Thanks for testing. > Yeah, after code looking, I think bottleneck seems be lurking in another > place where this patch doesn't change. I also think the patch is ok as > it is for 12, and this problem will be fixed in 13. If this drop in performance can be attributed to the fact that having too many partitions starts stressing other parts of the backend like stats collector, lock manager, etc. as time passes, then I agree that we'll have to tackle them later. Thanks, Amit
Re: speeding up planning with partitions
Hi Jesper, On 2019/03/20 23:25, Jesper Pedersen wrote:> Hi, > > On 3/19/19 11:15 PM, Imai, Yoshikazu wrote: >> Here the details. >> >> [creating partitioned tables (with 1024 partitions)] >> drop table if exists rt; >> create table rt (a int, b int, c int) partition by range (a); >> \o /dev/null >> select 'create table rt' || x::text || ' partition of rt for values >> from (' || >> (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, >> 1024) x; >> \gexec >> \o >> >> [select1024.sql] >> \set a random (1, 1024) >> select * from rt where a = :a; >> >> [pgbench] >> pgbench -n -f select1024.sql -T 60 >> >> > > My tests - using hash partitions - shows that the extra time is spent in > make_partition_pruneinfo() for the relid_subplan_map variable. > >64 partitions: make_partition_pruneinfo() 2.52% > 8192 partitions: make_partition_pruneinfo() 5.43% > > TPS goes down ~8% between the two. The test setup is similar to the above. > > Given that Tom is planning to change the List implementation [1] in 13 I > think using the palloc0 structure is ok for 12 before trying other > implementation options. > > perf sent off-list. > > [1] https://www.postgresql.org/message-id/24783.1551568303%40sss.pgh.pa.us > > Best regards, > Jesper > > Thanks for testing. Yeah, after code looking, I think bottleneck seems be lurking in another place where this patch doesn't change. I also think the patch is ok as it is for 12, and this problem will be fixed in 13. -- Yoshikazu Imai
Re: speeding up planning with partitions
Hi Amit, On 3/19/19 8:41 PM, Amit Langote wrote: I have fixed all. Attached updated patches. The comments in the thread has been addressed, so I have put the CF entry into Ready for Committer. Best regards, Jesper
Re: speeding up planning with partitions
Hi, On 3/19/19 11:15 PM, Imai, Yoshikazu wrote: Here the details. [creating partitioned tables (with 1024 partitions)] drop table if exists rt; create table rt (a int, b int, c int) partition by range (a); \o /dev/null select 'create table rt' || x::text || ' partition of rt for values from (' || (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 1024) x; \gexec \o [select1024.sql] \set a random (1, 1024) select * from rt where a = :a; [pgbench] pgbench -n -f select1024.sql -T 60 My tests - using hash partitions - shows that the extra time is spent in make_partition_pruneinfo() for the relid_subplan_map variable. 64 partitions: make_partition_pruneinfo() 2.52% 8192 partitions: make_partition_pruneinfo() 5.43% TPS goes down ~8% between the two. The test setup is similar to the above. Given that Tom is planning to change the List implementation [1] in 13 I think using the palloc0 structure is ok for 12 before trying other implementation options. perf sent off-list. [1] https://www.postgresql.org/message-id/24783.1551568303%40sss.pgh.pa.us Best regards, Jesper
RE: speeding up planning with partitions
Amit-san, On Wed, Mar 20, 2019 at 9:07 AM, Amit Langote wrote: > On 2019/03/20 17:36, Imai, Yoshikazu wrote: > > On Wed, Mar 20, 2019 at 8:21 AM, Amit Langote wrote: > >> On 2019/03/20 12:15, Imai, Yoshikazu wrote: > >>> [select1024.sql] > >>> \set a random (1, 1024) > >>> select * from rt where a = :a; > >>> > >>> [pgbench] > >>> pgbench -n -f select1024.sql -T 60 > >> > >> Thank you. > >> > >> Could you please try with running pgbench for a bit longer than 60 > seconds? > > > > I run pgbench for 180 seconds but there are still difference. > > Thank you very much. > > > 1024: 7,004 TPS > > 8192: 5,859 TPS > > > > > > I also tested for another number of partitions by running pgbench for > 60 seconds. > > > > num of partTPS > > --- - > > 128 7,579 > > 256 7,528 > > 512 7,512 > > 1024 7,257 (7274, 7246, 7252) > > 2048 6,718 (6627, 6780, 6747) > > 4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s results) > > 8192 6,008 (6018, 5999, 6007) > > > > > > I checked whether there are the process which go through the number > of partitions, but I couldn't find. I'm really wondering why this > degradation happens. > > Indeed, it's quite puzzling why. Will look into this. I don't know whether it is useful, but I noticed the usage of get_tabstat_entry increased when many partitions are scanned. -- Yoshikazu Imai
Re: speeding up planning with partitions
Imai-san, On 2019/03/20 17:36, Imai, Yoshikazu wrote: > On Wed, Mar 20, 2019 at 8:21 AM, Amit Langote wrote: >> On 2019/03/20 12:15, Imai, Yoshikazu wrote: >>> [select1024.sql] >>> \set a random (1, 1024) >>> select * from rt where a = :a; >>> >>> [pgbench] >>> pgbench -n -f select1024.sql -T 60 >> >> Thank you. >> >> Could you please try with running pgbench for a bit longer than 60 seconds? > > I run pgbench for 180 seconds but there are still difference. Thank you very much. > 1024: 7,004 TPS > 8192: 5,859 TPS > > > I also tested for another number of partitions by running pgbench for 60 > seconds. > > num of partTPS > --- - > 128 7,579 > 256 7,528 > 512 7,512 > 1024 7,257 (7274, 7246, 7252) > 2048 6,718 (6627, 6780, 6747) > 4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s results) > 8192 6,008 (6018, 5999, 6007) > > > I checked whether there are the process which go through the number of > partitions, but I couldn't find. I'm really wondering why this degradation > happens. Indeed, it's quite puzzling why. Will look into this. Thanks, Amit
RE: speeding up planning with partitions
Amit-san, On Wed, Mar 20, 2019 at 8:21 AM, Amit Langote wrote: > On 2019/03/20 12:15, Imai, Yoshikazu wrote: > > Here the details. > > > > [creating partitioned tables (with 1024 partitions)] drop table if > > exists rt; create table rt (a int, b int, c int) partition by range > > (a); \o /dev/null select 'create table rt' || x::text || ' partition > > of rt for values from (' || (x)::text || ') to (' || (x+1)::text || > > ');' from generate_series(1, 1024) x; \gexec \o > > > > [select1024.sql] > > \set a random (1, 1024) > > select * from rt where a = :a; > > > > [pgbench] > > pgbench -n -f select1024.sql -T 60 > > Thank you. > > Could you please try with running pgbench for a bit longer than 60 seconds? I run pgbench for 180 seconds but there are still difference. 1024: 7,004 TPS 8192: 5,859 TPS I also tested for another number of partitions by running pgbench for 60 seconds. num of partTPS --- - 128 7,579 256 7,528 512 7,512 1024 7,257 (7274, 7246, 7252) 2048 6,718 (6627, 6780, 6747) 4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s results) 8192 6,008 (6018, 5999, 6007) I checked whether there are the process which go through the number of partitions, but I couldn't find. I'm really wondering why this degradation happens. Yoshikazu Imai
Re: speeding up planning with partitions
Imai-san, On 2019/03/20 12:15, Imai, Yoshikazu wrote: > Here the details. > > [creating partitioned tables (with 1024 partitions)] > drop table if exists rt; > create table rt (a int, b int, c int) partition by range (a); > \o /dev/null > select 'create table rt' || x::text || ' partition of rt for values from (' || > (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 1024) x; > \gexec > \o > > [select1024.sql] > \set a random (1, 1024) > select * from rt where a = :a; > > [pgbench] > pgbench -n -f select1024.sql -T 60 Thank you. Could you please try with running pgbench for a bit longer than 60 seconds? Thanks, Amit
RE: speeding up planning with partitions
Amit-san, On Wed, Mar 20, 2019 at 3:01 PM, Amit Langote wrote: > > On Wed, Mar 20, 2019 at 2:34 AM, Amit Langote wrote: > >> On 2019/03/20 11:21, Imai, Yoshikazu wrote: > >>> (4) > >>> We expect the performance does not depend on the number of > >>> partitions > >> after applying all patches, if possible. > >>> > >>> num of partTPS > >>> --- - > >>> 1024 7,257 (7274, 7246, 7252) > >>> 2048 6,718 (6627, 6780, 6747) > >>> 4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s > results) > >>> 8192 6,008 (6018, 5999, 6007) > >>> > >>> It seems the performance still depend on the number of partitions. > >>> At > >> the moment, I don't have any idea what cause this problem but can we > >> improve this more? > >> > >> I've noticed [1] this kind of degradation when the server is built > >> with Asserts enabled. Did you? > >> ... > >> [1] > >> > https://www.postgresql.org/message-id/a49372b6-c044-4ac8-84ea-90ad18 > >> b1770d%40lab.ntt.co.jp > > > > No. I did test again from configuring without --enable-cassert but > problem I mentioned still happens. > > Hmm, OK. Can you describe your test setup with more details? Here the details. [creating partitioned tables (with 1024 partitions)] drop table if exists rt; create table rt (a int, b int, c int) partition by range (a); \o /dev/null select 'create table rt' || x::text || ' partition of rt for values from (' || (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 1024) x; \gexec \o [select1024.sql] \set a random (1, 1024) select * from rt where a = :a; [pgbench] pgbench -n -f select1024.sql -T 60 What I noticed so far is that it also might depends on the query. I created table with 8192 partitions and did select statements like "select * from a = :a (which ranges from 1 to 1024)" and "select * from a = :a (which ranges from 1 to 8192)", and the results of those were different. I'll send perf to off-list. -- Yoshikazu Imai
Re: speeding up planning with partitions
On 2019/03/20 11:51, Imai, Yoshikazu wrote: > Amit-san, > > On Wed, Mar 20, 2019 at 2:34 AM, Amit Langote wrote: >> On 2019/03/20 11:21, Imai, Yoshikazu wrote: >>> (4) >>> We expect the performance does not depend on the number of partitions >> after applying all patches, if possible. >>> >>> num of partTPS >>> --- - >>> 1024 7,257 (7274, 7246, 7252) >>> 2048 6,718 (6627, 6780, 6747) >>> 4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s results) >>> 8192 6,008 (6018, 5999, 6007) >>> >>> It seems the performance still depend on the number of partitions. At >> the moment, I don't have any idea what cause this problem but can we improve >> this more? >> >> I've noticed [1] this kind of degradation when the server is built with >> Asserts enabled. Did you? >> ... >> [1] >> https://www.postgresql.org/message-id/a49372b6-c044-4ac8-84ea-90ad18 >> b1770d%40lab.ntt.co.jp > > No. I did test again from configuring without --enable-cassert but problem I > mentioned still happens. Hmm, OK. Can you describe your test setup with more details? Thanks, Amit
RE: speeding up planning with partitions
Amit-san, On Wed, Mar 20, 2019 at 2:34 AM, Amit Langote wrote: > On 2019/03/20 11:21, Imai, Yoshikazu wrote: > > (4) > > We expect the performance does not depend on the number of partitions > after applying all patches, if possible. > > > > num of partTPS > > --- - > > 1024 7,257 (7274, 7246, 7252) > > 2048 6,718 (6627, 6780, 6747) > > 4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s results) > > 8192 6,008 (6018, 5999, 6007) > > > > It seems the performance still depend on the number of partitions. At > the moment, I don't have any idea what cause this problem but can we improve > this more? > > I've noticed [1] this kind of degradation when the server is built with > Asserts enabled. Did you? > ... > [1] > https://www.postgresql.org/message-id/a49372b6-c044-4ac8-84ea-90ad18 > b1770d%40lab.ntt.co.jp No. I did test again from configuring without --enable-cassert but problem I mentioned still happens. -- Yoshikazu Imai
Re: speeding up planning with partitions
Imai-san, On 2019/03/20 11:21, Imai, Yoshikazu wrote: > (4) > We expect the performance does not depend on the number of partitions after > applying all patches, if possible. > > num of partTPS > --- - > 1024 7,257 (7274, 7246, 7252) > 2048 6,718 (6627, 6780, 6747) > 4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s results) > 8192 6,008 (6018, 5999, 6007) > > It seems the performance still depend on the number of partitions. At the > moment, I don't have any idea what cause this problem but can we improve this > more? I've noticed [1] this kind of degradation when the server is built with Asserts enabled. Did you? Thanks, Amit [1] https://www.postgresql.org/message-id/a49372b6-c044-4ac8-84ea-90ad18b1770d%40lab.ntt.co.jp
RE: speeding up planning with partitions
Amit-san, On Wed, Mar 20, 2019 at 0:42 AM, Amit Langote wrote: > On 2019/03/19 20:13, Imai, Yoshikazu wrote: > > Thanks for new patches. > > I looked over them and there are little comments. > > > > ... > > > > I have no more comments about codes other than above :) > > I have fixed all. Attached updated patches. Thanks for quick modification. I did performance test again. I did four tests in the below. There might be some point we can improve performance more from the results of last test in the below. (1) v33-0003 slows down queries when there are inherited tables in UPDATE/DELETE's FROM clause. v33-0004 fixes this problem. * rt with 32 partitions. * jrt with 32 partitions. * update rt set c = jrt.c from jrt where rt.b = jrt.b; patch TPS - - master 66.8 (67.2, 66.8, 66.4) 0003 47.5 (47.2, 47.6, 47.7) 0004 68.8 (69.2, 68.9, 68.4) It seems fixing the performance problem correctly. (2) v33-0005 speeds up UPDATE/DELETE by removing useless copy of child target RTEs in adjust_appendrel_attrs(). * rt with 32 partitions. * update rt set b = b + 1; patch TPS - - master986 (999, 984, 974) 00051,589 (1608, 1577, 1582) It seems speeding up the performance as expected. (3) v33-0006, 0007, 0008 speeds up the case when few partitions are scanned among many partitions. * rt with 4096 partitions, partitioned by column 'a'. * select rt where rt.a = :a (:a ranges from 1 to 4096) patch TPS - - master 22.4 (22.4, 22.5, 22.2) 0005 20.9 (20.9, 21.2, 20.6) 00062,951 (2958, 2958, 2937) 00073,141 (3146, 3141, 3136) 00086,472 (6434, 6565, 6416) Certainly, it seems patches speed up the performance of the case. (4) We expect the performance does not depend on the number of partitions after applying all patches, if possible. num of partTPS --- - 1024 7,257 (7274, 7246, 7252) 2048 6,718 (6627, 6780, 6747) 4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s results) 8192 6,008 (6018, 5999, 6007) It seems the performance still depend on the number of partitions. At the moment, I don't have any idea what cause this problem but can we improve this more? -- Yoshikazu Imai
Re: speeding up planning with partitions
On 2019/03/20 9:49, Robert Haas wrote: > On Fri, Mar 8, 2019 at 4:18 AM Amit Langote > wrote: >> Maybe you know that range_table_mutator() spends quite a long time if >> there are many target children, but I realized there's no need for >> range_table_mutator() to copy/mutate child target RTEs. First, there's >> nothing to translate in their case. Second, copying them is not necessary >> too, because they're not modified across different planning cycles. If we >> pass to adjust_appendrel_attrs only the RTEs in the original range table >> (that is, before any child target RTEs were added), then >> range_table_mutator() has to do significantly less work and allocates lot >> less memory than before. I've broken this change into its own patch; see >> patch 0004. > > Was just glancing over 0001: > > - * every non-join RTE that is used in the query. Therefore, this routine > - * is the only place that should call build_simple_rel with reloptkind > - * RELOPT_BASEREL. (Note: build_simple_rel recurses internally to build > - * "other rel" RelOptInfos for the members of any appendrels we find here.) > + * every non-join RTE that is specified in the query. Therefore, this > + * routine is the only place that should call build_simple_rel with > + * reloptkind RELOPT_BASEREL. (Note: "other rel" RelOptInfos for the > + * members of any appendrels we find here are built later when query_planner > + * calls add_other_rels_to_query().) > > Used -> specified doesn't seem to change the meaning, so I'm not sure > what the motivation is there. Well, I thought it would clarify that now add_base_rels_to_query() only adds "baserel" RelOptInfos, that is, those for the relations that are directly mentioned in the query. Maybe: ...that is mentioned in the query. or ...that is directly mentioned in the query. ? Thanks, Amit
Re: speeding up planning with partitions
On Fri, Mar 8, 2019 at 4:18 AM Amit Langote wrote: > Maybe you know that range_table_mutator() spends quite a long time if > there are many target children, but I realized there's no need for > range_table_mutator() to copy/mutate child target RTEs. First, there's > nothing to translate in their case. Second, copying them is not necessary > too, because they're not modified across different planning cycles. If we > pass to adjust_appendrel_attrs only the RTEs in the original range table > (that is, before any child target RTEs were added), then > range_table_mutator() has to do significantly less work and allocates lot > less memory than before. I've broken this change into its own patch; see > patch 0004. Was just glancing over 0001: - * every non-join RTE that is used in the query. Therefore, this routine - * is the only place that should call build_simple_rel with reloptkind - * RELOPT_BASEREL. (Note: build_simple_rel recurses internally to build - * "other rel" RelOptInfos for the members of any appendrels we find here.) + * every non-join RTE that is specified in the query. Therefore, this + * routine is the only place that should call build_simple_rel with + * reloptkind RELOPT_BASEREL. (Note: "other rel" RelOptInfos for the + * members of any appendrels we find here are built later when query_planner + * calls add_other_rels_to_query().) Used -> specified doesn't seem to change the meaning, so I'm not sure what the motivation is there. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
RE: speeding up planning with partitions
Amit-san, On Tue, Mar 19, 2019 at 6:53 AM, Amit Langote wrote: > On 2019/03/15 9:33, Imai, Yoshikazu wrote: > > On Thu, Mar 14, 2019 at 9:04 AM, Amit Langote wrote: > >>> * In inheritance_planner(), we do ChangeVarNodes() only for > >>> orig_rtable, > >> so the codes concatenating lists of append_rel_list may be able to > be > >> moved before doing ChangeVarNodes() and then the codes concatenating > >> lists of rowmarks, rtable and append_rel_list can be in one block (or > >> one bunch). > >> > >> Yeah, perhaps. I'll check. > > I'm inclined to add source_appinfos to subroot->append_rel_list after > finishing the ChangeVarNodes(subroot->append_rel_list) step, because if > there are many entries in source_appinfos that would unnecessarily make > ChangeVarNodes take longer. OK, thanks. > I've attached updated patches. In the new version, I've moved some code > from 0004 to 0005 patch, so as to avoid mixing unrelated modifications > in one patch. Especially, orig_rtable now only appears after applying > 0005. > > I appreciate your continued interest in these patches. Thanks for new patches. I looked over them and there are little comments. [0002] * s/regresion/regression/g (in commit message.) [0003] * I thought "inh flag is it" is "inh flag is set" ...? +* For RTE_RELATION rangetable entries whose inh flag is it, adjust the * Below comments are correct when after applying 0004. +* the query's target relation and no other. Especially, children of any +* source relations are added when the loop below calls grouping_planner +* on the *1st* child target relation. [0004] * s/contain contain/contain/ +* will contain contain references to the subquery RTEs that we've * s/find them children/find their children/ +* AppendRelInfos needed to find them children, so the next [0006] * s/recursivly/recursively/ (in commit message) I have no more comments about codes other than above :) -- Yoshikazu Imai
RE: speeding up planning with partitions
Amit-san, On Mon, Mar 18, 2019 at 9:56 AM, Amit Langote wrote: > On 2019/03/15 14:40, Imai, Yoshikazu wrote: > > Amit-san, > > > > I have another little comments about v31-patches. > > > > * We don't need is_first_child in inheritance_planner(). > > > > On Fri, Mar 8, 2019 at 9:18 AM, Amit Langote wrote: > >> On 2019/03/08 16:16, Imai, Yoshikazu wrote: > >>> I attached the diff of modification for v26-0003 patch which also > >> contains some refactoring. > >>> Please see if it is ok. > >> > >> I like the is_first_child variable which somewhat improves > >> readability, so updated the patch to use it. > > > > I noticed that detecting first child query in inheritance_planner() > is already done by "final_rtable == NIL" > > > > /* > > * If this is the first non-excluded child, its post-planning > rtable > > * becomes the initial contents of final_rtable; otherwise, > append > > * just its modified subquery RTEs to final_rtable. > > */ > > if (final_rtable == NIL) > > > > So I think we can use that instead of using is_first_child. > > That's a good suggestion. One problem I see with that is that > final_rtable is set only once we've found the first *non-dummy* child. Ah... I overlooked that. > So, if all the children except the last one are dummy, we'd end up never > reusing the source child objects. Now, if final_rtable was set for the > first child irrespective of whether it turned out to be dummy or not, > which it seems OK to do, then we can implement your suggestion. I thought you mean we can remove or move below code to under setting final_rtable. /* * If this child rel was excluded by constraint exclusion, exclude it * from the result plan. */ if (IS_DUMMY_REL(sub_final_rel)) continue; It seems logically ok, but I also thought there are the case where useless process happens. If we execute below update statement, final_rtable would be unnecessarily expanded by adding placeholder. * partition table rt with 1024 partitions. * partition table pt with 2 partitions. * update rt set c = ss.c from ( select b from pt union all select b + 3 from pt) ss where a = 1024 and rt.b = ss.b; After all, I think it might be better to use is_first_child because the meaning of is_first_child and final_rtable is different. is_first_child == true represents that we currently processing first child query and final_rtable == NIL represents we didn't find first non-excluded child query. -- Yoshikazu Imai
Re: speeding up planning with partitions
Imai-san, On 2019/03/15 14:40, Imai, Yoshikazu wrote: > Amit-san, > > I have another little comments about v31-patches. > > * We don't need is_first_child in inheritance_planner(). > > On Fri, Mar 8, 2019 at 9:18 AM, Amit Langote wrote: >> On 2019/03/08 16:16, Imai, Yoshikazu wrote: >>> I attached the diff of modification for v26-0003 patch which also >> contains some refactoring. >>> Please see if it is ok. >> >> I like the is_first_child variable which somewhat improves readability, >> so updated the patch to use it. > > I noticed that detecting first child query in inheritance_planner() is > already done by "final_rtable == NIL" > > /* > * If this is the first non-excluded child, its post-planning rtable > * becomes the initial contents of final_rtable; otherwise, append > * just its modified subquery RTEs to final_rtable. > */ > if (final_rtable == NIL) > > So I think we can use that instead of using is_first_child. That's a good suggestion. One problem I see with that is that final_rtable is set only once we've found the first *non-dummy* child. So, if all the children except the last one are dummy, we'd end up never reusing the source child objects. Now, if final_rtable was set for the first child irrespective of whether it turned out to be dummy or not, which it seems OK to do, then we can implement your suggestion. Thanks, Amit
RE: speeding up planning with partitions
Amit-san, I have another little comments about v31-patches. * We don't need is_first_child in inheritance_planner(). On Fri, Mar 8, 2019 at 9:18 AM, Amit Langote wrote: > On 2019/03/08 16:16, Imai, Yoshikazu wrote: > > I attached the diff of modification for v26-0003 patch which also > contains some refactoring. > > Please see if it is ok. > > I like the is_first_child variable which somewhat improves readability, > so updated the patch to use it. I noticed that detecting first child query in inheritance_planner() is already done by "final_rtable == NIL" /* * If this is the first non-excluded child, its post-planning rtable * becomes the initial contents of final_rtable; otherwise, append * just its modified subquery RTEs to final_rtable. */ if (final_rtable == NIL) So I think we can use that instead of using is_first_child. -- Yoshikazu Imai
RE: speeding up planning with partitions
Hi, David On Thu, Mar 14, 2019 at 9:04 AM, David Rowley wrote: > On Thu, 14 Mar 2019 at 21:35, Imai, Yoshikazu > wrote: > > 0007: > > * This changes some processes using "for loop" to using > "while(bms_next_member())" which speeds up processing when we scan few > partitions in one statement, but when we scan a lot of partitions in one > statement, its performance will likely degraded. I measured the > performance of both cases. > > I executed select statement to the table which has 4096 partitions. > > > > [scanning 1 partition] > > Without 0007 : 3,450 TPS > > With 0007: 3,723 TPS > > > > [scanning 4096 partitions] > > Without 0007 : 10.8 TPS > > With 0007: 10.5 TPS > > > > In the above result, performance degrades 3% in case of scanning 4096 > partitions compared before and after applying 0007 patch. I think when > scanning a lot of tables, executor time would be also longer, so the > increasement of planner time would be relatively smaller than it. So we > might not have to care this performance degradation. > > I think it's better to focus on the fewer partitions case due to the fact > that execution initialisation time and actual execution are likely to > take much longer when more partitions are scanned. I did some work on > run-time pruning to tune it for this case. Tom did make a similar argument > in [1] and I explained my reasoning in [2]. Thanks for quoting these threads. Actually, I recalled this argument, so I tested this just to make sure. > bms_next_member has gotten a good performance boost since then and the > cases are not exactly the same since the old version the loop in run-time > pruning checked bms_is_member(), but the fact is, we did end up tuning > for the few partitions case in the end. Wow, I didn't know that. > However, it would be good to see the performance results for > plan+execution time of say a table with 4k parts looking up a single > indexed value. You could have two columns, one that's the partition key > which allows the pruning to take place, and one that's not and results > in scanning all partitions. I'll be surprised if you even notice the > difference between with and without 0007 with the latter case. So I tested for checking the performance for plan+execution time. [set up table and indexes] create table rt (a int, b int) partition by range (a); \o /dev/null select 'create table rt' || x::text || ' partition of rt for values from (' || (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 4096) x; \gexec \o create index b_idx on rt (b); insert into rt select a, b from generate_series(1, 4096) a, generate_series(1, 1000) b; [select_indexed_values.sql] \set b random(1, 1000) select count(*) from rt where b = :b; [pgbench] pgbench -n -f select_indexed_values.sql -T 60 postgres [results] Without 0007: 3.18 TPS (3.25, 3.13, 3.15) With 0007:3.21 TPS (3.21, 3.23, 3.18) From the results, we didn't see the performance degradation in this case. Actually, the performance increased 1% before and after applying 0007, but it would be just an measurement error. So, generally, we can think the performance difference of bms_next_member and for loop can be absorbed by other processing(execution initialisation and actual execution) when scanning many partitions. -- Yoshikazu Imai
RE: speeding up planning with partitions
Amit-san, On Thu, Mar 14, 2019 at 9:04 AM, Amit Langote wrote: > > 0002: > > * I don't really know that delaying adding resjunk output columns to > the plan doesn't affect any process in the planner. From the comments > of PlanRowMark, those columns are used in only the executor so I think > delaying adding junk Vars wouldn't be harm, is that right? > > I think so. The only visible change in behavior is that "rowmark" junk > columns are now placed later in the final plan's targetlist. ok, thanks. > > 0003: > > * Is there need to do CreatePartitionDirectory() if rte->inh becomes > false? > > > > + else if (rte->rtekind == RTE_RELATION && rte->inh) > > + { > > + rte->inh = has_subclass(rte->relid); > > + > > + /* > > +* While at it, initialize the PartitionDesc > infrastructure for > > +* this query, if not done yet. > > +*/ > > + if (root->glob->partition_directory == NULL) > > + root->glob->partition_directory = > > + > CreatePartitionDirectory(CurrentMemoryContext); > > + } > > We need to have create "partition directory" before we access a > partitioned table's PartitionDesc for planning. These days, we rely > solely on PartitionDesc to determine a partitioned table's children. So, > we need to see the PartitionDesc before we can tell there are zero children > and set inh to false. In other words, we need the "partition directory" > to have been set up in advance. Ah, I see. > > 0004: > > * In addition to commit message, this patch also changes the method > of adjusting AppendRelInfos which reference to the subquery RTEs, and > new one seems logically correct. > > Do you mean I should mention it in the patch's commit message? Actually I firstly thought that it's better to mention it in the patch's commit message, but I didn't mean that here. I only wanted to state that I confirmed new method of adjusting AppendRelInfos seems logically correct :) > > * In inheritance_planner(), we do ChangeVarNodes() only for orig_rtable, > so the codes concatenating lists of append_rel_list may be able to be > moved before doing ChangeVarNodes() and then the codes concatenating > lists of rowmarks, rtable and append_rel_list can be in one block (or > one bunch). > > Yeah, perhaps. I'll check. > > On 2019/03/14 17:35, Imai, Yoshikazu wrote:> Amit-san, > > I have done code review of v31 patches from 0004 to 0008. > > > > 0004: > > * s/childern/children > > Will fix. > > > 0005: > > * This seems reasonable for not using a lot of memory in specific > > case, although it needs special looking of planner experts. > > Certainly. Thanks for these. > > 0006: > > * The codes initializing/setting RelOptInfo's part_rels looks like a > > bit complicated, but I didn't come up with any good design so far. > > I guess you mean in add_appendrel_other_rels(). The other option is not > have that code there and expand partitioned tables freshly for every > target child, which we got rid of in patch 0004. But we don't want to > do that. Yeah, that's right. > > 0007: > > * This changes some processes using "for loop" to using > > "while(bms_next_member())" which speeds up processing when we scan few > > partitions in one statement, but when we scan a lot of partitions in > > one statement, its performance will likely degraded. I measured the > > performance of both cases. > > I executed select statement to the table which has 4096 partitions. > > > > [scanning 1 partition] > > Without 0007 : 3,450 TPS > > With 0007: 3,723 TPS > > > > [scanning 4096 partitions] > > Without 0007 : 10.8 TPS > > With 0007: 10.5 TPS > > > > In the above result, performance degrades 3% in case of scanning 4096 > > partitions compared before and after applying 0007 patch. I think when > > scanning a lot of tables, executor time would be also longer, so the > > increasement of planner time would be relatively smaller than it. So > > we might not have to care this performance degradation. > > Well, live_parts bitmapset is optimized for queries scanning only few > partitions. It's true that it's slightly more expensive than a simple > for loop over part_rels, but not too much as you're also saying. Thanks for the comments. -- Yoshikazu Imai
Re: speeding up planning with partitions
On Thu, 14 Mar 2019 at 21:35, Imai, Yoshikazu wrote: > 0007: > * This changes some processes using "for loop" to using > "while(bms_next_member())" which speeds up processing when we scan few > partitions in one statement, but when we scan a lot of partitions in one > statement, its performance will likely degraded. I measured the performance > of both cases. > I executed select statement to the table which has 4096 partitions. > > [scanning 1 partition] > Without 0007 : 3,450 TPS > With 0007: 3,723 TPS > > [scanning 4096 partitions] > Without 0007 : 10.8 TPS > With 0007: 10.5 TPS > > In the above result, performance degrades 3% in case of scanning 4096 > partitions compared before and after applying 0007 patch. I think when > scanning a lot of tables, executor time would be also longer, so the > increasement of planner time would be relatively smaller than it. So we might > not have to care this performance degradation. I think it's better to focus on the fewer partitions case due to the fact that execution initialisation time and actual execution are likely to take much longer when more partitions are scanned. I did some work on run-time pruning to tune it for this case. Tom did make a similar argument in [1] and I explained my reasoning in [2]. bms_next_member has gotten a good performance boost since then and the cases are not exactly the same since the old version the loop in run-time pruning checked bms_is_member(), but the fact is, we did end up tuning for the few partitions case in the end. However, it would be good to see the performance results for plan+execution time of say a table with 4k parts looking up a single indexed value. You could have two columns, one that's the partition key which allows the pruning to take place, and one that's not and results in scanning all partitions. I'll be surprised if you even notice the difference between with and without 0007 with the latter case. [1] https://www.postgresql.org/message-id/16107.1542307838%40sss.pgh.pa.us [2] https://www.postgresql.org/message-id/CAKJS1f8ZnAW9VJNpJW16t5CtXSq3eAseyJXdumLaYb8DiTbhXA%40mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: speeding up planning with partitions
Imai-san, On 2019/03/13 19:34, Imai, Yoshikazu wrote: > I have done code review of v31 patches from 0001 to 0004. > I described below what I confirmed or thoughts. Thanks for checking. > 0001: This seems ok. > > 0002: > * I don't really know that delaying adding resjunk output columns to the plan > doesn't affect any process in the planner. From the comments of PlanRowMark, > those columns are used in only the executor so I think delaying adding junk > Vars wouldn't be harm, is that right? I think so. The only visible change in behavior is that "rowmark" junk columns are now placed later in the final plan's targetlist. > 0003: > * Is there need to do CreatePartitionDirectory() if rte->inh becomes false? > > + else if (rte->rtekind == RTE_RELATION && rte->inh) > + { > + rte->inh = has_subclass(rte->relid); > + > + /* > + * While at it, initialize the PartitionDesc > infrastructure for > + * this query, if not done yet. > + */ > + if (root->glob->partition_directory == NULL) > + root->glob->partition_directory = > + > CreatePartitionDirectory(CurrentMemoryContext); > + } We need to have create "partition directory" before we access a partitioned table's PartitionDesc for planning. These days, we rely solely on PartitionDesc to determine a partitioned table's children. So, we need to see the PartitionDesc before we can tell there are zero children and set inh to false. In other words, we need the "partition directory" to have been set up in advance. > 0004: > * In addition to commit message, this patch also changes the method of > adjusting AppendRelInfos which reference to the subquery RTEs, and new one > seems logically correct. Do you mean I should mention it in the patch's commit message? > * In inheritance_planner(), we do ChangeVarNodes() only for orig_rtable, so > the codes concatenating lists of append_rel_list may be able to be moved > before doing ChangeVarNodes() and then the codes concatenating lists of > rowmarks, rtable and append_rel_list can be in one block (or one bunch). Yeah, perhaps. I'll check. On 2019/03/14 17:35, Imai, Yoshikazu wrote:> Amit-san, > I have done code review of v31 patches from 0004 to 0008. > > 0004: > * s/childern/children Will fix. > 0005: > * This seems reasonable for not using a lot of memory in specific case, > although it needs special looking of planner experts. Certainly. > 0006: > * The codes initializing/setting RelOptInfo's part_rels looks like a bit > complicated, but I didn't come up with any good design so far. I guess you mean in add_appendrel_other_rels(). The other option is not have that code there and expand partitioned tables freshly for every target child, which we got rid of in patch 0004. But we don't want to do that. > 0007: > * This changes some processes using "for loop" to using > "while(bms_next_member())" which speeds up processing when we scan few > partitions in one statement, but when we scan a lot of partitions in one > statement, its performance will likely degraded. I measured the > performance of both cases. > I executed select statement to the table which has 4096 partitions. > > [scanning 1 partition] > Without 0007 : 3,450 TPS > With 0007: 3,723 TPS > > [scanning 4096 partitions] > Without 0007 : 10.8 TPS > With 0007: 10.5 TPS > > In the above result, performance degrades 3% in case of scanning 4096 > partitions compared before and after applying 0007 patch. I think when > scanning a lot of tables, executor time would be also longer, so the > increasement of planner time would be relatively smaller than it. So we > might not have to care this performance degradation. Well, live_parts bitmapset is optimized for queries scanning only few partitions. It's true that it's slightly more expensive than a simple for loop over part_rels, but not too much as you're also saying. Thanks, Amit
RE: speeding up planning with partitions
Amit-san, I have done code review of v31 patches from 0004 to 0008. 0004: * s/childern/children 0005: * This seems reasonable for not using a lot of memory in specific case, although it needs special looking of planner experts. 0006: * The codes initializing/setting RelOptInfo's part_rels looks like a bit complicated, but I didn't come up with any good design so far. 0007: * This changes some processes using "for loop" to using "while(bms_next_member())" which speeds up processing when we scan few partitions in one statement, but when we scan a lot of partitions in one statement, its performance will likely degraded. I measured the performance of both cases. I executed select statement to the table which has 4096 partitions. [scanning 1 partition] Without 0007 : 3,450 TPS With 0007: 3,723 TPS [scanning 4096 partitions] Without 0007 : 10.8 TPS With 0007: 10.5 TPS In the above result, performance degrades 3% in case of scanning 4096 partitions compared before and after applying 0007 patch. I think when scanning a lot of tables, executor time would be also longer, so the increasement of planner time would be relatively smaller than it. So we might not have to care this performance degradation. 0008: This seems ok. -- Yoshikazu Imai
RE: speeding up planning with partitions
Amit-san, On Tue, Mar 12, 2019 at 2:34 PM, Amit Langote wrote: > Thanks for the heads up. > > On Tue, Mar 12, 2019 at 10:23 PM Jesper Pedersen > wrote: > > After applying 0004 check-world fails with the attached. CFBot agrees > [1]. > > Fixed. I had forgotten to re-run postgres_fdw tests after making a change > just before submitting. I have done code review of v31 patches from 0001 to 0004. I described below what I confirmed or thoughts. 0001: This seems ok. 0002: * I don't really know that delaying adding resjunk output columns to the plan doesn't affect any process in the planner. From the comments of PlanRowMark, those columns are used in only the executor so I think delaying adding junk Vars wouldn't be harm, is that right? 0003: * Is there need to do CreatePartitionDirectory() if rte->inh becomes false? + else if (rte->rtekind == RTE_RELATION && rte->inh) + { + rte->inh = has_subclass(rte->relid); + + /* +* While at it, initialize the PartitionDesc infrastructure for +* this query, if not done yet. +*/ + if (root->glob->partition_directory == NULL) + root->glob->partition_directory = + CreatePartitionDirectory(CurrentMemoryContext); + } 0004: * In addition to commit message, this patch also changes the method of adjusting AppendRelInfos which reference to the subquery RTEs, and new one seems logically correct. * In inheritance_planner(), we do ChangeVarNodes() only for orig_rtable, so the codes concatenating lists of append_rel_list may be able to be moved before doing ChangeVarNodes() and then the codes concatenating lists of rowmarks, rtable and append_rel_list can be in one block (or one bunch). -- Yoshikazu Imai
Re: speeding up planning with partitions
Hi Amit, On 3/12/19 4:22 AM, Amit Langote wrote: I wrestled with this idea a bit and concluded that we don't have to postpone *all* of preprocess_targetlist() processing to query_planner, only the part that adds row mark "junk" Vars, because only those matter for the problem being solved. To recap, the problem is that delaying adding inheritance children (and hence their row marks if any) means we can't add "junk" columns needed to implement row marks, because which ones to add is not clear until we've seen all the children. I propose that we delay the addition of "junk" Vars to query_planner() so that it doesn't stand in the way of deferring inheritance expansion to query_planner() too. That means the order of reltarget expressions for row-marked rels will change, but I assume that's OK. At least it will be consistent for both non-inherited baserels and inherited ones. Attached updated version of the patches with the above proposal implemented by patch 0002. To summarize, the patches are as follows: 0001: make building of "other rels" a separate step that runs after deconstruct_jointree(), implemented by a new subroutine of query_planner named add_other_rels_to_query() 0002: delay adding "junk" Vars to after add_other_rels_to_query() 0003: delay inheritance expansion to add_other_rels_to_query() 0004, 0005: adjust inheritance_planner() to account for the fact that inheritance is now expanded by query_planner(), not subquery_planner() 0006: perform partition pruning while inheritance is being expanded, instead of during set_append_append_rel_size() 0007: add a 'live_parts' field to RelOptInfo to store partition indexes (not RT indexes) of unpruned partitions, which speeds up looping over part_rels array of the partitioned parent 0008: avoid copying PartitionBoundInfo struct for planning After applying 0004 check-world fails with the attached. CFBot agrees [1]. [1] https://travis-ci.org/postgresql-cfbot/postgresql/builds/505107884 Best regards, Jesper diff -U3 /home/jpedersen/PostgreSQL/postgresql/contrib/postgres_fdw/expected/postgres_fdw.out /home/jpedersen/PostgreSQL/postgresql/contrib/postgres_fdw/results/postgres_fdw.out --- /home/jpedersen/PostgreSQL/postgresql/contrib/postgres_fdw/expected/postgres_fdw.out 2019-03-12 07:46:08.430690229 -0400 +++ /home/jpedersen/PostgreSQL/postgresql/contrib/postgres_fdw/results/postgres_fdw.out 2019-03-12 07:59:01.134285159 -0400 @@ -7190,8 +7190,8 @@ -- Check UPDATE with inherited target and an inherited source table explain (verbose, costs off) update bar set f2 = f2 + 100 where f1 in (select f1 from foo); - QUERY PLAN -- + QUERY PLAN +--- Update on public.bar Update on public.bar Foreign Update on public.bar2 @@ -7214,22 +7214,22 @@ Output: foo2.f1, foo2.ctid, foo2.*, foo2.tableoid Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 -> Hash Join - Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid + Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.* Inner Unique: true Hash Cond: (bar2.f1 = foo.f1) -> Foreign Scan on public.bar2 Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE -> Hash - Output: foo.f1, foo.ctid, foo.*, foo.tableoid + Output: foo.f1, foo.ctid, foo.* -> HashAggregate - Output: foo.f1, foo.ctid, foo.*, foo.tableoid + Output: foo.f1, foo.ctid, foo.* Group Key: foo.f1 -> Append -> Seq Scan on public.foo - Output: foo.f1, foo.ctid, foo.*, foo.tableoid + Output: foo.f1, foo.ctid, foo.* -> Foreign Scan on public.foo2 - Output: foo2.f1, foo2.ctid, foo2.*, foo2.tableoid + Output: foo2.f1, foo2.ctid, foo2.* Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 (39 rows)
RE: speeding up planning with partitions
Amit-san, On Fri, Mar 8, 2019 at 9:18 AM, Amit Langote wrote: > On 2019/03/08 16:16, Imai, Yoshikazu wrote: > > So I modified the code and did test to confirm memory increasement don't > happen. The test and results are below. > > > > [test] > > * Create partitioned table with 1536 partitions. > > * Execute "update rt set a = random();" > > > > [results] > > A backend uses below amount of memory in update transaction: > > > > HEAD: 807MB > > With v26-0001, 0002: 790MB > > With v26-0001, 0002, 0003: 860MB > > With v26-0003 modified: 790MB > > Can you measure with v28, or better attached v29 (about which more below)? > > > I attached the diff of modification for v26-0003 patch which also > contains some refactoring. > > Please see if it is ok. > > I like the is_first_child variable which somewhat improves readability, > so updated the patch to use it. > > Maybe you know that range_table_mutator() spends quite a long time if > there are many target children, but I realized there's no need for > range_table_mutator() to copy/mutate child target RTEs. First, there's > nothing to translate in their case. Second, copying them is not necessary > too, because they're not modified across different planning cycles. If > we pass to adjust_appendrel_attrs only the RTEs in the original range > table (that is, before any child target RTEs were added), then > range_table_mutator() has to do significantly less work and allocates > lot less memory than before. I've broken this change into its own patch; > see patch 0004. Cool! I tested with v29 patches and checked it saved a lot of memory.. HEAD: 807MB With v29-0001, 0002, 0003, 0004: 167MB Maybe planning time in this case is also improved, but I didn't check it. > but I realized there's no need for > range_table_mutator() to copy/mutate child target RTEs. First, there's > nothing to translate in their case. Second, copying them is not necessary > too, because they're not modified across different planning cycles. Yeah, although I couldn't check the codes in detail, but from the below comments in inheritance_planner(), ISTM we need copies of subquery RTEs but need not copies of other RTEs in each planning. /* * We generate a modified instance of the original Query for each target * relation, plan that, and put all the plans into a list that will be * controlled by a single ModifyTable node. All the instances share the * same rangetable, but each instance must have its own set of subquery * RTEs within the finished rangetable because (1) they are likely to get * scribbled on during planning, and (2) it's not inconceivable that * subqueries could get planned differently in different cases. We need * not create duplicate copies of other RTE kinds, in particular not the * target relations, because they don't have either of those issues. Not * having to duplicate the target relations is important because doing so * (1) would result in a rangetable of length O(N^2) for N targets, with * at least O(N^3) work expended here; and (2) would greatly complicate * management of the rowMarks list. Thanks -- Yoshikazu Imai
RE: speeding up planning with partitions
Amit-san, On Wed, Mar 6, 2019 at 5:38 AM, Amit Langote wrote: ... > > I didn't investigate that problem, but there is another memory > > increase > issue, which is because of 0003 patch I think. I'll try to solve the latter > issue. > > Interested in details as it seems to be a separate problem. I solved this problem. I think we don't need to do list_copy in the below code. + /* +* No need to copy of the RTEs themselves, but do copy the List +* structure. +*/ + subroot->parse->rtable = list_copy(rtable_with_target); Because subroot->parse->rtable will be copied again by: subroot->parse = (Query *) adjust_appendrel_attrs(parent_root, - (Node *) parent_parse, + (Node *) subroot->parse, 1, ); So I modified the code and did test to confirm memory increasement don't happen. The test and results are below. [test] * Create partitioned table with 1536 partitions. * Execute "update rt set a = random();" [results] A backend uses below amount of memory in update transaction: HEAD: 807MB With v26-0001, 0002: 790MB With v26-0001, 0002, 0003: 860MB With v26-0003 modified: 790MB I attached the diff of modification for v26-0003 patch which also contains some refactoring. Please see if it is ok. (Sorry it is modification for v26 patch though latest ones are v28.) -- Yoshikazu Imai v26-0003-solving-memory-increasement-problem.diff Description: v26-0003-solving-memory-increasement-problem.diff
Re: speeding up planning with partitions
Hi, On 3/5/19 5:24 AM, Amit Langote wrote: Attached an updated version. Need a rebase due to 898e5e3290. Best regards, Jesper
RE: speeding up planning with partitions
Amit-san, On Wed, Mar 6, 2019 at 5:38 AM, Amit Langote wrote: > On 2019/03/06 11:09, Imai, Yoshikazu wrote: > > [0004 or 0005] > > There are redundant process in add_appendrel_other_rels (or > expand_xxx_rtentry()?). > > I modified add_appendrel_other_rels like below and it actually worked. > > > > > > add_appendrel_other_rels(PlannerInfo *root, RangeTblEntry *rte, Index > > rti) { > > ListCell *l; > > RelOptInfo *rel; > > > > /* > > * Add inheritance children to the query if not already done. For > child > > * tables that are themselves partitioned, their children will be > added > > * recursively. > > */ > > if (rte->rtekind == RTE_RELATION > && !root->contains_inherit_children) > > { > > expand_inherited_rtentry(root, rte, rti); > > return; > > } > > > > rel = find_base_rel(root, rti); > > > > foreach(l, root->append_rel_list) > > { > > AppendRelInfo *appinfo = lfirst(l); > > Index childRTindex = appinfo->child_relid; > > RangeTblEntry *childrte; > > RelOptInfo *childrel; > > > > if (appinfo->parent_relid != rti) > > continue; > > > > Assert(childRTindex < root->simple_rel_array_size); > > childrte = root->simple_rte_array[childRTindex]; > > Assert(childrte != NULL); > > build_simple_rel(root, childRTindex, rel); > > > > /* Child may itself be an inherited relation. */ > > if (childrte->inh) > > add_appendrel_other_rels(root, childrte, childRTindex); > > } > > } > > If you don't intialize part_rels here, then it will break any code in > the planner that expects a partitioned rel with non-zero number of > partitions to have part_rels set to non-NULL. At the moment, that > includes the code that implements partitioning-specific optimizations > such partitionwise aggregate and join, run-time pruning etc. No existing > regression tests cover the cases where source inherited roles > participates in those optimizations, so you didn't find anything that > broke. For an example, consider the following update query: > > update p set a = p1.a + 1 from p p1 where p1.a = (select 1); > > Planner will create a run-time pruning aware Append node for p (aliased > p1), for which it will need to use the part_rels array. Note that p in > this case is a source relation which the above code initializes. > > Maybe we should add such regression tests. Ah, now I understand that the codes below of expand_inherited_rtentry() initializes part_rels of child queries after first child target and part_rels of those are used in partitioning-specific optimizations. Thanks for the explanation. -- Yoshikazu Imai
Re: speeding up planning with partitions
Imai-san, Thanks for the review. On 2019/03/06 11:09, Imai, Yoshikazu wrote: > Here is the code review for previous v26 patches. > > [0002] > In expand_inherited_rtentry(): > > expand_inherited_rtentry() > { > ... > + RelOptInfo *rel = NULL; > > can be declared at more later: > > if (oldrelation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) > ... > else > { > List *appinfos = NIL; > RangeTblEntry *childrte; > Index childRTindex; > +RelOptInfo *rel = NULL; > > This is fixed in v27. > [0003] > In inheritance_planner: > > + rtable_with_target = list_copy(root->parse->rtable); > > can be: > > + rtable_with_target = list_copy(parse->rtable); Sure. > [0004 or 0005] > There are redundant process in add_appendrel_other_rels (or > expand_xxx_rtentry()?). > I modified add_appendrel_other_rels like below and it actually worked. > > > add_appendrel_other_rels(PlannerInfo *root, RangeTblEntry *rte, Index rti) > { > ListCell *l; > RelOptInfo *rel; > > /* > * Add inheritance children to the query if not already done. For child > * tables that are themselves partitioned, their children will be added > * recursively. > */ > if (rte->rtekind == RTE_RELATION && !root->contains_inherit_children) > { > expand_inherited_rtentry(root, rte, rti); > return; > } > > rel = find_base_rel(root, rti); > > foreach(l, root->append_rel_list) > { > AppendRelInfo *appinfo = lfirst(l); > Index childRTindex = appinfo->child_relid; > RangeTblEntry *childrte; > RelOptInfo *childrel; > > if (appinfo->parent_relid != rti) > continue; > > Assert(childRTindex < root->simple_rel_array_size); > childrte = root->simple_rte_array[childRTindex]; > Assert(childrte != NULL); > build_simple_rel(root, childRTindex, rel); > > /* Child may itself be an inherited relation. */ > if (childrte->inh) > add_appendrel_other_rels(root, childrte, childRTindex); > } > } If you don't intialize part_rels here, then it will break any code in the planner that expects a partitioned rel with non-zero number of partitions to have part_rels set to non-NULL. At the moment, that includes the code that implements partitioning-specific optimizations such partitionwise aggregate and join, run-time pruning etc. No existing regression tests cover the cases where source inherited roles participates in those optimizations, so you didn't find anything that broke. For an example, consider the following update query: update p set a = p1.a + 1 from p p1 where p1.a = (select 1); Planner will create a run-time pruning aware Append node for p (aliased p1), for which it will need to use the part_rels array. Note that p in this case is a source relation which the above code initializes. Maybe we should add such regression tests. > I didn't investigate that problem, but there is another memory increase issue, which is because of 0003 patch I think. I'll try to solve the latter issue. Interested in details as it seems to be a separate problem. Thanks, Amit
RE: speeding up planning with partitions
Amit-san, On Tue, Mar 5, 2019 at 10:24 AM, Amit Langote wrote: > On 2019/03/05 9:50, Amit Langote wrote: > > I'll post the updated patches after diagnosing what I'm suspecting a > > memory over-allocation bug in one of the patches. If you configure > > build with --enable-cassert, you'll see that throughput decreases as > > number of partitions run into many thousands, but it doesn't when > > asserts are turned off. > > Attached an updated version. This incorporates fixes for both Jesper's > and Imai-san's review. Thanks for updating patches! Here is the code review for previous v26 patches. [0002] In expand_inherited_rtentry(): expand_inherited_rtentry() { ... + RelOptInfo *rel = NULL; can be declared at more later: if (oldrelation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) ... else { List *appinfos = NIL; RangeTblEntry *childrte; Index childRTindex; +RelOptInfo *rel = NULL; [0003] In inheritance_planner: + rtable_with_target = list_copy(root->parse->rtable); can be: + rtable_with_target = list_copy(parse->rtable); [0004 or 0005] There are redundant process in add_appendrel_other_rels (or expand_xxx_rtentry()?). I modified add_appendrel_other_rels like below and it actually worked. add_appendrel_other_rels(PlannerInfo *root, RangeTblEntry *rte, Index rti) { ListCell *l; RelOptInfo *rel; /* * Add inheritance children to the query if not already done. For child * tables that are themselves partitioned, their children will be added * recursively. */ if (rte->rtekind == RTE_RELATION && !root->contains_inherit_children) { expand_inherited_rtentry(root, rte, rti); return; } rel = find_base_rel(root, rti); foreach(l, root->append_rel_list) { AppendRelInfo *appinfo = lfirst(l); Index childRTindex = appinfo->child_relid; RangeTblEntry *childrte; RelOptInfo *childrel; if (appinfo->parent_relid != rti) continue; Assert(childRTindex < root->simple_rel_array_size); childrte = root->simple_rte_array[childRTindex]; Assert(childrte != NULL); build_simple_rel(root, childRTindex, rel); /* Child may itself be an inherited relation. */ if (childrte->inh) add_appendrel_other_rels(root, childrte, childRTindex); } } > and Imai-san's review. I haven't been able to pin down the bug (or > whatever) that makes throughput go down as the partition count increases, > when tested with a --enable-cassert build. I didn't investigate that problem, but there is another memory increase issue, which is because of 0003 patch I think. I'll try to solve the latter issue. -- Yoshikazu Imai
Re: speeding up planning with partitions
On 2019/03/06 0:57, Jesper Pedersen wrote: > On 3/5/19 5:24 AM, Amit Langote wrote: >> Attached an updated version. This incorporates fixes for both Jesper's >> and Imai-san's review. I haven't been able to pin down the bug (or >> whatever) that makes throughput go down as the partition count increases, >> when tested with a --enable-cassert build. >> > > Thanks ! > > I'm seeing the throughput going down as well, but are you sure it isn't > just the extra calls of MemoryContextCheck you are seeing ? A flamegraph > diff highlights that area -- sent offline. > > A non cassert build shows the same profile for 64 and 1024 partitions. Thanks for testing. I now see what's happening here. I was aware that it's MemoryContextCheck overhead but didn't quite understand why the time it takes should increase with the number of partitions increasing, especially given that the patch makes it so that only one partition is accessed if that's what the query needs to do. What I had forgotten however is that MemoryContextCheck checks *all* contexts in every transaction, including the "partition descriptor" context which stores a partitioned table's PartitionDesc. PartitionDesc gets bigger as the number of partitions increase, so does the time to check the context it's allocated in. So, the decrease in throughput in cassert build as the number of partitions increases is not due to any fault of this patch series as I had suspected. Phew! Thanks, Amit
Re: speeding up planning with partitions
On 3/5/19 5:24 AM, Amit Langote wrote: Attached an updated version. This incorporates fixes for both Jesper's and Imai-san's review. I haven't been able to pin down the bug (or whatever) that makes throughput go down as the partition count increases, when tested with a --enable-cassert build. Thanks ! I'm seeing the throughput going down as well, but are you sure it isn't just the extra calls of MemoryContextCheck you are seeing ? A flamegraph diff highlights that area -- sent offline. A non cassert build shows the same profile for 64 and 1024 partitions. Best regards, Jesper
Re: speeding up planning with partitions
On 2019/03/04 19:38, Amit Langote wrote: > 2. Defer inheritance expansion to add_other_rels_to_query(). Although the > purpose of doing this is to perform partition pruning before adding the > children, this patch doesn't change when the pruning occurs. It deals > with other issues that must be taken care of due to adding children during > query_planner instead of during subquery_planner. Especially, > inheritance_planner now has to add the child target relations on its own. > Also, delaying adding children also affects adding junk columns to the > query's targetlist based on PlanRowMarks, because preprocess_targetlist > can no longer finalize which junk columns to add for a "parent" > PlanRowMark; that must be delayed until all child PlanRowMarks are added > and their allMarkTypes propagated to the parent PlanRowMark. I thought more on this and started wondering why we can't call preprocess_targetlist() from query_planner() instead of from grouping_planner()? We don't have to treat parent row marks specially if preprocess_targetlist() is called after adding other rels (and hence all child row marks). This will change the order in which expressions are added to baserels targetlists and hence the order of expressions in their Path's targetlist, because the expressions contained in targetlist (including RETURNING) and other junk expressions will be added after expressions referenced in WHERE clauses, whereas the order is reverse today. But if we do what we propose above, the order will be uniform for all cases, that is, not one for regular table baserels and another for inherited table baserels. Thoughts? Thanks, Amit
Re: speeding up planning with partitions
On 2019/03/05 9:50, Amit Langote wrote: > I'll post the updated patches after diagnosing what > I'm suspecting a memory over-allocation bug in one of the patches. If you > configure build with --enable-cassert, you'll see that throughput > decreases as number of partitions run into many thousands, but it doesn't > when asserts are turned off. Attached an updated version. This incorporates fixes for both Jesper's and Imai-san's review. I haven't been able to pin down the bug (or whatever) that makes throughput go down as the partition count increases, when tested with a --enable-cassert build. Thanks, Amit From 1e793976d1affae8aa6fa1c835752bad530132b6 Mon Sep 17 00:00:00 2001 From: Amit Date: Sat, 2 Mar 2019 14:13:13 +0900 Subject: [PATCH v27 1/6] Build "other rels" of appendrel baserels in a separate step Currently they're built in a stanza in build_simple_rel() which builds the child rels in the same invocation of build_simple_rel that's used to build the parent relation. Since query hasn't been processed to distribute restriction clauses to individual baserels at this point, we cannot perform partition pruning before adding the children. Newly added add_other_rels_to_query() runs *after* query_planner has distributed restriction clauses to base relations. This will allow us to use the clauses applied a given partitioned baserel to perform partition pruning, and add other rels for only the unpruned partitions. Later patches will do that. --- src/backend/optimizer/path/allpaths.c | 2 +- src/backend/optimizer/plan/initsplan.c | 60 ++-- src/backend/optimizer/plan/planmain.c | 15 +++-- src/backend/optimizer/util/relnode.c | 101 + src/include/optimizer/pathnode.h | 2 + src/include/optimizer/planmain.h | 1 + 6 files changed, 135 insertions(+), 46 deletions(-) diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index 0debac75c6..8d8a8f17d5 100644 --- a/src/backend/optimizer/path/allpaths.c +++ b/src/backend/optimizer/path/allpaths.c @@ -1028,7 +1028,7 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel, /* * The child rel's RelOptInfo was already created during -* add_base_rels_to_query. +* add_other_rels_to_query. */ childrel = find_base_rel(root, childRTindex); Assert(childrel->reloptkind == RELOPT_OTHER_MEMBER_REL); diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c index 2afc3f1dfe..077d3203ba 100644 --- a/src/backend/optimizer/plan/initsplan.c +++ b/src/backend/optimizer/plan/initsplan.c @@ -20,6 +20,7 @@ #include "nodes/nodeFuncs.h" #include "optimizer/clauses.h" #include "optimizer/cost.h" +#include "optimizer/inherit.h" #include "optimizer/joininfo.h" #include "optimizer/optimizer.h" #include "optimizer/pathnode.h" @@ -30,6 +31,7 @@ #include "optimizer/prep.h" #include "optimizer/restrictinfo.h" #include "parser/analyze.h" +#include "parser/parsetree.h" #include "rewrite/rewriteManip.h" #include "utils/lsyscache.h" @@ -97,10 +99,11 @@ static void check_hashjoinable(RestrictInfo *restrictinfo); * jtnode. Internally, the function recurses through the jointree. * * At the end of this process, there should be one baserel RelOptInfo for - * every non-join RTE that is used in the query. Therefore, this routine - * is the only place that should call build_simple_rel with reloptkind - * RELOPT_BASEREL. (Note: build_simple_rel recurses internally to build - * "other rel" RelOptInfos for the members of any appendrels we find here.) + * every non-join RTE that is specified in the query. Therefore, this + * routine is the only place that should call build_simple_rel with + * reloptkind RELOPT_BASEREL. (Note: "other rel" RelOptInfos for the + * members of any appendrels we find here are built later when query_planner + * calls add_other_rels_to_query().) */ void add_base_rels_to_query(PlannerInfo *root, Node *jtnode) @@ -133,6 +136,55 @@ add_base_rels_to_query(PlannerInfo *root, Node *jtnode) (int) nodeTag(jtnode)); } +/* + * add_other_rels_to_query + * + * Scan the query's jointree and for each base rels that is an appendrel, + * create otherrel RelOptInfos of its children + * + * At the end of this process, there should be RelOptInfos for all relations + * that will be scanned by the query. + */ +void +add_other_rels_to_query(PlannerInfo *root, Node *jtnode) +{ + if (jtnode == NULL) + return; + if (IsA(jtnode, RangeTblRef)) + { + int varno = ((RangeTblRef *) jtnode)->rtindex; + RangeTblEntry *rte = rt_fetch(varno, root->parse->rtable); + + /* +* Only the parent subquery of a flattened UNION ALL and an inherited +* table can have
Re: speeding up planning with partitions
Imai-san, Thanks for checking. On 2019/03/05 15:03, Imai, Yoshikazu wrote: > I've also done code review of 0001 and 0002 patch so far. > > [0001] > 1. Do we need to open/close a relation in add_appendrel_other_rels()? > > + if (rel->part_scheme) > { > - ListCell *l; > ... > - Assert(cnt_parts == nparts); > + rel->part_rels = (RelOptInfo **) > + palloc0(sizeof(RelOptInfo *) * rel->nparts); > + relation = table_open(rte->relid, NoLock); > } > > + if (relation) > + table_close(relation, NoLock); Ah, it should be moved to another patch. Actually, to patch 0003, which makes it necessary to inspect the PartitionDesc. > 2. We can sophisticate the usage of Assert in add_appendrel_other_rels(). > > + if (rel->part_scheme) > { > ... > + rel->part_rels = (RelOptInfo **) > + palloc0(sizeof(RelOptInfo *) * rel->nparts); > + relation = table_open(rte->relid, NoLock); > } > ... > + i = 0; > + foreach(l, root->append_rel_list) > + { > ... > + if (rel->part_scheme != NULL) > + { > + Assert(rel->nparts > 0 && i < rel->nparts); > + rel->part_rels[i] = childrel; > + } > + > + i++; > > as below; > > + if (rel->part_scheme) > { > ... > Assert(rel->nparts > 0); > + rel->part_rels = (RelOptInfo **) > + palloc0(sizeof(RelOptInfo *) * rel->nparts); > + relation = table_open(rte->relid, NoLock); > } > ... > + i = 0; > + foreach(l, root->append_rel_list) > + { > ... > + if (rel->part_scheme != NULL) > + { > + Assert(i < rel->nparts); > + rel->part_rels[i] = childrel; > + } > + > + i++; You're right. That's what makes sense in this context. > [0002] > 3. If using variable like is_source_inh_expansion, the code would be easy to > read. (I might mistakenly understand root->simple_rel_array_size > 0 means > source inheritance expansion though.) root->simple_rel_array_size > 0 *does* mean source inheritance expansion, so you're not mistaken. As you can see, expand_inherited_rtentry is called by inheritance_planner() to expand target inheritance and by add_appendrel_other_rels() to expand source inheritance. Since the latter is called by query_planner, simple_rel_array would have been initialized and hence root->simple_rel_array_size > 0 is true. Maybe it'd be a good idea to introduce is_source_inh_expansion variable for clarity as you suggest and set it to (root->simple_rel_array_size > 0). > 4. I didn't see much carefully, but should the introduction of > contains_inherit_children be in 0003 patch...? You're right. Thanks again for the comments. I've made changes to my local repository. Thanks, Amit