Re: Asymmetric partition-wise JOIN
On 15/10/2023 13:25, Alexander Korotkov wrote: Great! I'm looking forward to the revised patch. Revising the code and opinions before restarting this work, I found two different possible strategies mentioned in the thread: 1. 'Common Resources' shares the materialised result of the inner table scan (a hash table in the case of HashJoin) to join each partition one by one. It gives us a profit in the case of parallel append and possibly other cases, like the one shown in the initial message. 2. 'Individual strategies' - By limiting the AJ feature to cases when the JOIN clause contains a partitioning expression, we can push an additional scan clause into each copy of the inner table scan, reduce the number of tuples scanned, and even prune something because of proven zero input. I see the pros and cons of both approaches. The first option is more straightforward, and its outcome is obvious in the case of parallel append. But how can we guarantee the same join type for each join? Why should we ignore the positive effect of different strategies for different partitions? The second strategy is more expensive for the optimiser, especially in the multipartition case. But as I can predict, it is easier to implement and looks more natural for the architecture. What do you think about that? -- regards, Andrei Lepikhov Postgres Professional
Re: Asymmetric partition-wise JOIN
On 18/10/2023 16:59, Ashutosh Bapat wrote: On Wed, Oct 18, 2023 at 10:55 AM Andrei Lepikhov wrote: But the clauses of A parameterized by P will produce different translations for each of the partitions. I think we will need different RelOptInfos (for A) to store these translations. Does the answer above resolved this issue? May be. There are other problematic areas like EvalPlanQual, Rescans, reparameterised paths which can blow up if we use the same RelOptInfo for different scans of the same relation. It will be good to test Yeah, now I got it. It is already the second place where I see some reference to a kind of hidden rule that the rte entry (or RelOptInfo) must correspond to only one plan node. I don't have a quick answer for now - maybe it is a kind of architectural agreement - and I will consider this issue during the development. those. And also A need not be a simple relation; it could be join as well. For a join RelOptInfo, as well as for any subtree, we have the same logic: the pathlist of this subtree is already formed during the previous level of the search and will not be changed. The relid is also used to track the scans at executor level. Since we have so many scans on A, each may be using different plan, we will need different ids for those. I don't understand this sentence. Which way executor uses this index of RelOptInfo ? See Scan::scanrelid -- regards, Andrey Lepikhov Postgres Professional
Re: Asymmetric partition-wise JOIN
On Wed, Oct 18, 2023 at 10:55 AM Andrei Lepikhov wrote: > > > But the clauses of A parameterized by P will produce different > > translations for each of the partitions. I think we will need > > different RelOptInfos (for A) to store these translations. > > Does the answer above resolved this issue? May be. There are other problematic areas like EvalPlanQual, Rescans, reparameterised paths which can blow up if we use the same RelOptInfo for different scans of the same relation. It will be good to test those. And also A need not be a simple relation; it could be join as well. > > > The relid is also used to track the scans at executor level. Since we > > have so many scans on A, each may be using different plan, we will > > need different ids for those. > > I don't understand this sentence. Which way executor uses this index of > RelOptInfo ? See Scan::scanrelid -- Best Wishes, Ashutosh Bapat
Re: Asymmetric partition-wise JOIN
On 17/10/2023 17:09, Ashutosh Bapat wrote: On Tue, Oct 17, 2023 at 2:05 PM Andrei Lepikhov wrote: On 16/10/2023 23:21, Ashutosh Bapat wrote: On Mon, Oct 16, 2023 at 10:24 AM Andrei Lepikhov Whenever I visited this idea, I hit one issue prominently - how would we differentiate different scans of the non-partitioned relation. Normally we do that using different Relids but in this case we wouldn't be able to know the number of such relations involved in the query unless we start planning such a join. It's late to add new base relations and assign them new Relids. Of course I haven't thought hard about it. I haven't looked at the patch to see whether this problem is solved and how. I'm curious, which type of problems do you afraid here? Why we need a range table entry for each scan of non-partitioned relation? Not RTE but RelOptInfo. Using the same example as Alexander Korotkov, let's say A is the nonpartitioned table and P is partitioned table with partitions P1, P2, ... Pn. The partitionwise join would need to compute AP1, AP2, ... APn. Each of these joins may have different properties and thus will require creating paths. In order to save these paths, we need RelOptInfos which are indentified by relids. Let's assume that the relids of these join RelOptInfos are created by union of relid of A and relid of Px (the partition being joined). This is notionally misleading but doable. Ok, now I see your disquiet. In current patch we have built RelOptInfo for each JOIN(A, Pi) by the build_child_join_rel() routine. And of course, they all have different sets of cheapest paths (it is one more point of optimality). At this point the RelOptInfo of relation A is fully formed and upper joins use the pathlist "as is", without changes. But the clauses of A parameterized by P will produce different translations for each of the partitions. I think we will need different RelOptInfos (for A) to store these translations. Does the answer above resolved this issue? The relid is also used to track the scans at executor level. Since we have so many scans on A, each may be using different plan, we will need different ids for those. I don't understand this sentence. Which way executor uses this index of RelOptInfo ? -- regards, Andrey Lepikhov Postgres Professional
Re: Asymmetric partition-wise JOIN
On Tue, Oct 17, 2023 at 2:05 PM Andrei Lepikhov wrote: > > On 16/10/2023 23:21, Ashutosh Bapat wrote: > > On Mon, Oct 16, 2023 at 10:24 AM Andrei Lepikhov > > Whenever I visited this idea, I hit one issue prominently - how would > > we differentiate different scans of the non-partitioned relation. > > Normally we do that using different Relids but in this case we > > wouldn't be able to know the number of such relations involved in the > > query unless we start planning such a join. It's late to add new base > > relations and assign them new Relids. Of course I haven't thought hard > > about it. I haven't looked at the patch to see whether this problem is > > solved and how. > > > I'm curious, which type of problems do you afraid here? Why we need a > range table entry for each scan of non-partitioned relation? > Not RTE but RelOptInfo. Using the same example as Alexander Korotkov, let's say A is the nonpartitioned table and P is partitioned table with partitions P1, P2, ... Pn. The partitionwise join would need to compute AP1, AP2, ... APn. Each of these joins may have different properties and thus will require creating paths. In order to save these paths, we need RelOptInfos which are indentified by relids. Let's assume that the relids of these join RelOptInfos are created by union of relid of A and relid of Px (the partition being joined). This is notionally misleading but doable. But the clauses of A parameterized by P will produce different translations for each of the partitions. I think we will need different RelOptInfos (for A) to store these translations. The relid is also used to track the scans at executor level. Since we have so many scans on A, each may be using different plan, we will need different ids for those. But if you have developed a way to use a single RelOptInfo of A to do all this, may be we don't need all this. Will take a look at your next version of patch. -- Best Wishes, Ashutosh Bapat
Re: Asymmetric partition-wise JOIN
On 16/10/2023 23:21, Ashutosh Bapat wrote: On Mon, Oct 16, 2023 at 10:24 AM Andrei Lepikhov Whenever I visited this idea, I hit one issue prominently - how would we differentiate different scans of the non-partitioned relation. Normally we do that using different Relids but in this case we wouldn't be able to know the number of such relations involved in the query unless we start planning such a join. It's late to add new base relations and assign them new Relids. Of course I haven't thought hard about it. I haven't looked at the patch to see whether this problem is solved and how. I'm curious, which type of problems do you afraid here? Why we need a range table entry for each scan of non-partitioned relation? -- regards, Andrey Lepikhov Postgres Professional
Re: Asymmetric partition-wise JOIN
On Mon, Oct 16, 2023 at 10:24 AM Andrei Lepikhov wrote: > > > > > Great! I'm looking forward to the revised patch > Before preparing a new patch, it would be better to find the common > ground in the next issue: > So far, this optimization stays aside, proposing an alternative path for > a join RelOptInfo if we have an underlying append path in the outer. > My back burner is redesigning the approach: asymmetric join doesn't > change the partitioning scheme and bounds of the partitioned side. So, > it looks consecutive to make it a part of partitionwise_join machinery > and implement it as a part of the try_partitionwise_join / > generate_partitionwise_join_paths routines. > I think we need an example where such a join will be faster than non-partitioned join when both the sides are local. It might be possible to come up with such an example without writing any code. The idea would be to rewrite SQL as union of joins. Whenever I visited this idea, I hit one issue prominently - how would we differentiate different scans of the non-partitioned relation. Normally we do that using different Relids but in this case we wouldn't be able to know the number of such relations involved in the query unless we start planning such a join. It's late to add new base relations and assign them new Relids. Of course I haven't thought hard about it. I haven't looked at the patch to see whether this problem is solved and how. -- Best Wishes, Ashutosh Bapat
Re: Asymmetric partition-wise JOIN
On 15/10/2023 17:25, Alexander Korotkov wrote: On Sun, Oct 15, 2023 at 8:40 AM Andrei Lepikhov wrote: Thanks for such detailed feedback! The rationale for this patch was to give the optimizer additional ways to push down more joins into foreign servers. And, because of asynchronous append, the benefit of that optimization was obvious. Unfortunately, we hadn't found other applications for this feature, which was why this patch was postponed in the core. You have brought new ideas about applying this idea locally. Moreover, the main issue of the patch was massive memory consumption in the case of many joins and partitions - because of reparameterization. But now, postponing the reparameterization proposed in the thread [1] resolves that problem and gives some insights into the reparameterization technique of some fields, like lateral references. Hence, I think we can restart this work. The first thing here (after rebase, of course) is to figure out and implement in the cost model cases of effectiveness when asymmetric join would give significant performance. Great! I'm looking forward to the revised patch Before preparing a new patch, it would be better to find the common ground in the next issue: So far, this optimization stays aside, proposing an alternative path for a join RelOptInfo if we have an underlying append path in the outer. My back burner is redesigning the approach: asymmetric join doesn't change the partitioning scheme and bounds of the partitioned side. So, it looks consecutive to make it a part of partitionwise_join machinery and implement it as a part of the try_partitionwise_join / generate_partitionwise_join_paths routines. -- regards, Andrey Lepikhov Postgres Professional
Re: Asymmetric partition-wise JOIN
On Sun, Oct 15, 2023 at 8:40 AM Andrei Lepikhov wrote: > Thanks for such detailed feedback! > The rationale for this patch was to give the optimizer additional ways > to push down more joins into foreign servers. And, because of > asynchronous append, the benefit of that optimization was obvious. > Unfortunately, we hadn't found other applications for this feature, > which was why this patch was postponed in the core. > You have brought new ideas about applying this idea locally. Moreover, > the main issue of the patch was massive memory consumption in the case > of many joins and partitions - because of reparameterization. But now, > postponing the reparameterization proposed in the thread [1] resolves > that problem and gives some insights into the reparameterization > technique of some fields, like lateral references. > Hence, I think we can restart this work. > The first thing here (after rebase, of course) is to figure out and > implement in the cost model cases of effectiveness when asymmetric join > would give significant performance. Great! I'm looking forward to the revised patch. -- Regards, Alexander Korotkov
Re: Asymmetric partition-wise JOIN
On 15/10/2023 07:18, Alexander Korotkov wrote: Hi Alexander, Hi Andrey, Thank you for your work on this subject. On Mon, Jan 17, 2022 at 1:42 PM Alexander Pyhalov wrote: The patch does not longer apply cleanly, so I rebased it. Attaching rebased version. Not surprising that the patch doesn't apply after 1.5 years since the last message. Could you please rebase it? I read the thread and the patch. The patch improves the joining of partitioned tables with non-partitioned relations. Let's denote non-partitioned relation as A, partitions as P1 ... PN. The patch allows to Append(Join(A, P1), ... Join(A, PN) instead of Join(A, Append(P1, ... PN). That could be cheaper because it's generally cheaper to join small pieces rather than do one big join. The drawback is the need to scan A multiple times. But is this really necessary and acceptable? Let's consider multiple options. 1) A is non-table. For instance, A is a function scan. In this case, doing multiple scans of A is not just expensive, but could lead to unexpected side effects. When the user includes a function once in the FROM clause, she expects this function to be evaluated once. I propose that we should materialize a scan of non-table relations. So, materialized representation will be scanned multiple times, but the source only scanned once. That would be similar to CTE. 2) A is the table to be scanned with the parametrized path in the inner part of the nested loop join. In this case, there is no big scan of A and nothing to materialize. 3) A is the table to be used in merge join or outer part of nested loop join. In this case, it would be nice to consider materialize. It's not always good to materialize, because materialization has its additional costs. I think that could be a cost-based decision. 4) A is used in the hash join. Could we re-use the hashed representation of A between multiple joins? I read upthread it was proposed to share a hashed table between multiple background workers via shared memory. But the first step would be to just share it between multiple join nodes within the same process. As we consider joining with each partition individually, there could be chosen different join methods. As I get, the current patch considers joining with each of the partitions as a separate isolated optimization task. However, if we share resources between the multiple joins, then rises a need for some global optimization. For instance, a join type could be expensive when applied to an individual partition, but cheap when applied to all the partitions thanks to saving the common work. My idea is to consider generated common resources (such as materialized scans) as a property of the path. For instance, if the nested loop join is cheaper than the hash join, but the hash join generates a common hash map of table A, we don't drop hash join immediately from the consideration and leave it to see how it could help join other partitions. What do you think? Thanks for such detailed feedback! The rationale for this patch was to give the optimizer additional ways to push down more joins into foreign servers. And, because of asynchronous append, the benefit of that optimization was obvious. Unfortunately, we hadn't found other applications for this feature, which was why this patch was postponed in the core. You have brought new ideas about applying this idea locally. Moreover, the main issue of the patch was massive memory consumption in the case of many joins and partitions - because of reparameterization. But now, postponing the reparameterization proposed in the thread [1] resolves that problem and gives some insights into the reparameterization technique of some fields, like lateral references. Hence, I think we can restart this work. The first thing here (after rebase, of course) is to figure out and implement in the cost model cases of effectiveness when asymmetric join would give significant performance. [1] Oversight in reparameterize_path_by_child leading to executor crash https://www.postgresql.org/message-id/flat/CAMbWs496%2BN%3DUAjOc%3DrcD3P7B6oJe4rZw08e_TZRUsWbPxZW3Tw%40mail.gmail.com -- regards, Andrey Lepikhov Postgres Professional
Re: Asymmetric partition-wise JOIN
Hi Alexander, Hi Andrey, Thank you for your work on this subject. On Mon, Jan 17, 2022 at 1:42 PM Alexander Pyhalov wrote: > The patch does not longer apply cleanly, so I rebased it. Attaching > rebased version. Not surprising that the patch doesn't apply after 1.5 years since the last message. Could you please rebase it? I read the thread and the patch. The patch improves the joining of partitioned tables with non-partitioned relations. Let's denote non-partitioned relation as A, partitions as P1 ... PN. The patch allows to Append(Join(A, P1), ... Join(A, PN) instead of Join(A, Append(P1, ... PN). That could be cheaper because it's generally cheaper to join small pieces rather than do one big join. The drawback is the need to scan A multiple times. But is this really necessary and acceptable? Let's consider multiple options. 1) A is non-table. For instance, A is a function scan. In this case, doing multiple scans of A is not just expensive, but could lead to unexpected side effects. When the user includes a function once in the FROM clause, she expects this function to be evaluated once. I propose that we should materialize a scan of non-table relations. So, materialized representation will be scanned multiple times, but the source only scanned once. That would be similar to CTE. 2) A is the table to be scanned with the parametrized path in the inner part of the nested loop join. In this case, there is no big scan of A and nothing to materialize. 3) A is the table to be used in merge join or outer part of nested loop join. In this case, it would be nice to consider materialize. It's not always good to materialize, because materialization has its additional costs. I think that could be a cost-based decision. 4) A is used in the hash join. Could we re-use the hashed representation of A between multiple joins? I read upthread it was proposed to share a hashed table between multiple background workers via shared memory. But the first step would be to just share it between multiple join nodes within the same process. As we consider joining with each partition individually, there could be chosen different join methods. As I get, the current patch considers joining with each of the partitions as a separate isolated optimization task. However, if we share resources between the multiple joins, then rises a need for some global optimization. For instance, a join type could be expensive when applied to an individual partition, but cheap when applied to all the partitions thanks to saving the common work. My idea is to consider generated common resources (such as materialized scans) as a property of the path. For instance, if the nested loop join is cheaper than the hash join, but the hash join generates a common hash map of table A, we don't drop hash join immediately from the consideration and leave it to see how it could help join other partitions. What do you think? -- Regards, Alexander Korotkov
Re: Asymmetric partition-wise JOIN
*root, if (set_join_pathlist_hook) set_join_pathlist_hook(root, joinrel, outerrel, innerrel, jointype, ); + + /* + * 7. If outer relation is delivered from partition-tables, consider + * distributing inner relation to every partition-leaf prior to + * append these leafs. + */ + try_asymmetric_partitionwise_join(root, joinrel, + outerrel, innerrel, + jointype, ); } /* diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c index 9da3ff2f9ab..dadb08ddb06 100644 --- a/src/backend/optimizer/path/joinrels.c +++ b/src/backend/optimizer/path/joinrels.c @@ -16,6 +16,7 @@ #include "miscadmin.h" #include "optimizer/appendinfo.h" +#include "optimizer/cost.h" #include "optimizer/joininfo.h" #include "optimizer/pathnode.h" #include "optimizer/paths.h" @@ -1552,6 +1553,192 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2, } } +/* + * Build RelOptInfo on JOIN of each partition of the outer relation and the inner + * relation. Return List of such RelOptInfo's. Return NIL, if at least one of + * these JOINs is impossible to build. + */ +static List * +extract_asymmetric_partitionwise_subjoin(PlannerInfo *root, + RelOptInfo *joinrel, + AppendPath *append_path, + RelOptInfo *inner_rel, + JoinType jointype, + JoinPathExtraData *extra) +{ + List *result = NIL; + ListCell *lc; + + foreach (lc, append_path->subpaths) + { + Path *child_path = lfirst(lc); + RelOptInfo *child_rel = child_path->parent; + Relids child_joinrelids; + Relids parent_relids; + RelOptInfo *child_joinrel; + SpecialJoinInfo *child_sjinfo; + List *child_restrictlist; + + child_joinrelids = bms_union(child_rel->relids, inner_rel->relids); + parent_relids = bms_union(append_path->path.parent->relids, + inner_rel->relids); + + child_sjinfo = build_child_join_sjinfo(root, extra->sjinfo, + child_rel->relids, + inner_rel->relids); + child_restrictlist = (List *) + adjust_appendrel_attrs_multilevel(root, (Node *)extra->restrictlist, + child_joinrelids, parent_relids); + + child_joinrel = find_join_rel(root, child_joinrelids); + if (!child_joinrel) + child_joinrel = build_child_join_rel(root, + child_rel, + inner_rel, + joinrel, + child_restrictlist, + child_sjinfo, + jointype); + else + { + /* + * The join relation already exists. For example, it could happen if + * we join two plane tables with partitioned table(s). + * Populating this join with additional paths could push out some + * previously added paths which could be pointed in a subplans list + * of an higher level append. + * Of course, we could save such paths before generating new. But it + * can increase too much the number of paths in complex queries. It + * can be a task for future work. + */ + return NIL; + } + + populate_joinrel_with_paths(root, + child_rel, + inner_rel, + child_joinrel, + child_sjinfo, + child_restrictlist); + + /* Give up if asymmetric partition-wise join is not available */ + if (child_joinrel->pathlist == NIL) + return NIL; + + set_cheapest(child_joinrel); + result = lappend(result, child_joinrel); + } + return result; +} + +static bool +is_asymmetric_join_feasible(PlannerInfo *root, + RelOptInfo *outer_rel, + RelOptInfo *inner_rel, + JoinType jointype) +{ + ListCell *lc; + + if (jointype != JOIN_INNER && jointype != JOIN_LEFT) + return false; + + if (IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel)) + return false; + + /* Disallow recursive usage of asymmertic join machinery */ + if (root->join_rel_level == NULL) + return false; + + /* + * Don't allow asymmetric JOIN of two append subplans. + * In the case of a parameterized NL join, a reparameterization procedure + * will lead to large memory allocations and a CPU consumption: + * each reparameterization will induce subpath duplication, creating new + * ParamPathInfo instance and increasing of ppilist up to number of + * partitions in the inner. Also, if we have many partitions, each bitmapset + * variable will be large and many leaks of such variable (caused by relid + * replacement) will highly increase memory consumption. + * So, we deny such paths for now. + */ + foreach(lc, inner_rel->pathlist) + { + if (IsA(lfirst(lc), AppendPath)) + return false; + } + + return true; +} + +void +try_asymmetric_partitionwise_join(PlannerInfo *root, + RelOptInfo *joinrel, + RelOptInfo *outer_rel, + RelOptInfo *inner_rel, + JoinType jointype, + JoinPathExtraData *extra) +{ + ListCell *lc; + + /* + * Try this kind of paths if we allow complex partitionwi
Re: Asymmetric partition-wise JOIN
t;relids, inner_rel->relids); + parent_relids = bms_union(append_path->path.parent->relids, + inner_rel->relids); + + child_sjinfo = build_child_join_sjinfo(root, extra->sjinfo, + child_rel->relids, + inner_rel->relids); + child_restrictlist = (List *) + adjust_appendrel_attrs_multilevel(root, (Node *)extra->restrictlist, + child_joinrelids, parent_relids); + + child_joinrel = find_join_rel(root, child_joinrelids); + if (!child_joinrel) + child_joinrel = build_child_join_rel(root, + child_rel, + inner_rel, + joinrel, + child_restrictlist, + child_sjinfo, + jointype); + else + { + /* +* The join relation already exists. For example, it could happen if +* we join two plane tables with partitioned table(s). +* Populating this join with additional paths could push out some +* previously added paths which could be pointed in a subplans list +* of an higher level append. +* Of course, we could save such paths before generating new. But it +* can increase too much the number of paths in complex queries. It +* can be a task for future work. +*/ + return NIL; + } + + populate_joinrel_with_paths(root, + child_rel, + inner_rel, + child_joinrel, + child_sjinfo, + child_restrictlist); + + /* Give up if asymmetric partition-wise join is not available */ + if (child_joinrel->pathlist == NIL) + return NIL; + + set_cheapest(child_joinrel); + result = lappend(result, child_joinrel); + } + return result; +} + +static bool +is_asymmetric_join_feasible(PlannerInfo *root, + RelOptInfo *outer_rel, + RelOptInfo *inner_rel, + JoinType jointype) +{ + ListCell *lc; + + if (jointype != JOIN_INNER && jointype != JOIN_LEFT) + return false; + + if (IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel)) + return false; + + /* Disallow recursive usage of asymmertic join machinery */ + if (root->join_rel_level == NULL) + return false; + + /* +* Don't allow asymmetric JOIN of two append subplans. +* In the case of a parameterized NL join, a reparameterization procedure +* will lead to large memory allocations and a CPU consumption: +* each reparameterization will induce subpath duplication, creating new +* ParamPathInfo instance and increasing of ppilist up to number of +* partitions in the inner. Also, if we have many partitions, each bitmapset +* variable will be large and many leaks of such variable (caused by relid +* replacement) will highly increase memory consumption. +* So, we deny such paths for now. +*/ + foreach(lc, inner_rel->pathlist) + { + if (IsA(lfirst(lc), AppendPath)) + return false; + } + + return true; +} + +void +try_asymmetric_partitionwise_join(PlannerInfo *root, + RelOptInfo *joinrel, +
Re: Asymmetric partition-wise JOIN
d_restrictlist = (List *) + adjust_appendrel_attrs_multilevel(root, (Node *)extra->restrictlist, + child_joinrelids, parent_relids); + + child_joinrel = find_join_rel(root, child_joinrelids); + if (!child_joinrel) + child_joinrel = build_child_join_rel(root, + child_rel, + inner_rel, + joinrel, + child_restrictlist, + child_sjinfo, + jointype); + else + { + /* + * The join relation already exists. For example, it could happen if + * we join two plane tables with partitioned table(s). + * Populating this join with additional paths could push out some + * previously added paths which could be pointed in a subplans list + * of an higher level append. + * Of course, we could save such paths before generating new. But it + * can increase too much the number of paths in complex queries. It + * can be a task for future work. + */ + return NIL; + } + + populate_joinrel_with_paths(root, + child_rel, + inner_rel, + child_joinrel, + child_sjinfo, + child_restrictlist); + + /* Give up if asymmetric partition-wise join is not available */ + if (child_joinrel->pathlist == NIL) + return NIL; + + set_cheapest(child_joinrel); + result = lappend(result, child_joinrel); + } + return result; +} + +static bool +is_asymmetric_join_feasible(PlannerInfo *root, + RelOptInfo *outer_rel, + RelOptInfo *inner_rel, + JoinType jointype) +{ + ListCell *lc; + + if (jointype != JOIN_INNER && jointype != JOIN_LEFT) + return false; + + if (IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel)) + return false; + + /* Disallow recursive usage of asymmertic join machinery */ + if (root->join_rel_level == NULL) + return false; + + /* + * Don't allow asymmetric JOIN of two append subplans. + * In the case of a parameterized NL join, a reparameterization procedure + * will lead to large memory allocations and a CPU consumption: + * each reparameterization will induce subpath duplication, creating new + * ParamPathInfo instance and increasing of ppilist up to number of + * partitions in the inner. Also, if we have many partitions, each bitmapset + * variable will be large and many leaks of such variable (caused by relid + * replacement) will highly increase memory consumption. + * So, we deny such paths for now. + */ + foreach(lc, inner_rel->pathlist) + { + if (IsA(lfirst(lc), AppendPath)) + return false; + } + + return true; +} + +void +try_asymmetric_partitionwise_join(PlannerInfo *root, + RelOptInfo *joinrel, + RelOptInfo *outer_rel, + RelOptInfo *inner_rel, + JoinType jointype, + JoinPathExtraData *extra) +{ + ListCell *lc; + + /* + * Try this kind of paths if we allow complex partitionwise joins and we know + * we can build this join safely. + */ + if (!enable_partitionwise_join || + !is_asymmetric_join_feasible(root, outer_rel, inner_rel, jointype)) + return; + + foreach (lc, outer_rel->pathlist) + { + AppendPath *append_path = lfirst(lc); + + /* + * We assume this pathlist keeps at least one AppendPath that + * represents partitioned table-scan, symmetric or asymmetric + * partition-wise join. Asymmetric join isn't needed if the append node + * has only one child. + */ + if (IsA(append_path, AppendPath) && + list_length(append_path->subpaths) > 1) + { + List **join_rel_level_saved; + List *live_childrels = NIL; + + join_rel_level_saved = root->join_rel_level; + PG_TRY(); + { +/* temporary disables "dynamic programming" algorithm */ +root->join_rel_level = NULL; + +live_childrels = + extract_asymmetric_partitionwise_subjoin(root, + joinrel, + append_path, + inner_rel, + jointype, + extra); + } + PG_FINALLY(); + { +root->join_rel_level = join_rel_level_saved; + } + PG_END_TRY(); + + if (live_childrels != NIL) + { +/* + * Add new append relation. We must choose cheapest paths after + * this operation because the pathlist possibly contains + * joinrels and appendrels that can be suboptimal. + */ +add_paths_to_append_rel(root, joinrel, live_childrels); +set_cheapest(joinrel); + } + + break; + } + } +} + /* * Construct the SpecialJoinInfo for a child-join by translating * SpecialJoinInfo for the join between parents. left_relids and right_relids diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c index e50624c465..fccc0685d7 100644 --- a/src/backend/optimizer/plan/setrefs.c +++ b/src/backend/optimizer/plan/setrefs.c @@ -281,24 +281,29 @@ set_plan_references(PlannerInfo *root, Plan *plan) /* * Adjust RT indexes of AppendRelInfos and add to final appendrels list. - * We assume the AppendRelInfos were built during planning and don't
Re: Asymmetric partition-wise JOIN
On Thu, Sep 09, 2021 at 09:50:46AM +, Aleksander Alekseev wrote: > It looks like this patch needs to be updated. According to > http://cfbot.cputube.org/ it applies but doesn't pass any tests. Changing the > status to save time for reviewers. > > The new status of this patch is: Waiting on Author Just to give some more info to work on I found this patch made postgres crash with a segmentation fault. """ Program terminated with signal SIGSEGV, Segmentation fault. #0 0x556e37ef1b55 in bms_equal (a=0x7f6e37a9c5b0, b=0x7f6e37a9c5b0) at bitmapset.c:126 126 if (shorter->words[i] != longer->words[i]) """ attached are the query that triggers the crash and the backtrace. -- Jaime Casanova Director de Servicios Profesionales SystemGuards - Consultores de PostgreSQL bug.sql Description: application/sql #0 0x556e37ef1b55 in bms_equal (a=0x7f6e37a9c5b0, b=0x7f6e37a9c5b0) at bitmapset.c:126 shorter = 0x7f6e37a9c5b0 longer = 0x7f6e37a9c5b0 shortlen = 2139062143 longlen = 32622 i = 138057 #1 0x556e37fc9f5d in find_param_path_info (rel=0x556e3944c490, required_outer=0x7f6e37a9c5b0) at relnode.c:1634 ppi = 0x7f6e37a9cc08 lc__state = {l = 0x7f6e37a9cc40, i = 0} lc = 0x7f6e37a5f898 #2 0x556e37fbeaeb in reparameterize_path_by_child (root=0x556e394340e0, path=0x7f6e37a9c6c8, child_rel=0x7f6e37a77ac0) at pathnode.c:4205 new_path = 0x7f6e37a89408 new_ppi = 0x7f6e37a77d30 old_ppi = 0x7f6e37a9cc08 required_outer = 0x7f6e37a9c5b0 #3 0x556e37f66545 in try_nestloop_path (root=0x556e394340e0, joinrel=0x7f6e37a88a80, outer_path=0x7f6e37a788e8, inner_path=0x7f6e37a9c6c8, pathkeys=0x0, jointype=JOIN_LEFT, extra=0x7ffde36f3e50) at joinpath.c:662 required_outer = 0x0 workspace = {startup_cost = 46733.4674, total_cost = 1195882.0625, run_cost = 1149148.595, inner_run_cost = 0.020375816993464052, inner_rescan_run_cost = 0.020375816993464052, outer_rows = 6.9224208015025731e-310, inner_rows = 6.9224208087218603e-310, outer_skip_rows = 6.9224208013235237e-310, inner_skip_rows = 4.640852265067508e-310, numbuckets = 960708832, numbatches = 21870, inner_rows_total = 6.9224208047606396e-310} innerrel = 0x556e3944c490 outerrel = 0x7f6e37a77ac0 innerrelids = 0x556e3944c2c8 outerrelids = 0x7f6e37a77d30 inner_paramrels = 0x7f6e37a9c5b0 outer_paramrels = 0x0 #4 0x556e37f67bf9 in match_unsorted_outer (root=0x556e394340e0, joinrel=0x7f6e37a88a80, outerrel=0x7f6e37a77ac0, innerrel=0x556e3944c490, jointype=JOIN_LEFT, extra=0x7ffde36f3e50) at joinpath.c:1702 innerpath = 0x7f6e37a9c6c8 mpath = 0x0 lc2__state = {l = 0x7f6e37a9d0a0, i = 1} lc2 = 0x7f6e37a9d0c0 outerpath = 0x7f6e37a788e8 merge_pathkeys = 0x0 lc1__state = {l = 0x7f6e37a786c8, i = 0} save_jointype = JOIN_LEFT nestjoinOK = true useallclauses = false inner_cheapest_total = 0x7f6e37a9c3b0 matpath = 0x7f6e37a89370 lc1 = 0x7f6e37a786e0 __func__ = "match_unsorted_outer" #5 0x556e37f65c14 in add_paths_to_joinrel (root=0x556e394340e0, joinrel=0x7f6e37a88a80, outerrel=0x7f6e37a77ac0, innerrel=0x556e3944c490, jointype=JOIN_LEFT, sjinfo=0x7f6e37a88630, restrictlist=0x7f6e37a88a28) at joinpath.c:291 extra = {restrictlist = 0x7f6e37a88a28, mergeclause_list = 0x7f6e37a88e28, inner_unique = true, sjinfo = 0x7f6e37a88630, semifactors = {outer_match_frac = 0.00039215686274509802, match_count = 2550}, param_source_rels = 0x0} mergejoin_allowed = true lc = 0x0 joinrelids = 0x7f6e37a886e8 #6 0x556e37f6a3d9 in populate_joinrel_with_paths (root=0x556e394340e0, rel1=0x7f6e37a77ac0, rel2=0x556e3944c490, joinrel=0x7f6e37a88a80, sjinfo=0x7f6e37a88630, restrictlist=0x7f6e37a88a28) at joinrels.c:825 __func__ = "populate_joinrel_with_paths" #7 0x556e37f6bca5 in extract_asymmetric_partitionwise_subjoin (root=0x556e394340e0, joinrel=0x7f6e37a86bd8, append_path=0x7f6e37a7b6c8, inner_rel=0x556e3944c490, jointype=JOIN_LEFT, extra=0x7ffde36f4150) at joinrels.c:1617 child_path = 0x7f6e37a788e8 child_joinrelids = 0x7f6e37a885e0 child_sjinfo = 0x7f6e37a88630 child_rel = 0x7f6e37a77ac0 parent_relids = 0x7f6e37a88608 child_joinrel = 0x7f6e37a88a80 child_restrictlist = 0x7f6e37a88a28 lc__state = {l = 0x7f6e37a7b8a8, i = 0} result = 0x0 lc = 0x7f6e37a7b8c0 #8 0x556e37f6bfb8 in try_asymmetric_partitionwise_join (root=0x556e394340e0, joinrel=0x7f6e37a86bd8, outer_rel=0x7f6e37a65f68, inner_rel=0x556e3944c490, jointype=JOIN_LEFT, extra=0x7ffde36f4150) at joinrels.c:1713 _save_exception_stack = 0x7ffde36f4b80 _save_context_stack = 0x0 _local_sigjmp_buf = {{__jmpbuf =
Re: Asymmetric partition-wise JOIN
It looks like this patch needs to be updated. According to http://cfbot.cputube.org/ it applies but doesn't pass any tests. Changing the status to save time for reviewers. The new status of this patch is: Waiting on Author
Re: Asymmetric partition-wise JOIN
On Thu, Jul 15, 2021 at 11:32 AM Andrey Lepikhov wrote: > On 5/7/21 23:15, Zhihong Yu wrote: > > On Mon, Jul 5, 2021 at 2:57 AM Andrey Lepikhov > > mailto:a.lepik...@postgrespro.ru>> wrote: > > +* Can't imagine situation when join relation already > > exists. But in > > +* the 'partition_join' regression test it happens. > > +* It may be an indicator of possible problems. > > > > Should a log be added in the above case ? > I worked more on this case and found more serious mistake. During > population of additional paths on the existed RelOptInfo we can remove > some previously generated paths that pointed from a higher-level list of > subplans and it could cause to lost of subplan links. I prohibit such > situation (you can read comments in the new version of the patch). > Also, choosing of a cheapest path after appendrel creation was added. > Unstable tests were fixed. > > -- > regards, > Andrey Lepikhov > Postgres Professional > Patch is failing the regression, can you please take a look at that. partition_join ... FAILED 6328 ms -- Ibrar Ahmed
Re: Asymmetric partition-wise JOIN
+ inner_rel->relids); + + child_sjinfo = build_child_join_sjinfo(root, extra->sjinfo, + child_rel->relids, + inner_rel->relids); + child_restrictlist = (List *) + adjust_appendrel_attrs_multilevel(root, (Node *)extra->restrictlist, + child_joinrelids, parent_relids); + + child_joinrel = find_join_rel(root, child_joinrelids); + if (!child_joinrel) + child_joinrel = build_child_join_rel(root, + child_rel, + inner_rel, + joinrel, + child_restrictlist, + child_sjinfo, + jointype); + else + { + /* +* The join relation already exists. For example, it could happen if +* we join two plane tables with partitioned table(s). +* Populating this join with additional paths could push out some +* previously added paths which could be pointed in a subplans list +* of an higher level append. +* Of course, we could save such paths before generating new. But it +* can increase too much the number of paths in complex queries. It +* can be a task for future work. +*/ + return NIL; + } + + populate_joinrel_with_paths(root, + child_rel, + inner_rel, + child_joinrel, + child_sjinfo, + child_restrictlist); + + /* Give up if asymmetric partition-wise join is not available */ + if (child_joinrel->pathlist == NIL) + return NIL; + + set_cheapest(child_joinrel); + result = lappend(result, child_joinrel); + } + return result; +} + +static bool +is_asymmetric_join_feasible(PlannerInfo *root, + RelOptInfo *outer_rel, + RelOptInfo *inner_rel, + JoinType jointype) +{ + ListCell *lc; + + if (jointype != JOIN_INNER && jointype != JOIN_LEFT) + return false; + + if (IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel)) + return false; + + /* Disallow recursive usage of asymmertic join machinery */ + if (root->join_rel_level == NULL) + return false; + + /* +* Don't allow asymmetric JOIN of two append subplans. +* In the case of a parameterized NL join, a reparameterization procedure +* will lead to large memory allocations and a CPU consumption: +* each reparameterization will induce subpath duplication, creating new +* ParamPathInfo instance and increasing of ppilist up to number of +* partitions in the inner. Also, if we have many partitions, each bitmapset +* variable will be large and many leaks of such variable (caused by relid +* replacement) will highly increase memory consumption. +* So, we deny such paths for now. +*/ + foreach(lc, inner_rel->pathlist) + { + if (IsA(lfirst(lc), AppendPath)) + return false; + } + + return true; +} + +void +try_asymmetric_partitionwise_join(PlannerInfo *root, + RelOptInfo *joinrel, + RelOptInfo *outer_rel, + RelOptInfo *inner_rel, +
Re: Asymmetric partition-wise JOIN
Andrey Lepikhov писал 2021-07-06 12:28: On 5/7/21 23:15, Zhihong Yu wrote: On Mon, Jul 5, 2021 at 2:57 AM Andrey Lepikhov mailto:a.lepik...@postgrespro.ru>> wrote: + * Can't imagine situation when join relation already exists. But in + * the 'partition_join' regression test it happens. + * It may be an indicator of possible problems. Should a log be added in the above case ? I made additional analysis of this branch of code. This situation can happen in the case of one child or if we join two plane tables with partitioned. Both situations are legal and I think we don't needed to add any log message here. Other mistakes were fixed. Hi. Small typo in comment in src/backend/optimizer/plan/setrefs.c: 281 282 /* 283 * Adjust RT indexes of AppendRelInfos and add to final appendrels list. 284 * The AppendRelInfos are copied, because as a part of a subplan its could 285 * be visited many times in the case of asymmetric join. 286 */ 287 foreach(lc, root->append_rel_list) 288 { its -> it (or they) ? -- Best regards, Alexander Pyhalov, Postgres Professional
Re: Asymmetric partition-wise JOIN
child_rel->relids, + inner_rel->relids); + child_restrictlist = (List *) + adjust_appendrel_attrs_multilevel(root, (Node *)extra->restrictlist, + child_joinrelids, parent_relids); + + child_joinrel = find_join_rel(root, child_joinrelids); + if (!child_joinrel) + child_joinrel = build_child_join_rel(root, + child_rel, + inner_rel, + joinrel, + child_restrictlist, + child_sjinfo, + jointype); + else + { + /* +* The join relation already exists. For example, it could happen if +* we join two plane tables with partitioned table(s). Do nothing. +*/ + } + + populate_joinrel_with_paths(root, + child_rel, + inner_rel, + child_joinrel, + child_sjinfo, + child_restrictlist); + + /* Give up if asymmetric partition-wise join is not available */ + if (child_joinrel->pathlist == NIL) + return NIL; + + set_cheapest(child_joinrel); + result = lappend(result, child_joinrel); + } + return result; +} + +static bool +is_asymmetric_join_feasible(PlannerInfo *root, + RelOptInfo *outer_rel, + RelOptInfo *inner_rel, + JoinType jointype) +{ + ListCell *lc; + + if (jointype != JOIN_INNER && jointype != JOIN_LEFT) + return false; + + if (IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel)) + return false; + + /* Disallow recursive usage of asymmertic join machinery */ + if (root->join_rel_level == NULL) + return false; + + /* +* Don't allow asymmetric JOIN of two append subplans. +* In the case of a parameterized NL join, a reparameterization procedure +* will lead to large memory allocations and a CPU consumption: +* each reparameterization will induce subpath duplication, creating new +* ParamPathInfo instance and increasing of ppilist up to number of +* partitions in the inner. Also, if we have many partitions, each bitmapset +* variable will be large and many leaks of such variable (caused by relid +* replacement) will highly increase memory consumption. +* So, we deny such paths for now. +*/ + foreach(lc, inner_rel->pathlist) + { + if (IsA(lfirst(lc), AppendPath)) + return false; + } + + return true; +} + +void +try_asymmetric_partitionwise_join(PlannerInfo *root, + RelOptInfo *joinrel, + RelOptInfo *outer_rel, + RelOptInfo *inner_rel, + JoinType jointype, + JoinPathExtraData *extra) +{ + ListCell *lc; + + /* +* Try this kind of paths if we allow complex partitionwise joins and we know +* we can build this join safely. +*/ + if (!enable_partitionwise_join || + !is_asymmetric_join_feasible(root, outer_rel, inner_rel, jointype)) + return; + + foreach (lc, outer_rel->pathlist) + { + AppendPath *append_path = lfirst(lc); + + /* +* We assume this pathlist keeps at least one AppendPath that +* r
Re: Asymmetric partition-wise JOIN
On Mon, Jul 5, 2021 at 2:57 AM Andrey Lepikhov wrote: > On 18/6/21 15:02, Alexander Pyhalov wrote: > > Andrey Lepikhov писал 2021-05-27 07:27: > >> Next version of the patch. > >> For searching any problems I forced this patch during 'make check' > >> tests. Some bugs were found and fixed. > > > > Hi. > > I've tested this patch and haven't found issues, but I have some > comments. > Thank you for review! > > Variable names - child_join_rel and child_join_relids seem to be > > inconsistent with rest of the file > fixed > > > When build_child_join_rel() can return NULL? > Fixed > > Missing word: > > each bitmapset variable will large => each bitmapset variable will be > large > Fixed > > What hook do you refer to? > Removed> You've changed function to copy appinfo, so now comment is > incorrect. > Thanks, fixed> Do I understand correctly that now parent_relids also > contains relids of > > relations from 'global' inner relation, which we join to childs? > Yes > > -- > regards, > Andrey Lepikhov > Postgres Professional > Hi, relations because it could cause CPU and memory huge consumption during reparameterization of NestLoop path. CPU and memory huge consumption -> huge consumption of CPU and memory + * relation. Return List of such RelOptInfo's. Return NIL, if at least one of + * these JOINs are impossible to build. at least one of these JOINs are impossible to build. -> at least one of these JOINs is impossible to build. +* Can't imagine situation when join relation already exists. But in +* the 'partition_join' regression test it happens. +* It may be an indicator of possible problems. Should a log be added in the above case ? +is_asymmetric_join_capable(PlannerInfo *root, is_asymmetric_join_capable -> is_asymmetric_join_feasible Cheers
Re: Asymmetric partition-wise JOIN
lids); + parent_relids = bms_union(append_path->path.parent->relids, + inner_rel->relids); + + child_sjinfo = build_child_join_sjinfo(root, extra->sjinfo, + child_rel->relids, + inner_rel->relids); + child_restrictlist = (List *) + adjust_appendrel_attrs_multilevel(root, (Node *)extra->restrictlist, + child_joinrelids, parent_relids); + + child_joinrel = find_join_rel(root, child_joinrelids); + if (!child_joinrel) + child_joinrel = build_child_join_rel(root, + child_rel, + inner_rel, + joinrel, + child_restrictlist, + child_sjinfo, + jointype); + else + { + /* +* TODO: +* Can't imagine situation when join relation already exists. But in +* the 'partition_join' regression test it happens. +* It may be an indicator of possible problems. +*/ + } + + populate_joinrel_with_paths(root, + child_rel, + inner_rel, + child_joinrel, + child_sjinfo, + child_restrictlist); + + /* Give up if asymmetric partition-wise join is not available */ + if (child_joinrel->pathlist == NIL) + return NIL; + + set_cheapest(child_joinrel); + result = lappend(result, child_joinrel); + } + return result; +} + +static bool +is_asymmetric_join_capable(PlannerInfo *root, + RelOptInfo *outer_rel, + RelOptInfo *inner_rel, + JoinType jointype) +{ + ListCell *lc; + + if (jointype != JOIN_INNER && jointype != JOIN_LEFT) + return false; + + if (IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel)) + return false; + + /* Disallow recursive usage of asymmertic join machinery */ + if (root->join_rel_level == NULL) + return false; + + /* +* Don't allow asymmetric JOIN of two append subplans. +* In the case of a parameterized NL join, a reparameterization procedure +* will lead to large memory allocations and a CPU consumption: +* each reparameterization will induce subpath duplication, creating new +* ParamPathInfo instance and increasing of ppilist up to number of +* partitions in the inner. Also, if we have many partitions, each bitmapset +* variable will be large and many leaks of such variable (caused by relid +* replacement) will highly increase memory consumption. +* So, we deny such paths for now. +*/ + foreach(lc, inner_rel->pathlist) + { + if (IsA(lfirst(lc), AppendPath)) + return false; + } + + return true; +} + +void +try_asymmetric_partitionwise_join(PlannerInfo *root, + RelOptInfo *joinrel, + RelOptInfo *outer_rel, + RelOptInfo *inner_rel, + JoinType jointype, + JoinPathExtraData *extra) +{ + ListCell *lc; + + /* +* Try this kind of paths if we allow complex partitionwise joins and we know +* we can build this join safely. +*/ + if (!enable_partitionwise_join || + !is_asymmetric_
Re: Asymmetric partition-wise JOIN
Andrey Lepikhov писал 2021-05-27 07:27: Next version of the patch. For searching any problems I forced this patch during 'make check' tests. Some bugs were found and fixed. Hi. I've tested this patch and haven't found issues, but I have some comments. src/backend/optimizer/path/joinrels.c: 1554 1555 /* 1556 * Build RelOptInfo on JOIN of each partition of the outer relation and the inner 1557 * relation. Return List of such RelOptInfo's. Return NIL, if at least one of 1558 * these JOINs are impossible to build. 1559 */ 1560 static List * 1561 extract_asymmetric_partitionwise_subjoin(PlannerInfo *root, 1562 RelOptInfo *joinrel, 1563 AppendPath *append_path, 1564 RelOptInfo *inner_rel, 1565 JoinType jointype, 1566 JoinPathExtraData *extra) 1567 { 1568 List*result = NIL; 1569 ListCell*lc; 1570 1571 foreach (lc, append_path->subpaths) 1572 { 1573 Path*child_path = lfirst(lc); 1574 RelOptInfo *child_rel = child_path->parent; 1575 Relids child_join_relids; 1576 Relids parent_relids; 1577 RelOptInfo *child_join_rel; 1578 SpecialJoinInfo *child_sjinfo; 1579 List*child_restrictlist; Variable names - child_join_rel and child_join_relids seem to be inconsistent with rest of the file (I see child_joinrelids in try_partitionwise_join() and child_joinrel in try_partitionwise_join() and get_matching_part_pairs()). 1595 child_join_rel = build_child_join_rel(root, 1596 child_rel, 1597 inner_rel, 1598 joinrel, 1599 child_restrictlist, 1600 child_sjinfo, 1601 jointype); 1602 if (!child_join_rel) 1603 { 1604 /* 1605 * If can't build JOIN between inner relation and one of the outer 1606 * partitions - return immediately. 1607 */ 1608 return NIL; 1609 } When build_child_join_rel() can return NULL? If I read code correctly, joinrel is created in the begining of build_child_join_rel() with makeNode(), makeNode() wraps newNode() and newNode() uses MemoryContextAllocZero()/MemoryContextAllocZeroAligned(), which would error() on alloc() failure. 1637 1638 static bool 1639 is_asymmetric_join_capable(PlannerInfo *root, 1640RelOptInfo *outer_rel, 1641RelOptInfo *inner_rel, 1642JoinType jointype) 1643 { Function misses a comment. 1656 /* 1657 * Don't allow asymmetric JOIN of two append subplans. 1658 * In the case of a parameterized NL join, a reparameterization procedure will 1659 * lead to large memory allocations and a CPU consumption: 1660 * each reparameterize will induce subpath duplication, creating new 1661 * ParamPathInfo instance and increasing of ppilist up to number of partitions 1662 * in the inner. Also, if we have many partitions, each bitmapset 1663 * variable will large and many leaks of such variable (caused by relid 1664 * replacement) will highly increase memory consumption. 1665 * So, we deny such paths for now. 1666 */ Missing word: each bitmapset variable will large => each bitmapset variable will be large 1694 foreach (lc, outer_rel->pathlist) 1695 { 1696 AppendPath *append_path = lfirst(lc); 1697 1698 /* 1699 * MEMO: We assume this pathlist keeps at least one AppendPath that 1700 * represents partitioned table-scan, symmetric or
Re: Asymmetric partition-wise JOIN
child_join_rel = find_join_rel(root, child_join_relids); + if (!child_join_rel) + { + child_join_rel = build_child_join_rel(root, + child_rel, + inner_rel, + joinrel, + child_restrictlist, + child_sjinfo, + jointype); + if (!child_join_rel) + { + /* +* If can't build JOIN between inner relation and one of the outer +* partitions - return immediately. +*/ + return NIL; + } + } + else + { + /* +* TODO: +* Can't imagine situation when join relation already exists. But in +* the 'partition_join' regression test it happens. +* It may be an indicator of possible problems. +*/ + } + + populate_joinrel_with_paths(root, + child_rel, + inner_rel, + child_join_rel, + child_sjinfo, + child_restrictlist); + + /* Give up if asymmetric partition-wise join is not available */ + if (child_join_rel->pathlist == NIL) + return NIL; + + set_cheapest(child_join_rel); + result = lappend(result, child_join_rel); + } + return result; +} + +static bool +is_asymmetric_join_capable(PlannerInfo *root, + RelOptInfo *outer_rel, + RelOptInfo *inner_rel, + JoinType jointype) +{ + ListCell *lc; + + if (jointype != JOIN_INNER && jointype != JOIN_LEFT) + return false; + + if (IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel)) + return false; + + /* Disallow recursive usage of asymmertic join machinery */ + if (root->join_rel_level == NULL) + return false; + + /* +* Don't allow asymmetric JOIN of two append subplans. +* In the case of a parameterized NL join, a reparameterization procedure will +* lead to large memory allocations and a CPU consumption: +* each reparameterize will induce subpath duplication, creating new +* ParamPathInfo instance and increasing of ppilist up to number of partitions +* in the inner. Also, if we have many partitions, each bitmapset +* variable will large and many leaks of such variable (caused by relid +* replacement) will highly increase memory consumption. +* So, we deny such paths for now. +*/ + foreach(lc, inner_rel->pathlist) + { + if (IsA(lfirst(lc), AppendPath)) + return false; + } + + return true; +} + +void +try_asymmetric_partitionwise_join(PlannerInfo *root, + RelOptInfo *joinrel, + RelOptInfo *outer_rel, + RelOptInfo *inner_rel, + JoinType jointype, + JoinPathExtraData *extra) +{ + ListCell *lc; + + /* +* Try this kind of paths if we allow complex partitionwise joins and we know +* we can build this join safely. +*/ + if (!enable_partitionwise_join || + !is_asymmetric_join_capable(root, outer_rel, inner_rel, jointype)) + return; + + foreach (lc, outer_rel->pathlist) + { + AppendPath *append_path = lfirst(lc); + + /* +* MEMO: We assume this pathlist keeps at least one AppendPath that +
Re: Asymmetric partition-wise JOIN
On 11/30/20 7:43 PM, Anastasia Lubennikova wrote: This entry was inactive during this CF, so I've marked it as returned with feedback. Feel free to resubmit an updated version to a future commitfest. I return the patch to commitfest. My current reason differs from reason of origin author. This patch can open a door for more complex optimizations in the partitionwise join push-down technique. I mean, we can push-down join not only of two partitioned tables with the same partition schema, but a partitioned (sharded) table with an arbitrary subplan that is provable independent of local resources. Example: CREATE TABLE p(a int) PARTITION BY HASH (a); CREATE TABLE p1 PARTITION OF p FOR VALUES WITH (MODULUS 3, REMAINDER 0); CREATE TABLE p2 PARTITION OF p FOR VALUES WITH (MODULUS 3, REMAINDER 1); CREATE TABLE p3 PARTITION OF p FOR VALUES WITH (MODULUS 3, REMAINDER 2); SELECT * FROM p, (SELECT * FROM generate_series(1,2) AS a) AS s WHERE p.a=s.a; Hash Join Hash Cond: (p.a = a.a) -> Append -> Seq Scan on p1 p_1 -> Seq Scan on p2 p_2 -> Seq Scan on p3 p_3 -> Hash -> Function Scan on generate_series a But with asymmetric join feature we have the plan: Append -> Hash Join Hash Cond: (p_1.a = a.a) -> Seq Scan on p1 p_1 -> Hash -> Function Scan on generate_series a -> Hash Join Hash Cond: (p_2.a = a.a) -> Seq Scan on p2 p_2 -> Hash -> Function Scan on generate_series a -> Hash Join Hash Cond: (p_3.a = a.a) -> Seq Scan on p3 p_3 -> Hash -> Function Scan on generate_series a In the case of FDW-sharding it means that if we can prove that the inner relation is independent from the execution server, we can push-down these joins and execute it in parallel. -- regards, Andrey Lepikhov Postgres Professional
Re: Asymmetric partition-wise JOIN
On 11/30/20 7:43 PM, Anastasia Lubennikova wrote: This entry was inactive during this CF, so I've marked it as returned with feedback. Feel free to resubmit an updated version to a future commitfest. Attached version is rebased on current master and fixes problems with complex parameterized plans - 'reparameterize by child' feature. Problems with reparameterization machinery can be demonstrated by TPC-H benchmark. -- regards, Andrey Lepikhov Postgres Professional >From 6a15a52bfb90659c51b3a918d48037c474ffe9dd Mon Sep 17 00:00:00 2001 From: Andrey Lepikhov Date: Fri, 2 Apr 2021 11:02:20 +0500 Subject: [PATCH] Asymmetric partitionwise join. Teach optimizer to consider partitionwise join of non-partitioned table with each partition of partitioned table. This technique cause changes of 'reparameterize by child' machinery. --- src/backend/optimizer/path/joinpath.c| 9 + src/backend/optimizer/path/joinrels.c| 151 ++ src/backend/optimizer/util/appendinfo.c | 28 ++- src/backend/optimizer/util/pathnode.c| 9 +- src/backend/optimizer/util/relnode.c | 14 +- src/include/optimizer/paths.h| 7 +- src/test/regress/expected/partition_join.out | 209 +++ src/test/regress/sql/partition_join.sql | 99 + 8 files changed, 509 insertions(+), 17 deletions(-) diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c index e9b6968b1d..6ba6d32ae4 100644 --- a/src/backend/optimizer/path/joinpath.c +++ b/src/backend/optimizer/path/joinpath.c @@ -335,6 +335,15 @@ add_paths_to_joinrel(PlannerInfo *root, if (set_join_pathlist_hook) set_join_pathlist_hook(root, joinrel, outerrel, innerrel, jointype, ); + + /* + * 7. If outer relation is delivered from partition-tables, consider + * distributing inner relation to every partition-leaf prior to + * append these leafs. + */ + try_asymmetric_partitionwise_join(root, joinrel, + outerrel, innerrel, + jointype, ); } /* diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c index 0dbe2ac726..6f900475bb 100644 --- a/src/backend/optimizer/path/joinrels.c +++ b/src/backend/optimizer/path/joinrels.c @@ -16,6 +16,7 @@ #include "miscadmin.h" #include "optimizer/appendinfo.h" +#include "optimizer/cost.h" #include "optimizer/joininfo.h" #include "optimizer/pathnode.h" #include "optimizer/paths.h" @@ -1551,6 +1552,156 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2, } } +/* + * Build RelOptInfo on JOIN of each partition of the outer relation and the inner + * relation. Return List of such RelOptInfo's. Return NIL, if at least one of + * these JOINs are impossible to build. + */ +static List * +extract_asymmetric_partitionwise_subjoin(PlannerInfo *root, + RelOptInfo *joinrel, + AppendPath *append_path, + RelOptInfo *inner_rel, + JoinType jointype, + JoinPathExtraData *extra) +{ + List *result = NIL; + ListCell *lc; + + foreach (lc, append_path->subpaths) + { + Path *child_path = lfirst(lc); + RelOptInfo *child_rel = child_path->parent; + Relids child_join_relids; + RelOptInfo *child_join_rel; + SpecialJoinInfo *child_sjinfo; + List *child_restrictlist; + AppendRelInfo **appinfos; + intnappinfos; + + child_join_relids = bms_union(child_rel->relids, + inner_rel->relids); + appinfos = find_appinfos_by_relids(root, child_join_relids, + ); + child_sjinfo = build_child_join_sjinfo(root, extra->sjinfo, + child_rel->relids, + inner_rel->relids); + child_restrictlist = (List *) + adjust_appendrel_attrs(root, (Node *)extra->restrictlist, + nappinfos, appinfos); + pfree(appinfos); + + child_join_rel = find_join_rel(root, child_join_relids); + if (!child_join_rel) + { + child_join_rel = build_child_join_rel(root, + child_rel, + inner_rel, + joinrel, + child_restrictlist, + child_sjinfo, + jointype); + if (!child_join_rel) + { +/* + * If can't build JOIN between inner relation and one of the outer + * partitions - return immediately. + */ +return NIL; + } + } + else + { + /* + * TODO: + * Can't imagine situation when join relation already exists. But in + * the 'partition_join' regression test it happens. + * It may be an indicator of possible problems. + */ + } + + populate_joinrel_with_paths(root, + child_rel, + inner_rel, + child_join_rel, + child_sjinfo, + child_restrictlist); + + /* Give up if asymmetric partition-wise join is not available */ + if (child_join_rel->pathlist == NIL) + return NIL; + + set_cheapest(chi
Re: Asymmetric partition-wise JOIN
On 09.11.2020 13:53, Anastasia Lubennikova wrote: On 21.08.2020 09:02, Andrey V. Lepikhov wrote: On 7/1/20 2:10 PM, Daniel Gustafsson wrote: On 27 Dec 2019, at 08:34, Kohei KaiGai wrote: The attached v2 fixed the problem, and regression test finished correctly. This patch no longer applies to HEAD, please submit an rebased version. Marking the entry Waiting on Author in the meantime. Rebased version of the patch on current master (d259afa736). I rebased it because it is a base of my experimental feature than we don't break partitionwise join of a relation with foreign partition and a local relation if we have info that remote server has foreign table link to the local relation (by analogy with shippable extensions). Maybe mark as 'Needs review'? Status update for a commitfest entry. According to cfbot, the patch fails to apply. Could you please send a rebased version? This thread was inactive for quite some time. Is anyone going to continue working on it? I see some interest in the idea of sharable hash, but I don't see even a prototype in this thread. So, probably, it is a matter of a separate discussion. Also, I took a look at the code. It looks like it needs some extra work. I am not a big expert in this area, so I'm sorry if questions are obvious. 1. What would happen if this assumption is not met? + * MEMO: We assume this pathlist keeps at least one AppendPath that + * represents partitioned table-scan, symmetric or asymmetric + * partition-wise join. It is not correct right now, however, a hook + * on add_path() to give additional decision for path removel allows + * to retain this kind of AppendPath, regardless of its cost. 2. Why do we wrap extract_asymmetric_partitionwise_subjoin() call into PG_TRY/PG_CATCH? What errors do we expect? 3. It looks like a crutch. If it isn't, I'd like to see a better comment about why "dynamic programming" is not applicable here. And shouldn't we also handle a root->join_cur_level? + /* temporary disables "dynamic programming" algorithm */ + root->join_rel_level = NULL; 4. This change looks like it can lead to a memory leak for old code. Maybe it is never the case, but again I think it worth a comment. - /* If there's nothing to adjust, don't call this function. */ - Assert(nappinfos >= 1 && appinfos != NULL); + /* If there's nothing to adjust, just return a duplication */ + if (nappinfos == 0) + return copyObject(node); 5. extract_asymmetric_partitionwise_subjoin() lacks a comment The new status of this patch is: Waiting on Author Status update for a commitfest entry. This entry was inactive during this CF, so I've marked it as returned with feedback. Feel free to resubmit an updated version to a future commitfest. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Asymmetric partition-wise JOIN
On 21.08.2020 09:02, Andrey V. Lepikhov wrote: On 7/1/20 2:10 PM, Daniel Gustafsson wrote: On 27 Dec 2019, at 08:34, Kohei KaiGai wrote: The attached v2 fixed the problem, and regression test finished correctly. This patch no longer applies to HEAD, please submit an rebased version. Marking the entry Waiting on Author in the meantime. Rebased version of the patch on current master (d259afa736). I rebased it because it is a base of my experimental feature than we don't break partitionwise join of a relation with foreign partition and a local relation if we have info that remote server has foreign table link to the local relation (by analogy with shippable extensions). Maybe mark as 'Needs review'? Status update for a commitfest entry. According to cfbot, the patch fails to apply. Could you please send a rebased version? This thread was inactive for quite some time. Is anyone going to continue working on it? I see some interest in the idea of sharable hash, but I don't see even a prototype in this thread. So, probably, it is a matter of a separate discussion. Also, I took a look at the code. It looks like it needs some extra work. I am not a big expert in this area, so I'm sorry if questions are obvious. 1. What would happen if this assumption is not met? + * MEMO: We assume this pathlist keeps at least one AppendPath that + * represents partitioned table-scan, symmetric or asymmetric + * partition-wise join. It is not correct right now, however, a hook + * on add_path() to give additional decision for path removel allows + * to retain this kind of AppendPath, regardless of its cost. 2. Why do we wrap extract_asymmetric_partitionwise_subjoin() call into PG_TRY/PG_CATCH? What errors do we expect? 3. It looks like a crutch. If it isn't, I'd like to see a better comment about why "dynamic programming" is not applicable here. And shouldn't we also handle a root->join_cur_level? + /* temporary disables "dynamic programming" algorithm */ + root->join_rel_level = NULL; 4. This change looks like it can lead to a memory leak for old code. Maybe it is never the case, but again I think it worth a comment. - /* If there's nothing to adjust, don't call this function. */ - Assert(nappinfos >= 1 && appinfos != NULL); + /* If there's nothing to adjust, just return a duplication */ + if (nappinfos == 0) + return copyObject(node); 5. extract_asymmetric_partitionwise_subjoin() lacks a comment The new status of this patch is: Waiting on Author -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Asymmetric partition-wise JOIN
On Sat, Aug 24, 2019 at 2:03 PM Kohei KaiGai wrote: > > 2019年8月24日(土) 7:02 Thomas Munro : > > > > On Fri, Aug 23, 2019 at 4:05 AM Kohei KaiGai wrote: > > > We can consider the table join ptable X t1 above is equivalent to: > > > (ptable_p0 + ptable_p1 + ptable_p2) X t1 > > > = (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1) > > > It returns an equivalent result, however, rows are already reduced by > > > HashJoin > > > in the individual leaf of Append, so CPU-cycles consumed by Append node > > > can > > > be cheaper. > > > > > > On the other hands, it has a downside because t1 must be read 3 times and > > > hash table also must be built 3 times. It increases the expected cost, > > > so planner > > > may not choose the asymmetric partition-wise join plan. > > > > What if you include the partition constraint as a filter on t1? So you get: > > > > ptable X t1 = > > (ptable_p0 X (σ hash(dist)%4=0 (t1))) + > > (ptable_p1 X (σ hash(dist)%4=1 (t1))) + > > (ptable_p2 X (σ hash(dist)%4=2 (t1))) + > > (ptable_p3 X (σ hash(dist)%4=3 (t1))) > > > > Pros: > > 1. The hash tables will not contain unnecessary junk. > > 2. You'll get the right answer if t1 is on the outer side of an outer join. > > 3. If this runs underneath a Parallel Append and t1 is big enough > > then workers will hopefully cooperate and do a synchronised scan of > > t1. > > 4. The filter might enable a selective and efficient plan like an index > > scan. > > > > Cons: > > 1. The filter might not enable a selective and efficient plan, and > > therefore cause extra work. > > > > (It's a little weird in this example because don't usually see hash > > functions in WHERE clauses, but that could just as easily be dist > > BETWEEN 1 AND 99 or any other partition constraint.) > > > It requires the join-key must include the partition key and also must be > equality-join, doesn't it? > If ptable and t1 are joined using ptable.dist = t1.foo, we can distribute > t1 for each leaf table with "WHERE hash(foo)%4 = xxx" according to > the partition bounds, indeed. > > In case when some of partition leafs are pruned, it is exactly beneficial > because relevant rows to be referenced by the pruned child relations > are waste of memory. > > On the other hands, it eventually consumes almost equivalent amount > of memory to load the inner relations, if no leafs are pruned, and if we > could extend the Hash-node to share the hash-table with sibling join-nodess. > > > > One idea I have is, sibling HashJoin shares a hash table that was built > > > once > > > by any of the sibling Hash plan. Right now, it is not implemented yet. > > > > Yeah, I've thought a little bit about that in the context of Parallel > > Repartition. I'm interested in combining intra-node partitioning > > (where a single plan node repartitions data among workers on the fly) > > with inter-node partitioning (like PWJ, where partitions are handled > > by different parts of the plan, considered at planning time); you > > finish up needing to have nodes in the plan that 'receive' tuples for > > each partition, to match up with the PWJ plan structure. That's not > > entirely unlike CTE references, and not entirely unlike your idea of > > somehow sharing the same hash table. I ran into a number of problems > > while thinking about that, which I should write about in another > > thread. > > > Hmm. Do you intend the inner-path may have different behavior according > to the partition bounds definition where the outer-path to be joined? > Let me investigate its pros & cons. > > The reasons why I think the idea of sharing the same hash table is reasonable > in this scenario are: > 1. We can easily extend the idea for parallel optimization. A hash table on > DSM > segment, once built, can be shared by all the siblings in all the > parallel workers. > 2. We can save the memory consumption regardless of the join-keys and > partition-keys, even if these are not involved in the query. > > On the other hands, below are the downside. Potentially, combined use of > your idea may help these cases: > 3. Distributed inner-relation cannot be outer side of XXX OUTER JOIN. > 4. Hash table contains rows to be referenced by only pruned partition leafs. > + many, for the sharable hash of the inner table of the join. IMHO, this could be the most interesting and captivating thing about this feature. But might be a complicated piece, is that still on the plan? Regards, Amul
Re: Asymmetric partition-wise JOIN
> On 21 Aug 2020, at 08:02, Andrey V. Lepikhov > wrote: > > On 7/1/20 2:10 PM, Daniel Gustafsson wrote: >>> On 27 Dec 2019, at 08:34, Kohei KaiGai wrote: >>> The attached v2 fixed the problem, and regression test finished correctly. >> This patch no longer applies to HEAD, please submit an rebased version. >> Marking the entry Waiting on Author in the meantime. > Rebased version of the patch on current master (d259afa736). > > I rebased it because it is a base of my experimental feature than we don't > break partitionwise join of a relation with foreign partition and a local > relation if we have info that remote server has foreign table link to the > local relation (by analogy with shippable extensions). > > Maybe mark as 'Needs review'? Thanks for the rebase, I've updated the commitfest entry to reflect that it needs a round of review. cheers ./daniel
Re: Asymmetric partition-wise JOIN
inInfo *child_sjinfo; + List *child_restrictlist; + AppendRelInfo **appinfos; + intnappinfos; + + child_join_relids = bms_union(child_rel->relids, + inner_rel->relids); + appinfos = find_appinfos_by_relids(root, child_join_relids, + ); + child_sjinfo = build_child_join_sjinfo(root, extra->sjinfo, + child_rel->relids, + inner_rel->relids); + child_restrictlist = (List *) + adjust_appendrel_attrs(root, (Node *)extra->restrictlist, + nappinfos, appinfos); + pfree(appinfos); + + child_join_rel = find_join_rel(root, child_join_relids); + if (!child_join_rel) + { + child_join_rel = build_child_join_rel(root, + child_rel, + inner_rel, + joinrel, + child_restrictlist, + child_sjinfo, + jointype); + if (!child_join_rel) +return NIL; + } + + populate_joinrel_with_paths(root, + child_rel, + inner_rel, + child_join_rel, + child_sjinfo, + child_restrictlist); + + /* Give up if asymmetric partition-wise join is not available */ + if (child_join_rel->pathlist == NIL) + return NIL; + + set_cheapest(child_join_rel); + result = lappend(result, child_join_rel); + } + return result; +} + +void +try_asymmetric_partitionwise_join(PlannerInfo *root, + RelOptInfo *joinrel, + RelOptInfo *outer_rel, + RelOptInfo *inner_rel, + JoinType jointype, + JoinPathExtraData *extra) +{ + ListCell *lc; + + if (!enable_partitionwise_join) + return; + + if (IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel)) + return; + + if (jointype != JOIN_INNER && jointype != JOIN_LEFT) + return; + + foreach (lc, outer_rel->pathlist) + { + AppendPath *append_path = lfirst(lc); + + /* + * MEMO: We assume this pathlist keeps at least one AppendPath that + * represents partitioned table-scan, symmetric or asymmetric + * partition-wise join. It is not correct right now, however, a hook + * on add_path() to give additional decision for path removel allows + * to retain this kind of AppendPath, regardless of its cost. + */ + if (IsA(append_path, AppendPath) && + append_path->partitioned_rels != NIL) + { + List **join_rel_level_saved; + List *live_childrels = NIL; + + join_rel_level_saved = root->join_rel_level; + PG_TRY(); + { +/* temporary disables "dynamic programming" algorithm */ +root->join_rel_level = NULL; + +live_childrels = + extract_asymmetric_partitionwise_subjoin(root, + joinrel, + append_path, + inner_rel, + jointype, + extra); + } + PG_CATCH(); + { +root->join_rel_level = join_rel_level_saved; +PG_RE_THROW(); + } + PG_END_TRY(); + root->join_rel_level = join_rel_level_saved; + + if (live_childrels != NIL) +add_paths_to_append_rel(root, joinrel, live_childrels, + append_path->partitioned_rels); + break; + } + } +} + /* * Construct the SpecialJoinInfo for a child-join by translating * SpecialJoinInfo for the join between parents. left_relids and right_relids diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index b40a112c25..863fb79f03 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -7536,7 +7536,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root, } /* Build new paths for this relation by appending child paths. */ - add_paths_to_append_rel(root, rel, live_children); + add_paths_to_append_rel(root, rel, live_children, NIL); } /* @@ -7689,7 +7689,7 @@ create_partitionwise_grouping_paths(PlannerInfo *root, Assert(partially_grouped_live_children != NIL); add_paths_to_append_rel(root, partially_grouped_rel, -partially_grouped_live_children); +partially_grouped_live_children, NIL); /* * We need call set_cheapest, since the finalization step will use the @@ -7704,7 +7704,7 @@ create_partitionwise_grouping_paths(PlannerInfo *root, { Assert(grouped_live_children != NIL); - add_paths_to_append_rel(root, grouped_rel, grouped_live_children); + add_paths_to_append_rel(root, grouped_rel, grouped_live_children, NIL); } } diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c index d722063cf3..32230cf877 100644 --- a/src/backend/optimizer/util/appendinfo.c +++ b/src/backend/optimizer/util/appendinfo.c @@ -201,8 +201,9 @@ adjust_appendrel_attrs(PlannerInfo *root, Node *node, int nappinfos, context.nappinfos = nappinfos; context.appinfos = appinfos; - /* If there's nothing to adjust, don't call this function. */ - Assert(nappinfos >= 1 && appinfos != NULL); + /* If there's nothing to adjust, just return a duplication */ + if (nappinfos == 0) + return copyObject(node); /* * M
Re: Asymmetric partition-wise JOIN
On 12/27/19 12:34 PM, Kohei KaiGai wrote: The attached v2 fixed the problem, and regression test finished correctly. Using your patch I saw incorrect value of predicted rows at the top node of the plan: "Append (cost=270.02..35165.37 rows=40004 width=16)" Full explain of the query plan see in attachment - explain_with_asymmetric.sql if I disable enable_partitionwise_join then: "Hash Join (cost=270.02..38855.25 rows=10001 width=16)" Full explain - explain_no_asymmetric.sql I thought that is the case of incorrect usage of cached values of norm_selec, but it is a corner-case problem of the eqjoinsel() routine : selectivity = 1/size_of_larger_relation; (selfuncs.c:2567) tuples = selectivity * outer_tuples * inner_tuples; (costsize.c:4607) i.e. number of tuples depends only on size of smaller relation. It is not a bug of your patch but I think you need to know because it may affect on planner decision. === P.S. Test case: CREATE TABLE t0 (a serial, b int); INSERT INTO t0 (b) (SELECT * FROM generate_series(1e4, 2e4) as g); CREATE TABLE parts (a serial, b int) PARTITION BY HASH(a) INSERT INTO parts (b) (SELECT * FROM generate_series(1, 1e6) as g); -- regards, Andrey Lepikhov Postgres Professional explain_with_asymmetric.sql Description: application/sql explain_no_asymmetric.sql Description: application/sql
Re: Asymmetric partition-wise JOIN
> On 27 Dec 2019, at 08:34, Kohei KaiGai wrote: > The attached v2 fixed the problem, and regression test finished correctly. This patch no longer applies to HEAD, please submit an rebased version. Marking the entry Waiting on Author in the meantime. cheers ./daniel
Re: Asymmetric partition-wise JOIN
Hi Thomas, On 12/27/19 2:34 AM, Kohei KaiGai wrote: > This crash was reproduced on our environment also. It looks to me adjust_child_relids_multilevel() didn't expect a case when supplied 'relids' (partially) indicate normal and non-partitioned relation. It tries to build a new 'parent_relids' that is a set of appinfo->parent_relid related to the supplied 'child_relids'. However, bits in child_relids that indicate normal relations are unintentionally dropped here. Then, adjust_child_relids_multilevel() goes to an infinite recursion until stack limitation. The attached v2 fixed the problem, and regression test finished correctly. Any thoughts on the new version of this patch? Regards, -- -David da...@pgmasters.net
Re: Asymmetric partition-wise JOIN
Hello, This crash was reproduced on our environment also. It looks to me adjust_child_relids_multilevel() didn't expect a case when supplied 'relids' (partially) indicate normal and non-partitioned relation. It tries to build a new 'parent_relids' that is a set of appinfo->parent_relid related to the supplied 'child_relids'. However, bits in child_relids that indicate normal relations are unintentionally dropped here. Then, adjust_child_relids_multilevel() goes to an infinite recursion until stack limitation. The attached v2 fixed the problem, and regression test finished correctly. Best regards, 2019年12月1日(日) 12:24 Michael Paquier : > > On Sat, Aug 24, 2019 at 05:33:01PM +0900, Kohei KaiGai wrote: > > On the other hands, it eventually consumes almost equivalent amount > > of memory to load the inner relations, if no leafs are pruned, and if we > > could extend the Hash-node to share the hash-table with sibling > > join-nodess. > > The patch crashes when running the regression tests, per the report of > the automatic patch tester. Could you look at that? I have moved the > patch to nexf CF, waiting on author. > -- > Michael -- HeteroDB, Inc / The PG-Strom Project KaiGai Kohei #12912 0x006f9471 in adjust_child_relids_multilevel ( root=root@entry=0x2115240, relids=relids@entry=0x2106df8, child_relids=child_relids@entry=0x0, top_parent_relids=top_parent_relids@entry=0x20afd88) at appendinfo.c:602 #12913 0x006f9471 in adjust_child_relids_multilevel ( root=root@entry=0x2115240, relids=relids@entry=0x2106df8, child_relids=child_relids@entry=0x0, top_parent_relids=top_parent_relids@entry=0x20afd88) at appendinfo.c:602 #12914 0x006f9471 in adjust_child_relids_multilevel ( root=root@entry=0x2115240, relids=relids@entry=0x2106df8, child_relids=child_relids@entry=0x0, top_parent_relids=top_parent_relids@entry=0x20afd88) at appendinfo.c:602 #12915 0x006f9471 in adjust_child_relids_multilevel ( root=root@entry=0x2115240, relids=relids@entry=0x2106df8, child_relids=child_relids@entry=0x20b0bb8, top_parent_relids=top_parent_relids@entry=0x20afd88) at appendinfo.c:602 #12916 0x006f9471 in adjust_child_relids_multilevel ( root=root@entry=0x2115240, relids=relids@entry=0x2106df8, child_relids=child_relids@entry=0x20af9d8, top_parent_relids=top_parent_relids@entry=0x20afd88) at appendinfo.c:602 #12917 0x006c9805 in add_child_join_rel_equivalences ( root=root@entry=0x2115240, nappinfos=1, appinfos=appinfos@entry=0x20afdb0, parent_joinrel=parent_joinrel@entry=0x20ae420, child_joinrel=child_joinrel@entry=0x20aef10) at equivclass.c:2458 #12918 0x0070bc07 in build_child_join_rel (root=root@entry=0x2115240, outer_rel=outer_rel@entry=0x210a318, inner_rel=inner_rel@entry=0x21051a0, parent_joinrel=parent_joinrel@entry=0x20ae420, restrictlist=restrictlist@entry=0x20afcd8, sjinfo=sjinfo@entry=0x20afa00, jointype=JOIN_INNER) at relnode.c:896 #12919 0x006d2dcf in extract_asymmetric_partitionwise_subjoin ( append_path=0x20a4000, extra=0x7ffea461c710, extra=0x7ffea461c710, jointype=JOIN_INNER, inner_rel=0x21051a0, joinrel=0x20ae420, root=0x2115240) at joinrels.c:1573 #12920 try_asymmetric_partitionwise_join (root=root@entry=0x2115240, joinrel=joinrel@entry=0x20ae420, outer_rel=outer_rel@entry=0x207cce0, inner_rel=inner_rel@entry=0x21051a0, jointype=jointype@entry=JOIN_INNER, extra=extra@entry=0x7ffea461c710) at joinrels.c:1641 #12921 0x006cf296 in add_paths_to_joinrel (root=root@entry=0x2115240, joinrel=joinrel@entry=0x20ae420, outerrel=outerrel@entry=0x207cce0, innerrel=innerrel@entry=0x21051a0, jointype=jointype@entry=JOIN_INNER, sjinfo=sjinfo@entry=0x7ffea461c8c0, restrictlist=0x20ae898) at joinpath.c:333 #12922 0x006d1a3a in populate_joinrel_with_paths (root=0x2115240, rel1=0x207cce0, rel2=0x21051a0, joinrel=0x20ae420, sjinfo=0x7ffea461c8c0, restrictlist=0x20ae898) at joinrels.c:804 #12923 0x006d2505 in make_join_rel (root=root@entry=0x2115240, rel1=rel1@entry=0x207cce0, rel2=rel2@entry=0x21051a0) at joinrels.c:757 #12924 0x006d278d in make_rels_by_clause_joins ( other_rels=, other_rels_list=0x20ae390, old_rel=0x207cce0, root=0x2115240) at joinrels.c:309 #12925 join_search_one_level (root=root@entry=0x2115240, level=level@entry=2) at joinrels.c:120 #12926 0x006bdd4b in standard_join_search (root=0x2115240, levels_needed=3, initial_rels=) at allpaths.c:2879 #12927 0x006be1c4 in make_one_rel (root=root@entry=0x2115240, joinlist=joinlist@entry=0x2106b80) at allpaths.c:227 #12928 0x006e1cdb in query_planner (root=root@entry=0x2115240, qp_callback=qp_callback@entry=0x6e22b0 , qp_extra=qp_extra@entry=0x7ffea461cb60) at planmain.c:269 #12929 0x006e6962 in grouping_planner (root=, inheritance_update=false, tuple_fraction=) at
Re: Asymmetric partition-wise JOIN
On Sat, Aug 24, 2019 at 05:33:01PM +0900, Kohei KaiGai wrote: > On the other hands, it eventually consumes almost equivalent amount > of memory to load the inner relations, if no leafs are pruned, and if we > could extend the Hash-node to share the hash-table with sibling > join-nodess. The patch crashes when running the regression tests, per the report of the automatic patch tester. Could you look at that? I have moved the patch to nexf CF, waiting on author. -- Michael signature.asc Description: PGP signature
Re: Asymmetric partition-wise JOIN
2019年8月24日(土) 7:02 Thomas Munro : > > On Fri, Aug 23, 2019 at 4:05 AM Kohei KaiGai wrote: > > We can consider the table join ptable X t1 above is equivalent to: > > (ptable_p0 + ptable_p1 + ptable_p2) X t1 > > = (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1) > > It returns an equivalent result, however, rows are already reduced by > > HashJoin > > in the individual leaf of Append, so CPU-cycles consumed by Append node can > > be cheaper. > > > > On the other hands, it has a downside because t1 must be read 3 times and > > hash table also must be built 3 times. It increases the expected cost, > > so planner > > may not choose the asymmetric partition-wise join plan. > > What if you include the partition constraint as a filter on t1? So you get: > > ptable X t1 = > (ptable_p0 X (σ hash(dist)%4=0 (t1))) + > (ptable_p1 X (σ hash(dist)%4=1 (t1))) + > (ptable_p2 X (σ hash(dist)%4=2 (t1))) + > (ptable_p3 X (σ hash(dist)%4=3 (t1))) > > Pros: > 1. The hash tables will not contain unnecessary junk. > 2. You'll get the right answer if t1 is on the outer side of an outer join. > 3. If this runs underneath a Parallel Append and t1 is big enough > then workers will hopefully cooperate and do a synchronised scan of > t1. > 4. The filter might enable a selective and efficient plan like an index scan. > > Cons: > 1. The filter might not enable a selective and efficient plan, and > therefore cause extra work. > > (It's a little weird in this example because don't usually see hash > functions in WHERE clauses, but that could just as easily be dist > BETWEEN 1 AND 99 or any other partition constraint.) > It requires the join-key must include the partition key and also must be equality-join, doesn't it? If ptable and t1 are joined using ptable.dist = t1.foo, we can distribute t1 for each leaf table with "WHERE hash(foo)%4 = xxx" according to the partition bounds, indeed. In case when some of partition leafs are pruned, it is exactly beneficial because relevant rows to be referenced by the pruned child relations are waste of memory. On the other hands, it eventually consumes almost equivalent amount of memory to load the inner relations, if no leafs are pruned, and if we could extend the Hash-node to share the hash-table with sibling join-nodess. > > One idea I have is, sibling HashJoin shares a hash table that was built once > > by any of the sibling Hash plan. Right now, it is not implemented yet. > > Yeah, I've thought a little bit about that in the context of Parallel > Repartition. I'm interested in combining intra-node partitioning > (where a single plan node repartitions data among workers on the fly) > with inter-node partitioning (like PWJ, where partitions are handled > by different parts of the plan, considered at planning time); you > finish up needing to have nodes in the plan that 'receive' tuples for > each partition, to match up with the PWJ plan structure. That's not > entirely unlike CTE references, and not entirely unlike your idea of > somehow sharing the same hash table. I ran into a number of problems > while thinking about that, which I should write about in another > thread. > Hmm. Do you intend the inner-path may have different behavior according to the partition bounds definition where the outer-path to be joined? Let me investigate its pros & cons. The reasons why I think the idea of sharing the same hash table is reasonable in this scenario are: 1. We can easily extend the idea for parallel optimization. A hash table on DSM segment, once built, can be shared by all the siblings in all the parallel workers. 2. We can save the memory consumption regardless of the join-keys and partition-keys, even if these are not involved in the query. On the other hands, below are the downside. Potentially, combined use of your idea may help these cases: 3. Distributed inner-relation cannot be outer side of XXX OUTER JOIN. 4. Hash table contains rows to be referenced by only pruned partition leafs. Best regards, -- HeteroDB, Inc / The PG-Strom Project KaiGai Kohei
Re: Asymmetric partition-wise JOIN
On Fri, Aug 23, 2019 at 4:05 AM Kohei KaiGai wrote: > We can consider the table join ptable X t1 above is equivalent to: > (ptable_p0 + ptable_p1 + ptable_p2) X t1 > = (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1) > It returns an equivalent result, however, rows are already reduced by HashJoin > in the individual leaf of Append, so CPU-cycles consumed by Append node can > be cheaper. > > On the other hands, it has a downside because t1 must be read 3 times and > hash table also must be built 3 times. It increases the expected cost, > so planner > may not choose the asymmetric partition-wise join plan. What if you include the partition constraint as a filter on t1? So you get: ptable X t1 = (ptable_p0 X (σ hash(dist)%4=0 (t1))) + (ptable_p1 X (σ hash(dist)%4=1 (t1))) + (ptable_p2 X (σ hash(dist)%4=2 (t1))) + (ptable_p3 X (σ hash(dist)%4=3 (t1))) Pros: 1. The hash tables will not contain unnecessary junk. 2. You'll get the right answer if t1 is on the outer side of an outer join. 3. If this runs underneath a Parallel Append and t1 is big enough then workers will hopefully cooperate and do a synchronised scan of t1. 4. The filter might enable a selective and efficient plan like an index scan. Cons: 1. The filter might not enable a selective and efficient plan, and therefore cause extra work. (It's a little weird in this example because don't usually see hash functions in WHERE clauses, but that could just as easily be dist BETWEEN 1 AND 99 or any other partition constraint.) > One idea I have is, sibling HashJoin shares a hash table that was built once > by any of the sibling Hash plan. Right now, it is not implemented yet. Yeah, I've thought a little bit about that in the context of Parallel Repartition. I'm interested in combining intra-node partitioning (where a single plan node repartitions data among workers on the fly) with inter-node partitioning (like PWJ, where partitions are handled by different parts of the plan, considered at planning time); you finish up needing to have nodes in the plan that 'receive' tuples for each partition, to match up with the PWJ plan structure. That's not entirely unlike CTE references, and not entirely unlike your idea of somehow sharing the same hash table. I ran into a number of problems while thinking about that, which I should write about in another thread. -- Thomas Munro https://enterprisedb.com
Re: Asymmetric partition-wise JOIN
Hello, Even though nobody has respond the thread, I tried to make a prototype of the asymmetric partition-wise join support. This feature tries to join non-partitioned and partitioned relation before append. See the example below: create table ptable (dist int, a int, b int) partition by hash (dist); create table ptable_p0 partition of ptable for values with (modulus 3, remainder 0); create table ptable_p1 partition of ptable for values with (modulus 3, remainder 1); create table ptable_p2 partition of ptable for values with (modulus 3, remainder 2); create table t1 (aid int, label text); create table t2 (bid int, label text); insert into ptable (select x, (1000*random())::int, (1000*random())::int from generate_series(1,100) x); insert into t1 (select x, md5(x::text) from generate_series(1,50) x); insert into t2 (select x, md5(x::text) from generate_series(1,50) x); vacuum analyze ptable; vacuum analyze t1; vacuum analyze t2; ptable.a has values between 0 and 1000, and t1.aid has values between 1 and 50. Therefore, tables join on ptable and t1 by a=aid can reduce almost 95% rows. On the other hands, t1 is not partitioned and join-keys are not partition keys. So, Append must process million rows first, then HashJoin processes the rows read from the partitioned table, and 95% of them are eventually dropped. On the other words, 95% of jobs by Append are waste of time and CPU cycles. postgres=# explain select * from ptable, t1 where a = aid; QUERY PLAN -- Hash Join (cost=2.12..24658.62 rows=49950 width=49) Hash Cond: (ptable_p0.a = t1.aid) -> Append (cost=0.00..20407.00 rows=100 width=12) -> Seq Scan on ptable_p0 (cost=0.00..5134.63 rows=333263 width=12) -> Seq Scan on ptable_p1 (cost=0.00..5137.97 rows=333497 width=12) -> Seq Scan on ptable_p2 (cost=0.00..5134.40 rows=333240 width=12) -> Hash (cost=1.50..1.50 rows=50 width=37) -> Seq Scan on t1 (cost=0.00..1.50 rows=50 width=37) (8 rows) The asymmetric partitionwise join allows to join non-partitioned tables and partitioned tables prior to Append. postgres=# set enable_partitionwise_join = on; SET postgres=# explain select * from ptable, t1 where a = aid; QUERY PLAN -- Append (cost=2.12..19912.62 rows=49950 width=49) -> Hash Join (cost=2.12..6552.96 rows=16647 width=49) Hash Cond: (ptable_p0.a = t1.aid) -> Seq Scan on ptable_p0 (cost=0.00..5134.63 rows=333263 width=12) -> Hash (cost=1.50..1.50 rows=50 width=37) -> Seq Scan on t1 (cost=0.00..1.50 rows=50 width=37) -> Hash Join (cost=2.12..6557.29 rows=16658 width=49) Hash Cond: (ptable_p1.a = t1.aid) -> Seq Scan on ptable_p1 (cost=0.00..5137.97 rows=333497 width=12) -> Hash (cost=1.50..1.50 rows=50 width=37) -> Seq Scan on t1 (cost=0.00..1.50 rows=50 width=37) -> Hash Join (cost=2.12..6552.62 rows=16645 width=49) Hash Cond: (ptable_p2.a = t1.aid) -> Seq Scan on ptable_p2 (cost=0.00..5134.40 rows=333240 width=12) -> Hash (cost=1.50..1.50 rows=50 width=37) -> Seq Scan on t1 (cost=0.00..1.50 rows=50 width=37) (16 rows) We can consider the table join ptable X t1 above is equivalent to: (ptable_p0 + ptable_p1 + ptable_p2) X t1 = (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1) It returns an equivalent result, however, rows are already reduced by HashJoin in the individual leaf of Append, so CPU-cycles consumed by Append node can be cheaper. On the other hands, it has a downside because t1 must be read 3 times and hash table also must be built 3 times. It increases the expected cost, so planner may not choose the asymmetric partition-wise join plan. One idea I have is, sibling HashJoin shares a hash table that was built once by any of the sibling Hash plan. Right now, it is not implemented yet. How about your thought for this feature? Best regards, 2019年8月12日(月) 15:03 Kohei KaiGai : > > Hello, > > PostgreSQL optimizer right now considers join pairs on only > non-partition - non-partition or > partition-leaf - partition-leaf relations. On the other hands, it is > harmless and makes sense to > consider a join pair on non-partition - partition-leaf. > > See the example below. ptable is partitioned by hash, and contains 10M > rows. ftable is not > partitioned and contains 50 rows. Most of ptable::fkey shall not have > matched rows in this > join. > > create table ptable (fkey int, dist text) partition by hash (dist); > create table ptable_p0 partition of ptable for values with (modulus 3, > remainder 0); > create table ptable_p1 partitio
Asymmetric partition-wise JOIN
cost=1.50..1.50 rows=50 width=4) (actual time=0.020..0.020 rows=50 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 10kB -> Seq Scan on ftable f_1 (cost=0.00..1.50 rows=50 width=4) (actual time=0.004..0.011 rows=50 loops=1) -> Subquery Scan on "*SELECT* 3" (cost=2.12..70051.84 rows=16699 width=0) (actual time=0.025..407.103 rows=16633 loops=1) -> Hash Join (cost=2.12..69884.85 rows=16699 width=72) (actual time=0.025..406.048 rows=16633 loops=1) Hash Cond: (p_2.fkey = f_2.pkey) -> Seq Scan on ptable_p2 p_2 (cost=0.00..61126.79 rows=3334179 width=4) (actual time=0.004..181.015 rows=3334179 loops=1) -> Hash (cost=1.50..1.50 rows=50 width=4) (actual time=0.014..0.014 rows=50 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 10kB -> Seq Scan on ftable f_2 (cost=0.00..1.50 rows=50 width=4) (actual time=0.003..0.008 rows=50 loops=1) Planning Time: 0.614 ms Execution Time: 1396.131 ms (25 rows) How about your opinions for this kind of asymmetric partition-wise JOIN support by the optimizer? I think we can harmlessly push-down inner-join and left-join if partition-leaf is left side. Probably, we need to implement two key functionalities. 1. Construction of RelOpInfo for join on non-partition table and partition-leafs for each pairs. Instead of JoinPaths, this logic adds AppendPath that takes asymmetric partition-wise join paths as sub-paths. Other optimization logic is equivalent as we are currently doing. 2. Allow to share the hash-table built from table scan distributed to individual partition leafs. In the above example, SeqScan on ftable and relevant Hash path will make identical hash- table for the upcoming hash-join. If sibling paths have equivalent results, it is reasonable to reuse it. Best regards, -- HeteroDB, Inc / The PG-Strom Project KaiGai Kohei