subject:"Asymmetric partition\-wise JOIN"

Re: Asymmetric partition-wise JOIN

2024-04-01 Thread Andrei Lepikhov


On 15/10/2023 13:25, Alexander Korotkov wrote:

Great!  I'm looking forward to the revised patch.
Revising the code and opinions before restarting this work, I found two 
different possible strategies mentioned in the thread:
1. 'Common Resources' shares the materialised result of the inner table 
scan (a hash table in the case of HashJoin) to join each partition one 
by one. It gives us a profit in the case of parallel append and possibly 
other cases, like the one shown in the initial message.
2. 'Individual strategies' - By limiting the AJ feature to cases when 
the JOIN clause contains a partitioning expression, we can push an 
additional scan clause into each copy of the inner table scan, reduce 
the number of tuples scanned, and even prune something because of proven 
zero input.


I see the pros and cons of both approaches. The first option is more 
straightforward, and its outcome is obvious in the case of parallel 
append. But how can we guarantee the same join type for each join? Why 
should we ignore the positive effect of different strategies for 
different partitions?
The second strategy is more expensive for the optimiser, especially in 
the multipartition case. But as I can predict, it is easier to implement 
and looks more natural for the architecture. What do you think about that?


--
regards,
Andrei Lepikhov
Postgres Professional

Re: Asymmetric partition-wise JOIN

2023-10-18 Thread Andrei Lepikhov


On 18/10/2023 16:59, Ashutosh Bapat wrote:

On Wed, Oct 18, 2023 at 10:55 AM Andrei Lepikhov
 wrote:



But the clauses of A parameterized by P will produce different
translations for each of the partitions. I think we will need
different RelOptInfos (for A) to store these translations.


Does the answer above resolved this issue?


May be. There are other problematic areas like EvalPlanQual, Rescans,
reparameterised paths which can blow up if we use the same RelOptInfo
for different scans of the same relation. It will be good to test


Yeah, now I got it. It is already the second place where I see some 
reference to a kind of hidden rule that the rte entry (or RelOptInfo) 
must correspond to only one plan node. I don't have a quick answer for 
now - maybe it is a kind of architectural agreement - and I will 
consider this issue during the development.



those. And also A need not be a simple relation; it could be join as
well.


For a join RelOptInfo, as well as for any subtree, we have the same 
logic: the pathlist of this subtree is already formed during the 
previous level of the search and will not be changed.





The relid is also used to track the scans at executor level. Since we
have so many scans on A, each may be using different plan, we will
need different ids for those.


I don't understand this sentence. Which way executor uses this index of
RelOptInfo ?


See Scan::scanrelid



--
regards,
Andrey Lepikhov
Postgres Professional

Re: Asymmetric partition-wise JOIN

2023-10-18 Thread Ashutosh Bapat

On Wed, Oct 18, 2023 at 10:55 AM Andrei Lepikhov
 wrote:
>
> > But the clauses of A parameterized by P will produce different
> > translations for each of the partitions. I think we will need
> > different RelOptInfos (for A) to store these translations.
>
> Does the answer above resolved this issue?

May be. There are other problematic areas like EvalPlanQual, Rescans,
reparameterised paths which can blow up if we use the same RelOptInfo
for different scans of the same relation. It will be good to test
those. And also A need not be a simple relation; it could be join as
well.

>
> > The relid is also used to track the scans at executor level. Since we
> > have so many scans on A, each may be using different plan, we will
> > need different ids for those.
>
> I don't understand this sentence. Which way executor uses this index of
> RelOptInfo ?

See Scan::scanrelid

-- 
Best Wishes,
Ashutosh Bapat

Re: Asymmetric partition-wise JOIN

2023-10-17 Thread Andrei Lepikhov


On 17/10/2023 17:09, Ashutosh Bapat wrote:

On Tue, Oct 17, 2023 at 2:05 PM Andrei Lepikhov
 wrote:


On 16/10/2023 23:21, Ashutosh Bapat wrote:

On Mon, Oct 16, 2023 at 10:24 AM Andrei Lepikhov
Whenever I visited this idea, I hit one issue prominently - how would
we differentiate different scans of the non-partitioned relation.
Normally we do that using different Relids but in this case we
wouldn't be able to know the number of such relations involved in the
query unless we start planning such a join. It's late to add new base
relations and assign them new Relids. Of course I haven't thought hard
about it. I haven't looked at the patch to see whether this problem is
solved and how.


I'm curious, which type of problems do you afraid here? Why we need a
range table entry for each scan of non-partitioned relation?



Not RTE but RelOptInfo.

Using the same example as Alexander Korotkov, let's say A is the
nonpartitioned table and P is partitioned table with partitions P1,
P2, ... Pn. The partitionwise join would need to compute AP1, AP2, ...
APn. Each of these joins may have different properties and thus will
require creating paths. In order to save these paths, we need
RelOptInfos which are indentified by relids. Let's assume that the
relids of these join RelOptInfos are created by union of relid of A
and relid of Px (the partition being joined). This is notionally
misleading but doable.


Ok, now I see your disquiet. In current patch we have built RelOptInfo 
for each JOIN(A, Pi) by the build_child_join_rel() routine. And of 
course, they all have different sets of cheapest paths (it is one more 
point of optimality). At this point the RelOptInfo of relation A is 
fully formed and upper joins use the pathlist "as is", without changes.



But the clauses of A parameterized by P will produce different
translations for each of the partitions. I think we will need
different RelOptInfos (for A) to store these translations.


Does the answer above resolved this issue?


The relid is also used to track the scans at executor level. Since we
have so many scans on A, each may be using different plan, we will
need different ids for those.


I don't understand this sentence. Which way executor uses this index of 
RelOptInfo ?


--
regards,
Andrey Lepikhov
Postgres Professional

Re: Asymmetric partition-wise JOIN

2023-10-17 Thread Ashutosh Bapat

On Tue, Oct 17, 2023 at 2:05 PM Andrei Lepikhov
 wrote:
>
> On 16/10/2023 23:21, Ashutosh Bapat wrote:
> > On Mon, Oct 16, 2023 at 10:24 AM Andrei Lepikhov
> > Whenever I visited this idea, I hit one issue prominently - how would
> > we differentiate different scans of the non-partitioned relation.
> > Normally we do that using different Relids but in this case we
> > wouldn't be able to know the number of such relations involved in the
> > query unless we start planning such a join. It's late to add new base
> > relations and assign them new Relids. Of course I haven't thought hard
> > about it. I haven't looked at the patch to see whether this problem is
> > solved and how.
> >
> I'm curious, which type of problems do you afraid here? Why we need a
> range table entry for each scan of non-partitioned relation?
>

Not RTE but RelOptInfo.

Using the same example as Alexander Korotkov, let's say A is the
nonpartitioned table and P is partitioned table with partitions P1,
P2, ... Pn. The partitionwise join would need to compute AP1, AP2, ...
APn. Each of these joins may have different properties and thus will
require creating paths. In order to save these paths, we need
RelOptInfos which are indentified by relids. Let's assume that the
relids of these join RelOptInfos are created by union of relid of A
and relid of Px (the partition being joined). This is notionally
misleading but doable.

But the clauses of A parameterized by P will produce different
translations for each of the partitions. I think we will need
different RelOptInfos (for A) to store these translations.

The relid is also used to track the scans at executor level. Since we
have so many scans on A, each may be using different plan, we will
need different ids for those.

But if you have developed a way to use a single RelOptInfo of A to do
all this, may be we don't need all this. Will take a look at your next
version of patch.

-- 
Best Wishes,
Ashutosh Bapat

Re: Asymmetric partition-wise JOIN

2023-10-17 Thread Andrei Lepikhov


On 16/10/2023 23:21, Ashutosh Bapat wrote:

On Mon, Oct 16, 2023 at 10:24 AM Andrei Lepikhov
Whenever I visited this idea, I hit one issue prominently - how would
we differentiate different scans of the non-partitioned relation.
Normally we do that using different Relids but in this case we
wouldn't be able to know the number of such relations involved in the
query unless we start planning such a join. It's late to add new base
relations and assign them new Relids. Of course I haven't thought hard
about it. I haven't looked at the patch to see whether this problem is
solved and how.

I'm curious, which type of problems do you afraid here? Why we need a 
range table entry for each scan of non-partitioned relation?


--
regards,
Andrey Lepikhov
Postgres Professional

Re: Asymmetric partition-wise JOIN

2023-10-16 Thread Ashutosh Bapat

On Mon, Oct 16, 2023 at 10:24 AM Andrei Lepikhov
 wrote:
>
> >
> > Great!  I'm looking forward to the revised patch
> Before preparing a new patch, it would be better to find the common
> ground in the next issue:
> So far, this optimization stays aside, proposing an alternative path for
> a join RelOptInfo if we have an underlying append path in the outer.
> My back burner is redesigning the approach: asymmetric join doesn't
> change the partitioning scheme and bounds of the partitioned side. So,
> it looks consecutive to make it a part of partitionwise_join machinery
> and implement it as a part of the try_partitionwise_join /
> generate_partitionwise_join_paths routines.
>

I think we need an example where such a join will be faster than
non-partitioned join when both the sides are local. It might be
possible to come up with such an example without writing any code. The
idea would be to rewrite SQL as union of joins.

Whenever I visited this idea, I hit one issue prominently - how would
we differentiate different scans of the non-partitioned relation.
Normally we do that using different Relids but in this case we
wouldn't be able to know the number of such relations involved in the
query unless we start planning such a join. It's late to add new base
relations and assign them new Relids. Of course I haven't thought hard
about it. I haven't looked at the patch to see whether this problem is
solved and how.

-- 
Best Wishes,
Ashutosh Bapat

Re: Asymmetric partition-wise JOIN

2023-10-15 Thread Andrei Lepikhov


On 15/10/2023 17:25, Alexander Korotkov wrote:

On Sun, Oct 15, 2023 at 8:40 AM Andrei Lepikhov
 wrote:

Thanks for such detailed feedback!
The rationale for this patch was to give the optimizer additional ways
to push down more joins into foreign servers. And, because of
asynchronous append, the benefit of that optimization was obvious.
Unfortunately, we hadn't found other applications for this feature,
which was why this patch was postponed in the core.
You have brought new ideas about applying this idea locally. Moreover,
the main issue of the patch was massive memory consumption in the case
of many joins and partitions - because of reparameterization. But now,
postponing the reparameterization proposed in the thread [1] resolves
that problem and gives some insights into the reparameterization
technique of some fields, like lateral references.
Hence, I think we can restart this work.
The first thing here (after rebase, of course) is to figure out and
implement in the cost model cases of effectiveness when asymmetric join
would give significant performance.


Great!  I'm looking forward to the revised patch
Before preparing a new patch, it would be better to find the common 
ground in the next issue:
So far, this optimization stays aside, proposing an alternative path for 
a join RelOptInfo if we have an underlying append path in the outer.
My back burner is redesigning the approach: asymmetric join doesn't 
change the partitioning scheme and bounds of the partitioned side. So, 
it looks consecutive to make it a part of partitionwise_join machinery 
and implement it as a part of the try_partitionwise_join / 
generate_partitionwise_join_paths routines.


--
regards,
Andrey Lepikhov
Postgres Professional

Re: Asymmetric partition-wise JOIN

2023-10-15 Thread Alexander Korotkov

On Sun, Oct 15, 2023 at 8:40 AM Andrei Lepikhov
 wrote:
> Thanks for such detailed feedback!
> The rationale for this patch was to give the optimizer additional ways
> to push down more joins into foreign servers. And, because of
> asynchronous append, the benefit of that optimization was obvious.
> Unfortunately, we hadn't found other applications for this feature,
> which was why this patch was postponed in the core.
> You have brought new ideas about applying this idea locally. Moreover,
> the main issue of the patch was massive memory consumption in the case
> of many joins and partitions - because of reparameterization. But now,
> postponing the reparameterization proposed in the thread [1] resolves
> that problem and gives some insights into the reparameterization
> technique of some fields, like lateral references.
> Hence, I think we can restart this work.
> The first thing here (after rebase, of course) is to figure out and
> implement in the cost model cases of effectiveness when asymmetric join
> would give significant performance.

Great!  I'm looking forward to the revised patch.

--
Regards,
Alexander Korotkov

Re: Asymmetric partition-wise JOIN

2023-10-14 Thread Andrei Lepikhov

On 15/10/2023 07:18, Alexander Korotkov wrote:

Hi Alexander,
Hi Andrey,

Thank you for your work on this subject.

On Mon, Jan 17, 2022 at 1:42 PM Alexander Pyhalov
wrote:

The patch does not longer apply cleanly, so I rebased it. Attaching
rebased version.

Not surprising that the patch doesn't apply after 1.5 years since the
last message. Could you please rebase it?

I read the thread and the patch. The patch improves the joining of
partitioned tables with non-partitioned relations. Let's denote
non-partitioned relation as A, partitions as P1 ... PN. The patch
allows to Append(Join(A, P1), ... Join(A, PN) instead of Join(A,
Append(P1, ... PN). That could be cheaper because it's generally
cheaper to join small pieces rather than do one big join. The
drawback is the need to scan A multiple times. But is this really
necessary and acceptable? Let's consider multiple options.

1) A is non-table. For instance, A is a function scan. In this case,
doing multiple scans of A is not just expensive, but could lead to
unexpected side effects. When the user includes a function once in
the FROM clause, she expects this function to be evaluated once. I
propose that we should materialize a scan of non-table relations. So,
materialized representation will be scanned multiple times, but the
source only scanned once. That would be similar to CTE.
2) A is the table to be scanned with the parametrized path in the
inner part of the nested loop join. In this case, there is no big
scan of A and nothing to materialize.
3) A is the table to be used in merge join or outer part of nested
loop join. In this case, it would be nice to consider materialize.
It's not always good to materialize, because materialization has its
additional costs. I think that could be a cost-based decision.
4) A is used in the hash join. Could we re-use the hashed
representation of A between multiple joins? I read upthread it was
proposed to share a hashed table between multiple background workers
via shared memory. But the first step would be to just share it
between multiple join nodes within the same process.

As we consider joining with each partition individually, there could
be chosen different join methods. As I get, the current patch
considers joining with each of the partitions as a separate isolated
optimization task. However, if we share resources between the
multiple joins, then rises a need for some global optimization. For
instance, a join type could be expensive when applied to an individual
partition, but cheap when applied to all the partitions thanks to
saving the common work.

My idea is to consider generated common resources (such as
materialized scans) as a property of the path. For instance, if the
nested loop join is cheaper than the hash join, but the hash join
generates a common hash map of table A, we don't drop hash join
immediately from the consideration and leave it to see how it could
help join other partitions. What do you think?

Thanks for such detailed feedback!
The rationale for this patch was to give the optimizer additional ways
to push down more joins into foreign servers. And, because of
asynchronous append, the benefit of that optimization was obvious.
Unfortunately, we hadn't found other applications for this feature,
which was why this patch was postponed in the core.
You have brought new ideas about applying this idea locally. Moreover,
the main issue of the patch was massive memory consumption in the case
of many joins and partitions - because of reparameterization. But now,
postponing the reparameterization proposed in the thread [1] resolves
that problem and gives some insights into the reparameterization
technique of some fields, like lateral references.

Hence, I think we can restart this work.
The first thing here (after rebase, of course) is to figure out and
implement in the cost model cases of effectiveness when asymmetric join
would give significant performance.

[1] Oversight in reparameterize_path_by_child leading to executor crash
https://www.postgresql.org/message-id/flat/CAMbWs496%2BN%3DUAjOc%3DrcD3P7B6oJe4rZw08e_TZRUsWbPxZW3Tw%40mail.gmail.com

--
regards,
Andrey Lepikhov
Postgres Professional

Re: Asymmetric partition-wise JOIN

2023-10-14 Thread Alexander Korotkov

Hi Alexander,
Hi Andrey,

Thank you for your work on this subject.

On Mon, Jan 17, 2022 at 1:42 PM Alexander Pyhalov
 wrote:
> The patch does not longer apply cleanly, so I rebased it. Attaching
> rebased version.

Not surprising that the patch doesn't apply after 1.5 years since the
last message.  Could you please rebase it?

I read the thread and the patch.  The patch improves the joining of
partitioned tables with non-partitioned relations.  Let's denote
non-partitioned relation as A, partitions as P1 ... PN.  The patch
allows to Append(Join(A, P1), ... Join(A, PN) instead of Join(A,
Append(P1, ... PN).  That could be cheaper because it's generally
cheaper to join small pieces rather than do one big join.  The
drawback is the need to scan A multiple times.  But is this really
necessary and acceptable?  Let's consider multiple options.

1) A is non-table.  For instance, A is a function scan.  In this case,
doing multiple scans of A is not just expensive, but could lead to
unexpected side effects.  When the user includes a function once in
the FROM clause, she expects this function to be evaluated once.  I
propose that we should materialize a scan of non-table relations.  So,
materialized representation will be scanned multiple times, but the
source only scanned once.  That would be similar to CTE.
2) A is the table to be scanned with the parametrized path in the
inner part of the nested loop join.  In this case, there is no big
scan of A and nothing to materialize.
3) A is the table to be used in merge join or outer part of nested
loop join.  In this case, it would be nice to consider materialize.
It's not always good to materialize, because materialization has its
additional costs.  I think that could be a cost-based decision.
4) A is used in the hash join.  Could we re-use the hashed
representation of A between multiple joins?  I read upthread it was
proposed to share a hashed table between multiple background workers
via shared memory.  But the first step would be to just share it
between multiple join nodes within the same process.

As we consider joining with each partition individually, there could
be chosen different join methods.  As I get, the current patch
considers joining with each of the partitions as a separate isolated
optimization task.  However, if we share resources between the
multiple joins, then rises a need for some global optimization.  For
instance, a join type could be expensive when applied to an individual
partition, but cheap when applied to all the partitions thanks to
saving the common work.

My idea is to consider generated common resources (such as
materialized scans) as a property of the path.  For instance, if the
nested loop join is cheaper than the hash join, but the hash join
generates a common hash map of table A, we don't drop hash join
immediately from the consideration and leave it to see how it could
help join other partitions.  What do you think?

--
Regards,
Alexander Korotkov

Re: Asymmetric partition-wise JOIN

2022-01-17 Thread Alexander Pyhalov

 *root,
 	if (set_join_pathlist_hook)
 		set_join_pathlist_hook(root, joinrel, outerrel, innerrel,
 			   jointype, );
+
+	/*
+	 * 7. If outer relation is delivered from partition-tables, consider
+	 * distributing inner relation to every partition-leaf prior to
+	 * append these leafs.
+	 */
+	try_asymmetric_partitionwise_join(root, joinrel,
+	  outerrel, innerrel,
+	  jointype, );
 }
 
 /*
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 9da3ff2f9ab..dadb08ddb06 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -16,6 +16,7 @@
 
 #include "miscadmin.h"
 #include "optimizer/appendinfo.h"
+#include "optimizer/cost.h"
 #include "optimizer/joininfo.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
@@ -1552,6 +1553,192 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	}
 }
 
+/*
+ * Build RelOptInfo on JOIN of each partition of the outer relation and the inner
+ * relation. Return List of such RelOptInfo's. Return NIL, if at least one of
+ * these JOINs is impossible to build.
+ */
+static List *
+extract_asymmetric_partitionwise_subjoin(PlannerInfo *root,
+		 RelOptInfo *joinrel,
+		 AppendPath *append_path,
+		 RelOptInfo *inner_rel,
+		 JoinType jointype,
+		 JoinPathExtraData *extra)
+{
+	List		*result = NIL;
+	ListCell	*lc;
+
+	foreach (lc, append_path->subpaths)
+	{
+		Path			*child_path = lfirst(lc);
+		RelOptInfo		*child_rel = child_path->parent;
+		Relids			child_joinrelids;
+		Relids			parent_relids;
+		RelOptInfo		*child_joinrel;
+		SpecialJoinInfo	*child_sjinfo;
+		List			*child_restrictlist;
+
+		child_joinrelids = bms_union(child_rel->relids, inner_rel->relids);
+		parent_relids = bms_union(append_path->path.parent->relids,
+  inner_rel->relids);
+
+		child_sjinfo = build_child_join_sjinfo(root, extra->sjinfo,
+			   child_rel->relids,
+			   inner_rel->relids);
+		child_restrictlist = (List *)
+			adjust_appendrel_attrs_multilevel(root, (Node *)extra->restrictlist,
+			  child_joinrelids, parent_relids);
+
+		child_joinrel = find_join_rel(root, child_joinrelids);
+		if (!child_joinrel)
+			child_joinrel = build_child_join_rel(root,
+ child_rel,
+ inner_rel,
+ joinrel,
+ child_restrictlist,
+ child_sjinfo,
+ jointype);
+		else
+		{
+			/*
+			 * The join relation already exists. For example, it could happen if
+			 * we join two plane tables with partitioned table(s).
+			 * Populating this join with additional paths could push out some
+			 * previously added paths which could be pointed in a subplans list
+			 * of an higher level append.
+			 * Of course, we could save such paths before generating new. But it
+			 * can increase too much the number of paths in complex queries. It
+			 * can be a task for future work.
+			 */
+			return NIL;
+		}
+
+		populate_joinrel_with_paths(root,
+	child_rel,
+	inner_rel,
+	child_joinrel,
+	child_sjinfo,
+	child_restrictlist);
+
+		/* Give up if asymmetric partition-wise join is not available */
+		if (child_joinrel->pathlist == NIL)
+			return NIL;
+
+		set_cheapest(child_joinrel);
+		result = lappend(result, child_joinrel);
+	}
+	return result;
+}
+
+static bool
+is_asymmetric_join_feasible(PlannerInfo *root,
+		   RelOptInfo *outer_rel,
+		   RelOptInfo *inner_rel,
+		   JoinType jointype)
+{
+	ListCell *lc;
+
+	if (jointype != JOIN_INNER && jointype != JOIN_LEFT)
+		return false;
+
+	if (IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel))
+		return false;
+
+	/* Disallow recursive usage of asymmertic join machinery */
+	if (root->join_rel_level == NULL)
+		return false;
+
+	/*
+	 * Don't allow asymmetric JOIN of two append subplans.
+	 * In the case of a parameterized NL join, a reparameterization procedure
+	 * will lead to large memory allocations and a CPU consumption:
+	 * each reparameterization will induce subpath duplication, creating new
+	 * ParamPathInfo instance and increasing of ppilist up to number of
+	 * partitions in the inner. Also, if we have many partitions, each bitmapset
+	 * variable will be large and many leaks of such variable (caused by relid
+	 * replacement) will highly increase memory consumption.
+	 * So, we deny such paths for now.
+	 */
+	foreach(lc, inner_rel->pathlist)
+	{
+		if (IsA(lfirst(lc), AppendPath))
+			return false;
+	}
+
+	return true;
+}
+
+void
+try_asymmetric_partitionwise_join(PlannerInfo *root,
+  RelOptInfo *joinrel,
+  RelOptInfo *outer_rel,
+  RelOptInfo *inner_rel,
+  JoinType jointype,
+  JoinPathExtraData *extra)
+{
+	ListCell *lc;
+
+	/*
+	 * Try this kind of paths if we allow complex partitionwi

Re: Asymmetric partition-wise JOIN

2021-09-15 Thread Andrey Lepikhov

t;relids, 
inner_rel->relids);
+   parent_relids = bms_union(append_path->path.parent->relids,
+ 
inner_rel->relids);
+
+   child_sjinfo = build_child_join_sjinfo(root, extra->sjinfo,
+   
   child_rel->relids,
+   
   inner_rel->relids);
+   child_restrictlist = (List *)
+   adjust_appendrel_attrs_multilevel(root, (Node 
*)extra->restrictlist,
+   
  child_joinrelids, parent_relids);
+
+   child_joinrel = find_join_rel(root, child_joinrelids);
+   if (!child_joinrel)
+   child_joinrel = build_child_join_rel(root,
+   
 child_rel,
+   
 inner_rel,
+   
 joinrel,
+   
 child_restrictlist,
+   
 child_sjinfo,
+   
 jointype);
+   else
+   {
+   /*
+* The join relation already exists. For example, it 
could happen if
+* we join two plane tables with partitioned table(s).
+* Populating this join with additional paths could 
push out some
+* previously added paths which could be pointed in a 
subplans list
+* of an higher level append.
+* Of course, we could save such paths before 
generating new. But it
+* can increase too much the number of paths in complex 
queries. It
+* can be a task for future work.
+*/
+   return NIL;
+   }
+
+   populate_joinrel_with_paths(root,
+   
child_rel,
+   
inner_rel,
+   
child_joinrel,
+   
child_sjinfo,
+       
child_restrictlist);
+
+   /* Give up if asymmetric partition-wise join is not available */
+   if (child_joinrel->pathlist == NIL)
+   return NIL;
+
+   set_cheapest(child_joinrel);
+   result = lappend(result, child_joinrel);
+   }
+   return result;
+}
+
+static bool
+is_asymmetric_join_feasible(PlannerInfo *root,
+  RelOptInfo *outer_rel,
+  RelOptInfo *inner_rel,
+  JoinType jointype)
+{
+   ListCell *lc;
+
+   if (jointype != JOIN_INNER && jointype != JOIN_LEFT)
+   return false;
+
+   if (IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel))
+   return false;
+
+   /* Disallow recursive usage of asymmertic join machinery */
+   if (root->join_rel_level == NULL)
+   return false;
+
+   /*
+* Don't allow asymmetric JOIN of two append subplans.
+* In the case of a parameterized NL join, a reparameterization 
procedure
+* will lead to large memory allocations and a CPU consumption:
+* each reparameterization will induce subpath duplication, creating new
+* ParamPathInfo instance and increasing of ppilist up to number of
+* partitions in the inner. Also, if we have many partitions, each 
bitmapset
+* variable will be large and many leaks of such variable (caused by 
relid
+* replacement) will highly increase memory consumption.
+* So, we deny such paths for now.
+*/
+   foreach(lc, inner_rel->pathlist)
+   {
+   if (IsA(lfirst(lc), AppendPath))
+   return false;
+   }
+
+   return true;
+}
+
+void
+try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo 
*joinrel,
+

Re: Asymmetric partition-wise JOIN

2021-09-14 Thread Andrey V. Lepikhov

d_restrictlist = (List *)
+			adjust_appendrel_attrs_multilevel(root, (Node *)extra->restrictlist,
+			  child_joinrelids, parent_relids);
+
+		child_joinrel = find_join_rel(root, child_joinrelids);
+		if (!child_joinrel)
+			child_joinrel = build_child_join_rel(root,
+ child_rel,
+ inner_rel,
+ joinrel,
+ child_restrictlist,
+ child_sjinfo,
+ jointype);
+		else
+		{
+			/*
+			 * The join relation already exists. For example, it could happen if
+			 * we join two plane tables with partitioned table(s).
+			 * Populating this join with additional paths could push out some
+			 * previously added paths which could be pointed in a subplans list
+			 * of an higher level append.
+			 * Of course, we could save such paths before generating new. But it
+			 * can increase too much the number of paths in complex queries. It
+			 * can be a task for future work.
+			 */
+			return NIL;
+		}
+
+		populate_joinrel_with_paths(root,
+	child_rel,
+					inner_rel,
+	child_joinrel,
+	child_sjinfo,
+	child_restrictlist);
+
+		/* Give up if asymmetric partition-wise join is not available */
+		if (child_joinrel->pathlist == NIL)
+			return NIL;
+
+		set_cheapest(child_joinrel);
+		result = lappend(result, child_joinrel);
+	}
+	return result;
+}
+
+static bool
+is_asymmetric_join_feasible(PlannerInfo *root,
+		   RelOptInfo *outer_rel,
+		   RelOptInfo *inner_rel,
+		   JoinType jointype)
+{
+	ListCell *lc;
+
+	if (jointype != JOIN_INNER && jointype != JOIN_LEFT)
+		return false;
+
+	if (IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel))
+		return false;
+
+	/* Disallow recursive usage of asymmertic join machinery */
+	if (root->join_rel_level == NULL)
+		return false;
+
+	/*
+	 * Don't allow asymmetric JOIN of two append subplans.
+	 * In the case of a parameterized NL join, a reparameterization procedure
+	 * will lead to large memory allocations and a CPU consumption:
+	 * each reparameterization will induce subpath duplication, creating new
+	 * ParamPathInfo instance and increasing of ppilist up to number of
+	 * partitions in the inner. Also, if we have many partitions, each bitmapset
+	 * variable will be large and many leaks of such variable (caused by relid
+	 * replacement) will highly increase memory consumption.
+	 * So, we deny such paths for now.
+	 */
+	foreach(lc, inner_rel->pathlist)
+	{
+		if (IsA(lfirst(lc), AppendPath))
+			return false;
+	}
+
+	return true;
+}
+
+void
+try_asymmetric_partitionwise_join(PlannerInfo *root,
+  RelOptInfo *joinrel,
+  RelOptInfo *outer_rel,
+  RelOptInfo *inner_rel,
+  JoinType jointype,
+  JoinPathExtraData *extra)
+{
+	ListCell *lc;
+
+	/*
+	 * Try this kind of paths if we allow complex partitionwise joins and we know
+	 * we can build this join safely.
+	 */
+	if (!enable_partitionwise_join ||
+		!is_asymmetric_join_feasible(root, outer_rel, inner_rel, jointype))
+		return;
+
+	foreach (lc, outer_rel->pathlist)
+	{
+		AppendPath *append_path = lfirst(lc);
+
+		/*
+		 * We assume this pathlist keeps at least one AppendPath that
+		 * represents partitioned table-scan, symmetric or asymmetric
+		 * partition-wise join. Asymmetric join isn't needed if the append node
+		 * has only one child.
+		 */
+		if (IsA(append_path, AppendPath) &&
+			list_length(append_path->subpaths) > 1)
+		{
+			List **join_rel_level_saved;
+			List *live_childrels = NIL;
+
+			join_rel_level_saved = root->join_rel_level;
+			PG_TRY();
+			{
+/* temporary disables "dynamic programming" algorithm */
+root->join_rel_level = NULL;
+
+live_childrels =
+	extract_asymmetric_partitionwise_subjoin(root,
+			 joinrel,
+			 append_path,
+			 inner_rel,
+			 jointype,
+			 extra);
+			}
+			PG_FINALLY();
+			{
+root->join_rel_level = join_rel_level_saved;
+			}
+			PG_END_TRY();
+
+			if (live_childrels != NIL)
+			{
+/*
+ * Add new append relation. We must choose cheapest paths after
+ * this operation because the pathlist possibly contains
+ * joinrels and appendrels that can be suboptimal.
+ */
+add_paths_to_append_rel(root, joinrel, live_childrels);
+set_cheapest(joinrel);
+			}
+
+			break;
+		}
+	}
+}
+
 /*
  * Construct the SpecialJoinInfo for a child-join by translating
  * SpecialJoinInfo for the join between parents. left_relids and right_relids
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index e50624c465..fccc0685d7 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -281,24 +281,29 @@ set_plan_references(PlannerInfo *root, Plan *plan)
 
 	/*
 	 * Adjust RT indexes of AppendRelInfos and add to final appendrels list.
-	 * We assume the AppendRelInfos were built during planning and don't

Re: Asymmetric partition-wise JOIN

2021-09-09 Thread Jaime Casanova

On Thu, Sep 09, 2021 at 09:50:46AM +, Aleksander Alekseev wrote:
> It looks like this patch needs to be updated. According to 
> http://cfbot.cputube.org/ it applies but doesn't pass any tests. Changing the 
> status to save time for reviewers.
> 
> The new status of this patch is: Waiting on Author

Just to give some more info to work on I found this patch made postgres
crash with a segmentation fault.

"""
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x556e37ef1b55 in bms_equal (a=0x7f6e37a9c5b0, b=0x7f6e37a9c5b0) at 
bitmapset.c:126
126 if (shorter->words[i] != longer->words[i])
"""

attached are the query that triggers the crash and the backtrace.

-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL


bug.sql
Description: application/sql
#0  0x556e37ef1b55 in bms_equal (a=0x7f6e37a9c5b0, b=0x7f6e37a9c5b0) at 
bitmapset.c:126
shorter = 0x7f6e37a9c5b0
longer = 0x7f6e37a9c5b0
shortlen = 2139062143
longlen = 32622
i = 138057
#1  0x556e37fc9f5d in find_param_path_info (rel=0x556e3944c490, 
required_outer=0x7f6e37a9c5b0) at relnode.c:1634
ppi = 0x7f6e37a9cc08
lc__state = {l = 0x7f6e37a9cc40, i = 0}
lc = 0x7f6e37a5f898
#2  0x556e37fbeaeb in reparameterize_path_by_child (root=0x556e394340e0, 
path=0x7f6e37a9c6c8, child_rel=0x7f6e37a77ac0) at pathnode.c:4205
new_path = 0x7f6e37a89408
new_ppi = 0x7f6e37a77d30
old_ppi = 0x7f6e37a9cc08
required_outer = 0x7f6e37a9c5b0
#3  0x556e37f66545 in try_nestloop_path (root=0x556e394340e0, 
joinrel=0x7f6e37a88a80, outer_path=0x7f6e37a788e8, inner_path=0x7f6e37a9c6c8, 
pathkeys=0x0, jointype=JOIN_LEFT, extra=0x7ffde36f3e50)
at joinpath.c:662
required_outer = 0x0
workspace = {startup_cost = 46733.4674, total_cost = 
1195882.0625, run_cost = 1149148.595, inner_run_cost = 0.020375816993464052, 
inner_rescan_run_cost = 0.020375816993464052, 
  outer_rows = 6.9224208015025731e-310, inner_rows = 
6.9224208087218603e-310, outer_skip_rows = 6.9224208013235237e-310, 
inner_skip_rows = 4.640852265067508e-310, numbuckets = 960708832, 
  numbatches = 21870, inner_rows_total = 6.9224208047606396e-310}
innerrel = 0x556e3944c490
outerrel = 0x7f6e37a77ac0
innerrelids = 0x556e3944c2c8
outerrelids = 0x7f6e37a77d30
inner_paramrels = 0x7f6e37a9c5b0
outer_paramrels = 0x0
#4  0x556e37f67bf9 in match_unsorted_outer (root=0x556e394340e0, 
joinrel=0x7f6e37a88a80, outerrel=0x7f6e37a77ac0, innerrel=0x556e3944c490, 
jointype=JOIN_LEFT, extra=0x7ffde36f3e50) at joinpath.c:1702
innerpath = 0x7f6e37a9c6c8
mpath = 0x0
lc2__state = {l = 0x7f6e37a9d0a0, i = 1}
lc2 = 0x7f6e37a9d0c0
outerpath = 0x7f6e37a788e8
merge_pathkeys = 0x0
lc1__state = {l = 0x7f6e37a786c8, i = 0}
save_jointype = JOIN_LEFT
nestjoinOK = true
useallclauses = false
inner_cheapest_total = 0x7f6e37a9c3b0
matpath = 0x7f6e37a89370
lc1 = 0x7f6e37a786e0
__func__ = "match_unsorted_outer"
#5  0x556e37f65c14 in add_paths_to_joinrel (root=0x556e394340e0, 
joinrel=0x7f6e37a88a80, outerrel=0x7f6e37a77ac0, innerrel=0x556e3944c490, 
jointype=JOIN_LEFT, sjinfo=0x7f6e37a88630, 
restrictlist=0x7f6e37a88a28) at joinpath.c:291
extra = {restrictlist = 0x7f6e37a88a28, mergeclause_list = 
0x7f6e37a88e28, inner_unique = true, sjinfo = 0x7f6e37a88630, semifactors = 
{outer_match_frac = 0.00039215686274509802, match_count = 2550}, 
  param_source_rels = 0x0}
mergejoin_allowed = true
lc = 0x0
joinrelids = 0x7f6e37a886e8
#6  0x556e37f6a3d9 in populate_joinrel_with_paths (root=0x556e394340e0, 
rel1=0x7f6e37a77ac0, rel2=0x556e3944c490, joinrel=0x7f6e37a88a80, 
sjinfo=0x7f6e37a88630, restrictlist=0x7f6e37a88a28)
at joinrels.c:825
__func__ = "populate_joinrel_with_paths"
#7  0x556e37f6bca5 in extract_asymmetric_partitionwise_subjoin 
(root=0x556e394340e0, joinrel=0x7f6e37a86bd8, append_path=0x7f6e37a7b6c8, 
inner_rel=0x556e3944c490, jointype=JOIN_LEFT, extra=0x7ffde36f4150)
at joinrels.c:1617
child_path = 0x7f6e37a788e8
child_joinrelids = 0x7f6e37a885e0
child_sjinfo = 0x7f6e37a88630
child_rel = 0x7f6e37a77ac0
parent_relids = 0x7f6e37a88608
child_joinrel = 0x7f6e37a88a80
child_restrictlist = 0x7f6e37a88a28
lc__state = {l = 0x7f6e37a7b8a8, i = 0}
result = 0x0
lc = 0x7f6e37a7b8c0
#8  0x556e37f6bfb8 in try_asymmetric_partitionwise_join 
(root=0x556e394340e0, joinrel=0x7f6e37a86bd8, outer_rel=0x7f6e37a65f68, 
inner_rel=0x556e3944c490, jointype=JOIN_LEFT, extra=0x7ffde36f4150)
at joinrels.c:1713
_save_exception_stack = 0x7ffde36f4b80
_save_context_stack = 0x0
_local_sigjmp_buf = {{__jmpbuf =

Re: Asymmetric partition-wise JOIN

2021-09-09 Thread Aleksander Alekseev

It looks like this patch needs to be updated. According to 
http://cfbot.cputube.org/ it applies but doesn't pass any tests. Changing the 
status to save time for reviewers.

The new status of this patch is: Waiting on Author

Re: Asymmetric partition-wise JOIN

2021-07-15 Thread Ibrar Ahmed

On Thu, Jul 15, 2021 at 11:32 AM Andrey Lepikhov 
wrote:

> On 5/7/21 23:15, Zhihong Yu wrote:
> > On Mon, Jul 5, 2021 at 2:57 AM Andrey Lepikhov
> > mailto:a.lepik...@postgrespro.ru>> wrote:
> > +* Can't imagine situation when join relation already
> > exists. But in
> > +* the 'partition_join' regression test it happens.
> > +* It may be an indicator of possible problems.
> >
> > Should a log be added in the above case ?
> I worked more on this case and found more serious mistake. During
> population of additional paths on the existed RelOptInfo we can remove
> some previously generated paths that pointed from a higher-level list of
> subplans and it could cause to lost of subplan links. I prohibit such
> situation (you can read comments in the new version of the patch).
> Also, choosing of a cheapest path after appendrel creation was added.
> Unstable tests were fixed.
>
> --
> regards,
> Andrey Lepikhov
> Postgres Professional
>

Patch is failing the regression, can you please take a look at that.

partition_join ... FAILED 6328 ms

--
Ibrar Ahmed

Re: Asymmetric partition-wise JOIN

2021-07-15 Thread Andrey Lepikhov

+ 
inner_rel->relids);
+
+   child_sjinfo = build_child_join_sjinfo(root, extra->sjinfo,
+   
   child_rel->relids,
+   
   inner_rel->relids);
+   child_restrictlist = (List *)
+   adjust_appendrel_attrs_multilevel(root, (Node 
*)extra->restrictlist,
+   
  child_joinrelids, parent_relids);
+
+   child_joinrel = find_join_rel(root, child_joinrelids);
+   if (!child_joinrel)
+   child_joinrel = build_child_join_rel(root,
+   
 child_rel,
+   
 inner_rel,
+   
 joinrel,
+   
 child_restrictlist,
+   
 child_sjinfo,
+   
 jointype);
+   else
+   {
+   /*
+* The join relation already exists. For example, it 
could happen if
+* we join two plane tables with partitioned table(s).
+* Populating this join with additional paths could 
push out some
+* previously added paths which could be pointed in a 
subplans list
+* of an higher level append.
+* Of course, we could save such paths before 
generating new. But it
+* can increase too much the number of paths in complex 
queries. It
+* can be a task for future work.
+*/
+   return NIL;
+   }
+
+   populate_joinrel_with_paths(root,
+   
child_rel,
+   
inner_rel,
+   
child_joinrel,
+   
child_sjinfo,
+           
child_restrictlist);
+
+   /* Give up if asymmetric partition-wise join is not available */
+   if (child_joinrel->pathlist == NIL)
+   return NIL;
+
+   set_cheapest(child_joinrel);
+   result = lappend(result, child_joinrel);
+   }
+   return result;
+}
+
+static bool
+is_asymmetric_join_feasible(PlannerInfo *root,
+  RelOptInfo *outer_rel,
+  RelOptInfo *inner_rel,
+  JoinType jointype)
+{
+   ListCell *lc;
+
+   if (jointype != JOIN_INNER && jointype != JOIN_LEFT)
+   return false;
+
+   if (IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel))
+   return false;
+
+   /* Disallow recursive usage of asymmertic join machinery */
+   if (root->join_rel_level == NULL)
+   return false;
+
+   /*
+* Don't allow asymmetric JOIN of two append subplans.
+* In the case of a parameterized NL join, a reparameterization 
procedure
+* will lead to large memory allocations and a CPU consumption:
+* each reparameterization will induce subpath duplication, creating new
+* ParamPathInfo instance and increasing of ppilist up to number of
+* partitions in the inner. Also, if we have many partitions, each 
bitmapset
+* variable will be large and many leaks of such variable (caused by 
relid
+* replacement) will highly increase memory consumption.
+* So, we deny such paths for now.
+*/
+   foreach(lc, inner_rel->pathlist)
+   {
+   if (IsA(lfirst(lc), AppendPath))
+   return false;
+   }
+
+   return true;
+}
+
+void
+try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo 
*joinrel,
+ RelOptInfo 
*outer_rel,
+ RelOptInfo 
*inner_rel,
+

Re: Asymmetric partition-wise JOIN

2021-07-06 Thread Alexander Pyhalov


Andrey Lepikhov писал 2021-07-06 12:28:

On 5/7/21 23:15, Zhihong Yu wrote:
On Mon, Jul 5, 2021 at 2:57 AM Andrey Lepikhov 
mailto:a.lepik...@postgrespro.ru>> wrote:
+            * Can't imagine situation when join relation already 
exists. But in

+            * the 'partition_join' regression test it happens.
+            * It may be an indicator of possible problems.
Should a log be added in the above case ?

I made additional analysis of this branch of code. This situation can
happen in the case of one child or if we join two plane tables with
partitioned. Both situations are legal and I think we don't needed to
add any log message here.
Other mistakes were fixed.


Hi.

Small  typo in comment in src/backend/optimizer/plan/setrefs.c:

 281
 282 /*
 283  * Adjust RT indexes of AppendRelInfos and add to final 
appendrels list.
 284  * The AppendRelInfos are copied, because as a part of a 
subplan its could

 285  * be visited many times in the case of asymmetric join.
 286  */
 287 foreach(lc, root->append_rel_list)
 288 {

its  -> it (or they) ?
--
Best regards,
Alexander Pyhalov,
Postgres Professional

Re: Asymmetric partition-wise JOIN

2021-07-06 Thread Andrey Lepikhov

 
   child_rel->relids,
+   
   inner_rel->relids);
+   child_restrictlist = (List *)
+   adjust_appendrel_attrs_multilevel(root, (Node 
*)extra->restrictlist,
+   
  child_joinrelids, parent_relids);
+
+   child_joinrel = find_join_rel(root, child_joinrelids);
+   if (!child_joinrel)
+   child_joinrel = build_child_join_rel(root,
+   
 child_rel,
+   
 inner_rel,
+   
 joinrel,
+   
 child_restrictlist,
+   
 child_sjinfo,
+   
 jointype);
+   else
+   {
+   /*
+* The join relation already exists. For example, it 
could happen if
+* we join two plane tables with partitioned table(s). 
Do nothing.
+*/
+   }
+
+   populate_joinrel_with_paths(root,
+   
child_rel,
+   
inner_rel,
+   
child_joinrel,
+   
child_sjinfo,
+           
child_restrictlist);
+
+   /* Give up if asymmetric partition-wise join is not available */
+   if (child_joinrel->pathlist == NIL)
+   return NIL;
+
+   set_cheapest(child_joinrel);
+   result = lappend(result, child_joinrel);
+   }
+   return result;
+}
+
+static bool
+is_asymmetric_join_feasible(PlannerInfo *root,
+  RelOptInfo *outer_rel,
+  RelOptInfo *inner_rel,
+  JoinType jointype)
+{
+   ListCell *lc;
+
+   if (jointype != JOIN_INNER && jointype != JOIN_LEFT)
+   return false;
+
+   if (IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel))
+   return false;
+
+   /* Disallow recursive usage of asymmertic join machinery */
+   if (root->join_rel_level == NULL)
+   return false;
+
+   /*
+* Don't allow asymmetric JOIN of two append subplans.
+* In the case of a parameterized NL join, a reparameterization 
procedure
+* will lead to large memory allocations and a CPU consumption:
+* each reparameterization will induce subpath duplication, creating new
+* ParamPathInfo instance and increasing of ppilist up to number of
+* partitions in the inner. Also, if we have many partitions, each 
bitmapset
+* variable will be large and many leaks of such variable (caused by 
relid
+* replacement) will highly increase memory consumption.
+* So, we deny such paths for now.
+*/
+   foreach(lc, inner_rel->pathlist)
+   {
+   if (IsA(lfirst(lc), AppendPath))
+   return false;
+   }
+
+   return true;
+}
+
+void
+try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo 
*joinrel,
+ RelOptInfo 
*outer_rel,
+ RelOptInfo 
*inner_rel,
+ JoinType 
jointype,
+ 
JoinPathExtraData *extra)
+{
+   ListCell *lc;
+
+   /*
+* Try this kind of paths if we allow complex partitionwise joins and 
we know
+* we can build this join safely.
+*/
+   if (!enable_partitionwise_join ||
+   !is_asymmetric_join_feasible(root, outer_rel, inner_rel, 
jointype))
+   return;
+
+   foreach (lc, outer_rel->pathlist)
+   {
+   AppendPath *append_path = lfirst(lc);
+
+   /*
+* We assume this pathlist keeps at least one AppendPath that
+* r

Re: Asymmetric partition-wise JOIN

2021-07-05 Thread Zhihong Yu

On Mon, Jul 5, 2021 at 2:57 AM Andrey Lepikhov 
wrote:

> On 18/6/21 15:02, Alexander Pyhalov wrote:
> > Andrey Lepikhov писал 2021-05-27 07:27:
> >> Next version of the patch.
> >> For searching any problems I forced this patch during 'make check'
> >> tests. Some bugs were found and fixed.
> >
> > Hi.
> > I've tested this patch and haven't found issues, but I have some
> comments.
> Thank you for review!
> > Variable names - child_join_rel and child_join_relids seem to be
> > inconsistent with rest of the file
> fixed
>
> > When build_child_join_rel() can return NULL?
> Fixed
> > Missing word:
> > each bitmapset variable will large => each bitmapset variable will be
> large
> Fixed
> > What hook do you refer to?
> Removed> You've changed function to copy appinfo, so now comment is
> incorrect.
> Thanks, fixed> Do I understand correctly that now parent_relids also
> contains relids of
> > relations from 'global' inner relation, which we join to childs?
> Yes
>
> --
> regards,
> Andrey Lepikhov
> Postgres Professional
>
Hi,

relations because it could cause CPU and memory huge consumption
during reparameterization of NestLoop path.

CPU and memory huge consumption -> huge consumption of CPU and memory

+ * relation. Return List of such RelOptInfo's. Return NIL, if at least one
of
+ * these JOINs are impossible to build.

at least one of these JOINs are impossible to build. -> at least one
of these JOINs is impossible to build.

+* Can't imagine situation when join relation already exists.
But in
+* the 'partition_join' regression test it happens.
+* It may be an indicator of possible problems.

Should a log be added in the above case ?

+is_asymmetric_join_capable(PlannerInfo *root,

is_asymmetric_join_capable -> is_asymmetric_join_feasible

Cheers

Re: Asymmetric partition-wise JOIN

2021-07-05 Thread Andrey Lepikhov

lids);
+   parent_relids = bms_union(append_path->path.parent->relids,
+ 
inner_rel->relids);
+
+   child_sjinfo = build_child_join_sjinfo(root, extra->sjinfo,
+   
   child_rel->relids,
+   
   inner_rel->relids);
+   child_restrictlist = (List *)
+   adjust_appendrel_attrs_multilevel(root, (Node 
*)extra->restrictlist,
+   
  child_joinrelids, parent_relids);
+
+   child_joinrel = find_join_rel(root, child_joinrelids);
+   if (!child_joinrel)
+   child_joinrel = build_child_join_rel(root,
+   
 child_rel,
+   
 inner_rel,
+   
 joinrel,
+   
 child_restrictlist,
+   
 child_sjinfo,
+   
 jointype);
+   else
+   {
+   /*
+* TODO:
+* Can't imagine situation when join relation already 
exists. But in
+* the 'partition_join' regression test it happens.
+* It may be an indicator of possible problems.
+*/
+   }
+
+   populate_joinrel_with_paths(root,
+   
child_rel,
+   
inner_rel,
+   
child_joinrel,
+   
child_sjinfo,
+           
child_restrictlist);
+
+   /* Give up if asymmetric partition-wise join is not available */
+   if (child_joinrel->pathlist == NIL)
+   return NIL;
+
+   set_cheapest(child_joinrel);
+   result = lappend(result, child_joinrel);
+   }
+   return result;
+}
+
+static bool
+is_asymmetric_join_capable(PlannerInfo *root,
+  RelOptInfo *outer_rel,
+  RelOptInfo *inner_rel,
+  JoinType jointype)
+{
+   ListCell *lc;
+
+   if (jointype != JOIN_INNER && jointype != JOIN_LEFT)
+   return false;
+
+   if (IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel))
+   return false;
+
+   /* Disallow recursive usage of asymmertic join machinery */
+   if (root->join_rel_level == NULL)
+   return false;
+
+   /*
+* Don't allow asymmetric JOIN of two append subplans.
+* In the case of a parameterized NL join, a reparameterization 
procedure
+* will lead to large memory allocations and a CPU consumption:
+* each reparameterization will induce subpath duplication, creating new
+* ParamPathInfo instance and increasing of ppilist up to number of
+* partitions in the inner. Also, if we have many partitions, each 
bitmapset
+* variable will be large and many leaks of such variable (caused by 
relid
+* replacement) will highly increase memory consumption.
+* So, we deny such paths for now.
+*/
+   foreach(lc, inner_rel->pathlist)
+   {
+   if (IsA(lfirst(lc), AppendPath))
+   return false;
+   }
+
+   return true;
+}
+
+void
+try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo 
*joinrel,
+ RelOptInfo 
*outer_rel,
+ RelOptInfo 
*inner_rel,
+ JoinType 
jointype,
+ 
JoinPathExtraData *extra)
+{
+   ListCell *lc;
+
+   /*
+* Try this kind of paths if we allow complex partitionwise joins and 
we know
+* we can build this join safely.
+*/
+   if (!enable_partitionwise_join ||
+   !is_asymmetric_

Re: Asymmetric partition-wise JOIN

2021-06-18 Thread Alexander Pyhalov


Andrey Lepikhov писал 2021-05-27 07:27:

Next version of the patch.
For searching any problems I forced this patch during 'make check'
tests. Some bugs were found and fixed.


Hi.
I've tested this patch and haven't found issues, but I have some 
comments.


src/backend/optimizer/path/joinrels.c:

1554
1555 /*
1556  * Build RelOptInfo on JOIN of each partition of the outer relation 
and the inner
1557  * relation. Return List of such RelOptInfo's. Return NIL, if at 
least one of

1558  * these JOINs are impossible to build.
1559  */
1560 static List *
1561 extract_asymmetric_partitionwise_subjoin(PlannerInfo *root,
1562 
 RelOptInfo *joinrel,
1563 
 AppendPath *append_path,
1564 
 RelOptInfo *inner_rel,
1565 
 JoinType jointype,
1566 
 JoinPathExtraData *extra)

1567 {
1568 List*result = NIL;
1569 ListCell*lc;
1570
1571 foreach (lc, append_path->subpaths)
1572 {
1573 Path*child_path = lfirst(lc);
1574 RelOptInfo  *child_rel = 
child_path->parent;

1575 Relids  child_join_relids;
1576 Relids  parent_relids;
1577 RelOptInfo  *child_join_rel;
1578 SpecialJoinInfo *child_sjinfo;
1579 List*child_restrictlist;


Variable names - child_join_rel and child_join_relids seem to be 
inconsistent with rest of the file (I see child_joinrelids in 
try_partitionwise_join() and child_joinrel in try_partitionwise_join() 
and get_matching_part_pairs()).



1595 child_join_rel = build_child_join_rel(root,
1596 
  child_rel,
1597 
  inner_rel,
1598 
  joinrel,
1599 
  child_restrictlist,
1600 
  child_sjinfo,
1601 
  jointype);

1602 if (!child_join_rel)
1603 {
1604 /*
1605  * If can't build JOIN between 
inner relation and one of the outer

1606  * partitions - return immediately.
1607  */
1608 return NIL;
1609 }

When build_child_join_rel() can return NULL?
If I read code correctly, joinrel is created in the begining of 
build_child_join_rel() with makeNode(), makeNode() wraps newNode() and 
newNode() uses MemoryContextAllocZero()/MemoryContextAllocZeroAligned(), 
which would error() on alloc() failure.


1637
1638 static bool
1639 is_asymmetric_join_capable(PlannerInfo *root,
1640RelOptInfo 
*outer_rel,
1641RelOptInfo 
*inner_rel,
1642JoinType 
jointype)

1643 {

Function misses a comment.

1656 /*
1657  * Don't allow asymmetric JOIN of two append subplans.
1658  * In the case of a parameterized NL join, a 
reparameterization procedure will

1659  * lead to large memory allocations and a CPU consumption:
1660  * each reparameterize will induce subpath duplication, 
creating new
1661  * ParamPathInfo instance and increasing of ppilist up to 
number of partitions
1662  * in the inner. Also, if we have many partitions, each 
bitmapset
1663  * variable will large and many leaks of such variable 
(caused by relid

1664  * replacement) will highly increase memory consumption.
1665  * So, we deny such paths for now.
1666  */


Missing word:
each bitmapset variable will large => each bitmapset variable will be 
large



1694 foreach (lc, outer_rel->pathlist)
1695 {
1696 AppendPath *append_path = lfirst(lc);
1697
1698 /*
1699  * MEMO: We assume this pathlist keeps at least one 
AppendPath that
1700  * represents partitioned table-scan, symmetric or

Re: Asymmetric partition-wise JOIN

2021-05-26 Thread Andrey Lepikhov

 child_join_rel = find_join_rel(root, child_join_relids);
+   if (!child_join_rel)
+   {
+   child_join_rel = build_child_join_rel(root,
+   
  child_rel,
+   
  inner_rel,
+   
  joinrel,
+   
  child_restrictlist,
+   
  child_sjinfo,
+   
  jointype);
+   if (!child_join_rel)
+   {
+   /*
+* If can't build JOIN between inner relation 
and one of the outer
+* partitions - return immediately.
+*/
+   return NIL;
+   }
+   }
+   else
+   {
+   /*
+* TODO:
+* Can't imagine situation when join relation already 
exists. But in
+* the 'partition_join' regression test it happens.
+* It may be an indicator of possible problems.
+*/
+   }
+
+   populate_joinrel_with_paths(root,
+   
child_rel,
+   
inner_rel,
+   
child_join_rel,
+   
child_sjinfo,
+       
child_restrictlist);
+
+   /* Give up if asymmetric partition-wise join is not available */
+   if (child_join_rel->pathlist == NIL)
+   return NIL;
+
+   set_cheapest(child_join_rel);
+   result = lappend(result, child_join_rel);
+   }
+   return result;
+}
+
+static bool
+is_asymmetric_join_capable(PlannerInfo *root,
+  RelOptInfo *outer_rel,
+  RelOptInfo *inner_rel,
+  JoinType jointype)
+{
+   ListCell *lc;
+
+   if (jointype != JOIN_INNER && jointype != JOIN_LEFT)
+   return false;
+
+   if (IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel))
+   return false;
+
+   /* Disallow recursive usage of asymmertic join machinery */
+   if (root->join_rel_level == NULL)
+   return false;
+
+   /*
+* Don't allow asymmetric JOIN of two append subplans.
+* In the case of a parameterized NL join, a reparameterization 
procedure will
+* lead to large memory allocations and a CPU consumption:
+* each reparameterize will induce subpath duplication, creating new
+* ParamPathInfo instance and increasing of ppilist up to number of 
partitions
+* in the inner. Also, if we have many partitions, each bitmapset
+* variable will large and many leaks of such variable (caused by relid
+* replacement) will highly increase memory consumption.
+* So, we deny such paths for now.
+*/
+   foreach(lc, inner_rel->pathlist)
+   {
+   if (IsA(lfirst(lc), AppendPath))
+   return false;
+   }
+
+   return true;
+}
+
+void
+try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo 
*joinrel,
+ RelOptInfo 
*outer_rel,
+ RelOptInfo 
*inner_rel,
+ JoinType 
jointype,
+ 
JoinPathExtraData *extra)
+{
+   ListCell *lc;
+
+   /*
+* Try this kind of paths if we allow complex partitionwise joins and 
we know
+* we can build this join safely.
+*/
+   if (!enable_partitionwise_join ||
+   !is_asymmetric_join_capable(root, outer_rel, inner_rel, 
jointype))
+   return;
+
+   foreach (lc, outer_rel->pathlist)
+   {
+   AppendPath *append_path = lfirst(lc);
+
+   /*
+* MEMO: We assume this pathlist keeps at least one AppendPath 
that
+

Re: Asymmetric partition-wise JOIN

2021-04-29 Thread Andrey V. Lepikhov


On 11/30/20 7:43 PM, Anastasia Lubennikova wrote:
This entry was inactive during this CF, so I've marked it as returned 
with feedback. Feel free to resubmit an updated version to a future 
commitfest. 
I return the patch to commitfest. My current reason differs from reason 
of origin author.
This patch can open a door for more complex optimizations in the 
partitionwise join push-down technique.
I mean, we can push-down join not only of two partitioned tables with 
the same partition schema, but a partitioned (sharded) table with an 
arbitrary subplan that is provable independent of local resources.


Example:

CREATE TABLE p(a int) PARTITION BY HASH (a);
CREATE TABLE p1 PARTITION OF p FOR VALUES WITH (MODULUS 3, REMAINDER 0);
CREATE TABLE p2 PARTITION OF p FOR VALUES WITH (MODULUS 3, REMAINDER 1);
CREATE TABLE p3 PARTITION OF p FOR VALUES WITH (MODULUS 3, REMAINDER 2);

SELECT * FROM p, (SELECT * FROM generate_series(1,2) AS a) AS s
WHERE p.a=s.a;

 Hash Join
   Hash Cond: (p.a = a.a)
   ->  Append
 ->  Seq Scan on p1 p_1
 ->  Seq Scan on p2 p_2
 ->  Seq Scan on p3 p_3
   ->  Hash
 ->  Function Scan on generate_series a

But with asymmetric join feature we have the plan:

 Append
   ->  Hash Join
 Hash Cond: (p_1.a = a.a)
 ->  Seq Scan on p1 p_1
 ->  Hash
   ->  Function Scan on generate_series a
   ->  Hash Join
 Hash Cond: (p_2.a = a.a)
 ->  Seq Scan on p2 p_2
 ->  Hash
   ->  Function Scan on generate_series a
   ->  Hash Join
 Hash Cond: (p_3.a = a.a)
 ->  Seq Scan on p3 p_3
 ->  Hash
   ->  Function Scan on generate_series a

In the case of FDW-sharding it means that if we can prove that the inner 
relation is independent from the execution server, we can push-down 
these joins and execute it in parallel.


--
regards,
Andrey Lepikhov
Postgres Professional

Re: Asymmetric partition-wise JOIN

2021-04-09 Thread Andrey V. Lepikhov


On 11/30/20 7:43 PM, Anastasia Lubennikova wrote:
This entry was inactive during this CF, so I've marked it as returned 
with feedback. Feel free to resubmit an updated version to a future 
commitfest.


Attached version is rebased on current master and fixes problems with 
complex parameterized plans - 'reparameterize by child' feature.
Problems with reparameterization machinery can be demonstrated by TPC-H 
benchmark.


--
regards,
Andrey Lepikhov
Postgres Professional
>From 6a15a52bfb90659c51b3a918d48037c474ffe9dd Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Fri, 2 Apr 2021 11:02:20 +0500
Subject: [PATCH] Asymmetric partitionwise join.

Teach optimizer to consider partitionwise join of non-partitioned
table with each partition of partitioned table.
This technique cause changes of 'reparameterize by child' machinery.
---
 src/backend/optimizer/path/joinpath.c|   9 +
 src/backend/optimizer/path/joinrels.c| 151 ++
 src/backend/optimizer/util/appendinfo.c  |  28 ++-
 src/backend/optimizer/util/pathnode.c|   9 +-
 src/backend/optimizer/util/relnode.c |  14 +-
 src/include/optimizer/paths.h|   7 +-
 src/test/regress/expected/partition_join.out | 209 +++
 src/test/regress/sql/partition_join.sql  |  99 +
 8 files changed, 509 insertions(+), 17 deletions(-)

diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index e9b6968b1d..6ba6d32ae4 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -335,6 +335,15 @@ add_paths_to_joinrel(PlannerInfo *root,
 	if (set_join_pathlist_hook)
 		set_join_pathlist_hook(root, joinrel, outerrel, innerrel,
 			   jointype, );
+
+	/*
+	 * 7. If outer relation is delivered from partition-tables, consider
+	 * distributing inner relation to every partition-leaf prior to
+	 * append these leafs.
+	 */
+	try_asymmetric_partitionwise_join(root, joinrel,
+	  outerrel, innerrel,
+	  jointype, );
 }
 
 /*
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 0dbe2ac726..6f900475bb 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -16,6 +16,7 @@
 
 #include "miscadmin.h"
 #include "optimizer/appendinfo.h"
+#include "optimizer/cost.h"
 #include "optimizer/joininfo.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
@@ -1551,6 +1552,156 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	}
 }
 
+/*
+ * Build RelOptInfo on JOIN of each partition of the outer relation and the inner
+ * relation. Return List of such RelOptInfo's. Return NIL, if at least one of
+ * these JOINs are impossible to build.
+ */
+static List *
+extract_asymmetric_partitionwise_subjoin(PlannerInfo *root,
+		 RelOptInfo *joinrel,
+		 AppendPath *append_path,
+		 RelOptInfo *inner_rel,
+		 JoinType jointype,
+		 JoinPathExtraData *extra)
+{
+	List		*result = NIL;
+	ListCell	*lc;
+
+	foreach (lc, append_path->subpaths)
+	{
+		Path			*child_path = lfirst(lc);
+		RelOptInfo		*child_rel = child_path->parent;
+		Relids			child_join_relids;
+		RelOptInfo		*child_join_rel;
+		SpecialJoinInfo	*child_sjinfo;
+		List			*child_restrictlist;
+		AppendRelInfo	**appinfos;
+		intnappinfos;
+
+		child_join_relids = bms_union(child_rel->relids,
+	  inner_rel->relids);
+		appinfos = find_appinfos_by_relids(root, child_join_relids,
+		   );
+		child_sjinfo = build_child_join_sjinfo(root, extra->sjinfo,
+			   child_rel->relids,
+			   inner_rel->relids);
+		child_restrictlist = (List *)
+			adjust_appendrel_attrs(root, (Node *)extra->restrictlist,
+   nappinfos, appinfos);
+		pfree(appinfos);
+
+		child_join_rel = find_join_rel(root, child_join_relids);
+		if (!child_join_rel)
+		{
+			child_join_rel = build_child_join_rel(root,
+  child_rel,
+  inner_rel,
+  joinrel,
+  child_restrictlist,
+  child_sjinfo,
+  jointype);
+			if (!child_join_rel)
+			{
+/*
+ * If can't build JOIN between inner relation and one of the outer
+ * partitions - return immediately.
+ */
+return NIL;
+			}
+		}
+		else
+		{
+			/*
+			 * TODO:
+			 * Can't imagine situation when join relation already exists. But in
+			 * the 'partition_join' regression test it happens.
+			 * It may be an indicator of possible problems.
+			 */
+		}
+
+		populate_joinrel_with_paths(root,
+	child_rel,
+	inner_rel,
+	child_join_rel,
+					child_sjinfo,
+	child_restrictlist);
+
+		/* Give up if asymmetric partition-wise join is not available */
+		if (child_join_rel->pathlist == NIL)
+			return NIL;
+
+		set_cheapest(chi

Re: Asymmetric partition-wise JOIN

2020-11-30 Thread Anastasia Lubennikova


On 09.11.2020 13:53, Anastasia Lubennikova wrote:

On 21.08.2020 09:02, Andrey V. Lepikhov wrote:

On 7/1/20 2:10 PM, Daniel Gustafsson wrote:

On 27 Dec 2019, at 08:34, Kohei KaiGai  wrote:


The attached v2 fixed the problem, and regression test finished 
correctly.


This patch no longer applies to HEAD, please submit an rebased version.
Marking the entry Waiting on Author in the meantime.

Rebased version of the patch on current master (d259afa736).

I rebased it because it is a base of my experimental feature than we 
don't break partitionwise join of a relation with foreign partition 
and a local relation if we have info that remote server has foreign 
table link to the local relation (by analogy with shippable extensions).


Maybe mark as 'Needs review'?


Status update for a commitfest entry.

According to cfbot, the patch fails to apply. Could you please send a 
rebased version?


This thread was inactive for quite some time. Is anyone going to 
continue working on it?


I see some interest in the idea of sharable hash, but I don't see even 
a prototype in this thread. So, probably, it is a matter of a separate 
discussion.


Also, I took a look at the code. It looks like it needs some extra 
work. I am not a big expert in this area, so I'm sorry if questions 
are obvious.


1. What would happen if this assumption is not met?

+         * MEMO: We assume this pathlist keeps at least one 
AppendPath that

+         * represents partitioned table-scan, symmetric or asymmetric
+         * partition-wise join. It is not correct right now, however, 
a hook
+         * on add_path() to give additional decision for path removel 
allows

+         * to retain this kind of AppendPath, regardless of its cost.

2. Why do we wrap extract_asymmetric_partitionwise_subjoin() call into 
PG_TRY/PG_CATCH? What errors do we expect?


3. It looks like a crutch. If it isn't, I'd like to see a better 
comment about why "dynamic programming" is not applicable here.

And shouldn't we also handle a root->join_cur_level?

+                /* temporary disables "dynamic programming" algorithm */
+                root->join_rel_level = NULL;

4. This change looks like it can lead to a memory leak for old code. 
Maybe it is never the case, but again I think it worth a comment.


-    /* If there's nothing to adjust, don't call this function. */
-    Assert(nappinfos >= 1 && appinfos != NULL);
+    /* If there's nothing to adjust, just return a duplication */
+    if (nappinfos == 0)
+        return copyObject(node);

5. extract_asymmetric_partitionwise_subjoin() lacks a comment

The new status of this patch is: Waiting on Author


Status update for a commitfest entry.

This entry was inactive during this CF, so I've marked it as returned 
with feedback. Feel free to resubmit an updated version to a future 
commitfest.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Asymmetric partition-wise JOIN

2020-11-09 Thread Anastasia Lubennikova


On 21.08.2020 09:02, Andrey V. Lepikhov wrote:

On 7/1/20 2:10 PM, Daniel Gustafsson wrote:

On 27 Dec 2019, at 08:34, Kohei KaiGai  wrote:


The attached v2 fixed the problem, and regression test finished 
correctly.


This patch no longer applies to HEAD, please submit an rebased version.
Marking the entry Waiting on Author in the meantime.

Rebased version of the patch on current master (d259afa736).

I rebased it because it is a base of my experimental feature than we 
don't break partitionwise join of a relation with foreign partition 
and a local relation if we have info that remote server has foreign 
table link to the local relation (by analogy with shippable extensions).


Maybe mark as 'Needs review'?


Status update for a commitfest entry.

According to cfbot, the patch fails to apply. Could you please send a 
rebased version?


This thread was inactive for quite some time. Is anyone going to 
continue working on it?


I see some interest in the idea of sharable hash, but I don't see even a 
prototype in this thread. So, probably, it is a matter of a separate 
discussion.


Also, I took a look at the code. It looks like it needs some extra work. 
I am not a big expert in this area, so I'm sorry if questions are obvious.


1. What would happen if this assumption is not met?

+         * MEMO: We assume this pathlist keeps at least one AppendPath that
+         * represents partitioned table-scan, symmetric or asymmetric
+         * partition-wise join. It is not correct right now, however, a 
hook
+         * on add_path() to give additional decision for path removel 
allows

+         * to retain this kind of AppendPath, regardless of its cost.

2. Why do we wrap extract_asymmetric_partitionwise_subjoin() call into 
PG_TRY/PG_CATCH? What errors do we expect?


3. It looks like a crutch. If it isn't, I'd like to see a better comment 
about why "dynamic programming" is not applicable here.

And shouldn't we also handle a root->join_cur_level?

+                /* temporary disables "dynamic programming" algorithm */
+                root->join_rel_level = NULL;

4. This change looks like it can lead to a memory leak for old code. 
Maybe it is never the case, but again I think it worth a comment.


-    /* If there's nothing to adjust, don't call this function. */
-    Assert(nappinfos >= 1 && appinfos != NULL);
+    /* If there's nothing to adjust, just return a duplication */
+    if (nappinfos == 0)
+        return copyObject(node);

5. extract_asymmetric_partitionwise_subjoin() lacks a comment

The new status of this patch is: Waiting on Author

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Asymmetric partition-wise JOIN

2020-08-26 Thread Amul Sul

On Sat, Aug 24, 2019 at 2:03 PM Kohei KaiGai  wrote:
>
> 2019年8月24日(土) 7:02 Thomas Munro :
> >
> > On Fri, Aug 23, 2019 at 4:05 AM Kohei KaiGai  wrote:
> > > We can consider the table join ptable X t1 above is equivalent to:
> > >   (ptable_p0 + ptable_p1 + ptable_p2) X t1
> > > = (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1)
> > > It returns an equivalent result, however, rows are already reduced by 
> > > HashJoin
> > > in the individual leaf of Append, so CPU-cycles consumed by Append node 
> > > can
> > > be cheaper.
> > >
> > > On the other hands, it has a downside because t1 must be read 3 times and
> > > hash table also must be built 3 times. It increases the expected cost,
> > > so planner
> > > may not choose the asymmetric partition-wise join plan.
> >
> > What if you include the partition constraint as a filter on t1?  So you get:
> >
> > ptable X t1 =
> >   (ptable_p0 X (σ hash(dist)%4=0 (t1))) +
> >   (ptable_p1 X (σ hash(dist)%4=1 (t1))) +
> >   (ptable_p2 X (σ hash(dist)%4=2 (t1))) +
> >   (ptable_p3 X (σ hash(dist)%4=3 (t1)))
> >
> > Pros:
> > 1.  The hash tables will not contain unnecessary junk.
> > 2.  You'll get the right answer if t1 is on the outer side of an outer join.
> > 3.  If this runs underneath a Parallel Append and t1 is big enough
> > then workers will hopefully cooperate and do a synchronised scan of
> > t1.
> > 4.  The filter might enable a selective and efficient plan like an index 
> > scan.
> >
> > Cons:
> > 1.  The filter might not enable a selective and efficient plan, and
> > therefore cause extra work.
> >
> > (It's a little weird in this example because don't usually see hash
> > functions in WHERE clauses, but that could just as easily be dist
> > BETWEEN 1 AND 99 or any other partition constraint.)
> >
> It requires the join-key must include the partition key and also must be
> equality-join, doesn't it?
> If ptable and t1 are joined using ptable.dist = t1.foo, we can distribute
> t1 for each leaf table with "WHERE hash(foo)%4 = xxx" according to
> the partition bounds, indeed.
>
> In case when some of partition leafs are pruned, it is exactly beneficial
> because relevant rows to be referenced by the pruned child relations
> are waste of memory.
>
> On the other hands, it eventually consumes almost equivalent amount
> of memory to load the inner relations, if no leafs are pruned, and if we
> could extend the Hash-node to share the hash-table with sibling join-nodess.
>
> > > One idea I have is, sibling HashJoin shares a hash table that was built 
> > > once
> > > by any of the sibling Hash plan. Right now, it is not implemented yet.
> >
> > Yeah, I've thought a little bit about that in the context of Parallel
> > Repartition.  I'm interested in combining intra-node partitioning
> > (where a single plan node repartitions data among workers on the fly)
> > with inter-node partitioning (like PWJ, where partitions are handled
> > by different parts of the plan, considered at planning time); you
> > finish up needing to have nodes in the plan that 'receive' tuples for
> > each partition, to match up with the PWJ plan structure.  That's not
> > entirely unlike CTE references, and not entirely unlike your idea of
> > somehow sharing the same hash table.  I ran into a number of problems
> > while thinking about that, which I should write about in another
> > thread.
> >
> Hmm. Do you intend the inner-path may have different behavior according
> to the partition bounds definition where the outer-path to be joined?
> Let me investigate its pros & cons.
>
> The reasons why I think the idea of sharing the same hash table is reasonable
> in this scenario are:
> 1. We can easily extend the idea for parallel optimization. A hash table on 
> DSM
> segment, once built, can be shared by all the siblings in all the
> parallel workers.
> 2. We can save the memory consumption regardless of the join-keys and
> partition-keys, even if these are not involved in the query.
>
> On the other hands, below are the downside. Potentially, combined use of
> your idea may help these cases:
> 3. Distributed inner-relation cannot be outer side of XXX OUTER JOIN.
> 4. Hash table contains rows to be referenced by only pruned partition leafs.
>

+ many, for the sharable hash of the inner table of the join. IMHO,
this could be the most interesting and captivating thing about this feature.
But might be a complicated piece, is that still on the plan?

Regards,
Amul

Re: Asymmetric partition-wise JOIN

2020-08-25 Thread Daniel Gustafsson

> On 21 Aug 2020, at 08:02, Andrey V. Lepikhov  
> wrote:
> 
> On 7/1/20 2:10 PM, Daniel Gustafsson wrote:
>>> On 27 Dec 2019, at 08:34, Kohei KaiGai  wrote:
>>> The attached v2 fixed the problem, and regression test finished correctly.
>> This patch no longer applies to HEAD, please submit an rebased version.
>> Marking the entry Waiting on Author in the meantime.
> Rebased version of the patch on current master (d259afa736).
> 
> I rebased it because it is a base of my experimental feature than we don't 
> break partitionwise join of a relation with foreign partition and a local 
> relation if we have info that remote server has foreign table link to the 
> local relation (by analogy with shippable extensions).
> 
> Maybe mark as 'Needs review'?

Thanks for the rebase, I've updated the commitfest entry to reflect that it
needs a round of review.

cheers ./daniel

Re: Asymmetric partition-wise JOIN

2020-08-21 Thread Andrey V. Lepikhov

inInfo	*child_sjinfo;
+		List			*child_restrictlist;
+		AppendRelInfo	**appinfos;
+		intnappinfos;
+
+		child_join_relids = bms_union(child_rel->relids,
+	  inner_rel->relids);
+		appinfos = find_appinfos_by_relids(root, child_join_relids,
+		   );
+		child_sjinfo = build_child_join_sjinfo(root, extra->sjinfo,
+			   child_rel->relids,
+			   inner_rel->relids);
+		child_restrictlist = (List *)
+			adjust_appendrel_attrs(root, (Node *)extra->restrictlist,
+   nappinfos, appinfos);
+		pfree(appinfos);
+
+		child_join_rel = find_join_rel(root, child_join_relids);
+		if (!child_join_rel)
+		{
+			child_join_rel = build_child_join_rel(root,
+  child_rel,
+  inner_rel,
+  joinrel,
+  child_restrictlist,
+  child_sjinfo,
+  jointype);
+			if (!child_join_rel)
+return NIL;
+		}
+
+		populate_joinrel_with_paths(root,
+	child_rel,
+	inner_rel,
+	child_join_rel,
+					child_sjinfo,
+	child_restrictlist);
+
+		/* Give up if asymmetric partition-wise join is not available */
+		if (child_join_rel->pathlist == NIL)
+			return NIL;
+
+		set_cheapest(child_join_rel);
+		result = lappend(result, child_join_rel);
+	}
+	return result;
+}
+
+void
+try_asymmetric_partitionwise_join(PlannerInfo *root,
+  RelOptInfo *joinrel,
+  RelOptInfo *outer_rel,
+  RelOptInfo *inner_rel,
+  JoinType jointype,
+  JoinPathExtraData *extra)
+{
+	ListCell *lc;
+
+	if (!enable_partitionwise_join)
+		return;
+
+	if (IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel))
+		return;
+
+	if (jointype != JOIN_INNER && jointype != JOIN_LEFT)
+		return;
+
+	foreach (lc, outer_rel->pathlist)
+	{
+		AppendPath *append_path = lfirst(lc);
+
+		/*
+		 * MEMO: We assume this pathlist keeps at least one AppendPath that
+		 * represents partitioned table-scan, symmetric or asymmetric
+		 * partition-wise join. It is not correct right now, however, a hook
+		 * on add_path() to give additional decision for path removel allows
+		 * to retain this kind of AppendPath, regardless of its cost.
+		 */
+		if (IsA(append_path, AppendPath) &&
+			append_path->partitioned_rels != NIL)
+		{
+			List **join_rel_level_saved;
+			List *live_childrels = NIL;
+
+			join_rel_level_saved = root->join_rel_level;
+			PG_TRY();
+			{
+/* temporary disables "dynamic programming" algorithm */
+root->join_rel_level = NULL;
+
+live_childrels =
+	extract_asymmetric_partitionwise_subjoin(root,
+			 joinrel,
+			 append_path,
+			 inner_rel,
+			 jointype,
+			 extra);
+			}
+			PG_CATCH();
+			{
+root->join_rel_level = join_rel_level_saved;
+PG_RE_THROW();
+			}
+			PG_END_TRY();
+			root->join_rel_level = join_rel_level_saved;
+
+			if (live_childrels != NIL)
+add_paths_to_append_rel(root, joinrel, live_childrels,
+		append_path->partitioned_rels);
+			break;
+		}
+	}
+}
+
 /*
  * Construct the SpecialJoinInfo for a child-join by translating
  * SpecialJoinInfo for the join between parents. left_relids and right_relids
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b40a112c25..863fb79f03 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -7536,7 +7536,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		}
 
 		/* Build new paths for this relation by appending child paths. */
-		add_paths_to_append_rel(root, rel, live_children);
+		add_paths_to_append_rel(root, rel, live_children, NIL);
 	}
 
 	/*
@@ -7689,7 +7689,7 @@ create_partitionwise_grouping_paths(PlannerInfo *root,
 		Assert(partially_grouped_live_children != NIL);
 
 		add_paths_to_append_rel(root, partially_grouped_rel,
-partially_grouped_live_children);
+partially_grouped_live_children, NIL);
 
 		/*
 		 * We need call set_cheapest, since the finalization step will use the
@@ -7704,7 +7704,7 @@ create_partitionwise_grouping_paths(PlannerInfo *root,
 	{
 		Assert(grouped_live_children != NIL);
 
-		add_paths_to_append_rel(root, grouped_rel, grouped_live_children);
+		add_paths_to_append_rel(root, grouped_rel, grouped_live_children, NIL);
 	}
 }
 
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index d722063cf3..32230cf877 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -201,8 +201,9 @@ adjust_appendrel_attrs(PlannerInfo *root, Node *node, int nappinfos,
 	context.nappinfos = nappinfos;
 	context.appinfos = appinfos;
 
-	/* If there's nothing to adjust, don't call this function. */
-	Assert(nappinfos >= 1 && appinfos != NULL);
+	/* If there's nothing to adjust, just return a duplication */
+	if (nappinfos == 0)
+		return copyObject(node);
 
 	/*
 	 * M

Re: Asymmetric partition-wise JOIN

2020-07-06 Thread Andrey V. Lepikhov


On 12/27/19 12:34 PM, Kohei KaiGai wrote:

The attached v2 fixed the problem, and regression test finished correctly.
Using your patch I saw incorrect value of predicted rows at the top node 
of the plan: "Append  (cost=270.02..35165.37 rows=40004 width=16)"
Full explain of the query plan see in attachment - 
explain_with_asymmetric.sql


if  I disable enable_partitionwise_join then:
"Hash Join  (cost=270.02..38855.25 rows=10001 width=16)"
Full explain - explain_no_asymmetric.sql

I thought that is the case of incorrect usage of cached values of 
norm_selec, but it is a corner-case problem of the eqjoinsel() routine :


selectivity = 1/size_of_larger_relation; (selfuncs.c:2567)
tuples = selectivity * outer_tuples * inner_tuples; (costsize.c:4607)

i.e. number of tuples depends only on size of smaller relation.
It is not a bug of your patch but I think you need to know because it 
may affect on planner decision.


===
P.S. Test case:
CREATE TABLE t0 (a serial, b int);
INSERT INTO t0 (b) (SELECT * FROM generate_series(1e4, 2e4) as g);
CREATE TABLE parts (a serial, b int) PARTITION BY HASH(a)
INSERT INTO parts (b) (SELECT * FROM generate_series(1, 1e6) as g);

--
regards,
Andrey Lepikhov
Postgres Professional


explain_with_asymmetric.sql
Description: application/sql


explain_no_asymmetric.sql
Description: application/sql

Re: Asymmetric partition-wise JOIN

2020-07-01 Thread Daniel Gustafsson

> On 27 Dec 2019, at 08:34, Kohei KaiGai  wrote:

> The attached v2 fixed the problem, and regression test finished correctly.

This patch no longer applies to HEAD, please submit an rebased version.
Marking the entry Waiting on Author in the meantime.

cheers ./daniel

Re: Asymmetric partition-wise JOIN

2020-03-27 Thread David Steele


Hi Thomas,

On 12/27/19 2:34 AM, Kohei KaiGai wrote:
>

This crash was reproduced on our environment also.
It looks to me adjust_child_relids_multilevel() didn't expect a case
when supplied 'relids'
(partially) indicate normal and non-partitioned relation.
It tries to build a new 'parent_relids' that is a set of
appinfo->parent_relid related to the
supplied 'child_relids'. However, bits in child_relids that indicate
normal relations are
unintentionally dropped here. Then, adjust_child_relids_multilevel()
goes to an infinite
recursion until stack limitation.

The attached v2 fixed the problem, and regression test finished correctly.


Any thoughts on the new version of this patch?

Regards,
--
-David
da...@pgmasters.net

Re: Asymmetric partition-wise JOIN

2019-12-26 Thread Kohei KaiGai

Hello,

This crash was reproduced on our environment also.
It looks to me adjust_child_relids_multilevel() didn't expect a case
when supplied 'relids'
(partially) indicate normal and non-partitioned relation.
It tries to build a new 'parent_relids' that is a set of
appinfo->parent_relid related to the
supplied 'child_relids'. However, bits in child_relids that indicate
normal relations are
unintentionally dropped here. Then, adjust_child_relids_multilevel()
goes to an infinite
recursion until stack limitation.

The attached v2 fixed the problem, and regression test finished correctly.

Best regards,

2019年12月1日(日) 12:24 Michael Paquier :
>
> On Sat, Aug 24, 2019 at 05:33:01PM +0900, Kohei KaiGai wrote:
> > On the other hands, it eventually consumes almost equivalent amount
> > of memory to load the inner relations, if no leafs are pruned, and if we
> > could extend the Hash-node to share the hash-table with sibling
> > join-nodess.
>
> The patch crashes when running the regression tests, per the report of
> the automatic patch tester.  Could you look at that?  I have moved the
> patch to nexf CF, waiting on author.
> --
> Michael



-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei 
#12912 0x006f9471 in adjust_child_relids_multilevel (
root=root@entry=0x2115240, relids=relids@entry=0x2106df8,
child_relids=child_relids@entry=0x0,
top_parent_relids=top_parent_relids@entry=0x20afd88) at appendinfo.c:602
#12913 0x006f9471 in adjust_child_relids_multilevel (
root=root@entry=0x2115240, relids=relids@entry=0x2106df8,
child_relids=child_relids@entry=0x0,
top_parent_relids=top_parent_relids@entry=0x20afd88) at appendinfo.c:602
#12914 0x006f9471 in adjust_child_relids_multilevel (
root=root@entry=0x2115240, relids=relids@entry=0x2106df8,
child_relids=child_relids@entry=0x0,
top_parent_relids=top_parent_relids@entry=0x20afd88) at appendinfo.c:602
#12915 0x006f9471 in adjust_child_relids_multilevel (
root=root@entry=0x2115240, relids=relids@entry=0x2106df8,
child_relids=child_relids@entry=0x20b0bb8,
top_parent_relids=top_parent_relids@entry=0x20afd88) at appendinfo.c:602
#12916 0x006f9471 in adjust_child_relids_multilevel (
root=root@entry=0x2115240, relids=relids@entry=0x2106df8,
child_relids=child_relids@entry=0x20af9d8,
top_parent_relids=top_parent_relids@entry=0x20afd88) at appendinfo.c:602
#12917 0x006c9805 in add_child_join_rel_equivalences (
root=root@entry=0x2115240, nappinfos=1, appinfos=appinfos@entry=0x20afdb0,
parent_joinrel=parent_joinrel@entry=0x20ae420,
child_joinrel=child_joinrel@entry=0x20aef10) at equivclass.c:2458
#12918 0x0070bc07 in build_child_join_rel (root=root@entry=0x2115240,
outer_rel=outer_rel@entry=0x210a318, inner_rel=inner_rel@entry=0x21051a0,
parent_joinrel=parent_joinrel@entry=0x20ae420,
restrictlist=restrictlist@entry=0x20afcd8, sjinfo=sjinfo@entry=0x20afa00,
jointype=JOIN_INNER) at relnode.c:896
#12919 0x006d2dcf in extract_asymmetric_partitionwise_subjoin (
append_path=0x20a4000, extra=0x7ffea461c710, extra=0x7ffea461c710,
jointype=JOIN_INNER, inner_rel=0x21051a0, joinrel=0x20ae420,
root=0x2115240) at joinrels.c:1573
#12920 try_asymmetric_partitionwise_join (root=root@entry=0x2115240,
joinrel=joinrel@entry=0x20ae420, outer_rel=outer_rel@entry=0x207cce0,
inner_rel=inner_rel@entry=0x21051a0, jointype=jointype@entry=JOIN_INNER,
extra=extra@entry=0x7ffea461c710) at joinrels.c:1641
#12921 0x006cf296 in add_paths_to_joinrel (root=root@entry=0x2115240,
joinrel=joinrel@entry=0x20ae420, outerrel=outerrel@entry=0x207cce0,
innerrel=innerrel@entry=0x21051a0, jointype=jointype@entry=JOIN_INNER,
sjinfo=sjinfo@entry=0x7ffea461c8c0, restrictlist=0x20ae898)
at joinpath.c:333
#12922 0x006d1a3a in populate_joinrel_with_paths (root=0x2115240,
rel1=0x207cce0, rel2=0x21051a0, joinrel=0x20ae420, sjinfo=0x7ffea461c8c0,
restrictlist=0x20ae898) at joinrels.c:804
#12923 0x006d2505 in make_join_rel (root=root@entry=0x2115240,
rel1=rel1@entry=0x207cce0, rel2=rel2@entry=0x21051a0) at joinrels.c:757
#12924 0x006d278d in make_rels_by_clause_joins (
other_rels=, other_rels_list=0x20ae390, old_rel=0x207cce0,
root=0x2115240) at joinrels.c:309
#12925 join_search_one_level (root=root@entry=0x2115240, level=level@entry=2)
at joinrels.c:120
#12926 0x006bdd4b in standard_join_search (root=0x2115240,
levels_needed=3, initial_rels=) at allpaths.c:2879
#12927 0x006be1c4 in make_one_rel (root=root@entry=0x2115240,
joinlist=joinlist@entry=0x2106b80) at allpaths.c:227
#12928 0x006e1cdb in query_planner (root=root@entry=0x2115240,
qp_callback=qp_callback@entry=0x6e22b0 ,
qp_extra=qp_extra@entry=0x7ffea461cb60) at planmain.c:269
#12929 0x006e6962 in grouping_planner (root=,
inheritance_update=false, tuple_fraction=) at

Re: Asymmetric partition-wise JOIN

2019-11-30 Thread Michael Paquier

On Sat, Aug 24, 2019 at 05:33:01PM +0900, Kohei KaiGai wrote:
> On the other hands, it eventually consumes almost equivalent amount
> of memory to load the inner relations, if no leafs are pruned, and if we
> could extend the Hash-node to share the hash-table with sibling
> join-nodess.

The patch crashes when running the regression tests, per the report of
the automatic patch tester.  Could you look at that?  I have moved the
patch to nexf CF, waiting on author.
--
Michael


signature.asc
Description: PGP signature

Re: Asymmetric partition-wise JOIN

2019-08-24 Thread Kohei KaiGai

2019年8月24日(土) 7:02 Thomas Munro :
>
> On Fri, Aug 23, 2019 at 4:05 AM Kohei KaiGai  wrote:
> > We can consider the table join ptable X t1 above is equivalent to:
> >   (ptable_p0 + ptable_p1 + ptable_p2) X t1
> > = (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1)
> > It returns an equivalent result, however, rows are already reduced by 
> > HashJoin
> > in the individual leaf of Append, so CPU-cycles consumed by Append node can
> > be cheaper.
> >
> > On the other hands, it has a downside because t1 must be read 3 times and
> > hash table also must be built 3 times. It increases the expected cost,
> > so planner
> > may not choose the asymmetric partition-wise join plan.
>
> What if you include the partition constraint as a filter on t1?  So you get:
>
> ptable X t1 =
>   (ptable_p0 X (σ hash(dist)%4=0 (t1))) +
>   (ptable_p1 X (σ hash(dist)%4=1 (t1))) +
>   (ptable_p2 X (σ hash(dist)%4=2 (t1))) +
>   (ptable_p3 X (σ hash(dist)%4=3 (t1)))
>
> Pros:
> 1.  The hash tables will not contain unnecessary junk.
> 2.  You'll get the right answer if t1 is on the outer side of an outer join.
> 3.  If this runs underneath a Parallel Append and t1 is big enough
> then workers will hopefully cooperate and do a synchronised scan of
> t1.
> 4.  The filter might enable a selective and efficient plan like an index scan.
>
> Cons:
> 1.  The filter might not enable a selective and efficient plan, and
> therefore cause extra work.
>
> (It's a little weird in this example because don't usually see hash
> functions in WHERE clauses, but that could just as easily be dist
> BETWEEN 1 AND 99 or any other partition constraint.)
>
It requires the join-key must include the partition key and also must be
equality-join, doesn't it?
If ptable and t1 are joined using ptable.dist = t1.foo, we can distribute
t1 for each leaf table with "WHERE hash(foo)%4 = xxx" according to
the partition bounds, indeed.

In case when some of partition leafs are pruned, it is exactly beneficial
because relevant rows to be referenced by the pruned child relations
are waste of memory.

On the other hands, it eventually consumes almost equivalent amount
of memory to load the inner relations, if no leafs are pruned, and if we
could extend the Hash-node to share the hash-table with sibling join-nodess.

> > One idea I have is, sibling HashJoin shares a hash table that was built once
> > by any of the sibling Hash plan. Right now, it is not implemented yet.
>
> Yeah, I've thought a little bit about that in the context of Parallel
> Repartition.  I'm interested in combining intra-node partitioning
> (where a single plan node repartitions data among workers on the fly)
> with inter-node partitioning (like PWJ, where partitions are handled
> by different parts of the plan, considered at planning time); you
> finish up needing to have nodes in the plan that 'receive' tuples for
> each partition, to match up with the PWJ plan structure.  That's not
> entirely unlike CTE references, and not entirely unlike your idea of
> somehow sharing the same hash table.  I ran into a number of problems
> while thinking about that, which I should write about in another
> thread.
>
Hmm. Do you intend the inner-path may have different behavior according
to the partition bounds definition where the outer-path to be joined?
Let me investigate its pros & cons.

The reasons why I think the idea of sharing the same hash table is reasonable
in this scenario are:
1. We can easily extend the idea for parallel optimization. A hash table on DSM
segment, once built, can be shared by all the siblings in all the
parallel workers.
2. We can save the memory consumption regardless of the join-keys and
partition-keys, even if these are not involved in the query.

On the other hands, below are the downside. Potentially, combined use of
your idea may help these cases:
3. Distributed inner-relation cannot be outer side of XXX OUTER JOIN.
4. Hash table contains rows to be referenced by only pruned partition leafs.

Best regards,
-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei

Re: Asymmetric partition-wise JOIN

2019-08-23 Thread Thomas Munro

On Fri, Aug 23, 2019 at 4:05 AM Kohei KaiGai  wrote:
> We can consider the table join ptable X t1 above is equivalent to:
>   (ptable_p0 + ptable_p1 + ptable_p2) X t1
> = (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1)
> It returns an equivalent result, however, rows are already reduced by HashJoin
> in the individual leaf of Append, so CPU-cycles consumed by Append node can
> be cheaper.
>
> On the other hands, it has a downside because t1 must be read 3 times and
> hash table also must be built 3 times. It increases the expected cost,
> so planner
> may not choose the asymmetric partition-wise join plan.

What if you include the partition constraint as a filter on t1?  So you get:

ptable X t1 =
  (ptable_p0 X (σ hash(dist)%4=0 (t1))) +
  (ptable_p1 X (σ hash(dist)%4=1 (t1))) +
  (ptable_p2 X (σ hash(dist)%4=2 (t1))) +
  (ptable_p3 X (σ hash(dist)%4=3 (t1)))

Pros:
1.  The hash tables will not contain unnecessary junk.
2.  You'll get the right answer if t1 is on the outer side of an outer join.
3.  If this runs underneath a Parallel Append and t1 is big enough
then workers will hopefully cooperate and do a synchronised scan of
t1.
4.  The filter might enable a selective and efficient plan like an index scan.

Cons:
1.  The filter might not enable a selective and efficient plan, and
therefore cause extra work.

(It's a little weird in this example because don't usually see hash
functions in WHERE clauses, but that could just as easily be dist
BETWEEN 1 AND 99 or any other partition constraint.)

> One idea I have is, sibling HashJoin shares a hash table that was built once
> by any of the sibling Hash plan. Right now, it is not implemented yet.

Yeah, I've thought a little bit about that in the context of Parallel
Repartition.  I'm interested in combining intra-node partitioning
(where a single plan node repartitions data among workers on the fly)
with inter-node partitioning (like PWJ, where partitions are handled
by different parts of the plan, considered at planning time); you
finish up needing to have nodes in the plan that 'receive' tuples for
each partition, to match up with the PWJ plan structure.  That's not
entirely unlike CTE references, and not entirely unlike your idea of
somehow sharing the same hash table.  I ran into a number of problems
while thinking about that, which I should write about in another
thread.

-- 
Thomas Munro
https://enterprisedb.com

Re: Asymmetric partition-wise JOIN

2019-08-22 Thread Kohei KaiGai

Hello,

Even though nobody has respond the thread, I tried to make a prototype of
the asymmetric partition-wise join support.
This feature tries to join non-partitioned and partitioned relation
before append.

See the example below:

create table ptable (dist int, a int, b int) partition by hash (dist);
create table ptable_p0 partition of ptable for values with (modulus 3,
remainder 0);
create table ptable_p1 partition of ptable for values with (modulus 3,
remainder 1);
create table ptable_p2 partition of ptable for values with (modulus 3,
remainder 2);
create table t1 (aid int, label text);
create table t2 (bid int, label text);
insert into ptable (select x, (1000*random())::int,
(1000*random())::int from generate_series(1,100) x);
insert into t1 (select x, md5(x::text) from generate_series(1,50) x);
insert into t2 (select x, md5(x::text) from generate_series(1,50) x);
vacuum analyze ptable;
vacuum analyze t1;
vacuum analyze t2;

ptable.a has values between 0 and 1000, and t1.aid has values between 1 and 50.
Therefore, tables join on ptable and t1 by a=aid can reduce almost 95% rows.
On the other hands, t1 is not partitioned and join-keys are not partition keys.
So, Append must process million rows first, then HashJoin processes
the rows read
from the partitioned table, and 95% of them are eventually dropped.
On the other words, 95% of jobs by Append are waste of time and CPU cycles.

postgres=# explain select * from ptable, t1 where a = aid;
  QUERY PLAN
--
 Hash Join  (cost=2.12..24658.62 rows=49950 width=49)
   Hash Cond: (ptable_p0.a = t1.aid)
   ->  Append  (cost=0.00..20407.00 rows=100 width=12)
 ->  Seq Scan on ptable_p0  (cost=0.00..5134.63 rows=333263 width=12)
 ->  Seq Scan on ptable_p1  (cost=0.00..5137.97 rows=333497 width=12)
 ->  Seq Scan on ptable_p2  (cost=0.00..5134.40 rows=333240 width=12)
   ->  Hash  (cost=1.50..1.50 rows=50 width=37)
 ->  Seq Scan on t1  (cost=0.00..1.50 rows=50 width=37)
(8 rows)

The asymmetric partitionwise join allows to join non-partitioned tables and
partitioned tables prior to Append.

postgres=# set enable_partitionwise_join = on;
SET
postgres=# explain select * from ptable, t1 where a = aid;
  QUERY PLAN
--
 Append  (cost=2.12..19912.62 rows=49950 width=49)
   ->  Hash Join  (cost=2.12..6552.96 rows=16647 width=49)
 Hash Cond: (ptable_p0.a = t1.aid)
 ->  Seq Scan on ptable_p0  (cost=0.00..5134.63 rows=333263 width=12)
 ->  Hash  (cost=1.50..1.50 rows=50 width=37)
   ->  Seq Scan on t1  (cost=0.00..1.50 rows=50 width=37)
   ->  Hash Join  (cost=2.12..6557.29 rows=16658 width=49)
 Hash Cond: (ptable_p1.a = t1.aid)
 ->  Seq Scan on ptable_p1  (cost=0.00..5137.97 rows=333497 width=12)
 ->  Hash  (cost=1.50..1.50 rows=50 width=37)
   ->  Seq Scan on t1  (cost=0.00..1.50 rows=50 width=37)
   ->  Hash Join  (cost=2.12..6552.62 rows=16645 width=49)
 Hash Cond: (ptable_p2.a = t1.aid)
 ->  Seq Scan on ptable_p2  (cost=0.00..5134.40 rows=333240 width=12)
 ->  Hash  (cost=1.50..1.50 rows=50 width=37)
   ->  Seq Scan on t1  (cost=0.00..1.50 rows=50 width=37)
(16 rows)

We can consider the table join ptable X t1 above is equivalent to:
  (ptable_p0 + ptable_p1 + ptable_p2) X t1
= (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1)
It returns an equivalent result, however, rows are already reduced by HashJoin
in the individual leaf of Append, so CPU-cycles consumed by Append node can
be cheaper.

On the other hands, it has a downside because t1 must be read 3 times and
hash table also must be built 3 times. It increases the expected cost,
so planner
may not choose the asymmetric partition-wise join plan.

One idea I have is, sibling HashJoin shares a hash table that was built once
by any of the sibling Hash plan. Right now, it is not implemented yet.

How about your thought for this feature?

Best regards,

2019年8月12日(月) 15:03 Kohei KaiGai :
>
> Hello,
>
> PostgreSQL optimizer right now considers join pairs on only
> non-partition - non-partition or
> partition-leaf - partition-leaf relations. On the other hands, it is
> harmless and makes sense to
> consider a join pair on non-partition - partition-leaf.
>
> See the example below. ptable is partitioned by hash, and contains 10M
> rows. ftable is not
> partitioned and contains 50 rows. Most of ptable::fkey shall not have
> matched rows in this
> join.
>
> create table ptable (fkey int, dist text) partition by hash (dist);
> create table ptable_p0 partition of ptable for values with (modulus 3,
> remainder 0);
> create table ptable_p1 partitio

Asymmetric partition-wise JOIN

2019-08-12 Thread Kohei KaiGai

cost=1.50..1.50 rows=50 width=4)
(actual time=0.020..0.020 rows=50 loops=1)
   Buckets: 1024  Batches: 1  Memory Usage: 10kB
   ->  Seq Scan on ftable f_1
(cost=0.00..1.50 rows=50 width=4) (actual time=0.004..0.011 rows=50
loops=1)
 ->  Subquery Scan on "*SELECT* 3"  (cost=2.12..70051.84
rows=16699 width=0) (actual time=0.025..407.103 rows=16633 loops=1)
   ->  Hash Join  (cost=2.12..69884.85 rows=16699
width=72) (actual time=0.025..406.048 rows=16633 loops=1)
 Hash Cond: (p_2.fkey = f_2.pkey)
 ->  Seq Scan on ptable_p2 p_2
(cost=0.00..61126.79 rows=3334179 width=4) (actual time=0.004..181.015
rows=3334179 loops=1)
 ->  Hash  (cost=1.50..1.50 rows=50 width=4)
(actual time=0.014..0.014 rows=50 loops=1)
   Buckets: 1024  Batches: 1  Memory Usage: 10kB
   ->  Seq Scan on ftable f_2
(cost=0.00..1.50 rows=50 width=4) (actual time=0.003..0.008 rows=50
loops=1)
 Planning Time: 0.614 ms
 Execution Time: 1396.131 ms
(25 rows)

How about your opinions for this kind of asymmetric partition-wise
JOIN support by the optimizer?
I think we can harmlessly push-down inner-join and left-join if
partition-leaf is left side.

Probably, we need to implement two key functionalities.
1. Construction of RelOpInfo for join on non-partition table and
partition-leafs for each pairs.
Instead of JoinPaths, this logic adds AppendPath that takes
asymmetric partition-wise join
paths as sub-paths. Other optimization logic is equivalent as we
are currently doing.
2. Allow to share the hash-table built from table scan distributed to
individual partition leafs.
In the above example, SeqScan on ftable and relevant Hash path
will make identical hash-
table for the upcoming hash-join. If sibling paths have equivalent
results, it is reasonable to
reuse it.

Best regards,
-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei

40 matches

Mail list logo