Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-31 Thread Robert Haas
On Mon, Oct 16, 2017 at 5:03 PM, Ashutosh Bapat
 wrote:
> set_append_rel_size() crashes when it encounters a partitioned table
> with a dropped column. Dropped columns do not have any translations
> saved in AppendInfo::translated_vars; the corresponding entry is NULL
> per make_inh_translation_list().
> 1802 att = TupleDescAttr(old_tupdesc, old_attno);
> 1803 if (att->attisdropped)
> 1804 {
> 1805 /* Just put NULL into this list entry */
> 1806 vars = lappend(vars, NULL);
> 1807 continue;
> 1808 }
>
> In set_append_rel_size() we try to attr_needed for child tables. While
> doing so we try to translate a user attribute number of parent to that
> of a child and crash since the translated Var is NULL. Here's patch to
> fix the crash. The patch also contains a testcase to test dropped
> columns in partitioned table.
>
> Sorry for not noticing this problem earlier.

OK, committed.  This is a good example of how having good code
coverage doesn't necessarily mean you've found all the bugs.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-16 Thread Ashutosh Bapat
On Sat, Oct 7, 2017 at 1:04 AM, Robert Haas  wrote:

>>
>> The fix is to copy the relevant partitioning information from relcache
>> into PartitionSchemeData and RelOptInfo. Here's a quick patch with
>> that fix.
>
> Committed.  I hope that makes things less red rather than more,
> because I'm going to be AFK for a few hours anyway.
>

set_append_rel_size() crashes when it encounters a partitioned table
with a dropped column. Dropped columns do not have any translations
saved in AppendInfo::translated_vars; the corresponding entry is NULL
per make_inh_translation_list().
1802 att = TupleDescAttr(old_tupdesc, old_attno);
1803 if (att->attisdropped)
1804 {
1805 /* Just put NULL into this list entry */
1806 vars = lappend(vars, NULL);
1807 continue;
1808 }

In set_append_rel_size() we try to attr_needed for child tables. While
doing so we try to translate a user attribute number of parent to that
of a child and crash since the translated Var is NULL. Here's patch to
fix the crash. The patch also contains a testcase to test dropped
columns in partitioned table.

Sorry for not noticing this problem earlier.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
From eca9e03410b049bf79d767657ad4d90fb974d19c Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Mon, 16 Oct 2017 13:23:49 +0530
Subject: [PATCH] Ignore dropped columns in set_append_rel_size().

A parent Var corresponding to column dropped from a parent table will
not have any translation for a child.  It won't have any attr_needed
information since it can not be referenced from SQL. Hence ignore
dropped columns while constructing attr_needed information for a child
table.

Ashutosh Bapat.
---
 src/backend/optimizer/path/allpaths.c |   12 
 src/test/regress/expected/alter_table.out |7 +++
 src/test/regress/sql/alter_table.sql  |4 
 3 files changed, 23 insertions(+)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 5535b63..1bc5e01 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -950,6 +950,18 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 	attno - 1);
 	int			child_index;
 
+	/*
+	 * Ignore any column dropped from the parent.
+	 * Corresponding Var won't have any translation. It won't
+	 * have attr_needed information, since it can not be
+	 * referenced in the query.
+	 */
+	if (var == NULL)
+	{
+		Assert(attr_needed == NULL);
+		continue;
+	}
+
 	child_index = var->varattno - childrel->min_attr;
 	childrel->attr_needed[child_index] = attr_needed;
 }
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index dbe438d..cc56651 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -3604,6 +3604,13 @@ ALTER TABLE list_parted2 DROP COLUMN b;
 ERROR:  cannot drop column named in partition key
 ALTER TABLE list_parted2 ALTER COLUMN b TYPE text;
 ERROR:  cannot alter type of column named in partition key
+-- dropping non-partition key columns should be allowed on the parent table.
+ALTER TABLE list_parted DROP COLUMN b;
+SELECT * FROM list_parted;
+ a 
+---
+(0 rows)
+
 -- cleanup
 DROP TABLE list_parted, list_parted2, range_parted;
 DROP TABLE fail_def_part;
diff --git a/src/test/regress/sql/alter_table.sql b/src/test/regress/sql/alter_table.sql
index 0c8ae2a..fc7bd66 100644
--- a/src/test/regress/sql/alter_table.sql
+++ b/src/test/regress/sql/alter_table.sql
@@ -2390,6 +2390,10 @@ ALTER TABLE part_2 INHERIT inh_test;
 ALTER TABLE list_parted2 DROP COLUMN b;
 ALTER TABLE list_parted2 ALTER COLUMN b TYPE text;
 
+-- dropping non-partition key columns should be allowed on the parent table.
+ALTER TABLE list_parted DROP COLUMN b;
+SELECT * FROM list_parted;
+
 -- cleanup
 DROP TABLE list_parted, list_parted2, range_parted;
 DROP TABLE fail_def_part;
-- 
1.7.9.5


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-12 Thread Ashutosh Bapat
On Thu, Oct 12, 2017 at 10:49 PM, Robert Haas  wrote:
> On Wed, Oct 11, 2017 at 10:43 PM, Ashutosh Bapat
>  wrote:
>> You are suggesting that a dummy partitioned table be treated as an
>> un-partitioned table and apply above suggested optimization. A join
>> between a partitioned and unpartitioned table is partitioned by the
>> keys of only partitioned table. An unpartitioned table doesn't have
>> any keys, so this is fine. But a dummy partitioned table does have
>> keys. Recording them as keys of the join relation helps when it joins
>> to other relations. Furthermore a join between partitioned and
>> unpartitioned table doesn't require any equi-join condition on
>> partition keys of partitioned table but a join between partitioned
>> tables is considered to be partitioned by keys on both sides only when
>> there is an equi-join. So, when implementing a partitioned join
>> between a partitioned and an unpartitioned table, we will have to make
>> a special case to record partition keys when the unpartitioned side is
>> actually a dummy partitioned table. That might be awkward.
>
> It seems to me that what we really need here is to move all of this
> stuff into a separate struct:
>
> /* used for partitioned relations */
> PartitionScheme part_scheme;/* Partitioning scheme. */
> int nparts; /* number of
> partitions */
> struct PartitionBoundInfoData *boundinfo;   /* Partition bounds */
> struct RelOptInfo **part_rels;  /* Array of RelOptInfos of partitions,
>
>   * stored in the same order of bounds */
> List  **partexprs;  /* Non-nullable partition key
> expressions. */
> List  **nullable_partexprs; /* Nullable partition key
> expressions. */
>

In a very early patch I had PartitionOptInfo to hold all of this.
RelOptInfo then had a pointer of PartitionOptInfo, if it was
partitioned. When a relation can be partitioned in multiple ways like
what you describe or because join by re-partitioning is efficient,
RelOptInfo would have a list of those. But the representation needs to
be thought through. I am wondering whether this should be modelled
like IndexOptInfo. I am not sure. This is a topic of much larger
discussion.

I think we are digressing. We were discussing my patch to handle dummy
partitioned relation, whose children are not marked dummy and do not
have pathlists set. Do you still think that we should leave that
aside?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-12 Thread Robert Haas
On Wed, Oct 11, 2017 at 10:43 PM, Ashutosh Bapat
 wrote:
> You are suggesting that a dummy partitioned table be treated as an
> un-partitioned table and apply above suggested optimization. A join
> between a partitioned and unpartitioned table is partitioned by the
> keys of only partitioned table. An unpartitioned table doesn't have
> any keys, so this is fine. But a dummy partitioned table does have
> keys. Recording them as keys of the join relation helps when it joins
> to other relations. Furthermore a join between partitioned and
> unpartitioned table doesn't require any equi-join condition on
> partition keys of partitioned table but a join between partitioned
> tables is considered to be partitioned by keys on both sides only when
> there is an equi-join. So, when implementing a partitioned join
> between a partitioned and an unpartitioned table, we will have to make
> a special case to record partition keys when the unpartitioned side is
> actually a dummy partitioned table. That might be awkward.

It seems to me that what we really need here is to move all of this
stuff into a separate struct:

/* used for partitioned relations */
PartitionScheme part_scheme;/* Partitioning scheme. */
int nparts; /* number of
partitions */
struct PartitionBoundInfoData *boundinfo;   /* Partition bounds */
struct RelOptInfo **part_rels;  /* Array of RelOptInfos of partitions,

  * stored in the same order of bounds */
List  **partexprs;  /* Non-nullable partition key
expressions. */
List  **nullable_partexprs; /* Nullable partition key
expressions. */

...and then have a RelOptInfo carry a pointer to a list of those
structures.  That lets us consider multiple possible partition schemes
for the same relation.  For instance, suppose that a user joins four
relations, P1, P2, Q1, and Q2.  P1 and P2 are compatibly partitioned.
Q1 and Q2 are compatibly partitioned (but not compatible with P1 and
P2).

Furthermore, let's suppose that the optimal join order begins with a
join between P1 and Q1.  When we construct the paths for that joinrel,
we can either join all of P1 to all of Q1 (giving up on partition-wise
join), or we can join each partition of P1 to all of Q1 (producing a
result partitioned compatibly with P1 and allowing for a future
partition-wise join to P2), or we can join each partition of Q1 to all
of P1 (producing a result partitioned compatibly with Q1 and allowing
for a future partition-wise join to Q2).  Any of those could win
depending on the details.  With the data structure as it is today,
we'd have to choose whether to mark the joinrel as partitioned like P1
or like Q1, but that's not really what we need here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-11 Thread Ashutosh Bapat
On Wed, Oct 11, 2017 at 7:47 PM, Robert Haas  wrote:
> On Mon, Oct 9, 2017 at 2:05 AM, Ashutosh Bapat
>  wrote:
>> On Sat, Oct 7, 2017 at 1:04 AM, Robert Haas  wrote:
>>> Committed.  I hope that makes things less red rather than more,
>>> because I'm going to be AFK for a few hours anyway.
>>
>> Here's the last patch, dealing with the dummy relations, rebased. With
>> this fix every join order of a partitioned join can be considered
>> partitioned. (This wasn't the case earlier when dummy relation was
>> involved.). So, we can allocate the child-join RelOptInfo array in
>> build_joinrel_partition_info(), instead of waiting for an appropriate
>> pair to arrive in try_partition_wise_join().
>
> Wouldn't a far more general approach be to allow a partition-wise join
> between a partitioned table and an unpartitioned table, considering
> the result as partitioned?  That seems like it would very often yield
> much better query plans than what we have right now, and also make the
> need for this particular thing go away.
>

You are suggesting that a dummy partitioned table be treated as an
un-partitioned table and apply above suggested optimization. A join
between a partitioned and unpartitioned table is partitioned by the
keys of only partitioned table. An unpartitioned table doesn't have
any keys, so this is fine. But a dummy partitioned table does have
keys. Recording them as keys of the join relation helps when it joins
to other relations. Furthermore a join between partitioned and
unpartitioned table doesn't require any equi-join condition on
partition keys of partitioned table but a join between partitioned
tables is considered to be partitioned by keys on both sides only when
there is an equi-join. So, when implementing a partitioned join
between a partitioned and an unpartitioned table, we will have to make
a special case to record partition keys when the unpartitioned side is
actually a dummy partitioned table. That might be awkward.

Because we don't have dummy children relation in all cases, we already
have some awkwardness like allocating part_rels array only when we
encounter a join order which has all the children. This patch removes
that.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-11 Thread Robert Haas
On Mon, Oct 9, 2017 at 2:05 AM, Ashutosh Bapat
 wrote:
> On Sat, Oct 7, 2017 at 1:04 AM, Robert Haas  wrote:
>> Committed.  I hope that makes things less red rather than more,
>> because I'm going to be AFK for a few hours anyway.
>
> Here's the last patch, dealing with the dummy relations, rebased. With
> this fix every join order of a partitioned join can be considered
> partitioned. (This wasn't the case earlier when dummy relation was
> involved.). So, we can allocate the child-join RelOptInfo array in
> build_joinrel_partition_info(), instead of waiting for an appropriate
> pair to arrive in try_partition_wise_join().

Wouldn't a far more general approach be to allow a partition-wise join
between a partitioned table and an unpartitioned table, considering
the result as partitioned?  That seems like it would very often yield
much better query plans than what we have right now, and also make the
need for this particular thing go away.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-08 Thread Ashutosh Bapat
On Sat, Oct 7, 2017 at 1:04 AM, Robert Haas  wrote:
>
> Committed.  I hope that makes things less red rather than more,
> because I'm going to be AFK for a few hours anyway.
>

Here's the last patch, dealing with the dummy relations, rebased. With
this fix every join order of a partitioned join can be considered
partitioned. (This wasn't the case earlier when dummy relation was
involved.). So, we can allocate the child-join RelOptInfo array in
build_joinrel_partition_info(), instead of waiting for an appropriate
pair to arrive in try_partition_wise_join().
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
From 4bf8ca38719aee730ed57a7f14522384b1ced7b0 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Tue, 29 Aug 2017 12:37:14 +0530
Subject: [PATCH] Support partition-wise join for dummy partitioned relation.

Current partition-wise join implementation treats dummy relations as
unpartitioned since the child relations may not be marked dummy and may not
have their pathlists set (See set_rel_size() and set_rel_pathlist()). Since
children do not have paths set, they can not be used to form a child-join. This
patch marks the children of dummy partitioned relations as dummy, thus allowing
partition-wise join when one of the joining relations is dummy.

If the dummy partitioned relation is a base relation, its children are base
relations as well and we will be marking base relation dummy during join
planning. But this shouldn't be a problem since populate_joinrel_with_paths()
already does that to an inner relation of left join.

Ashutosh Bapat.
---
 src/backend/optimizer/path/allpaths.c |7 +--
 src/backend/optimizer/path/joinrels.c |   24 
 src/backend/optimizer/util/relnode.c  |5 +
 3 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 5535b63..b46fb5b 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3261,12 +3261,7 @@ generate_partition_wise_join_paths(PlannerInfo *root, RelOptInfo *rel)
 	if (IS_DUMMY_REL(rel))
 		return;
 
-	/*
-	 * Nothing to do if the relation is not partitioned. An outer join
-	 * relation which had empty inner relation in every pair will have rest of
-	 * the partitioning properties set except the child-join RelOptInfos. See
-	 * try_partition_wise_join() for more explanation.
-	 */
+	/* Nothing to do if the relation is not partitioned. */
 	if (rel->nparts <= 0 || rel->part_rels == NULL)
 		return;
 
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 2b868c5..1578dea 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1321,14 +1321,19 @@ try_partition_wise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 	/*
 	 * set_rel_pathlist() may not create paths in children of an empty
-	 * partitioned table and so we can not add paths to child-joins. So, deem
-	 * such a join as unpartitioned. When a partitioned relation is deemed
-	 * empty because all its children are empty, dummy path will be set in
-	 * each of the children.  In such a case we could still consider the join
-	 * as partitioned, but it might not help much.
+	 * partitioned table. Mark such children as dummy so that we can add paths
+	 * to child-joins.
 	 */
-	if (IS_DUMMY_REL(rel1) || IS_DUMMY_REL(rel2))
-		return;
+	if (IS_DUMMY_REL(rel1))
+	{
+		for (cnt_parts = 0; cnt_parts < rel1->nparts; cnt_parts++)
+			mark_dummy_rel(rel1->part_rels[cnt_parts]);
+	}
+	if (IS_DUMMY_REL(rel2))
+	{
+		for (cnt_parts = 0; cnt_parts < rel2->nparts; cnt_parts++)
+			mark_dummy_rel(rel2->part_rels[cnt_parts]);
+	}
 
 	/*
 	 * Since this join relation is partitioned, all the base relations
@@ -1361,11 +1366,6 @@ try_partition_wise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 	nparts = joinrel->nparts;
 
-	/* Allocate space to hold child-joins RelOptInfos, if not already done. */
-	if (!joinrel->part_rels)
-		joinrel->part_rels =
-			(RelOptInfo **) palloc0(sizeof(RelOptInfo *) * nparts);
-
 	/*
 	 * Create child-join relations for this partitioned join, if those don't
 	 * exist. Add paths to child-joins for a pair of child relations
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 3bd1063..21fd29f 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -1746,4 +1746,9 @@ build_joinrel_partition_info(RelOptInfo *joinrel, RelOptInfo *outer_rel,
 		joinrel->partexprs[cnt] = partexpr;
 		joinrel->nullable_partexprs[cnt] = nullable_partexpr;
 	}
+
+	/* Allocate space to hold child-joins RelOptInfos. */
+	joinrel->part_rels =
+		(RelOptInfo **) palloc0(sizeof(RelOptInfo *) * joinrel->nparts);
+
 }
-- 
1.7.9.5


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http:

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-06 Thread Robert Haas
On Fri, Oct 6, 2017 at 3:07 PM, Ashutosh Bapat
 wrote:
> On Fri, Oct 6, 2017 at 8:45 PM, Robert Haas  wrote:
>> I think this is very good work and I'm excited about the feature.  Now
>> I'll wait to see whether the buildfarm, or Tom, yell at me for
>> whatever problems this may still have...
>
> Buildfarm animal prion turned red. Before going into that failure,
> good news is that the other animals are green. So the plans are
> stable.
>
> prion runs the regression with -DRELCACHE_FORCE_RELEASE, which
> destroys a relcache entry as soon as its reference count drops down to
> 0. This destroys everything that's there in corresponding relcache
> entry including partition key information and partition descriptor
> information. find_partition_scheme() and set_relation_partition_info()
> both assume that the relcache information will survive as long as the
> relation lock is held. They do not copy the relevant partitioning
> information but just copy the pointers. That assumption is wrong.
> Because of -DRELCACHE_FORCE_RELEASE, as soon as refcount drops to
> zero, the data in partition scheme and partition bounds goes invalid
> and various checks to see if partition wise join is possible fail.
> That causes partition_join test to fail on prion. But I think, the bug
> could cause crash as well.
>
> The fix is to copy the relevant partitioning information from relcache
> into PartitionSchemeData and RelOptInfo. Here's a quick patch with
> that fix.

Committed.  I hope that makes things less red rather than more,
because I'm going to be AFK for a few hours anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-06 Thread Ashutosh Bapat
On Fri, Oct 6, 2017 at 8:45 PM, Robert Haas  wrote:
>
> I think this is very good work and I'm excited about the feature.  Now
> I'll wait to see whether the buildfarm, or Tom, yell at me for
> whatever problems this may still have...
>

Buildfarm animal prion turned red. Before going into that failure,
good news is that the other animals are green. So the plans are
stable.

prion runs the regression with -DRELCACHE_FORCE_RELEASE, which
destroys a relcache entry as soon as its reference count drops down to
0. This destroys everything that's there in corresponding relcache
entry including partition key information and partition descriptor
information. find_partition_scheme() and set_relation_partition_info()
both assume that the relcache information will survive as long as the
relation lock is held. They do not copy the relevant partitioning
information but just copy the pointers. That assumption is wrong.
Because of -DRELCACHE_FORCE_RELEASE, as soon as refcount drops to
zero, the data in partition scheme and partition bounds goes invalid
and various checks to see if partition wise join is possible fail.
That causes partition_join test to fail on prion. But I think, the bug
could cause crash as well.

The fix is to copy the relevant partitioning information from relcache
into PartitionSchemeData and RelOptInfo. Here's a quick patch with
that fix.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 3a8306a..ebda85e 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -702,6 +702,74 @@ partition_bounds_equal(int partnatts, int16 *parttyplen, bool *parttypbyval,
 }
 
 /*
+ * Return a copy of given PartitionBoundInfo structure. The data types of bounds
+ * are described by given partition key specificiation.
+ */
+extern PartitionBoundInfo
+partition_bounds_copy(PartitionBoundInfo src,
+	  PartitionKey key)
+{
+	PartitionBoundInfo	dest;
+	int		i;
+	int		ndatums;
+	int		partnatts;
+	int		num_indexes;
+
+	dest = (PartitionBoundInfo) palloc(sizeof(PartitionBoundInfoData));
+
+	dest->strategy = src->strategy;
+	ndatums = dest->ndatums = src->ndatums;
+	partnatts = key->partnatts;
+
+	/* Range partitioned table has an extra index. */
+	num_indexes = key->strategy == PARTITION_STRATEGY_RANGE ? ndatums + 1 : ndatums;
+
+	/* List partitioned tables have only a single partition key. */
+	Assert(key->strategy != PARTITION_STRATEGY_LIST || partnatts == 1);
+
+	dest->datums = (Datum **) palloc(sizeof(Datum *) * ndatums);
+
+	if (src->kind != NULL)
+	{
+		dest->kind = (PartitionRangeDatumKind **) palloc(ndatums *
+sizeof(PartitionRangeDatumKind *));
+		for (i = 0; i < ndatums; i++)
+		{
+			dest->kind[i] = (PartitionRangeDatumKind *) palloc(partnatts *
+sizeof(PartitionRangeDatumKind));
+
+			memcpy(dest->kind[i], src->kind[i],
+   sizeof(PartitionRangeDatumKind) * key->partnatts);
+		}
+	}
+	else
+		dest->kind = NULL;
+
+	for (i = 0; i < ndatums; i++)
+	{
+		int		j;
+		dest->datums[i] = (Datum *) palloc(sizeof(Datum) * partnatts);
+
+		for (j = 0; j < partnatts; j++)
+		{
+			if (dest->kind == NULL ||
+dest->kind[i][j] == PARTITION_RANGE_DATUM_VALUE)
+dest->datums[i][j] = datumCopy(src->datums[i][j],
+			   key->parttypbyval[j],
+			   key->parttyplen[j]);
+		}
+	}
+
+	dest->indexes = (int *) palloc(sizeof(int) * num_indexes);
+	memcpy(dest->indexes, src->indexes, sizeof(int) * num_indexes);
+
+	dest->null_index = src->null_index;
+	dest->default_index = src->default_index;
+
+	return dest;
+}
+
+/*
  * check_new_partition_bound
  *
  * Checks if the new partition's bound overlaps any of the existing partitions
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 93cc757..9d35a41 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -1825,13 +1825,15 @@ set_relation_partition_info(PlannerInfo *root, RelOptInfo *rel,
 			Relation relation)
 {
 	PartitionDesc partdesc;
+	PartitionKey  partkey;
 
 	Assert(relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
 
 	partdesc = RelationGetPartitionDesc(relation);
+	partkey = RelationGetPartitionKey(relation);
 	rel->part_scheme = find_partition_scheme(root, relation);
 	Assert(partdesc != NULL && rel->part_scheme != NULL);
-	rel->boundinfo = partdesc->boundinfo;
+	rel->boundinfo = partition_bounds_copy(partdesc->boundinfo, partkey);
 	rel->nparts = partdesc->nparts;
 	set_baserel_partition_key_exprs(relation, rel);
 }
@@ -1888,18 +1890,33 @@ find_partition_scheme(PlannerInfo *root, Relation relation)
 
 	/*
 	 * Did not find matching partition scheme. Create one copying relevant
-	 * information from the relcache. Instead of copying whole arrays, copy
-	 * the pointers in relcache. It's safe to do so since
-	 * RelationClearRelation() wouldn't change it while planner is using it.
+	 * 

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-06 Thread Ashutosh Bapat
On Fri, Oct 6, 2017 at 8:45 PM, Robert Haas  wrote:
> On Fri, Oct 6, 2017 at 8:40 AM, Ashutosh Bapat
>  wrote:
>> Sorry. I sent a wrong file. Here's the real v37.
>
> Committed 0001-0006.  I made some assorted comment and formatting
> changes and two small substantive changes:
>
> - In try_nestloop_path, bms_free(outerrelids) before returning if we
> can't reparameterize.

Hmm. I missed that.

>
> - Moved the call to try_partition_wise_join inside
> populate_joinrel_with_paths, instead of always calling it just after
> that function is called.

This looks good too.

>
> I think this is very good work and I'm excited about the feature.

Thanks a lot Robert for detailed review and guidance. Thanks a lot
Rafia for benchmarking the feature with TPCH and esp. very large scale
database and also for testing and reported some real issues. Thanks
Rajkumar for testing it with an exhaustive testset. Thanks Amit
Langote, Thomas Munro, Dilip Kumar, Antonin Houska, Alvaro Herrera and
Amit Khandekar for their review comments and suggestions. Thanks
Jeevan Chalke, who used the patchset to implement partition-wise
aggregates and provided some insights offlist. Sorry if I have missed
anybody.

As Robert says in the commit message, there's more to do but now that
we have basic feature, improving it incrementally becomes a lot
easier.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-06 Thread Robert Haas
On Fri, Oct 6, 2017 at 8:40 AM, Ashutosh Bapat
 wrote:
> Sorry. I sent a wrong file. Here's the real v37.

Committed 0001-0006.  I made some assorted comment and formatting
changes and two small substantive changes:

- In try_nestloop_path, bms_free(outerrelids) before returning if we
can't reparameterize.

- Moved the call to try_partition_wise_join inside
populate_joinrel_with_paths, instead of always calling it just after
that function is called.

I think this is very good work and I'm excited about the feature.  Now
I'll wait to see whether the buildfarm, or Tom, yell at me for
whatever problems this may still have...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-05 Thread Ashutosh Bapat
On Wed, Oct 4, 2017 at 9:04 PM, Robert Haas  wrote:
> On Tue, Oct 3, 2017 at 3:27 PM, Robert Haas  wrote:
>> I decided to skip over 0001 for today and spend some time looking at
>> 0002-0006.
>
> Back to 0001.
>
> +Enables or disables the query planner's use of partition-wise join
> +plans. When enabled, it spends time in creating paths for joins 
> between
> +partitions and consumes memory to construct expression nodes to be 
> used
> +for those joins, even if partition-wise join does not result in the
> +cheapest path. The time and memory increase exponentially with the
> +number of partitioned tables being joined and they increase linearly
> +with the number of partitions. The default is off.
>
> I think this is too scary and too much technical detail.  I think you
> could just say something like: Enables or disables use of
> partition-wise join, which allows a join between partitioned tables to
> be performed by joining the matching partitions.  Partition-wise join
> currently applies only when the join conditions include all the
> columns of the partition keys, which must be of the same data type and
> have exactly matching sets of child partitions.  Because
> partition-wise join planning can use significantly increase CPU time
> and memory usage during planning, the default is off.

Done. With slight change. "include all the columns of the partition
keys" has a different meaning when partition key is an expression, so
I have used "include all the partition keys". Also changed the last
sentence as "... can use significantly more CPU time and memory during
planning ...". Please feel free to revert those changes, if you don't
like them.

>
> +partitioned table. The join partners can not be found in other partitions. 
> This
> +condition allows the join between partitioned tables to be broken into joins
> +between the matching partitions. The resultant join is partitioned in the 
> same
>
> "The join partners can not be found in other partitions." is redundant
> with the previous sentence.  I suggest deleting it.  I also suggest
> "This condition allows the join between partitioned tables to be
> broken" -> "Because of this, the join between partitioned tables can
> be broken".

Done.

>
> +relation" for both partitioned table as well as join between partitioned 
> tables
> +which can use partition-wise join technique.
>
> for either a partitioned table or a join between compatibly partitioned tables

Done.

>
> +Partitioning properties of a partitioned relation are stored in
> +PartitionSchemeData structure. Planner maintains a list of canonical 
> partition
> +schemes (distinct PartitionSchemeData objects) so that any two partitioned
> +relations with same partitioning scheme share the same PartitionSchemeData
> +object. This reduces memory consumed by PartitionSchemeData objects and makes
> +it easy to compare the partition schemes of joining relations.
>
> Not all of the partitioning properties are stored in the
> PartitionSchemeData structure any more.  I think this needs some
> rethinking and maybe some expansion.  As written, each of the first
> two sentences needs a "the" at the beginning.

Changed to

The partitioning properties of a partitioned relation are stored in its
RelOptInfo.  The information about data types of partition keys are stored in
PartitionSchemeData structure. The planner maintains a list of canonical
partition schemes (distinct PartitionSchemeData objects) so that RelOptInfo of
any two partitioned relations with same partitioning scheme point to the same
PartitionSchemeData object.  This reduces memory consumed by
PartitionSchemeData objects and makes it easy to compare the partition schemes
of joining relations.

Let me know if this looks good.

>
> +   /*
> +* Create "append" paths for
> partitioned joins. Do this before
> +* creating GatherPaths so that
> partial "append" paths in
> +* partitioned joins will be considered.
> +*/
>
> I think you could shorten this to a single-line comment and just keep
> the first sentence.  Similarly in the other location where you have
> the same sort of thing.

Done.

>
> + * child-joins. Otherwise, add_path might delete a path that some "append"
> + * path has reference to.
>
> to which some path generated here has a reference.

Done.

>
> Here and elsewhere, you use "append" rather than Append to refer to
> the paths added.  I suppose that's weasel-wording to work around the
> fact that they might be either Append or MergeAppend paths, but I'm
> not sure it's really going to convey that to anyone.  I suggest
> rephrasing those comments more generically, e.g.:
>
> +   /* Add "append" paths containing paths from child-joins. */
>
> You could say: Build additional paths for this rel from child-join paths.
>
> Or something.

Do

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-05 Thread Ashutosh Bapat
On Thu, Oct 5, 2017 at 7:18 PM, Alvaro Herrera  wrote:
> Robert Haas wrote:
>
>> Regarding nomenclature and my previous griping about wisdom, I was
>> wondering about just calling this a "partition join" like you have in
>> the regression test.  So the GUC would be enable_partition_join, you'd
>> have generate_partition_join_paths(), etc.  Basically just delete
>> "wise" throughout.
>
> If I understand correctly, what's being used here is the "-wise" suffix,
> unrelated to wisdom, which Merriam Webster lists as "adverb combining
> form" here https://www.merriam-webster.com/dictionary/wise (though you
> have to scroll down a lot), which is defined as
>
> 1 a :in the manner of  * crabwise * fanwise
>   b :in the position or direction of  * slantwise * clockwise
> 2 :with regard to :in respect of * dollarwise
>

That's right.

> According to that, the right way to write this is "partitionwise join"
> (no dash), which means "join in respect of partitions", "join with
> regard to partitions".

Google lists mostly  "partition wise" or "partition-wise" and very
rarely "partitionwise". The first being used in other DBMS literature.
The paper (there aren't many on this subject) I referred [1] uses
"partition-wise". It made more sense to replace " " or "-" with "_"
when syntax doesn't allow the first two. I am not against
"partitionwise" but I don't see any real reason why we should move
away from popular usage of this term.

[1] https://users.cs.duke.edu/~shivnath/papers/sigmod295-herodotou.pdf

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-05 Thread Robert Haas
On Thu, Oct 5, 2017 at 9:48 AM, Alvaro Herrera  wrote:
> Robert Haas wrote:
>> Regarding nomenclature and my previous griping about wisdom, I was
>> wondering about just calling this a "partition join" like you have in
>> the regression test.  So the GUC would be enable_partition_join, you'd
>> have generate_partition_join_paths(), etc.  Basically just delete
>> "wise" throughout.
>
> If I understand correctly, what's being used here is the "-wise" suffix,
> unrelated to wisdom, which Merriam Webster lists as "adverb combining
> form" here https://www.merriam-webster.com/dictionary/wise (though you
> have to scroll down a lot), which is defined as
>
> 1 a :in the manner of  * crabwise * fanwise
>   b :in the position or direction of  * slantwise * clockwise
> 2 :with regard to :in respect of * dollarwise
>
> According to that, the right way to write this is "partitionwise join"
> (no dash), which means "join in respect of partitions", "join with
> regard to partitions".

I'm fine with that, if others like it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-05 Thread Alvaro Herrera
Robert Haas wrote:

> Regarding nomenclature and my previous griping about wisdom, I was
> wondering about just calling this a "partition join" like you have in
> the regression test.  So the GUC would be enable_partition_join, you'd
> have generate_partition_join_paths(), etc.  Basically just delete
> "wise" throughout.

If I understand correctly, what's being used here is the "-wise" suffix,
unrelated to wisdom, which Merriam Webster lists as "adverb combining
form" here https://www.merriam-webster.com/dictionary/wise (though you
have to scroll down a lot), which is defined as

1 a :in the manner of  * crabwise * fanwise
  b :in the position or direction of  * slantwise * clockwise
2 :with regard to :in respect of * dollarwise

According to that, the right way to write this is "partitionwise join"
(no dash), which means "join in respect of partitions", "join with
regard to partitions".

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-05 Thread Ashutosh Bapat
On Wed, Oct 4, 2017 at 9:01 PM, Robert Haas  wrote:
> On Thu, Sep 21, 2017 at 8:52 AM, Robert Haas  wrote:
>> On Thu, Sep 21, 2017 at 8:21 AM, Ashutosh Bapat
>>  wrote:
>>> About your earlier comment of making build_joinrel_partition_info()
>>> simpler. Right now, the code assumes that partexprs or
>>> nullable_partexpr can be NULL when either of them is not populated.
>>> That may be saves a sizeof(pointer) * (number of keys) byes of memory.
>>> Saving that much memory may not be worth the complexity of code. So,
>>> we may always allocate memory for those arrays and fill it with NIL
>>> values when there are no key expressions to populate those. That will
>>> simplify the code. I haven't done that change in this patchset. I was
>>> busy debugging the Q7 regression. Let me know your comments about
>>> that.
>>
>> Hmm, I'm not sure that's the best approach, but let me look at it more
>> carefully before I express a firm opinion.
>
> Having studied this a bit more, I now think your proposed approach is
> a good idea.

Thanks. Done.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-04 Thread Robert Haas
On Wed, Oct 4, 2017 at 8:23 AM, Ashutosh Bapat
 wrote:
> I am not sure whether your assumption that expression with no Vars
> would have em_relids empty is correct. I wonder whether we will add
> any em_is_child members with empty em_relids; looking at
> process_equivalence() those come from RestrictInfo::left/right_relids
> which just indicates the relids at which that particular expression
> can be evaluated. Place holder vars is an example when that can
> happen, but there may be others. To verify this, I tried attached
> patch on master and ran make check. The assertion didn't trip. If
> em_relids is not NULL, bms_is_subset() is fine.

I spent some more time experimenting with this.  I found that cases
where an em_is_child equivalence class contains multiple relids are
quite easy to generate, e.g. select * from foo, bar where foo.a +
bar.a = 0, where foo and bar are partitioned.  However, I wasn't able
to generate a case where an em_is_child equivalence class has no
relids at all, and I'm out of ideas about how such a thing could
occur.  I suspect it can't.  I wondered whether there was some problem
with the multiple-relids case, but I can't find an example where that
misbehaves either.  So maybe it's fine (or maybe I'm just not smart
enough to find the case where it breaks).

>> I don't think I believe that comment, either.  In the case from which
>> that comment was copied (mark_dummy_rel), it was talking about a
>> RelOptInfo, and geqo_eval() takes care to remove any leftover pointers
>> to joinrels creating during a GEQO cycle.  But there's no similar
>> logic for ppilist, so I think what will happen here is that you'll end
>> up with a freed node in the middle of the list.
>
> In mark_dummy_rel() it's not about RelOptInfo, it's about the pathlist
> with dummy path being created in the same context as the RelOptInfo.
> Same applies here.

Oops.  I was thinking that the ppilist was attached to some
planner-global structure, but it's not; it's hanging off the
RelOptInfo.  So you're entirely right, and I'm just being dumb.

> We need to reparameterize any path which contains further paths and/or
> contains expressions that point to the parent relation. For a given
> path we need to reparameterize any paths that it contains and
> translate any expressions that are specific to that path. Expressions
> common across the paths are translated after the switch case. I have
> added this rule to the comment just above the switch case
> /*
>  * Copy of the given path. Reparameterize any paths referenced by the 
> given
>  * path. Replace parent Vars in path specific expressions by corresponding
>  * child Vars.
>  */
> Does that look fine or we want to add explanation for every node handled here.

No, I don't think we want something for every node, just a general
explanation at the top of the function.  Maybe something like this:

Most fields from the original path can simply be flat-copied, but any
expressions must be adjusted to refer to the correct varnos, and any
paths must be recursively reparameterized.  Other fields that refer to
specific relids also need adjustment.

>> I don't see much point in the T_SubqueryScanPath and T_ResultPath
>> cases in reparameterize_path_by_child().  It's just falling through to
>> the default case.
>
> I added those cases separately to explain why we should not see those
> cases in that switch case. I think that explanation is important
> (esp. considering your comment above) and associating those comment
> with "case" statement looks better. Are you suggesting that we should
> add that explanation in default case?

Or leave the explanation out altogether.

>> I wonder if reparameterize_path_by_child() ought to default to
>> returning NULL rather than throwing an error; the caller would then
>> have to be prepared for that and skip building the path.  But that
>> would be more like what reparameterize_path() does, and it would make
>> failure to include some relevant path type here a corner-case
>> performance bug rather than a correctness issue.  It seems like
>> someone adding a new path type could quite easily fail to realize that
>> it might need to be added here, or might be unsure whether it's
>> necessary to add it here.
>
> I am OK with that. However reparameterize_path_by_child() and
> reparameterize_paths_by_child() are callers of
> reparameterize_path_by_child() so they will need to deal with NULL
> return. I am fine with that too, but making sure that we are on the
> same page. If we do that, we could simply assert that the switch case
> doesn't see T_SubqueryScanPath and T_ResultPath.

Or do nothing at all about those cases.

I noticed today that the version of the patchset I have here says in
the header comments for reparameterize_path_by_child() that it returns
NULL if it can't reparameterize, but that's not what it actually does.
If you make this change, the existing comment will become correct.

The problem with the NULL return conventi

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-04 Thread Robert Haas
On Wed, Oct 4, 2017 at 11:34 AM, Robert Haas  wrote:
> +Enables or disables the query planner's use of partition-wise join
> +plans. When enabled, it spends time in creating paths for joins 
> between
> +partitions and consumes memory to construct expression nodes to be 
> used
> +for those joins, even if partition-wise join does not result in the
> +cheapest path. The time and memory increase exponentially with the
> +number of partitioned tables being joined and they increase linearly
> +with the number of partitions. The default is off.
>
> I think this is too scary and too much technical detail.  I think you
> could just say something like: Enables or disables use of
> partition-wise join, which allows a join between partitioned tables to
> be performed by joining the matching partitions.  Partition-wise join
> currently applies only when the join conditions include all the
> columns of the partition keys, which must be of the same data type and
> have exactly matching sets of child partitions.  Because
> partition-wise join planning can use significantly increase CPU time
> and memory usage during planning, the default is off.

Not enough caffeine, obviously: should have been something like --
Because partition-wise join can significantly increase the CPU and
memory costs of planning...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-04 Thread Robert Haas
On Tue, Oct 3, 2017 at 3:27 PM, Robert Haas  wrote:
> I decided to skip over 0001 for today and spend some time looking at
> 0002-0006.

Back to 0001.

+Enables or disables the query planner's use of partition-wise join
+plans. When enabled, it spends time in creating paths for joins between
+partitions and consumes memory to construct expression nodes to be used
+for those joins, even if partition-wise join does not result in the
+cheapest path. The time and memory increase exponentially with the
+number of partitioned tables being joined and they increase linearly
+with the number of partitions. The default is off.

I think this is too scary and too much technical detail.  I think you
could just say something like: Enables or disables use of
partition-wise join, which allows a join between partitioned tables to
be performed by joining the matching partitions.  Partition-wise join
currently applies only when the join conditions include all the
columns of the partition keys, which must be of the same data type and
have exactly matching sets of child partitions.  Because
partition-wise join planning can use significantly increase CPU time
and memory usage during planning, the default is off.

+partitioned table. The join partners can not be found in other partitions. This
+condition allows the join between partitioned tables to be broken into joins
+between the matching partitions. The resultant join is partitioned in the same

"The join partners can not be found in other partitions." is redundant
with the previous sentence.  I suggest deleting it.  I also suggest
"This condition allows the join between partitioned tables to be
broken" -> "Because of this, the join between partitioned tables can
be broken".

+relation" for both partitioned table as well as join between partitioned tables
+which can use partition-wise join technique.

for either a partitioned table or a join between compatibly partitioned tables

+Partitioning properties of a partitioned relation are stored in
+PartitionSchemeData structure. Planner maintains a list of canonical partition
+schemes (distinct PartitionSchemeData objects) so that any two partitioned
+relations with same partitioning scheme share the same PartitionSchemeData
+object. This reduces memory consumed by PartitionSchemeData objects and makes
+it easy to compare the partition schemes of joining relations.

Not all of the partitioning properties are stored in the
PartitionSchemeData structure any more.  I think this needs some
rethinking and maybe some expansion.  As written, each of the first
two sentences needs a "the" at the beginning.

+   /*
+* Create "append" paths for
partitioned joins. Do this before
+* creating GatherPaths so that
partial "append" paths in
+* partitioned joins will be considered.
+*/

I think you could shorten this to a single-line comment and just keep
the first sentence.  Similarly in the other location where you have
the same sort of thing.

+ * child-joins. Otherwise, add_path might delete a path that some "append"
+ * path has reference to.

to which some path generated here has a reference.

Here and elsewhere, you use "append" rather than Append to refer to
the paths added.  I suppose that's weasel-wording to work around the
fact that they might be either Append or MergeAppend paths, but I'm
not sure it's really going to convey that to anyone.  I suggest
rephrasing those comments more generically, e.g.:

+   /* Add "append" paths containing paths from child-joins. */

You could say: Build additional paths for this rel from child-join paths.

Or something.

+   if (!REL_HAS_ALL_PART_PROPS(rel))
+   return;

Isn't this an unnecessarily expensive test?  I mean, it shouldn't be
possible for it to have some arbitrary subset.

+   /*
+* Every pair of joining relations we see here should have an equi-join
+* between partition keys if this join has been deemed as a partitioned
+* join. See build_joinrel_partition_info() for reasons.
+*/
+   Assert(have_partkey_equi_join(rel1, rel2, parent_sjinfo->jointype,
+
parent_restrictlist));

I suggest removing this assertion.  Seems like overkill to me.

+   child_sjinfo = build_child_join_sjinfo(root, parent_sjinfo,
+
child_rel1->relids,
+
child_rel2->relids);

It seems like we might end up doing this multiple times for the same
child join, if there are more than 2 tables involved.  Not sure if
there's a good way to avoid that.  Similarly for child_restrictlist.

+   pk_has_clause = (bool *) palloc0(sizeof(bool) * num_pks);

Just  do bool pk_has_clause[PARTITION_MAX_KEYS] instead.  Stack
allocation is a lot faster, and then you don't need to pfree it.

+   /* Remove the relabel 

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-04 Thread Robert Haas
On Thu, Sep 21, 2017 at 8:52 AM, Robert Haas  wrote:
> On Thu, Sep 21, 2017 at 8:21 AM, Ashutosh Bapat
>  wrote:
>> About your earlier comment of making build_joinrel_partition_info()
>> simpler. Right now, the code assumes that partexprs or
>> nullable_partexpr can be NULL when either of them is not populated.
>> That may be saves a sizeof(pointer) * (number of keys) byes of memory.
>> Saving that much memory may not be worth the complexity of code. So,
>> we may always allocate memory for those arrays and fill it with NIL
>> values when there are no key expressions to populate those. That will
>> simplify the code. I haven't done that change in this patchset. I was
>> busy debugging the Q7 regression. Let me know your comments about
>> that.
>
> Hmm, I'm not sure that's the best approach, but let me look at it more
> carefully before I express a firm opinion.

Having studied this a bit more, I now think your proposed approach is
a good idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-04 Thread Ashutosh Bapat
On Wed, Oct 4, 2017 at 12:57 AM, Robert Haas  wrote:
>
> 0003:
>
> The commit message mentions estimate_num_groups but the patch doesn't touch 
> it.

This was fixed when we converted many rel->reloptkind ==
RELOPT_BASEREL to IS_SIMPLE_REL(). I have removed this section from
the commit message.

>
> I am concerned that this patch might introduce some problem fixed by
> commit dd4134ea56cb8855aad3988febc45eca28851cd8.  The comment in that
> patch say, at one place, that "This protects against possible
> incorrect matches to child expressions that contain no Vars."
> However, if a child expression has no Vars, then I think em->em_relids
> will be empty, so the bms_is_equal() test that is there now will fail
> but your proposed bms_is_subset() test will pass.

bms_is_equal() was enough when there was only a single member in
relids but it doesn't work now that there can be multiple of them.
bms_is_equal() was replaced with bms_is_subset() to accomodate for
ec_members with only a subset of relids when we are searching for a
join relation.

I am not sure whether your assumption that expression with no Vars
would have em_relids empty is correct. I wonder whether we will add
any em_is_child members with empty em_relids; looking at
process_equivalence() those come from RestrictInfo::left/right_relids
which just indicates the relids at which that particular expression
can be evaluated. Place holder vars is an example when that can
happen, but there may be others. To verify this, I tried attached
patch on master and ran make check. The assertion didn't trip. If
em_relids is not NULL, bms_is_subset() is fine.

If em_relids could indeed go NULL when em_is_child is true, passing
NULL relids (for parent rels) to that function can cause unwanted
behaviour. bms_is_equal(em->em_relids, relids) will return true
turning the if (em->em_is_child && !bms_is_equal(em->em_relids,
relids)) to false. This means that we will consider a child member
with em_relids NULL even while matching a parent relation. What
surprises me is, that commit added a bunch of testcases and none of
them failed with this change.

Nonetheless, I have changed "matches" with "belongs to" in the
prologue of those functions since an exact match won't be possible
with child-joins.

>
> 0004:
>
> I suggest renaming get_wholerow_ref_from_convert_row_type to
> is_converted_whole_row_reference and making it return a bool.

Done.

>
> The coding of that function is a little strange; why not move Var to
> an inner scope?  Like this: if (IsA(convexpr->arg, var)) { Var *var =
> castNode(Var, convexpr->arg; if (var->varattno == 0) return var; }

I probably went too far to avoid indented code :). Fixed now.

>
> Will the statement that "In case of multi-level partitioning, we will
> have as many nested ConvertRowtypeExpr as there are levels in
> partition hierarchy" be falsified by Amit Khandekar's pending patch to
> avoid sticking a ConvertRowTypeExpr on top of another
> ConvertRowTypeExpr?  Even if the answer is "no", I think it might be
> better to drop this part of the comment; it would be easy for it to
> become false in the future, because we might want to optimize that
> case in the future and we'll probably forget to update this comment
> when we do.

That might keep someone wondering where the nested
ConvertRowtypeExpr's came from. But may be in future those can arise
from something other than multi-level partition hierarchy and in that
case too the comment would be rendered inaccurate. So done.

>
> In fix_upper_expr_mutator(), you have an if statement whose entire
> contents are another if statement.  I think you should use && instead,
> and maybe reverse the order of the tests, since
> context->subplan_itlist->has_conv_whole_rows is probably cheaper to
> test than a function call.  It's also a little strange that this code
> isn't adjacent too, or merged with, the existing has_non_vars case.
> Maybe:
>
> converted_whole_row = is_converted_whole_row_reference(node);
> if (context->outer_itlist && (context->outer_itlist->has_non_vars ||
> (context->outer_itlist->has_conv_whole_rows && converted_whole_row))
> ...
> if (context->inner_itlist && (context->inner_itlist->has_non_vars ||
> (context->inner_itlist->has_conv_whole_rows && converted_whole_row))

I placed it with the other node types since it's for a specific node
type, but I guess your suggestion avoids duplicates and looks better.
Done.

> ...
>
> 0005:
>
> The comment explaining why the ParamPathInfo is allocated in the same
> context as the RelOptInfo is a modified copy of an existing comment
> that still reads like the original, a manner of commenting I find a
> bit undesirable as it leads to filling up the source base with
> duplicate comments.

I have pointed to mark_dummy_rel() in that comment instead of
duplicating the whole paragraph.

>
> I don't think I believe that comment, either.  In the case from which
> that comment was copied (mark_dummy_rel), it was talking about a
> RelOptInfo, and geq

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-03 Thread Amit Langote
On 2017/10/04 4:27, Robert Haas wrote:
> On Tue, Oct 3, 2017 at 8:57 AM, Ashutosh Bapat
>  wrote:
>>> Regarding nomenclature and my previous griping about wisdom, I was
>>> wondering about just calling this a "partition join" like you have in
>>> the regression test.  So the GUC would be enable_partition_join, you'd
>>> have generate_partition_join_paths(), etc.  Basically just delete
>>> "wise" throughout.
>>
>> Partition-wise join is standard term used in literature and in
>> documentation of other popular DBMSes, so partition_wise makes more
>> sense. But I am fine with partition_join as well. Do you want it
>> partition_join or partitionjoin like enable_mergejoin/enable_hashjoin
>> etc.?
> 
> Well, you're making me have second thoughts.  It's really just that
> partition_wise looks a little awkward to me, and maybe that's not
> enough reason to change anything.  I suppose if I commit it this way
> and somebody really hates it, it can always be changed later.  We're
> not getting a lot of input from anyone else at the moment.

FWIW, the name enable_partition_join seems enough to convey the core
feature, that is, I see "_wise" as redundant, even though I'm now quite
used to seeing "_wise" in the emails here and saying it out loud every now
and then.  Ashutosh may have a point though that users coming from other
databases might miss the "_wise". :)

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-03 Thread Robert Haas
On Tue, Oct 3, 2017 at 8:57 AM, Ashutosh Bapat
 wrote:
>> Regarding nomenclature and my previous griping about wisdom, I was
>> wondering about just calling this a "partition join" like you have in
>> the regression test.  So the GUC would be enable_partition_join, you'd
>> have generate_partition_join_paths(), etc.  Basically just delete
>> "wise" throughout.
>
> Partition-wise join is standard term used in literature and in
> documentation of other popular DBMSes, so partition_wise makes more
> sense. But I am fine with partition_join as well. Do you want it
> partition_join or partitionjoin like enable_mergejoin/enable_hashjoin
> etc.?

Well, you're making me have second thoughts.  It's really just that
partition_wise looks a little awkward to me, and maybe that's not
enough reason to change anything.  I suppose if I commit it this way
and somebody really hates it, it can always be changed later.  We're
not getting a lot of input from anyone else at the moment.

> Attached the updated patch-set.

I decided to skip over 0001 for today and spend some time looking at
0002-0006.  Comments below.

0002:

Looks fine.

0003:

The commit message mentions estimate_num_groups but the patch doesn't touch it.

I am concerned that this patch might introduce some problem fixed by
commit dd4134ea56cb8855aad3988febc45eca28851cd8.  The comment in that
patch say, at one place, that "This protects against possible
incorrect matches to child expressions that contain no Vars."
However, if a child expression has no Vars, then I think em->em_relids
will be empty, so the bms_is_equal() test that is there now will fail
but your proposed bms_is_subset() test will pass.

0004:

I suggest renaming get_wholerow_ref_from_convert_row_type to
is_converted_whole_row_reference and making it return a bool.

The coding of that function is a little strange; why not move Var to
an inner scope?  Like this: if (IsA(convexpr->arg, var)) { Var *var =
castNode(Var, convexpr->arg; if (var->varattno == 0) return var; }

Will the statement that "In case of multi-level partitioning, we will
have as many nested ConvertRowtypeExpr as there are levels in
partition hierarchy" be falsified by Amit Khandekar's pending patch to
avoid sticking a ConvertRowTypeExpr on top of another
ConvertRowTypeExpr?  Even if the answer is "no", I think it might be
better to drop this part of the comment; it would be easy for it to
become false in the future, because we might want to optimize that
case in the future and we'll probably forget to update this comment
when we do.

In fix_upper_expr_mutator(), you have an if statement whose entire
contents are another if statement.  I think you should use && instead,
and maybe reverse the order of the tests, since
context->subplan_itlist->has_conv_whole_rows is probably cheaper to
test than a function call.  It's also a little strange that this code
isn't adjacent too, or merged with, the existing has_non_vars case.
Maybe:

converted_whole_row = is_converted_whole_row_reference(node);
if (context->outer_itlist && (context->outer_itlist->has_non_vars ||
(context->outer_itlist->has_conv_whole_rows && converted_whole_row))
...
if (context->inner_itlist && (context->inner_itlist->has_non_vars ||
(context->inner_itlist->has_conv_whole_rows && converted_whole_row))
...

0005:

The comment explaining why the ParamPathInfo is allocated in the same
context as the RelOptInfo is a modified copy of an existing comment
that still reads like the original, a manner of commenting I find a
bit undesirable as it leads to filling up the source base with
duplicate comments.

I don't think I believe that comment, either.  In the case from which
that comment was copied (mark_dummy_rel), it was talking about a
RelOptInfo, and geqo_eval() takes care to remove any leftover pointers
to joinrels creating during a GEQO cycle.  But there's no similar
logic for ppilist, so I think what will happen here is that you'll end
up with a freed node in the middle of the list.

I think reparameterize_path_by_chid() could use a helper function
reparameterize_pathlist_by_child() that iterates over a list of paths
and returns a list of paths.  That would remove some of the loops.

I think the comments for reparameterize_path_by_child() need to be
expanded.  They don't explain how you decided which nodes need to be
handled here or which fields within those nodes need some kind of
handling other than a flat-copy.  I think these kinds of explanations
will be important for future maintenance of this code.  You know why
you did it this way, I can mostly guess what you did it this way, but
what about the next person who comes along who hasn't made a detailed
study of partition-wise join?

I don't see much point in the T_SubqueryScanPath and T_ResultPath
cases in reparameterize_path_by_child().  It's just falling through to
the default case.

I wonder if reparameterize_path_by_child() ought to default to
returning NULL rather than throwing an error; the caller would then
have

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-02 Thread Robert Haas
On Thu, Sep 21, 2017 at 8:21 AM, Ashutosh Bapat
 wrote:
> Here's set of rebased patches. The patch with extra tests is not for
> committing. All other patches, except the last one, will need to be
> committed together. The last patch may be committed along with other
> patches or as a separate patch.

In set_append_rel_size, is it necessary to set attr_needed =
bms_copy(rel->attr_needed[index]) rather than just pointing to the
existing value?  If so, perhaps the comments should explain the
reasons.  I would have thought that the values wouldn't change after
this point, in which case it might not be necessary to copy them.

Regarding nomenclature and my previous griping about wisdom, I was
wondering about just calling this a "partition join" like you have in
the regression test.  So the GUC would be enable_partition_join, you'd
have generate_partition_join_paths(), etc.  Basically just delete
"wise" throughout.

The elog(DEBUG3) in try_partition_wise_join() doesn't follow message
style guidelines and I think should just be removed.  It was useful
for development, I'm sure, but it's time for it to go.

+elog(ERROR, "unrecognized path node type %d", (int) nodeTag(path));

I think we should use the same formulation as elsewhere, namely
"unrecognized node type: %d".  And likewise probably "unexpected join
type: %d".

partition_join_extras.sql has a bunch of whitespace damage, although
it doesn't really matter since, as you say, that's not for commit.

(This is not a full review, just a few thoughts.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-22 Thread Ashutosh Bapat
On Fri, Sep 15, 2017 at 5:29 PM, Ashutosh Bapat
 wrote:
>
>>
>> Apart from these there is a regression case on a custom table, on head
>> query completes in 20s and with this patch it takes 27s. Please find
>> the attached .out and .sql file for the output and schema for the test
>> case respectively. I have reported this case before (sometime around
>> March this year) as well, but I am not sure if it was overlooked or is
>> an unimportant and expected behaviour for some reason.
>>
>
> Are you talking about [1]? I have explained about the regression in
> [2] and [3]. This looks like an issue with the existing costing model.
>

I debugged this case further. There are two partitioned tables being
joined prt (with partitions prt_p1, prt_p2 and so on) and prt2 (with
partitions prt2_p1, prt2_p2, and so on). When join is executed without
partition-wise join, prt2 is used to build hash table and prt is used
to probe that hash table. prt2 has lesser number of rows than prt. But
when partition-wise join is used, individual partitions are joined in
reverse join order i.e. partitions of prt are used to build the hash
table and partitions of prt2 are used to probe. This happens because
the path for the other join order (partition of prt2 used to build the
hash table and partition of prt used to probe) has huge cost compared
to the first one (74459 and 313109) and a portion worth 259094 comes
from lines 3226/7 of final_cost_hashjoin()
3215 /*
3216  * The number of tuple comparisons needed is the number of outer
3217  * tuples times the typical number of tuples in a hash
bucket, which
3218  * is the inner relation size times its bucketsize
fraction.  At each
3219  * one, we need to evaluate the hashjoin quals.  But actually,
3220  * charging the full qual eval cost at each tuple is pessimistic,
3221  * since we don't evaluate the quals unless the hash values match
3222  * exactly.  For lack of a better idea, halve the cost estimate to
3223  * allow for that.
3224  */
3225 startup_cost += hash_qual_cost.startup;
3226 run_cost += hash_qual_cost.per_tuple * outer_path_rows *
3227 clamp_row_est(inner_path_rows * innerbucketsize) * 0.5;

That's because for some reason innerbucketsize for partition of prt is
22 times more than that for partition of prt2. Looks like we have some
estimation error in estimating bucket sizes.

If I force partitions to be joined with the same order as partitioned
tables (without partition-wise join), child-joins execute faster and
in turn partition-wise join performs better than the
non-partition-wise join. So, this is clearly some estimation and
costing problem with regular joins.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-22 Thread Ashutosh Bapat
On Fri, Sep 22, 2017 at 10:45 AM, Rafia Sabih
 wrote:
>>
>> On completing the benchmark for all queries for the above mentioned
>> setup, following performance improvement can be seen,
>> Query | Patch | Head
>> 3  | 1455  |  1631
>> 4  |  499  |  4344
>> 5  |  1464  |  1606
>> 10  |  1475  |  1599
>> 12  |  1465  |  1790
>>
>> Note that all values of execution time are in seconds.
>
> I compared this experiment with non-partitioned database and following
> is the result,
> Query | Non-partitioned head
> 3  |  1752
> 4  |  315
> 5  |  2319
> 10 | 1535
> 12 | 1739
>
> In summary, the query that appears slowest in partitioned database is
> not so otherwise. It is good to see that in Q4 partition-wise join
> helps in achieving performance closer to it's non-partitioned case,
> otherwise partitioning alone causes it to suffer greatly. Apart from
> Q4 it does not looks like partitioning hurts anywhere else, though the
> maximum improvement is ~35% for Q5.
> Another point to note here is that the performance on partitioned and
> unpartitioned heads are quite close (except Q4) which is something
> atleast I wasn't expecting. It looks like we need not to partition the
> tables anyway, or atleast this set of queries doesn't benefit from
> partitioning. Please let me know if somebody has better ideas on how
> partitioning schemes should be applied to make it more beneficial for
> these queries.

Just partitioning is not expected to improve query performance (but we
still see some performance improvement). Partitioning + partition-wise
operations, pruning is expected to show performance gains. IIUC the
results you reported, Q3 takes 1752 seconds with non-partitioned head,
with partitioning it completes in 1631 seconds and with partition-wise
join it completes in 1455, so net improvement because of partitioning
is 300 seconds is almost 16% improvement, which is a lot for very
large data. So, except Q4, every query improves when the tables are
partitioned. Am I interpreting the results correctly?

There may be some other way of partitioning, which may give better
results, but I think what we have now shows the importance of
partitioning in case of very large data e.g. scale 300 TPCH.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-21 Thread Ashutosh Bapat
On Mon, Sep 18, 2017 at 10:18 AM, Rafia Sabih
 wrote:
>>
>
>  Limit  (cost=83341943.28..83341943.35 rows=1 width=92) (actual
> time=1556989.996..1556989.997 rows=1 loops=1)
>->  Finalize GroupAggregate  (cost=83341943.28..83342723.24
> rows=10064 width=92) (actual time=1556989.994..1556989.994 rows=1
> loops=1)
>  Group Key: n1.n_name, n2.n_name, (date_part('year'::text,
> (lineitem_001.l_shipdate)::timestamp without time zone))
>  ->  Sort  (cost=83341943.28..83342043.92 rows=40256 width=92)
> (actual time=1556989.910..1556989.911 rows=6 loops=1)
>Sort Key: n1.n_name, n2.n_name,
> (date_part('year'::text, (lineitem_001.l_shipdate)::timestamp without
> time zone))
>Sort Method: quicksort  Memory: 27kB
>->  Gather  (cost=83326804.81..83338864.31 rows=40256
> width=92) (actual time=1550598.855..1556989.760 rows=20 loops=1)
>  Workers Planned: 4
>  Workers Launched: 4
>
> AFAICU the node above sort is group-aggregate and then there is limit,
> and the number of rows for sort node in explain analyse is returned
> number of rows. So, what is happening here is once one group is
> completed it is aggregated and fetched by limit, now there is no need
> for sort to return any more rows and hence the result.

Thanks for your explanation. That makes sense. I forgot about LIMIT node on top.

I debugged the plans today and performed some experiments. Here are my
observations

The join order with and without partition-wise join changes. Without
partition-wise join it is
(lineitem, (suppliers, nation1)), (orders, (customer, nation2)). The
join (lineitem, (suppliers, nation1)) is executed by one gather node
and (orders, (customer, nation2)) is executed by other. Thus the plan
has two gather nodes, which feed to the topmost join.
With partition-wise join the join order is ((lineitem, orders),
(supplier, nation1)), (customer, nation2). The join (lineitem, orders)
uses partition-wise join. This plan executes the whole join tree along
with partial group aggregation under a gather merge.

The rows estimated for various nodes under Gather/GatherMerge are
different from the actual rows e.g.
->  Hash Join  (cost=113164.47..61031454.40 rows=10789501 width=46)
(actual time=3379.931..731987.943 rows=8744357 loops=5) (in
non-partition-wise join plan) OR
->  Append  (cost=179532.36..80681785.95 rows=134868761 width=24)
(actual time=9437.573..1360219.567 rows=109372134 loops=5) (in
partition-wise join plan).
I first thought that this is a real estimation error and spent some
time investigating the estimation error. But eventually realised that
this is how a parallel query plan reports, when I saw that Gather node
estimated correct number of rows even though the nodes under it showed
this difference. Here's the explanation of this report. There are 4
parallel workers, so, the leaders contribution would be estimated to
be 0 by get_parallel_divisor(). So these estimates are per worker and
so the total estimated rows produced by any of the nodes is 4 times
the reported. But when the query actually runs, the leader also
participates, so number of loops = 5 and the actual rows reported are
(total actual rows) / (number of loops i.e. number of backends that
executed the query). The total estimates rows and total actual rows
are roughly equal. So there's no real estimation error, as I thought
earlier. May be we want to make EXPLAIN (ANALYZE) output easier to
understand.

When I tried the same query on laptop with scale 20, I found that the
leader is really contributing as much as other workers. So, the
partial paths were really created based on an estimate which was 20%
off. The cost difference between partition-wise join plan and
non-partition-wise join plan is hardly 1.5%. So, it's possible that if
we correct this estimation error, partition-wise join plan won't be
chosen because of it will have a higher cost. Remember there are two
gather nodes in non-partition-wise join plan and partition-wise join
plan has one gather. So, non-partition-wise join path gets the 20%
decreased estimates twice and partition-wise join gets it only once.

The explain (analyze, verbose) of a parallel node looks like
->  Parallel Seq Scan on public.lineitem_002  (cost=0.00..168752.99
rows=573464 width=24) (actual time=1.395..3075.485 rows=454464
loops=5)
 Filter:
((lineitem_002.l_shipdate >= '1995-01-01'::date) AND
(lineitem_002.l_shipdate <= '1996-12-31'::date))
 Rows Removed by Filter: 1045065
 Worker 0: actual
time=3.358..3131.426 rows=458267 loops=1
 Worker 1: actual
time=0.860..3146.282 rows=447231 loops=1
 Worker 2: actual
time=1.317..3123.646 rows=489960 loops=1
 Worker 3: actual
time=0.927..

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-21 Thread Robert Haas
On Thu, Sep 21, 2017 at 8:21 AM, Ashutosh Bapat
 wrote:
> About your earlier comment of making build_joinrel_partition_info()
> simpler. Right now, the code assumes that partexprs or
> nullable_partexpr can be NULL when either of them is not populated.
> That may be saves a sizeof(pointer) * (number of keys) byes of memory.
> Saving that much memory may not be worth the complexity of code. So,
> we may always allocate memory for those arrays and fill it with NIL
> values when there are no key expressions to populate those. That will
> simplify the code. I haven't done that change in this patchset. I was
> busy debugging the Q7 regression. Let me know your comments about
> that.

Hmm, I'm not sure that's the best approach, but let me look at it more
carefully before I express a firm opinion.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-20 Thread Robert Haas
On Tue, Sep 19, 2017 at 5:47 AM, Ashutosh Bapat
 wrote:
> Done.

Committed 0001 with extensive editorialization.  I did not think it
was a good idea to include a partition.h a file in src/include/nodes,
so I worked around that.  The include of pg_inherits_fn.h was
unneeded.  I rewrote a lot of the comments and made some other style
tweaks.

Don't look now, but I think it might be about time for the main act.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-20 Thread Rajkumar Raghuwanshi
On Wed, Sep 20, 2017 at 3:13 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> On Wed, Sep 20, 2017 at 9:44 AM, Thomas Munro
>  wrote:
> > 2.  What queries in the 0008 patch are hitting lines that 0007 doesn't
> hit?
> >
> > I thought about how to answer questions like this and came up with a
> > shell script that (1) makes computers run really hot for quite a long
> > time and (2) tells you which blocks of SQL hit which lines of C.
> > Please find attached the shell script and its output.  The .sql files
> > have been annotated with "block" numbers (blocks being chunks of SQL
> > stuff separated by blank lines), and the C files annotated with
> > references to those block numbers where A = block 
> > partition_join.sql and B = block  in partition_join_extras.sql.
> >
> > Then to find lines that B queries hit but A queries don't and know
> > which particular queries hit them, you might use something like:
> >
> >   grep -v 'SQL blocks: .*A[0-9]' < joinpath.c.aggregated_coverage | \
> >   grep 'SQL blocks: .*B[0-9]'
> >
>
> Thanks for this. It generates a lot of output (970 lines over all the
> coverage files). It will take some time for getting anything
> meaningful out of this. May be there's some faster way by looking at
> the lines that are covered by B but not A. BTW, I checked those lines
> to see if there could be any bug there. But I don't see what could go
> wrong with those lines.
>
> I have also tried to find test cases in B which hits some extra line which
is not
hitting by A, with the help of results attached by Thomas in
coverage.tarball_FILES.
It took lot of time but I am able to find some test cases. which if adding
in partition_join.sql
increasing no of lines hit by 14. but for hitting these 14 extra line
attached patch is doing
900+ line inserts in partition_join.sql and partition_join.out file.

I have used gcov-lcov to find coverage for files changed by
partition-wise-join patches
with and without attached patch which is below.


*with existing partition_join.sql* *partition_join.sql + some test cases of
partition_join_extra.sql*
*Modifed Files* *Line Coverage* *Functions* *Line Coverage* *Functions*
src/backend/optimizer/geqo 79.4 % 269/339 96.6 % 28/29 79.4 % 269/339 96.6 %
28/29
src/backend/optimizer/path/allpaths.c 92.3 % 787 / 853 95.5 % 42 / 44
92.6 % 790
/ 853 95.5 % 42 / 44
src/backend/optimizer/path/costsize.c 96.8 % 1415 / 1462 98.4 % 61 / 62
96.9 % 1416 / 1462 98.4 % 61 / 62
src/backend/optimizer/path/joinpath.c 95.5 % 404 / 423 100.0 % 16 / 16
95.5 % 404 / 423 100.0 % 16 / 16
src/backend/optimizer/path/joinrels.c 92.5 % 422 / 456 100.0 % 16 / 16
93.0 % 424 / 456 100.0 % 16 / 16
src/backend/optimizer/plan/createplan.c 90.9 % 1928 / 2122 96.3 % 103 / 107
91.0 % 1930 / 2122 96.3 % 103 / 107
src/backend/optimizer/plan/planner.c 94.9 % 1609 / 1696 97.6 % 41 / 42
94.9 % 1609 / 1696 97.6 % 41 / 42
src/backend/optimizer/plan/setrefs.c 91.3 % 806 / 883 94.3 % 33 / 35 91.3 % 806
/ 883 94.3 % 33 / 35
src/backend/optimizer/prep/prepunion.c 95.5 % 661 / 692 100.0 % 25 / 25
95.5 % 661 / 692 100.0 % 25 / 25
src/backend/optimizer/util/pathnode.c 88.7 % 1144 / 1290 98.1 % 52 / 53
88.8 % 1146 / 1290 98.1 % 52 / 53
src/backend/optimizer/util/placeholder.c 96.5 % 139 / 144 100.0 % 10 / 10
96.5 % 139 / 144 100.0 % 10 / 10
src/backend/optimizer/util/plancat.c 89.0 % 540 / 607 94.7 % 18 / 19 89.6 % 544
/ 607 94.7 % 18 / 19
src/backend/optimizer/util/relnode.c 95.3 % 548 / 575 100.0 % 24 / 24
95.3 % 548
/ 575 100.0 % 24 / 24
src/backend/utils/misc/guc.c 67.4 % 1536 / 2278 89.7 % 113 / 126 67.4 % 1536
/ 2278 89.7 % 113 / 126


Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 9fec170..ab411b6 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -584,6 +584,215 @@ SELECT t1.a, ss.t2a, ss.t2c FROM prt1 t1 LEFT JOIN LATERAL
  550 | | 
 (12 rows)
 
+-- join with aggregate
+EXPLAIN (VERBOSE, COSTS OFF)
+select t1.a, count(t2.*) from prt1 t1 left join prt1 t2 on (t1.a = t2.a) where t1.a % 25 = 0 group by t1.a;
+  QUERY PLAN   
+---
+ GroupAggregate
+   Output: t1.a, count(((t2.*)::prt1))
+   Group Key: t1.a
+   ->  Sort
+ Output: t1.a, ((t2.*)::prt1)
+ Sort Key: t1.a
+ ->  Append
+   ->  Hash Right Join
+ Output: t1.a, ((t2.*)::prt1)
+ Hash Cond: (t2.a = t1.a)
+ ->  Seq Scan on public.prt1_p1 t2
+   Output: t2.*, t2.a
+ ->  Hash
+   Output: t1.a
+   ->  Seq Scan on public.prt1_p1 t1
+ Output: t1.a
+ Filter: ((t1.a % 25) = 0)
+   ->  Hash 

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-20 Thread Jeevan Chalke
On Tue, Sep 19, 2017 at 3:17 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> On Tue, Sep 19, 2017 at 2:35 AM, Robert Haas 
> wrote:
> > On Mon, Sep 18, 2017 at 8:02 AM, Ashutosh Bapat
> >  wrote:
> >> partition pruning might need partexprs look up relevant quals, but
> >> nullable_partexprs doesn't have any use there. So may be we should add
> >> nullable_partexpr to RelOptInfo as part of 0002 (partition-wise join
> >> implementation) instead of 0001. What do you think?
> >
> > +1.
>
> Done.
>
> >
> >>> - I'm not entirely sure whether maintaining partexprs and
> >>> nullable_partexprs is the right design.  If I understand correctly,
> >>> whether or not a partexpr is nullable is really a per-RTI property,
> >>> not a per-expression property.  You could consider something like
> >>> "Relids nullable_rels".
> >>
> >> That's true. However in order to decide whether an expression falls on
> >> nullable side of a join, we will need to call pull_varnos() on it and
> >> check the output against nullable_rels. Separating the expressions
> >> themselves avoids that step.
> >
> > Good point.  Also, I'm not sure about cases like this:
> >
> > SELECT * FROM (SELECT b.x, b.y FROM a LEFT JOIN b ON a.x = b.x WHERE
> > a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y;
> >
> > Suppose the relations are all partitioned by (x, y) but that the =
> > operator is not strict.  A partition-wise join is valid between a and
> > b, but we can't regard w as partitioned any more, because w.x might
> > contain nulls in partitions where the partitioning scheme wouldn't
> > allow them.  On the other hand, if the subquery were to select a.x,
> > a.y then clearly it would be fine: there would be no possibility of a
> > NULL having been substituted for a proper value.
> >
> > What if the subquery selected a.x, b.y?  Initially, I thought that
> > would be OK too, because of the fact that the a.y = b.y clause is in
> > the WHERE clause rather than the join condition.  But on further
> > thought I think that probably doesn't work, because with = being a
> > non-strict operator there's no guarantee that it would remove any
> > nulls introduced by the left join.  Of course, if the subselect had a
> > WHERE clause saying that b.x/b.y IS NOT NULL then having the SELECT
> > list mention those columns would be fine.
> >
>
> I am actually not sure whether we can use partition-wise join for a
> LEFT JOIN b when the partition key equalities are spread across ON and
> WHERE clauses. I am not able to find any example against it, but I am
> not able to prove it as well. The reference I used for partition-wise
> join [1], mentions JOIN conditions i.e. ON clause conditions. But all
> the examples used in that paper are that of INNER join. So, I am not
> sure what exactly the authors meant by JOIN conditions. Right now I am
> restricting the patch to work with only conditions in the ON clause.
>
> Practically most of the operators are strict. OUTER join's WHERE
> clause has any partition key equality with strict operator, optimizer
> will turn
> that OUTER join into an INNER one, turning all clauses into join
> clauses. That will enable partition-wise join. So, the current
> restriction doesn't restrict any practical cases.
>
> OTOH, I have seen that treating ON and WHERE clauses as same for an
> OUTER join leads to surprising results. So, I am leaning to treat them
> separate for partition-wise join as well and only use ON clause
> conditions for partition-wise join. If we get complaints about
> partition-wise join not being picked we will fix them after proving
> that it's not harmful. Lifting that restriction is not so difficult.
> have_partition_key_equijoin() ignores "pushed down" quals. We have to
> just change that condition.
>
> Your last sentence about a clause b.x IS NOT NULL or b.y IS NOT NULL
> is interesting. If those conditions are in ON clause, we may still
> have a result where b.x and b.y as NULL when no row in "a" matches a
> row in "b". If those conditions are in WHERE clause, I think optimizer
> will turn the join into an INNER join irrespective of whether the
> equality operator is strict.
>
> >
> >> If partition-wise join is disabled, partition-wise aggregates,
> >> strength reduction of MergeAppend won't be possible on a join tree,
> >> but those will be possible on a base relation. Even if partition-wise
> >> join enabled, one may want to disable other partition-wise
> >> optimizations individually. So, they are somewhat independent
> >> switches. I don't think we should bundle all of those into one.
> >> Whatever names we choose for those GUCs, I think they should have same
> >> naming convention e.g. "partition_wise_xyz". I am open to suggestions
> >> about the names.
> >
> > I think the chances of you getting multiple GUCs for different
> > partition-wise optimizations past Tom are pretty low.
>
> We do have enable_hashjoin and enable_hashagg to control use of
> hashing for aggregate and join. On similar lines we

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-20 Thread Ashutosh Bapat
On Tue, Sep 19, 2017 at 3:17 PM, Ashutosh Bapat
 wrote:
>>
 - I'm not entirely sure whether maintaining partexprs and
 nullable_partexprs is the right design.  If I understand correctly,
 whether or not a partexpr is nullable is really a per-RTI property,
 not a per-expression property.  You could consider something like
 "Relids nullable_rels".
>>>
>>> That's true. However in order to decide whether an expression falls on
>>> nullable side of a join, we will need to call pull_varnos() on it and
>>> check the output against nullable_rels. Separating the expressions
>>> themselves avoids that step.
>>
>> Good point.  Also, I'm not sure about cases like this:
>>
>> SELECT * FROM (SELECT b.x, b.y FROM a LEFT JOIN b ON a.x = b.x WHERE
>> a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y;
>>
>> Suppose the relations are all partitioned by (x, y) but that the =
>> operator is not strict.  A partition-wise join is valid between a and
>> b, but we can't regard w as partitioned any more, because w.x might
>> contain nulls in partitions where the partitioning scheme wouldn't
>> allow them.  On the other hand, if the subquery were to select a.x,
>> a.y then clearly it would be fine: there would be no possibility of a
>> NULL having been substituted for a proper value.
>>
>> What if the subquery selected a.x, b.y?  Initially, I thought that
>> would be OK too, because of the fact that the a.y = b.y clause is in
>> the WHERE clause rather than the join condition.  But on further
>> thought I think that probably doesn't work, because with = being a
>> non-strict operator there's no guarantee that it would remove any
>> nulls introduced by the left join.  Of course, if the subselect had a
>> WHERE clause saying that b.x/b.y IS NOT NULL then having the SELECT
>> list mention those columns would be fine.
>>
>

In my previous reply to this, I probably didn't answer your question
while I explained the restriction on where equality conditions on
partition keys can appear. Here's answer to your questions assuming
those restrictions don't exist. Actually in the example you have
given, optimizer flattens w as a LJ b which kind of makes the
explanations below a bit complicated.

1. SELECT * FROM (SELECT b.x, b.y FROM a LEFT JOIN b ON a.x = b.x
WHERE a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y;
partition-wise join will be possible between a and b but not between w
and c for the reasons you have explained above.
2. SELECT * FROM (SELECT a.x, a.y FROM a LEFT JOIN b ON a.x = b.x
WHERE a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y;
partition-wise join will be possible between a and b and also between
w and c for the reasons you have explained above.
3. SELECT * FROM (SELECT a.x, b.y FROM a LEFT JOIN b ON a.x = b.x
WHERE a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y;
partition-wise join will be possible between a and b but not w and c
as you have explained.

In this case b.x and b.y will appear as nullable_partexprs in w
(represented as a LJ b in optimizer) and a.x and a.y will appear in
partexprs. Depending upon what gets projected out of w, the join
between w and c will use corresponding keys for equality conditions.
Since the operator is non-strict, any expression which is part of
nullable_partexprs will be discarded in
match_expr_to_partition_keys().

Hope that helps.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-20 Thread Ashutosh Bapat
On Wed, Sep 20, 2017 at 9:44 AM, Thomas Munro
 wrote:
>
> The main areas of uncovered lines are: code in
> get_wholerow_ref_from_convert_row_type() and code that calls it, and
> per node type cases in reparameterize_path_by_child().  It seems like
> the former could use a test case, and I wonder if there is some way we
> could write "flat-copy and then apply recursively to all subpaths"
> code like this without having to handle these cases explicitly.  There
> are a couple of other tiny return cases other than just sanity check
> errors which it might be interesting to hit too.

Under the debugger I checked that the test in partition_join.sql
-- left outer join, with whole-row reference
EXPLAIN (COSTS OFF)
SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b
= 0 ORDER BY t1.a, t2.b;
SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b
= 0 ORDER BY t1.a, t2.b;
covers get_wholerow_ref_from_convert_row_type(). But it doesn't cover
a couple of lines in the case of nested ConvertRowTypeExpr in that
function. We can add/modify a testcase in multi-level partitioned
table section to cover that.

reparameterize_path_by_child() coverage is hard. It would require that
many different kinds of paths survive in lower joins in the join tree.
It's hard to come up with queries that would do that with limited
amount of data and a reasonable number of tests. Me and Thomas
discussed about his suggestion about "flat-copy and then apply
recursively to all subpaths" which he sees as a path tree mutator. It
won't improve the test coverage. Like expression_tree_mutator() path
mutation is not that widely used phenomenon, so we do not yet know
what should be the characteristics of a path mutator could be. In case
we see more of path mutation code in future, it's an idea worth
considering.

>
> 2.  What queries in the 0008 patch are hitting lines that 0007 doesn't hit?
>
> I thought about how to answer questions like this and came up with a
> shell script that (1) makes computers run really hot for quite a long
> time and (2) tells you which blocks of SQL hit which lines of C.
> Please find attached the shell script and its output.  The .sql files
> have been annotated with "block" numbers (blocks being chunks of SQL
> stuff separated by blank lines), and the C files annotated with
> references to those block numbers where A = block 
> partition_join.sql and B = block  in partition_join_extras.sql.
>
> Then to find lines that B queries hit but A queries don't and know
> which particular queries hit them, you might use something like:
>
>   grep -v 'SQL blocks: .*A[0-9]' < joinpath.c.aggregated_coverage | \
>   grep 'SQL blocks: .*B[0-9]'
>

Thanks for this. It generates a lot of output (970 lines over all the
coverage files). It will take some time for getting anything
meaningful out of this. May be there's some faster way by looking at
the lines that are covered by B but not A. BTW, I checked those lines
to see if there could be any bug there. But I don't see what could go
wrong with those lines.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-20 Thread Rafia Sabih
On Tue, Sep 19, 2017 at 3:50 PM, Alvaro Herrera  wrote:
> Rafia Sabih wrote:
>
>> On completing the benchmark for all queries for the above mentioned
>> setup, following performance improvement can be seen,
>> Query | Patch | Head
>> 3  | 1455  |  1631
>> 4  |  499  |  4344
>> 5  |  1464  |  1606
>> 10  |  1475  |  1599
>> 12  |  1465  |  1790
>>
>> Note that all values of execution time are in seconds.
>> To summarise, apart from Q4, all other queries are showing somewhat
>> 10-20% improvement.
>
> Saving 90% of time on the slowest query looks like a worthy improvement
> on its own right.  However, you're reporting execution time only, right?
> What happens to planning time?  In a quick look,

Definitely. The planning time issue has been discussed upthread,

On Mon, Mar 20, 2017 at 12:07 PM, Rafia Sabih
 wrote:

> Another minor thing to note that is planning time is almost twice with
> this patch, though I understand that this is for scenarios with really
> big 'big data' so this may not be a serious issue in such cases, but
> it'd be good if we can keep an eye on this that it doesn't exceed the
> computational bounds for a really large number of tables..

To which Robert replied as,

Yes, this is definitely going to use significant additional planning
time and memory.  There are several possible strategies for improving
that situation, but I think we need to get the basics in place first.
That's why the proposal is now to have this turned off by default.
People joining really big tables that happen to be equipartitioned are
likely to want to turn it on, though, even before those optimizations
are done.

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-19 Thread Alvaro Herrera
Rafia Sabih wrote:

> On completing the benchmark for all queries for the above mentioned
> setup, following performance improvement can be seen,
> Query | Patch | Head
> 3  | 1455  |  1631
> 4  |  499  |  4344
> 5  |  1464  |  1606
> 10  |  1475  |  1599
> 12  |  1465  |  1790
> 
> Note that all values of execution time are in seconds.
> To summarise, apart from Q4, all other queries are showing somewhat
> 10-20% improvement.

Saving 90% of time on the slowest query looks like a worthy improvement
on its own right.  However, you're reporting execution time only, right?
What happens to planning time?  In a quick look,

$ grep 'Planning time' pg_part_*/4*
pg_part_head/4_1.out: Planning time: 3390.699 ms
pg_part_head/4_2.out: Planning time: 194.211 ms
pg_part_head/4_3.out: Planning time: 210.964 ms
pg_part_head/4_4.out: Planning time: 4150.647 ms
pg_part_patch/4_1.out: Planning time: 7577.247 ms
pg_part_patch/4_2.out: Planning time: 312.421 ms
pg_part_patch/4_3.out: Planning time: 304.697 ms
pg_part_patch/4_4.out: Planning time: 269.778 ms

I think the noise in these few results is too large to draw any
conclusions.  Maybe a few dozen runs of EXPLAIN (w/o ANALYZE) would tell
something significant?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-18 Thread Robert Haas
On Mon, Sep 18, 2017 at 8:02 AM, Ashutosh Bapat
 wrote:
> partition pruning might need partexprs look up relevant quals, but
> nullable_partexprs doesn't have any use there. So may be we should add
> nullable_partexpr to RelOptInfo as part of 0002 (partition-wise join
> implementation) instead of 0001. What do you think?

+1.

>> - I'm not entirely sure whether maintaining partexprs and
>> nullable_partexprs is the right design.  If I understand correctly,
>> whether or not a partexpr is nullable is really a per-RTI property,
>> not a per-expression property.  You could consider something like
>> "Relids nullable_rels".
>
> That's true. However in order to decide whether an expression falls on
> nullable side of a join, we will need to call pull_varnos() on it and
> check the output against nullable_rels. Separating the expressions
> themselves avoids that step.

Good point.  Also, I'm not sure about cases like this:

SELECT * FROM (SELECT b.x, b.y FROM a LEFT JOIN b ON a.x = b.x WHERE
a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y;

Suppose the relations are all partitioned by (x, y) but that the =
operator is not strict.  A partition-wise join is valid between a and
b, but we can't regard w as partitioned any more, because w.x might
contain nulls in partitions where the partitioning scheme wouldn't
allow them.  On the other hand, if the subquery were to select a.x,
a.y then clearly it would be fine: there would be no possibility of a
NULL having been substituted for a proper value.

What if the subquery selected a.x, b.y?  Initially, I thought that
would be OK too, because of the fact that the a.y = b.y clause is in
the WHERE clause rather than the join condition.  But on further
thought I think that probably doesn't work, because with = being a
non-strict operator there's no guarantee that it would remove any
nulls introduced by the left join.  Of course, if the subselect had a
WHERE clause saying that b.x/b.y IS NOT NULL then having the SELECT
list mention those columns would be fine.

>> - The naming of enable_partition_wise_join might also need some
>> thought.  What happens when we also have partition-wise aggregate?
>> What about the proposal to strength-reduce MergeAppend to Append --
>> would that use this infrastructure?  I wonder if we out to call this
>> enable_partition_wise or enable_partition_wise_planning to make it a
>> bit more general.  Then, too, I've never really liked having
>> partition_wise in the GUC name because it might make someone think
>> that it makes you partitions have a lot of wisdom.  Removing the
>> underscore might help: partitionwise.  Or maybe there is some whole
>> different name that would be better.  If anyone wants to bikeshed,
>> now's the time.
>
> partitions having a lot of wisdom would be wise_partitions rather than
> partition_wise ;).

Well, maybe it's the joins that have a lot of wisdom, then.
enable_partition_wise_join could be read to mean that we should allow
partitioning of joins, but only if those joins know the secret of true
happiness.

> If partition-wise join is disabled, partition-wise aggregates,
> strength reduction of MergeAppend won't be possible on a join tree,
> but those will be possible on a base relation. Even if partition-wise
> join enabled, one may want to disable other partition-wise
> optimizations individually. So, they are somewhat independent
> switches. I don't think we should bundle all of those into one.
> Whatever names we choose for those GUCs, I think they should have same
> naming convention e.g. "partition_wise_xyz". I am open to suggestions
> about the names.

I think the chances of you getting multiple GUCs for different
partition-wise optimizations past Tom are pretty low.

>> - Instead of reorganizing add_paths_to_append_rel as you did, could
>> you just add an RTE_JOIN case to the switch?  Not sure if there's some
>> problem with that idea, but it seems like it might come out nicer.
>
> RTE_JOIN is created only for joins specified using JOIN clause i.e
> syntactic joins. The joins created during query planner like rel1,
> rel2, rel3 do not have RTE_JOIN. So, we can't use RTE_JOIN there.

OK, never mind that then.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-18 Thread Ashutosh Bapat
On Sat, Sep 16, 2017 at 2:53 AM, Robert Haas  wrote:
> On Fri, Sep 15, 2017 at 6:11 AM, Ashutosh Bapat
>  wrote:
>> Thanks a lot Robert.
>>
>> Here are rebased patches.
>
> I didn't get quite as much time to look at these today as I would have
> liked, but here's what I've got so far.
>
> Comments on 0001:
>
> - In the RelOptInfo, part_oids is defined in a completely different
> part of the structure than nparts, but you can't use it without nparts
> because you don't know how long it is.  I suggest moving the
> definition to just after nparts.
>
> - On the other hand, maybe we should just remove it completely.  I
> don't see it in any of the subsequent patches.  If it's used by the
> advanced matching code, let's leave it out of 0001 for now and add it
> back after the basic feature is committed.

No, it's not used by advanced partition matching code. It was used by
to match OIDs with the child rels to order those in the array. But now
that we are expanding in EIBO fashion, it is not useful. Should have
removed it earlier. Removed now.

>
> - Similarly, partsupfunc isn't used by the later patches either.  It
> seems it could also be removed, at least for now.

It's used by advanced partition matching code to compare bounds. It
will be required by partition pruning patch. But removed for now.

>
> - The comment for partexprs isn't very clear about how the lists
> inside the array work.  My understanding is that the lists have as
> many members as the partition key has columns/expressions.

Actually we are doing some preparation for partition-wise join here.
partexprs and nullable_partexprs are used in partition-wise join
implementation patch. I have updated prologue of RelOptInfo structure
with the comments like below

 * Note: A base relation will always have only one set of partition keys. But a
 * join relation has as many sets of partition keys as the number of joining
 * relations. The number of partition keys is given by
 * "part_scheme->partnatts". "partexprs" and "nullable_partexprs" are arrays
 * containing part_scheme->partnatts elements. Each element of the array
 * contains a list of partition key expressions. For a base relation each list
 * contains only one expression.  For a join relation each list contains at
 * most as many expressions as the joining relations. The expressions in a list
 * at a given position in the array correspond to the partition key at that
 * position. "partexprs" contains partition keys of non-nullable joining
 * relations and "nullable_partexprs" contains partition keys of nullable
 * joining relations. For a base relation only "partexprs" is populated.

Let me know this looks fine. The logic to match the partition keys of
joining relations in have_partkey_equi_join() and
match_expr_to_partition_keys() becomes simpler if we arrange the
partition key expressions as array indexed by position of partition
key and each array element as list of partition key expressions at
that position.

partition pruning might need partexprs look up relevant quals, but
nullable_partexprs doesn't have any use there. So may be we should add
nullable_partexpr to RelOptInfo as part of 0002 (partition-wise join
implementation) instead of 0001. What do you think?

>
> - I'm not entirely sure whether maintaining partexprs and
> nullable_partexprs is the right design.  If I understand correctly,
> whether or not a partexpr is nullable is really a per-RTI property,
> not a per-expression property.  You could consider something like
> "Relids nullable_rels".

That's true. However in order to decide whether an expression falls on
nullable side of a join, we will need to call pull_varnos() on it and
check the output against nullable_rels. Separating the expressions
themselves avoids that step.

>
> Comments on 0002:
>
> - The relationship between deciding to set partition scheme and
> related details and the configured value of enable_partition_wise_join
> needs some consideration.  If the only use of the partition scheme is
> partition-wise join, there's no point in setting it even for a baserel
> unless enable_partition_wise_join is set -- but if there are other
> important uses for that data, such as Amit's partition pruning work,
> then we might want to always set it.  And similarly for a join: if the
> details are only needed in the partition-wise join case, then we only
> need to set them in that case, but if there are other uses, then it's
> different.  If it turns out that setting these details for a baserel
> is useful in other cases but that it's only for a joinrel in the
> partition-wise join case, then the patch gets it exactly right.  But
> is that correct?  I'm not sure.

Partition scheme contains the information about data types of
partition keys, which is required to compare partition bounds.
Partition pruning will need to compare constants with partition bounds
and hence will need information contained in partition scheme. So, we
will need to set it for base relations whether or 

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-15 Thread Thomas Munro
On Sat, Sep 16, 2017 at 9:23 AM, Robert Haas  wrote:
> On the overall patch set:
>
> - I am curious to know how this has been tested.  How much of the new
> code is covered by the tests in 0007-Partition-wise-join-tests.patch?
> How much does coverage improve with
> 0008-Extra-testcases-for-partition-wise-join-NOT-FOR-COMM.patch?  What
> code, if any, is not covered by either of those test suites?  Could we
> do meaningful testing of this with something like Andreas
> Seltenreich's sqlsmith?

FWIW I'm working on an answer to both of those question, but keep
getting distracted by other things catching on fire...

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-15 Thread Robert Haas
On Fri, Sep 15, 2017 at 6:11 AM, Ashutosh Bapat
 wrote:
> Thanks a lot Robert.
>
> Here are rebased patches.

I didn't get quite as much time to look at these today as I would have
liked, but here's what I've got so far.

Comments on 0001:

- In the RelOptInfo, part_oids is defined in a completely different
part of the structure than nparts, but you can't use it without nparts
because you don't know how long it is.  I suggest moving the
definition to just after nparts.

- On the other hand, maybe we should just remove it completely.  I
don't see it in any of the subsequent patches.  If it's used by the
advanced matching code, let's leave it out of 0001 for now and add it
back after the basic feature is committed.

- Similarly, partsupfunc isn't used by the later patches either.  It
seems it could also be removed, at least for now.

- The comment for partexprs isn't very clear about how the lists
inside the array work.  My understanding is that the lists have as
many members as the partition key has columns/expressions.

- I'm not entirely sure whether maintaining partexprs and
nullable_partexprs is the right design.  If I understand correctly,
whether or not a partexpr is nullable is really a per-RTI property,
not a per-expression property.  You could consider something like
"Relids nullable_rels".

Comments on 0002:

- The relationship between deciding to set partition scheme and
related details and the configured value of enable_partition_wise_join
needs some consideration.  If the only use of the partition scheme is
partition-wise join, there's no point in setting it even for a baserel
unless enable_partition_wise_join is set -- but if there are other
important uses for that data, such as Amit's partition pruning work,
then we might want to always set it.  And similarly for a join: if the
details are only needed in the partition-wise join case, then we only
need to set them in that case, but if there are other uses, then it's
different.  If it turns out that setting these details for a baserel
is useful in other cases but that it's only for a joinrel in the
partition-wise join case, then the patch gets it exactly right.  But
is that correct?  I'm not sure.

- The naming of enable_partition_wise_join might also need some
thought.  What happens when we also have partition-wise aggregate?
What about the proposal to strength-reduce MergeAppend to Append --
would that use this infrastructure?  I wonder if we out to call this
enable_partition_wise or enable_partition_wise_planning to make it a
bit more general.  Then, too, I've never really liked having
partition_wise in the GUC name because it might make someone think
that it makes you partitions have a lot of wisdom.  Removing the
underscore might help: partitionwise.  Or maybe there is some whole
different name that would be better.  If anyone wants to bikeshed,
now's the time.

- It seems to me that build_joinrel_partition_info() could be
simplified a bit.  One thing is that list_copy() is perfectly capable
of handling a NIL input, so there's no need to test for that before
calling it.

Comments on 0003:

- Instead of reorganizing add_paths_to_append_rel as you did, could
you just add an RTE_JOIN case to the switch?  Not sure if there's some
problem with that idea, but it seems like it might come out nicer.

On the overall patch set:

- I am curious to know how this has been tested.  How much of the new
code is covered by the tests in 0007-Partition-wise-join-tests.patch?
How much does coverage improve with
0008-Extra-testcases-for-partition-wise-join-NOT-FOR-COMM.patch?  What
code, if any, is not covered by either of those test suites?  Could we
do meaningful testing of this with something like Andreas
Seltenreich's sqlsmith?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-15 Thread Ashutosh Bapat
On Fri, Sep 15, 2017 at 2:09 PM, Rafia Sabih
 wrote:
> On TPC-H benchmarking of this patch, I found a regression in Q7. It
> was taking some 1500s with the patch and some 900s without the patch.
> Please find the attached pwd_reg.zip for the output of explain analyse
> on head and with patch.
>
> The experimental settings used were,
> commit-id = 0c504a80cf2e6f66df2cdea563e879bf4abd1629
> patch-version = v26
>
> Server settings:
> work_mem = 1GB
> shared_buffers = 10GB
> effective_cache_size = 10GB
> max_parallel_workers_per_gather = 4
>
> Partitioning information:
> Partitioning scheme = by range
> Number of partitions in lineitem and orders table = 106
> partition key for lineitem = l_orderkey
> partition key for orders = o_orderkey

I observe that with partition-wise join patch the planner is using
GatherMerge along-with partition-wise join and on head its not using
GatherMerge. Just to make sure that its partition-wise join which is
causing regression and not GatherMerge, can you please run the query
with enable_gathermerge = false?

I see following lines explain analyze output 7_1.out without the patch
 ->  Sort  (cost=84634030.40..84638520.55 rows=1796063
width=72) (actual time=1061001.435..1061106.608 rows=437209 loops=1)
   Sort Key: n1.n_name, n2.n_name,
(date_part('year'::text, (lineitem_001.l_shipdate)::timestamp without
time zone))
   Sort Method: quicksort  Memory: 308912kB
   ->  Hash Join  (cost=16080591.94..84447451.72
rows=1796063 width=72) (actual time=252745.701..1057447.219
rows=1749956 loops=1)
Since Sort doesn't filter any rows, we would expect it to output the
same number of rows as hash join underneath it. But the number of rows
differ in this case. I am wondering whether there's some problem with
the explain analyze output itself.

>
> Apart from these there is a regression case on a custom table, on head
> query completes in 20s and with this patch it takes 27s. Please find
> the attached .out and .sql file for the output and schema for the test
> case respectively. I have reported this case before (sometime around
> March this year) as well, but I am not sure if it was overlooked or is
> an unimportant and expected behaviour for some reason.
>

Are you talking about [1]? I have explained about the regression in
[2] and [3]. This looks like an issue with the existing costing model.

[1] 
https://www.postgresql.org/message-id/caogqiimwcjnrunj_fcdbscrtlej-clp7exfzzipe2ut71n4...@mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAFjFpRedUZPa7tKbCLEGK3u5UWdDNQoN=eyfb7ieg5d0d1p...@mail.gmail.com
[3] 
https://www.postgresql.org/message-id/cafjfprejksdcfaeuzjgd79hoetzpz5bkdxljgxr7qznrxx+...@mail.gmail.com
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-14 Thread Amit Langote
On 2017/09/15 4:43, Robert Haas wrote:
> On Thu, Sep 14, 2017 at 8:06 AM, Ashutosh Bapat
>  wrote:
>> I have few changes to multi-level expansion patch as per discussion in
>> earlier mails
> 
> OK, I have committed
> 0002-Multi-level-partitioned-table-expansion.patch with a few cosmetic
> changes.
> 
> Phew, getting that sorted out has been an astonishing amount of work.

Yeah, thanks to both of you.  Now on to other complicated stuff. :)

Regards,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-14 Thread Robert Haas
On Thu, Sep 14, 2017 at 8:06 AM, Ashutosh Bapat
 wrote:
> I have few changes to multi-level expansion patch as per discussion in
> earlier mails

OK, I have committed
0002-Multi-level-partitioned-table-expansion.patch with a few cosmetic
changes.

Phew, getting that sorted out has been an astonishing amount of work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-14 Thread Robert Haas
On Wed, Sep 13, 2017 at 10:57 PM, Amit Langote
 wrote:
> I very much like pcinfo-for-subquery.patch, although I'm not sure if we
> need to create PartitionedChildRelInfo for the sub-query parent RTE as the
> patch teaches add_paths_to_append_rel() to do.  ISTM, nested UNION ALL
> subqueries are flattened way before we get to add_paths_to_append_rel();
> if it could not be flattened, there wouldn't be a call to
> add_paths_to_append_rel() in the first place, because no AppendRelInfos
> would be generated.  See what happens when is_simple_union_all_recurse()
> returns false to flatten_simple_union_all() -- no AppendRelInfos will be
> generated and added to root->append_rel_list in that case.
>
> IOW, there won't be nested AppendRelInfos for nested UNION ALL sub-queries
> like we're setting out to build for multi-level partitioned tables.
>
> So, as things stand today, there can at most be one recursive call of
> add_path_to_append_rel() for a sub-query parent RTE, that is, if its child
> sub-queries contain partitioned tables, but not more.  The other patch
> (multi-level expansion of partitioned tables) will change that, but even
> then we won't need sub-query's own PartitioendChildRelInfo.

OK, let's assume you're correct unless some contrary evidence emerges.
Committed without that part; thanks for the review.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-13 Thread Amit Langote
On 2017/09/14 7:43, Robert Haas wrote:
> On Wed, Sep 13, 2017 at 12:56 PM, Ashutosh Bapat
>  wrote:
>> I debugged what happens in case of query "select 1 from t1 union all
>> select 2 from t1;" with the current HEAD (without multi-level
>> expansion patch attached). It doesn't set partitioned_rels in Append
>> path that gets converted into Append plan. Remember t1 is a
>> multi-level partitioned table here with t1p1 as its immediate
>> partition and t1p1p1 as partition of t1p1. So, the
>> set_append_rel_pathlist() recurses once as shown in the following
>> stack trace.
> 
> Nice debugging.

+1.

> I spent some time today looking at this and I think
> it's a bug in v10, and specifically in add_paths_to_append_rel(),
> which only sets partitioned_rels correctly when the appendrel is a
> partitioned rel, and not when it's a subquery RTE with one or more
> partitioned queries beneath it.
> 
> Attached are two patches either one of which will fix it.  First, I
> wrote mechanical-partrels-fix.patch, which just mechanically
> propagates partitioned_rels lists from accumulated subpaths into the
> list used to construct the parent (Merge)AppendPath.  I wasn't entire
> happy with that, because it ends up building multiple partitioned_rels
> lists for the same RelOptInfo.  That seems silly, but there's no
> principled way to avoid it; avoiding it amounts to hoping that all the
> paths for the same relation carry the same partitioned_rels list,
> which is uncomfortable.
> 
> So then I wrote pcinfo-for-subquery.patch.  That patch notices when an
> RTE_SUBQUERY appendrel is processed and accumulates the
> partitioned_rels of its immediate children; in case there can be
> multiple nested levels of subqueries before we get down to the actual
> partitioned rel, it also adds a PartitionedChildRelInfo for the
> subquery RTE, so that there's no need to walk the whole tree to build
> the partitioned_rels list at higher levels, just the immediate
> children.  I find this fix a lot more satisfying.  It adds less code
> and does no extra work in the common case.

I very much like pcinfo-for-subquery.patch, although I'm not sure if we
need to create PartitionedChildRelInfo for the sub-query parent RTE as the
patch teaches add_paths_to_append_rel() to do.  ISTM, nested UNION ALL
subqueries are flattened way before we get to add_paths_to_append_rel();
if it could not be flattened, there wouldn't be a call to
add_paths_to_append_rel() in the first place, because no AppendRelInfos
would be generated.  See what happens when is_simple_union_all_recurse()
returns false to flatten_simple_union_all() -- no AppendRelInfos will be
generated and added to root->append_rel_list in that case.

IOW, there won't be nested AppendRelInfos for nested UNION ALL sub-queries
like we're setting out to build for multi-level partitioned tables.

So, as things stand today, there can at most be one recursive call of
add_path_to_append_rel() for a sub-query parent RTE, that is, if its child
sub-queries contain partitioned tables, but not more.  The other patch
(multi-level expansion of partitioned tables) will change that, but even
then we won't need sub-query's own PartitioendChildRelInfo.

> Notice that the choice of fix we adopt has consequences for your
> 0001-Multi-level-partitioned-table-expansion.patch -- with
> mechanical-partrels-fix.patch, that patch could either associated all
> partitioned_rels with the top-parent or it could work level by level
> and everything would get properly assembled later.  But with
> pcinfo-for-subquery.patch, we need everything associated with the
> top-parent.  That doesn't seem like a problem to me, but it's
> something to note.

I think it's fine.

With 0001-Multi-level-partitioned-table-expansion.patch,
get_partitioned_child_rels() will get called even for non-root partitioned
tables, for which it won't find a valid pcinfo.  I think that patch must
also change its callers to stop Asserting that a valid pcinfo is returned.

Spotted a typo in pcinfo-for-subquery.patch:

+ * A plain relation will alread have

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-13 Thread Robert Haas
On Wed, Sep 13, 2017 at 12:56 PM, Ashutosh Bapat
 wrote:
> I debugged what happens in case of query "select 1 from t1 union all
> select 2 from t1;" with the current HEAD (without multi-level
> expansion patch attached). It doesn't set partitioned_rels in Append
> path that gets converted into Append plan. Remember t1 is a
> multi-level partitioned table here with t1p1 as its immediate
> partition and t1p1p1 as partition of t1p1. So, the
> set_append_rel_pathlist() recurses once as shown in the following
> stack trace.

Nice debugging.  I spent some time today looking at this and I think
it's a bug in v10, and specifically in add_paths_to_append_rel(),
which only sets partitioned_rels correctly when the appendrel is a
partitioned rel, and not when it's a subquery RTE with one or more
partitioned queries beneath it.

Attached are two patches either one of which will fix it.  First, I
wrote mechanical-partrels-fix.patch, which just mechanically
propagates partitioned_rels lists from accumulated subpaths into the
list used to construct the parent (Merge)AppendPath.  I wasn't entire
happy with that, because it ends up building multiple partitioned_rels
lists for the same RelOptInfo.  That seems silly, but there's no
principled way to avoid it; avoiding it amounts to hoping that all the
paths for the same relation carry the same partitioned_rels list,
which is uncomfortable.

So then I wrote pcinfo-for-subquery.patch.  That patch notices when an
RTE_SUBQUERY appendrel is processed and accumulates the
partitioned_rels of its immediate children; in case there can be
multiple nested levels of subqueries before we get down to the actual
partitioned rel, it also adds a PartitionedChildRelInfo for the
subquery RTE, so that there's no need to walk the whole tree to build
the partitioned_rels list at higher levels, just the immediate
children.  I find this fix a lot more satisfying.  It adds less code
and does no extra work in the common case.

Notice that the choice of fix we adopt has consequences for your
0001-Multi-level-partitioned-table-expansion.patch -- with
mechanical-partrels-fix.patch, that patch could either associated all
partitioned_rels with the top-parent or it could work level by level
and everything would get properly assembled later.  But with
pcinfo-for-subquery.patch, we need everything associated with the
top-parent.  That doesn't seem like a problem to me, but it's
something to note.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


mechanical-partrels-fix.patch
Description: Binary data


pcinfo-for-subquery.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-13 Thread Ashutosh Bapat
On Wed, Sep 13, 2017 at 12:51 PM, Ashutosh Bapat
 wrote:
> On Wed, Sep 13, 2017 at 12:39 AM, Robert Haas  wrote:
>> On Tue, Sep 12, 2017 at 3:46 AM, Amit Langote
>>  wrote:
>>> In this case, AcquireExecutorLocks will lock all the relations in
>>> PlannedStmt.rtable, which must include all partitioned tables of all
>>> partition trees involved in the query.  Of those, it will lock the tables
>>> whose RT indexes appear in PlannedStmt.nonleafResultRelations with
>>> RowExclusiveLock mode.  PlannedStmt.nonleafResultRelations is a global
>>> list of all partitioned table RT indexes obtained by concatenating
>>> partitioned_rels lists of all ModifyTable nodes involved in the query
>>> (set_plan_refs does that).  We need to distinguish nonleafResultRelations,
>>> because we need to take the stronger lock on a given table before any
>>> weaker one if it happens to appear in the query as a non-result relation
>>> too, to avoid lock strength upgrade deadlock hazard.
>>
>> Hmm.  The problem with this theory in my view is that it doesn't
>> explain why InitPlan() and ExecOpenScanRelation() lock the relations
>> instead of just assuming that they are already locked either by
>> AcquireExecutorLocks or by planning.  If ExecLockNonLeafAppendTables()
>> doesn't really need to take locks, then ExecOpenScanRelation() must
>> not need to do it either.  We invented ExecLockNonLeafAppendTables()
>> on the occasion of removing the scans of those tables which would
>> previously have caused ExecOpenScanRelation() to be invoked, so as to
>> keep the locking behavior unchanged.
>>
>> AcquireExecutorLocks() looks like an odd bit of code to me.  The
>> executor itself locks result tables in InitPlan() and then everything
>> else during InitPlan() and all of the others later on while walking
>> the plan tree -- comments in InitPlan() say that this is to avoid a
>> lock upgrade hazard if a result rel is also a source rel.  But
>> AcquireExecutorLocks() has no such provision; it just locks everything
>> in RTE order.  In theory, that's a deadlock hazard of another kind, as
>> we just talked about in the context of EIBO.  In fact, expanding in
>> bound order has made the situation worse: before, expansion order and
>> locking order were the same, so maybe having AcquireExecutorLocks()
>> work in RTE order coincidentally happened to give the same result as
>> the executor code itself as long as there are no result relations.
>> But this is certainly not true any more.  I'm not sure it's worth
>> expending a lot of time on this -- it's evidently not a problem in
>> practice, or somebody probably would've complained before now.
>>
>> But that having been said, I don't think we should assume that all the
>> locks taken from the executor are worthless because plancache.c will
>> always do the job for us.  I don't know of a case where we execute a
>> saved plan without going through the plan cache, but that doesn't mean
>> that there isn't one or that there couldn't be one in the future.
>> It's not the job of these partitioning patches to whack around the way
>> we do locking in general -- they should preserve the existing behavior
>> as much as possible.  If we want to get rid of the locking in the
>> executor altogether, that's a separate discussion where, I have a
>> feeling, there will prove to be better reasons for the way things are
>> than we are right now supposing.
>>
>
> I agree that it's not the job of these patches to change the locking
> or even get rid of partitioned_rels. In order to continue returning
> partitioned_rels in Append paths esp. in the case of queries involving
> set operations and partitioned table e.g "select 1 from t1 union all
> select 2 from t1;" in which t1 is multi-level partitioned table, we
> need a fix in add_paths_to_append_rels(). The fix provided in [1] is
> correct but we will need a longer explanation of why we have to
> involve RTE_SUBQUERY with RELKIND_PARTITIONED_TABLE. The explanation
> is complicated. If we get rid of partitioned_rels, we don't need to
> fix that code in add_paths_to_append_rel().
>
> I suggested that [2]
> -- (excerpt from [2])
>
> Actually, the original problem that caused this discussion started
> with an assertion failure in get_partitioned_child_rels() as
> Assert(list_length(result) >= 1);
>
> This assertion fails if result is NIL when an intermediate partitioned
> table is passed. May be we should assert (result == NIL ||
> list_length(result) == 1) and allow that function to be called even
> for intermediate partitioned partitions for which the function will
> return NIL. That will leave the code in add_paths_to_append_rel()
> simple. Thoughts?
> --
>
> Amit Langote agrees with this. It kind of makes the assertion lame but
> keeps the code sane. What do you think?

I debugged what happens in case of query "select 1 from t1 union all
select 2 from t1;" with the current HEAD (without multi-level
expansion patch attached). It doesn't set partitioned_rels in Append
path th

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-13 Thread Amit Khandekar
On 13 September 2017 at 13:05, Ashutosh Bapat
 wrote:
> On Wed, Sep 13, 2017 at 12:32 PM, Amit Khandekar  
> wrote:
>> Hi,
>>
>> Rafia had done some testing on TPCH queries using Partition-wise join
>> patch along with Parallel Append patch.
>>
>> There, we had observed that for query 4, even though the partition
>> wise joins are under a Parallel Append, the join are all non-partial.
>>
>> Specifically, the partition-wise join has non-partial nested loop
>> joins when actually it was expected to have partial nested loop joins.
>> (The difference can be seen by the observation that the outer relation
>> of that join is scanned by non-parallel Bitmap Heap scan when it
>> should have used Parallel Bitmap Heap Scan).
>>
>> Here is the detailed analysis , including where I think is the issue :
>>
>> https://www.postgresql.org/message-id/CAJ3gD9cZms1ND3p%3DNN%3DhDYDFt_SeKq1htMBhbj85bOmvJwY5fg%40mail.gmail.com
>>
>> All the TPCH results are posted in the same above mail thread.
>
> Can you please check if the attached patch fixes the issue.

Thanks Ashutosh. Yes, it does fix the issue. Partial Nested Loop joins
are generated now. If I see any unexpected differences in the
estimated or actual costs, I will report that in the Parallel Append
thread. As far as Partition-wise join is concerned, this issue is
solved, because Partial nested loop join does get created.

>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-13 Thread Amit Langote
On 2017/09/13 16:21, Ashutosh Bapat wrote:
> On Wed, Sep 13, 2017 at 12:39 AM, Robert Haas  wrote:
>> locks taken from the executor are worthless because plancache.c will
>> always do the job for us.  I don't know of a case where we execute a
>> saved plan without going through the plan cache, but that doesn't mean
>> that there isn't one or that there couldn't be one in the future.
>> It's not the job of these partitioning patches to whack around the way
>> we do locking in general -- they should preserve the existing behavior
>> as much as possible.  If we want to get rid of the locking in the
>> executor altogether, that's a separate discussion where, I have a
>> feeling, there will prove to be better reasons for the way things are
>> than we are right now supposing.
>>
> 
> I agree that it's not the job of these patches to change the locking
> or even get rid of partitioned_rels. In order to continue returning
> partitioned_rels in Append paths esp. in the case of queries involving
> set operations and partitioned table e.g "select 1 from t1 union all
> select 2 from t1;" in which t1 is multi-level partitioned table, we
> need a fix in add_paths_to_append_rels(). The fix provided in [1] is
> correct but we will need a longer explanation of why we have to
> involve RTE_SUBQUERY with RELKIND_PARTITIONED_TABLE. The explanation
> is complicated. If we get rid of partitioned_rels, we don't need to
> fix that code in add_paths_to_append_rel().

Yeah, let's get on with setting partitioned_rels in AppendPath correctly
in this patch.  Ashutosh's suggested approach seems fine, although it
needlessly requires to scan root->pcinfo_list.  But it shouldn't be longer
than the number of partitioned tables in the query, so maybe that's fine
too.  At least, it doesn't require us to add code to
add_paths_to_append_rel() that can be pretty hard to wrap one's head around.

That said, we might someday need to look carefully at some things that
Robert mentioned carefully, especially around the order of locks taken by
AcquireExecutorLocks() in light of the EIBO patch getting committed.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-13 Thread Ashutosh Bapat
On Wed, Sep 13, 2017 at 12:32 PM, Amit Khandekar  wrote:
> Hi,
>
> Rafia had done some testing on TPCH queries using Partition-wise join
> patch along with Parallel Append patch.
>
> There, we had observed that for query 4, even though the partition
> wise joins are under a Parallel Append, the join are all non-partial.
>
> Specifically, the partition-wise join has non-partial nested loop
> joins when actually it was expected to have partial nested loop joins.
> (The difference can be seen by the observation that the outer relation
> of that join is scanned by non-parallel Bitmap Heap scan when it
> should have used Parallel Bitmap Heap Scan).
>
> Here is the detailed analysis , including where I think is the issue :
>
> https://www.postgresql.org/message-id/CAJ3gD9cZms1ND3p%3DNN%3DhDYDFt_SeKq1htMBhbj85bOmvJwY5fg%40mail.gmail.com
>
> All the TPCH results are posted in the same above mail thread.

Can you please check if the attached patch fixes the issue.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
commit 203b3083318e9da41ad614a2ccec532025877c3b
Author: Ashutosh Bapat 
Date:   Tue Sep 12 17:41:54 2017 +0530

Reparamterize partial nestloop paths.

We do not create partial nested looop paths if the inner path's
parameterization is not fully covered by the outer relation. For partition-wise
join, the test fails since the inner path is parameterized by the parent of the
outer relation. Fix the test to check the parent relids instead of the child
relids and also reparameterize the inner path to be parameterized by the outer
child similar to try_nestloop_path().

TODO: squash this patch with the reparameterization patch.

Ashutosh Bapat, per report from Rafia and analysis by Amit Khandekar

diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index 91f0b1c..c8da19c 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -496,8 +496,20 @@ try_partial_nestloop_path(PlannerInfo *root,
 	if (inner_path->param_info != NULL)
 	{
 		Relids		inner_paramrels = inner_path->param_info->ppi_req_outer;
+		RelOptInfo *outerrel = outer_path->parent;
+		Relids		outerrelids;
 
-		if (!bms_is_subset(inner_paramrels, outer_path->parent->relids))
+		/*
+		 * Paths are parameterized by top-level parent(s). Any paths parameterized
+		 * by the child relations, are not added to the pathlist. Hence run
+		 * parameterization tests on the parent relids.
+		 */
+		if (outerrel->top_parent_relids)
+			outerrelids = outerrel->top_parent_relids;
+		else
+			outerrelids = outerrel->relids;
+
+		if (!bms_is_subset(inner_paramrels, outerrelids))
 			return;
 	}
 
@@ -510,6 +522,32 @@ try_partial_nestloop_path(PlannerInfo *root,
 	if (!add_partial_path_precheck(joinrel, workspace.total_cost, pathkeys))
 		return;
 
+	/*
+	 * Since result produced by a child is part of the result produced by
+	 * its topmost parent and has same properties, the parameters
+	 * representing that parent may be substituted by values from a child.
+	 * Hence expressions and hence paths using those expressions,
+	 * parameterized by a parent can be said to be parameterized by any of
+	 * its child.  For a join between child relations, if the inner path
+	 * is parameterized by the parent of the outer relation,  translate
+	 * the inner path to be parameterized by the outer child relation and
+	 * create a nestloop join path.  The translated path should have the
+	 * same costs as the original path, so cost check above should still
+	 * hold.
+	 */
+	if (PATH_PARAM_BY_PARENT(inner_path, outer_path->parent))
+	{
+		inner_path = reparameterize_path_by_child(root, inner_path,
+  outer_path->parent);
+
+		/*
+		 * If we could not translate the path, we can't create nest loop
+		 * path.
+		 */
+		if (!inner_path)
+			return;
+	}
+
 	/* Might be good enough to be worth trying, so let's try it. */
 	add_partial_path(joinrel, (Path *)
 	 create_nestloop_path(root,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-13 Thread Ashutosh Bapat
On Wed, Sep 13, 2017 at 11:29 AM, Amit Langote
 wrote:
> On 2017/09/12 19:56, Ashutosh Bapat wrote:
>> I think the code here expects the original parent_rte and not the one
>> we set around line 1169.
>>
>> This isn't a bug right now, since both the parent_rte s have same
>> content. But I am not sure if that will remain to be so. Here's patch
>> to fix the thinko.
>
> Instead of the new bool is_parent_partitioned, why not move the code to
> set partitioned_rels to the block where you're now setting
> is_parent_partitioned.
>
> Also, since we know this isn't a bug at the moment but will turn into one
> once we have step-wise expansion, why not include this fix in that patch
> itself?

It won't turn into a bug with step-wise expansion since every
parent_rte will have RELKIND_PARTITIONED_TABLE for a partitioned top
parent, which is used to extract the partitioned_rels. But I guess,
it's better to fix the thinko in step-wise expansion since parent_rte
itself changes.
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-13 Thread Ashutosh Bapat
On Wed, Sep 13, 2017 at 12:39 AM, Robert Haas  wrote:
> On Tue, Sep 12, 2017 at 3:46 AM, Amit Langote
>  wrote:
>> In this case, AcquireExecutorLocks will lock all the relations in
>> PlannedStmt.rtable, which must include all partitioned tables of all
>> partition trees involved in the query.  Of those, it will lock the tables
>> whose RT indexes appear in PlannedStmt.nonleafResultRelations with
>> RowExclusiveLock mode.  PlannedStmt.nonleafResultRelations is a global
>> list of all partitioned table RT indexes obtained by concatenating
>> partitioned_rels lists of all ModifyTable nodes involved in the query
>> (set_plan_refs does that).  We need to distinguish nonleafResultRelations,
>> because we need to take the stronger lock on a given table before any
>> weaker one if it happens to appear in the query as a non-result relation
>> too, to avoid lock strength upgrade deadlock hazard.
>
> Hmm.  The problem with this theory in my view is that it doesn't
> explain why InitPlan() and ExecOpenScanRelation() lock the relations
> instead of just assuming that they are already locked either by
> AcquireExecutorLocks or by planning.  If ExecLockNonLeafAppendTables()
> doesn't really need to take locks, then ExecOpenScanRelation() must
> not need to do it either.  We invented ExecLockNonLeafAppendTables()
> on the occasion of removing the scans of those tables which would
> previously have caused ExecOpenScanRelation() to be invoked, so as to
> keep the locking behavior unchanged.
>
> AcquireExecutorLocks() looks like an odd bit of code to me.  The
> executor itself locks result tables in InitPlan() and then everything
> else during InitPlan() and all of the others later on while walking
> the plan tree -- comments in InitPlan() say that this is to avoid a
> lock upgrade hazard if a result rel is also a source rel.  But
> AcquireExecutorLocks() has no such provision; it just locks everything
> in RTE order.  In theory, that's a deadlock hazard of another kind, as
> we just talked about in the context of EIBO.  In fact, expanding in
> bound order has made the situation worse: before, expansion order and
> locking order were the same, so maybe having AcquireExecutorLocks()
> work in RTE order coincidentally happened to give the same result as
> the executor code itself as long as there are no result relations.
> But this is certainly not true any more.  I'm not sure it's worth
> expending a lot of time on this -- it's evidently not a problem in
> practice, or somebody probably would've complained before now.
>
> But that having been said, I don't think we should assume that all the
> locks taken from the executor are worthless because plancache.c will
> always do the job for us.  I don't know of a case where we execute a
> saved plan without going through the plan cache, but that doesn't mean
> that there isn't one or that there couldn't be one in the future.
> It's not the job of these partitioning patches to whack around the way
> we do locking in general -- they should preserve the existing behavior
> as much as possible.  If we want to get rid of the locking in the
> executor altogether, that's a separate discussion where, I have a
> feeling, there will prove to be better reasons for the way things are
> than we are right now supposing.
>

I agree that it's not the job of these patches to change the locking
or even get rid of partitioned_rels. In order to continue returning
partitioned_rels in Append paths esp. in the case of queries involving
set operations and partitioned table e.g "select 1 from t1 union all
select 2 from t1;" in which t1 is multi-level partitioned table, we
need a fix in add_paths_to_append_rels(). The fix provided in [1] is
correct but we will need a longer explanation of why we have to
involve RTE_SUBQUERY with RELKIND_PARTITIONED_TABLE. The explanation
is complicated. If we get rid of partitioned_rels, we don't need to
fix that code in add_paths_to_append_rel().

I suggested that [2]
-- (excerpt from [2])

Actually, the original problem that caused this discussion started
with an assertion failure in get_partitioned_child_rels() as
Assert(list_length(result) >= 1);

This assertion fails if result is NIL when an intermediate partitioned
table is passed. May be we should assert (result == NIL ||
list_length(result) == 1) and allow that function to be called even
for intermediate partitioned partitions for which the function will
return NIL. That will leave the code in add_paths_to_append_rel()
simple. Thoughts?
--

Amit Langote agrees with this. It kind of makes the assertion lame but
keeps the code sane. What do you think?

[1] 
https://www.postgresql.org/message-id/d2f1cdcb-ebb4-76c5-e471-79348ca5d...@lab.ntt.co.jp
[2] 
https://www.postgresql.org/message-id/CAFjFpRfJ3GRRmmOugaMA-q4i=se5p6yjz_c6a6hdrdqqtgx...@mail.gmail.com
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-13 Thread Amit Khandekar
Hi,

Rafia had done some testing on TPCH queries using Partition-wise join
patch along with Parallel Append patch.

There, we had observed that for query 4, even though the partition
wise joins are under a Parallel Append, the join are all non-partial.

Specifically, the partition-wise join has non-partial nested loop
joins when actually it was expected to have partial nested loop joins.
(The difference can be seen by the observation that the outer relation
of that join is scanned by non-parallel Bitmap Heap scan when it
should have used Parallel Bitmap Heap Scan).

Here is the detailed analysis , including where I think is the issue :

https://www.postgresql.org/message-id/CAJ3gD9cZms1ND3p%3DNN%3DhDYDFt_SeKq1htMBhbj85bOmvJwY5fg%40mail.gmail.com

All the TPCH results are posted in the same above mail thread.

Thanks
-Amit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-12 Thread Amit Langote
On 2017/09/12 19:56, Ashutosh Bapat wrote:
> I think the code here expects the original parent_rte and not the one
> we set around line 1169.
> 
> This isn't a bug right now, since both the parent_rte s have same
> content. But I am not sure if that will remain to be so. Here's patch
> to fix the thinko.

Instead of the new bool is_parent_partitioned, why not move the code to
set partitioned_rels to the block where you're now setting
is_parent_partitioned.

Also, since we know this isn't a bug at the moment but will turn into one
once we have step-wise expansion, why not include this fix in that patch
itself?

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-12 Thread Robert Haas
On Tue, Sep 12, 2017 at 3:46 AM, Amit Langote
 wrote:
> In this case, AcquireExecutorLocks will lock all the relations in
> PlannedStmt.rtable, which must include all partitioned tables of all
> partition trees involved in the query.  Of those, it will lock the tables
> whose RT indexes appear in PlannedStmt.nonleafResultRelations with
> RowExclusiveLock mode.  PlannedStmt.nonleafResultRelations is a global
> list of all partitioned table RT indexes obtained by concatenating
> partitioned_rels lists of all ModifyTable nodes involved in the query
> (set_plan_refs does that).  We need to distinguish nonleafResultRelations,
> because we need to take the stronger lock on a given table before any
> weaker one if it happens to appear in the query as a non-result relation
> too, to avoid lock strength upgrade deadlock hazard.

Hmm.  The problem with this theory in my view is that it doesn't
explain why InitPlan() and ExecOpenScanRelation() lock the relations
instead of just assuming that they are already locked either by
AcquireExecutorLocks or by planning.  If ExecLockNonLeafAppendTables()
doesn't really need to take locks, then ExecOpenScanRelation() must
not need to do it either.  We invented ExecLockNonLeafAppendTables()
on the occasion of removing the scans of those tables which would
previously have caused ExecOpenScanRelation() to be invoked, so as to
keep the locking behavior unchanged.

AcquireExecutorLocks() looks like an odd bit of code to me.  The
executor itself locks result tables in InitPlan() and then everything
else during InitPlan() and all of the others later on while walking
the plan tree -- comments in InitPlan() say that this is to avoid a
lock upgrade hazard if a result rel is also a source rel.  But
AcquireExecutorLocks() has no such provision; it just locks everything
in RTE order.  In theory, that's a deadlock hazard of another kind, as
we just talked about in the context of EIBO.  In fact, expanding in
bound order has made the situation worse: before, expansion order and
locking order were the same, so maybe having AcquireExecutorLocks()
work in RTE order coincidentally happened to give the same result as
the executor code itself as long as there are no result relations.
But this is certainly not true any more.  I'm not sure it's worth
expending a lot of time on this -- it's evidently not a problem in
practice, or somebody probably would've complained before now.

But that having been said, I don't think we should assume that all the
locks taken from the executor are worthless because plancache.c will
always do the job for us.  I don't know of a case where we execute a
saved plan without going through the plan cache, but that doesn't mean
that there isn't one or that there couldn't be one in the future.
It's not the job of these partitioning patches to whack around the way
we do locking in general -- they should preserve the existing behavior
as much as possible.  If we want to get rid of the locking in the
executor altogether, that's a separate discussion where, I have a
feeling, there will prove to be better reasons for the way things are
than we are right now supposing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-12 Thread Ashutosh Bapat
On Tue, Sep 12, 2017 at 2:35 PM, Amit Langote
 wrote:
> On 2017/09/12 17:53, Ashutosh Bapat wrote:
>> On Tue, Sep 12, 2017 at 1:42 PM, Amit Langote wrote:
>>> So, we can remove partitioned_rels from (Merge)AppendPath and
>>> (Merge)Append nodes and remove ExecLockNonLeafAppendTables().
>>
>> Don't we need partitioned_rels from Append paths to be transferred to
>> ModifyTable node or we have a different way of calculating
>> nonleafResultRelations?
>
> No, we don't transfer partitioned_rels from Append path to ModifyTable
> node.  inheritance_planner(), that builds the ModifyTable path for
> UPDATE/DELETE on a partitioned table, fetches partitioned_rels from
> root->pcinfo_list itself and passes it to create_modifytable_path.  No
> Append path is involved in that case.  PlannedStmt.nonleafResultRelations
> is built by concatenating the partitioned_rels lists of all ModifyTable
> nodes appearing in the query.  It does not depend on Append's or
> AppendPath's partitioned_rels.

Ok. Thanks for the explanation.

This make me examine inheritance_planner() closely and I think I have
spotted a thinko there. In inheritance_planner() parent_rte is set to
the RTE of parent to start with and then in the loop
1132 /*
1133  * And now we can get on with generating a plan for each child table.
1134  */
1135 foreach(lc, root->append_rel_list)
1136 {
... code clipped
1165 /*
1166  * If there are securityQuals attached to the parent,
move them to the
1167  * child rel (they've already been transformed properly for that).
1168  */
1169 parent_rte = rt_fetch(parentRTindex, subroot->parse->rtable);
1170 child_rte = rt_fetch(appinfo->child_relid, subroot->parse->rtable);
1171 child_rte->securityQuals = parent_rte->securityQuals;
1172 parent_rte->securityQuals = NIL;

we set parent_rte to the one obtained from subroot->parse, which
happens to be the same (at least in contents) as original parent_rte.
Later we use this parent_rte to pull partitioned_rels outside that
loop

1371 if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
1372 {
1373 partitioned_rels = get_partitioned_child_rels(root, parentRTindex);
1374 /* The root partitioned table is included as a child rel */
1375 Assert(list_length(partitioned_rels) >= 1);
1376 }

I think the code here expects the original parent_rte and not the one
we set around line 1169.

This isn't a bug right now, since both the parent_rte s have same
content. But I am not sure if that will remain to be so. Here's patch
to fix the thinko.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


inh_planner_prte.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-12 Thread Amit Langote
On 2017/09/12 18:49, Ashutosh Bapat wrote:
> On Tue, Sep 12, 2017 at 2:17 PM, Amit Langote
>  wrote:
>>
>> That said, I noticed that we might need to be careful about what the value
>> of the root parent's PlanRowMark's allMarkType field gets set to.  We need
>> to make sure that it reflects markType of all partitions in the tree,
>> including those that are not root parent's direct children.  Is that true
>> with the proposed implementation?
> 
> Yes. We include child's allMarkTypes into parent's allMarkTypes. So,
> top parent's PlanRowMarks should have all descendant's allMarkTypes,
> which is not happening in the patch right now. There are two ways to
> fix that.
> 
> 1. Pass top parent's PlanRowMark all the way down to the leaf
> partitions, so that current expand_single_inheritance_child() collects
> allMarkTypes of all children correctly. But this way, PlanRowMarks of
> intermediate parent does not reflect allMarkTypes of its children,
> only top root records that.
> 2. Pass immediate parent's PlanRowMark to
> expand_single_inheritance_child(), so that it records allMarkTypes of
> its children. In expand_partitioned_rtentry() have following sequence
> 
> expand_single_inheritance_child(root, parentrte, parentRTindex,
> parentrel, parentrc, childrel,
> appinfos, &childrte, &childRTindex,
> &childrc);
> 
> /* If this child is itself partitioned, recurse */
> if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
>{
> expand_partitioned_rtentry(root, childrte, childRTindex,
>childrel, childrc, lockmode, appinfos,
>partitioned_child_rels);
> 
> /* Include child's rowmark type in parent's allMarkTypes */
> parentrc->allMarkTypes |= childrc->allMarkTypes;
>}
> so that we push allMarkTypes up the hierarchy.
> 
> I like the second way, since every intermediate parent records
> allMarkTypes of its descendants.

I like the second way, too.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-12 Thread Ashutosh Bapat
On Tue, Sep 12, 2017 at 2:17 PM, Amit Langote
 wrote:
>
> That said, I noticed that we might need to be careful about what the value
> of the root parent's PlanRowMark's allMarkType field gets set to.  We need
> to make sure that it reflects markType of all partitions in the tree,
> including those that are not root parent's direct children.  Is that true
> with the proposed implementation?

Yes. We include child's allMarkTypes into parent's allMarkTypes. So,
top parent's PlanRowMarks should have all descendant's allMarkTypes,
which is not happening in the patch right now. There are two ways to
fix that.

1. Pass top parent's PlanRowMark all the way down to the leaf
partitions, so that current expand_single_inheritance_child() collects
allMarkTypes of all children correctly. But this way, PlanRowMarks of
intermediate parent does not reflect allMarkTypes of its children,
only top root records that.
2. Pass immediate parent's PlanRowMark to
expand_single_inheritance_child(), so that it records allMarkTypes of
its children. In expand_partitioned_rtentry() have following sequence

expand_single_inheritance_child(root, parentrte, parentRTindex,
parentrel, parentrc, childrel,
appinfos, &childrte, &childRTindex,
&childrc);

/* If this child is itself partitioned, recurse */
if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
   {
expand_partitioned_rtentry(root, childrte, childRTindex,
   childrel, childrc, lockmode, appinfos,
   partitioned_child_rels);

/* Include child's rowmark type in parent's allMarkTypes */
parentrc->allMarkTypes |= childrc->allMarkTypes;
   }
so that we push allMarkTypes up the hierarchy.

I like the second way, since every intermediate parent records
allMarkTypes of its descendants.

Thoughts?
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-12 Thread Amit Langote
On 2017/09/12 17:53, Ashutosh Bapat wrote:
> On Tue, Sep 12, 2017 at 1:42 PM, Amit Langote wrote:
>> So, we can remove partitioned_rels from (Merge)AppendPath and
>> (Merge)Append nodes and remove ExecLockNonLeafAppendTables().
> 
> Don't we need partitioned_rels from Append paths to be transferred to
> ModifyTable node or we have a different way of calculating
> nonleafResultRelations?

No, we don't transfer partitioned_rels from Append path to ModifyTable
node.  inheritance_planner(), that builds the ModifyTable path for
UPDATE/DELETE on a partitioned table, fetches partitioned_rels from
root->pcinfo_list itself and passes it to create_modifytable_path.  No
Append path is involved in that case.  PlannedStmt.nonleafResultRelations
is built by concatenating the partitioned_rels lists of all ModifyTable
nodes appearing in the query.  It does not depend on Append's or
AppendPath's partitioned_rels.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-12 Thread Ashutosh Bapat
On Tue, Sep 12, 2017 at 1:42 PM, Amit Langote
 wrote:
> On 2017/09/12 16:55, Ashutosh Bapat wrote:
>> On Tue, Sep 12, 2017 at 1:16 PM, Amit Langote wrote:
>>> So I looked at this a bit closely and came to the conclusion that we may
>>> not need to keep partitioned table RT indexes in the
>>> (Merge)Append.partitioned_rels after all, as far as execution-time locking
>>> is concerned.
>>>
>>> Consider two cases:
>>>
>>> 1. Plan is created and executed in the same transaction
>>>
>>> In this case, locks taken on the partitioned tables by the planner will
>>> suffice.
>>>
>>> 2. Plan is executed in a different transaction from the one in which it
>>>was created (a cached plan)
>>>
>>> In this case, AcquireExecutorLocks will lock all the relations in
>>> PlannedStmt.rtable, which must include all partitioned tables of all
>>> partition trees involved in the query.  Of those, it will lock the tables
>>> whose RT indexes appear in PlannedStmt.nonleafResultRelations with
>>> RowExclusiveLock mode.  PlannedStmt.nonleafResultRelations is a global
>>> list of all partitioned table RT indexes obtained by concatenating
>>> partitioned_rels lists of all ModifyTable nodes involved in the query
>>> (set_plan_refs does that).  We need to distinguish nonleafResultRelations,
>>> because we need to take the stronger lock on a given table before any
>>> weaker one if it happens to appear in the query as a non-result relation
>>> too, to avoid lock strength upgrade deadlock hazard.
>>>
>>> Moreover, because all the tables from plannedstmt->rtable, including the
>>> partitioned tables, will be added to PlannedStmt.relationsOids, any
>>> invalidation events affecting the partitioned tables (for example,
>>> add/remove a partition) will cause the plan involving partitioned tables
>>> to be recreated.
>>>
>>> In none of this do we rely on the partitioned table RT indexes appearing
>>> in the (Merge)Append node itself.  Maybe, we should just remove
>>> partitioned_rels from (Merge)AppendPath and (Merge)Append node in a
>>> separate patch and move on.
>>>
>>> Thoughts?
>>
>> Yes, I did the same analysis (to which I refer in my earlier reply to
>> you). I too think we should just remove partitioned_rels from Append
>> paths. But then the question is those are then transferred to
>> ModifyTable node in create_modifytable_plan() and use it for something
>> else. What should we do about that code? I don't think we are really
>> using that list from ModifyTable node as well, so may be we could
>> remove it from there as well. What do you think? Does that mean
>> partitioned_rels isn't used at all in the code?
>
> No, we cannot simply get rid of partitioned_rels altogether.  We'll need
> to keep it in the ModifyTable node, because we *do* need the
> nonleafResultRelations list in PlannedStmt to distinguish partitioned
> table result relations, which set_plan_refs builds by concatenating
> partitioned_rels lists of various ModifyTable nodes of the query.  The
> PlannedStmt.nonleafResultRelations list actually has some use (which
> parallels PlannedStmt.resultRelations), but partitioned_rels list in the
> individual (Merge)Append, as it turns out, doesn't.
>
> So, we can remove partitioned_rels from (Merge)AppendPath and
> (Merge)Append nodes and remove ExecLockNonLeafAppendTables().

Don't we need partitioned_rels from Append paths to be transferred to
ModifyTable node or we have a different way of calculating
nonleafResultRelations?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-12 Thread Amit Langote
On 2017/09/12 16:39, Ashutosh Bapat wrote:
> On Tue, Sep 12, 2017 at 7:31 AM, Amit Langote
>  wrote:
>> On 2017/09/11 19:45, Ashutosh Bapat wrote:
>>> On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote wrote:
 IMHO, we should make it the responsibility of the future patch to set a
 child PlanRowMark's prti to the direct parent's RT index, when we actually
 know that it's needed for something.  We clearly know today why we need to
 pass the other objects like child RT entry, RT index, and Relation, so we
 should limit this patch to pass only those objects to the recursive call.
 That makes this patch a relatively easy to understand change.
>>>
>>> I think you are mixing two issues here 1. setting parent RTI in child
>>> PlanRowMark and 2. passing immediate parent's PlanRowMark to
>>> expand_single_inheritance_child().
>>>
>>> I have discussed 1 in my reply to Robert.
>>>
>>> About 2 you haven't given any particular comments to my reply. To me
>>> it looks like it's this patch that introduces the notion of
>>> multi-level expansion, so it's natural for this patch to pass
>>> PlanRowMark in cascaded fashion similar to other structures.
>>
>> You patch does 2 to be able to do 1, doesn't it?  That is, to be able to
>> set the child PlanRowMark's prti to the direct parent's RT index, you pass
>> the immediate parent's PlanRowMark to the recursive call of
>> expand_single_inheritance_child().
> 
> No. child PlanRowMark's prti is set to parentRTIndex, which is a
> separate argument and is used to also set parent_relid in
> AppendRelInfo.

OK.  So, to keep the old behavior (if at all), we'd actually need a new
argument rootParentRTindex.  Old behavior being that all child
PlanRowMarks has the rootParentRTindex as their prti.

It seems though that the new behavior where prti will now be set to the
direct parent's RT index is more or less harmless, because whatever we set
prti to, as long as it's different from rti, we can consider it a child
PlanRowMark.  So it might be fine to set prti to direct parent's RT index.

That said, I noticed that we might need to be careful about what the value
of the root parent's PlanRowMark's allMarkType field gets set to.  We need
to make sure that it reflects markType of all partitions in the tree,
including those that are not root parent's direct children.  Is that true
with the proposed implementation?

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-12 Thread Amit Langote
On 2017/09/12 16:55, Ashutosh Bapat wrote:
> On Tue, Sep 12, 2017 at 1:16 PM, Amit Langote wrote:
>> So I looked at this a bit closely and came to the conclusion that we may
>> not need to keep partitioned table RT indexes in the
>> (Merge)Append.partitioned_rels after all, as far as execution-time locking
>> is concerned.
>>
>> Consider two cases:
>>
>> 1. Plan is created and executed in the same transaction
>>
>> In this case, locks taken on the partitioned tables by the planner will
>> suffice.
>>
>> 2. Plan is executed in a different transaction from the one in which it
>>was created (a cached plan)
>>
>> In this case, AcquireExecutorLocks will lock all the relations in
>> PlannedStmt.rtable, which must include all partitioned tables of all
>> partition trees involved in the query.  Of those, it will lock the tables
>> whose RT indexes appear in PlannedStmt.nonleafResultRelations with
>> RowExclusiveLock mode.  PlannedStmt.nonleafResultRelations is a global
>> list of all partitioned table RT indexes obtained by concatenating
>> partitioned_rels lists of all ModifyTable nodes involved in the query
>> (set_plan_refs does that).  We need to distinguish nonleafResultRelations,
>> because we need to take the stronger lock on a given table before any
>> weaker one if it happens to appear in the query as a non-result relation
>> too, to avoid lock strength upgrade deadlock hazard.
>>
>> Moreover, because all the tables from plannedstmt->rtable, including the
>> partitioned tables, will be added to PlannedStmt.relationsOids, any
>> invalidation events affecting the partitioned tables (for example,
>> add/remove a partition) will cause the plan involving partitioned tables
>> to be recreated.
>>
>> In none of this do we rely on the partitioned table RT indexes appearing
>> in the (Merge)Append node itself.  Maybe, we should just remove
>> partitioned_rels from (Merge)AppendPath and (Merge)Append node in a
>> separate patch and move on.
>>
>> Thoughts?
> 
> Yes, I did the same analysis (to which I refer in my earlier reply to
> you). I too think we should just remove partitioned_rels from Append
> paths. But then the question is those are then transferred to
> ModifyTable node in create_modifytable_plan() and use it for something
> else. What should we do about that code? I don't think we are really
> using that list from ModifyTable node as well, so may be we could
> remove it from there as well. What do you think? Does that mean
> partitioned_rels isn't used at all in the code?

No, we cannot simply get rid of partitioned_rels altogether.  We'll need
to keep it in the ModifyTable node, because we *do* need the
nonleafResultRelations list in PlannedStmt to distinguish partitioned
table result relations, which set_plan_refs builds by concatenating
partitioned_rels lists of various ModifyTable nodes of the query.  The
PlannedStmt.nonleafResultRelations list actually has some use (which
parallels PlannedStmt.resultRelations), but partitioned_rels list in the
individual (Merge)Append, as it turns out, doesn't.

So, we can remove partitioned_rels from (Merge)AppendPath and
(Merge)Append nodes and remove ExecLockNonLeafAppendTables().

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-12 Thread Ashutosh Bapat
On Tue, Sep 12, 2017 at 1:16 PM, Amit Langote
 wrote:
> On 2017/09/11 21:07, Ashutosh Bapat wrote:
>> On Mon, Sep 11, 2017 at 5:19 PM, Robert Haas  wrote:
>>> On Mon, Sep 11, 2017 at 6:45 AM, Ashutosh Bapat
>>>  wrote:
 So, all partitioned partitions are getting locked correctly. Am I
 missing something?
>>>
>>> That's not a valid test.  In that scenario, you're going to hold all
>>> the locks acquired by the planner, all the locks acquired by the
>>> rewriter, and all the locks acquired by the executor, but when using
>>> prepared queries, it's possible to execute the plan after the planner
>>> and rewriter locks are no longer held.
>>
>> I see the same thing when I use prepare and execute
>
> So I looked at this a bit closely and came to the conclusion that we may
> not need to keep partitioned table RT indexes in the
> (Merge)Append.partitioned_rels after all, as far as execution-time locking
> is concerned.
>
> Consider two cases:
>
> 1. Plan is created and executed in the same transaction
>
> In this case, locks taken on the partitioned tables by the planner will
> suffice.
>
> 2. Plan is executed in a different transaction from the one in which it
>was created (a cached plan)
>
> In this case, AcquireExecutorLocks will lock all the relations in
> PlannedStmt.rtable, which must include all partitioned tables of all
> partition trees involved in the query.  Of those, it will lock the tables
> whose RT indexes appear in PlannedStmt.nonleafResultRelations with
> RowExclusiveLock mode.  PlannedStmt.nonleafResultRelations is a global
> list of all partitioned table RT indexes obtained by concatenating
> partitioned_rels lists of all ModifyTable nodes involved in the query
> (set_plan_refs does that).  We need to distinguish nonleafResultRelations,
> because we need to take the stronger lock on a given table before any
> weaker one if it happens to appear in the query as a non-result relation
> too, to avoid lock strength upgrade deadlock hazard.
>
> Moreover, because all the tables from plannedstmt->rtable, including the
> partitioned tables, will be added to PlannedStmt.relationsOids, any
> invalidation events affecting the partitioned tables (for example,
> add/remove a partition) will cause the plan involving partitioned tables
> to be recreated.
>
> In none of this do we rely on the partitioned table RT indexes appearing
> in the (Merge)Append node itself.  Maybe, we should just remove
> partitioned_rels from (Merge)AppendPath and (Merge)Append node in a
> separate patch and move on.
>
> Thoughts?

Yes, I did the same analysis (to which I refer in my earlier reply to
you). I too think we should just remove partitioned_rels from Append
paths. But then the question is those are then transferred to
ModifyTable node in create_modifytable_plan() and use it for something
else. What should we do about that code? I don't think we are really
using that list from ModifyTable node as well, so may be we could
remove it from there as well. What do you think? Does that mean
partitioned_rels isn't used at all in the code?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-12 Thread Amit Langote
On 2017/09/11 21:07, Ashutosh Bapat wrote:
> On Mon, Sep 11, 2017 at 5:19 PM, Robert Haas  wrote:
>> On Mon, Sep 11, 2017 at 6:45 AM, Ashutosh Bapat
>>  wrote:
>>> So, all partitioned partitions are getting locked correctly. Am I
>>> missing something?
>>
>> That's not a valid test.  In that scenario, you're going to hold all
>> the locks acquired by the planner, all the locks acquired by the
>> rewriter, and all the locks acquired by the executor, but when using
>> prepared queries, it's possible to execute the plan after the planner
>> and rewriter locks are no longer held.
> 
> I see the same thing when I use prepare and execute

So I looked at this a bit closely and came to the conclusion that we may
not need to keep partitioned table RT indexes in the
(Merge)Append.partitioned_rels after all, as far as execution-time locking
is concerned.

Consider two cases:

1. Plan is created and executed in the same transaction

In this case, locks taken on the partitioned tables by the planner will
suffice.

2. Plan is executed in a different transaction from the one in which it
   was created (a cached plan)

In this case, AcquireExecutorLocks will lock all the relations in
PlannedStmt.rtable, which must include all partitioned tables of all
partition trees involved in the query.  Of those, it will lock the tables
whose RT indexes appear in PlannedStmt.nonleafResultRelations with
RowExclusiveLock mode.  PlannedStmt.nonleafResultRelations is a global
list of all partitioned table RT indexes obtained by concatenating
partitioned_rels lists of all ModifyTable nodes involved in the query
(set_plan_refs does that).  We need to distinguish nonleafResultRelations,
because we need to take the stronger lock on a given table before any
weaker one if it happens to appear in the query as a non-result relation
too, to avoid lock strength upgrade deadlock hazard.

Moreover, because all the tables from plannedstmt->rtable, including the
partitioned tables, will be added to PlannedStmt.relationsOids, any
invalidation events affecting the partitioned tables (for example,
add/remove a partition) will cause the plan involving partitioned tables
to be recreated.

In none of this do we rely on the partitioned table RT indexes appearing
in the (Merge)Append node itself.  Maybe, we should just remove
partitioned_rels from (Merge)AppendPath and (Merge)Append node in a
separate patch and move on.

Thoughts?

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-12 Thread Ashutosh Bapat
On Tue, Sep 12, 2017 at 7:31 AM, Amit Langote
 wrote:
> On 2017/09/11 19:45, Ashutosh Bapat wrote:
>> On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote wrote:
>>> IMHO, we should make it the responsibility of the future patch to set a
>>> child PlanRowMark's prti to the direct parent's RT index, when we actually
>>> know that it's needed for something.  We clearly know today why we need to
>>> pass the other objects like child RT entry, RT index, and Relation, so we
>>> should limit this patch to pass only those objects to the recursive call.
>>> That makes this patch a relatively easy to understand change.
>>
>> I think you are mixing two issues here 1. setting parent RTI in child
>> PlanRowMark and 2. passing immediate parent's PlanRowMark to
>> expand_single_inheritance_child().
>>
>> I have discussed 1 in my reply to Robert.
>>
>> About 2 you haven't given any particular comments to my reply. To me
>> it looks like it's this patch that introduces the notion of
>> multi-level expansion, so it's natural for this patch to pass
>> PlanRowMark in cascaded fashion similar to other structures.
>
> You patch does 2 to be able to do 1, doesn't it?  That is, to be able to
> set the child PlanRowMark's prti to the direct parent's RT index, you pass
> the immediate parent's PlanRowMark to the recursive call of
> expand_single_inheritance_child().

No. child PlanRowMark's prti is set to parentRTIndex, which is a
separate argument and is used to also set parent_relid in
AppendRelInfo.

>
>> Actually, the original problem that caused this discussion started
>> with an assertion failure in get_partitioned_child_rels() as
>> Assert(list_length(result) >= 1);
>>
>> This assertion fails if result is NIL when an intermediate partitioned
>> table is passed. May be we should assert (result == NIL ||
>> list_length(result) == 1) and allow that function to be called even
>> for intermediate partitioned partitions for which the function will
>> return NIL. That will leave the code in add_paths_to_append_rel()
>> simple. Thoughts?
>
> Yeah, I guess that could work.  We'll just have to write comments to
> describe why the Assert is written that way.
>
>>> By the way, when we call expand_single_inheritance_child() in the
>>> non-partitioned inheritance case, we should pass NULL for childrte_p,
>>> childRTindex_p, childrc_p, instead of declaring variables that won't be
>>> used.  Hence, expand_single_inheritance_child() should make those
>>> arguments optional.
>>
>> That introduces an extra "if" condition, which is costlier than an
>> assignment. We have used both the styles in the code. Previously, I
>> have got comments otherwise. So, I am not sure.
>
> OK.  expand_single_inheritance_child's header comment does not mention the
> new result fields.  Maybe add a comment describing what their role is and
> that they're not optional arguments.
>
>> I will update the patches once we have some resolution about 1. prti
>> in PlanRowMarks and 2. detection of root partitioned table in
>> add_paths_to_append_rel().
>
> OK.
>
> About 2, I somewhat agree with your proposed solution above, which might
> be simpler to explain in comments than the code I proposed.

After testing a few queries I am getting a feeling that
ExecLockNonLeafAppendTables isn't really locking anything. I will
write more about that in my reply to Robert's mail.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-11 Thread Amit Langote
On 2017/09/11 19:45, Ashutosh Bapat wrote:
> On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote wrote:
>> IMHO, we should make it the responsibility of the future patch to set a
>> child PlanRowMark's prti to the direct parent's RT index, when we actually
>> know that it's needed for something.  We clearly know today why we need to
>> pass the other objects like child RT entry, RT index, and Relation, so we
>> should limit this patch to pass only those objects to the recursive call.
>> That makes this patch a relatively easy to understand change.
> 
> I think you are mixing two issues here 1. setting parent RTI in child
> PlanRowMark and 2. passing immediate parent's PlanRowMark to
> expand_single_inheritance_child().
> 
> I have discussed 1 in my reply to Robert.
> 
> About 2 you haven't given any particular comments to my reply. To me
> it looks like it's this patch that introduces the notion of
> multi-level expansion, so it's natural for this patch to pass
> PlanRowMark in cascaded fashion similar to other structures.

You patch does 2 to be able to do 1, doesn't it?  That is, to be able to
set the child PlanRowMark's prti to the direct parent's RT index, you pass
the immediate parent's PlanRowMark to the recursive call of
expand_single_inheritance_child().

All I am trying to say is that this patch's mission is to expand
inheritance step-wise to be able to do certain things in the *planner*
that weren't possible before.  The patch accomplishes that by creating
child AppendRelInfos such that its parent_relid field is set to the
immediate parent's RT index.  It's quite clear why we're doing so.  It's
not clear why we should do so for PlanRowMarks too.  Maybe it's fine as
long as nothing breaks.

>> If we go with your patch, partitioned tables won't get locked, for
>> example, in case of the following query (p is a partitioned table):
>>
>> select 1 from p union all select 2 from p;
>>
>> That's because the RelOptInfos for the two instances of p in the above
>> query are RELOPT_OTHER_MEMBER_REL, not RELOPT_BASEREL.  They are children
>> of the Append corresponding to the UNION ALL subquery RTE.  So,
>> partitioned_rels does not get set per your proposed code.
> 

[...]

> So, all partitioned partitions are getting locked correctly. Am I
> missing something?

Will reply to this separately to your other email.

> Actually, the original problem that caused this discussion started
> with an assertion failure in get_partitioned_child_rels() as
> Assert(list_length(result) >= 1);
> 
> This assertion fails if result is NIL when an intermediate partitioned
> table is passed. May be we should assert (result == NIL ||
> list_length(result) == 1) and allow that function to be called even
> for intermediate partitioned partitions for which the function will
> return NIL. That will leave the code in add_paths_to_append_rel()
> simple. Thoughts?

Yeah, I guess that could work.  We'll just have to write comments to
describe why the Assert is written that way.

>> In create_lateral_join_info():
>>
>> +Assert(IS_SIMPLE_REL(brel));
>> +Assert(brte);
>>
>> The second Assert is either unnecessary or should be placed first.
> 
> simple_rte_array[] may have some NULL entries. Second assert makes
> sure that we aren't dealing with a NULL entry. Any particular reason
> to reorder the asserts?

Sorry, I missed that the 2nd Assert has b"rte".  I thought it's b"rel".

>> The following comment could be made a bit clearer.
>>
>> + * In the case of table inheritance, the parent RTE is directly
>> linked
>> + * to every child table via an AppendRelInfo.  In the case of table
>> + * partitioning, the inheritance hierarchy is expanded one level at 
>> a
>> + * time rather than flattened.  Therefore, an other member rel
>> that is
>> + * a partitioned table may have children of its own, and must
>> + * therefore be marked with the appropriate lateral info so that
>> those
>> + * children eventually get marked also.
>>
>> How about: In the case of partitioned table inheritance, the original
>> parent RTE is linked, via AppendRelInfo, only to its immediate partitions.
>>  Partitions below the first level are accessible only via their immediate
>> parent's RelOptInfo, which would be of kind RELOPT_OTHER_MEMBER_REL, so
>> consider those as well.
> 
> I don't see much difference between those two. We usually do not use
> macros in comments, so usually comments mention "other member" rel.
> Let's leave this for the committer to judge.

Sure.

>> In expand_inherited_rtentry(), the following comment fragment is obsolete,
>> because we *do* now create AppendRelInfo's for partitioned children:
>>
>> +/*
>> + * We keep a list of objects in root, each of which maps a
>> partitioned
>> + * parent RT index to the list of RT indexes of its partitioned 
>> child
>> + * tables which do not have AppendRelInfos associated with those.
> 
> Good catch. I have 

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-11 Thread Robert Haas
On Mon, Sep 11, 2017 at 8:07 AM, Ashutosh Bapat
 wrote:
> I see the same thing when I use prepare and execute

Hmm.  Well, that's good, but it doesn't prove there's no bug.  We have
to understand where and why it's getting locked to know whether the
behavior will be correct in all cases.  I haven't had time to look at
Amit's comments in detail yet so I don't know whether I agree with his
analysis or not, but we have to look at what's going on under the hood
to know whether the engine is working -- not just listen to the noise
it makes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-11 Thread Ashutosh Bapat
On Mon, Sep 11, 2017 at 5:19 PM, Robert Haas  wrote:
> On Mon, Sep 11, 2017 at 6:45 AM, Ashutosh Bapat
>  wrote:
>> So, all partitioned partitions are getting locked correctly. Am I
>> missing something?
>
> That's not a valid test.  In that scenario, you're going to hold all
> the locks acquired by the planner, all the locks acquired by the
> rewriter, and all the locks acquired by the executor, but when using
> prepared queries, it's possible to execute the plan after the planner
> and rewriter locks are no longer held.
>

I see the same thing when I use prepare and execute

Session 1
postgres=# prepare stmt as select 1 from t1 union all select 2 from t1;
PREPARE
postgres=# select pg_backend_pid();
 pg_backend_pid

  50912
(1 row)

postgres=# begin;
BEGIN
postgres=# execute stmt;
 ?column?
--
(0 rows)

Session 2
postgres=# select locktype, relation::regclass, virtualxid,
virtualtransaction, pid, mode, granted, fastpath from pg_locks;
  locktype  | relation | virtualxid | virtualtransaction |  pid  |
 mode   | granted | fastpath
+--+++---+-+-+--
 relation   | pg_locks || 4/4| 50914 |
AccessShareLock | t   | t
 virtualxid |  | 4/4| 4/4| 50914 |
ExclusiveLock   | t   | t
 relation   | t1p1p1   || 3/12   | 50912 |
AccessShareLock | t   | t
 relation   | t1p1 || 3/12   | 50912 |
AccessShareLock | t   | t
 relation   | t1   || 3/12   | 50912 |
AccessShareLock | t   | t
 virtualxid |  | 3/12   | 3/12   | 50912 |
ExclusiveLock   | t   | t
(6 rows)

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-11 Thread Robert Haas
On Mon, Sep 11, 2017 at 6:45 AM, Ashutosh Bapat
 wrote:
> So, all partitioned partitions are getting locked correctly. Am I
> missing something?

That's not a valid test.  In that scenario, you're going to hold all
the locks acquired by the planner, all the locks acquired by the
rewriter, and all the locks acquired by the executor, but when using
prepared queries, it's possible to execute the plan after the planner
and rewriter locks are no longer held.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-11 Thread Ashutosh Bapat
On Mon, Sep 11, 2017 at 2:16 PM, Amit Langote
 wrote:
> On 2017/09/11 16:23, Ashutosh Bapat wrote:
>> On Sat, Sep 9, 2017 at 6:28 AM, Robert Haas  wrote:
>>> I'm a bit suspicious about the fact that there are now executor
>>> changes related to the PlanRowMarks.  If the rowmark's prti is now the
>>> intermediate parent's RT index rather than the top-parent's RT index,
>>> it'd seem like that'd matter somehow.  Maybe it doesn't, because the
>>> code that cares about prti seems to only care about whether it's
>>> different from rti.  But if that's true everywhere, then why even
>>> change this?  I think we might be well off not to tinker with things
>>> that don't need to be changed.
>>
>> In the definition of ExecRowMark, I see
>> Index   prti;   /* parent range table index, if child */
>> It just says parent, by which I take as direct parent. For
>> inheritance, which earlier flattened inheritance hierarchy, direct
>> parent was top parent. So, probably nobody thought whether a parent is
>> direct parent or top parent. But now that we have introduced that
>> concept we need to interpret this comment anew. And I think
>> interpreting it as direct parent is non-lossy.
>
> But then we also don't have anything to say about why we're making that
> change.  If you could describe what non-lossy is in this context, then
> fine.

By setting prti to the topmost parent RTI we are loosing information
that this child may be an intermediate child similar to what we did
earlier to AppendRelInfo. That's the lossy-ness in this context.

> But that we'd like to match with what we're going to do for
> AppendRelInfos does not seem to be a sufficient explanation for this change.

The purpose of this patch is to change the parent-child linkages for
partitioned table and prti is one of them. So, in fact, I am wondering
why not to change that along with AppendRelInfo.

>
>> If we set top parent's
>> index, parent RTI in AppendRelInfo and PlanRowMark would not agree.
>> So, it looks quite natural that we set the direct parent's index in
>> PlanRowMark.
>
> They would not agree, yes, but aren't they unrelated?  If we have a reason
> for them to agree, (for example, row-locking breaks in the inherited table
> case if we didn't), then we should definitely make them agree.
>
> Updating the comment for prti definition might be something that this
> patch could (should?) do, but I'm not quite sure about that too.
>

To me that looks backwards again for the reasons described above.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-11 Thread Ashutosh Bapat
On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote
 wrote:
> On 2017/09/09 2:38, Ashutosh Bapat wrote:
>> On Fri, Sep 8, 2017 at 11:17 AM, Amit Langote wrote:
>>> I updated the patch to include just those changes.  I'm not sure about
>>> one of the Ashutosh's changes whereby the child PlanRowMark is also passed
>>> to expand_partitioned_rtentry() to use as the parent PlanRowMark.  I think
>>> the child RTE, child RT index and child Relation are fine, because they
>>> are necessary for creating AppendRelInfos in a desired way for later
>>> planning steps.  But PlanRowMarks are not processed within the planner
>>> afterwards and do not need to be marked with the immediate parent-child
>>> association in the same way that AppendRelInfos need to be.
>>
>> Passing top parent's row mark works today, since there is no
>> parent-child specific translation happening there. But if in future,
>> we introduce such a translation, row marks for indirect children in a
>> multi-level partitioned hierarchy won't be accurate. So, I think it's
>> better to pass row marks of the direct parent.
>
> IMHO, we should make it the responsibility of the future patch to set a
> child PlanRowMark's prti to the direct parent's RT index, when we actually
> know that it's needed for something.  We clearly know today why we need to
> pass the other objects like child RT entry, RT index, and Relation, so we
> should limit this patch to pass only those objects to the recursive call.
> That makes this patch a relatively easy to understand change.

I think you are mixing two issues here 1. setting parent RTI in child
PlanRowMark and 2. passing immediate parent's PlanRowMark to
expand_single_inheritance_child().

I have discussed 1 in my reply to Robert.

About 2 you haven't given any particular comments to my reply. To me
it looks like it's this patch that introduces the notion of
multi-level expansion, so it's natural for this patch to pass
PlanRowMark in cascaded fashion similar to other structures.

>
>>> I also included the changes to add_paths_to_append_rel() from my patch on
>>> the "path toward faster partition pruning" thread.  We'd need that change,
>>> because while add_paths_to_append_rel() is called for all partitioned
>>> table RTEs in a given partition tree, expand_inherited_rtentry() would
>>> have set up a PartitionedChildRelInfo only for the root parent, so
>>> get_partitioned_child_rels() would not find the same for non-root
>>> partitioned table rels and crash failing the Assert.  The changes I made
>>> are such that we call get_partitioned_child_rels() only for the parent
>>> rels that are known to correspond root partitioned tables (or as you
>>> pointed out on the thread, "the table named in the query" as opposed those
>>> added to the query as result of inheritance expansion).  In addition to
>>> the relkind check on the input RTE, it would seem that checking that the
>>> reloptkind is RELOPT_BASEREL would be enough.  But actually not, because
>>> if a partitioned table is accessed in a UNION ALL query, reloptkind even
>>> for the root partitioned table (the table named in the query) would be
>>> RELOPT_OTHER_MEMBER_REL.  The only way to confirm that the input rel is
>>> actually the root partitioned table is to check whether its parent rel is
>>> not RTE_RELATION, because the parent in case of UNION ALL Append is a
>>> RTE_SUBQUERY RT entry.
>>>
>>
>> There was a change in my 0003 patch, which fixed the crash. It checked
>> for RELOPT_BASEREL and RELKIND_PARTITIONED_TABLE. I have pulled it in
>> my 0001 patch. It no more crashes. I tried various queries involving
>> set operations and bare multi-level partitioned table scan with my
>> patch, but none of them showed any anomaly. Do you have a testcase
>> which shows problem with my patch? May be your suggestion is correct,
>> but corresponding code implementation is slightly longer than I would
>> expect. So, we should go with it, if there is corresponding testcase
>> which shows why it's needed.
>
> If we go with your patch, partitioned tables won't get locked, for
> example, in case of the following query (p is a partitioned table):
>
> select 1 from p union all select 2 from p;
>
> That's because the RelOptInfos for the two instances of p in the above
> query are RELOPT_OTHER_MEMBER_REL, not RELOPT_BASEREL.  They are children
> of the Append corresponding to the UNION ALL subquery RTE.  So,
> partitioned_rels does not get set per your proposed code.

Session 1:
postgres=# begin;
BEGIN
postgres=# select 1 from t1 union all select 2 from t1;
 ?column?
--
(0 rows)

postgres=# select pg_backend_pid();
 pg_backend_pid

  28843
(1 row)


Session 2
postgres=# select locktype, relation::regclass, virtualxid,
virtualtransaction, pid, mode, granted, fastpath from pg_locks;
  locktype  | relation | virtualxid | virtualtransaction |  pid  |
 mode   | granted | fastpath
+--+++---+-

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-11 Thread Amit Langote
On 2017/09/11 16:23, Ashutosh Bapat wrote:
> On Sat, Sep 9, 2017 at 6:28 AM, Robert Haas  wrote:
>> I'm a bit suspicious about the fact that there are now executor
>> changes related to the PlanRowMarks.  If the rowmark's prti is now the
>> intermediate parent's RT index rather than the top-parent's RT index,
>> it'd seem like that'd matter somehow.  Maybe it doesn't, because the
>> code that cares about prti seems to only care about whether it's
>> different from rti.  But if that's true everywhere, then why even
>> change this?  I think we might be well off not to tinker with things
>> that don't need to be changed.
> 
> In the definition of ExecRowMark, I see
> Index   prti;   /* parent range table index, if child */
> It just says parent, by which I take as direct parent. For
> inheritance, which earlier flattened inheritance hierarchy, direct
> parent was top parent. So, probably nobody thought whether a parent is
> direct parent or top parent. But now that we have introduced that
> concept we need to interpret this comment anew. And I think
> interpreting it as direct parent is non-lossy.

But then we also don't have anything to say about why we're making that
change.  If you could describe what non-lossy is in this context, then
fine.  But that we'd like to match with what we're going to do for
AppendRelInfos does not seem to be a sufficient explanation for this change.

> If we set top parent's
> index, parent RTI in AppendRelInfo and PlanRowMark would not agree.
> So, it looks quite natural that we set the direct parent's index in
> PlanRowMark.

They would not agree, yes, but aren't they unrelated?  If we have a reason
for them to agree, (for example, row-locking breaks in the inherited table
case if we didn't), then we should definitely make them agree.

Updating the comment for prti definition might be something that this
patch could (should?) do, but I'm not quite sure about that too.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-11 Thread Ashutosh Bapat
On Sat, Sep 9, 2017 at 6:28 AM, Robert Haas  wrote:
> On Fri, Sep 8, 2017 at 1:38 PM, Ashutosh Bapat
>  wrote:
>>> I also confirmed that the partition-pruning patch set works fine with this
>>> patch instead of the patch on that thread with the same functionality,
>>> which I will now drop from that patch set.  Sorry about the wasted time.
>>
>> Thanks a lot. Please review the patch in the updated patchset.
>
> In set_append_rel_size(), I don't find the comment too clear (and this
> part was taken from Amit's patch, right?).  I suggest:

No, I didn't take it from Amit's patch. Both of us have different
wordings. But yours is better than both of us. Included it in the
attached patches.

>
> +/*
> + * Associate the partitioned tables which are descendents of the table
> + * named in the query with the topmost append path (i.e. the one where
> + * rel->reloptkind is RELOPT_BASEREL).  This ensures that they get 
> properly
> + * locked at execution time.
> + */
>
> I'm a bit suspicious about the fact that there are now executor
> changes related to the PlanRowMarks.  If the rowmark's prti is now the
> intermediate parent's RT index rather than the top-parent's RT index,
> it'd seem like that'd matter somehow.  Maybe it doesn't, because the
> code that cares about prti seems to only care about whether it's
> different from rti.  But if that's true everywhere, then why even
> change this?  I think we might be well off not to tinker with things
> that don't need to be changed.

In the definition of ExecRowMark, I see
Index   prti;   /* parent range table index, if child */
It just says parent, by which I take as direct parent. For
inheritance, which earlier flattened inheritance hierarchy, direct
parent was top parent. So, probably nobody thought whether a parent is
direct parent or top parent. But now that we have introduced that
concept we need to interpret this comment anew. And I think
interpreting it as direct parent is non-lossy. If we set top parent's
index, parent RTI in AppendRelInfo and PlanRowMark would not agree.
So, it looks quite natural that we set the direct parent's index in
PlanRowMark. From that POV, we aren't changing anything, we are
setting the same parent RTI in AppendRelInfo and PlanRowMark. Chaning
different parent RTIs in those two structure would be a real change.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-11 Thread Amit Langote
On 2017/09/09 9:58, Robert Haas wrote:
> I'm a bit suspicious about the fact that there are now executor
> changes related to the PlanRowMarks.  If the rowmark's prti is now the
> intermediate parent's RT index rather than the top-parent's RT index,
> it'd seem like that'd matter somehow.  Maybe it doesn't, because the
> code that cares about prti seems to only care about whether it's
> different from rti.

Yes, it doesn't matter.  The important point though is that nothing we
want to do in the short term requires us to set a child PlanRowMark's prti
to its immediate parent's RT index, as I also mentioned in reply to Ashutosh.

>  But if that's true everywhere, then why even
> change this?  I think we might be well off not to tinker with things
> that don't need to be changed.

+1.

> Apart from that concern, now that I understand (from my own failed
> attempt and some off-list discussion) why this patch works the way it
> does, I think this is in fairly good shape.

I too think so, except we still need to incorporate changes to
add_paths_to_append_rel() necessary to correctly set partitioned_rels, as
I explained in reply Ashutosh.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-10 Thread Amit Langote
On 2017/09/09 2:38, Ashutosh Bapat wrote:
> On Fri, Sep 8, 2017 at 11:17 AM, Amit Langote wrote:
>> I updated the patch to include just those changes.  I'm not sure about
>> one of the Ashutosh's changes whereby the child PlanRowMark is also passed
>> to expand_partitioned_rtentry() to use as the parent PlanRowMark.  I think
>> the child RTE, child RT index and child Relation are fine, because they
>> are necessary for creating AppendRelInfos in a desired way for later
>> planning steps.  But PlanRowMarks are not processed within the planner
>> afterwards and do not need to be marked with the immediate parent-child
>> association in the same way that AppendRelInfos need to be.
> 
> Passing top parent's row mark works today, since there is no
> parent-child specific translation happening there. But if in future,
> we introduce such a translation, row marks for indirect children in a
> multi-level partitioned hierarchy won't be accurate. So, I think it's
> better to pass row marks of the direct parent.

IMHO, we should make it the responsibility of the future patch to set a
child PlanRowMark's prti to the direct parent's RT index, when we actually
know that it's needed for something.  We clearly know today why we need to
pass the other objects like child RT entry, RT index, and Relation, so we
should limit this patch to pass only those objects to the recursive call.
That makes this patch a relatively easy to understand change.

>> I also included the changes to add_paths_to_append_rel() from my patch on
>> the "path toward faster partition pruning" thread.  We'd need that change,
>> because while add_paths_to_append_rel() is called for all partitioned
>> table RTEs in a given partition tree, expand_inherited_rtentry() would
>> have set up a PartitionedChildRelInfo only for the root parent, so
>> get_partitioned_child_rels() would not find the same for non-root
>> partitioned table rels and crash failing the Assert.  The changes I made
>> are such that we call get_partitioned_child_rels() only for the parent
>> rels that are known to correspond root partitioned tables (or as you
>> pointed out on the thread, "the table named in the query" as opposed those
>> added to the query as result of inheritance expansion).  In addition to
>> the relkind check on the input RTE, it would seem that checking that the
>> reloptkind is RELOPT_BASEREL would be enough.  But actually not, because
>> if a partitioned table is accessed in a UNION ALL query, reloptkind even
>> for the root partitioned table (the table named in the query) would be
>> RELOPT_OTHER_MEMBER_REL.  The only way to confirm that the input rel is
>> actually the root partitioned table is to check whether its parent rel is
>> not RTE_RELATION, because the parent in case of UNION ALL Append is a
>> RTE_SUBQUERY RT entry.
>>
> 
> There was a change in my 0003 patch, which fixed the crash. It checked
> for RELOPT_BASEREL and RELKIND_PARTITIONED_TABLE. I have pulled it in
> my 0001 patch. It no more crashes. I tried various queries involving
> set operations and bare multi-level partitioned table scan with my
> patch, but none of them showed any anomaly. Do you have a testcase
> which shows problem with my patch? May be your suggestion is correct,
> but corresponding code implementation is slightly longer than I would
> expect. So, we should go with it, if there is corresponding testcase
> which shows why it's needed.

If we go with your patch, partitioned tables won't get locked, for
example, in case of the following query (p is a partitioned table):

select 1 from p union all select 2 from p;

That's because the RelOptInfos for the two instances of p in the above
query are RELOPT_OTHER_MEMBER_REL, not RELOPT_BASEREL.  They are children
of the Append corresponding to the UNION ALL subquery RTE.  So,
partitioned_rels does not get set per your proposed code.

> 
> In your patch
>
> +parent_rel = root->simple_rel_array[parent_relid];
> +get_pcinfo = (parent_rel->rtekind == RTE_SUBQUERY);
>
> Do you mean RTE_RELATION as you explained above?

No, I mean RTE_SUBQUERY.

If the partitioned table RTE in question corresponds to one named in the
query, we should be able to find its pcinfo in root->pcinfo_list.  If the
partitioned table RTE is one added as result of inheritance expansion, it
won't have an associated pcinfo.  So, we should find a way to distinguish
them from one another.  The first idea that had occurred to me was the
same as yours -- RelOptInfo of the partitioned table RTE named in the
query would be of reloptkind RELOPT_BASEREL and those of the partitioned
table RTE added as result of inheritance expansion will be of reloptkind
RELOPT_OTHER_MEMBER_REL.  Although the latter is always true, the former
is not.  If the partitioned table named in the query appears under UNION
ALL query, then its reloptkind will be RELOPT_OTHER_MEMBER_REL.  That
means we have to use some other means to distinguish partitioned table
RTEs that have an asso

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-08 Thread Robert Haas
On Fri, Sep 8, 2017 at 1:38 PM, Ashutosh Bapat
 wrote:
>> I also confirmed that the partition-pruning patch set works fine with this
>> patch instead of the patch on that thread with the same functionality,
>> which I will now drop from that patch set.  Sorry about the wasted time.
>
> Thanks a lot. Please review the patch in the updated patchset.

In set_append_rel_size(), I don't find the comment too clear (and this
part was taken from Amit's patch, right?).  I suggest:

+/*
+ * Associate the partitioned tables which are descendents of the table
+ * named in the query with the topmost append path (i.e. the one where
+ * rel->reloptkind is RELOPT_BASEREL).  This ensures that they get properly
+ * locked at execution time.
+ */

I'm a bit suspicious about the fact that there are now executor
changes related to the PlanRowMarks.  If the rowmark's prti is now the
intermediate parent's RT index rather than the top-parent's RT index,
it'd seem like that'd matter somehow.  Maybe it doesn't, because the
code that cares about prti seems to only care about whether it's
different from rti.  But if that's true everywhere, then why even
change this?  I think we might be well off not to tinker with things
that don't need to be changed.

Apart from that concern, now that I understand (from my own failed
attempt and some off-list discussion) why this patch works the way it
does, I think this is in fairly good shape.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-08 Thread Robert Haas
On Fri, Sep 8, 2017 at 1:47 AM, Amit Langote
 wrote:
> When I tried the attached patch, it doesn't seem to expand partitioning
> inheritance in step-wise manner as the patch's title says.  I think the
> rewritten patch forgot to include Ashutosh's changes to
> expand_single_inheritance_child() whereby the AppendRelInfo of the child
> will be marked with the direct parent instead of always the root parent.

Woops.

> I updated the patch to include just those changes.  I'm not sure about
> one of the Ashutosh's changes whereby the child PlanRowMark is also passed
> to expand_partitioned_rtentry() to use as the parent PlanRowMark.  I think
> the child RTE, child RT index and child Relation are fine, because they
> are necessary for creating AppendRelInfos in a desired way for later
> planning steps.  But PlanRowMarks are not processed within the planner
> afterwards and do not need to be marked with the immediate parent-child
> association in the same way that AppendRelInfos need to be.

We probably need some better comments to explain which things need to
be marked using the immediate parent and which need to be marked using
the baserel, and why.

> I also included the changes to add_paths_to_append_rel() from my patch on
> the "path toward faster partition pruning" thread.  We'd need that change,
> because while add_paths_to_append_rel() is called for all partitioned
> table RTEs in a given partition tree, expand_inherited_rtentry() would
> have set up a PartitionedChildRelInfo only for the root parent, so
> get_partitioned_child_rels() would not find the same for non-root
> partitioned table rels and crash failing the Assert.  The changes I made
> are such that we call get_partitioned_child_rels() only for the parent
> rels that are known to correspond root partitioned tables (or as you
> pointed out on the thread, "the table named in the query" as opposed those
> added to the query as result of inheritance expansion).  In addition to
> the relkind check on the input RTE, it would seem that checking that the
> reloptkind is RELOPT_BASEREL would be enough.  But actually not, because
> if a partitioned table is accessed in a UNION ALL query, reloptkind even
> for the root partitioned table (the table named in the query) would be
> RELOPT_OTHER_MEMBER_REL.  The only way to confirm that the input rel is
> actually the root partitioned table is to check whether its parent rel is
> not RTE_RELATION, because the parent in case of UNION ALL Append is a
> RTE_SUBQUERY RT entry.

OK, so this needs some good comments, too...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-08 Thread Ashutosh Bapat
On Fri, Sep 8, 2017 at 12:34 AM, Robert Haas  wrote:
> On Tue, Sep 5, 2017 at 7:01 AM, Ashutosh Bapat
>  wrote:
>> accumulate_append_subpath() is executed for every path instead of
>> every relation, so changing it would collect the same list multiple
>> times. Instead, I found the old way of associating all intermediate
>> partitions with the root partitioned relation work better. Here's the
>> updated patch set.
>
> When I tried out patch 0001, it crashed repeatedly during 'make check'
> because of an assertion failure in get_partitioned_child_rels.

Running "make check" on the whole patchset doesn't give that failure.
So I didn't notice the crash since I was running regression on the
whole patchset. Sorry for that. Fortunately git rebase -i allows us to
execute shell commands while applying patches, so I have set it up to
compile each patch and run regression. Hopefully that will catch such
errors in future. That process showed me that patch
0003-In-add_paths_to_append_rel-get-partitioned_rels-for-.patch fixes
that crash by calling get_partitioned_child_rels() only on the root
partitioned table for which we have set up child_rels list. Amit
Langote has a similar fix reported in his reply. So, we will discuss
it there.

> It
> seemed to me that the way the patch was refactoring
> expand_inherited_rtentry involved more code rearrangement than
> necessary, so I reverted all the code rearrangement and just kept the
> functional changes, and all the crashes went away.  (That refactoring
> also failed to initialize has_child properly.)  In so doing, I
> reintroduced the problem that the PartitionedChildRelInfo lists
> weren't getting set up correctly, but after some thought I realized
> that was just because expand_single_inheritance_child() was choosing
> between adding an RTE and adding the OID to partitioned_child_rels,
> whereas for an intermediate partitioned table it needs to do both.  So
> I inserted a trivial fix for that problem (replacing "else" with a new
> "if"-test), basically:
>
> -else
> +
> +if (childrte->relkind == RELKIND_PARTITIONED_TABLE)
>
> Please check out the attached version of the patch.  In addition to
> the above simplifications, I did some adjustments to the comments in
> various places - some just grammar and others a bit more substantive.
> And I think I broke a long line in one place, too.
>
> One thing I notice is that if I rip out the changes to initsplan.c,
> the new regression test still passes.  If it's possible to write a
> test that fails without those changes, I think it would be a good idea
> to include one in the patch.  That's certainly one of the subtler
> parts of this patch, IMHO.

Amit Langote has replied on these points as well. So, I will comment
in a reply to his reply.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-07 Thread Amit Langote
On 2017/09/08 14:47, Amit Langote wrote:
> When I tried the attached patch, it doesn't seem to expand partitioning
> inheritance in step-wise manner as the patch's title says.

Oops.  By "attached patch", I had meant to say the Robert's patch, that
is, expand-stepwise-rmh.patch.  Not expand-stepwise-rmh-2.patch, which is
the updated version of the patch attached with the quoted message.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-07 Thread Amit Langote
On 2017/09/08 4:04, Robert Haas wrote:
> On Tue, Sep 5, 2017 at 7:01 AM, Ashutosh Bapat
>  wrote:
>> accumulate_append_subpath() is executed for every path instead of
>> every relation, so changing it would collect the same list multiple
>> times. Instead, I found the old way of associating all intermediate
>> partitions with the root partitioned relation work better. Here's the
>> updated patch set.
> 
> When I tried out patch 0001, it crashed repeatedly during 'make check'
> because of an assertion failure in get_partitioned_child_rels.  It
> seemed to me that the way the patch was refactoring
> expand_inherited_rtentry involved more code rearrangement than
> necessary, so I reverted all the code rearrangement and just kept the
> functional changes, and all the crashes went away.  (That refactoring
> also failed to initialize has_child properly.)

When I tried the attached patch, it doesn't seem to expand partitioning
inheritance in step-wise manner as the patch's title says.  I think the
rewritten patch forgot to include Ashutosh's changes to
expand_single_inheritance_child() whereby the AppendRelInfo of the child
will be marked with the direct parent instead of always the root parent.

I updated the patch to include just those changes.  I'm not sure about
one of the Ashutosh's changes whereby the child PlanRowMark is also passed
to expand_partitioned_rtentry() to use as the parent PlanRowMark.  I think
the child RTE, child RT index and child Relation are fine, because they
are necessary for creating AppendRelInfos in a desired way for later
planning steps.  But PlanRowMarks are not processed within the planner
afterwards and do not need to be marked with the immediate parent-child
association in the same way that AppendRelInfos need to be.

I also included the changes to add_paths_to_append_rel() from my patch on
the "path toward faster partition pruning" thread.  We'd need that change,
because while add_paths_to_append_rel() is called for all partitioned
table RTEs in a given partition tree, expand_inherited_rtentry() would
have set up a PartitionedChildRelInfo only for the root parent, so
get_partitioned_child_rels() would not find the same for non-root
partitioned table rels and crash failing the Assert.  The changes I made
are such that we call get_partitioned_child_rels() only for the parent
rels that are known to correspond root partitioned tables (or as you
pointed out on the thread, "the table named in the query" as opposed those
added to the query as result of inheritance expansion).  In addition to
the relkind check on the input RTE, it would seem that checking that the
reloptkind is RELOPT_BASEREL would be enough.  But actually not, because
if a partitioned table is accessed in a UNION ALL query, reloptkind even
for the root partitioned table (the table named in the query) would be
RELOPT_OTHER_MEMBER_REL.  The only way to confirm that the input rel is
actually the root partitioned table is to check whether its parent rel is
not RTE_RELATION, because the parent in case of UNION ALL Append is a
RTE_SUBQUERY RT entry.

> One thing I notice is that if I rip out the changes to initsplan.c,
> the new regression test still passes.  If it's possible to write a
> test that fails without those changes, I think it would be a good idea
> to include one in the patch.  That's certainly one of the subtler
> parts of this patch, IMHO.

Back when this (step-wise expansion of partition inheritance) used to be a
patch in the original declarative partitioning patch series, Ashutosh had
reported a test query [1] that would fail getting a plan, for which we
came up with the initsplan.c changes in this patch as the solution:

ERROR:  could not devise a query plan for the given query

I tried that query again without the initsplan.c changes and somehow the
same error does not occur anymore.  It's strange because without the
initsplan.c changes, there is no way for partitions lower in the tree than
the first level to get the direct_lateral_relids and lateral_relids from
the root parent rel.   Maybe, Ashutosh has a way to devise the failing
query again.


I also confirmed that the partition-pruning patch set works fine with this
patch instead of the patch on that thread with the same functionality,
which I will now drop from that patch set.  Sorry about the wasted time.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAFjFpReZF34MDbY95xoATi0xVj2mAry4-LHBWVBayOc8gj%3Diqg%40mail.gmail.com

diff --git a/src/backend/optimizer/path/allpaths.c 
b/src/backend/optimizer/path/allpaths.c
index 2d7e1d84d0..71b5bdf95e 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -24,6 +24,7 @@
 #include "catalog/pg_operator.h"
 #include "catalog/pg_proc.h"
 #include "foreign/fdwapi.h"
+#include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #ifdef OPTIMIZER_DEBUG
@@ -867,6 +868,9 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
int   

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-07 Thread Robert Haas
On Tue, Sep 5, 2017 at 7:01 AM, Ashutosh Bapat
 wrote:
> accumulate_append_subpath() is executed for every path instead of
> every relation, so changing it would collect the same list multiple
> times. Instead, I found the old way of associating all intermediate
> partitions with the root partitioned relation work better. Here's the
> updated patch set.

When I tried out patch 0001, it crashed repeatedly during 'make check'
because of an assertion failure in get_partitioned_child_rels.  It
seemed to me that the way the patch was refactoring
expand_inherited_rtentry involved more code rearrangement than
necessary, so I reverted all the code rearrangement and just kept the
functional changes, and all the crashes went away.  (That refactoring
also failed to initialize has_child properly.)  In so doing, I
reintroduced the problem that the PartitionedChildRelInfo lists
weren't getting set up correctly, but after some thought I realized
that was just because expand_single_inheritance_child() was choosing
between adding an RTE and adding the OID to partitioned_child_rels,
whereas for an intermediate partitioned table it needs to do both.  So
I inserted a trivial fix for that problem (replacing "else" with a new
"if"-test), basically:

-else
+
+if (childrte->relkind == RELKIND_PARTITIONED_TABLE)

Please check out the attached version of the patch.  In addition to
the above simplifications, I did some adjustments to the comments in
various places - some just grammar and others a bit more substantive.
And I think I broke a long line in one place, too.

One thing I notice is that if I rip out the changes to initsplan.c,
the new regression test still passes.  If it's possible to write a
test that fails without those changes, I think it would be a good idea
to include one in the patch.  That's certainly one of the subtler
parts of this patch, IMHO.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


expand-stepwise-rmh.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-07 Thread Ashutosh Bapat
On Thu, Sep 7, 2017 at 4:32 PM, Antonin Houska  wrote:
> Ashutosh Bapat  wrote:
>
>> On Fri, Sep 1, 2017 at 6:05 PM, Antonin Houska  wrote:
>> >
>> >
>> >
>> > * get_partitioned_child_rels_for_join()
>> >
>> > I think the Assert() statement is easier to understand inside the loop, see
>> > the assert.diff attachment.
>
>> The assert at the end of function also checks that we have got
>> child_rels lists for all the parents passed in.
>
> Really? I can imagine that some instances of PartitionedChildRelInfo have the
> child_rels list empty, while other ones have these lists long enough to
> compensate for the empty lists.
>

That isn't true. Each child_rels list will at least have one entry.
Please see get_partitioned_child_rels().

>> >
>> >
>> > * have_partkey_equi_join()
>> >
>> > As the function handles generic join, this comment doesn't seem to me
>> > relevant:
>> >
>> > /*
>> >  * The equi-join between partition keys is strict if equi-join between
>> >  * at least one partition key is using a strict operator. See
>> >  * explanation about outer join reordering identity 3 in
>> >  * optimizer/README
>> >  */
>> > strict_op = op_strict(opexpr->opno);
>>
>> What in that comment is not exactly relevant?
>
> Basically I don't understand why you mention join reordering here. The join
> ordering questions must all have been resolved by the time
> have_partkey_equi_join() is called.

I am referring to a particular section in README which talks about the
relation between strict operator and legal join order.

>
>> >
>> > And I think the function can return true even if strict_op is false for all
>> > the operators evaluated in the loop.
>>
>> I think it does that. Do you have a case where it doesn't?
>
> Here I refer to this part of the comment above:
>
> "... if equi-join between at least one partition key is using a strict
> operator."
>
> My understanding of the code (especially match_expr_to_partition_keys) is that
> no operator actually needs to be strict as long as each operator involved in
> the join matches at least one non-nullable expression on both sides of the
> join.

I don't think so. A strict operator returns NULL when either of the
inputs is NULL. We can not say so for non-strict operators, which may
deem NULL and non-NULL arguments as equal, even though that looks
insane.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-07 Thread Antonin Houska
Ashutosh Bapat  wrote:

> On Fri, Sep 1, 2017 at 6:05 PM, Antonin Houska  wrote:
> >
> >
> >
> > * get_partitioned_child_rels_for_join()
> >
> > I think the Assert() statement is easier to understand inside the loop, see
> > the assert.diff attachment.

> The assert at the end of function also checks that we have got
> child_rels lists for all the parents passed in.

Really? I can imagine that some instances of PartitionedChildRelInfo have the
child_rels list empty, while other ones have these lists long enough to
compensate for the empty lists.

> >
> >
> > * have_partkey_equi_join()
> >
> > As the function handles generic join, this comment doesn't seem to me
> > relevant:
> >
> > /*
> >  * The equi-join between partition keys is strict if equi-join between
> >  * at least one partition key is using a strict operator. See
> >  * explanation about outer join reordering identity 3 in
> >  * optimizer/README
> >  */
> > strict_op = op_strict(opexpr->opno);
> 
> What in that comment is not exactly relevant?

Basically I don't understand why you mention join reordering here. The join
ordering questions must all have been resolved by the time
have_partkey_equi_join() is called.

> >
> > And I think the function can return true even if strict_op is false for all
> > the operators evaluated in the loop.
> 
> I think it does that. Do you have a case where it doesn't?

Here I refer to this part of the comment above:

"... if equi-join between at least one partition key is using a strict
operator."

My understanding of the code (especially match_expr_to_partition_keys) is that
no operator actually needs to be strict as long as each operator involved in
the join matches at least one non-nullable expression on both sides of the
join.

> > * match_expr_to_partition_keys()
> >
> > I'm not sure this comment is clear enough:
> >
> > /*
> >  * If it's a strict equi-join a NULL partition key on one side will
> >  * not join a NULL partition key on the other side. So, rows with NULL
> >  * partition key from a partition on one side can not join with those
> >  * from a non-matching partition on the other side. So, search the
> >  * nullable partition keys as well.
> >  */
> > if (!strict_op)
> > continue;
> >
> > My understanding of the problem of NULL values generated by outer join is:
> > these NULL values --- if evaluated by non-strict expression --- can make row
> > of N-th partition on one side of the join match row(s) of *other than* N-th
> > partition(s) on the other side. Thus the nullable input expressions may only
> > be evaluated by strict operators. I think it'd be clearer if you stressed 
> > that
> > (undesired) *match* of partition keys can be a problem, rather than mismatch
> 
> Sorry, I am not able to understand this. To me it looks like my
> wording conveys what you are saying.

I just tried to expreess the idea in a way that is clearer to me. I think we
both mean the same. Not sure I should spend more effort on another version of
the comment.

> > If you insist on your wording, then I think you should at least move the
> > comment below to the part that only deals with strict operators.
> 
> Done.

o.k.

> >
> > * map_and_merge_partitions()
> >
> > Besides a few changes proposed in map_and_merge_partitions.diff (a few of 
> > them
> > to suppress compiler warnings) I think that this part needs more thought:
> >
> > {
> > Assert(mergemap1[index1] != mergemap2[index2] &&
> >mergemap1[index1] >= 0 && mergemap2[index2] >= 0);
> >
> > /*
> >  * Both the partitions map to different merged partitions. This
> >  * means that multiple partitions from one relation matches to 
> > one
> >  * partition from the other relation. Partition-wise join does 
> > not
> >  * handle this case right now, since it requires ganging 
> > multiple
> >  * partitions together (into one RelOptInfo).
> >  */
> > merged_index = -1;
> > }
> >
> > I could hit this path with the following test:
> >
> > CREATE TABLE a(i int) PARTITION BY LIST(i);
> > CREATE TABLE a_0 PARTITION OF a FOR VALUES IN (0, 2);
> > CREATE TABLE b(j int) PARTITION BY LIST(j);
> > CREATE TABLE b_0 PARTITION OF b FOR VALUES IN (1, 2);
> >
> > SET enable_partition_wise_join TO on;
> >
> > SELECT  *
> > FROMa
> > FULL JOIN
> > b ON i = j;
> >
> > I don't think there's a reason not to join a_0 partition to b_0, is there?
> 
> With the latest patchset I am seeing that partition-wise join is used
> in this case. I have started a new thread [1] for advanced partition
> matching patches.

What plan do you get, with the patches from

https://www.postgresql.org/message-id/cafjfprfdxpusu0pxon3dkcr8wndjkaxlzhuvax_laod0tgc...@mail.gmail.com

I still see the join above Append, not below:

   QUERY PLAN   

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-05 Thread Ashutosh Bapat
On Tue, Sep 5, 2017 at 1:16 PM, Amit Langote
 wrote:
> On 2017/09/05 15:43, Ashutosh Bapat wrote:
>> Ok. Can you please answer my previous questions?
>>
>> AFAIU, the list contained RTIs of the relations, which didnt' have
>> corresponding AppendRelInfos to lock those relations. Now that we
>> create AppendRelInfos even for partitioned partitions with my 0001
>> patch, I don't think
>> we need the list to take care of the locks. Is there any other reason
>> why we maintain that list (apart from the trigger case I have raised
>> and Fujita-san says that the list is not required in that case as
>> well.)?
>
> AppendRelInfos exist within the planner (they come to be and go away
> within the planner).  Once we leave the planner, that information is gone.
>
> Executor will receive a plan node that won't contain that information:
>
> 1. Append has an appendplans field, which contains one plan tree for every
>leaf partition.  None of its fields, other than partitined_rels,
>contains the RT indexes of the partitioned tables.  Similarly in the
>case of MergeAppend.
>
> 2. ModifyTable has a resultRelations fields which contains a list of leaf
>partition RT indexes and a plans field which contains one plan tree for
>every RT index in the resultRelations list (that is a plan tree that
>will scan the particular leaf partition).  None of its fields, other
>than partitined_rels, contains the RT indexes of the partitioned
>tables.
>
> I learned over the course of developing the patch that added this
> partitioned_rels field [1] that the executor needs to identify all the
> affected tables by a given plan tree so that it could lock them.  Executor
> needs to lock them separately even if the plan itself was built after
> locking all the relevant tables [2].  For example, see
> ExecLockNonLeafAppendTables(), which will lock the tables in the
> (Merge)Append.partitioned_rels list.
>
> While I've been thinking all along that the same thing must be happening
> for RT indexes in ModifyTable.partitioned_rels list (I said so a couple of
> times on this thread), it's actually not.  Instead,
> ModifyTable.partitioned_rels of all ModifyTable nodes in a give query are
> merged into PlannedStmt.nonleafResultRelations (which happens in
> set_plan_refs) and that's where the executor finds them to lock them
> (which happens in InitPlan).
>
> So, it appears that ModifyTable.partitioned_rels is indeed unused in the
> executor.  But we still can't get rid of it from the ModifyTable node
> itself without figuring out a way (a channel) to transfer that information
> into PlannedStmt.nonleafResultRelations.

Thanks a lot for this detailed analysis. IIUC, in my 0001 patch, I am
not adding any partitioned partition other than the parent itself. But
since every partitioned partition in turn acts as parent, it appears
its own list. The list obtained by concatenating all such lists
together contains all the partitioned partition RTIs. In my patch, I
need to teach accumulate_append_subpath() to accumulate
partitioned_rels as well.

>
>> Having asked that, I think my patch shouldn't deal with removing
>> partitioned_rels lists and related structures and code.  It should be> done 
>> as a separate patch.
>
> Going back to your original email which started this discussion, it seems
> that we agree on that the PartitionedChildRelInfo node can be removed, and
> I agree that it shouldn't be done in the partitionwise-join patch series
> but as a separate patch.  As described above, we shouldn't try yet to get
> rid of the partitioned_rels list that appears in some plan nodes.

Thanks.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-05 Thread Amit Langote
On 2017/09/05 15:43, Ashutosh Bapat wrote:
> Ok. Can you please answer my previous questions?
> 
> AFAIU, the list contained RTIs of the relations, which didnt' have
> corresponding AppendRelInfos to lock those relations. Now that we
> create AppendRelInfos even for partitioned partitions with my 0001
> patch, I don't think
> we need the list to take care of the locks. Is there any other reason
> why we maintain that list (apart from the trigger case I have raised
> and Fujita-san says that the list is not required in that case as
> well.)?

AppendRelInfos exist within the planner (they come to be and go away
within the planner).  Once we leave the planner, that information is gone.

Executor will receive a plan node that won't contain that information:

1. Append has an appendplans field, which contains one plan tree for every
   leaf partition.  None of its fields, other than partitined_rels,
   contains the RT indexes of the partitioned tables.  Similarly in the
   case of MergeAppend.

2. ModifyTable has a resultRelations fields which contains a list of leaf
   partition RT indexes and a plans field which contains one plan tree for
   every RT index in the resultRelations list (that is a plan tree that
   will scan the particular leaf partition).  None of its fields, other
   than partitined_rels, contains the RT indexes of the partitioned
   tables.

I learned over the course of developing the patch that added this
partitioned_rels field [1] that the executor needs to identify all the
affected tables by a given plan tree so that it could lock them.  Executor
needs to lock them separately even if the plan itself was built after
locking all the relevant tables [2].  For example, see
ExecLockNonLeafAppendTables(), which will lock the tables in the
(Merge)Append.partitioned_rels list.

While I've been thinking all along that the same thing must be happening
for RT indexes in ModifyTable.partitioned_rels list (I said so a couple of
times on this thread), it's actually not.  Instead,
ModifyTable.partitioned_rels of all ModifyTable nodes in a give query are
merged into PlannedStmt.nonleafResultRelations (which happens in
set_plan_refs) and that's where the executor finds them to lock them
(which happens in InitPlan).

So, it appears that ModifyTable.partitioned_rels is indeed unused in the
executor.  But we still can't get rid of it from the ModifyTable node
itself without figuring out a way (a channel) to transfer that information
into PlannedStmt.nonleafResultRelations.

> Having asked that, I think my patch shouldn't deal with removing
> partitioned_rels lists and related structures and code.  It should be> done 
> as a separate patch.

Going back to your original email which started this discussion, it seems
that we agree on that the PartitionedChildRelInfo node can be removed, and
I agree that it shouldn't be done in the partitionwise-join patch series
but as a separate patch.  As described above, we shouldn't try yet to get
rid of the partitioned_rels list that appears in some plan nodes.

Thanks,
Amit

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d3cc37f1d8

[2]
https://www.postgresql.org/message-id/CA%2BTgmoYiwviCDRi3Zk%2BQuXj1r7uMu9T_kDNq%2B17PCWgzrbzw8A%40mail.gmail.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-04 Thread Ashutosh Bapat
On Tue, Sep 5, 2017 at 12:06 PM, Amit Langote
 wrote:
> On 2017/09/05 15:30, Ashutosh Bapat wrote:
>> Those changes are already part of my updated 0001 patch. Aren't they?
>> May be you should just review those and see if those are suitable for
>> you?
>
> Yeah, I think it's going to be the same patch, functionality-wise.
>
> And sorry, I didn't realize you were talking about the case after applying
> your patch on HEAD.
>

Ok. Can you please answer my previous questions?

AFAIU, the list contained RTIs of the relations, which didnt' have
corresponding AppendRelInfos to lock those relations. Now that we
create AppendRelInfos even for partitioned partitions with my 0001
patch, I don't think
we need the list to take care of the locks. Is there any other reason
why we maintain that list (apart from the trigger case I have raised
and Fujita-san says that the list is not required in that case as
well.)?

Having asked that, I think my patch shouldn't deal with removing
partitioned_rels lists and related structures and code. It should be
done as a separate patch.
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-04 Thread Amit Langote
On 2017/09/05 15:30, Ashutosh Bapat wrote:
> Those changes are already part of my updated 0001 patch. Aren't they?
> May be you should just review those and see if those are suitable for
> you?

Yeah, I think it's going to be the same patch, functionality-wise.

And sorry, I didn't realize you were talking about the case after applying
your patch on HEAD.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-04 Thread Etsuro Fujita

On 2017/09/05 13:20, Amit Langote wrote:

On 2017/09/04 21:32, Ashutosh Bapat wrote:



+1. Will Fujita-san's patch also handle getting rid of partitioned_rels list?


As Fujita-san mentioned, his patch won't.  Actually, I suppose he didn't
say that partitioned_rels itself is useless, just that its particular
usage in ExecInitModifyTable is.


That's right.  (I thought there would probably be no need to create that 
list if we created AppendRelInfos even for partitioned partitions.)



We still need that list for planner to
tell the executor that there are some RT entries the latter would need to
lock before executing a given plan.  Without that dedicated list, the
executor cannot know at all that certain tables in the partition tree
(viz. the partitioned ones) need to be locked.  I mentioned the reason -
(Merge)Append.subplans, ModifyTable.resultRelations does not contain
respective entries corresponding to the partitioned tables, and
traditionally, the executor looks at those lists to figure out the tables
to lock.


I think so too.

Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-04 Thread Ashutosh Bapat
On Tue, Sep 5, 2017 at 11:54 AM, Amit Langote
 wrote:
> On 2017/09/05 13:20, Amit Langote wrote:
>> The later phase can
>> build that list from the AppendRelInfos that you mention we now [1] build.
>>
>> [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=30833ba154
>
> Looking at that commit again, AppendRelInfos are still not created for
> partitioned child tables.  Looking at the code in
> expand_single_inheritance_child(), which exists as of 30833ba154:
>
>
> /*
>  * Build an AppendRelInfo for this parent and child, unless the child is a
>  * partitioned table.
>  */
> if (childrte->relkind != RELKIND_PARTITIONED_TABLE && !childrte->inh)
> {
>  ...code that builds AppendRelInfo...
> }
> else
> *partitioned_child_rels = lappend_int(*partitioned_child_rels,
>   childRTindex);
>
> you can see that an AppendRelInfo won't get built for partitioned child
> tables.
>
> Also, even if the commit changed things so that the child RT entries (and
> AppendRelInfos) now get built in an order determined by depth-first
> traversal of the partition tree, the same original parent RT index is used
> to mark all AppendRelInfos, so the expansion essentially flattens the
> hierarchy.  In the updated patch I will post on the "path toward faster
> partition pruning" thread [1], I am planning to rejigger things so that
> two things start to happen:
>
> 1. For partitioned child tables, build the child RT entry with inh = true
>and also build an AppendRelInfos
>
> 2. When recursively expanding a partitioned child table in
>expand_partitioned_rtentry(), pass its new RT index as the
>parentRTindex to the recursive call of expand_partitioned_rtentry(), so
>that the resulting AppendRelInfos reflect immediate parent-child
>relationship
>
> With 1 in place, build_simple_rel() will build RelOptInfos even for
> partitioned child tables, so that for each one, we can recursively build
> an Append path.  So, instead of just one Append path for the root
> partitioned table, there is one for each partitioned table in the tree.
>
> I will be including the above described change in the partition-pruning
> patch, because the other code in that patch relies on the same and I know
> Ashuotsh has wanted that for a long time. :)

Those changes are already part of my updated 0001 patch. Aren't they?
May be you should just review those and see if those are suitable for
you?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-04 Thread Amit Langote
On 2017/09/05 13:20, Amit Langote wrote:
> The later phase can
> build that list from the AppendRelInfos that you mention we now [1] build.
> 
> [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=30833ba154

Looking at that commit again, AppendRelInfos are still not created for
partitioned child tables.  Looking at the code in
expand_single_inheritance_child(), which exists as of 30833ba154:


/*
 * Build an AppendRelInfo for this parent and child, unless the child is a
 * partitioned table.
 */
if (childrte->relkind != RELKIND_PARTITIONED_TABLE && !childrte->inh)
{
 ...code that builds AppendRelInfo...
}
else
*partitioned_child_rels = lappend_int(*partitioned_child_rels,
  childRTindex);

you can see that an AppendRelInfo won't get built for partitioned child
tables.

Also, even if the commit changed things so that the child RT entries (and
AppendRelInfos) now get built in an order determined by depth-first
traversal of the partition tree, the same original parent RT index is used
to mark all AppendRelInfos, so the expansion essentially flattens the
hierarchy.  In the updated patch I will post on the "path toward faster
partition pruning" thread [1], I am planning to rejigger things so that
two things start to happen:

1. For partitioned child tables, build the child RT entry with inh = true
   and also build an AppendRelInfos

2. When recursively expanding a partitioned child table in
   expand_partitioned_rtentry(), pass its new RT index as the
   parentRTindex to the recursive call of expand_partitioned_rtentry(), so
   that the resulting AppendRelInfos reflect immediate parent-child
   relationship

With 1 in place, build_simple_rel() will build RelOptInfos even for
partitioned child tables, so that for each one, we can recursively build
an Append path.  So, instead of just one Append path for the root
partitioned table, there is one for each partitioned table in the tree.

I will be including the above described change in the partition-pruning
patch, because the other code in that patch relies on the same and I know
Ashuotsh has wanted that for a long time. :)

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/044e2e09-9690-7aff-1749-2d318da38a11%40lab.ntt.co.jp



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-04 Thread Amit Langote
On 2017/09/04 21:32, Ashutosh Bapat wrote:
> On Mon, Sep 4, 2017 at 10:04 AM, Amit Langote wrote:
>> By the way, if you want to get rid of PartitionedChildRelInfo, you can do
>> that as long as you find some other way of putting together the
>> partitioned_rels list to add into the ModifyTable (Append/MergeAppend)
>> node created for the root partitioned table.  Currently,
>> PartitionedChildRelInfo (and the root->pcinfo_list) is the way for
>> expand_inherited_rtentry() to pass that information to the planner's
>> path-generating code.  We may be able to generate that list when actually
>> creating the path using set_append_rel_pathlist() or
>> inheritance_planner(), without having created a PartitionedChildRelInfo
>> node beforehand.
> 
> AFAIU, the list contained RTIs of the relations, which didnt' have
> corresponding AppendRelInfos to lock those relations. Now that we
> create AppendRelInfos even for partitioned partitions, I don't think
> we need the list to take care of the locks. Is there any other reason
> why we maintain that list (apart from the trigger case I have raised
> and Fujita-san says that the list is not required in that case as
> well.)

We do *need* the list in ModifyTable (Append/MergeAppend) node itself.  We
can, however, get rid of the PartitionedChildRelInfo node that carries the
partitioned child RT indexes from an earlier planning phase
(expand_inherited_rtentry) to a later phase
(create_{modifytable|append|merge_append}_path).  The later phase can
build that list from the AppendRelInfos that you mention we now [1] build.

>>> Though I haven't read the patch yet, I think the above code is useless.
>>> And I proposed a patch to clean it up before [1].  I'll add that patch to
>>> the next commitfest.
>>
>> +1.
> +1. Will Fujita-san's patch also handle getting rid of partitioned_rels list?

As Fujita-san mentioned, his patch won't.  Actually, I suppose he didn't
say that partitioned_rels itself is useless, just that its particular
usage in ExecInitModifyTable is.  We still need that list for planner to
tell the executor that there are some RT entries the latter would need to
lock before executing a given plan.  Without that dedicated list, the
executor cannot know at all that certain tables in the partition tree
(viz. the partitioned ones) need to be locked.  I mentioned the reason -
(Merge)Append.subplans, ModifyTable.resultRelations does not contain
respective entries corresponding to the partitioned tables, and
traditionally, the executor looks at those lists to figure out the tables
to lock.

Thanks,
Amit

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=30833ba154



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-04 Thread Etsuro Fujita

On 2017/09/04 21:32, Ashutosh Bapat wrote:

On Mon, Sep 4, 2017 at 10:04 AM, Amit Langote

By the way, if you want to get rid of PartitionedChildRelInfo, you can do
that as long as you find some other way of putting together the
partitioned_rels list to add into the ModifyTable (Append/MergeAppend)
node created for the root partitioned table.  Currently,
PartitionedChildRelInfo (and the root->pcinfo_list) is the way for
expand_inherited_rtentry() to pass that information to the planner's
path-generating code.  We may be able to generate that list when actually
creating the path using set_append_rel_pathlist() or
inheritance_planner(), without having created a PartitionedChildRelInfo
node beforehand.


AFAIU, the list contained RTIs of the relations, which didnt' have
corresponding AppendRelInfos to lock those relations. Now that we
create AppendRelInfos even for partitioned partitions, I don't think
we need the list to take care of the locks.
I don't think so either.  (Since I haven't followed discussions on this 
thread in detail yet, I don't understand the idea/need of creating 
AppendRelInfos for partitioned partitions, though.)



Though I haven't read the patch yet, I think the above code is useless.
And I proposed a patch to clean it up before [1].  I'll add that patch to
the next commitfest.


+1.

+1. Will Fujita-san's patch also handle getting rid of partitioned_rels list?


No.  The patch just removes the partitioned_rels list from 
nodeModifyTable.c.


Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-04 Thread Ashutosh Bapat
On Mon, Sep 4, 2017 at 10:04 AM, Amit Langote
 wrote:
> On 2017/09/04 12:38, Etsuro Fujita wrote:
>> On 2017/09/02 4:10, Ashutosh Bapat wrote:
>>> This rebase mainly changes patch 0001, which translates partition
>>> hierarchy into inheritance hierarchy creating AppendRelInfos and
>>> RelOptInfos for partitioned partitions. Because of that, it's not
>>> necessary to record the partitioned partitions in a
>>> PartitionedChildRelInfos::child_rels. The only RTI that goes in there
>>> is the RTI of child RTE which is same as the parent RTE except inh
>>> flag. I tried removing that with a series of changes but it seems that
>>> following code in ExecInitModifyTable() requires it.
>>> 1897 /* The root table RT index is at the head of the
>>> partitioned_rels list */
>>> 1898 if (node->partitioned_rels)
>>> 1899 {
>>> 1900 Index   root_rti;
>>> 1901 Oid root_oid;
>>> 1902
>>> 1903 root_rti = linitial_int(node->partitioned_rels);
>>> 1904 root_oid = getrelid(root_rti, estate->es_range_table);
>>> 1905 rel = heap_open(root_oid, NoLock);  /* locked by InitPlan */
>>> 1906 }
>>> 1907 else
>>> 1908 rel = mtstate->resultRelInfo->ri_RelationDesc;
>>>
>>> I don't know whether we could change this code not to use
>>> PartitionedChildRelInfos::child_rels.
>
> For a root partitioned tables, ModifyTable.partitioned_rels comes from
> PartitionedChildRelInfo.child_rels recorded for the table by
> expand_inherited_rtnentry().  In fact, the latter is copied verbatim to
> ModifyTablePath (or AppendPath/MergeAppendPath) when creating the same.
> The only point of keeping those RT indexes around in the ModifyTable node
> is for the executor to be able to locate partitioned table RT entries and
> lock them.  Without them, the executor wouldn't know about those tables at
> all, because there won't be subplans corresponding to partitioned tables
> in the tree and hence their RT indexes won't appear in the
> ModifyTable.resultRelations list.  If your patch adds partitioned child
> rel AppendRelInfos back for whatever reason, you should also make sure in
> inheritance_planner() that their RT indexes don't end up the
> resultRelations list.  See this piece of code in inheritance_planner():
>
> 1351 /* Build list of sub-paths */
> 1352 subpaths = lappend(subpaths, subpath);
> 1353
> 1354 /* Build list of modified subroots, too */
> 1355 subroots = lappend(subroots, subroot);
> 1356
> 1357 /* Build list of target-relation RT indexes */
> 1358 resultRelations = lappend_int(resultRelations,
> appinfo->child_relid);
>
> Maybe it won't happen, because if this appinfo corresponds to a
> partitioned child table, recursion would occur and we'll get to this piece
> of code for only the leaf children.

You are right. We don't execute above lines for partitioned partitions.

>
> By the way, if you want to get rid of PartitionedChildRelInfo, you can do
> that as long as you find some other way of putting together the
> partitioned_rels list to add into the ModifyTable (Append/MergeAppend)
> node created for the root partitioned table.  Currently,
> PartitionedChildRelInfo (and the root->pcinfo_list) is the way for
> expand_inherited_rtentry() to pass that information to the planner's
> path-generating code.  We may be able to generate that list when actually
> creating the path using set_append_rel_pathlist() or
> inheritance_planner(), without having created a PartitionedChildRelInfo
> node beforehand.

AFAIU, the list contained RTIs of the relations, which didnt' have
corresponding AppendRelInfos to lock those relations. Now that we
create AppendRelInfos even for partitioned partitions, I don't think
we need the list to take care of the locks. Is there any other reason
why we maintain that list (apart from the trigger case I have raised
and Fujita-san says that the list is not required in that case as
well.)

>
>> Though I haven't read the patch yet, I think the above code is useless.
>> And I proposed a patch to clean it up before [1].  I'll add that patch to
>> the next commitfest.
>
> +1.
+1. Will Fujita-san's patch also handle getting rid of partitioned_rels list?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-03 Thread Amit Langote
On 2017/09/04 12:38, Etsuro Fujita wrote:
> On 2017/09/02 4:10, Ashutosh Bapat wrote:
>> This rebase mainly changes patch 0001, which translates partition
>> hierarchy into inheritance hierarchy creating AppendRelInfos and
>> RelOptInfos for partitioned partitions. Because of that, it's not
>> necessary to record the partitioned partitions in a
>> PartitionedChildRelInfos::child_rels. The only RTI that goes in there
>> is the RTI of child RTE which is same as the parent RTE except inh
>> flag. I tried removing that with a series of changes but it seems that
>> following code in ExecInitModifyTable() requires it.
>> 1897 /* The root table RT index is at the head of the
>> partitioned_rels list */
>> 1898 if (node->partitioned_rels)
>> 1899 {
>> 1900 Index   root_rti;
>> 1901 Oid root_oid;
>> 1902
>> 1903 root_rti = linitial_int(node->partitioned_rels);
>> 1904 root_oid = getrelid(root_rti, estate->es_range_table);
>> 1905 rel = heap_open(root_oid, NoLock);  /* locked by InitPlan */
>> 1906 }
>> 1907 else
>> 1908 rel = mtstate->resultRelInfo->ri_RelationDesc;
>>
>> I don't know whether we could change this code not to use
>> PartitionedChildRelInfos::child_rels.

For a root partitioned tables, ModifyTable.partitioned_rels comes from
PartitionedChildRelInfo.child_rels recorded for the table by
expand_inherited_rtnentry().  In fact, the latter is copied verbatim to
ModifyTablePath (or AppendPath/MergeAppendPath) when creating the same.
The only point of keeping those RT indexes around in the ModifyTable node
is for the executor to be able to locate partitioned table RT entries and
lock them.  Without them, the executor wouldn't know about those tables at
all, because there won't be subplans corresponding to partitioned tables
in the tree and hence their RT indexes won't appear in the
ModifyTable.resultRelations list.  If your patch adds partitioned child
rel AppendRelInfos back for whatever reason, you should also make sure in
inheritance_planner() that their RT indexes don't end up the
resultRelations list.  See this piece of code in inheritance_planner():

1351 /* Build list of sub-paths */
1352 subpaths = lappend(subpaths, subpath);
1353
1354 /* Build list of modified subroots, too */
1355 subroots = lappend(subroots, subroot);
1356
1357 /* Build list of target-relation RT indexes */
1358 resultRelations = lappend_int(resultRelations,
appinfo->child_relid);

Maybe it won't happen, because if this appinfo corresponds to a
partitioned child table, recursion would occur and we'll get to this piece
of code for only the leaf children.

By the way, if you want to get rid of PartitionedChildRelInfo, you can do
that as long as you find some other way of putting together the
partitioned_rels list to add into the ModifyTable (Append/MergeAppend)
node created for the root partitioned table.  Currently,
PartitionedChildRelInfo (and the root->pcinfo_list) is the way for
expand_inherited_rtentry() to pass that information to the planner's
path-generating code.  We may be able to generate that list when actually
creating the path using set_append_rel_pathlist() or
inheritance_planner(), without having created a PartitionedChildRelInfo
node beforehand.

> Though I haven't read the patch yet, I think the above code is useless.
> And I proposed a patch to clean it up before [1].  I'll add that patch to
> the next commitfest.

+1.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-03 Thread Etsuro Fujita

On 2017/09/02 4:10, Ashutosh Bapat wrote:

This rebase mainly changes patch 0001, which translates partition
hierarchy into inheritance hierarchy creating AppendRelInfos and
RelOptInfos for partitioned partitions. Because of that, it's not
necessary to record the partitioned partitions in a
PartitionedChildRelInfos::child_rels. The only RTI that goes in there
is the RTI of child RTE which is same as the parent RTE except inh
flag. I tried removing that with a series of changes but it seems that
following code in ExecInitModifyTable() requires it.
1897 /* The root table RT index is at the head of the
partitioned_rels list */
1898 if (node->partitioned_rels)
1899 {
1900 Index   root_rti;
1901 Oid root_oid;
1902
1903 root_rti = linitial_int(node->partitioned_rels);
1904 root_oid = getrelid(root_rti, estate->es_range_table);
1905 rel = heap_open(root_oid, NoLock);  /* locked by InitPlan */
1906 }
1907 else
1908 rel = mtstate->resultRelInfo->ri_RelationDesc;

I don't know whether we could change this code not to use
PartitionedChildRelInfos::child_rels.
Though I haven't read the patch yet, I think the above code is useless. 
And I proposed a patch to clean it up before [1].  I'll add that patch 
to the next commitfest.


Best regards,
Etsuro Fujita

[1] 
https://www.postgresql.org/message-id/93cf9816-2f7d-0f67-8ed2-4a4e497a6ab8%40lab.ntt.co.jp




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-01 Thread Antonin Houska
Ashutosh Bapat  wrote:

> I originally thought to provide it along-with the changes to
> expand_inherited_rtentry(), but that thread is taking longer. Jeevan
> Chalke needs rebased patches for his work on aggregate pushdown and
> Thomas might need them for further review. So, here they are.

Since I have related patch in the current commitfest
(https://commitfest.postgresql.org/14/1247/), I spent some time reviewing your
patch:

* generate_partition_wise_join_paths()

Right parenthesis is missing in the prologue.


* get_partitioned_child_rels_for_join()

I think the Assert() statement is easier to understand inside the loop, see
the assert.diff attachment.


* have_partkey_equi_join()

As the function handles generic join, this comment doesn't seem to me
relevant:

/*
 * The equi-join between partition keys is strict if equi-join between
 * at least one partition key is using a strict operator. See
 * explanation about outer join reordering identity 3 in
 * optimizer/README
 */
strict_op = op_strict(opexpr->opno);

And I think the function can return true even if strict_op is false for all
the operators evaluated in the loop.


* match_expr_to_partition_keys()

I'm not sure this comment is clear enough:

/*
 * If it's a strict equi-join a NULL partition key on one side will
 * not join a NULL partition key on the other side. So, rows with NULL
 * partition key from a partition on one side can not join with those
 * from a non-matching partition on the other side. So, search the
 * nullable partition keys as well.
 */
if (!strict_op)
continue;

My understanding of the problem of NULL values generated by outer join is:
these NULL values --- if evaluated by non-strict expression --- can make row
of N-th partition on one side of the join match row(s) of *other than* N-th
partition(s) on the other side. Thus the nullable input expressions may only
be evaluated by strict operators. I think it'd be clearer if you stressed that
(undesired) *match* of partition keys can be a problem, rather than mismatch.

If you insist on your wording, then I think you should at least move the
comment below to the part that only deals with strict operators.


* There are several places where lfirst_node() macro should be used. For
  example

rel = lfirst_node(RelOptInfo, lc);

instead of

rel = (RelOptInfo *) lfirst(lc);


* map_and_merge_partitions()

Besides a few changes proposed in map_and_merge_partitions.diff (a few of them
to suppress compiler warnings) I think that this part needs more thought:

{
Assert(mergemap1[index1] != mergemap2[index2] &&
   mergemap1[index1] >= 0 && mergemap2[index2] >= 0);

/*
 * Both the partitions map to different merged partitions. This
 * means that multiple partitions from one relation matches to one
 * partition from the other relation. Partition-wise join does not
 * handle this case right now, since it requires ganging multiple
 * partitions together (into one RelOptInfo).
 */
merged_index = -1;
}

I could hit this path with the following test:

CREATE TABLE a(i int) PARTITION BY LIST(i);
CREATE TABLE a_0 PARTITION OF a FOR VALUES IN (0, 2);
CREATE TABLE b(j int) PARTITION BY LIST(j);
CREATE TABLE b_0 PARTITION OF b FOR VALUES IN (1, 2);

SET enable_partition_wise_join TO on;

SELECT  *
FROMa
FULL JOIN
b ON i = j;

I don't think there's a reason not to join a_0 partition to b_0, is there?

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index a75b1a3..3094b56
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*** get_partitioned_child_rels_for_join(Plan
*** 6160,6170 
  		PartitionedChildRelInfo *pc = lfirst(l);
  
  		if (bms_is_member(pc->parent_relid, join_relids))
  			result = list_concat(result, list_copy(pc->child_rels));
  	}
  
- 	/* The root partitioned table is included as a child rel */
- 	Assert(list_length(result) >= bms_num_members(join_relids));
- 
  	return result;
  }
--- 6160,6172 
  		PartitionedChildRelInfo *pc = lfirst(l);
  
  		if (bms_is_member(pc->parent_relid, join_relids))
+ 		{
+ 			/* The root partitioned table is included as a child rel */
+ 			Assert(list_length(pc->child_rels) >= 1);
+ 
  			result = list_concat(result, list_copy(pc->child_rels));
+ 		}
  	}
  
  	return result;
  }
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
new file mode 100644
index eb35fab..aa9c70d
*** a/src/backend/catalog/partition.c
--- b/src/backend/catalog/partition.c
*** partition_list_bounds_merge(int partnatt
*** 3110,3116 
--- 3110,3119

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-08-16 Thread Robert Haas
On Wed, Aug 16, 2017 at 3:31 AM, Ashutosh Bapat
 wrote:
> There are two ways we can do this
> 1. In expand_inherited_rtentry(), remember (childRTE and childRTIndex)
> or just childRTIndex (using this we can fetch childRTE calling
> rtfetch()) of intermediate partitioned tables. Once we are done
> expanding immediate children, call expand_inherited_rtentry()
> recursively on this list.
>
> 2. expand_inherited_tables() scans root->parse->rtable only upto the
> end of original range table list. Make it go beyond that end,
> expanding any new entries added for intermediate partitions.
>
> FWIW, the first option allows us to keep all AppendRelInfos
> corresponding to one partitioned relation together and also expands
> the whole partition hierarchy in one go. Second will require minimal
> changes to expand_inherited_rtentry(). Both approaches will spend time
> scanning same number of RTE; the first will have them in different
> lists, and second will have them in root->parse->rtable. I don't see
> one being more attractive than the other. Do you have any opinion?

I don't like option (2).  I'm not sure about option (1).  I think
maybe we should have two nested loops in expanded_inherited_rtentry(),
the outer one iterating over partitioned tables (or just the original
parent RTE if partitioning is not involved) and then inner one looping
over individual leaf partitions for each partitioned table.  Probably
we'd end up wanting to move at least some of the logic inside the
existing loop into a subroutine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-08-16 Thread Ashutosh Bapat
On Tue, Aug 15, 2017 at 10:15 PM, Robert Haas  wrote:
> On Thu, Aug 10, 2017 at 8:00 AM, Ashutosh Bapat
>  wrote:
>> Attached patches with the comments addressed.
>
> I have committed 0001-0003 as 480f1f4329f1bf8bfbbcda8ed233851e1b110ad4
> and e139f1953f29db245f60a7acb72fccb1e05d2442.

Thanks a lot Robert. Some less patches to maintain :).

>
> 0004 doesn't apply any more, probably due to commit
> d57929afc7063431f80a0ac4c510fc39aacd22e6.  I think something along
> these lines could be separately committed prior to the main patch, and
> I think that would be a good idea just to flush out any bugs in this
> part independently of the rest.  However, I also think that we
> probably ought to try to get Amit Langote's changes to this function
> to repair the locking order and expand in bound order committed before
> proceeding with these changes.

I am reviewing those changes and contribute to that thread if necessary.

>
> In fact, I think there's a certain amount of conflict between what's
> being discussed over there and what you're trying to do here.  In that
> thread, we propose to move partitioned tables at any level to the
> front of the inheritance expansion.  Here, however, you want to expand
> level by level.  I think partitioned-tables-first is the right
> approach for the reasons discussed on the other thread; namely, we
> want to be able to prune leaf partitions before expanding them, but
> that requires us to expand all the non-leaf tables first to maintain a
> consistent locking order in all scenarios.  So the approach you've
> taken in this patch may need to be re-thought somewhat.
>

There are two ways we can do this
1. In expand_inherited_rtentry(), remember (childRTE and childRTIndex)
or just childRTIndex (using this we can fetch childRTE calling
rtfetch()) of intermediate partitioned tables. Once we are done
expanding immediate children, call expand_inherited_rtentry()
recursively on this list.

2. expand_inherited_tables() scans root->parse->rtable only upto the
end of original range table list. Make it go beyond that end,
expanding any new entries added for intermediate partitions.

FWIW, the first option allows us to keep all AppendRelInfos
corresponding to one partitioned relation together and also expands
the whole partition hierarchy in one go. Second will require minimal
changes to expand_inherited_rtentry(). Both approaches will spend time
scanning same number of RTE; the first will have them in different
lists, and second will have them in root->parse->rtable. I don't see
one being more attractive than the other. Do you have any opinion?

I will submit the rebased patches after reviewing/adjusting Amit's
changes and also the changes in expand_inherited_rtentry() after we
have concluded the approach to be taken.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-08-15 Thread Robert Haas
On Thu, Aug 10, 2017 at 8:00 AM, Ashutosh Bapat
 wrote:
> Attached patches with the comments addressed.

I have committed 0001-0003 as 480f1f4329f1bf8bfbbcda8ed233851e1b110ad4
and e139f1953f29db245f60a7acb72fccb1e05d2442.

0004 doesn't apply any more, probably due to commit
d57929afc7063431f80a0ac4c510fc39aacd22e6.  I think something along
these lines could be separately committed prior to the main patch, and
I think that would be a good idea just to flush out any bugs in this
part independently of the rest.  However, I also think that we
probably ought to try to get Amit Langote's changes to this function
to repair the locking order and expand in bound order committed before
proceeding with these changes.

In fact, I think there's a certain amount of conflict between what's
being discussed over there and what you're trying to do here.  In that
thread, we propose to move partitioned tables at any level to the
front of the inheritance expansion.  Here, however, you want to expand
level by level.  I think partitioned-tables-first is the right
approach for the reasons discussed on the other thread; namely, we
want to be able to prune leaf partitions before expanding them, but
that requires us to expand all the non-leaf tables first to maintain a
consistent locking order in all scenarios.  So the approach you've
taken in this patch may need to be re-thought somewhat.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-08-10 Thread Robert Haas
On Thu, Aug 10, 2017 at 5:43 AM, Thomas Munro
 wrote:
>> Do you think we solving this problem is a prerequisite for
>> partition-wise join? Or should we propose that patch as a separate
>> enhancement?
>
> No, I'm not proposing anything yet.  For now I just wanted to share
> this observation about where hot CPU time goes in simple tests, and
> since it turned out to be a loop in a loop that I could see an easy to
> way to fix for singleton sets and sets with a small range, I couldn't
> help trying it...  But I'm still trying to understand the bigger
> picture.  I'll be interested to compare profiles with the ordered
> append_rel_list version you have mentioned, to see how that moves the
> hot spots.

Perhaps this is stating the obvious, but it's often better to optimize
things like this at a higher level, rather than by tinkering with
stuff like Bitmapset.  On the other hand, sometimes
micro-optimizations are the way to go, because optimizing
find_ec_member_for_tle(), for example, might involve a much broader
rethink of the planner code than we want to undertake right now.

> I guess one very practical question to ask is: can we plan queries
> with realistic numbers of partitioned tables and partitions in
> reasonable times?  Well, it certainly looks very good for hundreds of
> partitions so far...  My own experience of partitioning with other
> RDBMSs has been on that order, 'monthly partitions covering the past
> 10 years' and similar, but on the other hand it wouldn't be surprising
> to learn that people want to go to many thousands, especially for
> schemas which just keep adding partitions over time and don't want to
> drop them.

I've been thinking that it would be good if this feature - and other
new partitioning features - could scale to about 1000 partitions
without too much trouble.  Eventually, it might be nice to scale
higher, but there's not much point in making partition-wise join scale
to 100,000 partitions if we've got some other part of the system that
runs into trouble beyond 250.

> Curious: would you consider joins between partitioned tables and
> non-partitioned tables where the join is pushed down to be a kind of
> "partition-wise join", or something else?  If so, would that be a
> special case, or just the logical extreme case for
> 0014-WIP-Partition-wise-join-for-1-1-1-0-0-1-partition-ma.patch, where
> one single "partition" on the non-partitioned side maps to all the
> partitions on the partitioned size?

I think this is actually a really important case which we've just
excluded from the initial scope because the problem is hard enough
already.  But it's quite possible that if you are joining partitioned
tables A and B with unpartitioned table X, the right join order could
be A-X-B; the A-X join might knock out a lot of rows.  It's not great
to have to pick between doing the A-B join partitionwise and doing the
A-X join first; you want to get both things.  But we can't do
everything at once.

Further down the road, there's more than one way of doing the A-X
join.  You could join each partition of A to all of X, which is likely
optimal if for example you are doing a nested loop with an inner index
scan on X.  But you could also partition X on the fly using A's
partitioning scheme and then join partitions of A against the
on-the-fly-partitioned version of X.  That's likely to be a lot better
for a merge join with an underlying sort on X.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-08-10 Thread Ashutosh Bapat
On Thu, Aug 10, 2017 at 3:13 PM, Thomas Munro
 wrote:
> On Thu, Aug 10, 2017 at 6:23 PM, Ashutosh Bapat
>  wrote:
>> Your patch didn't improve planning time without partition-wise join,
>> so it's something good to have along-with partition-wise join. Given
>> that Bitmapsets are used in other parts of code as well, the
>> optimization may affect those parts as well, esp. the overhead of
>> maintaining first_non_empty_wordnum.
>
> Maybe, but if you consider that this container already deals with the
> upper bound moving up by reallocating and copying the whole thing,
> adjusting an int when the lower bound moves down doesn't seem like
> anything to worry about...

Yeah. May be we should check whether that makes any difference to
planning times of TPC-H queries. If it shows any difference.

>
>> Do you think we solving this problem is a prerequisite for
>> partition-wise join? Or should we propose that patch as a separate
>> enhancement?
>
> No, I'm not proposing anything yet.  For now I just wanted to share
> this observation about where hot CPU time goes in simple tests, and
> since it turned out to be a loop in a loop that I could see an easy to
> way to fix for singleton sets and sets with a small range, I couldn't
> help trying it...  But I'm still trying to understand the bigger
> picture.  I'll be interested to compare profiles with the ordered
> append_rel_list version you have mentioned, to see how that moves the
> hot spots.

build_simple_rel() which contains that loop takes only .23% of
planning time. So, I doubt if that's going to change much.
+   0.23%  postgres  postgres[.] build_simple_rel

▒


>
> I guess one very practical question to ask is: can we plan queries
> with realistic numbers of partitioned tables and partitions in
> reasonable times?  Well, it certainly looks very good for hundreds of
> partitions so far...  My own experience of partitioning with other
> RDBMSs has been on that order, 'monthly partitions covering the past
> 10 years' and similar, but on the other hand it wouldn't be surprising
> to learn that people want to go to many thousands, especially for
> schemas which just keep adding partitions over time and don't want to
> drop them.  As for hash partitioning, that seems to be typically done
> with numbers like 16, 32 or 64 in other products from what I can
> glean.  Speculation: perhaps hash partitioning is more motivated by
> parallelism than data maintenance and thus somehow anchored to the
> ground by core counts; if so no planning performance worries there I
> guess (until core counts double quite a few more times).

Agreed.

>
> One nice thing about the planning time is that restrictions on the
> partition key cut down planning time; so where I measure ~7 seconds to
> plan SELECT * FROM foofoo JOIN barbar USING (a, b) with 2k partitions,
> if I add WHERE a > 50 it's ~4 seconds and WHERE a > 99 it's ~0.8s, so
> if someone has a keep-adding-more-partitions-over-time model then at
> least their prunable current day/week/whatever queries will not suffer
> quite so badly.  (Yeah my computer seems to be a lot slower than yours
> for these tests; clang -O2 no asserts on a mid 2014 MBP with i7 @
> 2.2Ghz).

That's interesting observation. Thanks for sharing it.

>
> Curious: would you consider joins between partitioned tables and
> non-partitioned tables where the join is pushed down to be a kind of
> "partition-wise join", or something else?  If so, would that be a
> special case, or just the logical extreme case for
> 0014-WIP-Partition-wise-join-for-1-1-1-0-0-1-partition-ma.patch, where
> one single "partition" on the non-partitioned side maps to all the
> partitions on the partitioned size?
>

Parameterized nest loop joins with partition key as parameters
simulate something like that. Apart from that case, I don't see any
case where such a join would be more efficient compared to the current
method of ganging all partitions and joining them to the unpartitioned
table. But oh wait, that could be useful in sharding, when the
unpartitioned table is replicated and partitioned table is distributed
across shards. So, yes, that's a useful case. I am not sure whether
it's some kind of partition-wise join; it doesn't matter, it looks
useful. Said that, I am not planning to handle it in the near future.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   3   >