from:"Ashutosh Bapat"

Re: [HACKERS] Aggregates push-down to partitions

2017-11-10 Thread Ashutosh Bapat

On Fri, Nov 10, 2017 at 12:20 AM, Maksim Milyutin <milyuti...@gmail.com> wrote:
> Hi Konstantin!

>> I wonder if somebody already investigate this problem or working in this
>> direction.
>> May be there are already some patches proposed?
>> I have searched hackers archive, but didn't find something relevant...
>> Are there any suggestions about the best approach to implement this
>> feature?
>>
>
> Maybe in this thread[1] your described problem are solved through
> introducing Parallel Append node?
>
> 1.
> https://www.postgresql.org/message-id/CAJ3gD9dy0K_E8r727heqXoBmWZ83HwLFwdcaSSmBQ1%2BS%2BvRuUQ%40mail.gmail.com

You may want to review [2] and [3] as well.

[2] https://www.postgresql.org/message-id/9666.1491295317@localhost
[3] 
https://www.postgresql.org/message-id/CAM2+6=V64_xhstVHie0Rz=kpeqnljmzt_e314p0jat_oj9m...@mail.gmail.com
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation

2017-11-08 Thread Ashutosh Bapat

On Wed, Nov 8, 2017 at 3:18 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Robert Haas <robertmh...@gmail.com> writes:
>> I think it would be a good idea, as Thomas says, to order the qual
>> clauses at an earlier stage and then remember our decision.  However,
>> we have to think about whether that's going to increase planning time
>> in a noticeable way.  I wonder why we currently postpone this until
>> such a late phase of planning.
>
> Because (1) up to now there's been no need to consider the qual ordering
> till later, and (2) re-doing that sort for each path seemed unduly
> expensive.  If we're to try to estimate whether later quals will be
> reached, then sure the ordering becomes important.  I'm still concerned
> about (2) though.  If there were a way to pre-sort the quals once for
> all paths, the problem goes away, but I don't think that works ---
> indexscans may segregate some quals, and in join cases different paths
> will actually have different sets of quals they need to check depending
> on the join order.
>

Looking at order_qual_clauses(), we can say that a set of quals q1
 qn are ordered the same irrespective of the set of clauses they
are subset of. E.g. if {q1 .. qn} is subset of Q (ordered as Qo) and
also Q' (ordered as Q'o) the order in which they appear in Qo and Q'o
is same. So, even if different paths segregate the clauses in
different set, within each set the order is same. FWIW, we can order
all clauses in largest set once and use that order every time. Albeit
we will have to remember the order somewhere OR make the separator
routine retain the order in the larger set, which I guess is true
about all separator functions.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation

2017-11-06 Thread Ashutosh Bapat

On Mon, Nov 6, 2017 at 10:01 AM, Thomas Munro
<thomas.mu...@enterprisedb.com> wrote:
>
> This idea seems to makes intuitive sense.  I see that you use
> order_qual_clauses() to know what order they'll run in, so I'm
> wondering if there is any reason we shouldn't do it up front and keep
> it during path building, instead of running it again at plan creation
> time.  Is there some way it could ever produce a different result at
> the two times?

IIRC, only thing that changes between plan time quals and execution
time quals is constaint folding of constant parameters. But I don't
think we change the selectivity estimates when that's done. At the
same time, I don't think we should make a lot of effort to make sure
that the order used during the estimation is same as the order at the
execution; we are anyway estimating. There can always be some
difference between what's estimated and what actually happens.

> Why not also apply this logic to qpquals of joins,
> foreign scans, subplans?  That is, instead of replacing cost_qual_eval
> with this code for baserels, I wonder if we should teach
> cost_qual_eval how to do this so those other users could also benefit
> (having perhaps already ordered the qual clauses earlier).

+1.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dropping partitioned tables without CASCADE

2017-11-05 Thread Ashutosh Bapat

Somehow the earlier patches missed qualifying pg_get_expr() by
pg_catalog. Fixed it along with annotating the partitioned partition
as ", PARTITIONED".

On Fri, Nov 3, 2017 at 6:09 PM, Alvaro Herrera <alvhe...@alvh.no-ip.org> wrote:
>>
>> Right now, we could do that if we order the list by bound expression;
>> lexically DEFAULT would come before FOR VALUES ... . But that's not
>> future-safe; we may have a bound expression starting with A, B or C.
>> Beyond that it really gets tricky to order the partitions by bounds.
>
> I was just thinking in changing the query to be "order by
> is_the_default_partition, partition_name" instead of just "order by
> partition_name".  Sorting by bounds rather than name (a feature whose
> worth should definitely be discussed separately IMV) sounds a lot more
> complicated.

Right now we don't have a catalog column or a SQL function which can
tell whether a given partition is default partition based on the
partition bounds or otherwise. That's what it seemed when you
suggested ordering by "is_the_default_partition". Instead I have
ordered the partitions by pg_catalog.pg_get_expr(...) = 'DEFAULT'. We
can introduce a SQL function which takes child and parent oids and
return true if it's default partition and use that here, but that
seems more than what you are suggesting here. I have added that as a
separate patch.

If we tackle the problem of listing partitions by their bounds
somehow, DEFAULT partition listing would be tackled anyway. So, may be
we should leave it outside the scope of this patch.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
From 398003b2d5f6b54e6cdd8542f653786987ef3bfe Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.ba...@enterprisedb.com>
Date: Mon, 4 Sep 2017 09:56:41 +0530
Subject: [PATCH 1/2] Improve \d+ output of a partitioned table

While displaying partitions in \d+ output of a partitioned table
annotate the partitioned partitions as "PARTITIONED".

For a partitioned table show the number of partitions even if it's 0.

Ashutosh Bapat and Amit Langote.
---
 src/bin/psql/describe.c|   34 +++-
 src/test/regress/expected/create_table.out |   13 +++
 src/test/regress/expected/foreign_data.out |3 +++
 src/test/regress/expected/insert.out   |   17 ++
 src/test/regress/sql/create_table.sql  |2 +-
 src/test/regress/sql/insert.sql|4 
 6 files changed, 61 insertions(+), 12 deletions(-)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index b7b978a..44c5089 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -2870,7 +2870,9 @@ describeOneTableDetails(const char *schemaname,
 		/* print child tables (with additional info if partitions) */
 		if (pset.sversion >= 10)
 			printfPQExpBuffer(,
-			  "SELECT c.oid::pg_catalog.regclass, pg_catalog.pg_get_expr(c.relpartbound, c.oid)"
+			  "SELECT c.oid::pg_catalog.regclass,"
+			  "   pg_catalog.pg_get_expr(c.relpartbound, c.oid),"
+			  "   c.relkind"
 			  " FROM pg_catalog.pg_class c, pg_catalog.pg_inherits i"
 			  " WHERE c.oid=i.inhrelid AND i.inhparent = '%s'"
 			  " ORDER BY c.oid::pg_catalog.regclass::pg_catalog.text;", oid);
@@ -2893,7 +2895,18 @@ describeOneTableDetails(const char *schemaname,
 		else
 			tuples = PQntuples(result);
 
-		if (!verbose)
+		/*
+		 * For a partitioned table with no partitions, always print the number
+		 * of partitions as zero, even when verbose output is expected.
+		 * Otherwise, we will not print "Partitions" section for a partitioned
+		 * table without any partitions.
+		 */
+		if (tableinfo.relkind == RELKIND_PARTITIONED_TABLE && tuples == 0)
+		{
+			printfPQExpBuffer(, _("Number of partitions: %d"), tuples);
+			printTableAddFooter(, buf.data);
+		}
+		else if (!verbose)
 		{
 			/* print the number of child tables, if any */
 			if (tuples > 0)
@@ -2925,12 +2938,21 @@ describeOneTableDetails(const char *schemaname,
 }
 else
 {
+	char *partitioned_note;
+
+	if (*PQgetvalue(result, i, 2) == RELKIND_PARTITIONED_TABLE)
+		partitioned_note = ", PARTITIONED";
+	else
+		partitioned_note = "";
+
 	if (i == 0)
-		printfPQExpBuffer(, "%s: %s %s",
-		  ct, PQgetvalue(result, i, 0), PQgetvalue(result, i, 1));
+		printfPQExpBuffer(, "%s: %s %s%s",
+		  ct, PQgetvalue(result, i, 0), PQgetvalue(result, i, 1),
+		  partitioned_note);
 	else
-		printfPQExpBuffer(, "%*s  %s %s",
-		  ctw, "", PQgetvalue(result, i, 0), PQgetvalue(result, i, 1));
+		printfPQExpBuffer(, "%*s  %s %s%s",
+

Re: [HACKERS] dropping partitioned tables without CASCADE

2017-11-03 Thread Ashutosh Bapat

On Fri, Nov 3, 2017 at 1:42 PM, Alvaro Herrera <alvhe...@alvh.no-ip.org> wrote:
> Amit Langote wrote:
>> On 2017/09/06 19:14, Amit Langote wrote:
>> > On 2017/09/06 18:46, Rushabh Lathia wrote:
>> >> Okay, I have marked this as ready for committer.
>> >
>> > Thanks Ashutosh and Rushabh for rebasing and improving the patch.  Looks
>> > good to me too.
>>
>> Patch needed to be rebased after the default partitions patch went in, so
>> done.  Per build status on http://commitfest.cputube.org :)
>
> I think adding "is partitioned" at end of line isn't good; looks like a
> phrase but isn't translatable.  Maybe add keyword PARTITIONED instead?

In that case may be we should separate bounds and "PARTITIONED" with a
",". "part_default DEFAULT, PARTITIONED" would read better than
"part_default DEFAULT PARTITIONED"?

>
> Having the DEFAULT partition show up in the middle of the list is weird.

Agreed. But that's true even without this patch.

> Is it possible to put it at either start or end of the list?
>

Right now, we could do that if we order the list by bound expression;
lexically DEFAULT would come before FOR VALUES ... . But that's not
future-safe; we may have a bound expression starting with A, B or C.
Beyond that it really gets tricky to order the partitions by bounds.

The goal of this patch is to mark the partitioned partitions as such
and show the number of partitions. While your suggestion is a valid
request, it's kind of beyond the scope of this patch. Someone might
want to extend this request and say that partitions should be listed
in the order of their bounds (I do feel that we should do some effort
in that direction). But I am not sure whether it should be done in
this patch.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

2017-11-02 Thread Ashutosh Bapat

On Thu, Nov 2, 2017 at 1:35 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Wed, Nov 1, 2017 at 3:46 PM, amul sul <sula...@gmail.com> wrote:
>> Although partition constraints become more simple, there isn't any 
>> performance
>> gain with 0005 patch. Also I am little skeptic about logic in 0005 where we
>> copied extended hash function info from the partition key, what if parent is
>> changed while we are using it? Do we need to keep lock on parent until 
>> commit in
>> satisfies_hash_partition?
>
> I don't think it should be possible for the parent to be changed.  I
> mean, the partition key is altogether immutable -- it can't be changed
> after creation time.  The partition bounds can be changed for
> individual partitions but that would require a lock on the partition.
>
> Can you give an example of the kind of scenario about which you are concerned?

Right now partition keys are immutable but we don't have much code
written with that assumption. All the code usually keeps a lock on the
parent till the time they use the information in the partition key. In
a distant future, which may not exist, we may support ALTER TABLE ...
PARTITION BY to change partition keys (albeit at huge cost of data
movement). If we do that, we will have to remember this one-off
instance of code which assumes that the partition keys are immutable.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Account for cost and selectivity of HAVING quals

2017-11-01 Thread Ashutosh Bapat

On Wed, Nov 1, 2017 at 8:41 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Ashutosh Bapat <ashutosh.ba...@enterprisedb.com> writes:
>> On Wed, Nov 1, 2017 at 3:15 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>> here's a patch to fix the planner so that eval costs and selectivity of
>>> HAVING quals are factored into the appropriate plan node numbers.
>>> ...
>>> + /* Add cost of qual, if any --- but we ignore its selectivity */
>
>> And may be we should try to explain why can we ignore selectivity.
>> Similarly for the changes in create_minmaxagg_path().
>
> I'm sure you realize that's because the estimate is already just one
> row ... but sure, we can spell that out.
>

+1. That would be helpful.



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Account for cost and selectivity of HAVING quals

2017-11-01 Thread Ashutosh Bapat

On Wed, Nov 1, 2017 at 3:15 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Pursuant to the discussion at
> https://www.postgresql.org/message-id/20171029112420.8920b5f...@mx.zeyos.com
> here's a patch to fix the planner so that eval costs and selectivity of
> HAVING quals are factored into the appropriate plan node numbers.
> Perhaps unsurprisingly, this doesn't change the results of any
> existing regression tests.
>
> It's slightly annoying that this approach will result in calculating
> the eval costs and selectivity several times, once for each aggregation
> plan type we consider.  I thought about inserting RestrictInfo nodes
> into the havingQual so that those numbers could be cached, but that turned
> out to break various code that doesn't expect to see such nodes there.
> I'm not sure it's worth going to the trouble of fixing that; in the big
> scheme of things, the redundant calculations likely don't cost much, since
> we aren't going to have relevant statistics.
>
> Comments?  If anyone wants to do a real review of this, I'm happy to stick
> it into the upcoming CF; but without an expression of interest, I'll just
> push it.  I don't think there's anything terribly controversial here.
>

I am not able to see how is the following hunk related to $subject
*** create_result_path(PlannerInfo *root, Re
*** 1374,1379 
--- 1374,1380 
  pathnode->path.startup_cost = target->cost.startup;
  pathnode->path.total_cost = target->cost.startup +
  cpu_tuple_cost + target->cost.per_tuple;
+ /* Add cost of qual, if any --- but we ignore its selectivity */
  if (resconstantqual)
  {
  QualCostqual_cost;

And may be we should try to explain why can we ignore selectivity.
Similarly for the changes in create_minmaxagg_path().

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Transactions involving multiple postgres foreign servers

2017-10-30 Thread Ashutosh Bapat

On Thu, Oct 26, 2017 at 7:41 PM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
>
> Because I don't want to break the current user semantics. that is,
> currently it's guaranteed that the subsequent reads can see the
> committed result of previous writes even if the previous transactions
> were distributed transactions. And it's ensured by writer side. If we
> can make the reader side ensure it, the backend process don't need to
> wait for the resolver process.
>
> The waiting backend process are released by resolver process after the
> resolver process tried to resolve foreign transactions. Even if
> resolver process failed to either connect to foreign server or to
> resolve foreign transaction the backend process will be released and
> the foreign transactions are leaved as dangling transaction in that
> case, which are processed later. Also if resolver process takes a long
> time to resolve foreign transactions for whatever reason the user can
> cancel it by Ctl-c anytime.
>

So, there's no guarantee that the next command issued from the
connection *will* see the committed data, since the foreign
transaction might not have committed because of a network glitch
(say). If we go this route of making backends wait for resolver to
resolve the foreign transaction, we will have add complexity to make
sure that the waiting backends are woken up in problematic events like
crash of the resolver process OR if the resolver process hangs in a
connection to a foreign server etc. I am not sure that the complexity
is worth the half-guarantee.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Removing [Merge]Append nodes which contain a single subpath

2017-10-30 Thread Ashutosh Bapat

On Fri, Oct 27, 2017 at 1:36 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> David Rowley <david.row...@2ndquadrant.com> writes:
>> On 27 October 2017 at 01:44, Ashutosh Bapat
>> <ashutosh.ba...@enterprisedb.com> wrote:
>>> I think Antonin has a point. The join processing may deem some base
>>> relations dummy (See populate_joinrel_with_paths()). So an Append
>>> relation which had multiple child alive at the end of
>>> set_append_rel_size() might ultimately have only one child after
>>> partition-wise join has worked on it.
>
> TBH, that means partition-wise join planning is broken and needs to be
> redesigned.  It's impossible that we're going to make sane planning
> choices if the sizes of relations change after we've already done some
> planning work.

The mark_dummy_rel() call that marks a joining relation dummy was
added by 719012e013f10f72938520c46889c52df40fa329. The joining
relations are planned before the joins they participate in are
planned. Further some of the joins in which it participates might have
been planned before the joining pair, which causes it to be marked
dummy, is processed by populate_joinrel_with_paths(). So we already
have places where the sizes of relation changes during planning.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Removing [Merge]Append nodes which contain a single subpath

2017-10-29 Thread Ashutosh Bapat

On Fri, Oct 27, 2017 at 12:49 AM, David Rowley
<david.row...@2ndquadrant.com> wrote:
> On 27 October 2017 at 01:44, Ashutosh Bapat
> <ashutosh.ba...@enterprisedb.com> wrote:
>> I think Antonin has a point. The join processing may deem some base
>> relations dummy (See populate_joinrel_with_paths()). So an Append
>> relation which had multiple child alive at the end of
>> set_append_rel_size() might ultimately have only one child after
>> partition-wise join has worked on it.
>
> hmm, I see. partition-wise join is able to remove eliminate partitions
> on the other side of the join that can't be matched to:
>
> # set enable_partition_wise_join=on;
> SET
> # explain select count(*) from ab ab1 inner join ab ab2 on ab1.a=ab2.a
> where ab1.a between 1 and 2 and ab2.a between 1 and 10001;
>QUERY PLAN
> 
>  Aggregate  (cost=0.00..0.01 rows=1 width=8)
>->  Result  (cost=0.00..0.00 rows=0 width=0)
>  One-Time Filter: false
> (3 rows)
>
> # \d+ ab
> Table "public.ab"
>  Column |  Type   | Collation | Nullable | Default | Storage | Stats
> target | Description
> +-+---+--+-+-+--+-
>  a  | integer |   |  | | plain   |  |
>  b  | integer |   |  | | plain   |  |
> Partition key: RANGE (a)
> Partitions: ab_p1 FOR VALUES FROM (1) TO (1),
> ab_p2 FOR VALUES FROM (1) TO (2)
>

IIUC, ab1_p2 and ab2_p1 are marked dummy by constraint exclusion.
Because of partition-wise join when the corresponding partitions are
joined, their inner joins are marked dummy resulting in an empty join.
This isn't the case of a join processing marking one of the joining
relations dummy, that I was referring to. But I think it's relevant
here since partition-wise join might create single child joins this
way.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Removing [Merge]Append nodes which contain a single subpath

2017-10-26 Thread Ashutosh Bapat

On Thu, Oct 26, 2017 at 5:07 PM, David Rowley
<david.row...@2ndquadrant.com> wrote:
> On 26 October 2017 at 23:42, Antonin Houska <a...@cybertec.at> wrote:
>> David Rowley <david.row...@2ndquadrant.com> wrote:
>>
>>> Method 1:
>>>
>>> In set_append_rel_size() detect when just a single subpath would be
>>> added to the Append path.
>>
>> I spent some time reviewing the partition-wise join patch during the last CF
>> and have an impression that set_append_rel_size() is called too early to find
>> out how many child paths will eventually exist if Append represents a join of
>> a partitioned table. I think the partition matching takes place at later 
>> stage
>> and that determines the actual number of partitions, and therefore the number
>> of Append subpaths.
>
> I understand that we must wait until set_append_rel_size() is being
> called on the RELOPT_BASEREL since partitions may themselves be
> partitioned tables and we'll only know what's left after we've
> recursively checked all partitions of the baserel.
>
> I've not studied the partition-wise join code yet, but if we've
> eliminated all but one partition in set_append_rel_size() then I
> imagine we'd just need to ensure partition-wise join is not attempted
> since we'd be joining directly to the only partition anyway.
>

I think Antonin has a point. The join processing may deem some base
relations dummy (See populate_joinrel_with_paths()). So an Append
relation which had multiple child alive at the end of
set_append_rel_size() might ultimately have only one child after
partition-wise join has worked on it.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Transactions involving multiple postgres foreign servers

2017-10-26 Thread Ashutosh Bapat

On Wed, Oct 25, 2017 at 3:15 AM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
>
> Foreign Transaction Resolver
> ==
> I introduced a new background worker called "foreign transaction
> resolver" which is responsible for resolving the transaction prepared
> on foreign servers. The foreign transaction resolver process is
> launched by backend processes when commit/rollback transaction. And it
> periodically resolves the queued transactions on a database as long as
> the queue is not empty. If the queue has been empty for the certain
> time specified by foreign_transaction_resolver_time GUC parameter, it
> exits. It means that the backend doesn't launch a new resolver process
> if the resolver process is already working. In this case, the backend
> process just adds the entry to the queue on shared memory and wake it
> up. The maximum number of resolver process we can launch is controlled
> by max_foreign_transaction_resolvers. So we recommends to set larger
> max_foreign_transaction_resolvers value than the number of databases.
> The resolver process also tries to resolve dangling transaction as
> well in a cycle.
>
> Processing Sequence
> =
> I've changed the processing sequence of resolving foreign transaction
> so that the second phase of two-phase commit protocol (COMMIT/ROLLBACK
> prepared) is executed by a resolver process, not by backend process.
> The basic processing sequence is following;
>
> * Backend process
> 1. In pre-commit phase, the backend process saves fdwxact entries, and
> then prepares transaction on all foreign servers that can execute
> two-phase commit protocol.
> 2. Local commit.
> 3. Enqueue itself to the shmem queue and change its status to WAITING
> 4. launch or wakeup a resolver process and wait
>
> * Resolver process
> 1. Dequeue the waiting process from shmem qeue
> 2. Collect the fdwxact entries that are associated with the waiting 
> process.
> 3. Resolve foreign transactoins
> 4. Release the waiting process

Why do we want the the backend to linger behind, once it has added its
foreign transaction entries in the shared memory and informed resolver
about it? The foreign connections may take their own time and even
after that there is no guarantee that the foreign transactions will be
resolved in case the foreign server is not available. So, why to make
the backend wait?

>
> 5. Wake up and restart
>
> This is still under the design phase and I'm sure that there is room
> for improvement and consider more sensitive behaviour but I'd like to
> share the current status of the patch. The patch includes regression
> tests but not includes fully documentation.

Any background worker, backend should be child of the postmaster, so
we should not let a backend start a resolver process. It should be the
job of the postmaster.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Removing [Merge]Append nodes which contain a single subpath

2017-10-26 Thread Ashutosh Bapat

On Thu, Oct 26, 2017 at 3:29 AM, David Rowley
<david.row...@2ndquadrant.com> wrote:
> However, method 2 appears to also require some Var
> translation in Path targetlists which contain this Proxy path, either
> that or some global Var replacement would need to be done during
> setrefs.
>

For inheritance and partitioning, we translate parent expressions to
child expression by applying Var translations. The translated
expressions differ only in the Var nodes (at least for partitioning
this true, and I believe it to be true for inheritance). I have been
toying with an idea of having some kind of a (polymorphic?) Var node
which can represent parent and children. That will greatly reduce the
need to translate the expression trees and will also help in this
implementation. I am wondering, if ProxyPath could be helpful in
avoiding the cost of reparameterize_path_by_child() introduced for
partition-wise join.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: Is anything preventing us from allowing write to foreign tables from standby?

2017-10-26 Thread Ashutosh Bapat

On Wed, Oct 18, 2017 at 5:44 AM, Craig Ringer <cr...@2ndquadrant.com> wrote:
> On 18 October 2017 at 02:01, Alexander Korotkov
> <a.korot...@postgrespro.ru> wrote:
>> On Wed, Sep 6, 2017 at 4:42 PM, Alexander Korotkov
>> <a.korot...@postgrespro.ru> wrote:
>>>
>>> We're currently blocking writing queries on standby if even they are
>>> modifying contents of foreign tables.  But do we have serious reasons for
>>> that?
>>> Keeping in the mind FDW-sharding, making FDW-tables writable from standby
>>> would be good to prevent single-master bottleneck.
>>> I wrote simple patch enabling writing to foreign tables from standbys.  It
>>> works from the first glance for me.
>>
>>
>> No interest yet, but no objections too :-)
>> I'm going to add this to next commitfest.
>
> Superficially at least, it sounds like a good idea.
>
> We should only need a virtual xid when we're working with foreign
> tables since we don't do any local heap changes.
>
> How's it work with savepoints?
>

In a nearby thread, we are discussing about atomic commit of
transactions involving foreign transactions. For maintaining
consistency, atomicity of transactions writing to foreign server, we
will need to create local transactions. Will that be possible on
standby? Obviously, we could add a restriction that no consistency and
atomic commit is guaranteed when foreign servers are written from a
standby. I am not sure how easy would be to impose such a restriction
and whether such a restriction would be practical.


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-16 Thread Ashutosh Bapat

On Sat, Oct 7, 2017 at 1:04 AM, Robert Haas <robertmh...@gmail.com> wrote:

>>
>> The fix is to copy the relevant partitioning information from relcache
>> into PartitionSchemeData and RelOptInfo. Here's a quick patch with
>> that fix.
>
> Committed.  I hope that makes things less red rather than more,
> because I'm going to be AFK for a few hours anyway.
>

set_append_rel_size() crashes when it encounters a partitioned table
with a dropped column. Dropped columns do not have any translations
saved in AppendInfo::translated_vars; the corresponding entry is NULL
per make_inh_translation_list().
1802 att = TupleDescAttr(old_tupdesc, old_attno);
1803 if (att->attisdropped)
1804 {
1805 /* Just put NULL into this list entry */
1806 vars = lappend(vars, NULL);
1807 continue;
1808 }

In set_append_rel_size() we try to attr_needed for child tables. While
doing so we try to translate a user attribute number of parent to that
of a child and crash since the translated Var is NULL. Here's patch to
fix the crash. The patch also contains a testcase to test dropped
columns in partitioned table.

Sorry for not noticing this problem earlier.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
From eca9e03410b049bf79d767657ad4d90fb974d19c Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.ba...@enterprisedb.com>
Date: Mon, 16 Oct 2017 13:23:49 +0530
Subject: [PATCH] Ignore dropped columns in set_append_rel_size().

A parent Var corresponding to column dropped from a parent table will
not have any translation for a child.  It won't have any attr_needed
information since it can not be referenced from SQL. Hence ignore
dropped columns while constructing attr_needed information for a child
table.

Ashutosh Bapat.
---
 src/backend/optimizer/path/allpaths.c |   12 
 src/test/regress/expected/alter_table.out |7 +++
 src/test/regress/sql/alter_table.sql  |4 
 3 files changed, 23 insertions(+)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 5535b63..1bc5e01 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -950,6 +950,18 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 	attno - 1);
 	int			child_index;
 
+	/*
+	 * Ignore any column dropped from the parent.
+	 * Corresponding Var won't have any translation. It won't
+	 * have attr_needed information, since it can not be
+	 * referenced in the query.
+	 */
+	if (var == NULL)
+	{
+		Assert(attr_needed == NULL);
+		continue;
+	}
+
 	child_index = var->varattno - childrel->min_attr;
 	childrel->attr_needed[child_index] = attr_needed;
 }
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index dbe438d..cc56651 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -3604,6 +3604,13 @@ ALTER TABLE list_parted2 DROP COLUMN b;
 ERROR:  cannot drop column named in partition key
 ALTER TABLE list_parted2 ALTER COLUMN b TYPE text;
 ERROR:  cannot alter type of column named in partition key
+-- dropping non-partition key columns should be allowed on the parent table.
+ALTER TABLE list_parted DROP COLUMN b;
+SELECT * FROM list_parted;
+ a 
+---
+(0 rows)
+
 -- cleanup
 DROP TABLE list_parted, list_parted2, range_parted;
 DROP TABLE fail_def_part;
diff --git a/src/test/regress/sql/alter_table.sql b/src/test/regress/sql/alter_table.sql
index 0c8ae2a..fc7bd66 100644
--- a/src/test/regress/sql/alter_table.sql
+++ b/src/test/regress/sql/alter_table.sql
@@ -2390,6 +2390,10 @@ ALTER TABLE part_2 INHERIT inh_test;
 ALTER TABLE list_parted2 DROP COLUMN b;
 ALTER TABLE list_parted2 ALTER COLUMN b TYPE text;
 
+-- dropping non-partition key columns should be allowed on the parent table.
+ALTER TABLE list_parted DROP COLUMN b;
+SELECT * FROM list_parted;
+
 -- cleanup
 DROP TABLE list_parted, list_parted2, range_parted;
 DROP TABLE fail_def_part;
-- 
1.7.9.5


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

2017-10-16 Thread Ashutosh Bapat

On Mon, Oct 16, 2017 at 2:36 PM, Ashutosh Bapat
<ashutosh.ba...@enterprisedb.com> wrote:

>
> Probably we should move changes to partition_bounds_copy() in 0003 to
> 0001, whose name suggests so.
>

We can't do this, hash partition strategy is introduced by 0002. Sorry
for the noise.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

2017-10-16 Thread Ashutosh Bapat

On Tue, Oct 10, 2017 at 4:37 PM, amul sul <sula...@gmail.com> wrote:
> On Tue, Oct 10, 2017 at 3:42 PM, Ashutosh Bapat
> <ashutosh.ba...@enterprisedb.com> wrote:
>> On Tue, Oct 10, 2017 at 3:32 PM, amul sul <sula...@gmail.com> wrote:
>>
>>>> +hash_part? true : 
>>>> key->parttypbyval[j],
>>>> +key->parttyplen[j]);
>>>> parttyplen is the length of partition key attribute, whereas what you want 
>>>> here
>>>> is the length of type of modulus and remainder. Is that correct? Probably 
>>>> we
>>>> need some special handling wherever parttyplen and parttypbyval is used 
>>>> e.g. in
>>>> call to partition_bounds_equal() from build_joinrel_partition_info().
>>>>
>>>
>>> Unless I am missing something, I don't think we should worry about 
>>> parttyplen
>>> because in the datumCopy() when the datatype is pass-by-value then typelen
>>> is ignored.
>>
>> That's true, but it's ugly, passing typbyvalue of one type and len of other.
>>
>
> How about the attached patch(0003)?
> Also, the dim variable is renamed to natts.

Probably we should move changes to partition_bounds_copy() in 0003 to
0001, whose name suggests so.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] How does postgres store the join predicate for a relation in a given query

2017-10-15 Thread Ashutosh Bapat

On Sat, Oct 14, 2017 at 3:15 AM, Gourav Kumar <gourav1...@gmail.com> wrote:
> Why does have_relevant_joinclause() and have_relevant_eclass_joinclause()
> return true for all possible joins for the query given below.
> Even when they have no join predicate between them.
> e.g. join between ss1 & ws3, ss2 & ws3 etc.
>

The prologues of those functions and comments within those explain that.

/*
 * have_relevant_joinclause
 *  Detect whether there is a joinclause that involves
 *  the two given relations.
 *
 * Note: the joinclause does not have to be evaluable with only these two
 * relations.  This is intentional.  For example consider
 *  SELECT * FROM a, b, c WHERE a.x = (b.y + c.z)
 * If a is much larger than the other tables, it may be worthwhile to
 * cross-join b and c and then use an inner indexscan on a.x.  Therefore
 * we should consider this joinclause as reason to join b to c, even though
 * it can't be applied at that join step.
 */

/*
 * have_relevant_eclass_joinclause
 *  Detect whether there is an EquivalenceClass that could produce
 *  a joinclause involving the two given relations.
 *
 * This is essentially a very cut-down version of
 * generate_join_implied_equalities().  Note it's OK to occasionally say "yes"
 * incorrectly.  Hence we don't bother with details like whether the lack of a
 * cross-type operator might prevent the clause from actually being generated.
 */
May be you want to see whether those comments are applicable in your
case and also see how the callers handle the return values.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2017-10-12 Thread Ashutosh Bapat

On Thu, Oct 12, 2017 at 9:46 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Wed, Oct 11, 2017 at 7:08 AM, Ashutosh Bapat
> <ashutosh.ba...@enterprisedb.com> wrote:
>> Here's updated patch set based on the basic partition-wise join
>> committed. The patchset applies on top of the patch to optimize the
>> case of dummy partitioned tables [1].
>>
>> Right now, the advanced partition matching algorithm bails out when
>> either of the joining relations has a default partition.
>
> So is that something you are going to fix?
>

Yes, if time permits. I had left the patch unattended while basic
partition-wise join was getting committed. Now that it's committed, I
rebased it. It still has TODOs and some work is required to improve
it. But for the patch to be really complete, we have to deal with the
problem of missing partitions described before. I am fine
collaborating if someone else wants to pick it up.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-12 Thread Ashutosh Bapat

On Thu, Oct 12, 2017 at 10:49 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Wed, Oct 11, 2017 at 10:43 PM, Ashutosh Bapat
> <ashutosh.ba...@enterprisedb.com> wrote:
>> You are suggesting that a dummy partitioned table be treated as an
>> un-partitioned table and apply above suggested optimization. A join
>> between a partitioned and unpartitioned table is partitioned by the
>> keys of only partitioned table. An unpartitioned table doesn't have
>> any keys, so this is fine. But a dummy partitioned table does have
>> keys. Recording them as keys of the join relation helps when it joins
>> to other relations. Furthermore a join between partitioned and
>> unpartitioned table doesn't require any equi-join condition on
>> partition keys of partitioned table but a join between partitioned
>> tables is considered to be partitioned by keys on both sides only when
>> there is an equi-join. So, when implementing a partitioned join
>> between a partitioned and an unpartitioned table, we will have to make
>> a special case to record partition keys when the unpartitioned side is
>> actually a dummy partitioned table. That might be awkward.
>
> It seems to me that what we really need here is to move all of this
> stuff into a separate struct:
>
> /* used for partitioned relations */
> PartitionScheme part_scheme;/* Partitioning scheme. */
> int nparts; /* number of
> partitions */
> struct PartitionBoundInfoData *boundinfo;   /* Partition bounds */
> struct RelOptInfo **part_rels;  /* Array of RelOptInfos of partitions,
>
>   * stored in the same order of bounds */
> List  **partexprs;  /* Non-nullable partition key
> expressions. */
> List  **nullable_partexprs; /* Nullable partition key
> expressions. */
>

In a very early patch I had PartitionOptInfo to hold all of this.
RelOptInfo then had a pointer of PartitionOptInfo, if it was
partitioned. When a relation can be partitioned in multiple ways like
what you describe or because join by re-partitioning is efficient,
RelOptInfo would have a list of those. But the representation needs to
be thought through. I am wondering whether this should be modelled
like IndexOptInfo. I am not sure. This is a topic of much larger
discussion.

I think we are digressing. We were discussing my patch to handle dummy
partitioned relation, whose children are not marked dummy and do not
have pathlists set. Do you still think that we should leave that
aside?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-11 Thread Ashutosh Bapat

On Wed, Oct 11, 2017 at 7:47 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Mon, Oct 9, 2017 at 2:05 AM, Ashutosh Bapat
> <ashutosh.ba...@enterprisedb.com> wrote:
>> On Sat, Oct 7, 2017 at 1:04 AM, Robert Haas <robertmh...@gmail.com> wrote:
>>> Committed.  I hope that makes things less red rather than more,
>>> because I'm going to be AFK for a few hours anyway.
>>
>> Here's the last patch, dealing with the dummy relations, rebased. With
>> this fix every join order of a partitioned join can be considered
>> partitioned. (This wasn't the case earlier when dummy relation was
>> involved.). So, we can allocate the child-join RelOptInfo array in
>> build_joinrel_partition_info(), instead of waiting for an appropriate
>> pair to arrive in try_partition_wise_join().
>
> Wouldn't a far more general approach be to allow a partition-wise join
> between a partitioned table and an unpartitioned table, considering
> the result as partitioned?  That seems like it would very often yield
> much better query plans than what we have right now, and also make the
> need for this particular thing go away.
>

You are suggesting that a dummy partitioned table be treated as an
un-partitioned table and apply above suggested optimization. A join
between a partitioned and unpartitioned table is partitioned by the
keys of only partitioned table. An unpartitioned table doesn't have
any keys, so this is fine. But a dummy partitioned table does have
keys. Recording them as keys of the join relation helps when it joins
to other relations. Furthermore a join between partitioned and
unpartitioned table doesn't require any equi-join condition on
partition keys of partitioned table but a join between partitioned
tables is considered to be partitioned by keys on both sides only when
there is an equi-join. So, when implementing a partitioned join
between a partitioned and an unpartitioned table, we will have to make
a special case to record partition keys when the unpartitioned side is
actually a dummy partitioned table. That might be awkward.

Because we don't have dummy children relation in all cases, we already
have some awkwardness like allocating part_rels array only when we
encounter a join order which has all the children. This patch removes
that.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] How does postgres store the join predicate for a relation in a given query

2017-10-11 Thread Ashutosh Bapat

On Tue, Oct 10, 2017 at 7:29 PM, Gourav Kumar <gourav1...@gmail.com> wrote:
> Hi all,
>
> When you fire a query in postgresql, it will first parse the query and
> create the data structures for storing various aspects of the query and
> executing the query. (Like RangeTblEntry, PlannerInfo, RangeOptInfo etc.).
>
> I want to know how does postgresql stores the join predicates of a query.
> Like which data structure is used to store the join predicates.
>
> How can we find the join predicates applied on a relation from relid, Oid or
> RangeTblEntry ?
>

Every relation has a RelOptInfo associated with it. Predicates
applicable to it are stored in this RelOptInfo as a list. For base
relations (simple tables) it's in baserestrictinfo. The join
predicates applicable are in joininfo. You can get RelOptInfo of a
given simple table using find_base_rel().

HTH.


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

2017-10-10 Thread Ashutosh Bapat

On Tue, Oct 10, 2017 at 3:40 PM, amul sul <sula...@gmail.com> wrote:
>>
>> natts represents the number of attributes, but for the hash partition bound 
>> we
>> are not dealing with the attribute so that I have used short-form of 
>> dimension,
>> thoughts?
>
> Okay, I think the dimension(dim) is also unfit here.  Any suggestions?
>


I think natts is ok, since we are dealing with the number of
attributes in the pack of datums; esp. when ndatums is already taken.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

2017-10-10 Thread Ashutosh Bapat

On Tue, Oct 10, 2017 at 3:32 PM, amul sul <sula...@gmail.com> wrote:

>> +hash_part? true : key->parttypbyval[j],
>> +key->parttyplen[j]);
>> parttyplen is the length of partition key attribute, whereas what you want 
>> here
>> is the length of type of modulus and remainder. Is that correct? Probably we
>> need some special handling wherever parttyplen and parttypbyval is used e.g. 
>> in
>> call to partition_bounds_equal() from build_joinrel_partition_info().
>>
>
> Unless I am missing something, I don't think we should worry about parttyplen
> because in the datumCopy() when the datatype is pass-by-value then typelen
> is ignored.

That's true, but it's ugly, passing typbyvalue of one type and len of other.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise aggregation/grouping

2017-10-10 Thread Ashutosh Bapat

On Tue, Oct 10, 2017 at 1:31 PM, David Rowley
<david.row...@2ndquadrant.com> wrote:
>
> I don't think there's any need to invent any new GUC. You could just
> divide cpu_tuple_cost by something.
>
> I did a quick benchmark on my laptop to see how much Append really
> costs, and with the standard costs the actual cost seems to be about
> cpu_tuple_cost / 2.4. So probably cpu_tuple_cost / 2 might be
> realistic. create_set_projection_path() does something similar and
> brincostestimate() does some similar magic and applies 0.1 *
> cpu_operator_cost to the total cost.
>
>
> # -- How does that compare to the cpu_tuple_cost?
> # select current_Setting('cpu_tuple_cost')::float8 / 0.00416630302337493743;
> ?column?
> 
>  2.400209476818
> (1 row)
>
> Maybe it's worth trying with different row counts to see if the
> additional cost is consistent, but it's probably not worth being too
> critical here.
>

This looks good to me. I think it should be a separate, yet very small patch.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise aggregation/grouping

2017-10-09 Thread Ashutosh Bapat

On Tue, Oct 10, 2017 at 3:15 AM, David Rowley
<david.row...@2ndquadrant.com> wrote:
> On 10 October 2017 at 01:10, Jeevan Chalke
> <jeevan.cha...@enterprisedb.com> wrote:
>> Attached new patch set having HEAD at 84ad4b0 with all these review points
>> fixed. Let me know if I missed any thanks.
>
> I've only really skimmed over this thread and only opened the code
> enough to extract the following:
>
> + /* Multiply the costs by partition_wise_agg_cost_factor. */
> + apath->startup_cost *= partition_wise_agg_cost_factor;
> + apath->total_cost *= partition_wise_agg_cost_factor;
>
> I've not studied how all the path plumbing is done, but I think
> instead of doing this costing magic we should really stop pretending
> that Append/MergeAppend nodes are cost-free. I think something like
> charging cpu_tuple_cost per row expected through Append/MergeAppend
> would be a better approach to this.
>
> If you perform grouping or partial grouping before the Append, then in
> most cases the Append will receive less rows, so come out cheaper than
> if you perform the grouping after it. I've not learned the
> partition-wise join code enough to know if this is going to affect
> that too, but for everything else, there should be no plan change,
> since there's normally no alternative paths. I see there's even a
> comment in create_append_path() which claims the zero cost is a bit
> optimistic.
>

+1. Partition-wise join will also benefit from costing Append
processing. Number of rows * width of join result compared with the
sum of that measure for joining relations decides whether Append node
processes more data in Append->Join case than Join->Append case.

Append node just returns the result of ExecProcNode(). Charging
cpu_tuple_cost may make it too expensive. In other places where we
charge cpu_tuple_cost there's some processing done to the tuple like
ExecStoreTuple() in SeqNext(). May be we need some other measure for
Append's processing of the tuple.

May be we should try to measure the actual time spent in Append node
as a fraction of say time spent in child seq scans. That might give us
a clue as to how Append processing can be charged in terms of costing.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

2017-10-09 Thread Ashutosh Bapat

On Mon, Oct 9, 2017 at 4:44 PM, amul sul <sula...@gmail.com> wrote:


> 0002 few changes in partition-wise join code to support
> hash-partitioned table as well & regression tests.

+switch (key->strategy)
+{
+case PARTITION_STRATEGY_HASH:
+/*
+ * Indexes array is same as the greatest modulus.
+ * See partition_bounds_equal() for more explanation.
+ */
+num_indexes = DatumGetInt32(src->datums[ndatums - 1][0]);
+break;
This logic is duplicated at multiple places.  I think it's time we consolidate
these changes in a function/macro and call it from the places where we have to
calculate number of indexes based on the information in partition descriptor.
Refactoring existing code might be a separate patch and then add hash
partitioning case in hash partitioning patch.

+intdim = hash_part? 2 : partnatts;
Call the variable as natts_per_datum or just natts?

+hash_part? true : key->parttypbyval[j],
+key->parttyplen[j]);
parttyplen is the length of partition key attribute, whereas what you want here
is the length of type of modulus and remainder. Is that correct? Probably we
need some special handling wherever parttyplen and parttypbyval is used e.g. in
call to partition_bounds_equal() from build_joinrel_partition_info().

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PoC: full merge join on comparison clause

2017-10-09 Thread Ashutosh Bapat

Hi Alexander,
Commit c7a9fa399 has added another test on mergeopfamilies. I think
your patch will need to take care of that test.

On Wed, Oct 4, 2017 at 6:38 PM, Alexander Kuzmenkov
<a.kuzmen...@postgrespro.ru> wrote:
> As discussed earlier, I changed the way we work with mergeopfamilies. I use
> the "is_equality" flag to indicate whether the clause is an equality one,
> and fill mergeopfamilies for both equality and inequality operators.
> The updated patch is attached (rebased to 20b6552242).
>
>
> --
> Alexander Kuzmenkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] separate serial_schedule useful?

2017-10-09 Thread Ashutosh Bapat

On Sat, Oct 7, 2017 at 10:19 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> I wrote:
>> Robert Haas <robertmh...@gmail.com> writes:

Sorry, my bad. I wasn't aware of this rule. I should have looked at
the beginning of the file for any rules.

>>> There's no reason why pg_regress couldn't have a
>>> --bail-if-group-size-exceeds=N argument, or why we couldn't have a
>>> separate Perl script to validate the schedule file as part of the
>>> build process.
>
>> I'd go for the former approach; seems like less new code and fewer cycles
>> used to enforce the rule.
>
> Concretely, how about the attached?  (Obviously we'd have to fix
> parallel_schedule before committing this.)
>

Thanks, this will help. May be we should set default to 20 instead of unlimited.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-10-09 Thread Ashutosh Bapat

On Sat, Oct 7, 2017 at 7:04 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Fri, Oct 6, 2017 at 12:37 PM, Alvaro Herrera <alvhe...@alvh.no-ip.org> 
> wrote:
>> 2. create one index for each existing partition.  These would be
>>identical to what would happen if you created the index directly on
>>each partition, except that there is an additional dependency to the
>>parent's abstract index.
>
> One thing I'm a bit worried about is how to name these subordinate
> indexes.  They have to have names because that's how pg_class works,
> and those names can't all be the same, again because that's how
> pg_class works.  There's no problem right away when you first create
> the partitioned index, because you can just pick names out of a hat
> using whatever name-generating algorithm seems best.  However, when
> you dump-and-restore (including but not limited to the pg_upgrade
> case) you've got to preserve those names.  If you just generate a new
> name that may or may not be the same as the old one, then it may
> collide with a user-specified name that only occurs later in the dump.
> Also, you'll have trouble if the user has applied a COMMENT or a
> SECURITY LABEL to the index because that command works by name, or if
> the user has a reference to the index name inside a function or
> whatever.
>
> These are pretty annoying corner-case bugs because they're not likely
> to come up very often.  Most people won't notice or care if the index
> name changes.  But I don't think it's acceptable to just ignore the
> problem.  An idea I had was to treat the abstract index - to use your
> term - sort of the way we treat an extension.  Normally, when you
> create an index on a partitioned table, it cascades, but for dump and
> restore purpose, we tag on some syntax that says "well, don't actually
> create the subordinate indexes, i'll tell you about those later".
> Then for each subordinate index we issue a separate CREATE INDEX
> command followed by ALTER INDEX abstract_index ATTACH PARTITION
> concrete_index or something of that sort.  That means you can't
> absolutely count on the parent index to have all of the children it's
> supposed to have but maybe that's OK.

+1.

How about CREATE INDEX ... PARTITION OF ... FOR TABLE ...? to create
the index and attach it?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-09 Thread Ashutosh Bapat

On Sat, Oct 7, 2017 at 1:04 AM, Robert Haas <robertmh...@gmail.com> wrote:
>
> Committed.  I hope that makes things less red rather than more,
> because I'm going to be AFK for a few hours anyway.
>

Here's the last patch, dealing with the dummy relations, rebased. With
this fix every join order of a partitioned join can be considered
partitioned. (This wasn't the case earlier when dummy relation was
involved.). So, we can allocate the child-join RelOptInfo array in
build_joinrel_partition_info(), instead of waiting for an appropriate
pair to arrive in try_partition_wise_join().
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
From 4bf8ca38719aee730ed57a7f14522384b1ced7b0 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.ba...@enterprisedb.com>
Date: Tue, 29 Aug 2017 12:37:14 +0530
Subject: [PATCH] Support partition-wise join for dummy partitioned relation.

Current partition-wise join implementation treats dummy relations as
unpartitioned since the child relations may not be marked dummy and may not
have their pathlists set (See set_rel_size() and set_rel_pathlist()). Since
children do not have paths set, they can not be used to form a child-join. This
patch marks the children of dummy partitioned relations as dummy, thus allowing
partition-wise join when one of the joining relations is dummy.

If the dummy partitioned relation is a base relation, its children are base
relations as well and we will be marking base relation dummy during join
planning. But this shouldn't be a problem since populate_joinrel_with_paths()
already does that to an inner relation of left join.

Ashutosh Bapat.
---
 src/backend/optimizer/path/allpaths.c |7 +--
 src/backend/optimizer/path/joinrels.c |   24 
 src/backend/optimizer/util/relnode.c  |5 +
 3 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 5535b63..b46fb5b 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3261,12 +3261,7 @@ generate_partition_wise_join_paths(PlannerInfo *root, RelOptInfo *rel)
 	if (IS_DUMMY_REL(rel))
 		return;
 
-	/*
-	 * Nothing to do if the relation is not partitioned. An outer join
-	 * relation which had empty inner relation in every pair will have rest of
-	 * the partitioning properties set except the child-join RelOptInfos. See
-	 * try_partition_wise_join() for more explanation.
-	 */
+	/* Nothing to do if the relation is not partitioned. */
 	if (rel->nparts <= 0 || rel->part_rels == NULL)
 		return;
 
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 2b868c5..1578dea 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1321,14 +1321,19 @@ try_partition_wise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 	/*
 	 * set_rel_pathlist() may not create paths in children of an empty
-	 * partitioned table and so we can not add paths to child-joins. So, deem
-	 * such a join as unpartitioned. When a partitioned relation is deemed
-	 * empty because all its children are empty, dummy path will be set in
-	 * each of the children.  In such a case we could still consider the join
-	 * as partitioned, but it might not help much.
+	 * partitioned table. Mark such children as dummy so that we can add paths
+	 * to child-joins.
 	 */
-	if (IS_DUMMY_REL(rel1) || IS_DUMMY_REL(rel2))
-		return;
+	if (IS_DUMMY_REL(rel1))
+	{
+		for (cnt_parts = 0; cnt_parts < rel1->nparts; cnt_parts++)
+			mark_dummy_rel(rel1->part_rels[cnt_parts]);
+	}
+	if (IS_DUMMY_REL(rel2))
+	{
+		for (cnt_parts = 0; cnt_parts < rel2->nparts; cnt_parts++)
+			mark_dummy_rel(rel2->part_rels[cnt_parts]);
+	}
 
 	/*
 	 * Since this join relation is partitioned, all the base relations
@@ -1361,11 +1366,6 @@ try_partition_wise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 	nparts = joinrel->nparts;
 
-	/* Allocate space to hold child-joins RelOptInfos, if not already done. */
-	if (!joinrel->part_rels)
-		joinrel->part_rels =
-			(RelOptInfo **) palloc0(sizeof(RelOptInfo *) * nparts);
-
 	/*
 	 * Create child-join relations for this partitioned join, if those don't
 	 * exist. Add paths to child-joins for a pair of child relations
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 3bd1063..21fd29f 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -1746,4 +1746,9 @@ build_joinrel_partition_info(RelOptInfo *joinrel, RelOptInfo *outer_rel,
 		joinrel->partexprs[cnt] = partexpr;
 		joinrel->nullable_partexprs[cnt] = nullable_partexpr;
 	}
+
+	/* Allocate space to hold child-joins RelOptInfos. */
+	joinrel->part_rels =
+		(RelOptInfo **) palloc0(sizeof(RelOptInfo *) * joinrel->nparts);
+

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-06 Thread Ashutosh Bapat

On Fri, Oct 6, 2017 at 8:45 PM, Robert Haas <robertmh...@gmail.com> wrote:
>
> I think this is very good work and I'm excited about the feature.  Now
> I'll wait to see whether the buildfarm, or Tom, yell at me for
> whatever problems this may still have...
>

Buildfarm animal prion turned red. Before going into that failure,
good news is that the other animals are green. So the plans are
stable.

prion runs the regression with -DRELCACHE_FORCE_RELEASE, which
destroys a relcache entry as soon as its reference count drops down to
0. This destroys everything that's there in corresponding relcache
entry including partition key information and partition descriptor
information. find_partition_scheme() and set_relation_partition_info()
both assume that the relcache information will survive as long as the
relation lock is held. They do not copy the relevant partitioning
information but just copy the pointers. That assumption is wrong.
Because of -DRELCACHE_FORCE_RELEASE, as soon as refcount drops to
zero, the data in partition scheme and partition bounds goes invalid
and various checks to see if partition wise join is possible fail.
That causes partition_join test to fail on prion. But I think, the bug
could cause crash as well.

The fix is to copy the relevant partitioning information from relcache
into PartitionSchemeData and RelOptInfo. Here's a quick patch with
that fix.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 3a8306a..ebda85e 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -702,6 +702,74 @@ partition_bounds_equal(int partnatts, int16 *parttyplen, bool *parttypbyval,
 }
 
 /*
+ * Return a copy of given PartitionBoundInfo structure. The data types of bounds
+ * are described by given partition key specificiation.
+ */
+extern PartitionBoundInfo
+partition_bounds_copy(PartitionBoundInfo src,
+	  PartitionKey key)
+{
+	PartitionBoundInfo	dest;
+	int		i;
+	int		ndatums;
+	int		partnatts;
+	int		num_indexes;
+
+	dest = (PartitionBoundInfo) palloc(sizeof(PartitionBoundInfoData));
+
+	dest->strategy = src->strategy;
+	ndatums = dest->ndatums = src->ndatums;
+	partnatts = key->partnatts;
+
+	/* Range partitioned table has an extra index. */
+	num_indexes = key->strategy == PARTITION_STRATEGY_RANGE ? ndatums + 1 : ndatums;
+
+	/* List partitioned tables have only a single partition key. */
+	Assert(key->strategy != PARTITION_STRATEGY_LIST || partnatts == 1);
+
+	dest->datums = (Datum **) palloc(sizeof(Datum *) * ndatums);
+
+	if (src->kind != NULL)
+	{
+		dest->kind = (PartitionRangeDatumKind **) palloc(ndatums *
+sizeof(PartitionRangeDatumKind *));
+		for (i = 0; i < ndatums; i++)
+		{
+			dest->kind[i] = (PartitionRangeDatumKind *) palloc(partnatts *
+sizeof(PartitionRangeDatumKind));
+
+			memcpy(dest->kind[i], src->kind[i],
+   sizeof(PartitionRangeDatumKind) * key->partnatts);
+		}
+	}
+	else
+		dest->kind = NULL;
+
+	for (i = 0; i < ndatums; i++)
+	{
+		int		j;
+		dest->datums[i] = (Datum *) palloc(sizeof(Datum) * partnatts);
+
+		for (j = 0; j < partnatts; j++)
+		{
+			if (dest->kind == NULL ||
+dest->kind[i][j] == PARTITION_RANGE_DATUM_VALUE)
+dest->datums[i][j] = datumCopy(src->datums[i][j],
+			   key->parttypbyval[j],
+			   key->parttyplen[j]);
+		}
+	}
+
+	dest->indexes = (int *) palloc(sizeof(int) * num_indexes);
+	memcpy(dest->indexes, src->indexes, sizeof(int) * num_indexes);
+
+	dest->null_index = src->null_index;
+	dest->default_index = src->default_index;
+
+	return dest;
+}
+
+/*
  * check_new_partition_bound
  *
  * Checks if the new partition's bound overlaps any of the existing partitions
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 93cc757..9d35a41 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -1825,13 +1825,15 @@ set_relation_partition_info(PlannerInfo *root, RelOptInfo *rel,
 			Relation relation)
 {
 	PartitionDesc partdesc;
+	PartitionKey  partkey;
 
 	Assert(relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
 
 	partdesc = RelationGetPartitionDesc(relation);
+	partkey = RelationGetPartitionKey(relation);
 	rel->part_scheme = find_partition_scheme(root, relation);
 	Assert(partdesc != NULL && rel->part_scheme != NULL);
-	rel->boundinfo = partdesc->boundinfo;
+	rel->boundinfo = partition_bounds_copy(partdesc->boundinfo, partkey);
 	rel->nparts = partdesc->nparts;
 	set_baserel_partition_key_exprs(relation, rel);
 }
@@ -1888,18 +1890,33 @@ find_partition_scheme(PlannerInfo *root, Relation relation)
 
 	/*
 	 * Did not find matching partition scheme. Create one copying relevant
-	 * information from the re

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-06 Thread Ashutosh Bapat

On Fri, Oct 6, 2017 at 8:45 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Fri, Oct 6, 2017 at 8:40 AM, Ashutosh Bapat
> <ashutosh.ba...@enterprisedb.com> wrote:
>> Sorry. I sent a wrong file. Here's the real v37.
>
> Committed 0001-0006.  I made some assorted comment and formatting
> changes and two small substantive changes:
>
> - In try_nestloop_path, bms_free(outerrelids) before returning if we
> can't reparameterize.

Hmm. I missed that.

>
> - Moved the call to try_partition_wise_join inside
> populate_joinrel_with_paths, instead of always calling it just after
> that function is called.

This looks good too.

>
> I think this is very good work and I'm excited about the feature.

Thanks a lot Robert for detailed review and guidance. Thanks a lot
Rafia for benchmarking the feature with TPCH and esp. very large scale
database and also for testing and reported some real issues. Thanks
Rajkumar for testing it with an exhaustive testset. Thanks Amit
Langote, Thomas Munro, Dilip Kumar, Antonin Houska, Alvaro Herrera and
Amit Khandekar for their review comments and suggestions. Thanks
Jeevan Chalke, who used the patchset to implement partition-wise
aggregates and provided some insights offlist. Sorry if I have missed
anybody.

As Robert says in the commit message, there's more to do but now that
we have basic feature, improving it incrementally becomes a lot
easier.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-06 Thread Ashutosh Bapat

;
> +* An INNER join between two partitioned relations is partitioned by 
> key
> +* expressions from both the relations. For tables A and B
> partitioned by
> +* a and b respectively, (A INNER JOIN B ON A.a = B.b) is partitioned 
> by
> +* both A.a and B.b.
> +*
> +* A SEMI/ANTI join only retains data from the outer side and is
> +* partitioned by the partition keys of the outer side.
>
> I would write: An INNER join between two partitioned relations can be
> regarded as partitioned by either key expression.  For example, A
> INNER JOIN B ON A.a = B.b can be regarded as partitioned on A.a or on
> B.b; they are equivalent.  For a SEMI or ANTI join, the result can
> only be regarded as being partitioned in the same manner as the outer
> side, since the inner columns are not retained.

Done.

>
> +* An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows 
> with
> +* B.b NULL. These rows may not fit the partitioning
> conditions imposed on
> +* B.b. Hence, strictly speaking, the join is not partitioned by B.b.
>
> Good.
>
> +* Strictly speaking, partition keys of an OUTER join should include
> +* partition key expressions from the OUTER side only. Consider a join
>
> I would join this with the previous sentence instead of repeating
> strictly speaking: ...and thus the partition keys should include
> partition key expressions from the OUTER side only.  After that
> sentence, I'd skip a lot of the intermediate words here and continue
> this way: However, because all commonly-used comparison operators are
> strict, the presence of nulls on the outer side doesn't cause any
> problem; they can't match anything at future join levels anyway.
> Therefore, we track two sets of expressions: those that authentically
> partition the relation (partexprs) and those that partition the
> relation with the exception that extra nulls may be present
> (nullable_partexprs).  When the comparison operator is strict, the
> latter is just as good as the former.
>
> Then, I think you can omit the rest of what you have; it should be
> clear enough what's going on for the full and right cases given that
> explanation.

I liked this version. Changed the comments as per your suggestions.

>
> + * being joined. partexprs and nullable_partexprs are arrays
> containing part_scheme->partnatts
>
> Long line, needs reflowing.

Done. Also fixed a grammatical mistake: contains -> contain in the
last line of that paragraph.

>
> I don't think this is too far from being committable.  You've done
> some nice work here!
>

Thanks a lot for your detailed reviews and guidance. I will post the
updated patchset with my next reply.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-05 Thread Ashutosh Bapat

On Thu, Oct 5, 2017 at 7:18 PM, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote:
> Robert Haas wrote:
>
>> Regarding nomenclature and my previous griping about wisdom, I was
>> wondering about just calling this a "partition join" like you have in
>> the regression test.  So the GUC would be enable_partition_join, you'd
>> have generate_partition_join_paths(), etc.  Basically just delete
>> "wise" throughout.
>
> If I understand correctly, what's being used here is the "-wise" suffix,
> unrelated to wisdom, which Merriam Webster lists as "adverb combining
> form" here https://www.merriam-webster.com/dictionary/wise (though you
> have to scroll down a lot), which is defined as
>
> 1 a :in the manner of  * crabwise * fanwise
>   b :in the position or direction of  * slantwise * clockwise
> 2 :with regard to :in respect of * dollarwise
>

That's right.

> According to that, the right way to write this is "partitionwise join"
> (no dash), which means "join in respect of partitions", "join with
> regard to partitions".

Google lists mostly  "partition wise" or "partition-wise" and very
rarely "partitionwise". The first being used in other DBMS literature.
The paper (there aren't many on this subject) I referred [1] uses
"partition-wise". It made more sense to replace " " or "-" with "_"
when syntax doesn't allow the first two. I am not against
"partitionwise" but I don't see any real reason why we should move
away from popular usage of this term.

[1] https://users.cs.duke.edu/~shivnath/papers/sigmod295-herodotou.pdf

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-05 Thread Ashutosh Bapat

On Wed, Oct 4, 2017 at 9:01 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Sep 21, 2017 at 8:52 AM, Robert Haas <robertmh...@gmail.com> wrote:
>> On Thu, Sep 21, 2017 at 8:21 AM, Ashutosh Bapat
>> <ashutosh.ba...@enterprisedb.com> wrote:
>>> About your earlier comment of making build_joinrel_partition_info()
>>> simpler. Right now, the code assumes that partexprs or
>>> nullable_partexpr can be NULL when either of them is not populated.
>>> That may be saves a sizeof(pointer) * (number of keys) byes of memory.
>>> Saving that much memory may not be worth the complexity of code. So,
>>> we may always allocate memory for those arrays and fill it with NIL
>>> values when there are no key expressions to populate those. That will
>>> simplify the code. I haven't done that change in this patchset. I was
>>> busy debugging the Q7 regression. Let me know your comments about
>>> that.
>>
>> Hmm, I'm not sure that's the best approach, but let me look at it more
>> carefully before I express a firm opinion.
>
> Having studied this a bit more, I now think your proposed approach is
> a good idea.

Thanks. Done.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Another oddity in handling of WCO constraints in postgres_fdw

2017-10-04 Thread Ashutosh Bapat

On Wed, Oct 4, 2017 at 5:32 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Wed, Oct 4, 2017 at 6:40 AM, Ashutosh Bapat
> <ashutosh.ba...@enterprisedb.com> wrote:
>> Just like the local constraints on a foreign table are not ensured on
>> remote table (unless user takes steps to make that sure), WCO defined
>> locally need not be (and probably can not be) ensured remotely. We can
>> check whether a row being sent from the local server to the foreign
>> server obeys WCO, but what foreign server does to that row is beyond
>> local server's scope.
>
> But I think right now we're not checking the row being sent from the
> local server, either.

Didn't 7086be6e3627c1ad797e32ebbdd232905b5f577f fix that?

> The WCO that is being ignored isn't a
> constraint on the foreign table; it's a constraint on a view which
> happens to reference the foreign table.  It seems quite odd for the
> "assume constraints are valid" property of the foreign table to
> propagate back up into the view that references it.
>

The view with WCO is local but the modification which violates WCO is
being made on remote server by a trigger on remote table. Trying to
control that doesn't seem to be a good idea, just like we can't
control what rows get inserted on the foreign server when they violate
local constraints. I am using local constraints as an example of
precedence where we ignore what's happening on remote side and enforce
whatever we could enforce locally. Local server should make sure that
any rows sent from local server to the remote server do not violate
any local WCO. But once it's handed over to the foreign server, we
shouldn't worry about what happens there. That behaviour is ensured by
the above commit, isn't it?  I am not suggesting that we use local
constraints to enforce WCO or something like that.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-04 Thread Ashutosh Bapat

or
not. The testcase is not interested in the actual shape. It doesn't
make sense to just test the output if partition-wise join is not used.
May be a function examining the plan tree would help. The function
will have to handle Result/Sort nodes on top and make sure that Append
has join children. Do you have any other idea to check the shape of
the plan tree without the details? Any EXPLAIN switch, existing
functions etc.?

Removed the extra full outer join testcase.
>
> I think it would be good to have a test case that shows multi-level
> partition-wise join working across multiple levels.  I wrote the
> attached test, which you're welcome to use if you like it, adapt if
> you sorta like it, or replace if you dislike it. The table names at
> least should be changed to something less likely to duplicate other
> tests.
>

There are tests for multi-level partitioned table in the file. They
test whole partition hierarchy join, part of it being joined based on
the quals. Search for
--
-- multi-leveled partitions
--

Have you looked at those? They test two-level partitioned tables and
your test tests three-level partitioned table. I can modify the tests
to have three levels of partitions and different partition schemes on
different levels. Is that what you expect?

[1] 
https://www.postgresql.org/message-id/CAFjFpRcPutbr4nVAsrY-5q%3DwCFrNK25_3MNhHgyYYM0yeOoj%3DQ%40mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 2821662..78eec0a 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -5680,6 +5680,7 @@ prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
 if (em->em_is_const)
 	continue;
 
+Assert(!em->em_is_child || !bms_is_empty(em->em_relids));
 /*
  * Ignore child members unless they match the rel being
  * sorted.
@@ -5796,6 +5797,8 @@ find_ec_member_for_tle(EquivalenceClass *ec,
 		if (em->em_is_const)
 			continue;
 
+		Assert(!em->em_is_child || !bms_is_empty(em->em_relids));
+
 		/*
 		 * Ignore child members unless they match the rel being sorted.
 		 */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Another oddity in handling of WCO constraints in postgres_fdw

2017-10-04 Thread Ashutosh Bapat

On Wed, Oct 4, 2017 at 3:45 PM, Etsuro Fujita
<fujita.ets...@lab.ntt.co.jp> wrote:
> On 2017/10/03 18:16, Ashutosh Bapat wrote:
>>
>> Enforcing WCO constraints imposed by the local server on the row/DML
>> being passed to the foreign server is fine, but trying to impose them
>> on the row being inserted/updated at the foreign server looks odd. May
>> be we should just leave this case as it is. I am comparing this case
>> with the way we handle constraints on a foreign table.
>
>
> Hmm, I think that would be okay in the case where WCO constraints match
> constraints on the foreign table, but I'm not sure that would be okay even
> in the case where WCO constraints don't match?  Consider:
>
> create table bt (a int check (a % 2 = 0));
> create foreign table ft (a int check (a % 2 = 0)) server loopback options
> (table_name 'bt');
> create view rw_view_2 as select * from ft where a % 2 = 0 with check option;
>
> In that case the WCO constraint matches the constraint on the foreign table,
> so there would be no need to ensure the WCO constraint locally (to make the
> explanation simple, we assume here that we don't have triggers on the remote
> end).  BUT: for another auto-updatable view defined using the same foreign
> table like this:
>
> create view rw_view_4 as select * from ft where a % 4 = 0 with check option;
>
> how is the WCO constraint (ie, a % 4 = 0) ensured remotely, which is
> different from the constraint on the foreign table (ie, a % 2 = 0)? Maybe
> I'm missing something, though.

Just like the local constraints on a foreign table are not ensured on
remote table (unless user takes steps to make that sure), WCO defined
locally need not be (and probably can not be) ensured remotely. We can
check whether a row being sent from the local server to the foreign
server obeys WCO, but what foreign server does to that row is beyond
local server's scope.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Another oddity in handling of WCO constraints in postgres_fdw

2017-10-03 Thread Ashutosh Bapat

On Thu, Sep 28, 2017 at 5:45 PM, Etsuro Fujita
<fujita.ets...@lab.ntt.co.jp> wrote:
> Hi,
>
> Commit 7086be6e3627c1ad797e32ebbdd232905b5f577f addressed mishandling of WCO
> in direct foreign table modification by disabling it when we have WCO, but I
> noticed another oddity in postgres_fdw:
>
> postgres=# create table base_tbl (a int, b int);
> postgres=# create function row_before_insupd_trigfunc() returns trigger as
> $$begin new.a := new.a + 10; return new; end$$ language plpgsql;
> postgres=# create trigger row_before_insupd_trigger before insert or update
> on base_tbl for each row execute procedure row_before_insupd_trigfunc();
> postgres=# create server loopback foreign data wrapper postgres_fdw options
> (dbname 'postgres');
> postgres=# create user mapping for CURRENT_USER server loopback;
> postgres=# create foreign table foreign_tbl (a int, b int) server loopback
> options (table_name 'base_tbl');
> postgres=# create view rw_view as select * from foreign_tbl where a < b with
> check option;
>
> So, this should fail, but
>
> postgres=# insert into rw_view values (0, 5);
> INSERT 0 1
>
> The reason for that is: this is processed using postgres_fdw's non-direct
> foreign table modification (ie. ForeignModify), but unlike the RETURNING or
> local after trigger case, the ForeignModify doesn't take care that remote
> triggers might change the data in that case, so the WCO is evaluated using
> the data supplied, not the data actually inserted, which I think is wrong.
> (I should have noticed that as well while working on the fix, though.)  So,
> I'd propose to fix that by modifying postgresPlanForeignModify so that it
> handles WCO the same way as for the RETURNING case.  Attached is a patch for
> that.  I'll add the patch to the next commitfest.
>

Enforcing WCO constraints imposed by the local server on the row/DML
being passed to the foreign server is fine, but trying to impose them
on the row being inserted/updated at the foreign server looks odd. May
be we should just leave this case as it is. I am comparing this case
with the way we handle constraints on a foreign table.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Transactions involving multiple postgres foreign servers

2017-10-02 Thread Ashutosh Bapat

On Fri, Sep 29, 2017 at 9:12 PM, Robert Haas <robertmh...@gmail.com> wrote:
>
> It's possible that we might find that neither of the above approaches
> are practical and that the performance benefits of resolving the
> transaction from the original connection are large enough that we want
> to try to make it work anyhow.  However, I think we can postpone that
> work to a future time.  Any general solution to this problem at least
> needs to be ABLE to resolve transactions at a later time from a
> different session, so let's get that working first, and then see what
> else we want to do.
>

 +1.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PoC: full merge join on comparison clause

2017-10-02 Thread Ashutosh Bapat

On Thu, Sep 28, 2017 at 8:57 PM, Alexander Kuzmenkov
<a.kuzmen...@postgrespro.ru> wrote:
> Hi Ashutosh,
>
> Thanks for the review.
>
> Jeff, I'm copying you because this is relevant to our discussion about what
> to do with mergeopfamilies when adding new merge join types.
>
> You have renamed RestrictInfo member mergeopfamilies as
> equivopfamilies. I don't think that's a good name; it doesn't convey
> that these are opfamilies containing merge operators. The changes in
> check_mergejoinable() suggest that an operator may act as equality
> operator in few operator families and comparison operator in others.
> That looks odd. Actually an operator family contains operators other
> than equality operators, so you may want to retain this member and add
> a new member to specify whether the clause is an equality clause or
> not.
>
>
> For mergeopfamilies, I'm not sure what is the best thing to do. I'll try to
> explain my understanding of the situation, please correct me if I'm wrong.
>
> Before the patch, mergeopfamilies was used for two things: creating
> equivalence classes and performing merge joins.
>
> For equivalence classes: we look at the restriction clauses, and if they
> have mergeopfamilies set, it means that these clause are based on an
> equality operator, and the left and right variables must be equal. To record
> this fact, we create an equivalence class. The variables might be equal for
> one equality operator and not equal for another, so we record the particular
> operator families to which our equality operator belongs.
>
> For merge join: we look at the join clauses, and if they have
> mergeopfamilies set, it means that these clauses are based on an equality
> operator, and we can try performing this particular join as merge join.
> These opfamilies are also used beforehand to create the equivalence classes
> for left and right variables. The equivalence classes are used to match the
> join clauses to pathkeys describing the ordering of join inputs.
>
> So, if we want to start doing merge joins for operators other than equality,
> we still need to record their opfamilies, but only use them for the second
> case and not the first. I chose to put these opfamilies to different
> variables, and
> name the one used for equivalence classes 'equivopfamilies' and the one used
> for merge joins 'mergeopfamilies'. The equality operators are used for both
> cases, so we put their opfamilies into both of these variables.
>
> I agree this might look confusing. Indeed, we could keep a single variable
> for opfamilies, and add separate flags that show how they can be used, be
> that for equivalence classes, merge joins, range joins or some combination
> of them. This is similar to what Jeff did in his range merge join patch [1].
> I will think more about this and try to produce an updated patch.
>

I think we have (ab?)used mergeopfamilies to indicate equality
condition, which needs some changes. May be these two patches are
where we can do those changes.

>
> In mergejoinscansel() you have just removed Assert(op_strategy ==
> BTEqualStrategyNumber); Probably this function is written considering
> on equality operators. But now that we are using this for all other
> operators, we will need more changes to this function. That may be the
> reason why INNER join in your earlier example doesn't choose right
> costing.
>
>
> I changed mergejoinscansel() slightly to reflect the fact that the inner
> relation is scanned from the beginning if we have an inequality merge
> clause.
>
>
> The comment change in final_cost_mergejoin() needs more work. n1, n2,
> n3 are number of rows on inner side with values 1, 2, 3 resp. So n1 +
> n2 + n3 + ... = size of inner relation is correct. In that context I
> am not able to understand your change
> +* If the merge clauses contain inequality, (n1 + n2 + ...) ~=
> +* (size of inner relation)^2.
>
>
> I extended the comment in final_cost_mergejoin(). Not sure if that
> approximation makes any sense, but this is the best I could think of.
>
> Style problems are fixed.
>
> Attached please find the new version of the patch that addresses all the
> review comments except mergeopfamilies.
>
> The current commitfest is ending, but I'd like to continue working on this
> patch, so I am moving it to the next one.
>
>

Thanks for working on the comments. I am interested to continue
reviewing it in the next commitfest.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise aggregation/grouping

2017-09-28 Thread Ashutosh Bapat

On Wed, Sep 27, 2017 at 3:42 PM, Jeevan Chalke
 wrote:
> Thanks Ashutosh for reviewing.
>
> Attached new patch-set with following changes:
>
> 1. Removed earlier 0007 and 0008 patches which were PoC for supporting
> partial aggregation over fdw. I removed them as it will be a different
> issue altogether and hence I will tackle them separately once this is
> done.
>
> This patch-set now includes support for parallel plans within partitions.
>
> Notes:
> HEAD: 59597e6
> Partition-wise Join Version: 34
>
> (First six patches 0001 - 0006, remains the same functionality-wise)
> 0007 - Refactors partial grouping paths creation into the separate function.
> 0008 - Enables parallelism within the partition-wise aggregation.
>
> This patch also includes a fix for the crash reported by Rajkumar.
> While forcibly applying scan/join target to all the Paths for the scan/join
> rel, earlier I was using apply_projection_to_path() which modifies the path
> in-place which causing this crash as the path finally chosen has been
> updated by partition-wise agg path creation. Now I have used
> create_projection_path() like we do in partial aggregation paths.
>
> Also, fixed issues reported by Ashutosh.

Thanks.

Here are comments on 0004 from last patch set. But most the comments
still apply.

Patch 0001 adds functions create_hash_agg_path() and create_sort_agg_path().
Patch 0004 adds a new argument to those functions for conditions in HAVING
clause. We should move those changes to 0001 and pass parse->havingQual to
these functions in 0001 itself. That will keep all changes to those functions
together and also make 0003 small.

The prologue of try_partition_wise_grouping() mentions a restriction of
partition keys being leading group by clauses. This restriction is not
specified in the prologue of have_grouping_by_partkey(), which actually checks
for this restriction. The requirement per prologue of that function is just to
have partition keys in group clause. I think have_grouping_by_partkey() is
correct, and we should change prologue of try_partition_wise_grouping() to be
in sync with have_grouping_by_partkey(). The prologue explains why
partition-wise aggregation/grouping would be efficient with this restriction,
but it doesn't explain why partial aggregation/grouping per partition would be
efficient. May be it should do that as well. Similar is the case with partial
aggregation/grouping discussion in README.

+/* Do not handle grouping sets for now. */
Is this a real restriction or just restriction for first cut of this feature?
Can you please add more explanation? May be update README as well?

+grouped_rel->part_scheme = input_rel->part_scheme;
Is this true even for partial aggregates? I think not. Since group by clause
does not contain partition keys, the rows from multiple partitions participate
in one group and thus the partition keys of input relation do not apply to the
grouped relation. In this case, it seems that the grouped rel will have
part_rels but will not be partitioned.

+/*
+ * If there is no path for the child relation, then we cannot add
+ * aggregate path too.
+ */
+if (input_child_rel->pathlist == NIL)
+return;
When can this happen? I think, similar to partition-wise join it happens when
there is a dummy parent relation. See [1]. If that's the case, you may want to
do things similar to what partition-wise join is doing. If there's some other
reason for doing this, returing from here half-way is actually waste of memory
and planning time. Instead, we may want to loop over the part_rels to find if
any of those have empty pathlist and return from there before doing any actual
work.

+extra.pathTarget = child_target;
+extra.inputRows = input_child_rel->cheapest_startup_path->rows;
+extra.havingQual = (Node *) adjust_appendrel_attrs(root,
+   (Node *)
query->havingQual,
+   nappinfos,
+   appinfos);
These lines are updating some fields of "extra" structure in every loop. The
structure is passed to create_child_grouping_paths() in the loop and to
add_paths_to_append_rel() outside the loop. Thus add_paths_to_append_rel() only
gets some member values for the last child. Is that right? Should we split
extra into two structures one to be used within the loop and one outside? Or
may be send the members being updated within the loop separately?

+/*
+ * Forcibly apply scan/join target to all the Paths for the scan/join
+ * rel.
+ *
[ lines clipped ]
+if (subpath == input_child_rel->cheapest_total_path)
+input_child_rel->cheapest_total_path = path;
+}
+}
This code seems to be copied from grouping_planner() almost verbatim. Is there
a way we can

Re: [HACKERS] PartitionSchemaData & partcollation (Re: [COMMITTERS] pgsql: Associate partitioning information with each RelOptInfo.)

2017-09-28 Thread Ashutosh Bapat

On Thu, Sep 28, 2017 at 11:47 AM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> On 2017/09/21 12:42, Robert Haas wrote:
>> Associate partitioning information with each RelOptInfo.
>>
>> This is not used for anything yet, but it is necessary infrastructure
>> for partition-wise join and for partition pruning without constraint
>> exclusion.
>>
>> Ashutosh Bapat, reviewed by Amit Langote and with quite a few changes,
>> mostly cosmetic, by me.  Additional review and testing of this patch
>> series by Antonin Houska, Amit Khandekar, Rafia Sabih, Rajkumar
>> Raghuwanshi, Thomas Munro, and Dilip Kumar.
>
> I noticed that this commit does not add partcollation field to
> PartitionSchemeData, while it adds parttypcoll.  I think it'd be necessary
> to have partcollation too, because partitioning would have used the same,
> not parttypcoll.  For, example see the following code in
> partition_rbound_datum_cmp:
>
> cmpval = DatumGetInt32(FunctionCall2Coll(>partsupfunc[i],
>  key->partcollation[i],
>  rb_datums[i],
>  tuple_datums[i]));
>
> So, it would be wrong to use parttypcoll, if we are to use the collation
> to match a clause with the partition bounds when doing partition-pruning.
> Concretely, a clause's inputcollid should match partcollation for the
> corresponding column, not the column's parttypcoll.
>
> Attached is a patch that adds the same.  I first thought of including it
> in the partition-pruning patch set [1], but thought we could independently
> fix this.
>

I think PartitionSchemeData structure will grow as we need more
information about partition key for various things. E.g. partsupfunc
is not part of this structure right now, but it would be required to
compare partition bound datums. Similarly partcollation. Please add
this to partition pruning patchset. May be parttypcoll won't be used
at all and we may want to remove it altogether.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Transactions involving multiple postgres foreign servers

2017-09-27 Thread Ashutosh Bapat

 no such
>> remote transaction exists, what should we do?  I'm inclined to think
>> that we should regard that as if we had succeeded in resolving the
>> transaction.  Certainly, if we've retried the server repeatedly, it
>> might be that we previously succeeded in resolving the transaction but
>> then the network connection was broken before we got the success
>> message back from the remote server.  But even if that's not the
>> scenario, I think we should assume that the DBA or some other system
>> resolved it and therefore we don't need to do anything further.  If we
>> assume anything else, then we just go into an infinite error loop,
>> which isn't useful behavior.  We could log a message, though (for
>> example, LOG: unable to resolve foreign transaction ... because it
>> does not exist).
>
> Agreed.
>

Yes. I think the current patch takes care of this, except probably the
error message.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise aggregation/grouping

2017-09-26 Thread Ashutosh Bapat

Hi Jeevan,
I have started reviewing these patches.

0001 looks fine. There might be some changes that will be needed, but
those will be clear when I review the patch that uses this
refactoring.

0002
+ *
+ * If targetlist is provided, we use it else use targetlist from the root.
  */
 static double
 get_number_of_groups(PlannerInfo *root,
 double path_rows,
-grouping_sets_data *gd)
+grouping_sets_data *gd,
+List *tlist)
 {
Query  *parse = root->parse;
double  dNumGroups;
+   List   *targetList = (tlist == NIL) ? parse->targetList : tlist;

May be we should just pass targetlist always. Instead of passing NIL,
pass parse->targetList directly. That would save us one conditional
assignment. May be passing NIL is required for the patches that use
this refactoring, but that's not clear as is in this patch.

0003
In the documenation of enable_partition_wise_aggregate, we should
probably explain why the default is off or like partition_wise_join
GUC, explain the consequences of turning it off. I doubt if we could
accept something like partition_wise_agg_cost_factor looks. But we can
discuss this at a later stage. Mean time it may be worthwhile to fix
the reason why we would require this GUC. If the regular aggregation
has cost lesser than partition-wise aggregation in most of the cases,
then probably we need to fix the cost model.

I will continue reviewing rest of the patches.

On Mon, Sep 18, 2017 at 12:37 PM, Jeevan Chalke
<jeevan.cha...@enterprisedb.com> wrote:
>
>
> On Tue, Sep 12, 2017 at 6:21 PM, Jeevan Chalke
> <jeevan.cha...@enterprisedb.com> wrote:
>>
>>
>>
>> On Tue, Sep 12, 2017 at 3:24 PM, Rajkumar Raghuwanshi
>> <rajkumar.raghuwan...@enterprisedb.com> wrote:
>>>
>>>
>>> Hi Jeevan,
>>>
>>> I have started testing partition-wise-aggregate and got one observation,
>>> please take a look.
>>> with the v2 patch, here if I change target list order, query is not
>>> picking full partition-wise-aggregate.
>>
>>
>> Thanks Rajkumar for reporting this.
>>
>> I am looking into this issue and will post updated patch with the fix.
>
>
> Logic for checking whether partition keys lead group by keys needs to be
> updated here. The group by expressions can appear in any order without
> affecting the final result. And thus, the need for partition keys should
> be leading the group by keys to have full aggregation is not mandatory.
> Instead we must ensure that the partition keys are part of the group by
> keys to compute full aggregation on a partition.
>
> Attached, revised patch-set with above fix.
>
> Also, in test-cases, I have removed DROP/ANALYZE commands on child
> relations and also removed VERBOSE from the EXPLAIN.
>
> Notes:
> HEAD: 8edacab209957520423770851351ab4013cb0167
> Partition-wise Join patch-set version: v32
>
> Thanks
>
> --
> Jeevan Chalke
> Principal Software Engineer, Product Development
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-22 Thread Ashutosh Bapat

On Fri, Sep 15, 2017 at 5:29 PM, Ashutosh Bapat
<ashutosh.ba...@enterprisedb.com> wrote:
>
>>
>> Apart from these there is a regression case on a custom table, on head
>> query completes in 20s and with this patch it takes 27s. Please find
>> the attached .out and .sql file for the output and schema for the test
>> case respectively. I have reported this case before (sometime around
>> March this year) as well, but I am not sure if it was overlooked or is
>> an unimportant and expected behaviour for some reason.
>>
>
> Are you talking about [1]? I have explained about the regression in
> [2] and [3]. This looks like an issue with the existing costing model.
>

I debugged this case further. There are two partitioned tables being
joined prt (with partitions prt_p1, prt_p2 and so on) and prt2 (with
partitions prt2_p1, prt2_p2, and so on). When join is executed without
partition-wise join, prt2 is used to build hash table and prt is used
to probe that hash table. prt2 has lesser number of rows than prt. But
when partition-wise join is used, individual partitions are joined in
reverse join order i.e. partitions of prt are used to build the hash
table and partitions of prt2 are used to probe. This happens because
the path for the other join order (partition of prt2 used to build the
hash table and partition of prt used to probe) has huge cost compared
to the first one (74459 and 313109) and a portion worth 259094 comes
from lines 3226/7 of final_cost_hashjoin()
3215 /*
3216  * The number of tuple comparisons needed is the number of outer
3217  * tuples times the typical number of tuples in a hash
bucket, which
3218  * is the inner relation size times its bucketsize
fraction.  At each
3219  * one, we need to evaluate the hashjoin quals.  But actually,
3220  * charging the full qual eval cost at each tuple is pessimistic,
3221  * since we don't evaluate the quals unless the hash values match
3222  * exactly.  For lack of a better idea, halve the cost estimate to
3223  * allow for that.
3224  */
3225 startup_cost += hash_qual_cost.startup;
3226 run_cost += hash_qual_cost.per_tuple * outer_path_rows *
3227 clamp_row_est(inner_path_rows * innerbucketsize) * 0.5;

That's because for some reason innerbucketsize for partition of prt is
22 times more than that for partition of prt2. Looks like we have some
estimation error in estimating bucket sizes.

If I force partitions to be joined with the same order as partitioned
tables (without partition-wise join), child-joins execute faster and
in turn partition-wise join performs better than the
non-partition-wise join. So, this is clearly some estimation and
costing problem with regular joins.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-22 Thread Ashutosh Bapat

On Fri, Sep 22, 2017 at 10:45 AM, Rafia Sabih
<rafia.sa...@enterprisedb.com> wrote:
>>
>> On completing the benchmark for all queries for the above mentioned
>> setup, following performance improvement can be seen,
>> Query | Patch | Head
>> 3  | 1455  |  1631
>> 4  |  499  |  4344
>> 5  |  1464  |  1606
>> 10  |  1475  |  1599
>> 12  |  1465  |  1790
>>
>> Note that all values of execution time are in seconds.
>
> I compared this experiment with non-partitioned database and following
> is the result,
> Query | Non-partitioned head
> 3  |  1752
> 4  |  315
> 5  |  2319
> 10 | 1535
> 12 | 1739
>
> In summary, the query that appears slowest in partitioned database is
> not so otherwise. It is good to see that in Q4 partition-wise join
> helps in achieving performance closer to it's non-partitioned case,
> otherwise partitioning alone causes it to suffer greatly. Apart from
> Q4 it does not looks like partitioning hurts anywhere else, though the
> maximum improvement is ~35% for Q5.
> Another point to note here is that the performance on partitioned and
> unpartitioned heads are quite close (except Q4) which is something
> atleast I wasn't expecting. It looks like we need not to partition the
> tables anyway, or atleast this set of queries doesn't benefit from
> partitioning. Please let me know if somebody has better ideas on how
> partitioning schemes should be applied to make it more beneficial for
> these queries.

Just partitioning is not expected to improve query performance (but we
still see some performance improvement). Partitioning + partition-wise
operations, pruning is expected to show performance gains. IIUC the
results you reported, Q3 takes 1752 seconds with non-partitioned head,
with partitioning it completes in 1631 seconds and with partition-wise
join it completes in 1455, so net improvement because of partitioning
is 300 seconds is almost 16% improvement, which is a lot for very
large data. So, except Q4, every query improves when the tables are
partitioned. Am I interpreting the results correctly?

There may be some other way of partitioning, which may give better
results, but I think what we have now shows the importance of
partitioning in case of very large data e.g. scale 300 TPCH.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-21 Thread Ashutosh Bapat

ops=1
 Worker 2: actual
time=1.317..3123.646 rows=489960 loops=1
 Worker 3: actual
time=0.927..3130.497 rows=475545 loops=1
If we sum the rows returned by each worker they don't add up to
(actual rows) * (actual loops). So I assumed that the unreported
number of rows were processed by the leader. Is that right?

I might be misunderstanding how parallel query works, but here's my
analysis so far. I will continue investigating further.

Any clues would be helpful.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Proposal] Make the optimiser aware of partitions ordering

2017-09-21 Thread Ashutosh Bapat

On Thu, Sep 21, 2017 at 3:30 AM, Julien Rouhaud <rjuju...@gmail.com> wrote:
> On Thu, Aug 31, 2017 at 5:30 AM, Peter Eisentraut
> <peter.eisentr...@2ndquadrant.com> wrote:
>> On 3/20/17 11:03, Ronan Dunklau wrote:
>>>> Great idea.  This is too late for v10 at this point, but please add it
>>>> to the next CommitFest so we don't forget about it.
>>> I know it is too late, and thought that it was too early to add it to the
>>> commitfest properly since so many design decisions should be discussed. 
>>> Thanks
>>> for the feedback, I added it.
>>
>> This patch no longer applies.  Please send an update.
>
>
> Thanks for the reminder, and sorry for very very late answer.  PFA
> rebased patch on current head.
>
> I unfortunately didn't have time yet to read carefully the newly added
> infrastructure (30833ba154e0c1106d61e3270242dc5999a3e4f3 and
> following), so this patch is still using its own copy of partitions
> list.  I hope I can work on it shortly, but I prefer to send the
> rebased version now, since it's still a POC with knowns problems and
> questions, and I'll more than welcome any guidance with it.
>
>

With 9140cf8269b0c4ae002b2748d93979d535891311, we store the
RelOptInfos of partitions in the RelOptInfo of partitioned table. For
range partitioned table without default partition, they are arranged
in the ascending order of partition bounds. This patch may avoid
MergeAppend if the sort keys match partition keys by creating an
Append path by picking sorted paths from partition RelOptInfos in
order. You may use slightly modified version of
add_paths_to_append_rel().

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-20 Thread Ashutosh Bapat

On Tue, Sep 19, 2017 at 3:17 PM, Ashutosh Bapat
<ashutosh.ba...@enterprisedb.com> wrote:
>>
>>>> - I'm not entirely sure whether maintaining partexprs and
>>>> nullable_partexprs is the right design.  If I understand correctly,
>>>> whether or not a partexpr is nullable is really a per-RTI property,
>>>> not a per-expression property.  You could consider something like
>>>> "Relids nullable_rels".
>>>
>>> That's true. However in order to decide whether an expression falls on
>>> nullable side of a join, we will need to call pull_varnos() on it and
>>> check the output against nullable_rels. Separating the expressions
>>> themselves avoids that step.
>>
>> Good point.  Also, I'm not sure about cases like this:
>>
>> SELECT * FROM (SELECT b.x, b.y FROM a LEFT JOIN b ON a.x = b.x WHERE
>> a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y;
>>
>> Suppose the relations are all partitioned by (x, y) but that the =
>> operator is not strict.  A partition-wise join is valid between a and
>> b, but we can't regard w as partitioned any more, because w.x might
>> contain nulls in partitions where the partitioning scheme wouldn't
>> allow them.  On the other hand, if the subquery were to select a.x,
>> a.y then clearly it would be fine: there would be no possibility of a
>> NULL having been substituted for a proper value.
>>
>> What if the subquery selected a.x, b.y?  Initially, I thought that
>> would be OK too, because of the fact that the a.y = b.y clause is in
>> the WHERE clause rather than the join condition.  But on further
>> thought I think that probably doesn't work, because with = being a
>> non-strict operator there's no guarantee that it would remove any
>> nulls introduced by the left join.  Of course, if the subselect had a
>> WHERE clause saying that b.x/b.y IS NOT NULL then having the SELECT
>> list mention those columns would be fine.
>>
>

In my previous reply to this, I probably didn't answer your question
while I explained the restriction on where equality conditions on
partition keys can appear. Here's answer to your questions assuming
those restrictions don't exist. Actually in the example you have
given, optimizer flattens w as a LJ b which kind of makes the
explanations below a bit complicated.

1. SELECT * FROM (SELECT b.x, b.y FROM a LEFT JOIN b ON a.x = b.x
WHERE a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y;
partition-wise join will be possible between a and b but not between w
and c for the reasons you have explained above.
2. SELECT * FROM (SELECT a.x, a.y FROM a LEFT JOIN b ON a.x = b.x
WHERE a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y;
partition-wise join will be possible between a and b and also between
w and c for the reasons you have explained above.
3. SELECT * FROM (SELECT a.x, b.y FROM a LEFT JOIN b ON a.x = b.x
WHERE a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y;
partition-wise join will be possible between a and b but not w and c
as you have explained.

In this case b.x and b.y will appear as nullable_partexprs in w
(represented as a LJ b in optimizer) and a.x and a.y will appear in
partexprs. Depending upon what gets projected out of w, the join
between w and c will use corresponding keys for equality conditions.
Since the operator is non-strict, any expression which is part of
nullable_partexprs will be discarded in
match_expr_to_partition_keys().

Hope that helps.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-20 Thread Ashutosh Bapat

On Wed, Sep 20, 2017 at 9:44 AM, Thomas Munro
<thomas.mu...@enterprisedb.com> wrote:
>
> The main areas of uncovered lines are: code in
> get_wholerow_ref_from_convert_row_type() and code that calls it, and
> per node type cases in reparameterize_path_by_child().  It seems like
> the former could use a test case, and I wonder if there is some way we
> could write "flat-copy and then apply recursively to all subpaths"
> code like this without having to handle these cases explicitly.  There
> are a couple of other tiny return cases other than just sanity check
> errors which it might be interesting to hit too.

Under the debugger I checked that the test in partition_join.sql
-- left outer join, with whole-row reference
EXPLAIN (COSTS OFF)
SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b
= 0 ORDER BY t1.a, t2.b;
SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b
= 0 ORDER BY t1.a, t2.b;
covers get_wholerow_ref_from_convert_row_type(). But it doesn't cover
a couple of lines in the case of nested ConvertRowTypeExpr in that
function. We can add/modify a testcase in multi-level partitioned
table section to cover that.

reparameterize_path_by_child() coverage is hard. It would require that
many different kinds of paths survive in lower joins in the join tree.
It's hard to come up with queries that would do that with limited
amount of data and a reasonable number of tests. Me and Thomas
discussed about his suggestion about "flat-copy and then apply
recursively to all subpaths" which he sees as a path tree mutator. It
won't improve the test coverage. Like expression_tree_mutator() path
mutation is not that widely used phenomenon, so we do not yet know
what should be the characteristics of a path mutator could be. In case
we see more of path mutation code in future, it's an idea worth
considering.

>
> 2.  What queries in the 0008 patch are hitting lines that 0007 doesn't hit?
>
> I thought about how to answer questions like this and came up with a
> shell script that (1) makes computers run really hot for quite a long
> time and (2) tells you which blocks of SQL hit which lines of C.
> Please find attached the shell script and its output.  The .sql files
> have been annotated with "block" numbers (blocks being chunks of SQL
> stuff separated by blank lines), and the C files annotated with
> references to those block numbers where A = block 
> partition_join.sql and B = block  in partition_join_extras.sql.
>
> Then to find lines that B queries hit but A queries don't and know
> which particular queries hit them, you might use something like:
>
>   grep -v 'SQL blocks: .*A[0-9]' < joinpath.c.aggregated_coverage | \
>   grep 'SQL blocks: .*B[0-9]'
>

Thanks for this. It generates a lot of output (970 lines over all the
coverage files). It will take some time for getting anything
meaningful out of this. May be there's some faster way by looking at
the lines that are covered by B but not A. BTW, I checked those lines
to see if there could be any bug there. But I don't see what could go
wrong with those lines.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Varying results when using merge joins over postgres_fdw vs hash joins

2017-09-20 Thread Ashutosh Bapat

ps=1)
>Output: c.str2
>  Planning time: 4.285 ms
>  Execution time: 104.458 ms
> (14 rows)
>

Since the results returned by the foreign server are according to the
collation of the foreign server, the order doesn't match with order
expected by the local server and so the merge join reports different
rows.

>
> --
> -- query with hash join, returns rows
>
> --
>
> -- the default for the foreign server is to use remote estimates, so we turn
> that off...
>
> # alter foreign table table_with_en_us_utf8_encoding OPTIONS (ADD
> use_remote_estimate 'false');
> ALTER FOREIGN TABLE
>
> -- and then run the same query again
>
> # explain (analyze, verbose) select e.str1, e.str2, e.str3 from
> tmp_on_c_collated_foreign_server c left join table_with_en_us_utf8_encoding
> e  on c.str2 = e.str2 where e.str4='2' ;
>
> QUERY PLAN
> --
>  Hash Join  (cost=110.68..139.45 rows=7 width=1548) (actual
> time=154.280..154.286 rows=7 loops=1)
>Output: e.str1, e.str2, e.str3
>Hash Cond: (c.str2 = (e.str2)::text)
>->  Seq Scan on pg_temp_3.tmp_on_c_collated_foreign_server c
> (cost=0.00..23.60 rows=1360 width=32) (actual time=0.006..0.008 rows=7
> loops=1)
>  Output: c.str2
>->  Hash  (cost=110.67..110.67 rows=1 width=1548) (actual
> time=154.264..154.264 rows=33418 loops=1)
>  Output: e.str1, e.str2, e.str3
>  Buckets: 65536 (originally 1024)  Batches: 1 (originally 1)  Memory
> Usage: 4003kB
>  ->  Foreign Scan on public.table_with_en_us_utf8_encoding e
> (cost=100.00..110.67 rows=1 width=1548) (actual time=8.289..144.210
> rows=33418 loops=1)
>Output: e.str1, e.str2, e.str3
>Remote SQL: SELECT str1, str2, str3 FROM
> public.table_with_en_us_utf8_encoding WHERE ((str4 = '2'::text))
>  Planning time: 0.153 ms
>  Execution time: 156.557 ms
> (13 rows)
>
>

In this case, both tables use same collation while comparing the rows,
so result is different from the merge join result. Hash join executed
on local server and the same executed on foreign server (by importing
local table to the foreign server) would also differ.


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PoC: full merge join on comparison clause

2017-09-19 Thread Ashutosh Bapat

Hi Alexander,

On Fri, Aug 25, 2017 at 10:11 PM, Alexander Kuzmenkov
<a.kuzmen...@postgrespro.ru> wrote:
> Here is a new version of the patch, rebased to 749c7c41 and with some
> cosmetic changes.
>

I looked at this patch briefly. This is a useful feature. This isn't a
design level review of the patch. I may get back to that later. But
here are some assorted comments

The patch applies cleanly, but there are some whitespace errors.
src/backend/executor/nodeMergejoin.c:231: trailing whitespace.
+   /*
src/backend/executor/nodeMergejoin.c:232: trailing whitespace.
+* Determine whether we accept lesser and/or equal tuples
src/backend/optimizer/path/joinpath.c:499: trailing whitespace.
+ * a merge join. A merge join executor can only choose inner values that are
src/backend/optimizer/path/joinpath.c:501: trailing whitespace.
+ * have at most one non-equality clause.

The implementation may change, so fixing the white space errors may
not be priority now. The patch compiles cleanly.

You have renamed RestrictInfo member mergeopfamilies as
equivopfamilies. I don't think that's a good name; it doesn't convey
that these are opfamilies containing merge operators. The changes in
check_mergejoinable() suggest that an operator may act as equality
operator in few operator families and comparison operator in others.
That looks odd. Actually an operator family contains operators other
than equality operators, so you may want to retain this member and add
a new member to specify whether the clause is an equality clause or
not.

In mergejoinscansel() you have just removed Assert(op_strategy ==
BTEqualStrategyNumber); Probably this function is written considering
on equality operators. But now that we are using this for all other
operators, we will need more changes to this function. That may be the
reason why INNER join in your earlier example doesn't choose right
costing.

The comment change in final_cost_mergejoin() needs more work. n1, n2,
n3 are number of rows on inner side with values 1, 2, 3 resp. So n1 +
n2 + n3 + ... = size of inner relation is correct. In that context I
am not able to understand your change
+* If the merge clauses contain inequality, (n1 + n2 + ...) ~=
+* (size of inner relation)^2.

Some stylistic comments
+   switch (join_op_strategy)
+   {
+   case BTEqualStrategyNumber:
+   parent->mj_UseEqual[iClause] = true;
+   break;
+   case BTLessEqualStrategyNumber:
+   parent->mj_UseEqual[iClause] = true;
+   /* fall through */
+   case BTLessStrategyNumber:
+   parent->mj_UseLesser[iClause] = true;
+   break;
+   case BTGreaterEqualStrategyNumber:
+   parent->mj_UseEqual[iClause] = true;
+   /* fall through */
+   case BTGreaterStrategyNumber:
+   parent->mj_UseLesser[iClause] = true;
+   break;
+   default:
+   Assert(false);

Add blank lines between different cases and you may want to replace
Assert in default case into an elog(). See for example exprType() or
get_jointype_name().

+   if (sort_result < 0)
+   {
+   result = MJCR_NextOuter;
+   }

We usually do not add {} around a single statement block.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgres_fdw bug in 9.6

2017-09-19 Thread Ashutosh Bapat

On Tue, Sep 5, 2017 at 1:10 PM, Etsuro Fujita
<fujita.ets...@lab.ntt.co.jp> wrote:
>
>
> I think Tom is reviewing this patch [1].
>

I am marking this as ready for committer as I don't have any new
comments and possibly other reviewers have also done reviewing it.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-18 Thread Ashutosh Bapat

On Sat, Sep 16, 2017 at 2:53 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Fri, Sep 15, 2017 at 6:11 AM, Ashutosh Bapat
> <ashutosh.ba...@enterprisedb.com> wrote:
>> Thanks a lot Robert.
>>
>> Here are rebased patches.
>
> I didn't get quite as much time to look at these today as I would have
> liked, but here's what I've got so far.
>
> Comments on 0001:
>
> - In the RelOptInfo, part_oids is defined in a completely different
> part of the structure than nparts, but you can't use it without nparts
> because you don't know how long it is.  I suggest moving the
> definition to just after nparts.
>
> - On the other hand, maybe we should just remove it completely.  I
> don't see it in any of the subsequent patches.  If it's used by the
> advanced matching code, let's leave it out of 0001 for now and add it
> back after the basic feature is committed.

No, it's not used by advanced partition matching code. It was used by
to match OIDs with the child rels to order those in the array. But now
that we are expanding in EIBO fashion, it is not useful. Should have
removed it earlier. Removed now.

>
> - Similarly, partsupfunc isn't used by the later patches either.  It
> seems it could also be removed, at least for now.

It's used by advanced partition matching code to compare bounds. It
will be required by partition pruning patch. But removed for now.

>
> - The comment for partexprs isn't very clear about how the lists
> inside the array work.  My understanding is that the lists have as
> many members as the partition key has columns/expressions.

Actually we are doing some preparation for partition-wise join here.
partexprs and nullable_partexprs are used in partition-wise join
implementation patch. I have updated prologue of RelOptInfo structure
with the comments like below

 * Note: A base relation will always have only one set of partition keys. But a
 * join relation has as many sets of partition keys as the number of joining
 * relations. The number of partition keys is given by
 * "part_scheme->partnatts". "partexprs" and "nullable_partexprs" are arrays
 * containing part_scheme->partnatts elements. Each element of the array
 * contains a list of partition key expressions. For a base relation each list
 * contains only one expression.  For a join relation each list contains at
 * most as many expressions as the joining relations. The expressions in a list
 * at a given position in the array correspond to the partition key at that
 * position. "partexprs" contains partition keys of non-nullable joining
 * relations and "nullable_partexprs" contains partition keys of nullable
 * joining relations. For a base relation only "partexprs" is populated.

Let me know this looks fine. The logic to match the partition keys of
joining relations in have_partkey_equi_join() and
match_expr_to_partition_keys() becomes simpler if we arrange the
partition key expressions as array indexed by position of partition
key and each array element as list of partition key expressions at
that position.

partition pruning might need partexprs look up relevant quals, but
nullable_partexprs doesn't have any use there. So may be we should add
nullable_partexpr to RelOptInfo as part of 0002 (partition-wise join
implementation) instead of 0001. What do you think?

>
> - I'm not entirely sure whether maintaining partexprs and
> nullable_partexprs is the right design.  If I understand correctly,
> whether or not a partexpr is nullable is really a per-RTI property,
> not a per-expression property.  You could consider something like
> "Relids nullable_rels".

That's true. However in order to decide whether an expression falls on
nullable side of a join, we will need to call pull_varnos() on it and
check the output against nullable_rels. Separating the expressions
themselves avoids that step.

>
> Comments on 0002:
>
> - The relationship between deciding to set partition scheme and
> related details and the configured value of enable_partition_wise_join
> needs some consideration.  If the only use of the partition scheme is
> partition-wise join, there's no point in setting it even for a baserel
> unless enable_partition_wise_join is set -- but if there are other
> important uses for that data, such as Amit's partition pruning work,
> then we might want to always set it.  And similarly for a join: if the
> details are only needed in the partition-wise join case, then we only
> need to set them in that case, but if there are other uses, then it's
> different.  If it turns out that setting these details for a baserel
> is useful in other cases but that it's only for a joinrel in the
> partition-wise join case, then the patch gets it exactly right.  But
> is that correct?  I'm not sure.

Partition sch

[HACKERS] Re: [COMMITTERS] pgsql: Expand partitioned table RTEs level by level, without flattening

2017-09-18 Thread Ashutosh Bapat

On Sun, Sep 17, 2017 at 7:24 AM, Robert Haas <robertmh...@gmail.com> wrote:
> Can you debug this on Monday?
>
> ...Robert
>
> Begin forwarded message:
>
> From: Andreas Seltenreich <seltenre...@gmx.de>
> Date: September 16, 2017 at 6:55:46 AM EDT
> To: Robert Haas <rh...@postgresql.org>
> Cc: pgsql-committ...@postgresql.org
> Subject: Re: [COMMITTERS] pgsql: Expand partitioned table RTEs level by
> level, without flattening
>
> Robert Haas writes:
>
> Expand partitioned table RTEs level by level, without flattening.
>
>
> testing with sqlsmith shows that the following assertion in this commit
> doesn't hold:
>
> TRAP: FailedAssertion("!(((brel)->reloptkind == RELOPT_BASEREL ||
> (brel)->reloptkind == RELOPT_OTHER_MEMBER_REL))", File: "initsplan.c", Line:
> 647)
>
> One of the simpler queries that triggers it for me:
>
>select from information_schema.user_mapping_options;

Thanks Andreas for the report. Sorry for the assertion failure.

PFA patch to fix the assertion failure.

The assertion assumed that all relations in simple_rel_array[] were
either "base" relations or "other" relations. This isn't true. The
array can contain "dead" relations as well. I have removed the
assertion and instead fixed the code to skip anything which is not
"base" or "other member rel".

I have also added a test to cover dead relations and lateral
references in join.sql.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
From 0227d3357d0b988183af439bb278e696d516bc5f Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.ba...@enterprisedb.com>
Date: Mon, 18 Sep 2017 12:16:24 +0530
Subject: [PATCH] Skip "dead" relations in create_lateral_join_info()

Prior to commit 0a480502b092195a9b25a2f0f199a21d592a9c57, the loop to propagate
lateral relation information to the child rels was skipping "dead" relations by
skipping any relation with reloptkind other than RELOPT_BASEREL. This commit
enabled lateral relation information to be percolated through relations with
reloptkind RELOPT_OTHER_MEMBER_REL. But it assumed that simple_rel_array can
have only RELOPT_OTHER_MEMBER_REL or RELOPT_BASEREL. Obviously that's not true.
Fix the loop to skip "dead" relations.

Ashutosh Bapat, reported by Andreas Seltenreich.
---
 src/backend/optimizer/plan/initsplan.c |7 +--
 src/test/regress/expected/join.out |   12 
 src/test/regress/sql/join.sql  |5 +
 3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c
index ad81f0f..9931ddd 100644
--- a/src/backend/optimizer/plan/initsplan.c
+++ b/src/backend/optimizer/plan/initsplan.c
@@ -632,7 +632,11 @@ create_lateral_join_info(PlannerInfo *root)
 		RelOptInfo *brel = root->simple_rel_array[rti];
 		RangeTblEntry *brte = root->simple_rte_array[rti];
 
-		if (brel == NULL)
+		/*
+		 * Skip empty slots. Also skip non-simple relations i.e. dead
+		 * relations.
+		 */
+		if (brel == NULL || !IS_SIMPLE_REL(brel))
 			continue;
 
 		/*
@@ -644,7 +648,6 @@ create_lateral_join_info(PlannerInfo *root)
 		 * therefore be marked with the appropriate lateral info so that those
 		 * children eventually get marked also.
 		 */
-		Assert(IS_SIMPLE_REL(brel));
 		Assert(brte);
 		if (brel->reloptkind == RELOPT_OTHER_MEMBER_REL &&
 			(brte->rtekind != RTE_RELATION ||
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 06a84e8..f47449b 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -4060,6 +4060,18 @@ select i8.* from int8_tbl i8 left join (select f1 from int4_tbl group by f1) i4
  Seq Scan on int8_tbl i8
 (1 row)
 
+-- check join removal with lateral references
+explain (costs off)
+select 1 from (select a.id FROM a left join b on a.b_id = b.id) q,
+			  lateral generate_series(1, q.id) gs(i) where q.id = gs.i;
+QUERY PLAN 
+---
+ Nested Loop
+   ->  Seq Scan on a
+   ->  Function Scan on generate_series gs
+ Filter: (a.id = i)
+(4 rows)
+
 rollback;
 create temp table parent (k int primary key, pd int);
 create temp table child (k int unique, cd int);
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 8b21838..d847d53 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -1336,6 +1336,11 @@ explain (costs off)
 select i8.* from int8_tbl i8 left join (select f1 from int4_tbl group by f1) i4
   on i8.q1 = i4.f1;
 
+-- check join removal with lateral references
+explain (costs off)
+select 1 from (select a.id FROM a left join b on a.b_id = b.id) q,
+			  lateral generate_series(1, q.id) gs(i) where

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-15 Thread Ashutosh Bapat

On Fri, Sep 15, 2017 at 2:09 PM, Rafia Sabih
<rafia.sa...@enterprisedb.com> wrote:
> On TPC-H benchmarking of this patch, I found a regression in Q7. It
> was taking some 1500s with the patch and some 900s without the patch.
> Please find the attached pwd_reg.zip for the output of explain analyse
> on head and with patch.
>
> The experimental settings used were,
> commit-id = 0c504a80cf2e6f66df2cdea563e879bf4abd1629
> patch-version = v26
>
> Server settings:
> work_mem = 1GB
> shared_buffers = 10GB
> effective_cache_size = 10GB
> max_parallel_workers_per_gather = 4
>
> Partitioning information:
> Partitioning scheme = by range
> Number of partitions in lineitem and orders table = 106
> partition key for lineitem = l_orderkey
> partition key for orders = o_orderkey

I observe that with partition-wise join patch the planner is using
GatherMerge along-with partition-wise join and on head its not using
GatherMerge. Just to make sure that its partition-wise join which is
causing regression and not GatherMerge, can you please run the query
with enable_gathermerge = false?

I see following lines explain analyze output 7_1.out without the patch
 ->  Sort  (cost=84634030.40..84638520.55 rows=1796063
width=72) (actual time=1061001.435..1061106.608 rows=437209 loops=1)
   Sort Key: n1.n_name, n2.n_name,
(date_part('year'::text, (lineitem_001.l_shipdate)::timestamp without
time zone))
   Sort Method: quicksort  Memory: 308912kB
   ->  Hash Join  (cost=16080591.94..84447451.72
rows=1796063 width=72) (actual time=252745.701..1057447.219
rows=1749956 loops=1)
Since Sort doesn't filter any rows, we would expect it to output the
same number of rows as hash join underneath it. But the number of rows
differ in this case. I am wondering whether there's some problem with
the explain analyze output itself.

>
> Apart from these there is a regression case on a custom table, on head
> query completes in 20s and with this patch it takes 27s. Please find
> the attached .out and .sql file for the output and schema for the test
> case respectively. I have reported this case before (sometime around
> March this year) as well, but I am not sure if it was overlooked or is
> an unimportant and expected behaviour for some reason.
>

Are you talking about [1]? I have explained about the regression in
[2] and [3]. This looks like an issue with the existing costing model.

[1] 
https://www.postgresql.org/message-id/caogqiimwcjnrunj_fcdbscrtlej-clp7exfzzipe2ut71n4...@mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAFjFpRedUZPa7tKbCLEGK3u5UWdDNQoN=eyfb7ieg5d0d1p...@mail.gmail.com
[3] 
https://www.postgresql.org/message-id/cafjfprejksdcfaeuzjgd79hoetzpz5bkdxljgxr7qznrxx+...@mail.gmail.com
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] no test coverage for ALTER FOREIGN DATA WRAPPER name HANDLER ...

2017-09-14 Thread Ashutosh Bapat

On Wed, Sep 13, 2017 at 1:46 PM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
>>
>> Ok. May be then create_function_1.sql is the right place. Just add it
>> to the section of passing tests and annotate that it's testing
>> creating an FDW handler. Sorry for jumping back and forth.
>
> Alright, done.  Thanks for reviewing.
>

LGTM. The patch applies cleanly on the current HEAD, compiles (small
bit in regress.c requires compilation), and make check passes. Marking
this as "ready for committer".

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Log LDAP "diagnostic messages"?

2017-09-14 Thread Ashutosh Bapat

On Wed, Sep 13, 2017 at 6:58 AM, Thomas Munro
<thomas.mu...@enterprisedb.com> wrote:
> On Tue, Sep 12, 2017 at 11:23 PM, Ashutosh Bapat
> <ashutosh.ba...@enterprisedb.com> wrote:
>> On Wed, Aug 16, 2017 at 11:13 AM, Ashutosh Bapat
>> <ashutosh.ba...@enterprisedb.com> wrote:
>>> On Wed, Aug 16, 2017 at 8:44 AM, Alvaro Herrera
>>> <alvhe...@2ndquadrant.com> wrote:
>>>> Christoph Berg wrote:
>>>>> "Diagnostic message" doesn't really mean anything, and printing
>>>>> "DETAIL: Diagnostic message: " seems redundant to me. Maybe
>>>>> drop that prefix? It should be clear from the context that this is a
>>>>> message from the LDAP layer.
>>>>
>>>> I think making it visible that the message comes from LDAP (rather than
>>>> Postgres or anything else) is valuable.  How about this?
>>>>
>>>> LOG:  could not start LDAP TLS session: Protocol error
>>>> DETAIL:  LDAP diagnostics: unsupported extended operation.
>>>>
>>> +1, pretty neat.
>
> Here is a new version adopting Alvaro's wording.  I'll set this back
> to "Needs review" status.
>

Thanks for the updated patches.

Looks good to me. The patch applies cleanly on the latest HEAD,
compiles without any errors or warnings and make check passes. Marking
this as ready for committer.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] <> join selectivity estimate question

2017-09-13 Thread Ashutosh Bapat

On Thu, Sep 14, 2017 at 4:30 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.mu...@enterprisedb.com> writes:
>> On Wed, Sep 6, 2017 at 11:14 PM, Ashutosh Bapat
>> <ashutosh.ba...@enterprisedb.com> wrote:
>>> I added some "stable" tests to your patch taking inspiration from the
>>> test SQL file. I think those will be stable across machines and runs.
>>> Please let me know if those look good to you.
>
>> Hmm.  But they show actual rows, not plan->plan_rows, and although the
>> former is interesting as a sanity check the latter is the thing under
>> test here.  It seems like we don't have fine enough control of
>> EXPLAIN's output to show estimated rows but not cost.  I suppose we
>> could try to capture EXPLAIN's output somehow (plpgsql dynamic
>> execution or spool output from psql?) and then pull out just the row
>> estimates, maybe with extra rounding to cope with instability.
>
> Don't have time to think about the more general question right now,
> but as far as the testing goes, there's already precedent for filtering
> EXPLAIN output --- see explain_sq_limit() in subselect.sql.  But I'm
> dubious whether the rowcount estimate could be relied on to be perfectly
> machine-independent, even if you were hiding costs successfully.
>

Are you referring to rounding errors? We should probably add some fuzz
factor to cover the rounding errors and cause a diff when difference
in expected and reported plan rows is beyond that fuzz factor.



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] <> join selectivity estimate question

2017-09-13 Thread Ashutosh Bapat

On Thu, Sep 14, 2017 at 4:19 AM, Thomas Munro
<thomas.mu...@enterprisedb.com> wrote:
> On Wed, Sep 6, 2017 at 11:14 PM, Ashutosh Bapat
> <ashutosh.ba...@enterprisedb.com> wrote:
>> On Fri, Jul 21, 2017 at 4:10 AM, Thomas Munro
>> <thomas.mu...@enterprisedb.com> wrote:
>>> That just leaves the question of whether we should try to handle the
>>> empty RHS and single-value RHS cases using statistics.  My intuition
>>> is that we shouldn't, but I'll be happy to change my intuition and
>>> code that up if that is the feedback from planner gurus.
>>
>> Empty RHS can result from dummy relations also, which are produced by
>> constraint exclusion, so may be that's an interesting case. Single
>> value RHS may be interesting with partitioned table with all rows in a
>> given partition end up with the same partition key value. But may be
>> those are just different patches. I am not sure.
>
> Can you elaborate on the constraint exclusion case?  We don't care
> about the selectivity of an excluded relation, do we?
>

I meant, an empty RHS case doesn't necessarily need an empty table, it
could happen because of a relation excluded by constraints (see
relation_excluded_by_constraints()). So, that's not as obscure as we
would think. But it's not very frequent either. But I think we should
deal with that as a separate patch. This patch improves the estimate
for some cases, while not degrading those in other cases. So, I think
we can leave other cases for a later patch.

> Any other views on the empty and single value special cases, when
> combined with [NOT] EXISTS (SELECT ... WHERE r.something <>
> s.something)?  Looking at this again, my feeling is that they're too
> obscure to spend time on, but others may disagree.
>
>>> Please find attached a new version, and a test script I used, which
>>> shows a bunch of interesting cases.  I'll add this to the commitfest.
>>
>> I added some "stable" tests to your patch taking inspiration from the
>> test SQL file. I think those will be stable across machines and runs.
>> Please let me know if those look good to you.
>
> Hmm.  But they show actual rows, not plan->plan_rows, and although the
> former is interesting as a sanity check the latter is the thing under
> test here.

I missed this point while adopting the tests. Sorry.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-13 Thread Ashutosh Bapat

On Wed, Sep 13, 2017 at 12:51 PM, Ashutosh Bapat
<ashutosh.ba...@enterprisedb.com> wrote:
> On Wed, Sep 13, 2017 at 12:39 AM, Robert Haas <robertmh...@gmail.com> wrote:
>> On Tue, Sep 12, 2017 at 3:46 AM, Amit Langote
>> <langote_amit...@lab.ntt.co.jp> wrote:
>>> In this case, AcquireExecutorLocks will lock all the relations in
>>> PlannedStmt.rtable, which must include all partitioned tables of all
>>> partition trees involved in the query.  Of those, it will lock the tables
>>> whose RT indexes appear in PlannedStmt.nonleafResultRelations with
>>> RowExclusiveLock mode.  PlannedStmt.nonleafResultRelations is a global
>>> list of all partitioned table RT indexes obtained by concatenating
>>> partitioned_rels lists of all ModifyTable nodes involved in the query
>>> (set_plan_refs does that).  We need to distinguish nonleafResultRelations,
>>> because we need to take the stronger lock on a given table before any
>>> weaker one if it happens to appear in the query as a non-result relation
>>> too, to avoid lock strength upgrade deadlock hazard.
>>
>> Hmm.  The problem with this theory in my view is that it doesn't
>> explain why InitPlan() and ExecOpenScanRelation() lock the relations
>> instead of just assuming that they are already locked either by
>> AcquireExecutorLocks or by planning.  If ExecLockNonLeafAppendTables()
>> doesn't really need to take locks, then ExecOpenScanRelation() must
>> not need to do it either.  We invented ExecLockNonLeafAppendTables()
>> on the occasion of removing the scans of those tables which would
>> previously have caused ExecOpenScanRelation() to be invoked, so as to
>> keep the locking behavior unchanged.
>>
>> AcquireExecutorLocks() looks like an odd bit of code to me.  The
>> executor itself locks result tables in InitPlan() and then everything
>> else during InitPlan() and all of the others later on while walking
>> the plan tree -- comments in InitPlan() say that this is to avoid a
>> lock upgrade hazard if a result rel is also a source rel.  But
>> AcquireExecutorLocks() has no such provision; it just locks everything
>> in RTE order.  In theory, that's a deadlock hazard of another kind, as
>> we just talked about in the context of EIBO.  In fact, expanding in
>> bound order has made the situation worse: before, expansion order and
>> locking order were the same, so maybe having AcquireExecutorLocks()
>> work in RTE order coincidentally happened to give the same result as
>> the executor code itself as long as there are no result relations.
>> But this is certainly not true any more.  I'm not sure it's worth
>> expending a lot of time on this -- it's evidently not a problem in
>> practice, or somebody probably would've complained before now.
>>
>> But that having been said, I don't think we should assume that all the
>> locks taken from the executor are worthless because plancache.c will
>> always do the job for us.  I don't know of a case where we execute a
>> saved plan without going through the plan cache, but that doesn't mean
>> that there isn't one or that there couldn't be one in the future.
>> It's not the job of these partitioning patches to whack around the way
>> we do locking in general -- they should preserve the existing behavior
>> as much as possible.  If we want to get rid of the locking in the
>> executor altogether, that's a separate discussion where, I have a
>> feeling, there will prove to be better reasons for the way things are
>> than we are right now supposing.
>>
>
> I agree that it's not the job of these patches to change the locking
> or even get rid of partitioned_rels. In order to continue returning
> partitioned_rels in Append paths esp. in the case of queries involving
> set operations and partitioned table e.g "select 1 from t1 union all
> select 2 from t1;" in which t1 is multi-level partitioned table, we
> need a fix in add_paths_to_append_rels(). The fix provided in [1] is
> correct but we will need a longer explanation of why we have to
> involve RTE_SUBQUERY with RELKIND_PARTITIONED_TABLE. The explanation
> is complicated. If we get rid of partitioned_rels, we don't need to
> fix that code in add_paths_to_append_rel().
>
> I suggested that [2]
> -- (excerpt from [2])
>
> Actually, the original problem that caused this discussion started
> with an assertion failure in get_partitioned_child_rels() as
> Assert(list_length(result) >= 1);
>
> This assertion fails if result is NIL when an intermediate partitioned
> table is passed. May be we should assert (result == NIL ||
> list_length(result) ==

Re: [HACKERS] no test coverage for ALTER FOREIGN DATA WRAPPER name HANDLER ...

2017-09-13 Thread Ashutosh Bapat

On Wed, Sep 13, 2017 at 1:27 PM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> On 2017/09/13 16:42, Ashutosh Bapat wrote:
>> On Wed, Sep 13, 2017 at 7:49 AM, Amit Langote wrote:
>>> In the attached updated patch, I created separate .source files in
>>> src/test/regress/input and output directories called fdw_handler.source
>>> and put the test_fdw_handler function definition there.  When I had
>>> originally thought of it back when I wrote the patch, it seemed to be an
>>> overkill, because we're just normally defining a single C function there
>>> to be used in the newly added foreign_data tests.  In any case, we need to
>>> go the .source file way, because that's the only way to refer to paths to
>>> .so library when defining C language functions.
>>
>> It still looks like an overkill to add a new file to define a dummy
>> FDW handler. Why do we need to define a handler as a C function? Can't
>> we define handler as a SQL function. If we could do that we could add
>> the function definition in foreign_data.sql itself.
>
> I guess that's because the last time I tried to define the handler as a
> SQL function, I couldn't:
>
> create function test_fdw_handler() returns fdw_handler as '' language sql;
> ERROR:  SQL functions cannot return type fdw_handler
>
> fdw_handler is a pseudo-type, which neither SQL nor plpgsql function can
> return.
>
> Am I missing something?

Ok. May be then create_function_1.sql is the right place. Just add it
to the section of passing tests and annotate that it's testing
creating an FDW handler. Sorry for jumping back and forth.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] no test coverage for ALTER FOREIGN DATA WRAPPER name HANDLER ...

2017-09-13 Thread Ashutosh Bapat

On Wed, Sep 13, 2017 at 7:49 AM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> On 2017/09/12 20:17, Ashutosh Bapat wrote:
>> On Tue, Sep 12, 2017 at 2:27 PM, Amit Langote
>> <langote_amit...@lab.ntt.co.jp> wrote:
>>> Thanks Ashutosh for taking a look at this.
>>>
>>> On 2017/09/05 21:16, Ashutosh Bapat wrote:
>>>> The patch needs a rebase.
>>>
>>> Attached rebased patch.
>>
>> Thanks for rebased patch.
>
> Thanks for the review.
>
>> We could annotate each ERROR with an explanation as to why that's an
>> error, but then this file doesn't do that for other commands, so may
>> be the patch is just fine.
>
> Agreed.  Note that this patch is just about adding the tests, not
> modifying foreigncmds.c to change error handling around HANDLER functions.

Yes. I am not concerned about foreigncmds.c but foreign_data.sql/.out

>
>> Also, I am wondering whether we should create the new handler function
>> in foreign.c similar to postgresql_fdw_validator(). The prologue has a
>> caution
>>
>> 606  * Caution: this function is deprecated, and is now meant only for 
>> testing
>> 607  * purposes, because the list of options it knows about doesn't 
>> necessarily
>> 608  * square with those known to whichever libpq instance you might be 
>> using.
>> 609  * Inquire of libpq itself, instead.
>>
>> So, may be we don't want to add it there. But adding the handler
>> function in create_function_1 doesn't seem good. If that's the correct
>> place, then at least it should be moved before " -- Things that
>> shouldn't work:"; it doesn't belong to functions that don't work.
>
> In the attached updated patch, I created separate .source files in
> src/test/regress/input and output directories called fdw_handler.source
> and put the test_fdw_handler function definition there.  When I had
> originally thought of it back when I wrote the patch, it seemed to be an
> overkill, because we're just normally defining a single C function there
> to be used in the newly added foreign_data tests.  In any case, we need to
> go the .source file way, because that's the only way to refer to paths to
> .so library when defining C language functions.

It still looks like an overkill to add a new file to define a dummy
FDW handler. Why do we need to define a handler as a C function? Can't
we define handler as a SQL function. If we could do that we could add
the function definition in foreign_data.sql itself.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-13 Thread Ashutosh Bapat

On Wed, Sep 13, 2017 at 12:32 PM, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> Hi,
>
> Rafia had done some testing on TPCH queries using Partition-wise join
> patch along with Parallel Append patch.
>
> There, we had observed that for query 4, even though the partition
> wise joins are under a Parallel Append, the join are all non-partial.
>
> Specifically, the partition-wise join has non-partial nested loop
> joins when actually it was expected to have partial nested loop joins.
> (The difference can be seen by the observation that the outer relation
> of that join is scanned by non-parallel Bitmap Heap scan when it
> should have used Parallel Bitmap Heap Scan).
>
> Here is the detailed analysis , including where I think is the issue :
>
> https://www.postgresql.org/message-id/CAJ3gD9cZms1ND3p%3DNN%3DhDYDFt_SeKq1htMBhbj85bOmvJwY5fg%40mail.gmail.com
>
> All the TPCH results are posted in the same above mail thread.

Can you please check if the attached patch fixes the issue.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
commit 203b3083318e9da41ad614a2ccec532025877c3b
Author: Ashutosh Bapat <ashutosh.ba...@enterprisedb.com>
Date:   Tue Sep 12 17:41:54 2017 +0530

Reparamterize partial nestloop paths.

We do not create partial nested looop paths if the inner path's
parameterization is not fully covered by the outer relation. For partition-wise
join, the test fails since the inner path is parameterized by the parent of the
outer relation. Fix the test to check the parent relids instead of the child
relids and also reparameterize the inner path to be parameterized by the outer
child similar to try_nestloop_path().

    TODO: squash this patch with the reparameterization patch.

Ashutosh Bapat, per report from Rafia and analysis by Amit Khandekar

diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index 91f0b1c..c8da19c 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -496,8 +496,20 @@ try_partial_nestloop_path(PlannerInfo *root,
 	if (inner_path->param_info != NULL)
 	{
 		Relids		inner_paramrels = inner_path->param_info->ppi_req_outer;
+		RelOptInfo *outerrel = outer_path->parent;
+		Relids		outerrelids;
 
-		if (!bms_is_subset(inner_paramrels, outer_path->parent->relids))
+		/*
+		 * Paths are parameterized by top-level parent(s). Any paths parameterized
+		 * by the child relations, are not added to the pathlist. Hence run
+		 * parameterization tests on the parent relids.
+		 */
+		if (outerrel->top_parent_relids)
+			outerrelids = outerrel->top_parent_relids;
+		else
+			outerrelids = outerrel->relids;
+
+		if (!bms_is_subset(inner_paramrels, outerrelids))
 			return;
 	}
 
@@ -510,6 +522,32 @@ try_partial_nestloop_path(PlannerInfo *root,
 	if (!add_partial_path_precheck(joinrel, workspace.total_cost, pathkeys))
 		return;
 
+	/*
+	 * Since result produced by a child is part of the result produced by
+	 * its topmost parent and has same properties, the parameters
+	 * representing that parent may be substituted by values from a child.
+	 * Hence expressions and hence paths using those expressions,
+	 * parameterized by a parent can be said to be parameterized by any of
+	 * its child.  For a join between child relations, if the inner path
+	 * is parameterized by the parent of the outer relation,  translate
+	 * the inner path to be parameterized by the outer child relation and
+	 * create a nestloop join path.  The translated path should have the
+	 * same costs as the original path, so cost check above should still
+	 * hold.
+	 */
+	if (PATH_PARAM_BY_PARENT(inner_path, outer_path->parent))
+	{
+		inner_path = reparameterize_path_by_child(root, inner_path,
+  outer_path->parent);
+
+		/*
+		 * If we could not translate the path, we can't create nest loop
+		 * path.
+		 */
+		if (!inner_path)
+			return;
+	}
+
 	/* Might be good enough to be worth trying, so let's try it. */
 	add_partial_path(joinrel, (Path *)
 	 create_nestloop_path(root,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-13 Thread Ashutosh Bapat

On Wed, Sep 13, 2017 at 11:29 AM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> On 2017/09/12 19:56, Ashutosh Bapat wrote:
>> I think the code here expects the original parent_rte and not the one
>> we set around line 1169.
>>
>> This isn't a bug right now, since both the parent_rte s have same
>> content. But I am not sure if that will remain to be so. Here's patch
>> to fix the thinko.
>
> Instead of the new bool is_parent_partitioned, why not move the code to
> set partitioned_rels to the block where you're now setting
> is_parent_partitioned.
>
> Also, since we know this isn't a bug at the moment but will turn into one
> once we have step-wise expansion, why not include this fix in that patch
> itself?

It won't turn into a bug with step-wise expansion since every
parent_rte will have RELKIND_PARTITIONED_TABLE for a partitioned top
parent, which is used to extract the partitioned_rels. But I guess,
it's better to fix the thinko in step-wise expansion since parent_rte
itself changes.
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-13 Thread Ashutosh Bapat

On Wed, Sep 13, 2017 at 12:39 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Tue, Sep 12, 2017 at 3:46 AM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>> In this case, AcquireExecutorLocks will lock all the relations in
>> PlannedStmt.rtable, which must include all partitioned tables of all
>> partition trees involved in the query.  Of those, it will lock the tables
>> whose RT indexes appear in PlannedStmt.nonleafResultRelations with
>> RowExclusiveLock mode.  PlannedStmt.nonleafResultRelations is a global
>> list of all partitioned table RT indexes obtained by concatenating
>> partitioned_rels lists of all ModifyTable nodes involved in the query
>> (set_plan_refs does that).  We need to distinguish nonleafResultRelations,
>> because we need to take the stronger lock on a given table before any
>> weaker one if it happens to appear in the query as a non-result relation
>> too, to avoid lock strength upgrade deadlock hazard.
>
> Hmm.  The problem with this theory in my view is that it doesn't
> explain why InitPlan() and ExecOpenScanRelation() lock the relations
> instead of just assuming that they are already locked either by
> AcquireExecutorLocks or by planning.  If ExecLockNonLeafAppendTables()
> doesn't really need to take locks, then ExecOpenScanRelation() must
> not need to do it either.  We invented ExecLockNonLeafAppendTables()
> on the occasion of removing the scans of those tables which would
> previously have caused ExecOpenScanRelation() to be invoked, so as to
> keep the locking behavior unchanged.
>
> AcquireExecutorLocks() looks like an odd bit of code to me.  The
> executor itself locks result tables in InitPlan() and then everything
> else during InitPlan() and all of the others later on while walking
> the plan tree -- comments in InitPlan() say that this is to avoid a
> lock upgrade hazard if a result rel is also a source rel.  But
> AcquireExecutorLocks() has no such provision; it just locks everything
> in RTE order.  In theory, that's a deadlock hazard of another kind, as
> we just talked about in the context of EIBO.  In fact, expanding in
> bound order has made the situation worse: before, expansion order and
> locking order were the same, so maybe having AcquireExecutorLocks()
> work in RTE order coincidentally happened to give the same result as
> the executor code itself as long as there are no result relations.
> But this is certainly not true any more.  I'm not sure it's worth
> expending a lot of time on this -- it's evidently not a problem in
> practice, or somebody probably would've complained before now.
>
> But that having been said, I don't think we should assume that all the
> locks taken from the executor are worthless because plancache.c will
> always do the job for us.  I don't know of a case where we execute a
> saved plan without going through the plan cache, but that doesn't mean
> that there isn't one or that there couldn't be one in the future.
> It's not the job of these partitioning patches to whack around the way
> we do locking in general -- they should preserve the existing behavior
> as much as possible.  If we want to get rid of the locking in the
> executor altogether, that's a separate discussion where, I have a
> feeling, there will prove to be better reasons for the way things are
> than we are right now supposing.
>

I agree that it's not the job of these patches to change the locking
or even get rid of partitioned_rels. In order to continue returning
partitioned_rels in Append paths esp. in the case of queries involving
set operations and partitioned table e.g "select 1 from t1 union all
select 2 from t1;" in which t1 is multi-level partitioned table, we
need a fix in add_paths_to_append_rels(). The fix provided in [1] is
correct but we will need a longer explanation of why we have to
involve RTE_SUBQUERY with RELKIND_PARTITIONED_TABLE. The explanation
is complicated. If we get rid of partitioned_rels, we don't need to
fix that code in add_paths_to_append_rel().

I suggested that [2]
-- (excerpt from [2])

Actually, the original problem that caused this discussion started
with an assertion failure in get_partitioned_child_rels() as
Assert(list_length(result) >= 1);

This assertion fails if result is NIL when an intermediate partitioned
table is passed. May be we should assert (result == NIL ||
list_length(result) == 1) and allow that function to be called even
for intermediate partitioned partitions for which the function will
return NIL. That will leave the code in add_paths_to_append_rel()
simple. Thoughts?
--

Amit Langote agrees with this. It kind of makes the assertion lame but
keeps the code sane. What do you think?

[1] 
https://www.postgresql.org/message-id/d2f1cdcb-ebb4-76c5-e471-79348ca5d...@lab.ntt.co.jp
[2]

Re: [HACKERS] Constraint exclusion for partitioned tables

2017-09-13 Thread Ashutosh Bapat

On Wed, Sep 13, 2017 at 12:17 PM, Jeevan Chalke
<jeevan.cha...@enterprisedb.com> wrote:
>
>
> On Tue, Sep 12, 2017 at 8:12 PM, Robert Haas <robertmh...@gmail.com> wrote:
>>
>> On Tue, Sep 12, 2017 at 7:08 AM, Jeevan Chalke
>> <jeevan.cha...@enterprisedb.com> wrote:
>> > This patch clearly improves the planning time with given conditions.
>> >
>> > To verify that, I have created a table like:
>> > create table foo(a int, b int check (b > 100), c text) partition by
>> > range(a);
>> > And then used following query to get planning time:
>> > select * from foo where b < 100;
>> >
>> > And on my local setup, I have observed that,
>> > For 16 partitions, planning time was 0.234692 ms, which reduced to
>> > 0.112948
>> > ms with this patch.
>> > For 128 partitions, planning time was 1.62305 ms, which reduced to
>> > 0.654252
>> > ms with this patch.
>> > For 1024 partitions, planning time was 18.720993 ms, which reduced to
>> > 9.667395 ms with this patch.
>> >
>> > This clearly shows an improvement in planning time.
>>
>> What about the extra cost of checking the parent when it doesn't help?
>>  In that case we will have some loss.
>>
>> I'm inclined to think that's OK, but it's something to think about.
>
>
> I have updated query like:

Thanks a lot Jeevan for all your experiments. They are very useful.

> select * from foo where b > 100;
> Which matches with the CHECK constraint, and here are the result on my local
> setup:

So, in this case, constraint exclusion fails since the WHERE condition
can not be refuted by the constraints.

>
> Time in milliseconds
> Partitions | without patch | with patch
> ---|---|
> 2  | 0.072551  | 0.074154
> 4  | 0.102537  | 0.108024
> 8  | 0.162703  | 0.175017
> 16 | 0.288589  | 0.305285
> 128|  2.7119   | 2.636247
> 1024   | 29.101347 | 29.48275
>
> So yes, as you said, it will have slight (may be negligible) overhead.
>
> This observation are from local setup and I have also seen a large standard
> deviation in the runs.

For a regular table if the constraint exclusion fails, we will waste
those many CPU cycles. But if the relation is excluded we will save
disk I/O or buffer access and time to apply the conditions on all the
rows in the relation. Given the magnitude of difference in the time to
run constraint exclusion and time for all those things, we take the
hit and run constraint exclusion always.

For a partitioned table, this patch saves the time to run constraint
exclusion on all the partitions if constraint exclusion succeeds on
the partitioned table. If constraint exclusion fails, we have wasted
CPU cycles on one run of constraint exclusion. The difference between
the time spent in the two scenarios increases with the number of
partitions. Practically, users will have a handful partitions rather
than a couple and thus running overhead of running constraint
exclusion on partitioned table would be justified given the time it
will save when CE succeeds.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dropping partitioned tables without CASCADE

2017-09-12 Thread Ashutosh Bapat

Thanks Amit for taking care of this.

On Wed, Sep 13, 2017 at 6:31 AM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> On 2017/09/06 19:14, Amit Langote wrote:
>> On 2017/09/06 18:46, Rushabh Lathia wrote:
>>> Okay, I have marked this as ready for committer.
>>
>> Thanks Ashutosh and Rushabh for rebasing and improving the patch.  Looks
>> good to me too.
>
> Patch needed to be rebased after the default partitions patch went in, so
> done.  Per build status on http://commitfest.cputube.org :)
>
> Thanks,
> Amit



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Log LDAP "diagnostic messages"?

2017-09-12 Thread Ashutosh Bapat

The patch needs to address some comments in the previous mails, so
marking it as "waiting for author".

On Wed, Aug 16, 2017 at 11:13 AM, Ashutosh Bapat
<ashutosh.ba...@enterprisedb.com> wrote:
> On Wed, Aug 16, 2017 at 8:44 AM, Alvaro Herrera
> <alvhe...@2ndquadrant.com> wrote:
>> Christoph Berg wrote:
>>> Re: Thomas Munro 2017-08-10 
>>>

Re: [HACKERS] no test coverage for ALTER FOREIGN DATA WRAPPER name HANDLER ...

2017-09-12 Thread Ashutosh Bapat

On Tue, Sep 12, 2017 at 2:27 PM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> Thanks Ashutosh for taking a look at this.
>
> On 2017/09/05 21:16, Ashutosh Bapat wrote:
>> The patch needs a rebase.
>
> Attached rebased patch.

Thanks for rebased patch.

We could annotate each ERROR with an explanation as to why that's an
error, but then this file doesn't do that for other commands, so may
be the patch is just fine.

Also, I am wondering whether we should create the new handler function
in foreign.c similar to postgresql_fdw_validator(). The prologue has a
caution

606  * Caution: this function is deprecated, and is now meant only for testing
607  * purposes, because the list of options it knows about doesn't necessarily
608  * square with those known to whichever libpq instance you might be using.
609  * Inquire of libpq itself, instead.

So, may be we don't want to add it there. But adding the handler
function in create_function_1 doesn't seem good. If that's the correct
place, then at least it should be moved before " -- Things that
shouldn't work:"; it doesn't belong to functions that don't work.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-12 Thread Ashutosh Bapat

On Tue, Sep 12, 2017 at 2:35 PM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> On 2017/09/12 17:53, Ashutosh Bapat wrote:
>> On Tue, Sep 12, 2017 at 1:42 PM, Amit Langote wrote:
>>> So, we can remove partitioned_rels from (Merge)AppendPath and
>>> (Merge)Append nodes and remove ExecLockNonLeafAppendTables().
>>
>> Don't we need partitioned_rels from Append paths to be transferred to
>> ModifyTable node or we have a different way of calculating
>> nonleafResultRelations?
>
> No, we don't transfer partitioned_rels from Append path to ModifyTable
> node.  inheritance_planner(), that builds the ModifyTable path for
> UPDATE/DELETE on a partitioned table, fetches partitioned_rels from
> root->pcinfo_list itself and passes it to create_modifytable_path.  No
> Append path is involved in that case.  PlannedStmt.nonleafResultRelations
> is built by concatenating the partitioned_rels lists of all ModifyTable
> nodes appearing in the query.  It does not depend on Append's or
> AppendPath's partitioned_rels.

Ok. Thanks for the explanation.

This make me examine inheritance_planner() closely and I think I have
spotted a thinko there. In inheritance_planner() parent_rte is set to
the RTE of parent to start with and then in the loop
1132 /*
1133  * And now we can get on with generating a plan for each child table.
1134  */
1135 foreach(lc, root->append_rel_list)
1136 {
... code clipped
1165 /*
1166  * If there are securityQuals attached to the parent,
move them to the
1167  * child rel (they've already been transformed properly for that).
1168  */
1169 parent_rte = rt_fetch(parentRTindex, subroot->parse->rtable);
1170 child_rte = rt_fetch(appinfo->child_relid, subroot->parse->rtable);
1171 child_rte->securityQuals = parent_rte->securityQuals;
1172 parent_rte->securityQuals = NIL;

we set parent_rte to the one obtained from subroot->parse, which
happens to be the same (at least in contents) as original parent_rte.
Later we use this parent_rte to pull partitioned_rels outside that
loop

1371 if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
1372 {
1373 partitioned_rels = get_partitioned_child_rels(root, parentRTindex);
1374 /* The root partitioned table is included as a child rel */
1375 Assert(list_length(partitioned_rels) >= 1);
1376 }

I think the code here expects the original parent_rte and not the one
we set around line 1169.

This isn't a bug right now, since both the parent_rte s have same
content. But I am not sure if that will remain to be so. Here's patch
to fix the thinko.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

inh_planner_prte.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-12 Thread Ashutosh Bapat

On Tue, Sep 12, 2017 at 2:17 PM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
>
> That said, I noticed that we might need to be careful about what the value
> of the root parent's PlanRowMark's allMarkType field gets set to.  We need
> to make sure that it reflects markType of all partitions in the tree,
> including those that are not root parent's direct children.  Is that true
> with the proposed implementation?

Yes. We include child's allMarkTypes into parent's allMarkTypes. So,
top parent's PlanRowMarks should have all descendant's allMarkTypes,
which is not happening in the patch right now. There are two ways to
fix that.

1. Pass top parent's PlanRowMark all the way down to the leaf
partitions, so that current expand_single_inheritance_child() collects
allMarkTypes of all children correctly. But this way, PlanRowMarks of
intermediate parent does not reflect allMarkTypes of its children,
only top root records that.
2. Pass immediate parent's PlanRowMark to
expand_single_inheritance_child(), so that it records allMarkTypes of
its children. In expand_partitioned_rtentry() have following sequence

expand_single_inheritance_child(root, parentrte, parentRTindex,
parentrel, parentrc, childrel,
appinfos, , ,
);

/* If this child is itself partitioned, recurse */
if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
   {
expand_partitioned_rtentry(root, childrte, childRTindex,
   childrel, childrc, lockmode, appinfos,
   partitioned_child_rels);

/* Include child's rowmark type in parent's allMarkTypes */
parentrc->allMarkTypes |= childrc->allMarkTypes;
   }
so that we push allMarkTypes up the hierarchy.

I like the second way, since every intermediate parent records
allMarkTypes of its descendants.

Thoughts?
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-12 Thread Ashutosh Bapat

On Tue, Sep 12, 2017 at 1:42 PM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> On 2017/09/12 16:55, Ashutosh Bapat wrote:
>> On Tue, Sep 12, 2017 at 1:16 PM, Amit Langote wrote:
>>> So I looked at this a bit closely and came to the conclusion that we may
>>> not need to keep partitioned table RT indexes in the
>>> (Merge)Append.partitioned_rels after all, as far as execution-time locking
>>> is concerned.
>>>
>>> Consider two cases:
>>>
>>> 1. Plan is created and executed in the same transaction
>>>
>>> In this case, locks taken on the partitioned tables by the planner will
>>> suffice.
>>>
>>> 2. Plan is executed in a different transaction from the one in which it
>>>was created (a cached plan)
>>>
>>> In this case, AcquireExecutorLocks will lock all the relations in
>>> PlannedStmt.rtable, which must include all partitioned tables of all
>>> partition trees involved in the query.  Of those, it will lock the tables
>>> whose RT indexes appear in PlannedStmt.nonleafResultRelations with
>>> RowExclusiveLock mode.  PlannedStmt.nonleafResultRelations is a global
>>> list of all partitioned table RT indexes obtained by concatenating
>>> partitioned_rels lists of all ModifyTable nodes involved in the query
>>> (set_plan_refs does that).  We need to distinguish nonleafResultRelations,
>>> because we need to take the stronger lock on a given table before any
>>> weaker one if it happens to appear in the query as a non-result relation
>>> too, to avoid lock strength upgrade deadlock hazard.
>>>
>>> Moreover, because all the tables from plannedstmt->rtable, including the
>>> partitioned tables, will be added to PlannedStmt.relationsOids, any
>>> invalidation events affecting the partitioned tables (for example,
>>> add/remove a partition) will cause the plan involving partitioned tables
>>> to be recreated.
>>>
>>> In none of this do we rely on the partitioned table RT indexes appearing
>>> in the (Merge)Append node itself.  Maybe, we should just remove
>>> partitioned_rels from (Merge)AppendPath and (Merge)Append node in a
>>> separate patch and move on.
>>>
>>> Thoughts?
>>
>> Yes, I did the same analysis (to which I refer in my earlier reply to
>> you). I too think we should just remove partitioned_rels from Append
>> paths. But then the question is those are then transferred to
>> ModifyTable node in create_modifytable_plan() and use it for something
>> else. What should we do about that code? I don't think we are really
>> using that list from ModifyTable node as well, so may be we could
>> remove it from there as well. What do you think? Does that mean
>> partitioned_rels isn't used at all in the code?
>
> No, we cannot simply get rid of partitioned_rels altogether.  We'll need
> to keep it in the ModifyTable node, because we *do* need the
> nonleafResultRelations list in PlannedStmt to distinguish partitioned
> table result relations, which set_plan_refs builds by concatenating
> partitioned_rels lists of various ModifyTable nodes of the query.  The
> PlannedStmt.nonleafResultRelations list actually has some use (which
> parallels PlannedStmt.resultRelations), but partitioned_rels list in the
> individual (Merge)Append, as it turns out, doesn't.
>
> So, we can remove partitioned_rels from (Merge)AppendPath and
> (Merge)Append nodes and remove ExecLockNonLeafAppendTables().

Don't we need partitioned_rels from Append paths to be transferred to
ModifyTable node or we have a different way of calculating
nonleafResultRelations?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-12 Thread Ashutosh Bapat

On Tue, Sep 12, 2017 at 1:16 PM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> On 2017/09/11 21:07, Ashutosh Bapat wrote:
>> On Mon, Sep 11, 2017 at 5:19 PM, Robert Haas <robertmh...@gmail.com> wrote:
>>> On Mon, Sep 11, 2017 at 6:45 AM, Ashutosh Bapat
>>> <ashutosh.ba...@enterprisedb.com> wrote:
>>>> So, all partitioned partitions are getting locked correctly. Am I
>>>> missing something?
>>>
>>> That's not a valid test.  In that scenario, you're going to hold all
>>> the locks acquired by the planner, all the locks acquired by the
>>> rewriter, and all the locks acquired by the executor, but when using
>>> prepared queries, it's possible to execute the plan after the planner
>>> and rewriter locks are no longer held.
>>
>> I see the same thing when I use prepare and execute
>
> So I looked at this a bit closely and came to the conclusion that we may
> not need to keep partitioned table RT indexes in the
> (Merge)Append.partitioned_rels after all, as far as execution-time locking
> is concerned.
>
> Consider two cases:
>
> 1. Plan is created and executed in the same transaction
>
> In this case, locks taken on the partitioned tables by the planner will
> suffice.
>
> 2. Plan is executed in a different transaction from the one in which it
>was created (a cached plan)
>
> In this case, AcquireExecutorLocks will lock all the relations in
> PlannedStmt.rtable, which must include all partitioned tables of all
> partition trees involved in the query.  Of those, it will lock the tables
> whose RT indexes appear in PlannedStmt.nonleafResultRelations with
> RowExclusiveLock mode.  PlannedStmt.nonleafResultRelations is a global
> list of all partitioned table RT indexes obtained by concatenating
> partitioned_rels lists of all ModifyTable nodes involved in the query
> (set_plan_refs does that).  We need to distinguish nonleafResultRelations,
> because we need to take the stronger lock on a given table before any
> weaker one if it happens to appear in the query as a non-result relation
> too, to avoid lock strength upgrade deadlock hazard.
>
> Moreover, because all the tables from plannedstmt->rtable, including the
> partitioned tables, will be added to PlannedStmt.relationsOids, any
> invalidation events affecting the partitioned tables (for example,
> add/remove a partition) will cause the plan involving partitioned tables
> to be recreated.
>
> In none of this do we rely on the partitioned table RT indexes appearing
> in the (Merge)Append node itself.  Maybe, we should just remove
> partitioned_rels from (Merge)AppendPath and (Merge)Append node in a
> separate patch and move on.
>
> Thoughts?

Yes, I did the same analysis (to which I refer in my earlier reply to
you). I too think we should just remove partitioned_rels from Append
paths. But then the question is those are then transferred to
ModifyTable node in create_modifytable_plan() and use it for something
else. What should we do about that code? I don't think we are really
using that list from ModifyTable node as well, so may be we could
remove it from there as well. What do you think? Does that mean
partitioned_rels isn't used at all in the code?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-12 Thread Ashutosh Bapat

On Tue, Sep 12, 2017 at 7:31 AM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> On 2017/09/11 19:45, Ashutosh Bapat wrote:
>> On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote wrote:
>>> IMHO, we should make it the responsibility of the future patch to set a
>>> child PlanRowMark's prti to the direct parent's RT index, when we actually
>>> know that it's needed for something.  We clearly know today why we need to
>>> pass the other objects like child RT entry, RT index, and Relation, so we
>>> should limit this patch to pass only those objects to the recursive call.
>>> That makes this patch a relatively easy to understand change.
>>
>> I think you are mixing two issues here 1. setting parent RTI in child
>> PlanRowMark and 2. passing immediate parent's PlanRowMark to
>> expand_single_inheritance_child().
>>
>> I have discussed 1 in my reply to Robert.
>>
>> About 2 you haven't given any particular comments to my reply. To me
>> it looks like it's this patch that introduces the notion of
>> multi-level expansion, so it's natural for this patch to pass
>> PlanRowMark in cascaded fashion similar to other structures.
>
> You patch does 2 to be able to do 1, doesn't it?  That is, to be able to
> set the child PlanRowMark's prti to the direct parent's RT index, you pass
> the immediate parent's PlanRowMark to the recursive call of
> expand_single_inheritance_child().

No. child PlanRowMark's prti is set to parentRTIndex, which is a
separate argument and is used to also set parent_relid in
AppendRelInfo.

>
>> Actually, the original problem that caused this discussion started
>> with an assertion failure in get_partitioned_child_rels() as
>> Assert(list_length(result) >= 1);
>>
>> This assertion fails if result is NIL when an intermediate partitioned
>> table is passed. May be we should assert (result == NIL ||
>> list_length(result) == 1) and allow that function to be called even
>> for intermediate partitioned partitions for which the function will
>> return NIL. That will leave the code in add_paths_to_append_rel()
>> simple. Thoughts?
>
> Yeah, I guess that could work.  We'll just have to write comments to
> describe why the Assert is written that way.
>
>>> By the way, when we call expand_single_inheritance_child() in the
>>> non-partitioned inheritance case, we should pass NULL for childrte_p,
>>> childRTindex_p, childrc_p, instead of declaring variables that won't be
>>> used.  Hence, expand_single_inheritance_child() should make those
>>> arguments optional.
>>
>> That introduces an extra "if" condition, which is costlier than an
>> assignment. We have used both the styles in the code. Previously, I
>> have got comments otherwise. So, I am not sure.
>
> OK.  expand_single_inheritance_child's header comment does not mention the
> new result fields.  Maybe add a comment describing what their role is and
> that they're not optional arguments.
>
>> I will update the patches once we have some resolution about 1. prti
>> in PlanRowMarks and 2. detection of root partitioned table in
>> add_paths_to_append_rel().
>
> OK.
>
> About 2, I somewhat agree with your proposed solution above, which might
> be simpler to explain in comments than the code I proposed.

After testing a few queries I am getting a feeling that
ExecLockNonLeafAppendTables isn't really locking anything. I will
write more about that in my reply to Robert's mail.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-11 Thread Ashutosh Bapat

On Mon, Sep 11, 2017 at 5:19 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Mon, Sep 11, 2017 at 6:45 AM, Ashutosh Bapat
> <ashutosh.ba...@enterprisedb.com> wrote:
>> So, all partitioned partitions are getting locked correctly. Am I
>> missing something?
>
> That's not a valid test.  In that scenario, you're going to hold all
> the locks acquired by the planner, all the locks acquired by the
> rewriter, and all the locks acquired by the executor, but when using
> prepared queries, it's possible to execute the plan after the planner
> and rewriter locks are no longer held.
>

I see the same thing when I use prepare and execute

Session 1
postgres=# prepare stmt as select 1 from t1 union all select 2 from t1;
PREPARE
postgres=# select pg_backend_pid();
 pg_backend_pid

  50912
(1 row)

postgres=# begin;
BEGIN
postgres=# execute stmt;
 ?column?
--
(0 rows)

Session 2
postgres=# select locktype, relation::regclass, virtualxid,
virtualtransaction, pid, mode, granted, fastpath from pg_locks;
  locktype  | relation | virtualxid | virtualtransaction |  pid  |
 mode   | granted | fastpath
+--+++---+-+-+--
 relation   | pg_locks || 4/4| 50914 |
AccessShareLock | t   | t
 virtualxid |  | 4/4| 4/4| 50914 |
ExclusiveLock   | t   | t
 relation   | t1p1p1   || 3/12   | 50912 |
AccessShareLock | t   | t
 relation   | t1p1 || 3/12   | 50912 |
AccessShareLock | t   | t
 relation   | t1   || 3/12   | 50912 |
AccessShareLock | t   | t
 virtualxid |  | 3/12   | 3/12   | 50912 |
ExclusiveLock   | t   | t
(6 rows)

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Automatic testing of patches in commit fest

2017-09-11 Thread Ashutosh Bapat

On Mon, Sep 11, 2017 at 3:11 PM, Aleksander Alekseev
<a.aleks...@postgrespro.ru> wrote:
> Hi Thomas,
>
> Great job!
>
> Here is a crazy idea. What if we write a script that would automatically
> return the patches that:
>
> 1) Are not in "Waiting on Author" status
> 2) Don't apply OR don't pass `make installcheck-world`
>
> ... to the "Waiting on Author" status and describe the problem through
> the "Add review" form on commitfest.postgresql.org? This will save a lot
> of time to the reviewers. Naturally nobody wants to spam pgsql-hackers@
> with automatic messages to often. I believe that sending such messages
> once a day would be enough.
>
> Unless there are any objections to give this idea a try I'm willing to
> write and host a corresponding script.
>
+1 that would help.



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-11 Thread Ashutosh Bapat

On Mon, Sep 11, 2017 at 2:16 PM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> On 2017/09/11 16:23, Ashutosh Bapat wrote:
>> On Sat, Sep 9, 2017 at 6:28 AM, Robert Haas <robertmh...@gmail.com> wrote:
>>> I'm a bit suspicious about the fact that there are now executor
>>> changes related to the PlanRowMarks.  If the rowmark's prti is now the
>>> intermediate parent's RT index rather than the top-parent's RT index,
>>> it'd seem like that'd matter somehow.  Maybe it doesn't, because the
>>> code that cares about prti seems to only care about whether it's
>>> different from rti.  But if that's true everywhere, then why even
>>> change this?  I think we might be well off not to tinker with things
>>> that don't need to be changed.
>>
>> In the definition of ExecRowMark, I see
>> Index   prti;   /* parent range table index, if child */
>> It just says parent, by which I take as direct parent. For
>> inheritance, which earlier flattened inheritance hierarchy, direct
>> parent was top parent. So, probably nobody thought whether a parent is
>> direct parent or top parent. But now that we have introduced that
>> concept we need to interpret this comment anew. And I think
>> interpreting it as direct parent is non-lossy.
>
> But then we also don't have anything to say about why we're making that
> change.  If you could describe what non-lossy is in this context, then
> fine.

By setting prti to the topmost parent RTI we are loosing information
that this child may be an intermediate child similar to what we did
earlier to AppendRelInfo. That's the lossy-ness in this context.

> But that we'd like to match with what we're going to do for
> AppendRelInfos does not seem to be a sufficient explanation for this change.

The purpose of this patch is to change the parent-child linkages for
partitioned table and prti is one of them. So, in fact, I am wondering
why not to change that along with AppendRelInfo.

>
>> If we set top parent's
>> index, parent RTI in AppendRelInfo and PlanRowMark would not agree.
>> So, it looks quite natural that we set the direct parent's index in
>> PlanRowMark.
>
> They would not agree, yes, but aren't they unrelated?  If we have a reason
> for them to agree, (for example, row-locking breaks in the inherited table
> case if we didn't), then we should definitely make them agree.
>
> Updating the comment for prti definition might be something that this
> patch could (should?) do, but I'm not quite sure about that too.
>

To me that looks backwards again for the reasons described above.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-11 Thread Ashutosh Bapat

On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> On 2017/09/09 2:38, Ashutosh Bapat wrote:
>> On Fri, Sep 8, 2017 at 11:17 AM, Amit Langote wrote:
>>> I updated the patch to include just those changes.  I'm not sure about
>>> one of the Ashutosh's changes whereby the child PlanRowMark is also passed
>>> to expand_partitioned_rtentry() to use as the parent PlanRowMark.  I think
>>> the child RTE, child RT index and child Relation are fine, because they
>>> are necessary for creating AppendRelInfos in a desired way for later
>>> planning steps.  But PlanRowMarks are not processed within the planner
>>> afterwards and do not need to be marked with the immediate parent-child
>>> association in the same way that AppendRelInfos need to be.
>>
>> Passing top parent's row mark works today, since there is no
>> parent-child specific translation happening there. But if in future,
>> we introduce such a translation, row marks for indirect children in a
>> multi-level partitioned hierarchy won't be accurate. So, I think it's
>> better to pass row marks of the direct parent.
>
> IMHO, we should make it the responsibility of the future patch to set a
> child PlanRowMark's prti to the direct parent's RT index, when we actually
> know that it's needed for something.  We clearly know today why we need to
> pass the other objects like child RT entry, RT index, and Relation, so we
> should limit this patch to pass only those objects to the recursive call.
> That makes this patch a relatively easy to understand change.

I think you are mixing two issues here 1. setting parent RTI in child
PlanRowMark and 2. passing immediate parent's PlanRowMark to
expand_single_inheritance_child().

I have discussed 1 in my reply to Robert.

About 2 you haven't given any particular comments to my reply. To me
it looks like it's this patch that introduces the notion of
multi-level expansion, so it's natural for this patch to pass
PlanRowMark in cascaded fashion similar to other structures.

>
>>> I also included the changes to add_paths_to_append_rel() from my patch on
>>> the "path toward faster partition pruning" thread.  We'd need that change,
>>> because while add_paths_to_append_rel() is called for all partitioned
>>> table RTEs in a given partition tree, expand_inherited_rtentry() would
>>> have set up a PartitionedChildRelInfo only for the root parent, so
>>> get_partitioned_child_rels() would not find the same for non-root
>>> partitioned table rels and crash failing the Assert.  The changes I made
>>> are such that we call get_partitioned_child_rels() only for the parent
>>> rels that are known to correspond root partitioned tables (or as you
>>> pointed out on the thread, "the table named in the query" as opposed those
>>> added to the query as result of inheritance expansion).  In addition to
>>> the relkind check on the input RTE, it would seem that checking that the
>>> reloptkind is RELOPT_BASEREL would be enough.  But actually not, because
>>> if a partitioned table is accessed in a UNION ALL query, reloptkind even
>>> for the root partitioned table (the table named in the query) would be
>>> RELOPT_OTHER_MEMBER_REL.  The only way to confirm that the input rel is
>>> actually the root partitioned table is to check whether its parent rel is
>>> not RTE_RELATION, because the parent in case of UNION ALL Append is a
>>> RTE_SUBQUERY RT entry.
>>>
>>
>> There was a change in my 0003 patch, which fixed the crash. It checked
>> for RELOPT_BASEREL and RELKIND_PARTITIONED_TABLE. I have pulled it in
>> my 0001 patch. It no more crashes. I tried various queries involving
>> set operations and bare multi-level partitioned table scan with my
>> patch, but none of them showed any anomaly. Do you have a testcase
>> which shows problem with my patch? May be your suggestion is correct,
>> but corresponding code implementation is slightly longer than I would
>> expect. So, we should go with it, if there is corresponding testcase
>> which shows why it's needed.
>
> If we go with your patch, partitioned tables won't get locked, for
> example, in case of the following query (p is a partitioned table):
>
> select 1 from p union all select 2 from p;
>
> That's because the RelOptInfos for the two instances of p in the above
> query are RELOPT_OTHER_MEMBER_REL, not RELOPT_BASEREL.  They are children
> of the Append corresponding to the UNION ALL subquery RTE.  So,
> partitioned_rels does not get set per your proposed code.

Session 1:
po

Re: [HACKERS] WIP: Aggregation push-down

2017-09-11 Thread Ashutosh Bapat

On Fri, Sep 8, 2017 at 7:04 PM, Antonin Houska <a...@cybertec.at> wrote:
> Ashutosh Bapat <ashutosh.ba...@enterprisedb.com> wrote:
>
>> On Thu, Aug 17, 2017 at 8:52 PM, Antonin Houska <a...@cybertec.at> wrote:
>> > Antonin Houska <a...@cybertec.at> wrote:
>> >
>> >> Antonin Houska <a...@cybertec.at> wrote:
>> >>
>> >> > This is a new version of the patch I presented in [1].
>> >>
>> >> Rebased.
>> >>
>> >> cat .git/refs/heads/master
>> >> b9a3ef55b253d885081c2d0e9dc45802cab71c7b
>> >
>> > This is another version of the patch.
>> >
>> > Besides other changes, it enables the aggregation push-down for 
>> > postgres_fdw,
>> > although only for aggregates whose transient aggregation state is equal to 
>> > the
>> > output type. For other aggregates (like avg()) the remote nodes will have 
>> > to
>> > return the transient state value in an appropriate form (maybe bytea type),
>> > which does not depend on PG version.
>>
>> Having transient aggregation state type same as output type doesn't
>> mean that we can feed the output of remote aggregation as is to the
>> finalization stage locally or finalization is not needed at all. I am
>> not sure why is that test being used to decide whether or not to push
>> an aggregate (I assume they are partial aggregates) to the foreign
>> server.
>
> I agree. This seems to be the same problem as reported in [2].
>
>> postgres_fdw doesn't have a feature to fetch partial aggregate
>> results from the foreign server. Till we do that, I don't think this
>> patch can push any partial aggregates down to the foreign server.
>
> Even if the query contains only aggregates like sum(int4), i.e. aggfinalfn
> does not exist and the transient type can be transfered from the remote server
> in textual form? (Of course there's a question is if such behaviour is
> consistent from user's perspective.)

Yes, those functions will work.

>
>> >
>> > One thing I'm not sure about is whether the patch should remove
>> > GetForeignUpperPaths function from FdwRoutine, which it essentially makes
>> > obsolete. Or should it only be deprecated so far? I understand that
>> > deprecation is important for "ordinary SQL users", but FdwRoutine is an
>> > interface for extension developers, so the policy might be different.
>>
>> I doubt if that's correct. We can do that only when we know that all
>> kinds aggregates/grouping can be pushed down the join tree, which
>> looks impossible to me. Apart from that that FdwRoutine handles all
>> kinds of upper relations like windowing, distinct, ordered which are
>> not pushed down by this set of patches.
>
> Good point.
>

I think this is where Jeevan Chalke's partition-wise aggregation
helps. It deals with the cases where your patch can not push the
aggregates down to children of join.

>> The patch maintains an extra rel target, two pathlists, partial pathlist and
>> non-partial pathlist for grouping in RelOptInfo. These two extra
>> pathlists require "grouped" argument to be passed to a number of functions.
>> Instead probably it makes sense to create a RelOptInfo for 
>> aggregation/grouping
>> and set pathlist and partial_pathlist in that RelOptInfo. That will also 
>> make a
>> place for keeping grouped estimates.
>
> If grouped paths require a separate RelOptInfo, why the existing partial paths
> do not?

partial paths produce the same targetlist and the same relation that
non-partial paths do. A RelOptInfo in planner represents a set of rows
and all paths added to that RelOptInfo produce same whole result
(parameterized paths produces the same result collectively). Grouping
paths however are producing a different result and different
targetlist, so may be it's better to have a separate RelOptInfo.

>
>> For placeholders we have two function add_placeholders_to_base_rels() and
>> add_placeholders_to_joinrel(). We have add_grouping_info_to_base_rels(), but 
>> do
>> not have corresponding function for joinrels. How do we push aggregates
>> involving two or more relations to the corresponding joinrels?
>
> add_grouping_info_to_base_rels() calls prepare_rel_for_grouping() which
> actually adds the "grouping info". For join, prepare_rel_for_grouping() is
> called from build_join_rel().

Ok.

>
> While PlaceHolderVars need to be finalized before planner starts to create
> joins, the GroupedPathInfo has to fit particular pair of joined relations.
>
> Perhaps create_grouping_i

Re: [HACKERS] [POC] hash partitioning

2017-09-11 Thread Ashutosh Bapat

On Fri, Sep 8, 2017 at 6:10 PM, amul sul <sula...@gmail.com> wrote:
> On Fri, Sep 8, 2017 at 6:45 AM, Robert Haas <robertmh...@gmail.com> wrote:
>>
>> On Mon, Sep 4, 2017 at 6:38 AM, amul sul <sula...@gmail.com> wrote:
>> > I've updated patch to use an extended hash function (Commit #
>> > 81c5e46c490e2426db243eada186995da5bb0ba7) for the partitioning.
>>
>> Committed 0001 after noticing that Jeevan Ladhe also found that change
>> convenient for default partitioning.  I made a few minor cleanups;
>> hopefully I didn't break anything.
>
>
> Thanks you.
>
> Rebased 0002 against this commit & renamed to 0001, PFA.

Given that we have default partition support now, I am wondering
whether hash partitioned tables also should have default partitions.
The way we have structured hash partitioning syntax, there can be
"holes" in partitions. Default partition would help plug those holes.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2017-09-11 Thread Ashutosh Bapat

On Thu, Sep 7, 2017 at 7:34 PM, Antonin Houska <a...@cybertec.at> wrote:
> Ashutosh Bapat <ashutosh.ba...@enterprisedb.com> wrote:
>
>> I have fixed the issues which were marked as TODOs in the attached
>> patches. Also, I have included your test change patch in my series of
>> patches.
>
> I've noticed that partition_bounds_merge() is called twice from
> make_join_rel():
>
>  * build_join_rel -> build_joinrel_partition_info -> partition_bounds_merge
>
>  * try_partition_wise_join -> partition_bounds_merge
>
> Is this intentional, or just a thinko?
>

This is expected. partition_bounds_merge() also returns the pairs of
matching partitions. So, we have to call that function for every pair
of joining relations.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-11 Thread Ashutosh Bapat

On Sat, Sep 9, 2017 at 6:28 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Fri, Sep 8, 2017 at 1:38 PM, Ashutosh Bapat
> <ashutosh.ba...@enterprisedb.com> wrote:
>>> I also confirmed that the partition-pruning patch set works fine with this
>>> patch instead of the patch on that thread with the same functionality,
>>> which I will now drop from that patch set.  Sorry about the wasted time.
>>
>> Thanks a lot. Please review the patch in the updated patchset.
>
> In set_append_rel_size(), I don't find the comment too clear (and this
> part was taken from Amit's patch, right?).  I suggest:

No, I didn't take it from Amit's patch. Both of us have different
wordings. But yours is better than both of us. Included it in the
attached patches.

>
> +/*
> + * Associate the partitioned tables which are descendents of the table
> + * named in the query with the topmost append path (i.e. the one where
> + * rel->reloptkind is RELOPT_BASEREL).  This ensures that they get 
> properly
> + * locked at execution time.
> + */
>
> I'm a bit suspicious about the fact that there are now executor
> changes related to the PlanRowMarks.  If the rowmark's prti is now the
> intermediate parent's RT index rather than the top-parent's RT index,
> it'd seem like that'd matter somehow.  Maybe it doesn't, because the
> code that cares about prti seems to only care about whether it's
> different from rti.  But if that's true everywhere, then why even
> change this?  I think we might be well off not to tinker with things
> that don't need to be changed.

In the definition of ExecRowMark, I see
Index   prti;   /* parent range table index, if child */
It just says parent, by which I take as direct parent. For
inheritance, which earlier flattened inheritance hierarchy, direct
parent was top parent. So, probably nobody thought whether a parent is
direct parent or top parent. But now that we have introduced that
concept we need to interpret this comment anew. And I think
interpreting it as direct parent is non-lossy. If we set top parent's
index, parent RTI in AppendRelInfo and PlanRowMark would not agree.
So, it looks quite natural that we set the direct parent's index in
PlanRowMark. From that POV, we aren't changing anything, we are
setting the same parent RTI in AppendRelInfo and PlanRowMark. Chaning
different parent RTIs in those two structure would be a real change.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-08 Thread Ashutosh Bapat

On Fri, Sep 8, 2017 at 12:34 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Tue, Sep 5, 2017 at 7:01 AM, Ashutosh Bapat
> <ashutosh.ba...@enterprisedb.com> wrote:
>> accumulate_append_subpath() is executed for every path instead of
>> every relation, so changing it would collect the same list multiple
>> times. Instead, I found the old way of associating all intermediate
>> partitions with the root partitioned relation work better. Here's the
>> updated patch set.
>
> When I tried out patch 0001, it crashed repeatedly during 'make check'
> because of an assertion failure in get_partitioned_child_rels.

Running "make check" on the whole patchset doesn't give that failure.
So I didn't notice the crash since I was running regression on the
whole patchset. Sorry for that. Fortunately git rebase -i allows us to
execute shell commands while applying patches, so I have set it up to
compile each patch and run regression. Hopefully that will catch such
errors in future. That process showed me that patch
0003-In-add_paths_to_append_rel-get-partitioned_rels-for-.patch fixes
that crash by calling get_partitioned_child_rels() only on the root
partitioned table for which we have set up child_rels list. Amit
Langote has a similar fix reported in his reply. So, we will discuss
it there.

> It
> seemed to me that the way the patch was refactoring
> expand_inherited_rtentry involved more code rearrangement than
> necessary, so I reverted all the code rearrangement and just kept the
> functional changes, and all the crashes went away.  (That refactoring
> also failed to initialize has_child properly.)  In so doing, I
> reintroduced the problem that the PartitionedChildRelInfo lists
> weren't getting set up correctly, but after some thought I realized
> that was just because expand_single_inheritance_child() was choosing
> between adding an RTE and adding the OID to partitioned_child_rels,
> whereas for an intermediate partitioned table it needs to do both.  So
> I inserted a trivial fix for that problem (replacing "else" with a new
> "if"-test), basically:
>
> -else
> +
> +if (childrte->relkind == RELKIND_PARTITIONED_TABLE)
>
> Please check out the attached version of the patch.  In addition to
> the above simplifications, I did some adjustments to the comments in
> various places - some just grammar and others a bit more substantive.
> And I think I broke a long line in one place, too.
>
> One thing I notice is that if I rip out the changes to initsplan.c,
> the new regression test still passes.  If it's possible to write a
> test that fails without those changes, I think it would be a good idea
> to include one in the patch.  That's certainly one of the subtler
> parts of this patch, IMHO.

Amit Langote has replied on these points as well. So, I will comment
in a reply to his reply.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Adding support for Default partition in partitioning

2017-09-07 Thread Ashutosh Bapat

On Wed, Sep 6, 2017 at 5:50 PM, Jeevan Ladhe
<jeevan.la...@enterprisedb.com> wrote:
>
>>
>> I am wondering whether we could avoid call to get_default_partition_oid()
>> in
>> the above block, thus avoiding a sys cache lookup. The sys cache search
>> shouldn't be expensive since the cache should already have that entry, but
>> still if we can avoid it, we save some CPU cycles. The default partition
>> OID is
>> stored in pg_partition_table catalog, which is looked up in
>> RelationGetPartitionKey(), a function which precedes
>> RelationGetPartitionDesc()
>> everywhere. What if that RelationGetPartitionKey() also returns the
>> default
>> partition OID and the common caller passes it to
>> RelationGetPartitionDesc()?.
>
>
> The purpose here is to cross check the relid with partdefid stored in
> catalog
> pg_partitioned_table, though its going to be the same in the parents cache,
> I
> think its better that we retrieve it from the catalog syscache.
> Further, RelationGetPartitionKey() is a macro and not a function, so
> modifying
> the existing simple macro for this reason does not sound a good idea to me.
> Having said this I am open to ideas here.

Sorry, I meant RelationBuildPartitionKey() and
RelationBuildPartitionDesc() instead of RelationGetPartitionKey() and
RelationGetPartitionDesc() resp.

>
>>
>> +/* A partition cannot be attached if there exists a default partition
>> */
>> +defaultPartOid = get_default_partition_oid(RelationGetRelid(rel));
>> +if (OidIsValid(defaultPartOid))
>> +ereport(ERROR,
>> +(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
>> + errmsg("cannot attach a new partition to table
>> \"%s\" having a default partition",
>> +RelationGetRelationName(rel;
>> get_default_partition_oid() searches the catalogs, which is not needed
>> when we
>> have relation descriptor of the partitioned table (to which a new
>> partition is
>> being attached). You should get the default partition OID from partition
>> descriptor. That will be cheaper.
>
>
> Something like following can be done here:
> /* A partition cannot be attached if there exists a default partition */
> if (partition_bound_has_default(rel->partdesc->boundinfo))
> ereport(ERROR,
> (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
>  errmsg("cannot attach a new partition to table \"%s\"
> having a default partition",
> RelationGetRelationName(rel;
>
> But, partition_bound_has_default() is defined in partition.c and not in
> partition.h. This is done that way because boundinfo is not available in
> partition.h. Further, this piece of code is removed in next patch where we
> extend default partition support to add/attach partition even when default
> partition exists. So, to me I don’t see much of the correction issue here.

If the code is being removed, I don't think we should sweat too much
about it. Sorry for the noise.

>
> Another way to get around this is, we can define another version of
> get_default_partition_oid() something like
> get_default_partition_oid_from_parent_rel()
> in partition.c which looks around in relcache instead of catalog and returns
> the
> oid of default partition, or get_default_partition_oid() accepts both parent
> OID,
> and parent ‘Relation’ rel, if rel is not null look into relcahce and return,
> else search from catalog using OID.

I think we should define a function to return default partition OID
from partition descriptor and make it extern. Define a wrapper which
accepts Relation and returns calls this function to get default
partition OID from partition descriptor. The wrapper will be called
only on an open Relation, wherever it's available.


>
>> I haven't gone through the full patch yet, so there may be more
>> comments here. There is some duplication of code in
>> check_default_allows_bound() and ValidatePartitionConstraints() to
>> scan the children of partition being validated. The difference is that
>> the first one scans the relations in the same function and the second
>> adds them to work queue. May be we could use
>> ValidatePartitionConstraints() to scan the relation or add to the
>> queue based on some input flag may be wqueue argument itself. But I
>> haven't thought through this completely. Any thoughts?
>
>
> check_default_allows_bound() is called only from DefineRelation(),
> and not for alter command. I am not really sure how can we use
> work queue for create command.


No, we shouldn't use work queue for CREATE comm

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-07 Thread Ashutosh Bapat

On Thu, Sep 7, 2017 at 4:32 PM, Antonin Houska <a...@cybertec.at> wrote:
> Ashutosh Bapat <ashutosh.ba...@enterprisedb.com> wrote:
>
>> On Fri, Sep 1, 2017 at 6:05 PM, Antonin Houska <a...@cybertec.at> wrote:
>> >
>> >
>> >
>> > * get_partitioned_child_rels_for_join()
>> >
>> > I think the Assert() statement is easier to understand inside the loop, see
>> > the assert.diff attachment.
>
>> The assert at the end of function also checks that we have got
>> child_rels lists for all the parents passed in.
>
> Really? I can imagine that some instances of PartitionedChildRelInfo have the
> child_rels list empty, while other ones have these lists long enough to
> compensate for the empty lists.
>

That isn't true. Each child_rels list will at least have one entry.
Please see get_partitioned_child_rels().

>> >
>> >
>> > * have_partkey_equi_join()
>> >
>> > As the function handles generic join, this comment doesn't seem to me
>> > relevant:
>> >
>> > /*
>> >  * The equi-join between partition keys is strict if equi-join between
>> >  * at least one partition key is using a strict operator. See
>> >  * explanation about outer join reordering identity 3 in
>> >  * optimizer/README
>> >  */
>> > strict_op = op_strict(opexpr->opno);
>>
>> What in that comment is not exactly relevant?
>
> Basically I don't understand why you mention join reordering here. The join
> ordering questions must all have been resolved by the time
> have_partkey_equi_join() is called.

I am referring to a particular section in README which talks about the
relation between strict operator and legal join order.

>
>> >
>> > And I think the function can return true even if strict_op is false for all
>> > the operators evaluated in the loop.
>>
>> I think it does that. Do you have a case where it doesn't?
>
> Here I refer to this part of the comment above:
>
> "... if equi-join between at least one partition key is using a strict
> operator."
>
> My understanding of the code (especially match_expr_to_partition_keys) is that
> no operator actually needs to be strict as long as each operator involved in
> the join matches at least one non-nullable expression on both sides of the
> join.

I don't think so. A strict operator returns NULL when either of the
inputs is NULL. We can not say so for non-strict operators, which may
deem NULL and non-NULL arguments as equal, even though that looks
insane.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Aggregation push-down

2017-09-07 Thread Ashutosh Bapat

  * joins, because there may be no other alternative.
  */
 if (enable_hashjoin || jointype == JOIN_FULL)
-hash_inner_and_outer(root, joinrel, outerrel, innerrel,
- jointype, );
+hash_inner_and_outer(root, joinrel, outerrel, innerrel, jointype,
+ );

In the last "regression" patch, there are some code changes (mostly because of
pg_indent run). May be you want to include those in appropriate code patches.

Some quick observations using two tables t1(a int, b int) and t2(a int, b int),
populated as
insert into t1 select i, i % 5 from generate_series(1, 100) i where i % 2 = 0;
insert into t2 select i, i % 5 from generate_series(1, 100) i where i % 3 = 0;

1. The patch pushes aggregates down join in the following query
explain verbose select avg(t2.a) from t1 inner join t2 on t1.b = t2.b group by
t2.b;
but does not push the aggregates down join in the following query
explain verbose select avg(t2.a) from t1 left join t2 on t1.b = t2.b group by
t2.b;
In fact, it doesn't use the optimization for any OUTER join. I think the reason
for this behaviour is that the patch uses equivalence classes to distribute the
aggregates and grouping to base relations and OUTER equi-joins do not form
equivalence classes. But I think it should be possible to push the aggregates
down the OUTER join by adding one row for NULL values if the grouping is pushed
to the inner side. I don't see much change for outer side. This also means that
we have to choose some means other than equivalence class for propagating the
aggregates.

2. Following query throws error with these patches, but not without the
patches.
explain verbose select sum(t1.a + t2.a) from t1, t2, t2 t3 where t1.a
= t2.a
group by t2.a, t1.a;
ERROR:  ORDER/GROUP BY expression not found in targetlist

[1] 
https://www.postgresql.org/message-id/CAFjFpRejPP4K%3Dg%2B0aaq_US0YrMaZzyM%2BNUCi%3DJgwaxhMUj2Zcg%40mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] <> join selectivity estimate question

2017-09-06 Thread Ashutosh Bapat

On Fri, Jul 21, 2017 at 4:10 AM, Thomas Munro
<thomas.mu...@enterprisedb.com> wrote:
>
> Thanks.  Bearing all that in mind, I ran through a series of test
> scenarios and discovered that my handling for JOIN_ANTI was wrong: I
> thought that I had to deal with inverting the result, but I now see
> that that's handled elsewhere (calc_joinrel_size_estimate() I think).
> So neqjoinsel should just treat JOIN_SEMI and JOIN_ANTI exactly the
> same way.

I agree, esp. after looking at eqjoinsel_semi(), which is used for
both semi and anti joins, it becomes more clear.

>
> That just leaves the question of whether we should try to handle the
> empty RHS and single-value RHS cases using statistics.  My intuition
> is that we shouldn't, but I'll be happy to change my intuition and
> code that up if that is the feedback from planner gurus.

Empty RHS can result from dummy relations also, which are produced by
constraint exclusion, so may be that's an interesting case. Single
value RHS may be interesting with partitioned table with all rows in a
given partition end up with the same partition key value. But may be
those are just different patches. I am not sure.

>
> Please find attached a new version, and a test script I used, which
> shows a bunch of interesting cases.  I'll add this to the commitfest.

I added some "stable" tests to your patch taking inspiration from the
test SQL file. I think those will be stable across machines and runs.
Please let me know if those look good to you.



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 23e5526..4c1bae6 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -2697,26 +2697,63 @@ neqjoinsel(PG_FUNCTION_ARGS)
 	Oid			eqop;
 	float8		result;
 
-	/*
-	 * We want 1 - eqjoinsel() where the equality operator is the one
-	 * associated with this != operator, that is, its negator.
-	 */
-	eqop = get_negator(operator);
-	if (eqop)
+
+	if (jointype == JOIN_SEMI || jointype == JOIN_ANTI)
 	{
-		result = DatumGetFloat8(DirectFunctionCall5(eqjoinsel,
-	PointerGetDatum(root),
-	ObjectIdGetDatum(eqop),
-	PointerGetDatum(args),
-	Int16GetDatum(jointype),
-	PointerGetDatum(sjinfo)));
+		VariableStatData leftvar;
+		VariableStatData rightvar;
+		double		nullfrac;
+		bool		reversed;
+		HeapTuple	statsTuple;
+
+		get_join_variables(root, args, sjinfo, , , );
+		statsTuple = reversed ? rightvar.statsTuple : leftvar.statsTuple;
+		if (HeapTupleIsValid(statsTuple))
+			nullfrac = ((Form_pg_statistic) GETSTRUCT(statsTuple))->stanullfrac;
+		else
+			nullfrac = 0.0;
+		ReleaseVariableStats(leftvar);
+		ReleaseVariableStats(rightvar);
+
+		/*
+		 * For semi-joins, if there is more than one distinct value in the RHS
+		 * relation then every non-null LHS row must find a row to join since
+		 * it can only be equal to one of them.  We'll assume that there is
+		 * always more than one distinct RHS value for the sake of stability,
+		 * though in theory we could have special cases for empty RHS
+		 * (selectivity = 0) and single-distinct-value RHS (selectivity =
+		 * fraction of LHS that has the same value as the single RHS value).
+		 *
+		 * For anti-joins, if we use the same assumption that there is more
+		 * than one distinct key in the RHS relation, then every non-null LHS
+		 * row must be suppressed by the anti-join leaving only nullfrac.
+		 */
+		result = 1.0 - nullfrac;
 	}
 	else
 	{
-		/* Use default selectivity (should we raise an error instead?) */
-		result = DEFAULT_EQ_SEL;
+		/*
+		 * We want 1 - eqjoinsel() where the equality operator is the one
+		 * associated with this != operator, that is, its negator.
+		 */
+		eqop = get_negator(operator);
+		if (eqop)
+		{
+			result = DatumGetFloat8(DirectFunctionCall5(eqjoinsel,
+		PointerGetDatum(root),
+		ObjectIdGetDatum(eqop),
+		PointerGetDatum(args),
+		Int16GetDatum(jointype),
+		PointerGetDatum(sjinfo)));
+		}
+		else
+		{
+			/* Use default selectivity (should we raise an error instead?) */
+			result = DEFAULT_EQ_SEL;
+		}
+		result = 1.0 - result;
 	}
-	result = 1.0 - result;
+
 	PG_RETURN_FLOAT8(result);
 }
 
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 9f4c88d..10bfb68 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -1845,6 +1845,89 @@ SELECT '' AS "xxx", *
  | 1 | 4 | one | -1
 (1 row)
 
+-- SEMI and ANTI join neq selectivity
+ANALYZE J1_TBL;
+ANALYZE J2_TBL;
+EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF)
+SELECT * FROM J1_TBL t1
+WHERE EXISTS (SELECT * FROM J2_TBL t2 WHERE t1.i <> t2.i);
+QUERY PLAN 
+--

Re: [HACKERS] dropping partitioned tables without CASCADE

2017-09-06 Thread Ashutosh Bapat

On Wed, Sep 6, 2017 at 1:06 PM, Rushabh Lathia <rushabh.lat...@gmail.com> wrote:
>
> 2) Add partition to the foo;
>
> create table foo_p1 partition of foo for values in (1, 2, 3) partition by
> list (b);
>
> postgres=# \d foo
> Table "public.foo"
>  Column |  Type   | Collation | Nullable | Default
> +-+---+--+-
>  a  | integer |   |  |
>  b  | integer |   |  |
> Partition key: LIST (a)
> Number of partitions: 1 (Use \d+ to list them.)
>
> postgres=# \d+ foo
> Table "public.foo"
>  Column |  Type   | Collation | Nullable | Default | Storage | Stats target
> | Description
> +-+---+--+-+-+--+-
>  a  | integer |   |  | | plain   |
> |
>  b  | integer |   |  | | plain   |
> |
> Partition key: LIST (a)
> Partitions: foo_p1 FOR VALUES IN (1, 2, 3) has partitions
>
> Above verbose output for foo says, foo_p1 "has partitions". But if I do
>
> postgres=# \d foo_p1
>Table "public.foo_p1"
>  Column |  Type   | Collation | Nullable | Default
> +-+---+--+-
>  a  | integer |   |  |
>  b  | integer |   |  |
> Partition of: foo FOR VALUES IN (1, 2, 3)
> Partition key: LIST (b)
> Number of partitions: 0
>
> it tell "Number of partitions: 0".
>
> I feel like information is conflicting with each other. AFAIU, idea about
> adding
> "has partitions" was to let know that it's a partitioned table. So can you
> directly
> add the "is partitioned" in place of "has partitions"?
>
> Did those change in the attached patch and update regression expected
> output.
>

Looks better.

> Also run pgindent on the patch.
>

Thanks for the changes. The patch looks good to me.


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] paths in partitions of a dummy partitioned table

2017-09-06 Thread Ashutosh Bapat

On Fri, Aug 25, 2017 at 10:46 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Jul 6, 2017 at 11:35 AM, Ashutosh Bapat
> <ashutosh.ba...@enterprisedb.com> wrote:
>> If a partitioned table is proven dummy, set_rel_pathlist() doesn't mark the
>> partition relations dummy and thus doesn't set any (dummy) paths in the
>> partition relations. The lack of paths in the partitions means that we can
>> not use partition-wise join while joining this table with some other 
>> similarly
>> partitioned table as the partitions do not have any paths that child-joins 
>> can
>> use. This means that we can not create partition relations for such a join 
>> and
>> thus can not consider the join to be partitioned. This doesn't matter much 
>> when
>> the dummy relation is the outer relation, since the resultant join is also
>> dummy. But when the dummy relation is inner relation, the resultant join is 
>> not
>> dummy and can be considered to be partitioned with same partitioning scheme 
>> as
>> the outer relation to be joined with other similarly partitioned table. Not
>> having paths in the partitions deprives us of this future optimization.
>
> I think it's wrong for any code to be examining the path list for a
> rel marked dummy, so I would suggest approaching this from a different
> direction.

Me and Robert had an offline discussion about this. I am summarizing
it here for the sake of completeness.

A dummy relation is identified by the only dummy path that exists in
its pathlist. There is no flag in RelOptInfo which tells whether a
given relation is dummy or not, it's the dummy path which tells that.
A dummy path is an Append path with no subpaths. Planner doesn't treat
dummy relations any different from other relations when it comes to
using paths. When a dummy relation participates in a join, the dummy
path is used as one of the joining paths and converted to a Result
plan at the time of planning. So, for a partition-wise join where one
of the joining relations is dummy, its every child must have dummy
path which can be used to construct child-join paths.

But we don't need to mark partition relations dummy (if their parent
is dummy) even when it's not going to participate in partition-wise
join. The partition relations will be marked dummy when we know that
they will be required for partition-wise join. I was worried that we
might mark base relation dummy during join planning this way, but we
already have a precedence for that in add_paths_to_join_rel(). So,
shouldn't be a problem. So, I have now added a patch in partition-wise
join set to mark partition relations dummy when their parent is dummy.

> Given A LEFT JOIN B where Bk is dummy, I suggest
> constructing the path for (AB)k by taking a path from Ak and applying
> an appropriate PathTarget.  You don't really need a join path at all;
> a path for the non-dummy input is fine - and, in fact, better, since
> it will be cheaper to execute.  One problem is that it may not produce
> the correct output columns.  (AB) may drop some columns that were
> being generated by A because they were only needed to perform the
> join, and it may add additional columns taken from B.  But I think
> that if you take the default PathTarget for (AB) and replace
> references to columns of B with NULL constants, you should be able to
> apply the resulting PathTarget to a path for Ak and get a valid path
> for (AB)k.  Maybe there is some reason why that won't work exactly,
> but I think it is probably more promising to attack the problem from
> this angle than to do what you propose.  Sticking dummy joins into the
> query plan that are really just projecting out NULLs is not appealing.
>

This might help in the cases when the RelOptInfo itself is missing
e.g. missing partitions in partition matching as discussed in [1]. I
will discuss this approach on that thread.

[1] 
https://www.postgresql.org/message-id/cafjfprdjqvauev5djx3tw6pu5eq54nckadtxhx2jijg_gvb...@mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replacing lfirst() with lfirst_node() appropriately in planner.c

2017-09-05 Thread Ashutosh Bapat

On Wed, Sep 6, 2017 at 1:32 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Ashutosh Bapat <ashutosh.ba...@enterprisedb.com> writes:
>> On Tue, Sep 5, 2017 at 4:02 PM, Jeevan Chalke
>> <jeevan.cha...@enterprisedb.com> wrote:
>>> Attached patch for replacing linitial() with linital_node() appropriately in
>>> planner.c
>
>> Ok. The patch looks good to me. It compiles and make check passes.
>> Here are both the patches rebased on the latest sources.
>
> LGTM, pushed.  I also changed a couple of places that left off any cast
> of lfirst, eg:
>
> -   List   *gset = lfirst(lc);
> +   List   *gset = (List *) lfirst(lc);
>
> While this isn't really harmful, it's not per prevailing style.

+1. Thanks.

>
> BTW, I think we *could* use "lfirst_node(List, ...)" in cases where
> we know the list is supposed to be a list of objects rather than ints
> or Oids.  I didn't do anything about that observation, though.
>

IMO, it won't be apparent as to why some code uses lfirst_node(List,
...) and some code uses (List *) lfirst().

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] no test coverage for ALTER FOREIGN DATA WRAPPER name HANDLER ...

2017-09-05 Thread Ashutosh Bapat

On Wed, May 10, 2017 at 5:55 AM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> On 2017/05/09 22:52, Mark Dilger wrote:
>>
>>> On May 9, 2017, at 12:18 AM, Amit Langote <langote_amit...@lab.ntt.co.jp> 
>>> wrote:
>>>
>>> Hi,
>>>
>>> On 2017/05/05 9:38, Mark Dilger wrote:
>>>> Hackers,
>>>>
>>>> just FYI, I cannot find any regression test coverage of this part of the 
>>>> grammar, not
>>>> even in the contrib/ directory or TAP tests.  I was going to submit a 
>>>> patch to help out,
>>>> and discovered it is not so easy to do, and perhaps that is why the 
>>>> coverage is missing.
>>>
>>> I think we could add the coverage by defining a dummy C FDW handler
>>> function in regress.c.  I see that some other regression tests use C
>>> functions defined there, such as test_atomic_ops().
>>>
>>> What do you think about the attached patch?  I am assuming you only meant
>>> to add tests for the code in foreigncmds.c (CREATE/ALTER FOREIGN DATA
>>> WRAPPER).
>>
>> Thank you for creating the patch.  I only see one small oversight, which is 
>> the
>> successful case of ALTER FOREIGN DATA WRAPPER ... HANDLER ... is still
>> not tested.  I added one line for that in the attached modification of your 
>> patch.
>
> Ah right, thanks.
>
> Adding this to the next commitfest as Ashutosh suggested.
>

The patch needs a rebase.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replacing lfirst() with lfirst_node() appropriately in planner.c

2017-09-05 Thread Ashutosh Bapat

On Tue, Sep 5, 2017 at 4:02 PM, Jeevan Chalke
<jeevan.cha...@enterprisedb.com> wrote:
> Hi,
>
> lfirst_node() changes are missing for List node type and I was thinking
> about adding those. But it looks like we cannot do so as List can be a
> T_List, T_IntList, or T_OidList. So I am OK with that.

Thanks for investigating the case of T_List.

>
> Apart from this, changes look good to me. Everything works fine.
>
> As we are now doing it for lfirst(), can we also do this for linitial()?
> I did not find any usage for lsecond(), lthird(), lfourth() and llast()
> though.
>
> Attached patch for replacing linitial() with linital_node() appropriately in
> planner.c
>
Ok. The patch looks good to me. It compiles and make check passes.
Here are both the patches rebased on the latest sources.

I am marking this commitfest entry as "ready for committer".
From 25ae53d7f72a54a03ec90206c7e5579a562a121c Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.ba...@enterprisedb.com>
Date: Tue, 5 Sep 2017 16:50:58 +0530
Subject: [PATCH 1/2] Use lfirst_node() instead of lfirst() wherever suitable
 in planner.c

Ashutosh Bapat, reviewed by Jeevan Chalke
---
 src/backend/optimizer/plan/planner.c |   94 +-
 1 file changed, 47 insertions(+), 47 deletions(-)

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 9662302..a1dd157 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -411,7 +411,7 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 		forboth(lp, glob->subplans, lr, glob->subroots)
 		{
 			Plan	   *subplan = (Plan *) lfirst(lp);
-			PlannerInfo *subroot = (PlannerInfo *) lfirst(lr);
+			PlannerInfo *subroot = lfirst_node(PlannerInfo, lr);
 
 			SS_finalize_plan(subroot, subplan);
 		}
@@ -430,7 +430,7 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	forboth(lp, glob->subplans, lr, glob->subroots)
 	{
 		Plan	   *subplan = (Plan *) lfirst(lp);
-		PlannerInfo *subroot = (PlannerInfo *) lfirst(lr);
+		PlannerInfo *subroot = lfirst_node(PlannerInfo, lr);
 
 		lfirst(lp) = set_plan_references(subroot, subplan);
 	}
@@ -586,7 +586,7 @@ subquery_planner(PlannerGlobal *glob, Query *parse,
 	hasOuterJoins = false;
 	foreach(l, parse->rtable)
 	{
-		RangeTblEntry *rte = (RangeTblEntry *) lfirst(l);
+		RangeTblEntry *rte = lfirst_node(RangeTblEntry, l);
 
 		if (rte->rtekind == RTE_JOIN)
 		{
@@ -643,7 +643,7 @@ subquery_planner(PlannerGlobal *glob, Query *parse,
 	newWithCheckOptions = NIL;
 	foreach(l, parse->withCheckOptions)
 	{
-		WithCheckOption *wco = (WithCheckOption *) lfirst(l);
+		WithCheckOption *wco = lfirst_node(WithCheckOption, l);
 
 		wco->qual = preprocess_expression(root, wco->qual,
 		  EXPRKIND_QUAL);
@@ -663,7 +663,7 @@ subquery_planner(PlannerGlobal *glob, Query *parse,
 
 	foreach(l, parse->windowClause)
 	{
-		WindowClause *wc = (WindowClause *) lfirst(l);
+		WindowClause *wc = lfirst_node(WindowClause, l);
 
 		/* partitionClause/orderClause are sort/group expressions */
 		wc->startOffset = preprocess_expression(root, wc->startOffset,
@@ -705,7 +705,7 @@ subquery_planner(PlannerGlobal *glob, Query *parse,
 	/* Also need to preprocess expressions within RTEs */
 	foreach(l, parse->rtable)
 	{
-		RangeTblEntry *rte = (RangeTblEntry *) lfirst(l);
+		RangeTblEntry *rte = lfirst_node(RangeTblEntry, l);
 		int			kind;
 		ListCell   *lcsq;
 
@@ -1080,7 +1080,7 @@ inheritance_planner(PlannerInfo *root)
 	rti = 1;
 	foreach(lc, parse->rtable)
 	{
-		RangeTblEntry *rte = (RangeTblEntry *) lfirst(lc);
+		RangeTblEntry *rte = lfirst_node(RangeTblEntry, lc);
 
 		if (rte->rtekind == RTE_SUBQUERY)
 			subqueryRTindexes = bms_add_member(subqueryRTindexes, rti);
@@ -1102,7 +1102,7 @@ inheritance_planner(PlannerInfo *root)
 	{
 		foreach(lc, root->append_rel_list)
 		{
-			AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(lc);
+			AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc);
 
 			if (bms_is_member(appinfo->parent_relid, subqueryRTindexes) ||
 bms_is_member(appinfo->child_relid, subqueryRTindexes) ||
@@ -1130,7 +1130,7 @@ inheritance_planner(PlannerInfo *root)
 	 */
 	foreach(lc, root->append_rel_list)
 	{
-		AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(lc);
+		AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc);
 		PlannerInfo *subroot;
 		RangeTblEntry *child_rte;
 		RelOptInfo *sub_final_rel;
@@ -1192,7 +1192,7 @@ inheritance_planner(PlannerInfo *root)
 			subroot->append_rel_list = NIL;
 			foreach(lc2, root->append_rel_list)
 			{
-AppendRelInfo *appinfo2 = (AppendRelInfo *) lfirst(lc2);
+AppendRelInfo *appinfo2 = lfirst_node(AppendRelInfo, lc2);
 
 if (bms_is_member(appinfo2->child_relid, modifiableARIindexes))
 	appinfo2 = copyObject(appinfo2);
@

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-05 Thread Ashutosh Bapat

On Tue, Sep 5, 2017 at 1:16 PM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> On 2017/09/05 15:43, Ashutosh Bapat wrote:
>> Ok. Can you please answer my previous questions?
>>
>> AFAIU, the list contained RTIs of the relations, which didnt' have
>> corresponding AppendRelInfos to lock those relations. Now that we
>> create AppendRelInfos even for partitioned partitions with my 0001
>> patch, I don't think
>> we need the list to take care of the locks. Is there any other reason
>> why we maintain that list (apart from the trigger case I have raised
>> and Fujita-san says that the list is not required in that case as
>> well.)?
>
> AppendRelInfos exist within the planner (they come to be and go away
> within the planner).  Once we leave the planner, that information is gone.
>
> Executor will receive a plan node that won't contain that information:
>
> 1. Append has an appendplans field, which contains one plan tree for every
>leaf partition.  None of its fields, other than partitined_rels,
>contains the RT indexes of the partitioned tables.  Similarly in the
>case of MergeAppend.
>
> 2. ModifyTable has a resultRelations fields which contains a list of leaf
>partition RT indexes and a plans field which contains one plan tree for
>every RT index in the resultRelations list (that is a plan tree that
>will scan the particular leaf partition).  None of its fields, other
>than partitined_rels, contains the RT indexes of the partitioned
>tables.
>
> I learned over the course of developing the patch that added this
> partitioned_rels field [1] that the executor needs to identify all the
> affected tables by a given plan tree so that it could lock them.  Executor
> needs to lock them separately even if the plan itself was built after
> locking all the relevant tables [2].  For example, see
> ExecLockNonLeafAppendTables(), which will lock the tables in the
> (Merge)Append.partitioned_rels list.
>
> While I've been thinking all along that the same thing must be happening
> for RT indexes in ModifyTable.partitioned_rels list (I said so a couple of
> times on this thread), it's actually not.  Instead,
> ModifyTable.partitioned_rels of all ModifyTable nodes in a give query are
> merged into PlannedStmt.nonleafResultRelations (which happens in
> set_plan_refs) and that's where the executor finds them to lock them
> (which happens in InitPlan).
>
> So, it appears that ModifyTable.partitioned_rels is indeed unused in the
> executor.  But we still can't get rid of it from the ModifyTable node
> itself without figuring out a way (a channel) to transfer that information
> into PlannedStmt.nonleafResultRelations.

Thanks a lot for this detailed analysis. IIUC, in my 0001 patch, I am
not adding any partitioned partition other than the parent itself. But
since every partitioned partition in turn acts as parent, it appears
its own list. The list obtained by concatenating all such lists
together contains all the partitioned partition RTIs. In my patch, I
need to teach accumulate_append_subpath() to accumulate
partitioned_rels as well.

>
>> Having asked that, I think my patch shouldn't deal with removing
>> partitioned_rels lists and related structures and code.  It should be> done 
>> as a separate patch.
>
> Going back to your original email which started this discussion, it seems
> that we agree on that the PartitionedChildRelInfo node can be removed, and
> I agree that it shouldn't be done in the partitionwise-join patch series
> but as a separate patch.  As described above, we shouldn't try yet to get
> rid of the partitioned_rels list that appears in some plan nodes.

Thanks.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-05 Thread Ashutosh Bapat

On Tue, Sep 5, 2017 at 12:06 PM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> On 2017/09/05 15:30, Ashutosh Bapat wrote:
>> Those changes are already part of my updated 0001 patch. Aren't they?
>> May be you should just review those and see if those are suitable for
>> you?
>
> Yeah, I think it's going to be the same patch, functionality-wise.
>
> And sorry, I didn't realize you were talking about the case after applying
> your patch on HEAD.
>

Ok. Can you please answer my previous questions?

AFAIU, the list contained RTIs of the relations, which didnt' have
corresponding AppendRelInfos to lock those relations. Now that we
create AppendRelInfos even for partitioned partitions with my 0001
patch, I don't think
we need the list to take care of the locks. Is there any other reason
why we maintain that list (apart from the trigger case I have raised
and Fujita-san says that the list is not required in that case as
well.)?

Having asked that, I think my patch shouldn't deal with removing
partitioned_rels lists and related structures and code. It should be
done as a separate patch.
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-05 Thread Ashutosh Bapat

On Tue, Sep 5, 2017 at 11:54 AM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> On 2017/09/05 13:20, Amit Langote wrote:
>> The later phase can
>> build that list from the AppendRelInfos that you mention we now [1] build.
>>
>> [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=30833ba154
>
> Looking at that commit again, AppendRelInfos are still not created for
> partitioned child tables.  Looking at the code in
> expand_single_inheritance_child(), which exists as of 30833ba154:
>
>
> /*
>  * Build an AppendRelInfo for this parent and child, unless the child is a
>  * partitioned table.
>  */
> if (childrte->relkind != RELKIND_PARTITIONED_TABLE && !childrte->inh)
> {
>  ...code that builds AppendRelInfo...
> }
> else
> *partitioned_child_rels = lappend_int(*partitioned_child_rels,
>   childRTindex);
>
> you can see that an AppendRelInfo won't get built for partitioned child
> tables.
>
> Also, even if the commit changed things so that the child RT entries (and
> AppendRelInfos) now get built in an order determined by depth-first
> traversal of the partition tree, the same original parent RT index is used
> to mark all AppendRelInfos, so the expansion essentially flattens the
> hierarchy.  In the updated patch I will post on the "path toward faster
> partition pruning" thread [1], I am planning to rejigger things so that
> two things start to happen:
>
> 1. For partitioned child tables, build the child RT entry with inh = true
>and also build an AppendRelInfos
>
> 2. When recursively expanding a partitioned child table in
>expand_partitioned_rtentry(), pass its new RT index as the
>parentRTindex to the recursive call of expand_partitioned_rtentry(), so
>that the resulting AppendRelInfos reflect immediate parent-child
>relationship
>
> With 1 in place, build_simple_rel() will build RelOptInfos even for
> partitioned child tables, so that for each one, we can recursively build
> an Append path.  So, instead of just one Append path for the root
> partitioned table, there is one for each partitioned table in the tree.
>
> I will be including the above described change in the partition-pruning
> patch, because the other code in that patch relies on the same and I know
> Ashuotsh has wanted that for a long time. :)

Those changes are already part of my updated 0001 patch. Aren't they?
May be you should just review those and see if those are suitable for
you?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dropping partitioned tables without CASCADE

2017-09-04 Thread Ashutosh Bapat

On Mon, Sep 4, 2017 at 3:48 PM, Alvaro Herrera <alvhe...@alvh.no-ip.org> wrote:
>
> if (tuples > 0)
> {
> -   if (tableinfo.relkind != 
> RELKIND_PARTITIONED_TABLE)
> -   printfPQExpBuffer(, _("Number of 
> child tables: %d (Use \\d+ to list them.)"), tuples);
> -   else
> -   printfPQExpBuffer(, _("Number of 
> partitions: %d (Use \\d+ to list them.)"), tuples);
> +   printfPQExpBuffer(, _("Number of %s: %d 
> (Use \\d+ to list them.)"), ct, tuples);
> printTableAddFooter(, buf.data);
> }
>
> Please don't do this, because it breaks translatability.  I think the
> original is fine.
>

We have used this style in the "else" case of if (!verbose). So, I
just copied it. I have removed that change in the attached patch.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
From c0a153f0535c4d1dca637996d4cd5e6f62c11afe Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.ba...@enterprisedb.com>
Date: Mon, 4 Sep 2017 09:56:41 +0530
Subject: [PATCH] Improve \d+ output of a partitioned table

While displaying partitions in \d+ output of a partitioned table
annotate the partitioned partitions as "has partitions".

For a partitioned table show the number of partitions even if it's 0.

Ashutosh Bapat and Amit Langote.
---
 src/bin/psql/describe.c|   32 ++--
 src/test/regress/expected/create_table.out |   13 ++-
 src/test/regress/expected/foreign_data.out |3 +++
 src/test/regress/expected/insert.out   |   15 +
 src/test/regress/sql/create_table.sql  |2 +-
 src/test/regress/sql/insert.sql|4 
 6 files changed, 57 insertions(+), 12 deletions(-)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index f6049cc..5adf5a4 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -2828,7 +2828,7 @@ describeOneTableDetails(const char *schemaname,
 		/* print child tables (with additional info if partitions) */
 		if (pset.sversion >= 10)
 			printfPQExpBuffer(,
-			  "SELECT c.oid::pg_catalog.regclass, pg_catalog.pg_get_expr(c.relpartbound, c.oid)"
+			  "SELECT c.oid::pg_catalog.regclass, pg_get_expr(c.relpartbound, c.oid), c.relkind"
 			  " FROM pg_catalog.pg_class c, pg_catalog.pg_inherits i"
 			  " WHERE c.oid=i.inhrelid AND i.inhparent = '%s'"
 			  " ORDER BY c.oid::pg_catalog.regclass::pg_catalog.text;", oid);
@@ -2851,7 +2851,18 @@ describeOneTableDetails(const char *schemaname,
 		else
 			tuples = PQntuples(result);
 
-		if (!verbose)
+		/*
+		 * For a partitioned table with no partitions, always print the number
+		 * of partitions as zero, even when verbose output is expected.
+		 * Otherwise, we will not print "Partitions" section for a partitioned
+		 * table without any partitions.
+		 */
+		if (tableinfo.relkind == RELKIND_PARTITIONED_TABLE && tuples == 0)
+		{
+			printfPQExpBuffer(, _("Number of partitions: %d"), tuples);
+			printTableAddFooter(, buf.data);
+		}
+		else if (!verbose)
 		{
 			/* print the number of child tables, if any */
 			if (tuples > 0)
@@ -2883,12 +2894,21 @@ describeOneTableDetails(const char *schemaname,
 }
 else
 {
+	char *partitioned_note;
+
+	if (*PQgetvalue(result, i, 2) == RELKIND_PARTITIONED_TABLE)
+		partitioned_note = " has partitions";
+	else
+		partitioned_note = "";
+
 	if (i == 0)
-		printfPQExpBuffer(, "%s: %s %s",
-		  ct, PQgetvalue(result, i, 0), PQgetvalue(result, i, 1));
+		printfPQExpBuffer(, "%s: %s %s%s",
+		  ct, PQgetvalue(result, i, 0), PQgetvalue(result, i, 1),
+		  partitioned_note);
 	else
-		printfPQExpBuffer(, "%*s  %s %s",
-		  ctw, "", PQgetvalue(result, i, 0), PQgetvalue(result, i, 1));
+		printfPQExpBuffer(, "%*s  %s %s%s",
+		  ctw, "", PQgetvalue(result, i, 0), PQgetvalue(result, i, 1),
+		  partitioned_note);
 }
 if (i < tuples - 1)
 	appendPQExpBufferChar(, ',');
diff --git a/src/test/regress/expected/create_table.out b/src/test/regress/expected/create_table.out
index babda89..a35d19e 100644
--- a/src/test/regress/expected/create_table.out
+++ b/src/test/regress/expected/create_table.out
@@ -428,13 +428,15 @@ ERROR:  cannot inherit from partitioned table "partitioned2"
  c  | text|   |  | 
  d  | text|   |  | 
 Partition key: RANGE

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 965 matches

Mail list logo