subject:"speeding up planning with partitions"

Re: speeding up planning with partitions

2019-05-02 Thread Amit Langote

On Wed, May 1, 2019 at 4:08 AM Tom Lane  wrote:
> OK, I tweaked that a bit and pushed both versions.

Thank you.

Regards,
Amit

Re: speeding up planning with partitions

2019-04-30 Thread Tom Lane

Amit Langote  writes:
> On Tue, Apr 30, 2019 at 1:26 PM Amit Langote  wrote:
>> It would be nice if at least we fix the bug that directly accessed
>> partitions are not excluded with constraint_exclusion = on, without
>> removing PG 11's contortions in relation_excluded_by_constraints to
>> work around the odd requirements imposed by inheritance_planner, which
>> is causing the additional diffs in the regression expected output.

> FWIW, attached is a delta patch that applies on top of your patch for
> v11 branch that shows what may be one way to go about this.

OK, I tweaked that a bit and pushed both versions.

regards, tom lane

Re: speeding up planning with partitions

2019-04-29 Thread Tom Lane

Amit Langote  writes:
> Here is the patch.  I've also included the patch to update the text in
> ddl.sgml regarding constraint exclusion and partition pruning.

I thought this was a bit messy.  In particular, IMV the reason to
have a split between get_relation_constraints and its only caller
relation_excluded_by_constraints is to create a policy vs mechanism
separation: relation_excluded_by_constraints figures out what kinds
of constraints we need to look at, while get_relation_constraints does
the gruntwork of digging them out of the catalog data.  Somebody had
already ignored this principle to the extent of putting this
very-much-policy test into get_relation_constraints:

if (enable_partition_pruning && root->parse->commandType != CMD_SELECT)

but the way to improve that is to add another flag parameter to convey
the policy choice, not to move the whole chunk of mechanism out to the
caller.

It also struck me while looking at the code that we were being
unnecessarily stupid about non-inheritable constraints: rather than
just throwing up our hands for traditional inheritance situations,
we can still apply constraint exclusion, as long as we consider only
constraints that aren't marked ccnoinherit.  (attnotnull constraints
have to be considered as if they were ccnoinherit, for ordinary
tables but not partitioned ones.)

So, I propose the attached revised patch.

I'm not sure how much of this, if anything, we should back-patch to
v11.  It definitely doesn't seem like we should back-patch the
improvement just explained.  I tried diking out that change, as
in the v11 variant attached, and found that this still causes quite a
few other changes in v11's expected results, most of them not for the
better.  So I'm thinking that we'd better conclude that v11's ship
has sailed.  Its behavior is in some ways weird, but I am not sure
that anyone will appreciate our changing it on the fourth minor
release.

It's somewhat interesting that we get these other changes in v11
but not HEAD.  I think the reason is that we reimplemented so much
of inheritance_planner in 428b260f8; that is, it seems the weird
decisions we find in relation_excluded_by_constraints are mostly
there to band-aid over the old weird behavior of inheritance_planner.

Anyway, my current thought is to apply this to HEAD and do nothing
in v11.  I include the v11 patch just for amusement.  (I did not
check v11's behavior outside the core regression tests; it might
possibly have additional test diffs in contrib.)

regards, tom lane

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8ddab75..84341a3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5084,10 +5084,11 @@ ANY num_sync ( .)

 Refer to  for
-more information on using constraint exclusion and partitioning.
+more information on using constraint exclusion to implement
+partitioning.

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index cba2ea9..a0a7435 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -4535,24 +4535,11 @@ EXPLAIN SELECT count(*) FROM measurement WHERE logdate = DATE '2008-01-01';

- Currently, pruning of partitions during the planning of an
- UPDATE or DELETE command is
- implemented using the constraint exclusion method (however, it is
- controlled by the enable_partition_pruning rather than
- constraint_exclusion)  see the following section
- for details and caveats that apply.
-
-
-
  Execution-time partition pruning currently only occurs for the
  Append and MergeAppend node types.
  It is not yet implemented for the ModifyTable node
- type.
-
-
-
- Both of these behaviors are likely to be changed in a future release
- of PostgreSQL.
+ type, but that is likely to be changed in a future release of
+ PostgreSQL.

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 0a6710c..eb6f5a3 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1513,8 +1513,9 @@ inheritance_planner(PlannerInfo *root)
 		parent_rte->securityQuals = NIL;

 		/*
-		 * Mark whether we're planning a query to a partitioned table or an
-		 * inheritance parent.
+		 * HACK: setting this to a value other than INHKIND_NONE signals to
+		 * relation_excluded_by_constraints() to treat the result relation as
+		 * being an appendrel member.
 		 */
 		subroot->inhTargetKind =
 			(rootRelation != 0) ? INHKIND_PARTITIONED : INHKIND_INHERITED;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 3301331..3215c29 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -67,7 +67,9 @@ static bool infer_collation_opclass_match(InferenceElem *elem, Relation idxRel,
 			  List *idxExprs);

Re: speeding up planning with partitions

2019-04-28 Thread Amit Langote

On Sun, Apr 28, 2019 at 8:10 AM Tom Lane  wrote:
> Amit Langote  writes:
> > Not sure if you'll like it but maybe we could ignore even regular
> > inheritance child targets that are proven to be empty (is_dummy_rel()) for
> > a given query during the initial SELECT planning.  That way, we can avoid
> > re-running relation_excluded_by_constraints() a second time for *all*
> > child target relations.
>
> My thought was to keep traditional inheritance working more or less
> as it has.  To do what you're suggesting, we'd have to move generic
> constraint-exclusion logic up into the RTE expansion phase, and I don't
> think that's a particularly great idea.  I think what we should be
> doing is applying partition pruning (which is a very specialized form
> of constraint exclusion) during RTE expansion, then applying generic
> constraint exclusion in relation_excluded_by_constraints, and not
> examining partition constraints again there if we already used them.

Just to clarify, I wasn't suggesting that we change query_planner(),
but the blocks in inheritance_planner() that perform initial planning
as if the query was SELECT and gather child target relations from that
planner run; the following consecutive blocks:

/*
 * Before generating the real per-child-relation plans, do a cycle of
 * planning as though the query were a SELECT.
...
 */
{
PlannerInfo *subroot;
and:

/*--
 * Since only one rangetable can exist in the final plan, we need to make
 * sure that it contains all the RTEs needed for any child plan.
...
child_appinfos = NIL;
old_child_rtis = NIL;
new_child_rtis = NIL;
parent_relids = bms_make_singleton(top_parentRTindex);
foreach(lc, select_appinfos)
{
AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc);
RangeTblEntry *child_rte;

/* append_rel_list contains all append rels; ignore others */
if (!bms_is_member(appinfo->parent_relid, parent_relids))
continue;

/* remember relevant AppendRelInfos for use below */
child_appinfos = lappend(child_appinfos, appinfo);

I'm suggesting that we don't add the child relations that are dummy
due to constraint exclusion to child_appinfos.  Maybe, we'll need to
combine the two blocks so that the latter can use the PlannerInfo
defined in the former to look up the child relation to check if dummy.

> > Do you want me to update my patch considering the above summary?
>
> Yes please.

I will try to get that done hopefully by tomorrow.

(On extended holidays that those of us who are in Japan have around
this time of year.)

>  However, I wonder whether you're thinking differently in
> light of what you wrote in [1]:

Thanks for checking that thread.

> >>> Pruning in 10.2 works using internally generated partition constraints
> >>> (which for this purpose are same as CHECK constraints).  With the new
> >>> pruning logic introduced in 11, planner no longer considers partition
> >>> constraint because it's redundant to check them in most cases, because
> >>> pruning would've selected the right set of partitions.  Given that the new
> >>> pruning logic is still unable to handle the cases like above, maybe we
> >>> could change the planner to consider them, at least until we fix the
> >>> pruning logic to handle such cases.
>
> If we take that seriously then it would suggest not ignoring partition
> constraints in relation_excluded_by_constraints.  However, I'm of the
> opinion that we shouldn't let temporary deficiencies in the
> partition-pruning logic drive what we do here.  I don't think the set
> of cases where we could get a win by reconsidering the partition
> constraints is large enough to justify the cycles expended in doing so;
> and it'll get even smaller as pruning gets smarter.

Yeah, maybe we could away with that by telling users to define
equivalent CHECK constraints for corner cases like that although
that's not really great.

Thanks,
Amit

Re: speeding up planning with partitions

2019-04-27 Thread Tom Lane

Amit Langote  writes:
> On 2019/04/23 7:08, Tom Lane wrote:
>> [ a bunch of stuff ]

> Not sure if you'll like it but maybe we could ignore even regular
> inheritance child targets that are proven to be empty (is_dummy_rel()) for
> a given query during the initial SELECT planning.  That way, we can avoid
> re-running relation_excluded_by_constraints() a second time for *all*
> child target relations.

My thought was to keep traditional inheritance working more or less
as it has.  To do what you're suggesting, we'd have to move generic
constraint-exclusion logic up into the RTE expansion phase, and I don't
think that's a particularly great idea.  I think what we should be
doing is applying partition pruning (which is a very specialized form
of constraint exclusion) during RTE expansion, then applying generic
constraint exclusion in relation_excluded_by_constraints, and not
examining partition constraints again there if we already used them.

> Do you want me to update my patch considering the above summary?

Yes please.  However, I wonder whether you're thinking differently in
light of what you wrote in [1]:

>>> Pruning in 10.2 works using internally generated partition constraints
>>> (which for this purpose are same as CHECK constraints).  With the new
>>> pruning logic introduced in 11, planner no longer considers partition
>>> constraint because it's redundant to check them in most cases, because
>>> pruning would've selected the right set of partitions.  Given that the new
>>> pruning logic is still unable to handle the cases like above, maybe we
>>> could change the planner to consider them, at least until we fix the
>>> pruning logic to handle such cases.

If we take that seriously then it would suggest not ignoring partition
constraints in relation_excluded_by_constraints.  However, I'm of the
opinion that we shouldn't let temporary deficiencies in the
partition-pruning logic drive what we do here.  I don't think the set
of cases where we could get a win by reconsidering the partition
constraints is large enough to justify the cycles expended in doing so;
and it'll get even smaller as pruning gets smarter.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/358cd54d-c018-60f8-7d76-55780eef6...@lab.ntt.co.jp

Re: speeding up planning with partitions

2019-04-22 Thread Amit Langote

On 2019/04/23 7:08, Tom Lane wrote:
> Amit Langote  writes:
>> On 2019/04/02 2:34, Tom Lane wrote:
>>> I'm not at all clear on what we think the interaction between
>>> enable_partition_pruning and constraint_exclusion ought to be,
>>> so I'm not sure what the appropriate resolution is here.  Thoughts?
> 
>> Prior to 428b260f87 (that is, in PG 11), partition pruning for UPDATE and
>> DELETE queries is realized by applying constraint exclusion to the
>> partition constraint of the target partition.  The conclusion of the
>> discussion when adding the enable_partition_pruning GUC [1] was that
>> whether or not constraint exclusion is carried out (to facilitate
>> partition pruning) should be governed by the new GUC, not
>> constraint_exclusion, if only to avoid confusing users.
> 
> I got back to thinking about how this ought to work.

Thanks a lot for taking time to look at this.

> It appears to me
> that we've got half a dozen different behaviors that depend on one or both
> of these settings:
> 
> 1. Use of ordinary table constraints (CHECK, NOT NULL) in baserel pruning,
>   that is relation_excluded_by_constraints for baserels.
>   This is enabled by constraint_exclusion = on.
> 
> 2. Use of partition constraints in baserel pruning (applicable only
>   when a partition is accessed directly).
>   This is currently partly broken, and it's what your patch wants to
>   change.

Yes.  Any fix we come up with for this will need to be back-patched to 11,
because it's a regression introduced in 11 when the then new partition
pruning feature was committed (9fdb675fc).

> 3. Use of ordinary table constraints in appendrel pruning,
>   that is relation_excluded_by_constraints for appendrel members.
>   This is enabled by constraint_exclusion >= partition.
> 
> 4. Use of partition constraints in appendrel pruning.
>   This is enabled by the combination of enable_partition_pruning AND
>   constraint_exclusion >= partition.  However, it looks to me like this
>   is now nearly if not completely useless because of #5.
> 
> 5. Use of partition constraints in expand_partitioned_rtentry.
>   Enabled by enable_partition_pruning (see prune_append_rel_partitions).

Right, #5 obviates #4.

> 6. Use of partition constraints in run-time partition pruning.
>   This is also enabled by enable_partition_pruning, cf
>   create_append_plan, create_merge_append_plan.
> 
> Now in addition to what I mention above, there are assorted random
> differences in behavior depending on whether we are in an inherited
> UPDATE/DELETE or not.  I consider these differences to be so bogus
> that I'm not even going to include them in this taxonomy; they should
> not exist.  The UPDATE/DELETE target ought to act the same as a baserel.

The *partition* constraint of UPDATE/DELETE targets would never be refuted
by the query, because we process only those partition targets that remain
after applying partition pruning during the initial planning of the query
as if it were SELECT.  I'm saying we should distinguish such targets as
such when addressing #2.

Not sure if you'll like it but maybe we could ignore even regular
inheritance child targets that are proven to be empty (is_dummy_rel()) for
a given query during the initial SELECT planning.  That way, we can avoid
re-running relation_excluded_by_constraints() a second time for *all*
child target relations.

> I think this is ridiculously overcomplicated even without said random
> differences.  I propose that we do the following:
> 
> * Get rid of point 4 by not considering partition constraints for
> appendrel members in relation_excluded_by_constraints.  It's just
> useless cycles in view of point 5, or nearly so.  (Possibly there
> are corner cases where we could prove contradictions between a
> relation's partition constraints and regular constraints ... but is
> it really worth spending planner cycles to look for that?)

I guess not.  If partition constraint contradicts regular constraints,
there wouldn't be any data in such partitions to begin with, no?

> * Make point 2 like point 1 by treating partition constraints for
> baserels like ordinary table constraints, ie, they are considered
> only when constraint_exclusion = on (independently of whether
> enable_partition_pruning is on).

Right, enable_partition_pruning should only apply to appendrel pruning.
If a partition is accessed directly and hence a baserel to the planner, we
only consider constraint_exclusion and perform it only if the setting is on.

Another opinion on this is that we treat partition constraints differently
from regular constraints and don't mind the setting of
constraint_exclusion, that is, always perform constraint exclusion using
partition constraints.

> * Treat an inherited UPDATE/DELETE target table as if it were an
> appendrel member for the purposes of relation_excluded_by_constraints,
> thus removing the behavioral differences between SELECT and UPDATE/DELETE.

As I mentioned above, planner encounters any given

Re: speeding up planning with partitions

2019-04-22 Thread Tom Lane

Amit Langote  writes:
> On 2019/04/02 2:34, Tom Lane wrote:
>> I'm not at all clear on what we think the interaction between
>> enable_partition_pruning and constraint_exclusion ought to be,
>> so I'm not sure what the appropriate resolution is here.  Thoughts?

> Prior to 428b260f87 (that is, in PG 11), partition pruning for UPDATE and
> DELETE queries is realized by applying constraint exclusion to the
> partition constraint of the target partition.  The conclusion of the
> discussion when adding the enable_partition_pruning GUC [1] was that
> whether or not constraint exclusion is carried out (to facilitate
> partition pruning) should be governed by the new GUC, not
> constraint_exclusion, if only to avoid confusing users.

I got back to thinking about how this ought to work.  It appears to me
that we've got half a dozen different behaviors that depend on one or both
of these settings:

1. Use of ordinary table constraints (CHECK, NOT NULL) in baserel pruning,
  that is relation_excluded_by_constraints for baserels.
  This is enabled by constraint_exclusion = on.

2. Use of partition constraints in baserel pruning (applicable only
  when a partition is accessed directly).
  This is currently partly broken, and it's what your patch wants to
  change.

3. Use of ordinary table constraints in appendrel pruning,
  that is relation_excluded_by_constraints for appendrel members.
  This is enabled by constraint_exclusion >= partition.

4. Use of partition constraints in appendrel pruning.
  This is enabled by the combination of enable_partition_pruning AND
  constraint_exclusion >= partition.  However, it looks to me like this
  is now nearly if not completely useless because of #5.

5. Use of partition constraints in expand_partitioned_rtentry.
  Enabled by enable_partition_pruning (see prune_append_rel_partitions).

6. Use of partition constraints in run-time partition pruning.
  This is also enabled by enable_partition_pruning, cf
  create_append_plan, create_merge_append_plan.

Now in addition to what I mention above, there are assorted random
differences in behavior depending on whether we are in an inherited
UPDATE/DELETE or not.  I consider these differences to be so bogus
that I'm not even going to include them in this taxonomy; they should
not exist.  The UPDATE/DELETE target ought to act the same as a baserel.

I think this is ridiculously overcomplicated even without said random
differences.  I propose that we do the following:

* Get rid of point 4 by not considering partition constraints for
appendrel members in relation_excluded_by_constraints.  It's just
useless cycles in view of point 5, or nearly so.  (Possibly there
are corner cases where we could prove contradictions between a
relation's partition constraints and regular constraints ... but is
it really worth spending planner cycles to look for that?)

* Make point 2 like point 1 by treating partition constraints for
baserels like ordinary table constraints, ie, they are considered
only when constraint_exclusion = on (independently of whether
enable_partition_pruning is on).

* Treat an inherited UPDATE/DELETE target table as if it were an
appendrel member for the purposes of relation_excluded_by_constraints,
thus removing the behavioral differences between SELECT and UPDATE/DELETE.

With this, constraint_exclusion would act pretty much as it traditionally
has, and in most cases would not have any special impact on partitions
compared to old-style inheritance.  The behaviors that
enable_partition_pruning would control are expand_partitioned_rtentry
pruning and run-time pruning, neither of which have any applicability to
old-style inheritance.

Thoughts?

regards, tom lane

Re: speeding up planning with partitions

2019-04-10 Thread Amit Langote

On 2019/04/11 14:03, David Rowley wrote:
> On Fri, 5 Apr 2019 at 19:50, Amit Langote  
> wrote:
>> While we're on the topic of the relation between constraint exclusion and
>> partition pruning, I'd like to (re-) propose this documentation update
>> patch.  The partitioning chapter in ddl.sgml says update/delete of
>> partitioned tables uses constraint exclusion internally to emulate
>> partition pruning, which is no longer true as of 428b260f8.
> 
> Update-docs-that-update-delete-no-longer-use-cons.patch looks good to
> me. It should be changed as what the docs say is no longer true.

Thanks for the quick review. :)

Regards,
Amit

Re: speeding up planning with partitions

2019-04-10 Thread David Rowley

On Fri, 5 Apr 2019 at 19:50, Amit Langote  wrote:
> While we're on the topic of the relation between constraint exclusion and
> partition pruning, I'd like to (re-) propose this documentation update
> patch.  The partitioning chapter in ddl.sgml says update/delete of
> partitioned tables uses constraint exclusion internally to emulate
> partition pruning, which is no longer true as of 428b260f8.

Update-docs-that-update-delete-no-longer-use-cons.patch looks good to
me. It should be changed as what the docs say is no longer true.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: speeding up planning with partitions

2019-04-05 Thread David Rowley

On Fri, 5 Apr 2019 at 23:07, Amit Langote  wrote:
>
> Hi,
>
> On 2019/04/05 18:13, Floris Van Nee wrote:
> > One unrelated thing I noticed (but I'm not sure if it's worth a separate 
> > email thread) is that the changed default of jit=on in v12 doesn't work 
> > very well with a large number of partitions at run-time, for which a large 
> > number get excluded at run-time. A query that has an estimated cost above 
> > jit_optimize_above_cost takes about 30 seconds to run (for a table with 
> > 1000 partitions), because JIT is optimizing the full plan. Without JIT it's 
> > barely 20ms (+400ms planning). I can give more details in a separate thread 
> > if it's deemed interesting.
> >
> > Planning Time: 411.321 ms
> > JIT:
> >   Functions: 5005
> >   Options: Inlining false, Optimization true, Expressions true, Deforming 
> > true
> >   Timing: Generation 721.472 ms, Inlining 0.000 ms, Optimization 16312.195 
> > ms, Emission 12533.611 ms, Total 29567.278 ms
>
> I've noticed a similar problem but in the context of interaction with
> parallel query mechanism.  The planner, seeing that all partitions will be
> scanned (after failing to prune with clauses containing CURRENT_TIMESTAMP
> etc.), prepares a parallel plan (containing Parallel Append in this case).
>  As you can imagine, parallel query initialization (Gather+workers) will
> take large amount of time relative to the time it will take to scan the
> partitions that remain after pruning (often just one).
>
> The problem in this case is that the planner is oblivious to the
> possibility of partition pruning occurring during execution, which may be
> common to both parallel query and JIT.  If it wasn't oblivious, it
> would've set the cost of pruning-capable Append such that parallel query
> and/or JIT won't be invoked.  We are going to have to fix that sooner or
> later.

Robert and I had a go at discussing this in [1]. Some ideas were
thrown around in the nature of contorting the Append/MergeAppend's
total_cost in a similar way to how clauselist_selectivity does its
estimates for unknown values.  Perhaps it is possible to actually
multiplying the total_cost by the clauselist_selectivity for the
run-time pruning quals.  That's pretty crude and highly unusual, but
it's probably going to give something more sane than what's there
today.  The run-time prune quals would likely need to be determined
earlier than createplan.c for that to work though. IIRC the reason it
was done there is, because at the time, there wasn't a need to do it
per path.

I don't really have any better ideas right now, so if someone does
then maybe we should take it up on a new thread.  It would be good to
leave this thread alone for unrelated things. It's long enough
already.

[1] 
https://www.postgresql.org/message-id/CA%2BTgmobhXJGMuHxKjbaKcEJXacxVZHG4%3DhEGFfPF_FrGt37T_Q%40mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: speeding up planning with partitions

2019-04-05 Thread Amit Langote

Hi,

On 2019/04/05 18:13, Floris Van Nee wrote:
> One unrelated thing I noticed (but I'm not sure if it's worth a separate 
> email thread) is that the changed default of jit=on in v12 doesn't work very 
> well with a large number of partitions at run-time, for which a large number 
> get excluded at run-time. A query that has an estimated cost above 
> jit_optimize_above_cost takes about 30 seconds to run (for a table with 1000 
> partitions), because JIT is optimizing the full plan. Without JIT it's barely 
> 20ms (+400ms planning). I can give more details in a separate thread if it's 
> deemed interesting.
> 
> Planning Time: 411.321 ms
> JIT:
>   Functions: 5005
>   Options: Inlining false, Optimization true, Expressions true, Deforming true
>   Timing: Generation 721.472 ms, Inlining 0.000 ms, Optimization 16312.195 
> ms, Emission 12533.611 ms, Total 29567.278 ms

I've noticed a similar problem but in the context of interaction with
parallel query mechanism.  The planner, seeing that all partitions will be
scanned (after failing to prune with clauses containing CURRENT_TIMESTAMP
etc.), prepares a parallel plan (containing Parallel Append in this case).
 As you can imagine, parallel query initialization (Gather+workers) will
take large amount of time relative to the time it will take to scan the
partitions that remain after pruning (often just one).

The problem in this case is that the planner is oblivious to the
possibility of partition pruning occurring during execution, which may be
common to both parallel query and JIT.  If it wasn't oblivious, it
would've set the cost of pruning-capable Append such that parallel query
and/or JIT won't be invoked.  We are going to have to fix that sooner or
later.

Thanks,
Amit

Re: speeding up planning with partitions

2019-04-05 Thread Floris Van Nee

Thanks for the details! Indeed the versions with now()/current_date use the 
runtime pruning rather than planning time. I wasn't aware of the use of 'today' 
though - that could be useful in case we're sure statements won't be prepared.

Previously (v10/ partly v11) it was necessary to make sure that statements on 
partioned tables were never prepared, because run-time pruning wasn't available 
- using a generic plan was almost always a bad option. Now in v12 it seems to 
be a tradeoff between whether or not run-time pruning can occur. If pruning is 
possible at planning time it's probably still better not to prepare statements, 
whereas if run-time pruning has to occur, it's better to prepare them.

One unrelated thing I noticed (but I'm not sure if it's worth a separate email 
thread) is that the changed default of jit=on in v12 doesn't work very well 
with a large number of partitions at run-time, for which a large number get 
excluded at run-time. A query that has an estimated cost above 
jit_optimize_above_cost takes about 30 seconds to run (for a table with 1000 
partitions), because JIT is optimizing the full plan. Without JIT it's barely 
20ms (+400ms planning). I can give more details in a separate thread if it's 
deemed interesting.

Planning Time: 411.321 ms
JIT:
  Functions: 5005
  Options: Inlining false, Optimization true, Expressions true, Deforming true
  Timing: Generation 721.472 ms, Inlining 0.000 ms, Optimization 16312.195 ms, 
Emission 12533.611 ms, Total 29567.278 ms

-Floris

Re: speeding up planning with partitions

2019-04-05 Thread Amit Langote

On 2019/04/02 14:50, Amit Langote wrote:
> Attached patch is only for HEAD this time.  I'll post one for PG 11 (if
> you'd like) once we reach consensus on the best thing to do here is.

While we're on the topic of the relation between constraint exclusion and
partition pruning, I'd like to (re-) propose this documentation update
patch.  The partitioning chapter in ddl.sgml says update/delete of
partitioned tables uses constraint exclusion internally to emulate
partition pruning, which is no longer true as of 428b260f8.

The v2-0001 patch hasn't changed.

Thanks,
Amit
>From 336963d5f08a937c0890e794553dc23aced1fca1 Mon Sep 17 00:00:00 2001
From: amit 
Date: Fri, 5 Apr 2019 15:41:11 +0900
Subject: [PATCH v3 2/2] Update docs that update/delete no longer use
 constraint exclusion

---
 doc/src/sgml/ddl.sgml | 18 ++
 1 file changed, 2 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 1fe27c5da9..33012939b8 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -4535,26 +4535,12 @@ EXPLAIN SELECT count(*) FROM measurement WHERE logdate 
= DATE '2008-01-01';
  setting.

 
-   
-
- Currently, pruning of partitions during the planning of an
- UPDATE or DELETE command is
- implemented using the constraint exclusion method (however, it is
- controlled by the enable_partition_pruning rather than
- constraint_exclusion)  see the following section
- for details and caveats that apply.
-
-
 
  Execution-time partition pruning currently only occurs for the
  Append and MergeAppend node types.
  It is not yet implemented for the ModifyTable node
- type.
-
-
-
- Both of these behaviors are likely to be changed in a future release
- of PostgreSQL.
+ type, but that is likely to be changed in a future release of
+ PostgreSQL.
 

   
-- 
2.11.0

>From 5549e6caae79259032e844812804529ffbf0d321 Mon Sep 17 00:00:00 2001
From: amit 
Date: Tue, 2 Apr 2019 14:54:08 +0900
Subject: [PATCH v3 1/2] Fix partition constraint loading in planner

---
 src/backend/optimizer/plan/planner.c  |   5 +-
 src/backend/optimizer/util/plancat.c  | 106 +-
 src/test/regress/expected/partition_prune.out |  48 
 src/test/regress/sql/partition_prune.sql  |  19 +
 4 files changed, 121 insertions(+), 57 deletions(-)

diff --git a/src/backend/optimizer/plan/planner.c 
b/src/backend/optimizer/plan/planner.c
index e2cdc83613..1f6bd142b7 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1513,8 +1513,9 @@ inheritance_planner(PlannerInfo *root)
parent_rte->securityQuals = NIL;
 
/*
-* Mark whether we're planning a query to a partitioned table 
or an
-* inheritance parent.
+* HACK: setting this to a value other than INHKIND_NONE 
signals to
+* relation_excluded_by_constraints() to process the result 
relation as
+* a partition; see that function for more details.
 */
subroot->inhTargetKind =
(rootRelation != 0) ? INHKIND_PARTITIONED : 
INHKIND_INHERITED;
diff --git a/src/backend/optimizer/util/plancat.c 
b/src/backend/optimizer/util/plancat.c
index 3301331304..28923f805e 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -66,7 +66,7 @@ static void get_relation_foreign_keys(PlannerInfo *root, 
RelOptInfo *rel,
 static bool infer_collation_opclass_match(InferenceElem *elem, Relation idxRel,
  List *idxExprs);
 static List *get_relation_constraints(PlannerInfo *root,
-Oid relationObjectId, 
RelOptInfo *rel,
+Relation relation, RelOptInfo 
*rel,
 bool include_notnull);
 static List *build_index_tlist(PlannerInfo *root, IndexOptInfo *index,
  Relation heapRelation);
@@ -1150,19 +1150,13 @@ get_relation_data_width(Oid relid, int32 *attr_widths)
  */
 static List *
 get_relation_constraints(PlannerInfo *root,
-Oid relationObjectId, 
RelOptInfo *rel,
+Relation relation, RelOptInfo 
*rel,
 bool include_notnull)
 {
List   *result = NIL;
Index   varno = rel->relid;
-   Relationrelation;
TupleConstr *constr;
 
-   /*
-* We assume the relation has already been safely locked.
-*/
-   relation = table_open(relationObjectId, NoLock);
-
constr = relation->rd_att->constr;
if (constr != NULL)
{
@@ -1242,38 +1236,6 @@

Re: speeding up planning with partitions

2019-04-04 Thread Amit Langote

On 2019/04/05 12:18, David Rowley wrote:
> On Fri, 5 Apr 2019 at 16:09, Amit Langote  
> wrote:
>> Although, we still have ways
>> to go in terms of scaling generic plan execution to larger partition
>> counts, solution(s) for which have been proposed by David but haven't made
>> it into master yet.
> 
> Is that a reference to the last paragraph in [1]?  That idea has not
> gone beyond me writing that text yet! :-(   It was more of a passing
> comment on the only way I could think of to solve the problem.
> 
> [1] 
> https://www.postgresql.org/message-id/CAKJS1f-y1HQK+VjG7=C==vGcLnzxjN8ysD5NmaN8Wh4=vsy...@mail.gmail.com

Actually, I meant to refer to the following:

https://commitfest.postgresql.org/22/1897/

Of course, we should pursue all available options. :)

Thanks,
Amit

Re: speeding up planning with partitions

2019-04-04 Thread David Rowley

On Fri, 5 Apr 2019 at 16:09, Amit Langote  wrote:
> Although, we still have ways
> to go in terms of scaling generic plan execution to larger partition
> counts, solution(s) for which have been proposed by David but haven't made
> it into master yet.

Is that a reference to the last paragraph in [1]?  That idea has not
gone beyond me writing that text yet! :-(   It was more of a passing
comment on the only way I could think of to solve the problem.

[1] 
https://www.postgresql.org/message-id/CAKJS1f-y1HQK+VjG7=C==vGcLnzxjN8ysD5NmaN8Wh4=vsy...@mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: speeding up planning with partitions

2019-04-04 Thread Amit Langote

On 2019/04/05 6:59, David Rowley wrote:
> On Fri, 5 Apr 2019 at 07:33, Floris Van Nee  wrote:
>> I had a question about the performance of pruning of functions like now() 
>> and current_date. I know these are handled differently, as they cannot be 
>> excluded during the first phases of planning. However, curerntly, this new 
>> patch makes the performance difference between the static timestamp variant 
>> and now() very obvious (even more than before). Consider
>> select * from partitioned_table where ts >= now()
>> or
>> select * from partitioned_table where ts >= '2019-04-04'
>>
>> The second plans in less than a millisecond, whereas the first takes +- 
>> 180ms for a table with 1000 partitions. Both end up with the same plan.
> 
> The patch here only aims to improve the performance of queries to
> partitioned tables when partitions can be pruned during planning. The
> now() version of the query is unable to do that since we don't know
> what that value will be during the execution of the query. In that
> version, you're most likely seeing "Subplans Removed: ". This means
> run-time pruning did some pruning and the planner generated subplans
> for what you see plus  others. Since planning for all partitions is
> still slow, you're getting a larger performance difference than
> before, but only due to the fact that the other plan is now faster to
> generate.

Yeah, the time for generating plan for a query that *can* use pruning but
not during planning is still very much dependent on the number of
partitions, because access plans must be created for all partitions, even
if only one of those plans will actually be used and the rest pruned away
during execution.

> If you're never using prepared statements,

Or if using prepared statements is an option, the huge planning cost
mentioned above need not be paid repeatedly.  Although, we still have ways
to go in terms of scaling generic plan execution to larger partition
counts, solution(s) for which have been proposed by David but haven't made
it into master yet.

Thanks,
Amit

Re: speeding up planning with partitions

2019-04-04 Thread David Rowley

On Fri, 5 Apr 2019 at 07:33, Floris Van Nee  wrote:
> I had a question about the performance of pruning of functions like now() and 
> current_date. I know these are handled differently, as they cannot be 
> excluded during the first phases of planning. However, curerntly, this new 
> patch makes the performance difference between the static timestamp variant 
> and now() very obvious (even more than before). Consider
> select * from partitioned_table where ts >= now()
> or
> select * from partitioned_table where ts >= '2019-04-04'
>
> The second plans in less than a millisecond, whereas the first takes +- 180ms 
> for a table with 1000 partitions. Both end up with the same plan.

The patch here only aims to improve the performance of queries to
partitioned tables when partitions can be pruned during planning. The
now() version of the query is unable to do that since we don't know
what that value will be during the execution of the query. In that
version, you're most likely seeing "Subplans Removed: ". This means
run-time pruning did some pruning and the planner generated subplans
for what you see plus  others. Since planning for all partitions is
still slow, you're getting a larger performance difference than
before, but only due to the fact that the other plan is now faster to
generate.

If you're never using prepared statements, i.e, always planning right
before execution, then you might want to consider using "where ts >=
'today'::timestamp".  This will evaluate to the current date during
parse and make the value available to the planner. You'll need to be
pretty careful with that though, as if you do prepare queries or
change to do that in the future then the bugs in your application
could be very subtle and only do the wrong thing just after midnight
on some day when the current time progresses over your partition
boundary.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: speeding up planning with partitions

2019-04-04 Thread Floris Van Nee

Hi all,

First of all I would like to thank everyone involved in this patch for their 
hard work on this. This is a big step forward. I've done some performance and 
functionality testing with the patch that was committed to master and it looks 
very good.
I had a question about the performance of pruning of functions like now() and 
current_date. I know these are handled differently, as they cannot be excluded 
during the first phases of planning. However, curerntly, this new patch makes 
the performance difference between the static timestamp variant and now() very 
obvious (even more than before). Consider
select * from partitioned_table where ts >= now()
or
select * from partitioned_table where ts >= '2019-04-04'

The second plans in less than a millisecond, whereas the first takes +- 180ms 
for a table with 1000 partitions. Both end up with the same plan.

I'm not too familiar with the code that handles this, but is there a 
possibility for improvement in this area? Or is the stage at which exclusion 
for now()/current_date occurs already too far in the process to make any good 
improvements to this? My apologies if this is considered off-topic for this 
patch, but I ran into this issue specifically when I was testing this patch, so 
I thought I'd ask here about it. I do think a large number of use-cases for 
tables with a large number of partitions involve a timestamp for partition key, 
and naturally people will start writing queries for this that use functions 
such as now() and current_date.

Thanks again for your work on this patch!

-Floris

From: Amit Langote 
Sent: Tuesday, April 2, 2019 7:50 AM
To: Tom Lane
Cc: David Rowley; Imai Yoshikazu; jesper.peder...@redhat.com; Imai, Yoshikazu; 
Amit Langote; Alvaro Herrera; Robert Haas; Justin Pryzby; Pg Hackers
Subject: Re: speeding up planning with partitions [External]

Thanks for taking a look.

On 2019/04/02 2:34, Tom Lane wrote:
> Amit Langote  writes:
>> On 2019/03/30 0:29, Tom Lane wrote:
>>> That seems like probably an independent patch --- do you want to write it?
>
>> Here is that patch.
>> It revises get_relation_constraints() such that the partition constraint
>> is loaded in only the intended cases.
>
> So I see the problem you're trying to solve here, but I don't like this
> patch a bit, because it depends on root->inhTargetKind which IMO is a
> broken bit of junk that we need to get rid of.  Here is an example of
> why, with this patch applied:
>
> regression=# create table p (a int) partition by list (a);
> CREATE TABLE
> regression=# create table p1 partition of p for values in (1);
> CREATE TABLE
> regression=# set constraint_exclusion to on;
> SET
> regression=# explain select * from p1 where a = 2;
> QUERY PLAN
> --
>  Result  (cost=0.00..0.00 rows=0 width=0)
>One-Time Filter: false
> (2 rows)
>
> So far so good, but watch what happens when we include the same case
> in an UPDATE on some other partitioned table:
>
> regression=# create table prtab (a int, b int) partition by list (a);
> CREATE TABLE
> regression=# create table prtab2 partition of prtab for values in (2);
> CREATE TABLE
> regression=# explain update prtab set b=b+1 from p1 where prtab.a=p1.a and 
> p1.a=2;
> QUERY PLAN
> ---
>  Update on prtab  (cost=0.00..82.30 rows=143 width=20)
>Update on prtab2
>->  Nested Loop  (cost=0.00..82.30 rows=143 width=20)
>  ->  Seq Scan on p1  (cost=0.00..41.88 rows=13 width=10)
>Filter: (a = 2)
>  ->  Materialize  (cost=0.00..38.30 rows=11 width=14)
>->  Seq Scan on prtab2  (cost=0.00..38.25 rows=11 width=14)
>  Filter: (a = 2)
> (8 rows)
>
> No constraint exclusion, while in v10 you get
>
>  Update on prtab  (cost=0.00..0.00 rows=0 width=0)
>->  Result  (cost=0.00..0.00 rows=0 width=0)
>  One-Time Filter: false
>
> The reason is that this logic supposes that root->inhTargetKind describes
> *all* partitioned tables in the query, which is obviously wrong.
>
> Now maybe we could make it work by doing something like
>
>   if (rel->reloptkind == RELOPT_BASEREL &&
> (root->inhTargetKind == INHKIND_NONE ||
>  rel->relid != root->parse->resultRelation))

Ah, you're right.  inhTargetKind has to be checked in conjunction with
checking whether the relation is the target relation.

> but I find that pretty messy, plus it's violating the concept that we
> shouldn't be allowing messiness from inheritance_planner to leak into
> other places.

I'm afraid that we'll hav

Re: speeding up planning with partitions

2019-04-01 Thread Amit Langote

Thanks for taking a look.

On 2019/04/02 2:34, Tom Lane wrote:
> Amit Langote  writes:
>> On 2019/03/30 0:29, Tom Lane wrote:
>>> That seems like probably an independent patch --- do you want to write it?
> 
>> Here is that patch.
>> It revises get_relation_constraints() such that the partition constraint
>> is loaded in only the intended cases.
> 
> So I see the problem you're trying to solve here, but I don't like this
> patch a bit, because it depends on root->inhTargetKind which IMO is a
> broken bit of junk that we need to get rid of.  Here is an example of
> why, with this patch applied:
> 
> regression=# create table p (a int) partition by list (a);
> CREATE TABLE
> regression=# create table p1 partition of p for values in (1);
> CREATE TABLE
> regression=# set constraint_exclusion to on;
> SET
> regression=# explain select * from p1 where a = 2;
> QUERY PLAN
> --
>  Result  (cost=0.00..0.00 rows=0 width=0)
>One-Time Filter: false
> (2 rows)
> 
> So far so good, but watch what happens when we include the same case
> in an UPDATE on some other partitioned table:
> 
> regression=# create table prtab (a int, b int) partition by list (a);
> CREATE TABLE
> regression=# create table prtab2 partition of prtab for values in (2);
> CREATE TABLE
> regression=# explain update prtab set b=b+1 from p1 where prtab.a=p1.a and 
> p1.a=2;
> QUERY PLAN 
> ---
>  Update on prtab  (cost=0.00..82.30 rows=143 width=20)
>Update on prtab2
>->  Nested Loop  (cost=0.00..82.30 rows=143 width=20)
>  ->  Seq Scan on p1  (cost=0.00..41.88 rows=13 width=10)
>Filter: (a = 2)
>  ->  Materialize  (cost=0.00..38.30 rows=11 width=14)
>->  Seq Scan on prtab2  (cost=0.00..38.25 rows=11 width=14)
>  Filter: (a = 2)
> (8 rows)
> 
> No constraint exclusion, while in v10 you get
> 
>  Update on prtab  (cost=0.00..0.00 rows=0 width=0)
>->  Result  (cost=0.00..0.00 rows=0 width=0)
>  One-Time Filter: false
> 
> The reason is that this logic supposes that root->inhTargetKind describes
> *all* partitioned tables in the query, which is obviously wrong.
> 
> Now maybe we could make it work by doing something like
> 
>   if (rel->reloptkind == RELOPT_BASEREL &&
> (root->inhTargetKind == INHKIND_NONE ||
>  rel->relid != root->parse->resultRelation))

Ah, you're right.  inhTargetKind has to be checked in conjunction with
checking whether the relation is the target relation.

> but I find that pretty messy, plus it's violating the concept that we
> shouldn't be allowing messiness from inheritance_planner to leak into
> other places.

I'm afraid that we'll have to live with this particular hack as long as we
have inheritance_planner(), but we maybe could somewhat reduce the extent
to which the hack is spread into other planner files.

How about we move the part of get_relation_constraints() that loads the
partition constraint to its only caller
relation_excluded_by_constraints()?  If we do that, all uses of
root->inhTargetKind will be confined to one place.  Attached updated patch
does that.

> What I'd rather do is have this test just read
> 
>   if (rel->reloptkind == RELOPT_BASEREL)
> 
> Making it be that way causes some changes in the partition_prune results,
> as attached, which suggest that removing the enable_partition_pruning
> check as you did wasn't such a great idea either.  However, if I add
> that back in, then it breaks the proposed new regression test case.
> 
> I'm not at all clear on what we think the interaction between
> enable_partition_pruning and constraint_exclusion ought to be,
> so I'm not sure what the appropriate resolution is here.  Thoughts?

Prior to 428b260f87 (that is, in PG 11), partition pruning for UPDATE and
DELETE queries is realized by applying constraint exclusion to the
partition constraint of the target partition.  The conclusion of the
discussion when adding the enable_partition_pruning GUC [1] was that
whether or not constraint exclusion is carried out (to facilitate
partition pruning) should be governed by the new GUC, not
constraint_exclusion, if only to avoid confusing users.

428b260f87 has obviated the need to check enable_partition_pruning in
relation_excluded_by_constraints(), because inheritance_planner() now runs
the query as if it were SELECT, which does partition pruning using
partprune.c, governed by the setting of enable_partition_pruning.  So,
there's no need to check it again in relation_excluded_by_constraints(),
because we won't be consulting the partition constraint again; well, at
least after applying the proposed patch.

> BTW, just about all the other uses of root->inhTargetKind seem equally
> broken from here; none of them are accounting for whether the rel in

Re: speeding up planning with partitions

2019-04-01 Thread Tom Lane

Amit Langote  writes:
> On 2019/03/30 0:29, Tom Lane wrote:
>> That seems like probably an independent patch --- do you want to write it?

> Here is that patch.
> It revises get_relation_constraints() such that the partition constraint
> is loaded in only the intended cases.

So I see the problem you're trying to solve here, but I don't like this
patch a bit, because it depends on root->inhTargetKind which IMO is a
broken bit of junk that we need to get rid of.  Here is an example of
why, with this patch applied:

regression=# create table p (a int) partition by list (a);
CREATE TABLE
regression=# create table p1 partition of p for values in (1);
CREATE TABLE
regression=# set constraint_exclusion to on;
SET
regression=# explain select * from p1 where a = 2;
QUERY PLAN
--
 Result  (cost=0.00..0.00 rows=0 width=0)
   One-Time Filter: false
(2 rows)

So far so good, but watch what happens when we include the same case
in an UPDATE on some other partitioned table:

regression=# create table prtab (a int, b int) partition by list (a);
CREATE TABLE
regression=# create table prtab2 partition of prtab for values in (2);
CREATE TABLE
regression=# explain update prtab set b=b+1 from p1 where prtab.a=p1.a and 
p1.a=2;
QUERY PLAN 
---
 Update on prtab  (cost=0.00..82.30 rows=143 width=20)
   Update on prtab2
   ->  Nested Loop  (cost=0.00..82.30 rows=143 width=20)
 ->  Seq Scan on p1  (cost=0.00..41.88 rows=13 width=10)
   Filter: (a = 2)
 ->  Materialize  (cost=0.00..38.30 rows=11 width=14)
   ->  Seq Scan on prtab2  (cost=0.00..38.25 rows=11 width=14)
 Filter: (a = 2)
(8 rows)

No constraint exclusion, while in v10 you get

 Update on prtab  (cost=0.00..0.00 rows=0 width=0)
   ->  Result  (cost=0.00..0.00 rows=0 width=0)
 One-Time Filter: false

The reason is that this logic supposes that root->inhTargetKind describes
*all* partitioned tables in the query, which is obviously wrong.

Now maybe we could make it work by doing something like

if (rel->reloptkind == RELOPT_BASEREL &&
(root->inhTargetKind == INHKIND_NONE ||
 rel->relid != root->parse->resultRelation))

but I find that pretty messy, plus it's violating the concept that we
shouldn't be allowing messiness from inheritance_planner to leak into
other places.  What I'd rather do is have this test just read

if (rel->reloptkind == RELOPT_BASEREL)

Making it be that way causes some changes in the partition_prune results,
as attached, which suggest that removing the enable_partition_pruning
check as you did wasn't such a great idea either.  However, if I add
that back in, then it breaks the proposed new regression test case.

I'm not at all clear on what we think the interaction between
enable_partition_pruning and constraint_exclusion ought to be,
so I'm not sure what the appropriate resolution is here.  Thoughts?

BTW, just about all the other uses of root->inhTargetKind seem equally
broken from here; none of them are accounting for whether the rel in
question is the query target.

regards, tom lane

diff -U3 /home/postgres/pgsql/src/test/regress/expected/partition_prune.out /home/postgres/pgsql/src/test/regress/results/partition_prune.out
--- /home/postgres/pgsql/src/test/regress/expected/partition_prune.out	2019-04-01 12:39:52.613109088 -0400
+++ /home/postgres/pgsql/src/test/regress/results/partition_prune.out	2019-04-01 13:18:02.852615395 -0400
@@ -3409,24 +3409,18 @@
 --
  Update on pp_lp
Update on pp_lp1
-   Update on pp_lp2
->  Seq Scan on pp_lp1
  Filter: (a = 1)
-   ->  Seq Scan on pp_lp2
- Filter: (a = 1)
-(7 rows)
+(4 rows)

 explain (costs off) delete from pp_lp where a = 1;
 QUERY PLAN
 --
  Delete on pp_lp
Delete on pp_lp1
-   Delete on pp_lp2
->  Seq Scan on pp_lp1
  Filter: (a = 1)
-   ->  Seq Scan on pp_lp2
- Filter: (a = 1)
-(7 rows)
+(4 rows)

 set constraint_exclusion = 'off'; -- this should not affect the result.
 explain (costs off) select * from pp_lp where a = 1;
@@ -3444,24 +3438,18 @@
 --
  Update on pp_lp
Update on pp_lp1
-   Update on pp_lp2
->  Seq Scan on pp_lp1
  Filter: (a = 1)
-   ->  Seq Scan on pp_lp2
- Filter: (a = 1)
-(7 rows)
+(4 rows)

 explain (costs off) delete from pp_lp where a = 1;
 QUERY PLAN
 --
  Delete on pp_lp
Delete on pp_lp1
-   Delete on pp_lp2
->  Seq Scan on pp_lp1
  Filter: (a = 1)
-   ->  Seq Scan on pp_lp2
- Filter: (a = 1)
-(7 rows)
+(4 rows)

 drop table pp_lp;
 -- Ensure enable_partition_prune does not affect non-partitioned tables.

Re: speeding up planning with partitions

2019-04-01 Thread Tom Lane

Amit Langote  writes:
> On 2019/04/01 3:46, Tom Lane wrote:
>> One thing that I intentionally left out of the committed patch was changes
>> to stop short of scanning the whole simple_rel_array when looking only for
>> baserels.  I thought that had been done in a rather piecemeal fashion
>> and it'd be better to address it holistically, which I've now done in the
>> attached proposed patch.
>> I have not done any performance testing to see if this is actually
>> worth the trouble, though.  Anybody want to do that?

> Thanks for creating the patch.

> I spent some time looking for cases where this patch would provide
> recognizable benefit, but couldn't find one.

Yeah, I was afraid of that.  In cases where we do have a ton of otherrels,
the processing that's actually needed on them would probably swamp any
savings from this patch.

The only place where that might possibly not be true is
create_lateral_join_info, since that has nested loops that could
potentially impose an O(N^2) cost.  However, since your patch went in,
that runs before inheritance expansion anyway.

So this probably isn't worth even the minuscule cost it imposes.

regards, tom lane

Re: speeding up planning with partitions

2019-04-01 Thread Amit Langote

On 2019/03/30 0:29, Tom Lane wrote:
> Amit Langote  writes:
>> Finally, it's not in the patch, but how about visiting
>> get_relation_constraints() for revising this block of code:
> 
> That seems like probably an independent patch --- do you want to write it?

Here is that patch.

It revises get_relation_constraints() such that the partition constraint
is loaded in only the intended cases.  To summarize:

* PG 11 currently misses one such intended case (select * from partition)
causing a *bug* that constraint exclusion fails to exclude the partition
with constraint_exclusion = on

* HEAD loads the partition constraint even in some cases where 428b260f87
rendered doing that unnecessary

Thanks,
Amit
diff --git a/src/backend/optimizer/util/plancat.c 
b/src/backend/optimizer/util/plancat.c
index 31a3784536..b7ae063585 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -1250,11 +1250,15 @@ get_relation_constraints(PlannerInfo *root,
/*
 * Append partition predicates, if any.
 *
-* For selects, partition pruning uses the parent table's partition 
bound
-* descriptor, instead of constraint exclusion which is driven by the
-* individual partition's partition constraint.
+* If the partition is accessed indirectly via its parent table, 
partition
+* pruning is performed with the parent table's partition bound, so 
there
+* is no need to include the partition constraint in that case.  
However,
+* if the partition is referenced directly in the query and we're not
+* being called from inheritance_planner(), then no partition pruning
+* would have occurred, so we'll include it in that case.
 */
-   if (enable_partition_pruning && root->parse->commandType != CMD_SELECT)
+   if (rel->reloptkind == RELOPT_BASEREL &&
+   root->inhTargetKind == INHKIND_NONE)
{
List   *pcqual = RelationGetPartitionQual(relation);
 
diff --git a/src/test/regress/expected/partition_prune.out 
b/src/test/regress/expected/partition_prune.out
index 7806ba1d47..0bc0ed8042 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -3643,4 +3643,44 @@ select * from listp where a = (select 2) and b <> 10;
  ->  Result (never executed)
 (4 rows)
 
+--
+-- check that a partition directly accessed in a query is excluded with
+-- constraint_exclusion = on
+--
+-- turn off partition pruning, so that it doesn't interfere
+set enable_partition_pruning to off;
+-- constraint exclusion doesn't apply
+set constraint_exclusion to 'partition';
+explain (costs off) select * from listp1 where a = 2;
+ QUERY PLAN 
+
+ Seq Scan on listp1
+   Filter: (a = 2)
+(2 rows)
+
+explain (costs off) select * from listp2 where a = 1;
+  QUERY PLAN   
+---
+ Seq Scan on listp2_10
+   Filter: (a = 1)
+(2 rows)
+
+-- constraint exclusion applies
+set constraint_exclusion to 'on';
+explain (costs off) select * from listp1 where a = 2;
+QUERY PLAN
+--
+ Result
+   One-Time Filter: false
+(2 rows)
+
+explain (costs off) select * from listp2 where a = 1;
+QUERY PLAN
+--
+ Result
+   One-Time Filter: false
+(2 rows)
+
+reset constraint_exclusion;
+reset enable_partition_pruning;
 drop table listp;
diff --git a/src/test/regress/sql/partition_prune.sql 
b/src/test/regress/sql/partition_prune.sql
index 2e4d2b483d..cc3c497238 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -990,4 +990,22 @@ create table listp2_10 partition of listp2 for values in 
(10);
 explain (analyze, costs off, summary off, timing off)
 select * from listp where a = (select 2) and b <> 10;
 
+--
+-- check that a partition directly accessed in a query is excluded with
+-- constraint_exclusion = on
+--
+
+-- turn off partition pruning, so that it doesn't interfere
+set enable_partition_pruning to off;
+
+-- constraint exclusion doesn't apply
+set constraint_exclusion to 'partition';
+explain (costs off) select * from listp1 where a = 2;
+explain (costs off) select * from listp2 where a = 1;
+-- constraint exclusion applies
+set constraint_exclusion to 'on';
+explain (costs off) select * from listp1 where a = 2;
+explain (costs off) select * from listp2 where a = 1;
+reset constraint_exclusion;
+reset enable_partition_pruning;
 drop table listp;
diff --git a/src/backend/optimizer/util/plancat.c 
b/src/backend/optimizer/util/plancat.c
index 8369e3ad62..8428fe37bb 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -1269,10 +1269,14 @@ get_relation_constraints(PlannerInfo *root,
 * Append partition predicates, if any.
 *
 * For selects, partition pruning uses the parent table's partition 
bound
-*

Re: speeding up planning with partitions

2019-04-01 Thread Amit Langote

On 2019/04/01 3:46, Tom Lane wrote:
> One thing that I intentionally left out of the committed patch was changes
> to stop short of scanning the whole simple_rel_array when looking only for
> baserels.  I thought that had been done in a rather piecemeal fashion
> and it'd be better to address it holistically, which I've now done in the
> attached proposed patch.
> 
> This probably makes little if any difference in the test cases we've
> mostly focused on in this thread, since there wouldn't be very many
> otherrels anyway now that we don't create them for pruned partitions.
> However, in a case where there's a lot of partitions that we can't prune,
> this could be useful.
> 
> I have not done any performance testing to see if this is actually
> worth the trouble, though.  Anybody want to do that?

Thanks for creating the patch.

I spent some time looking for cases where this patch would provide
recognizable benefit, but couldn't find one.  As one would suspect, it's
hard to notice it if only looking at the overall latency of the query,
because time spent doing other things with such plans tends to be pretty
huge anyway (both in the planner itself and other parts of the backend).
I even devised a query on a partitioned table such that planner has to
process all partitions, but ExecInitAppend can prune all but one, thus
reducing the time spent in the executor, but still wasn't able to see an
improvement in the overall latency of the query due to planner not doing
looping over the long simple_rel_array.

Thanks,
Amit

Re: speeding up planning with partitions

2019-03-31 Thread Amit Langote

(I've closed the CF entry: https://commitfest.postgresql.org/22/1778/)

On 2019/04/01 2:04, Tom Lane wrote:
> Amit Langote  writes:
>> On Sun, Mar 31, 2019 at 11:45 AM Imai Yoshikazu  
>> wrote:
>>> Certainly, using bitmapset contributes to the performance when scanning
>>> one partition(few partitions) from large partitions.
> 
>> Thanks Imai-san for testing.
> 
> I tried to replicate these numbers with the code as-committed, and
> could not.

Thanks for that.

> What I get, using the same table-creation code as you
> posted and a pgbench script file like
> 
> \set param random(1, :N)
> select * from rt where a = :param;
> 
> is scaling like this:
> 
> N tps, range  tps, hash
> 
> 2 10520.51993210415.230400
> 8 10443.36145710480.987665
> 3210341.19676810462.551167
> 128   10370.95384910383.885128
> 512   10207.57841310214.049394
> 1024  10042.79434010121.683993
> 4096  8937.561825 9214.993778
> 8192  8247.614040 8486.728918
> 
> If I use "-M prepared" the numbers go up a bit for lower N, but
> drop at high N:
> 
> N tps, range  tps, hash
> 
> 2 11449.92052711462.253871
> 8 11530.51314611470.812476
> 3211372.41299911450.213753
> 128   11289.35159611322.698856
> 512   11095.42845111200.683771
> 1024  10757.64610810805.052480
> 4096  8689.165875 8930.690887
> 8192  7301.609147 7502.806455
> 
> Digging into that, it seems like the degradation with -M prepared is
> mostly in LockReleaseAll's hash_seq_search over the locallock hash table.
> What I think must be happening is that with -M prepared, at some point the
> plancache decides to try a generic plan, which causes opening/locking all
> the partitions, resulting in permanent bloat in the locallock hash table.
> We immediately go back to using custom plans, but hash_seq_search has
> more buckets to look through for the remainder of the process' lifetime.

Ah, we did find this to be a problem upthread [1] and Tsunakawa-san then
even posted a patch which is being discussed at:

https://commitfest.postgresql.org/22/1993/

> I do see some cycles getting spent in apply_scanjoin_target_to_paths
> that look to be due to scanning over the long part_rels array,
> which your proposal would ameliorate.  But (a) that's pretty small
> compared to other effects, and (b) IMO, apply_scanjoin_target_to_paths
> is a remarkable display of brute force inefficiency to begin with.
> I think we should see if we can't nuke that function altogether in
> favor of generating the paths with the right target the first time.

That's an option if we can make it work.

Shouldn't we look at *all* of the places that have code that now look like
this:

  for (i = 0; i < rel->nparts; i++)
  {
  RelOptInfo *partrel = rel->part_rels[i];

  if (partrel == NULL)
  continue;
  ...
  }

Beside apply_scanjoin_target_to_paths(), there are:

create_partitionwise_grouping_paths()
make_partitionedrel_pruneinfo()


> BTW, the real elephant in the room is the O(N^2) cost of creating
> these tables in the first place.  The runtime for the table-creation
> scripts looks like
> 
> N range   hash
> 
> 2 0m0.011s0m0.011s
> 8 0m0.015s0m0.014s
> 320m0.032s0m0.030s
> 128   0m0.132s0m0.099s
> 512   0m0.969s0m0.524s
> 1024  0m3.306s0m1.442s
> 4096  0m46.058s   0m15.522s
> 8192  3m11.995s   0m58.720s
> 
> This seems to be down to the expense of doing RelationBuildPartitionDesc
> to rebuild the parent's relcache entry for each child CREATE TABLE.
> Not sure we can avoid that, but maybe we should consider adopting a
> cheaper-to-read representation of partition descriptors.  The fact that
> range-style entries seem to be 3X more expensive to load than hash-style
> entries is strange.

I've noticed this many times too, but never prioritized doing something
about it.  I'll try sometime.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAKJS1f-dn1hDZqObwdMrYdV7-cELJwWCPRWet6EQX_WaV8JLgw%40mail.gmail.com

Re: speeding up planning with partitions

2019-03-31 Thread David Rowley

On Sun, 31 Mar 2019 at 05:50, Robert Haas  wrote:
>
> On Sat, Mar 30, 2019 at 12:16 PM Amit Langote  wrote:
> > Fwiw, I had complained when reviewing the run-time pruning patch that
> > creating those maps in the planner and putting them in
> > PartitionPruneInfo might not be a good idea, but David insisted that
> > it'd be good for performance (in the context of using cached plans) to
> > compute this information during planning.
>
> Well, he's not wrong about that, I expect.

I'm aware that there have been combinations of objects to either
having these arrays and/or editing them during execution.

I don't recall Amit's complaint, but I do recall Tom's.  He suggested
we not resequence the arrays in the executor and just maintain NULL
elements in the Append/MergeAppend subplans. I did consider this when
writing run-time pruning but found that performance suffers. I
demonstrated this on a thread somewhere.

IIRC, I wrote this code because there was no way to translate the
result of the pruning code into Append/MergeAppend subplan indexes.
Robert has since added a map of Oids to allow the executor to have
those details, so it perhaps would be possible to take the result of
the pruning code then lookup the Oids of the partitions that survived
pruning, then map those to the subplans using the array Robert added.
Using the array for that wouldn't be very efficient due to a lookup
being O(n) per surviving partition. Maybe it could be thrown into a
hashtable to make that faster.  This solution would need to take into
account mixed hierarchy Appends. e.g SELECT * FROM partitioned_table
WHERE partkey = $1 UNION ALL SELECT * FROM something_else; so it would
likely need to be a hashtable per partitioned table.  If the pruning
code returned a partition whose Oid we didn't know about, then it must
be from a partition that was added concurrently since the plan was
built... However, that shouldn't happen today since Robert went to
great lengths for it not to.

Further discussions are likely best put in their own thread. As far as
I know, nothing is broken with the code today.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: speeding up planning with partitions

2019-03-31 Thread Tom Lane

One thing that I intentionally left out of the committed patch was changes
to stop short of scanning the whole simple_rel_array when looking only for
baserels.  I thought that had been done in a rather piecemeal fashion
and it'd be better to address it holistically, which I've now done in the
attached proposed patch.

This probably makes little if any difference in the test cases we've
mostly focused on in this thread, since there wouldn't be very many
otherrels anyway now that we don't create them for pruned partitions.
However, in a case where there's a lot of partitions that we can't prune,
this could be useful.

I have not done any performance testing to see if this is actually
worth the trouble, though.  Anybody want to do that?

regards, tom lane

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 727da33..7a9aa12 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -157,7 +157,7 @@ make_one_rel(PlannerInfo *root, List *joinlist)
 	 * Construct the all_baserels Relids set.
 	 */
 	root->all_baserels = NULL;
-	for (rti = 1; rti < root->simple_rel_array_size; rti++)
+	for (rti = 1; rti <= root->last_base_relid; rti++)
 	{
 		RelOptInfo *brel = root->simple_rel_array[rti];
 
@@ -290,7 +290,7 @@ set_base_rel_sizes(PlannerInfo *root)
 {
 	Index		rti;
 
-	for (rti = 1; rti < root->simple_rel_array_size; rti++)
+	for (rti = 1; rti <= root->last_base_relid; rti++)
 	{
 		RelOptInfo *rel = root->simple_rel_array[rti];
 		RangeTblEntry *rte;
@@ -333,7 +333,7 @@ set_base_rel_pathlists(PlannerInfo *root)
 {
 	Index		rti;
 
-	for (rti = 1; rti < root->simple_rel_array_size; rti++)
+	for (rti = 1; rti <= root->last_base_relid; rti++)
 	{
 		RelOptInfo *rel = root->simple_rel_array[rti];
 
@@ -1994,7 +1994,7 @@ has_multiple_baserels(PlannerInfo *root)
 	int			num_base_rels = 0;
 	Index		rti;
 
-	for (rti = 1; rti < root->simple_rel_array_size; rti++)
+	for (rti = 1; rti <= root->last_base_relid; rti++)
 	{
 		RelOptInfo *brel = root->simple_rel_array[rti];
 
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 61b5b11..723643c 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -828,11 +828,11 @@ generate_base_implied_equalities(PlannerInfo *root)
 	 * This is also a handy place to mark base rels (which should all exist by
 	 * now) with flags showing whether they have pending eclass joins.
 	 */
-	for (rti = 1; rti < root->simple_rel_array_size; rti++)
+	for (rti = 1; rti <= root->last_base_relid; rti++)
 	{
 		RelOptInfo *brel = root->simple_rel_array[rti];
 
-		if (brel == NULL)
+		if (brel == NULL || brel->reloptkind != RELOPT_BASEREL)
 			continue;
 
 		brel->has_eclass_joins = has_relevant_eclass_joinclause(root, brel);
diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c
index 9798dca..c5459b6 100644
--- a/src/backend/optimizer/plan/initsplan.c
+++ b/src/backend/optimizer/plan/initsplan.c
@@ -145,7 +145,7 @@ add_other_rels_to_query(PlannerInfo *root)
 {
 	int			rti;
 
-	for (rti = 1; rti < root->simple_rel_array_size; rti++)
+	for (rti = 1; rti <= root->last_base_relid; rti++)
 	{
 		RelOptInfo *rel = root->simple_rel_array[rti];
 		RangeTblEntry *rte = root->simple_rte_array[rti];
@@ -312,7 +312,7 @@ find_lateral_references(PlannerInfo *root)
 	/*
 	 * Examine all baserels (the rel array has been set up by now).
 	 */
-	for (rti = 1; rti < root->simple_rel_array_size; rti++)
+	for (rti = 1; rti <= root->last_base_relid; rti++)
 	{
 		RelOptInfo *brel = root->simple_rel_array[rti];
 
@@ -460,7 +460,7 @@ create_lateral_join_info(PlannerInfo *root)
 	/*
 	 * Examine all baserels (the rel array has been set up by now).
 	 */
-	for (rti = 1; rti < root->simple_rel_array_size; rti++)
+	for (rti = 1; rti <= root->last_base_relid; rti++)
 	{
 		RelOptInfo *brel = root->simple_rel_array[rti];
 		Relids		lateral_relids;
@@ -580,7 +580,7 @@ create_lateral_join_info(PlannerInfo *root)
 	 * The outer loop considers each baserel, and propagates its lateral
 	 * dependencies to those baserels that have a lateral dependency on it.
 	 */
-	for (rti = 1; rti < root->simple_rel_array_size; rti++)
+	for (rti = 1; rti <= root->last_base_relid; rti++)
 	{
 		RelOptInfo *brel = root->simple_rel_array[rti];
 		Relids		outer_lateral_relids;
@@ -595,7 +595,7 @@ create_lateral_join_info(PlannerInfo *root)
 			continue;
 
 		/* else scan all baserels */
-		for (rti2 = 1; rti2 < root->simple_rel_array_size; rti2++)
+		for (rti2 = 1; rti2 <= root->last_base_relid; rti2++)
 		{
 			RelOptInfo *brel2 = root->simple_rel_array[rti2];
 
@@ -614,7 +614,7 @@ create_lateral_join_info(PlannerInfo *root)
 	 * with the set of relids of rels that reference it laterally (possibly
 	 * indirectly) --- that is, the inverse mapping of lateral_relids.
 	 */
-	for (rti = 1; rti <

Re: speeding up planning with partitions

2019-03-31 Thread Tom Lane

Amit Langote  writes:
> On Sun, Mar 31, 2019 at 11:45 AM Imai Yoshikazu  
> wrote:
>> Certainly, using bitmapset contributes to the performance when scanning
>> one partition(few partitions) from large partitions.

> Thanks Imai-san for testing.

I tried to replicate these numbers with the code as-committed, and
could not.  What I get, using the same table-creation code as you
posted and a pgbench script file like

\set param random(1, :N)
select * from rt where a = :param;

is scaling like this:

N   tps, range  tps, hash

2   10520.51993210415.230400
8   10443.36145710480.987665
32  10341.19676810462.551167
128 10370.95384910383.885128
512 10207.57841310214.049394
102410042.79434010121.683993
40968937.561825 9214.993778
81928247.614040 8486.728918

If I use "-M prepared" the numbers go up a bit for lower N, but
drop at high N:

N   tps, range  tps, hash

2   11449.92052711462.253871
8   11530.51314611470.812476
32  11372.41299911450.213753
128 11289.35159611322.698856
512 11095.42845111200.683771
102410757.64610810805.052480
40968689.165875 8930.690887
81927301.609147 7502.806455

Digging into that, it seems like the degradation with -M prepared is
mostly in LockReleaseAll's hash_seq_search over the locallock hash table.
What I think must be happening is that with -M prepared, at some point the
plancache decides to try a generic plan, which causes opening/locking all
the partitions, resulting in permanent bloat in the locallock hash table.
We immediately go back to using custom plans, but hash_seq_search has
more buckets to look through for the remainder of the process' lifetime.

I do see some cycles getting spent in apply_scanjoin_target_to_paths
that look to be due to scanning over the long part_rels array,
which your proposal would ameliorate.  But (a) that's pretty small
compared to other effects, and (b) IMO, apply_scanjoin_target_to_paths
is a remarkable display of brute force inefficiency to begin with.
I think we should see if we can't nuke that function altogether in
favor of generating the paths with the right target the first time.

BTW, the real elephant in the room is the O(N^2) cost of creating
these tables in the first place.  The runtime for the table-creation
scripts looks like

N   range   hash

2   0m0.011s0m0.011s
8   0m0.015s0m0.014s
32  0m0.032s0m0.030s
128 0m0.132s0m0.099s
512 0m0.969s0m0.524s
10240m3.306s0m1.442s
40960m46.058s   0m15.522s
81923m11.995s   0m58.720s

This seems to be down to the expense of doing RelationBuildPartitionDesc
to rebuild the parent's relcache entry for each child CREATE TABLE.
Not sure we can avoid that, but maybe we should consider adopting a
cheaper-to-read representation of partition descriptors.  The fact that
range-style entries seem to be 3X more expensive to load than hash-style
entries is strange.

regards, tom lane

Re: speeding up planning with partitions

2019-03-30 Thread Amit Langote

On Sun, Mar 31, 2019 at 11:45 AM Imai Yoshikazu  wrote:
> On 2019/03/31 1:06, Amit Langote wrote:
>  > On Sun, Mar 31, 2019 at 12:11 AM Tom Lane  wrote:
>  >> I am curious as to why there seems to be more degradation
>  >> for hash cases, as per Yoshikazu-san's results in
>  >> <0F97FA9ABBDBE54F91744A9B37151A512BAC60@g01jpexmbkw24>,
>  >> but whatever's accounting for the difference probably
>  >> is not that.
>  >
>  > I suspected it may have been the lack of bitmapsets, but maybe only
>  > Imai-san could've confirmed that by applying the live_parts patch too.
>
> Yeah, I forgot to applying live_parts patch. I did same test again which
> I did for hash before.
> (BTW, thanks for committing speeding up patches!)

Thanks a lot for committing, Tom.  I wish you had listed yourself as
an author though.

I will send the patch for get_relation_constraints() mentioned
upthread tomorrow.

> [HEAD(428b260)]
> npartsTPS
> ==  =
> 2:  13134 (13240, 13290, 13071, 13172, 12896)
> 1024:   12627 (12489, 12635, 12716, 12732, 12562)
> 8192:   10289 (10216, 10265, 10171, 10278, 10514)
>
> [HEAD(428b260) + live_parts.diff]
> npartsTPS
> ==  =
> 2:  13277 (13112, 13290, 13241, 13360, 13382)
> 1024:   12821 (12930, 12849, 12909, 12700, 12716)
> 8192:   11102 (11134, 11158, 4, 10997, 11109)
>
>
> Degradations of performance are below.
>
>
> My test results from above (with live_parts, HEAD(428b260) +
> live_parts.diff)
> nparts   live_parts   HEAD
> ==   ==   
> 2:13277  13134
> 1024: 12821  12627
> 8192: 11102  10289
>
> 11102/13277 = 83.6 %
>
>
> Amit-san's test results (with live_parts)
>  > npartsv38   HEAD
>  > ==      
>  > 22971   2969
>  > 82980   1949
>  > 32   2955733
>  > 128  2946145
>  > 512  2924 11
>  > 1024 2986  3
>  > 4096 2702  0
>  > 8192 2531OOM
>
> 2531/2971 = 85.2 %
>
>
> My test results I posted before (without live_parts)
>  > npartsv38   HEAD
>  > ==      
>  > 0:  10538  10487
>  > 2:   6942   7028
>  > 4:   7043   5645
>  > 8:   6981   3954
>  > 16:  6932   2440
>  > 32:  6897   1243
>  > 64:  6897309
>  > 128: 6753120
>  > 256: 6727 46
>  > 512: 6708 12
>  > 1024:6063  3
>  > 2048:5894  1
>  > 4096:5374OOM
>  > 8192:4572OOM
>
> 4572/6942 = 65.9 %
>
>
> Certainly, using bitmapset contributes to the performance when scanning
> one partition(few partitions) from large partitions.

Thanks Imai-san for testing.

Regards,
Amit

Re: speeding up planning with partitions

2019-03-30 Thread Imai Yoshikazu

On 2019/03/31 1:06, Amit Langote wrote:
 > On Sun, Mar 31, 2019 at 12:11 AM Tom Lane  wrote:
 >> Amit Langote  writes:
 >>> I think the performance results did prove that degradation due to
 >>> those loops over part_rels becomes significant for very large
 >>> partition counts.  Is there a better solution than the bitmapset that
 >>> you have in mind?
 >>
 >> Hm, I didn't see much degradation in what you posted in
 >> <5c83dbca-12b5-1acf-0e85-58299e464...@lab.ntt.co.jp>.
 >
 > Sorry that I didn't mention the link to begin with, but I meant to
 > point to numbers that I reported on Monday this week.
 >
 > 
https://www.postgresql.org/message-id/19f54c17-1619-b228-10e5-ca343be6a4e8%40lab.ntt.co.jp
 >
 > You were complaining of the bitmapset being useless overhead for small
 > partition counts, but the numbers I get tend to suggest that any
 > degradation in performance is within noise range, whereas the
 > performance benefit from having them looks pretty significant for very
 > large partition counts.
 >
 >> I am curious as to why there seems to be more degradation
 >> for hash cases, as per Yoshikazu-san's results in
 >> <0F97FA9ABBDBE54F91744A9B37151A512BAC60@g01jpexmbkw24>,
 >> but whatever's accounting for the difference probably
 >> is not that.
 >
 > I suspected it may have been the lack of bitmapsets, but maybe only
 > Imai-san could've confirmed that by applying the live_parts patch too.

Yeah, I forgot to applying live_parts patch. I did same test again which 
I did for hash before.
(BTW, thanks for committing speeding up patches!)

[HEAD(428b260)]
npartsTPS
==  =
2:  13134 (13240, 13290, 13071, 13172, 12896)
1024:   12627 (12489, 12635, 12716, 12732, 12562)
8192:   10289 (10216, 10265, 10171, 10278, 10514)

[HEAD(428b260) + live_parts.diff]
npartsTPS
==  =
2:  13277 (13112, 13290, 13241, 13360, 13382)
1024:   12821 (12930, 12849, 12909, 12700, 12716)
8192:   11102 (11134, 11158, 4, 10997, 11109)


Degradations of performance are below.


My test results from above (with live_parts, HEAD(428b260) + 
live_parts.diff)
nparts   live_parts   HEAD
==   ==   
2:13277  13134
1024: 12821  12627
8192: 11102  10289

11102/13277 = 83.6 %


Amit-san's test results (with live_parts)
 > npartsv38   HEAD
 > ==      
 > 22971   2969
 > 82980   1949
 > 32   2955733
 > 128  2946145
 > 512  2924 11
 > 1024 2986  3
 > 4096 2702  0
 > 8192 2531OOM

2531/2971 = 85.2 %


My test results I posted before (without live_parts)
 > npartsv38   HEAD
 > ==      
 > 0:  10538  10487
 > 2:   6942   7028
 > 4:   7043   5645
 > 8:   6981   3954
 > 16:  6932   2440
 > 32:  6897   1243
 > 64:  6897309
 > 128: 6753120
 > 256: 6727 46
 > 512: 6708 12
 > 1024:6063  3
 > 2048:5894  1
 > 4096:5374OOM
 > 8192:4572OOM

4572/6942 = 65.9 %


Certainly, using bitmapset contributes to the performance when scanning 
one partition(few partitions) from large partitions.


Thanks
--
Imai Yoshikazu
diff --git a/src/backend/optimizer/path/joinrels.c 
b/src/backend/optimizer/path/joinrels.c
index 34cc7da..e847655 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1516,6 +1516,9 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo 
*rel1, RelOptInfo *rel2,
populate_joinrel_with_paths(root, child_rel1, child_rel2,

child_joinrel, child_sjinfo,

child_restrictlist);
+   if(!IS_DUMMY_REL(child_joinrel))
+   joinrel->live_parts = 
bms_add_member(joinrel->live_parts,
+   
cnt_parts);
}
 }
 
diff --git a/src/backend/optimizer/plan/planner.c 
b/src/backend/optimizer/plan/planner.c
index f08f1cd..9ddf42a 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -7107,7 +7107,9 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
int partition_idx;
 
/* Adjust each partition. */
-   for (partition_idx = 0; partition_idx < rel->nparts; 
partition_idx++)
+   partition_idx = -1;
+   while ((partition_idx = bms_next_member(rel->live_parts,
+   
partition_idx)) >= 0)
{
RelOptInfo *child_rel = rel->part_rels[partition_idx];
AppendRelInfo **appinfos;
@@ -7115,9 +7117,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
List

Re: speeding up planning with partitions

2019-03-30 Thread Robert Haas

On Sat, Mar 30, 2019 at 12:16 PM Amit Langote  wrote:
> Fwiw, I had complained when reviewing the run-time pruning patch that
> creating those maps in the planner and putting them in
> PartitionPruneInfo might not be a good idea, but David insisted that
> it'd be good for performance (in the context of using cached plans) to
> compute this information during planning.

Well, he's not wrong about that, I expect.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: speeding up planning with partitions

2019-03-30 Thread Amit Langote

On Sun, Mar 31, 2019 at 12:59 AM Robert Haas  wrote:
>
> On Sat, Mar 30, 2019 at 11:46 AM Tom Lane  wrote:
> > > The only problem with PartitionPruneInfo structures of which I am
> > > aware is that they rely on PartitionDesc offsets not changing. But I
> > > added code in that commit in ExecCreatePartitionPruneState to handle
> > > that exact problem.  See also paragraph 5 of the commit message, which
> > > begins with "Although in general..."
> >
> > Ah.  Grotty, but I guess it will cover the issue.
>
> I suppose it is.  I am a little suspicious of the decision to make
> PartitionPruneInfo structures depend on PartitionDesc indexes.

Fwiw, I had complained when reviewing the run-time pruning patch that
creating those maps in the planner and putting them in
PartitionPruneInfo might not be a good idea, but David insisted that
it'd be good for performance (in the context of using cached plans) to
compute this information during planning.

Thanks,
Amit

Re: speeding up planning with partitions

2019-03-30 Thread Amit Langote

On Sun, Mar 31, 2019 at 12:11 AM Tom Lane  wrote:
> Amit Langote  writes:
> > I think the performance results did prove that degradation due to
> > those loops over part_rels becomes significant for very large
> > partition counts.  Is there a better solution than the bitmapset that
> > you have in mind?
>
> Hm, I didn't see much degradation in what you posted in
> <5c83dbca-12b5-1acf-0e85-58299e464...@lab.ntt.co.jp>.

Sorry that I didn't mention the link to begin with, but I meant to
point to numbers that I reported on Monday this week.

https://www.postgresql.org/message-id/19f54c17-1619-b228-10e5-ca343be6a4e8%40lab.ntt.co.jp

You were complaining of the bitmapset being useless overhead for small
partition counts, but the numbers I get tend to suggest that any
degradation in performance is within noise range, whereas the
performance benefit from having them looks pretty significant for very
large partition counts.

> I am curious as to why there seems to be more degradation
> for hash cases, as per Yoshikazu-san's results in
> <0F97FA9ABBDBE54F91744A9B37151A512BAC60@g01jpexmbkw24>,
> but whatever's accounting for the difference probably
> is not that.

I suspected it may have been the lack of bitmapsets, but maybe only
Imai-san could've confirmed that by applying the live_parts patch too.

Thanks,
Amit

Re: speeding up planning with partitions

2019-03-30 Thread Robert Haas

On Sat, Mar 30, 2019 at 11:46 AM Tom Lane  wrote:
> > The only problem with PartitionPruneInfo structures of which I am
> > aware is that they rely on PartitionDesc offsets not changing. But I
> > added code in that commit in ExecCreatePartitionPruneState to handle
> > that exact problem.  See also paragraph 5 of the commit message, which
> > begins with "Although in general..."
>
> Ah.  Grotty, but I guess it will cover the issue.

I suppose it is.  I am a little suspicious of the decision to make
PartitionPruneInfo structures depend on PartitionDesc indexes.  First,
it's really non-obvious that the dependency exists, and I do not think
I would have spotted it had not Alvaro pointed the problem out.
Second, I wonder whether it is really a good idea in general to make a
plan depend on array indexes when the array is not stored in the plan.
In one sense, we do that all the time, because attnums are arguably
just indexes into what is conceptually an array of attributes.
However, I feel that's not quite the same, because the attnum is
explicitly stored in the catalogs, and PartitionDesc array indexes are
not stored anywhere, but rather are the result of a fairly complex
calculation.  Now I guess it's probably OK because we will probably
have lots of other problems if we don't get the same answer every time
we do that calculation, but it still makes me a little nervous.  I
would try to propose something better but I don't have a good idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: speeding up planning with partitions

2019-03-30 Thread Tom Lane

Robert Haas  writes:
> On Sat, Mar 30, 2019 at 11:11 AM Tom Lane  wrote:
>> Before that, though, I remain concerned that the PartitionPruneInfo
>> data structure the planner is transmitting to the executor is unsafe
>> against concurrent ATTACH PARTITION operations.  The comment for
>> PartitionedRelPruneInfo says in so many words that it's relying on
>> indexes in the table's PartitionDesc; how is that not broken by
>> 898e5e329?

> The only problem with PartitionPruneInfo structures of which I am
> aware is that they rely on PartitionDesc offsets not changing. But I
> added code in that commit in ExecCreatePartitionPruneState to handle
> that exact problem.  See also paragraph 5 of the commit message, which
> begins with "Although in general..."

Ah.  Grotty, but I guess it will cover the issue.

regards, tom lane

Re: speeding up planning with partitions

2019-03-30 Thread Robert Haas

On Sat, Mar 30, 2019 at 11:11 AM Tom Lane  wrote:
> Before that, though, I remain concerned that the PartitionPruneInfo
> data structure the planner is transmitting to the executor is unsafe
> against concurrent ATTACH PARTITION operations.  The comment for
> PartitionedRelPruneInfo says in so many words that it's relying on
> indexes in the table's PartitionDesc; how is that not broken by
> 898e5e329?

The only problem with PartitionPruneInfo structures of which I am
aware is that they rely on PartitionDesc offsets not changing. But I
added code in that commit in ExecCreatePartitionPruneState to handle
that exact problem.  See also paragraph 5 of the commit message, which
begins with "Although in general..."

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: speeding up planning with partitions

2019-03-30 Thread Tom Lane

Amit Langote  writes:
> On Sat, Mar 30, 2019 at 9:17 AM Tom Lane  wrote:
>> What I propose we do about the GEQO problem is shown in 0001 attached
>> (which would need to be back-patched into v11).
>> ...
>> That's just dumb.  What we *ought* to be doing in such degenerate
>> outer-join cases is just emitting the non-dummy side, ie

> Fwiw, I agree that we should fix join planning so that we get the
> ProjectionPath atop scan path of non-nullable relation instead of a
> full-fledged join path with dummy path on the nullable side.  It seems
> to me that the "fix" would be mostly be localized to
> try_partitionwise_join() at least insofar as detecting whether we
> should generate a join or the other plan shape is concerned, right?

Well, if we're going to do something about that, I would like to see
it work for non-partition cases too, ie we're not smart about this
either:

regression=# explain select * from tenk1 left join (select 1 where false) ss(x) 
on unique1=x;
QUERY PLAN 
---
 Nested Loop Left Join  (cost=0.00..570.00 rows=1 width=248)
   Join Filter: (tenk1.unique1 = 1)
   ->  Seq Scan on tenk1  (cost=0.00..445.00 rows=1 width=244)
   ->  Result  (cost=0.00..0.00 rows=0 width=0)
 One-Time Filter: false
(5 rows)

A general solution would presumably involve new logic in
populate_joinrel_with_paths for the case where the nullable side is
dummy.  I'm not sure whether that leaves anything special to do in
try_partitionwise_join or not.  Maybe it would, since that would
have a requirement to build the joinrel without any RHS input
RelOptInfo, but I don't think that's the place to begin working on
this.

> By the way, does it make sense to remove the tests whose output
> changes altogether and reintroduce them when we fix join planning?
> Especially, in partitionwise_aggregate.out, there are comments near
> changed plans which are no longer true.

Good point about the comments, but we shouldn't just remove those test
cases; they're useful to exercise the give-up-on-partitionwise-join
code paths.  I'll tweak the comments.

>> 0002 attached is then the rest of the partition-planning patch;
>> it doesn't need to mess with joinrels.c at all.  I've addressed
>> the other points discussed today in that, except for the business
>> about whether we want your 0003 bitmap-of-live-partitions patch.
>> I'm still inclined to think that that's not really worth it,
>> especially in view of your performance results.

> I think the performance results did prove that degradation due to
> those loops over part_rels becomes significant for very large
> partition counts.  Is there a better solution than the bitmapset that
> you have in mind?

Hm, I didn't see much degradation in what you posted in
<5c83dbca-12b5-1acf-0e85-58299e464...@lab.ntt.co.jp>.
I am curious as to why there seems to be more degradation
for hash cases, as per Yoshikazu-san's results in
<0F97FA9ABBDBE54F91744A9B37151A512BAC60@g01jpexmbkw24>,
but whatever's accounting for the difference probably
is not that.  Anyway I still believe that getting rid of
these sparse arrays would be a better answer.

Before that, though, I remain concerned that the PartitionPruneInfo
data structure the planner is transmitting to the executor is unsafe
against concurrent ATTACH PARTITION operations.  The comment for
PartitionedRelPruneInfo says in so many words that it's relying on
indexes in the table's PartitionDesc; how is that not broken by
898e5e329?

regards, tom lane

Re: speeding up planning with partitions

2019-03-29 Thread Amit Langote

Thanks for the new patches.

On Sat, Mar 30, 2019 at 9:17 AM Tom Lane  wrote:
>
> Amit Langote  writes:
> > On 2019/03/29 7:38, Tom Lane wrote:
> >> 2. I seriously dislike what's been done in joinrels.c, too.  That
> >> really seems like a kluge (and I haven't had time to study it
> >> closely).
>
> > Those hunks account for the fact that pruned partitions, for which we no
> > longer create RangeTblEntry and RelOptInfo, may appear on the nullable
> > side of an outer join.  We'll need a RelOptInfo holding a dummy path, so
> > that outer join paths can be created with one side of join being dummy
> > result path, which are built in the patch by  build_dummy_partition_rel().
>
> Now that I've had a chance to look closer, there's no way I'm committing
> that change in joinrels.c.  If it works at all, it's accidental, because
> it's breaking all sorts of data structure invariants.  The business with
> an AppendRelInfo that maps from the parentrel to itself is particularly
> ugly; and I doubt that you can get away with assuming that
> root->append_rel_array[parent->relid] is available for use for that.
> (What if the parent is an intermediate partitioned table?)
>
> There's also the small problem of the GEQO crash.  It's possible that
> that could be gotten around by switching into the long-term planner
> context in update_child_rel_info and build_dummy_partition_rel, but
> then you're creating a memory leak across GEQO cycles.  It'd be much
> better to avoid touching base-relation data structures during join
> planning.
>
> What I propose we do about the GEQO problem is shown in 0001 attached
> (which would need to be back-patched into v11).  This is based on the
> observation that, if we know an input relation is empty, we can often
> prove the join is empty and then skip building it at all.  (In the
> existing partitionwise-join code, the same cases are detected by
> populate_joinrel_with_paths, but we do a fair amount of work before
> discovering that.)  The cases where that's not true are where we
> have a pruned partition on the inside of a left join, or either side
> of a full join ... but frankly, what the existing code produces for
> those cases is not short of embarrassing:
>
>->  Hash Left Join
>  Hash Cond: (pagg_tab1_p1.x = y)
>  Filter: ((pagg_tab1_p1.x > 5) OR (y < 20))
>  ->  Seq Scan on pagg_tab1_p1
>Filter: (x < 20)
>  ->  Hash
>->  Result
>  One-Time Filter: false
>
> That's just dumb.  What we *ought* to be doing in such degenerate
> outer-join cases is just emitting the non-dummy side, ie
>
>->  Seq Scan on pagg_tab1_p1
>Filter: (x < 20) AND ((pagg_tab1_p1.x > 5) OR (y < 20))
>
> in this example.  I would envision handling this by teaching the
> code to generate a path for the joinrel that's basically just a
> ProjectionPath atop a path for the non-dummy input rel, with the
> projection set up to emit nulls for the columns of the dummy side.
> (Note that this would be useful for outer joins against dummy rels
> in regular planning contexts, not only partitionwise joins.)
>
> Pending somebody doing the work for that, though, I do not
> have a problem with just being unable to generate partitionwise
> joins in such cases, so 0001 attached just changes the expected
> outputs for the affected regression test cases.

Fwiw, I agree that we should fix join planning so that we get the
ProjectionPath atop scan path of non-nullable relation instead of a
full-fledged join path with dummy path on the nullable side.  It seems
to me that the "fix" would be mostly be localized to
try_partitionwise_join() at least insofar as detecting whether we
should generate a join or the other plan shape is concerned, right?

By the way, does it make sense to remove the tests whose output
changes altogether and reintroduce them when we fix join planning?
Especially, in partitionwise_aggregate.out, there are comments near
changed plans which are no longer true.

> 0002 attached is then the rest of the partition-planning patch;
> it doesn't need to mess with joinrels.c at all.  I've addressed
> the other points discussed today in that, except for the business
> about whether we want your 0003 bitmap-of-live-partitions patch.
> I'm still inclined to think that that's not really worth it,
> especially in view of your performance results.

I think the performance results did prove that degradation due to
those loops over part_rels becomes significant for very large
partition counts.  Is there a better solution than the bitmapset that
you have in mind?

> If people are OK with this approach to solving the GEQO problem,
> I think these are committable.

Thanks again.  Really appreciate that you are putting so much of your
time into this.

Regards,
Amit

Re: speeding up planning with partitions

2019-03-29 Thread Tom Lane

Amit Langote  writes:
> On 2019/03/29 7:38, Tom Lane wrote:
>> 2. I seriously dislike what's been done in joinrels.c, too.  That
>> really seems like a kluge (and I haven't had time to study it
>> closely).

> Those hunks account for the fact that pruned partitions, for which we no
> longer create RangeTblEntry and RelOptInfo, may appear on the nullable
> side of an outer join.  We'll need a RelOptInfo holding a dummy path, so
> that outer join paths can be created with one side of join being dummy
> result path, which are built in the patch by  build_dummy_partition_rel().

Now that I've had a chance to look closer, there's no way I'm committing
that change in joinrels.c.  If it works at all, it's accidental, because
it's breaking all sorts of data structure invariants.  The business with
an AppendRelInfo that maps from the parentrel to itself is particularly
ugly; and I doubt that you can get away with assuming that
root->append_rel_array[parent->relid] is available for use for that.
(What if the parent is an intermediate partitioned table?)

There's also the small problem of the GEQO crash.  It's possible that
that could be gotten around by switching into the long-term planner
context in update_child_rel_info and build_dummy_partition_rel, but
then you're creating a memory leak across GEQO cycles.  It'd be much
better to avoid touching base-relation data structures during join
planning.

What I propose we do about the GEQO problem is shown in 0001 attached
(which would need to be back-patched into v11).  This is based on the
observation that, if we know an input relation is empty, we can often
prove the join is empty and then skip building it at all.  (In the
existing partitionwise-join code, the same cases are detected by
populate_joinrel_with_paths, but we do a fair amount of work before
discovering that.)  The cases where that's not true are where we
have a pruned partition on the inside of a left join, or either side
of a full join ... but frankly, what the existing code produces for
those cases is not short of embarrassing:

   ->  Hash Left Join
 Hash Cond: (pagg_tab1_p1.x = y)
 Filter: ((pagg_tab1_p1.x > 5) OR (y < 20))
 ->  Seq Scan on pagg_tab1_p1
   Filter: (x < 20)
 ->  Hash
   ->  Result
 One-Time Filter: false

That's just dumb.  What we *ought* to be doing in such degenerate
outer-join cases is just emitting the non-dummy side, ie

   ->  Seq Scan on pagg_tab1_p1
   Filter: (x < 20) AND ((pagg_tab1_p1.x > 5) OR (y < 20))

in this example.  I would envision handling this by teaching the
code to generate a path for the joinrel that's basically just a
ProjectionPath atop a path for the non-dummy input rel, with the
projection set up to emit nulls for the columns of the dummy side.
(Note that this would be useful for outer joins against dummy rels
in regular planning contexts, not only partitionwise joins.)

Pending somebody doing the work for that, though, I do not
have a problem with just being unable to generate partitionwise
joins in such cases, so 0001 attached just changes the expected
outputs for the affected regression test cases.

0002 attached is then the rest of the partition-planning patch;
it doesn't need to mess with joinrels.c at all.  I've addressed
the other points discussed today in that, except for the business
about whether we want your 0003 bitmap-of-live-partitions patch.
I'm still inclined to think that that's not really worth it,
especially in view of your performance results.

If people are OK with this approach to solving the GEQO problem,
I think these are committable.

regards, tom lane

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 56a5084..3c9d84f 100644
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
*** set_append_rel_size(PlannerInfo *root, R
*** 1112,1122 
  		 * for partitioned child rels.
  		 *
  		 * Note: here we abuse the consider_partitionwise_join flag by setting
! 		 * it *even* for child rels that are not partitioned.  In that case,
! 		 * we set it to tell try_partitionwise_join() that it doesn't need to
! 		 * generate their targetlists and EC entries as they have already been
! 		 * generated here, as opposed to the dummy child rels for which the
! 		 * flag is left set to false so that it will generate them.
  		 */
  		if (rel->consider_partitionwise_join)
  			childrel->consider_partitionwise_join = true;
--- 1112,1122 
  		 * for partitioned child rels.
  		 *
  		 * Note: here we abuse the consider_partitionwise_join flag by setting
! 		 * it for child rels that are not themselves partitioned.  We do so to
! 		 * tell try_partitionwise_join() that the child rel is sufficiently
! 		 * valid to be used as a per-partition input, even if it later gets
! 		 * proven to be dummy.  (It's not

Re: speeding up planning with partitions

2019-03-29 Thread Tom Lane

Amit Langote  writes:
> On 2019/03/29 7:38, Tom Lane wrote:
>> 2. I seriously dislike what's been done in joinrels.c, too.  That
>> really seems like a kluge (and I haven't had time to study it
>> closely).

> Those hunks account for the fact that pruned partitions, for which we no
> longer create RangeTblEntry and RelOptInfo, may appear on the nullable
> side of an outer join.  We'll need a RelOptInfo holding a dummy path, so
> that outer join paths can be created with one side of join being dummy
> result path, which are built in the patch by  build_dummy_partition_rel().

Just for the record, that code is completely broken: it falls over
badly under GEQO.   (Try running the regression tests with
"alter system set geqo_threshold to 2".)  However, the partitionwise
join code was completely broken for GEQO before this patch, too, so
I'm just going to log that as an open issue for the moment.

regards, tom lane

Re: speeding up planning with partitions

2019-03-29 Thread Tom Lane

I wrote:
> Amit Langote  writes:
>> About the XXX: I think resetting inh flag is unnecessary, so we should
>> just remove the line.

> Possibly.  I hadn't had time to follow up the XXX annotation.

Now I have ...

Yeah, it seems we can just drop that and leave the flag alone.  We'll
end up running through set_append_rel_size and finding no relevant
AppendRelInfos, but that's not going to take long enough to be a problem.
It seems better to have the principle that rte->inh doesn't change after
subquery_planner's initial scan of the rtable, so I'll make it so.

>> If we do that, we can also get rid of the following
>> code in set_rel_size():

No, we can't --- that's still reachable if somebody says
"SELECT FROM ONLY partitioned_table".

regards, tom lane

Re: speeding up planning with partitions

2019-03-29 Thread Tom Lane

Amit Langote  writes:
> Here are some comments on v38.

Thanks for looking it over!  I'll just reply to points worth discussing:

> -Assert(rte->rtekind == RTE_RELATION ||
> -   rte->rtekind == RTE_SUBQUERY);
> -add_appendrel_other_rels(root, rel, rti);
> +if (rte->rtekind == RTE_RELATION)
> +expand_inherited_rtentry(root, rel, rte, rti);
> +else
> +expand_appendrel_subquery(root, rel, rte, rti);

> Wouldn't it be a good idea to keep the Assert?

There's an Assert in expand_appendrel_subquery that what it got is an
RTE_SUBQUERY, so I thought the one at the call site was redundant.
I suppose another way to do this would be like

   if (rte->rtekind == RTE_RELATION)
   expand_inherited_rtentry(root, rel, rte, rti);
   else if (rte->rtekind == RTE_SUBQUERY)
   expand_appendrel_subquery(root, rel, rte, rti);
   else
   Assert(false);

Not sure if that's better or not.  Or we could go back to the
design of just having one function and letting it dispatch the
case it doesn't want to the other function --- though I think
I'd make expand_inherited_rtentry be the primary function,
rather than the other way around as you had it in v37.

> +forboth(lc, old_child_rtis, lc2, new_child_rtis)
> +{
> +int old_child_rti = lfirst_int(lc);
> +int new_child_rti = lfirst_int(lc2);
> +
> +if (old_child_rti == new_child_rti)
> +continue;   /* nothing to do */
> +
> +Assert(old_child_rti > new_child_rti);
> +
> +ChangeVarNodes((Node *) child_appinfos,
> +   old_child_rti, new_child_rti, 0);
> +}

> This seems necessary?  RTEs of children of the target table should be in
> the same position in the final_rtable as they are in the select_rtable.

Well, that's what I'm not very convinced of.  I have observed that
the regression tests don't reach this ChangeVarNodes call, but
I think that might just be lack of test cases rather than a proof
that it's impossible.  The question is whether it'd ever be possible
for the update/delete target to not be the first "inh" table that
gets expanded.  Since that expansion is done in RTE order, it
reduces to "is the target always before any other RTE entries
that could need inheritance expansion?"  Certainly that would typically
be true, but I don't feel very comfortable about assuming that it
must be true, when you start thinking about things like updatable
views, rules, WITH queries, and so on.

It might be worth trying to devise a test case that does reach this
code.  If we could convince ourselves that it's really impossible,
I'd be willing to drop it in favor of putting a test-and-elog check
in the earlier loop that the RTI pairs are all equal.  But I'm not
willing to do it without more investigation.

> +/* XXX wrong? */
> +parentrte->inh = false;

> About the XXX: I think resetting inh flag is unnecessary, so we should
> just remove the line.

Possibly.  I hadn't had time to follow up the XXX annotation.

> If we do that, we can also get rid of the following
> code in set_rel_size():

> else if (rte->relkind == RELKIND_PARTITIONED_TABLE)
> {
> /*
>  * A partitioned table without any partitions is marked as
>  * a dummy rel.
>  */
> set_dummy_rel_pathlist(rel);
> }

Not following?  Surely we need to mark the childless parent as dummy at
some point, and that seems like as good a place as any.

> Finally, it's not in the patch, but how about visiting
> get_relation_constraints() for revising this block of code:

That seems like probably an independent patch --- do you want to write it?

regards, tom lane

Re: speeding up planning with partitions

2019-03-29 Thread Amit Langote

Here are some comments on v38.

On 2019/03/29 12:44, Amit Langote wrote:
> Thanks again for the new patch.  I'm reading it now and will send comments
> later today if I find something.

-Assert(rte->rtekind == RTE_RELATION ||
-   rte->rtekind == RTE_SUBQUERY);
-add_appendrel_other_rels(root, rel, rti);
+if (rte->rtekind == RTE_RELATION)
+expand_inherited_rtentry(root, rel, rte, rti);
+else
+expand_appendrel_subquery(root, rel, rte, rti);

Wouldn't it be a good idea to keep the Assert?

+ * It's possible that the RTIs we just assigned for the child rels in the
+ * final rtable are different from where they were in the SELECT query.

In the 2nd sentence, maybe you meant "...from what they were"

+forboth(lc, old_child_rtis, lc2, new_child_rtis)
+{
+int old_child_rti = lfirst_int(lc);
+int new_child_rti = lfirst_int(lc2);
+
+if (old_child_rti == new_child_rti)
+continue;   /* nothing to do */
+
+Assert(old_child_rti > new_child_rti);
+
+ChangeVarNodes((Node *) child_appinfos,
+   old_child_rti, new_child_rti, 0);
+}

This seems necessary?  RTEs of children of the target table should be in
the same position in the final_rtable as they are in the select_rtable.
It seems that they can be added to parse->rtable simply as:

  orig_rtable_len = list_length(parse->rtable);
  parse->rtable = list_concat(parse->rtable,
  list_copy_tail(select_rtable,
 orig_rtable_len));

That is, after the block of code that plans the query as SELECT.

+ * about the childrens' reltargets, they'll be made later

Should it be children's?

+/*
+ * If the partitioned table has no partitions, treat this as the
+ * non-inheritance case.
+ */
+if (partdesc->nparts == 0)
+{
+/* XXX wrong? */
+parentrte->inh = false;
+return;
+}

About the XXX: I think resetting inh flag is unnecessary, so we should
just remove the line.  If we do that, we can also get rid of the following
code in set_rel_size():

else if (rte->relkind == RELKIND_PARTITIONED_TABLE)
{
/*
 * A partitioned table without any partitions is marked as
 * a dummy rel.
 */
set_dummy_rel_pathlist(rel);
}

Finally, it's not in the patch, but how about visiting
get_relation_constraints() for revising this block of code:

/*
 * Append partition predicates, if any.
 *
 * For selects, partition pruning uses the parent table's partition bound
 * descriptor, instead of constraint exclusion which is driven by the
 * individual partition's partition constraint.
 */
if (enable_partition_pruning && root->parse->commandType != CMD_SELECT)
{
List   *pcqual = RelationGetPartitionQual(relation);

if (pcqual)
{
/*
 * Run the partition quals through const-simplification similar to
 * check constraints.  We skip canonicalize_qual, though, because
 * partition quals should be in canonical form already; also,
 * since the qual is in implicit-AND format, we'd have to
 * explicitly convert it to explicit-AND format and back again.
 */
pcqual = (List *) eval_const_expressions(root, (Node *) pcqual);

/* Fix Vars to have the desired varno */
if (varno != 1)
ChangeVarNodes((Node *) pcqual, 1, varno, 0);

result = list_concat(result, pcqual);
}
}

We will no longer need to load the partition constraints for "other rel"
partitions, not even for UPDATE and DELETE queries.  Now, we won't load
them with the patch applied, because we're cheating by first planning the
query as SELECT, so that's not an issue.  But we should change the
condition here to check if the input relation is a "baserel", in which
case, this should still load the partition constraint so that constraint
exclusion can use it when running with constraint_exclusion = on.  In
fact, I recently reported [1] on -hackers that we don't load the partition
constraint even if the partition is being accessed directly as a bug
introduced in PG 11.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/9813f079-f16b-61c8-9ab7-4363cab28d80%40lab.ntt.co.jp

RE: speeding up planning with partitions

2019-03-29 Thread Imai, Yoshikazu

On Fri, Mar 29, 2019 at 3:45 PM, Amit Langote wrote:
> Thanks a lot for hacking on the patch.  I'm really happy with the direction
> you took for inheritance_planner, as it allows UPDATE/DELETE to use
> partition pruning.

I was astonished by Tom's awesome works and really thanks him.

> Certainly.  Note that previously we'd always scan *all* hash partitions
> for UPDATE and DELETE queries, because constraint exclusion can't exclude
> hash partitions due to the shape of their partition constraint.
> 
> I ran my usual benchmark with up to 8192 partitions.
> 
> N: 2..8192
> 
> create table rt (a int, b int, c int) partition by range (a); select 'create
> table rt' || x::text || ' partition of rt for values from (' || (x)::text
> || ') to (' || (x+1)::text || ');' from generate_series(1,
> N) x;
> \gexec
> 
> update.sql:
> 
> \set param random(1, N)
> update rt set a = 0 where a = :param;
> 
> pgbench -n -T 120 -f select.sql
> 
> npartsv38   HEAD
> ==      
> 2  2971   2969
> 8  2980   1949
> 32 2955733
> 1282946145
> 5122924 11
> 1024   2986  3
> 4096   2702  0
> 8192   2531OOM
> 
> Obviously, you'll get similar numbers with hash or list partitioning.

I also ran the test for hash partitioning for just make sure.


N: 2..8192

create table ht (a int, b int, c int) partition by hash (a);
select 'create table ht' || x::text ||
' partition of ht for values with (MODULUS N, REMAINDER || (x)::text || ');'
from generate_series(0, N-1) x;
\gexec

update.sql:

\set param random(1, N * 100)
update ht set b = b + 1 where a = :param;

pgbench -n -T 60 -f update.sql


[updating one partition]
npartsv38   HEAD
==      
0:  10538  10487
2:   6942   7028
4:   7043   5645
8:   6981   3954
16:  6932   2440
32:  6897   1243
64:  6897309
128: 6753120
256: 6727 46
512: 6708 12
1024:6063  3
2048:5894  1
4096:5374OOM
8192:4572OOM


The performance for hash is also improved, though drop rate of performance with 
large partitions seems higher than that of range partitioning.

Thanks
--
Imai Yoshikazu

Re: speeding up planning with partitions

2019-03-28 Thread Amit Langote

Thanks a lot for hacking on the patch.  I'm really happy with the
direction you took for inheritance_planner, as it allows UPDATE/DELETE to
use partition pruning.

On 2019/03/29 7:38, Tom Lane wrote:
> I've been hacking on this pretty hard over the last couple days,
> because I really didn't like the contortions you'd made to allow
> inheritance_planner to call expand_inherited_rtentry in a completely
> different context than the regular code path did.  I eventually got
> rid of that

Good riddance.

> by having inheritance_planner run one cycle of planning
> the query as if it were a SELECT, and extracting the list of unpruned
> children from that.

This is somewhat like my earlier patch that we decided to not to pursue,
minus all the hackery within query_planner() that was in that patch, which
is great.

(I can't find the link, but David Rowley had posted a patch for allowing
UPDATE/DELETE to use partition pruning in the late stages of PG 11
development, which had taken a similar approach.)

> I had to rearrange the generation of the final
> rtable a bit to make that work, but I think that inheritance_planner
> winds up somewhat cleaner and safer this way; it's making (slightly)
> fewer assumptions about how much the results of planning the child
> queries can vary.
>
> Perhaps somebody will object that that's one more planning pass than
> we had before, but I'm not very concerned, because
> (a) at least for partitioned tables that we can prune successfully,
> this should still be better than v11, since we avoid the planning
> passes for pruned children.

Certainly.  Note that previously we'd always scan *all* hash partitions
for UPDATE and DELETE queries, because constraint exclusion can't exclude
hash partitions due to the shape of their partition constraint.

I ran my usual benchmark with up to 8192 partitions.

N: 2..8192

create table rt (a int, b int, c int) partition by range (a);
select 'create table rt' || x::text || ' partition of rt for values from
(' || (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1,
N) x;
\gexec

update.sql:

\set param random(1, N)
update rt set a = 0 where a = :param;

pgbench -n -T 120 -f select.sql

npartsv38   HEAD
==      
22971   2969
82980   1949
32   2955733
128  2946145
512  2924 11
1024 2986  3
4096 2702  0
8192 2531OOM

Obviously, you'll get similar numbers with hash or list partitioning.

> (b) inheritance_planner is horridly inefficient anyway, in that it
> has to run a near-duplicate planning pass for each child table.
> If we're concerned about its cost, we should be working to get rid of
> the function altogether, as per [1].  In the meantime, I do not want
> to contort other code to make life easier for inheritance_planner.

Agreed.

> There's still some loose ends:
> 
> 1. I don't like 0003 much, and omitted it from the attached.
> I think that what we ought to be doing instead is not having holes
> in the rel_parts[] arrays to begin with, ie they should only include
> the unpruned partitions.  If we are actually encoding any important
> information in those array positions, I suspect that is broken anyway
> in view of 898e5e329: we can't assume that the association of child
> rels with particular PartitionDesc slots will hold still from planning
> to execution.

It's useful for part_rels array to be indexed in the same way as
PartitionDesc.  Firstly, because partition pruning code returns the
PartitionDesc-defined indexes of unpruned partitions.  Second,
partitionwise join code decides two partitioned tables as being compatible
for partitionwise joining, then it must join partitions that have
identical *PartitionDesc* indexes, which is what it does by part_rels
arrays of both sides in one loop.

Regarding the impact of 898e5e329 on this, I think it invented
PartitionDirectory exactly to avoid PartitionDesc changing under us
affecting the planning or execution of a given query.  As for
PartitionDesc indexes being different between planning and execution, it
only affects PartitionPruneInfo and the commit did make changes to
ExecCreatePartitionPruningState to remap the old indexes of unpruned
partitions in PartitionPruneInfo (as they were during planning) to the new
ones.

> 2. I seriously dislike what's been done in joinrels.c, too.  That
> really seems like a kluge (and I haven't had time to study it
> closely).

Those hunks account for the fact that pruned partitions, for which we no
longer create RangeTblEntry and RelOptInfo, may appear on the nullable
side of an outer join.  We'll need a RelOptInfo holding a dummy path, so
that outer join paths can be created with one side of join being dummy
result path, which are built in the patch by  build_dummy_partition_rel().

> 3. It's not entirely clear to me why the patch has to touch
> execPartition.c.  That implies that the planner-to-executor API
> changed, but how so, and why is there no comment update

Re: speeding up planning with partitions

2019-03-28 Thread Tom Lane

I've been hacking on this pretty hard over the last couple days,
because I really didn't like the contortions you'd made to allow
inheritance_planner to call expand_inherited_rtentry in a completely
different context than the regular code path did.  I eventually got
rid of that by having inheritance_planner run one cycle of planning
the query as if it were a SELECT, and extracting the list of unpruned
children from that.  I had to rearrange the generation of the final
rtable a bit to make that work, but I think that inheritance_planner
winds up somewhat cleaner and safer this way; it's making (slightly)
fewer assumptions about how much the results of planning the child
queries can vary.

Perhaps somebody will object that that's one more planning pass than
we had before, but I'm not very concerned, because
(a) at least for partitioned tables that we can prune successfully,
this should still be better than v11, since we avoid the planning
passes for pruned children.
(b) inheritance_planner is horridly inefficient anyway, in that it
has to run a near-duplicate planning pass for each child table.
If we're concerned about its cost, we should be working to get rid of
the function altogether, as per [1].  In the meantime, I do not want
to contort other code to make life easier for inheritance_planner.

There's still some loose ends:

1. I don't like 0003 much, and omitted it from the attached.
I think that what we ought to be doing instead is not having holes
in the rel_parts[] arrays to begin with, ie they should only include
the unpruned partitions.  If we are actually encoding any important
information in those array positions, I suspect that is broken anyway
in view of 898e5e329: we can't assume that the association of child
rels with particular PartitionDesc slots will hold still from planning
to execution.

2. I seriously dislike what's been done in joinrels.c, too.  That
really seems like a kluge (and I haven't had time to study it
closely).

3. It's not entirely clear to me why the patch has to touch
execPartition.c.  That implies that the planner-to-executor API
changed, but how so, and why is there no comment update clarifying it?

Given the short amount of time left in this CF, there may not be
time to address the first two points, and I won't necessarily
insist that those be changed before committing.  I'd like at least
a comment about point 3 though.

Attached is updated patch as a single patch --- I didn't think the
division into multiple patches was terribly helpful, due to the
flapping in expected regression results.

regards, tom lane

[1] https://www.postgresql.org/message-id/357.1550612...@sss.pgh.pa.us
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 9ad9035..310c715 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -7116,9 +7116,9 @@ select * from bar where f1 in (select f1 from foo) for update;
   QUERY PLAN  
 --
  LockRows
-   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+   Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
->  Hash Join
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
  Inner Unique: true
  Hash Cond: (bar.f1 = foo.f1)
  ->  Append
@@ -7128,15 +7128,15 @@ select * from bar where f1 in (select f1 from foo) for update;
  Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
  ->  Hash
-   Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+   Output: foo.ctid, foo.f1, foo.*, foo.tableoid
->  HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo.ctid, foo.f1, foo.*, foo.tableoid
  Group Key: foo.f1
  ->  Append
->  Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo.ctid, foo.f1, foo.*, foo.tableoid
->  Foreign Scan on public.foo2
- Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
+ Output: foo2.ctid, foo2.f1, foo2.*, foo2.tableoid
  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
 (23 rows)
 
@@ -7154,9 +7154,9 @@ select * from bar where f1 in (select f1 from foo) for share;

Re: speeding up planning with partitions

2019-03-28 Thread Amit Langote

On 2019/03/28 14:03, Tom Lane wrote:
> Amit Langote  writes:
>> On 2019/03/27 23:57, Tom Lane wrote:
>>> Yeah, there's something to be said for having plancat.c open each table
>>> *and store the Relation pointer in the RelOptInfo*, and then close that
>>> again at the end of planning rather than immediately.  If we can't avoid
>>> these retail table_opens without a great deal of pain, that's the
>>> direction I'd tend to go in.  However it would add some overhead, in
>>> the form of a need to traverse the RelOptInfo array an additional time.
> 
>> Just to be sure, do you mean we should do that now or later (David said
>> "in the long term")?
> 
> It's probably not high priority, though I wonder how much time is being
> eaten by the repeated table_opens.

It took me a couple of hours to confirm this, but it seems significantly
less than it takes to go over the simple_rel_array one more time to close
the relations.  The scan of simple_rel_array to close the relations needs
to be done once per query_planner() invocation, whereas I'd hoped this
could be done, say, in add_rtes_to_flat_rtable() that has to iterate over
the range table anyway.  That's because we call query_planner multiple
times on the same query in a few cases, viz. build_minmax_path(),
inheritance_planner().

Within query_planner(), table is opened in expand_inherited_rtentry(),
get_relation_constraints(), get_relation_data_width(), and
build_physical_tlist(), of which the first two are more common.  So, we
end up doing table_open() 3 times on average for any given relation, just
inside query_planner().

Thanks,
Amit

Re: speeding up planning with partitions

2019-03-27 Thread Tom Lane

Amit Langote  writes:
> On 2019/03/27 23:57, Tom Lane wrote:
>> Yeah, there's something to be said for having plancat.c open each table
>> *and store the Relation pointer in the RelOptInfo*, and then close that
>> again at the end of planning rather than immediately.  If we can't avoid
>> these retail table_opens without a great deal of pain, that's the
>> direction I'd tend to go in.  However it would add some overhead, in
>> the form of a need to traverse the RelOptInfo array an additional time.

> Just to be sure, do you mean we should do that now or later (David said
> "in the long term")?

It's probably not high priority, though I wonder how much time is being
eaten by the repeated table_opens.

regards, tom lane

Re: speeding up planning with partitions

2019-03-27 Thread Amit Langote

On 2019/03/27 23:57, Tom Lane wrote:
> David Rowley  writes:
>> On Wed, 27 Mar 2019 at 18:39, Amit Langote
>>  wrote:
>>> On 2019/03/27 14:26, David Rowley wrote:
 Perhaps the way to make this work, at least in the long term is to do
 in the planner what we did in the executor in d73f4c74dd34.
> 
>>> Maybe you meant 9ddef36278a9?
> 
>> Probably.
> 
> Yeah, there's something to be said for having plancat.c open each table
> *and store the Relation pointer in the RelOptInfo*, and then close that
> again at the end of planning rather than immediately.  If we can't avoid
> these retail table_opens without a great deal of pain, that's the
> direction I'd tend to go in.  However it would add some overhead, in
> the form of a need to traverse the RelOptInfo array an additional time.

Just to be sure, do you mean we should do that now or later (David said
"in the long term")?

Thanks,
Amit

Re: speeding up planning with partitions

2019-03-27 Thread Tom Lane

David Rowley  writes:
> On Wed, 27 Mar 2019 at 18:39, Amit Langote
>  wrote:
>> On 2019/03/27 14:26, David Rowley wrote:
>>> Perhaps the way to make this work, at least in the long term is to do
>>> in the planner what we did in the executor in d73f4c74dd34.

>> Maybe you meant 9ddef36278a9?

> Probably.

Yeah, there's something to be said for having plancat.c open each table
*and store the Relation pointer in the RelOptInfo*, and then close that
again at the end of planning rather than immediately.  If we can't avoid
these retail table_opens without a great deal of pain, that's the
direction I'd tend to go in.  However it would add some overhead, in
the form of a need to traverse the RelOptInfo array an additional time.

regards, tom lane

Re: speeding up planning with partitions

2019-03-26 Thread David Rowley

On Wed, 27 Mar 2019 at 18:39, Amit Langote
 wrote:
>
> On 2019/03/27 14:26, David Rowley wrote:
> > Perhaps the way to make this work, at least in the long term is to do
> > in the planner what we did in the executor in d73f4c74dd34.
>
> Maybe you meant 9ddef36278a9?

Probably.

> What would be nice is being able to readily access Relation pointers of
> all tables accessed in a query from anywhere in the planner, whereby, a
> given table is opened only once.

Well, yeah, that's what the commit did for the executor, so it is what
I was trying to get at.

> Note that Tom complained upthread that the patch is introducing
> table_open()'s at random points within the planner, which is something to
> avoid.

Yip. :)  hence my suggestion.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: speeding up planning with partitions

2019-03-26 Thread Amit Langote

On 2019/03/27 14:26, David Rowley wrote:
> On Wed, 27 Mar 2019 at 18:13, Amit Langote
>  wrote:
>>
>> On 2019/03/27 13:50, Amit Langote wrote:
>>> On 2019/03/27 12:06, Amit Langote wrote:
 I wonder if I should rework inherit.c so that its internal interfaces
 don't pass around parent Relation, but make do with the RelOptInfo?  I'll
 need to add tupdesc and reltype fields to RelOptInfo to go ahead with that
 though.
>>>
>>> To give more context on the last sentence, the only place we need the
>>> TupleDesc for is make_inh_translation_list().  Maybe, instead of adding
>>> tupdesc to RelOptInfo, we could add a List of Vars of all attributes of a
>>> table.  To avoid the overhead of always setting it, maybe we could set it
>>> only for inheritance parent tables.  Adding tupdesc to RelOptInfo might be
>>> a hard sell to begin with, because it would need to be pinned and then
>>> released at the end of planning.
>>>
>>> I'll see if I can make this work.
>>
>> Hmm, scratch that.  We'd need to keep around the attribute names too for
>> comparing with the attribute names of the children during translation.
>> Maybe we could store TargetEntry's in the list instead of just Vars, but
>> maybe that's too much.
>>
>> Furthermore, if we make make_inh_translation_list too dependent on this
>> stuff, that will make it hard to use from inheritance_planner(), where we
>> expand the target inheritance without having created any RelOptInfos.
> 
> Perhaps the way to make this work, at least in the long term is to do
> in the planner what we did in the executor in d73f4c74dd34.

Maybe you meant 9ddef36278a9?

What would be nice is being able to readily access Relation pointers of
all tables accessed in a query from anywhere in the planner, whereby, a
given table is opened only once.

Note that Tom complained upthread that the patch is introducing
table_open()'s at random points within the planner, which is something to
avoid.

>  I'm not
> quite sure how exactly you'd know what size to make the array, but if
> it's not something that's happening for PG12, then maybe we can just
> use a List, once we get array based versions of them, at least.

We're going to have to expand the PlannerInfo arrays (simple_rel_*,
append_rel_array, etc.) with the patches here anyway.  Maybe, it will be
just one more array to consider?

Thanks,
Amit

Re: speeding up planning with partitions

2019-03-26 Thread David Rowley

On Wed, 27 Mar 2019 at 18:13, Amit Langote
 wrote:
>
> On 2019/03/27 13:50, Amit Langote wrote:
> > On 2019/03/27 12:06, Amit Langote wrote:
> >> I wonder if I should rework inherit.c so that its internal interfaces
> >> don't pass around parent Relation, but make do with the RelOptInfo?  I'll
> >> need to add tupdesc and reltype fields to RelOptInfo to go ahead with that
> >> though.
> >
> > To give more context on the last sentence, the only place we need the
> > TupleDesc for is make_inh_translation_list().  Maybe, instead of adding
> > tupdesc to RelOptInfo, we could add a List of Vars of all attributes of a
> > table.  To avoid the overhead of always setting it, maybe we could set it
> > only for inheritance parent tables.  Adding tupdesc to RelOptInfo might be
> > a hard sell to begin with, because it would need to be pinned and then
> > released at the end of planning.
> >
> > I'll see if I can make this work.
>
> Hmm, scratch that.  We'd need to keep around the attribute names too for
> comparing with the attribute names of the children during translation.
> Maybe we could store TargetEntry's in the list instead of just Vars, but
> maybe that's too much.
>
> Furthermore, if we make make_inh_translation_list too dependent on this
> stuff, that will make it hard to use from inheritance_planner(), where we
> expand the target inheritance without having created any RelOptInfos.

Perhaps the way to make this work, at least in the long term is to do
in the planner what we did in the executor in d73f4c74dd34.   I'm not
quite sure how exactly you'd know what size to make the array, but if
it's not something that's happening for PG12, then maybe we can just
use a List, once we get array based versions of them, at least.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: speeding up planning with partitions

2019-03-26 Thread Amit Langote

On 2019/03/27 13:50, Amit Langote wrote:
> On 2019/03/27 12:06, Amit Langote wrote:
>> I wonder if I should rework inherit.c so that its internal interfaces
>> don't pass around parent Relation, but make do with the RelOptInfo?  I'll
>> need to add tupdesc and reltype fields to RelOptInfo to go ahead with that
>> though.
> 
> To give more context on the last sentence, the only place we need the
> TupleDesc for is make_inh_translation_list().  Maybe, instead of adding
> tupdesc to RelOptInfo, we could add a List of Vars of all attributes of a
> table.  To avoid the overhead of always setting it, maybe we could set it
> only for inheritance parent tables.  Adding tupdesc to RelOptInfo might be
> a hard sell to begin with, because it would need to be pinned and then
> released at the end of planning.
> 
> I'll see if I can make this work.

Hmm, scratch that.  We'd need to keep around the attribute names too for
comparing with the attribute names of the children during translation.
Maybe we could store TargetEntry's in the list instead of just Vars, but
maybe that's too much.

Furthermore, if we make make_inh_translation_list too dependent on this
stuff, that will make it hard to use from inheritance_planner(), where we
expand the target inheritance without having created any RelOptInfos.

Thanks,
Amit

Re: speeding up planning with partitions

2019-03-26 Thread Amit Langote

On 2019/03/27 12:06, Amit Langote wrote:
> I wonder if I should rework inherit.c so that its internal interfaces
> don't pass around parent Relation, but make do with the RelOptInfo?  I'll
> need to add tupdesc and reltype fields to RelOptInfo to go ahead with that
> though.

To give more context on the last sentence, the only place we need the
TupleDesc for is make_inh_translation_list().  Maybe, instead of adding
tupdesc to RelOptInfo, we could add a List of Vars of all attributes of a
table.  To avoid the overhead of always setting it, maybe we could set it
only for inheritance parent tables.  Adding tupdesc to RelOptInfo might be
a hard sell to begin with, because it would need to be pinned and then
released at the end of planning.

I'll see if I can make this work.

Thanks,
Amit

Re: speeding up planning with partitions

2019-03-26 Thread Amit Langote

On 2019/03/27 1:06, Tom Lane wrote:
> Amit Langote  writes:
>> 0002 is a new patch to get rid of the duplicate RTE and PlanRowMark that's
>> created for partitioned parent table, as it's pointless.  I was planning
>> to propose it later, but decided to post it now if we're modifying the
>> nearby code anyway.
> 
> Good idea, but it needs a bit more work on nearby comments, and the
> logic in expand_single_inheritance_child could be simplified, and
> you evidently didn't try check-world because the RTE-removal causes
> cosmetic changes in postgres_fdw.out.  I fixed that up and pushed
> this part.

Thanks.  Sorry that I left a lot to be done.

Regards,
Amit

Re: speeding up planning with partitions

2019-03-26 Thread Tom Lane

Amit Langote  writes:
> On 2019/03/23 6:07, Tom Lane wrote:
>> I also feel like you used a dartboard while deciding where to insert the
>> call in query_planner(); dropping it into the middle of a sequence of
>> equivalence-class-related operations seems quite random.  Maybe we could
>> delay that step all the way to just before make_one_rel, since the other
>> stuff in between seems to only care about baserels?  For example,
>> it'd certainly be better if we could run remove_useless_joins before
>> otherrel expansion, so that we needn't do otherrel expansion at all
>> on a table that gets thrown away as being a useless join.

> create_lateral_join_info() expects otherrels to be present to propagate
> the information it creates.

Sure, but the actually-useful work in that function is just concerned
with baserels.  The only thing it does with otherrels is to propagate
parents' lateral-ref info to children, which is a lot easier, cheaper,
and more reliable if we let build_simple_rel do it.  See the version
of 0001 I just pushed.

I'll look at 0003 and up tomorrow.

regards, tom lane

Re: speeding up planning with partitions

2019-03-26 Thread Tom Lane

Amit Langote  writes:
> 0002 is a new patch to get rid of the duplicate RTE and PlanRowMark that's
> created for partitioned parent table, as it's pointless.  I was planning
> to propose it later, but decided to post it now if we're modifying the
> nearby code anyway.

Good idea, but it needs a bit more work on nearby comments, and the
logic in expand_single_inheritance_child could be simplified, and
you evidently didn't try check-world because the RTE-removal causes
cosmetic changes in postgres_fdw.out.  I fixed that up and pushed
this part.

regards, tom lane

RE: speeding up planning with partitions

2019-03-26 Thread Imai, Yoshikazu

Amit-san,

On Tue, Mar 26, 2019 at 7:17 AM, Amit Langote wrote:
> Rebased patches attached.

I could only do the code review of v36-0001.

On Mon, Mar 25, 2019 at 11:35 AM, Amit Langote wrote:
> On 2019/03/23 6:07, Tom Lane wrote:
> > Amit Langote  writes:
> >> [ v34 patch set ]
> > 
> > I had a bit of a look through this.  I went ahead and pushed 0008 and
> > 0009, as they seem straightforward and independent of the rest.  (BTW,
> > 0009 makes 0003's dubious optimization in set_relation_partition_info
> > quite unnecessary.)  As for the rest:
> > 
> > 0001: OK in principle, but I wonder why you implemented it by adding
> > another recursive scan of the jointree rather than just iterating
> > over the baserel array, as in make_one_rel() for instance.  Seems
> > like this way is more work, and more code that we'll have to touch
> > if we ever change the jointree representation.
> 
> I've changed this to work by iterating over baserel array.  I was mostly
> worried about looping over potentially many elements of baserel array that
> we simply end up skipping, but considering the other patch that delays
> adding inheritance children, we don't need to worry about that.
>
> > I also feel like you used a dartboard while deciding where to insert the
> > call in query_planner(); dropping it into the middle of a sequence of
> > equivalence-class-related operations seems quite random.  Maybe we could
> > delay that step all the way to just before make_one_rel, since the other
> > stuff in between seems to only care about baserels?  For example,
> > it'd certainly be better if we could run remove_useless_joins before
> > otherrel expansion, so that we needn't do otherrel expansion at all
> > on a table that gets thrown away as being a useless join.
> 
> create_lateral_join_info() expects otherrels to be present to propagate
> the information it creates.
> 
> I have moved add_other_rels_to_query() followed by
> create_lateral_join_info() to just before make_one_rel().

I checked 0001 patch modifies the thing which is discussed above correctly.

What problem I only found is a little typo.
s/propgated/propagated/

I'm really sorry for my shortage of time to review for now...

--
Yoshikazu Imai

RE: speeding up planning with partitions

2019-03-24 Thread Imai, Yoshikazu

On Fri, Mar 22, 2019 at 9:07 PM, Tom Lane wrote:
> BTW, it strikes me that we could take advantage of the fact that baserels
> must all appear before otherrels in the rel array, by having loops over
> that array stop early if they're only interested in baserels.  We could
> either save the index of the last baserel, or just extend the loop logic
> to fall out upon hitting an otherrel.
> Seems like this'd save some cycles when there are lots of partitions,
> although perhaps loops like that would be fast enough to not matter.

Actually, this speeds up planning time little when scanning a lot of otherrels 
like selecting thousands of partitions. I tested below.

* rt with 8192 partitions
* execute "select * from rt;" for 60 seconds.

[results]
HEAD: 4.19 TPS (4.06, 4.34, 4.17)
(v34 patches) + (looping over only baserels): 4.26 TPS (4.31, 4.28, 4.19)


Attached is the diff I used for this test.

--
Yoshikazu Imai



looping-over-last-baserel-idx.diff
Description: looping-over-last-baserel-idx.diff

Re: speeding up planning with partitions

2019-03-22 Thread Tom Lane

I wrote:
> ... I also
> don't like the undocumented way that you've got build_base_rel_tlists
> working on something that's not the final tlist, and then going back
> and doing more work of the same sort later.  I wonder whether we can
> postpone build_base_rel_tlists until after the other stuff happens,
> so that it can continue to do all of the tlist-distribution work.

On further reflection: we don't really need the full build_base_rel_tlists
pushups for these extra variables.  The existence of the original rowmark
variable will be sufficient, for example, to make sure that we don't
decide the parent partitioned table can be thrown away as a useless join.
All we need to do is add the new Var to the parent's reltarget and adjust
its attr_needed, and we can do that very late in query_planner.  So now
I'm thinking leave build_base_rel_tlists() where it is, and then fix up
the tlist/reltarget data on the fly when we discover that new child rels
need more rowmark types than we had before.  (This'd be another reason to
keep the tlist in root->processed_tlist throughout, likely.)

regards, tom lane

Re: speeding up planning with partitions

2019-03-22 Thread Tom Lane

Amit Langote  writes:
> [ v34 patch set ]

I had a bit of a look through this.  I went ahead and pushed 0008 and
0009, as they seem straightforward and independent of the rest.  (BTW,
0009 makes 0003's dubious optimization in set_relation_partition_info
quite unnecessary.)  As for the rest:


0001: OK in principle, but I wonder why you implemented it by adding
another recursive scan of the jointree rather than just iterating
over the baserel array, as in make_one_rel() for instance.  Seems
like this way is more work, and more code that we'll have to touch
if we ever change the jointree representation.

I also feel like you used a dartboard while deciding where to insert the
call in query_planner(); dropping it into the middle of a sequence of
equivalence-class-related operations seems quite random.  Maybe we could
delay that step all the way to just before make_one_rel, since the other
stuff in between seems to only care about baserels?  For example,
it'd certainly be better if we could run remove_useless_joins before
otherrel expansion, so that we needn't do otherrel expansion at all
on a table that gets thrown away as being a useless join.

BTW, it strikes me that we could take advantage of the fact that
baserels must all appear before otherrels in the rel array, by
having loops over that array stop early if they're only interested
in baserels.  We could either save the index of the last baserel,
or just extend the loop logic to fall out upon hitting an otherrel.
Seems like this'd save some cycles when there are lots of partitions,
although perhaps loops like that would be fast enough to not matter.


0002: I seriously dislike this from a system-structure viewpoint.
For one thing it makes root->processed_tlist rather moot, since it
doesn't get set till after most of the work is done.  (I don't know
if there are any FDWs out there that look at processed_tlist, but
they'd be broken by this if so.  It'd be better to get rid of the
field if we can, so that at least such breakage is obvious.  Or,
maybe, use root->processed_tlist as the communication field rather
than having the tlist be a param to query_planner?)  I also
don't like the undocumented way that you've got build_base_rel_tlists
working on something that's not the final tlist, and then going back
and doing more work of the same sort later.  I wonder whether we can
postpone build_base_rel_tlists until after the other stuff happens,
so that it can continue to do all of the tlist-distribution work.


0003: I haven't studied this in great detail, but it seems like there's
some pretty random things in it, like this addition in
inheritance_planner:

+   /* grouping_planner doesn't need to see the target children. */
+   subroot->append_rel_list = list_copy(orig_append_rel_list);

That sure seems like a hack unrelated to the purpose of the patch ... and
since list_copy is a shallow copy, I'm not even sure it's safe.  Won't
we be mutating the contained (and shared) AppendRelInfos as we go along?


0004 and 0005: I'm not exactly thrilled about loading more layers of
overcomplication into inheritance_planner, especially not when the
complication then extends out into new global planner flags with
questionable semantics.  We should be striving to get rid of that
function, as previously discussed, and I fear this is just throwing
more roadblocks in the way of doing so.  Can we skip these and come
back to the question after we replace inheritance_planner with
expand-at-the-bottom?


0006: Seems mostly OK, but I'm not very happy about the additional
table_open calls it's throwing into random places in the planner.
Those can be rather expensive.  Why aren't we collecting the appropriate
info during the initial plancat.c consultation of the relcache?


0007: Not really sold on this; it seems like it's going to be a net
negative for cases where there aren't a lot of partitions.

regards, tom lane

Re: speeding up planning with partitions

2019-03-22 Thread David Rowley

On Fri, 22 Mar 2019 at 20:39, Amit Langote
 wrote:
> The problem is that make_partitionedrel_pruneinfo() does some work which
> involves allocating arrays containing nparts elements and then looping
> over the part_rels array to fill values in those arrays that will be used
> by run-time pruning.  It does all that even *before* it has checked that
> run-time pruning will be needed at all, which if it is not, it will simply
> discard the result of the aforementioned work.
>
> Patch 0008 of 0009 rearranges the code such that we check whether we will
> need run-time pruning at all before doing any of the work that's necessary
> for run-time to work.

I had a quick look at the make_partitionedrel_pruneinfo() code earlier
before this patch appeared and I agree that something like this could
be done. I've not gone over the patch in detail though.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: speeding up planning with partitions

2019-03-22 Thread Jesper Pedersen


Hi Amit,

On 3/22/19 3:39 AM, Amit Langote wrote:

I took a stab at this.  I think rearranging the code in
make_partitionedrel_pruneinfo() slightly will take care of this.

The problem is that make_partitionedrel_pruneinfo() does some work which
involves allocating arrays containing nparts elements and then looping
over the part_rels array to fill values in those arrays that will be used
by run-time pruning.  It does all that even *before* it has checked that
run-time pruning will be needed at all, which if it is not, it will simply
discard the result of the aforementioned work.

Patch 0008 of 0009 rearranges the code such that we check whether we will
need run-time pruning at all before doing any of the work that's necessary
for run-time to work.



Yeah, this is better :) Increasing number of partitions leads to a TPS 
within the noise level.


Passes check-world.

Best regards,
 Jesper

Re: speeding up planning with partitions

2019-03-21 Thread Amit Langote

Jesper, Imai-san,

Thanks for testing and reporting your findings.

On 2019/03/21 23:10, Imai Yoshikazu wrote:
> On 2019/03/20 23:25, Jesper Pedersen wrote:> Hi,
>  > My tests - using hash partitions - shows that the extra time is spent in
>  > make_partition_pruneinfo() for the relid_subplan_map variable.
>  >
>  >64 partitions: make_partition_pruneinfo() 2.52%
>  > 8192 partitions: make_partition_pruneinfo() 5.43%
>  >
>  > TPS goes down ~8% between the two. The test setup is similar to the 
> above.
>  >
>  > Given that Tom is planning to change the List implementation [1] in 13 I
>  > think using the palloc0 structure is ok for 12 before trying other
>  > implementation options.

Hmm, relid_subplan_map's size should be constant (number of partitions
scanned) even as the actual partition count grows.

However, looking into make_partitionedrel_pruneinfo(), it seems that it's
unconditionally allocating 3 arrays that all have nparts elements:

subplan_map = (int *) palloc0(nparts * sizeof(int));
subpart_map = (int *) palloc0(nparts * sizeof(int));
relid_map = (Oid *) palloc0(nparts * sizeof(int));

So, that part has got to cost more as the partition count grows.

This is the code for runtime pruning, which is not exercised in our tests,
so it might seem odd that we're spending any time here at all.  I've been
told that we have to perform at least *some* work here if only to conclude
that runtime pruning is not needed and it seems that above allocations
occur before making that conclusion.  Maybe, that's salvageable by
rearranging this code a bit.  David may be in a better position to help
with that.

> Thanks for testing.
> Yeah, after code looking, I think bottleneck seems be lurking in another 
> place where this patch doesn't change. I also think the patch is ok as 
> it is for 12, and this problem will be fixed in 13.

If this drop in performance can be attributed to the fact that having too
many partitions starts stressing other parts of the backend like stats
collector, lock manager, etc. as time passes, then I agree that we'll have
to tackle them later.

Thanks,
Amit

Re: speeding up planning with partitions

2019-03-21 Thread Imai Yoshikazu

Hi Jesper,

On 2019/03/20 23:25, Jesper Pedersen wrote:> Hi,
 >
 > On 3/19/19 11:15 PM, Imai, Yoshikazu wrote:
 >> Here the details.
 >>
 >> [creating partitioned tables (with 1024 partitions)]
 >> drop table if exists rt;
 >> create table rt (a int, b int, c int) partition by range (a);
 >> \o /dev/null
 >> select 'create table rt' || x::text || ' partition of rt for values
 >> from (' ||
 >>   (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1,
 >> 1024) x;
 >> \gexec
 >> \o
 >>
 >> [select1024.sql]
 >> \set a random (1, 1024)
 >> select * from rt where a = :a;
 >>
 >> [pgbench]
 >> pgbench -n -f select1024.sql -T 60
 >>
 >>
 >
 > My tests - using hash partitions - shows that the extra time is spent in
 > make_partition_pruneinfo() for the relid_subplan_map variable.
 >
 >64 partitions: make_partition_pruneinfo() 2.52%
 > 8192 partitions: make_partition_pruneinfo() 5.43%
 >
 > TPS goes down ~8% between the two. The test setup is similar to the 
above.
 >
 > Given that Tom is planning to change the List implementation [1] in 13 I
 > think using the palloc0 structure is ok for 12 before trying other
 > implementation options.
 >
 > perf sent off-list.
 >
 > [1] 
https://www.postgresql.org/message-id/24783.1551568303%40sss.pgh.pa.us
 >
 > Best regards,
 >   Jesper
 >
 >

Thanks for testing.
Yeah, after code looking, I think bottleneck seems be lurking in another 
place where this patch doesn't change. I also think the patch is ok as 
it is for 12, and this problem will be fixed in 13.

--
Yoshikazu Imai

Re: speeding up planning with partitions

2019-03-21 Thread Jesper Pedersen


Hi Amit,

On 3/19/19 8:41 PM, Amit Langote wrote:

I have fixed all.  Attached updated patches.



The comments in the thread has been addressed, so I have put the CF 
entry into Ready for Committer.


Best regards,
 Jesper

Re: speeding up planning with partitions

2019-03-20 Thread Jesper Pedersen


Hi,

On 3/19/19 11:15 PM, Imai, Yoshikazu wrote:

Here the details.

[creating partitioned tables (with 1024 partitions)]
drop table if exists rt;
create table rt (a int, b int, c int) partition by range (a);
\o /dev/null
select 'create table rt' || x::text || ' partition of rt for values from (' ||
  (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 1024) x;
\gexec
\o

[select1024.sql]
\set a random (1, 1024)
select * from rt where a = :a;

[pgbench]
pgbench -n -f select1024.sql -T 60




My tests - using hash partitions - shows that the extra time is spent in 
make_partition_pruneinfo() for the relid_subplan_map variable.


  64 partitions: make_partition_pruneinfo() 2.52%
8192 partitions: make_partition_pruneinfo() 5.43%

TPS goes down ~8% between the two. The test setup is similar to the above.

Given that Tom is planning to change the List implementation [1] in 13 I 
think using the palloc0 structure is ok for 12 before trying other 
implementation options.


perf sent off-list.

[1] https://www.postgresql.org/message-id/24783.1551568303%40sss.pgh.pa.us

Best regards,
 Jesper

RE: speeding up planning with partitions

2019-03-20 Thread Imai, Yoshikazu

Amit-san,

On Wed, Mar 20, 2019 at 9:07 AM, Amit Langote wrote:
> On 2019/03/20 17:36, Imai, Yoshikazu wrote:
> > On Wed, Mar 20, 2019 at 8:21 AM, Amit Langote wrote:
> >> On 2019/03/20 12:15, Imai, Yoshikazu wrote:
> >>> [select1024.sql]
> >>> \set a random (1, 1024)
> >>> select * from rt where a = :a;
> >>>
> >>> [pgbench]
> >>> pgbench -n -f select1024.sql -T 60
> >>
> >> Thank you.
> >>
> >> Could you please try with running pgbench for a bit longer than 60
> seconds?
> >
> > I run pgbench for 180 seconds but there are still difference.
> 
> Thank you very much.
> 
> > 1024: 7,004 TPS
> > 8192: 5,859 TPS
> >
> >
> > I also tested for another number of partitions by running pgbench for
> 60 seconds.
> >
> > num of partTPS
> > ---  -
> > 128  7,579
> > 256  7,528
> > 512  7,512
> > 1024 7,257 (7274, 7246, 7252)
> > 2048 6,718 (6627, 6780, 6747)
> > 4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s results)
> > 8192 6,008 (6018, 5999, 6007)
> >
> >
> > I checked whether there are the process which go through the number
> of partitions, but I couldn't find. I'm really wondering why this
> degradation happens.
> 
> Indeed, it's quite puzzling why.  Will look into this.

I don't know whether it is useful, but I noticed the usage of get_tabstat_entry 
increased when many partitions are scanned.

--
Yoshikazu Imai

Re: speeding up planning with partitions

2019-03-20 Thread Amit Langote

Imai-san,

On 2019/03/20 17:36, Imai, Yoshikazu wrote:
> On Wed, Mar 20, 2019 at 8:21 AM, Amit Langote wrote:
>> On 2019/03/20 12:15, Imai, Yoshikazu wrote:
>>> [select1024.sql]
>>> \set a random (1, 1024)
>>> select * from rt where a = :a;
>>>
>>> [pgbench]
>>> pgbench -n -f select1024.sql -T 60
>>
>> Thank you.
>>
>> Could you please try with running pgbench for a bit longer than 60 seconds?
> 
> I run pgbench for 180 seconds but there are still difference.

Thank you very much.

> 1024: 7,004 TPS
> 8192: 5,859 TPS
> 
> 
> I also tested for another number of partitions by running pgbench for 60 
> seconds.
> 
> num of partTPS
> ---  -
> 128  7,579
> 256  7,528
> 512  7,512
> 1024 7,257 (7274, 7246, 7252)
> 2048 6,718 (6627, 6780, 6747)
> 4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s results)
> 8192 6,008 (6018, 5999, 6007)
> 
> 
> I checked whether there are the process which go through the number of 
> partitions, but I couldn't find. I'm really wondering why this degradation 
> happens.

Indeed, it's quite puzzling why.  Will look into this.

Thanks,
Amit

RE: speeding up planning with partitions

2019-03-20 Thread Imai, Yoshikazu

Amit-san,

On Wed, Mar 20, 2019 at 8:21 AM, Amit Langote wrote:
> On 2019/03/20 12:15, Imai, Yoshikazu wrote:
> > Here the details.
> >
> > [creating partitioned tables (with 1024 partitions)] drop table if
> > exists rt; create table rt (a int, b int, c int) partition by range
> > (a); \o /dev/null select 'create table rt' || x::text || ' partition
> > of rt for values from (' ||  (x)::text || ') to (' || (x+1)::text ||
> > ');' from generate_series(1, 1024) x; \gexec \o
> >
> > [select1024.sql]
> > \set a random (1, 1024)
> > select * from rt where a = :a;
> >
> > [pgbench]
> > pgbench -n -f select1024.sql -T 60
> 
> Thank you.
> 
> Could you please try with running pgbench for a bit longer than 60 seconds?

I run pgbench for 180 seconds but there are still difference.

1024: 7,004 TPS
8192: 5,859 TPS


I also tested for another number of partitions by running pgbench for 60 
seconds.

num of partTPS
---  -
128  7,579
256  7,528
512  7,512
1024 7,257 (7274, 7246, 7252)
2048 6,718 (6627, 6780, 6747)
4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s results)
8192 6,008 (6018, 5999, 6007)


I checked whether there are the process which go through the number of 
partitions, but I couldn't find. I'm really wondering why this degradation 
happens.

Yoshikazu Imai

Re: speeding up planning with partitions

2019-03-20 Thread Amit Langote

Imai-san,

On 2019/03/20 12:15, Imai, Yoshikazu wrote:
> Here the details.
> 
> [creating partitioned tables (with 1024 partitions)]
> drop table if exists rt;
> create table rt (a int, b int, c int) partition by range (a);
> \o /dev/null
> select 'create table rt' || x::text || ' partition of rt for values from (' ||
>  (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 1024) x;
> \gexec
> \o
> 
> [select1024.sql]
> \set a random (1, 1024)
> select * from rt where a = :a;
> 
> [pgbench]
> pgbench -n -f select1024.sql -T 60

Thank you.

Could you please try with running pgbench for a bit longer than 60 seconds?

Thanks,
Amit

RE: speeding up planning with partitions

2019-03-19 Thread Imai, Yoshikazu

Amit-san,

On Wed, Mar 20, 2019 at 3:01 PM, Amit Langote wrote:
> > On Wed, Mar 20, 2019 at 2:34 AM, Amit Langote wrote:
> >> On 2019/03/20 11:21, Imai, Yoshikazu wrote:
> >>> (4)
> >>> We expect the performance does not depend on the number of
> >>> partitions
> >> after applying all patches, if possible.
> >>>
> >>> num of partTPS
> >>> ---  -
> >>> 1024 7,257 (7274, 7246, 7252)
> >>> 2048 6,718 (6627, 6780, 6747)
> >>> 4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s
> results)
> >>> 8192 6,008 (6018, 5999, 6007)
> >>>
> >>> It seems the performance still depend on the number of partitions.
> >>> At
> >> the moment, I don't have any idea what cause this problem but can we
> >> improve this more?
> >>
> >> I've noticed [1] this kind of degradation when the server is built
> >> with Asserts enabled.  Did you?
> >> ...
> >> [1]
> >>
> https://www.postgresql.org/message-id/a49372b6-c044-4ac8-84ea-90ad18
> >> b1770d%40lab.ntt.co.jp
> >
> > No. I did test again from configuring without --enable-cassert but
> problem I mentioned still happens.
> 
> Hmm, OK.  Can you describe your test setup with more details?

Here the details.

[creating partitioned tables (with 1024 partitions)]
drop table if exists rt;
create table rt (a int, b int, c int) partition by range (a);
\o /dev/null
select 'create table rt' || x::text || ' partition of rt for values from (' ||
 (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 1024) x;
\gexec
\o

[select1024.sql]
\set a random (1, 1024)
select * from rt where a = :a;

[pgbench]
pgbench -n -f select1024.sql -T 60


What I noticed so far is that it also might depends on the query. I created 
table with 8192 partitions and did select statements like "select * from a = :a 
(which ranges from 1 to 1024)" and "select * from a = :a (which ranges from 1 
to 8192)", and the results of those were different.

I'll send perf to off-list.

--
Yoshikazu Imai

Re: speeding up planning with partitions

2019-03-19 Thread Amit Langote

On 2019/03/20 11:51, Imai, Yoshikazu wrote:
> Amit-san,
> 
> On Wed, Mar 20, 2019 at 2:34 AM, Amit Langote wrote:
>> On 2019/03/20 11:21, Imai, Yoshikazu wrote:
>>> (4)
>>> We expect the performance does not depend on the number of partitions
>> after applying all patches, if possible.
>>>
>>> num of partTPS
>>> ---  -
>>> 1024 7,257 (7274, 7246, 7252)
>>> 2048 6,718 (6627, 6780, 6747)
>>> 4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s results)
>>> 8192 6,008 (6018, 5999, 6007)
>>>
>>> It seems the performance still depend on the number of partitions. At
>> the moment, I don't have any idea what cause this problem but can we improve
>> this more?
>>
>> I've noticed [1] this kind of degradation when the server is built with
>> Asserts enabled.  Did you? 
>> ...
>> [1]
>> https://www.postgresql.org/message-id/a49372b6-c044-4ac8-84ea-90ad18
>> b1770d%40lab.ntt.co.jp
> 
> No. I did test again from configuring without --enable-cassert but problem I 
> mentioned still happens.

Hmm, OK.  Can you describe your test setup with more details?

Thanks,
Amit

RE: speeding up planning with partitions

2019-03-19 Thread Imai, Yoshikazu

Amit-san,

On Wed, Mar 20, 2019 at 2:34 AM, Amit Langote wrote:
> On 2019/03/20 11:21, Imai, Yoshikazu wrote:
> > (4)
> > We expect the performance does not depend on the number of partitions
> after applying all patches, if possible.
> >
> > num of partTPS
> > ---  -
> > 1024 7,257 (7274, 7246, 7252)
> > 2048 6,718 (6627, 6780, 6747)
> > 4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s results)
> > 8192 6,008 (6018, 5999, 6007)
> >
> > It seems the performance still depend on the number of partitions. At
> the moment, I don't have any idea what cause this problem but can we improve
> this more?
> 
> I've noticed [1] this kind of degradation when the server is built with
> Asserts enabled.  Did you? 
> ...
> [1]
> https://www.postgresql.org/message-id/a49372b6-c044-4ac8-84ea-90ad18
> b1770d%40lab.ntt.co.jp

No. I did test again from configuring without --enable-cassert but problem I 
mentioned still happens.

--
Yoshikazu Imai

Re: speeding up planning with partitions

2019-03-19 Thread Amit Langote

Imai-san,

On 2019/03/20 11:21, Imai, Yoshikazu wrote:
> (4)
> We expect the performance does not depend on the number of partitions after 
> applying all patches, if possible.
> 
> num of partTPS
> ---  -
> 1024 7,257 (7274, 7246, 7252)
> 2048 6,718 (6627, 6780, 6747)
> 4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s results)
> 8192 6,008 (6018, 5999, 6007)
> 
> It seems the performance still depend on the number of partitions. At the 
> moment, I don't have any idea what cause this problem but can we improve this 
> more?

I've noticed [1] this kind of degradation when the server is built with
Asserts enabled.  Did you?

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/a49372b6-c044-4ac8-84ea-90ad18b1770d%40lab.ntt.co.jp

RE: speeding up planning with partitions

2019-03-19 Thread Imai, Yoshikazu

Amit-san,

On Wed, Mar 20, 2019 at 0:42 AM, Amit Langote wrote:
> On 2019/03/19 20:13, Imai, Yoshikazu wrote:
> > Thanks for new patches.
> > I looked over them and there are little comments. 
> >
> > ...
> >
> > I have no more comments about codes other than above :)
> 
> I have fixed all.  Attached updated patches.

Thanks for quick modification.

I did performance test again. I did four tests in the below.
There might be some point we can improve performance more from the results of 
last test in the below.

(1)
v33-0003 slows down queries when there are inherited tables in UPDATE/DELETE's 
FROM clause.
v33-0004 fixes this problem.

* rt with 32 partitions.
* jrt with 32 partitions.
* update rt set c = jrt.c from jrt where rt.b = jrt.b;

patch TPS
-   -
master   66.8 (67.2, 66.8, 66.4)
0003 47.5 (47.2, 47.6, 47.7)
0004 68.8 (69.2, 68.9, 68.4)

It seems fixing the performance problem correctly.

(2)
v33-0005 speeds up UPDATE/DELETE by removing useless copy of child target RTEs 
in adjust_appendrel_attrs().

* rt with 32 partitions.
* update rt set b = b + 1;

patch TPS
-   -
master986 (999, 984, 974)
00051,589 (1608, 1577, 1582)

It seems speeding up the performance as expected.

(3)
v33-0006, 0007, 0008 speeds up the case when few partitions are scanned among 
many partitions.

* rt with 4096 partitions, partitioned by column 'a'.
* select rt where rt.a = :a (:a ranges from 1 to 4096)

patch TPS
-   -
master   22.4 (22.4, 22.5, 22.2)
0005 20.9 (20.9, 21.2, 20.6)
00062,951 (2958, 2958, 2937)
00073,141 (3146, 3141, 3136)
00086,472 (6434, 6565, 6416)

Certainly, it seems patches speed up the performance of the case.

(4)
We expect the performance does not depend on the number of partitions after 
applying all patches, if possible.

num of partTPS
---  -
1024 7,257 (7274, 7246, 7252)
2048 6,718 (6627, 6780, 6747)
4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s results)
8192 6,008 (6018, 5999, 6007)

It seems the performance still depend on the number of partitions. At the 
moment, I don't have any idea what cause this problem but can we improve this 
more?

--
Yoshikazu Imai

Re: speeding up planning with partitions

2019-03-19 Thread Amit Langote

On 2019/03/20 9:49, Robert Haas wrote:
> On Fri, Mar 8, 2019 at 4:18 AM Amit Langote
>  wrote:
>> Maybe you know that range_table_mutator() spends quite a long time if
>> there are many target children, but I realized there's no need for
>> range_table_mutator() to copy/mutate child target RTEs.  First, there's
>> nothing to translate in their case.  Second, copying them is not necessary
>> too, because they're not modified across different planning cycles.  If we
>> pass to adjust_appendrel_attrs only the RTEs in the original range table
>> (that is, before any child target RTEs were added), then
>> range_table_mutator() has to do significantly less work and allocates lot
>> less memory than before.  I've broken this change into its own patch; see
>> patch 0004.
> 
> Was just glancing over 0001:
> 
> - * every non-join RTE that is used in the query.  Therefore, this routine
> - * is the only place that should call build_simple_rel with reloptkind
> - * RELOPT_BASEREL.  (Note: build_simple_rel recurses internally to build
> - * "other rel" RelOptInfos for the members of any appendrels we find here.)
> + * every non-join RTE that is specified in the query.  Therefore, this
> + * routine is the only place that should call build_simple_rel with
> + * reloptkind RELOPT_BASEREL.  (Note:  "other rel" RelOptInfos for the
> + * members of any appendrels we find here are built later when query_planner
> + * calls add_other_rels_to_query().)
> 
> Used -> specified doesn't seem to change the meaning, so I'm not sure
> what the motivation is there.

Well, I thought it would clarify that now add_base_rels_to_query() only
adds "baserel" RelOptInfos, that is, those for the relations that are
directly mentioned in the query.

Maybe:

...that is mentioned in the query.

or

...that is directly mentioned in the query.

?

Thanks,
Amit

Re: speeding up planning with partitions

2019-03-19 Thread Robert Haas

On Fri, Mar 8, 2019 at 4:18 AM Amit Langote
 wrote:
> Maybe you know that range_table_mutator() spends quite a long time if
> there are many target children, but I realized there's no need for
> range_table_mutator() to copy/mutate child target RTEs.  First, there's
> nothing to translate in their case.  Second, copying them is not necessary
> too, because they're not modified across different planning cycles.  If we
> pass to adjust_appendrel_attrs only the RTEs in the original range table
> (that is, before any child target RTEs were added), then
> range_table_mutator() has to do significantly less work and allocates lot
> less memory than before.  I've broken this change into its own patch; see
> patch 0004.

Was just glancing over 0001:

- * every non-join RTE that is used in the query.  Therefore, this routine
- * is the only place that should call build_simple_rel with reloptkind
- * RELOPT_BASEREL.  (Note: build_simple_rel recurses internally to build
- * "other rel" RelOptInfos for the members of any appendrels we find here.)
+ * every non-join RTE that is specified in the query.  Therefore, this
+ * routine is the only place that should call build_simple_rel with
+ * reloptkind RELOPT_BASEREL.  (Note:  "other rel" RelOptInfos for the
+ * members of any appendrels we find here are built later when query_planner
+ * calls add_other_rels_to_query().)

Used -> specified doesn't seem to change the meaning, so I'm not sure
what the motivation is there.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

RE: speeding up planning with partitions

2019-03-19 Thread Imai, Yoshikazu

Amit-san,

On Tue, Mar 19, 2019 at 6:53 AM, Amit Langote wrote:
> On 2019/03/15 9:33, Imai, Yoshikazu wrote:
> > On Thu, Mar 14, 2019 at 9:04 AM, Amit Langote wrote:
> >>> * In inheritance_planner(), we do ChangeVarNodes() only for
> >>> orig_rtable,
> >> so the codes concatenating lists of append_rel_list may be able to
> be
> >> moved before doing ChangeVarNodes() and then the codes concatenating
> >> lists of rowmarks, rtable and append_rel_list can be in one block (or
> >> one bunch).
> >>
> >> Yeah, perhaps.  I'll check.
> 
> I'm inclined to add source_appinfos to subroot->append_rel_list after
> finishing the ChangeVarNodes(subroot->append_rel_list) step, because if
> there are many entries in source_appinfos that would unnecessarily make
> ChangeVarNodes take longer.

OK, thanks.

> I've attached updated patches.  In the new version, I've moved some code
> from 0004 to 0005 patch, so as to avoid mixing unrelated modifications
> in one patch.  Especially, orig_rtable now only appears after applying
> 0005.
> 
> I appreciate your continued interest in these patches.

Thanks for new patches.
I looked over them and there are little comments.

[0002]
* s/regresion/regression/g
(in commit message.)

[0003]
* I thought "inh flag is it" is "inh flag is set" ...?

+* For RTE_RELATION rangetable entries whose inh flag is it, adjust the


* Below comments are correct when after applying 0004.

+* the query's target relation and no other.  Especially, children of 
any
+* source relations are added when the loop below calls grouping_planner
+* on the *1st* child target relation.

[0004]
* s/contain contain/contain/

+* will contain contain references to the subquery RTEs that 
we've


* s/find them children/find their children/

+* AppendRelInfos needed to find them children, so the 
next

[0006]
* s/recursivly/recursively/
(in commit message)


I have no more comments about codes other than above :)

--
Yoshikazu Imai

RE: speeding up planning with partitions

2019-03-18 Thread Imai, Yoshikazu

Amit-san,

On Mon, Mar 18, 2019 at 9:56 AM, Amit Langote wrote:
> On 2019/03/15 14:40, Imai, Yoshikazu wrote:
> > Amit-san,
> >
> > I have another little comments about v31-patches.
> >
> > * We don't need is_first_child in inheritance_planner().
> >
> > On Fri, Mar 8, 2019 at 9:18 AM, Amit Langote wrote:
> >> On 2019/03/08 16:16, Imai, Yoshikazu wrote:
> >>> I attached the diff of modification for v26-0003 patch which also
> >> contains some refactoring.
> >>> Please see if it is ok.
> >>
> >> I like the is_first_child variable which somewhat improves
> >> readability, so updated the patch to use it.
> >
> > I noticed that detecting first child query in inheritance_planner()
> is already done by "final_rtable == NIL"
> >
> > /*
> >  * If this is the first non-excluded child, its post-planning
> rtable
> >  * becomes the initial contents of final_rtable; otherwise,
> append
> >  * just its modified subquery RTEs to final_rtable.
> >  */
> > if (final_rtable == NIL)
> >
> > So I think we can use that instead of using is_first_child.
> 
> That's a good suggestion.  One problem I see with that is that
> final_rtable is set only once we've found the first *non-dummy* child.

Ah... I overlooked that.

> So, if all the children except the last one are dummy, we'd end up never
> reusing the source child objects.  Now, if final_rtable was set for the
> first child irrespective of whether it turned out to be dummy or not,
> which it seems OK to do, then we can implement your suggestion.

I thought you mean we can remove or move below code to under setting 
final_rtable.

/*
 * If this child rel was excluded by constraint exclusion, exclude it
 * from the result plan.
 */
if (IS_DUMMY_REL(sub_final_rel))
continue;

It seems logically ok, but I also thought there are the case where useless 
process happens.

If we execute below update statement, final_rtable would be unnecessarily 
expanded by adding placeholder.

* partition table rt with 1024 partitions.
* partition table pt with 2 partitions.
* update rt set c = ss.c from ( select b from pt union all select b + 3 from 
pt) ss where a = 1024 and rt.b = ss.b;

After all, I think it might be better to use is_first_child because the meaning 
of is_first_child and final_rtable is different.
is_first_child == true represents that we currently processing first child 
query and final_rtable == NIL represents we didn't find first non-excluded 
child query.

--
Yoshikazu Imai

Re: speeding up planning with partitions

2019-03-18 Thread Amit Langote

Imai-san,

On 2019/03/15 14:40, Imai, Yoshikazu wrote:
> Amit-san,
> 
> I have another little comments about v31-patches.
> 
> * We don't need is_first_child in inheritance_planner().
> 
> On Fri, Mar 8, 2019 at 9:18 AM, Amit Langote wrote:
>> On 2019/03/08 16:16, Imai, Yoshikazu wrote:
>>> I attached the diff of modification for v26-0003 patch which also
>> contains some refactoring.
>>> Please see if it is ok.
>>
>> I like the is_first_child variable which somewhat improves readability,
>> so updated the patch to use it.
> 
> I noticed that detecting first child query in inheritance_planner() is 
> already done by "final_rtable == NIL"
> 
> /*
>  * If this is the first non-excluded child, its post-planning rtable
>  * becomes the initial contents of final_rtable; otherwise, append
>  * just its modified subquery RTEs to final_rtable.
>  */
> if (final_rtable == NIL)
> 
> So I think we can use that instead of using is_first_child.

That's a good suggestion.  One problem I see with that is that
final_rtable is set only once we've found the first *non-dummy* child.
So, if all the children except the last one are dummy, we'd end up never
reusing the source child objects.  Now, if final_rtable was set for the
first child irrespective of whether it turned out to be dummy or not,
which it seems OK to do, then we can implement your suggestion.

Thanks,
Amit

RE: speeding up planning with partitions

2019-03-14 Thread Imai, Yoshikazu

Amit-san,

I have another little comments about v31-patches.

* We don't need is_first_child in inheritance_planner().

On Fri, Mar 8, 2019 at 9:18 AM, Amit Langote wrote:
> On 2019/03/08 16:16, Imai, Yoshikazu wrote:
> > I attached the diff of modification for v26-0003 patch which also
> contains some refactoring.
> > Please see if it is ok.
> 
> I like the is_first_child variable which somewhat improves readability,
> so updated the patch to use it.

I noticed that detecting first child query in inheritance_planner() is already 
done by "final_rtable == NIL"

/*
 * If this is the first non-excluded child, its post-planning rtable
 * becomes the initial contents of final_rtable; otherwise, append
 * just its modified subquery RTEs to final_rtable.
 */
if (final_rtable == NIL)

So I think we can use that instead of using is_first_child.

--
Yoshikazu Imai

RE: speeding up planning with partitions

2019-03-14 Thread Imai, Yoshikazu

Hi, David

On Thu, Mar 14, 2019 at 9:04 AM, David Rowley wrote:
> On Thu, 14 Mar 2019 at 21:35, Imai, Yoshikazu
>  wrote:
> > 0007:
> > * This changes some processes using "for loop" to using
> "while(bms_next_member())" which speeds up processing when we scan few
> partitions in one statement, but when we scan a lot of partitions in one
> statement, its performance will likely degraded. I measured the
> performance of both cases.
> > I executed select statement to the table which has 4096 partitions.
> >
> > [scanning 1 partition]
> > Without 0007 : 3,450 TPS
> > With 0007: 3,723 TPS
> >
> > [scanning 4096 partitions]
> > Without 0007 : 10.8 TPS
> > With 0007: 10.5 TPS
> >
> > In the above result, performance degrades 3% in case of scanning 4096
> partitions compared before and after applying 0007 patch. I think when
> scanning a lot of tables, executor time would be also longer, so the
> increasement of planner time would be relatively smaller than it. So we
> might not have to care this performance degradation.
> 
> I think it's better to focus on the fewer partitions case due to the fact
> that execution initialisation time and actual execution are likely to
> take much longer when more partitions are scanned.  I did some work on
> run-time pruning to tune it for this case.  Tom did make a similar argument
> in [1] and I explained my reasoning in [2].

Thanks for quoting these threads.
Actually, I recalled this argument, so I tested this just to make sure. 

> bms_next_member has gotten a good performance boost since then and the
> cases are not exactly the same since the old version the loop in run-time
> pruning checked bms_is_member(), but the fact is, we did end up tuning
> for the few partitions case in the end.

Wow, I didn't know that.

> However, it would be good to see the performance results for
> plan+execution time of say a table with 4k parts looking up a single
> indexed value.  You could have two columns, one that's the partition key
> which allows the pruning to take place, and one that's not and results
> in scanning all partitions. I'll be surprised if you even notice the
> difference between with and without 0007 with the latter case.

So I tested for checking the performance for plan+execution time.

[set up table and indexes]
create table rt (a int, b int) partition by range (a);
\o /dev/null
select 'create table rt' || x::text || ' partition of rt for values from (' ||
 (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 4096) x;
\gexec
\o

create index b_idx on rt (b);

insert into rt select a, b from generate_series(1, 4096) a, generate_series(1, 
1000) b;

[select_indexed_values.sql]
\set b random(1, 1000)
select count(*) from rt where b = :b;

[pgbench]
pgbench -n -f select_indexed_values.sql -T 60 postgres

[results]
Without 0007: 3.18 TPS (3.25, 3.13, 3.15)
With 0007:3.21 TPS (3.21, 3.23, 3.18)

From the results, we didn't see the performance degradation in this case. 
Actually, the performance increased 1% before and after applying 0007, but it 
would be just an measurement error.
So, generally, we can think the performance difference of bms_next_member and 
for loop can be absorbed by other processing(execution initialisation and 
actual execution) when scanning many partitions.

--
Yoshikazu Imai

RE: speeding up planning with partitions

2019-03-14 Thread Imai, Yoshikazu

Amit-san,

On Thu, Mar 14, 2019 at 9:04 AM, Amit Langote wrote:
> > 0002:
> > * I don't really know that delaying adding resjunk output columns to
> the plan doesn't affect any process in the planner. From the comments
> of PlanRowMark, those columns are used in only the executor so I think
> delaying adding junk Vars wouldn't be harm, is that right?
> 
> I think so.  The only visible change in behavior is that "rowmark" junk
> columns are now placed later in the final plan's targetlist.

ok, thanks.

> > 0003:
> > * Is there need to do CreatePartitionDirectory() if rte->inh becomes
> false?
> >
> > +   else if (rte->rtekind == RTE_RELATION && rte->inh)
> > +   {
> > +   rte->inh = has_subclass(rte->relid);
> > +
> > +   /*
> > +* While at it, initialize the PartitionDesc
> infrastructure for
> > +* this query, if not done yet.
> > +*/
> > +   if (root->glob->partition_directory == NULL)
> > +   root->glob->partition_directory =
> > +
>   CreatePartitionDirectory(CurrentMemoryContext);
> > +   }
> 
> We need to have create "partition directory" before we access a
> partitioned table's PartitionDesc for planning.  These days, we rely
> solely on PartitionDesc to determine a partitioned table's children.  So,
> we need to see the PartitionDesc before we can tell there are zero children
> and set inh to false.  In other words, we need the "partition directory"
> to have been set up in advance.

Ah, I see.

> > 0004:
> > * In addition to commit message, this patch also changes the method
> of adjusting AppendRelInfos which reference to the subquery RTEs, and
> new one seems logically correct.
> 
> Do you mean I should mention it in the patch's commit message?

Actually I firstly thought that it's better to mention it in the patch's commit 
message, but I didn't mean that here. I only wanted to state that I confirmed 
new method of adjusting AppendRelInfos seems logically correct :)

> > * In inheritance_planner(), we do ChangeVarNodes() only for orig_rtable,
> so the codes concatenating lists of append_rel_list may be able to be
> moved before doing ChangeVarNodes() and then the codes concatenating
> lists of rowmarks, rtable and append_rel_list can be in one block (or
> one bunch).
> 
> Yeah, perhaps.  I'll check.
> 
> On 2019/03/14 17:35, Imai, Yoshikazu wrote:> Amit-san,
> > I have done code review of v31 patches from 0004 to 0008.
> >
> > 0004:
> > * s/childern/children
> 
> Will fix.
> 
> > 0005:
> > * This seems reasonable for not using a lot of memory in specific
> > case, although it needs special looking of planner experts.
> 
> Certainly.

Thanks for these.

> > 0006:
> > * The codes initializing/setting RelOptInfo's part_rels looks like a
> > bit complicated, but I didn't come up with any good design so far.
> 
> I guess you mean in add_appendrel_other_rels().  The other option is not
> have that code there and expand partitioned tables freshly for every
> target child, which we got rid of in patch 0004.  But we don't want to
> do that.

Yeah, that's right.

> > 0007:
> > * This changes some processes using "for loop" to using
> > "while(bms_next_member())" which speeds up processing when we scan few
> > partitions in one statement, but when we scan a lot of partitions in
> > one statement, its performance will likely degraded. I measured the
> > performance of both cases.
> > I executed select statement to the table which has 4096 partitions.
> >
> > [scanning 1 partition]
> > Without 0007 : 3,450 TPS
> > With 0007: 3,723 TPS
> >
> > [scanning 4096 partitions]
> > Without 0007 : 10.8 TPS
> > With 0007: 10.5 TPS
> >
> > In the above result, performance degrades 3% in case of scanning 4096
> > partitions compared before and after applying 0007 patch. I think when
> > scanning a lot of tables, executor time would be also longer, so the
> > increasement of planner time would be relatively smaller than it. So
> > we might not have to care this performance degradation.
> 
> Well, live_parts bitmapset is optimized for queries scanning only few
> partitions.  It's true that it's slightly more expensive than a simple
> for loop over part_rels, but not too much as you're also saying.

Thanks for the comments.

--
Yoshikazu Imai

Re: speeding up planning with partitions

2019-03-14 Thread David Rowley

On Thu, 14 Mar 2019 at 21:35, Imai, Yoshikazu
 wrote:
> 0007:
> * This changes some processes using "for loop" to using 
> "while(bms_next_member())" which speeds up processing when we scan few 
> partitions in one statement, but when we scan a lot of partitions in one 
> statement, its performance will likely degraded. I measured the performance 
> of both cases.
> I executed select statement to the table which has 4096 partitions.
>
> [scanning 1 partition]
> Without 0007 : 3,450 TPS
> With 0007: 3,723 TPS
>
> [scanning 4096 partitions]
> Without 0007 : 10.8 TPS
> With 0007: 10.5 TPS
>
> In the above result, performance degrades 3% in case of scanning 4096 
> partitions compared before and after applying 0007 patch. I think when 
> scanning a lot of tables, executor time would be also longer, so the 
> increasement of planner time would be relatively smaller than it. So we might 
> not have to care this performance degradation.

I think it's better to focus on the fewer partitions case due to the
fact that execution initialisation time and actual execution are
likely to take much longer when more partitions are scanned.  I did
some work on run-time pruning to tune it for this case.  Tom did make
a similar argument in [1] and I explained my reasoning in [2].
bms_next_member has gotten a good performance boost since then and the
cases are not exactly the same since the old version the loop in
run-time pruning checked bms_is_member(), but the fact is, we did end
up tuning for the few partitions case in the end.

However, it would be good to see the performance results for
plan+execution time of say a table with 4k parts looking up a single
indexed value.  You could have two columns, one that's the partition
key which allows the pruning to take place, and one that's not and
results in scanning all partitions. I'll be surprised if you even
notice the difference between with and without 0007 with the latter
case.

[1] https://www.postgresql.org/message-id/16107.1542307838%40sss.pgh.pa.us
[2] 
https://www.postgresql.org/message-id/CAKJS1f8ZnAW9VJNpJW16t5CtXSq3eAseyJXdumLaYb8DiTbhXA%40mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: speeding up planning with partitions

2019-03-14 Thread Amit Langote

Imai-san,

On 2019/03/13 19:34, Imai, Yoshikazu wrote:
> I have done code review of v31 patches from 0001 to 0004.
> I described below what I confirmed or thoughts.

Thanks for checking.

> 0001: This seems ok.
> 
> 0002:
> * I don't really know that delaying adding resjunk output columns to the plan 
> doesn't affect any process in the planner. From the comments of PlanRowMark, 
> those columns are used in only the executor so I think delaying adding junk 
> Vars wouldn't be harm, is that right?

I think so.  The only visible change in behavior is that "rowmark" junk
columns are now placed later in the final plan's targetlist.

> 0003:
> * Is there need to do CreatePartitionDirectory() if rte->inh becomes false?
> 
> + else if (rte->rtekind == RTE_RELATION && rte->inh)
> + {
> + rte->inh = has_subclass(rte->relid);
> +
> + /*
> +  * While at it, initialize the PartitionDesc 
> infrastructure for
> +  * this query, if not done yet.
> +  */
> + if (root->glob->partition_directory == NULL)
> + root->glob->partition_directory =
> + 
> CreatePartitionDirectory(CurrentMemoryContext);
> + }

We need to have create "partition directory" before we access a
partitioned table's PartitionDesc for planning.  These days, we rely
solely on PartitionDesc to determine a partitioned table's children.  So,
we need to see the PartitionDesc before we can tell there are zero
children and set inh to false.  In other words, we need the "partition
directory" to have been set up in advance.

> 0004:
> * In addition to commit message, this patch also changes the method of 
> adjusting AppendRelInfos which reference to the subquery RTEs, and new one 
> seems logically correct.

Do you mean I should mention it in the patch's commit message?

> * In inheritance_planner(), we do ChangeVarNodes() only for orig_rtable, so 
> the codes concatenating lists of append_rel_list may be able to be moved 
> before doing ChangeVarNodes() and then the codes concatenating lists of 
> rowmarks, rtable and append_rel_list can be in one block (or one bunch).

Yeah, perhaps.  I'll check.

On 2019/03/14 17:35, Imai, Yoshikazu wrote:> Amit-san,
> I have done code review of v31 patches from 0004 to 0008.
>
> 0004:
> * s/childern/children

Will fix.

> 0005:
> * This seems reasonable for not using a lot of memory in specific case,
> although it needs special looking of planner experts.

Certainly.

> 0006:
> * The codes initializing/setting RelOptInfo's part_rels looks like a bit
> complicated, but I didn't come up with any good design so far.

I guess you mean in add_appendrel_other_rels().  The other option is not
have that code there and expand partitioned tables freshly for every
target child, which we got rid of in patch 0004.  But we don't want to do
that.

> 0007:
> * This changes some processes using "for loop" to using
> "while(bms_next_member())" which speeds up processing when we scan few
> partitions in one statement, but when we scan a lot of partitions in one
> statement, its performance will likely degraded. I measured the
> performance of both cases.
> I executed select statement to the table which has 4096 partitions.
>
> [scanning 1 partition]
> Without 0007 : 3,450 TPS
> With 0007: 3,723 TPS
>
> [scanning 4096 partitions]
> Without 0007 : 10.8 TPS
> With 0007: 10.5 TPS
>
> In the above result, performance degrades 3% in case of scanning 4096
> partitions compared before and after applying 0007 patch. I think when
> scanning a lot of tables, executor time would be also longer, so the
> increasement of planner time would be relatively smaller than it. So we
> might not have to care this performance degradation.

Well, live_parts bitmapset is optimized for queries scanning only few
partitions.  It's true that it's slightly more expensive than a simple for
loop over part_rels, but not too much as you're also saying.

Thanks,
Amit

RE: speeding up planning with partitions

2019-03-14 Thread Imai, Yoshikazu

Amit-san,

I have done code review of v31 patches from 0004 to 0008.

0004:
* s/childern/children

0005:
* This seems reasonable for not using a lot of memory in specific case, 
although it needs special looking of planner experts.

0006:
* The codes initializing/setting RelOptInfo's part_rels looks like a bit 
complicated, but I didn't come up with any good design so far.

0007:
* This changes some processes using "for loop" to using 
"while(bms_next_member())" which speeds up processing when we scan few 
partitions in one statement, but when we scan a lot of partitions in one 
statement, its performance will likely degraded. I measured the performance of 
both cases.
I executed select statement to the table which has 4096 partitions.

[scanning 1 partition]
Without 0007 : 3,450 TPS
With 0007: 3,723 TPS

[scanning 4096 partitions]
Without 0007 : 10.8 TPS
With 0007: 10.5 TPS

In the above result, performance degrades 3% in case of scanning 4096 
partitions compared before and after applying 0007 patch. I think when scanning 
a lot of tables, executor time would be also longer, so the increasement of 
planner time would be relatively smaller than it. So we might not have to care 
this performance degradation.

0008:
This seems ok.


--
Yoshikazu Imai

RE: speeding up planning with partitions

2019-03-13 Thread Imai, Yoshikazu

Amit-san,

On Tue, Mar 12, 2019 at 2:34 PM, Amit Langote wrote:
> Thanks for the heads up.
> 
> On Tue, Mar 12, 2019 at 10:23 PM Jesper Pedersen
>  wrote:
> > After applying 0004 check-world fails with the attached. CFBot agrees
> [1].
> 
> Fixed.  I had forgotten to re-run postgres_fdw tests after making a change
> just before submitting.

I have done code review of v31 patches from 0001 to 0004.
I described below what I confirmed or thoughts.

0001: This seems ok.

0002:
* I don't really know that delaying adding resjunk output columns to the plan 
doesn't affect any process in the planner. From the comments of PlanRowMark, 
those columns are used in only the executor so I think delaying adding junk 
Vars wouldn't be harm, is that right?

0003:
* Is there need to do CreatePartitionDirectory() if rte->inh becomes false?

+   else if (rte->rtekind == RTE_RELATION && rte->inh)
+   {
+   rte->inh = has_subclass(rte->relid);
+
+   /*
+* While at it, initialize the PartitionDesc 
infrastructure for
+* this query, if not done yet.
+*/
+   if (root->glob->partition_directory == NULL)
+   root->glob->partition_directory =
+   
CreatePartitionDirectory(CurrentMemoryContext);
+   }

0004:
* In addition to commit message, this patch also changes the method of 
adjusting AppendRelInfos which reference to the subquery RTEs, and new one 
seems logically correct.

* In inheritance_planner(), we do ChangeVarNodes() only for orig_rtable, so the 
codes concatenating lists of append_rel_list may be able to be moved before 
doing ChangeVarNodes() and then the codes concatenating lists of rowmarks, 
rtable and append_rel_list can be in one block (or one bunch).

--
Yoshikazu Imai

Re: speeding up planning with partitions

2019-03-12 Thread Jesper Pedersen


Hi Amit,

On 3/12/19 4:22 AM, Amit Langote wrote:

I wrestled with this idea a bit and concluded that we don't have to
postpone *all* of preprocess_targetlist() processing to query_planner,
only the part that adds row mark "junk" Vars, because only those matter
for the problem being solved.  To recap, the problem is that delaying
adding inheritance children (and hence their row marks if any) means we
can't add "junk" columns needed to implement row marks, because which ones
to add is not clear until we've seen all the children.

I propose that we delay the addition of "junk" Vars to query_planner() so
that it doesn't stand in the way of deferring inheritance expansion to
query_planner() too.  That means the order of reltarget expressions for
row-marked rels will change, but I assume that's OK.  At least it will be
consistent for both non-inherited baserels and inherited ones.

Attached updated version of the patches with the above proposal
implemented by patch 0002.  To summarize, the patches are as follows:

0001: make building of "other rels" a separate step that runs after
deconstruct_jointree(), implemented by a new subroutine of query_planner
named add_other_rels_to_query()

0002: delay adding "junk" Vars to after add_other_rels_to_query()

0003: delay inheritance expansion to add_other_rels_to_query()

0004, 0005: adjust inheritance_planner() to account for the fact that
inheritance is now expanded by query_planner(), not subquery_planner()

0006: perform partition pruning while inheritance is being expanded,
instead of during set_append_append_rel_size()

0007: add a 'live_parts' field to RelOptInfo to store partition indexes
(not RT indexes) of unpruned partitions, which speeds up looping over
part_rels array of the partitioned parent

0008: avoid copying PartitionBoundInfo struct for planning



After applying 0004 check-world fails with the attached. CFBot agrees [1].

[1] https://travis-ci.org/postgresql-cfbot/postgresql/builds/505107884

Best regards,
 Jesper

diff -U3 
/home/jpedersen/PostgreSQL/postgresql/contrib/postgres_fdw/expected/postgres_fdw.out
 
/home/jpedersen/PostgreSQL/postgresql/contrib/postgres_fdw/results/postgres_fdw.out
--- 
/home/jpedersen/PostgreSQL/postgresql/contrib/postgres_fdw/expected/postgres_fdw.out
2019-03-12 07:46:08.430690229 -0400
+++ 
/home/jpedersen/PostgreSQL/postgresql/contrib/postgres_fdw/results/postgres_fdw.out
 2019-03-12 07:59:01.134285159 -0400
@@ -7190,8 +7190,8 @@
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
- QUERY PLAN
  
--
+  QUERY PLAN   

+---
  Update on public.bar
Update on public.bar
Foreign Update on public.bar2
@@ -7214,22 +7214,22 @@
  Output: foo2.f1, foo2.ctid, foo2.*, 
foo2.tableoid
  Remote SQL: SELECT f1, f2, f3, ctid FROM 
public.loct1
->  Hash Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, 
foo.*, foo.tableoid
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*
  Inner Unique: true
  Hash Cond: (bar2.f1 = foo.f1)
  ->  Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
  ->  Hash
-   Output: foo.f1, foo.ctid, foo.*, foo.tableoid
+   Output: foo.f1, foo.ctid, foo.*
->  HashAggregate
- Output: foo.f1, foo.ctid, foo.*, foo.tableoid
+ Output: foo.f1, foo.ctid, foo.*
  Group Key: foo.f1
  ->  Append
->  Seq Scan on public.foo
- Output: foo.f1, foo.ctid, foo.*, foo.tableoid
+ Output: foo.f1, foo.ctid, foo.*
->  Foreign Scan on public.foo2
- Output: foo2.f1, foo2.ctid, foo2.*, 
foo2.tableoid
+ Output: foo2.f1, foo2.ctid, foo2.*
  Remote SQL: SELECT f1, f2, f3, ctid FROM 
public.loct1
 (39 rows)

RE: speeding up planning with partitions

2019-03-10 Thread Imai, Yoshikazu

Amit-san,

On Fri, Mar 8, 2019 at 9:18 AM, Amit Langote wrote:
> On 2019/03/08 16:16, Imai, Yoshikazu wrote:
> > So I modified the code and did test to confirm memory increasement don't
> happen. The test and results are below.
> >
> > [test]
> > * Create partitioned table with 1536 partitions.
> > * Execute "update rt set a = random();"
> >
> > [results]
> > A backend uses below amount of memory in update transaction:
> >
> > HEAD: 807MB
> > With v26-0001, 0002: 790MB
> > With v26-0001, 0002, 0003: 860MB
> > With v26-0003 modified: 790MB
> 
> Can you measure with v28, or better attached v29 (about which more below)?
> 
> > I attached the diff of modification for v26-0003 patch which also
> contains some refactoring.
> > Please see if it is ok.
> 
> I like the is_first_child variable which somewhat improves readability,
> so updated the patch to use it.
> 
> Maybe you know that range_table_mutator() spends quite a long time if
> there are many target children, but I realized there's no need for
> range_table_mutator() to copy/mutate child target RTEs.  First, there's
> nothing to translate in their case.  Second, copying them is not necessary
> too, because they're not modified across different planning cycles.  If
> we pass to adjust_appendrel_attrs only the RTEs in the original range
> table (that is, before any child target RTEs were added), then
> range_table_mutator() has to do significantly less work and allocates
> lot less memory than before.  I've broken this change into its own patch;
> see patch 0004.

Cool!
I tested with v29 patches and checked it saved a lot of memory..

HEAD: 807MB
With v29-0001, 0002, 0003, 0004: 167MB

Maybe planning time in this case is also improved, but I didn't check it.

> but I realized there's no need for
> range_table_mutator() to copy/mutate child target RTEs.  First, there's
> nothing to translate in their case.  Second, copying them is not necessary
> too, because they're not modified across different planning cycles.

Yeah, although I couldn't check the codes in detail, but from the below 
comments in inheritance_planner(), ISTM we need copies of subquery RTEs but 
need not copies of other RTEs in each planning.

/*   
 * We generate a modified instance of the original Query for each target
 * relation, plan that, and put all the plans into a list that will be
 * controlled by a single ModifyTable node.  All the instances share the
 * same rangetable, but each instance must have its own set of subquery
 * RTEs within the finished rangetable because (1) they are likely to get
 * scribbled on during planning, and (2) it's not inconceivable that
 * subqueries could get planned differently in different cases.  We need
 * not create duplicate copies of other RTE kinds, in particular not the
 * target relations, because they don't have either of those issues.  Not
 * having to duplicate the target relations is important because doing so
 * (1) would result in a rangetable of length O(N^2) for N targets, with
 * at least O(N^3) work expended here; and (2) would greatly complicate
 * management of the rowMarks list.

Thanks 
--
Yoshikazu Imai

RE: speeding up planning with partitions

2019-03-07 Thread Imai, Yoshikazu

Amit-san,

On Wed, Mar 6, 2019 at 5:38 AM, Amit Langote wrote:
...
> > I didn't investigate that problem, but there is another memory
> > increase
> issue, which is because of 0003 patch I think. I'll try to solve the latter
> issue.
> 
> Interested in details as it seems to be a separate problem.

I solved this problem.

I think we don't need to do list_copy in the below code.

+   /*
+* No need to copy of the RTEs themselves, but do copy the List
+* structure.
+*/
+   subroot->parse->rtable = list_copy(rtable_with_target);

Because subroot->parse->rtable will be copied again by:

subroot->parse = (Query *)
adjust_appendrel_attrs(parent_root,
-  (Node *) 
parent_parse,
+  (Node *) 
subroot->parse,
   1, );

So I modified the code and did test to confirm memory increasement don't 
happen. The test and results are below.

[test]
* Create partitioned table with 1536 partitions.
* Execute "update rt set a = random();"

[results]
A backend uses below amount of memory in update transaction:

HEAD: 807MB
With v26-0001, 0002: 790MB
With v26-0001, 0002, 0003: 860MB
With v26-0003 modified: 790MB

I attached the diff of modification for v26-0003 patch which also contains some 
refactoring.
Please see if it is ok.
(Sorry it is modification for v26 patch though latest ones are v28.)

--
Yoshikazu Imai 

v26-0003-solving-memory-increasement-problem.diff
Description: v26-0003-solving-memory-increasement-problem.diff

Re: speeding up planning with partitions

2019-03-07 Thread Jesper Pedersen


Hi,

On 3/5/19 5:24 AM, Amit Langote wrote:
Attached an updated version. 


Need a rebase due to 898e5e3290.

Best regards,
 Jesper

RE: speeding up planning with partitions

2019-03-06 Thread Imai, Yoshikazu

Amit-san,

On Wed, Mar 6, 2019 at 5:38 AM, Amit Langote wrote:
> On 2019/03/06 11:09, Imai, Yoshikazu wrote:
> > [0004 or 0005]
> > There are redundant process in add_appendrel_other_rels (or
> expand_xxx_rtentry()?).
> > I modified add_appendrel_other_rels like below and it actually worked.
> >
> >
> > add_appendrel_other_rels(PlannerInfo *root, RangeTblEntry *rte, Index
> > rti) {
> > ListCell   *l;
> > RelOptInfo *rel;
> >
> > /*
> >  * Add inheritance children to the query if not already done. For
> child
> >  * tables that are themselves partitioned, their children will be
> added
> >  * recursively.
> >  */
> > if (rte->rtekind == RTE_RELATION
> && !root->contains_inherit_children)
> > {
> > expand_inherited_rtentry(root, rte, rti);
> > return;
> > }
> >
> > rel = find_base_rel(root, rti);
> >
> > foreach(l, root->append_rel_list)
> > {
> > AppendRelInfo  *appinfo = lfirst(l);
> > Index   childRTindex = appinfo->child_relid;
> > RangeTblEntry  *childrte;
> > RelOptInfo *childrel;
> >
> > if (appinfo->parent_relid != rti)
> > continue;
> >
> > Assert(childRTindex < root->simple_rel_array_size);
> > childrte = root->simple_rte_array[childRTindex];
> > Assert(childrte != NULL);
> > build_simple_rel(root, childRTindex, rel);
> >
> > /* Child may itself be an inherited relation. */
> > if (childrte->inh)
> > add_appendrel_other_rels(root, childrte, childRTindex);
> > }
> > }
> 
> If you don't intialize part_rels here, then it will break any code in
> the planner that expects a partitioned rel with non-zero number of
> partitions to have part_rels set to non-NULL.  At the moment, that
> includes the code that implements partitioning-specific optimizations
> such partitionwise aggregate and join, run-time pruning etc.  No existing
> regression tests cover the cases where source inherited roles
> participates in those optimizations, so you didn't find anything that
> broke.  For an example, consider the following update query:
> 
> update p set a = p1.a + 1 from p p1 where p1.a = (select 1);
> 
> Planner will create a run-time pruning aware Append node for p (aliased
> p1), for which it will need to use the part_rels array.  Note that p in
> this case is a source relation which the above code initializes.
> 
> Maybe we should add such regression tests.

Ah, now I understand that the codes below of expand_inherited_rtentry() 
initializes part_rels of child queries after first child target and part_rels 
of those are used in partitioning-specific optimizations. Thanks for the 
explanation.

--
Yoshikazu Imai

Re: speeding up planning with partitions

2019-03-05 Thread Amit Langote

Imai-san,

Thanks for the review.

On 2019/03/06 11:09, Imai, Yoshikazu wrote:
> Here is the code review for previous v26 patches.
> 
> [0002]
> In expand_inherited_rtentry():
> 
> expand_inherited_rtentry()
> {
> ...
> +   RelOptInfo *rel = NULL;
> 
> can be declared at more later:
> 
> if (oldrelation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
> ...
> else
> {   
> List   *appinfos = NIL;
> RangeTblEntry *childrte;
> Index   childRTindex;
> +RelOptInfo *rel = NULL;
> 
> 

This is fixed in v27.

> [0003]
> In inheritance_planner:
> 
> + rtable_with_target = list_copy(root->parse->rtable);
> 
> can be:
> 
> + rtable_with_target = list_copy(parse->rtable);

Sure.

> [0004 or 0005]
> There are redundant process in add_appendrel_other_rels (or 
> expand_xxx_rtentry()?).
> I modified add_appendrel_other_rels like below and it actually worked.
> 
> 
> add_appendrel_other_rels(PlannerInfo *root, RangeTblEntry *rte, Index rti) 
> {
> ListCell   *l;
> RelOptInfo *rel;
> 
> /*   
>  * Add inheritance children to the query if not already done. For child
>  * tables that are themselves partitioned, their children will be added
>  * recursively.
>  */
> if (rte->rtekind == RTE_RELATION && !root->contains_inherit_children)
> {
> expand_inherited_rtentry(root, rte, rti);
> return;
> }
> 
> rel = find_base_rel(root, rti);
> 
> foreach(l, root->append_rel_list)
> {
> AppendRelInfo  *appinfo = lfirst(l);
> Index   childRTindex = appinfo->child_relid;
> RangeTblEntry  *childrte;
> RelOptInfo *childrel;
> 
> if (appinfo->parent_relid != rti) 
> continue;
> 
> Assert(childRTindex < root->simple_rel_array_size);
> childrte = root->simple_rte_array[childRTindex];
> Assert(childrte != NULL);
> build_simple_rel(root, childRTindex, rel);
> 
> /* Child may itself be an inherited relation. */
> if (childrte->inh)
> add_appendrel_other_rels(root, childrte, childRTindex);
> }
> }

If you don't intialize part_rels here, then it will break any code in the
planner that expects a partitioned rel with non-zero number of partitions
to have part_rels set to non-NULL.  At the moment, that includes the code
that implements partitioning-specific optimizations such partitionwise
aggregate and join, run-time pruning etc.  No existing regression tests
cover the cases where source inherited roles participates in those
optimizations, so you didn't find anything that broke.  For an example,
consider the following update query:

update p set a = p1.a + 1 from p p1 where p1.a = (select 1);

Planner will create a run-time pruning aware Append node for p (aliased
p1), for which it will need to use the part_rels array.  Note that p in
this case is a source relation which the above code initializes.

Maybe we should add such regression tests.

> I didn't investigate that problem, but there is another memory increase
issue, which is because of 0003 patch I think. I'll try to solve the
latter issue.

Interested in details as it seems to be a separate problem.

Thanks,
Amit

RE: speeding up planning with partitions

2019-03-05 Thread Imai, Yoshikazu

Amit-san,

On Tue, Mar 5, 2019 at 10:24 AM, Amit Langote wrote:
> On 2019/03/05 9:50, Amit Langote wrote:
> > I'll post the updated patches after diagnosing what I'm suspecting a
> > memory over-allocation bug in one of the patches.  If you configure
> > build with --enable-cassert, you'll see that throughput decreases as
> > number of partitions run into many thousands, but it doesn't when
> > asserts are turned off.
> 
> Attached an updated version.  This incorporates fixes for both Jesper's
> and Imai-san's review.

Thanks for updating patches!
Here is the code review for previous v26 patches.

[0002]
In expand_inherited_rtentry():

expand_inherited_rtentry()
{
...
+   RelOptInfo *rel = NULL;

can be declared at more later:

if (oldrelation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
...
else
{   
List   *appinfos = NIL;
RangeTblEntry *childrte;
Index   childRTindex;
+RelOptInfo *rel = NULL;

[0003]
In inheritance_planner:

+   rtable_with_target = list_copy(root->parse->rtable);

can be:

+   rtable_with_target = list_copy(parse->rtable);

[0004 or 0005]
There are redundant process in add_appendrel_other_rels (or 
expand_xxx_rtentry()?).
I modified add_appendrel_other_rels like below and it actually worked.

add_appendrel_other_rels(PlannerInfo *root, RangeTblEntry *rte, Index rti) 
{
ListCell   *l;
RelOptInfo *rel;

/*   
 * Add inheritance children to the query if not already done. For child
 * tables that are themselves partitioned, their children will be added
 * recursively.
 */
if (rte->rtekind == RTE_RELATION && !root->contains_inherit_children)
{
expand_inherited_rtentry(root, rte, rti);
return;
}

rel = find_base_rel(root, rti);

foreach(l, root->append_rel_list)
{
AppendRelInfo  *appinfo = lfirst(l);
Index   childRTindex = appinfo->child_relid;
RangeTblEntry  *childrte;
RelOptInfo *childrel;

if (appinfo->parent_relid != rti) 
continue;

Assert(childRTindex < root->simple_rel_array_size);
childrte = root->simple_rte_array[childRTindex];
Assert(childrte != NULL);
build_simple_rel(root, childRTindex, rel);

/* Child may itself be an inherited relation. */
if (childrte->inh)
add_appendrel_other_rels(root, childrte, childRTindex);
}
}

> and Imai-san's review.  I haven't been able to pin down the bug (or
> whatever) that makes throughput go down as the partition count increases,
> when tested with a --enable-cassert build.

I didn't investigate that problem, but there is another memory increase issue, 
which is because of 0003 patch I think. I'll try to solve the latter issue.

--
Yoshikazu Imai

Re: speeding up planning with partitions

2019-03-05 Thread Amit Langote

On 2019/03/06 0:57, Jesper Pedersen wrote:
> On 3/5/19 5:24 AM, Amit Langote wrote:
>> Attached an updated version.  This incorporates fixes for both Jesper's
>> and Imai-san's review.  I haven't been able to pin down the bug (or
>> whatever) that makes throughput go down as the partition count increases,
>> when tested with a --enable-cassert build.
>>
> 
> Thanks !
> 
> I'm seeing the throughput going down as well, but are you sure it isn't
> just the extra calls of MemoryContextCheck you are seeing ? A flamegraph
> diff highlights that area -- sent offline.
> 
> A non cassert build shows the same profile for 64 and 1024 partitions.

Thanks for testing.

I now see what's happening here.  I was aware that it's MemoryContextCheck
overhead but didn't quite understand why the time it takes should increase
with the number of partitions increasing, especially given that the patch
makes it so that only one partition is accessed if that's what the query
needs to do.  What I had forgotten however is that MemoryContextCheck
checks *all* contexts in every transaction, including the "partition
descriptor" context which stores a partitioned table's PartitionDesc.
PartitionDesc gets bigger as the number of partitions increase, so does
the time to check the context it's allocated in.

So, the decrease in throughput in cassert build as the number of
partitions increases is not due to any fault of this patch series as I had
suspected.  Phew!

Thanks,
Amit

Re: speeding up planning with partitions

2019-03-05 Thread Jesper Pedersen


On 3/5/19 5:24 AM, Amit Langote wrote:

Attached an updated version.  This incorporates fixes for both Jesper's
and Imai-san's review.  I haven't been able to pin down the bug (or
whatever) that makes throughput go down as the partition count increases,
when tested with a --enable-cassert build.



Thanks !

I'm seeing the throughput going down as well, but are you sure it isn't 
just the extra calls of MemoryContextCheck you are seeing ? A flamegraph 
diff highlights that area -- sent offline.


A non cassert build shows the same profile for 64 and 1024 partitions.

Best regards,
 Jesper

Re: speeding up planning with partitions

2019-03-05 Thread Amit Langote

On 2019/03/04 19:38, Amit Langote wrote:
> 2. Defer inheritance expansion to add_other_rels_to_query().  Although the
> purpose of doing this is to perform partition pruning before adding the
> children, this patch doesn't change when the pruning occurs.  It deals
> with other issues that must be taken care of due to adding children during
> query_planner instead of during subquery_planner.  Especially,
> inheritance_planner now has to add the child target relations on its own.
> Also, delaying adding children also affects adding junk columns to the
> query's targetlist based on PlanRowMarks, because preprocess_targetlist
> can no longer finalize which junk columns to add for a "parent"
> PlanRowMark; that must be delayed until all child PlanRowMarks are added
> and their allMarkTypes propagated to the parent PlanRowMark.

I thought more on this and started wondering why we can't call
preprocess_targetlist() from query_planner() instead of from
grouping_planner()?  We don't have to treat parent row marks specially if
preprocess_targetlist() is called after adding other rels (and hence all
child row marks).  This will change the order in which expressions are
added to baserels targetlists and hence the order of expressions in their
Path's targetlist, because the expressions contained in targetlist
(including RETURNING) and other junk expressions will be added after
expressions referenced in WHERE clauses, whereas the order is reverse
today.  But if we do what we propose above, the order will be uniform for
all cases, that is, not one for regular table baserels and another for
inherited table baserels.

Thoughts?

Thanks,
Amit

Re: speeding up planning with partitions

2019-03-05 Thread Amit Langote

On 2019/03/05 9:50, Amit Langote wrote:
> I'll post the updated patches after diagnosing what
> I'm suspecting a memory over-allocation bug in one of the patches.  If you
> configure build with --enable-cassert, you'll see that throughput
> decreases as number of partitions run into many thousands, but it doesn't
> when asserts are turned off.

Attached an updated version.  This incorporates fixes for both Jesper's
and Imai-san's review.  I haven't been able to pin down the bug (or
whatever) that makes throughput go down as the partition count increases,
when tested with a --enable-cassert build.

Thanks,
Amit
From 1e793976d1affae8aa6fa1c835752bad530132b6 Mon Sep 17 00:00:00 2001
From: Amit 
Date: Sat, 2 Mar 2019 14:13:13 +0900
Subject: [PATCH v27 1/6] Build "other rels" of appendrel baserels in a
 separate step

Currently they're built in a stanza in build_simple_rel() which
builds the child rels in the same invocation of build_simple_rel
that's used to build the parent relation.  Since query hasn't
been processed to distribute restriction clauses to individual
baserels at this point, we cannot perform partition pruning before
adding the children.

Newly added add_other_rels_to_query() runs *after* query_planner has
distributed restriction clauses to base relations.  This will allow
us to use the clauses applied a given partitioned baserel to perform
partition pruning, and add other rels for only the unpruned
partitions.  Later patches will do that.
---
 src/backend/optimizer/path/allpaths.c  |   2 +-
 src/backend/optimizer/plan/initsplan.c |  60 ++--
 src/backend/optimizer/plan/planmain.c  |  15 +++--
 src/backend/optimizer/util/relnode.c   | 101 +
 src/include/optimizer/pathnode.h   |   2 +
 src/include/optimizer/planmain.h   |   1 +
 6 files changed, 135 insertions(+), 46 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c 
b/src/backend/optimizer/path/allpaths.c
index 0debac75c6..8d8a8f17d5 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1028,7 +1028,7 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 
/*
 * The child rel's RelOptInfo was already created during
-* add_base_rels_to_query.
+* add_other_rels_to_query.
 */
childrel = find_base_rel(root, childRTindex);
Assert(childrel->reloptkind == RELOPT_OTHER_MEMBER_REL);
diff --git a/src/backend/optimizer/plan/initsplan.c 
b/src/backend/optimizer/plan/initsplan.c
index 2afc3f1dfe..077d3203ba 100644
--- a/src/backend/optimizer/plan/initsplan.c
+++ b/src/backend/optimizer/plan/initsplan.c
@@ -20,6 +20,7 @@
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/cost.h"
+#include "optimizer/inherit.h"
 #include "optimizer/joininfo.h"
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
@@ -30,6 +31,7 @@
 #include "optimizer/prep.h"
 #include "optimizer/restrictinfo.h"
 #include "parser/analyze.h"
+#include "parser/parsetree.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
 
@@ -97,10 +99,11 @@ static void check_hashjoinable(RestrictInfo *restrictinfo);
  * jtnode.  Internally, the function recurses through the jointree.
  *
  * At the end of this process, there should be one baserel RelOptInfo for
- * every non-join RTE that is used in the query.  Therefore, this routine
- * is the only place that should call build_simple_rel with reloptkind
- * RELOPT_BASEREL.  (Note: build_simple_rel recurses internally to build
- * "other rel" RelOptInfos for the members of any appendrels we find here.)
+ * every non-join RTE that is specified in the query.  Therefore, this
+ * routine is the only place that should call build_simple_rel with
+ * reloptkind RELOPT_BASEREL.  (Note:  "other rel" RelOptInfos for the
+ * members of any appendrels we find here are built later when query_planner
+ * calls add_other_rels_to_query().)
  */
 void
 add_base_rels_to_query(PlannerInfo *root, Node *jtnode)
@@ -133,6 +136,55 @@ add_base_rels_to_query(PlannerInfo *root, Node *jtnode)
 (int) nodeTag(jtnode));
 }
 
+/*
+ * add_other_rels_to_query
+ *
+ *   Scan the query's jointree and for each base rels that is an appendrel,
+ *   create otherrel RelOptInfos of its children
+ *
+ * At the end of this process, there should be RelOptInfos for all relations
+ * that will be scanned by the query.
+ */
+void
+add_other_rels_to_query(PlannerInfo *root, Node *jtnode)
+{
+   if (jtnode == NULL)
+   return;
+   if (IsA(jtnode, RangeTblRef))
+   {
+   int varno = ((RangeTblRef *) 
jtnode)->rtindex;
+   RangeTblEntry *rte = rt_fetch(varno, root->parse->rtable);
+
+   /*
+* Only the parent subquery of a flattened UNION ALL and an 
inherited
+* table can have

Re: speeding up planning with partitions

2019-03-04 Thread Amit Langote

Imai-san,

Thanks for checking.

On 2019/03/05 15:03, Imai, Yoshikazu wrote:
> I've also done code review of 0001 and 0002 patch so far.
> 
> [0001]
> 1. Do we need to open/close a relation in add_appendrel_other_rels()? 
> 
> + if (rel->part_scheme)
>   {
> - ListCell   *l;
> ...
> - Assert(cnt_parts == nparts);
> + rel->part_rels = (RelOptInfo **)
> + palloc0(sizeof(RelOptInfo *) * rel->nparts);
> + relation = table_open(rte->relid, NoLock);
>   }
> 
> + if (relation)
> + table_close(relation, NoLock);

Ah, it should be moved to another patch.  Actually, to patch 0003, which
makes it necessary to inspect the PartitionDesc.

> 2. We can sophisticate the usage of Assert in add_appendrel_other_rels().
> 
> + if (rel->part_scheme)
>   {
> ...
> + rel->part_rels = (RelOptInfo **)
> + palloc0(sizeof(RelOptInfo *) * rel->nparts);
> + relation = table_open(rte->relid, NoLock);
>   }
>   ...
> + i  = 0;
> + foreach(l, root->append_rel_list)
> + {
> ...
> + if (rel->part_scheme != NULL)
> + {
> + Assert(rel->nparts > 0 && i < rel->nparts);
> + rel->part_rels[i] = childrel;
> + }
> +
> + i++;
> 
> as below;
> 
> + if (rel->part_scheme)
>   {
> ...
> Assert(rel->nparts > 0);
> + rel->part_rels = (RelOptInfo **)
> + palloc0(sizeof(RelOptInfo *) * rel->nparts);
> + relation = table_open(rte->relid, NoLock);
>   }
>   ...
> + i  = 0;
> + foreach(l, root->append_rel_list)
> + {
> ...
> + if (rel->part_scheme != NULL)
> + {
> + Assert(i < rel->nparts);
> + rel->part_rels[i] = childrel;
> + }
> +
> + i++;

You're right.  That's what makes sense in this context.

> [0002]
> 3. If using variable like is_source_inh_expansion, the code would be easy to 
> read. (I might mistakenly understand root->simple_rel_array_size > 0 means 
> source inheritance expansion though.)

root->simple_rel_array_size > 0  *does* mean source inheritance expansion,
so you're not mistaken.  As you can see, expand_inherited_rtentry is
called by inheritance_planner() to expand target inheritance and by
add_appendrel_other_rels() to expand source inheritance.  Since the latter
is called by query_planner, simple_rel_array would have been initialized
and hence root->simple_rel_array_size > 0 is true.

Maybe it'd be a good idea to introduce is_source_inh_expansion variable
for clarity as you suggest and set it to (root->simple_rel_array_size > 0).

> 4. I didn't see much carefully, but should the introduction of 
> contains_inherit_children be in 0003 patch...?

You're right.

Thanks again for the comments.  I've made changes to my local repository.

Thanks,
Amit

1 2 3 >

1 - 100 of 228 matches

Mail list logo