subject:"Re\\\: \\\[HACKERS\\\] Runtime Partition Pruning"

Re: [HACKERS] Runtime Partition Pruning

2020-10-13 Thread Andy Fan

On Tue, Oct 13, 2020 at 3:48 PM Amit Langote 
wrote:

> On Thu, Oct 8, 2020 at 8:25 PM Ashutosh Bapat
>  wrote:
> > On Wed, Oct 7, 2020 at 7:00 PM Andy Fan 
> wrote:
> > > I can understand Robert's idea now,  he intended to resolve both the
> > > "initial-partition-prune" case and "runtime partition prune" case
> while I always think
> > > about the former case (Amit reminded me about that at the beginning,
> but I just
> > > wake up right now..)
> > >
> > > With the Robert's idea,  I think we can do some hack at
> create_append_path,
> > > we can figure out the pruneinfo (it was done at create_append_plan
> now) at that
> > > stage, and then did a fix_append_path with such pruneinfo.  We need to
> fix not
> > > only the cost of AppendPath,  but also rows of AppendPath, For example:
> > > SELECT .. FROM t1, t2, p where t1.a = p.partkey and t1.b = t2.b, if the
> > > plan is:
> > > Nest Loop
> > >Nest Loop
> > >   t1
> > >   Append (p)
> > >t2
> > >
> > > Then the rows of Append (p) will be very important.
> > >
> > > For this idea,  how to estimate how many partitions/rows can be pruned
> is a key
> > > part.  Robert said "I was thinking, rather, that if we know for
> example that
> > > we've doing pruning on partition_column = $1, then we know that only
> > > one partition will match.  That's probably a common case.  If we've
> > > got partition_column > $1, we could assume that, say, 75% of the
> > > partitions would match.  partition_column BETWEEN $1 and $2 is
> > > probably a bit more selective, so maybe we assume 50% of the
> > > partitions would match.".   I think we can't say the 75% or 50% is a
> good
> > > number,  but the keypoint may be "partition_column = $1" is a common
> > > case.  And for the others case, we probably don't make it worse.
> > >
> > > I think we need to do something here, any thoughts? Personally I'm more
> > > like this idea now.
> >
> > Yes, I think we have to do something on those lines. But I am
> > wondering why our stats machinery is failing to do that already. For
> > equality I think it's straight forward. But even for other operators
> > we should use the statistics from the partitioned table to estimate
> > the selectivity and then adjust number of scanned partitions = (number
> > of partitions * fraction of rows scanned). That might be better than
> > 50% or 75%.
>
> This is an interesting idea, that is, the idea of consulting parent
> relation's stats to guess at the number of partitions that might be
> scanned.
>
> However, we don't currently consult an inheritance parent relation's
> stats at all during "base" relation planning, which is why you don't
> see the parent relation's statistics reflected in the row count and
> costs assigned to its (Append at al) paths.  The hard-coded rule is to
> sum up children's rows and their paths' costs; see cost_append().
>
> My thinking on this was that we'd just extend that hard-coded rule to
> take into account that run-time pruning, if applicable in a given
> plan, will cause some of those child paths to be discarded.  Maybe
> Andy is headed in that direction?
>
>
Yes, that's what I am trying to do.  The minimum code in my mind is:

create_append_path(...)
{

   double run_time_prune_est;

   if (enable_partition_prune && .. )
   List *  partition_prune_step_ops =  cal_prune_step_ops(rel,
partitioned_rels);
   run_time_prune_est =
estimate_runtime_prune_ratio(partition_prune_step_ops);
}
// adjust the rows/costs of AppendPath based on the  run_time_prune_est.

The difference between '=' and '<' will be handled in function
estimate_runtime_prune_ratio. Since I still don't make my mind runnable
now,
some data structures may  change, but the overall idea is something like
above.


-- 
Best Regards
Andy Fan

Re: [HACKERS] Runtime Partition Pruning

2020-10-13 Thread Amit Langote

On Thu, Oct 8, 2020 at 8:25 PM Ashutosh Bapat
 wrote:
> On Wed, Oct 7, 2020 at 7:00 PM Andy Fan  wrote:
> > I can understand Robert's idea now,  he intended to resolve both the
> > "initial-partition-prune" case and "runtime partition prune" case while I 
> > always think
> > about the former case (Amit reminded me about that at the beginning, but I 
> > just
> > wake up right now..)
> >
> > With the Robert's idea,  I think we can do some hack at create_append_path,
> > we can figure out the pruneinfo (it was done at create_append_plan now) at 
> > that
> > stage, and then did a fix_append_path with such pruneinfo.  We need to fix 
> > not
> > only the cost of AppendPath,  but also rows of AppendPath, For example:
> > SELECT .. FROM t1, t2, p where t1.a = p.partkey and t1.b = t2.b, if the
> > plan is:
> > Nest Loop
> >Nest Loop
> >   t1
> >   Append (p)
> >t2
> >
> > Then the rows of Append (p) will be very important.
> >
> > For this idea,  how to estimate how many partitions/rows can be pruned is a 
> > key
> > part.  Robert said "I was thinking, rather, that if we know for example that
> > we've doing pruning on partition_column = $1, then we know that only
> > one partition will match.  That's probably a common case.  If we've
> > got partition_column > $1, we could assume that, say, 75% of the
> > partitions would match.  partition_column BETWEEN $1 and $2 is
> > probably a bit more selective, so maybe we assume 50% of the
> > partitions would match.".   I think we can't say the 75% or 50% is a good
> > number,  but the keypoint may be "partition_column = $1" is a common
> > case.  And for the others case, we probably don't make it worse.
> >
> > I think we need to do something here, any thoughts? Personally I'm more
> > like this idea now.
>
> Yes, I think we have to do something on those lines. But I am
> wondering why our stats machinery is failing to do that already. For
> equality I think it's straight forward. But even for other operators
> we should use the statistics from the partitioned table to estimate
> the selectivity and then adjust number of scanned partitions = (number
> of partitions * fraction of rows scanned). That might be better than
> 50% or 75%.

This is an interesting idea, that is, the idea of consulting parent
relation's stats to guess at the number of partitions that might be
scanned.

However, we don't currently consult an inheritance parent relation's
stats at all during "base" relation planning, which is why you don't
see the parent relation's statistics reflected in the row count and
costs assigned to its (Append at al) paths.  The hard-coded rule is to
sum up children's rows and their paths' costs; see cost_append().

My thinking on this was that we'd just extend that hard-coded rule to
take into account that run-time pruning, if applicable in a given
plan, will cause some of those child paths to be discarded.  Maybe
Andy is headed in that direction?

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: [HACKERS] Runtime Partition Pruning

2020-10-12 Thread Andy Fan

On Mon, Oct 12, 2020 at 5:48 PM Ashutosh Bapat 
wrote:

> On Mon, Oct 12, 2020 at 7:59 AM Andy Fan  wrote:
>
> >
> > Sorry for the late reply!  Suppose we have partition defined like this:
> > p
> > - p1
> > - p2
> >
> > When you talk about "the statistics from the partitioned table", do you
> > mean the statistics from p or p1/p2?   I just confirmed there is no
> statistics
> > for p (at least pg_class.reltuples = -1),  so I think you are talking
> about
> > p1/p2.
>
> I am talking about p when I say statistics from the partitioned table.
> I see that pg_statistic row from p is well populated.
> pg_class.reltuples = -1 indicates that the heap doesn't have any rows.
> set_rel_size() sets the number of rows in the partitioned table based
> on the rows in individual unpruned partitions.
>
>
Glad to know that, Thanks for this info!


> >
> > Here we are talking about partkey = $1 or  partkey = RunTimeValue.
> > so even the value can hit 1 partition only,  but since we don't know
> > the exact value, so we treat all the partition equally.  so looks
> > nothing wrong with partition level estimation.  However when we cost
> > the Append path, we need know just one of them can be hit,  then
> > we need do something there.  Both AppendPath->rows/total_cost
> > should be adjusted (That doesn't happen now).
>
> I think in this case we can safely assume that only one partition will
> remain so normalize costs considering that only one partition will
> survive.
>

Exactly.  What I am trying to do is fix this at create_append_path,
do you have different suggestions? about the pkey > $1 case,  I think
even if we use the statistics from partition level,  it would be
hard-code as well since we don't know what value $1 is.

I have gone through the main part of the RunTime partition prune, hope
I can update a runnable patch soon.  The main idea is fix the rows/
costs at create_append_path stage.  So any suggestion in a different
direction will be very useful.

-- 
Best Regards
Andy Fan

Re: [HACKERS] Runtime Partition Pruning

2020-10-12 Thread Ashutosh Bapat

On Mon, Oct 12, 2020 at 7:59 AM Andy Fan  wrote:

>
> Sorry for the late reply!  Suppose we have partition defined like this:
> p
> - p1
> - p2
>
> When you talk about "the statistics from the partitioned table", do you
> mean the statistics from p or p1/p2?   I just confirmed there is no statistics
> for p (at least pg_class.reltuples = -1),  so I think you are talking about
> p1/p2.

I am talking about p when I say statistics from the partitioned table.
I see that pg_statistic row from p is well populated.
pg_class.reltuples = -1 indicates that the heap doesn't have any rows.
set_rel_size() sets the number of rows in the partitioned table based
on the rows in individual unpruned partitions.

>
> Here we are talking about partkey = $1 or  partkey = RunTimeValue.
> so even the value can hit 1 partition only,  but since we don't know
> the exact value, so we treat all the partition equally.  so looks
> nothing wrong with partition level estimation.  However when we cost
> the Append path, we need know just one of them can be hit,  then
> we need do something there.  Both AppendPath->rows/total_cost
> should be adjusted (That doesn't happen now).

I think in this case we can safely assume that only one partition will
remain so normalize costs considering that only one partition will
survive.

-- 
Best Wishes,
Ashutosh Bapat

Re: [HACKERS] Runtime Partition Pruning

2020-10-11 Thread Andy Fan

Hi Ashutosh:

On Thu, Oct 8, 2020 at 7:25 PM Ashutosh Bapat 
wrote:

> On Wed, Oct 7, 2020 at 7:00 PM Andy Fan  wrote:
> >
> >
> >
> > On Wed, Oct 7, 2020 at 5:05 PM Andy Fan 
> wrote:
> >>
> >>
> >>
> >> On Sun, Oct 4, 2020 at 3:10 PM Andy Fan 
> wrote:
> 
> 
> 
>  Now, in my experience, the current system for custom plans vs. generic
>  plans doesn't approach the problem in this way at all, and in my
>  experience that results in some pretty terrible behavior.  It will do
>  things like form a custom plan every time because the estimated cost
>  of the custom plan is lower than the estimated cost of the generic
>  plan even though the two plans are structurally identical; only the
>  estimates differ.  It will waste gobs of CPU cycles by replanning a
>  primary key lookup 5 times just on the off chance that a lookup on the
>  primary key index isn't the best option.  But this patch isn't going
>  to fix any of that.  The best we can probably do is try to adjust the
>  costing for Append paths in some way that reflects the costs and
>  benefits of pruning.  I'm tentatively in favor of trying to do
>  something modest in that area, but I don't have a detailed proposal.
> 
> >>>
> >>> I just realized this issue recently and reported it at [1], then Amit
> pointed
> >>> me to this issue being discussed here, so I would like to continue
> this topic
> >>> here.
> >>>
> >>> I think we can split the issue into 2 issues.  One is the partition
> prune in initial
> >>> partition prune, which maybe happen in custom plan case only and caused
> >>> the above issue.  The other one happens in the "Run-Time" partition
> prune,
> >>> I admit that is an important issue to resolve as well, but looks
> harder.   So I
> >>> think we can fix the first one at first.
> >>>
> >>> ... When we count for the cost of a
> >>> generic plan, we can reduce the cost based on such information.
> >>
> >>
> >> This way doesn't work since after the initial partition prune,  not
> only the
> >> cost of the Append node should be reduced,  the cost of other plans
> should
> >> be reduced as well [1]
> >>
> >> However I think if we can use partition prune information from a custom
> plan
> >> at the cost_append_path stage,  it looks the issue can be fixed.  If
> so, the idea
> >> is similar to David's idea in [2],  however Robert didn't agree with
> this[2].
> >> Can anyone elaborate this objection?  for a partkey > $1 or BETWEEN
> cases,
> >> some real results from the past are probably better than some hard-coded
> >> assumptions IMO.
> >
> >
> > I can understand Robert's idea now,  he intended to resolve both the
> > "initial-partition-prune" case and "runtime partition prune" case while
> I always think
> > about the former case (Amit reminded me about that at the beginning, but
> I just
> > wake up right now..)
> >
> > With the Robert's idea,  I think we can do some hack at
> create_append_path,
> > we can figure out the pruneinfo (it was done at create_append_plan now)
> at that
> > stage, and then did a fix_append_path with such pruneinfo.  We need to
> fix not
> > only the cost of AppendPath,  but also rows of AppendPath, For example:
> > SELECT .. FROM t1, t2, p where t1.a = p.partkey and t1.b = t2.b, if the
> > plan is:
> > Nest Loop
> >Nest Loop
> >   t1
> >   Append (p)
> >t2
> >
> > Then the rows of Append (p) will be very important.
> >
> > For this idea,  how to estimate how many partitions/rows can be pruned
> is a key
> > part.  Robert said "I was thinking, rather, that if we know for example
> that
> > we've doing pruning on partition_column = $1, then we know that only
> > one partition will match.  That's probably a common case.  If we've
> > got partition_column > $1, we could assume that, say, 75% of the
> > partitions would match.  partition_column BETWEEN $1 and $2 is
> > probably a bit more selective, so maybe we assume 50% of the
> > partitions would match.".   I think we can't say the 75% or 50% is a good
> > number,  but the keypoint may be "partition_column = $1" is a common
> > case.  And for the others case, we probably don't make it worse.
> >
> > I think we need to do something here, any thoughts? Personally I'm more
> > like this idea now.
>
> Yes, I think we have to do something on those lines. But I am
> wondering why our stats machinery is failing to do that already. For
> equality I think it's straight forward. But even for other operators
> we should use the statistics from the partitioned table to estimate
> the selectivity and then adjust number of scanned partitions = (number
> of partitions * fraction of rows scanned). That might be better than
> 50% or 75%.
>
>
Sorry for the late reply!  Suppose we have partition defined like this:
p
- p1
- p2

When you talk about "the statistics from the partitioned table", do you
mean the statistics from p or p1/p2?   I just confirmed there is no
statistics
for p (at least

Re: [HACKERS] Runtime Partition Pruning

2020-10-08 Thread Ashutosh Bapat

On Wed, Oct 7, 2020 at 7:00 PM Andy Fan  wrote:
>
>
>
> On Wed, Oct 7, 2020 at 5:05 PM Andy Fan  wrote:
>>
>>
>>
>> On Sun, Oct 4, 2020 at 3:10 PM Andy Fan  wrote:



 Now, in my experience, the current system for custom plans vs. generic
 plans doesn't approach the problem in this way at all, and in my
 experience that results in some pretty terrible behavior.  It will do
 things like form a custom plan every time because the estimated cost
 of the custom plan is lower than the estimated cost of the generic
 plan even though the two plans are structurally identical; only the
 estimates differ.  It will waste gobs of CPU cycles by replanning a
 primary key lookup 5 times just on the off chance that a lookup on the
 primary key index isn't the best option.  But this patch isn't going
 to fix any of that.  The best we can probably do is try to adjust the
 costing for Append paths in some way that reflects the costs and
 benefits of pruning.  I'm tentatively in favor of trying to do
 something modest in that area, but I don't have a detailed proposal.

>>>
>>> I just realized this issue recently and reported it at [1], then Amit 
>>> pointed
>>> me to this issue being discussed here, so I would like to continue this 
>>> topic
>>> here.
>>>
>>> I think we can split the issue into 2 issues.  One is the partition prune 
>>> in initial
>>> partition prune, which maybe happen in custom plan case only and caused
>>> the above issue.  The other one happens in the "Run-Time" partition prune,
>>> I admit that is an important issue to resolve as well, but looks harder.   
>>> So I
>>> think we can fix the first one at first.
>>>
>>> ... When we count for the cost of a
>>> generic plan, we can reduce the cost based on such information.
>>
>>
>> This way doesn't work since after the initial partition prune,  not only the
>> cost of the Append node should be reduced,  the cost of other plans should
>> be reduced as well [1]
>>
>> However I think if we can use partition prune information from a custom plan
>> at the cost_append_path stage,  it looks the issue can be fixed.  If so, the 
>> idea
>> is similar to David's idea in [2],  however Robert didn't agree with this[2].
>> Can anyone elaborate this objection?  for a partkey > $1 or BETWEEN cases,
>> some real results from the past are probably better than some hard-coded
>> assumptions IMO.
>
>
> I can understand Robert's idea now,  he intended to resolve both the
> "initial-partition-prune" case and "runtime partition prune" case while I 
> always think
> about the former case (Amit reminded me about that at the beginning, but I 
> just
> wake up right now..)
>
> With the Robert's idea,  I think we can do some hack at create_append_path,
> we can figure out the pruneinfo (it was done at create_append_plan now) at 
> that
> stage, and then did a fix_append_path with such pruneinfo.  We need to fix not
> only the cost of AppendPath,  but also rows of AppendPath, For example:
> SELECT .. FROM t1, t2, p where t1.a = p.partkey and t1.b = t2.b, if the
> plan is:
> Nest Loop
>Nest Loop
>   t1
>   Append (p)
>t2
>
> Then the rows of Append (p) will be very important.
>
> For this idea,  how to estimate how many partitions/rows can be pruned is a 
> key
> part.  Robert said "I was thinking, rather, that if we know for example that
> we've doing pruning on partition_column = $1, then we know that only
> one partition will match.  That's probably a common case.  If we've
> got partition_column > $1, we could assume that, say, 75% of the
> partitions would match.  partition_column BETWEEN $1 and $2 is
> probably a bit more selective, so maybe we assume 50% of the
> partitions would match.".   I think we can't say the 75% or 50% is a good
> number,  but the keypoint may be "partition_column = $1" is a common
> case.  And for the others case, we probably don't make it worse.
>
> I think we need to do something here, any thoughts? Personally I'm more
> like this idea now.

Yes, I think we have to do something on those lines. But I am
wondering why our stats machinery is failing to do that already. For
equality I think it's straight forward. But even for other operators
we should use the statistics from the partitioned table to estimate
the selectivity and then adjust number of scanned partitions = (number
of partitions * fraction of rows scanned). That might be better than
50% or 75%.

-- 
Best Wishes,
Ashutosh Bapat

Re: [HACKERS] Runtime Partition Pruning

2020-10-07 Thread Andy Fan

On Wed, Oct 7, 2020 at 5:05 PM Andy Fan  wrote:

>
>
> On Sun, Oct 4, 2020 at 3:10 PM Andy Fan  wrote:
>
>>
>>>
>>> Now, in my experience, the current system for custom plans vs. generic
>>> plans doesn't approach the problem in this way at all, and in my
>>> experience that results in some pretty terrible behavior.  It will do
>>> things like form a custom plan every time because the estimated cost
>>> of the custom plan is lower than the estimated cost of the generic
>>> plan even though the two plans are structurally identical; only the
>>> estimates differ.  It will waste gobs of CPU cycles by replanning a
>>> primary key lookup 5 times just on the off chance that a lookup on the
>>> primary key index isn't the best option.  But this patch isn't going
>>> to fix any of that.  The best we can probably do is try to adjust the
>>> costing for Append paths in some way that reflects the costs and
>>> benefits of pruning.  I'm tentatively in favor of trying to do
>>> something modest in that area, but I don't have a detailed proposal.
>>>
>>>
>> I just realized this issue recently and reported it at [1], then Amit
>> pointed
>> me to this issue being discussed here, so I would like to continue this
>> topic
>> here.
>>
>> I think we can split the issue into 2 issues.  One is the partition prune
>> in initial
>> partition prune, which maybe happen in custom plan case only and caused
>> the above issue.  The other one happens in the "Run-Time" partition
>> prune,
>> I admit that is an important issue to resolve as well, but looks harder.
>>  So I
>> think we can fix the first one at first.
>>
>> ... When we count for the cost of a
>> generic plan, we can reduce the cost based on such information.
>>
>
> This way doesn't work since after the initial partition prune,  not only
> the
> cost of the Append node should be reduced,  the cost of other plans should
> be reduced as well [1]
>
> However I think if we can use partition prune information from a custom
> plan
> at the cost_append_path stage,  it looks the issue can be fixed.  If so,
> the idea
> is similar to David's idea in [2],  however Robert didn't agree with
> this[2].
> Can anyone elaborate this objection?  for a partkey > $1 or BETWEEN cases,
> some real results from the past are probably better than some hard-coded
> assumptions IMO.
>

I can understand Robert's idea now,  he intended to resolve both the
"initial-partition-prune" case and "runtime partition prune" case while I
always think
about the former case (Amit reminded me about that at the beginning, but I
just
wake up right now..)

With the Robert's idea,  I think we can do some hack at create_append_path,
we can figure out the pruneinfo (it was done at create_append_plan now) at
that
stage, and then did a fix_append_path with such pruneinfo.  We need to fix
not
only the cost of AppendPath,  but also rows of AppendPath, For example:
SELECT .. FROM t1, t2, p where t1.a = p.partkey and t1.b = t2.b, if the
plan is:
Nest Loop
   Nest Loop
  t1
  Append (p)
   t2

Then the rows of Append (p) will be very important.

For this idea,  how to estimate how many partitions/rows can be pruned is a
key
part.  Robert said "I was thinking, rather, that if we know for example that
we've doing pruning on partition_column = $1, then we know that only
one partition will match.  That's probably a common case.  If we've
got partition_column > $1, we could assume that, say, 75% of the
partitions would match.  partition_column BETWEEN $1 and $2 is
probably a bit more selective, so maybe we assume 50% of the
partitions would match.".   I think we can't say the 75% or 50% is a good
number,  but the keypoint may be "partition_column = $1" is a common
case.  And for the others case, we probably don't make it worse.

I think we need to do something here, any thoughts? Personally I'm more
like this idea now.

-- 
Best Regards
Andy Fan

Re: [HACKERS] Runtime Partition Pruning

2020-10-07 Thread Andy Fan

On Sun, Oct 4, 2020 at 3:10 PM Andy Fan  wrote:

>
>>
>> Now, in my experience, the current system for custom plans vs. generic
>> plans doesn't approach the problem in this way at all, and in my
>> experience that results in some pretty terrible behavior.  It will do
>> things like form a custom plan every time because the estimated cost
>> of the custom plan is lower than the estimated cost of the generic
>> plan even though the two plans are structurally identical; only the
>> estimates differ.  It will waste gobs of CPU cycles by replanning a
>> primary key lookup 5 times just on the off chance that a lookup on the
>> primary key index isn't the best option.  But this patch isn't going
>> to fix any of that.  The best we can probably do is try to adjust the
>> costing for Append paths in some way that reflects the costs and
>> benefits of pruning.  I'm tentatively in favor of trying to do
>> something modest in that area, but I don't have a detailed proposal.
>>
>>
> I just realized this issue recently and reported it at [1], then Amit
> pointed
> me to this issue being discussed here, so I would like to continue this
> topic
> here.
>
> I think we can split the issue into 2 issues.  One is the partition prune
> in initial
> partition prune, which maybe happen in custom plan case only and caused
> the above issue.  The other one happens in the "Run-Time" partition prune,
> I admit that is an important issue to resolve as well, but looks harder.
>  So I
> think we can fix the first one at first.
>
> ... When we count for the cost of a
> generic plan, we can reduce the cost based on such information.
>

This way doesn't work since after the initial partition prune,  not only the
cost of the Append node should be reduced,  the cost of other plans should
be reduced as well [1]

However I think if we can use partition prune information from a custom plan
at the cost_append_path stage,  it looks the issue can be fixed.  If so,
the idea
is similar to David's idea in [2],  however Robert didn't agree with
this[2].
Can anyone elaborate this objection?  for a partkey > $1 or BETWEEN cases,
some real results from the past are probably better than some hard-coded
assumptions IMO.

[1]
https://www.postgresql.org/message-id/CAKU4AWrWSCFO5fh01GTnN%2B1T8K8MyVAi4Gw-TvYC-Vhx3JohUw%40mail.gmail.com

[2]
https://www.postgresql.org/message-id/CAKJS1f8q_d7_Viweeivt1eS4Q8a0WAGFbrgeX38468mVgKseTA%40mail.gmail.com

[3]
https://www.postgresql.org/message-id/CA%2BTgmoZv8sd9cKyYtHwmd_13%2BBAjkVKo%3DECe7G98tBK5Ejwatw%40mail.gmail.com

-- 
Best Regards
Andy Fan

Re: [HACKERS] Runtime Partition Pruning

2020-10-04 Thread Andy Fan

>
>
>
> Now, in my experience, the current system for custom plans vs. generic
> plans doesn't approach the problem in this way at all, and in my
> experience that results in some pretty terrible behavior.  It will do
> things like form a custom plan every time because the estimated cost
> of the custom plan is lower than the estimated cost of the generic
> plan even though the two plans are structurally identical; only the
> estimates differ.  It will waste gobs of CPU cycles by replanning a
> primary key lookup 5 times just on the off chance that a lookup on the
> primary key index isn't the best option.  But this patch isn't going
> to fix any of that.  The best we can probably do is try to adjust the
> costing for Append paths in some way that reflects the costs and
> benefits of pruning.  I'm tentatively in favor of trying to do
> something modest in that area, but I don't have a detailed proposal.
>
>
I just realized this issue recently and reported it at [1], then Amit
pointed
me to this issue being discussed here, so I would like to continue this
topic
here.

I think we can split the issue into 2 issues.  One is the partition prune
in initial
partition prune, which maybe happen in custom plan case only and caused
the above issue.  The other one happens in the "Run-Time" partition prune,
I admit that is an important issue to resolve as well, but looks harder.
 So I
think we can fix the first one at first.

The proposal to fix the first one is we can remember how many partitions
survived after plan time pruned for a RelOptInfo for a custom plan.  and
record
such information in the CachedPlanSource.  When we count for the cost of a
generic plan, we can reduce the cost based on such information.

Any thought about this?  I'd be sorry if I missed some already existing
discussion
on this topic.

[1]
https://www.postgresql.org/message-id/CA%2BHiwqGsP2L0BW1ad58HRSj1NouNSVHLfL5pm7%3DPBTvL0b%2B-BQ%40mail.gmail.com


-- 
Best Regards
Andy Fan

Re: [HACKERS] Runtime Partition Pruning

2019-05-29 Thread Robert Haas

On Wed, May 29, 2019 at 6:02 PM Tom Lane  wrote:
> Well, it *is* a problem.  The whole point of this discussion I think is
> to try to get better information "by default" for routine bug reports.
> So if those come from production servers without debug symbols, which
> I believe will be the usual case, then it seems likely to me that
> libunwind will produce no better results than glibc.  (But perhaps
> I'm wrong about that --- I have not experimented with libunwind.)

Sure, I agree.

> Now it's true that "install debug symbols" is less of an ask than
> "install debug symbols, *and* gdb, and make sure server core dumps are
> enabled, and then go through this arcane manual procedure next time
> you get a core dump".  But we shouldn't fool ourselves that it isn't
> an ask that's going to be hard for people with corporate policies
> against installing extra stuff on production servers.

There may be cases where that is true, but as you say, it's better
than what we have now.  Plus, what exactly is the alternative?  We
could:

- encourage packagers to install debug symbols by default (but they
might not; it might even be against policy), or
- invent our own system for generating backtraces and ignore what the
OS toolchain knows how to do (sounds painfully complex and expensive),
or
- just live with the fact that it's imperfect.

Is there a fourth option?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Runtime Partition Pruning

2019-05-29 Thread Tom Lane

Robert Haas  writes:
> On Fri, May 24, 2019 at 12:09 PM Tom Lane  wrote:
>> Is it actually better?  The basic problem with backtrace() is that it
>> only knows about global functions, and so reports call sites in static
>> functions as if they were in whatever global function physically precedes
>> the static one.  I think doing materially better requires depending on
>> debug symbols, which (at least in the Red Hat world) aren't going to
>> be there in a typical production situation.

> I don't have an opinion on glibc vs. libunwind, but I don't understand
> this argument.  If you are unlucky enough to have a production server
> that is crashing due to some hitherto-unknown bug, and if it's not
> possible to get a good backtrace without installing debugging symbols,
> then you are going to have to pick between (1) installing those
> debugging symbols and (2) getting a poor backtrace.  I don't really
> see that as a problem so much as just the way life is.

Well, it *is* a problem.  The whole point of this discussion I think is
to try to get better information "by default" for routine bug reports.
So if those come from production servers without debug symbols, which
I believe will be the usual case, then it seems likely to me that
libunwind will produce no better results than glibc.  (But perhaps
I'm wrong about that --- I have not experimented with libunwind.)

Now it's true that "install debug symbols" is less of an ask than
"install debug symbols, *and* gdb, and make sure server core dumps are
enabled, and then go through this arcane manual procedure next time
you get a core dump".  But we shouldn't fool ourselves that it isn't
an ask that's going to be hard for people with corporate policies
against installing extra stuff on production servers.

regards, tom lane

Re: [HACKERS] Runtime Partition Pruning

2019-05-29 Thread Robert Haas

On Fri, May 24, 2019 at 12:09 PM Tom Lane  wrote:
> Is it actually better?  The basic problem with backtrace() is that it
> only knows about global functions, and so reports call sites in static
> functions as if they were in whatever global function physically precedes
> the static one.  I think doing materially better requires depending on
> debug symbols, which (at least in the Red Hat world) aren't going to
> be there in a typical production situation.

I don't have an opinion on glibc vs. libunwind, but I don't understand
this argument.  If you are unlucky enough to have a production server
that is crashing due to some hitherto-unknown bug, and if it's not
possible to get a good backtrace without installing debugging symbols,
then you are going to have to pick between (1) installing those
debugging symbols and (2) getting a poor backtrace.  I don't really
see that as a problem so much as just the way life is.  You can't
expect to get good debugging output without debugging symbols, just as
you can't expect to get a clean audit without bank statements or a
clear idea of how to find your way to an unknown destination without a
map.  If you don't have the thing that contains the information that
is needed, you can't get the information; c'est la vie.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Runtime Partition Pruning

2019-05-27 Thread Andres Freund

Hi,

On 2019-05-25 00:42:39 +0200, Tomas Vondra wrote:
> On Fri, May 24, 2019 at 09:24:28AM -0700, Andres Freund wrote:
> > On 2019-05-24 12:08:57 -0400, Tom Lane wrote:
> > > Andres Freund  writes:
> > > The basic problem with backtrace() is that it
> > > only knows about global functions, and so reports call sites in static
> > > functions as if they were in whatever global function physically precedes
> > > the static one.
> > 
> > Does that depend on whether the program was compiled with
> > -fno-omit-frame-pointer? At least some distros now compile with frame
> > pointers enabled, to get reasonably fast perf profiles (at a basically
> > immeasurable slowdown, on modern-ish CPUs).
> > 
> 
> I doubt that, because if that was the case we'd not be able to get
> accurate profiles from perf, no? And AFAICS that's not the case,
> irrespectedly of whether -fno-omit-frame-pointer is used.

I can't parse this. With perf you can get accurate call-graph profiles
if you either use -fno-omit-frame-pointer, to force frame pointers to be
present (so the call graph can cheaply be assembled during profiling),
or with dwarf (the entire stack is saved, and then dwarf is unwinding at
perf report time - very large), or with lbr (CPU saves traces of
branches taken, enabling call graphs to be computed, but it needs they're not
very deep).


> > > I think doing materially better requires depending on
> > > debug symbols, which (at least in the Red Hat world) aren't going to
> > > be there in a typical production situation.
> > 
> > It's still a lot easier to install debug symbols than to attach a
> > debugger and get a backtrace that way. Especially when the problem is
> > hard to reproduce.
> > 
> 
> Right. Debugger requires interaction with a running process, while
> having it integrated would make that unnecessary.
> 
> But I think Peter also suggested this might require the ability to
> selectively enable the backtraces, and I think he's right. I doubt we
> want to log them for every log message, right?

Well, I think that if we had PANIC, SIGSEGV/BUS most FATALs covered,
we'd be off to a very good start. I'm not sure it's wise to give users
control over the computation of stack computations.

Greetings,

Andres Freund

Re: [HACKERS] Runtime Partition Pruning

2019-05-24 Thread Tomas Vondra

On Fri, May 24, 2019 at 09:24:28AM -0700, Andres Freund wrote:

Hi,

On 2019-05-24 12:08:57 -0400, Tom Lane wrote:

Andres Freund  writes:
> On 2019-05-24 11:34:58 -0400, Tom Lane wrote:
>> Hmm, after some digging in the archives, the closest thing I can find
>> is this thread:
>> 
https://www.postgresql.org/message-id/flat/CAMsr%2BYGL%2ByfWE%3DJvbUbnpWtrRZNey7hJ07%2BzT4bYJdVp4Szdrg%40mail.gmail.com
>> where we discussed using libunwind instead, but people didn't like
>> the extra dependency.

> Hm, I didn't actually see that much concern about that. I still think we
> should just go for libunwind.

Is it actually better?

I've not looked in a while, but I think at some point it was.

The basic problem with backtrace() is that it
only knows about global functions, and so reports call sites in static
functions as if they were in whatever global function physically precedes
the static one.

Does that depend on whether the program was compiled with
-fno-omit-frame-pointer? At least some distros now compile with frame
pointers enabled, to get reasonably fast perf profiles (at a basically
immeasurable slowdown, on modern-ish CPUs).

I doubt that, because if that was the case we'd not be able to get
accurate profiles from perf, no? And AFAICS that's not the case,
irrespectedly of whether -fno-omit-frame-pointer is used.

I think doing materially better requires depending on
debug symbols, which (at least in the Red Hat world) aren't going to
be there in a typical production situation.

It's still a lot easier to install debug symbols than to attach a
debugger and get a backtrace that way. Especially when the problem is
hard to reproduce.

Right. Debugger requires interaction with a running process, while
having it integrated would make that unnecessary.

But I think Peter also suggested this might require the ability to
selectively enable the backtraces, and I think he's right. I doubt we
want to log them for every log message, right?

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2019-05-24 Thread Andres Freund

Hi,

On 2019-05-24 12:08:57 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2019-05-24 11:34:58 -0400, Tom Lane wrote:
> >> Hmm, after some digging in the archives, the closest thing I can find
> >> is this thread:
> >> https://www.postgresql.org/message-id/flat/CAMsr%2BYGL%2ByfWE%3DJvbUbnpWtrRZNey7hJ07%2BzT4bYJdVp4Szdrg%40mail.gmail.com
> >> where we discussed using libunwind instead, but people didn't like
> >> the extra dependency.
> 
> > Hm, I didn't actually see that much concern about that. I still think we
> > should just go for libunwind.
> 
> Is it actually better?

I've not looked in a while, but I think at some point it was.


> The basic problem with backtrace() is that it
> only knows about global functions, and so reports call sites in static
> functions as if they were in whatever global function physically precedes
> the static one.

Does that depend on whether the program was compiled with
-fno-omit-frame-pointer? At least some distros now compile with frame
pointers enabled, to get reasonably fast perf profiles (at a basically
immeasurable slowdown, on modern-ish CPUs).


> I think doing materially better requires depending on
> debug symbols, which (at least in the Red Hat world) aren't going to
> be there in a typical production situation.

It's still a lot easier to install debug symbols than to attach a
debugger and get a backtrace that way. Especially when the problem is
hard to reproduce.

Greetings,

Andres Freund

Re: [HACKERS] Runtime Partition Pruning

2019-05-24 Thread Tom Lane

Andres Freund  writes:
> On 2019-05-24 11:34:58 -0400, Tom Lane wrote:
>> Hmm, after some digging in the archives, the closest thing I can find
>> is this thread:
>> https://www.postgresql.org/message-id/flat/CAMsr%2BYGL%2ByfWE%3DJvbUbnpWtrRZNey7hJ07%2BzT4bYJdVp4Szdrg%40mail.gmail.com
>> where we discussed using libunwind instead, but people didn't like
>> the extra dependency.

> Hm, I didn't actually see that much concern about that. I still think we
> should just go for libunwind.

Is it actually better?  The basic problem with backtrace() is that it
only knows about global functions, and so reports call sites in static
functions as if they were in whatever global function physically precedes
the static one.  I think doing materially better requires depending on
debug symbols, which (at least in the Red Hat world) aren't going to
be there in a typical production situation.

regards, tom lane

Re: [HACKERS] Runtime Partition Pruning

2019-05-24 Thread Andres Freund

Hi,

On 2019-05-24 11:34:58 -0400, Tom Lane wrote:
> I wrote:
> > Peter Eisentraut  writes:
> >> What do people think about adding something like this errbacktrace()
> >> from Álvaro's patch to core PostgreSQL?
> 
> > I think we did discuss it right after that, or somewhere nearby, and
> > concluded that the output is so imprecise that it's not really going
> > to be worth whatever portability issues we'd have to deal with.
> 
> Hmm, after some digging in the archives, the closest thing I can find
> is this thread:
> 
> https://www.postgresql.org/message-id/flat/CAMsr%2BYGL%2ByfWE%3DJvbUbnpWtrRZNey7hJ07%2BzT4bYJdVp4Szdrg%40mail.gmail.com
> 
> where we discussed using libunwind instead, but people didn't like
> the extra dependency.

Hm, I didn't actually see that much concern about that. I still think we
should just go for libunwind. At least on debian it's likely to already
be installed:

andres@alap4:~$ apt rdepends libunwind8
libunwind8
Reverse Depends:
  Depends: libunwind-dev (= 1.2.1-9)
  Depends: linux-perf-4.16
  Depends: linux-perf-4.15
  Depends: linux-perf-4.14
  Depends: rspamd
  Depends: linux-perf-5.0
  Depends: libjulia1
  Depends: julia
  Depends: geary
  Depends: libunwind8-dbgsym (= 1.2.1-9)
  Depends: xwayland
  Depends: xvfb
  Depends: xserver-xorg-core
  Depends: xserver-xephyr
  Depends: xnest
  Depends: xdmx
  Depends: trafficserver
  Depends: tigervnc-standalone-server
  Depends: tarantool
  Depends: strace
  Depends: spring
  Depends: rspamd
  Depends: linux-perf-4.19
  Depends: libunwind-setjmp0
  Depends: libeina1a
  Depends: libjulia1
  Depends: julia
  Depends: intel-gpu-tools
  Depends: libheaptrack
  Depends: libgoogle-perftools4
  Depends: libgoogle-glog0v5
  Depends: gdnsd
  Depends: libevas1-engines-x
  Depends: libevas1

In particular strace, xserver-xorg-core, perf are reasonably likely to
already installed.

It's also not a large library. I'd bet if we made it an optional
build-time dependency it'd get included by just about every distro.

Greetings,

Andres Freund

Re: [HACKERS] Runtime Partition Pruning

2019-05-24 Thread Tom Lane

I wrote:
> Peter Eisentraut  writes:
>> What do people think about adding something like this errbacktrace()
>> from Álvaro's patch to core PostgreSQL?

> I think we did discuss it right after that, or somewhere nearby, and
> concluded that the output is so imprecise that it's not really going
> to be worth whatever portability issues we'd have to deal with.

Hmm, after some digging in the archives, the closest thing I can find
is this thread:

https://www.postgresql.org/message-id/flat/CAMsr%2BYGL%2ByfWE%3DJvbUbnpWtrRZNey7hJ07%2BzT4bYJdVp4Szdrg%40mail.gmail.com

where we discussed using libunwind instead, but people didn't like
the extra dependency.

However, I stand by the assertion that glibc's backtrace() is too
imprecise to be useful; I've experimented with it and despaired of
being able to tell where control had actually been.

regards, tom lane

Re: [HACKERS] Runtime Partition Pruning

2019-05-24 Thread Tom Lane

Peter Eisentraut  writes:
> On 2018-04-10 23:32, Alvaro Herrera wrote:
>> To figure out, I used the attached patch (not intended for application)
>> to add a backtrace to each log message, plus a couple of accusatory
>> elog() calls in relation_open and ExecSetupPartitionPruneState.

> What do people think about adding something like this errbacktrace()
> from Álvaro's patch to core PostgreSQL?  If we could devise a way to
> selectively enable it, it might be an easier way for users to provide
> backtraces from errors.

I think we did discuss it right after that, or somewhere nearby, and
concluded that the output is so imprecise that it's not really going
to be worth whatever portability issues we'd have to deal with.

I'd be all for a better version, but glibc's backtrace() just sucks,
at least given our coding style with lots of static functions.

regards, tom lane

Re: [HACKERS] Runtime Partition Pruning

2019-05-24 Thread Peter Eisentraut

On 2018-04-10 23:32, Alvaro Herrera wrote:
> To figure out, I used the attached patch (not intended for application)
> to add a backtrace to each log message, plus a couple of accusatory
> elog() calls in relation_open and ExecSetupPartitionPruneState.

What do people think about adding something like this errbacktrace()
from Álvaro's patch to core PostgreSQL?  If we could devise a way to
selectively enable it, it might be an easier way for users to provide
backtraces from errors.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-04-25 Thread David Rowley

On 25 April 2018 at 05:10, Alvaro Herrera  wrote:
> Pushed.  Thanks!

Thanks!

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-04-25 Thread David Rowley

On 25 April 2018 at 06:21, Alvaro Herrera  wrote:
> Andres Freund wrote:
>
>> What I wonder, after skimming this change, is where the relevant
>> expression context is reset?  That's not really related to this change
>> but the wider thread, I just noticed it while looking at this.
>
> Do you mean ResetExprContext?  We use the plan's context, so it should
> occur together with the plan reset itself, i.e. nodeAppend.c should do
> it somewhere.
>
> ... Hmm, it appears we don't do it anywhere there actually.

It's not immediately obvious to me why this is required.

All the expressions that are initialized here must live the entire
length of the executor run, and since they're allocated in the
ExecutorState context they'll be destroyed in FreeExecutorState().

If we do need to call ResetExprContext for these, then we'd just need
to invent a teardown for ExecSetupPartitionPruneState which would free
off the memory allocated and call ResetExprContext for all non-NULL
ExprStates in each context->exprstates. This function would need to be
called from the node's End function for any node that's set up a
PartitionPruneState.

Do we really need to do this given that the memory context these are
allocated in will be released a moment later anyway?

Just to ensure I'm not dreaming, I ran the following and couldn't see
the backend's memory consumption move.

create table lp (a int, value int) partition by list(a);
create table lp_1 partition of lp for values in(1);
create table lp_2 partition of lp for values in(2);
create function lp_value(p_a int) returns int as $$ select value from
lp where a = p_a; $$ language sql;
insert into lp values(1,10),(2,20);
select sum(lp_value(x)) from generate_Series(1,2) x,
generate_series(1,1000);

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-04-24 Thread Alvaro Herrera

Andres Freund wrote:

> What I wonder, after skimming this change, is where the relevant
> expression context is reset?  That's not really related to this change
> but the wider thread, I just noticed it while looking at this.

Do you mean ResetExprContext?  We use the plan's context, so it should
occur together with the plan reset itself, i.e. nodeAppend.c should do
it somewhere.

... Hmm, it appears we don't do it anywhere there actually.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-04-24 Thread Andres Freund

On 2018-04-19 12:04:35 +1200, David Rowley wrote:
> On 19 April 2018 at 03:13, Robert Haas  wrote:
> > 10% is more than a "slight" improvement, I'd say!  It's certainly got
> > to be worth avoiding the repeated calls to ExecInitExpr, whatever we
> > do about the memory contexts.

Yea, that seems important. Good that that got in.

What I wonder, after skimming this change, is where the relevant
expression context is reset?  That's not really related to this change
but the wider thread, I just noticed it while looking at this.

Greetings,

Andres Freund

Re: [HACKERS] Runtime Partition Pruning

2018-04-23 Thread Alvaro Herrera

Anybody wanna argue against pushing this patch now?  I'm inclined to
push it on the grounds of being closure for already committed work, but
there are possible arguments about this being new development after
feature freeze.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-04-18 Thread Amit Langote

Hi David.

On 2018/04/19 9:04, David Rowley wrote:
> On 19 April 2018 at 03:13, Robert Haas  wrote:
>> On Mon, Apr 16, 2018 at 10:46 PM, David Rowley
>>  wrote:
>>> The patch does happen to improve performance slightly, but that is
>>> most likely due to the caching of the ExprStates rather than the
>>> change of memory management. It's not really possible to do that with
>>> the reset unless we stored the executor's memory context in
>>> PartitionPruneContext and did a context switch back inside
>>> partkey_datum_from_expr before calling ExecInitExpr.
>>
>> 10% is more than a "slight" improvement, I'd say!  It's certainly got
>> to be worth avoiding the repeated calls to ExecInitExpr, whatever we
>> do about the memory contexts.
> 
> I've attached a patch which does just this. On benchmarking again with
> this single change performance has improved 15% over master.
> 
> Also, out of curiosity, I also checked what this performed like before
> the run-time pruning patch was committed (5c0675215). Taking the
> average of the times below, it seems without this patch the
> performance of this case has improved about 356% and about 410% with
> this patch. So, I agree, it might be worth considering.
> 
> create table p (a int, value int) partition by hash (a);
> select 'create table p'||x|| ' partition of p for values with (modulus
> 10, remainder '||x||');' from generate_series(0,9) x;
> \gexec
> create table t1 (a int);
> 
> insert into p select x,x from generate_Series(1,1000) x;
> insert into t1 select x from generate_series(1,1000) x;
> 
> create index on p(a);
> 
> set enable_hashjoin = 0;
> set enable_mergejoin = 0;
> explain analyze select count(*) from t1 inner join p on t1.a=p.a;
> 
> -- Unpatched
> Execution Time: 20413.975 ms
> Execution Time: 20232.050 ms
> Execution Time: 20229.116 ms
> 
> -- Patched
> Execution Time: 17758.111 ms
> Execution Time: 17645.151 ms
> Execution Time: 17492.260 ms
> 
> -- 5c0675215e153ba1297fd494b34af2fdebd645d1
> Execution Time: 72875.161 ms
> Execution Time: 71817.757 ms
> Execution Time: 72411.730 ms

That's neat!  Definitely agree that we should call ExecInitExpr just once
here.  The patch looks good too, except the long line.  Maybe:

@@ -1514,13 +1514,15 @@ ExecSetupPartitionPruneState(PlanState *planstate,
List *partitionpruneinfo)
 foreach(lc3, step->exprs)
 {
 Expr   *expr = (Expr *) lfirst(lc3);
+int step_id = step->step.step_id;

 /*
  * partkey_datum_from_expr does not need an expression state
  * to evaluate a Const.
  */
 if (!IsA(expr, Const))
-context->exprstates[step->step.step_id * partnatts +
keyno] = ExecInitExpr(expr, context->planstate);
+context->exprstates[step_id * partnatts + keyno] =
+ExecInitExpr(expr, context->planstate);

Thanks,
Amit

Re: [HACKERS] Runtime Partition Pruning

2018-04-18 Thread David Rowley

On 19 April 2018 at 12:04, David Rowley  wrote:
> insert into p select x,x from generate_Series(1,1000) x;
> insert into t1 select x from generate_series(1,1000) x;

Correction. These were meant to read:

insert into p select x,x from generate_Series(1,1000) x;
insert into t1 select x from generate_series(1,1000) x;

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-04-18 Thread David Rowley

On 19 April 2018 at 03:13, Robert Haas  wrote:
> On Mon, Apr 16, 2018 at 10:46 PM, David Rowley
>  wrote:
>> I did go and start working on a patch to test how possible this would
>> be and came up with the attached. I've left a stray
>> MemoryContextStatsDetail call in there which does indicate that
>> something is not being freed. I'm just not sure what it is yet.
>>
>> The patch does happen to improve performance slightly, but that is
>> most likely due to the caching of the ExprStates rather than the
>> change of memory management. It's not really possible to do that with
>> the reset unless we stored the executor's memory context in
>> PartitionPruneContext and did a context switch back inside
>> partkey_datum_from_expr before calling ExecInitExpr.
>
> 10% is more than a "slight" improvement, I'd say!  It's certainly got
> to be worth avoiding the repeated calls to ExecInitExpr, whatever we
> do about the memory contexts.

I've attached a patch which does just this. On benchmarking again with
this single change performance has improved 15% over master.

Also, out of curiosity, I also checked what this performed like before
the run-time pruning patch was committed (5c0675215). Taking the
average of the times below, it seems without this patch the
performance of this case has improved about 356% and about 410% with
this patch. So, I agree, it might be worth considering.

create table p (a int, value int) partition by hash (a);
select 'create table p'||x|| ' partition of p for values with (modulus
10, remainder '||x||');' from generate_series(0,9) x;
\gexec
create table t1 (a int);

insert into p select x,x from generate_Series(1,1000) x;
insert into t1 select x from generate_series(1,1000) x;

create index on p(a);

set enable_hashjoin = 0;
set enable_mergejoin = 0;
explain analyze select count(*) from t1 inner join p on t1.a=p.a;

-- Unpatched
Execution Time: 20413.975 ms
Execution Time: 20232.050 ms
Execution Time: 20229.116 ms

-- Patched
Execution Time: 17758.111 ms
Execution Time: 17645.151 ms
Execution Time: 17492.260 ms

-- 5c0675215e153ba1297fd494b34af2fdebd645d1
Execution Time: 72875.161 ms
Execution Time: 71817.757 ms
Execution Time: 72411.730 ms

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


0001-Initialize-expr-states-once-in-run-time-partition-pr.patch
Description: Binary data

Re: [HACKERS] Runtime Partition Pruning

2018-04-18 Thread Robert Haas

On Mon, Apr 16, 2018 at 10:46 PM, David Rowley
 wrote:
> I did go and start working on a patch to test how possible this would
> be and came up with the attached. I've left a stray
> MemoryContextStatsDetail call in there which does indicate that
> something is not being freed. I'm just not sure what it is yet.
>
> The patch does happen to improve performance slightly, but that is
> most likely due to the caching of the ExprStates rather than the
> change of memory management. It's not really possible to do that with
> the reset unless we stored the executor's memory context in
> PartitionPruneContext and did a context switch back inside
> partkey_datum_from_expr before calling ExecInitExpr.

10% is more than a "slight" improvement, I'd say!  It's certainly got
to be worth avoiding the repeated calls to ExecInitExpr, whatever we
do about the memory contexts.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Runtime Partition Pruning

2018-04-16 Thread David Rowley

On 17 April 2018 at 14:33, Alvaro Herrera  wrote:
> David Rowley wrote:
>
>> For a while, during my review of the faster partition pruning patch I
>> was asking Amit to add pfree() calls in various places for this exact
>> reason, but in the end, I gave up and decided it was easier to just
>> create a new memory context to call the planner function from. I've
>> now forgotten the exact reason why I finally decided it was too much
>> trouble.  The pruning code now works using your step logic so perhaps
>> that reason no longer applies, although, on a quick scan of the
>> pruning code now, it seems to require that get_matching_partitions
>> performs a deep pfree of each PruneStepResult. However, there is still
>> partkey_datum_from_expr which performs ExecInitExpr, although perhaps
>> that can just be done once, and the result stashed in the
>> PartitionPruneContext.
>
> I think trying to replace a well-placed MemoryContextReset (or Delete)
> with a bunch of individual pfrees is pointless.

I agree. I think I'd sleep better at night with the context reset in
there rather than hoping we've managed to pfree everything.

I did go and start working on a patch to test how possible this would
be and came up with the attached. I've left a stray
MemoryContextStatsDetail call in there which does indicate that
something is not being freed. I'm just not sure what it is yet.

The patch does happen to improve performance slightly, but that is
most likely due to the caching of the ExprStates rather than the
change of memory management. It's not really possible to do that with
the reset unless we stored the executor's memory context in
PartitionPruneContext and did a context switch back inside
partkey_datum_from_expr before calling ExecInitExpr.

My test case was as follows:

create table p (a int, value int) partition by hash (a);
select 'create table p'||x|| ' partition of p for values with (modulus
10, remainder '||x||');' from generate_series(0,9) x;
\gexec
create table t1 (a int);

insert into p select x,x from generate_Series(1,1000) x;
insert into t1 select x from generate_series(1,1000) x;

create index on p(a);

set enable_hashjoin = 0;
set enable_mergejoin = 0;
explain analyze select count(*) from t1 inner join p on t1.a=p.a;


-- Unpatched
Execution Time: 19725.981 ms
Execution Time: 19533.655 ms
Execution Time: 19542.854 ms

-- Patched
Execution Time: 17389.537 ms
Execution Time: 17603.802 ms
Execution Time: 17618.670 ms

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


recycle_mem_part_prune.patch
Description: Binary data

Re: [HACKERS] Runtime Partition Pruning

2018-04-16 Thread Alvaro Herrera

David Rowley wrote:

> For a while, during my review of the faster partition pruning patch I
> was asking Amit to add pfree() calls in various places for this exact
> reason, but in the end, I gave up and decided it was easier to just
> create a new memory context to call the planner function from. I've
> now forgotten the exact reason why I finally decided it was too much
> trouble.  The pruning code now works using your step logic so perhaps
> that reason no longer applies, although, on a quick scan of the
> pruning code now, it seems to require that get_matching_partitions
> performs a deep pfree of each PruneStepResult. However, there is still
> partkey_datum_from_expr which performs ExecInitExpr, although perhaps
> that can just be done once, and the result stashed in the
> PartitionPruneContext.

I think trying to replace a well-placed MemoryContextReset (or Delete)
with a bunch of individual pfrees is pointless.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-04-16 Thread David Rowley

On 14 April 2018 at 05:04, Robert Haas  wrote:
> But I wonder why it's the executor's job to clean up after the
> planner, instead of adjusting the relevant planner functions to avoid
> leaking memory?

It might be possible, but it might also be risky and difficult.

For a while, during my review of the faster partition pruning patch I
was asking Amit to add pfree() calls in various places for this exact
reason, but in the end, I gave up and decided it was easier to just
create a new memory context to call the planner function from. I've
now forgotten the exact reason why I finally decided it was too much
trouble.  The pruning code now works using your step logic so perhaps
that reason no longer applies, although, on a quick scan of the
pruning code now, it seems to require that get_matching_partitions
performs a deep pfree of each PruneStepResult. However, there is still
partkey_datum_from_expr which performs ExecInitExpr, although perhaps
that can just be done once, and the result stashed in the
PartitionPruneContext.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-04-13 Thread Robert Haas

On Thu, Apr 12, 2018 at 6:01 PM, David Rowley
 wrote:
> On 13 April 2018 at 04:57, Robert Haas  wrote:
>> BTW, looking at ExecSetupPartitionPruneState:
>>
>> /*
>>  * Create a sub memory context which we'll use when making calls to 
>> the
>>  * query planner's function to determine which partitions will
>> match.  The
>>  * planner is not too careful about freeing memory, so we'll ensure 
>> we
>>  * call the function in this context to avoid any memory leaking in 
>> the
>>  * executor's memory context.
>>  */
>>
>> This is a sloppy cut-and-paste job, not only because somebody changed
>> one copy of the word "planner" to "executor" and left the others
>> untouched, but also because the rationale isn't really correct for the
>> executor anyway, which has memory contexts all over the place and
>> frees them all the time.  I don't know whether the context is not
>> needed at all or whether the context is needed but the rationale is
>> different, but I don't buy that explanation.
>
> The comment is written exactly as intended. Unsure which of the
> "planner"s you think should be "executor".
>
> The context is needed. I can easily produce an OOM without it.

Oh, crap.  You know, I totally misread what that comment was trying to
say.  Sorry.

But I wonder why it's the executor's job to clean up after the
planner, instead of adjusting the relevant planner functions to avoid
leaking memory?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Runtime Partition Pruning

2018-04-13 Thread Amit Langote

On 2018/04/13 14:48, Amit Langote wrote:
> On 2018/04/13 14:38, Amit Langote wrote:
>> About the specific relation_open(.., NoLock) under question, I think there
>> might be a way to address this by opening the tables with the appropriate
>> lock mode in partitioned_rels list in ExecLockNonleafAppendTables
> 
> That may have sounded a bit confusing:
> 
> I meant to say: "by opening the tables in partitioned_rels list with the
> appropriate lock mode in ExecLockNonleafAppendTables"
> 
>> Attached a PoC patch.
>
> There were a couple of unnecessary hunks, which removed in the attached.

Sorry, still a couple more were unnecessary.

Thanks,
Amit
diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index 11139f743d..e3b3911945 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1244,7 +1244,8 @@ adjust_partition_tlist(List *tlist, TupleConversionMap 
*map)
  * PartitionPruneInfo.
  */
 PartitionPruneState *
-ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo)
+ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo,
+Relation 
*partitioned_rels)
 {
PartitionPruningData *prunedata;
PartitionPruneState *prunestate;
@@ -1303,10 +1304,11 @@ ExecSetupPartitionPruneState(PlanState *planstate, List 
*partitionpruneinfo)
pprune->subpart_map = pinfo->subpart_map;
 
/*
-* Grab some info from the table's relcache; lock was already 
obtained
-* by ExecLockNonLeafAppendTables.
+* ExecLockNonLeafAppendTables already opened the relation for 
us.
 */
-   rel = relation_open(pinfo->reloid, NoLock);
+   Assert(partitioned_rels[i] != NULL);
+   rel = partitioned_rels[i];
+   Assert(RelationGetRelid(rel) == pinfo->reloid);
 
partkey = RelationGetPartitionKey(rel);
partdesc = RelationGetPartitionDesc(rel);
@@ -1336,8 +1338,6 @@ ExecSetupPartitionPruneState(PlanState *planstate, List 
*partitionpruneinfo)
prunestate->execparams = bms_add_members(prunestate->execparams,

 pinfo->execparams);
 
-   relation_close(rel, NoLock);
-
i++;
}
 
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index b963cae730..58a0961eb1 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -858,22 +858,49 @@ ShutdownExprContext(ExprContext *econtext, bool isCommit)
 /*
  * ExecLockNonLeafAppendTables
  *
- * Locks, if necessary, the tables indicated by the RT indexes contained in
- * the partitioned_rels list.  These are the non-leaf tables in the partition
- * tree controlled by a given Append or MergeAppend node.
+ * Opens and/or locks, if necessary, the tables indicated by the RT indexes
+ * contained in the partitioned_rels list.  These are the non-leaf tables in
+ * the partition tree controlled by a given Append or MergeAppend node.
  */
 void
-ExecLockNonLeafAppendTables(List *partitioned_rels, EState *estate)
+ExecLockNonLeafAppendTables(PlanState *planstate,
+   EState *estate,
+   List *partitioned_rels)
 {
PlannedStmt *stmt = estate->es_plannedstmt;
ListCell   *lc;
+   int i;
 
+   /*
+* If we're going to be performing pruning, allocate space for Relation
+* pointers to be used later when setting up partition pruning state in
+* ExecSetupPartitionPruneState.
+*/
+   if (IsA(planstate, AppendState))
+   {
+   AppendState *appendstate = (AppendState *) planstate;
+   Append *node = (Append *) planstate->plan;
+
+   if (node->part_prune_infos != NIL)
+   {
+   Assert(list_length(node->part_prune_infos) ==
+  list_length(partitioned_rels));
+   appendstate->partitioned_rels = (Relation *)
+   
palloc(sizeof(Relation) *
+  
list_length(node->part_prune_infos));
+   appendstate->num_partitioned_rels =
+  
list_length(node->part_prune_infos);
+   }
+   }
+
+   i = 0;
foreach(lc, partitioned_rels)
{
ListCell   *l;
Index   rti = lfirst_int(lc);
boolis_result_rel = false;
Oid relid = getrelid(rti,

Re: [HACKERS] Runtime Partition Pruning

2018-04-12 Thread Amit Langote

On 2018/04/13 14:38, Amit Langote wrote:
> About the specific relation_open(.., NoLock) under question, I think there
> might be a way to address this by opening the tables with the appropriate
> lock mode in partitioned_rels list in ExecLockNonleafAppendTables

That may have sounded a bit confusing:

I meant to say: "by opening the tables in partitioned_rels list with the
appropriate lock mode in ExecLockNonleafAppendTables"

> Attached a PoC patch.
There were a couple of unnecessary hunks, which removed in the attached.

Thanks,
Amit
diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index 11139f743d..e3b3911945 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1244,7 +1244,8 @@ adjust_partition_tlist(List *tlist, TupleConversionMap 
*map)
  * PartitionPruneInfo.
  */
 PartitionPruneState *
-ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo)
+ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo,
+Relation 
*partitioned_rels)
 {
PartitionPruningData *prunedata;
PartitionPruneState *prunestate;
@@ -1303,10 +1304,11 @@ ExecSetupPartitionPruneState(PlanState *planstate, List 
*partitionpruneinfo)
pprune->subpart_map = pinfo->subpart_map;
 
/*
-* Grab some info from the table's relcache; lock was already 
obtained
-* by ExecLockNonLeafAppendTables.
+* ExecLockNonLeafAppendTables already opened the relation for 
us.
 */
-   rel = relation_open(pinfo->reloid, NoLock);
+   Assert(partitioned_rels[i] != NULL);
+   rel = partitioned_rels[i];
+   Assert(RelationGetRelid(rel) == pinfo->reloid);
 
partkey = RelationGetPartitionKey(rel);
partdesc = RelationGetPartitionDesc(rel);
@@ -1336,8 +1338,6 @@ ExecSetupPartitionPruneState(PlanState *planstate, List 
*partitionpruneinfo)
prunestate->execparams = bms_add_members(prunestate->execparams,

 pinfo->execparams);
 
-   relation_close(rel, NoLock);
-
i++;
}
 
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index b963cae730..58a0961eb1 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -858,22 +858,49 @@ ShutdownExprContext(ExprContext *econtext, bool isCommit)
 /*
  * ExecLockNonLeafAppendTables
  *
- * Locks, if necessary, the tables indicated by the RT indexes contained in
- * the partitioned_rels list.  These are the non-leaf tables in the partition
- * tree controlled by a given Append or MergeAppend node.
+ * Opens and/or locks, if necessary, the tables indicated by the RT indexes
+ * contained in the partitioned_rels list.  These are the non-leaf tables in
+ * the partition tree controlled by a given Append or MergeAppend node.
  */
 void
-ExecLockNonLeafAppendTables(List *partitioned_rels, EState *estate)
+ExecLockNonLeafAppendTables(PlanState *planstate,
+   EState *estate,
+   List *partitioned_rels)
 {
PlannedStmt *stmt = estate->es_plannedstmt;
ListCell   *lc;
+   int i;
 
+   /*
+* If we're going to be performing pruning, allocate space for Relation
+* pointers to be used later when setting up partition pruning state in
+* ExecSetupPartitionPruneState.
+*/
+   if (IsA(planstate, AppendState))
+   {
+   AppendState *appendstate = (AppendState *) planstate;
+   Append *node = (Append *) planstate->plan;
+
+   if (node->part_prune_infos != NIL)
+   {
+   Assert(list_length(node->part_prune_infos) ==
+  list_length(partitioned_rels));
+   appendstate->partitioned_rels = (Relation *)
+   
palloc(sizeof(Relation) *
+  
list_length(node->part_prune_infos));
+   appendstate->num_partitioned_rels =
+  
list_length(node->part_prune_infos);
+   }
+   }
+
+   i = 0;
foreach(lc, partitioned_rels)
{
ListCell   *l;
Index   rti = lfirst_int(lc);
boolis_result_rel = false;
Oid relid = getrelid(rti, 
estate->es_range_table);
+   int lockmode;
 
/* If this is a result relation,

Re: [HACKERS] Runtime Partition Pruning

2018-04-12 Thread Amit Langote

On 2018/04/13 1:57, Robert Haas wrote:
>> It might be possible to do something better in each module by keeping
>> an array indexed by RTI which have each entry NULL initially then on
>> first relation_open set the element in the array to that pointer.
> 
> I'm not sure that makes a lot of sense in the planner, but in the
> executor it might be a good idea.   See also
> https://www.postgresql.org/message-id/CA%2BTgmoYKToP4-adCFFRNrO21OGuH%3Dphx-fiB1dYoqksNYX6YHQ%40mail.gmail.com
> for related discussion.  I think that a coding pattern where we rely
> on relation_open(..., NoLock) is inherently dangerous -- it's too easy
> to be wrong about whether the lock is sure to have been taken.  It
> would be much better to open the relation once and hold onto the
> pointer, not just for performance reasons, but for robustness.

About the specific relation_open(.., NoLock) under question, I think there
might be a way to address this by opening the tables with the appropriate
lock mode in partitioned_rels list in ExecLockNonleafAppendTables and
keeping the Relation pointers around until ExecEndNode.  Then instead of
ExecSetupPartitionPruneState doing relation_open/close(.., NoLock), it
just reuses the one that's passed by the caller.

Attached a PoC patch.  David, thoughts?

Thanks,
Amit
diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index 11139f743d..e3b3911945 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1244,7 +1244,8 @@ adjust_partition_tlist(List *tlist, TupleConversionMap 
*map)
  * PartitionPruneInfo.
  */
 PartitionPruneState *
-ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo)
+ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo,
+Relation 
*partitioned_rels)
 {
PartitionPruningData *prunedata;
PartitionPruneState *prunestate;
@@ -1303,10 +1304,11 @@ ExecSetupPartitionPruneState(PlanState *planstate, List 
*partitionpruneinfo)
pprune->subpart_map = pinfo->subpart_map;
 
/*
-* Grab some info from the table's relcache; lock was already 
obtained
-* by ExecLockNonLeafAppendTables.
+* ExecLockNonLeafAppendTables already opened the relation for 
us.
 */
-   rel = relation_open(pinfo->reloid, NoLock);
+   Assert(partitioned_rels[i] != NULL);
+   rel = partitioned_rels[i];
+   Assert(RelationGetRelid(rel) == pinfo->reloid);
 
partkey = RelationGetPartitionKey(rel);
partdesc = RelationGetPartitionDesc(rel);
@@ -1336,8 +1338,6 @@ ExecSetupPartitionPruneState(PlanState *planstate, List 
*partitionpruneinfo)
prunestate->execparams = bms_add_members(prunestate->execparams,

 pinfo->execparams);
 
-   relation_close(rel, NoLock);
-
i++;
}
 
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index b963cae730..58a0961eb1 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -858,22 +858,49 @@ ShutdownExprContext(ExprContext *econtext, bool isCommit)
 /*
  * ExecLockNonLeafAppendTables
  *
- * Locks, if necessary, the tables indicated by the RT indexes contained in
- * the partitioned_rels list.  These are the non-leaf tables in the partition
- * tree controlled by a given Append or MergeAppend node.
+ * Opens and/or locks, if necessary, the tables indicated by the RT indexes
+ * contained in the partitioned_rels list.  These are the non-leaf tables in
+ * the partition tree controlled by a given Append or MergeAppend node.
  */
 void
-ExecLockNonLeafAppendTables(List *partitioned_rels, EState *estate)
+ExecLockNonLeafAppendTables(PlanState *planstate,
+   EState *estate,
+   List *partitioned_rels)
 {
PlannedStmt *stmt = estate->es_plannedstmt;
ListCell   *lc;
+   int i;
 
+   /*
+* If we're going to be performing pruning, allocate space for Relation
+* pointers to be used later when setting up partition pruning state in
+* ExecSetupPartitionPruneState.
+*/
+   if (IsA(planstate, AppendState))
+   {
+   AppendState *appendstate = (AppendState *) planstate;
+   Append *node = (Append *) planstate->plan;
+
+   if (node->part_prune_infos != NIL)
+   {
+   Assert(list_length(node->part_prune_infos) ==
+  list_length(partitioned_rels));
+   appendstate->partitioned_rels = (Relation *)
+

Re: [HACKERS] Runtime Partition Pruning

2018-04-12 Thread David Rowley

On 13 April 2018 at 04:57, Robert Haas  wrote:
> BTW, looking at ExecSetupPartitionPruneState:
>
> /*
>  * Create a sub memory context which we'll use when making calls to 
> the
>  * query planner's function to determine which partitions will
> match.  The
>  * planner is not too careful about freeing memory, so we'll ensure we
>  * call the function in this context to avoid any memory leaking in 
> the
>  * executor's memory context.
>  */
>
> This is a sloppy cut-and-paste job, not only because somebody changed
> one copy of the word "planner" to "executor" and left the others
> untouched, but also because the rationale isn't really correct for the
> executor anyway, which has memory contexts all over the place and
> frees them all the time.  I don't know whether the context is not
> needed at all or whether the context is needed but the rationale is
> different, but I don't buy that explanation.

The comment is written exactly as intended. Unsure which of the
"planner"s you think should be "executor".

The context is needed. I can easily produce an OOM without it.



-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-04-12 Thread Robert Haas

On Thu, Apr 12, 2018 at 12:40 AM, David Rowley
 wrote:
> I guess the problem there would be there's nothing to say that parse
> analysis will shortly be followed by a call to the planner, and a call
> to the planner does not mean the plan is about to be executed. So I
> don't think it would be possible to keep pointers to relcache entries
> between these modules, and it would be hard to determine whose
> responsibility it would be to call relation_close().

Yeah, that's definitely a non-starter.

> It might be possible to do something better in each module by keeping
> an array indexed by RTI which have each entry NULL initially then on
> first relation_open set the element in the array to that pointer.

I'm not sure that makes a lot of sense in the planner, but in the
executor it might be a good idea.   See also
https://www.postgresql.org/message-id/CA%2BTgmoYKToP4-adCFFRNrO21OGuH%3Dphx-fiB1dYoqksNYX6YHQ%40mail.gmail.com
for related discussion.  I think that a coding pattern where we rely
on relation_open(..., NoLock) is inherently dangerous -- it's too easy
to be wrong about whether the lock is sure to have been taken.  It
would be much better to open the relation once and hold onto the
pointer, not just for performance reasons, but for robustness.

BTW, looking at ExecSetupPartitionPruneState:

/*
 * Create a sub memory context which we'll use when making calls to the
 * query planner's function to determine which partitions will
match.  The
 * planner is not too careful about freeing memory, so we'll ensure we
 * call the function in this context to avoid any memory leaking in the
 * executor's memory context.
 */

This is a sloppy cut-and-paste job, not only because somebody changed
one copy of the word "planner" to "executor" and left the others
untouched, but also because the rationale isn't really correct for the
executor anyway, which has memory contexts all over the place and
frees them all the time.  I don't know whether the context is not
needed at all or whether the context is needed but the rationale is
different, but I don't buy that explanation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Runtime Partition Pruning

2018-04-11 Thread David Rowley

On 11 April 2018 at 09:32, Alvaro Herrera  wrote:
> Robert Haas wrote:
>> On Mon, Apr 9, 2018 at 2:28 PM, Alvaro Herrera  
>> wrote:
>
>> >> I don't get this.  The executor surely had to (and did) open all of
>> >> the relations somewhere even before this patch.
>
>> > I was worried that this coding could be seen as breaking modularity, or
>> > trying to do excessive work.  However, after looking closer at it, it
>> > doesn't really look like it's the case.  So, nevermind.
>>
>> Well what I'm saying is that it shouldn't be necessary.  If the
>> relations are being opened already and the pointers to the relcache
>> entries are being saved someplace, you shouldn't need to re-open them
>> elsewhere to get pointers to the relcache entries.
>
> I looked a bit more into this.  It turns out that we have indeed opened
> the relation before -- first in parserOpenTable (for addRangeTableEntry),
> then in expandRTE, then in QueryRewrite, then in subquery_planner, then
> in get_relation_info.
>
> So, frankly, since each module thinks it's okay to open it every once in
> a while, I'm not sure we should be terribly stressed about doing it once
> more for partition pruning.  Particularly since communicating the
> pointer seems to be quite troublesome.

I guess the problem there would be there's nothing to say that parse
analysis will shortly be followed by a call to the planner, and a call
to the planner does not mean the plan is about to be executed. So I
don't think it would be possible to keep pointers to relcache entries
between these modules, and it would be hard to determine whose
responsibility it would be to call relation_close().

It might be possible to do something better in each module by keeping
an array indexed by RTI which have each entry NULL initially then on
first relation_open set the element in the array to that pointer.

This might mean we'd save a few relation_open calls, but I don't know
if there would be a way to somehow remove the Relation from the array
on relation_close.  Having something like this might mean we could
detect lock upgrade hazards more easily, but the whole thing is a
cache on top of a cache which does seem a bit weird.  relation_open()
should be pretty cheap if the relation is already open. It's just a
hash table lookup. What is described above just changes that to an
array lookup.  It also does nothing for index_open.

However, something like the above would simplify
ExecLockNonLeafAppendTables() a bit and get rid of the O(N^2) which
checks the partition is not a result relation.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-04-11 Thread Amit Langote

On 2018/04/11 6:32, Alvaro Herrera wrote:
> Robert Haas wrote:
>> On Mon, Apr 9, 2018 at 2:28 PM, Alvaro Herrera  
>> wrote:
> 
 I don't get this.  The executor surely had to (and did) open all of
 the relations somewhere even before this patch.
> 
>>> I was worried that this coding could be seen as breaking modularity, or
>>> trying to do excessive work.  However, after looking closer at it, it
>>> doesn't really look like it's the case.  So, nevermind.
>>
>> Well what I'm saying is that it shouldn't be necessary.  If the
>> relations are being opened already and the pointers to the relcache
>> entries are being saved someplace, you shouldn't need to re-open them
>> elsewhere to get pointers to the relcache entries.
> 
> I looked a bit more into this.  It turns out that we have indeed opened
> the relation before -- first in parserOpenTable (for addRangeTableEntry),
> then in expandRTE, then in QueryRewrite, then in subquery_planner, then
> in get_relation_info.
> 
> So, frankly, since each module thinks it's okay to open it every once in
> a while, I'm not sure we should be terribly stressed about doing it once
> more for partition pruning.  Particularly since communicating the
> pointer seems to be quite troublesome.

Maybe, Robert was saying somewhere in the executor itself, before
ExecInitAppend, or more precisely, ExecSetupPartitionPruneState is called.
 But that doesn't seem to be the case.

For the result relation case (insert/update/delete on a partitioned
table), we don't need to do extra relation_opens, as InitPlan opens the
needed relations when building the ResultRelInfo(s), from where they're
later accessed, for example, in ExecInitModifyTable.

However, in the Append/MergeAppend cases, we don't, at any earlier point
in the executor, open the partitioned tables.  InitPlan doesn't touch
them.  In ExecInitAppend, ExecLockNonLeafAppendTables that we call before
calling ExecSetupPartitionPruneState does not open, just locks them using
LockRelationOid.

So, relation_open on partitioned tables in ExecSetupPartitionPruneState
seem to be the first time they're opened *in the executor*.

Thanks,
Amit

Re: [HACKERS] Runtime Partition Pruning

2018-04-10 Thread Alvaro Herrera

Robert Haas wrote:
> On Mon, Apr 9, 2018 at 2:28 PM, Alvaro Herrera  
> wrote:

> >> I don't get this.  The executor surely had to (and did) open all of
> >> the relations somewhere even before this patch.

> > I was worried that this coding could be seen as breaking modularity, or
> > trying to do excessive work.  However, after looking closer at it, it
> > doesn't really look like it's the case.  So, nevermind.
> 
> Well what I'm saying is that it shouldn't be necessary.  If the
> relations are being opened already and the pointers to the relcache
> entries are being saved someplace, you shouldn't need to re-open them
> elsewhere to get pointers to the relcache entries.

I looked a bit more into this.  It turns out that we have indeed opened
the relation before -- first in parserOpenTable (for addRangeTableEntry),
then in expandRTE, then in QueryRewrite, then in subquery_planner, then
in get_relation_info.

So, frankly, since each module thinks it's okay to open it every once in
a while, I'm not sure we should be terribly stressed about doing it once
more for partition pruning.  Particularly since communicating the
pointer seems to be quite troublesome.

To figure out, I used the attached patch (not intended for application)
to add a backtrace to each log message, plus a couple of accusatory
elog() calls in relation_open and ExecSetupPartitionPruneState.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 5d682c8cb31f79c79c2ba4cafeb876f10b072cfc Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 10 Apr 2018 18:30:29 -0300
Subject: [PATCH] errbacktrace

---
 src/backend/utils/error/elog.c  | 59 +
 src/include/postgres_ext.h  |  1 +
 src/include/utils/elog.h|  3 ++
 src/interfaces/libpq/fe-protocol3.c |  3 ++
 4 files changed, 66 insertions(+)

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 16531f7a0f..e531d0009e 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -62,6 +62,7 @@
 #ifdef HAVE_SYSLOG
 #include 
 #endif
+#include 
 
 #include "access/transam.h"
 #include "access/xact.h"
@@ -493,6 +494,8 @@ errfinish(int dummy,...)
pfree(edata->hint);
if (edata->context)
pfree(edata->context);
+   if (edata->backtrace)
+   pfree(edata->backtrace);
if (edata->schema_name)
pfree(edata->schema_name);
if (edata->table_name)
@@ -811,6 +814,41 @@ errmsg(const char *fmt,...)
return 0;   /* return value does 
not matter */
 }
 
+#define BACKTRACE_FRAMES 100
+int
+errbacktrace(void)
+{
+   ErrorData   *edata = [errordata_stack_depth];
+   MemoryContext oldcontext;
+   void   *buf[BACKTRACE_FRAMES];
+   int nframes;
+   char  **strfrms;
+   StringInfoData errtrace;
+   int i;
+
+   recursion_depth++;
+   CHECK_STACK_DEPTH();
+   oldcontext = MemoryContextSwitchTo(edata->assoc_context);
+
+   nframes = backtrace(buf, BACKTRACE_FRAMES);
+   strfrms = backtrace_symbols(buf, nframes);
+   if (strfrms == NULL)
+   return 0;
+
+   initStringInfo();
+
+   /* the first frame is always errbacktrace itself, so skip it */
+   for (i = 1; i < nframes; i++)
+   appendStringInfo(, "%s\n", strfrms[i]);
+   free(strfrms);
+
+   edata->backtrace = errtrace.data;
+
+   MemoryContextSwitchTo(oldcontext);
+   recursion_depth--;
+
+   return 0;
+}
 
 /*
  * errmsg_internal --- add a primary error message text to the current error
@@ -1522,6 +1560,8 @@ CopyErrorData(void)
newedata->hint = pstrdup(newedata->hint);
if (newedata->context)
newedata->context = pstrdup(newedata->context);
+   if (newedata->backtrace)
+   newedata->backtrace = pstrdup(newedata->backtrace);
if (newedata->schema_name)
newedata->schema_name = pstrdup(newedata->schema_name);
if (newedata->table_name)
@@ -1560,6 +1600,8 @@ FreeErrorData(ErrorData *edata)
pfree(edata->hint);
if (edata->context)
pfree(edata->context);
+   if (edata->backtrace)
+   pfree(edata->backtrace);
if (edata->schema_name)
pfree(edata->schema_name);
if (edata->table_name)
@@ -1635,6 +1677,8 @@ ThrowErrorData(ErrorData *edata)
newedata->hint = pstrdup(edata->hint);
if (edata->context)
newedata->context = pstrdup(edata->context);
+   if (edata->backtrace)
+   newedata->backtrace = pstrdup(newedata->backtrace);
/* assume message_id is not available */
if (edata->schema_name)

Re: [HACKERS] Runtime Partition Pruning

2018-04-09 Thread Alvaro Herrera

Robert Haas wrote:
> On Mon, Apr 9, 2018 at 2:28 PM, Alvaro Herrera  
> wrote:
> > Robert Haas wrote:

> >> I don't get this.  The executor surely had to (and did) open all of
> >> the relations somewhere even before this patch.
> >
> > I was worried that this coding could be seen as breaking modularity, or
> > trying to do excessive work.  However, after looking closer at it, it
> > doesn't really look like it's the case.  So, nevermind.
> 
> Well what I'm saying is that it shouldn't be necessary.  If the
> relations are being opened already and the pointers to the relcache
> entries are being saved someplace, you shouldn't need to re-open them
> elsewhere to get pointers to the relcache entries.

Oh, okay.  I can look into that.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-04-09 Thread Robert Haas

On Mon, Apr 9, 2018 at 2:28 PM, Alvaro Herrera  wrote:
> Robert Haas wrote:
>> On Sat, Apr 7, 2018 at 5:13 PM, Alvaro Herrera  
>> wrote:
>> > I had reservations about a relation_open() in the new executor code.  It
>> > seemed a bit odd; we don't have any other relation_open in the executor
>> > anywhere.  However, setting up the pruneinfo needs some stuff from
>> > relcache that I don't see a reasonable mechanism to pass through
>> > planner.  I asked Andres about it on IM and while he didn't endorse the
>> > patch in any way, his quick opinion was that "it wasn't entirely
>> > insane".  I verified that we already hold lock on the relation.
>>
>> I don't get this.  The executor surely had to (and did) open all of
>> the relations somewhere even before this patch.
>
> Yeah.
>
> I was worried that this coding could be seen as breaking modularity, or
> trying to do excessive work.  However, after looking closer at it, it
> doesn't really look like it's the case.  So, nevermind.

Well what I'm saying is that it shouldn't be necessary.  If the
relations are being opened already and the pointers to the relcache
entries are being saved someplace, you shouldn't need to re-open them
elsewhere to get pointers to the relcache entries.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Runtime Partition Pruning

2018-04-09 Thread Alvaro Herrera

Robert Haas wrote:
> On Sat, Apr 7, 2018 at 5:13 PM, Alvaro Herrera  
> wrote:
> > I had reservations about a relation_open() in the new executor code.  It
> > seemed a bit odd; we don't have any other relation_open in the executor
> > anywhere.  However, setting up the pruneinfo needs some stuff from
> > relcache that I don't see a reasonable mechanism to pass through
> > planner.  I asked Andres about it on IM and while he didn't endorse the
> > patch in any way, his quick opinion was that "it wasn't entirely
> > insane".  I verified that we already hold lock on the relation.
> 
> I don't get this.  The executor surely had to (and did) open all of
> the relations somewhere even before this patch.

Yeah.

I was worried that this coding could be seen as breaking modularity, or
trying to do excessive work.  However, after looking closer at it, it
doesn't really look like it's the case.  So, nevermind.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-04-09 Thread Robert Haas

On Sat, Apr 7, 2018 at 5:13 PM, Alvaro Herrera  wrote:
> I had reservations about a relation_open() in the new executor code.  It
> seemed a bit odd; we don't have any other relation_open in the executor
> anywhere.  However, setting up the pruneinfo needs some stuff from
> relcache that I don't see a reasonable mechanism to pass through
> planner.  I asked Andres about it on IM and while he didn't endorse the
> patch in any way, his quick opinion was that "it wasn't entirely
> insane".  I verified that we already hold lock on the relation.

I don't get this.  The executor surely had to (and did) open all of
the relations somewhere even before this patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Runtime Partition Pruning

2018-04-07 Thread David Rowley

On 8 April 2018 at 09:13, Alvaro Herrera  wrote:
> I pushed this patch -- 0001, 0002 and 0003 only.  I did not include
> anything from 0004 and 0005; I didn't even get to the point of reading
> them, so that I could focus on the first part.

Oh great! Thank you for working on this and pushing it, especially so
during your weekend.

> While we didn't get fast pruning support for MergeAppend or the
> DELETE/UPDATE parts, I think those are valuable and recommend to
> resubmit those for PG12.

Thanks. I'll certainly be doing that.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-04-07 Thread Alvaro Herrera

I pushed this patch -- 0001, 0002 and 0003 only.  I did not include
anything from 0004 and 0005; I didn't even get to the point of reading
them, so that I could focus on the first part.

I did not find anything to complain about.  I made a few adjustments and
asked David to supply a paragraph for perform.sgml (the "Using EXPLAIN"
section) which is included here.  I also adopted Jesper's (actually
David's) suggestion of changing "Partitions Pruned" to "Partitions
Removed" in the EXPLAIN output.

I had reservations about a relation_open() in the new executor code.  It
seemed a bit odd; we don't have any other relation_open in the executor
anywhere.  However, setting up the pruneinfo needs some stuff from
relcache that I don't see a reasonable mechanism to pass through
planner.  I asked Andres about it on IM and while he didn't endorse the
patch in any way, his quick opinion was that "it wasn't entirely
insane".  I verified that we already hold lock on the relation.


While we didn't get fast pruning support for MergeAppend or the
DELETE/UPDATE parts, I think those are valuable and recommend to
resubmit those for PG12.

Thank you!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-04-07 Thread Jesper Pedersen


Hi,

On 04/07/2018 04:45 AM, David Rowley wrote:

Ok, so I've gone and done this.

PartitionPruning has become PartitionPruneState
PartitionRelPruning has become PartitionPruningData

I've changed pointers to PartitionPruneStates to be named prunestate,
sometimes having the node prefix; as_, ma_, in these cases prune and
state are separated with a _ which seems to be the general rule for
executor state struct members.

Generally, pointers to PartitionPruningData are now named pprune.
Hopefully, that's ok, as this was the name previously used for
PartitionPruning pointers.

I applied the patch to get rid of as_noop_scan in favour of using a
special as_whichplan value. There was already one special value
(INVALID_SUBPLAN_INDEX), so seemed better to build on that rather than
inventing something new. This also means we don't have to make the
AppendState struct and wider too, which seems like a good thing to try
to do.

I made all the fixups which I mentioned in my review earlier and also
re-removed the resultRelation parameter from make_partition_pruneinfo.
It sneaked back into v22.

v23 is attached.



Passes check-world.

Changing explain.c to "Subplans Removed" as suggested by you in [1] is a 
good idea.


[1] 
https://www.postgresql.org/message-id/CAKJS1f99JnkbOshdV_4zoJZ96DPtKeHMHv43JRL_ZdHRkkVKCA%40mail.gmail.com


Best regards,
 Jesper

Re: [HACKERS] Runtime Partition Pruning

2018-04-07 Thread Alvaro Herrera

David Rowley wrote:
> On 8 April 2018 at 01:18, Alvaro Herrera  wrote:
> > Amit had it as "indexes" also in his original.  I wanted to avoid using
> > the "indexes" word alone, whose meaning is so overloaded.
> 
> hmm, good point.
> 
> > How about this?
> > "... which are then executed to produce a set of partitions (as indexes
> > of resultRelInfo->part_rels array) that satisfy the constraints in the
> > step".
> 
> Works for me, but with RelOptInfo rather than resultRelInfo.

Oops, sorry about that.  Pushed now, adding one line to
get_matching_partitions also.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-04-07 Thread David Rowley

On 8 April 2018 at 01:18, Alvaro Herrera  wrote:
> Amit had it as "indexes" also in his original.  I wanted to avoid using
> the "indexes" word alone, whose meaning is so overloaded.

hmm, good point.

> How about this?
> "... which are then executed to produce a set of partitions (as indexes
> of resultRelInfo->part_rels array) that satisfy the constraints in the
> step".

Works for me, but with RelOptInfo rather than resultRelInfo.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-04-07 Thread Alvaro Herrera

David Rowley wrote:

> It's not exactly wrong but:
> 
> + * are turned into a set of "pruning steps", which are then executed to
> + * produce a set of RTIs of partitions whose bounds satisfy the constraints 
> in
> + * the step.  Partitions not in the set are said to have been pruned.
> 
> It's only prune_append_rel_partitions which is only used for the
> planner's pruning needs that converts the partition indexes to RTIs.
> Would it be better to mention that the output is partition indexes?
> Maybe:
> 
> "which are then executed to produce a set of partition indexes whose
> bounds satisfy the constraints in the step. These partition indexes
> may then be translated into RTIs", or maybe even not mention the RTIs.

Amit had it as "indexes" also in his original.  I wanted to avoid using
the "indexes" word alone, whose meaning is so overloaded.  How about
this?

"... which are then executed to produce a set of partitions (as indexes
of resultRelInfo->part_rels array) that satisfy the constraints in the
step".

Maybe "the boundinfo array" instead of part_rels, which as I understand
also uses the same indexing as the other array, and partprune mostly
works based on boundinfo anyway?

Not mentioning RTIs seems fine.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-04-07 Thread David Rowley

On 8 April 2018 at 00:23, Alvaro Herrera  wrote:
> I edited it as attached, to 1. avoid mentioning functionality that
> doesn't yet exist, and 2. avoid excessive internal detail (we want a
> high-level overview here), which from experience gets outdated pretty
> quickly.

It's not exactly wrong but:

+ * are turned into a set of "pruning steps", which are then executed to
+ * produce a set of RTIs of partitions whose bounds satisfy the constraints in
+ * the step.  Partitions not in the set are said to have been pruned.

It's only prune_append_rel_partitions which is only used for the
planner's pruning needs that converts the partition indexes to RTIs.
Would it be better to mention that the output is partition indexes?
Maybe:

"which are then executed to produce a set of partition indexes whose
bounds satisfy the constraints in the step. These partition indexes
may then be translated into RTIs", or maybe even not mention the RTIs.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-04-07 Thread Alvaro Herrera

Amit Langote wrote:

> See if the attached makes it any better.
> 
> Now I know we don't have the runtime pruning in yet, but since the
> proposed patch would extend its functionality I have included its
> description in the comment.

Thanks!

I edited it as attached, to 1. avoid mentioning functionality that
doesn't yet exist, and 2. avoid excessive internal detail (we want a
high-level overview here), which from experience gets outdated pretty
quickly.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
commit 28768290d8d9c58e76e2594a14d5b1f8465ba262[m
Author: Alvaro Herrera 
AuthorDate: Sat Apr 7 08:44:12 2018 -0300
CommitDate: Sat Apr 7 09:20:58 2018 -0300

Add some documentation to partprune.c

Author: Amit Langote
Discussion: 
https://postgr.es/m/CA+HiwqGzq4D6z=8r0ap+xhbtfcq-4ct+t2ekqje9fpm84_j...@mail.gmail.com

diff --git a/src/backend/partitioning/partprune.c 
b/src/backend/partitioning/partprune.c
index 959ee1643d..07b8057e3f 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -1,10 +1,27 @@
 /*-
  *
  * partprune.c
- * Parses clauses attempting to match them up to partition keys of 
a
- * given relation and generates a set of "pruning steps", which 
can be
- * later "executed" either from the planner or the executor to 
determine
- * the minimum set of partitions which match the given clauses.
+ * Support for partition pruning during query planning
+ *
+ * This module implements partition pruning using the information contained in
+ * table's partition descriptor and query clauses.
+ *
+ * During planning, clauses that can be matched to the table's partition key
+ * are turned into a set of "pruning steps", which are then executed to
+ * produce a set of RTIs of partitions whose bounds satisfy the constraints in
+ * the step.  Partitions not in the set are said to have been pruned.
+ *
+ * There are two kinds of pruning steps: a "base" pruning step, which contains
+ * information extracted from one or more clauses that are matched to the
+ * (possibly multi-column) partition key, such as the expressions whose values
+ * to match against partition bounds and operator strategy to associate to
+ * each expression.  The other kind is a "combine" pruning step, which combines
+ * the outputs of some other steps using the appropriate combination method.
+ * All steps that are constructed are executed in succession such that for any
+ * "combine" step, all of the steps whose output it depends on are executed
+ * first and their ouput preserved.
+ *
+ * See gen_partprune_steps_internal() for more details on step generation.
  *
  * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California

Re: [HACKERS] Runtime Partition Pruning

2018-04-06 Thread Amit Langote

On Sat, Apr 7, 2018 at 1:58 PM, Amit Langote  wrote:
> On Sat, Apr 7, 2018 at 1:31 PM, Andres Freund  wrote:
>> I've not followed this thread/feature at all, but I don't find the
>> comments atop partprune.c even remotely sufficient. Unless there's an
>> README hidden or such hidden somewhere?
>
> Sorry there isn't a README and I agree partprune.c's header comment
> could be improved quite a bit.
>
> Just to be clear, that's the fault of the patch that was already
> committed earlier today (9fdb675fc "Faster partition pruning"), not
> this patch, which just extends partition.c's functionality to
> implement additional planner and executor support for runtime pruning.
>
> I'm drafting a patch that expands the partprune.c comment and will post 
> shortly.

See if the attached makes it any better.

Now I know we don't have the runtime pruning in yet, but since the
proposed patch would extend its functionality I have included its
description in the comment.

Thanks,
Amit


expand-partprune-header-comment.patch
Description: Binary data

Re: [HACKERS] Runtime Partition Pruning

2018-04-06 Thread Andres Freund

On 2018-04-07 16:58:01 +1200, David Rowley wrote:
> On 7 April 2018 at 16:31, Andres Freund  wrote:
> > I've not followed this thread/feature at all, but I don't find the
> > comments atop partprune.c even remotely sufficient. Unless there's an
> > README hidden or such hidden somewhere?
> 
> There's not a README file. The comments for partprune.c, do you want
> this to explain more about how partition pruning works or how this
> patch uses the existing code?

Primarily the first. This isn't trivial straightforward code.


> Probably if we need to explain more there about how pruning works then
> it should be a fixup patch to 9fdb675fc, no?

Yea, it's about that. Sorry for accidentally jumping on the wrong
thread.

Greetings,

Andres Freund

Re: [HACKERS] Runtime Partition Pruning

2018-04-06 Thread Amit Langote

On Sat, Apr 7, 2018 at 1:58 PM, David Rowley
 wrote:
> Probably if we need to explain more there about how pruning works then
> it should be a fixup patch to 9fdb675fc, no?

Yes, I just replied and working on a patch.

Thanks,
Amit

Re: [HACKERS] Runtime Partition Pruning

2018-04-06 Thread Amit Langote

On Sat, Apr 7, 2018 at 1:31 PM, Andres Freund  wrote:
> On 2018-04-07 13:26:51 +0900, Amit Langote wrote:
>> On Sat, Apr 7, 2018 at 11:26 AM, David Rowley
>>  wrote:
>> > Everything else looks fine from my point of view.
>>
>> Me too, although I still think having struct names PartitionPruning
>> and PartitionRelPruning is going to be  a bit confusing.  We should
>> think about naming the latter to something else.  I suggested
>> PartitionPruningDispatch(Data), but David doesn't seem to like it.
>> Maybe, PartitionPruneState, because it parallels the
>> PartitionPruneInfo that comes from the planner for every partitioned
>> table in the tree.
>
> I've not followed this thread/feature at all, but I don't find the
> comments atop partprune.c even remotely sufficient. Unless there's an
> README hidden or such hidden somewhere?

Sorry there isn't a README and I agree partprune.c's header comment
could be improved quite a bit.

Just to be clear, that's the fault of the patch that was already
committed earlier today (9fdb675fc "Faster partition pruning"), not
this patch, which just extends partition.c's functionality to
implement additional planner and executor support for runtime pruning.

I'm drafting a patch that expands the partprune.c comment and will post shortly.

Thanks,
Amit

Re: [HACKERS] Runtime Partition Pruning

2018-04-06 Thread David Rowley

On 7 April 2018 at 16:31, Andres Freund  wrote:
> I've not followed this thread/feature at all, but I don't find the
> comments atop partprune.c even remotely sufficient. Unless there's an
> README hidden or such hidden somewhere?

There's not a README file. The comments for partprune.c, do you want
this to explain more about how partition pruning works or how this
patch uses the existing code?

Probably if we need to explain more there about how pruning works then
it should be a fixup patch to 9fdb675fc, no?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-04-06 Thread David Rowley

On 7 April 2018 at 16:26, Amit Langote  wrote:
> Maybe, PartitionPruneState, because it parallels the
> PartitionPruneInfo that comes from the planner for every partitioned
> table in the tree.

I like that.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-04-06 Thread Andres Freund

On 2018-04-07 13:26:51 +0900, Amit Langote wrote:
> On Sat, Apr 7, 2018 at 11:26 AM, David Rowley
>  wrote:
> > Everything else looks fine from my point of view.
> 
> Me too, although I still think having struct names PartitionPruning
> and PartitionRelPruning is going to be  a bit confusing.  We should
> think about naming the latter to something else.  I suggested
> PartitionPruningDispatch(Data), but David doesn't seem to like it.
> Maybe, PartitionPruneState, because it parallels the
> PartitionPruneInfo that comes from the planner for every partitioned
> table in the tree.

I've not followed this thread/feature at all, but I don't find the
comments atop partprune.c even remotely sufficient. Unless there's an
README hidden or such hidden somewhere?

Greetings,

Andres Freund

Re: [HACKERS] Runtime Partition Pruning

2018-04-06 Thread Amit Langote

On Sat, Apr 7, 2018 at 11:26 AM, David Rowley
 wrote:
> Everything else looks fine from my point of view.

Me too, although I still think having struct names PartitionPruning
and PartitionRelPruning is going to be  a bit confusing.  We should
think about naming the latter to something else.  I suggested
PartitionPruningDispatch(Data), but David doesn't seem to like it.
Maybe, PartitionPruneState, because it parallels the
PartitionPruneInfo that comes from the planner for every partitioned
table in the tree.

Thanks,
Amit

Re: [HACKERS] Runtime Partition Pruning

2018-04-06 Thread David Rowley

On 7 April 2018 at 12:03, David Rowley  wrote:
> Continuing to read 0003 and 0004 now.


0003:

1. "setup" -> "set"

/* If run-time partition pruning is enabled, then setup that up now */

2. We should be able to get rid of as_noopscan and just have another
special negative value for as_whichplan.

I've attached a patch to do this.

3. I've forgotten to drop table boolvalues in the tests.

Patched attached to fix.

0004:

1. "ms_valid_subplans" -> "valid_subplans" in:

 * ms_valid_subplans for runtime pruning, valid mergeplans indexes to
 * scan.

All the other fields are not being prefixed with ms_ in these comments.

Everything else looks fine from my point of view.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


nodeAppend_get_rid_of_as_noopscan.patch
Description: Binary data


drop_table_boolvalues.patch
Description: Binary data

Re: [HACKERS] Runtime Partition Pruning

2018-04-06 Thread David Rowley

On 7 April 2018 at 10:45, David Rowley  wrote:
> I'm looking over the rebased patches now.

I've made a complete read of 0001 and 0002 so far.

Your rebase looks fine.

After the complete read, I only have the following comments:

0001:

1. missing "the" before "partition key":

* Extract Params matching partition key and record if we got any.

2. Is this the property name we're going to stick with:

ExplainPropertyInteger("Subplans Pruned", NULL, nplans - nsubnodes, es);

Other ideas are: "Subplans Removed"

3. In the following comment I've used the word "hierarchy", but maybe
we need to add the word "flattened" before it.

 * PartitionPruning - Encapsulates a hierarchy of PartitionRelPruning

4. Comment mentions "after init plan", but really we can only know the
value of an exec param during actual execution. So:

* Parameters that are safe to be used for partition pruning. execparams
* are not safe to use until after init plan.

maybe better as:

* Parameters that are safe to be used for partition pruning. execparams
* are not safe to use until the executor is running.

0002:

Looks fine. But if I was committing this, to give me confidence, I'd
want to know how the left_most_one table was generated.

I used:

#include 

int main(void)
{
int i = 1;

printf("0, ");
while (i < 256)
{
printf("%d, ", 31 - __builtin_clz(i));
if ((i & 0xf) == 0xf)
putchar('\n');
i++;
}
return 0;
}

Continuing to read 0003 and 0004 now.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-04-06 Thread David Rowley

On 7 April 2018 at 09:29, Alvaro Herrera  wrote:
> Alvaro Herrera wrote:
>> I rebased this series on top of the committed version of the other patch.
>> Here's v22, with no other changes than rebasing.  I did not include
>> 0005, though.
>
> Apologies, I forgot to "git add" one fixup for 0001.

0003 I think.

I'm looking over the rebased patches now.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-04-06 Thread Alvaro Herrera

Alvaro Herrera wrote:
> I rebased this series on top of the committed version of the other patch.
> Here's v22, with no other changes than rebasing.  I did not include
> 0005, though.

Apologies, I forgot to "git add" one fixup for 0001.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/optimizer/plan/createplan.c 
b/src/backend/optimizer/plan/createplan.c
index 093ceaa867..aac05917ee 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -29,7 +29,6 @@
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/cost.h"
-#include "optimizer/partprune.h"
 #include "optimizer/paths.h"
 #include "optimizer/placeholder.h"
 #include "optimizer/plancat.h"
@@ -42,6 +41,7 @@
 #include "optimizer/var.h"
 #include "parser/parse_clause.h"
 #include "parser/parsetree.h"
+#include "partitioning/partprune.h"
 #include "utils/lsyscache.h"

Re: [HACKERS] Runtime Partition Pruning

2018-04-05 Thread Amit Langote

Hi David,

On 2018/04/06 12:27, David Rowley wrote:
> (sending my reply in parts for concurrency)
> 
> On 6 April 2018 at 14:39, Amit Langote  wrote:
>> I think we can save some space here by not having the pointers stored
>> here.  Instead of passing the pointer itself in the recursive calls to
>> find_subplans_for_extparams_recurse, et al, just pass the entire array and
>> an offset to use for the given recursive invocation.
> 
> hmm, where would those offsets be stored?  I don't want to have to do
> any linear searching to determine the offset, which is why I just
> stored the pointer to the flat array. It seems very efficient to me to
> do this. Remember that this pruning can occur per tuple in cases like
> parameterized nested loops.
> 
> Are you worried about memory consumption? It's one pointer per
> partition. I imagine there's lots more allocated for DML on a
> partitioned table as it needs to store maps to map attribute numbers.
> 
> Or are you thinking the saving of storing an array of 32-bit int
> values is better than the array of probably 64-bit pointers? So
> requires half the space?

Yeah, just copy it from the PartitionPruneInfo like you're doing for
subnodeindex:

   memcpy(partrelprune->subpartindex, pinfo->subpartindex,
  sizeof(int) * pinfo->nparts);

Instead I see ExecSetupPartitionPruning is now doing this:

  /*
   * Setup the PartitionedRelPruning's subpartprune so that we can
   * quickly find sub-PartitionedRelPruning details for any
   * sub-partitioned tables that this partitioned table contains.
   * We need to be able to find these quickly during our recursive
   * search to find all matching subnodes.
   */
   for (j = 0; j < pinfo->nparts; j++)
   {
   int subpartidx = pinfo->subpartindex[j];

   Assert(subpartidx < list_length(partitionpruneinfo));

   if (subpartidx >= 0)
   partrelprune->subpartprune[j] = [subpartidx];
   else
   partrelprune->subpartprune[j] = NULL;
   }


With that in place, pass the index/offset instead of the pointer to the
next recursive invocation of find_subplans_*, along with the array
containing all PartitionedRelPruning's.

So, where you have in each of find_subplans_*:

  if (partrelprune->subnodeindex[i] >= 0)
  *validsubplans = bms_add_member(*validsubplans,
  partrelprune->subnodeindex[i]);
  else if (partrelprune->subpartprune[i] != NULL)
  find_subplans_for_allparams_recurse(partrelprune->subpartprune[i],
  validsubplans);

I'm proposing that you do:

  if (partrelprune->subnodeindex[i] >= 0)
  *validsubplans = bms_add_member(*validsubplans,
  partrelprune->subnodeindex[i]);
  else if (partrelprune->subpartindex[i] >= 0)
  find_subplans_for_allparams_recurse(all_partrelprunes,
  partrelprune->subpartindex[i],
  validsubplans);

And at the top of each of  find_subplans_*:

  ParitionedRelPruning *partrelprune = all_partrelprunes[offset];

where the signature is:

  static void
  find_subplans_for_allparams_recurse(
 PartitionRelPruning **all_partrelprune,
 int offset,
 Bitmapset **validsubplans)

The all_partrelprune above refers to the flat array in PartitionPruning.
On the first call from ExecFindMatchingSubPlans, et al, you'd pass 0 for
offset to start pruning with the root parent's PartitionedRelPruning.  All
the values contained in subnodeindex and subpartindex are indexes into the
global array (whole-tree that is) anyway and that fact would be more
apparent if we use this code structure, imho.

>> If you look at ExecFindPartition used for tuple routing, you may see that
>> there no recursion at all.  Similarly find_subplans_for_extparams_recurse,
>> et al might be able to avoid recursion if similar tricks are used.
> 
> That seems pretty different. That's looking for a single node in a
> tree, so just is following a single path from the root, it never has
> to go back up a level and look down any other paths.
> 
> What we need for the find_subplans_for_extparams_recurse is to find
> all nodes in the entire tree which match the given clause. Doing this
> without recursion would require some sort of stack so we can go back
> up a level and search again down another branch.  There are ways
> around this without using recursion, sure, but I don't think any of
> them will be quite as convenient and simple. The best I can think of
> is to palloc some stack manually and use some depth_level to track
> which element to use.  An actual stack seems more simple. I can't
> quite think of a good way to know in advance how many elements we'd
> need to palloc.

Hmm, yeah.  I just remembered that I had to give up suggesting this a
while back on this thread.  So, okay, you

Re: [HACKERS] Runtime Partition Pruning

2018-04-05 Thread Amit Langote

Hi David.

On 2018/04/05 22:41, David Rowley wrote:
>> * make_partition_pruneinfo has a parameter resultRelations that's not used
>> anywhere
> 
> It gets used in 0005.
> 
> I guess I could only add it in 0005, but make_partition_pruneinfo is
> only used in 0003, so you could say the same about that entire
> function.
> 
> Do you think I should delay adding that parameter until the 0005 patch?

Yes, I think.

>> * In make_partition_pruneinfo()
>>
>> allsubnodeindex = palloc(sizeof(int) * root->simple_rel_array_size);
>> allsubpartindex = palloc(sizeof(int) * root->simple_rel_array_size);
>>
>> I think these arrays need to have root->simple_rel_array_size + 1
>> elements, because they're subscripted using RT indexes which start at 1.
> 
> RT indexes are always 1-based. See setup_simple_rel_arrays. It already
> sets the array size to list_length(rtable) + 1.

Oh, I missed that simple_rel_array_size itself is set to consider 1-based
RT indexes.

relnode.c:73
root->simple_rel_array_size = list_length(root->parse->rtable) + 1;

>> * Also in make_partition_pruneinfo()
>>
>> /* Initialize to -1 to indicate the rel was not found */
>> for (i = 0; i < root->simple_rel_array_size; i++)
>> {
>> allsubnodeindex[i] = -1;
>> allsubpartindex[i] = -1;
>> }
>>
>> Maybe, allocate the arrays above mentioned using palloc0 and don't do this
>> initialization.  Instead make the indexes that are stored in these start
>> with 1 and consider 0 as invalid entries.
> 
> 0 is a valid subplan index. It is possible to make this happen, but
> I'd need to subtract 1 everywhere I used the map. That does not seem
> very nice. Seems more likely to result in bugs where we might forget
> to do the - 1.

You can subtract 1 right here in make_partition_pruneinfo before setting
the values in PartitionPruneInfo's subplan_indexes and parent_indexes.
I'm only proposing to make make_partition_pruneinfo() a bit faster by not
looping over both the local map arrays setting them to -1.

IOW, I'm not saying that we emit PartitionPruneInfo nodes that contain
1-based indexes.

> Did you want this because you'd rather have the palloc0() than the for
> loop setting the array elements to -1? Or is there another reason?

Yeah, that's it.

>> * Instead of nesting PartitionedRelPruning inside another, just store them
>> in a global flat array in the PartitionPruning, like PartitionTupleRouting
>> does for PartitionDispatch of individual partitioned tables in the tree.
>>
>> typedef struct PartitionedRelPruning
>> {
>> int nparts;
>> int*subnodeindex;
>> struct PartitionedRelPruning **subpartprune;
> 
> There is a flat array in PartitionPruning. subpartprune contains
> pointers into that array. I want to have this pointer array so I can
> directly reference the flat array in PartitionPruning.
> 
> Maybe I've misunderstood what you mean here.

I think we can save some space here by not having the pointers stored
here.  Instead of passing the pointer itself in the recursive calls to
find_subplans_for_extparams_recurse, et al, just pass the entire array and
an offset to use for the given recursive invocation.

If you look at ExecFindPartition used for tuple routing, you may see that
there no recursion at all.  Similarly find_subplans_for_extparams_recurse,
et al might be able to avoid recursion if similar tricks are used.

Finally about having two different functions for different sets of Params:
can't we have just named find_subplans_for_params_recurse() and use the
appropriate one based on the value of some parameter?  I can't help but
notice the duplication of code.

Thanks,
Amit

Re: [HACKERS] Runtime Partition Pruning

2018-04-05 Thread Jesper Pedersen


Hi David,

First of all: Solid patch set with good documentation.

On 04/05/2018 09:41 AM, David Rowley wrote:

Seems mostly fair. I'm not a fan of using the term "unpruned" though.
I'll have a think.  The "all" is meant in terms of what exists as
subnodes.



'included_parts' / 'excluded_parts' probably isn't better...


subplan_indexes and parent_indexes seem like better names, I agree.



More clear.


* Also in make_partition_pruneinfo()

 /* Initialize to -1 to indicate the rel was not found */
 for (i = 0; i < root->simple_rel_array_size; i++)
 {
 allsubnodeindex[i] = -1;
 allsubpartindex[i] = -1;
 }

Maybe, allocate the arrays above mentioned using palloc0 and don't do this
initialization.  Instead make the indexes that are stored in these start
with 1 and consider 0 as invalid entries.


0 is a valid subplan index. It is possible to make this happen, but
I'd need to subtract 1 everywhere I used the map. That does not seem
very nice. Seems more likely to result in bugs where we might forget
to do the - 1.

Did you want this because you'd rather have the palloc0() than the for
loop setting the array elements to -1? Or is there another reason?



I think that doing palloc0 would be confusing; -1 is more clear, 
especially since it is used with 'allpartindexes' which is a Bitmapset.


Doing the variable renames as Amit suggests would be good.

I ran some tests (v50_v20) (make check-world passes), w/ and w/o 
choose_custom_plan() being false, and seeing good performance results 
without running into issues.


Maybe 0005 should be expanded in partition_prune.sql with the supported 
cases to make those more clear.


Thanks !

Best regards,
 Jesper

Re: [HACKERS] Runtime Partition Pruning

2018-04-05 Thread David Rowley

On 5 April 2018 at 15:14, Amit Langote  wrote:
> On 2018/03/31 22:52, David Rowley wrote:
>> Initialising
>> Append/MergeAppend/ModifyTable nodes with fewer subnodes than were
>> originally planned did require a small change in explain.c in some
>> code that was assuming the subnode arrays were the same length as the
>> subplan list. I also ended up adding a Int property to EXPLAIN to show
>> the number of subnodes that have been removed during execution.
>> Unfortunately, there is a small corner case with this in the case
>> where all subnodes are removed it leaves EXPLAIN with no subnodes to
>> search for outer side Vars in. I didn't really see any sane way to
>> handle this in EXPLAIN, so I think the best fix for this is what I've
>> done, and that's just to always initalise at least 1 subnode, even
>> when none match the external Params. Cost-wise, I don't think this is
>> such a big deal as the cost savings here are coming from saving on
>> initalising ten's or hundreds of subnodes, not 1.
>
> I have wondered about the possibility of setting a valid (non-dummy)
> targetlist in Append and MergeAppend if they're created for a partitioned
> table.  Currently they're overwritten by a dummy one using
> set_dummy_tlist_references() in set_plan_refs() citing the following reason:
>
>  * set_dummy_tlist_references
>  *Replace the targetlist of an upper-level plan node with a simple
>  *list of OUTER_VAR references to its child.
>  *
>  * This is used for plan types like Sort and Append that don't evaluate
>  * their targetlists.  Although the executor doesn't care at all what's in
>  * the tlist, EXPLAIN needs it to be realistic.
>
> In fact, when I had noticed that this EXPLAIN behavior I had wondered if
> that's something we should have discussed when d3cc37f1d801a [1] went in.

I had a quick hack at this to see if it would work and it does seem to
on my very simple test. However, it would mean removing
set_dummy_tlist_references from more than just Append/MergeAppend

create table listp (a int, b int) partition by list(a);
create table listp1 partition of listp for values in(1);
create table listp2 partition of listp for values in(2);
prepare q1 (int, int) as select * from listp where a in($1,$2) order
by b limit 1;
explain execute q1(1,2);
explain execute q1(1,2);
explain execute q1(1,2);
explain execute q1(1,2);
explain execute q1(1,2);
explain (verbose, costs off) execute q1(0,0);
   QUERY PLAN

 Limit
   Output: listp.a, listp.b
   ->  Sort
 Output: listp.a, listp.b
 Sort Key: listp.b
 ->  Append
   Subplans Pruned: 2
(7 rows)

The downside is that if we were to do this it would mean changing the
output in cases like:

explain (verbose, costs off) (select a z, b y from listp union all
select * from listp) order by y;
  QUERY PLAN
--
 Sort
   Output: z, y
   Sort Key: y
   ->  Append
 ->  Seq Scan on public.listp1
   Output: listp1.a, listp1.b
 ->  Seq Scan on public.listp2
   Output: listp2.a, listp2.b
 ->  Seq Scan on public.listp1 listp1_1
   Output: listp1_1.a, listp1_1.b
 ->  Seq Scan on public.listp2 listp2_1
   Output: listp2_1.a, listp2_1.b

Notice the sort key now refers to the alias rather than a column from
the first append child.

It sure is an interesting thought, and one I'd not considered, but I
don't think trying for something like this is going to be the path of
least resistance. It may also add quite a bit of complexity if we just
try to do it when the OUTER_VAR would lead to a Append/MergeAppend
which belongs to a partitioned table scan.

I think this idea has good merit and some form of it might be the
nicest way to allow run-time pruning of all subnodes in the future.
Perhaps we can put it on the shelf and think about it again for PG12.
However, it might not be the most interesting optimization to work on,
as I think probably run-time pruning away of all subnodes is probably
far less common than pruning some, or all but one, and the cost of
initializing the one unneeded subnode is not so big.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


set_dummy_tlist_hacks.patch
Description: Binary data

Re: [HACKERS] Runtime Partition Pruning

2018-04-04 Thread Amit Langote

Hi David.

On 2018/03/31 22:52, David Rowley wrote:
> The attached patchset is based on Amit's v45 faster partition pruning [1].
> 
> I've made a few changes since the v14 version. Since Amit's v45 patch
> now creates the partition pruning details in a data structure that can
> be copied from the planner over to the executor, it means this patch
> is now able to do much of the setup work for run-time pruning in the
> planner. Doing this now allows us to determine if run-time pruning is
> even possible at plan time, rather than the executor attempting it and
> sometimes wasting effort when it failed to find Params matching the
> partition key.
>
> Another change from the last version is that I've separated out the
> handling of exec Params and external Params.  The new patch now will
> perform a pruning step at executor startup if some external params
> match the partition key.  This is very beneficial to a PREPAREd OLTP
> type query against a partitioned table as it means we can skip
> sub-plan initialisation for all non-matching partitions.

This is quite nice.

> Initialising
> Append/MergeAppend/ModifyTable nodes with fewer subnodes than were
> originally planned did require a small change in explain.c in some
> code that was assuming the subnode arrays were the same length as the
> subplan list. I also ended up adding a Int property to EXPLAIN to show
> the number of subnodes that have been removed during execution.
> Unfortunately, there is a small corner case with this in the case
> where all subnodes are removed it leaves EXPLAIN with no subnodes to
> search for outer side Vars in. I didn't really see any sane way to
> handle this in EXPLAIN, so I think the best fix for this is what I've
> done, and that's just to always initalise at least 1 subnode, even
> when none match the external Params. Cost-wise, I don't think this is
> such a big deal as the cost savings here are coming from saving on
> initalising ten's or hundreds of subnodes, not 1.

I have wondered about the possibility of setting a valid (non-dummy)
targetlist in Append and MergeAppend if they're created for a partitioned
table.  Currently they're overwritten by a dummy one using
set_dummy_tlist_references() in set_plan_refs() citing the following reason:

 * set_dummy_tlist_references
 *Replace the targetlist of an upper-level plan node with a simple
 *list of OUTER_VAR references to its child.
 *
 * This is used for plan types like Sort and Append that don't evaluate
 * their targetlists.  Although the executor doesn't care at all what's in
 * the tlist, EXPLAIN needs it to be realistic.

In fact, when I had noticed that this EXPLAIN behavior I had wondered if
that's something we should have discussed when d3cc37f1d801a [1] went in.

> To put the new patch to the test, I tried pgbench -S -M prepared -s
> 100 with and without having modified pgbench_accounts to separate into
> 10 RANGE partitions of equal size.
> 
> A non-partitioned table was getting 12503 TPS.
> With partitioned tables, the old version of this patch was getting: 5470 TPS.
> With partitioned tables, the attached version gets 11247 TPS.
> For perspective, today's master with a partitioned table gets 4719 TPS.

Quite nice!

> So you can see it's a pretty good performance boost by skipping
> initialisation of the 9 non-matching subplans.  It's not hard to
> imagine the gains getting more significant with a larger number of
> partitions. Ideally, the performance of a partitioned table would be
> faster than the non-partitioned case, but please remember this query
> is a single row PK lookup SELECT, so is a very fast query in both
> cases. Partitioning cases should improve more as the table grows and
> indexes struggle to fit in RAM, or when many rows are being taken from
> the partition and being aggregated.
> 
> Unfortunately, the story is not quite as good for the non -M prepared
> version of the benchmark. This shows that planning time of partitioned
> table queries could still use some improvements. Amit's v45 patch
> certainly makes a dent in the planner slow performance here, but it's
> still nothing like as fast as the non-partitioned case. More work is
> required there in the future.

Certainly.  Things like the issue that we both replied to yesterday should
be addressed for starters [2].

> This patchset also contains a patch to improve the performance of
> inheritance planning of UPDATE/DELETE queries. This patch also
> implements run-time pruning for UPDATE/DELETE too. This also has a
> significant performance improvement for planning of UPDATE/DELETE
> operations on partitioned tables with a large number of partitions,
> performance is as follows:
> 
> Values in transactions per second.
> 
> Partitions = 1
> Unpatched: 7323.3
> Patched: 6573.2 (-10.24%)
> 
> Partitions = 2
> Unpatched: 6784.8
> Patched: 6377.1 (-6.01%)
> 
> Partitions = 4
> Unpatched: 5903.0
> Patched: 6106.8 (3.45%)
> 
> Partitions = 8
> Unpatched: 4582.0
> Patched: 5579.9 (21.78%)

Re: [HACKERS] Runtime Partition Pruning

2018-04-04 Thread Jesper Pedersen


Hi David,

On 04/03/2018 10:10 PM, David Rowley wrote:

The attached case doesn't trigger a generic plan, so basically all time is
spent in GetCachedPlan.


Yeah, there's still no resolution to the fact that a generic plan +
runtime pruning might be cheaper than a custom plan.  The problem is
the generic plan appears expensive to the custom vs generic plan
comparison due to it containing more Append subnodes and the run-time
pruning not being taking into account by that comparison.

There's been some discussion about this on this thread somewhere.



Forgot about that, sorry.


I think the best solution is probably the one suggested by Robert [1]
and that's to alter the Append plan's cost when run-time pruning is
enabled to try to account for the run-time pruning. This would be a
bit of a blind guess akin to what we do for clause selectivity
estimates for Params, but it's probably better than nothing, and
likely better than doing nothing.



Yeah, something based on the number of WHERE clauses, and if the 
partition type has DEFAULT / NULL partition could help. Forcing 
choose_custom_plan() to return false does help a lot (> 400%) for the 
HASH case.


But maybe this area is best left for PG12.


Yeah, it's a bug in v46 faster partition pruning. Discussing a fix for
that with Amit over on [2].



I was running with a version of faster_part_prune_v45_fixups.patch.

Patch v49 with v18 (0001-0004) works. 0005 needs a rebase.

Thanks again,
 Jesper

Re: [HACKERS] Runtime Partition Pruning

2018-04-04 Thread Amit Langote

On 2018/04/04 16:04, David Rowley wrote:
> On 4 April 2018 at 18:27, Amit Langote  wrote:
>> I'm not sure if we've yet discussed anything that'd be related to this on
>> the faster pruning thread.
> 
> hmm, yeah, I didn't really explain the context, but the report was in [1]>
> [1] 
> https://www.postgresql.org/message-id/cakjs1f_shpuqdhqwjq-_p1kppqn7bjt71ypbdp_8b3rhwfq...@mail.gmail.com

Oh, I see.  Hopefully it is no longer an issue.

Thanks,
Amit

Re: [HACKERS] Runtime Partition Pruning

2018-04-04 Thread David Rowley

On 4 April 2018 at 18:27, Amit Langote  wrote:
> On 2018/04/04 11:10, David Rowley wrote:
>> On 4 April 2018 at 05:44, Jesper Pedersen  wrote:
>>> Also, I'm seeing a regression for check-world in
>>> src/test/regress/results/inherit.out
>>>
>>> ***
>>> *** 642,648 
>>>   -+---+---+-
>>>mlparted_tab_part1  | 1 | a |
>>>mlparted_tab_part2a | 2 | a |
>>> !  mlparted_tab_part2b | 2 | b | xxx
>>>mlparted_tab_part3  | 3 | a | xxx
>>>   (4 rows)
>>>
>>> --- 642,648 
>>>   -+---+---+-
>>>mlparted_tab_part1  | 1 | a |
>>>mlparted_tab_part2a | 2 | a |
>>> !  mlparted_tab_part2b | 2 | b |
>>>mlparted_tab_part3  | 3 | a | xxx
>>>   (4 rows)
>>>
>>> I'll spend some more time tomorrow.
>>
>> Yeah, it's a bug in v46 faster partition pruning. Discussing a fix for
>> that with Amit over on [2].
>
> I'm not sure if we've yet discussed anything that'd be related to this on
> the faster pruning thread.

hmm, yeah, I didn't really explain the context, but the report was in [1]

Basically, the OR clause in the following SQL fragment was overwriting
the scan_all_non_null value:

where (mlp.a = ss.a and mlp.b = 'b') or mlp.a = 3;

Basically the:

result->scan_all_nonnull = step_result->scan_all_nonnull;

The minimum fix would have been to change that line to:

result->scan_all_nonnull |= step_result->scan_all_nonnull;

Anyway, it all irrelevant now as that code has all changed.

[1] 
https://www.postgresql.org/message-id/cakjs1f_shpuqdhqwjq-_p1kppqn7bjt71ypbdp_8b3rhwfq...@mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-04-04 Thread Amit Langote

Hi David.

On 2018/04/04 11:10, David Rowley wrote:
> On 4 April 2018 at 05:44, Jesper Pedersen  wrote:
>> Also, I'm seeing a regression for check-world in
>> src/test/regress/results/inherit.out
>>
>> ***
>> *** 642,648 
>>   -+---+---+-
>>mlparted_tab_part1  | 1 | a |
>>mlparted_tab_part2a | 2 | a |
>> !  mlparted_tab_part2b | 2 | b | xxx
>>mlparted_tab_part3  | 3 | a | xxx
>>   (4 rows)
>>
>> --- 642,648 
>>   -+---+---+-
>>mlparted_tab_part1  | 1 | a |
>>mlparted_tab_part2a | 2 | a |
>> !  mlparted_tab_part2b | 2 | b |
>>mlparted_tab_part3  | 3 | a | xxx
>>   (4 rows)
>>
>> I'll spend some more time tomorrow.
> 
> Yeah, it's a bug in v46 faster partition pruning. Discussing a fix for
> that with Amit over on [2].

I'm not sure if we've yet discussed anything that'd be related to this on
the faster pruning thread.  It seems that the difference arises from
mlparted_tab_part2b not being selected for an update query that's executed
just before this test.  When I execute an equivalent select query to check
if mlparted_tab_part2b is inadvertently pruned due to the new code, I
don't see the latest faster pruning patch doing it:

explain (costs off)
select *
from mlparted_tab mlp,
(select a from some_tab union all select a+1 from some_tab) ss (a)
where (mlp.a = ss.a and mlp.b = 'b') or mlp.a = 3;
QUERY PLAN

--
 Nested Loop
   Join Filter: (((mlp.a = some_tab.a) AND (mlp.b = 'b'::bpchar)) OR
(mlp.a = 3))
   ->  Append
 ->  Seq Scan on some_tab
 ->  Seq Scan on some_tab some_tab_1
   ->  Materialize
 ->  Append
   ->  Seq Scan on mlparted_tab_part1 mlp
 Filter: ((b = 'b'::bpchar) OR (a = 3))
   ->  Seq Scan on mlparted_tab_part2b mlp_1
 Filter: ((b = 'b'::bpchar) OR (a = 3))
   ->  Seq Scan on mlparted_tab_part3 mlp_2
 Filter: ((b = 'b'::bpchar) OR (a = 3))
(13 rows)

For the original update query, constraint exclusion selects the same set
of partitions:

explain (costs off) update mlparted_tab mlp set c = 'xxx'
from (select a from some_tab union all select a+1 from some_tab) ss (a)
where (mlp.a = ss.a and mlp.b = 'b') or mlp.a = 3;
  QUERY PLAN

--
 Update on mlparted_tab mlp
   Update on mlparted_tab_part1 mlp_1
   Update on mlparted_tab_part2b mlp_2
   Update on mlparted_tab_part3 mlp_3
   ->  Nested Loop
 Join Filter: (((mlp_1.a = some_tab.a) AND (mlp_1.b =
'b'::bpchar)) OR (mlp_1.a = 3))
 ->  Append
   ->  Seq Scan on some_tab
   ->  Seq Scan on some_tab some_tab_1
 ->  Materialize
   ->  Seq Scan on mlparted_tab_part1 mlp_1
 Filter: ((b = 'b'::bpchar) OR (a = 3))
   ->  Nested Loop
 Join Filter: (((mlp_2.a = some_tab.a) AND (mlp_2.b =
'b'::bpchar)) OR (mlp_2.a = 3))
 ->  Append
   ->  Seq Scan on some_tab
   ->  Seq Scan on some_tab some_tab_1
 ->  Materialize
   ->  Seq Scan on mlparted_tab_part2b mlp_2
 Filter: ((b = 'b'::bpchar) OR (a = 3))
   ->  Nested Loop
 Join Filter: (((mlp_3.a = some_tab.a) AND (mlp_3.b =
'b'::bpchar)) OR (mlp_3.a = 3))
 ->  Append
   ->  Seq Scan on some_tab
   ->  Seq Scan on some_tab some_tab_1
 ->  Materialize
   ->  Seq Scan on mlparted_tab_part3 mlp_3
 Filter: ((b = 'b'::bpchar) OR (a = 3))
(28 rows)

What am I missing?

Thanks,
Amit

Re: [HACKERS] Runtime Partition Pruning

2018-04-03 Thread Beena Emerson

Hi David,

On Wed, Apr 4, 2018 at 7:57 AM, David Rowley
 wrote:
> On 4 April 2018 at 05:50, Beena Emerson  wrote:
>> With Amit's v46 patch,  the following query in partition_prune was
>> crashing during make check.
>> explain (analyze, costs off, summary off, timing off) execute ab_q1 (2, 2, 
>> 3);
>
> Hi Beena,
>
> Thanks for looking.
>
> Does it work correctly if you apply [1] to Amit's v46 patch before
> patching with v18 run-time partition pruning?
>
> [1] 
> https://www.postgresql.org/message-id/CAKJS1f_6%2BgXB%3DQ%2BDryeB62yW7N19sY8hH_dBSjPFjm2ifdgoCw%40mail.gmail.com
>

Thanks for working on it. make check passes when the patch [1] is also applied.

--

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Runtime Partition Pruning

2018-04-03 Thread David Rowley

On 4 April 2018 at 05:50, Beena Emerson  wrote:
> With Amit's v46 patch,  the following query in partition_prune was
> crashing during make check.
> explain (analyze, costs off, summary off, timing off) execute ab_q1 (2, 2, 3);

Hi Beena,

Thanks for looking.

Does it work correctly if you apply [1] to Amit's v46 patch before
patching with v18 run-time partition pruning?

[1] 
https://www.postgresql.org/message-id/CAKJS1f_6%2BgXB%3DQ%2BDryeB62yW7N19sY8hH_dBSjPFjm2ifdgoCw%40mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-04-03 Thread Beena Emerson

Hello,

On Tue, Apr 3, 2018 at 11:14 PM, Jesper Pedersen
 wrote:
> Hi David,
>
> On 03/31/2018 09:52 AM, David Rowley wrote:
>>
>> I've attached a new version of the patch.  I'm now at v18 after having
>> some versions of the patch that I didn't release which were based on
>> various versions of Amit's faster partition pruning patch.
>>
>
> Thank you for the updated patch set !
>
> I have tested this together with Amit's v46 patch.
>
>
> Also, I'm seeing a regression for check-world in
> src/test/regress/results/inherit.out
>

With Amit's v46 patch,  the following query in partition_prune was
crashing during make check.
explain (analyze, costs off, summary off, timing off) execute ab_q1 (2, 2, 3);



-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Runtime Partition Pruning

2018-04-03 Thread Jesper Pedersen


Hi David,

On 03/31/2018 09:52 AM, David Rowley wrote:

I've attached a new version of the patch.  I'm now at v18 after having
some versions of the patch that I didn't release which were based on
various versions of Amit's faster partition pruning patch.



Thank you for the updated patch set !

I have tested this together with Amit's v46 patch.

The attached case doesn't trigger a generic plan, so basically all time 
is spent in GetCachedPlan.


The standard table case (std.sql) gives:

 generic_cost = 8.4525
 avg_custom_cost = 13.4525
 total_custom_cost = 67.2625

whereas the 64 hash partition case (hash.sql) gives:

 generic_cost = 540.32
 avg_custom_cost = 175.9425
 total_custom_cost = 879.7125

I tested with pgbench -M prepared -f select.sql.

Also, I'm seeing a regression for check-world in 
src/test/regress/results/inherit.out


***
*** 642,648 
  -+---+---+-
   mlparted_tab_part1  | 1 | a |
   mlparted_tab_part2a | 2 | a |
!  mlparted_tab_part2b | 2 | b | xxx
   mlparted_tab_part3  | 3 | a | xxx
  (4 rows)

--- 642,648 
  -+---+---+-
   mlparted_tab_part1  | 1 | a |
   mlparted_tab_part2a | 2 | a |
!  mlparted_tab_part2b | 2 | b |
   mlparted_tab_part3  | 3 | a | xxx
  (4 rows)

I'll spend some more time tomorrow.

Thanks for working on this !

Best regards,
 Jesper


hash.sql
Description: application/sql


select.sql
Description: application/sql


std.sql
Description: application/sql

Re: [HACKERS] Runtime Partition Pruning

2018-03-09 Thread Jesper Pedersen


Hi David,

On 03/01/2018 08:29 PM, David Rowley wrote:

0004 fails "make check-world" due to

pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 670; 1259 49954 TABLE
boolp_f jpedersen
pg_restore: [archiver (db)] could not execute query: ERROR:  syntax error at
or near "false"
LINE 24: ..." ATTACH PARTITION "public"."boolp_f" FOR VALUES IN (false);


The tests seem to have stumbled on a pg_dump bug which causes it to
produce syntax that's not valid (currently)

I should be able to stop my patch failing the test by dropping that
table, which I should have been doing anyway.



I've added that thread to the Open Items list.


Thanks for the review and in advance for the future review.

I'll delay releasing a new patch as there's some discussion over on
the faster partition pruning thread which affects this too [1]

[1] 
https://www.postgresql.org/message-id/CA+Tgmoa4D1c4roj7L8cx8gkkeBWAZD=mtcxkxtwbnslrhd3...@mail.gmail.com



Sure, 0003-0005 depends on that thread. 0002 is refactoring so that one 
is ready.


In 0004 should we add a HASH based test case,

-- test.sql --
-- verify pruning in hash partitions
create table hashp (a int primary key, b int) partition by hash (a);
create table hashp_0 partition of hashp for values with (modulus 2, 
remainder 0);
create table hashp_1 partition of hashp for values with (modulus 2, 
remainder 1);

insert into hashp values (0,0), (1,1), (2,2), (3,3);
prepare q1 (int) as select * from hashp where a = $1;
execute q1 (1);
execute q1 (1);
execute q1 (1);
execute q1 (1);
execute q1 (1);
explain (analyze, costs off, summary off, timing off)  execute q1 (1);
explain (analyze, costs off, summary off, timing off)  execute q1 (3);
deallocate q1;
drop table hashp;
-- test.sql --

Also, should 0004 consider the "Parallel Append" case, aka

-- parallel.sql --
CREATE TABLE t1 (
a integer NOT NULL,
b integer NOT NULL
) PARTITION BY HASH (b);

CREATE TABLE t1_p00 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, 
REMAINDER 0);
CREATE TABLE t1_p01 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, 
REMAINDER 1);
CREATE TABLE t1_p02 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, 
REMAINDER 2);
CREATE TABLE t1_p03 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, 
REMAINDER 3);

INSERT INTO t1 (SELECT i, i FROM generate_series(1, 100) AS i);
PREPARE q1 (int) AS SELECT * FROM t1 WHERE a = $1;
EXECUTE q1 (5432);
EXECUTE q1 (5432);
EXECUTE q1 (5432);
EXECUTE q1 (5432);
EXECUTE q1 (5432);
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)  EXECUTE q1 (5432);
DEALLOCATE q1;
DROP TABLE t1;
-- parallel.sql --

Best regards,
 Jesper

Re: [HACKERS] Runtime Partition Pruning

2018-03-01 Thread David Rowley

On 2 March 2018 at 07:17, Jesper Pedersen  wrote:
> A small typo in 0001:
>
> + * leftmost_ons_pos[x] gives the bit number (0-7) of the leftmost one bit
> in a
>
> ..."_one_"...

Oops. I'll fix that.

> 0004 fails "make check-world" due to
>
> pg_restore: [archiver (db)] Error while PROCESSING TOC:
> pg_restore: [archiver (db)] Error from TOC entry 670; 1259 49954 TABLE
> boolp_f jpedersen
> pg_restore: [archiver (db)] could not execute query: ERROR:  syntax error at
> or near "false"
> LINE 24: ..." ATTACH PARTITION "public"."boolp_f" FOR VALUES IN (false);

The tests seem to have stumbled on a pg_dump bug which causes it to
produce syntax that's not valid (currently)

I should be able to stop my patch failing the test by dropping that
table, which I should have been doing anyway.

> Do you require https://commitfest.postgresql.org/17/1410/ as well ?

I'll look at that thread and see if there's any pg_dump being broken discussion.

> I'll look more at 0002-0005 over the coming days.

Thanks for the review and in advance for the future review.

I'll delay releasing a new patch as there's some discussion over on
the faster partition pruning thread which affects this too [1]

[1] 
https://www.postgresql.org/message-id/CA+Tgmoa4D1c4roj7L8cx8gkkeBWAZD=mtcxkxtwbnslrhd3...@mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-03-01 Thread Jesper Pedersen


Hi David,

On 03/01/2018 08:04 AM, David Rowley wrote:

I've also split the patch out a bit more into logical parts in the
hope it makes things easier to review.



A small typo in 0001:

+ * leftmost_ons_pos[x] gives the bit number (0-7) of the leftmost one 
bit in a


..."_one_"...

0004 fails "make check-world" due to

pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 670; 1259 49954 TABLE 
boolp_f jpedersen
pg_restore: [archiver (db)] could not execute query: ERROR:  syntax 
error at or near "false"

LINE 24: ..." ATTACH PARTITION "public"."boolp_f" FOR VALUES IN (false);

Do you require https://commitfest.postgresql.org/17/1410/ as well ?

I'll look more at 0002-0005 over the coming days.

Thanks for working on this !

Best regards,
 Jesper

Re: [HACKERS] Runtime Partition Pruning

2018-02-25 Thread Amit Langote

(Sorry, I had mistakenly replied only to Simon on Friday)

On Fri, Feb 23, 2018 at 9:04 PM, Simon Riggs  wrote:
> On 23 February 2018 at 11:40, David Rowley  
> wrote:
>> On 23 February 2018 at 04:11, Jesper Pedersen
>>  wrote:
>>> Are UPDATE and DELETE suppose to be supported ?
>>
>> To be honest, I had not even considered those.
>
> I can say that I had considered those. Handling of UPDATE and DELETE
> with partitions is significantly different, so its not just an
> oversight its a different branch of the project.
>
> We need SELECT to work first and then we have a chance of making
> UPDATE/DELETE work.

+1

Thanks,
Amit

Re: [HACKERS] Runtime Partition Pruning

2018-02-23 Thread Simon Riggs

On 23 February 2018 at 11:40, David Rowley  wrote:
> On 23 February 2018 at 04:11, Jesper Pedersen
>  wrote:
>> Are UPDATE and DELETE suppose to be supported ?
>
> To be honest, I had not even considered those.

I can say that I had considered those. Handling of UPDATE and DELETE
with partitions is significantly different, so its not just an
oversight its a different branch of the project.

We need SELECT to work first and then we have a chance of making
UPDATE/DELETE work.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-02-23 Thread David Rowley

On 23 February 2018 at 04:11, Jesper Pedersen
 wrote:
> Are UPDATE and DELETE suppose to be supported ?

To be honest, I had not even considered those. Without looking in
detail I imagine it may be possible to allow this simply by setting
the AppendPath->trypartitionprune in the correct cases in the
inheritence_planner(). I would need to look into this in some detail
to find out for sure.

Another case which likely is simple to implement is the exact same
processing for MergeAppends. I currently see no reason why the same
pruning cannot be done for subnodes of that node type too. I've just
not done so yet. I'd rather get more sanity check reviews on the
current scope of the patch before I widen it out to other areas, but
at the same time also don't want to leave very simple things to PG12
which can easily be done in PG11. So I'll try to look at this and get
back to you, or perhaps release a new set of patches to support the
additional features.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-02-22 Thread Amit Langote

Hi David.

On 2018/02/23 0:11, David Rowley wrote:
> On 23 February 2018 at 01:15, David Rowley  
> wrote:
>> One problem that I'm facing now is down to the way I'm gathering the
>> ParamIds that match the partkeys. As you'll see from the patch I've
>> added a 'paramids' field to PartitionPruneContext and I'm populating
>> this when the clauses are being pre-processed in
>> extract_partition_clauses(). The problem is that the or_clauses are
>> not pre-processed at all, so the current patch will not properly
>> perform run-time pruning when the Params are "hidden" in OR branches.
>>
>> One way I thought to fix this was to change the clause processing to
>> create an array of PartitionClauseInfos, one for each OR branch. This
>> would also improve the performance of the run-time pruning, meaning
>> that all of the or clauses would be already matched to the partition
>> keys once, rather than having to redo that again each time a Param
>> changes its value.
>>
>> If I go and write a patch to do that, would you want it in your patch,
>> or would you rather I kept it over here?  Or perhaps you have a better
>> idea on how to solve...?
> 
> I've attached a patch which does this. For now, the patch is only
> intended to assist in the discussion of the above idea.
> 
> The patch is based on a WIP version of run-time pruning that I'm not
> quite ready to post yet, but with a small amount of work you could
> probably take it and base it on your faster partition pruning v31
> patch [1].
> 
> I ended up pulling the PartitionPruneInfo out of the
> PartitionPruneContext. This was required due how I've now made
> extract_partition_clauses() recursively call itself. We don't want to
> overwrite the context's clauseinfo with the one from the recursive
> call. A side effect of this is that the subcontext is no longer
> required when processing the OR clauses. You only did this so that the
> context's clauseinfo was not overwritten. I also think it's better to
> seperate out the inputs and outputs. Anything in context was more
> intended to be for input fields.
> 
> Let me know your thoughts about this. If you don't want it for faster
> partition pruning, then I'll probably go and tidy it up and include it
> for run-time pruning.

Thanks for the patch.

I don't have time today to look at the patch closely, but at first blush,
it seems like something I should incorporate into my own patch, because
it's restructuring the partprune.c code.  I will study the issue and your
proposed solution in the form of this restructuring more closely over the
weekend and reply (probably with a new version of my patch) on Monday.

Thanks,
Amit

Re: [HACKERS] Runtime Partition Pruning

2018-02-22 Thread David Rowley

On 23 February 2018 at 01:15, David Rowley  wrote:
> One problem that I'm facing now is down to the way I'm gathering the
> ParamIds that match the partkeys. As you'll see from the patch I've
> added a 'paramids' field to PartitionPruneContext and I'm populating
> this when the clauses are being pre-processed in
> extract_partition_clauses(). The problem is that the or_clauses are
> not pre-processed at all, so the current patch will not properly
> perform run-time pruning when the Params are "hidden" in OR branches.
>
> One way I thought to fix this was to change the clause processing to
> create an array of PartitionClauseInfos, one for each OR branch. This
> would also improve the performance of the run-time pruning, meaning
> that all of the or clauses would be already matched to the partition
> keys once, rather than having to redo that again each time a Param
> changes its value.
>
> If I go and write a patch to do that, would you want it in your patch,
> or would you rather I kept it over here?  Or perhaps you have a better
> idea on how to solve...?

Hi Amit,

I've attached a patch which does this. For now, the patch is only
intended to assist in the discussion of the above idea.

The patch is based on a WIP version of run-time pruning that I'm not
quite ready to post yet, but with a small amount of work you could
probably take it and base it on your faster partition pruning v31
patch [1].

I ended up pulling the PartitionPruneInfo out of the
PartitionPruneContext. This was required due how I've now made
extract_partition_clauses() recursively call itself. We don't want to
overwrite the context's clauseinfo with the one from the recursive
call. A side effect of this is that the subcontext is no longer
required when processing the OR clauses. You only did this so that the
context's clauseinfo was not overwritten. I also think it's better to
seperate out the inputs and outputs. Anything in context was more
intended to be for input fields.

Let me know your thoughts about this. If you don't want it for faster
partition pruning, then I'll probably go and tidy it up and include it
for run-time pruning.

[1] 
https://www.postgresql.org/message-id/00ae2273-bb6b-1287-9ebc-5459b37c9078%40lab.ntt.co.jp

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

generate_PartitionClauseInfos_for_or_clauses.patch
Description: Binary data

Re: [HACKERS] Runtime Partition Pruning

2018-02-22 Thread Jesper Pedersen


Hi David,

On 02/21/2018 04:06 AM, David Rowley wrote:

I've attached v11 of the patch.



Are UPDATE and DELETE suppose to be supported ?

With

-- test.sql --
CREATE TABLE test (a integer NOT NULL, b integer) PARTITION BY HASH(a);
CREATE TABLE test_p00 PARTITION OF test FOR VALUES WITH (MODULUS 2, 
REMAINDER 0);
CREATE TABLE test_p01 PARTITION OF test FOR VALUES WITH (MODULUS 2, 
REMAINDER 1);

CREATE INDEX idx_test_a ON test (a);
CREATE INDEX idx_test_b ON test (b);

INSERT INTO test (SELECT i,i FROM generate_series(1, 100) AS i);

ANALYZE;
-- test.sql --

and

UPDATE test SET b = 1 WHERE a = ?
DELETE FROM test WHERE a = ?

both shows that all partitions are scanned;

 Update on test
   Update on test_p00
   Update on test_p01
   ->  Index Scan using test_p00_a_idx on test_p00
 Index Cond: (a = 1)
   ->  Index Scan using test_p01_a_idx on test_p01
 Index Cond: (a = 1)

Using prune_v32 and runtime_v11 with conflicts resolved.

Best regards,
 Jesper

Re: [HACKERS] Runtime Partition Pruning

2018-02-22 Thread David Rowley

On 22 February 2018 at 22:31, Amit Langote
 wrote:
> Some comments:

Hi Amit,

Thanks for looking at this. I'll work through your comments and
produce a patch sometime in the near future.

One problem that I'm facing now is down to the way I'm gathering the
ParamIds that match the partkeys. As you'll see from the patch I've
added a 'paramids' field to PartitionPruneContext and I'm populating
this when the clauses are being pre-processed in
extract_partition_clauses(). The problem is that the or_clauses are
not pre-processed at all, so the current patch will not properly
perform run-time pruning when the Params are "hidden" in OR branches.

One way I thought to fix this was to change the clause processing to
create an array of PartitionClauseInfos, one for each OR branch. This
would also improve the performance of the run-time pruning, meaning
that all of the or clauses would be already matched to the partition
keys once, rather than having to redo that again each time a Param
changes its value.

If I go and write a patch to do that, would you want it in your patch,
or would you rather I kept it over here?  Or perhaps you have a better
idea on how to solve...?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-02-22 Thread Amit Langote

Hi David.

On 2018/02/21 18:06, David Rowley wrote:
> Other things I don't particularly like about the current patch are how
> I had to construct the partition key expressions in
> set_valid_runtime_subplans_recurse(). This pretty much uses code
> borrowed from set_baserel_partition_key_exprs(). One way around that
> would be to change the partprune.c code to deal with the
> partkey->partattrs and consume an expression from the list on attnum =
> 0. I didn't do that as I want to minimise how much I touch Amit's
> patch before it gets committed as doing so would likely just cause
> more headaches for me keeping this merged with his work. Another
> option to resolve this would be to put the partition key expressions
> into the PartitionPruneInfo.

Another way could be to refactor the code you've borrowed from
set_baserel_partition_key_exprs() into its own function that is exported
by some optimizer header.

I removed PartitionKey/Relation from signatures of various functions of my
patch to avoid having to re-heap_open() the relation per [1].

> I've attached v11 of the patch.

Some comments:

* I noticed that the patch adds a function to bitmapset.c which you might
want to extract into its own patch, like your patch to add bms_add_range()
that got committed as 84940644d [2].

* Maybe it's better to add `default: break;` in the two switch's
you've added in partkey_datum_from_expr()

partprune.c: In function ‘partkey_datum_from_expr’:
partprune.c:1555:2: warning: enumeration value ‘T_Invalid’ not handled in
switch [-Wswitch]
  switch (nodeTag(expr))

partprune.c:1562:4: warning: enumeration value ‘PARAM_SUBLINK’ not handled
in switch [-Wswitch]
switch (((Param *) expr)->paramkind)

* I see that you moved PartitionClauseInfo's definition from partprune.c
to partition.h.  Isn't it better to move it to partprune.h?

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoabi-29Vs8H0xkjtYB%3DcU%2BGVCrNwPz7okpa3KsoLmdEUQ%40mail.gmail.com

[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=84940644d

Re: [HACKERS] Runtime Partition Pruning

2018-02-21 Thread David Rowley

On 21 February 2018 at 22:45, Rajkumar Raghuwanshi
 wrote:
> select count(*) from prt1 x where (x.a,x.b) in (select t1.a,t2.b from prt1
> t1,prt2 t2 where t1.a=t2.b)
> and (x.c) in (select t3.c from plt1 t3,plt2 t4 where t3.c=t4.c);

Thanks for the test case.

It seems like the x.c Param for some reason has a ParamId of 0. I need
to go learn a bit more about Params to understand why this is.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-02-21 Thread Rajkumar Raghuwanshi

On Wed, Feb 21, 2018 at 2:36 PM, David Rowley 
wrote:

> I've attached v11 of the patch.
>

Hi,

I have applied attached patch on head
"6f1d723b6359507ef55a81617167507bc25e3e2b" over Amit's v30 patches. while
testing further I got a server crash with below test case. Please take a
look.

CREATE TABLE prt1 (a int, b int, c varchar) PARTITION BY RANGE(a);
CREATE TABLE prt1_p1 PARTITION OF prt1 FOR VALUES FROM (0) TO (250);
CREATE TABLE prt1_p3 PARTITION OF prt1 FOR VALUES FROM (500) TO (600);
CREATE TABLE prt1_p2 PARTITION OF prt1 FOR VALUES FROM (250) TO (500);
INSERT INTO prt1 SELECT i, i, to_char(i, 'FM') FROM generate_series(0,
599, 2) i;
CREATE INDEX iprt1_p1_a on prt1_p1(a);
CREATE INDEX iprt1_p2_a on prt1_p2(a);
CREATE INDEX iprt1_p3_a on prt1_p3(a);
ANALYZE prt1;

CREATE TABLE prt2 (a int, b int, c varchar) PARTITION BY RANGE(b);
CREATE TABLE prt2_p1 PARTITION OF prt2 FOR VALUES FROM (0) TO (250);
CREATE TABLE prt2_p2 PARTITION OF prt2 FOR VALUES FROM (250) TO (500);
CREATE TABLE prt2_p3 PARTITION OF prt2 FOR VALUES FROM (500) TO (600);
INSERT INTO prt2 SELECT i, i, to_char(i, 'FM') FROM generate_series(0,
599, 3) i;
CREATE INDEX iprt2_p1_b on prt2_p1(b);
CREATE INDEX iprt2_p2_b on prt2_p2(b);
CREATE INDEX iprt2_p3_b on prt2_p3(b);
ANALYZE prt2;

CREATE TABLE plt1 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE plt1_p1 PARTITION OF plt1 FOR VALUES IN ('', '0003',
'0004', '0010');
CREATE TABLE plt1_p2 PARTITION OF plt1 FOR VALUES IN ('0001', '0005',
'0002', '0009');
CREATE TABLE plt1_p3 PARTITION OF plt1 FOR VALUES IN ('0006', '0007',
'0008', '0011');
INSERT INTO plt1 SELECT i, i, to_char(i/50, 'FM') FROM
generate_series(0, 599, 2) i;
CREATE INDEX iplt1_p1_c on plt1_p1(c);
CREATE INDEX iplt1_p2_c on plt1_p2(c);
CREATE INDEX iplt1_p3_c on plt1_p3(c);
ANALYZE plt1;

CREATE TABLE plt2 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE plt2_p1 PARTITION OF plt2 FOR VALUES IN ('', '0003',
'0004', '0010');
CREATE TABLE plt2_p2 PARTITION OF plt2 FOR VALUES IN ('0001', '0005',
'0002', '0009');
CREATE TABLE plt2_p3 PARTITION OF plt2 FOR VALUES IN ('0006', '0007',
'0008', '0011');
INSERT INTO plt2 SELECT i, i, to_char(i/50, 'FM') FROM
generate_series(0, 599, 3) i;
CREATE INDEX iplt2_p1_c on plt2_p1(c);
CREATE INDEX iplt2_p2_c on plt2_p2(c);
CREATE INDEX iplt2_p3_c on plt2_p3(c);
ANALYZE plt2;

select count(*) from prt1 x where (x.a,x.b) in (select t1.a,t2.b from prt1
t1,prt2 t2 where t1.a=t2.b)
and (x.c) in (select t3.c from plt1 t3,plt2 t4 where t3.c=t4.c);

/*
postgres=# select count(*) from prt1 x where (x.a,x.b) in (select t1.a,t2.b
from prt1 t1,prt2 t2 where t1.a=t2.b)
postgres-# and (x.c) in (select t3.c from plt1 t3,plt2 t4 where t3.c=t4.c);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
*/

stack-trace give below :

/*
(gdb) bt
#0  0x006ce6dc in ExecEvalParamExec (state=0x26e9ee0, op=0x26e9f78,
econtext=0x26ea390) at execExprInterp.c:
#1  0x006cc66a in ExecInterpExpr (state=0x26e9ee0,
econtext=0x26ea390, isnull=0x7ffe0f75d77f "") at execExprInterp.c:1024
#2  0x006cdd8c in ExecInterpExprStillValid (state=0x26e9ee0,
econtext=0x26ea390, isNull=0x7ffe0f75d77f "") at execExprInterp.c:1819
#3  0x007db078 in ExecEvalExprSwitchContext (state=0x26e9ee0,
econtext=0x26ea390, isNull=0x7ffe0f75d77f "") at
../../../../src/include/executor/executor.h:305
#4  0x007e2072 in evaluate_expr (expr=0x26a3cb0, result_type=25,
result_typmod=-1, result_collation=0) at clauses.c:4890
#5  0x007e588a in partkey_datum_from_expr (context=0x26d3180,
parttypid=25, expr=0x26a3cb0, value=0x7ffe0f75da00) at partprune.c:1504
#6  0x007e5243 in extract_bounding_datums (context=0x26d3180,
minimalclauses=0x7ffe0f75d900, keys=0x7ffe0f75da00) at partprune.c:1307
#7  0x007e377d in get_partitions_from_clauses (context=0x26d3180)
at partprune.c:273
#8  0x006ea2ec in set_valid_runtime_subplans_recurse
(node=0x269bf90, pinfo=0x7f6cf6765cf0, ctxcache=0x26d3158,
validsubplans=0x7ffe0f75de10) at nodeAppend.c:771
#9  0x006e9ebf in set_valid_runtime_subplans (node=0x269bf90) at
nodeAppend.c:640
#10 0x006e99b5 in choose_next_subplan_locally (node=0x269bf90) at
nodeAppend.c:426
#11 0x006e9598 in ExecAppend (pstate=0x269bf90) at nodeAppend.c:224
#12 0x006deb3a in ExecProcNodeFirst (node=0x269bf90) at
execProcnode.c:446
#13 0x006fb9ee in ExecProcNode (node=0x269bf90) at
../../../src/include/executor/executor.h:239
#14 0x006fbcc4 in ExecHashJoinImpl (pstate=0x2697808, parallel=0
'\000') at nodeHashjoin.c:262
#15 0x006fc3fd in ExecHashJoin (pstate=0x2697808) at
nodeHashjoin.c:565
#16 0x006deb3a in ExecProcNodeFirst (node=0x2697808) at
execProcnode.c:446
#17 0x0070c376 in ExecProcNode

Re: [HACKERS] Runtime Partition Pruning

2018-02-20 Thread David Rowley

On 20 February 2018 at 23:46, Rajkumar Raghuwanshi
 wrote:
> I have applied v10 patch on Amit's v27 over head ad7dbee36. I got "ERROR:
> partition missing from Append subplans" with the patch. on head and only
> with Amit's patches query is working fine, Please find test case below.

Thanks for the test case. I can recreate locally. This is down to the
fact that make_partition_pruneinfo() only makes sub-PartitionPruneInfo
for actual subpaths found in the Append. Your test case happens to
match both the part_p1 and part_p2 partitions on the first level of
iteration, but since no child of part_p2 is found in
make_partition_pruneinfo, that element in the subpartindex never gets
set.

The fix might be to just remove the error and silently ignore those
cases, but I was hoping to keep that around as it might catch other
bugs. I'm just not sure yet how to do both.

I'll rebase this on Amit's latest patch and have a think about it
while doing that.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-02-20 Thread Rajkumar Raghuwanshi

On Sat, Feb 17, 2018 at 2:27 PM, David Rowley 
wrote:

> Hi,
>
> I've attached an updated patch, now at v10. v9 was short lived due to
> the evolution of Amit's which which this based on.
>
> This version is based on Amit's v27 of faster partition pruning [1]
> which can be applied atop of ad7dbee36.
>

Hi,

I have applied v10 patch on Amit's v27 over head ad7dbee36. I got "ERROR:
partition missing from Append subplans" with the patch. on head and only
with Amit's patches query is working fine, Please find test case below.

CREATE TABLE part ( c1 INT2, c2 DATE) PARTITION BY RANGE (c1);
CREATE TABLE part_p1 PARTITION OF part FOR VALUES FROM (0) TO (141)
PARTITION BY RANGE(c2);
CREATE TABLE part_p11 PARTITION OF part_p1 FOR VALUES FROM ('1/1/1997') TO
('2/1/1999');
CREATE TABLE part_p12 PARTITION OF part_p1 FOR VALUES FROM ('2/1/1999') TO
('2/1/2000');
CREATE TABLE part_p2 PARTITION OF part FOR VALUES FROM (141) TO (211)
PARTITION BY RANGE(c2);
CREATE TABLE part_p21 PARTITION OF part_p2 FOR VALUES FROM ('1/1/2000') TO
('2/1/2001');
CREATE TABLE part_p22 PARTITION OF part_p2 FOR VALUES FROM ('2/1/2001') TO
('2/1/2006');

INSERT INTO part VALUES (100,'1/1/1999');
INSERT INTO part VALUES (110,'1/1/1998');
INSERT INTO part VALUES (130,'1/1/2000');
INSERT INTO part VALUES (170,'1/1/2000');
INSERT INTO part VALUES (180,'1/1/2001');
INSERT INTO part VALUES (190,'1/1/2006');
INSERT INTO part VALUES (200,'1/1/2000');
INSERT INTO part VALUES (210,'1/1/2002');

postgres=# PREPARE RTP AS SELECT * FROM PART WHERE c2 BETWEEN '1/1/1998'
AND '1/1/1999';
PREPARE
postgres=# EXPLAIN execute RTP;
 QUERY
PLAN
-
 Append  (cost=0.00..46.00 rows=12 width=6)
   ->  Seq Scan on part_p11  (cost=0.00..46.00 rows=12 width=6)
 Filter: ((c2 >= '1998-01-01'::date) AND (c2 <= '1999-01-01'::date))
(3 rows)

postgres=# execute RTP;
*ERROR:  partition missing from Append subplans*

deallocate RTP;
DROP TABLE part;

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Re: [HACKERS] Runtime Partition Pruning

2018-02-17 Thread David Rowley

Hi,

I've attached an updated patch, now at v10. v9 was short lived due to
the evolution of Amit's which which this based on.

This version is based on Amit's v27 of faster partition pruning [1]
which can be applied atop of ad7dbee36.

I've done a self review of this, but I've not yet done any final
polishing work as the faster partition pruning patch is still
evolving. I will, for example, likely need to do some more work in
nodeAppend.c to add a few more comments and probably think of better
names for a few new things that have made a first appearance in this
version of the patch

[1] 
https://www.postgresql.org/message-id/520f8a71-286d-e36d-34cf-363fd7436...@lab.ntt.co.jp

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


runtime_prune_drowley_v10.patch
Description: Binary data

Re: [HACKERS] Runtime Partition Pruning

2018-01-09 Thread David Rowley

On Tue, Jan 9, 2018 at 2:24 PM, Beena Emerson  wrote:
> prepare abc_q1 (int, int, char) as select * from ab_c where a BETWEEN
> $1 and $2 AND b <= $3;
>
> --after 5 runs: abc_a2_b3 is not pruned.

This seems to be down to an existing bug. I'm not yet sure if it's
caused by faster_partition_prune_v17, or if it exists in master.
Basically RelOptInfo->partition_rels can contain duplicates for
relations. In your example while debugging make_partition_pruneinfo I
see:

list_nth_int(best_path->partitioned_rels,0)
1
list_nth_int(best_path->partitioned_rels,1)
3
list_nth_int(best_path->partitioned_rels,2)
8
list_nth_int(best_path->partitioned_rels,3)
13
list_nth_int(best_path->partitioned_rels,4)
3
list_nth_int(best_path->partitioned_rels,5)
8
list_nth_int(best_path->partitioned_rels,6)
13

There should only be 4 items in this list, not 7.

make_partition_pruneinfo might have been a bit naive to assume this
couldn't happen, so I've coded it to be a bit more resilient to this
happening. It'll still end up creating another sub-PartitionPruneInfo
and slotting into the same place, but it'll no longer attempt to
translate the prunequals twice... which was what was causing the
problem. I'd been a bit sloppy and assigned the output of
adjust_appendrel_attrs() back to the prunequals which is a parameter
to the function instead of assigning to a local variable like I've
done now.

On 9 January 2018 at 22:22, Beena Emerson  wrote:
>> postgres=# explain (analyze, costs off, summary off, timing off)
>> execute abc_q1 (1, 2);
>> ERROR:  partition missing from Append subplans

This also seems to be fixed by the above fix.

> =# explain (analyze, costs off, summary off, timing off) execute abc_q1 (1, 
> 1);
> ERROR:  operator 1057 is not a member of opfamily 1976

This seems to be broken in faster_partition_prune_v17 where in
classify_partition_bounding_keys() the code properly checks if the
clause matches the partition key for OpExpr, but fails to do the same
for ScalarArrayOpExpr. I'll report to Amit on the thread for that
patch.

I'll also investigate the duplication in RelOptInfo->partition_rels
and report that in another thread.

Can you confirm that case 1 and 2 are working with the attached?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

runtime_prune_drowley_v8.patch
Description: Binary data

Re: [HACKERS] Runtime Partition Pruning

2018-01-09 Thread David Rowley

On 9 January 2018 at 22:22, Beena Emerson  wrote:
> ERROR:  operator 1057 is not a member of opfamily 1976

Thanks for finding these. I'm looking into the above, and the other
ones you've mentioned now.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-01-09 Thread Beena Emerson

Hi,

The mail was accidently sent before I could complete.

On Tue, Jan 9, 2018 at 2:24 PM, Beena Emerson  wrote:
> Hello,
>
> The pruning does not work well with char type:
>
> Case 2: Case with optimizer pruning
> drop table ab_c;
>  create table ab_c (a int not null, b int) partition by list(a);
>  create table abc_a2 (b int, a int not null) partition by list(b);
>   create table abc_a2_b1 partition of abc_a2 for values in (1);
>   create table abc_a2_b2 partition of abc_a2 for values in (2);
>   create table abc_a2_b3 partition of abc_a2 for values in (3);
>   alter table ab_c attach partition abc_a2 for values in (2);
>   create table abc_a1 partition of ab_c for values in(1) partition by list 
> (b);
>   create table abc_a1_b1 partition of abc_a1 for values in (1);
>   create table abc_a1_b2 partition of abc_a1 for values in (2);
>   create table abc_a1_b3 partition of abc_a1 for values in (3);
>   create table abc_a3 partition of ab_c for values in(3) partition by list 
> (b);
>   create table abc_a3_b1 partition of abc_a3 for values in (1);
>   create table abc_a3_b2 partition of abc_a3 for values in (2);
>   create table abc_a3_b3 partition of abc_a3 for values in (3);
>   deallocate abc_q1;

Prepared statement is missing:
 prepare abc_q1 (int, int) as select a,b from ab_c where a BETWEEN $1
and $2 AND b IN (3, 2);
>
>
> =# explain (analyze, costs off, summary off, timing off) execute abc_q1 (1, 
> 1);
>   QUERY PLAN
> --
>  Append (actual rows=2 loops=1)
>->  Seq Scan on abc_a1_b2 (actual rows=1 loops=1)
>  Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[])))
>->  Seq Scan on abc_a1_b3 (actual rows=1 loops=1)
>  Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[])))
>->  Seq Scan on abc_a2_b2 (never executed)
>  Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[])))
>->  Seq Scan on abc_a2_b3 (never executed)
>  Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[])))
>->  Seq Scan on abc_a3_b2 (never executed)
>  Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[])))
>->  Seq Scan on abc_a3_b3 (never executed)
>  Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[])))
> (13 rows)
>
> postgres=# explain (analyze, costs off, summary off, timing off)
> execute abc_q1 (1, 2);
> ERROR:  partition missing from Append subplans

These work fine when the column order of subpartitons are not changed.

Case 3: Optimizer pruning with char types:
Same as case1 with all subpartitions having same col order as parent.

drop table ab_c;
 create table ab_c (a int not null, b char) partition by list(a);
 create table abc_a2 ( a int not null, b char) partition by list(b);
  create table abc_a2_b1 partition of abc_a2 for values in ('1');
  create table abc_a2_b2 partition of abc_a2 for values in ('2');
  create table abc_a2_b3 partition of abc_a2 for values in ('3');
  alter table ab_c attach partition abc_a2 for values in (2);
  create table abc_a1 partition of ab_c for values in(1) partition by list (b);
  create table abc_a1_b1 partition of abc_a1 for values in ('1');
  create table abc_a1_b2 partition of abc_a1 for values in ('2');
  create table abc_a1_b3 partition of abc_a1 for values in ('3');
  create table abc_a3 partition of ab_c for values in(3) partition by list (b);
  create table abc_a3_b1 partition of abc_a3 for values in ('1');
  create table abc_a3_b2 partition of abc_a3 for values in ('2');
  create table abc_a3_b3 partition of abc_a3 for values in ('3');

deallocate abc_q1;
  prepare abc_q1 (int, int) as select a,b from ab_c where a BETWEEN $1
and $2 AND b IN ('3', '2');

-- b4 runtime pruning
=# explain (analyze, costs off, summary off, timing off) execute abc_q1 (1, 8);
QUERY PLAN
---
 Append (actual rows=0 loops=1)
   ->  Seq Scan on abc_a1_b2 (actual rows=0 loops=1)
 Filter: ((a >= 1) AND (a <= 8) AND (b = ANY ('{3,2}'::bpchar[])))
   ->  Seq Scan on abc_a1_b3 (actual rows=0 loops=1)
 Filter: ((a >= 1) AND (a <= 8) AND (b = ANY ('{3,2}'::bpchar[])))
   ->  Seq Scan on abc_a2_b2 (actual rows=0 loops=1)
 Filter: ((a >= 1) AND (a <= 8) AND (b = ANY ('{3,2}'::bpchar[])))
   ->  Seq Scan on abc_a2_b3 (actual rows=0 loops=1)
 Filter: ((a >= 1) AND (a <= 8) AND (b = ANY ('{3,2}'::bpchar[])))
   ->  Seq Scan on abc_a3_b2 (actual rows=0 loops=1)
 Filter: ((a >= 1) AND (a <= 8) AND (b = ANY ('{3,2}'::bpchar[])))
   ->  Seq Scan on abc_a3_b3 (actual rows=0 loops=1)
 Filter: ((a >= 1) AND (a <= 8) AND (b = ANY ('{3,2}'::bpchar[])))
(13 rows)

-- after 5 runs

=# explain (analyze, costs off, summary off, timing off) execute abc_q1 (1, 1);
ERROR:  operator 1057 is not a

Re: [HACKERS] Runtime Partition Pruning

2018-01-09 Thread Beena Emerson

Hello,

On Sun, Jan 7, 2018 at 5:31 PM, David Rowley
 wrote:
> On 7 January 2018 at 00:03, David Rowley  wrote:
>> I've fixed this in the attached, but I did so by calling
>> adjust_appendrel_attrs() from the nodeAppend.c, which did, of course,
>> mean that the AppendRelInfo needed to be given to the executor. I was
>> also a bit unsure what exactly I should be doing in primnodes.h, since
>> I've put PartitionPruneInfo in there, but AppendRelInfo is not. I
>> stuck a quick declaration of AppendRelInfo in primnode.h with an XXX
>> comment so we don't forget to think about that again.
>
> Actually, this was not a very smart fix for the problem. It seems much
> better to make the prune qual part of PartitionPruneInfo and just have
> the planner translate the qual to what's required for the partition
> that the PartitionPruneInfo belongs to. This means we no longer need
> to use the Append's qual to store the prune qual and that all the
> pruning information for one partition is now neatly in a single
> struct.
>
> I've attached a patch which does things like this.


The pruning does not work well with char type:

Case: A subpartition has a different col order and the subpartitioned
col is type char.

drop table ab_c;
 create table ab_c (a int not null, b char) partition by list(a);
  create table abc_a2 (b char, a int not null) partition by list(b);
  create table abc_a2_b1 partition of abc_a2 for values in ('1');
  create table abc_a2_b2 partition of abc_a2 for values in ('2');
  create table abc_a2_b3 partition of abc_a2 for values in ('3');
  alter table ab_c attach partition abc_a2 for values in (2);
  create table abc_a1 partition of ab_c for values in(1) partition by list (b);
  create table abc_a1_b1 partition of abc_a1 for values in ('1');
  create table abc_a1_b2 partition of abc_a1 for values in ('2');
  create table abc_a1_b3 partition of abc_a1 for values in ('3');
  create table abc_a3 partition of ab_c for values in(3) partition by list (b);
  create table abc_a3_b1 partition of abc_a3 for values in ('1');
  create table abc_a3_b2 partition of abc_a3 for values in ('2');
  create table abc_a3_b3 partition of abc_a3 for values in ('3');
  deallocate abc_q1;

INSERT INTO ab_c VALUES (1,'1'), (1,'2'), (1,'3');
INSERT INTO ab_c VALUES (2,'1'), (2,'2'), (2,'3');
INSERT INTO ab_c VALUES (3,'1'), (3,'2'), (3,'3');

prepare abc_q1 (int, int, char) as select * from ab_c where a BETWEEN
$1 and $2 AND b <= $3;

--after 5 runs: abc_a2_b3 is not pruned.

# explain (analyze, costs off, summary off, timing off) execute abc_q1
(1, 2, '2');
   QUERY PLAN
-
 Append (actual rows=4 loops=1)
   ->  Seq Scan on abc_a1_b1 (actual rows=1 loops=1)
 Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
   ->  Seq Scan on abc_a1_b2 (actual rows=1 loops=1)
 Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
   ->  Seq Scan on abc_a1_b3 (never executed)
 Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
   ->  Seq Scan on abc_a2_b1 (actual rows=1 loops=1)
 Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
   ->  Seq Scan on abc_a2_b2 (actual rows=1 loops=1)
 Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
   ->  Seq Scan on abc_a2_b3 (actual rows=0 loops=1)
 Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
 Rows Removed by Filter: 1
   ->  Seq Scan on abc_a3_b1 (never executed)
 Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
   ->  Seq Scan on abc_a3_b2 (never executed)
 Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
   ->  Seq Scan on abc_a3_b3 (never executed)
 Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
(20 rows)


Case 2: Case with optimizer pruning
drop table ab_c;
 create table ab_c (a int not null, b int) partition by list(a);
 create table abc_a2 (b int, a int not null) partition by list(b);
  create table abc_a2_b1 partition of abc_a2 for values in (1);
  create table abc_a2_b2 partition of abc_a2 for values in (2);
  create table abc_a2_b3 partition of abc_a2 for values in (3);
  alter table ab_c attach partition abc_a2 for values in (2);
  create table abc_a1 partition of ab_c for values in(1) partition by list (b);
  create table abc_a1_b1 partition of abc_a1 for values in (1);
  create table abc_a1_b2 partition of abc_a1 for values in (2);
  create table abc_a1_b3 partition of abc_a1 for values in (3);
  create table abc_a3 partition of ab_c for values in(3) partition by list (b);
  create table abc_a3_b1 partition of abc_a3 for values in (1);
  create table abc_a3_b2 partition of abc_a3 for values in (2);
  create table abc_a3_b3 partition of abc_a3 for values in (3);
  deallocate abc_q1;


=# explain (analyze, costs off, summary off, timing off) execute abc_q1 (1, 1);
  QUERY PLAN
--
 Append

Re: [HACKERS] Runtime Partition Pruning

2018-01-07 Thread David Rowley

On 7 January 2018 at 00:03, David Rowley  wrote:
> I've fixed this in the attached, but I did so by calling
> adjust_appendrel_attrs() from the nodeAppend.c, which did, of course,
> mean that the AppendRelInfo needed to be given to the executor. I was
> also a bit unsure what exactly I should be doing in primnodes.h, since
> I've put PartitionPruneInfo in there, but AppendRelInfo is not. I
> stuck a quick declaration of AppendRelInfo in primnode.h with an XXX
> comment so we don't forget to think about that again.

Actually, this was not a very smart fix for the problem. It seems much
better to make the prune qual part of PartitionPruneInfo and just have
the planner translate the qual to what's required for the partition
that the PartitionPruneInfo belongs to. This means we no longer need
to use the Append's qual to store the prune qual and that all the
pruning information for one partition is now neatly in a single
struct.

I've attached a patch which does things like this.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

runtime_prune_drowley_v7.patch
Description: Binary data

Re: [HACKERS] Runtime Partition Pruning

2018-01-04 Thread Alvaro Herrera

David Rowley wrote:

> Looks like it's down to ExplainPropertyFloat() having
> machine-dependent behaviour.
> 
> On the machine that I was working with when testing this the following
> code outputs "1"
> [ sample code ]
> 
> but on your machine it must be outputting "0"?

Yeah, it does.  Thanks for updating --- I'll look at your patch
tomorrow.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

1 2 >

1 - 100 of 147 matches

Mail list logo