from:"Amit Khandekar"

Re: [HACKERS] asynchronous and vectorized execution

2016-09-23 Thread Amit Khandekar

On 13 September 2016 at 20:20, Robert Haas  wrote:

> On Mon, Aug 29, 2016 at 4:08 AM, Kyotaro HORIGUCHI
>  wrote:
> > [ new patches ]
>
> +/*
> + * We assume that few nodes are async-aware and async-unaware
> + * nodes cannot be revserse-dispatched from lower nodes that
> is
> + * async-aware. Firing of an async node that is not a
> descendant
> + * of the planstate will cause such reverse-diaptching to
> + * async-aware nodes, which is unexpected behavior for them.
> + *
> + * For instance, consider an async-unaware Hashjoin(OUTER,
> INNER)
> + * where the OUTER is running asynchronously but the Hashjoin
> is
> + * waiting on the async INNER during inner-hash creation. If
> the
> + * OUTER fires for the case, since anyone is waiting on it,
> + * ExecAsyncWaitForNode finally dispatches to the Hashjoin
> which
> + * is now in the middle of thing its work.
> + */
> +if (!IsParent(planstate, node))
> +continue;
>
> I'm not entirely sure that I understand this comment, but I don't
> think it's going in the right direction.  Let's start with the example
> in the second paragraph. If the hash join is async-unaware, then it
> isn't possible for the hash join to be both running the outer side of
> the join asynchronously and at the same time waiting on the inner
> side.  Once it tries to pull the first tuple from the outer side, it's
> waiting for that to finish and can't do anything else.  So, the inner
> side can't possibly get touched in any way until the outer side
> finishes.  For anything else to happen, the hash join would have to be
> async-aware.  Even if we did that, I don't think it would be right to
> kick off both sides of the hash join at the same time.  Right now, if
> the outer side turns out to be empty, we never need to build the hash
> table, and that's good.
>

I feel the !IsParent() condition is actually to prevent the infinite wait
caused by a re-entrant issue in ExecAsuncWaitForNode() that Kyotaro
mentioned
earlier. But yes, the comments don't explain exactly how the hash join can
cause the re-entrant issue.

But I attempted to come up with some testcase which might reproduce the
infinite-waiting in ExecAsyncWaitForNode() after removing the !IsParent()
check
so that the other subtree nodes are also included, but I couldn't reproduce.
Kyotaro, is it possible for you to give a testcase that consistently hangs
if
we revert back the !IsParent() check ?

I was also thinking about another possibility where the same plan state
node is
re-entered, as explained below.

>
> I don't think it's a good idea to wait for only nodes that are in the
> current subtree.  For example, consider a plan like this:
>
> Append
> -> Foreign Scan on a
> -> Hash Join
>   -> Foreign Scan on b
>   -> Hash
> -> Seq Scan on x
>
> Suppose Append and Foreign Scan are parallel-aware but the other nodes
> are not.  Append kicks off the Foreign Scan on a and then waits for
> the hash join to produce a tuple; the hash join kicks off the Foreign
> Scan on b and waits for it to return a tuple.  If, while we're waiting
> for the foreign scan on b, the foreign scan on a needs some attention
> - either to produce tuples, or maybe just to call PQconsumeInput() so
> that more data can be sent from the other side, I think we need to be
> able to do that.  There's no real problem here; even if the Append
> becomes result-ready before the hash join returns, that is fine.

Yes I agree : we should be able to do this. Sine we have all the waiting
events
in a common estate, there's no harm if we start executing nodes of another
sub-tree if we get an event from there.

But I am thinking about what would happen when this node from other sub-tree
returns result_ready, and then it's parents are called, and then the result
gets bubbled up upto the node which had already caused us to call
ExecAsyncWaitForNode() in the first place.

For e.g., in the above plan which you specified, suppose :
1. Hash Join has called ExecProcNode() for the child foreign scan b, and so
is
waiting in ExecAsyncWaitForNode(foreign_scan_on_b).
2. The event wait list already has foreign scan on a that is on a different
subtree.
3. This foreign scan a happens to be ready, so in
ExecAsyncWaitForNode (), ExecDispatchNode(foreign_scan_a) is called,
which returns with result_ready.
4. Since it returns result_ready, it's parent node is now inserted in the
callbacks array, and so it's parent (Append) is executed.
5. But, this Append planstate is already in the middle of executing Hash
join, and is waiting for HashJoin.

Is this safe to execute the same plan state when it is already inside its
execution ? In other words, is the plan state re-entrant ? I suspect, the
new
execution may even corrupt the structures with which it was already
executing.

In usual cases, a tree can contain m

Re: [HACKERS] asynchronous execution

2016-09-27 Thread Amit Khandekar

On 24 September 2016 at 06:39, Robert Haas  wrote:
> Since Kyotaro Horiguchi found that my previous design had a
> system-wide performance impact due to the ExecProcNode changes, I
> decided to take a different approach here: I created an async
> infrastructure where both the requestor and the requestee have to be
> specifically modified to support parallelism, and then modified Append
> and ForeignScan to cooperate using the new interface.  Hopefully that
> means that anything other than those two nodes will suffer no
> performance impact.  Of course, it might have other problems

I see that the reason why you re-designed the asynchronous execution
implementation is because the earlier implementation showed
performance degradation in local sequential and local parallel scans.
But I checked that the ExecProcNode() changes were not that
significant as to cause the degradation. It will not call
ExecAsyncWaitForNode() unless that node supports asynchronism. Do you
feel there is anywhere else in the implementation that is really
causing this degrade ? That previous implementation has some issues,
but they seemed solvable. We could resolve the plan state recursion
issue by explicitly making sure the same plan state does not get
called again while it is already executing.

Thanks
-Amit Khandekar

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-10-03 Thread Amit Khandekar

On 30 September 2017 at 01:26, Robert Haas  wrote:
> On Fri, Sep 29, 2017 at 3:53 PM, Robert Haas  wrote:
>> On Fri, Sep 22, 2017 at 1:57 AM, Amit Khandekar  
>> wrote:
>>> The patch for the above change is :
>>> 0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patch
>>
>> Thinking about this a little more, I'm wondering about how this case
>> arises.  I think that for this patch to avoid multiple conversions,
>> we'd have to be calling map_variable_attnos on an expression and then
>> calling map_variable_attnos on that expression again.

We are not calling map_variable_attnos() twice. The first time it
calls, there is already the ConvertRowtypeExpr node if the expression
is a whole row var. This node is already added from
adjust_appendrel_attrs(). So the conversion is done by two different
functions.

For ConvertRowtypeExpr, map_variable_attnos_mutator() recursively
calls map_variable_attnos_mutator() for ConvertRowtypeExpr->arg with
coerced_var=true.

>
> I guess I didn't quite finish this thought, sorry.  Maybe it's
> obvious, but the point I was going for is: why would we do that, vs.
> just converting once?

The first time ConvertRowtypeExpr node gets added in the expression is
when adjust_appendrel_attrs() is called for each of the child tables.
Here, for each of the child table, when the parent parse tree is
converted into the child parse tree, the whole row var (in RETURNING
or WITH CHECK OPTIONS expr) is wrapped with ConvertRowtypeExpr(), so
child parse tree (or the child WCO expr) has this ConvertRowtypeExpr
node.

The second time this node is added is during update-tuple-routing in
ExecInitModifyTable(), when map_partition_varattnos() is called for
each of the partitions to convert from the first per-subplan
RETURNING/WCO expression to the RETURNING/WCO expression belonging to
the leaf partition. This second conversion happens for the leaf
partitions which are not already present in per-subplan UPDATE result
rels.

So the first conversion is from parent to child while building
per-subplan plans, and the second is from first per-subplan child to
another child for building expressions of the leaf partitions.

So suppose the root partitioned table RETURNING expression is a whole
row var wr(r) where r is its composite type representing the root
table type.
Then, one of its UPDATE child tables will have its RETURNING
expression converted like this :
wr(r)  ===>  CRE(r) -> wr(c1)
where CRE(r) represents ConvertRowtypeExpr of result type r, which has
its arg pointing to wr(c1) which is a whole row var of composite type
c1 for the child table c1. So this node converts from composite type
of child table to composite type of root table.

Now, when the second conversion occurs for the leaf partition (i.e.
during update-tuple-routing), the conversion looks like this :
CRE(r) -> wr(c1)  ===>  CRE(r) -> wr(c2)
But W/o the 0002*ConvertRowtypeExpr*.patch the conversion would have
looked like this :
CRE(r) -> wr(c1)  ===>  CRE(r) -> CRE(c1) -> wr(c2)
In short, we omit the intermediate CRE(c1) node.

While writing this down, I observed that after multi-level partition
tree expansion was introduced, the child table expressions are not
converted directly from the root. Instead, they are converted from
their immediate parent. So there is a chain of conversions : to leaf
from its parent, to that parent from its parent, and so on from the
root. Effectively, during the first conversion, there are that many
ConvertRowtypeExpr nodes one above the other already present in the
UPDATE result rel expressions. But my patch handles the optimization
only for the leaf partition conversions.

If already has CRE : CRE(rr) -> wr(r)
Parent-to-child conversion ::: CRE(p) -> wr(r)  ===>   CRE(rr) ->
CRE(r) -> wr(c1)
W patch : CRE(rr) -> CRE(r) -> wr(c1) ===> CRE(rr) -> CRE(r) -> wr(c2)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-10-05 Thread Amit Khandekar

On 30 September 2017 at 19:21, Amit Kapila  wrote:
> On Wed, Sep 20, 2017 at 10:59 AM, Amit Khandekar  
> wrote:
>> On 16 September 2017 at 10:42, Amit Kapila  wrote:
>>>
>>> At a broader level, the idea is good, but I think it won't turn out
>>> exactly like that considering your below paragraph which indicates
>>> that it is okay if leader picks a partial path that is costly among
>>> other partial paths as a leader won't be locked with that.
>>> Considering this is a good design for parallel append, the question is
>>> do we really need worker and leader to follow separate strategy for
>>> choosing next path.  I think the patch will be simpler if we can come
>>> up with a way for the worker and leader to use the same strategy to
>>> pick next path to process.  How about we arrange the list of paths
>>> such that first, all partial paths will be there and then non-partial
>>> paths and probably both in decreasing order of cost.  Now, both leader
>>> and worker can start from the beginning of the list. In most cases,
>>> the leader will start at the first partial path and will only ever
>>> need to scan non-partial path if there is no other partial path left.
>>> This is not bulletproof as it is possible that some worker starts
>>> before leader in which case leader might scan non-partial path before
>>> all partial paths are finished, but I think we can avoid that as well
>>> if we are too worried about such cases.
>>
>> If there are no partial subpaths, then again the leader is likely to
>> take up the expensive subpath. And this scenario would not be
>> uncommon.
>>
>
> While thinking about how common the case of no partial subpaths would
> be, it occurred to me that as of now we always create a partial path
> for the inheritance child if it is parallel-safe and the user has not
> explicitly set the value of parallel_workers to zero (refer
> compute_parallel_worker).  So, unless you are planning to change that,
> I think it will be quite uncommon to have no partial subpaths.

There are still some paths that can have non-partial paths cheaper
than the partial paths. Also, there can be UNION ALL queries which
could have non-partial subpaths. I guess this has already been
discussed in the other replies.

>
> Few nitpicks in your latest patch:
> 1.
> @@ -298,6 +366,292 @@ ExecReScanAppend(AppendState *node)
>   if (subnode->chgParam == NULL)
>   ExecReScan(subnode);
>   }
> +
>
> Looks like a spurious line.
>
> 2.
> @@ -1285,7 +1291,11 @@ add_paths_to_append_rel(PlannerInfo *root,
> RelOptInfo *rel,
> ..
> + if (chosen_path && chosen_path != cheapest_partial_path)
> + pa_all_partial_subpaths = false;
>
> It will keep on setting pa_all_partial_subpaths as false for
> non-partial paths which don't seem to be the purpose of this variable.
> I think you want it to be set even when there is one non-partial path,
> so isn't it better to write as below or something similar:
> if (pa_nonpartial_subpaths && pa_all_partial_subpaths)
> pa_all_partial_subpaths = false;

Ok. How about removing pa_all_partial_subpaths altogether , and
instead of the below condition :

/*
* If all the child rels have partial paths, and if the above Parallel
* Append path has a mix of partial and non-partial subpaths, then consider
* another Parallel Append path which will have *all* partial subpaths.
* If enable_parallelappend is off, make this one non-parallel-aware.
*/
if (partial_subpaths_valid && !pa_all_partial_subpaths)
..

Use this condition :
if (partial_subpaths_valid && pa_nonpartial_subpaths != NIL)
..




Regarding a mix of partial and non-partial paths, I feel it always
makes sense for the leader to choose the partial path. If it chooses a
non-partial path, no other worker would be able to help finish that
path. Among the partial paths, whether it chooses the cheapest one or
expensive one does not matter, I think. We have the partial paths
unordered. So whether it starts from the last partial path or the
first partial path should not matter.

Regarding scenario where all paths are non-partial, here is an e.g. :
Suppose we have 4 child paths with costs : 10 5 5 3, and with 2
workers plus one leader. And suppose the leader takes additionally
1/4th of these costs to process the returned tuples.

If leader takes least expensive one (3)  :
2 workers will finish 10, 5, 5 in 10 units,
and leader simultaneously chooses the plan with cost 3, and so it
takes 3 + (1/4)(10 + 5 + 5 + 3) = 9 units.
So the total time taken by Append is : 10.


Whereas if leader takes most expensive one (10) :
10 + .25 (total) = 10 + 6 = 16
The 2 workers will finish 2nd, 3rd and 4th plan

Re: [HACKERS] Parallel Append implementation

2017-10-05 Thread Amit Khandekar

On 6 October 2017 at 08:49, Amit Kapila  wrote:
> On Thu, Oct 5, 2017 at 4:11 PM, Amit Khandekar  wrote:
>>
>> Ok. How about removing pa_all_partial_subpaths altogether , and
>> instead of the below condition :
>>
>> /*
>> * If all the child rels have partial paths, and if the above Parallel
>> * Append path has a mix of partial and non-partial subpaths, then consider
>> * another Parallel Append path which will have *all* partial subpaths.
>> * If enable_parallelappend is off, make this one non-parallel-aware.
>> */
>> if (partial_subpaths_valid && !pa_all_partial_subpaths)
>> ..
>>
>> Use this condition :
>> if (partial_subpaths_valid && pa_nonpartial_subpaths != NIL)
>> ..
>>
>
> Sounds good to me.
>
> One minor point:
>
> + if (!node->as_padesc)
> + {
> + /*
> + */
> + if (!exec_append_seq_next(node))
> + return ExecClearTuple(node->ps.ps_ResultTupleSlot);
> + }
>
> It seems either you want to add a comment in above part of patch or
> you just left /**/ mistakenly.

Oops. Yeah, the comment wrapper remained there when I moved its
content "This is Parallel-aware append. Follow it's own logic ..." out
of the if block. Since this is too small a change for an updated
patch, I will do this along with any other changes that would be
required as the review progresses.

>
>> 
>>
>>
>> Regarding a mix of partial and non-partial paths, I feel it always
>> makes sense for the leader to choose the partial path.
>>
>
> Okay, but why not cheapest partial path?

I gave some thought on this point. Overall I feel it does not matter
which partial path it should pick up. RIght now the partial paths are
not ordered. But for non-partial paths sake, we are just choosing the
very last path. So in case of mixed paths, leader will get a partial
path, but that partial path would not be the cheapest path. But if we
also order the partial paths, the same logic would then pick up
cheapest partial path. The question is, should we also order the
partial paths for the leader ?

The only scenario I see where leader choosing cheapest partial path
*might* show some benefit, is if there are some partial paths that
need to do some startup work using only one worker. I think currently,
parallel hash join is one case where it builds the hash table, but I
guess here also, we support parallel hash build, but not sure about
the status. For such plan, if leader starts it, it would be slow, and
no other worker would be able to help it, so its actual startup cost
would be drastically increased. (Another path is parallel bitmap heap
scan where the leader has to do something and the other workers wait.
But here, I think it's not much work for the leader to do). So
overall, to handle such cases, it's better for leader to choose a
cheapest path, or may be, a path with cheapest startup cost. We can
also consider sorting partial paths with decreasing startup cost.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-10-11 Thread Amit Khandekar

On 9 October 2017 at 16:03, Amit Kapila  wrote:
> On Fri, Oct 6, 2017 at 12:03 PM, Amit Khandekar  
> wrote:
>> On 6 October 2017 at 08:49, Amit Kapila  wrote:
>>>
>>> Okay, but why not cheapest partial path?
>>
>> I gave some thought on this point. Overall I feel it does not matter
>> which partial path it should pick up. RIght now the partial paths are
>> not ordered. But for non-partial paths sake, we are just choosing the
>> very last path. So in case of mixed paths, leader will get a partial
>> path, but that partial path would not be the cheapest path. But if we
>> also order the partial paths, the same logic would then pick up
>> cheapest partial path. The question is, should we also order the
>> partial paths for the leader ?
>>
>> The only scenario I see where leader choosing cheapest partial path
>> *might* show some benefit, is if there are some partial paths that
>> need to do some startup work using only one worker. I think currently,
>> parallel hash join is one case where it builds the hash table, but I
>> guess here also, we support parallel hash build, but not sure about
>> the status.
>>
>
> You also need to consider how merge join is currently work in parallel
> (each worker need to perform the whole of work of right side).

Yes, here if the leader happens to take the right side, it may slow
down the overall merge join. But this seems to be a different case
than the case of high startup costs.

>  I think there could be more scenario's where the startup cost is high
> and parallel worker needs to do that work independently.

True.

>
>  For such plan, if leader starts it, it would be slow, and
>> no other worker would be able to help it, so its actual startup cost
>> would be drastically increased. (Another path is parallel bitmap heap
>> scan where the leader has to do something and the other workers wait.
>> But here, I think it's not much work for the leader to do). So
>> overall, to handle such cases, it's better for leader to choose a
>> cheapest path, or may be, a path with cheapest startup cost. We can
>> also consider sorting partial paths with decreasing startup cost.
>>
>
> Yeah, that sounds reasonable.

Attached patch sorts partial paths by descending startup cost.


On 6 October 2017 at 08:49, Amit Kapila  wrote:
> On Thu, Oct 5, 2017 at 4:11 PM, Amit Khandekar  wrote:
>>
>> Ok. How about removing pa_all_partial_subpaths altogether , and
>> instead of the below condition :
>>
>> /*
>> * If all the child rels have partial paths, and if the above Parallel
>> * Append path has a mix of partial and non-partial subpaths, then consider
>> * another Parallel Append path which will have *all* partial subpaths.
>> * If enable_parallelappend is off, make this one non-parallel-aware.
>> */
>> if (partial_subpaths_valid && !pa_all_partial_subpaths)
>> ..
>>
>> Use this condition :
>> if (partial_subpaths_valid && pa_nonpartial_subpaths != NIL)
>> ..
>>
>
> Sounds good to me.

Did this. Here is the new condition I used  along with the comments
explaining it :

+* If parallel append has not been added above, or the added
one has a mix
+* of partial and non-partial subpaths, then consider another Parallel
+* Append path which will have *all* partial subpaths. We can add such a
+* path only if all childrels have partial paths in the first
place. This
+* new path will be parallel-aware unless enable_parallelappend is off.
 */
-   if (partial_subpaths_valid && !pa_all_partial_subpaths)
+   if (partial_subpaths_valid &&
+   (!pa_subpaths_valid || pa_nonpartial_subpaths != NIL))

Also added some test scenarios.

On 6 October 2017 at 12:03, Amit Khandekar  wrote:
> On 6 October 2017 at 08:49, Amit Kapila  wrote:
>>
>> One minor point:
>>
>> + if (!node->as_padesc)
>> + {
>> + /*
>> + */
>> + if (!exec_append_seq_next(node))
>> + return ExecClearTuple(node->ps.ps_ResultTupleSlot);
>> + }
>>
>> It seems either you want to add a comment in above part of patch or
>> you just left /**/ mistakenly.
>
> Oops. Yeah, the comment wrapper remained there when I moved its
> content "This is Parallel-aware append. Follow it's own logic ..." out
> of the if block.

Removed the comment wrapper.


Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


ParallelAppend_v17.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql: Avoid coercing a whole-row variable that is already coerced

2017-10-13 Thread Amit Khandekar

Bringing here the mail thread from pgsql-committers  regarding this commit :

commit 1c497fa72df7593d8976653538da3d0ab033207f
Author: Robert Haas 
Date:   Thu Oct 12 17:10:48 2017 -0400

Avoid coercing a whole-row variable that is already coerced.

Marginal efficiency and beautification hack.  I'm not sure whether
this case ever arises currently, but the pending patch for update
tuple routing will cause it to arise.

    Amit Khandekar

Discussion:
http://postgr.es/m/caj3gd9cazfppe7-wwubabpcq4_0subkipfd1+0r5_dkvnwo...@mail.gmail.com

Tom Lane  wrote:
> Robert Haas  wrote:
> > Avoid coercing a whole-row variable that is already coerced.
>
> This logic seems very strange and more than likely buggy: the
> added coerced_var flag feels like the sort of action at a distance
> that is likely to have unforeseen side effects.
>
> I'm not entirely sure what the issue is here, but if you're concerned
> about not applying two ConvertRowtypeExprs in a row, why not have the
> upper one just strip off the lower one?  We handle, eg, nested
> RelabelTypes that way.

We kind of do a similar thing. When a ConvertRowtypeExpr node is
encountered, we create a new ConvertRowtypeExpr that points to a new
var, and return this new ConvertRowtypeExpr instead of the existing
one. So we actually replace the old with a new one. But additionally,
we also want to change the vartype of the new var to
context->to_rowtype.

I think you are worried specifically about coerced_var causing
unexpected regression in existing scenarios, such as :
context->coerced_var getting set and prematurely unset in recursive
scenarios. But note that, when we call map_variable_attnos_mutator()
just after setting context->coerced_var = true,
map_variable_attnos_mutator() won't recurse further, because it is
always called with a Var, which does not have any further arguments to
process. So coerced_var won't be again changed until we return from
map_variable_attnos_mutator().

The only reason why we chose to call map_variable_attnos_mutator()
with a Var is so that we can re-use the code that converts the whole
row var.

One thing we can do is : instead of calling
map_variable_attnos_mutator(), convert the var inside the if block for
"if (IsA(node, ConvertRowtypeExpr))". Please check the attached patch.
There, I have avoided coerced_var context variable.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

handle-redundant-ConvertRowtypeExpr-node.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-10-19 Thread Amit Khandekar

On 13 October 2017 at 00:29, Robert Haas  wrote:
> On Wed, Oct 11, 2017 at 8:51 AM, Amit Khandekar  
> wrote:
>> [ new patch ]
>
> + parallel_append
> + Waiting to choose the next subplan during Parallel Append 
> plan
> + execution.
> +
> +
>
> Probably needs to update a morerows values of some earlier entry.

>From what I observed from the other places, the morerows value is one
less than the number of following entries. I have changed it to 10
since it has 11 entries.

>
> +   enable_parallelappend configuration
> parameter
>
> How about enable_parallel_append?

Done.

>
> + * pa_finished : workers currently executing the subplan. A worker which
>
> The way the colon is used here is not a standard comment style for PostgreSQL.

Changed it to "pa_finished:".

>
> + * Go on to the "next" subplan. If no more subplans, return the empty
> + * slot set up for us by ExecInitAppend.
> + * Note: Parallel-aware Append follows different logic for choosing 
> the
> + * next subplan.
>
> Formatting looks wrong, and moreover I don't think this is the right
> way of handling this comment anyway.  Move the existing comment inside
> the if (!node->padesc) block and leave it unchanged; the else block
> explains the differences for parallel append.

I think the first couple of lines do apply to both parallel-append and
sequential append plans. I have moved the remaining couple of lines
inside the else block.

>
> + *ExecAppendEstimate
> + *
> + *estimates the space required to serialize Append node.
>
> Ugh, this is wrong, but I notice it follows various other
> equally-wrong comments for other parallel-aware node types. I guess
> I'll go fix that.  We are not in serializing the Append node.

I didn't clealy get this. Do you think it should be "space required to
copy the Append node into the shared memory" ?

>
> I do not think that it's a good idea to call
> exec_append_parallel_next() from ExecAppendInitializeDSM,
> ExecAppendReInitializeDSM, and ExecAppendInitializeWorker.  We want to
> postpone selecting which plan to run until we're actually ready to run
> that plan.  Otherwise, for example, the leader might seize a
> non-partial plan (if only such plans are included in the Parallel
> Append) when it isn't really necessary for it to do so.  If the
> workers would've reached the plans and started returning tuples to the
> leader before it grabbed a plan, oh well, too bad.  The leader's still
> claimed that plan and must now run it.
>
> I concede that's not a high-probability scenario, but I still maintain
> that it is better for processes not to claim a subplan until the last
> possible moment.  I think we need to initialize as_whichplan as
> PA_INVALID plan and then fix it when ExecProcNode() is called for the
> first time.

Done. Set as_whichplan to PA_INVALID_PLAN in
ExecAppendInitializeDSM(), ExecAppendReInitializeDSM() and
ExecAppendInitializeWorker(). Then when ExecAppend() is called for the
first time, we notice that as_whichplan is PA_INVALID_PLAN, that means
we need to choose the plan.

>
> +if (!IsParallelWorker())
>
> This is not a great test, because it would do the wrong thing if we
> ever allowed an SQL function called from a parallel worker to run a
> parallel query of its own.  Currently that's not allowed but we might
> want to allow it someday.  What we really want to test is whether
> we're the leader for *this* query.  Maybe use a flag in the
> AppendState for that, and set it correctly in
> ExecAppendInitializeWorker.

Done. Set a new AppendState->is_parallel_worker field to true in
ExecAppendInitializeWorker().

>
> I think maybe the loop in exec_append_parallel_next should look more like 
> this:
>
> /* Pick the next plan. */
> state->as_whichplan = padesc->pa_nextplan;
> if (state->as_whichplan != PA_INVALID_PLAN)
> {
> int nextplan = state->as_whichplan;
>
> /* Mark non-partial plans done immediately so that they can't be
> picked again. */
> if (nextplan < first_partial_plan)
> padesc->pa_finished[nextplan] = true;
>
> /* Figure out what plan the next worker should pick. */
> do
> {
> /* If we've run through all the plans, loop back through
> partial plans only. */
> if (++nextplan >= state->as_nplans)
> nextplan = first_partial_plan;
>
> /* No plans remaining or tried them all?  Then give up. */
> if (nextplan == state->as_whichplan || nextplan >= state->as_nplans)
> {
> nextpla

Re: [HACKERS] UPDATE of partition key

2017-11-06 Thread Amit Khandekar

On 7 November 2017 at 00:33, Robert Haas  wrote:

> Also, +1 for Amit Langote's idea of trying to merge
> mt_perleaf_childparent_maps with mt_persubplan_childparent_maps.

Currently I am trying to see if it simplifies things if we do that. We
will be merging these arrays into one, but we are adding a new int[]
array that maps subplans to leaf partitions. Will get back with how it
looks finally.

Robert, Amit , I will get back with your other review comments.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-11-07 Thread Amit Khandekar

On 8 November 2017 at 07:55, Thomas Munro  wrote:
> On Tue, Nov 7, 2017 at 8:03 AM, Robert Haas  wrote:
>> The changes to trigger.c still make me super-nervous.  Hey THOMAS
>> MUNRO, any chance you could review that part?
>
> Looking, but here's one silly thing that jumped out at me while
> getting started with this patch.  I cannot seem to convince my macOS
> system to agree with the expected sort order from :show_data, where
> underscores precede numbers:
>
>   part_a_10_a_20 | a | 10 | 200 |  1 |
>   part_a_1_a_10  | a |  1 |   1 |  1 |
> - part_d_1_15| b | 15 | 146 |  1 |
> - part_d_1_15| b | 16 | 147 |  2 |
>   part_d_15_20   | b | 17 | 155 | 16 |
>   part_d_15_20   | b | 19 | 155 | 19 |
> + part_d_1_15| b | 15 | 146 |  1 |
> + part_d_1_15| b | 16 | 147 |  2 |
>
> It seems that macOS (like older BSDs) just doesn't know how to sort
> Unicode and falls back to sorting the bits.  I expect that means that
> the test will also fail on any other OS with "make check
> LC_COLLATE=C".  I believe our regression tests are supposed to pass
> with a wide range of collations including C, so I wonder if this means
> we should stick a leading zero on those single digit numbers, or
> something, to stabilise the output.

I preferably need to retain the partition names. I have now added a
LOCALE "C" for partname like this :

-\set show_data 'select tableoid::regclass::text partname, * from
range_parted order by 1, 2, 3, 4, 5, 6'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname,
* from range_parted order by 1, 2, 3, 4, 5, 6'

Thomas, can you please try the attached incremental patch
regress_locale_changes.patch and check if the test passes ? The patch
is to be applied on the main v22 patch. If the test passes, I will
include these changes (also for list_parted) in the upcoming v23
patch.

Thanks
-Amit Khandekar


regress_locale_changes.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-11-09 Thread Amit Khandekar

On 9 November 2017 at 09:27, Thomas Munro  wrote:
> On Wed, Nov 8, 2017 at 5:57 PM, Amit Khandekar  wrote:
>> On 8 November 2017 at 07:55, Thomas Munro  
>> wrote:
>>> On Tue, Nov 7, 2017 at 8:03 AM, Robert Haas  wrote:
>>>> The changes to trigger.c still make me super-nervous.  Hey THOMAS
>>>> MUNRO, any chance you could review that part?
>
> At first, it seemed quite strange to me that row triggers and
> statement triggers fire different events for the same modification.
> Row triggers see DELETE +  INSERT (necessarily because different
> tables are involved), but this fact is hidden from the target table's
> statement triggers.
>
> The alternative would be for all triggers to see consistent events and
> transitions.  Instead of having your special case code in ExecInsert
> and ExecDelete that creates the two halves of a 'synthetic' UPDATE for
> the transition tables, you'd just let the existing ExecInsert and
> ExecDelete code do its thing, and you'd need a flag to record that you
> should also fire INSERT/DELETE after statement triggers if any rows
> moved.

Yeah I also had thought about that. But thought that change was too
invasive. For e.g. letting ExecARInsertTriggers() do the transition
capture even when transition_capture->tcs_update_new_table is set.

I was also thinking of having a separate function to *only* add the
transition table rows. So in ExecInsert, call this one instead of
ExecARUpdateTriggers(). But realized that the existing
ExecARUpdateTriggers() looks like a better, robust interface with all
its checks. Just that calling ExecARUpdateTriggers() sounds like we
are also firing trigger; we are not firing any trigger or saving any
event, we are just adding the transition row.

>
> After sleeping on this question, I am coming around to the view that
> the way you have it is right.  The distinction isn't really between
> row triggers and statement triggers, it's between triggers at
> different levels in the hierarchy.  It just so happens that we
> currently only fire target table statement triggers and leaf table row
> triggers.

Yes. And rows are there only in leaf partitions. So we have to
simulate as though the target table has these rows. Like you
mentioned, the user has to get the impression of a normal table. So we
have to do something extra to capture the rows.

> Future development ideas that seem consistent with your choice:
>
> 1.  If we ever allow row triggers with transition tables on child
> tables, then I think *their* transition tables should certainly see
> the deletes and inserts, otherwise OLD TABLE and NEW TABLE would be
> inconsistent with the OLD and NEW variables in a single trigger
> invocation.  (These were prohibited mainly due to lack of time and
> (AFAIK) limited usefulness; I think they would need probably need
> their own separate tuplestores, or possibly some kind of filtering.)

As we know, for row triggers on leaf partitions, we treat them as
normal tables, so a trigger written on a leaf partition sees only the
local changes. The trigger is unaware whether the insert is part of an
UPDATE row movement. Similarly, the transition table referenced by
that row trigger function should see only the NEW table, not the old
table.

>
> 2.  If we ever allow row triggers on partitioned tables (ie that fire
> when its children are modified), then I think their UPDATE trigger
> should probably fire when a row moves between any two (grand-)*child
> tables, just as you have it for target table statement triggers.

Yes I agree.

> It doesn't matter that the view from parent tables' triggers is
> inconsistent with the view from leaf table triggers: it's a feature
> that we 'hide' partitioning from the user to the extent we can so that
> you can treat the partitioned table just like a table.
>
> Any other views?

I think because because there is no provision for a row trigger on
partitioned table, users who want to have a common trigger on a
partition subtree, has no choice but to create the same trigger
individually on the leaf partitions. And that's the reason we cannot
handle an update row movement with triggers without anomalies.

Thanks
-Amit Khandekar

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-11-12 Thread Amit Khandekar

Thanks a lot Robert for the patch. I will have a look. Quickly tried
to test some aggregate queries with a partitioned pgbench_accounts
table, and it is crashing. Will get back with the fix, and any other
review comments.

Thanks
-Amit Khandekar

On 9 November 2017 at 23:44, Robert Haas  wrote:
> On Sat, Oct 28, 2017 at 5:50 AM, Robert Haas  wrote:
>> No, because the Append node is *NOT* getting copied into shared
>> memory.  I have pushed a comment update to the existing functions; you
>> can use the same comment for this patch.
>
> I spent the last several days working on this patch, which had a
> number of problems both cosmetic and functional.  I think the attached
> is in better shape now, but it could certainly use some more review
> and testing since I only just finished modifying it, and I modified it
> pretty heavily.  Changes:
>
> - I fixed the "morerows" entries in the documentation.  If you had
> built the documentation the way you had it and loaded up in a web
> browser, you would have seen that the way you had it was not correct.
>
> - I moved T_AppendState to a different position in the switch inside
> ExecParallelReInitializeDSM, so as to keep that switch in the same
> order as all of the other switch statements in that file.
>
> - I rewrote the comment for pa_finished.  It previously began with
> "workers currently executing the subplan", which is not an accurate
> description. I suspect this was a holdover from a previous version of
> the patch in which this was an array of integers rather than an array
> of type bool.  I also fixed the comment in ExecAppendEstimate and
> added, removed, or rewrote various other comments as well.
>
> - I renamed PA_INVALID_PLAN to INVALID_SUBPLAN_INDEX, which I think is
> more clear and allows for the possibility that this sentinel value
> might someday be used for non-parallel-aware Append plans.
>
> - I largely rewrote the code for picking the next subplan.  A
> superficial problem with the way that you had it is that you had
> renamed exec_append_initialize_next to exec_append_seq_next but not
> updated the header comment to match.  Also, the logic was spread out
> all over the file.  There are three cases: not parallel aware, leader,
> worker.  You had the code for the first case at the top of the file
> and the other two cases at the bottom of the file and used multiple
> "if" statements to pick the right one in each case.  I replaced all
> that with a function pointer stored in the AppendState, moved the code
> so it's all together, and rewrote it in a way that I find easier to
> understand.  I also changed the naming convention.
>
> - I renamed pappend_len to pstate_len and ParallelAppendDescData to
> ParallelAppendState.  I think the use of the word "descriptor" is a
> carryover from the concept of a scan descriptor.  There's nothing
> really wrong with inventing the concept of an "append descriptor", but
> it seems more clear to just refer to shared state.
>
> - I fixed ExecAppendReInitializeDSM not to reset node->as_whichplan.
> Per commit 41b0dd987d44089dc48e9c70024277e253b396b7, that's wrong;
> instead, local state should be reset in ExecReScanAppend.  I installed
> what I believe to be the correct logic in that function instead.
>
> - I fixed list_qsort() so that it copies the type of the old list into
> the new list.  Otherwise, sorting a list of type T_IntList or
> T_OidList would turn it into just plain T_List, which is wrong.
>
> - I removed get_append_num_workers and integrated the logic into the
> callers.  This function was coded quite strangely: it assigned the
> return value of fls() to a double and then eventually rounded the
> result back to an integer.  But fls() returns an integer, so this
> doesn't make much sense.  On a related note, I made it use fls(# of
> subpaths) instead of fls(# of subpaths)+1.  Adding 1 doesn't make
> sense to me here because it leads to a decision to use 2 workers for a
> single, non-partial subpath.  I suspect both of these mistakes stem
> from thinking that fls() returns the base-2 logarithm, but in fact it
> doesn't, quite: log2(1) = 0.0 but fls(1) = 1.
>
> - In the process of making the changes described in the previous
> point, I added a couple of assertions, one of which promptly failed.
> It turns out the reason is that your patch didn't update
> accumulate_append_subpaths(), which can result in flattening
> non-partial paths from a Parallel Append into a parent Append's list
> of partial paths, which is bad.  The easiest way to fix that would be
> to just teach accumulate_append_subpaths() not to flatten a Parallel
> Append into a parent Append or MergeAppend node,

Re: [HACKERS] asynchronous execution

2016-10-04 Thread Amit Khandekar

On 4 October 2016 at 02:30, Robert Haas  wrote:
> On Wed, Sep 28, 2016 at 12:30 AM, Amit Khandekar  
> wrote:
>> On 24 September 2016 at 06:39, Robert Haas  wrote:
>>> Since Kyotaro Horiguchi found that my previous design had a
>>> system-wide performance impact due to the ExecProcNode changes, I
>>> decided to take a different approach here: I created an async
>>> infrastructure where both the requestor and the requestee have to be
>>> specifically modified to support parallelism, and then modified Append
>>> and ForeignScan to cooperate using the new interface.  Hopefully that
>>> means that anything other than those two nodes will suffer no
>>> performance impact.  Of course, it might have other problems
>>
>> I see that the reason why you re-designed the asynchronous execution
>> implementation is because the earlier implementation showed
>> performance degradation in local sequential and local parallel scans.
>> But I checked that the ExecProcNode() changes were not that
>> significant as to cause the degradation.
>
> I think we need some testing to prove that one way or the other.  If
> you can do some - say on a plan with multiple nested loop joins with
> inner index-scans, which will call ExecProcNode() a lot - that would
> be great.  I don't think we can just rely on "it doesn't seem like it
> should be slower"
Agreed. I will come up with some tests.

> , though - ExecProcNode() is too important a function
> for us to guess at what the performance will be.

Also, parent pointers are not required in the new design. Thinking of
parent pointers, now it seems the event won't get bubbled up the tree
with the new design. But still, , I think it's possible to switch over
to the other asynchronous tree when some node in the current subtree
is waiting. But I am not sure, will think more on that.

>
> The thing I'm really worried about with either implementation is what
> happens when we start to add asynchronous capability to multiple
> nodes.  For example, if you imagine a plan like this:
>
> Append
> -> Hash Join
>   -> Foreign Scan
>   -> Hash
> -> Seq Scan
> -> Hash Join
>   -> Foreign Scan
>   -> Hash
> -> Seq Scan
>
> In order for this to run asynchronously, you need not only Append and
> Foreign Scan to be async-capable, but also Hash Join.  That's true in
> either approach.  Things are slightly better with the original
> approach, but the basic problem is there in both cases.  So it seems
> we need an approach that will make adding async capability to a node
> really cheap, which seems like it might be a problem.

Yes, we might have to deal with this.

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel bitmap heap scan

2016-11-17 Thread Amit Khandekar

On 19 October 2016 at 09:47, Dilip Kumar  wrote:
> On Tue, Oct 18, 2016 at 1:45 AM, Andres Freund  wrote:
>> I don't quite understand why the bitmap has to be parallel at all. As
>> far as I understand your approach as described here, the only thing that
>> needs to be shared are the iteration arrays.  Since they never need to
>> be resized and such, it seems to make a lot more sense to just add an
>> API to share those, instead of the whole underlying hash.
>
> You are right that we only share iteration arrays. But only point is
> that each entry of iteration array is just a pointer to hash entry.
> So either we need to build hash in shared memory (my current approach)
> or we need to copy each hash element at shared location (I think this
> is going to be expensive).

While the discussion is going on regarding implementation for creating
shared tidbitmap, meanwhile I am starting with review of the bitmap
heap scan part, i.e. nodeBitmapHeapscan.c, since this looks mostly
independent of tidbitmap implementation.

> Brief design idea:
> ---
> #1. Shared TIDBitmap creation and initialization
>   First worker to see the state as parallel bitmap info as PBM_INITIAL
> become leader and set the state to PBM_INPROGRESS All other   workers
> see the state as PBM_INPROGRESS will wait for leader to complete the
> TIDBitmap.
>
> #2 At this level TIDBitmap is ready and all workers are awake.

As far as correctness is concerned, the logic where the first worker
becomes leader while others synchronously wait, looks good. Workers
get allocated right from the beginning even though they would stay
idle for some percentage of time (5-20% ?) , but I guess there is
nothing we can do about it with the current parallel query
infrastructure.

In pbms_is_leader() , I didn't clearly understand the significance of
the for-loop. If it is a worker, it can call
ConditionVariablePrepareToSleep() followed by
ConditionVariableSleep(). Once it comes out of
ConditionVariableSleep(), isn't it guaranteed that the leader has
finished the bitmap ? If yes, then it looks like it is not necessary
to again iterate and go back through the pbminfo->state checking.
Also, with this, variable queuedSelf also might not be needed. But I
might be missing something here. Not sure what happens if worker calls
ConditionVariable[Prepare]Sleep() but leader has already called
ConditionVariableBroadcast(). Does the for loop have something to do
with this ? But this can happen even with the current for-loop, it
seems.

> #3. Bitmap processing (Iterate and process the pages).
> In this phase each worker will iterate over page and chunk array and
> select heap pages one by one. If prefetch is enable then there will be
> two iterator. Since multiple worker are iterating over same page and
> chunk array we need to have a shared iterator, so we grab a spin lock
> and iterate within a lock, so that each worker get and different page
> to process.

tbm_iterate() call under SpinLock :
For parallel tbm iteration, tbm_iterate() is called while SpinLock is
held. Generally we try to keep code inside Spinlock call limited to a
few lines, and that too without occurrence of a function call.
Although tbm_iterate() code itself looks safe under a spinlock, I was
checking if we can squeeze SpinlockAcquire() and SpinLockRelease()
closer to each other. One thought is :  in tbm_iterate(), acquire the
SpinLock before the while loop that iterates over lossy chunks. Then,
if both chunk and per-page data remain, release spinlock just before
returning (the first return stmt). And then just before scanning
bitmap of an exact page, i.e. just after "if (iterator->spageptr <
tbm->npages)", save the page handle, increment iterator->spageptr,
release Spinlock, and then use the saved page handle to iterate over
the page bitmap.

prefetch_pages() call under Spinlock :
Here again, prefetch_pages() is called while pbminfo->prefetch_mutex
Spinlock is held. Effectively, heavy functions like PrefetchBuffer()
would get called while under the Spinlock. These can even ereport().
One option is to use mutex lock for this purpose. But I think that
would slow things down. Moreover, the complete set of prefetch pages
would be scanned by a single worker, and others might wait for this
one. Instead, what I am thinking is: grab the pbminfo->prefetch_mutex
Spinlock only while incrementing pbminfo->prefetch_pages. The rest
part viz : iterating over the prefetch pages, and doing the
PrefetchBuffer() need not be synchronised using this
pgbinfo->prefetch_mutex Spinlock. pbms_parallel_iterate() already has
its own iterator spinlock. Only thing is, workers may not do the
actual PrefetchBuffer() sequentially. One of them might shoot ahead
and prefetch 3-4 pages while the other is lagging with the
sequentially lesser page number; but I believe this is fine, as long
as they all prefetch all the required blocks.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to

Re: [HACKERS] Parallel Append implementation

2017-02-17 Thread Amit Khandekar

On 16 February 2017 at 20:37, Robert Haas  wrote:
> On Thu, Feb 16, 2017 at 1:34 AM, Amit Khandekar  
> wrote:
>>> What I was thinking about is something like this:
>>>
>>> 1. First, take the maximum parallel_workers value from among all the 
>>> children.
>>>
>>> 2. Second, compute log2(num_children)+1 and round up.  So, for 1
>>> child, 1; for 2 children, 2; for 3-4 children, 3; for 5-8 children, 4;
>>> for 9-16 children, 5, and so on.
>>>
>>> 3. Use as the number of parallel workers for the children the maximum
>>> of the value computed in step 1 and the value computed in step 2.
>>
>> Ah, now that I closely look at compute_parallel_worker(), I see what
>> you are getting at.
>>
>> For plain unpartitioned table, parallel_workers is calculated as
>> roughly equal to log(num_pages) (actually it is log3). So if the table
>> size is n, the workers will be log(n). So if it is partitioned into p
>> partitions of size n/p each, still the number of workers should be
>> log(n). Whereas, in the patch, it is calculated as (total of all the
>> child workers) i.e. n * log(n/p) for this case. But log(n) != p *
>> log(x/p). For e.g. log(1000) is much less than log(300) + log(300) +
>> log(300).
>>
>> That means, the way it is calculated in the patch turns out to be much
>> larger than if it were calculated using log(total of sizes of all
>> children). So I think for the step 2 above, log(total_rel_size)
>> formula seems to be appropriate. What do you think ? For
>> compute_parallel_worker(), it is actually log3 by the way.
>>
>> BTW this formula is just an extension of how parallel_workers is
>> calculated for an unpartitioned table.
>
> log(total_rel_size) would be a reasonable way to estimate workers when
> we're scanning an inheritance hierarchy, but I'm hoping Parallel
> Append is also going to apply to UNION ALL queries, where there's no
> concept of the total rel size.
Yes ParallelAppend also gets used in UNION ALL.

> For that we need something else, which
> is why the algorithm that I proposed upthread doesn't rely on it.

The log2(num_children)+1 formula which you proposed does not take into
account the number of workers for each of the subplans, that's why I
am a bit more inclined to look for some other logic. May be, treat the
children as if they belong to partitions, and accordingly calculate
the final number of workers. So for 2 children with 4 and 5 workers
respectively, Append parallel_workers would be : log3(3^4 + 3^5) .

>
>>> The decision to use fewer workers for a smaller scan isn't really
>>> because we think that using more workers will cause a regression.
>>> It's because we think it may not help very much, and because it's not
>>> worth firing up a ton of workers for a relatively small scan given
>>> that workers are a limited resource.  I think once we've got a bunch
>>> of workers started, we might as well try to use them.
>>
>> One possible side-effect I see due to this is : Other sessions might
>> not get a fair share of workers due to this. But again, there might be
>> counter argument that, because Append is now focussing all the workers
>> on a last subplan, it may finish faster, and release *all* of its
>> workers earlier.
>
> Right.  I think in general it's pretty clear that there are possible
> fairness problems with parallel query.  The first process that comes
> along seizes however many workers it thinks it should use, and
> everybody else can use whatever (if anything) is left.  In the long
> run, I think it would be cool to have a system where workers can leave
> one parallel query in progress and join a different one (or exit and
> spawn a new worker to join a different one), automatically rebalancing
> as the number of parallel queries in flight fluctuates.  But that's
> clearly way beyond anything we can do right now.  I think we should
> assume that any parallel workers our process has obtained are ours to
> use for the duration of the query, and use them as best we can.

> Note that even if the Parallel Append tells one of the workers that there
> are no more tuples and it should go away, some higher level of the
> query plan could make a different choice anyway; there might be
> another Append elsewhere in the plan tree.
Yeah, that looks good enough to justify not losing the workers

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-02-17 Thread Amit Khandekar

Ashutosh Bapat  wrote:
> Do we have any performance measurements where we see that Goal B
> performs better than Goal A, in such a situation? Do we have any
> performance measurement comparing these two approaches in other
> situations. If implementation for Goal B beats that of Goal A always,
> we can certainly implement it directly. But it may not.

I will get back with some performance numbers.

> Also, separating patches for Goal A and Goal B might make reviews easier.

Do you anyways want the patch with the current state to be split ?
Right now, I am not sure how exactly you need me to split it.

>
>>
>>>
>>> BTW, Right now, the patch does not consider non-partial paths for a
>>> child which has partial paths. Do we know, for sure, that a path
>>> containing partial paths for a child, which has it, is always going to
>>> be cheaper than the one which includes non-partial path. If not,
>>> should we build another paths which contains non-partial paths for all
>>> child relations. This sounds like a 0/1 knapsack problem.
>>
>> I didn't quite get this. We do create a non-partial Append path using
>> non-partial child paths anyways.
>
> Let's say a given child-relation has both partial and non-partial
> paths, your approach would always pick up a partial path. But now that
> parallel append can handle non-partial paths as well, it may happen
> that picking up non-partial path instead of partial one when both are
> available gives an overall better performance. Have we ruled out that
> possibility.

Yes, one Append can contain a child c1 with partial path, another
Append path can contain child c1 with non-partial path, and each of
this combination can have two more combinations for child2, and so on,
leading to too many Append paths. I think that's what you referred to
as 0/1 knapsack problem. Right, this does not seem worth it.

I had earlier considered adding a partial Append path containing only
non-partial paths, but for some reason I had concluded that it's not
worth having this path, as it's cost is most likely going to be higher
due to presence of all single-worker paths *and* also a Gather above
them. I should have documented the reason. Let me give a thought on
this.

 Let me try keeping the per-subplan max_worker info in Append path
 itself, like I mentioned above. If that works, the bitmap will be
 replaced by max_worker field. In case of non-partial subpath,
 max_worker will be 1. (this is the same info kept in AppendState node
 in the patch, but now we might need to keep it in Append path node as
 well).
>>>
>>> It will be better if we can fetch that information from each subpath
>>> when creating the plan. As I have explained before, a path is minimal
>>> structure, which should be easily disposable, when throwing away the
>>> path.
>>
>> Now in the v2 patch, we store per-subplan worker count. But still, we
>> cannot use the path->parallel_workers to determine whether it's a
>> partial path. This is because even for a non-partial path, it seems
>> the parallel_workers can be non-zero. For e.g., in
>> create_subqueryscan_path(), it sets path->parallel_workers to
>> subpath->parallel_workers. But this path is added as a non-partial
>> path. So we need a separate info as to which of the subpaths in Append
>> path are partial subpaths. So in the v2 patch, I continued to use
>> Bitmapset in AppendPath. But in Append plan node, number of workers is
>> calculated using this bitmapset. Check the new function
>> get_append_num_workers().
>
> If the subpath from childrel->partial_pathlist, then we set the
> corresponding bit in the bitmap. Now we can infer that for any path if
> that path is found in path->parent->partial_pathlist. Since the code
> always chooses the first partial path, the search in partial_pathlist
> should not affect performance. So, we can avoid maintaining a bitmap
> in the path and keep accumulating it when collapsing append paths.

Thanks. Accordingly did these changes in attached v4 patch.
get_append_num_workers() now uses
linitial(path->parent->partial_pathlist) to determine whether the
subpath is a partial or a non-partial path. Removed the bitmapset
field from AppendPath.


> 12. cost_append() essentially adds costs of all the subpaths and then 
> divides
> by parallel_divisor. This might work if all the subpaths are partial 
> paths. But
> for the subpaths which are not partial, a single worker will incur the 
> whole
> cost of that subpath. Hence just dividing all the total cost doesn't seem 
> the
> right thing to do. We should apply different logic for costing non-partial
> subpaths and partial subpaths.

 WIth the current partial path costing infrastructure, it is assumed
 that a partial path node should return the average per-worker cost.
 Hence, I thought it would be best to do it in a similar way for
 Append. But let me think if we can do something. With the current
 paralleli

Re: [HACKERS] Parallel Append implementation

2017-02-17 Thread Amit Khandekar

On 16 February 2017 at 20:37, Robert Haas  wrote:

> I'm not sure that it's going to be useful to make this logic very
> complicated.  I think the most important thing is to give 1 worker to
> each plan before we give a second worker to any plan.  In general I
> think it's sufficient to assign a worker that becomes available to the
> subplan with the fewest number of workers (or one of them, if there's
> a tie)

> without worrying too much about the target number of workers for that subplan.

The reason I have considered per-subplan workers is , for instance, so
that we can respect the parallel_workers reloption set by the user for
different tables. Or for e.g., subquery1 is a big hash join needing
more workers, and subquery2 is a small table requiring quite lesser
workers, it seems to make sense to give more workers to subquery1.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-02-20 Thread Amit Khandekar

On 16 February 2017 at 20:53, Robert Haas  wrote:
> On Thu, Feb 16, 2017 at 5:47 AM, Greg Stark  wrote:
>> On 13 February 2017 at 12:01, Amit Khandekar  wrote:
>>> There are a few things that can be discussed about :
>>
>> If you do a normal update the new tuple is linked to the old one using
>> the ctid forming a chain of tuple versions. This tuple movement breaks
>> that chain.  So the question I had reading this proposal is what
>> behaviour depends on ctid and how is it affected by the ctid chain
>> being broken.
>
> I think this is a good question.
>
>> I think the concurrent update case is just a symptom of this. If you
>> try to update a row that's locked for a concurrent update you normally
>> wait until the concurrent update finishes, then follow the ctid chain
>> and recheck the where clause on the target of the link and if it still
>> matches you perform the update there.
>
> Right.  EvalPlanQual behavior, in short.
>
>> At least you do that if you have isolation_level set to
>> repeatable_read. If you have isolation level set to serializable then
>> you just fail with a serialization failure. I think that's what you
>> should do if you come across a row that's been updated with a broken
>> ctid chain even in repeatable read mode. Just fail with a
>> serialization failure and document that in partitioned tables if you
>> perform updates that move tuples between partitions then you need to
>> be ensure your updates are prepared for serialization failures.
>
> Now, this part I'm not sure about.  What's pretty clear is that,
> barring some redesign of the heap format, we can't keep the CTID chain
> intact when the tuple moves to a different relfilenode.  What's less
> clear is what to do about that.  We can either (1) give up on
> EvalPlanQual behavior in this case and act just as we would if the row
> had been deleted; no update happens.

This is what the current patch has done.

> or (2) throw a serialization
> error.  You're advocating for #2, but I'm not sure that's right,
> because:
>
> 1. It's a lot more work,
>
> 2. Your proposed implementation needs an on-disk format change that
> uses up a scarce infomask bit, and
>
> 3. It's not obvious to me that it's clearly preferable from a user
> experience standpoint.  I mean, either way the user doesn't get the
> behavior that they want.  Either they're hoping for EPQ semantics and
> they instead do a no-op update, or they're hoping for EPQ semantics
> and they instead get an ERROR.  Generally speaking, we don't throw
> serialization errors today at READ COMMITTED, so if we do so here,
> that's going to be a noticeable and perhaps unwelcome change.
>
> More opinions welcome.

I am inclined to at least have some option for the user to decide the
behaviour. In the future we can even consider support for walking
through the ctid chain across multiple relfilenodes. But till then, we
need to decide what default behaviour to keep. My inclination is more
towards erroring out in an unfortunate even where there is an UPDATE
while the row-movement is happening. One option is to not get into
finding whether the DELETE was part of partition row-movement or it
was indeed a DELETE, and always error out the UPDATE when
heap_update() returns HeapTupleUpdated, but only if the table is a
leaf partition. But this obviously will cause annoyance because of
chances of getting such errors when there are concurrent updates and
deletes in the same partition. But we can keep a table-level option
for determining whether to error out or silently lose the UPDATE.

Another option I was thinking : When the UPDATE is on a partition key,
acquire ExclusiveLock (not AccessExclusiveLock) only on that
partition, so that the selects will continue to execute, but
UPDATE/DELETE will wait before opening the table for scan. The UPDATE
on partition key is not going to be a very routine operation, it
sounds more like a DBA maintenance operation; so it does not look like
it would come in between usual transactions.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-03-01 Thread Amit Khandekar

licate things, as compared to always doing the setup for
>> UPDATE. WIll check on that.
>
> Hmm.  ExecSetupPartitionTupleRouting(), which does significant amount of
> setup work, is fine being called in ExecInitModifyTable() in the insert
> case because there are often cases where that's a bulk-insert and hence
> cost of the setup work is amortized.  Updates, OTOH, are seldom done in a
> bulk manner.  So that might be an argument for doing it late only when
> needed.

Yes, agreed.

> But that starts to sound less attractive when one realizes that
> that will occur for every row that wants to move.

If we manage to call ExecSetupPartitionTupleRouting() during execution
phase only once for the very first time we find the update requires
row movement, then we can re-use the info.


One more thing I noticed is that, in case of update-returning, the
ExecDelete() will also generate result of RETURNING, which we are
discarding. So this is a waste. We should not even process RETURNING
in ExecDelete() called for row-movement. The RETURNING should be
processed only for ExecInsert().

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-03-07 Thread Amit Khandekar

On 19 February 2017 at 14:59, Robert Haas  wrote:
> On Fri, Feb 17, 2017 at 2:56 PM, Amit Khandekar  
> wrote:
>> The log2(num_children)+1 formula which you proposed does not take into
>> account the number of workers for each of the subplans, that's why I
>> am a bit more inclined to look for some other logic. May be, treat the
>> children as if they belong to partitions, and accordingly calculate
>> the final number of workers. So for 2 children with 4 and 5 workers
>> respectively, Append parallel_workers would be : log3(3^4 + 3^5) .
>
> In general this will give an answer not different by more than 1 or 2
> from my answer, and often exactly the same.  In the case you mention,
> whether we get the same answer depends on which way you round:
> log3(3^4+3^5) is 5 if you round down, 6 if you round up.
>
> My formula is more aggressive when there are many subplans that are
> not parallel or take only 1 worker, because I'll always use at least 5
> workers for an append that has 9-16 children, whereas you might use
> only 2 if you do log3(3^0+3^0+3^0+3^0+3^0+3^0+3^0+3^0+3^0).  In that
> case I like my formula better. With lots of separate children, the
> chances of being able to use as many as 5 workers seem good.  (Note
> that using 9 workers as Ashutosh seems to be proposing would be a
> waste if the different children have very unequal execution times,
> because the workers that run children with short execution times can
> be reused to run additional subplans while the long ones are still
> running.  Running a separate worker for each child only works out if
> the shortest runtime is more than 50% of the longest runtime, which
> may sometimes be true but doesn't seem like a good bet in general.)
>
> Your formula is more aggressive when you have 3 children that all use
> the same number of workers; it'll always decide on  per child>+1, whereas mine won't add the extra worker in that case.
> Possibly your formula is better than mine in that case, but I'm not
> sure.  If you have as many as 9 children that all want N workers, your
> formula will decide on N+2 workers, but since my formula guarantees a
> minimum of 5 workers in such cases, I'll probably be within 1 of
> whatever answer you were getting.
>

Yeah, that seems to be right in most of the cases. The only cases
where your formula seems to give too few workers is for something like
: (2, 8, 8). For such subplans, we should at least allocate 8 workers.
It turns out that in most of the cases in my formula, the Append
workers allocated is just 1 worker more than the max per-subplan
worker count. So in (2, 1, 1, 8), it will be a fraction more than 8.
So in the patch, in addition to the log2() formula you proposed, I
have made sure that it allocates at least equal to max(per-subplan
parallel_workers values).

>
>> BTW, there is going to be some logic change in the choose-next-subplan
>> algorithm if we consider giving extra workers to subplans.
>
> I'm not sure that it's going to be useful to make this logic very
> complicated.  I think the most important thing is to give 1 worker to
> each plan before we give a second worker to any plan.  In general I
> think it's sufficient to assign a worker that becomes available to the
> subplan with the fewest number of workers (or one of them, if there's
> a tie) without worrying too much about the target number of workers
> for that subplan.

In the attached v5 patch, the logic of distributing the workers is now
kept simple : it just distributes the workers equally without
considering the per-sublan parallel_workers value. I have retained the
earlier logic of choosing the plan with minimum current workers. But
now that the pa_max_workers is not needed, I removed it, and instead a
partial_plans bitmapset is added in the Append node. Once a worker
picks up a non-partial subplan, it immediately changes its
pa_num_workers to -1. Whereas for partial subplans, the worker sets it
to -1 only after it finishes executing it.

Effectively, in parallel_append_next(), the check for whether subplan
is executing with max parallel_workers is now removed, and all code
that was using pa_max_workers is now removed.

Ashutosh Bapat  wrote:
> 10. We should probably move the parallel_safe calculation out of 
> cost_append().
> +path->parallel_safe = path->parallel_safe &&
> +  subpath->parallel_safe;
>
> 11. This check shouldn't be part of cost_append().
> +/* All child paths must have same parameterization */
> +Assert(bms_equal(PATH_REQ_OUTER(subpath), required_outer));
>

Moved out these two statements from cost_append(). Did it separately
in create_append_path().

Also, I have removed s

Re: [HACKERS] Parallel Append implementation

2017-03-09 Thread Amit Khandekar

* Increment worker count for the chosen node, if at all we found one.
* For non-partial plans, set it to -1 instead, so that no other workers
* run it.
*/
if (min_whichplan != PA_INVALID_PLAN)
{
   if (bms_is_member(min_whichplan,
((Append*)state->ps.plan)->partial_subplans_set))
   padesc->pa_info[min_whichplan].pa_num_workers++;
   else
   padesc->pa_info[min_whichplan].pa_num_workers = -1;
}

Now, since Bitmapset field is used during execution with such
simplicity, why not have this same data structure in AppendPath, and
re-use bitmapset field in Append plan node without making a copy of
it. Otherwise, if we have two lists in AppendPath, and a bitmap in
Append, again there is going to be code for data structure conversion.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-03-10 Thread Amit Khandekar

On 10 March 2017 at 12:33, Ashutosh Bapat
 wrote:
> On Fri, Mar 10, 2017 at 11:33 AM, Ashutosh Bapat
>  wrote:
>>>
>>> But as far as code is concerned, I think the two-list approach will
>>> turn out to be less simple if we derive corresponding two different
>>> arrays in AppendState node. Handling two different arrays during
>>> execution does not look clean. Whereas, the bitmapset that I have used
>>> in Append has turned out to be very simple. I just had to do the below
>>> check (and that is the only location) to see if it's a partial or
>>> non-partial subplan. There is nowhere else any special handling for
>>> non-partial subpath.
>>>
>>> /*
>>> * Increment worker count for the chosen node, if at all we found one.
>>> * For non-partial plans, set it to -1 instead, so that no other workers
>>> * run it.
>>> */
>>> if (min_whichplan != PA_INVALID_PLAN)
>>> {
>>>if (bms_is_member(min_whichplan,
>>> ((Append*)state->ps.plan)->partial_subplans_set))
>>>padesc->pa_info[min_whichplan].pa_num_workers++;
>>>else
>>>padesc->pa_info[min_whichplan].pa_num_workers = -1;
>>> }
>>>
>>> Now, since Bitmapset field is used during execution with such
>>> simplicity, why not have this same data structure in AppendPath, and
>>> re-use bitmapset field in Append plan node without making a copy of
>>> it. Otherwise, if we have two lists in AppendPath, and a bitmap in
>>> Append, again there is going to be code for data structure conversion.
>>>
>>
>> I think there is some merit in separating out non-parallel and
>> parallel plans within the same array or outside it. The current logic
>> to assign plan to a worker looks at all the plans, unnecessarily
>> hopping over the un-parallel ones after they are given to a worker. If
>> we separate those two, we can keep assigning new workers to the
>> non-parallel plans first and then iterate over the parallel ones when
>> a worker needs a plan to execute. We might eliminate the need for
>> special value -1 for num workers. You may separate those two kinds in
>> two different arrays or within the same array and remember the
>> smallest index of a parallel plan.

Do you think we might get performance benefit with this ? I am looking
more towards logic simplicity. non-parallel plans would be mostly
likely be there only in case of UNION ALL queries, and not partitioned
tables. And UNION ALL queries probably would have far lesser number of
subplans, there won't be too many unnecessary hops. The need for
num_workers=-1 will still be there for partial plans, because we need
to set it to -1 once a worker finishes a plan.

>
> Further to that, with this scheme and the scheme to distribute workers
> equally irrespective of the maximum workers per plan, you don't need
> to "scan" the subplans to find the one with minimum workers. If you
> treat the array of parallel plans as a circular queue, the plan to be
> assigned next to a worker will always be the plan next to the one
> which got assigned to the given worker.

> Once you have assigned workers
> to non-parallel plans, intialize a shared variable next_plan to point
> to the first parallel plan. When a worker comes asking for a plan,
> assign the plan pointed by next_plan and update it to the next plan in
> the circular queue.

At some point of time, this logic may stop working. Imagine plans are
running with (1, 1, 1). Next worker goes to plan 1, so they run with
(2, 1, 1). So now the next_plan points to plan 2. Now suppose worker
on plan 2 finishes. It should not again take plan 2, even though
next_plan points to 2. It should take plan 3, or whichever is not
finished. May be a worker that finishes a plan should do this check
before directly going to the next_plan. But if this is turning out as
simple as the finding-min-worker-plan, we can use this logic. But will
have to check. We can anyway consider this even when we have a single
list.

>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-03-10 Thread Amit Khandekar

On 10 March 2017 at 14:05, Ashutosh Bapat
 wrote:
>> The need for
>> num_workers=-1 will still be there for partial plans, because we need
>> to set it to -1 once a worker finishes a plan.
>>
>
> IIRC, we do that so that no other workers are assigned to it when
> scanning the array of plans. But with the new scheme we don't need to
> scan the non-parallel plans for when assigning plan to workers so -1
> may not be needed. I may be wrong though.
>

Still, when a worker finishes a partial subplan , it marks it as -1,
so that no new workers pick this, even if there are other workers
already executing it.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-03-10 Thread Amit Khandekar

After giving more thought to our discussions, I have have used the
Bitmapset structure in AppendPath as against having two lists one for
partial and other for non-partial paths. Attached is the patch v6 that
has the required changes. So accumulate_append_subpath() now also
prepares the bitmapset containing the information about which paths
are partial paths. This is what I had done in the first version.

At this point of time, I have not given sufficient time to think about
Ashutosh's proposal of just keeping track of the next_subplan which he
mentioned. There, we just keep assigning workers to a circle of
subplans in round-robin style. But I think as of now the approach of
choosing the minimum worker subplan is pretty simple looking. So the
patch v6 is in a working condition using minimum-worker approach.

On 9 March 2017 at 07:22, Robert Haas  wrote:

> Some review:
>
> +typedef struct ParallelAppendDescData
> +{
> +slock_tpa_mutex;/* mutual exclusion to choose
> next subplan */
> +ParallelAppendInfo pa_info[FLEXIBLE_ARRAY_MEMBER];
> +} ParallelAppendDescData;
>
> Instead of having ParallelAppendInfo, how about just int
> pa_workers[FLEXIBLE_ARRAY_MEMBER]?  The second structure seems like
> overkill, at least for now.

I have , for now, kept the structure there, just in case after further
discussion we may add something.

>
> +static inline void
> +exec_append_scan_first(AppendState *appendstate)
> +{
> +appendstate->as_whichplan = 0;
> +}
>
> I don't think this is buying you anything, and suggest backing it out.

This is required for sequential Append, so that we can start executing
from the first subplan.

>
> +/* Backward scan is not supported by parallel-aware plans */
> +
> Assert(!ScanDirectionIsBackward(appendstate->ps.state->es_direction));
>
> I think you could assert ScanDirectionIsForward, couldn't you?
> NoMovement, I assume, is right out.

Right. Changed.

>
> +elog(DEBUG2, "ParallelAppend : pid %d : all plans already
> finished",
> + MyProcPid);
>
> Please remove (and all similar cases also).

Removed at multiple places.

>
> + sizeof(*node->as_padesc->pa_info) * node->as_nplans);
>
> I'd use the type name instead.

Done.

>
> +for (i = 0; i < node->as_nplans; i++)
> +{
> +/*
> + * Just setting all the number of workers to 0 is enough. The logic
> + * of choosing the next plan in workers will take care of everything
> + * else.
> + */
> +padesc->pa_info[i].pa_num_workers = 0;
> +}
>
> Here I'd use memset.

Done.

>
> +return (min_whichplan == PA_INVALID_PLAN ? false : true);
>
> Maybe just return (min_whichplan != PA_INVALID_PLAN);

Done.

>
> -  childrel->cheapest_total_path);
> +
> childrel->cheapest_total_path);
>
> Unnecessary.

This call is now having more param, so kept the change.
>
> +{
>  partial_subpaths = accumulate_append_subpath(partial_subpaths,
> linitial(childrel->partial_pathlist));
> +}
>
> Don't need to add braces.

Removed them.

>
> +/*
> + * Extract the first unparameterized, parallel-safe one among the
> + * child paths.
> + */
>
> Can we use get_cheapest_parallel_safe_total_inner for this, from
> a71f10189dc10a2fe422158a2c9409e0f77c6b9e?

Yes, Fixed.

>
> +if (rel->partial_pathlist != NIL &&
> +(Path *) linitial(rel->partial_pathlist) == subpath)
> +partial_subplans_set = bms_add_member(partial_subplans_set, i);
>
> This seems like a scary way to figure this out.  What if we wanted to
> build a parallel append subpath with some path other than the
> cheapest, for some reason?  I think you ought to record the decision
> that set_append_rel_pathlist makes about whether to use a partial path
> or a parallel-safe path, and then just copy it over here.

As mentioned above, used Bitmapset in AppendPath.

>
> -create_append_path(grouped_rel,
> -   paths,
> -   NULL,
> -   0);
> +create_append_path(grouped_rel, paths, NULL, 0);
>
> Unnecessary.

Now since there was anyway a change in the number of params, I kept
the single line call.

Please refer to attached patch version v6 for all of the above changes.

ParallelAppend_v6.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-03-13 Thread Amit Khandekar

On 10 March 2017 at 22:08, Robert Haas  wrote:
> On Fri, Mar 10, 2017 at 12:17 AM, Amit Khandekar  
> wrote:
>> I agree that the two-lists approach will consume less memory than
>> bitmapset. Keeping two lists will effectively have an extra pointer
>> field which will add up to the AppendPath size, but this size will not
>> grow with the number of subpaths, whereas the Bitmapset will grow.
>
> Sure.  You'll use about one BIT of memory per subpath.  I'm kind of
> baffled as to why we're treating this as an issue worth serious
> discussion; the amount of memory involved is clearly very small.  Even
> for an appendrel with 1000 children, that's 125 bytes of memory.
> Considering the amount of memory we're going to spend planning that
> appendrel overall, that's not significant.
Yes, I agree that we should consider rather other things like code
simplicity to determine which data structure we should use in
AppendPath.

>
> However, Ashutosh's response made me think of something: one thing is
> that we probably do want to group all of the non-partial plans at the
> beginning of the Append so that they get workers first, and put the
> partial plans afterward.  That's because the partial plans can always
> be accelerated by adding more workers as they become available, but
> the non-partial plans are just going to take as long as they take - so
> we want to start them as soon as possible.  In fact, what we might
> want to do is actually sort the non-partial paths in order of
> decreasing cost, putting the most expensive one first and the others
> in decreasing order after that - and then similarly afterward with the
> partial paths.  If we did that, we wouldn't need to store a bitmapset
> OR two separate lists.  We could just store the index of the first
> partial plan in the list.  Then you can test whether a path is partial
> by checking whether this_index >= first_partial_index.

I agree that we should preferably have the non-partial plans started
first. But I am not sure if it is really worth ordering the partial
plans by cost. The reason we ended up not keeping track of the
per-subplan parallel_worker, is because it would not matter  much ,
and we would just equally distribute the workers among all regardless
of how big the subplans are. Even if smaller plans get more worker,
they will finish faster, and workers would be available to larger
subplans sooner.

Anyways, I have given a thought on the logic of choosing the next plan
, and that is irrespective of whether the list is sorted. I have
included Ashutosh's proposal of scanning the array round-robin as
against finding the minimum, since that method will automatically
distribute the workers evenly. Also, the logic uses a single array and
keeps track of first partial plan. The first section of the array is
non-partial, followed by partial plans. Below is the algorithm ...
There might be corner cases which I didn't yet take into account, but
first I wanted to get an agreement if this looks ok to go ahead with.
Since it does not find minimum worker count, it no longer uses
pa_num_workers. Instead it has boolean field painfo->pa_finished.

parallel_append_next(AppendState *state)
{

/* Make a note of which subplan we have started with */
initial_plan = padesc->next_plan;

/* Keep going to the next plan until we find an unfinished one. In
the process, also keep track of the first unfinished subplan. As the
non-partial subplans are taken one by one, the unfinished subplan will
shift ahead, so that we don't have to scan these anymore */

whichplan = initial_plan;
for (;;)
{
ParallelAppendInfo *painfo = &padesc->pa_info[whichplan];

/*
 * Ignore plans that are already done processing. These also include
 * non-partial subplans which have already been taken by a worker.
 */
if (!painfo->pa_finished)
{
/* If this a non-partial plan, immediately mark it
finished, and shift ahead first_plan */
if (whichplan < padesc->first_partial_plan)
{
padesc->pa_info[whichplan].pa_finished = true;
padesc->first_plan++;
}

break;
}

/* Either go to the next index, or wrap around to the first
unfinished one */
whichplan = goto_next_plan(whichplan, padesc->first_plan,
padesc->as_nplans - 1));

/* Have we scanned all subplans ? If yes, we are done */
if (whichplan == initial_plan)
break;
}

/* If we didn't find any plan to execute, stop executing. */
if (whichplan == initial_plan || whichplan == PA_INVALID_PLAN)
return false;
else
{
/* Set the chosen plan, and also the next plan to be picked by
other workers */
state->as_

Re: [HACKERS] Parallel Append implementation

2017-03-13 Thread Amit Khandekar

On 12 March 2017 at 19:31, Tels  wrote:
> Moin,
>
> On Sat, March 11, 2017 11:29 pm, Robert Haas wrote:
>> On Fri, Mar 10, 2017 at 6:01 AM, Tels 
>> wrote:
>>> Just a question for me to understand the implementation details vs. the
>>> strategy:
>>>
>>> Have you considered how the scheduling decision might impact performance
>>> due to "inter-plan parallelism vs. in-plan parallelism"?
>>>
>>> So what would be the scheduling strategy? And should there be a fixed
>>> one
>>> or user-influencable? And what could be good ones?
>>>
>>> A simple example:
>>>
>>> E.g. if we have 5 subplans, and each can have at most 5 workers and we
>>> have 5 workers overall.
>>>
>>> So, do we:
>>>
>>>   Assign 5 workers to plan 1. Let it finish.
>>>   Then assign 5 workers to plan 2. Let it finish.
>>>   and so on
>>>
>>> or:
>>>
>>>   Assign 1 workers to each plan until no workers are left?
>>
>> Currently, we do the first of those, but I'm pretty sure the second is
>> way better.  For example, suppose each subplan has a startup cost.  If
>> you have all the workers pile on each plan in turn, every worker pays
>> the startup cost for every subplan.  If you spread them out, then
>> subplans can get finished without being visited by all workers, and
>> then the other workers never pay those costs.  Moreover, you reduce
>> contention for spinlocks, condition variables, etc.  It's not
>> impossible to imagine a scenario where having all workers pile on one
>> subplan at a time works out better: for example, suppose you have a
>> table with lots of partitions all of which are on the same disk, and
>> it's actually one physical spinning disk, not an SSD or a disk array
>> or anything, and the query is completely I/O-bound.  Well, it could
>> be, in that scenario, that spreading out the workers is going to turn
>> sequential I/O into random I/O and that might be terrible.  In most
>> cases, though, I think you're going to be better off.  If the
>> partitions are on different spindles or if there's some slack I/O
>> capacity for prefetching, you're going to come out ahead, maybe way
>> ahead.  If you come out behind, then you're evidently totally I/O
>> bound and have no capacity for I/O parallelism; in that scenario, you
>> should probably just turn parallel query off altogether, because
>> you're not going to benefit from it.
>
> I agree with the proposition that both strategies can work well, or not,
> depending on system-setup, the tables and data layout. I'd be a bit more
> worried about turning it into the "random-io-case", but that's still just
> a feeling and guesswork.
>
> So which one will be better seems speculative, hence the question for
> benchmarking different strategies.
>
> So, I'd like to see the scheduler be out in a single place, maybe a
> function that get's called with the number of currently running workers,
> the max. number of workers to be expected, the new worker, the list of
> plans still todo, and then schedules that single worker to one of these
> plans by strategy X.
>
> That would make it easier to swap out X for Y and see how it fares,
> wouldn't it?

Yes, actually pretty much the scheduler logic is all in one single
function parallel_append_next().

>
>
> However, I don't think the patch needs to select the optimal strategy
> right from the beginning (if that even exists, maybe it's a mixed
> strategy), even "not so optimal" parallelism will be better than doing all
> things sequentially.
>
> Best regards,
>
> Tels



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-03-16 Thread Amit Khandekar

On 12 March 2017 at 08:50, Robert Haas  wrote:
>> However, Ashutosh's response made me think of something: one thing is
>> that we probably do want to group all of the non-partial plans at the
>> beginning of the Append so that they get workers first, and put the
>> partial plans afterward.  That's because the partial plans can always
>> be accelerated by adding more workers as they become available, but
>> the non-partial plans are just going to take as long as they take - so
>> we want to start them as soon as possible.  In fact, what we might
>> want to do is actually sort the non-partial paths in order of
>> decreasing cost, putting the most expensive one first and the others
>> in decreasing order after that - and then similarly afterward with the
>> partial paths.  If we did that, we wouldn't need to store a bitmapset
>> OR two separate lists.  We could just store the index of the first
>> partial plan in the list.  Then you can test whether a path is partial
>> by checking whether this_index >= first_partial_index.

Attached is an updated patch v7, which does the above. Now,
AppendState->subplans has all non-partial subplans followed by all
partial subplans, with the non-partial subplans in the order of
descending total cost. Also, for convenience, the AppendPath also now
has similar ordering in its AppendPath->subpaths. So there is a new
field both in Append and AppendPath : first_partial_path/plan, which
has value 0 if there are no non-partial subpaths.

Also the backend now scans reverse, so that it does not take up the
most expensive path.

There are also some changes in the costing done. Now that we know that
the very first path is the costliest non-partial path, we can use its
total cost as the total cost of Append in case all the partial path
costs are lesser.

Modified/enhanced an existing test scenario in
src/test/regress/select_parallel.sql so that Parallel Append is
covered.

As suggested by Robert, since pa_info->pa_finished was the only field
in pa_info, removed the ParallelAppendDescData.pa_info structure, and
instead brought pa_info->pa_finished into ParallelAppendDescData.

>>> +static inline void
>>> +exec_append_scan_first(AppendState *appendstate)
>>> +{
>>> +appendstate->as_whichplan = 0;
>>> +}
>>>
>>> I don't think this is buying you anything, and suggest backing it out.
>>
>> This is required for sequential Append, so that we can start executing
>> from the first subplan.
>
> My point is that there's really no point in defining a static inline
> function containing one line of code.  You could just put that line of
> code in whatever places need it, which would probably be more clear.

Did the same.

ParallelAppend_v7.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-03-17 Thread Amit Khandekar

I haven't yet handled all points, but meanwhile, some of the important
points are discussed below ...

On 6 March 2017 at 15:11, Amit Langote  wrote:
>
>>> But that starts to sound less attractive when one realizes that
>>> that will occur for every row that wants to move.
>>
>> If we manage to call ExecSetupPartitionTupleRouting() during execution
>> phase only once for the very first time we find the update requires
>> row movement, then we can re-use the info.
>
> That might work, too.  But I guess we're going with initialization in
> ExecInitModifyTable().

I am more worried about this: even the UPDATEs that do not involve row
movement would do the expensive setup. So do it only once when we find
that we need to move the row. Something like this :
ExecUpdate()
{

if (resultRelInfo->ri_PartitionCheck &&
  !ExecPartitionCheck(resultRelInfo, slot, estate))
{
  bool  already_deleted;

  ExecDelete(tupleid, oldtuple, planSlot, epqstate, estate,
 &already_deleted, canSetTag);

  if (already_deleted)
return NULL;
  else
  {
/* If we haven't already built the state for INSERT
 * tuple routing, build it now */
if (!mtstate->mt_partition_dispatch_info)
{
  ExecSetupPartitionTupleRouting(
mtstate->resultRelInfo->ri_RelationDesc,
&mtstate->mt_partition_dispatch_info,
&mtstate->mt_partitions,
&mtstate->mt_partition_tupconv_maps,
&mtstate->mt_partition_tuple_slot,
&mtstate->mt_num_dispatch,
&mtstate->mt_num_partitions);
}

return ExecInsert(mtstate, slot, planSlot, NULL,
  ONCONFLICT_NONE, estate, false);
  }
}
...
}


>
>> One more thing I noticed is that, in case of update-returning, the
>> ExecDelete() will also generate result of RETURNING, which we are
>> discarding. So this is a waste. We should not even process RETURNING
>> in ExecDelete() called for row-movement. The RETURNING should be
>> processed only for ExecInsert().
>
> I wonder if it makes sense to have ExecDeleteInternal() and
> ExecInsertInternal(), which perform the core function of DELETE and
> INSERT, respectively.  Such as running triggers, checking constraints,
> etc.  The RETURNING part is controllable by the statement, so it will be
> handled by the ExecDelete() and ExecInsert(), like it is now.
>
> When called from ExecUpdate() as part of row-movement, they perform just
> the core part and leave the rest to be done by ExecUpdate() itself.

Yes, if we decide to execute only the core insert/delete operations
and skip the triggers, then there is a compelling reason to have
something like ExecDeleteInternal() and ExecInsertInternal(). In fact,
I was about to start doing the same, except for the below discussion
...

On 4 March 2017 at 12:49, Robert Haas  wrote:
> On Thu, Mar 2, 2017 at 11:53 AM, Amit Khandekar  
> wrote:
>> I think it does not make sense running after row triggers in case of
>> row-movement. There is no update happened on that leaf partition. This
>> reasoning can also apply to BR update triggers. But the reasons for
>> having a BR trigger and AR triggers are quite different. Generally, a
>> user needs to do some modifications to the row before getting the
>> final NEW row into the database, and hence [s]he defines a BR trigger
>> for that. And we can't just silently skip this step only because the
>> final row went into some other partition; in fact the row-movement
>> itself might depend on what the BR trigger did with the row. Whereas,
>> AR triggers are typically written for doing some other operation once
>> it is made sure the row is actually updated. In case of row-movement,
>> it is not actually updated.
>
> How about running the BR update triggers for the old partition and the
> AR update triggers for the new partition?  It seems weird to run BR
> update triggers but not AR update triggers.  Another option would be
> to run BR and AR delete triggers and then BR and AR insert triggers,
> emphasizing the choice to treat this update as a delete + insert, but
> (as Amit Kh. pointed out to me when we were in a room together this
> week) that precludes using the BEFORE trigger to modify the row.

I checked the trigger behaviour in case of UPSERT. Here, when there is
conflict found, ExecOnConflictUpdate() is called, and then the
function returns immediately, which means AR INSERT trigger will not
fire. And ExecOnConflictUpdate() calls ExecUpdate(), which means BR
and AR UPDATE triggers will be fired. So in short, when an INSERT
becomes an UPDATE, BR INSE

Re: [HACKERS] Parallel Append implementation

2017-03-17 Thread Amit Khandekar

On 17 March 2017 at 01:37, Robert Haas  wrote:
> - You've added a GUC (which is good) but not documented it (which is
> bad) or added it to postgresql.conf.sample (also bad).
>
> - You've used a loop inside a spinlock-protected critical section,
> which is against project policy.  Use an LWLock; define and document a
> new builtin tranche ID.
>
> - The comment for pa_finished claims that it is the number of workers
> executing the subplan, but it's a bool, not a count; I think this
> comment is just out of date.

Yes, agreed. Will fix the above.

>
> - paths_insert_sorted_by_cost() is a hand-coded insertion sort.  Can't
> we find a way to use qsort() for this instead of hand-coding a slower
> algorithm?  I think we could just create an array of the right length,
> stick each path into it from add_paths_to_append_rel, and then qsort()
> the array based on .  Then the result can be
> turned into a list.

Yeah, I was in double minds as to whether to do the
copy-to-array-and-qsort thing, or should just write the same number of
lines of code to manually do an insertion sort. Actually I was
searching if we already have a linked list sort, but it seems we don't
have. Will do the qsort now since it would be faster.

>
> - Maybe the new helper functions in nodeAppend.c could get names
> starting with exec_append_, to match the style of
> exec_append_initialize_next().
>
> - There's a superfluous whitespace change in add_paths_to_append_rel.

Will fix this.

>
> - The substantive changes in add_paths_to_append_rel don't look right
> either.  It's not clear why accumulate_partialappend_subpath is
> getting called even in the non-enable_parallelappend case.  I don't
> think the logic for the case where we're not generating a parallel
> append path needs to change at all.

When accumulate_partialappend_subpath() is called for a childrel with
a partial path, it works just like accumulate_append_subpath() when
enable_parallelappend is false. That's why, for partial child path,
the same function is called irrespective of parallel-append or
non-parallel-append case. May be mentioning this in comments should
suffice here ?

>
> - When parallel append is enabled, I think add_paths_to_append_rel
> should still consider all the same paths that it does today, plus one
> extra.  The new path is a parallel append path where each subpath is
> the cheapest subpath for that childrel, whether partial or
> non-partial.  If !enable_parallelappend, or if all of the cheapest
> subpaths are partial, then skip this.  (If all the cheapest subpaths
> are non-partial, it's still potentially useful.)

In case of all-partial childrels, the paths are *exactly* same as
those that would have been created for enable_parallelappend=off. The
extra path is there for enable_parallelappend=on only when one or more
of the child rels do not have partial paths. Does this make sense ?

> In other words,
> don't skip consideration of parallel append just because you have a
> partial path available for every child rel; it could be

I didn't get this. Are you saying that in the patch it is getting
skipped if enable_parallelappend = off ? I don't think so. For
all-partial child rels, partial append is always created. Only thing
is, in case of enable_parallelappend=off, the Append path is not
parallel_aware, so it executes just like it executes today under
Gather without being parallel-aware.

>
> - I think the way cost_append() works is not right.  What you've got
> assumes that you can just multiply the cost of a partial plan by the
> parallel divisor to recover the total cost, which is not true because
> we don't divide all elements of the plan cost by the parallel divisor
> -- only the ones that seem like they should be divided.

Yes, that was an approximation done. For those subpaths for which
there is no parallel_divsor, we cannot calculate the total cost
considering the number of workers for the subpath. I feel we should
consider the per-subpath parallel_workers somehow. The
Path->total_cost for a partial path is *always* per-worker cost, right
? Just want to confirm this assumption of mine.

> Also, it
> could be smarter about what happens with the costs of non-partial
> paths. I suggest the following algorithm instead.
>
> 1. Add up all the costs of the partial paths.  Those contribute
> directly to the final cost of the Append.  This ignores the fact that
> the Append may escalate the parallel degree, but I think we should
> just ignore that problem for now, because we have no real way of
> knowing what the impact of that is going to be.

I wanted to take into account per-subpath parallel_workers for total
cost of Append. Suppose the partial subpaths have per worker total
costs (3, 3, 3) and their parallel_workers are (2, 8, 4), with 2
Append workers available. So according to what you say, the total cost
is 9. With per-subplan parallel_workers taken into account, total cost
= (3*2 + 3*8 * 3*4)/2 = 21.

May be I didn't follow exactly what you suggested

Re: [HACKERS] Parallel Append implementation

2017-03-17 Thread Amit Khandekar

On 16 March 2017 at 18:18, Ashutosh Bapat
 wrote:
> + * Check if we are already finished plans from parallel append. This
> + * can happen if all the subplans are finished when this worker
> + * has not even started returning tuples.
> + */
> +if (node->as_padesc && node->as_whichplan == PA_INVALID_PLAN)
> +return ExecClearTuple(node->ps.ps_ResultTupleSlot);
> From the comment, it looks like this condition will be encountered before the
> backend returns any tuple. But this code is part of the loop which returns the
> tuples. Shouldn't this be outside the loop? Why do we want to check a 
> condition
> for every row returned when the condition can happen only once and that too
> before returning any tuple?

The way ExecProcNode() gets called, there is no different special code
that gets called instead of ExecProcNode() when a tuple is fetched for
the first time. I mean, we cannot prevent ExecProcNode() from getting
called when as_whichplan is invalid right from the beginning.

One thing we can do is : have a special slot in AppenState->as_plan[]
which has some dummy execution node that just returns NULL tuple, and
initially make as_whichplan point to this slot. But I think it is not
worth doing this.

We can instead reduce the if condition to:
if (node->as_whichplan == PA_INVALID_PLAN)
{
Assert(node->as_padesc != NULL);
   return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}

BTW, the loop which you mentioned that returns tuples the loop is
not for returning tuples, the loop is for iterating to the next
subplan. Even if we take the condition out and keep it in the
beginning of ExecAppend, the issue will remain.

>
> Why do we need following code in both ExecAppendInitializeWorker() and
> ExecAppendInitializeDSM()? Both of those things happen before starting the
> actual execution, so one of those should suffice?
> +/* Choose the optimal subplan to be executed. */
> +(void) parallel_append_next(node);

ExecAppendInitializeWorker() is for the worker to attach (and then
initialize its own local data) to the dsm area created and shared by
ExecAppendInitializeDSM() in backend. But both worker and backend
needs to initialize its own as_whichplan to the next subplan.

>
> There is no pa_num_worker now, so probably this should get updated. Per 
> comment
> we should also get rid of SpinLockAcquire() and SpinLockRelease()?
> + *purpose. The spinlock is used so that it does not change the
> + *pa_num_workers field while workers are choosing the next node.
Will do this.

>
> BTW, sa_finished seems to be a misnomor. The plan is not finished yet, but it
> wants no more workers. So, should it be renamed as sa_no_new_workers or
> something like that?

Actually in this context, "finished" means "we are done with this subplan".

>
> In parallel_append_next() we shouldn't need to call goto_next_plan() twice. If
> the plan indicated by pa_next_plan is finished, all the plans must have
> finished. This should be true if we set pa_next_plan to 0 at the time of
> initialization. Any worker picking up pa_next_plan will set it to the next
> valid plan. So the next worker asking for plan should pick pa_next_plan and
> set it to the next one and so on.

The current patch does not call it twice, but I might have overlooked
something. Let me know if I have.

>
> I am wonding whether goto_next_plan() can be simplified as some module
> arithmatic e.g. (whichplan - first_plan)++ % (last_plan - first_plan)
> + first_plan.

Hmm. IMHO it seems too much calculation for just shifting to next array element.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-03-20 Thread Amit Khandekar

>> 2. Next, estimate the cost of the non-partial paths.  To do this, make
>> an array of Cost of that length and initialize all the elements to
>> zero, then add the total cost of each non-partial plan in turn to the
>> element of the array with the smallest cost, and then take the maximum
>> of the array elements as the total cost of the non-partial plans.  Add
>> this to the result from step 1 to get the total cost.
>
> So with costs (8, 5, 2), add 8 and 5 to 2 so that it becomes (8, 5,
> 15) , and so the max is 15 ? I surely am misinterpreting this.
>
> Actually, I couldn't come up with a general formula to find the
> non-partial paths total cost, given the per-subplan cost and number of
> workers. I mean, we can manually find out the total cost, but turning
> it into a formula seems quite involved. We can even do a dry-run of
> workers consuming each of the subplan slots and find the total time
> time units taken, but finding some approximation seemed ok.
>
> For e.g. we can manually find total time units taken for following :
> costs (8, 2, 2, 2) with 2 workers : 8
> costs (6, 6, 4, 1) with 2 workers : 10.
> costs (6, 6, 4, 1) with 3 workers : 6.
>
> But coming up with an alogrithm or a formula didn't look worth. So I
> just did the total cost and divided it by workers. And besides that,
> took the maximum of the 1st plan cost (since it is the highest) and
> the average of total. I understand it would be too much approximation
> for some cases, but another thing is, we don't know how to take into
> account some of the workers shifting to partial workers. So the shift
> may be quite fuzzy since all workers may not shift to partial plans
> together.


For non-partial paths, I did some comparison between the actual cost
and the cost taken by adding the per-subpath figures and dividing by
number of workers. And in the below cases, they do not differ
significantly. Here are the figures :

Case 1 :
Cost units of subpaths : 20 16 10 8 3 1.
Workers : 3
Actual total time to finish all workers : 20.
total/workers: 16.

Case 2 :
Cost units of subpaths : 20 16 10 8 3 1.
Workers : 2
Actual total time to finish all workers : 34.
total/workers: 32.

Case 3 :
Cost units of subpaths : 5 3 3 3 3
Workers : 3
Actual total time to finish all workers : 6
total/workers: 5.6

One more thing observed, is , in all of the above cases, all the
workers more or less finish at about the same time.

So this method seem to compare good which actual cost. The average
comes out a little less than the actual. But I think in the patch,
what I need to correct is, calculate separate per-worker costs of
non-partial and partial costs, and add them. This will give us
per-worker total cost, which is what a partial Append cost will be. I
just added all costs together.

There can be some extreme cases such as (5, 1, 1, 1, 1, 1) with 6
workers, where it will take at least 5 units, but average is 2. For
that we can clamp up the cost to the first path cost, so that for e.g.
it does not go lesser than 5 in this case.

Actually I have deviced one algorithm to calculate the exact time when
all workers finish non-partial costs. But I think it does not make
sense to apply it because it may be too much of calculation cost for
hundreds of paths.

But anyways, for archival purpose, here is the algorithm :

Per-subpath cost : 20 16 10 8 3 1, with 3 workers.
After 10 units (this is minimum of 20, 16, 10), the times remaining are :
10  6  0 8 3 1
After 6 units (minimum of 10, 06, 08), the times remaining are :
4  0  0 2 3 1
After 2 units (minimum of 4, 2, 3), the times remaining are :
 2  0  0 0 1 1
After 1 units (minimum of 2, 1, 1), the times remaining are :
 1  0  0 0 0 0
After 1 units (minimum of 1, 0 , 0), the times remaining are :
 0  0  0 0 0 0
Now add up above time chunks : 10 + 6 + 2 + 1 + 1 = 20

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] introduce XLogLockBlockRangeForCleanup()

2014-07-02 Thread Amit Khandekar

On 13 June 2014 14:10, Abhijit Menon-Sen  wrote:

> nbtxlog.c:btree_xlog_vacuum() contains the following comment:
>
>  * XXX we don't actually need to read the block, we just need to
>  * confirm it is unpinned. If we had a special call into the
>  * buffer manager we could optimise this so that if the block is
>  * not in shared_buffers we confirm it as unpinned.
>
> The attached patch introduces an XLogLockBlockRangeForCleanup() function
> that addresses this, as well as a "special call into the buffer manager"
> that makes it possible to lock the buffers without reading them.
>

In GetBufferWithoutRelcache(), I was wondering, rather than calling
PinBuffer(), if we do this :
LockBufHdr(buf);
PinBuffer_Locked(buf);
valid = ((buf->flags & BM_VALID) != 0);
then we can avoid having the new buffer access strategy BAS_DISCARD that is
introduced in this patch. And so the code changes in freelist.c would not
be necessary.

Will give further thought on the overall logic in
XLogLockBlockRangeForCleanup().

> will follow up with some performance numbers soon.
>

Yes, that would be nice.

-Amit

Re: [HACKERS] [PATCH] introduce XLogLockBlockRangeForCleanup()

2014-07-04 Thread Amit Khandekar

On 3 July 2014 16:59, Simon Riggs  wrote:

>
> I think we should say this though
>
> LockBufHdr(buf);
> valid = ((buf->flags & BM_VALID) != 0);
> if (valid)
> PinBuffer_Locked(buf);
> else
> UnlockBufHdr(buf);
>
> since otherwise we would access the buffer flags without the spinlock
> and we would leak a pin if the buffer was not valid
>

Ah right. That is essential.

Re: [HACKERS] [PATCH] introduce XLogLockBlockRangeForCleanup()

2014-07-07 Thread Amit Khandekar

On 4 July 2014 19:11, Abhijit Menon-Sen  wrote:

> Updated patch attached, thanks.
>
> Amit: what's your conclusion from the review?
>

Other than some minor comments as mentioned below, I don't have any more
issues, it looks all good.

XLogLockBlockRangeForCleanup() function header comments has the function
name spelled: XLogBlockRangeForCleanup

In GetBufferWithoutRelcache(), we can call BufferDescriptorGetBuffer(buf)
rather than BufferDescriptorGetBuffer(&BufferDescriptors[buf_id]).

> -- Abhijit
>

Re: [HACKERS] delta relations in AFTER triggers

2014-08-07 Thread Amit Khandekar

On 21 June 2014 23:36, Kevin Grittner  wrote:
> Kevin Grittner  wrote:
> I didn't change the tuplestores to TID because it seemed to me that
> it would preclude using transition relations with FDW triggers, and
> it seemed bad not to support that.  Does anyone see a way around
> that, or feel that it's OK to not support FDW triggers in this
> regard?

I think it is ok to use tuplestores for now, but as mentioned by you
somewhere else in the thread, later on we should change this to using
tids as an optimization.

>
> Does this look good otherwise, as far as it goes?

I didn't yet extensively go through the patch, but before that, just a
few quick comments:

I see that the tupelstores for transition tables are stored per query
depth. If the
DML involves a table that has multiple child tables, it seems as though
a single tuplestore would be shared among all these tables. That means
if we define
such triggers using transition table clause for all the child tables, then
the trigger function for a child table will see tuples from other child tables
as well. Is that true ? If it is, it does not make sense.
For fdw tuplestore, this issue does not arise because the DML won't involve
multiple target tables I suppose.

---

I tried to google some SQLs that use REFERENCING clause with triggers.
It looks like in some database systems, even the WHEN clause of CREATE TRIGGER
can refer to a transition table, just like how it refers to NEW and
OLD row variables.

For e.g. :
CREATE TRIGGER notify_dept
  AFTER UPDATE ON weather
  REFERENCING NEW_TABLE AS N_TABLE
  NEW AS N_ROW
  FOR EACH ROW
  WHEN ((SELECT AVG (temperature) FROM N_TABLE) > 10)
  BEGIN
notify_department(N_ROW.temperature, N_ROW.city);
  END

Above, it is used to get an aggregate value of all the changed rows. I think
we do not currently support aggregate expressions in the where clause, but with
transition tables, it makes more sense to support it later if not now.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] delta relations in AFTER triggers

2014-08-11 Thread Amit Khandekar

On 7 August 2014 19:49, Kevin Grittner  wrote:
> Amit Khandekar  wrote:
>> On 21 June 2014 23:36, Kevin Grittner  wrote:
>>> Kevin Grittner  wrote:
>>> I didn't change the tuplestores to TID because it seemed to me that
>>> it would preclude using transition relations with FDW triggers, and
>>> it seemed bad not to support that.  Does anyone see a way around
>>> that, or feel that it's OK to not support FDW triggers in this
>>> regard?
>>
>> I think it is ok to use tuplestores for now, but as mentioned by you
>> somewhere else in the thread, later on we should change this to using
>> tids as an optimization.
>
> Well, the optimization would probably be to use a tuplestore of
> tids referencing modified tuples in the base table, rather than a
> tuplestore of the data itself.  But I think we're in agreement.
Right, that's what I meant.

>> I see that the tupelstores for transition tables are stored per query
>> depth. If the
>> DML involves a table that has multiple child tables, it seems as though
>> a single tuplestore would be shared among all these tables. That means
>> if we define
>> such triggers using transition table clause for all the child tables, then
>> the trigger function for a child table will see tuples from other child 
>> tables
>> as well. Is that true ?
>
> I don't think so.  I will make a note of the concern to confirm by testing.
Thanks. I will wait for this.

>
>> I tried to google some SQLs that use REFERENCING clause with triggers.
>> It looks like in some database systems, even the WHEN clause of CREATE 
>> TRIGGER
>> can refer to a transition table, just like how it refers to NEW and
>> OLD row variables.
>>
>> For e.g. :
>> CREATE TRIGGER notify_dept
>>   AFTER UPDATE ON weather
>>   REFERENCING NEW_TABLE AS N_TABLE
>>   NEW AS N_ROW
>>   FOR EACH ROW
>>   WHEN ((SELECT AVG (temperature) FROM N_TABLE) > 10)
>>   BEGIN
>> notify_department(N_ROW.temperature, N_ROW.city);
>>   END
>>
>> Above, it is used to get an aggregate value of all the changed rows. I think
>> we do not currently support aggregate expressions in the where clause, but 
>> with
>> transition tables, it makes more sense to support it later if not now.
>
> Interesting point; I had not thought about that.  Will see if I can
> include support for that in the patch for the next CF; failing
> that; I will at least be careful to not paint myself into a corner
> where it is unduly hard to do later.
We currently do the WHEN checks while saving the AFTER trigger events,
and also add the tuples one by one while saving the trigger events. If
and when we support WHEN, we would need to make all of these tuples
saved *before* the first WHEN clause execution, and that seems to
demand more changes in the current trigger code.

More comments below :

---

Are we later going to extend this support for constraint triggers as
well ? I think these variables would make sense even for deferred
constraint triggers. I think we would need some more changes if we
want to support this, because there is no query depth while executing
deferred triggers and so the tuplestores might be inaccessible with
the current design.

---

The following (and similarly other) statements :
trigdesc->trig_insert_new_table |=
(TRIGGER_FOR_INSERT(tgtype) &&
TRIGGER_USES_TRANSITION_TABLE(trigger->tgnewtable)) ? true : false;

can be simplfied to :

trigdesc->trig_insert_new_table |=
(TRIGGER_FOR_INSERT(tgtype) &&
TRIGGER_USES_TRANSITION_TABLE(trigger->tgnewtable));

-

AfterTriggerExecute()
{
.
/*
* Set up the tuplestore information.
*/
if (trigdesc->trig_delete_old_table || trigdesc->trig_update_old_table)
LocTriggerData.tg_olddelta =
GetCurrentTriggerDeltaTuplestore(afterTriggers->old_tuplestores);
.
}
Above, trigdesc->trig_update_old_table is true if at least one of the
triggers of the table has transition tables. So this will cause the
delta table to be available on all of the triggers of the table even
if only one out of them uses transition tables. May be this would be
solved by using LocTriggerData.tg_trigger->tg_olddelta rather than
trigdesc->trig_update_old_table to decide whether
LocTriggerData.tg_olddelta should be assigned.

---

GetCurrentTriggerDeltaTuplestore() is now used for getting fdw
tuplestore also, so it should have a more general name.

---

#define TRIGGER_USES_TRANSITION_TABLE(namepointer) \
((namepointer) != (char *) NULL && (*(namepointer)) != '\0')
Since all other code sections assume trigger->tgoldtable to be
non-NULL, we can skip the NULL check above.

---

Re: [HACKERS] delta relations in AFTER triggers

2014-08-14 Thread Amit Khandekar

On 12 August 2014 20:09, Kevin Grittner  wrote:
> Amit Khandekar  wrote:
>> On 7 August 2014 19:49, Kevin Grittner  wrote:
>>> Amit Khandekar  wrote:
>
>>>> I tried to google some SQLs that use REFERENCING clause with triggers.
>>>> It looks like in some database systems, even the WHEN clause of CREATE 
>>>> TRIGGER
>>>> can refer to a transition table, just like how it refers to NEW and
>>>> OLD row variables.
>>>>
>>>> For e.g. :
>>>> CREATE TRIGGER notify_dept
>>>>   AFTER UPDATE ON weather
>>>>   REFERENCING NEW_TABLE AS N_TABLE
>>>>   NEW AS N_ROW
>>>>   FOR EACH ROW
>>>>   WHEN ((SELECT AVG (temperature) FROM N_TABLE) > 10)
>>>>   BEGIN
>>>> notify_department(N_ROW.temperature, N_ROW.city);
>>>>   END
>>>>
>>>> Above, it is used to get an aggregate value of all the changed rows. I 
>>>> think
>>>> we do not currently support aggregate expressions in the where clause, but 
>>>> with
>>>> transition tables, it makes more sense to support it later if not now.
>>>
>>> Interesting point; I had not thought about that.  Will see if I can
>>> include support for that in the patch for the next CF; failing
>>> that; I will at least be careful to not paint myself into a corner
>>> where it is unduly hard to do later.
>> We currently do the WHEN checks while saving the AFTER trigger events,
>> and also add the tuples one by one while saving the trigger events. If
>> and when we support WHEN, we would need to make all of these tuples
>> saved *before* the first WHEN clause execution, and that seems to
>> demand more changes in the current trigger code.
>
> In that case my inclination is to get things working with the less
> invasive patch that doesn't try to support transition table usage
> in WHEN clauses, and make support for that a later patch.
Agreed.
>
>> ---
>>
>> Are we later going to extend this support for constraint triggers as
>> well ? I think these variables would make sense even for deferred
>> constraint triggers. I think we would need some more changes if we
>> want to support this, because there is no query depth while executing
>> deferred triggers and so the tuplestores might be inaccessible with
>> the current design.
>
> Hmm, I would also prefer to exclude that from an initial patch, but
> this and the WHEN clause issue may influence a decision I've been
> struggling with.  This is my first non-trivial foray into the
> planner territory, and I've been struggling with how best to bridge
> the gap between where the tuplestores are *captured* in the trigger
> code and where they are referenced by name in a query and
> incorporated into a plan for the executor.  (The execution level
> itself was almost trivial; it's getting the tuplestore reference
> through the parse analysis and planning phases that is painful for
> me.
I am not sure why you think we would need to refer the tuplestore in
the parse analysis and planner phases. It seems that we would need
them only in execution phase. Or may be I didn't understand your
point.

> )  At one point I create a "tuplestore registry" using a
> process-local hashmap where each Tuplestorestate and its associated
> name, TupleDesc, etc. would be registered, yielding a Tsrid (based
> on Oid) to use through the parse analysis and planning steps, but
> then I ripped it back out again in favor of just passing the
> pointer to the structure which was stored in the registry; because
> the registry seemed to me to introduce more risk of memory leaks,
> references to freed memory, etc.  While it helped a little with
> abstraction, it seemed to make things more fragile.  But I'm still
> torn on this, and unsure whether such a registry is a good idea.
I feel it is ok to use direct tuplestore pointers as handles like how
you have done in the patch. I may be biased with doing that as against
the above method of accessing tuplestore by its name through hash
lookup; the reason of my bias might be because of one particular way I
see how deferred constraint triggers can be supported. In the after
trigger event structure, we can store these delta tuplestores pointers
as well. This way, we don't need to worry about how to lookup these
tuplestores, and also need not worry about the mechanism that moves
these events from deferred event list to immediate event list in case
of SET CONSTRAINTS. Only thing we would need to make sure is to
cleanup these tuplestores exactly where the event structures get
cleaned up.

This

Re: [HACKERS] delta relations in AFTER triggers

2014-08-14 Thread Amit Khandekar

>> The execution level
>> itself was almost trivial; it's getting the tuplestore reference
>> through the parse analysis and planning phases that is painful for
>> me.
> I am not sure why you think we would need to refer the tuplestore in
> the parse analysis and planner phases. It seems that we would need
> them only in execution phase. Or may be I didn't understand your
> point.
Ah I think I understand now. That might be because you are thinking of
having an infrastructure common to triggers and materialized views,
right ?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] delta relations in AFTER triggers

2014-08-17 Thread Amit Khandekar

On 15 August 2014 04:04, Kevin Grittner  wrote:
> Amit Khandekar  wrote:
>
>>>> The execution level itself was almost trivial; it's getting the
>>>> tuplestore reference through the parse analysis and planning
>>>> phases that is painful for me.
>>> I am not sure why you think we would need to refer the
>>> tuplestore in the parse analysis and planner phases. It seems
>>> that we would need them only in execution phase. Or may be I
>>> didn't understand your point.
>> Ah I think I understand now. That might be because you are
>> thinking of having an infrastructure common to triggers and
>> materialized views, right ?
>
> Well, it's more immediate than that.  The identifiers in the
> trigger function are not resolved to particular objects until there
> is a request to fire the trigger.
Ah ok, you are talking about changes specific to the PL language
handlers. Yes, I agree that in the plpgsql parser (and in any PL
handler), we need to parse such table references in the SQL construct,
and transform it into something else.
> At that time the parse analysis
> needs to find the name defined somewhere.  It's not defined in the
> catalogs like a table or view, and it's not defined in the query
> itself like a CTE or VALUES clause.  The names specified in trigger
> creation must be recognized as needing to resolve to the new
> TuplestoreScan, and it needs to match those to the tuplestores
> themselves.
One approach that comes to my mind is by transforming such transition
table references into a RangeFunction reference while in plpgsql
parser/lexer. This RangeFunction would point to a set-returning
catalog function that would return rows from the delta tuplestore. So
the tuplestore itself would remain a blackbox. Once we have such a
function, any language handler can re-use the same interface.

> Row counts, costing, etc. needs to be provided so the
> optimizer can pick a good plan in what might be a complex query
> with many options.
I am also not sure about the costing, but I guess it may be possible
to supply some costs to the FunctionScan plan node.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-03-22 Thread Amit Khandekar

Attached is the updated patch that handles the changes for all the
comments except the cost changes part. Details about the specific
changes are after the cost-related points discussed below.

>> I wanted to take into account persubpath parallel_workers for total
>> cost of Append. Suppose the partial subpaths have per worker total
>> costs (3, 3, 3) and their parallel_workers are (2, 8, 4), with 2
>> Append workers available. So according to what you say, the total cost
>> is 9. With persubplan parallel_workers taken into account, total cost
>> = (3*2 + 3*8 * 3*4)/2 = 21.
> But that case never happens, because the parallel workers for the
> append is always at least as large as the number of workers for any
> single child.

Yeah, that's right. I will use this approach for partial paths.


For non-partial paths, I was checking following 3 options :

Option 1. Just take the sum of total non-partial child costs and
divide it by number of workers. It seems to be getting close to the
actual cost.

Option 2. Calculate exact cost by an algorithm which I mentioned
before, which is pasted below for reference :
Persubpath cost : 20 16 10 8 3 1, with 3 workers.
After 10 time units (this is minimum of first 3 i.e. 20, 16, 10), the
times remaining are :
10  6  0 8 3 1
After 6 units (minimum of 10, 06, 08), the times remaining are :
4  0  0 2 3 1
After 2 units (minimum of 4, 2, 3), the times remaining are :
 2  0  0 0 1 1
After 1 units (minimum of 2, 1, 1), the times remaining are :
 1  0  0 0 0 0
After 1 units (minimum of 1, 0 , 0), the times remaining are :
 0  0  0 0 0 0
Now add up above time chunks : 10 + 6 + 2 + 1 + 1 = 20

Option 3. Get some approximation formula like you suggested. I am also
looking for such formula, just that some things are not clear to me.
The discussion of the same is below ...
>>> 2. Next, estimate the cost of the nonpartial paths.  To do this, make
>>> an array of Cost of that length and initialize all the elements to
>>> zero, then add the total cost of each nonpartial plan in turn to the
>>> element of the array with the smallest cost, and then take the maximum
>>> of the array elements as the total cost of the nonpartial plans.  Add
>>> this to the result from step 1 to get the total cost.
>>
>> So with costs (8, 5, 2), add 8 and 5 to 2 so that it becomes (8, 5,
>> 15) , and so the max is 15 ? I surely am misinterpreting this.
> No.  If you have costs 8, 5, and 2 and only one process, cost is 15.
> If you have two processes then for costing purposes you assume worker
> 1 will execute the first path (cost 8) and worker 2 will execute the
> other two (cost 5 + 2 = 7), so the total cost is 8.  If you have three
> workers, the cost will still be 8, because there's no way to finish
> the cost8 path in less than 8 units of work.

So the part that you suggested about adding up total cost in turn to
the smallest cost; this suggestion applies to only 1 worker right ?
For more than worker, are you suggesting to use some algorithm similar
to the one I suggested in option 2 above ? If yes, it would be great
if you again describe how that works for multiple workers. Or is it
that you were suggesting some simple approximate arithmetic that
applies to multiple workers ?
Like I mentioned, I will be happy to get such simple approximation
arithmetic that can be applied for multiple worker case. The one logic
I suggested in option 2 is something we can keep as the last option.
And option 1 is also an approximation but we would like to have a
better approximation. So wanted to clear my queries regarding option
3.

--

Details about all the remaining changes in updated patch are below ...

On 20 March 2017 at 17:29, Robert Haas  wrote:
> On Fri, Mar 17, 2017 at 1:12 PM, Amit Khandekar  
> wrote:
>>> - The substantive changes in add_paths_to_append_rel don't look right
>>> either.  It's not clear why accumulate_partialappend_subpath is
>>> getting called even in the non-enable_parallelappend case.  I don't
>>> think the logic for the case where we're not generating a parallel
>>> append path needs to change at all.
>>
>> When accumulate_partialappend_subpath() is called for a childrel with
>> a partial path, it works just like accumulate_append_subpath() when
>> enable_parallelappend is false. That's why, for partial child path,
>> the same function is called irrespective of parallel-append or
>> non-parallel-append case. May be mentioning this in comments should
>> suffice here ?
>
> I don't get it.  If you can get the same effect by changing something
> or not changing it, presumably it'd be better to not change it.   We
> try not to change things just because we can; the change should be an
> improvem

Re: [HACKERS] UPDATE of partition key

2017-03-22 Thread Amit Khandekar

On 17 March 2017 at 16:07, Amit Khandekar  wrote:
> On 6 March 2017 at 15:11, Amit Langote  wrote:
>>
>>>> But that starts to sound less attractive when one realizes that
>>>> that will occur for every row that wants to move.
>>>
>>> If we manage to call ExecSetupPartitionTupleRouting() during execution
>>> phase only once for the very first time we find the update requires
>>> row movement, then we can re-use the info.
>>
>> That might work, too.  But I guess we're going with initialization in
>> ExecInitModifyTable().
>
> I am more worried about this: even the UPDATEs that do not involve row
> movement would do the expensive setup. So do it only once when we find
> that we need to move the row. Something like this :
> ExecUpdate()
> {
> 
> if (resultRelInfo->ri_PartitionCheck &&
>   !ExecPartitionCheck(resultRelInfo, slot, estate))
> {
>   bool  already_deleted;
>
>   ExecDelete(tupleid, oldtuple, planSlot, epqstate, estate,
>  &already_deleted, canSetTag);
>
>   if (already_deleted)
> return NULL;
>   else
>   {
> /* If we haven't already built the state for INSERT
>  * tuple routing, build it now */
> if (!mtstate->mt_partition_dispatch_info)
> {
>   ExecSetupPartitionTupleRouting(
> mtstate->resultRelInfo->ri_RelationDesc,
> &mtstate->mt_partition_dispatch_info,
> &mtstate->mt_partitions,
> &mtstate->mt_partition_tupconv_maps,
> &mtstate->mt_partition_tuple_slot,
> &mtstate->mt_num_dispatch,
> &mtstate->mt_num_partitions);
> }
>
> return ExecInsert(mtstate, slot, planSlot, NULL,
>   ONCONFLICT_NONE, estate, false);
>   }
> }
> ...
> }

Attached is v2 patch which implements the above optimization. Now, for
UPDATE, ExecSetupPartitionTupleRouting() will be called only if row
movement is needed.

We have to open an extra relation for the root partition, and keep it
opened and its handle stored in
mt_partition_dispatch_info[0]->reldesc. So ExecEndModifyTable() closes
this if it is different from node->resultRelInfo->ri_RelationDesc. If
it is same as node->resultRelInfo, it should not be closed because it
gets closed as part of ExecEndPlan().


update-partition-key_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-03-23 Thread Amit Khandekar

On 23 March 2017 at 05:55, Robert Haas  wrote:
> On Wed, Mar 22, 2017 at 4:49 AM, Amit Khandekar  
> wrote:
>> Attached is the updated patch that handles the changes for all the
>> comments except the cost changes part. Details about the specific
>> changes are after the cost-related points discussed below.
>>
>> For non-partial paths, I was checking following 3 options :
>>
>> Option 1. Just take the sum of total non-partial child costs and
>> divide it by number of workers. It seems to be getting close to the
>> actual cost.
>
> If the costs for all children are about equal, then that works fine.
> But when they are very unequal, then it's highly misleading.
>
>> Option 2. Calculate exact cost by an algorithm which I mentioned
>> before, which is pasted below for reference :
>> Persubpath cost : 20 16 10 8 3 1, with 3 workers.
>> After 10 time units (this is minimum of first 3 i.e. 20, 16, 10), the
>> times remaining are :
>> 10  6  0 8 3 1
>> After 6 units (minimum of 10, 06, 08), the times remaining are :
>> 4  0  0 2 3 1
>> After 2 units (minimum of 4, 2, 3), the times remaining are :
>>  2  0  0 0 1 1
>> After 1 units (minimum of 2, 1, 1), the times remaining are :
>>  1  0  0 0 0 0
>> After 1 units (minimum of 1, 0 , 0), the times remaining are :
>>  0  0  0 0 0 0
>> Now add up above time chunks : 10 + 6 + 2 + 1 + 1 = 20
>

> This gives the same answer as what I was proposing

Ah I see.

> but I believe it's more complicated to compute.
Yes a bit, particularly because in my algorithm, I would have to do
'n' subtractions each time, in case of 'n' workers. But it looked more
natural because it follows exactly the way we manually calculate.

> The way my proposal would work in this
> case is that we would start with an array C[3] (since there are three
> workers], with all entries 0.  Logically C[i] represents the amount of
> work to be performed by worker i.  We add each path in turn to the
> worker whose array entry is currently smallest; in the case of a tie,
> just pick the first such entry.
>
> So in your example we do this:
>
> C[0] += 20;
> C[1] += 16;
> C[2] += 10;
> /* C[2] is smaller than C[0] or C[1] at this point, so we add the next
> path to C[2] */
> C[2] += 8;
> /* after the previous line, C[1] is now the smallest, so add to that
> entry next */
> C[1] += 3;
> /* now we've got C[0] = 20, C[1] = 19, C[2] = 18, so add to C[2] */
> C[2] += 1;
> /* final result: C[0] = 20, C[1] = 19, C[2] = 19 */
>
> Now we just take the highest entry that appears in any array, which in
> this case is C[0], as the total cost.

Wow. The way your final result exactly tallies with my algorithm
result is very interesting. This looks like some maths or computer
science theory that I am not aware.

I am currently coding the algorithm using your method. Meanwhile
attached is a patch that takes care of your other comments, details of
which are below...

>
> In my previous review, I said that you should "define and document a
> new builtin tranche ID"; you did the first but not the second.  See
> the table in monitoring.sgml.

Yeah, I tried to search how TBM did in the source, but I guess I
didn't correctly run "git grep" commands, so the results did not have
monitoring.sgml, so I thought may be you mean something else by
"document".

Added changes in monitoring.sgml now.

>
> Definition of exec_append_goto_next_plan should have a line break
> after the return type, per usual PostgreSQL style rules.

Oops. Done.

>
> - * initialize to scan first subplan
> + * In case it's a sequential Append, initialize to scan first subplan.
>
> This comment is confusing because the code is executed whether it's
> parallel or not.  I think it might be better to write something like
> "initialize to scan first subplan (but note that we'll override this
> later in the case of a parallel append)"
Done.

>
>  /*
> + * Check if we are already finished plans from parallel append. This
> + * can happen if all the subplans are finished when this worker
> + * has not even started returning tuples.
> + */
> +if (node->as_padesc && node->as_whichplan == PA_INVALID_PLAN)
> +return ExecClearTuple(node->ps.ps_ResultTupleSlot);
>
> There seems to be no reason why this couldn't be hoisted out of the
> loop.  Actually, I think Ashutosh pointed this out before, but I
> didn't understand at that time what his point was.  Looking back, I
> see that he also pointed out that the as_padesc test isn't necessary,
> which is also true.

Re: [HACKERS] Parallel Append implementation

2017-03-23 Thread Amit Khandekar

On 23 March 2017 at 16:26, Amit Khandekar  wrote:
> On 23 March 2017 at 05:55, Robert Haas  wrote:
>> On Wed, Mar 22, 2017 at 4:49 AM, Amit Khandekar 
wrote:
>>> Attached is the updated patch that handles the changes for all the
>>> comments except the cost changes part. Details about the specific
>>> changes are after the cost-related points discussed below.
>>>
>>> For non-partial paths, I was checking following 3 options :
>>>
>>> Option 1. Just take the sum of total non-partial child costs and
>>> divide it by number of workers. It seems to be getting close to the
>>> actual cost.
>>
>> If the costs for all children are about equal, then that works fine.
>> But when they are very unequal, then it's highly misleading.
>>
>>> Option 2. Calculate exact cost by an algorithm which I mentioned
>>> before, which is pasted below for reference :
>>> Persubpath cost : 20 16 10 8 3 1, with 3 workers.
>>> After 10 time units (this is minimum of first 3 i.e. 20, 16, 10), the
>>> times remaining are :
>>> 10  6  0 8 3 1
>>> After 6 units (minimum of 10, 06, 08), the times remaining are :
>>> 4  0  0 2 3 1
>>> After 2 units (minimum of 4, 2, 3), the times remaining are :
>>>  2  0  0 0 1 1
>>> After 1 units (minimum of 2, 1, 1), the times remaining are :
>>>  1  0  0 0 0 0
>>> After 1 units (minimum of 1, 0 , 0), the times remaining are :
>>>  0  0  0 0 0 0
>>> Now add up above time chunks : 10 + 6 + 2 + 1 + 1 = 20
>>
>
>> This gives the same answer as what I was proposing
>
> Ah I see.
>
>> but I believe it's more complicated to compute.
> Yes a bit, particularly because in my algorithm, I would have to do
> 'n' subtractions each time, in case of 'n' workers. But it looked more
> natural because it follows exactly the way we manually calculate.
>
>> The way my proposal would work in this
>> case is that we would start with an array C[3] (since there are three
>> workers], with all entries 0.  Logically C[i] represents the amount of
>> work to be performed by worker i.  We add each path in turn to the
>> worker whose array entry is currently smallest; in the case of a tie,
>> just pick the first such entry.
>>
>> So in your example we do this:
>>
>> C[0] += 20;
>> C[1] += 16;
>> C[2] += 10;
>> /* C[2] is smaller than C[0] or C[1] at this point, so we add the next
>> path to C[2] */
>> C[2] += 8;
>> /* after the previous line, C[1] is now the smallest, so add to that
>> entry next */
>> C[1] += 3;
>> /* now we've got C[0] = 20, C[1] = 19, C[2] = 18, so add to C[2] */
>> C[2] += 1;
>> /* final result: C[0] = 20, C[1] = 19, C[2] = 19 */
>>
>> Now we just take the highest entry that appears in any array, which in
>> this case is C[0], as the total cost.
>
> Wow. The way your final result exactly tallies with my algorithm
> result is very interesting. This looks like some maths or computer
> science theory that I am not aware.
>
> I am currently coding the algorithm using your method.

While I was coding this, I was considering if Path->rows also should
be calculated similar to total cost for non-partial subpath and total
cost for partial subpaths. I think for rows, we can just take
total_rows divided by workers for non-partial paths, and this
approximation should suffice. It looks odd that it be treated with the
same algorithm we chose for total cost for non-partial paths.

Meanwhile, attached is a WIP patch v10. The only change in this patch
w.r.t. the last patch (v9) is that this one has a new function defined
append_nonpartial_cost(). Just sending this to show how the algorithm
looks like; haven't yet called it.


ParallelAppend_v10_WIP.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-03-24 Thread Amit Khandekar

On 24 March 2017 at 13:11, Rajkumar Raghuwanshi
 wrote:
> I have given patch on latest pg sources (on commit
> 457a4448732881b5008f7a3bcca76fc299075ac3). configure and make all
> install ran successfully, but initdb failed with below error.

> FailedAssertion("!(LWLockTranchesAllocated >=
> LWTRANCHE_FIRST_USER_DEFINED)", File: "lwlock.c", Line: 501)

Thanks for reporting, Rajkumar.

With the new PARALLEL_APPEND tranche ID, LWTRANCHE_FIRST_USER_DEFINED
value has crossed the value 64. So we need to increase the initial
size of LWLockTrancheArray from 64 to 128. Attached is the updated
patch.


ParallelAppend_v11.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-03-24 Thread Amit Khandekar

On 24 March 2017 at 00:38, Amit Khandekar  wrote:
> On 23 March 2017 at 16:26, Amit Khandekar  wrote:
>> On 23 March 2017 at 05:55, Robert Haas  wrote:
>>>
>>> So in your example we do this:
>>>
>>> C[0] += 20;
>>> C[1] += 16;
>>> C[2] += 10;
>>> /* C[2] is smaller than C[0] or C[1] at this point, so we add the next
>>> path to C[2] */
>>> C[2] += 8;
>>> /* after the previous line, C[1] is now the smallest, so add to that
>>> entry next */
>>> C[1] += 3;
>>> /* now we've got C[0] = 20, C[1] = 19, C[2] = 18, so add to C[2] */
>>> C[2] += 1;
>>> /* final result: C[0] = 20, C[1] = 19, C[2] = 19 */
>>>
>>> Now we just take the highest entry that appears in any array, which in
>>> this case is C[0], as the total cost.
>>
>> Wow. The way your final result exactly tallies with my algorithm
>> result is very interesting. This looks like some maths or computer
>> science theory that I am not aware.
>>
>> I am currently coding the algorithm using your method.
>

> While I was coding this, I was considering if Path->rows also should
> be calculated similar to total cost for non-partial subpath and total
> cost for partial subpaths. I think for rows, we can just take
> total_rows divided by workers for non-partial paths, and this
> approximation should suffice. It looks odd that it be treated with the
> same algorithm we chose for total cost for non-partial paths.

Attached is the patch v12, where Path->rows calculation of non-partial
paths is kept separate from the way total cost is done for non-partial
costs. rows for non-partial paths is calculated as total_rows divided
by workers as approximation. And then rows for partial paths are just
added one by one.

>
> Meanwhile, attached is a WIP patch v10. The only change in this patch
> w.r.t. the last patch (v9) is that this one has a new function defined
> append_nonpartial_cost(). Just sending this to show how the algorithm
> looks like; haven't yet called it.

Now append_nonpartial_cost() is used, and it is tested.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


ParallelAppend_v12.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-03-24 Thread Amit Khandekar

Thanks Amit for your review comments. I am yet to handle all of your
comments, but meanwhile , attached is an updated patch, that handles
RETURNING.

Earlier it was not working because ExecInsert() did not return any
RETURNING clause. This is because the setup needed to create RETURNIG
projection info for leaf partitions is done in ExecInitModifyTable()
only in case of INSERT. But because it is an UPDATE operation, we have
to do this explicitly as a one-time operation when it is determined
that row-movement is required. This is similar to how we do one-time
setup of mt_partition_dispatch_info. So in the patch, I have moved
this code into a new function ExecInitPartitionReturningProjection(),
and now this is called in ExecInitModifyTable() as well as during row
movement for ExecInsert() processing the returning clause.

Basically we need to do all that is done in ExecInitModifyTable() for
INSERT. There are a couple of other things that I suspect that might
need to be done as part of the missing initialization for Execinsert()
during row-movement :
1. Junk filter handling
2. WITH CHECK OPTION

Yet, ExecDelete() during row-movement is still returning the RETURNING
result redundantly, which I am yet to handle this.

On 23 March 2017 at 07:04, Amit Langote  wrote:
> Hi Amit,
>
> Thanks for the updated patch.
>
> On 2017/03/23 3:09, Amit Khandekar wrote:
>> Attached is v2 patch which implements the above optimization.
>
> Would it be better to have at least some new tests?  Also, there are a few
> places in the documentation mentioning that such updates cause error,
> which will need to be updated.  Perhaps also add some explanatory notes
> about the mechanism (delete+insert), trigger behavior, caveats, etc.
> There were some points discussed upthread that could be mentioned in the
> documentation.

Yeah, agreed. Will do this in the subsequent patch.

>
> @@ -633,6 +634,9 @@ ExecDelete(ItemPointer tupleid,
>  HeapUpdateFailureData hufd;
>  TupleTableSlot *slot = NULL;
>
> +if (already_deleted)
> +*already_deleted = false;
> +
>
> concurrently_deleted?

Done.

>
> @@ -962,7 +969,7 @@ ExecUpdate(ItemPointer tupleid,
>  }
>  else
>  {
> -LockTupleMode lockmode;
> +LockTupleMode   lockmode;
>
> Useless hunk.
Removed.

I am yet to handle your other comments , still working on them, but
till then , attached is the updated patch.

update-partition-key_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-03-27 Thread Amit Khandekar

On 25 March 2017 at 01:34, Amit Khandekar  wrote:
I am yet to handle all of your comments, but meanwhile , attached is
> an updated patch, that handles RETURNING.
>
> Earlier it was not working because ExecInsert() did not return any
> RETURNING clause. This is because the setup needed to create RETURNIG
> projection info for leaf partitions is done in ExecInitModifyTable()
> only in case of INSERT. But because it is an UPDATE operation, we have
> to do this explicitly as a one-time operation when it is determined
> that row-movement is required. This is similar to how we do one-time
> setup of mt_partition_dispatch_info. So in the patch, I have moved
> this code into a new function ExecInitPartitionReturningProjection(),
> and now this is called in ExecInitModifyTable() as well as during row
> movement for ExecInsert() processing the returning clause.

> Basically we need to do all that is done in ExecInitModifyTable() for
> INSERT. There are a couple of other things that I suspect that might
> need to be done as part of the missing initialization for Execinsert()
> during row-movement :
> 1. Junk filter handling
> 2. WITH CHECK OPTION

Attached is an another updated patch v4 which does WITH-CHECK-OPTION
related initialization.

So we now have below two function calls during row movement :
/* Build WITH CHECK OPTION constraints for leaf partitions */
ExecInitPartitionWithCheckOptions(mtstate, root_rel);

/* Build a projection for each leaf partition rel. */
ExecInitPartitionReturningProjection(mtstate, root_rel);

And these functions are now re-used at two places : In
ExecInitModifyTable() and in row-movement code.
Basically whatever was not being initialized in ExecInitModifyTable()
is now done in row-movement code.

I have added relevant scenarios in sql/update.sql.

I checked the junk filter handling. I think there isn't anything that
needs to be done, because for INSERT, all that is needed is
ExecCheckPlanOutput(). And this function is anyway called even in
ExecInitModifyTable() even for UPDATE, so we don't have to initialize
anything additional.

> Yet, ExecDelete() during row-movement is still returning the RETURNING
> result redundantly, which I am yet to handle this.
Done above. Now we have a new parameter in ExecDelete() which tells
whether to skip RETURNING.

On 23 March 2017 at 07:04, Amit Langote  wrote:
> Would it be better to have at least some new tests?
Added some more scenarios in update.sql. Also have included scenarios
for WITH-CHECK-OPTION for updatable views.


> Also, there are a few places in the documentation mentioning that such 
> updates cause error,
> which will need to be updated.  Perhaps also add some explanatory notes
> about the mechanism (delete+insert), trigger behavior, caveats, etc.
> There were some points discussed upthread that could be mentioned in the
> documentation.
Yeah, I agree. Documentation needs some important changes. I am still
working on them.

> +if (!mtstate->mt_partition_dispatch_info)
> +{
>
> The if (pointer == NULL) style is better perhaps.
>
> +/* root table RT index is at the head of partitioned_rels */
> +if (node->partitioned_rels)
> +{
> +Index   root_rti;
> +Oid root_oid;
> +
> +root_rti = linitial_int(node->partitioned_rels);
> +root_oid = getrelid(root_rti, estate->es_range_table);
> +root_rel = heap_open(root_oid, NoLock); /* locked by
> InitPlan */
> +}
> +else
> +root_rel = mtstate->resultRelInfo->ri_RelationDesc;
>
> Some explanatory comments here might be good, for example, explain in what
> situations node->partitioned_rels would not have been set and/or vice versa.
Added some more comments in the relevant if conditions.

>
>> Now, for
>> UPDATE, ExecSetupPartitionTupleRouting() will be called only if row
>> movement is needed.
>>
>> We have to open an extra relation for the root partition, and keep it
>> opened and its handle stored in
>> mt_partition_dispatch_info[0]->reldesc. So ExecEndModifyTable() closes
>> this if it is different from node->resultRelInfo->ri_RelationDesc. If
>> it is same as node->resultRelInfo, it should not be closed because it
>> gets closed as part of ExecEndPlan().
>
> I guess you're referring to the following hunk.  Some comments:
>
> @@ -2154,10 +2221,19 @@ ExecEndModifyTable(ModifyTableState *node)
>   * Close all the partitioned tables, leaf partitions, and their indices
>   *
>   * Remember node->mt_partition_dispatch_info[0] corresponds to the root
> - * partitioned table, which we m

Re: [HACKERS] UPDATE of partition key

2017-03-28 Thread Amit Khandekar

On 27 March 2017 at 13:05, Amit Khandekar  wrote:
>> Also, there are a few places in the documentation mentioning that such 
>> updates cause error,
>> which will need to be updated.  Perhaps also add some explanatory notes
>> about the mechanism (delete+insert), trigger behavior, caveats, etc.
>> There were some points discussed upthread that could be mentioned in the
>> documentation.
>> Yeah, I agree. Documentation needs some important changes. I am still
>> working on them.

Attached patch v5 has above required doc changes added.

In the section 5.11 "Partitioning" => subsection 5 "Caveats", I have
removed the caveat of not being able to update partition key. And it
is now replaced by the caveat where an update/delete operations can
silently miss a row when there is a concurrent UPDATE of partition-key
happening.

UPDATE row movement behaviour is described in :
Part VI "Reference => SQL Commands => UPDATE

> On 4 March 2017 at 12:49, Robert Haas  wrote:
>> How about running the BR update triggers for the old partition and the
>> AR update triggers for the new partition?  It seems weird to run BR
>> update triggers but not AR update triggers.  Another option would be
>> to run BR and AR delete triggers and then BR and AR insert triggers,
>> emphasizing the choice to treat this update as a delete + insert, but
>> (as Amit Kh. pointed out to me when we were in a room together this
>> week) that precludes using the BEFORE trigger to modify the row.
>
> I checked the trigger behaviour in case of UPSERT. Here, when there is
> conflict found, ExecOnConflictUpdate() is called, and then the
> function returns immediately, which means AR INSERT trigger will not
> fire. And ExecOnConflictUpdate() calls ExecUpdate(), which means BR
> and AR UPDATE triggers will be fired. So in short, when an INSERT
> becomes an UPDATE, BR INSERT triggers do fire, but then the BR UPDATE
> and AR UPDATE also get fired. On the same lines, it makes sense in
> case of UPDATE=>DELETE+INSERT operation to fire a BR UPDATE trigger on
> the original table, and then the BR and AR DELETE/INSERT triggers on
> the respective tables.
>
> So the common policy can be :
> Fire the BR trigger. It can be INESRT/UPDATE/DELETE trigger depending
> upon what the statement is.
> If there is a change in the operation, according to what the operation
> is converted to (UPDATE->DELETE+INSERT or INSERT->UPDATE), all the
> respective triggers would be fired.

The current patch already has the behaviour as per above policy. So I
have included the description of this trigger related behaviour in the
"Overview of Trigger Behavior" section of the docs. This has been
derived from the way it is written for trigger behaviour for UPSERT in
the preceding section.


update-partition-key_v5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-04-02 Thread Amit Khandekar

For some reason, my reply got sent to only Amit Langote instead of
reply-to-all. Below is the mail reply. Thanks Amit Langote for
bringing this to my notice.

On 31 March 2017 at 16:54, Amit Khandekar  wrote:
> On 31 March 2017 at 14:04, Amit Langote  wrote:
>> On 2017/03/28 19:12, Amit Khandekar wrote:
>>> In the section 5.11 "Partitioning" => subsection 5 "Caveats", I have
>>> removed the caveat of not being able to update partition key. And it
>>> is now replaced by the caveat where an update/delete operations can
>>> silently miss a row when there is a concurrent UPDATE of partition-key
>>> happening.
>>
>> Hmm, how about just removing the "partition-changing updates are
>> disallowed" caveat from the list on the 5.11 Partitioning page and explain
>> the concurrency-related caveats on the UPDATE reference page?
>
> IMHO this caveat is better placed in Partitioning chapter to emphasize
> that it is a drawback specifically in presence of partitioning.
>
>> +If an UPDATE on a partitioned table causes a row to
>> +move to another partition, it is possible that all row-level
>> +BEFORE UPDATE triggers and all row-level
>> +BEFORE DELETE/INSERT
>> +triggers are applied on the respective partitions in a way that is
>> apparent
>> +from the final state of the updated row.
>>
>> How about dropping "it is possible that" from this sentence?
>
> What the statement means is : "It is true that all triggers are
> applied on the respective partitions; but it is possible that they are
> applied in a way that is apparent from final state of the updated
> row". So "possible" applies to "in a way that is apparent..". It
> means, the user should be aware that all the triggers can change the
> row and so the final row will be affected by all those triggers.
> Actually, we have a similar statement for UPSERT involved with
> triggers in the preceding section. I have taken the statement from
> there.
>
>>
>> +UPDATE is done by doing a DELETE 
>> on
>>
>> How about: s/is done/is performed/g
>
> Done.
>
>>
>> +triggers are not applied because the UPDATE is
>> converted
>> +to a DELETE and UPDATE.
>>
>> I think you meant DELETE and INSERT.
>
> Oops. Corrected.
>
>>
>> +if (resultRelInfo->ri_PartitionCheck &&
>> +!ExecPartitionCheck(resultRelInfo, slot, estate))
>> +{
>>
>> How about a one-line comment what this block of code does?
>
> Yes, this was needed. Added a comment.
>
>>
>> - * Check the constraints of the tuple.  Note that we pass the same
>> + * Check the constraints of the tuple. Note that we pass the same
>>
>> I think that this hunk is not necessary.  (I've heard that two spaces
>> after a sentence-ending period is not a problem [1].)
>
> Actually I accidentally removed one space, thinking that it was one of
> my own comments. Reverted back this change, since it is a needless
> hunk.
>
>>
>> + * We have already run partition constraints above, so skip them 
>> below.
>>
>> How about: s/run/checked the/g?
>
> Done.
>
>> @@ -2159,6 +2289,7 @@ ExecEndModifyTable(ModifyTableState *node)
>>  heap_close(pd->reldesc, NoLock);
>>  ExecDropSingleTupleTableSlot(pd->tupslot);
>>  }
>> +
>>  for (i = 0; i < node->mt_num_partitions; i++)
>>  {
>>  ResultRelInfo *resultRelInfo = node->mt_partitions + i;
>>
>> Needless hunk.
>
> Right. Removed.
>
>>
>> Overall, I think the patch looks good now.  Thanks again for working on it.
>
> Thanks Amit for your efforts in reviewing the patch. Attached is v6
> patch that contains above points handled.



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


update-partition-key_v6.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-04-03 Thread Amit Khandekar

Thanks Andres for your review comments. Will get back with the other
comments, but meanwhile some queries about the below particular
comment ...

On 4 April 2017 at 10:17, Andres Freund  wrote:
> On 2017-04-03 22:13:18 -0400, Robert Haas wrote:
>> On Mon, Apr 3, 2017 at 4:17 PM, Andres Freund  wrote:
>> > Hm.  I'm not really convinced by the logic here.  Wouldn't it be better
>> > to try to compute the minimum total cost across all workers for
>> > 1..#max_workers for the plans in an iterative manner?  I.e. try to map
>> > each of the subplans to 1 (if non-partial) or N workers (partial) using
>> > some fitting algorith (e.g. always choosing the worker(s) that currently
>> > have the least work assigned).  I think the current algorithm doesn't
>> > lead to useful #workers for e.g. cases with a lot of non-partial,
>> > high-startup plans - imo a quite reasonable scenario.

I think I might have not understood this part exactly. Are you saying
we need to consider per-subplan parallel_workers to calculate total
number of workers for Append ? I also didn't get about non-partial
subplans. Can you please explain how many workers you think should be
expected with , say , 7 subplans out of which 3 are non-partial
subplans ?

>>
>> Well, that'd be totally unlike what we do in any other case.  We only
>> generate a Parallel Seq Scan plan for a given table with one # of
>> workers, and we cost it based on that.  We have no way to re-cost it
>> if we changed our mind later about how many workers to use.
>> Eventually, we should probably have something like what you're
>> describing here, but in general, not just for this specific case.  One
>> problem, of course, is to avoid having a larger number of workers
>> always look better than a smaller number, which with the current
>> costing model would probably happen a lot.
>
> I don't think the parallel seqscan is comparable in complexity with the
> parallel append case.  Each worker there does the same kind of work, and
> if one of them is behind, it'll just do less.  But correct sizing will
> be more important with parallel-append, because with non-partial
> subplans the work is absolutely *not* uniform.
>
> Greetings,
>
> Andres Freund



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-04-04 Thread Amit Khandekar

On 4 April 2017 at 01:47, Andres Freund  wrote:
>> +typedef struct ParallelAppendDescData
>> +{
>> + LWLock  pa_lock;/* mutual exclusion to choose 
>> next subplan */
>> + int pa_first_plan;  /* plan to choose while 
>> wrapping around plans */
>> + int pa_next_plan;   /* next plan to choose by any 
>> worker */
>> +
>> + /*
>> +  * pa_finished : workers currently executing the subplan. A worker 
>> which
>> +  * finishes a subplan should set pa_finished to true, so that no new
>> +  * worker picks this subplan. For non-partial subplan, a worker which 
>> picks
>> +  * up that subplan should immediately set to true, so as to make sure
>> +  * there are no more than 1 worker assigned to this subplan.
>> +  */
>> + boolpa_finished[FLEXIBLE_ARRAY_MEMBER];
>> +} ParallelAppendDescData;
>
>
>> +typedef ParallelAppendDescData *ParallelAppendDesc;
>
> Pointer hiding typedefs make this Andres sad.

Yeah .. was trying to be consistent with other parts of code where we
have typedefs for both structure and a pointer to that structure.

>
>
>
>> @@ -291,6 +362,276 @@ ExecReScanAppend(AppendState *node)
>>   if (subnode->chgParam == NULL)
>>   ExecReScan(subnode);
>>   }
>> +
>> + if (padesc)
>> + {
>> + padesc->pa_first_plan = padesc->pa_next_plan = 0;
>> + memset(padesc->pa_finished, 0, sizeof(bool) * node->as_nplans);
>> + }
>> +
>
> Is it actually guaranteed that none of the parallel workers are doing
> something at that point?

ExecReScanAppend() would be called by ExecReScanGather().
ExecReScanGather() shuts down all the parallel workers before calling
its child node (i.e. ExecReScanAppend).


>> +static bool
>> +exec_append_parallel_next(AppendState *state)
>> +{
>> + ParallelAppendDesc padesc = state->as_padesc;
>> + int whichplan;
>> + int initial_plan;
>> + int first_partial_plan = ((Append 
>> *)state->ps.plan)->first_partial_plan;
>> + boolfound;
>> +
>> + Assert(padesc != NULL);
>> +
>> + /* Backward scan is not supported by parallel-aware plans */
>> + Assert(ScanDirectionIsForward(state->ps.state->es_direction));
>> +
>> + /* The parallel leader chooses its next subplan differently */
>> + if (!IsParallelWorker())
>> + return exec_append_leader_next(state);
>
> It's a bit weird that the leader's case does is so separate, and does
> it's own lock acquisition.

Since we wanted to prevent it from taking the most expensive
non-partial plans first , thought it would be better to keep its logic
simple and separate, so could not merge it in the main logic code.

>
>
>> + found = false;
>> + for (whichplan = initial_plan; whichplan != PA_INVALID_PLAN;)
>> + {
>> + /*
>> +  * Ignore plans that are already done processing. These also 
>> include
>> +  * non-partial subplans which have already been taken by a 
>> worker.
>> +  */
>> + if (!padesc->pa_finished[whichplan])
>> + {
>> + found = true;
>> + break;
>> + }
>> +
>> + /*
>> +  * Note: There is a chance that just after the child plan node 
>> is
>> +  * chosen above, some other worker finishes this node and sets
>> +  * pa_finished to true. In that case, this worker will go 
>> ahead and
>> +  * call ExecProcNode(child_node), which will return NULL tuple 
>> since it
>> +  * is already finished, and then once again this worker will 
>> try to
>> +  * choose next subplan; but this is ok : it's just an extra
>> +  * "choose_next_subplan" operation.
>> +  */
>
> IIRC not all node types are safe against being executed again when
> they've previously returned NULL.  That's why e.g. nodeMaterial.c
> contains the following blurb:
> /*
>  * If necessary, try to fetch another row from the subplan.
>  *
>  * Note: the eof_underlying state variable exists to short-circuit 
> further
>  * subplan calls.  It's not optional, unfortunately, because some plan
>  * node types are not robust about being called again when they've 
> already
>  * returned NULL.
>  */

This scenario is different from the parallel append scenario described
by my comment. A worker sets pa_finished to true only when it itself
gets a NULL tuple for a given subplan. So in
exec_append_parallel_next(), suppose a worker W1 finds a subplan with
pa_finished=false. So it chooses it. Now a different worker W2 sets
this subplan's pa_finished=true because W2 has got a NULL tuple. But
W1 hasn't yet got a NULL tuple. If it had got a NULL tuple earlier, it
would have itself set pa_finished to true, and then it would have
never again

Re: [HACKERS] UPDATE of partition key

2017-04-04 Thread Amit Khandekar

On 3 April 2017 at 17:13, Amit Langote  wrote:
> Hi Amit,
>
> Thanks for updating the patch.  Since ddl.sgml got updated on Saturday,
> patch needs a rebase.

Rebased now.

>
>> On 31 March 2017 at 16:54, Amit Khandekar  wrote:
>>> On 31 March 2017 at 14:04, Amit Langote  
>>> wrote:
>>>> On 2017/03/28 19:12, Amit Khandekar wrote:
>>>>> In the section 5.11 "Partitioning" => subsection 5 "Caveats", I have
>>>>> removed the caveat of not being able to update partition key. And it
>>>>> is now replaced by the caveat where an update/delete operations can
>>>>> silently miss a row when there is a concurrent UPDATE of partition-key
>>>>> happening.
>>>>
>>>> Hmm, how about just removing the "partition-changing updates are
>>>> disallowed" caveat from the list on the 5.11 Partitioning page and explain
>>>> the concurrency-related caveats on the UPDATE reference page?
>>>
>>> IMHO this caveat is better placed in Partitioning chapter to emphasize
>>> that it is a drawback specifically in presence of partitioning.
>
> I mean we fixed things for declarative partitioning so it's no longer a
> caveat like it is for partitioning implemented using inheritance (in that
> the former doesn't require user-defined triggers to implement
> row-movement).  Seeing the first sentence, that is:
>
> An UPDATE causes a row to move from one partition to another
> if the new value of the row fails to satisfy the implicit partition
> constraint of the original partition but there is another partition which
> can fit this row.
>
> which clearly seems to suggest that row-movement, if required, is handled
> by the system.  So it's not clear why it's in this list.  If we want to
> describe the limitations of the current implementation, we'll need to
> rephrase it a bit.

Yes I agree.

> How about something like:
> For an UPDATE that causes a row to move from one partition to
> another due the partition key being updated, the following caveats exist:
>  presence of concurrent manipulation of the row in question>

Now with the slightly changed doc structuring for partitioning in
latest master, I have described in the end of section "5.10.2.
Declarative Partitioning" this note :

---

"Updating the partition key of a row might cause it to be moved into a
different partition where this row satisfies its partition
constraint."

---

And then in the Limitations section, I have replaced the earlier
can't-update-partition-key limitation with this new limitation as
below :

"When an UPDATE causes a row to move from one partition to another,
there is a chance that another concurrent UPDATE or DELETE misses this
row. Suppose, during the row movement, the row is still visible for
the concurrent session, and it is about to do an UPDATE or DELETE
operation on the same row. This DML operation can silently miss this
row if the row now gets deleted from the partition by the first
session as part of its UPDATE row movement. In such case, the
concurrent UPDATE/DELETE, being unaware of the row movement,
interprets that the row has just been deleted so there is nothing to
be done for this row. Whereas, in the usual case where the table is
not partitioned, or where there is no row movement, the second session
would have identified the newly updated row and carried UPDATE/DELETE
on this new row version."

---

Further, in the Notes section of update.sgml, I have kept a link to
the above limitations section like this :

"In the case of a partitioned table, updating a row might cause it to
no longer satisfy the partition constraint of the containing
partition. In that case, if there is some other partition in the
partition tree for which this row satisfies its partition constraint,
then the row is moved to that partition. If there isn't such a
partition, an error will occur. The error will also occur when
updating a partition directly. Behind the scenes, the row movement is
actually a DELETE and INSERT operation. However, there is a
possibility that a concurrent UPDATE or DELETE on the same row may
miss this row. For details see the section Section 5.10.2.3."

>
>>>> +If an UPDATE on a partitioned table causes a row to
>>>> +move to another partition, it is possible that all row-level
>>>> +BEFORE UPDATE triggers and all 
>>>> row-level
>>>> +BEFORE DELETE/INSERT
>>>> +triggers are applied on the respective partitions in a way that is
>>>> apparent
>>>> +from the final state of the updated row.
>>>>
>>>> How about dropping "it is possible that"

Re: [HACKERS] Parallel Append implementation

2017-04-05 Thread Amit Khandekar

On 5 April 2017 at 01:43, Andres Freund  wrote:
> On 2017-04-04 08:01:32 -0400, Robert Haas wrote:
>> On Tue, Apr 4, 2017 at 12:47 AM, Andres Freund  wrote:
>> > I don't think the parallel seqscan is comparable in complexity with the
>> > parallel append case.  Each worker there does the same kind of work, and
>> > if one of them is behind, it'll just do less.  But correct sizing will
>> > be more important with parallel-append, because with non-partial
>> > subplans the work is absolutely *not* uniform.
>>
>> Sure, that's a problem, but I think it's still absolutely necessary to
>> ramp up the maximum "effort" (in terms of number of workers)
>> logarithmically.  If you just do it by costing, the winning number of
>> workers will always be the largest number that we think we'll be able
>> to put to use - e.g. with 100 branches of relatively equal cost we'll
>> pick 100 workers.  That's not remotely sane.
>
> I'm quite unconvinced that just throwing a log() in there is the best
> way to combat that.  Modeling the issue of starting more workers through
> tuple transfer, locking, startup overhead costing seems a better to me.
>
> If the goal is to compute the results of the query as fast as possible,
> and to not use more than max_parallel_per_XXX, and it's actually
> beneficial to use more workers, then we should.  Because otherwise you
> really can't use the resources available.
>
> - Andres

This is what the earlier versions of my patch had done : just add up
per-subplan parallel_workers (1 for non-partial subplan and
subpath->parallel_workers for partial subplans) and set this total as
the Append parallel_workers.

Robert had a valid point that this would be inconsistent with the
worker count that we would come up with if it were a single table with
a cost as big as the total cost of all Append subplans. We were
discussing rather about partitioned table versus if it were
unpartitioned, but I think the same argument goes for a union query
with non-partial plans : if we want to clamp down the number of
workers for a single table for a good reason, we should then also
follow that policy and prevent assigning too many workers even for an
Append.

Now I am not sure of the reason why for a single table parallel scan,
we increase number of workers logarithmically; but I think there might
have been an observation that after certain number of workers, adding
up more workers does not make significant difference, but this is just
my guess.

If we try to calculate workers based on each of the subplan costs
rather than just the number of workers, still I think the total worker
count should be a *log* of the total cost, so as to be consistent with
what we did for other scans. Now log(total_cost) does not increase
significantly with cost. For cost of 1000 units, the log3(cost) will
be 6, and for cost of 10,000 units, it is 8, i.e. just 2 more workers.
So I think since its a logarithmic value, it would be might as well
better to just drop the cost factor, and consider only number of
workers.

But again, in the future if we drop the method of log(), then the
above is not valid. But I think till then we should follow some common
strategy we have been following.

BTW all of the above points apply only for non-partial plans. For
partial plans, what we have done in the patch is : Take the highest of
the per-subplan parallel_workers, and make sure that Append workers is
at least as high as this value.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-04-06 Thread Amit Khandekar

On 6 April 2017 at 07:33, Andres Freund  wrote:
> On 2017-04-05 14:52:38 +0530, Amit Khandekar wrote:
>> This is what the earlier versions of my patch had done : just add up
>> per-subplan parallel_workers (1 for non-partial subplan and
>> subpath->parallel_workers for partial subplans) and set this total as
>> the Append parallel_workers.
>
> I don't think that's great, consider e.g. the case that you have one
> very expensive query, and a bunch of cheaper ones. Most of those workers
> wouldn't do much while waiting for the the expensive query.  What I'm
> basically thinking we should do is something like the following
> pythonesque pseudocode:
>
> best_nonpartial_cost = -1
> best_nonpartial_nworkers = -1
>
> for numworkers in 1...#max workers:
>worker_work = [0 for x in range(0, numworkers)]
>
>nonpartial_cost += startup_cost * numworkers
>
># distribute all nonpartial tasks over workers.  Assign tasks to the
># worker with the least amount of work already performed.
>for task in all_nonpartial_subqueries:
>least_busy_worker = worker_work.smallest()
>least_busy_worker += task.total_nonpartial_cost
>
># the nonpartial cost here is the largest amount any single worker
># has to perform.
>nonpartial_cost += worker_work.largest()
>
>total_partial_cost = 0
>for task in all_partial_subqueries:
>total_partial_cost += task.total_nonpartial_cost
>
># Compute resources needed by partial tasks. First compute how much
># cost we can distribute to workers that take shorter than the
># "busiest" worker doing non-partial tasks.
>remaining_avail_work = 0
>for i in range(0, numworkers):
>remaining_avail_work += worker_work.largest() - worker_work[i]
>
># Equally divide up remaining work over all workers
>if remaining_avail_work < total_partial_cost:
>   nonpartial_cost += (worker_work.largest - remaining_avail_work) / 
> numworkers
>
># check if this is the best number of workers
>if best_nonpartial_cost == -1 or best_nonpartial_cost > nonpartial_cost:
>   best_nonpartial_cost = worker_work.largest
>   best_nonpartial_nworkers = nworkers
>
> Does that make sense?

Yeah, I gather what you are trying to achieve is : allocate number of
workers such that the total cost does not exceed the cost of the first
non-partial plan (i.e. the costliest one, because the plans are sorted
by descending cost).

So for non-partial costs such as (20, 10, 5, 2) allocate only 2
workers because the 2nd worker will execute (10, 5, 2) while 1st
worker executes (20).

But for costs such as (4, 4, 4,   20 times), the logic would give
us 20 workers because we want to finish the Append in 4 time units;
and this what we want to avoid when we go with
don't-allocate-too-many-workers approach.

>
>
>> BTW all of the above points apply only for non-partial plans.
>
> Indeed. But I think that's going to be a pretty common type of plan,

Yes it is.

> especially if we get partitionwise joins.

About that I am not sure, because we already have support for parallel
joins, so wouldn't the join subpaths corresponding to all of the
partitions be partial paths ? I may be wrong about that.

But if the subplans are foreign scans, then yes all would be
non-partial plans. This may provoke  off-topic discussion, but here
instead of assigning so many workers to all these foreign plans and
all those workers waiting for the results, a single asynchronous
execution node (which is still in the making) would be desirable
because it would do the job of all these workers.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-04-17 Thread Amit Khandekar

On 7 April 2017 at 20:35, Andres Freund  wrote:
>> But for costs such as (4, 4, 4,   20 times), the logic would give
>> us 20 workers because we want to finish the Append in 4 time units;
>> and this what we want to avoid when we go with
>> don't-allocate-too-many-workers approach.
>
> I guess, my problem is that I don't agree with that as a goal in an of
> itself.  If you actually want to run your query quickly, you *want* 20
> workers here.  The issues of backend startup overhead (already via
> parallel_setup_cost), concurrency and such cost should be modelled, not
> burried in a formula the user can't change.  If we want to make it less
> and less likely to start more workers we should make that configurable,
> not the default.
> I think there's some precedent taken from the parallel seqscan case,
> that's not actually applicable here.  Parallel seqscans have a good
> amount of shared state, both on the kernel and pg side, and that shared
> state reduces gains of increasing the number of workers.  But with
> non-partial scans such shared state largely doesn't exist.

After searching through earlier mails about parallel scan, I am not
sure whether the shared state was considered to be a potential factor
that might reduce parallel query gains, when deciding the calculation
for number of workers for a parallel seq scan. I mean even today if we
allocate 10 workers as against a calculated 4 workers count for a
parallel seq scan, they might help. I think it's just that we don't
know if they would *always* help or it would regress sometimes.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-18 Thread Amit Khandekar

When I tested partition-key-update on a partitioned table having no
child partitions, it crashed. This is because there is an
Assert(mtstate->mt_num_partitions > 0) for creating the
partition-to-root map, which fails if there are no partitions under
the partitioned table. Actually we should skp creating this map if
there are no partitions under the partitioned table on which UPDATE is
run. So the attached patch has this new change to fix it (and
appropriate additional test case added) :

--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -2006,15 +2006,14 @@ ExecInitModifyTable(ModifyTable *node, EState
*estate, int eflags)
 * descriptor of a source partition does not match the root partition
 * descriptor. In such case we need to convert tuples to the
root partition
 * tuple descriptor, because the search for destination partition starts
-* from the root.
+* from the root. Skip this setup if it's not a partition key
update or if
+* there are no partitions below this partitioned table.
 */
-   if (is_partitionkey_update)
+   if (is_partitionkey_update && mtstate->mt_num_partitions > 0)
{
TupleConversionMap **tup_conv_maps;
TupleDesc   outdesc;

-   Assert(mtstate->mt_num_partitions > 0);
-
mtstate->mt_resultrel_maps =
(TupleConversionMap **)
palloc0(sizeof(TupleConversionMap*) * nplans);

On 15 June 2017 at 23:06, Amit Khandekar  wrote:
> On 13 June 2017 at 15:40, Amit Khandekar  wrote:
>> While rebasing my patch for the below recent commit, I realized that a
>> similar issue exists for the uptate-tuple-routing patch as well :
>>
>> commit 78a030a441966d91bc7e932ef84da39c3ea7d970
>> Author: Tom Lane 
>> Date:   Mon Jun 12 23:29:44 2017 -0400
>>
>> Fix confusion about number of subplans in partitioned INSERT setup.
>>
>> The above issue was about incorrectly using 'i' in
>> mtstate->mt_plans[i] during handling WITH CHECK OPTIONS in
>> ExecInitModifyTable(), where 'i' was actually meant to refer to the
>> positions in mtstate->mt_num_partitions. Actually for INSERT, there is
>> only a single plan element in mtstate->mt_plans[] array.
>>
>> Similarly, for update-tuple routing, we cannot use
>> mtstate->mt_plans[i], because 'i' refers to position in
>> mtstate->mt_partitions[] , whereas mtstate->mt_plans is not at all in
>> order of mtstate->mt_partitions; in fact mt_plans has only the plans
>> that are to be scanned on pruned partitions; so it can well be a small
>> subset of total partitions.
>>
>> I am working on an updated patch to fix the above.
>
> Attached patch v10 fixes the above. In the existing code, where it
> builds WCO constraints for each leaf partition; with the patch, that
> code now is applicable to row-movement-updates as well. So the
> assertions in the code are now updated to allow the same. Secondly,
> the mapping for each of the leaf partitions was constructed using the
> root partition attributes. Now in the patch, the
> mtstate->resultRelInfo[0] (i.e. the first resultRelInfo) is used as
> reference. So effectively, map_partition_varattnos() now represents
> not just parent-to-partition mapping, but rather, mapping between any
> two partitions/partitioned_tables. It's done this way, so that we can
> have a common WCO building code for inserts as well as updates. For
> e.g. for inserts, the first (and only) WCO belongs to
> node->nominalRelation so nominalRelation is used for
> map_partition_varattnos(), whereas for updates, first WCO belongs to
> the first resultRelInfo which is not same as nominalRelation. So in
> the patch, in both cases, we use the first resultRelInfo and the WCO
> of the first resultRelInfo for map_partition_varattnos().
>
> Similar thing is done for Returning expressions.
>
> -
>
> Another change in the patch is : for ExecInitQual() for WCO quals,
> mtstate->ps is used as parent, rather than first plan. For updates,
> first plan does not belong to the parent partition. In fact, I think
> in all cases, we should use mtstate->ps as the parent.
> mtstate->mt_plans[0] don't look like they should be considered parent
> of these expressions. May be it does not matter to which parent we
> link these quals, because there is no ReScan for ExecModifyTable().
>
> Note that for RETURNING projection expressions, we do use mtstate->ps.
>
> 
>
> There is another issue I discovered. The row-movement works fine if
> the destination leaf partition has different attribute ordering than
> the root : the

Re: [HACKERS] UPDATE of partition key

2017-06-19 Thread Amit Khandekar

On 20 June 2017 at 03:42, Thomas Munro  wrote:
> Just a thought: If I understand correctly this new array of tuple
> conversion maps is the same as mtstate->mt_transition_tupconv_maps in
> my patch transition-tuples-from-child-tables-v11.patch (hopefully soon
> to be committed to close a PG10 open item).  In my patch I bounce
> transition tuples from child relations up to the named relation's
> triggers, and in this patch you bounce child tuples up to the named
> relation for rerouting, so the conversion requirement is the same.
> Perhaps we could consider refactoring to build a common struct member
> on demand for the row movement patch at some point in the future if it
> makes the code cleaner.

I agree; thanks for bringing this to my attention. The conversion maps
in my patch and yours do sound like they are exactly same. And even in
case where both update-row-movement and transition tables are playing
together, the same map should serve the purpose of both. I will keep a
watch on your patch, and check how I can adjust my patch so that I
don't have to refactor the mapping.

One difference I see is : in your patch, in ExecModifyTable() we jump
the current map position for each successive subplan, whereas in my
patch, in ExecInsert() we deduce the position of the right map to be
fetched using the position of the current resultRelInfo in the
mtstate->resultRelInfo[] array. I think your way is more consistent
with the existing code.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-19 Thread Amit Khandekar

On 20 June 2017 at 03:46, Robert Haas  wrote:
> On Thu, Jun 15, 2017 at 1:36 PM, Amit Khandekar  
> wrote:
>> Attached patch v10 fixes the above. In the existing code, where it
>> builds WCO constraints for each leaf partition; with the patch, that
>> code now is applicable to row-movement-updates as well.
>
> I guess I don't see why it should work like this.  In the INSERT case,
> we must build withCheckOption objects for each partition because those
> partitions don't appear in the plan otherwise -- but in the UPDATE
> case, they're already there, so why do we need to build anything at
> all?  Similarly for RETURNING projections.  How are the things we need
> for those cases not already getting built, associated with the
> relevant resultRelInfos?  Maybe there's a concern if some children got
> pruned - they could turn out later to be the children into which
> tuples need to be routed. But the patch makes no distinction
> between possibly-pruned children and any others.

Yes, only a subset of the partitions appear in the UPDATE subplans. I
think typically for updates, a very small subset of the total leaf
partitions will be there in the plans, others would get pruned. IMHO,
it would not be worth having an optimization where it opens only those
leaf partitions which are not already there in the subplans. Without
the optimization, we are able to re-use the INSERT infrastructure
without additional changes.

>
>> There is another issue I discovered. The row-movement works fine if
>> the destination leaf partition has different attribute ordering than
>> the root : the existing insert-tuple-routing mapping handles that. But
>> if the source partition has different ordering w.r.t. the root, it has
>> a problem : there is no mapping in the opposite direction, i.e. from
>> the leaf to root. And we require that because the tuple of source leaf
>> partition needs to be converted to root partition tuple descriptor,
>> since ExecFindPartition() starts with root.
>
> Seems reasonable, but...
>
>> To fix this, I have introduced another mapping array
>> mtstate->mt_resultrel_maps[]. This corresponds to the
>> mtstate->resultRelInfo[]. We don't require per-leaf-partition mapping,
>> because the update result relations are pruned subset of the total
>> leaf partitions.
>
> ... I don't understand how you can *not* need a per-leaf-partition
> mapping.  I mean, maybe you only need the mapping for the *unpruned*
> leaf partitions

Yes, we need the mapping only for the unpruned leaf partitions, and
those partitions are available in the per-subplan resultRelInfo's.

> but you certainly need a separate mapping for each one of those.

You mean *each* of the leaf partitions ? I didn't get why we would
need it for each one. The tuple targeted for update belongs to one of
the per-subplan resultInfos. And this tuple is to be routed to another
leaf partition. So the reverse mapping is for conversion from the
source resultRelinfo to the root partition. I am unable to figure out
a scenario where we would require this reverse mapping for partitions
on which UPDATE is *not* going to be executed.

>
> It's possible to imagine driving the tuple routing off of just the
> partition key attributes, extracted from wherever they are inside the
> tuple at the current level, rather than converting to the root's tuple
> format.  However, that's not totally straightforward because there
> could be multiple levels of partitioning throughout the tree and
> different attributes might be needed at different levels.

Yes, the conversion anyway occurs at each of these levels even for
insert, specifically because there can be different partition
attributes each time. For update, its only one additional conversion.
But yes, this new mapping would be required for this one single
conversion.

> Moreover,
> in most cases, the mappings are going to end up being no-ops because
> the column order will be the same, so it's probably not worth
> complicating the code to try to avoid a double conversion that usually
> won't happen.

I agree.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-21 Thread Amit Khandekar

On 21 June 2017 at 00:23, Robert Haas  wrote:
> On Tue, Jun 20, 2017 at 2:54 AM, Amit Khandekar  
> wrote:
>>> I guess I don't see why it should work like this.  In the INSERT case,
>>> we must build withCheckOption objects for each partition because those
>>> partitions don't appear in the plan otherwise -- but in the UPDATE
>>> case, they're already there, so why do we need to build anything at
>>> all?  Similarly for RETURNING projections.  How are the things we need
>>> for those cases not already getting built, associated with the
>>> relevant resultRelInfos?  Maybe there's a concern if some children got
>>> pruned - they could turn out later to be the children into which
>>> tuples need to be routed. But the patch makes no distinction
>>> between possibly-pruned children and any others.
>>
>> Yes, only a subset of the partitions appear in the UPDATE subplans. I
>> think typically for updates, a very small subset of the total leaf
>> partitions will be there in the plans, others would get pruned. IMHO,
>> it would not be worth having an optimization where it opens only those
>> leaf partitions which are not already there in the subplans. Without
>> the optimization, we are able to re-use the INSERT infrastructure
>> without additional changes.
>
> Well, that is possible, but certainly not guaranteed.  I mean,
> somebody could do a whole-table UPDATE, or an UPDATE that hits a
> smattering of rows in every partition;

I am not saying that it's guaranteed to be a small subset. I am saying
that it would be typically a small subset for
update-of-partitioned-key case. Seems weird if a user causes an
update-row-movement for multiple partitions at the same time.
Generally it would be an administrative task where some/all of the
rows of a partition need to have their partition key updated that
cause them to change their partition, and so there would be probably a
where clause that would narrow down the update to that particular
partition, because without the where clause the update is anyway
slower and it's redundant to scan all other partitions.

But, point taken, that there can always be certain cases involving
multiple table partition-key updates.

> e.g. the table is partitioned on order number, and you do UPDATE
> lineitem SET product_code = 'K372B' WHERE product_code = 'K372'.

This query does not update order number, so here there is no
partition-key-update. Are you thinking that the patch is generating
the per-leaf-partition WCO expressions even for a update not involving
a partition key ?

>
> Leaving that aside, the point here is that you're rebuilding
> withCheckOptions and returningLists that have already been built in
> the planner.  That's bad for two reasons.  First, it's inefficient,
> especially if there are many partitions.

Yeah, I agree that this becomes more and more redundant if the update
involves more partitions.

> Second, it will amount to a functional bug if you get a
> different answer than the planner did.

Actually, the per-leaf WCOs are meant to be executed on the
destination partitions where the tuple is moved, while the WCOs
belonging to the per-subplan resultRelInfo are meant for the
resultRelinfo used for the UPDATE plans. So actually it should not
matter whether they look same or different, because they are fired at
different objects. Now these objects can happen to be the same
relations though.

But in any case, it's not clear to me how the mapped WCO and the
planner's WCO would yield a different answer if they are both the same
relation. I am possibly missing something. The planner has already
generated the withCheckOptions for each of the resultRelInfo. And then
we are using one of those to re-generate the WCO for a leaf partition
by only adjusting the attnos. If there is already a WCO generated in
the planner for that leaf partition (because that partition was
present in mtstate->resultRelInfo), then the re-built WCO should be
exactly look same as the earlier one, because they are the same
relations, and so the attnos generated in them would be same since the
Relation TupleDesc is the same.

> Note this comment in the existing code:
>
> /*
>  * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
>  * that we didn't build the withCheckOptionList for each partition within
>  * the planner, but simple translation of the varattnos for each partition
>  * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
>  * cases are handled above.
>  */
>
> The comment "UPDATE/DELETE cases are handled above" is referring to
> the code that initializes the WCOs generated by the planner.  You've
> modified the co

Re: [HACKERS] UPDATE of partition key

2017-06-21 Thread Amit Khandekar

On 21 June 2017 at 20:14, Robert Haas  wrote:
> On Wed, Jun 21, 2017 at 5:28 AM, Amit Langote
>  wrote:>> The comment "UPDATE/DELETE
> cases are handled above" is referring to
>>> the code that initializes the WCOs generated by the planner.  You've
>>> modified the comment in your patch, but the associated code: your
>>> updated comment says that only "DELETEs and local UPDATES are handled
>>> above", but in reality, *all* updates are still handled above.  And
>>> then they are handled again here.  Similarly for returning lists.
>>> It's certainly not OK for the comment to be inaccurate, but I think
>>> it's also bad to redo the work which the planner has already done,
>>> even if it makes the patch smaller.
>>
>> I guess this has to do with the UPDATE turning into DELETE+INSERT.  So, it
>> seems like WCOs are being initialized for the leaf partitions
>> (ResultRelInfos in the mt_partitions array) that are in turn are
>> initialized for the aforementioned INSERT.  That's why the term "...local
>> UPDATEs" in the new comment text.
>>
>> If that's true, I wonder if it makes sense to apply what would be
>> WCO_RLS_UPDATE_CHECK to a leaf partition that the tuple will be moved into
>> by calling ExecInsert()?
>
> I think we probably should apply the insert policy, just as we're
> executing the insert trigger.

Yes, the RLS quals should execute during tuple routing according to
whether it is a update or whether it has been converted to insert. I
think the tests don't quite test the insert part. Will check.

>
>>> Also, I feel like it's probably not correct to use the first result
>>> relation as the nominal relation for building WCOs and returning lists
>>> anyway.  I mean, if the first result relation has a different column
>>> order than the parent relation, isn't this just broken?  If it works
>>> for some reason, the comments don't explain what that reason is.
>>
>> Yep, it's more appropriate to use
>> ModifyTableState->rootResultRelationInfo->ri_RelationDesc somehow.  That
>> is, if answer to the question I raised above is positive.

>From what I had checked earlier when coding that part,
rootResultRelInfo is NULL in case of inserts, unless something has
changed in later commits. That's the reason I decided to use the first
resultRelInfo.


Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-25 Thread Amit Khandekar

On 22 June 2017 at 01:41, Robert Haas  wrote:
>>> Second, it will amount to a functional bug if you get a
>>> different answer than the planner did.
>>
>> Actually, the per-leaf WCOs are meant to be executed on the
>> destination partitions where the tuple is moved, while the WCOs
>> belonging to the per-subplan resultRelInfo are meant for the
>> resultRelinfo used for the UPDATE plans. So actually it should not
>> matter whether they look same or different, because they are fired at
>> different objects. Now these objects can happen to be the same
>> relations though.
>>
>> But in any case, it's not clear to me how the mapped WCO and the
>> planner's WCO would yield a different answer if they are both the same
>> relation. I am possibly missing something. The planner has already
>> generated the withCheckOptions for each of the resultRelInfo. And then
>> we are using one of those to re-generate the WCO for a leaf partition
>> by only adjusting the attnos. If there is already a WCO generated in
>> the planner for that leaf partition (because that partition was
>> present in mtstate->resultRelInfo), then the re-built WCO should be
>> exactly look same as the earlier one, because they are the same
>> relations, and so the attnos generated in them would be same since the
>> Relation TupleDesc is the same.
>
> If the planner's WCOs and mapped WCOs are always the same, then I
> think we should try to avoid generating both.  If they can be
> different, but that's intentional and correct, then there's no
> substantive problem with the patch but the comments need to make it
> clear why we are generating both.
>
>> Actually I meant, "above works for only local updates. For
>> row-movement-updates, we need per-leaf partition WCOs, because when
>> the row is inserted into target partition, that partition may be not
>> be included in the above planner resultRelInfo, so we need WCOs for
>> all partitions". I think this said comment should be sufficient if I
>> add this in the code ?
>
> Let's not get too focused on updating the comment until we are in
> agreement about what the code ought to be doing.  I'm not clear
> whether you accept the point that the patch needs to be changed to
> avoid generating the same WCOs and returning lists in both the planner
> and the executor.

Yes, we can re-use the WCOs generated in the planner, as an
optimization, since those we re-generate for the same relations will
look exactly the same. The WCOs generated by planner (in
inheritance_planner) are generated when (in adjust_appendrel_attrs())
we change attnos used in the query to refer to the child RTEs and this
adjusts the attnos of the WCOs of the child RTEs. So the WCOs of
subplan resultRelInfo are actually the parent table WCOs, but only the
attnos changed. And in ExecInitModifyTable() we do the same thing for
leaf partitions, although using different function
map_variable_attnos().

>
>>> Also, I feel like it's probably not correct to use the first result
>>> relation as the nominal relation for building WCOs and returning lists
>>> anyway.  I mean, if the first result relation has a different column
>>> order than the parent relation, isn't this just broken?  If it works
>>> for some reason, the comments don't explain what that reason is.

One thing I didn't mention earlier about the WCOs, is that for child
rels, we don't use the WCOs defined for the child rels. We only
inherit the WCO expressions defined for the root rel. That's the
reason they are the same expressions, only the attnos changed to match
the respective relation tupledesc. If the WCOs of each of the subplan
resultRelInfo() were different, then definitely it was not possible to
use the first resultRelinfo to generate other leaf partition WCOs,
because the WCO defined for relation A is independent of that defined
for relation B.

So, since the WCOs of all the relations are actually those of the
parent, we only need to adjust the attnos of any of these
resultRelInfos.

For e.g., if the root rel WCO is defined as "col > 5" where col is the
4th column, the expression will look like "var_1.attno_4 > 5". And the
WCO that is generated for a subplan resultRelInfo will look something
like "var_n.attno_2 > 5" if col is the 2nd column in this table.

All of the above logic assumes that we never use the WCO defined for
the child relation. At least that's how it looks by looking at the way
we generate WCOs in ExecInitModifyTable() for INSERTs as well looking
at the code in inheritance_planner() for UPDATEs. At both these
places, we never use the WCOs defined for child tables.

So suppose we define the tables and their WCOs like this :

CREATE TABLE range_parted ( a text, b int, c int) partition by range (a, b);

ALTER TABLE range_parted ENABLE ROW LEVEL SECURITY;
GRANT ALL ON range_parted TO PUBLIC ;
create policy seeall ON range_parted as PERMISSIVE for SELECT using ( true);

create table part_b_10_b_20 partition of range_parted for values from
('b', 10) to ('b', 20) partition

Re: [HACKERS] UPDATE of partition key

2017-06-28 Thread Amit Khandekar

On 26 June 2017 at 08:37, Amit Khandekar  wrote:
> On 22 June 2017 at 01:41, Robert Haas  wrote:
>>>> Second, it will amount to a functional bug if you get a
>>>> different answer than the planner did.
>>>
>>> Actually, the per-leaf WCOs are meant to be executed on the
>>> destination partitions where the tuple is moved, while the WCOs
>>> belonging to the per-subplan resultRelInfo are meant for the
>>> resultRelinfo used for the UPDATE plans. So actually it should not
>>> matter whether they look same or different, because they are fired at
>>> different objects. Now these objects can happen to be the same
>>> relations though.
>>>
>>> But in any case, it's not clear to me how the mapped WCO and the
>>> planner's WCO would yield a different answer if they are both the same
>>> relation. I am possibly missing something. The planner has already
>>> generated the withCheckOptions for each of the resultRelInfo. And then
>>> we are using one of those to re-generate the WCO for a leaf partition
>>> by only adjusting the attnos. If there is already a WCO generated in
>>> the planner for that leaf partition (because that partition was
>>> present in mtstate->resultRelInfo), then the re-built WCO should be
>>> exactly look same as the earlier one, because they are the same
>>> relations, and so the attnos generated in them would be same since the
>>> Relation TupleDesc is the same.
>>
>> If the planner's WCOs and mapped WCOs are always the same, then I
>> think we should try to avoid generating both.  If they can be
>> different, but that's intentional and correct, then there's no
>> substantive problem with the patch but the comments need to make it
>> clear why we are generating both.
>>
>>> Actually I meant, "above works for only local updates. For
>>> row-movement-updates, we need per-leaf partition WCOs, because when
>>> the row is inserted into target partition, that partition may be not
>>> be included in the above planner resultRelInfo, so we need WCOs for
>>> all partitions". I think this said comment should be sufficient if I
>>> add this in the code ?
>>
>> Let's not get too focused on updating the comment until we are in
>> agreement about what the code ought to be doing.  I'm not clear
>> whether you accept the point that the patch needs to be changed to
>> avoid generating the same WCOs and returning lists in both the planner
>> and the executor.
>
> Yes, we can re-use the WCOs generated in the planner, as an
> optimization, since those we re-generate for the same relations will
> look exactly the same. The WCOs generated by planner (in
> inheritance_planner) are generated when (in adjust_appendrel_attrs())
> we change attnos used in the query to refer to the child RTEs and this
> adjusts the attnos of the WCOs of the child RTEs. So the WCOs of
> subplan resultRelInfo are actually the parent table WCOs, but only the
> attnos changed. And in ExecInitModifyTable() we do the same thing for
> leaf partitions, although using different function
> map_variable_attnos().

In attached patch v12,  during UPDATE tuple routing setup, for each
leaf partition, we now check if it is present already in one of the
UPDATE per-subplan resultrels. If present, we re-use them rather than
creating a new one and opening the table again.

So the mtstate->mt_partitions is now an array of ResultRelInfo
pointers. That pointer points to either the UPDATE per-subplan result
rel, or a newly allocated ResultRelInfo.

For each of the leaf partitions, we have to search through the
per-subplan resultRelInfo oids to check if there is a match. To do
this, I have created a temporary hash table which stores oids and the
ResultRelInfo pointers of mtstate->resultRelInfo array, and which can
be used to search the oid for each of the leaf partitions.

This patch version has handled only the above discussion point. I will
follow up with the other points separately.


update-partition-key_v12.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-28 Thread Amit Khandekar

On 22 June 2017 at 01:57, Robert Haas  wrote:
> On Wed, Jun 21, 2017 at 1:38 PM, Amit Khandekar  
> wrote:
>>>> Yep, it's more appropriate to use
>>>> ModifyTableState->rootResultRelationInfo->ri_RelationDesc somehow.  That
>>>> is, if answer to the question I raised above is positive.
>>
>> From what I had checked earlier when coding that part,
>> rootResultRelInfo is NULL in case of inserts, unless something has
>> changed in later commits. That's the reason I decided to use the first
>> resultRelInfo.
>
> We're just going around in circles here.  Saying that you decided to
> use the first child's resultRelInfo because you didn't have a
> resultRelInfo for the parent is an explanation of why you wrote the
> code the way you did, but that doesn't make it correct.  I want to
> know why you think it's correct.

Yeah, that was just an FYI on how I decided to use the first
resultRelInfo; it was not for explaining why using first resultRelInfo
is correct. So upthread, I have tried to explain.

>
> I think it's probably wrong, because it seems to me that if the INSERT
> code needs to use the parent's ResultRelInfo rather than the first
> child's ResultRelInfo, the UPDATE code probably needs to do the same.
> Commit d3cc37f1d801a6b5cad9bf179274a8d767f1ee50 got rid of
> resultRelInfos for non-leaf partitions, and commit
> e180c8aa8caf5c55a273d4a8e6092e77ff3cff10 added the resultRelInfo back
> for the topmost parent, because otherwise it didn't work correctly.

Regarding rootResultRelInfo , it would have been good if
rootResultRelInfo was set for both insert and update, but it isn't set
for inserts.

For inserts :
In ExecInitModifyTable(), ModifyTableState->rootResultRelInfo  remains
NULL because ModifyTable->rootResultRelIndex is = -1 :
/* If modifying a partitioned table, initialize the root table info */
if (node->rootResultRelIndex >= 0)
   mtstate->rootResultRelInfo = estate->es_root_result_relations +
node->rootResultRelIndex;

ModifyTable->rootResultRelIndex = -1 because it does not get set since
ModifyTable->partitioned_rels is NULL :

/*
* If the main target relation is a partitioned table, the
* following list contains the RT indexes of partitioned child
* relations including the root, which are not included in the
* above list.  We also keep RT indexes of the roots
* separately to be identitied as such during the executor
* initialization.
*/
if (splan->partitioned_rels != NIL)
{
root->glob->nonleafResultRelations =
list_concat(root->glob->nonleafResultRelations,
list_copy(splan->partitioned_rels));
/* Remember where this root will be in the global list. */
splan->rootResultRelIndex = list_length(root->glob->rootResultRelations);
root->glob->rootResultRelations =
lappend_int(root->glob->rootResultRelations,
linitial_int(splan->partitioned_rels));
}

ModifyTable->partitioned_rels is NULL because inheritance_planner()
does not get called for INSERTs; instead, grouping_planner() gets
called :

subquery_planner()
{
/*
* Do the main planning.  If we have an inherited target relation, that
* needs special processing, else go straight to grouping_planner.
*/
if (parse->resultRelation && rt_fetch(parse->resultRelation,
parse->rtable)->inh)
   inheritance_planner(root);
else
   grouping_planner(root, false, tuple_fraction);

}

Above, inh is false in case of inserts.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-29 Thread Amit Khandekar

On 29 June 2017 at 07:42, Amit Langote  wrote:
> Hi Amit,
>
> On 2017/06/28 20:43, Amit Khandekar wrote:
>> In attached patch v12
>
> The patch no longer applies and fails to compile after the following
> commit was made yesterday:
>
> commit 501ed02cf6f4f60c3357775eb07578aebc912d3a
> Author: Andrew Gierth 
> Date:   Wed Jun 28 18:55:03 2017 +0100
>
> Fix transition tables for partition/inheritance.

Thanks for informing Amit.

As Thomas mentioned upthread, the above commit already uses a tuple
conversion mapping from leaf partition to root partitioned table
(mt_transition_tupconv_maps), which serves the same purpose as that of
the mapping used in the update-partition-key patch during update tuple
routing (mt_resultrel_maps).

We need to try to merge these two into a general-purpose mapping array
such as mt_leaf_root_maps. I haven't done that in the rebased patch
(attached), so currently it has both these mapping fields.

For transition tables, this map is per-leaf-partition in case of
inserts, whereas it is per-subplan result rel for updates. For
update-tuple routing, the mapping is required to be per-subplan. Now,
for update-row-movement in presence of transition tables, we would
require both per-subplan mapping as well as per-leaf-partition
mapping, which can't be done if we have a single mapping field, unless
we have some way to identify which of the per-leaf partition mapping
elements belong to per-subplan rels.

So, it's not immediately possible to merge them.

update-partition-key_v12_rebased.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-29 Thread Amit Khandekar

On 22 June 2017 at 01:41, Robert Haas  wrote:
>>> +for (i = 0; i < num_rels; i++)
>>> +{
>>> +ResultRelInfo *resultRelInfo = &result_rels[i];
>>> +Relationrel = resultRelInfo->ri_RelationDesc;
>>> +Bitmapset *expr_attrs = NULL;
>>> +
>>> +pull_varattnos((Node *) rel->rd_partcheck, 1, &expr_attrs);
>>> +
>>> +/* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. 
>>> */
>>> +if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, 
>>> estate)))
>>> +return true;
>>> +}
>>>
>>> This seems like an awfully expensive way of performing this test.
>>> Under what circumstances could this be true for some result relations
>>> and false for others;
>>
>> One resultRelinfo can have no partition key column used in its quals,
>> but the next resultRelinfo can have quite different quals, and these
>> quals can have partition key referred. This is possible if the two of
>> them have different parents that have different partition-key columns.
>
> Hmm, true.  So if we have a table foo that is partitioned by list (a),
> and one of its children is a table bar that is partitioned by list
> (b), then we need to consider doing tuple-routing if either column a
> is modified, or if column b is modified for a partition which is a
> descendant of bar.  But visiting that only requires looking at the
> partitioned table and those children that are also partitioned, not
> all of the leaf partitions as the patch does.

The main concern is that the non-leaf partitions are not open (except
root), so we would need to open them in order to get the partition key
of the parents of update resultrels (or get only the partition key
atts and exprs from pg_partitioned_table).

There can be multiple approaches to finding partition key columns.

Approach 1 : When there are a few update result rels and a large
partition tree, we traverse from each of the result rels to their
ancestors , and open their ancestors (get_partition_parent()) to get
the partition key columns. For result rels having common parents, do
this only once.

Approach 2 : If there are only a few partitioned tables, and large
number of update result rels, it would be easier to just open all the
partitioned tables and form the partition key column bitmap out of all
their partition keys. If the bitmap does not have updated columns,
that's not a partition-key-update. So for typical non-partition-key
updates, just opening the partitioned tables will suffice, and so that
would not affect performance of normal updates.

But if the bitmap has updated columns, we can't conclude that it's a
partition-key-update, otherwise it would be false positive. We again
need to further check whether the update result rels belong to
ancestors that have updated partition keys.

Approach 3 : In RelationData, in a new bitmap field (rd_partcheckattrs
?), store partition key attrs that are used in rd_partcheck . Populate
this field during generate_partition_qual().

So to conclude, I think, we can do this :

Scenario 1 :
Only one partitioned table : the root; rest all are leaf partitions.
In this case, it is definitely efficient to just check the root
partition key, which will be sufficient.

Scenario 2 :
There are few non-leaf partitioned tables (3-4) :
Open those tables, and follow 2nd approach above: If we don't find any
updated partition-keys in any of them, well and good. If we do find,
failover to approach 3 : For each of the update resultrels, use the
new rd_partcheckattrs bitmap to know if it uses any of the updated
columns. This would be faster than pulling up attrs from the quals
like how it was done in the patch.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-07-04 Thread Amit Khandekar

On 4 July 2017 at 14:38, Amit Langote  wrote:
> On 2017/07/04 17:25, Etsuro Fujita wrote:
>> On 2017/07/03 18:54, Amit Langote wrote:
>>> On 2017/07/02 20:10, Robert Haas wrote:
>>>> That
>>>> seems pretty easy to do - just have expand_inherited_rtentry() notice
>>>> that it's got a partitioned table and call
>>>> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
>>>> produce the list of OIDs.
>> Seems like a good idea.
>>
>>> Interesting idea.
>>>
>>> If we are going to do this, I think we may need to modify
>>> RelationGetPartitionDispatchInfo() a bit or invent an alternative that
>>> does not do as much work.  Currently, it assumes that it's only ever
>>> called by ExecSetupPartitionTupleRouting() and hence also generates
>>> PartitionDispatchInfo objects for partitioned child tables.  We don't need
>>> that if called from within the planner.
>>>
>>> Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled
>>> with its usage within the executor, because there is this comment:
>>>
>>>  /*
>>>   * We keep the partitioned ones open until we're done using the
>>>   * information being collected here (for example, see
>>>   * ExecEndModifyTable).
>>>   */
>>
>> Yeah, we need some refactoring work.  Is anyone working on that?
>
> I would like to take a shot at that if someone else hasn't already cooked
> up a patch.  Working on making RelationGetPartitionDispatchInfo() a
> routine callable from both within the planner and the executor should be a
> worthwhile effort.

What I am currently working on is to see if we can call
find_all_inheritors() or find_inheritance_children() instead of
generating the leaf_part_oids using APPEND_REL_PARTITION_OIDS().
Possibly we don't have to refactor it completely.
find_inheritance_children() needs to return the oids in canonical
order. So in find_inheritance_children () need to re-use part of
RelationBuildPartitionDesc() where it generates those oids in that
order. I am checking this part, and am going to come up with an
approach based on findings.

Also, need to investigate whether *always* sorting the oids in
canonical order is going to be much expensive than the current sorting
using oids. But I guess it won't be that expensive.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-07-04 Thread Amit Khandekar

On 4 July 2017 at 14:48, Amit Khandekar  wrote:
> On 4 July 2017 at 14:38, Amit Langote  wrote:
>> On 2017/07/04 17:25, Etsuro Fujita wrote:
>>> On 2017/07/03 18:54, Amit Langote wrote:
>>>> On 2017/07/02 20:10, Robert Haas wrote:
>>>>> That
>>>>> seems pretty easy to do - just have expand_inherited_rtentry() notice
>>>>> that it's got a partitioned table and call
>>>>> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
>>>>> produce the list of OIDs.
>>> Seems like a good idea.
>>>
>>>> Interesting idea.
>>>>
>>>> If we are going to do this, I think we may need to modify
>>>> RelationGetPartitionDispatchInfo() a bit or invent an alternative that
>>>> does not do as much work.  Currently, it assumes that it's only ever
>>>> called by ExecSetupPartitionTupleRouting() and hence also generates
>>>> PartitionDispatchInfo objects for partitioned child tables.  We don't need
>>>> that if called from within the planner.
>>>>
>>>> Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled
>>>> with its usage within the executor, because there is this comment:
>>>>
>>>>  /*
>>>>   * We keep the partitioned ones open until we're done using the
>>>>   * information being collected here (for example, see
>>>>   * ExecEndModifyTable).
>>>>   */
>>>
>>> Yeah, we need some refactoring work.  Is anyone working on that?
>>
>> I would like to take a shot at that if someone else hasn't already cooked
>> up a patch.  Working on making RelationGetPartitionDispatchInfo() a
>> routine callable from both within the planner and the executor should be a
>> worthwhile effort.
>
> What I am currently working on is to see if we can call
> find_all_inheritors() or find_inheritance_children() instead of
> generating the leaf_part_oids using APPEND_REL_PARTITION_OIDS().
> Possibly we don't have to refactor it completely.
> find_inheritance_children() needs to return the oids in canonical
> order. So in find_inheritance_children () need to re-use part of
> RelationBuildPartitionDesc() where it generates those oids in that
> order. I am checking this part, and am going to come up with an
> approach based on findings.

The other approach is to make canonical ordering only in
find_all_inheritors() by replacing call to find_inheritance_children()
with the refactored/modified RelationGetPartitionDispatchInfo(). But
that would mean that the callers of find_inheritance_children() would
have one ordering, while the callers of find_all_inheritors() would
have a different ordering; that brings up chances of deadlocks. That's
why I think, we need to think about modifying the common function
find_inheritance_children(), so that we would be consistent with the
ordering. And then use find_inheritance_children() or
find_all_inheritors() in RelationGetPartitionDispatchInfo(). So yes,
there would be some modifications to
RelationGetPartitionDispatchInfo().

>
> Also, need to investigate whether *always* sorting the oids in
> canonical order is going to be much expensive than the current sorting
> using oids. But I guess it won't be that expensive.
>
>
> --
> Thanks,
> -Amit Khandekar
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-07-05 Thread Amit Khandekar

On 4 July 2017 at 15:23, Amit Khandekar  wrote:
> On 4 July 2017 at 14:48, Amit Khandekar  wrote:
>> On 4 July 2017 at 14:38, Amit Langote  wrote:
>>> On 2017/07/04 17:25, Etsuro Fujita wrote:
>>>> On 2017/07/03 18:54, Amit Langote wrote:
>>>>> On 2017/07/02 20:10, Robert Haas wrote:
>>>>>> That
>>>>>> seems pretty easy to do - just have expand_inherited_rtentry() notice
>>>>>> that it's got a partitioned table and call
>>>>>> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
>>>>>> produce the list of OIDs.
>>>> Seems like a good idea.
>>>>
>>>>> Interesting idea.
>>>>>
>>>>> If we are going to do this, I think we may need to modify
>>>>> RelationGetPartitionDispatchInfo() a bit or invent an alternative that
>>>>> does not do as much work.  Currently, it assumes that it's only ever
>>>>> called by ExecSetupPartitionTupleRouting() and hence also generates
>>>>> PartitionDispatchInfo objects for partitioned child tables.  We don't need
>>>>> that if called from within the planner.
>>>>>
>>>>> Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled
>>>>> with its usage within the executor, because there is this comment:
>>>>>
>>>>>  /*
>>>>>   * We keep the partitioned ones open until we're done using the
>>>>>   * information being collected here (for example, see
>>>>>   * ExecEndModifyTable).
>>>>>   */
>>>>
>>>> Yeah, we need some refactoring work.  Is anyone working on that?
>>>
>>> I would like to take a shot at that if someone else hasn't already cooked
>>> up a patch.  Working on making RelationGetPartitionDispatchInfo() a
>>> routine callable from both within the planner and the executor should be a
>>> worthwhile effort.
>>
>> What I am currently working on is to see if we can call
>> find_all_inheritors() or find_inheritance_children() instead of
>> generating the leaf_part_oids using APPEND_REL_PARTITION_OIDS().
>> Possibly we don't have to refactor it completely.
>> find_inheritance_children() needs to return the oids in canonical
>> order. So in find_inheritance_children () need to re-use part of
>> RelationBuildPartitionDesc() where it generates those oids in that
>> order. I am checking this part, and am going to come up with an
>> approach based on findings.
>
> The other approach is to make canonical ordering only in
> find_all_inheritors() by replacing call to find_inheritance_children()
> with the refactored/modified RelationGetPartitionDispatchInfo(). But
> that would mean that the callers of find_inheritance_children() would
> have one ordering, while the callers of find_all_inheritors() would
> have a different ordering; that brings up chances of deadlocks. That's
> why I think, we need to think about modifying the common function
> find_inheritance_children(), so that we would be consistent with the
> ordering. And then use find_inheritance_children() or
> find_all_inheritors() in RelationGetPartitionDispatchInfo(). So yes,
> there would be some modifications to
> RelationGetPartitionDispatchInfo().
>
>>
>> Also, need to investigate whether *always* sorting the oids in
>> canonical order is going to be much expensive than the current sorting
>> using oids. But I guess it won't be that expensive.


Like I mentioned upthread... in expand_inherited_rtentry(), if we
replace find_all_inheritors() with something else that returns oids in
canonical order, that will change the order in which children tables
get locked, which increases the chance of deadlock. Because, then the
callers of find_all_inheritors() will lock them in one order, while
callers of expand_inherited_rtentry() will lock them in a different
order. Even in the current code, I think there is a chance of
deadlocks because RelationGetPartitionDispatchInfo() and
find_all_inheritors() have different lock ordering.

Now, to get the oids of a partitioned table children sorted by
canonical ordering, (i.e. using the partition bound values) we need to
either use the partition bounds to sort the oids like the way it is
done in RelationBuildPartitionDesc() or, open the parent table and get
it's Relation->rd_partdesc->oids[] which are already sorted in
canonical order. So if we generate oids using this way in
find_all_inheritors() and find_inheritance_children(), that will
generate consistent ordering e

Re: [HACKERS] Parallel Append implementation

2017-07-05 Thread Amit Khandekar

On 30 June 2017 at 15:10, Rafia Sabih  wrote:
>
> On Tue, Apr 4, 2017 at 12:37 PM, Amit Khandekar 
> wrote:
>>
>> Attached is an updated patch v13 that has some comments changed as per
>> your review, and also rebased on latest master.
>
>
> This is not applicable on the latest head i.e. commit --
> 08aed6604de2e6a9f4d499818d7c641cbf5eb9f7, looks like need a rebasing.

Thanks for notifying. Attached is the rebased version of the patch.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


ParallelAppend_v13_rebased.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-07-13 Thread Amit Khandekar

On 5 July 2017 at 15:12, Amit Khandekar  wrote:
> Like I mentioned upthread... in expand_inherited_rtentry(), if we
> replace find_all_inheritors() with something else that returns oids in
> canonical order, that will change the order in which children tables
> get locked, which increases the chance of deadlock. Because, then the
> callers of find_all_inheritors() will lock them in one order, while
> callers of expand_inherited_rtentry() will lock them in a different
> order. Even in the current code, I think there is a chance of
> deadlocks because RelationGetPartitionDispatchInfo() and
> find_all_inheritors() have different lock ordering.
>
> Now, to get the oids of a partitioned table children sorted by
> canonical ordering, (i.e. using the partition bound values) we need to
> either use the partition bounds to sort the oids like the way it is
> done in RelationBuildPartitionDesc() or, open the parent table and get
> it's Relation->rd_partdesc->oids[] which are already sorted in
> canonical order. So if we generate oids using this way in
> find_all_inheritors() and find_inheritance_children(), that will
> generate consistent ordering everywhere. But this method is quite
> expensive as compared to the way oids are generated and sorted using
> oid values in find_inheritance_children().
>
> In both expand_inherited_rtentry() and
> RelationGetPartitionDispatchInfo(), each of the child tables are
> opened.
>
> So, in both of these functions, what we can do is : call a new
> function partition_tree_walker() which does following :
> 1. Lock the children using the existing order (i.e. sorted by oid
> values) using the same function find_all_inheritors(). Rename
> find_all_inheritors() to lock_all_inheritors(... , bool return_oids)
> which returns the oid list only if requested.
> 2. And then scan through each of the partitions in canonical order, by
> opening the parent table, then opening the partition descriptor oids,
> and then doing whatever needs to be done with that partition rel.
>
> partition_tree_walker() will look something like this :
>
> void partition_tree_walker(Oid parentOid, LOCKMODE lockmode,
>void (*walker_func) (), void *context)
> {
> Relation parentrel;
> List *rels_list;
> ListCell *cell;
>
> (void) lock_all_inheritors(parentOid, lockmode,
>false /* don't generate oids */);
>
> parentrel = heap_open(parentOid, NoLock);
> rels_list = append_rel_partition_oids(NIL, parentrel);
>
> /* Scan through all partitioned rels, and at the
>  * same time append their children. */
> foreach(cell, rels_list)
> {
> /* Open partrel without locking; lock_all_inheritors() has locked it 
> */
> Relationpartrel = heap_open(lfirst_oid(cell), NoLock);
>
> /* Append the children of a partitioned rel to the same list
>  * that we are iterating on */
> if (RelationGetPartitionDesc(partrel))
> rels_list = append_rel_partition_oids(rels_list, partrel);
>
> /*
>  * Do whatever processing needs to be done on this partel.
>  * The walker function is free to either close the partel
>  * or keep it opened, but it needs to make sure the opened
>  * ones are closed later
>  */
> walker_func(partrel, context);
> }
> }
>
> List *append_rel_partition_oids(List *rel_list, Relation rel)
> {
> int i;
> for (i = 0; i < rel->rd_partdesc->nparts; i++)
> rel_list = lappend_oid(rel_list, rel->rd_partdesc->oids[i]);
>
> return rel_list;
> }
>
>
> So, in expand_inherited_rtentry() the foreach(l, inhOIDs) loop will be
> replaced by partition_tree_walker(parentOid, expand_rte_walker_func)
> where expand_rte_walker_func() will do all the work done in the for
> loop for each of the partition rels.
>
> Similarly, in RelationGetPartitionDispatchInfo() the initial part
> where it uses APPEND_REL_PARTITION_OIDS() can be replaced by
> partition_tree_walker(rel, dispatch_info_walkerfunc) where
> dispatch_info_walkerfunc() will generate the oids, or may be populate
> the complete PartitionDispatchData structure. 'pd' variable can be
> passed as context to the partition_tree_walker(..., context)
>
> Generating the resultrels in canonical order by opening the tables
> using the above way wouldn't be more expensive than the existing code,
> because even currently we anyways have to open all the tables in both
> of these functions.
>

Attached is a WIP patch (make_resultrels_ordered.patch) that generates
the result rels in canonical order. This patch is kept separate from
the updat

Re: [HACKERS] Oddity in error handling of constraint violation in ExecConstraints for partitioned tables

2017-07-23 Thread Amit Khandekar

>> On 2017/07/10 14:15, Etsuro Fujita wrote:
>> Another thing I noticed is the error handling in ExecWithCheckOptions; it
>> doesn't take any care for partition tables, so the column description in
>> the error message for WCO_VIEW_CHECK is built using the partition table's
>> rowtype, not the root table's.  But I think it'd be better to build that
>> using the root table's rowtype, like ExecConstraints, because that would
>> make the column description easier to understand since the parent view(s)
>> (from which WithCheckOptions evaluated there are created) would have been
>> defined on the root table.  This seems independent from the above issue,
>> so I created a separate patch, which I'm attaching. What do you think
>> about that?

+ if (map != NULL)
+ {
+ tuple = do_convert_tuple(tuple, map);
+ ExecStoreTuple(tuple, slot, InvalidBuffer, false);
+ }

Above, the tuple descriptor also needs to be set, since the parent and
child partitions can have different column ordering.

Due to this, the following testcase crashes :

CREATE TABLE range_parted (a text,b int, c int) partition by range (b);
CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c >
120) WITH CHECK OPTION;
create table part_a_1_a_10(b int, c int, a text);
alter table range_parted attach partition part_a_1_a_10 for values
from (1) to (10);
insert into upview values ('a', 2, 100);

Attached is a patch that sets the tuple descriptor.
Also in the patch, in test updatable_view.sql, I have added a varchar
column in one of the partitioned tables used in updatable_views.sql,
so as to cover this scenario. Without setting the tuple descriptor,
the output of the patched updatable_views.sql  shows junk value in one
of the columns of the row in the error message emitted for the
WithCheckOption violation.

Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


set_slot_descriptor.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-07-23 Thread Amit Khandekar

On 13 July 2017 at 22:39, Amit Khandekar  wrote:
> Attached is a WIP patch (make_resultrels_ordered.patch) that generates
> the result rels in canonical order. This patch is kept separate from
> the update-partition-key patch, and can be applied on master branch.

Attached update-partition-key_v13.patch now contains this
make_resultrels_ordered.patch changes.

So now that the per-subplan result rels and the leaf partition oids
that are generated for tuple routing are both known to have the same
order (cannonical), in ExecSetupPartitionTupleRouting(), we look for
the per-subplan results without the need for a hash table. Instead of
the hash table, we iterate over the leaf partition oids and at the
same time keep shifting a position over the per-subplan resultrels
whenever the resultrel at the position is found to be present in the
leaf partitions list. Since the two lists are in the same order, we
never have to again scan the portition of the lists that is already
scanned.

I considered whether the issue behind this recent commit might be
relevant for update tuple-routing as well :
commit f81a91db4d1c2032632aa5df9fc14be24f5fe5ec
Author: Robert Haas 
Date:   Mon Jul 17 21:29:45 2017 -0400
Use a real RT index when setting up partition tuple routing.

Since we know that using a dummy 1 value for tuple routing result rels
is not correct, I am checking about another possibility : Now in the
latest patch, the tuple routing partitions would have a mix of a)
existing update result-rels, and b) new partition resultrels. 'b'
resultrels would have the RT index of nominalRelation, but the
existing 'a' resultrels would have their own different RT indexes. I
suspect, this might surface a similar issue that was fixed by the
above commit, for e.g. with the WITH query having UPDATE subqueries
doing tuple routing. Will check that.

This patch also has Robert's changes in the planner to decide whether
to do update tuple routing.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

update-partition-key_v13.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Oddity in error handling of constraint violation in ExecConstraints for partitioned tables

2017-07-24 Thread Amit Khandekar

On 24 July 2017 at 12:11, Amit Langote  wrote:
> Hi Amit,
>
> On 2017/07/24 14:09, Amit Khandekar wrote:
>>>> On 2017/07/10 14:15, Etsuro Fujita wrote:
>>>> Another thing I noticed is the error handling in ExecWithCheckOptions; it
>>>> doesn't take any care for partition tables, so the column description in
>>>> the error message for WCO_VIEW_CHECK is built using the partition table's
>>>> rowtype, not the root table's.  But I think it'd be better to build that
>>>> using the root table's rowtype, like ExecConstraints, because that would
>>>> make the column description easier to understand since the parent view(s)
>>>> (from which WithCheckOptions evaluated there are created) would have been
>>>> defined on the root table.  This seems independent from the above issue,
>>>> so I created a separate patch, which I'm attaching. What do you think
>>>> about that?
>>
>> + if (map != NULL)
>> + {
>> + tuple = do_convert_tuple(tuple, map);
>> + ExecStoreTuple(tuple, slot, InvalidBuffer, false);
>> + }
>>
>> Above, the tuple descriptor also needs to be set, since the parent and
>> child partitions can have different column ordering.
>>
>> Due to this, the following testcase crashes :
>>
>> CREATE TABLE range_parted (a text,b int, c int) partition by range (b);
>> CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c >
>> 120) WITH CHECK OPTION;
>> create table part_a_1_a_10(b int, c int, a text);
>> alter table range_parted attach partition part_a_1_a_10 for values
>> from (1) to (10);
>> insert into upview values ('a', 2, 100);
>
> Good catch.  Thanks for creating the patch.
>
> There are other places with similar code viz. those in ExecConstraints()
> that would need fixing.

Ah ok. I should have noticed that. Thanks.

> Attached is the updated version of your patch.
Now that this is done, any particular reason it is not done in
ExecPartitionCheck() ? I see that there is a do_convert_tuple() in
that function, again without changing the slot descriptor.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-07-25 Thread Amit Khandekar

On 25 July 2017 at 15:02, Rajkumar Raghuwanshi
 wrote:
> On Mon, Jul 24, 2017 at 11:23 AM, Amit Khandekar 
> wrote:
>>
>>
>> Attached update-partition-key_v13.patch now contains this
>> make_resultrels_ordered.patch changes.
>>
>
> I have applied attach patch and got below observation.
>
> Observation :  if join producing multiple output rows for a given row to be
> modified. I am seeing here it is updating a row and also inserting rows in
> target table. hence after update total count of table got incremented.

Thanks for catching this Rajkumar.

So after the row to be updated is already moved to another partition,
when the next join output row corresponds to the same row which is
moved, that row is now deleted, so ExecDelete()=>heap_delete() gets
HeapTupleSelfUpdated, and this is not handled. So even when
ExecDelete() finds that the row is already deleted, we still call
ExecInsert(), so a new row is inserted.  In ExecDelete(), we should
indicate that the row is already deleted. In the existing patch, there
is a parameter concurrenty_deleted for ExecDelete() which indicates
that the row is concurrently deleted. I think we can make this
parameter for both of these purposes so as to avoid ExecInsert() for
both these scenarios. Will work on a patch.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-07-26 Thread Amit Khandekar

On 26 July 2017 at 02:37, Robert Haas  wrote:
> Is there any real benefit in this "walker" interface?  It looks to me
> like it might be simpler to just change things around so that it
> returns a list of OIDs, like find_all_inheritors, but generated
> differently.  Then if you want bound-ordering rather than
> OID-ordering, you just do this:
>
> list_free(inhOids);
> inhOids = get_partition_oids_in_bound_order(rel);
>
> That'd remove the need for some if/then logic as you've currently got
> in get_next_child().

Yes, I had considered that ; i.e., first generating just a list of
bound-ordered oids. But that consequently needs all the child tables
to be opened and closed twice; once during the list generation, and
then while expanding the partitioned table. Agreed, that the second
time, heap_open() would not be that expensive because tables would be
cached, but still it would require to get the cached relation handle
from hash table. Since we anyway want to open the tables, better have
a *next() function to go-get the next partition in a fixed order.

Actually, there isn't much that the walker next() function does. Any
code that wants to traverse bound-wise can do that by its own. The
walker function is just a convenient way to make sure everyone
traverses in the same order by using this function.

Yet to go over other things including your review comments, and Amit
Langote's patch on refactoring RelationGetPartitionDispatchInfo().

> On another note, did you do anything about the suggestion Thomas made
> in 
> http://postgr.es/m/CAEepm=3sc_j1zwqdyrbu4dtfx5rhcamnnuaxrkwzfgt9m23...@mail.gmail.com
> ?

This is still pending on me; plus I think there are some more points.
I need to go over those and consolidate a list of todos.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] map_partition_varattnos() and whole-row vars

2017-07-28 Thread Amit Khandekar

On 28 July 2017 at 11:17, Amit Langote  wrote:
> On 2017/07/26 16:58, Amit Langote wrote:
>> Rajkumar Raghuwanshi reported [1] on the "UPDATE partition key" thread
>> that whole-row vars don't play nicely with partitioned table operations.
>>
>> For example, when used to convert WITH CHECK OPTION constraint expressions
>> and RETURNING target list expressions, it will error out if the
>> expressions contained whole-row vars.  That's a bug, because whole-row
>> vars are legal in those expressions.  I think the decision to output error
>> upon encountering a whole-row in the input expression was based on the
>> assumption that it will only ever be used to convert partition constraint
>> expressions, which cannot contain those.  So, we can relax that
>> requirement so that its other users are not bitten by it.
>>
>> Attached fixes that.
>
> Attached a revised version of the patch.
>
> Updated patch moves the if (found_whole_row) elog(ERROR) bit in
> map_partition_varattnos to the callers.  Only the callers know if
> whole-row vars are not expected in the expression it's getting converted
> from map_partition_varattnos.  Given the current message text (mentioning
> "partition key"), it didn't seem appropriate to have the elog inside this
> function, since it's callable from places where it wouldn't make sense.

create table foo (a int, b text) partition by list (a);
create table foo1 partition of foo for values in (1);
create table foo2(b text, a int) ;
alter table foo attach partition foo2 for values in (2);

postgres=# insert into foo values (1, 'hi there');
INSERT 0 1
postgres=# insert into foo values (2, 'bi there');
INSERT 0 1
postgres=# insert into foo values (2, 'error there') returning foo;
ERROR:  table row type and query-specified row type do not match
DETAIL:  Table has type text at ordinal position 1, but query expects integer.

This is due to a mismatch between the composite type tuple descriptor
of the leaf partition doesn't match the RETURNING slot tuple
descriptor.

We probably need to do what
inheritance_planner()=>adjust_appendrel_attrs() does for adjusting the
attrs in the child rel parse tree; i.e., handle some specific nodes
other than just simple var nodes. In adjust_appendrel_attrs_mutator(),
for whole row node, it updates var->vartype to the child rel.

I suspect that the other nodes that adjust_appendrel_attrs_mutator
handles (like JoinExpr, CurrentOfExpr, etc ) may also require similar
adjustments for our case, because WithCheckOptions (and I think even
RETURNING) can have a subquery.

>
> Thanks,
> Amit
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-07-28 Thread Amit Khandekar

s might be
automatically solved.



Below are the TODOS at this point :

Fix for bug reported by Rajkumar about update with join.
Do something about two separate mapping tables for Transition tables
and update tuple-routing.
GetUpdatedColumns() to be moved to header file.
More test scenarios in regression tests.
Need to check/test whether we are correctly applying insert policies
(ant not update) while inserting a routed tuple.
Use getASTriggerResultRelInfo() for attrno mapping, rather than first
resultrel, for generating child WCO/RETURNING expression.
Address Robert's review comments on make_resultrel_ordered.patch.
pgindent.

[1] 
https://www.postgresql.org/message-id/d86d27ea-cc9d-5dbe-b131-e7dec4017983%40lab.ntt.co.jp

Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] map_partition_varattnos() and whole-row vars

2017-08-01 Thread Amit Khandekar

var->vartype =
> context->to_rowtype;
> +   return (Node *) r;
> +   }
>
> I think we could set r->resulttype to the vartype (ie, "r->resulttype =
> newvar->vartype" instead of "r->resulttype = context->from_rowtype").

I agree.

---

Few more comments :

@@ -1240,7 +1247,7 @@ map_variable_attnos_mutator(Node *node,
 var->varlevelsup == context->sublevels_up)
 {
 /* Found a matching variable, make the substitution */

- Var*newvar = (Var *) palloc(sizeof(Var));
+ Var*newvar = copyObject(var);
  int attno = var->varattno;

 *newvar = *var;

Here, "*newvar = *var" should be removed.


---

-   result = map_partition_varattnos(result, 1, rel, parent);
+   result = map_partition_varattnos(result, 1, rel, parent,
+
  &found_whole_row);
+   /* There can never be a whole-row reference here */
+   if (found_whole_row)
+   elog(ERROR, "unexpected whole-row reference found in
partition key");

Instead of callers of map_partition_varattnos() reporting error, we
can have map_partition_varattnos() itself report error. Instead of the
found_whole_row parameter of map_partition_varattnos(), we can have
error_on_whole_row parameter. So callers who don't expect whole row,
would pass error_on_whole_row=true to map_partition_varattnos(). This
will simplify the resultant code a bit.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-08-02 Thread Amit Khandekar

On 2 August 2017 at 14:38, Amit Langote  wrote:
> On 2017/07/29 2:45, Amit Khandekar wrote:
>> On 28 July 2017 at 20:10, Robert Haas  wrote:
>>> On Wed, Jul 26, 2017 at 2:13 AM, Amit Langote wrote:
>>>> I checked that we get the same result relation order with both the
>>>> patches, but I would like to highlight a notable difference here between
>>>> the approaches taken by our patches.  In my patch, I have now taught
>>>> RelationGetPartitionDispatchInfo() to lock *only* the partitioned tables
>>>> in the tree, because we need to look at its partition descriptor to
>>>> collect partition OIDs and bounds.  We can defer locking (and opening the
>>>> relation descriptor of) leaf partitions to a point where planner has
>>>> determined that the partition will be accessed after all (not pruned),
>>>> which will be done in a separate patch of course.
>>>
>>> That's very desirable, but I believe it introduces a deadlock risk
>>> which Amit's patch avoids.  A transaction using the code you've
>>> written here is eventually going to lock all partitions, BUT it's
>>> going to move the partitioned ones to the front of the locking order
>>> vs. what find_all_inheritors would do.  So, when multi-level
>>> partitioning is in use, I think it could happen that some other
>>> transaction is accessing the table using a different code path that
>>> uses the find_all_inheritors order without modification.  If those
>>> locks conflict (e.g. query vs. DROP) then there's a deadlock risk.
>>
>> Yes, I agree. Even with single-level partitioning, the leaf partitions
>> ordered by find_all_inheritors() is by oid values, so that's also
>> going to be differently ordered.
>
> We do require to lock the parent first in any case.  Doesn't that prevent
> deadlocks by imparting an implicit order on locking by operations whose
> locks conflict.

Yes may be, but I am not too sure at this point. find_all_inheritors()
locks only the children, and the parent lock is already locked
separately. find_all_inheritors() does not necessitate to lock the
children with the same lockmode as the parent.

> Having said that, I think it would be desirable for all code paths to
> manipulate partitions in the same order.  For partitioned tables, I think
> we can make it the partition bound order by replacing all calls to
> find_all_inheritors and find_inheritance_children on partitioned table
> parents with something else that reads partition OIDs from the relcache
> (PartitionDesc) and traverses the partition tree breadth-first manner.
>
>>> Unfortunately I don't see any easy way around that problem, but maybe
>>> somebody else has an idea.
>>
>> One approach I had considered was to have find_inheritance_children()
>> itself lock the children in bound order, so that everyone will have
>> bound-ordered oids, but that would be too expensive since it requires
>> opening all partitioned tables to initialize partition descriptors. In
>> find_inheritance_children(), we get all oids without opening any
>> tables. But now that I think more of it, it's only the partitioned
>> tables that we have to open, not the leaf partitions; and furthermore,
>> I didn't see calls to find_inheritance_children() and
>> find_all_inheritors() in performance-critical code, except in
>> expand_inherited_rtentry(). All of them are in DDL commands; but yes,
>> that can change in the future.
>
> This approach more or less amounts to calling the new
> RelationGetPartitionDispatchInfo() (per my proposed patch, a version of
> which I posted upthread.)  Maybe we can add a wrapper on top, say,
> get_all_partition_oids() which throws away other things that
> RelationGetPartitionDispatchInfo() returned.  In addition it locks all the
> partitions that are returned, unlike only the partitioned ones, which is
> what RelationGetPartitionDispatchInfo() has been taught to do.

So there are three different task items here :
1. Arrange the oids in consistent order everywhere.
2. Prepare the Partition Dispatch Info data structure in the planner
as against during execution.
3. For update tuple routing, assume that the result rels are ordered
consistently to make the searching efficient.

#3 depends on #1. So for that, I have come up with a minimum set of
changes to have expand_inherited_rtentry() generate the rels in bound
order. When we do #2 , it may be possible that we may need to re-do my
changes in expand_inherited_rtentry(), but those are minimum. We may
even end up having the walker function being used at multiple places,
but right now it is not certain.

So, I thi

Re: [HACKERS] map_partition_varattnos() and whole-row vars

2017-08-02 Thread Amit Khandekar

On 2 August 2017 at 11:51, Amit Langote  wrote:
> Thanks Fuita-san and Amit for reviewing.
>
> On 2017/08/02 1:33, Amit Khandekar wrote:
>> On 1 August 2017 at 15:11, Etsuro Fujita  wrote:
>>> On 2017/07/31 18:56, Amit Langote wrote:
>>>> Yes, that's what's needed here.  So we need to teach
>>>> map_variable_attnos_mutator() to convert whole-row vars just like it's
>>>> done in adjust_appendrel_attrs_mutator().
>>>
>>>
>>> Seems reasonable.  (Though I think it might be better to do this kind of
>>> conversion in the planner, not the executor, because that would increase the
>>> efficiency of cached plans.)
>
> That's a good point, although it sounds like a bigger project that, IMHO,
> should be undertaken separately, because that would involve designing for
> considerations of expanding inheritance even in the INSERT case.
>
>> I think the work of shifting to planner should be taken as a different
>> task when we shift the preparation of DispatchInfo to the planner.
>
> Yeah, I think it'd be a good idea to do those projects together.  That is,
> doing what Fujita-san suggested and expanding partitioned tables in
> partition bound order in the planner.
>
>>>> Attached 2 patches:
>>>>
>>>> 0001: Addresses the bug that Rajkumar reported (handling whole-row vars in
>>>>WITH CHECK and RETURNING expressions at all)
>>>>
>>>> 0002: Addressed the bug that Amit reported (converting whole-row vars
>>>>that could occur in WITH CHECK and RETURNING expressions)
>>>
>>>
>>> I took a quick look at the patches.  One thing I noticed is this:
>>>
>>>  map_variable_attnos(Node *node,
>>> int target_varno, int sublevels_up,
>>> const AttrNumber *attno_map, int
>>> map_length,
>>> +   Oid from_rowtype, Oid to_rowtype,
>>> bool *found_whole_row)
>>>  {
>>> map_variable_attnos_context context;
>>> @@ -1291,6 +1318,8 @@ map_variable_attnos(Node *node,
>>> context.sublevels_up = sublevels_up;
>>> context.attno_map = attno_map;
>>> context.map_length = map_length;
>>> +   context.from_rowtype = from_rowtype;
>>> +   context.to_rowtype = to_rowtype;
>>> context.found_whole_row = found_whole_row;
>>>
>>> You added two arguments to pass to map_variable_attnos(): from_rowtype and
>>> to_rowtype.  But I'm not sure that we really need the from_rowtype argument
>>> because it's possible to get the Oid from the vartype of target whole-row
>>> Vars.  And in this:
>>>
>>> +   /*
>>> +* If the callers expects us to convert the
>>> same, do so if
>>> +* necessary.
>>> +*/
>>> +   if (OidIsValid(context->to_rowtype) &&
>>> +   OidIsValid(context->from_rowtype) &&
>>> +   context->to_rowtype !=
>>> context->from_rowtype)
>>> +   {
>>> +   ConvertRowtypeExpr *r =
>>> makeNode(ConvertRowtypeExpr);
>>> +
>>> +   r->arg = (Expr *) newvar;
>>> +   r->resulttype =
>>> context->from_rowtype;
>>> +   r->convertformat =
>>> COERCE_IMPLICIT_CAST;
>>> +   r->location = -1;
>>> +   /* Make sure the Var node has the
>>> right type ID, too */
>>> +   newvar->vartype =
>>> context->to_rowtype;
>>> +   return (Node *) r;
>>> +   }
>>>
>>> I think we could set r->resulttype to the vartype (ie, "r->resulttype =
>>> newvar->vartype" instead of "r->resulttype = context->from_rowtype").
>>
>> I agree.
>
> You're right, from_rowtype is unnecessary.
>
>> ---
>>
>> Few more comments :
>>
>> @@ -1240,7 +1247,7 @@ map_variable_

Re: [HACKERS] map_partition_varattnos() and whole-row vars

2017-08-02 Thread Amit Khandekar

On 3 August 2017 at 11:00, Amit Langote  wrote:
> Thanks for the review.
>
> On 2017/08/03 13:54, Amit Khandekar wrote:
>> On 2 August 2017 at 11:51, Amit Langote wrote:
>>> On 2017/08/02 1:33, Amit Khandekar wrote:
>>>> Instead of callers of map_partition_varattnos() reporting error, we
>>>> can have map_partition_varattnos() itself report error. Instead of the
>>>> found_whole_row parameter of map_partition_varattnos(), we can have
>>>> error_on_whole_row parameter. So callers who don't expect whole row,
>>>> would pass error_on_whole_row=true to map_partition_varattnos(). This
>>>> will simplify the resultant code a bit.
>>>
>>> So, I think it's only the callers of
>>> map_partition_varattnos() who know whether finding whole-row vars is an
>>> error and *what kind* of error.
>>
>> Ok. Got it. Thanks for the explanation.
>>
>> How about making found_whole_row as an optional parameter ?
>> map_partition_varattnos() would populate it only if its not NULL. This
>> way, callers who don't bother about whether it is found or not don't
>> have to declare a variable and pass its address. In current code,
>> ExecInitModifyTable() is the one who has to declare found_whole_row
>> without needing it.
>
> Sure, done.

Looks good.

> But while implementing that, I avoided changing anything in
> map_variable_attnos(), such as adding a check for not-NULLness of
> found_whole_row.  The only check added is in map_partition_varattnos().

Yes, I agree. That is anyway not relevant to the fix.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-08-04 Thread Amit Khandekar

>
> Below are the TODOS at this point :
>
> Fix for bug reported by Rajkumar about update with join.

I had explained the root issue of this bug here : [1]

Attached patch includes the fix, which is explained below.
Currently in the patch, there is a check if the tuple is concurrently
deleted by other session, i.e. when heap_update() returns
HeapTupleUpdated. In such case we set concurrently_deleted output
param to true. We should also do the same for HeapTupleSelfUpdated
return value.

In fact, there are other places in ExecDelete() where it can return
without doing anything. For e.g. if a BR DELETE trigger prevents the
delete from happening, ExecBRDeleteTriggers() returns false, in which
case ExecDelete() returns.

So what the fix does is : rename concurrently_deleted parameter to
delete_skipped so as to indicate a more general status : whether
delete has actually happened or was it skipped. And set this param to
true only after the delete happens. This allows us to avoid adding a
new rows for the trigger case also.

Added test scenario for UPDATE with JOIN case, and also TRIGGER case.

> Do something about two separate mapping tables for Transition tables
> and update tuple-routing.
On 1 July 2017 at 03:15, Thomas Munro  wrote:
> Would make sense to have a set of functions with names like
> GetConvertor{From,To}{Subplan,Leaf}(mtstate, index) which build arrays
> m_convertors_{from,to}_by_{subplan,leaf} the first time they need
> them?

This was discussed here : [2]. I think even if we have them built when
needed, still in presence of both tuple routing and transition tables,
we do need separate arrays. So I think rather than dynamic arrays, we
can have static arrays but their elements will point to  a shared
TupleConversionMap structure whenever possible.
As already in the patch, in case of insert/update tuple routing, there
is a per-leaf partition mt_transition_tupconv_maps array for
transition tables, and a separate per-subplan arry mt_resultrel_maps
for update tuple routing. *But*, what I am proposing is: for the
mt_transition_tupconv_maps[] element for which the leaf partition also
exists as a per-subplan result, that array element and the
mt_resultrel_maps[] element will point to the same TupleConversionMap
structure.

This is quite similar to how we are re-using the per-subplan
resultrels for the per-leaf result rels. We will re-use the
per-subplan TupleConversionMap for the per-leaf
mt_transition_tupconv_maps[] elements.

Not yet implemented this.

> GetUpdatedColumns() to be moved to header file.

Done. I have moved it in execnodes.h

> More test scenarios in regression tests.
> Need to check/test whether we are correctly applying insert policies
> (ant not update) while inserting a routed tuple.

Yet to do above two.

> Use getASTriggerResultRelInfo() for attrno mapping, rather than first
> resultrel, for generating child WCO/RETURNING expression.
>

Regarding generating child WithCheckOption and Returning expressions
using those of the root result relation, ModifyTablePath and
ModifyTable should have new fields rootReturningList (and
rootWithCheckOptions) which would be derived from
root->parse->returningList in inheritance_planner(). But then, similar
to per-subplan returningList, rootReturningList would have to pass
through set_plan_refs()=>set_returning_clause_references() which
requires the subplan targetlist to be passed. Because of this, for
rootReturningList, we require a subplan for root partition, which is
not there currently; we have subpans only for child rels. That means
we would have to create such plan only for the sake of generating
rootReturningList.

The other option is to do the way the patch is currently doing in the
executor by using the returningList of the first per-subplan result
rel to generate the other child returningList (and WithCheckOption).
This is working by applying map_partition_varattnos() to the first
returningList. But now that we realized that we have to specially
handle whole-row vars, map_partition_varattnos() would need some
changes to convert whole row vars differently for
child-rel-to-child-rel mapping. For childrel-to-childrel conversion,
the whole-row var is already wrapped by ConvertRowtypeExpr, but we
need to change its Var->vartype to the new child vartype.

I think the second option looks easier, but I am open to suggestions,
and I am myself still checking the first one.

> Address Robert's review comments on make_resultrel_ordered.patch.
>
> +typedef struct ParentChild
>
> This is a pretty generic name.  Pick something more specific and informative.

I have used ChildPartitionInfo. But suggestions welcome.

>
> +static List *append_rel_partition_oids(List *rel_list, Relation rel);
>
> One could be forgiven for thinking that this function was just going
> to append OIDs, but it actually appends ParentChild structures, so I
> think the name needs work.

Renamed it to append_child_partitions().

>
> +List *append_rel_partition_oids(List *rel_list,

Re: [HACKERS] expanding inheritance in partition bound order

2017-08-06 Thread Amit Khandekar

On 4 August 2017 at 22:55, Robert Haas  wrote:
>
> 1. Before calling RelationGetPartitionDispatchInfo, the calling code
> should use find_all_inheritors to lock all the relevant relations (or
> the planner could use find_all_inheritors to get a list of relation
> OIDs, store it in the plan in order, and then at execution time we
> visit them in that order and lock them).
>
> 2. RelationGetPartitionDispatchInfo assumes the relations are already locked.

I agree. I think overall, we should keep
RelationGetPartitionDispatchInfo() only for preparing the dispatch
info in the planner, and generate the locked oids (using
find_all_inheritors() or get_partitioned_oids() or whatever) *without*
using RelationGetPartitionDispatchInfo(), since
RelationGetPartitionDispatchInfo() is generating the pd structure
which we don't want in every expansion.

>
> 3. While we're optimizing, in the first loop inside of
> RelationGetPartitionDispatchInfo, don't call heap_open().  Instead,
> use get_rel_relkind() to see whether we've got a partitioned table; if
> so, open it.  If not, there's no need.

Yes, this way we need to open only the partitioned tables.

> P.S. While I haven't reviewed 0002 in detail, I think the concept of
> minimizing what needs to be built in RelationGetPartitionDispatchInfo
> is a very good idea.

True. I also think, RelationGetPartitionDispatchInfo () should be
called while preparing the ModifyTable plan; the PartitionDispatch
data structure returned by RelationGetPartitionDispatchInfo() should
be stored in that plan, and then the execution-time fields in
PartitionDispatch would be populated in ExecInitModifyTable().

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-08-09 Thread Amit Khandekar

On 9 August 2017 at 19:05, Robert Haas  wrote:
> On Wed, Jul 5, 2017 at 7:53 AM, Amit Khandekar  wrote:
>>> This is not applicable on the latest head i.e. commit --
>>> 08aed6604de2e6a9f4d499818d7c641cbf5eb9f7, looks like need a rebasing.
>>
>> Thanks for notifying. Attached is the rebased version of the patch.
>
> This again needs a rebase.

Attached rebased version of the patch. Thanks.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


ParallelAppend_v13_rebased_2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] expanding inheritance in partition bound order

2017-08-16 Thread Amit Khandekar

On 16 August 2017 at 11:06, Amit Langote  wrote:

> Attached updated patches.

Thanks Amit for the patches.

I too agree with the overall approach taken for keeping the locking
order consistent: it's best to do the locking with the existing
find_all_inheritors() since it is much cheaper than to lock them in
partition-bound order, the later being expensive since it requires
opening partitioned tables.

> I haven't yet done anything about changing the timing of opening and
> locking leaf partitions, because it will require some more thinking about
> the required planner changes.  But the above set of patches will get us
> far enough to get leaf partition sub-plans appear in the partition bound
> order (same order as what partition tuple-routing uses in the executor).

So, I believe none of the changes done in pg_inherits.c are essential
for expanding the inheritence tree in bound order, right ? I am not
sure whether we are planning to commit these two things together or
incrementally :
1. Expand in bound order
2. Allow for locking only the partitioned tables first.

For #1, the changes in pg_inherits.c are not required (viz, keeping
partitioned tables at the head of the list, adding inhchildparted
column, etc).

If we are going to do #2 together with #1, then in the patch set there
is no one using the capability introduced by #2. That is, there are no
callers of find_all_inheritors() that make use of the new
num_partitioned_children parameter. Also, there is no boolean
parameter  for find_all_inheritors() to be used to lock only the
partitioned tables.

I feel we should think about
0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch later, and
first get the review done for the other patches.

---

I see that RelationGetPartitionDispatchInfo() now returns quite a
small subset of what it used to return, which is good. But I feel for
expand_inherited_rtentry(), RelationGetPartitionDispatchInfo() is
still a bit heavy. We only require the oids, so the
PartitionedTableInfo data structure (including the pd->indexes array)
gets discarded.

Also, RelationGetPartitionDispatchInfo() has to call get_rel_relkind()
for each child. In expand_inherited_rtentry(), we anyway have to open
all the child tables, so we get the partition descriptors for each of
the children for free. So how about, in expand_inherited_rtentry(), we
traverse the partition tree using these descriptors similar to how it
is traversed in RelationGetPartitionDispatchInfo() ? May be to avoid
code duplication for traversing, we can have a common API.

Still looking at RelationGetPartitionDispatchInfo() changes ...

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-08-17 Thread Amit Khandekar

On 16 August 2017 at 18:34, Robert Haas  wrote:
> Thanks for the benchmarking results!
>
> On Tue, Aug 15, 2017 at 11:35 PM, Rafia Sabih
>  wrote:
>> Q4 | 244 | 12 | PA and PWJ, time by only PWJ - 41
>
> 12 seconds instead of 244?  Whoa.  I find it curious that we picked a
> Parallel Append with a bunch of non-partial plans when we could've
> just as easily picked partial plans, or so it seems to me.  To put
> that another way, why did we end up with a bunch of Bitmap Heap Scans
> here instead of Parallel Bitmap Heap Scans?
>
>> Q7 | 134 | 88 | PA only
>> Q18 | 508 | 489 | PA only
>
> What's interesting in these results is that the join order changes
> quite a lot when PA is in the mix, and I don't really see why that
> should happen.

Yes, it seems hard to determine why exactly the join order changes.
Parallel Append is expected to give the benefit especially if there
are no partial subplans. But for all of the cases here, partial
subplans seem possible, and so even on HEAD it executed Partial
Append. So between a Parallel Append having partial subplans and a
Partial Append having partial subplans , the cost difference would not
be significant. Even if we assume that Parallel Append was chosen
because its cost turned out to be a bit cheaper, the actual
performance gain seems quite large as compared to the expected cost
difference. So it might be even possible that the performance gain
might be due to some other reasons. I will investigate this, and the
other queries.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] expanding inheritance in partition bound order

2017-08-17 Thread Amit Khandekar

On 17 August 2017 at 06:39, Amit Langote  wrote:
> Hi Amit,
>
> Thanks for the comments.
>
> On 2017/08/16 20:30, Amit Khandekar wrote:
>> On 16 August 2017 at 11:06, Amit Langote  
>> wrote:
>>
>>> Attached updated patches.
>>
>> Thanks Amit for the patches.
>>
>> I too agree with the overall approach taken for keeping the locking
>> order consistent: it's best to do the locking with the existing
>> find_all_inheritors() since it is much cheaper than to lock them in
>> partition-bound order, the later being expensive since it requires
>> opening partitioned tables.
>
> Yeah.  Per the Robert's design idea, we will always do the *locking* in
> the order determined by find_all_inheritors/find_inheritance_children.
>
>>> I haven't yet done anything about changing the timing of opening and
>>> locking leaf partitions, because it will require some more thinking about
>>> the required planner changes.  But the above set of patches will get us
>>> far enough to get leaf partition sub-plans appear in the partition bound
>>> order (same order as what partition tuple-routing uses in the executor).
>>
>> So, I believe none of the changes done in pg_inherits.c are essential
>> for expanding the inheritence tree in bound order, right ?
>
> Right.
>
> The changes to pg_inherits.c are only about recognizing partitioned tables
> in an inheritance hierarchy and putting them ahead in the returned list.
> Now that I think of it, the title of the patch (teach pg_inherits.c about
> "partitioning") sounds a bit confusing.  In particular, the patch does not
> teach it things like partition bound order, just that some tables in the
> hierarchy could be partitioned tables.
>
>> I am not
>> sure whether we are planning to commit these two things together or
>> incrementally :
>> 1. Expand in bound order
>> 2. Allow for locking only the partitioned tables first.
>>
>> For #1, the changes in pg_inherits.c are not required (viz, keeping
>> partitioned tables at the head of the list, adding inhchildparted
>> column, etc).
>
> Yes.  Changes to pg_inherits.c can be committed completely independently
> of either 1 or 2, although then there would be nobody using that capability.
>
> About 2: I think the capability to lock only the partitioned tables in
> expand_inherited_rtentry() will only be used once we have the patch to do
> the necessary planner restructuring that will allow us to defer child
> table locking to some place that is not expand_inherited_rtentry().  I am
> working on that patch now and should be able to show something soon.
>
>> If we are going to do #2 together with #1, then in the patch set there
>> is no one using the capability introduced by #2. That is, there are no
>> callers of find_all_inheritors() that make use of the new
>> num_partitioned_children parameter. Also, there is no boolean
>> parameter  for find_all_inheritors() to be used to lock only the
>> partitioned tables.
>>
>> I feel we should think about
>> 0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch later, and
>> first get the review done for the other patches.
>
> I think that makes sense.
>
>> I see that RelationGetPartitionDispatchInfo() now returns quite a
>> small subset of what it used to return, which is good. But I feel for
>> expand_inherited_rtentry(), RelationGetPartitionDispatchInfo() is
>> still a bit heavy. We only require the oids, so the
>> PartitionedTableInfo data structure (including the pd->indexes array)
>> gets discarded.
>
> Maybe we could make the output argument optional, but I don't see much
> point in being too conservative here.  Building the indexes array does not
> cost us that much and if a not-too-distant-in-future patch could use that
> information somehow, it could do so for free.

Ok, so these changes are mostly kept keeping in mind some future
use-cases. Otherwise, I was thinking we could just keep a light-weight
function to generate the oids, and keep the current
RelationGetPartitionDispatchInfo() intact.

Anyways, some more comments :

In ExecSetupPartitionTupleRouting(), not sure why ptrinfos array is an
array of pointers. Why can't it be an array of
PartitionTupleRoutingInfo structure  rather than pointer to that
structure ?

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
+ * Close all the leaf partitions and their indices.
*
Above comment needs to be shifted a bit down to the subsequent "for"
loop where it's actually applicable.

* node->mt_partition_dispatch_info[0] corresponds to the root partitioned
* table, for which w

Re: [HACKERS] expanding inheritance in partition bound order

2017-08-17 Thread Amit Khandekar

On 18 August 2017 at 01:24, Robert Haas  wrote:
> On Thu, Aug 17, 2017 at 8:39 AM, Ashutosh Bapat
>  wrote:
>> [2] had a patch with some changes to the original patch you posted. I
>> didn't describe those changes in my mail, since they rearranged the
>> comments. Those changes are not part of this patch and you haven't
>> comments about those changes as well. If you have intentionally
>> excluded those changes, it's fine. In case, you haven't reviewed them,
>> please see if they are good to be incorporated.
>
> I took a quick look at your version but I think I like Amit's fine the
> way it is, so committed that and back-patched it to v10.
>
> I find 0002 pretty ugly as things stand.  We get a bunch of tuple maps
> that we don't really need, only to turn around and free them.  We get
> a bunch of tuple slots that we don't need, only to turn around and
> drop them.

I think in the final changes after applying all 3 patches, the
redundant tuple slot is no longer present. But ...
> We don't really need the PartitionDispatch objects either,
> except for the OIDs they contain.  There's a lot of extra stuff being
> computed here that is really irrelevant for this purpose.  I think we
> should try to clean that up somehow.
... I am of the same opinion. That's why - as I mentioned upthread - I
was thinking why not have a separate light-weight function to just
generate oids, and keep RelationGetPartitionDispatchInfo() intact, to
be used only for tuple routing.

But, I haven't yet checked Ashuthosh's requirements, which suggest
that it does not help to even get the oid list.

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] expanding inheritance in partition bound order

2017-08-17 Thread Amit Khandekar

On 18 August 2017 at 10:55, Amit Langote  wrote:
> On 2017/08/18 4:54, Robert Haas wrote:
>> On Thu, Aug 17, 2017 at 8:39 AM, Ashutosh Bapat
>>  wrote:
>>> [2] had a patch with some changes to the original patch you posted. I
>>> didn't describe those changes in my mail, since they rearranged the
>>> comments. Those changes are not part of this patch and you haven't
>>> comments about those changes as well. If you have intentionally
>>> excluded those changes, it's fine. In case, you haven't reviewed them,
>>> please see if they are good to be incorporated.
>>
>> I took a quick look at your version but I think I like Amit's fine the
>> way it is, so committed that and back-patched it to v10.
>
> Thanks for committing.
>
>> I find 0002 pretty ugly as things stand.  We get a bunch of tuple maps
>> that we don't really need, only to turn around and free them.  We get
>> a bunch of tuple slots that we don't need, only to turn around and
>> drop them.  We don't really need the PartitionDispatch objects either,
>> except for the OIDs they contain.  There's a lot of extra stuff being
>> computed here that is really irrelevant for this purpose.  I think we
>> should try to clean that up somehow.
>
> One way to do that might be to reverse the order of the remaining patches
> and put the patch to refactor RelationGetPartitionDispatchInfo() first.
> With that refactoring, PartitionDispatch itself has become much simpler in
> that it does not contain a relcache reference to be closed eventually by
> the caller, the tuple map, and the tuple table slot.  Since those things
> are required for tuple-routing, the refactoring makes
> ExecSetupPartitionTupleRouting itself create them from the (minimal)
> information returned by RelationGetPartitionDispatchInfo and ultimately
> destroy when done using it.  I kept the indexes field in
> PartitionDispatchData though, because it's essentially free to create
> while we are walking the partition tree in
> RelationGetPartitionDispatchInfo() and it seems undesirable to make the
> caller compute that information (indexes) by traversing the partition tree
> all over again, if it doesn't otherwise have to.  I am still considering
> some counter-arguments raised by Amit Khandekar about this last assertion.
>
> Thoughts?

One another approach (that I have used in update-partition-key patch)
is to *not* generate the oids beforehand, and instead, call a
partition_walker_next() function to traverse through the tree. Each
next() function would return a ChildInfo that includes child oid,
parent oid, etc. All users of this would guarantee a fixed order of
oids. In the update-partition-key patch, I am opening and closing each
of the children, which of course, we need to avoid.




-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

2017-04-26 Thread Amit Khandekar

On 26 April 2017 at 00:28, Robert Haas  wrote:
> So what I'd prefer -- on
> the totally unprincipled basis that it would let us improve
> performance in the future -- if you operate on a partition directly,
> you fire the partition's triggers, but if you operate on the parent,
> only the parent's triggers fire.

I would also opt for this behaviour.

Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-05-02 Thread Amit Khandekar

On 2 May 2017 at 18:17, Robert Haas  wrote:
> On Tue, Apr 4, 2017 at 7:11 AM, Amit Khandekar  wrote:
>> Attached updated patch v7 has the above changes.
>
> This no longer applies.  Please rebase.

Thanks Robert for informing about this.

My patch has a separate function for emitting error message when a
partition constraint fails. And, the recent commit c0a8ae7be3 has
changes to correct the way the tuple is formed for displaying in the
error message. Hence there were some code-level conflicts.

Attached is the rebased patch, which resolves the above conflicts.

update-partition-key_v7_rebased.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

2017-05-04 Thread Amit Khandekar

On 5 May 2017 at 07:50, David Rowley  wrote:
>  On 3 May 2017 at 07:13, Robert Haas  wrote:
>> It is of course possible that the Parallel Seq Scan could run into
>> contention problems if the number of workers is large, but in my
>> experience there are bigger problems here.  The non-parallel Seq Scan
>> can also contend -- not of course over the shared mutex because there
>> isn't one, but over access to the blocks themselves.  Every one of
>> those blocks has a content lock and a buffer header and so on, and
>> having multiple processes accessing those things at the same time
>> scales well, but not perfectly.  The Hash node can also contend: if
>> the hash join spills to disk, you've got multiple processes reading
>> and writing to the temp directory at the same time and, of course,
>> that can be worse than just one process doing it -- sometimes much
>> worse.  It can also be better, depending on how much I/O gets
>> generated and how much I/O bandwidth you have.
>
> Yeah, I did get some time to look over the contention in Parallel Seq
> Scan a while back and I discovered that on the machine that I was
> testing on. the lock obtained in heap_parallelscan_nextpage() was
> causing workers to have to wait for other workers to fetch their next
> task to work on.
>
> I ended up writing the attached (which I'd not intended to post until
> some time closer to when the doors open for PG11). At the moment it's
> basically just a test patch to see how it affects things when we give
> workers a bit more to do before they come back to look for more work.
> In this case, I've just given them 10 pages to work on, instead of the
> 1 that's allocated in 9.6 and v10.
>
> A quick test on a pretty large table on a large machine shows:
>
> Unpatched:
>
> postgres=# select count(*) from a;
>count
> 
>  187400
> (1 row)
>
> Time: 5211.485 ms (00:05.211)
>
> Patched:
>
> postgres=# select count(*) from a;
>count
> 
>  187400
> (1 row)
>
> Time: 2523.983 ms (00:02.524)

The result is quite impressive.

>
> So it seems worth looking into. "a" was just a table with a single int
> column. I'm unsure as yet if there are more gains to be had for tables
> with wider tuples. I do suspect the patch will be a bigger win in
> those cases, since there's less work to do for each page, e.g less
> advance aggregate calls, so likely they'll be looking for their next
> page a bit sooner.
>
> Now I'm not going to pretend that this patch is ready for the
> prime-time. I've not yet worked out how to properly report sync-scan
> locations without risking reporting later pages after reporting the
> end of the scan. What I have at the moment could cause a report to be
> missed if SYNC_SCAN_REPORT_INTERVAL is not divisible by the batch
> size. I'm also not sure how batching like this affect read-aheads, but
> at least the numbers above speak for something. Although none of the
> pages in this case came from disk.
>
> I'd had thoughts that the 10 pages wouldn't be constant, but the
> batching size would depend on the size of the relation to be scanned.
> I'd rough ideas to just try to make about 1 million batches. Something
> like batch_pages = Max(parallel_scan->phs_nblocks / 100, 1); so
> that we only take more than 1 page if there's some decent amount to
> process. We don't want to make the batches too big as we might end up
> having to wait on slow workers at the end of a scan.

I was wondering : if we keep increasing the batch size, that might
lead to I/O contention. I mean, the higher the batch size, the higher
is the chance to cause more random I/O, because all workers would be
accessing disk blocks far away from each other in parallel. So there
might be a trade off here. (it's another thing that there needs to be
I/O contention testing done, in general, for many scenarios).

I believe there are certain parallel scans (parallel bitmap heap scan
? ) where the logic to go to the next block consumes time, so more
waits consequently.

What if we supply for each worker with a sequence of blocks to be
scanned, rather than a range of blocks. Each worker would have a list
of next few blocks, say :
w1 : 1, 5, 9, 13
w2 : 2, 6, 10, 14
w3 : 3, 7, 11, 15.
w4 : .

May be the leader worker would do the accounting and store the
instructions for each of the workers at individual locations in shared
memory, so there won't be any contention while accessing them.

This may be simple/applicable for a sequential scan, but not for other
scans, some of which this may not be even possible. But basically I
was thinking of a way around to tackle shared memory contention as
well as random I/O.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-05-11 Thread Amit Khandekar

On 11 May 2017 at 17:23, Amit Kapila  wrote:
> On Fri, Mar 17, 2017 at 4:07 PM, Amit Khandekar  
> wrote:
>> On 4 March 2017 at 12:49, Robert Haas  wrote:
>>> On Thu, Mar 2, 2017 at 11:53 AM, Amit Khandekar  
>>> wrote:
>>>> I think it does not make sense running after row triggers in case of
>>>> row-movement. There is no update happened on that leaf partition. This
>>>> reasoning can also apply to BR update triggers. But the reasons for
>>>> having a BR trigger and AR triggers are quite different. Generally, a
>>>> user needs to do some modifications to the row before getting the
>>>> final NEW row into the database, and hence [s]he defines a BR trigger
>>>> for that. And we can't just silently skip this step only because the
>>>> final row went into some other partition; in fact the row-movement
>>>> itself might depend on what the BR trigger did with the row. Whereas,
>>>> AR triggers are typically written for doing some other operation once
>>>> it is made sure the row is actually updated. In case of row-movement,
>>>> it is not actually updated.
>>>
>>> How about running the BR update triggers for the old partition and the
>>> AR update triggers for the new partition?  It seems weird to run BR
>>> update triggers but not AR update triggers.  Another option would be
>>> to run BR and AR delete triggers and then BR and AR insert triggers,
>>> emphasizing the choice to treat this update as a delete + insert, but
>>> (as Amit Kh. pointed out to me when we were in a room together this
>>> week) that precludes using the BEFORE trigger to modify the row.
>>>
>
> I also find the current behavior with respect to triggers quite odd.
> The two points that appears odd are (a) Executing both before row
> update and delete triggers on original partition sounds quite odd.
Note that *before* trigger gets fired *before* the update happens. The
actual update may not even happen, depending upon what the trigger
does. And then in our case, the update does not happen; not just that,
it is transformed into delete-insert. So then we should fire
before-delete trigger.

> (b) It seems inconsistent to consider behavior for row and statement
> triggers differently

I am not sure whether we should compare row and statement triggers.
Statement triggers are anyway fired only per-statement, depending upon
whether it is update or insert or delete. It has nothing to do with
how the rows are modified.


>
>>
>> I checked the trigger behaviour in case of UPSERT. Here, when there is
>> conflict found, ExecOnConflictUpdate() is called, and then the
>> function returns immediately, which means AR INSERT trigger will not
>> fire. And ExecOnConflictUpdate() calls ExecUpdate(), which means BR
>> and AR UPDATE triggers will be fired. So in short, when an INSERT
>> becomes an UPDATE, BR INSERT triggers do fire, but then the BR UPDATE
>> and AR UPDATE also get fired. On the same lines, it makes sense in
>> case of UPDATE=>DELETE+INSERT operation to fire a BR UPDATE trigger on
>> the original table, and then the BR and AR DELETE/INSERT triggers on
>> the respective tables.
>>
>
> I am not sure if it is good idea to compare it with "Insert On
> Conflict Do Update", but  even if we want that way, I think Insert On
> Conflict is consistent in statement level triggers which means it will
> fire both Insert and Update statement level triggres (as per below
> note in docs) whereas the documentation in the patch indicates that
> this patch will only fire Update statement level triggers which is
> odd
>
> Note in docs about Insert On Conflict
> "Note that with an INSERT with an ON CONFLICT DO UPDATE clause, both
> INSERT and UPDATE statement level trigger will be fired.

I guess the reason this behaviour is kept for UPSERT, is because the
statement itself suggests : insert/update.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-05-11 Thread Amit Khandekar

On 11 May 2017 at 17:24, Amit Kapila  wrote:
> Few comments:
> 1.
> Operating directly on partition doesn't allow update to move row.
> Refer below example:
> create table t1(c1 int) partition by range(c1);
> create table t1_part_1 partition of t1 for values from (1) to (100);
> create table t1_part_2 partition of t1 for values from (100) to (200);
> insert into t1 values(generate_series(1,11));
> insert into t1 values(generate_series(110,120));
>
> postgres=# update t1_part_1 set c1=122 where c1=11;
> ERROR:  new row for relation "t1_part_1" violates partition constraint
> DETAIL:  Failing row contains (122).

Yes, as Robert said, this is expected behaviour. We move the row only
within the partition subtree that has the update table as its root. In
this case, it's the leaf partition.

>
> 3.
> +   longer satisfy the partition constraint of the containing partition. In 
> that
> +   case, if there is some other partition in the partition tree for which 
> this
> +   row satisfies its partition constraint, then the row is moved to that
> +   partition. If there isn't such a partition, an error will occur.
>
> Doesn't this error case indicate that this needs to be integrated with
> Default partition patch of Rahila or that patch needs to take care
> this error case?
> Basically, if there is no matching partition, then move it to default 
> partition.

Will have a look on this. Thanks for pointing this out.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-05-11 Thread Amit Khandekar

On 12 May 2017 at 08:30, Amit Kapila  wrote:
> On Thu, May 11, 2017 at 5:41 PM, Amit Khandekar  
> wrote:
>> On 11 May 2017 at 17:23, Amit Kapila  wrote:
>>> On Fri, Mar 17, 2017 at 4:07 PM, Amit Khandekar  
>>> wrote:
>>>> On 4 March 2017 at 12:49, Robert Haas  wrote:
>>>>> On Thu, Mar 2, 2017 at 11:53 AM, Amit Khandekar  
>>>>> wrote:
>>>>>> I think it does not make sense running after row triggers in case of
>>>>>> row-movement. There is no update happened on that leaf partition. This
>>>>>> reasoning can also apply to BR update triggers. But the reasons for
>>>>>> having a BR trigger and AR triggers are quite different. Generally, a
>>>>>> user needs to do some modifications to the row before getting the
>>>>>> final NEW row into the database, and hence [s]he defines a BR trigger
>>>>>> for that. And we can't just silently skip this step only because the
>>>>>> final row went into some other partition; in fact the row-movement
>>>>>> itself might depend on what the BR trigger did with the row. Whereas,
>>>>>> AR triggers are typically written for doing some other operation once
>>>>>> it is made sure the row is actually updated. In case of row-movement,
>>>>>> it is not actually updated.
>>>>>
>>>>> How about running the BR update triggers for the old partition and the
>>>>> AR update triggers for the new partition?  It seems weird to run BR
>>>>> update triggers but not AR update triggers.  Another option would be
>>>>> to run BR and AR delete triggers and then BR and AR insert triggers,
>>>>> emphasizing the choice to treat this update as a delete + insert, but
>>>>> (as Amit Kh. pointed out to me when we were in a room together this
>>>>> week) that precludes using the BEFORE trigger to modify the row.
>>>>>
>>>
>>> I also find the current behavior with respect to triggers quite odd.
>>> The two points that appears odd are (a) Executing both before row
>>> update and delete triggers on original partition sounds quite odd.
>> Note that *before* trigger gets fired *before* the update happens. The
>> actual update may not even happen, depending upon what the trigger
>> does. And then in our case, the update does not happen; not just that,
>> it is transformed into delete-insert. So then we should fire
>> before-delete trigger.
>>
>
> Sure, I am aware of that point, but it doesn't seem obvious that both
> update and delete BR triggers get fired for original partition.  As a
> developer, it might be obvious to you that as you have used delete and
> insert interface, it is okay that corresponding BR/AR triggers get
> fired, however, it is not so obvious for others, rather it appears
> quite odd.

I agree that it seems a bit odd that we are firing both update and
delete triggers on the same partition. But if you look at the
perspective that the update=>delete+insert is a user-aware operation,
it does not seem that odd.

> If we try to compare it with the non-partitioned update,
> there also it is internally a delete and insert operation, but we
> don't fire triggers for those.

For a non-partitioned table, the delete+insert is internal, whereas
for partitioned table, it is completely visible to the user.

>
>>> (b) It seems inconsistent to consider behavior for row and statement
>>> triggers differently
>>
>> I am not sure whether we should compare row and statement triggers.
>> Statement triggers are anyway fired only per-statement, depending upon
>> whether it is update or insert or delete. It has nothing to do with
>> how the rows are modified.
>>
>
> Okay.  The broader point I was trying to convey was that the way this
> patch defines the behavior of triggers doesn't sound good to me.  It
> appears to me that in this thread multiple people have raised points
> around trigger behavior and you should try to consider those.

I understand that there is no single solution which will provide
completely intuitive trigger behaviour. Skipping BR delete trigger
should be fine. But then for consistency, we should skip BR insert
trigger as well, the theory being that the delete+insert are not fired
by the user so we should not fire them. But I feel both should be
fired to avoid any consequences unexpected to the user who has
installed those triggers.

The only specific concern of yours is that of firing *both* update as
well as insert triggers on the same table, right ? My explanation for
this was : we ha

Re: [HACKERS] UPDATE of partition key

2017-05-11 Thread Amit Khandekar

On 12 May 2017 at 10:01, Amit Kapila  wrote:
> On Fri, May 12, 2017 at 9:27 AM, Amit Kapila  wrote:
>> On Thu, May 11, 2017 at 5:45 PM, Amit Khandekar  
>> wrote:
>>> On 11 May 2017 at 17:24, Amit Kapila  wrote:
>>>> Few comments:
>>>> 1.
>>>> Operating directly on partition doesn't allow update to move row.
>>>> Refer below example:
>>>> create table t1(c1 int) partition by range(c1);
>>>> create table t1_part_1 partition of t1 for values from (1) to (100);
>>>> create table t1_part_2 partition of t1 for values from (100) to (200);
>>>> insert into t1 values(generate_series(1,11));
>>>> insert into t1 values(generate_series(110,120));
>>>>
>>>> postgres=# update t1_part_1 set c1=122 where c1=11;
>>>> ERROR:  new row for relation "t1_part_1" violates partition constraint
>>>> DETAIL:  Failing row contains (122).
>>>
>>> Yes, as Robert said, this is expected behaviour. We move the row only
>>> within the partition subtree that has the update table as its root. In
>>> this case, it's the leaf partition.
>>>
>>
>> Okay, but what is the technical reason behind it?  Is it because the
>> current design doesn't support it or is it because of something very
>> fundamental to partitions?
No, we can do that if decide to update some table outside the
partition subtree. The reason is more of semantics. I think the user
who is running UPDATE for a partitioned table, should not be
necessarily aware of the structure of the complete partition tree
outside of the current subtree. It is always safe to return error
instead of moving the data outside of the subtree silently.

>>
>
> One plausible theory is that as Select's on partitions just returns
> the rows of that partition, the update should also behave in same way.

Yes , right. Or even inserts fail if we try to insert data that does
not fit into the current subtree.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-05-11 Thread Amit Khandekar

>> 3.
>> +   longer satisfy the partition constraint of the containing partition. In 
>> that
>> +   case, if there is some other partition in the partition tree for which 
>> this
>> +   row satisfies its partition constraint, then the row is moved to that
>> +   partition. If there isn't such a partition, an error will occur.
>>
>> Doesn't this error case indicate that this needs to be integrated with
>> Default partition patch of Rahila or that patch needs to take care
>> this error case?
>> Basically, if there is no matching partition, then move it to default 
>> partition.
>
> Will have a look on this. Thanks for pointing this out.

I tried update row movement with both my patch and default-partition
patch applied. And it looks like it works as expected :

1. When an update changes the partitioned key such that the row does
not fit into any of the non-default partitions, the row is moved to
the default partition.
2. If the row does fit into a non-default partition, the row moves
into that partition.
3. If a row from a default partition is updated such that it fits into
any of the non-default partition, it moves into that partition. I
think we can debate on whether the row should stay in the default
partition or move. I think it should be moved, since now the row has a
suitable partition.



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-05-12 Thread Amit Khandekar

On 12 May 2017 at 14:56, Amit Kapila  wrote:
> I think it might be better to summarize all the options discussed
> including what the patch has and see what most people consider as
> sensible.

Yes, makes sense. Here are the options that were discussed so far for
ROW triggers :

Option 1 : (the patch follows this option)
--
BR Update trigger for source partition.
BR,AR Delete trigger for source partition.
BR,AR Insert trigger for destination partition.
No AR Update trigger.

Rationale :

BR Update trigger should be fired because that trigger can even modify
the rows, and that can even result in partition key update even though
the UPDATE statement is not updating the partition key.

Also, fire the delete/insert triggers on respective partitions since
the rows are about to be deleted/inserted. AR update trigger should
not be fired because that required an actual update to have happened.

Option 2
--
BR Update trigger for source partition.
AR Update trigger on destination partition.
No insert/delete triggers.

Rationale :

Since it's an UPDATE statement, only update triggers should be fired.
The update ends up moving the row into another partition, so AR Update
trigger should be fired on this partition rather than the original
partition.

Option 3

BR, AR delete triggers on source partition
BR, AR insert triggers on destination partition.

Rationale :
Since the update is converted to delete+insert, just skip the update
triggers completely.

Option 4

BR-AR update triggers for source partition
BR-AR insert triggers for destination partition

Rationale :
Since it is an update statement, both BR and AR UPDATE trigger should
be fired on original partition.
Since update is converted to delete+insert, the corresponding triggers
should be fired, but since we already are firing UPDATE trigger on
original partition, skip delete triggers, otherwise both UPDATE and
DELETE triggers would get fired on the same partition.

For statement triggers, I think it should be based on the
documentation recently checked in for partitions in general.

+A statement that targets a parent table in a inheritance or partitioning
+hierarchy does not cause the statement-level triggers of affected child
+tables to be fired; only the parent table's statement-level triggers are
+fired.  However, row-level triggers of any affected child tables will be
+fired.

Based on that, for row movement as well, the trigger should be fired
only for the table referred in the UPDATE statement, and not for any
child tables, or for any partitions to which the rows were moved. The
doc in this row-movement patch also matches with this behaviour.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-05-17 Thread Amit Khandekar

On 17 May 2017 at 17:29, Rushabh Lathia  wrote:
>
>
> On Wed, May 17, 2017 at 12:06 PM, Dilip Kumar  wrote:
>>
>> On Fri, May 12, 2017 at 4:17 PM, Amit Khandekar 
>> wrote:
>> > Option 3
>> > 
>> >
>> > BR, AR delete triggers on source partition
>> > BR, AR insert triggers on destination partition.
>> >
>> > Rationale :
>> > Since the update is converted to delete+insert, just skip the update
>> > triggers completely.
>>
>> +1 to option3
>> Generally, BR triggers are used for updating the ROW value and AR
>> triggers to VALIDATE the row or to modify some other tables.  So it
>> seems that we can fire the triggers what is actual operation is
>> happening at the partition level.
>>
>> For source partition, it's only the delete operation (no update
>> happened) so we fire delete triggers and for the destination only
>> insert operations so fire only inserts triggers.  That will keep the
>> things simple.  And, it will also be in sync with the actual partition
>> level delete/insert operations.
>>
>> We may argue that user might have declared only update triggers and as
>> he has executed the update operation he may expect those triggers to
>> get fired.  But, I think this behaviour can be documented with the
>> proper logic that if the user is updating the partition key then he
>> must be ready with the Delete/Insert triggers also, he can not rely
>> only upon update level triggers.
>>
>
> Right, that is even my concern. That user might had declared only update
> triggers and when user executing UPDATE its expect it to get call - but
> with option 3 its not happening.

Yes that's the issue with option 3. A user wants to make sure update
triggers run, and here we are skipping the BEFORE update triggers. And
user might even modify rows.

Now regarding the AR update triggers  The user might be more
concerned with the non-partition-key columns, and the UPDATE of
partition key typically would update only the partition key and not
the other column. So for typical case, it makes sense to skip the
UPDATE AR trigger. But if the UPDATE contains both partition key as
well as other column updates, it makes sense to fire AR UPDATE
trigger. One thing we can do is restrict an UPDATE to have both
partition key and non-partition key column updates. So this way we can
always skip the AR update triggers for row-movement updates, unless
may be fire AR UPDATE triggers *only* if they are created using
"BEFORE UPDATE OF " and the column is the partition key.

Between skipping delete-insert triggers versus skipping update
triggers, I would go for skipping delete-insert triggers. I think we
cannot skip BR update triggers because that would be a correctness
issue.

>From user-perspective, I think the user would like to install a
trigger that would fire if any of the child tables get modified. But
because there is no provision to install a common trigger, the user
has to install the same trigger on every child table. In that sense,
it might not matter whether we fire AR UPDATE trigger on old partition
or new partition.

>
> In term of consistency option 1 looks better. Its doing the same what
> its been implemented for the UPSERT - so that user might be already
> aware of trigger behaviour. Plus if we document the behaviour then it
> sounds correct -
>
> - Original command was UPDATE so BR update
> - Later found that its ROW movement - so BR delete followed by AR delete
> - Then Insert in new partition - so BR INSERT followed by AR Insert.
>
> But again I am not quite sure how good it will be to compare the partition
> behaviour with the UPSERT.
>
>
>
>>
>> Earlier I thought that option1 is better but later I think that this
>> can complicate the situation as we are firing first BR update then BR
>> delete and can change the row multiple time and defining such
>> behaviour can be complicated.
>>
>> --
>> Regards,
>> Dilip Kumar
>> EnterpriseDB: http://www.enterprisedb.com
>>
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
>
>
>
>
> --
> Rushabh Lathia



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-05-24 Thread Amit Khandekar

On 18 May 2017 at 16:52, Amit Kapila  wrote:
> On Wed, May 17, 2017 at 4:05 PM, Dilip Kumar  wrote:
>> On Wed, May 17, 2017 at 3:15 PM, Amit Kapila  wrote:
>>>> Earlier I thought that option1 is better but later I think that this
>>>> can complicate the situation as we are firing first BR update then BR
>>>> delete and can change the row multiple time and defining such
>>>> behaviour can be complicated.
>>>>
>>>
>>> If we have to go by this theory, then the option you have preferred
>>> will still execute BR triggers for both delete and insert, so input
>>> row can still be changed twice.
>>
>> Yeah, right as per my theory above Option3 have the same problem.
>>
>> But after putting some more thought I realised that only for "Before
>> Update" or the "Before Insert" trigger row can be changed.
>>
>
> Before Row Delete triggers can suppress the delete operation itself
> which is kind of unintended in this case.  I think without the user
> being aware it doesn't seem advisable to execute multiple BR triggers.

By now, majority of the opinions have shown that they do not favour
two triggers getting fired on a single update. Amit, do you consider
option 2 as a valid option ? That is, fire only UPDATE triggers. BR on
source partition, and AR on destination partition. Do you agree that
firing BR update trigger is essential since it can modify the row and
even prevent the update from happening ?

Also, since a user does not have a provision to install a common
UPDATE row trigger, (s)he installs it on each of the leaf partitions.
And then when an update causes row movement, using option 3 would end
up not firing update triggers on any of the partitions. So, I prefer
option 2 over option 3 , i.e. make sure to fire BR and AR update
triggers. Actually option 2 is what Robert had proposed in the
beginning.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 >

1 - 100 of 194 matches

Mail list logo