Re: [HACKERS] Parallel Append implementation

2017-11-12 Thread Amit Khandekar
Thanks a lot Robert for the patch. I will have a look. Quickly tried
to test some aggregate queries with a partitioned pgbench_accounts
table, and it is crashing. Will get back with the fix, and any other
review comments.

Thanks
-Amit Khandekar

On 9 November 2017 at 23:44, Robert Haas <robertmh...@gmail.com> wrote:
> On Sat, Oct 28, 2017 at 5:50 AM, Robert Haas <robertmh...@gmail.com> wrote:
>> No, because the Append node is *NOT* getting copied into shared
>> memory.  I have pushed a comment update to the existing functions; you
>> can use the same comment for this patch.
>
> I spent the last several days working on this patch, which had a
> number of problems both cosmetic and functional.  I think the attached
> is in better shape now, but it could certainly use some more review
> and testing since I only just finished modifying it, and I modified it
> pretty heavily.  Changes:
>
> - I fixed the "morerows" entries in the documentation.  If you had
> built the documentation the way you had it and loaded up in a web
> browser, you would have seen that the way you had it was not correct.
>
> - I moved T_AppendState to a different position in the switch inside
> ExecParallelReInitializeDSM, so as to keep that switch in the same
> order as all of the other switch statements in that file.
>
> - I rewrote the comment for pa_finished.  It previously began with
> "workers currently executing the subplan", which is not an accurate
> description. I suspect this was a holdover from a previous version of
> the patch in which this was an array of integers rather than an array
> of type bool.  I also fixed the comment in ExecAppendEstimate and
> added, removed, or rewrote various other comments as well.
>
> - I renamed PA_INVALID_PLAN to INVALID_SUBPLAN_INDEX, which I think is
> more clear and allows for the possibility that this sentinel value
> might someday be used for non-parallel-aware Append plans.
>
> - I largely rewrote the code for picking the next subplan.  A
> superficial problem with the way that you had it is that you had
> renamed exec_append_initialize_next to exec_append_seq_next but not
> updated the header comment to match.  Also, the logic was spread out
> all over the file.  There are three cases: not parallel aware, leader,
> worker.  You had the code for the first case at the top of the file
> and the other two cases at the bottom of the file and used multiple
> "if" statements to pick the right one in each case.  I replaced all
> that with a function pointer stored in the AppendState, moved the code
> so it's all together, and rewrote it in a way that I find easier to
> understand.  I also changed the naming convention.
>
> - I renamed pappend_len to pstate_len and ParallelAppendDescData to
> ParallelAppendState.  I think the use of the word "descriptor" is a
> carryover from the concept of a scan descriptor.  There's nothing
> really wrong with inventing the concept of an "append descriptor", but
> it seems more clear to just refer to shared state.
>
> - I fixed ExecAppendReInitializeDSM not to reset node->as_whichplan.
> Per commit 41b0dd987d44089dc48e9c70024277e253b396b7, that's wrong;
> instead, local state should be reset in ExecReScanAppend.  I installed
> what I believe to be the correct logic in that function instead.
>
> - I fixed list_qsort() so that it copies the type of the old list into
> the new list.  Otherwise, sorting a list of type T_IntList or
> T_OidList would turn it into just plain T_List, which is wrong.
>
> - I removed get_append_num_workers and integrated the logic into the
> callers.  This function was coded quite strangely: it assigned the
> return value of fls() to a double and then eventually rounded the
> result back to an integer.  But fls() returns an integer, so this
> doesn't make much sense.  On a related note, I made it use fls(# of
> subpaths) instead of fls(# of subpaths)+1.  Adding 1 doesn't make
> sense to me here because it leads to a decision to use 2 workers for a
> single, non-partial subpath.  I suspect both of these mistakes stem
> from thinking that fls() returns the base-2 logarithm, but in fact it
> doesn't, quite: log2(1) = 0.0 but fls(1) = 1.
>
> - In the process of making the changes described in the previous
> point, I added a couple of assertions, one of which promptly failed.
> It turns out the reason is that your patch didn't update
> accumulate_append_subpaths(), which can result in flattening
> non-partial paths from a Parallel Append into a parent Append's list
> of partial paths, which is bad.  The easiest way to fix that would be
> to just teach accumulate_append_subpaths() not to flatten a Parallel
> Append into a parent Append or

Re: [HACKERS] UPDATE of partition key

2017-11-09 Thread Amit Khandekar
On 9 November 2017 at 09:27, Thomas Munro <thomas.mu...@enterprisedb.com> wrote:
> On Wed, Nov 8, 2017 at 5:57 PM, Amit Khandekar <amitdkhan...@gmail.com> wrote:
>> On 8 November 2017 at 07:55, Thomas Munro <thomas.mu...@enterprisedb.com> 
>> wrote:
>>> On Tue, Nov 7, 2017 at 8:03 AM, Robert Haas <robertmh...@gmail.com> wrote:
>>>> The changes to trigger.c still make me super-nervous.  Hey THOMAS
>>>> MUNRO, any chance you could review that part?
>
> At first, it seemed quite strange to me that row triggers and
> statement triggers fire different events for the same modification.
> Row triggers see DELETE +  INSERT (necessarily because different
> tables are involved), but this fact is hidden from the target table's
> statement triggers.
>
> The alternative would be for all triggers to see consistent events and
> transitions.  Instead of having your special case code in ExecInsert
> and ExecDelete that creates the two halves of a 'synthetic' UPDATE for
> the transition tables, you'd just let the existing ExecInsert and
> ExecDelete code do its thing, and you'd need a flag to record that you
> should also fire INSERT/DELETE after statement triggers if any rows
> moved.

Yeah I also had thought about that. But thought that change was too
invasive. For e.g. letting ExecARInsertTriggers() do the transition
capture even when transition_capture->tcs_update_new_table is set.

I was also thinking of having a separate function to *only* add the
transition table rows. So in ExecInsert, call this one instead of
ExecARUpdateTriggers(). But realized that the existing
ExecARUpdateTriggers() looks like a better, robust interface with all
its checks. Just that calling ExecARUpdateTriggers() sounds like we
are also firing trigger; we are not firing any trigger or saving any
event, we are just adding the transition row.

>
> After sleeping on this question, I am coming around to the view that
> the way you have it is right.  The distinction isn't really between
> row triggers and statement triggers, it's between triggers at
> different levels in the hierarchy.  It just so happens that we
> currently only fire target table statement triggers and leaf table row
> triggers.

Yes. And rows are there only in leaf partitions. So we have to
simulate as though the target table has these rows. Like you
mentioned, the user has to get the impression of a normal table. So we
have to do something extra to capture the rows.

> Future development ideas that seem consistent with your choice:
>
> 1.  If we ever allow row triggers with transition tables on child
> tables, then I think *their* transition tables should certainly see
> the deletes and inserts, otherwise OLD TABLE and NEW TABLE would be
> inconsistent with the OLD and NEW variables in a single trigger
> invocation.  (These were prohibited mainly due to lack of time and
> (AFAIK) limited usefulness; I think they would need probably need
> their own separate tuplestores, or possibly some kind of filtering.)

As we know, for row triggers on leaf partitions, we treat them as
normal tables, so a trigger written on a leaf partition sees only the
local changes. The trigger is unaware whether the insert is part of an
UPDATE row movement. Similarly, the transition table referenced by
that row trigger function should see only the NEW table, not the old
table.

>
> 2.  If we ever allow row triggers on partitioned tables (ie that fire
> when its children are modified), then I think their UPDATE trigger
> should probably fire when a row moves between any two (grand-)*child
> tables, just as you have it for target table statement triggers.

Yes I agree.

> It doesn't matter that the view from parent tables' triggers is
> inconsistent with the view from leaf table triggers: it's a feature
> that we 'hide' partitioning from the user to the extent we can so that
> you can treat the partitioned table just like a table.
>
> Any other views?

I think because because there is no provision for a row trigger on
partitioned table, users who want to have a common trigger on a
partition subtree, has no choice but to create the same trigger
individually on the leaf partitions. And that's the reason we cannot
handle an update row movement with triggers without anomalies.

Thanks
-Amit Khandekar


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-11-07 Thread Amit Khandekar
On 8 November 2017 at 07:55, Thomas Munro <thomas.mu...@enterprisedb.com> wrote:
> On Tue, Nov 7, 2017 at 8:03 AM, Robert Haas <robertmh...@gmail.com> wrote:
>> The changes to trigger.c still make me super-nervous.  Hey THOMAS
>> MUNRO, any chance you could review that part?
>
> Looking, but here's one silly thing that jumped out at me while
> getting started with this patch.  I cannot seem to convince my macOS
> system to agree with the expected sort order from :show_data, where
> underscores precede numbers:
>
>   part_a_10_a_20 | a | 10 | 200 |  1 |
>   part_a_1_a_10  | a |  1 |   1 |  1 |
> - part_d_1_15| b | 15 | 146 |  1 |
> - part_d_1_15| b | 16 | 147 |  2 |
>   part_d_15_20   | b | 17 | 155 | 16 |
>   part_d_15_20   | b | 19 | 155 | 19 |
> + part_d_1_15| b | 15 | 146 |  1 |
> + part_d_1_15| b | 16 | 147 |  2 |
>
> It seems that macOS (like older BSDs) just doesn't know how to sort
> Unicode and falls back to sorting the bits.  I expect that means that
> the test will also fail on any other OS with "make check
> LC_COLLATE=C".  I believe our regression tests are supposed to pass
> with a wide range of collations including C, so I wonder if this means
> we should stick a leading zero on those single digit numbers, or
> something, to stabilise the output.

I preferably need to retain the partition names. I have now added a
LOCALE "C" for partname like this :

-\set show_data 'select tableoid::regclass::text partname, * from
range_parted order by 1, 2, 3, 4, 5, 6'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname,
* from range_parted order by 1, 2, 3, 4, 5, 6'

Thomas, can you please try the attached incremental patch
regress_locale_changes.patch and check if the test passes ? The patch
is to be applied on the main v22 patch. If the test passes, I will
include these changes (also for list_parted) in the upcoming v23
patch.

Thanks
-Amit Khandekar


regress_locale_changes.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-11-06 Thread Amit Khandekar
On 7 November 2017 at 00:33, Robert Haas <robertmh...@gmail.com> wrote:

> Also, +1 for Amit Langote's idea of trying to merge
> mt_perleaf_childparent_maps with mt_persubplan_childparent_maps.

Currently I am trying to see if it simplifies things if we do that. We
will be merging these arrays into one, but we are adding a new int[]
array that maps subplans to leaf partitions. Will get back with how it
looks finally.

Robert, Amit , I will get back with your other review comments.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-10-19 Thread Amit Khandekar
On 13 October 2017 at 00:29, Robert Haas <robertmh...@gmail.com> wrote:
> On Wed, Oct 11, 2017 at 8:51 AM, Amit Khandekar <amitdkhan...@gmail.com> 
> wrote:
>> [ new patch ]
>
> + parallel_append
> + Waiting to choose the next subplan during Parallel Append 
> plan
> + execution.
> +
> +
>
> Probably needs to update a morerows values of some earlier entry.

>From what I observed from the other places, the morerows value is one
less than the number of following entries. I have changed it to 10
since it has 11 entries.

>
> +   enable_parallelappend configuration
> parameter
>
> How about enable_parallel_append?

Done.

>
> + * pa_finished : workers currently executing the subplan. A worker which
>
> The way the colon is used here is not a standard comment style for PostgreSQL.

Changed it to "pa_finished:".

>
> + * Go on to the "next" subplan. If no more subplans, return the empty
> + * slot set up for us by ExecInitAppend.
> + * Note: Parallel-aware Append follows different logic for choosing 
> the
> + * next subplan.
>
> Formatting looks wrong, and moreover I don't think this is the right
> way of handling this comment anyway.  Move the existing comment inside
> the if (!node->padesc) block and leave it unchanged; the else block
> explains the differences for parallel append.

I think the first couple of lines do apply to both parallel-append and
sequential append plans. I have moved the remaining couple of lines
inside the else block.

>
> + *ExecAppendEstimate
> + *
> + *estimates the space required to serialize Append node.
>
> Ugh, this is wrong, but I notice it follows various other
> equally-wrong comments for other parallel-aware node types. I guess
> I'll go fix that.  We are not in serializing the Append node.

I didn't clealy get this. Do you think it should be "space required to
copy the Append node into the shared memory" ?

>
> I do not think that it's a good idea to call
> exec_append_parallel_next() from ExecAppendInitializeDSM,
> ExecAppendReInitializeDSM, and ExecAppendInitializeWorker.  We want to
> postpone selecting which plan to run until we're actually ready to run
> that plan.  Otherwise, for example, the leader might seize a
> non-partial plan (if only such plans are included in the Parallel
> Append) when it isn't really necessary for it to do so.  If the
> workers would've reached the plans and started returning tuples to the
> leader before it grabbed a plan, oh well, too bad.  The leader's still
> claimed that plan and must now run it.
>
> I concede that's not a high-probability scenario, but I still maintain
> that it is better for processes not to claim a subplan until the last
> possible moment.  I think we need to initialize as_whichplan as
> PA_INVALID plan and then fix it when ExecProcNode() is called for the
> first time.

Done. Set as_whichplan to PA_INVALID_PLAN in
ExecAppendInitializeDSM(), ExecAppendReInitializeDSM() and
ExecAppendInitializeWorker(). Then when ExecAppend() is called for the
first time, we notice that as_whichplan is PA_INVALID_PLAN, that means
we need to choose the plan.

>
> +if (!IsParallelWorker())
>
> This is not a great test, because it would do the wrong thing if we
> ever allowed an SQL function called from a parallel worker to run a
> parallel query of its own.  Currently that's not allowed but we might
> want to allow it someday.  What we really want to test is whether
> we're the leader for *this* query.  Maybe use a flag in the
> AppendState for that, and set it correctly in
> ExecAppendInitializeWorker.

Done. Set a new AppendState->is_parallel_worker field to true in
ExecAppendInitializeWorker().

>
> I think maybe the loop in exec_append_parallel_next should look more like 
> this:
>
> /* Pick the next plan. */
> state->as_whichplan = padesc->pa_nextplan;
> if (state->as_whichplan != PA_INVALID_PLAN)
> {
> int nextplan = state->as_whichplan;
>
> /* Mark non-partial plans done immediately so that they can't be
> picked again. */
> if (nextplan < first_partial_plan)
> padesc->pa_finished[nextplan] = true;
>
> /* Figure out what plan the next worker should pick. */
> do
> {
> /* If we've run through all the plans, loop back through
> partial plans only. */
> if (++nextplan >= state->as_nplans)
> nextplan = first_partial_plan;
>
> /* No plans remaining or tried them all?  Then give up. */
> if (nextplan == state->as_whichplan || nextplan >= state->as_nplans)
> {
> nextplan = PA_I

Re: [HACKERS] pgsql: Avoid coercing a whole-row variable that is already coerced

2017-10-13 Thread Amit Khandekar
Bringing here the mail thread from pgsql-committers  regarding this commit :

commit 1c497fa72df7593d8976653538da3d0ab033207f
Author: Robert Haas <rh...@postgresql.org>
Date:   Thu Oct 12 17:10:48 2017 -0400

Avoid coercing a whole-row variable that is already coerced.

Marginal efficiency and beautification hack.  I'm not sure whether
this case ever arises currently, but the pending patch for update
tuple routing will cause it to arise.

    Amit Khandekar

Discussion:
http://postgr.es/m/caj3gd9cazfppe7-wwubabpcq4_0subkipfd1+0r5_dkvnwo...@mail.gmail.com


Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <rhaas(at)postgresql(dot)org> wrote:
> > Avoid coercing a whole-row variable that is already coerced.
>
> This logic seems very strange and more than likely buggy: the
> added coerced_var flag feels like the sort of action at a distance
> that is likely to have unforeseen side effects.
>
> I'm not entirely sure what the issue is here, but if you're concerned
> about not applying two ConvertRowtypeExprs in a row, why not have the
> upper one just strip off the lower one?  We handle, eg, nested
> RelabelTypes that way.

We kind of do a similar thing. When a ConvertRowtypeExpr node is
encountered, we create a new ConvertRowtypeExpr that points to a new
var, and return this new ConvertRowtypeExpr instead of the existing
one. So we actually replace the old with a new one. But additionally,
we also want to change the vartype of the new var to
context->to_rowtype.

I think you are worried specifically about coerced_var causing
unexpected regression in existing scenarios, such as :
context->coerced_var getting set and prematurely unset in recursive
scenarios. But note that, when we call map_variable_attnos_mutator()
just after setting context->coerced_var = true,
map_variable_attnos_mutator() won't recurse further, because it is
always called with a Var, which does not have any further arguments to
process. So coerced_var won't be again changed until we return from
map_variable_attnos_mutator().

The only reason why we chose to call map_variable_attnos_mutator()
with a Var is so that we can re-use the code that converts the whole
row var.

One thing we can do is : instead of calling
map_variable_attnos_mutator(), convert the var inside the if block for
"if (IsA(node, ConvertRowtypeExpr))". Please check the attached patch.
There, I have avoided coerced_var context variable.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


handle-redundant-ConvertRowtypeExpr-node.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-10-11 Thread Amit Khandekar
On 9 October 2017 at 16:03, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Fri, Oct 6, 2017 at 12:03 PM, Amit Khandekar <amitdkhan...@gmail.com> 
> wrote:
>> On 6 October 2017 at 08:49, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>>
>>> Okay, but why not cheapest partial path?
>>
>> I gave some thought on this point. Overall I feel it does not matter
>> which partial path it should pick up. RIght now the partial paths are
>> not ordered. But for non-partial paths sake, we are just choosing the
>> very last path. So in case of mixed paths, leader will get a partial
>> path, but that partial path would not be the cheapest path. But if we
>> also order the partial paths, the same logic would then pick up
>> cheapest partial path. The question is, should we also order the
>> partial paths for the leader ?
>>
>> The only scenario I see where leader choosing cheapest partial path
>> *might* show some benefit, is if there are some partial paths that
>> need to do some startup work using only one worker. I think currently,
>> parallel hash join is one case where it builds the hash table, but I
>> guess here also, we support parallel hash build, but not sure about
>> the status.
>>
>
> You also need to consider how merge join is currently work in parallel
> (each worker need to perform the whole of work of right side).

Yes, here if the leader happens to take the right side, it may slow
down the overall merge join. But this seems to be a different case
than the case of high startup costs.

>  I think there could be more scenario's where the startup cost is high
> and parallel worker needs to do that work independently.

True.

>
>  For such plan, if leader starts it, it would be slow, and
>> no other worker would be able to help it, so its actual startup cost
>> would be drastically increased. (Another path is parallel bitmap heap
>> scan where the leader has to do something and the other workers wait.
>> But here, I think it's not much work for the leader to do). So
>> overall, to handle such cases, it's better for leader to choose a
>> cheapest path, or may be, a path with cheapest startup cost. We can
>> also consider sorting partial paths with decreasing startup cost.
>>
>
> Yeah, that sounds reasonable.

Attached patch sorts partial paths by descending startup cost.


On 6 October 2017 at 08:49, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Thu, Oct 5, 2017 at 4:11 PM, Amit Khandekar <amitdkhan...@gmail.com> wrote:
>>
>> Ok. How about removing pa_all_partial_subpaths altogether , and
>> instead of the below condition :
>>
>> /*
>> * If all the child rels have partial paths, and if the above Parallel
>> * Append path has a mix of partial and non-partial subpaths, then consider
>> * another Parallel Append path which will have *all* partial subpaths.
>> * If enable_parallelappend is off, make this one non-parallel-aware.
>> */
>> if (partial_subpaths_valid && !pa_all_partial_subpaths)
>> ..
>>
>> Use this condition :
>> if (partial_subpaths_valid && pa_nonpartial_subpaths != NIL)
>> ..
>>
>
> Sounds good to me.

Did this. Here is the new condition I used  along with the comments
explaining it :

+* If parallel append has not been added above, or the added
one has a mix
+* of partial and non-partial subpaths, then consider another Parallel
+* Append path which will have *all* partial subpaths. We can add such a
+* path only if all childrels have partial paths in the first
place. This
+* new path will be parallel-aware unless enable_parallelappend is off.
 */
-   if (partial_subpaths_valid && !pa_all_partial_subpaths)
+   if (partial_subpaths_valid &&
+   (!pa_subpaths_valid || pa_nonpartial_subpaths != NIL))

Also added some test scenarios.

On 6 October 2017 at 12:03, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> On 6 October 2017 at 08:49, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>
>> One minor point:
>>
>> + if (!node->as_padesc)
>> + {
>> + /*
>> + */
>> + if (!exec_append_seq_next(node))
>> + return ExecClearTuple(node->ps.ps_ResultTupleSlot);
>> + }
>>
>> It seems either you want to add a comment in above part of patch or
>> you just left /**/ mistakenly.
>
> Oops. Yeah, the comment wrapper remained there when I moved its
> content "This is Parallel-aware append. Follow it's own logic ..." out
> of the if block.

Removed the comment wrapper.


Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


ParallelAppend_v17.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-10-06 Thread Amit Khandekar
On 6 October 2017 at 08:49, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Thu, Oct 5, 2017 at 4:11 PM, Amit Khandekar <amitdkhan...@gmail.com> wrote:
>>
>> Ok. How about removing pa_all_partial_subpaths altogether , and
>> instead of the below condition :
>>
>> /*
>> * If all the child rels have partial paths, and if the above Parallel
>> * Append path has a mix of partial and non-partial subpaths, then consider
>> * another Parallel Append path which will have *all* partial subpaths.
>> * If enable_parallelappend is off, make this one non-parallel-aware.
>> */
>> if (partial_subpaths_valid && !pa_all_partial_subpaths)
>> ..
>>
>> Use this condition :
>> if (partial_subpaths_valid && pa_nonpartial_subpaths != NIL)
>> ..
>>
>
> Sounds good to me.
>
> One minor point:
>
> + if (!node->as_padesc)
> + {
> + /*
> + */
> + if (!exec_append_seq_next(node))
> + return ExecClearTuple(node->ps.ps_ResultTupleSlot);
> + }
>
> It seems either you want to add a comment in above part of patch or
> you just left /**/ mistakenly.

Oops. Yeah, the comment wrapper remained there when I moved its
content "This is Parallel-aware append. Follow it's own logic ..." out
of the if block. Since this is too small a change for an updated
patch, I will do this along with any other changes that would be
required as the review progresses.

>
>> 
>>
>>
>> Regarding a mix of partial and non-partial paths, I feel it always
>> makes sense for the leader to choose the partial path.
>>
>
> Okay, but why not cheapest partial path?

I gave some thought on this point. Overall I feel it does not matter
which partial path it should pick up. RIght now the partial paths are
not ordered. But for non-partial paths sake, we are just choosing the
very last path. So in case of mixed paths, leader will get a partial
path, but that partial path would not be the cheapest path. But if we
also order the partial paths, the same logic would then pick up
cheapest partial path. The question is, should we also order the
partial paths for the leader ?

The only scenario I see where leader choosing cheapest partial path
*might* show some benefit, is if there are some partial paths that
need to do some startup work using only one worker. I think currently,
parallel hash join is one case where it builds the hash table, but I
guess here also, we support parallel hash build, but not sure about
the status. For such plan, if leader starts it, it would be slow, and
no other worker would be able to help it, so its actual startup cost
would be drastically increased. (Another path is parallel bitmap heap
scan where the leader has to do something and the other workers wait.
But here, I think it's not much work for the leader to do). So
overall, to handle such cases, it's better for leader to choose a
cheapest path, or may be, a path with cheapest startup cost. We can
also consider sorting partial paths with decreasing startup cost.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-10-05 Thread Amit Khandekar
On 30 September 2017 at 19:21, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Wed, Sep 20, 2017 at 10:59 AM, Amit Khandekar <amitdkhan...@gmail.com> 
> wrote:
>> On 16 September 2017 at 10:42, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>>
>>> At a broader level, the idea is good, but I think it won't turn out
>>> exactly like that considering your below paragraph which indicates
>>> that it is okay if leader picks a partial path that is costly among
>>> other partial paths as a leader won't be locked with that.
>>> Considering this is a good design for parallel append, the question is
>>> do we really need worker and leader to follow separate strategy for
>>> choosing next path.  I think the patch will be simpler if we can come
>>> up with a way for the worker and leader to use the same strategy to
>>> pick next path to process.  How about we arrange the list of paths
>>> such that first, all partial paths will be there and then non-partial
>>> paths and probably both in decreasing order of cost.  Now, both leader
>>> and worker can start from the beginning of the list. In most cases,
>>> the leader will start at the first partial path and will only ever
>>> need to scan non-partial path if there is no other partial path left.
>>> This is not bulletproof as it is possible that some worker starts
>>> before leader in which case leader might scan non-partial path before
>>> all partial paths are finished, but I think we can avoid that as well
>>> if we are too worried about such cases.
>>
>> If there are no partial subpaths, then again the leader is likely to
>> take up the expensive subpath. And this scenario would not be
>> uncommon.
>>
>
> While thinking about how common the case of no partial subpaths would
> be, it occurred to me that as of now we always create a partial path
> for the inheritance child if it is parallel-safe and the user has not
> explicitly set the value of parallel_workers to zero (refer
> compute_parallel_worker).  So, unless you are planning to change that,
> I think it will be quite uncommon to have no partial subpaths.

There are still some paths that can have non-partial paths cheaper
than the partial paths. Also, there can be UNION ALL queries which
could have non-partial subpaths. I guess this has already been
discussed in the other replies.

>
> Few nitpicks in your latest patch:
> 1.
> @@ -298,6 +366,292 @@ ExecReScanAppend(AppendState *node)
>   if (subnode->chgParam == NULL)
>   ExecReScan(subnode);
>   }
> +
>
> Looks like a spurious line.
>
> 2.
> @@ -1285,7 +1291,11 @@ add_paths_to_append_rel(PlannerInfo *root,
> RelOptInfo *rel,
> ..
> + if (chosen_path && chosen_path != cheapest_partial_path)
> + pa_all_partial_subpaths = false;
>
> It will keep on setting pa_all_partial_subpaths as false for
> non-partial paths which don't seem to be the purpose of this variable.
> I think you want it to be set even when there is one non-partial path,
> so isn't it better to write as below or something similar:
> if (pa_nonpartial_subpaths && pa_all_partial_subpaths)
> pa_all_partial_subpaths = false;

Ok. How about removing pa_all_partial_subpaths altogether , and
instead of the below condition :

/*
* If all the child rels have partial paths, and if the above Parallel
* Append path has a mix of partial and non-partial subpaths, then consider
* another Parallel Append path which will have *all* partial subpaths.
* If enable_parallelappend is off, make this one non-parallel-aware.
*/
if (partial_subpaths_valid && !pa_all_partial_subpaths)
..

Use this condition :
if (partial_subpaths_valid && pa_nonpartial_subpaths != NIL)
..




Regarding a mix of partial and non-partial paths, I feel it always
makes sense for the leader to choose the partial path. If it chooses a
non-partial path, no other worker would be able to help finish that
path. Among the partial paths, whether it chooses the cheapest one or
expensive one does not matter, I think. We have the partial paths
unordered. So whether it starts from the last partial path or the
first partial path should not matter.

Regarding scenario where all paths are non-partial, here is an e.g. :
Suppose we have 4 child paths with costs : 10 5 5 3, and with 2
workers plus one leader. And suppose the leader takes additionally
1/4th of these costs to process the returned tuples.

If leader takes least expensive one (3)  :
2 workers will finish 10, 5, 5 in 10 units,
and leader simultaneously chooses the plan with cost 3, and so it
takes 3 + (1/4)(10 + 5 + 5 + 3) = 9 units.
So the total time taken by Append is : 10.


Whereas if leader takes most expensive one (10) :
10 +

Re: [HACKERS] UPDATE of partition key

2017-10-03 Thread Amit Khandekar
On 30 September 2017 at 01:26, Robert Haas <robertmh...@gmail.com> wrote:
> On Fri, Sep 29, 2017 at 3:53 PM, Robert Haas <robertmh...@gmail.com> wrote:
>> On Fri, Sep 22, 2017 at 1:57 AM, Amit Khandekar <amitdkhan...@gmail.com> 
>> wrote:
>>> The patch for the above change is :
>>> 0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patch
>>
>> Thinking about this a little more, I'm wondering about how this case
>> arises.  I think that for this patch to avoid multiple conversions,
>> we'd have to be calling map_variable_attnos on an expression and then
>> calling map_variable_attnos on that expression again.

We are not calling map_variable_attnos() twice. The first time it
calls, there is already the ConvertRowtypeExpr node if the expression
is a whole row var. This node is already added from
adjust_appendrel_attrs(). So the conversion is done by two different
functions.

For ConvertRowtypeExpr, map_variable_attnos_mutator() recursively
calls map_variable_attnos_mutator() for ConvertRowtypeExpr->arg with
coerced_var=true.

>
> I guess I didn't quite finish this thought, sorry.  Maybe it's
> obvious, but the point I was going for is: why would we do that, vs.
> just converting once?

The first time ConvertRowtypeExpr node gets added in the expression is
when adjust_appendrel_attrs() is called for each of the child tables.
Here, for each of the child table, when the parent parse tree is
converted into the child parse tree, the whole row var (in RETURNING
or WITH CHECK OPTIONS expr) is wrapped with ConvertRowtypeExpr(), so
child parse tree (or the child WCO expr) has this ConvertRowtypeExpr
node.

The second time this node is added is during update-tuple-routing in
ExecInitModifyTable(), when map_partition_varattnos() is called for
each of the partitions to convert from the first per-subplan
RETURNING/WCO expression to the RETURNING/WCO expression belonging to
the leaf partition. This second conversion happens for the leaf
partitions which are not already present in per-subplan UPDATE result
rels.

So the first conversion is from parent to child while building
per-subplan plans, and the second is from first per-subplan child to
another child for building expressions of the leaf partitions.

So suppose the root partitioned table RETURNING expression is a whole
row var wr(r) where r is its composite type representing the root
table type.
Then, one of its UPDATE child tables will have its RETURNING
expression converted like this :
wr(r)  ===>  CRE(r) -> wr(c1)
where CRE(r) represents ConvertRowtypeExpr of result type r, which has
its arg pointing to wr(c1) which is a whole row var of composite type
c1 for the child table c1. So this node converts from composite type
of child table to composite type of root table.

Now, when the second conversion occurs for the leaf partition (i.e.
during update-tuple-routing), the conversion looks like this :
CRE(r) -> wr(c1)  ===>  CRE(r) -> wr(c2)
But W/o the 0002*ConvertRowtypeExpr*.patch the conversion would have
looked like this :
CRE(r) -> wr(c1)  ===>  CRE(r) -> CRE(c1) -> wr(c2)
In short, we omit the intermediate CRE(c1) node.


While writing this down, I observed that after multi-level partition
tree expansion was introduced, the child table expressions are not
converted directly from the root. Instead, they are converted from
their immediate parent. So there is a chain of conversions : to leaf
from its parent, to that parent from its parent, and so on from the
root. Effectively, during the first conversion, there are that many
ConvertRowtypeExpr nodes one above the other already present in the
UPDATE result rel expressions. But my patch handles the optimization
only for the leaf partition conversions.

If already has CRE : CRE(rr) -> wr(r)
Parent-to-child conversion ::: CRE(p) -> wr(r)  ===>   CRE(rr) ->
CRE(r) -> wr(c1)
W patch : CRE(rr) -> CRE(r) -> wr(c1) ===> CRE(rr) -> CRE(r) -> wr(c2)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-28 Thread Amit Khandekar
On 20 September 2017 at 11:32, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> There is still the open point being
> discussed : whether to have non-parallel-aware partial Append path, or
> always have parallel-aware Append paths.

Attached is the revised patch v16. In previous versions, we used to
add a non-parallel-aware Partial Append path having all partial
subpaths if the Parallel Append path already added does not contain
all-partial subpaths. Now in the patch, when we add such Append Path
containing all partial subpaths, we make it parallel-aware (unless
enable_parallelappend is false). So in this case, there will be a
parallel-aware Append path containing one or more non-partial
subpaths, and there will be another parallel-aware Append path
containing all-partial subpaths.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


ParallelAppend_v16.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-09-28 Thread Amit Khandekar
Below are some performance figures. Overall, there does not appear to
be a noticeable difference in the figures in partition key updates
with and without row movement (which is surprising), and
non-partition-key updates with and without the patch.

All the values are in milliseconds.

Configuration :

shared_buffers = 8GB
maintenance_work_mem = 4GB
synchronous_commit = off
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
log_line_prefix = '%t [%p] '
max_wal_size = 5GB
max_connections = 200

The attached files were used to create a partition tree made up of 16
partitioned tables, each containing 125 partitions. First half of the
2000 partitions are filled with 10 million rows. Update row movement
moves the data to the other half of the partitions.

gen.sql : Creates the partitions.
insert.data : This data file is uploaded here [1]. Used "COPY ptab
from '$PWD/insert.data' "
index.sql : Optionally, Create index on column d.

The schema looks like this :

CREATE TABLE ptab (a date, b int, c int, d int) PARTITION BY RANGE (a, b);

CREATE TABLE ptab_1_1 PARTITION OF ptab
for values from ('1900-01-01', 1) to ('1900-01-01', 7501)
PARTITION BY range (c);
CREATE TABLE ptab_1_1_1 PARTITION OF ptab_1_1
for values from (1) to (81);
CREATE TABLE ptab_1_1_2 PARTITION OF ptab_1_1
for values from (81) to (161);
..
..
CREATE TABLE ptab_1_2 PARTITION OF ptab
for values from ('1900-01-01', 7501) to ('1900-01-01', 15001)
PARTITION BY range (c);
..
..

On 20 September 2017 at 00:06, Robert Haas <robertmh...@gmail.com> wrote:
> I wonder how much more expensive it
> is to execute UPDATE root SET a = a + 1 WHERE a = 1 on a table with
> 1000 subpartitions with this patch than without, assuming the update
> succeeds in both cases.

UPDATE query used : UPDATE ptab set d = d + 1 where d = 1; -- where d
is not a partition key of any of the partitions.
This query updates 8 rows out of 10 million rows.
With HEAD  : 2953.691 , 2862.298 , 2855.286 , 2835.879 (avg : 2876)
With Patch : 2933.719 , 2832.463 , 2749.979 , 2820.416 (avg : 2834)
(All the values are in milliseconds.)

> suppose you make a table with 1000 partitions each containing
> 10,000 tuples and update them all, and consider three scenarios: (1)
> partition key not updated but all tuples subject to non-HOT updates
> because the updated column is indexed, (2) partition key updated but
> no tuple movement required as a result, (3) partition key updated and
> all tuples move to a different partition.

Note that the following figures do not represent a consistent set of
figures. They keep on varying. For e.g. , even though the
partition-key-update without row movement appears to have taken a bit
more time with patch than with HEAD, a new set of tests run might even
end up the other way round.

NPK  : 42089 (patch)
NPKI : 81593 (patch)
PK   : 45250 (patch) , 44944 (HEAD)
PKR  : 46701 (patch)

The above figures are in milliseconds. The explanations of the above
short-forms :

NPK :
Update of column that is not a partition-key.
UPDATE query used : UPDATE ptab set d = d + 1 ; This update *all* rows.

NPKI :
Update of column that is not a partition-key. And this column is
indexed (Used attached file index.sql).
UPDATE query used : UPDATE ptab set d = d + 1 ; This update *all* rows.

PK :
Update of partition key, but row movement does not occur. There are no
indexed columns.
UPDATE query used : UPDATE ptab set a = a + '1 hour'::interval ;

PKR :
Update of partition key, with all rows moved to other partitions.
There are no indexed columns.
UPDATE query used : UPDATE ptab set a = a + '2 years'::interval ;


[1] 
https://drive.google.com/open?id=0B_YJCqIAxKjeN3hMXzdDejlNYmlpWVJpaU9mWUhFRVhXTG5Z

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


gen.tar.gz
Description: GNU Zip compressed data


index.tar.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-09-22 Thread Amit Khandekar
On 21 September 2017 at 19:52, amul sul <sula...@gmail.com> wrote:
> On Wed, Sep 20, 2017 at 9:27 PM, Amit Khandekar <amitdkhan...@gmail.com>
> wrote:
>>
>> On 20 September 2017 at 00:06, Robert Haas <robertmh...@gmail.com> wrote:
>> > On Fri, Sep 15, 2017 at 7:25 AM, Amit Khandekar <amitdkhan...@gmail.com>
>> > wrote:
>> >> [ new patch ]
>
>
>   86 -   (event == TRIGGER_EVENT_UPDATE &&
> !trigdesc->trig_update_after_row))
>   87 +   (event == TRIGGER_EVENT_UPDATE &&
> !trigdesc->trig_update_after_row) ||
>   88 +   (event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup
> == NULL)))
>   89 return;
>   90 }
>
>
> Either of oldtup or newtup will be valid at a time & vice versa.  Can we
> improve
> this check accordingly?
>
> For e.g.:
> (event == TRIGGER_EVENT_UPDATE && )(HeapTupleIsValid(oldtup) ^
> ItemPointerIsValid(newtup)

Ok, I will be doing this as below :
-  (event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup == NULL)))
+ (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL

At other places in the function, oldtup and newtup are checked for
NULL, so to be consistent, I haven't used HeapTupleIsValid.

Actually, it won't happen that both oldtup and newtup are NULL ... in
either of delete, insert, or update, but I haven't added an Assert for
this, because that has been true even on HEAD.

Will include the above minor change in the next patch when more changes come in.

>
>
>  247
>  248 +   /*
>  249 +* EDB: In case this is part of update tuple routing, put this row
> into the
>  250 +* transition NEW TABLE if we are capturing transition tables. We
> need to
>  251 +* do this separately for DELETE and INSERT because they happen on
>  252 +* different tables.
>  253 +*/
>  254 +   if (mtstate->operation == CMD_UPDATE &&
> mtstate->mt_transition_capture)
>  255 +   ExecARUpdateTriggers(estate, resultRelInfo, NULL,
>  256 +NULL,
>  257 +tuple,
>  258 +NULL,
>  259 +mtstate->mt_transition_capture);
>  260 +
>  261 list_free(recheckIndexes);
>
>  267
>  268 +   /*
>  269 +* EDB: In case this is part of update tuple routing, put this row
> into the
>  270 +* transition OLD TABLE if we are capturing transition tables. We
> need to
>  271 +* do this separately for DELETE and INSERT because they happen on
>  272 +* different tables.
>  273 +*/
>  274 +   if (mtstate->operation == CMD_UPDATE &&
> mtstate->mt_transition_capture)
>  275 +   ExecARUpdateTriggers(estate, resultRelInfo, tupleid,
>  276 +oldtuple,
>  277 +NULL,
>  278 +NULL,
>  279 +mtstate->mt_transition_capture);
>  280 +
>
> Initially, I wondered that why can't we have above code right after
> ExecInsert() & ExecIDelete() in ExecUpdate respectively?
>
> We can do that for ExecIDelete() but not easily in the ExecInsert() case,
> because ExecInsert() internally searches the correct partition's
> resultRelInfo
> for an insert and before returning to ExecUpdate resultRelInfo is restored
> to the old one.  That's why current logic seems to be reasonable for now.
> Is there anything that we can do?

Yes, resultRelInfo is different when we return from ExecInsert().
Also, I think the trigger and transition capture be done immediately
after the rows are inserted. This is true for existing code also.
Furthermore, there is a dependency of ExecARUpdateTriggers() on
ExecARInsertTriggers(). transition_capture is passed NULL if we
already captured the tuple in ExecARUpdateTriggers(). It looks simpler
to do all this at a single place.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-09-20 Thread Amit Khandekar
On 20 September 2017 at 00:06, Robert Haas <robertmh...@gmail.com> wrote:
> On Fri, Sep 15, 2017 at 7:25 AM, Amit Khandekar <amitdkhan...@gmail.com> 
> wrote:
>> [ new patch ]
>
> This already fails to apply again.  In general, I think it would be a
> good idea to break this up into a patch series rather than have it as
> a single patch.  That would allow some bits to be applied earlier.
> The main patch will probably still be pretty big, but at least we can
> make things a little easier by getting some of the cleanup out of the
> way first.  Specific suggestions on what to break out below.
>
> If the changes to rewriteManip.c are a marginal efficiency hack and
> nothing more, then let's commit this part separately before the main
> patch.  If they're necessary for correctness, then please add a
> comment explaining why they are necessary.

Ok. Yes, just wanted to avoid two ConvertRowtypeExpr nodes one over
the other. But that was not causing any correctness issue. Will
extract these changes into separate patch.

>
> There appears to be no reason why the definitions of
> GetInsertedColumns() and GetUpdatedColumns() need to be moved to a
> header file as a result of this patch.  GetUpdatedColumns() was
> previously defined in trigger.c and execMain.c and, post-patch, is
> still called from only those files.  GetInsertedColumns() was, and
> remains, called only from execMain.c.  If this were needed I'd suggest
> doing it as a preparatory patch before the main patch, but it seems we
> don't need it at all.

In earlier versions of the patch, these functions were used in
nodeModifyTable.c as well. Now that those calls are not there in this
file, I will revert back the changes done for moving the definitions
into header file.

>
> If I understand correctly, the reason for changing mt_partitions from
> ResultRelInfo * to ResultRelInfo ** is that, currently, all of the
> ResultRelInfos for a partitioning hierarchy are allocated as a single
> chunk, but we can't do that and also reuse the ResultRelInfos created
> during InitPlan.  I suggest that we do this as a preparatory patch.

Ok, will prepare a separate patch. Do you mean to include in that
patch the changes I did in ExecSetupPartitionTupleRouting() that
re-use the ResultRelInfo structures of per-subplan update result rels
?

> Someone could argue that this is going the wrong way and that we ought
> to instead make InitPlan() create all of the necessarily
> ResultRelInfos, but it seems to me that eventually we probably want to
> allow setting up ResultRelInfos on the fly for only those partitions
> for which we end up needing them.  The code already has some provision
> for creating ResultRelInfos on the fly - see ExecGetTriggerResultRel.
> I don't think it's this patch's job to try to apply that kind of thing
> to tuple routing, but it seems like in the long run if we're inserting
> 1 tuple into a table with 1000 partitions, or performing 1 update that
> touches the partition key, it would be best not to create
> ResultRelInfos for all 1000 partitions just for fun.

Yes makes sense.

>  But this sort of
> thing seems much easier of mt_partitions is ResultRelInfo ** rather
> than ResultRelInfo *, so I think what you have is going in the right
> direction.

Ok.

>
> + * mtstate->resultRelInfo[]. Note: We assume that if the 
> resultRelInfo
> + * does not belong to subplans, then it already matches the root 
> tuple
> + * descriptor; although there is no such known scenario where this
> + * could happen.
> + */
> +if (rootResultRelInfo != resultRelInfo &&
> +mtstate->mt_persubplan_childparent_maps != NULL &&
> +resultRelInfo >= mtstate->resultRelInfo &&
> +resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1)
> +{
> +int map_index = resultRelInfo - mtstate->resultRelInfo;
>
> I think you should Assert() that it doesn't happen instead of assuming
> that it doesn't happen.   IOW, remove the last two branches of the
> if-condition, and then add an Assert() that map_index is sane.

Ok.

>
> It is not clear to me why we need both mt_perleaf_childparent_maps and
> mt_persubplan_childparent_maps.

mt_perleaf_childparent_maps :
This is used for converting transition-captured
inserted/modified/deleted tuples from leaf to root partition, because
we need to have all the ROWS in the root partition attribute order.
This map is used only for tuples that are routed from root to leaf
partition during INSERT, or when tuples are routed from one leaf
partition to another leaf partition during update row movement. For
both of these operations, we need per-leaf maps, because during tuple
conversion, the 

Re: [HACKERS] Parallel Append implementation

2017-09-20 Thread Amit Khandekar
On 11 September 2017 at 18:55, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>> 1.
>>> + else if (IsA(subpath, MergeAppendPath))
>>> + {
>>> + MergeAppendPath *mpath = (MergeAppendPath *) subpath;
>>> +
>>> + /*
>>> + * If at all MergeAppend is partial, all its child plans have to be
>>> + * partial : we don't currently support a mix of partial and
>>> + * non-partial MergeAppend subpaths.
>>> + */
>>> + if (is_partial)
>>> + return list_concat(partial_subpaths, list_copy(mpath->subpaths));
>>>
>>> In which situation partial MergeAppendPath is generated?  Can you
>>> provide one example of such path?
>>
>> Actually currently we don't support partial paths for MergeAppendPath.
>> That code just has that if condition (is_partial) but currently that
>> condition won't be true for MergeAppendPath.
>>
>
> I think then it is better to have Assert for MergeAppend.  It is
> generally not a good idea to add code which we can never hit.

Done.

> One more thing, I think you might want to update comment atop
> add_paths_to_append_rel to reflect the new type of parallel-aware
> append path being generated. In particular, I am referring to below
> part of the comment:
>
>  * Similarly it collects partial paths from
>  * non-dummy children to create partial append paths.
>  */
> static void
> add_paths_to_append_rel()
>

Added comments.

Attached revised patch v15. There is still the open point being
discussed : whether to have non-parallel-aware partial Append path, or
always have parallel-aware Append paths.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


ParallelAppend_v15.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-19 Thread Amit Khandekar
On 16 September 2017 at 10:42, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Thu, Sep 14, 2017 at 9:41 PM, Robert Haas <robertmh...@gmail.com> wrote:
>> On Mon, Sep 11, 2017 at 9:25 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>> I think the patch stores only non-partial paths in decreasing order,
>>> what if partial paths having more costs follows those paths?
>>
>> The general picture here is that we don't want the leader to get stuck
>> inside some long-running operation because then it won't be available
>> to read tuples from the workers.  On the other hand, we don't want to
>> just have the leader do no work because that might be slow.  And in
>> most cast cases, the leader will be the first participant to arrive at
>> the Append node, because of the worker startup time.  So the idea is
>> that the workers should pick expensive things first, and the leader
>> should pick cheap things first.
>>
>
> At a broader level, the idea is good, but I think it won't turn out
> exactly like that considering your below paragraph which indicates
> that it is okay if leader picks a partial path that is costly among
> other partial paths as a leader won't be locked with that.
> Considering this is a good design for parallel append, the question is
> do we really need worker and leader to follow separate strategy for
> choosing next path.  I think the patch will be simpler if we can come
> up with a way for the worker and leader to use the same strategy to
> pick next path to process.  How about we arrange the list of paths
> such that first, all partial paths will be there and then non-partial
> paths and probably both in decreasing order of cost.  Now, both leader
> and worker can start from the beginning of the list. In most cases,
> the leader will start at the first partial path and will only ever
> need to scan non-partial path if there is no other partial path left.
> This is not bulletproof as it is possible that some worker starts
> before leader in which case leader might scan non-partial path before
> all partial paths are finished, but I think we can avoid that as well
> if we are too worried about such cases.

If there are no partial subpaths, then again the leader is likely to
take up the expensive subpath. And this scenario would not be
uncommon. And for this scenario at least, we anyway have to make it
start from cheapest one, so will have to maintain code for that logic.
Now since we anyway have to maintain that logic, why not use the same
logic for leader for all cases. Otherwise, if we try to come up with a
common logic that conditionally chooses different next plan for leader
or worker, then that logic would most probably get complicated than
the current state. Also, I don't see any performance issue if there is
a leader is running backwards while the others are going forwards.



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-09-19 Thread Amit Khandekar
On 18 September 2017 at 20:45, Dilip Kumar <dilipbal...@gmail.com> wrote:
> Please find few more comments.
>
> + * in which they appear in the PartitionDesc. Also, extract the
> + * partition key columns of the root partitioned table. Those of the
> + * child partitions would be collected during recursive expansion.
> */
> + pull_child_partition_columns(_part_cols, oldrelation, oldrelation);
> expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
>   lockmode, >append_rel_list,
> +   _part_cols,
>
> pcinfo->all_part_cols is only used in case of update, I think we can
> call pull_child_partition_columns
> only if rte has updateCols?
>
> @@ -2029,6 +2034,7 @@ typedef struct PartitionedChildRelInfo
>
> Index parent_relid;
> List   *child_rels;
> + Bitmapset  *all_part_cols;
> } PartitionedChildRelInfo;
>
> I might be missing something, but do we really need to store
> all_part_cols inside the
> PartitionedChildRelInfo,  can't we call pull_child_partition_columns
> directly inside
> inheritance_planner whenever we realize that RTE has some updateCols
> and we want to
> check the overlap?

One thing  we will have to do extra is : Open and close the
partitioned rels again. The idea was that we collect the bitmap
*while* we are already expanding through the tree and the rel is open.
Will check if this is feasible.

>
> +extern void partition_walker_init(PartitionWalker *walker, Relation rel);
> +extern Relation partition_walker_next(PartitionWalker *walker,
> +  Relation *parent);
> +
>
> I don't see these functions are used anywhere?
>
> +typedef struct PartitionWalker
> +{
> + List   *rels_list;
> + ListCell   *cur_cell;
> +} PartitionWalker;
> +
>
> Same as above

Yes, this was left out from the earlier implementation. Will have this
removed in the next updated patch.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-17 Thread Amit Khandekar
On 16 September 2017 at 11:45, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Thu, Sep 14, 2017 at 8:30 PM, Amit Khandekar <amitdkhan...@gmail.com> 
> wrote:
>> On 11 September 2017 at 18:55, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>>>
>>>
>>> How?  See, if you have four partial subpaths and two non-partial
>>> subpaths, then for parallel-aware append it considers all six paths in
>>> parallel path whereas for non-parallel-aware append it will consider
>>> just four paths and that too with sub-optimal strategy.  Can you
>>> please try to give me some example so that it will be clear.
>>
>> Suppose 4 appendrel children have costs for their cheapest partial (p)
>> and non-partial paths (np)  as shown below :
>>
>> p1=5000  np1=100
>> p2=200   np2=1000
>> p3=80   np3=2000
>> p4=3000  np4=50
>>
>> Here, following two Append paths will be generated :
>>
>> 1. a parallel-aware Append path with subpaths :
>> np1, p2, p3, np4
>>
>> 2. Partial (i.e. non-parallel-aware) Append path with all partial subpaths:
>> p1,p2,p3,p4
>>
>> Now, one thing we can do above is : Make the path#2 parallel-aware as
>> well; so both Append paths would be parallel-aware.
>>
>
> Yes, we can do that and that is what I think is probably better.  So,
> the question remains that in which case non-parallel-aware partial
> append will be required?  Basically, it is not clear to me why after
> having parallel-aware partial append we need non-parallel-aware
> version?  Are you keeping it for the sake of backward-compatibility or
> something like for cases if someone has disabled parallel append with
> the help of new guc in this patch?

Yes one case is the enable_parallelappend GUC case. If a user disables
it, we do want to add the usual non-parallel-aware append partial
path.

About backward compatibility, the concern we discussed in [1] was that
we better continue to have the usual non-parallel-aware partial Append
path, plus we should have an additional parallel-aware Append path
containing mix of partial and non-partial subpaths.

But thinking again on the example above, I think Amit, I tend to agree
that we don't have to worry about the existing behaviour, and so we
can make the path#2 parallel-aware as well.

Robert, can you please suggest what is your opinion on the paths that
are chosen in the above example ?

[1] 
https://www.postgresql.org/message-id/CA%2BTgmoaLRtaWdJVHfhHej2s7w1spbr6gZiZXJrM5bsz1KQ54Rw%40mail.gmail.com

>
-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-14 Thread Amit Khandekar
On 11 September 2017 at 18:55, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>> Do you think non-parallel-aware Append
>>> will be better in any case when there is a parallel-aware append?  I
>>> mean to say let's try to create non-parallel-aware append only when
>>> parallel-aware append is not possible.
>>
>> By non-parallel-aware append, I am assuming you meant  partial
>> non-parallel-aware Append. Yes, if the parallel-aware Append path has
>> *all* partial subpaths chosen, then we do omit a partial non-parallel
>> Append path, as seen in this code in the patch :
>>
>> /*
>> * Consider non-parallel partial append path. But if the parallel append
>> * path is made out of all partial subpaths, don't create another partial
>> * path; we will keep only the parallel append path in that case.
>> */
>> if (partial_subpaths_valid && !pa_all_partial_subpaths)
>> {
>> ..
>> }
>>
>> But if the parallel-Append path has a mix of partial and non-partial
>> subpaths, then we can't really tell which of the two could be cheapest
>> until we calculate the cost. It can be that the non-parallel-aware
>> partial Append can be cheaper as well.
>>
>
> How?  See, if you have four partial subpaths and two non-partial
> subpaths, then for parallel-aware append it considers all six paths in
> parallel path whereas for non-parallel-aware append it will consider
> just four paths and that too with sub-optimal strategy.  Can you
> please try to give me some example so that it will be clear.

Suppose 4 appendrel children have costs for their cheapest partial (p)
and non-partial paths (np)  as shown below :

p1=5000  np1=100
p2=200   np2=1000
p3=80   np3=2000
p4=3000  np4=50

Here, following two Append paths will be generated :

1. a parallel-aware Append path with subpaths :
np1, p2, p3, np4

2. Partial (i.e. non-parallel-aware) Append path with all partial subpaths:
p1,p2,p3,p4

Now, one thing we can do above is : Make the path#2 parallel-aware as
well; so both Append paths would be parallel-aware. Are you suggesting
exactly this ?

So above, what I am saying is, we can't tell which of the paths #1 and
#2 are cheaper until we calculate total cost. I didn't understand what
did you mean by "non-parallel-aware append will consider only the
partial subpaths and that too with sub-optimal strategy" in the above
example. I guess, you were considering a different scenario than the
above one.

Whereas, if one or more subpaths of Append do not have partial subpath
in the first place, then non-parallel-aware partial Append is out of
question, which we both agree.
And the other case where we skip non-parallel-aware partial Append is
when all the cheapest subpaths of the parallel-aware Append path are
partial paths: we do not want parallel-aware and non-parallel-aware
Append paths both having exactly the same partial subpaths.

-

I will be addressing your other comments separately.

Thanks
-Amit Khandekar


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] expanding inheritance in partition bound order

2017-09-14 Thread Amit Khandekar
On 14 September 2017 at 06:43, Amit Langote
> langote_amit...@lab.ntt.co.jp> wrote:
> Attached updated patch.

@@ -1222,151 +1209,130 @@ PartitionDispatch *
 RelationGetPartitionDispatchInfo(Relation rel,
 int
*num_parted, List **leaf_part_oids)
 {
+   List   *pdlist;
PartitionDispatchData **pd;

+   get_partition_dispatch_recurse(rel, NULL, , leaf_part_oids);

Above, pdlist is passed uninitialized. And then inside
get_partition_dispatch_recurse(), it is used here :
*pds = lappend(*pds, pd);



pg_indent says more alignments needed. For e.g. gettext_noop() call
below needs to be aligned:
pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
tupdesc,
gettext_noop("could not convert row type"));



Other than that, the patch looks good to me. I verified that the leaf
oids are ordered exaclty in the order of the UPDATE subplans output.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] expanding inheritance in partition bound order

2017-09-13 Thread Amit Khandekar
On 13 September 2017 at 15:32, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> On 2017/09/11 18:56, Amit Langote wrote:
>> Attached updated patch does it that way for both partitioned table indexes
>> and leaf partition indexes.  Thanks for pointing it out.
>
> It seems to me we don't really need the first patch all that much.  That
> is, let's keep PartitionDispatchData the way it is for now, since we don't
> really have any need for it beside tuple-routing (EIBO as committed didn't
> need it for one).  So, let's forget about "decoupling
> RelationGetPartitionDispatchInfo() from the executor" thing for now and
> move on.
>
> So, attached is just the patch to make RelationGetPartitionDispatchInfo()
> traverse the partition tree in depth-first manner to be applied on HEAD.
>
> Thoughts?

+1. If at all we need the decoupling later for some reason, we can do
that incrementally.

Will review your latest patch by tomorrow.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-13 Thread Amit Khandekar
On 13 September 2017 at 13:05, Ashutosh Bapat
<ashutosh.ba...@enterprisedb.com> wrote:
> On Wed, Sep 13, 2017 at 12:32 PM, Amit Khandekar <amitdkhan...@gmail.com> 
> wrote:
>> Hi,
>>
>> Rafia had done some testing on TPCH queries using Partition-wise join
>> patch along with Parallel Append patch.
>>
>> There, we had observed that for query 4, even though the partition
>> wise joins are under a Parallel Append, the join are all non-partial.
>>
>> Specifically, the partition-wise join has non-partial nested loop
>> joins when actually it was expected to have partial nested loop joins.
>> (The difference can be seen by the observation that the outer relation
>> of that join is scanned by non-parallel Bitmap Heap scan when it
>> should have used Parallel Bitmap Heap Scan).
>>
>> Here is the detailed analysis , including where I think is the issue :
>>
>> https://www.postgresql.org/message-id/CAJ3gD9cZms1ND3p%3DNN%3DhDYDFt_SeKq1htMBhbj85bOmvJwY5fg%40mail.gmail.com
>>
>> All the TPCH results are posted in the same above mail thread.
>
> Can you please check if the attached patch fixes the issue.

Thanks Ashutosh. Yes, it does fix the issue. Partial Nested Loop joins
are generated now. If I see any unexpected differences in the
estimated or actual costs, I will report that in the Parallel Append
thread. As far as Partition-wise join is concerned, this issue is
solved, because Partial nested loop join does get created.

>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-13 Thread Amit Khandekar
Hi,

Rafia had done some testing on TPCH queries using Partition-wise join
patch along with Parallel Append patch.

There, we had observed that for query 4, even though the partition
wise joins are under a Parallel Append, the join are all non-partial.

Specifically, the partition-wise join has non-partial nested loop
joins when actually it was expected to have partial nested loop joins.
(The difference can be seen by the observation that the outer relation
of that join is scanned by non-parallel Bitmap Heap scan when it
should have used Parallel Bitmap Heap Scan).

Here is the detailed analysis , including where I think is the issue :

https://www.postgresql.org/message-id/CAJ3gD9cZms1ND3p%3DNN%3DhDYDFt_SeKq1htMBhbj85bOmvJwY5fg%40mail.gmail.com

All the TPCH results are posted in the same above mail thread.

Thanks
-Amit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] why not parallel seq scan for slow functions

2017-09-12 Thread Amit Khandekar
On 5 September 2017 at 14:04, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Fri, Aug 25, 2017 at 10:08 PM, Robert Haas <robertmh...@gmail.com> wrote:
>> On Mon, Aug 21, 2017 at 5:08 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>> (b) I have changed the costing of gather path for path target in
>>> generate_gather_paths which I am not sure is the best way. Another
>>> possibility could have been that I change the code in
>>> apply_projection_to_path as done in the previous patch and just call
>>> it from generate_gather_paths.  I have not done that because of your
>>> comment above thread ("is probably unsafe, because it might confuse
>>> code that reaches the modified-in-place path through some other
>>> pointer (e.g. code which expects the RelOptInfo's paths to still be
>>> sorted by cost).").  It is not clear to me what exactly is bothering
>>> you if we directly change costing in apply_projection_to_path.
>>
>> The point I was trying to make is that if you retroactively change the
>> cost of a path after you've already done add_path(), it's too late to
>> change your mind about whether to keep the path.  At least according
>> to my current understanding, that's the root of this problem in the
>> first place.  On top of that, add_path() and other routines that deal
>> with RelOptInfo path lists expect surviving paths to be ordered by
>> descending cost; if you frob the cost, they might not be properly
>> ordered any more.
>>
>
> Okay, now I understand your point, but I think we already change the
> cost of paths in apply_projection_to_path which is done after add_path
> for top level scan/join paths.  I think this matters a lot in case of
> Gather because the cost of computing target list can be divided among
> workers.  I have changed the patch such that parallel paths for top
> level scan/join node will be generated after pathtarget is ready.  I
> had kept the costing of path targets local to
> apply_projection_to_path() as that makes the patch simple.

I started with a quick review ... a couple of comments below :

- * If this is a baserel, consider gathering any partial paths we may have
- * created for it.  (If we tried to gather inheritance children, we could
+ * If this is a baserel and not the only rel, consider gathering any
+ * partial paths we may have created for it.  (If we tried to gather

  /* Create GatherPaths for any useful partial paths for rel */
-  generate_gather_paths(root, rel);
+  if (lev < levels_needed)
+ generate_gather_paths(root, rel, NULL);

I think at the above two places, and may be in other place also, it's
better to mention the reason why we should generate the gather path
only if it's not the only rel.

--

-   if (rel->reloptkind == RELOPT_BASEREL)
-   generate_gather_paths(root, rel);
+   if (rel->reloptkind == RELOPT_BASEREL &&
root->simple_rel_array_size > 2)
+   generate_gather_paths(root, rel, NULL);

Above, in case it's a partitioned table, root->simple_rel_array_size
includes the child rels. So even if it's a simple select without a
join rel, simple_rel_array_size would be > 2, and so gather path would
be generated here for the root table, and again in grouping_planner().



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-09-12 Thread Amit Khandekar
On 12 September 2017 at 11:57, Dilip Kumar <dilipbal...@gmail.com> wrote:
> On Tue, Sep 12, 2017 at 11:15 AM, Amit Khandekar <amitdkhan...@gmail.com> 
> wrote:
>
>> But the statement level trigger function can refer to OLD TABLE and
>> NEW TABLE, which will contain all the OLD rows and NEW rows
>> respectively. So the updated rows of the partitions (including the
>> moved ones) need to be captured. So for OLD TABLE, we need to capture
>> the deleted row, and for NEW TABLE, we need to capture the inserted
>> row.
>
> Yes, I agree.  So in ExecDelete for OLD TABLE we only need to call
> ExecARUpdateTriggers which will make the entry in OLD TABLE only if
> transition table is there otherwise nothing and I guess this part
> already exists in your patch.  And, we are also calling
> ExecARDeleteTriggers and I guess that is to fire the ROW-LEVEL delete
> trigger and that is also fine.  What I don't understand is that if
> there is no "ROW- LEVEL delete trigger" and there is only a "statement
> level delete trigger" with transition table still we are making the
> entry in transition table of the delete trigger and that will never be
> used.

Hmm, ok, that might be happening, since we are calling
ExecARDeleteTriggers() with mtstate->mt_transition_capture non-NULL,
and so the deleted tuple gets captured even when there is no UPDATE
statement trigger defined, which looks redundant. Will check this.
Thanks.

>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-09-11 Thread Amit Khandekar
On 11 September 2017 at 21:12, Dilip Kumar <dilipbal...@gmail.com> wrote:
> On Thu, Sep 7, 2017 at 11:41 AM, Amit Khandekar <amitdkhan...@gmail.com> 
> wrote:
>> On 6 September 2017 at 21:47, Dilip Kumar <dilipbal...@gmail.com> wrote:
>
>> Actually, since transition tables came in, the functions like
>> ExecARUpdateTriggers() or ExecARInsertTriggers() have this additional
>> purpose of capturing transition table rows, so that the images of the
>> tables are visible when statement triggers are fired that refer to
>> these transition tables. So in the above code, these functions only
>> capture rows, they do not add any event for firing any ROW triggers.
>> AfterTriggerSaveEvent() returns without adding any event if it's
>> called only for transition capture. So even if UPDATE row triggers are
>> defined, they won't get fired in case of row movement, although the
>> updated rows would be captured if transition tables are referenced in
>> these triggers or in the statement triggers.
>>
>
> Ok then I have one more question,
>
> With transition table, we can only support statement level trigger

Yes, we don't support row triggers with transition tables if the table
is a partition.

> and for update
> statement, we are only going to execute UPDATE statement level
> trigger? so is there
> any point of making transition table entry for DELETE/INSERT trigger
> as those transition
> table will never be used.

But the statement level trigger function can refer to OLD TABLE and
NEW TABLE, which will contain all the OLD rows and NEW rows
respectively. So the updated rows of the partitions (including the
moved ones) need to be captured. So for OLD TABLE, we need to capture
the deleted row, and for NEW TABLE, we need to capture the inserted
row.

In the regression test update.sql, check how the statement trigger
trans_updatetrig prints all the updated rows, including the moved
ones.


>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-11 Thread Amit Khandekar
On 8 September 2017 at 19:17, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Fri, Sep 8, 2017 at 3:59 PM, Amit Khandekar <amitdkhan...@gmail.com> wrote:
>> On 7 September 2017 at 11:05, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>> On Thu, Aug 31, 2017 at 12:47 PM, Amit Khandekar <amitdkhan...@gmail.com> 
>>> wrote:
>>> 3.
>>> +/* 
>>> + * exec_append_leader_next
>>> + *
>>> + * To be used only if it's a parallel leader. The backend should scan
>>> + * backwards from the last plan. This is to prevent it from taking up
>>> + * the most expensive non-partial plan, i.e. the first subplan.
>>> + * 
>>> + */
>>> +static bool
>>> +exec_append_leader_next(AppendState *state)
>>>
>>> From above explanation, it is clear that you don't want backend to
>>> pick an expensive plan for a leader, but the reason for this different
>>> treatment is not clear.
>>
>> Explained it, saying that for more workers, a leader spends more work
>> in processing the worker tuples , and less work contributing to
>> parallel processing. So it should not take expensive plans, otherwise
>> it will affect the total time to finish Append plan.
>>
>
> In that case, why can't we keep the workers also process in same
> order, what is the harm in that?

Because of the way the logic of queuing works, the workers finish
earlier if they start with expensive plans first. For e.g. : with 3
plans with costs 8, 4, 4 and with 2 workers w1 and w2, they will
finish in 8 time units (w1 will finish plan 1 in 8, while in parallel
w2 will finish the remaining 2 plans in 8 units. Whereas if the plans
are ordered like : 4, 4, 8, then the workers will finish in 12 time
units (w1 and w2 will finish each of the 1st two plans in 4 units, and
then w1 or w2 will take up plan 3 and finish in 8 units, while the
other worker remains idle).

> Also, the leader will always scan
> the subplans from last subplan even if all the subplans are partial
> plans.

Since we already need to have two different code paths, I think we can
use the same code paths for any subplans.

> I think this will be the unnecessary difference in the
> strategy of leader and worker especially when all paths are partial.
> I think the selection of next subplan might become simpler if we use
> the same strategy for worker and leader.

Yeah if we had a common method for both it would have been better. But
anyways we have different logics to maintain.

>
> Few more comments:
>
> 1.
> + else if (IsA(subpath, MergeAppendPath))
> + {
> + MergeAppendPath *mpath = (MergeAppendPath *) subpath;
> +
> + /*
> + * If at all MergeAppend is partial, all its child plans have to be
> + * partial : we don't currently support a mix of partial and
> + * non-partial MergeAppend subpaths.
> + */
> + if (is_partial)
> + return list_concat(partial_subpaths, list_copy(mpath->subpaths));
>
> In which situation partial MergeAppendPath is generated?  Can you
> provide one example of such path?

Actually currently we don't support partial paths for MergeAppendPath.
That code just has that if condition (is_partial) but currently that
condition won't be true for MergeAppendPath.

>
> 2.
> add_paths_to_append_rel()
> {
> ..
> + /* Consider parallel append path. */
> + if (pa_subpaths_valid)
> + {
> + AppendPath *appendpath;
> + int parallel_workers;
> +
> + parallel_workers = get_append_num_workers(pa_partial_subpaths,
> +  pa_nonpartial_subpaths);
> + appendpath = create_append_path(rel, pa_nonpartial_subpaths,
> + pa_partial_subpaths,
> + NULL, parallel_workers, true,
> + partitioned_rels);
> + add_partial_path(rel, (Path *) appendpath);
> + }
> +
>   /*
> - * Consider an append of partial unordered, unparameterized partial paths.
> + * Consider non-parallel partial append path. But if the parallel append
> + * path is made out of all partial subpaths, don't create another partial
> + * path; we will keep only the parallel append path in that case.
>   */
> - if (partial_subpaths_valid)
> + if (partial_subpaths_valid && !pa_all_partial_subpaths)
>   {
>   AppendPath *appendpath;
>   ListCell   *lc;
>   int parallel_workers = 0;
>
>   /*
> - * Decide on the number of workers to request for this append path.
> - * For now, we just use the maximum value from among the members.  It
> - * might be useful to use a higher number if the Append node were
> - * smart enough to spread out the workers, but it currently isn't.
> + * To decide the number of workers, just use 

Re: [HACKERS] expanding inheritance in partition bound order

2017-09-11 Thread Amit Khandekar
Thanks Amit for the patch. I am still reviewing it, but meanwhile
below are a few comments so far ...

On 8 September 2017 at 15:53, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> [PATCH 2/2] Make RelationGetPartitionDispatch expansion order
>  depth-first
>
> This is so as it matches what the planner is doing with partitioning
> inheritance expansion.  Matching with planner order helps because
> it helps ease matching the executor's per-partition objects with
> planner-created per-partition nodes.
>
>

+   next_parted_idx += (list_length(*pds) - next_parted_idx - 1);

I think this can be replaced just by :
+   next_parted_idx = list_length(*pds) - 1;
Or, how about removing this variable next_parted_idx altogether ?
Instead, we can just do this :
pd->indexes[i] = -(1 + list_length(*pds));
If that is not possible, I may be missing something.

---

+ next_leaf_idx += (list_length(*leaf_part_oids) - next_leaf_idx);

Didn't understand why next_leaf_idx needs to be updated in case when
the current partrelid is partitioned. I think it should be incremented
only for leaf partitions, no ? Or for that matter, again, how about
removing the variable 'next_leaf_idx' and doing this :
*leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
pd->indexes[i] = list_length(*leaf_part_oids) - 1;

---

* For every partitioned table in the tree, starting with the root
* partitioned table, add its relcache entry to parted_rels, while also
* queuing its partitions (in the order in which they appear in the
* partition descriptor) to be looked at later in the same loop.  This is
* a bit tricky but works because the foreach() macro doesn't fetch the
* next list element until the bottom of the loop.

I think the above comment needs to be modified with something
explaining the relevant changed code. For e.g. there is no
parted_rels, and the "tricky" part was there earlier because of the
list being iterated and at the same time being appended.



I couldn't see the existing comments like "Indexes corresponding to
the internal partitions are multiplied by" anywhere in the patch. I
think those comments are still valid, and important.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-08 Thread Amit Khandekar
On 7 September 2017 at 11:05, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Thu, Aug 31, 2017 at 12:47 PM, Amit Khandekar <amitdkhan...@gmail.com> 
> wrote:
>> The last updated patch needs a rebase. Attached is the rebased version.
>>
>
> Few comments on the first read of the patch:

Thanks !

>
> 1.
> @@ -279,6 +347,7 @@ void
>  ExecReScanAppend(AppendState *node)
>  {
>   int i;
> + ParallelAppendDesc padesc = node->as_padesc;
>
>   for (i = 0; i < node->as_nplans; i++)
>   {
> @@ -298,6 +367,276 @@ ExecReScanAppend(AppendState *node)
>   if (subnode->chgParam == NULL)
>   ExecReScan(subnode);
>   }
> +
> + if (padesc)
> + {
> + padesc->pa_first_plan = padesc->pa_next_plan = 0;
> + memset(padesc->pa_finished, 0, sizeof(bool) * node->as_nplans);
> + }
> +
>
> For rescan purpose, the parallel state should be reinitialized via
> ExecParallelReInitializeDSM.  We need to do that way mainly to avoid
> cases where sometimes in rescan machinery we don't perform rescan of
> all the nodes.  See commit 41b0dd987d44089dc48e9c70024277e253b396b7.

Right. I didn't notice this while I rebased my patch over that commit.
Fixed it. Also added an exec_append_parallel_next() call in
ExecAppendReInitializeDSM(), otherwise the next ExecAppend() in leader
will get an uninitialized as_whichplan.

>
> 2.
> + * shared next_subplan counter index to start looking for unfinished plan,

Done.

>
> Here using "counter index" seems slightly confusing. I think using one
> of those will be better.

Re-worded it a bit. See whether that's what you wanted.

>
> 3.
> +/* 
> + * exec_append_leader_next
> + *
> + * To be used only if it's a parallel leader. The backend should scan
> + * backwards from the last plan. This is to prevent it from taking up
> + * the most expensive non-partial plan, i.e. the first subplan.
> + * 
> + */
> +static bool
> +exec_append_leader_next(AppendState *state)
>
> From above explanation, it is clear that you don't want backend to
> pick an expensive plan for a leader, but the reason for this different
> treatment is not clear.

Explained it, saying that for more workers, a leader spends more work
in processing the worker tuples , and less work contributing to
parallel processing. So it should not take expensive plans, otherwise
it will affect the total time to finish Append plan.

>
> 4.
> accumulate_partialappend_subpath()
> {
> ..
> + /* Add partial subpaths, if any. */
> + return list_concat(partial_subpaths, apath_partial_paths);
> ..
> + return partial_subpaths;
> ..
> + if (is_partial)
> + return lappend(partial_subpaths, subpath);
> ..
> }
>
> In this function, instead of returning from multiple places
> partial_subpaths list, you can just return it at the end and in all
> other places just append to it if required.  That way code will look
> more clear and simpler.

Agreed. Did it that way.

>
> 5.
>  * is created to represent the case that a relation is provably empty.
> + *
>   */
>  typedef struct AppendPath
>
> Spurious line addition.
Removed.

>
> 6.
> if (!node->as_padesc)
> {
> /*
> * This is Parallel-aware append. Follow it's own logic of choosing
> * the next subplan.
> */
> if (!exec_append_seq_next(node))
>
> I think this is the case of non-parallel-aware appends, but the
> comments are indicating the opposite.

Yeah, this comment got left over there when the relevant code got
changed. Shifted that comment upwards where it is appropriate.

Attached is the updated patch v14.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


ParallelAppend_v14.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-09-07 Thread Amit Khandekar
On 3 September 2017 at 17:10, Amit Khandekar <amitdkhan...@gmail.com> wrote:

> After recent commit 30833ba154, now the partitions are expanded in
> depth-first order. It didn't seem worthwhile rebasing my partition
> walker changes onto the latest code. So in the attached patch, I have
> removed all the partition walker changes. But
> RelationGetPartitionDispatchInfo() traverses in breadth-first order,
> which is different than the update result rels order (because
> inheritance expansion order is depth-first). So, in order to make the
> tuple-routing-related leaf partitions in the same order as that of the
> update result rels, we would have to make changes in
> RelationGetPartitionDispatchInfo(), which I am not sure whether it is
> going to be done as part of the thread "expanding inheritance in
> partition bound order" [1]. For now, in the attached patch, I have
> reverted back to the hash table method to find the leaf partitions in
> the update result rels.
>
> [1] 
> https://www.postgresql.org/message-id/CAJ3gD9eyudCNU6V-veMme%2BeyzfX_ey%2BgEzULMzOw26c3f9rzdg%40mail.gmail.com

As mentioned by Amit Langote in the above mail thread, he is going to
do changes for making RelationGetPartitionDispatchInfo() return the
leaf partitions in depth-first order. Once that is done, I will then
remove the hash table method for finding leaf partitions in update
result rels, and instead use the earlier efficient method that takes
advantage of the fact that update result rels and leaf partitions are
in the same order.

>
> Thanks
> -Amit Khandekar



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-07 Thread Amit Khandekar
On 7 September 2017 at 13:40, Rafia Sabih <rafia.sa...@enterprisedb.com> wrote:
> On Wed, Aug 30, 2017 at 5:32 PM, Amit Khandekar <amitdkhan...@gmail.com> 
> wrote:
>> Hi Rafia,
>>
>> On 17 August 2017 at 14:12, Amit Khandekar <amitdkhan...@gmail.com> wrote:
>>> But for all of the cases here, partial
>>> subplans seem possible, and so even on HEAD it executed Partial
>>> Append. So between a Parallel Append having partial subplans and a
>>> Partial Append having partial subplans , the cost difference would not
>>> be significant. Even if we assume that Parallel Append was chosen
>>> because its cost turned out to be a bit cheaper, the actual
>>> performance gain seems quite large as compared to the expected cost
>>> difference. So it might be even possible that the performance gain
>>> might be due to some other reasons. I will investigate this, and the
>>> other queries.
>>>
>>
>> I ran all the queries that were showing performance benefits in your
>> run. But for me, the ParallelAppend benefits are shown only for plans
>> that use Partition-Wise-Join.
>>
>> For all the queries that use only PA plans but not PWJ plans, I got
>> the exact same plan for HEAD as for PA+PWJ patch, except that for the
>> later, the Append is a ParallelAppend. Whereas, for you, the plans
>> have join-order changed.
>>
>> Regarding actual costs; consequtively, for me the actual-cost are more
>> or less the same for HEAD and PA+PWJ. Whereas, for your runs, you have
>> quite different costs naturally because the plans themselves are
>> different on head versus PA+PWJ.
>>
>> My PA+PWJ plan outputs (and actual costs) match exactly what you get
>> with PA+PWJ patch. But like I said, I get the same join order and same
>> plans (and actual costs) for HEAD as well (except
>> ParallelAppend=>Append).
>>
>> May be, if you have the latest HEAD code with your setup, you can
>> yourself check some of the queries again to see if they are still
>> seeing higher costs as compared to PA ? I suspect that some changes in
>> latest code might be causing this discrepancy; because when I tested
>> some of the explains with a HEAD-branch server running with your
>> database, I got results matching PA figures.
>>
>> Attached is my explain-analyze outputs.
>>
>
> Strange. Please let me know what was the commit-id you were
> experimenting on. I think we need to investigate this a little
> further.

Sure. I think the commit was b5c75fec. It was sometime in Aug 30 when
I ran the tests. But you may try on latest head.

> Additionally I want to point that I also applied patch [1],
> which I forgot to mention before.

Yes , I also had applied that patch over PA+PWJ.

>
> [1]  
> https://www.postgresql.org/message-id/CAEepm%3D3%3DNHHko3oOzpik%2BggLy17AO%2Bpx3rGYrg3x_x05%2BBr9-A%40mail.gmail.com

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-05 Thread Amit Khandekar
On 30 August 2017 at 17:32, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> On 16 August 2017 at 18:34, Robert Haas <robertmh...@gmail.com> wrote:
>> Thanks for the benchmarking results!
>>
>> On Tue, Aug 15, 2017 at 11:35 PM, Rafia Sabih
>> <rafia.sa...@enterprisedb.com> wrote:
>>> Q4 | 244 | 12 | PA and PWJ, time by only PWJ - 41
>>
>> 12 seconds instead of 244?  Whoa.  I find it curious that we picked a
>> Parallel Append with a bunch of non-partial plans when we could've
>> just as easily picked partial plans, or so it seems to me.  To put
>> that another way, why did we end up with a bunch of Bitmap Heap Scans
>> here instead of Parallel Bitmap Heap Scans?
>
> Actually, the cost difference would be quite low for Parallel Append
> with partial plans and Parallel Append with non-partial plans with 2
> workers. But yes, I should take a look at why it is consistently
> taking non-partial Bitmap Heap Scan.

Here, I checked that Partial Bitmap Heap Scan Path is not getting
created in the first place; but I think it should.

As you can see from the below plan snippet, the inner path of the join
is a parameterized Index Scan :

->  Parallel Append
 ->  Nested Loop Semi Join
   ->  Bitmap Heap Scan on orders_004
   Recheck Cond: ((o_orderdate >= '1994-01-01'::date) AND
(o_orderdate < '1994-04-01 00:00:00'::timestamp without time zone))
   ->  Bitmap Index Scan on idx_orders_orderdate_004
Index Cond: ((o_orderdate >= '1994-01-01'::date) AND
(o_orderdate < '1994-04-01 00:00:00'::timestamp without time zone))
   ->  Index Scan using idx_lineitem_orderkey_004 on lineitem_004
   Index Cond: (l_orderkey = orders_004.o_orderkey)
   Filter: (l_commitdate < l_receiptdate)

In the index condition of the inner IndexScan path, it is referencing
partition order_004 which is used by the outer path. So this should
satisfy the partial join path restriction concerning parameterized
inner path : "inner path should not refer to relations *outside* the
join path". Here, it is referring to relations *inside* the join path.
But still this join path gets rejected by try_partial_nestloop_path(),
here :

if (inner_path->param_info != NULL)
{
   Relids inner_paramrels = inner_path->param_info->ppi_req_outer;
   if (!bms_is_subset(inner_paramrels, outer_path->parent->relids))
  return;
}

Actually, bms_is_subset() above should return true, because
inner_paramrels and outer_path relids should have orders_004. But
that's not happening. inner_paramrels is referring to orders, not
orders_004. And hence bms_is_subset() returns false (thereby rejecting
the partial nestloop path). I suspect this is because the innerpath is
not getting reparameterized so as to refer to child relations. In the
PWJ patch, I saw that reparameterize_path_by_child() is called by
try_nestloop_path(), but not by try_partial_nestloop_path().

Now, for Parallel Append, if this partial nestloop subpath gets
created, it may or may not get chosen, depending upon the number of
workers. For e.g. if the number of workers is 6, and ParalleAppend+PWJ
runs with only 2 partitions, then partial nestedloop join would
definitely win because we can put all 6 workers to work, whereas for
ParallelAppend with all non-partial nested loop join subpaths, at the
most only 2 workers could be allotted, one for each child. But if the
partitions are more, and available workers are less, then I think the
cost difference in case of partial versus non-partial join paths would
not be significant.

But here the issue is, partial nest loop subpaths don't get created in
the first place. Looking at the above analysis, this issue should be
worked by a different thread, not in this one.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] expanding inheritance in partition bound order

2017-09-04 Thread Amit Khandekar
On 4 September 2017 at 06:34, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> Hi Amit,
>
> On 2017/09/03 16:07, Amit Khandekar wrote:
>> On 31 August 2017 at 13:06, Amit Langote <langote_amit...@lab.ntt.co.jp> 
>> wrote:
>>>> Mind you, that idea has some problems anyway in the face of default
>>>> partitions, null partitions, and list partitions which accept
>>>> non-contiguous values (e.g. one partition for 1, 3, 5; another for 2,
>>>> 4, 6).  We might need to mark the PartitionDesc to indicate whether
>>>> PartitionDesc-order is in fact bound-order in a particular instance,
>>>> or something like that.
>>>
>>> ISTM, the primary motivation for the EIBO patch at this point is to get
>>> the partitions ordered in a predictable manner so that the partition-wise
>>> join patch and update partition key patches could implement certain logic
>>> using O (n) algorithm rather than an O (n^2) one.  Neither of them depend
>>> on the actual order in the sense of, say, sticking a PathKey to the
>>> resulting Append.
>>
>> Now that the inheritance hierarchy is expanded in depth-first order,
>> RelationGetPartitionDispatchInfo() needs to be changed to arrange the
>> PartitionDispatch array and the leaf partitions in depth-first order
>> (as we know this is a requirement for the update-partition-key patch
>> for efficiently determining which of the leaf partitions are already
>> present in the update result rels).
>
> I was thinking the same.
>
>> Amit, I am not sure if you are
>> already doing this as part of the patches in this mail thread. Please
>> let me know.
>
> Actually, I had thought of changing the expansion order in
> RelationGetPartitionDispatchInfo to depth-first after Robert committed his
> patch the other day, but haven't got around to doing that yet.  Will do
> that in the updated patch (the refactoring patch) I will post sometime
> later today or tomorrow on a differently titled thread, because the EIBO
> work seems to be done.

Great, thanks. Just wanted to make sure someone is working on that,
because, as you said, it is no longer an EIBO patch. Since you are
doing that, I won't work on that.

>
>> Also, let me know if you think there will be any loss of
>> efficiency in tuple routing code if we arrange the Partition Dispatch
>> indexes in depth-first order.
>
> I don't think there will be any loss in the efficiency of the tuple
> routing code itself.  It's just that the position of the ResultRelInfos
> (of leaf partitions) and PartitionDispatch objects (of partitioned tables)
> will be different in their respective arrays, that is, pd->indexes will
> now have different values than formerly.

Ok. Good to hear that.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-09-03 Thread Amit Khandekar
On 31 August 2017 at 14:15, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> Thanks Dilip. I am working on rebasing the patch. Particularly, the
> partition walker in my patch depended on the fact that all the tables
> get opened (and then closed) while creating the tuple routing info.
> But in HEAD, now only the partitioned tables get opened. So need some
> changes in my patch.
>
> The partition walker related changes are going to be inapplicable once
> the other thread [1] commits the changes for expansion of inheritence
> in bound-order, but till then I would have to rebase the partition
> walker changes over HEAD.
>
> [1] 
> https://www.postgresql.org/message-id/0118a1f2-84bb-19a7-b906-dec040a206f2%40lab.ntt.co.jp
>

After recent commit 30833ba154, now the partitions are expanded in
depth-first order. It didn't seem worthwhile rebasing my partition
walker changes onto the latest code. So in the attached patch, I have
removed all the partition walker changes. But
RelationGetPartitionDispatchInfo() traverses in breadth-first order,
which is different than the update result rels order (because
inheritance expansion order is depth-first). So, in order to make the
tuple-routing-related leaf partitions in the same order as that of the
update result rels, we would have to make changes in
RelationGetPartitionDispatchInfo(), which I am not sure whether it is
going to be done as part of the thread "expanding inheritance in
partition bound order" [1]. For now, in the attached patch, I have
reverted back to the hash table method to find the leaf partitions in
the update result rels.

[1] 
https://www.postgresql.org/message-id/CAJ3gD9eyudCNU6V-veMme%2BeyzfX_ey%2BgEzULMzOw26c3f9rzdg%40mail.gmail.com

Thanks
-Amit Khandekar


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] expanding inheritance in partition bound order

2017-09-03 Thread Amit Khandekar
On 31 August 2017 at 13:06, Amit Langote  wrote:
>> Mind you, that idea has some problems anyway in the face of default
>> partitions, null partitions, and list partitions which accept
>> non-contiguous values (e.g. one partition for 1, 3, 5; another for 2,
>> 4, 6).  We might need to mark the PartitionDesc to indicate whether
>> PartitionDesc-order is in fact bound-order in a particular instance,
>> or something like that.
>
> ISTM, the primary motivation for the EIBO patch at this point is to get
> the partitions ordered in a predictable manner so that the partition-wise
> join patch and update partition key patches could implement certain logic
> using O (n) algorithm rather than an O (n^2) one.  Neither of them depend
> on the actual order in the sense of, say, sticking a PathKey to the
> resulting Append.

Now that the inheritance hierarchy is expanded in depth-first order,
RelationGetPartitionDispatchInfo() needs to be changed to arrange the
PartitionDispatch array and the leaf partitions in depth-first order
(as we know this is a requirement for the update-partition-key patch
for efficiently determining which of the leaf partitions are already
present in the update result rels). Amit, I am not sure if you are
already doing this as part of the patches in this mail thread. Please
let me know. Also, let me know if you think there will be any loss of
efficiency in tuple routing code if we arrange the Partition Dispatch
indexes in depth-first order.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-08-31 Thread Amit Khandekar
Thanks Dilip. I am working on rebasing the patch. Particularly, the
partition walker in my patch depended on the fact that all the tables
get opened (and then closed) while creating the tuple routing info.
But in HEAD, now only the partitioned tables get opened. So need some
changes in my patch.

The partition walker related changes are going to be inapplicable once
the other thread [1] commits the changes for expansion of inheritence
in bound-order, but till then I would have to rebase the partition
walker changes over HEAD.

[1] 
https://www.postgresql.org/message-id/0118a1f2-84bb-19a7-b906-dec040a206f2%40lab.ntt.co.jp


On 31 August 2017 at 12:09, Dilip Kumar <dilipbal...@gmail.com> wrote:
> On Fri, Aug 11, 2017 at 10:44 AM, Amit Khandekar <amitdkhan...@gmail.com> 
> wrote:
>> On 4 August 2017 at 22:28, Amit Khandekar <amitdkhan...@gmail.com> wrote:
>>>>
>
> I am planning to review and test this patch, Seems like this patch
> needs to be rebased.
>
> [dilip@localhost postgresql]$ patch -p1 <
> ../patches/update-partition-key_v15.patch
> patching file doc/src/sgml/ddl.sgml
> patching file doc/src/sgml/ref/update.sgml
> patching file doc/src/sgml/trigger.sgml
> patching file src/backend/catalog/partition.c
> Hunk #3 succeeded at 910 (offset -1 lines).
> Hunk #4 succeeded at 924 (offset -1 lines).
> Hunk #5 succeeded at 934 (offset -1 lines).
> Hunk #6 succeeded at 994 (offset -1 lines).
> Hunk #7 succeeded at 1009 with fuzz 1 (offset 3 lines).
> Hunk #8 FAILED at 1023.
> Hunk #9 succeeded at 1059 with fuzz 2 (offset 10 lines).
> Hunk #10 succeeded at 2069 (offset 2 lines).
> Hunk #11 succeeded at 2406 (offset 2 lines).
> 1 out of 11 hunks FAILED -- saving rejects to file
> src/backend/catalog/partition.c.rej
> patching file src/backend/commands/copy.c
> Hunk #2 FAILED at 1426.
> Hunk #3 FAILED at 1462.
> Hunk #4 succeeded at 2616 (offset 7 lines).
> Hunk #5 succeeded at 2726 (offset 8 lines).
> Hunk #6 succeeded at 2846 (offset 8 lines).
> 2 out of 6 hunks FAILED -- saving rejects to file
> src/backend/commands/copy.c.rej
> patching file src/backend/commands/trigger.c
> Hunk #4 succeeded at 5261 with fuzz 2.
> patching file src/backend/executor/execMain.c
> Hunk #1 succeeded at 65 (offset 1 line).
> Hunk #2 succeeded at 103 (offset 1 line).
> Hunk #3 succeeded at 1829 (offset 20 lines).
> Hunk #4 succeeded at 1860 (offset 20 lines).
> Hunk #5 succeeded at 1927 (offset 20 lines).
> Hunk #6 succeeded at 2044 (offset 21 lines).
> Hunk #7 FAILED at 3210.
> Hunk #8 FAILED at 3244.
> Hunk #9 succeeded at 3289 (offset 26 lines).
> Hunk #10 FAILED at 3340.
> Hunk #11 succeeded at 3387 (offset 29 lines).
> Hunk #12 succeeded at 3424 (offset 29 lines).
> 3 out of 12 hunks FAILED -- saving rejects to file
> src/backend/executor/execMain.c.rej
> patching file src/backend/executor/execReplication.c
> patching file src/backend/executor/nodeModifyTable.c
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-08-31 Thread Amit Khandekar
The last updated patch needs a rebase. Attached is the rebased version.

Thanks
-Amit Khandekar


ParallelAppend_v13_rebased_3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-08-30 Thread Amit Khandekar
Hi Rafia,

On 17 August 2017 at 14:12, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> But for all of the cases here, partial
> subplans seem possible, and so even on HEAD it executed Partial
> Append. So between a Parallel Append having partial subplans and a
> Partial Append having partial subplans , the cost difference would not
> be significant. Even if we assume that Parallel Append was chosen
> because its cost turned out to be a bit cheaper, the actual
> performance gain seems quite large as compared to the expected cost
> difference. So it might be even possible that the performance gain
> might be due to some other reasons. I will investigate this, and the
> other queries.
>

I ran all the queries that were showing performance benefits in your
run. But for me, the ParallelAppend benefits are shown only for plans
that use Partition-Wise-Join.

For all the queries that use only PA plans but not PWJ plans, I got
the exact same plan for HEAD as for PA+PWJ patch, except that for the
later, the Append is a ParallelAppend. Whereas, for you, the plans
have join-order changed.

Regarding actual costs; consequtively, for me the actual-cost are more
or less the same for HEAD and PA+PWJ. Whereas, for your runs, you have
quite different costs naturally because the plans themselves are
different on head versus PA+PWJ.

My PA+PWJ plan outputs (and actual costs) match exactly what you get
with PA+PWJ patch. But like I said, I get the same join order and same
plans (and actual costs) for HEAD as well (except
ParallelAppend=>Append).

May be, if you have the latest HEAD code with your setup, you can
yourself check some of the queries again to see if they are still
seeing higher costs as compared to PA ? I suspect that some changes in
latest code might be causing this discrepancy; because when I tested
some of the explains with a HEAD-branch server running with your
database, I got results matching PA figures.

Attached is my explain-analyze outputs.

On 16 August 2017 at 18:34, Robert Haas <robertmh...@gmail.com> wrote:
> Thanks for the benchmarking results!
>
> On Tue, Aug 15, 2017 at 11:35 PM, Rafia Sabih
> <rafia.sa...@enterprisedb.com> wrote:
>> Q4 | 244 | 12 | PA and PWJ, time by only PWJ - 41
>
> 12 seconds instead of 244?  Whoa.  I find it curious that we picked a
> Parallel Append with a bunch of non-partial plans when we could've
> just as easily picked partial plans, or so it seems to me.  To put
> that another way, why did we end up with a bunch of Bitmap Heap Scans
> here instead of Parallel Bitmap Heap Scans?

Actually, the cost difference would be quite low for Parallel Append
with partial plans and Parallel Append with non-partial plans with 2
workers. But yes, I should take a look at why it is consistently
taking non-partial Bitmap Heap Scan.



> Q6 | 29 | 12 | PA only

This one needs to be analysed, because here, the plan cost is the
same, but actual cost for PA is almost half the cost for HEAD. This is
the same observation for my run also.

Thanks
-Amit


PA-test-AmitKh.tar.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] expanding inheritance in partition bound order

2017-08-30 Thread Amit Khandekar
On 25 August 2017 at 23:58, Robert Haas  wrote:
>
> That just leaves indexes.  In a world where keystate, tupslot, and
> tupmap are removed from the PartitionDispatchData, you must need
> indexes or there would be no point in constructing a
> PartitionDispatchData object in the first place; any application that
> needs neither indexes nor the executor-specific stuff could just use
> the Relation directly.

But there is expand_inherited_rtentry() which neither requires indexes
nor any executor stuff, but still requires to call
RelationGetPartitionDispatchInfo(), and so these indexes get built
unnecessarily.

Looking at the latest patch, I can see that those indexes can be
separately built in ExecSetupPartitionTupleRouting() where it is
required, instead of in RelationGetPartitionDispatchInfo(). In the
loop which iterates through the pd[] returned from
RelationGetPartitionDispatchInfo(), we can build them using the exact
code currently written to build them in
RelationGetPartitionDispatchInfo().

In the future, if we require such applications where indexes are also
required, we may have a separate function only to build indexes, and
then ExecSetupPartitionTupleRouting() would also call that function.




On 21 August 2017 at 11:40, Amit Langote  wrote:
>> In ExecSetupPartitionTupleRouting(), not sure why ptrinfos array is an
>> array of pointers. Why can't it be an array of
>> PartitionTupleRoutingInfo structure  rather than pointer to that
>> structure ?
>
> AFAIK, assigning pointers is less expensive than assigning struct and we
> end up doing a lot of assigning of the members of that array to a local
> variable

I didn't get why exactly we would have to copy the structures. We
could just pass the address of ptrinfos[index], no ?

My only point for this was : we would not have to call palloc0() for
each of the element of ptrinfos. Instead, just allocate memory for all
of the elements in a single palloc0(). We anyways have to allocate
memory for *each* of the element.

> in get_partition_for_tuple(), for example.  Perhaps, we could
> avoid those assignments and implement it the way you suggest.

You mean at these 2 places in get_partition_for_tuple() , right ? :

1. /* start with the root partitioned table */
parent = ptrinfos[0];

2. else
parent = ptrinfos[-parent->pd->indexes[cur_index]];

Both of the above places, we could just use [...] instead of
ptrinfos[...]. But I guess you meant something else.



RelationGetPartitionDispatchInfo() opens all the partitioned tables.
But in ExecSetupPartitionTupleRouting(), it again opens all the
parents, that is all the partitioned tables, and closes them back.

Instead, would it be possible to do this : Instead of the
PartitionDispatch->parentoid field, PartitionDispatch can have
parentrel Relation field, which will point to reldesc field of one of
the pds[] elements.



For me, the CopyStateData->ptrinfos and ModifyTableState.mt_ptrinfos
field names sound confusing. How about part_tuple_routing_info or just
tuple_routing_info ?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] expanding inheritance in partition bound order

2017-08-18 Thread Amit Khandekar
On 18 August 2017 at 10:55, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote:
> On 2017/08/18 4:54, Robert Haas wrote:
>> On Thu, Aug 17, 2017 at 8:39 AM, Ashutosh Bapat
>> <ashutosh.ba...@enterprisedb.com> wrote:
>>> [2] had a patch with some changes to the original patch you posted. I
>>> didn't describe those changes in my mail, since they rearranged the
>>> comments. Those changes are not part of this patch and you haven't
>>> comments about those changes as well. If you have intentionally
>>> excluded those changes, it's fine. In case, you haven't reviewed them,
>>> please see if they are good to be incorporated.
>>
>> I took a quick look at your version but I think I like Amit's fine the
>> way it is, so committed that and back-patched it to v10.
>
> Thanks for committing.
>
>> I find 0002 pretty ugly as things stand.  We get a bunch of tuple maps
>> that we don't really need, only to turn around and free them.  We get
>> a bunch of tuple slots that we don't need, only to turn around and
>> drop them.  We don't really need the PartitionDispatch objects either,
>> except for the OIDs they contain.  There's a lot of extra stuff being
>> computed here that is really irrelevant for this purpose.  I think we
>> should try to clean that up somehow.
>
> One way to do that might be to reverse the order of the remaining patches
> and put the patch to refactor RelationGetPartitionDispatchInfo() first.
> With that refactoring, PartitionDispatch itself has become much simpler in
> that it does not contain a relcache reference to be closed eventually by
> the caller, the tuple map, and the tuple table slot.  Since those things
> are required for tuple-routing, the refactoring makes
> ExecSetupPartitionTupleRouting itself create them from the (minimal)
> information returned by RelationGetPartitionDispatchInfo and ultimately
> destroy when done using it.  I kept the indexes field in
> PartitionDispatchData though, because it's essentially free to create
> while we are walking the partition tree in
> RelationGetPartitionDispatchInfo() and it seems undesirable to make the
> caller compute that information (indexes) by traversing the partition tree
> all over again, if it doesn't otherwise have to.  I am still considering
> some counter-arguments raised by Amit Khandekar about this last assertion.
>
> Thoughts?

One another approach (that I have used in update-partition-key patch)
is to *not* generate the oids beforehand, and instead, call a
partition_walker_next() function to traverse through the tree. Each
next() function would return a ChildInfo that includes child oid,
parent oid, etc. All users of this would guarantee a fixed order of
oids. In the update-partition-key patch, I am opening and closing each
of the children, which of course, we need to avoid.




-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] expanding inheritance in partition bound order

2017-08-17 Thread Amit Khandekar
On 18 August 2017 at 01:24, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Aug 17, 2017 at 8:39 AM, Ashutosh Bapat
> <ashutosh.ba...@enterprisedb.com> wrote:
>> [2] had a patch with some changes to the original patch you posted. I
>> didn't describe those changes in my mail, since they rearranged the
>> comments. Those changes are not part of this patch and you haven't
>> comments about those changes as well. If you have intentionally
>> excluded those changes, it's fine. In case, you haven't reviewed them,
>> please see if they are good to be incorporated.
>
> I took a quick look at your version but I think I like Amit's fine the
> way it is, so committed that and back-patched it to v10.
>
> I find 0002 pretty ugly as things stand.  We get a bunch of tuple maps
> that we don't really need, only to turn around and free them.  We get
> a bunch of tuple slots that we don't need, only to turn around and
> drop them.

I think in the final changes after applying all 3 patches, the
redundant tuple slot is no longer present. But ...
> We don't really need the PartitionDispatch objects either,
> except for the OIDs they contain.  There's a lot of extra stuff being
> computed here that is really irrelevant for this purpose.  I think we
> should try to clean that up somehow.
... I am of the same opinion. That's why - as I mentioned upthread - I
was thinking why not have a separate light-weight function to just
generate oids, and keep RelationGetPartitionDispatchInfo() intact, to
be used only for tuple routing.

But, I haven't yet checked Ashuthosh's requirements, which suggest
that it does not help to even get the oid list.

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] expanding inheritance in partition bound order

2017-08-17 Thread Amit Khandekar
On 17 August 2017 at 06:39, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote:
> Hi Amit,
>
> Thanks for the comments.
>
> On 2017/08/16 20:30, Amit Khandekar wrote:
>> On 16 August 2017 at 11:06, Amit Langote <langote_amit...@lab.ntt.co.jp> 
>> wrote:
>>
>>> Attached updated patches.
>>
>> Thanks Amit for the patches.
>>
>> I too agree with the overall approach taken for keeping the locking
>> order consistent: it's best to do the locking with the existing
>> find_all_inheritors() since it is much cheaper than to lock them in
>> partition-bound order, the later being expensive since it requires
>> opening partitioned tables.
>
> Yeah.  Per the Robert's design idea, we will always do the *locking* in
> the order determined by find_all_inheritors/find_inheritance_children.
>
>>> I haven't yet done anything about changing the timing of opening and
>>> locking leaf partitions, because it will require some more thinking about
>>> the required planner changes.  But the above set of patches will get us
>>> far enough to get leaf partition sub-plans appear in the partition bound
>>> order (same order as what partition tuple-routing uses in the executor).
>>
>> So, I believe none of the changes done in pg_inherits.c are essential
>> for expanding the inheritence tree in bound order, right ?
>
> Right.
>
> The changes to pg_inherits.c are only about recognizing partitioned tables
> in an inheritance hierarchy and putting them ahead in the returned list.
> Now that I think of it, the title of the patch (teach pg_inherits.c about
> "partitioning") sounds a bit confusing.  In particular, the patch does not
> teach it things like partition bound order, just that some tables in the
> hierarchy could be partitioned tables.
>
>> I am not
>> sure whether we are planning to commit these two things together or
>> incrementally :
>> 1. Expand in bound order
>> 2. Allow for locking only the partitioned tables first.
>>
>> For #1, the changes in pg_inherits.c are not required (viz, keeping
>> partitioned tables at the head of the list, adding inhchildparted
>> column, etc).
>
> Yes.  Changes to pg_inherits.c can be committed completely independently
> of either 1 or 2, although then there would be nobody using that capability.
>
> About 2: I think the capability to lock only the partitioned tables in
> expand_inherited_rtentry() will only be used once we have the patch to do
> the necessary planner restructuring that will allow us to defer child
> table locking to some place that is not expand_inherited_rtentry().  I am
> working on that patch now and should be able to show something soon.
>
>> If we are going to do #2 together with #1, then in the patch set there
>> is no one using the capability introduced by #2. That is, there are no
>> callers of find_all_inheritors() that make use of the new
>> num_partitioned_children parameter. Also, there is no boolean
>> parameter  for find_all_inheritors() to be used to lock only the
>> partitioned tables.
>>
>> I feel we should think about
>> 0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch later, and
>> first get the review done for the other patches.
>
> I think that makes sense.
>
>> I see that RelationGetPartitionDispatchInfo() now returns quite a
>> small subset of what it used to return, which is good. But I feel for
>> expand_inherited_rtentry(), RelationGetPartitionDispatchInfo() is
>> still a bit heavy. We only require the oids, so the
>> PartitionedTableInfo data structure (including the pd->indexes array)
>> gets discarded.
>
> Maybe we could make the output argument optional, but I don't see much
> point in being too conservative here.  Building the indexes array does not
> cost us that much and if a not-too-distant-in-future patch could use that
> information somehow, it could do so for free.

Ok, so these changes are mostly kept keeping in mind some future
use-cases. Otherwise, I was thinking we could just keep a light-weight
function to generate the oids, and keep the current
RelationGetPartitionDispatchInfo() intact.

Anyways, some more comments :

In ExecSetupPartitionTupleRouting(), not sure why ptrinfos array is an
array of pointers. Why can't it be an array of
PartitionTupleRoutingInfo structure  rather than pointer to that
structure ?

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
+ * Close all the leaf partitions and their indices.
*
Above comment needs to be shifted a bit down to the subsequent "for"
loop where it's actually applicable.

* node->mt_partition_dispatch_info[0] corresponds 

Re: [HACKERS] Parallel Append implementation

2017-08-17 Thread Amit Khandekar
On 16 August 2017 at 18:34, Robert Haas <robertmh...@gmail.com> wrote:
> Thanks for the benchmarking results!
>
> On Tue, Aug 15, 2017 at 11:35 PM, Rafia Sabih
> <rafia.sa...@enterprisedb.com> wrote:
>> Q4 | 244 | 12 | PA and PWJ, time by only PWJ - 41
>
> 12 seconds instead of 244?  Whoa.  I find it curious that we picked a
> Parallel Append with a bunch of non-partial plans when we could've
> just as easily picked partial plans, or so it seems to me.  To put
> that another way, why did we end up with a bunch of Bitmap Heap Scans
> here instead of Parallel Bitmap Heap Scans?
>
>> Q7 | 134 | 88 | PA only
>> Q18 | 508 | 489 | PA only
>
> What's interesting in these results is that the join order changes
> quite a lot when PA is in the mix, and I don't really see why that
> should happen.

Yes, it seems hard to determine why exactly the join order changes.
Parallel Append is expected to give the benefit especially if there
are no partial subplans. But for all of the cases here, partial
subplans seem possible, and so even on HEAD it executed Partial
Append. So between a Parallel Append having partial subplans and a
Partial Append having partial subplans , the cost difference would not
be significant. Even if we assume that Parallel Append was chosen
because its cost turned out to be a bit cheaper, the actual
performance gain seems quite large as compared to the expected cost
difference. So it might be even possible that the performance gain
might be due to some other reasons. I will investigate this, and the
other queries.



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] expanding inheritance in partition bound order

2017-08-16 Thread Amit Khandekar
On 16 August 2017 at 11:06, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote:

> Attached updated patches.

Thanks Amit for the patches.

I too agree with the overall approach taken for keeping the locking
order consistent: it's best to do the locking with the existing
find_all_inheritors() since it is much cheaper than to lock them in
partition-bound order, the later being expensive since it requires
opening partitioned tables.

> I haven't yet done anything about changing the timing of opening and
> locking leaf partitions, because it will require some more thinking about
> the required planner changes.  But the above set of patches will get us
> far enough to get leaf partition sub-plans appear in the partition bound
> order (same order as what partition tuple-routing uses in the executor).

So, I believe none of the changes done in pg_inherits.c are essential
for expanding the inheritence tree in bound order, right ? I am not
sure whether we are planning to commit these two things together or
incrementally :
1. Expand in bound order
2. Allow for locking only the partitioned tables first.

For #1, the changes in pg_inherits.c are not required (viz, keeping
partitioned tables at the head of the list, adding inhchildparted
column, etc).

If we are going to do #2 together with #1, then in the patch set there
is no one using the capability introduced by #2. That is, there are no
callers of find_all_inheritors() that make use of the new
num_partitioned_children parameter. Also, there is no boolean
parameter  for find_all_inheritors() to be used to lock only the
partitioned tables.

I feel we should think about
0002-Teach-pg_inherits.c-a-bit-about-partitioning.patch later, and
first get the review done for the other patches.

---

I see that RelationGetPartitionDispatchInfo() now returns quite a
small subset of what it used to return, which is good. But I feel for
expand_inherited_rtentry(), RelationGetPartitionDispatchInfo() is
still a bit heavy. We only require the oids, so the
PartitionedTableInfo data structure (including the pd->indexes array)
gets discarded.

Also, RelationGetPartitionDispatchInfo() has to call get_rel_relkind()
for each child. In expand_inherited_rtentry(), we anyway have to open
all the child tables, so we get the partition descriptors for each of
the children for free. So how about, in expand_inherited_rtentry(), we
traverse the partition tree using these descriptors similar to how it
is traversed in RelationGetPartitionDispatchInfo() ? May be to avoid
code duplication for traversing, we can have a common API.

Still looking at RelationGetPartitionDispatchInfo() changes ...


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-08-09 Thread Amit Khandekar
On 9 August 2017 at 19:05, Robert Haas <robertmh...@gmail.com> wrote:
> On Wed, Jul 5, 2017 at 7:53 AM, Amit Khandekar <amitdkhan...@gmail.com> wrote:
>>> This is not applicable on the latest head i.e. commit --
>>> 08aed6604de2e6a9f4d499818d7c641cbf5eb9f7, looks like need a rebasing.
>>
>> Thanks for notifying. Attached is the rebased version of the patch.
>
> This again needs a rebase.

Attached rebased version of the patch. Thanks.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


ParallelAppend_v13_rebased_2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] expanding inheritance in partition bound order

2017-08-06 Thread Amit Khandekar
On 4 August 2017 at 22:55, Robert Haas <robertmh...@gmail.com> wrote:
>
> 1. Before calling RelationGetPartitionDispatchInfo, the calling code
> should use find_all_inheritors to lock all the relevant relations (or
> the planner could use find_all_inheritors to get a list of relation
> OIDs, store it in the plan in order, and then at execution time we
> visit them in that order and lock them).
>
> 2. RelationGetPartitionDispatchInfo assumes the relations are already locked.

I agree. I think overall, we should keep
RelationGetPartitionDispatchInfo() only for preparing the dispatch
info in the planner, and generate the locked oids (using
find_all_inheritors() or get_partitioned_oids() or whatever) *without*
using RelationGetPartitionDispatchInfo(), since
RelationGetPartitionDispatchInfo() is generating the pd structure
which we don't want in every expansion.

>
> 3. While we're optimizing, in the first loop inside of
> RelationGetPartitionDispatchInfo, don't call heap_open().  Instead,
> use get_rel_relkind() to see whether we've got a partitioned table; if
> so, open it.  If not, there's no need.

Yes, this way we need to open only the partitioned tables.

> P.S. While I haven't reviewed 0002 in detail, I think the concept of
> minimizing what needs to be built in RelationGetPartitionDispatchInfo
> is a very good idea.

True. I also think, RelationGetPartitionDispatchInfo () should be
called while preparing the ModifyTable plan; the PartitionDispatch
data structure returned by RelationGetPartitionDispatchInfo() should
be stored in that plan, and then the execution-time fields in
PartitionDispatch would be populated in ExecInitModifyTable().


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-08-04 Thread Amit Khandekar
>
> Below are the TODOS at this point :
>
> Fix for bug reported by Rajkumar about update with join.

I had explained the root issue of this bug here : [1]

Attached patch includes the fix, which is explained below.
Currently in the patch, there is a check if the tuple is concurrently
deleted by other session, i.e. when heap_update() returns
HeapTupleUpdated. In such case we set concurrently_deleted output
param to true. We should also do the same for HeapTupleSelfUpdated
return value.

In fact, there are other places in ExecDelete() where it can return
without doing anything. For e.g. if a BR DELETE trigger prevents the
delete from happening, ExecBRDeleteTriggers() returns false, in which
case ExecDelete() returns.

So what the fix does is : rename concurrently_deleted parameter to
delete_skipped so as to indicate a more general status : whether
delete has actually happened or was it skipped. And set this param to
true only after the delete happens. This allows us to avoid adding a
new rows for the trigger case also.

Added test scenario for UPDATE with JOIN case, and also TRIGGER case.

> Do something about two separate mapping tables for Transition tables
> and update tuple-routing.
On 1 July 2017 at 03:15, Thomas Munro  wrote:
> Would make sense to have a set of functions with names like
> GetConvertor{From,To}{Subplan,Leaf}(mtstate, index) which build arrays
> m_convertors_{from,to}_by_{subplan,leaf} the first time they need
> them?

This was discussed here : [2]. I think even if we have them built when
needed, still in presence of both tuple routing and transition tables,
we do need separate arrays. So I think rather than dynamic arrays, we
can have static arrays but their elements will point to  a shared
TupleConversionMap structure whenever possible.
As already in the patch, in case of insert/update tuple routing, there
is a per-leaf partition mt_transition_tupconv_maps array for
transition tables, and a separate per-subplan arry mt_resultrel_maps
for update tuple routing. *But*, what I am proposing is: for the
mt_transition_tupconv_maps[] element for which the leaf partition also
exists as a per-subplan result, that array element and the
mt_resultrel_maps[] element will point to the same TupleConversionMap
structure.

This is quite similar to how we are re-using the per-subplan
resultrels for the per-leaf result rels. We will re-use the
per-subplan TupleConversionMap for the per-leaf
mt_transition_tupconv_maps[] elements.

Not yet implemented this.

> GetUpdatedColumns() to be moved to header file.

Done. I have moved it in execnodes.h

> More test scenarios in regression tests.
> Need to check/test whether we are correctly applying insert policies
> (ant not update) while inserting a routed tuple.

Yet to do above two.

> Use getASTriggerResultRelInfo() for attrno mapping, rather than first
> resultrel, for generating child WCO/RETURNING expression.
>

Regarding generating child WithCheckOption and Returning expressions
using those of the root result relation, ModifyTablePath and
ModifyTable should have new fields rootReturningList (and
rootWithCheckOptions) which would be derived from
root->parse->returningList in inheritance_planner(). But then, similar
to per-subplan returningList, rootReturningList would have to pass
through set_plan_refs()=>set_returning_clause_references() which
requires the subplan targetlist to be passed. Because of this, for
rootReturningList, we require a subplan for root partition, which is
not there currently; we have subpans only for child rels. That means
we would have to create such plan only for the sake of generating
rootReturningList.

The other option is to do the way the patch is currently doing in the
executor by using the returningList of the first per-subplan result
rel to generate the other child returningList (and WithCheckOption).
This is working by applying map_partition_varattnos() to the first
returningList. But now that we realized that we have to specially
handle whole-row vars, map_partition_varattnos() would need some
changes to convert whole row vars differently for
child-rel-to-child-rel mapping. For childrel-to-childrel conversion,
the whole-row var is already wrapped by ConvertRowtypeExpr, but we
need to change its Var->vartype to the new child vartype.

I think the second option looks easier, but I am open to suggestions,
and I am myself still checking the first one.

> Address Robert's review comments on make_resultrel_ordered.patch.
>
> +typedef struct ParentChild
>
> This is a pretty generic name.  Pick something more specific and informative.

I have used ChildPartitionInfo. But suggestions welcome.

>
> +static List *append_rel_partition_oids(List *rel_list, Relation rel);
>
> One could be forgiven for thinking that this function was just going
> to append OIDs, but it actually appends ParentChild structures, so I
> think the name needs work.

Renamed it to append_child_partitions().

>
> +List 

Re: [HACKERS] map_partition_varattnos() and whole-row vars

2017-08-02 Thread Amit Khandekar
On 3 August 2017 at 11:00, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote:
> Thanks for the review.
>
> On 2017/08/03 13:54, Amit Khandekar wrote:
>> On 2 August 2017 at 11:51, Amit Langote wrote:
>>> On 2017/08/02 1:33, Amit Khandekar wrote:
>>>> Instead of callers of map_partition_varattnos() reporting error, we
>>>> can have map_partition_varattnos() itself report error. Instead of the
>>>> found_whole_row parameter of map_partition_varattnos(), we can have
>>>> error_on_whole_row parameter. So callers who don't expect whole row,
>>>> would pass error_on_whole_row=true to map_partition_varattnos(). This
>>>> will simplify the resultant code a bit.
>>>
>>> So, I think it's only the callers of
>>> map_partition_varattnos() who know whether finding whole-row vars is an
>>> error and *what kind* of error.
>>
>> Ok. Got it. Thanks for the explanation.
>>
>> How about making found_whole_row as an optional parameter ?
>> map_partition_varattnos() would populate it only if its not NULL. This
>> way, callers who don't bother about whether it is found or not don't
>> have to declare a variable and pass its address. In current code,
>> ExecInitModifyTable() is the one who has to declare found_whole_row
>> without needing it.
>
> Sure, done.

Looks good.

> But while implementing that, I avoided changing anything in
> map_variable_attnos(), such as adding a check for not-NULLness of
> found_whole_row.  The only check added is in map_partition_varattnos().

Yes, I agree. That is anyway not relevant to the fix.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] map_partition_varattnos() and whole-row vars

2017-08-02 Thread Amit Khandekar
On 2 August 2017 at 11:51, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote:
> Thanks Fuita-san and Amit for reviewing.
>
> On 2017/08/02 1:33, Amit Khandekar wrote:
>> On 1 August 2017 at 15:11, Etsuro Fujita <fujita.ets...@lab.ntt.co.jp> wrote:
>>> On 2017/07/31 18:56, Amit Langote wrote:
>>>> Yes, that's what's needed here.  So we need to teach
>>>> map_variable_attnos_mutator() to convert whole-row vars just like it's
>>>> done in adjust_appendrel_attrs_mutator().
>>>
>>>
>>> Seems reasonable.  (Though I think it might be better to do this kind of
>>> conversion in the planner, not the executor, because that would increase the
>>> efficiency of cached plans.)
>
> That's a good point, although it sounds like a bigger project that, IMHO,
> should be undertaken separately, because that would involve designing for
> considerations of expanding inheritance even in the INSERT case.
>
>> I think the work of shifting to planner should be taken as a different
>> task when we shift the preparation of DispatchInfo to the planner.
>
> Yeah, I think it'd be a good idea to do those projects together.  That is,
> doing what Fujita-san suggested and expanding partitioned tables in
> partition bound order in the planner.
>
>>>> Attached 2 patches:
>>>>
>>>> 0001: Addresses the bug that Rajkumar reported (handling whole-row vars in
>>>>WITH CHECK and RETURNING expressions at all)
>>>>
>>>> 0002: Addressed the bug that Amit reported (converting whole-row vars
>>>>that could occur in WITH CHECK and RETURNING expressions)
>>>
>>>
>>> I took a quick look at the patches.  One thing I noticed is this:
>>>
>>>  map_variable_attnos(Node *node,
>>> int target_varno, int sublevels_up,
>>> const AttrNumber *attno_map, int
>>> map_length,
>>> +   Oid from_rowtype, Oid to_rowtype,
>>> bool *found_whole_row)
>>>  {
>>> map_variable_attnos_context context;
>>> @@ -1291,6 +1318,8 @@ map_variable_attnos(Node *node,
>>> context.sublevels_up = sublevels_up;
>>> context.attno_map = attno_map;
>>> context.map_length = map_length;
>>> +   context.from_rowtype = from_rowtype;
>>> +   context.to_rowtype = to_rowtype;
>>> context.found_whole_row = found_whole_row;
>>>
>>> You added two arguments to pass to map_variable_attnos(): from_rowtype and
>>> to_rowtype.  But I'm not sure that we really need the from_rowtype argument
>>> because it's possible to get the Oid from the vartype of target whole-row
>>> Vars.  And in this:
>>>
>>> +   /*
>>> +* If the callers expects us to convert the
>>> same, do so if
>>> +* necessary.
>>> +*/
>>> +   if (OidIsValid(context->to_rowtype) &&
>>> +   OidIsValid(context->from_rowtype) &&
>>> +   context->to_rowtype !=
>>> context->from_rowtype)
>>> +   {
>>> +   ConvertRowtypeExpr *r =
>>> makeNode(ConvertRowtypeExpr);
>>> +
>>> +   r->arg = (Expr *) newvar;
>>> +   r->resulttype =
>>> context->from_rowtype;
>>> +   r->convertformat =
>>> COERCE_IMPLICIT_CAST;
>>> +   r->location = -1;
>>> +   /* Make sure the Var node has the
>>> right type ID, too */
>>> +   newvar->vartype =
>>> context->to_rowtype;
>>> +   return (Node *) r;
>>> +   }
>>>
>>> I think we could set r->resulttype to the vartype (ie, "r->resulttype =
>>> newvar->vartype" instead of "r->resulttype = context->from_rowtype").
>>
>> I agree.
>
> You're right, from_rowtype is unnecessary.
>
>> ---
>>
>> Few more comments :
>>
>> @

Re: [HACKERS] UPDATE of partition key

2017-08-02 Thread Amit Khandekar
On 2 August 2017 at 14:38, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote:
> On 2017/07/29 2:45, Amit Khandekar wrote:
>> On 28 July 2017 at 20:10, Robert Haas <robertmh...@gmail.com> wrote:
>>> On Wed, Jul 26, 2017 at 2:13 AM, Amit Langote wrote:
>>>> I checked that we get the same result relation order with both the
>>>> patches, but I would like to highlight a notable difference here between
>>>> the approaches taken by our patches.  In my patch, I have now taught
>>>> RelationGetPartitionDispatchInfo() to lock *only* the partitioned tables
>>>> in the tree, because we need to look at its partition descriptor to
>>>> collect partition OIDs and bounds.  We can defer locking (and opening the
>>>> relation descriptor of) leaf partitions to a point where planner has
>>>> determined that the partition will be accessed after all (not pruned),
>>>> which will be done in a separate patch of course.
>>>
>>> That's very desirable, but I believe it introduces a deadlock risk
>>> which Amit's patch avoids.  A transaction using the code you've
>>> written here is eventually going to lock all partitions, BUT it's
>>> going to move the partitioned ones to the front of the locking order
>>> vs. what find_all_inheritors would do.  So, when multi-level
>>> partitioning is in use, I think it could happen that some other
>>> transaction is accessing the table using a different code path that
>>> uses the find_all_inheritors order without modification.  If those
>>> locks conflict (e.g. query vs. DROP) then there's a deadlock risk.
>>
>> Yes, I agree. Even with single-level partitioning, the leaf partitions
>> ordered by find_all_inheritors() is by oid values, so that's also
>> going to be differently ordered.
>
> We do require to lock the parent first in any case.  Doesn't that prevent
> deadlocks by imparting an implicit order on locking by operations whose
> locks conflict.

Yes may be, but I am not too sure at this point. find_all_inheritors()
locks only the children, and the parent lock is already locked
separately. find_all_inheritors() does not necessitate to lock the
children with the same lockmode as the parent.

> Having said that, I think it would be desirable for all code paths to
> manipulate partitions in the same order.  For partitioned tables, I think
> we can make it the partition bound order by replacing all calls to
> find_all_inheritors and find_inheritance_children on partitioned table
> parents with something else that reads partition OIDs from the relcache
> (PartitionDesc) and traverses the partition tree breadth-first manner.
>
>>> Unfortunately I don't see any easy way around that problem, but maybe
>>> somebody else has an idea.
>>
>> One approach I had considered was to have find_inheritance_children()
>> itself lock the children in bound order, so that everyone will have
>> bound-ordered oids, but that would be too expensive since it requires
>> opening all partitioned tables to initialize partition descriptors. In
>> find_inheritance_children(), we get all oids without opening any
>> tables. But now that I think more of it, it's only the partitioned
>> tables that we have to open, not the leaf partitions; and furthermore,
>> I didn't see calls to find_inheritance_children() and
>> find_all_inheritors() in performance-critical code, except in
>> expand_inherited_rtentry(). All of them are in DDL commands; but yes,
>> that can change in the future.
>
> This approach more or less amounts to calling the new
> RelationGetPartitionDispatchInfo() (per my proposed patch, a version of
> which I posted upthread.)  Maybe we can add a wrapper on top, say,
> get_all_partition_oids() which throws away other things that
> RelationGetPartitionDispatchInfo() returned.  In addition it locks all the
> partitions that are returned, unlike only the partitioned ones, which is
> what RelationGetPartitionDispatchInfo() has been taught to do.

So there are three different task items here :
1. Arrange the oids in consistent order everywhere.
2. Prepare the Partition Dispatch Info data structure in the planner
as against during execution.
3. For update tuple routing, assume that the result rels are ordered
consistently to make the searching efficient.

#3 depends on #1. So for that, I have come up with a minimum set of
changes to have expand_inherited_rtentry() generate the rels in bound
order. When we do #2 , it may be possible that we may need to re-do my
changes in expand_inherited_rtentry(), but those are minimum. We may
even end up having the walker function being used at multiple places,
but right now it is not ce

Re: [HACKERS] map_partition_varattnos() and whole-row vars

2017-08-01 Thread Amit Khandekar
wvar->vartype =
> context->to_rowtype;
> +   return (Node *) r;
> +   }
>
> I think we could set r->resulttype to the vartype (ie, "r->resulttype =
> newvar->vartype" instead of "r->resulttype = context->from_rowtype").

I agree.

---

Few more comments :

@@ -1240,7 +1247,7 @@ map_variable_attnos_mutator(Node *node,
 var->varlevelsup == context->sublevels_up)
 {
 /* Found a matching variable, make the substitution */

- Var*newvar = (Var *) palloc(sizeof(Var));
+ Var*newvar = copyObject(var);
  int attno = var->varattno;

 *newvar = *var;

Here, "*newvar = *var" should be removed.


---

-   result = map_partition_varattnos(result, 1, rel, parent);
+   result = map_partition_varattnos(result, 1, rel, parent,
+
  _whole_row);
+   /* There can never be a whole-row reference here */
+   if (found_whole_row)
+   elog(ERROR, "unexpected whole-row reference found in
partition key");

Instead of callers of map_partition_varattnos() reporting error, we
can have map_partition_varattnos() itself report error. Instead of the
found_whole_row parameter of map_partition_varattnos(), we can have
error_on_whole_row parameter. So callers who don't expect whole row,
would pass error_on_whole_row=true to map_partition_varattnos(). This
will simplify the resultant code a bit.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-07-28 Thread Amit Khandekar
automatically solved.



Below are the TODOS at this point :

Fix for bug reported by Rajkumar about update with join.
Do something about two separate mapping tables for Transition tables
and update tuple-routing.
GetUpdatedColumns() to be moved to header file.
More test scenarios in regression tests.
Need to check/test whether we are correctly applying insert policies
(ant not update) while inserting a routed tuple.
Use getASTriggerResultRelInfo() for attrno mapping, rather than first
resultrel, for generating child WCO/RETURNING expression.
Address Robert's review comments on make_resultrel_ordered.patch.
pgindent.

[1] 
https://www.postgresql.org/message-id/d86d27ea-cc9d-5dbe-b131-e7dec4017983%40lab.ntt.co.jp

Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] map_partition_varattnos() and whole-row vars

2017-07-28 Thread Amit Khandekar
On 28 July 2017 at 11:17, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote:
> On 2017/07/26 16:58, Amit Langote wrote:
>> Rajkumar Raghuwanshi reported [1] on the "UPDATE partition key" thread
>> that whole-row vars don't play nicely with partitioned table operations.
>>
>> For example, when used to convert WITH CHECK OPTION constraint expressions
>> and RETURNING target list expressions, it will error out if the
>> expressions contained whole-row vars.  That's a bug, because whole-row
>> vars are legal in those expressions.  I think the decision to output error
>> upon encountering a whole-row in the input expression was based on the
>> assumption that it will only ever be used to convert partition constraint
>> expressions, which cannot contain those.  So, we can relax that
>> requirement so that its other users are not bitten by it.
>>
>> Attached fixes that.
>
> Attached a revised version of the patch.
>
> Updated patch moves the if (found_whole_row) elog(ERROR) bit in
> map_partition_varattnos to the callers.  Only the callers know if
> whole-row vars are not expected in the expression it's getting converted
> from map_partition_varattnos.  Given the current message text (mentioning
> "partition key"), it didn't seem appropriate to have the elog inside this
> function, since it's callable from places where it wouldn't make sense.

create table foo (a int, b text) partition by list (a);
create table foo1 partition of foo for values in (1);
create table foo2(b text, a int) ;
alter table foo attach partition foo2 for values in (2);

postgres=# insert into foo values (1, 'hi there');
INSERT 0 1
postgres=# insert into foo values (2, 'bi there');
INSERT 0 1
postgres=# insert into foo values (2, 'error there') returning foo;
ERROR:  table row type and query-specified row type do not match
DETAIL:  Table has type text at ordinal position 1, but query expects integer.

This is due to a mismatch between the composite type tuple descriptor
of the leaf partition doesn't match the RETURNING slot tuple
descriptor.

We probably need to do what
inheritance_planner()=>adjust_appendrel_attrs() does for adjusting the
attrs in the child rel parse tree; i.e., handle some specific nodes
other than just simple var nodes. In adjust_appendrel_attrs_mutator(),
for whole row node, it updates var->vartype to the child rel.

I suspect that the other nodes that adjust_appendrel_attrs_mutator
handles (like JoinExpr, CurrentOfExpr, etc ) may also require similar
adjustments for our case, because WithCheckOptions (and I think even
RETURNING) can have a subquery.


>
> Thanks,
> Amit
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-07-26 Thread Amit Khandekar
On 26 July 2017 at 02:37, Robert Haas <robertmh...@gmail.com> wrote:
> Is there any real benefit in this "walker" interface?  It looks to me
> like it might be simpler to just change things around so that it
> returns a list of OIDs, like find_all_inheritors, but generated
> differently.  Then if you want bound-ordering rather than
> OID-ordering, you just do this:
>
> list_free(inhOids);
> inhOids = get_partition_oids_in_bound_order(rel);
>
> That'd remove the need for some if/then logic as you've currently got
> in get_next_child().

Yes, I had considered that ; i.e., first generating just a list of
bound-ordered oids. But that consequently needs all the child tables
to be opened and closed twice; once during the list generation, and
then while expanding the partitioned table. Agreed, that the second
time, heap_open() would not be that expensive because tables would be
cached, but still it would require to get the cached relation handle
from hash table. Since we anyway want to open the tables, better have
a *next() function to go-get the next partition in a fixed order.

Actually, there isn't much that the walker next() function does. Any
code that wants to traverse bound-wise can do that by its own. The
walker function is just a convenient way to make sure everyone
traverses in the same order by using this function.

Yet to go over other things including your review comments, and Amit
Langote's patch on refactoring RelationGetPartitionDispatchInfo().

> On another note, did you do anything about the suggestion Thomas made
> in 
> http://postgr.es/m/CAEepm=3sc_j1zwqdyrbu4dtfx5rhcamnnuaxrkwzfgt9m23...@mail.gmail.com
> ?

This is still pending on me; plus I think there are some more points.
I need to go over those and consolidate a list of todos.




-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-07-25 Thread Amit Khandekar
On 25 July 2017 at 15:02, Rajkumar Raghuwanshi
<rajkumar.raghuwan...@enterprisedb.com> wrote:
> On Mon, Jul 24, 2017 at 11:23 AM, Amit Khandekar <amitdkhan...@gmail.com>
> wrote:
>>
>>
>> Attached update-partition-key_v13.patch now contains this
>> make_resultrels_ordered.patch changes.
>>
>
> I have applied attach patch and got below observation.
>
> Observation :  if join producing multiple output rows for a given row to be
> modified. I am seeing here it is updating a row and also inserting rows in
> target table. hence after update total count of table got incremented.

Thanks for catching this Rajkumar.

So after the row to be updated is already moved to another partition,
when the next join output row corresponds to the same row which is
moved, that row is now deleted, so ExecDelete()=>heap_delete() gets
HeapTupleSelfUpdated, and this is not handled. So even when
ExecDelete() finds that the row is already deleted, we still call
ExecInsert(), so a new row is inserted.  In ExecDelete(), we should
indicate that the row is already deleted. In the existing patch, there
is a parameter concurrenty_deleted for ExecDelete() which indicates
that the row is concurrently deleted. I think we can make this
parameter for both of these purposes so as to avoid ExecInsert() for
both these scenarios. Will work on a patch.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Oddity in error handling of constraint violation in ExecConstraints for partitioned tables

2017-07-24 Thread Amit Khandekar
On 24 July 2017 at 12:11, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote:
> Hi Amit,
>
> On 2017/07/24 14:09, Amit Khandekar wrote:
>>>> On 2017/07/10 14:15, Etsuro Fujita wrote:
>>>> Another thing I noticed is the error handling in ExecWithCheckOptions; it
>>>> doesn't take any care for partition tables, so the column description in
>>>> the error message for WCO_VIEW_CHECK is built using the partition table's
>>>> rowtype, not the root table's.  But I think it'd be better to build that
>>>> using the root table's rowtype, like ExecConstraints, because that would
>>>> make the column description easier to understand since the parent view(s)
>>>> (from which WithCheckOptions evaluated there are created) would have been
>>>> defined on the root table.  This seems independent from the above issue,
>>>> so I created a separate patch, which I'm attaching. What do you think
>>>> about that?
>>
>> + if (map != NULL)
>> + {
>> + tuple = do_convert_tuple(tuple, map);
>> + ExecStoreTuple(tuple, slot, InvalidBuffer, false);
>> + }
>>
>> Above, the tuple descriptor also needs to be set, since the parent and
>> child partitions can have different column ordering.
>>
>> Due to this, the following testcase crashes :
>>
>> CREATE TABLE range_parted (a text,b int, c int) partition by range (b);
>> CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c >
>> 120) WITH CHECK OPTION;
>> create table part_a_1_a_10(b int, c int, a text);
>> alter table range_parted attach partition part_a_1_a_10 for values
>> from (1) to (10);
>> insert into upview values ('a', 2, 100);
>
> Good catch.  Thanks for creating the patch.
>
> There are other places with similar code viz. those in ExecConstraints()
> that would need fixing.

Ah ok. I should have noticed that. Thanks.

> Attached is the updated version of your patch.
Now that this is done, any particular reason it is not done in
ExecPartitionCheck() ? I see that there is a do_convert_tuple() in
that function, again without changing the slot descriptor.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-07-23 Thread Amit Khandekar
On 13 July 2017 at 22:39, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> Attached is a WIP patch (make_resultrels_ordered.patch) that generates
> the result rels in canonical order. This patch is kept separate from
> the update-partition-key patch, and can be applied on master branch.

Attached update-partition-key_v13.patch now contains this
make_resultrels_ordered.patch changes.

So now that the per-subplan result rels and the leaf partition oids
that are generated for tuple routing are both known to have the same
order (cannonical), in ExecSetupPartitionTupleRouting(), we look for
the per-subplan results without the need for a hash table. Instead of
the hash table, we iterate over the leaf partition oids and at the
same time keep shifting a position over the per-subplan resultrels
whenever the resultrel at the position is found to be present in the
leaf partitions list. Since the two lists are in the same order, we
never have to again scan the portition of the lists that is already
scanned.

I considered whether the issue behind this recent commit might be
relevant for update tuple-routing as well :
commit f81a91db4d1c2032632aa5df9fc14be24f5fe5ec
Author: Robert Haas <rh...@postgresql.org>
Date:   Mon Jul 17 21:29:45 2017 -0400
Use a real RT index when setting up partition tuple routing.

Since we know that using a dummy 1 value for tuple routing result rels
is not correct, I am checking about another possibility : Now in the
latest patch, the tuple routing partitions would have a mix of a)
existing update result-rels, and b) new partition resultrels. 'b'
resultrels would have the RT index of nominalRelation, but the
existing 'a' resultrels would have their own different RT indexes. I
suspect, this might surface a similar issue that was fixed by the
above commit, for e.g. with the WITH query having UPDATE subqueries
doing tuple routing. Will check that.

This patch also has Robert's changes in the planner to decide whether
to do update tuple routing.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


update-partition-key_v13.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Oddity in error handling of constraint violation in ExecConstraints for partitioned tables

2017-07-23 Thread Amit Khandekar
>> On 2017/07/10 14:15, Etsuro Fujita wrote:
>> Another thing I noticed is the error handling in ExecWithCheckOptions; it
>> doesn't take any care for partition tables, so the column description in
>> the error message for WCO_VIEW_CHECK is built using the partition table's
>> rowtype, not the root table's.  But I think it'd be better to build that
>> using the root table's rowtype, like ExecConstraints, because that would
>> make the column description easier to understand since the parent view(s)
>> (from which WithCheckOptions evaluated there are created) would have been
>> defined on the root table.  This seems independent from the above issue,
>> so I created a separate patch, which I'm attaching. What do you think
>> about that?

+ if (map != NULL)
+ {
+ tuple = do_convert_tuple(tuple, map);
+ ExecStoreTuple(tuple, slot, InvalidBuffer, false);
+ }

Above, the tuple descriptor also needs to be set, since the parent and
child partitions can have different column ordering.

Due to this, the following testcase crashes :

CREATE TABLE range_parted (a text,b int, c int) partition by range (b);
CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c >
120) WITH CHECK OPTION;
create table part_a_1_a_10(b int, c int, a text);
alter table range_parted attach partition part_a_1_a_10 for values
from (1) to (10);
insert into upview values ('a', 2, 100);

Attached is a patch that sets the tuple descriptor.
Also in the patch, in test updatable_view.sql, I have added a varchar
column in one of the partitioned tables used in updatable_views.sql,
so as to cover this scenario. Without setting the tuple descriptor,
the output of the patched updatable_views.sql  shows junk value in one
of the columns of the row in the error message emitted for the
WithCheckOption violation.

Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


set_slot_descriptor.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-07-13 Thread Amit Khandekar
On 5 July 2017 at 15:12, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> Like I mentioned upthread... in expand_inherited_rtentry(), if we
> replace find_all_inheritors() with something else that returns oids in
> canonical order, that will change the order in which children tables
> get locked, which increases the chance of deadlock. Because, then the
> callers of find_all_inheritors() will lock them in one order, while
> callers of expand_inherited_rtentry() will lock them in a different
> order. Even in the current code, I think there is a chance of
> deadlocks because RelationGetPartitionDispatchInfo() and
> find_all_inheritors() have different lock ordering.
>
> Now, to get the oids of a partitioned table children sorted by
> canonical ordering, (i.e. using the partition bound values) we need to
> either use the partition bounds to sort the oids like the way it is
> done in RelationBuildPartitionDesc() or, open the parent table and get
> it's Relation->rd_partdesc->oids[] which are already sorted in
> canonical order. So if we generate oids using this way in
> find_all_inheritors() and find_inheritance_children(), that will
> generate consistent ordering everywhere. But this method is quite
> expensive as compared to the way oids are generated and sorted using
> oid values in find_inheritance_children().
>
> In both expand_inherited_rtentry() and
> RelationGetPartitionDispatchInfo(), each of the child tables are
> opened.
>
> So, in both of these functions, what we can do is : call a new
> function partition_tree_walker() which does following :
> 1. Lock the children using the existing order (i.e. sorted by oid
> values) using the same function find_all_inheritors(). Rename
> find_all_inheritors() to lock_all_inheritors(... , bool return_oids)
> which returns the oid list only if requested.
> 2. And then scan through each of the partitions in canonical order, by
> opening the parent table, then opening the partition descriptor oids,
> and then doing whatever needs to be done with that partition rel.
>
> partition_tree_walker() will look something like this :
>
> void partition_tree_walker(Oid parentOid, LOCKMODE lockmode,
>void (*walker_func) (), void *context)
> {
> Relation parentrel;
> List *rels_list;
> ListCell *cell;
>
> (void) lock_all_inheritors(parentOid, lockmode,
>false /* don't generate oids */);
>
> parentrel = heap_open(parentOid, NoLock);
> rels_list = append_rel_partition_oids(NIL, parentrel);
>
> /* Scan through all partitioned rels, and at the
>  * same time append their children. */
> foreach(cell, rels_list)
> {
> /* Open partrel without locking; lock_all_inheritors() has locked it 
> */
> Relationpartrel = heap_open(lfirst_oid(cell), NoLock);
>
> /* Append the children of a partitioned rel to the same list
>  * that we are iterating on */
> if (RelationGetPartitionDesc(partrel))
> rels_list = append_rel_partition_oids(rels_list, partrel);
>
> /*
>  * Do whatever processing needs to be done on this partel.
>  * The walker function is free to either close the partel
>  * or keep it opened, but it needs to make sure the opened
>  * ones are closed later
>  */
> walker_func(partrel, context);
> }
> }
>
> List *append_rel_partition_oids(List *rel_list, Relation rel)
> {
> int i;
> for (i = 0; i < rel->rd_partdesc->nparts; i++)
> rel_list = lappend_oid(rel_list, rel->rd_partdesc->oids[i]);
>
> return rel_list;
> }
>
>
> So, in expand_inherited_rtentry() the foreach(l, inhOIDs) loop will be
> replaced by partition_tree_walker(parentOid, expand_rte_walker_func)
> where expand_rte_walker_func() will do all the work done in the for
> loop for each of the partition rels.
>
> Similarly, in RelationGetPartitionDispatchInfo() the initial part
> where it uses APPEND_REL_PARTITION_OIDS() can be replaced by
> partition_tree_walker(rel, dispatch_info_walkerfunc) where
> dispatch_info_walkerfunc() will generate the oids, or may be populate
> the complete PartitionDispatchData structure. 'pd' variable can be
> passed as context to the partition_tree_walker(..., context)
>
> Generating the resultrels in canonical order by opening the tables
> using the above way wouldn't be more expensive than the existing code,
> because even currently we anyways have to open all the tables in both
> of these functions.
>

Attached is a WIP patch (make_resultrels_ordered.patch) that generates
the result rels in canonical order. This patch is kept separate from
the up

Re: [HACKERS] Parallel Append implementation

2017-07-05 Thread Amit Khandekar
On 30 June 2017 at 15:10, Rafia Sabih <rafia.sa...@enterprisedb.com> wrote:
>
> On Tue, Apr 4, 2017 at 12:37 PM, Amit Khandekar <amitdkhan...@gmail.com>
> wrote:
>>
>> Attached is an updated patch v13 that has some comments changed as per
>> your review, and also rebased on latest master.
>
>
> This is not applicable on the latest head i.e. commit --
> 08aed6604de2e6a9f4d499818d7c641cbf5eb9f7, looks like need a rebasing.

Thanks for notifying. Attached is the rebased version of the patch.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


ParallelAppend_v13_rebased.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-07-05 Thread Amit Khandekar
On 4 July 2017 at 15:23, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> On 4 July 2017 at 14:48, Amit Khandekar <amitdkhan...@gmail.com> wrote:
>> On 4 July 2017 at 14:38, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote:
>>> On 2017/07/04 17:25, Etsuro Fujita wrote:
>>>> On 2017/07/03 18:54, Amit Langote wrote:
>>>>> On 2017/07/02 20:10, Robert Haas wrote:
>>>>>> That
>>>>>> seems pretty easy to do - just have expand_inherited_rtentry() notice
>>>>>> that it's got a partitioned table and call
>>>>>> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
>>>>>> produce the list of OIDs.
>>>> Seems like a good idea.
>>>>
>>>>> Interesting idea.
>>>>>
>>>>> If we are going to do this, I think we may need to modify
>>>>> RelationGetPartitionDispatchInfo() a bit or invent an alternative that
>>>>> does not do as much work.  Currently, it assumes that it's only ever
>>>>> called by ExecSetupPartitionTupleRouting() and hence also generates
>>>>> PartitionDispatchInfo objects for partitioned child tables.  We don't need
>>>>> that if called from within the planner.
>>>>>
>>>>> Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled
>>>>> with its usage within the executor, because there is this comment:
>>>>>
>>>>>  /*
>>>>>   * We keep the partitioned ones open until we're done using the
>>>>>   * information being collected here (for example, see
>>>>>   * ExecEndModifyTable).
>>>>>   */
>>>>
>>>> Yeah, we need some refactoring work.  Is anyone working on that?
>>>
>>> I would like to take a shot at that if someone else hasn't already cooked
>>> up a patch.  Working on making RelationGetPartitionDispatchInfo() a
>>> routine callable from both within the planner and the executor should be a
>>> worthwhile effort.
>>
>> What I am currently working on is to see if we can call
>> find_all_inheritors() or find_inheritance_children() instead of
>> generating the leaf_part_oids using APPEND_REL_PARTITION_OIDS().
>> Possibly we don't have to refactor it completely.
>> find_inheritance_children() needs to return the oids in canonical
>> order. So in find_inheritance_children () need to re-use part of
>> RelationBuildPartitionDesc() where it generates those oids in that
>> order. I am checking this part, and am going to come up with an
>> approach based on findings.
>
> The other approach is to make canonical ordering only in
> find_all_inheritors() by replacing call to find_inheritance_children()
> with the refactored/modified RelationGetPartitionDispatchInfo(). But
> that would mean that the callers of find_inheritance_children() would
> have one ordering, while the callers of find_all_inheritors() would
> have a different ordering; that brings up chances of deadlocks. That's
> why I think, we need to think about modifying the common function
> find_inheritance_children(), so that we would be consistent with the
> ordering. And then use find_inheritance_children() or
> find_all_inheritors() in RelationGetPartitionDispatchInfo(). So yes,
> there would be some modifications to
> RelationGetPartitionDispatchInfo().
>
>>
>> Also, need to investigate whether *always* sorting the oids in
>> canonical order is going to be much expensive than the current sorting
>> using oids. But I guess it won't be that expensive.


Like I mentioned upthread... in expand_inherited_rtentry(), if we
replace find_all_inheritors() with something else that returns oids in
canonical order, that will change the order in which children tables
get locked, which increases the chance of deadlock. Because, then the
callers of find_all_inheritors() will lock them in one order, while
callers of expand_inherited_rtentry() will lock them in a different
order. Even in the current code, I think there is a chance of
deadlocks because RelationGetPartitionDispatchInfo() and
find_all_inheritors() have different lock ordering.

Now, to get the oids of a partitioned table children sorted by
canonical ordering, (i.e. using the partition bound values) we need to
either use the partition bounds to sort the oids like the way it is
done in RelationBuildPartitionDesc() or, open the parent table and get
it's Relation->rd_partdesc->oids[] which are already sorted in
canonical order. So if we generate oids using this way in
find_all_inheritors() and find_inheritance_ch

Re: [HACKERS] UPDATE of partition key

2017-07-04 Thread Amit Khandekar
On 4 July 2017 at 14:48, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> On 4 July 2017 at 14:38, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote:
>> On 2017/07/04 17:25, Etsuro Fujita wrote:
>>> On 2017/07/03 18:54, Amit Langote wrote:
>>>> On 2017/07/02 20:10, Robert Haas wrote:
>>>>> That
>>>>> seems pretty easy to do - just have expand_inherited_rtentry() notice
>>>>> that it's got a partitioned table and call
>>>>> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
>>>>> produce the list of OIDs.
>>> Seems like a good idea.
>>>
>>>> Interesting idea.
>>>>
>>>> If we are going to do this, I think we may need to modify
>>>> RelationGetPartitionDispatchInfo() a bit or invent an alternative that
>>>> does not do as much work.  Currently, it assumes that it's only ever
>>>> called by ExecSetupPartitionTupleRouting() and hence also generates
>>>> PartitionDispatchInfo objects for partitioned child tables.  We don't need
>>>> that if called from within the planner.
>>>>
>>>> Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled
>>>> with its usage within the executor, because there is this comment:
>>>>
>>>>  /*
>>>>   * We keep the partitioned ones open until we're done using the
>>>>   * information being collected here (for example, see
>>>>   * ExecEndModifyTable).
>>>>   */
>>>
>>> Yeah, we need some refactoring work.  Is anyone working on that?
>>
>> I would like to take a shot at that if someone else hasn't already cooked
>> up a patch.  Working on making RelationGetPartitionDispatchInfo() a
>> routine callable from both within the planner and the executor should be a
>> worthwhile effort.
>
> What I am currently working on is to see if we can call
> find_all_inheritors() or find_inheritance_children() instead of
> generating the leaf_part_oids using APPEND_REL_PARTITION_OIDS().
> Possibly we don't have to refactor it completely.
> find_inheritance_children() needs to return the oids in canonical
> order. So in find_inheritance_children () need to re-use part of
> RelationBuildPartitionDesc() where it generates those oids in that
> order. I am checking this part, and am going to come up with an
> approach based on findings.

The other approach is to make canonical ordering only in
find_all_inheritors() by replacing call to find_inheritance_children()
with the refactored/modified RelationGetPartitionDispatchInfo(). But
that would mean that the callers of find_inheritance_children() would
have one ordering, while the callers of find_all_inheritors() would
have a different ordering; that brings up chances of deadlocks. That's
why I think, we need to think about modifying the common function
find_inheritance_children(), so that we would be consistent with the
ordering. And then use find_inheritance_children() or
find_all_inheritors() in RelationGetPartitionDispatchInfo(). So yes,
there would be some modifications to
RelationGetPartitionDispatchInfo().

>
> Also, need to investigate whether *always* sorting the oids in
> canonical order is going to be much expensive than the current sorting
> using oids. But I guess it won't be that expensive.
>
>
> --
> Thanks,
> -Amit Khandekar
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-07-04 Thread Amit Khandekar
On 4 July 2017 at 14:38, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote:
> On 2017/07/04 17:25, Etsuro Fujita wrote:
>> On 2017/07/03 18:54, Amit Langote wrote:
>>> On 2017/07/02 20:10, Robert Haas wrote:
>>>> That
>>>> seems pretty easy to do - just have expand_inherited_rtentry() notice
>>>> that it's got a partitioned table and call
>>>> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
>>>> produce the list of OIDs.
>> Seems like a good idea.
>>
>>> Interesting idea.
>>>
>>> If we are going to do this, I think we may need to modify
>>> RelationGetPartitionDispatchInfo() a bit or invent an alternative that
>>> does not do as much work.  Currently, it assumes that it's only ever
>>> called by ExecSetupPartitionTupleRouting() and hence also generates
>>> PartitionDispatchInfo objects for partitioned child tables.  We don't need
>>> that if called from within the planner.
>>>
>>> Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled
>>> with its usage within the executor, because there is this comment:
>>>
>>>  /*
>>>   * We keep the partitioned ones open until we're done using the
>>>   * information being collected here (for example, see
>>>   * ExecEndModifyTable).
>>>   */
>>
>> Yeah, we need some refactoring work.  Is anyone working on that?
>
> I would like to take a shot at that if someone else hasn't already cooked
> up a patch.  Working on making RelationGetPartitionDispatchInfo() a
> routine callable from both within the planner and the executor should be a
> worthwhile effort.

What I am currently working on is to see if we can call
find_all_inheritors() or find_inheritance_children() instead of
generating the leaf_part_oids using APPEND_REL_PARTITION_OIDS().
Possibly we don't have to refactor it completely.
find_inheritance_children() needs to return the oids in canonical
order. So in find_inheritance_children () need to re-use part of
RelationBuildPartitionDesc() where it generates those oids in that
order. I am checking this part, and am going to come up with an
approach based on findings.

Also, need to investigate whether *always* sorting the oids in
canonical order is going to be much expensive than the current sorting
using oids. But I guess it won't be that expensive.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-06-29 Thread Amit Khandekar
On 22 June 2017 at 01:41, Robert Haas <robertmh...@gmail.com> wrote:
>>> +for (i = 0; i < num_rels; i++)
>>> +{
>>> +ResultRelInfo *resultRelInfo = _rels[i];
>>> +Relationrel = resultRelInfo->ri_RelationDesc;
>>> +Bitmapset *expr_attrs = NULL;
>>> +
>>> +pull_varattnos((Node *) rel->rd_partcheck, 1, _attrs);
>>> +
>>> +/* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. 
>>> */
>>> +if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, 
>>> estate)))
>>> +return true;
>>> +}
>>>
>>> This seems like an awfully expensive way of performing this test.
>>> Under what circumstances could this be true for some result relations
>>> and false for others;
>>
>> One resultRelinfo can have no partition key column used in its quals,
>> but the next resultRelinfo can have quite different quals, and these
>> quals can have partition key referred. This is possible if the two of
>> them have different parents that have different partition-key columns.
>
> Hmm, true.  So if we have a table foo that is partitioned by list (a),
> and one of its children is a table bar that is partitioned by list
> (b), then we need to consider doing tuple-routing if either column a
> is modified, or if column b is modified for a partition which is a
> descendant of bar.  But visiting that only requires looking at the
> partitioned table and those children that are also partitioned, not
> all of the leaf partitions as the patch does.

The main concern is that the non-leaf partitions are not open (except
root), so we would need to open them in order to get the partition key
of the parents of update resultrels (or get only the partition key
atts and exprs from pg_partitioned_table).

There can be multiple approaches to finding partition key columns.

Approach 1 : When there are a few update result rels and a large
partition tree, we traverse from each of the result rels to their
ancestors , and open their ancestors (get_partition_parent()) to get
the partition key columns. For result rels having common parents, do
this only once.

Approach 2 : If there are only a few partitioned tables, and large
number of update result rels, it would be easier to just open all the
partitioned tables and form the partition key column bitmap out of all
their partition keys. If the bitmap does not have updated columns,
that's not a partition-key-update. So for typical non-partition-key
updates, just opening the partitioned tables will suffice, and so that
would not affect performance of normal updates.

But if the bitmap has updated columns, we can't conclude that it's a
partition-key-update, otherwise it would be false positive. We again
need to further check whether the update result rels belong to
ancestors that have updated partition keys.

Approach 3 : In RelationData, in a new bitmap field (rd_partcheckattrs
?), store partition key attrs that are used in rd_partcheck . Populate
this field during generate_partition_qual().

So to conclude, I think, we can do this :

Scenario 1 :
Only one partitioned table : the root; rest all are leaf partitions.
In this case, it is definitely efficient to just check the root
partition key, which will be sufficient.

Scenario 2 :
There are few non-leaf partitioned tables (3-4) :
Open those tables, and follow 2nd approach above: If we don't find any
updated partition-keys in any of them, well and good. If we do find,
failover to approach 3 : For each of the update resultrels, use the
new rd_partcheckattrs bitmap to know if it uses any of the updated
columns. This would be faster than pulling up attrs from the quals
like how it was done in the patch.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-06-29 Thread Amit Khandekar
On 29 June 2017 at 07:42, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote:
> Hi Amit,
>
> On 2017/06/28 20:43, Amit Khandekar wrote:
>> In attached patch v12
>
> The patch no longer applies and fails to compile after the following
> commit was made yesterday:
>
> commit 501ed02cf6f4f60c3357775eb07578aebc912d3a
> Author: Andrew Gierth <rhodiumt...@postgresql.org>
> Date:   Wed Jun 28 18:55:03 2017 +0100
>
> Fix transition tables for partition/inheritance.

Thanks for informing Amit.

As Thomas mentioned upthread, the above commit already uses a tuple
conversion mapping from leaf partition to root partitioned table
(mt_transition_tupconv_maps), which serves the same purpose as that of
the mapping used in the update-partition-key patch during update tuple
routing (mt_resultrel_maps).

We need to try to merge these two into a general-purpose mapping array
such as mt_leaf_root_maps. I haven't done that in the rebased patch
(attached), so currently it has both these mapping fields.

For transition tables, this map is per-leaf-partition in case of
inserts, whereas it is per-subplan result rel for updates. For
update-tuple routing, the mapping is required to be per-subplan. Now,
for update-row-movement in presence of transition tables, we would
require both per-subplan mapping as well as per-leaf-partition
mapping, which can't be done if we have a single mapping field, unless
we have some way to identify which of the per-leaf partition mapping
elements belong to per-subplan rels.

So, it's not immediately possible to merge them.


update-partition-key_v12_rebased.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-06-28 Thread Amit Khandekar
On 22 June 2017 at 01:57, Robert Haas <robertmh...@gmail.com> wrote:
> On Wed, Jun 21, 2017 at 1:38 PM, Amit Khandekar <amitdkhan...@gmail.com> 
> wrote:
>>>> Yep, it's more appropriate to use
>>>> ModifyTableState->rootResultRelationInfo->ri_RelationDesc somehow.  That
>>>> is, if answer to the question I raised above is positive.
>>
>> From what I had checked earlier when coding that part,
>> rootResultRelInfo is NULL in case of inserts, unless something has
>> changed in later commits. That's the reason I decided to use the first
>> resultRelInfo.
>
> We're just going around in circles here.  Saying that you decided to
> use the first child's resultRelInfo because you didn't have a
> resultRelInfo for the parent is an explanation of why you wrote the
> code the way you did, but that doesn't make it correct.  I want to
> know why you think it's correct.

Yeah, that was just an FYI on how I decided to use the first
resultRelInfo; it was not for explaining why using first resultRelInfo
is correct. So upthread, I have tried to explain.

>
> I think it's probably wrong, because it seems to me that if the INSERT
> code needs to use the parent's ResultRelInfo rather than the first
> child's ResultRelInfo, the UPDATE code probably needs to do the same.
> Commit d3cc37f1d801a6b5cad9bf179274a8d767f1ee50 got rid of
> resultRelInfos for non-leaf partitions, and commit
> e180c8aa8caf5c55a273d4a8e6092e77ff3cff10 added the resultRelInfo back
> for the topmost parent, because otherwise it didn't work correctly.



Regarding rootResultRelInfo , it would have been good if
rootResultRelInfo was set for both insert and update, but it isn't set
for inserts.

For inserts :
In ExecInitModifyTable(), ModifyTableState->rootResultRelInfo  remains
NULL because ModifyTable->rootResultRelIndex is = -1 :
/* If modifying a partitioned table, initialize the root table info */
if (node->rootResultRelIndex >= 0)
   mtstate->rootResultRelInfo = estate->es_root_result_relations +
node->rootResultRelIndex;


ModifyTable->rootResultRelIndex = -1 because it does not get set since
ModifyTable->partitioned_rels is NULL :

/*
* If the main target relation is a partitioned table, the
* following list contains the RT indexes of partitioned child
* relations including the root, which are not included in the
* above list.  We also keep RT indexes of the roots
* separately to be identitied as such during the executor
* initialization.
*/
if (splan->partitioned_rels != NIL)
{
root->glob->nonleafResultRelations =
list_concat(root->glob->nonleafResultRelations,
list_copy(splan->partitioned_rels));
/* Remember where this root will be in the global list. */
splan->rootResultRelIndex = list_length(root->glob->rootResultRelations);
root->glob->rootResultRelations =
lappend_int(root->glob->rootResultRelations,
linitial_int(splan->partitioned_rels));
}

ModifyTable->partitioned_rels is NULL because inheritance_planner()
does not get called for INSERTs; instead, grouping_planner() gets
called :

subquery_planner()
{
/*
* Do the main planning.  If we have an inherited target relation, that
* needs special processing, else go straight to grouping_planner.
*/
if (parse->resultRelation && rt_fetch(parse->resultRelation,
parse->rtable)->inh)
   inheritance_planner(root);
else
   grouping_planner(root, false, tuple_fraction);

}

Above, inh is false in case of inserts.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-06-28 Thread Amit Khandekar
On 26 June 2017 at 08:37, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> On 22 June 2017 at 01:41, Robert Haas <robertmh...@gmail.com> wrote:
>>>> Second, it will amount to a functional bug if you get a
>>>> different answer than the planner did.
>>>
>>> Actually, the per-leaf WCOs are meant to be executed on the
>>> destination partitions where the tuple is moved, while the WCOs
>>> belonging to the per-subplan resultRelInfo are meant for the
>>> resultRelinfo used for the UPDATE plans. So actually it should not
>>> matter whether they look same or different, because they are fired at
>>> different objects. Now these objects can happen to be the same
>>> relations though.
>>>
>>> But in any case, it's not clear to me how the mapped WCO and the
>>> planner's WCO would yield a different answer if they are both the same
>>> relation. I am possibly missing something. The planner has already
>>> generated the withCheckOptions for each of the resultRelInfo. And then
>>> we are using one of those to re-generate the WCO for a leaf partition
>>> by only adjusting the attnos. If there is already a WCO generated in
>>> the planner for that leaf partition (because that partition was
>>> present in mtstate->resultRelInfo), then the re-built WCO should be
>>> exactly look same as the earlier one, because they are the same
>>> relations, and so the attnos generated in them would be same since the
>>> Relation TupleDesc is the same.
>>
>> If the planner's WCOs and mapped WCOs are always the same, then I
>> think we should try to avoid generating both.  If they can be
>> different, but that's intentional and correct, then there's no
>> substantive problem with the patch but the comments need to make it
>> clear why we are generating both.
>>
>>> Actually I meant, "above works for only local updates. For
>>> row-movement-updates, we need per-leaf partition WCOs, because when
>>> the row is inserted into target partition, that partition may be not
>>> be included in the above planner resultRelInfo, so we need WCOs for
>>> all partitions". I think this said comment should be sufficient if I
>>> add this in the code ?
>>
>> Let's not get too focused on updating the comment until we are in
>> agreement about what the code ought to be doing.  I'm not clear
>> whether you accept the point that the patch needs to be changed to
>> avoid generating the same WCOs and returning lists in both the planner
>> and the executor.
>
> Yes, we can re-use the WCOs generated in the planner, as an
> optimization, since those we re-generate for the same relations will
> look exactly the same. The WCOs generated by planner (in
> inheritance_planner) are generated when (in adjust_appendrel_attrs())
> we change attnos used in the query to refer to the child RTEs and this
> adjusts the attnos of the WCOs of the child RTEs. So the WCOs of
> subplan resultRelInfo are actually the parent table WCOs, but only the
> attnos changed. And in ExecInitModifyTable() we do the same thing for
> leaf partitions, although using different function
> map_variable_attnos().

In attached patch v12,  during UPDATE tuple routing setup, for each
leaf partition, we now check if it is present already in one of the
UPDATE per-subplan resultrels. If present, we re-use them rather than
creating a new one and opening the table again.

So the mtstate->mt_partitions is now an array of ResultRelInfo
pointers. That pointer points to either the UPDATE per-subplan result
rel, or a newly allocated ResultRelInfo.

For each of the leaf partitions, we have to search through the
per-subplan resultRelInfo oids to check if there is a match. To do
this, I have created a temporary hash table which stores oids and the
ResultRelInfo pointers of mtstate->resultRelInfo array, and which can
be used to search the oid for each of the leaf partitions.

This patch version has handled only the above discussion point. I will
follow up with the other points separately.


update-partition-key_v12.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-06-25 Thread Amit Khandekar
On 22 June 2017 at 01:41, Robert Haas  wrote:
>>> Second, it will amount to a functional bug if you get a
>>> different answer than the planner did.
>>
>> Actually, the per-leaf WCOs are meant to be executed on the
>> destination partitions where the tuple is moved, while the WCOs
>> belonging to the per-subplan resultRelInfo are meant for the
>> resultRelinfo used for the UPDATE plans. So actually it should not
>> matter whether they look same or different, because they are fired at
>> different objects. Now these objects can happen to be the same
>> relations though.
>>
>> But in any case, it's not clear to me how the mapped WCO and the
>> planner's WCO would yield a different answer if they are both the same
>> relation. I am possibly missing something. The planner has already
>> generated the withCheckOptions for each of the resultRelInfo. And then
>> we are using one of those to re-generate the WCO for a leaf partition
>> by only adjusting the attnos. If there is already a WCO generated in
>> the planner for that leaf partition (because that partition was
>> present in mtstate->resultRelInfo), then the re-built WCO should be
>> exactly look same as the earlier one, because they are the same
>> relations, and so the attnos generated in them would be same since the
>> Relation TupleDesc is the same.
>
> If the planner's WCOs and mapped WCOs are always the same, then I
> think we should try to avoid generating both.  If they can be
> different, but that's intentional and correct, then there's no
> substantive problem with the patch but the comments need to make it
> clear why we are generating both.
>
>> Actually I meant, "above works for only local updates. For
>> row-movement-updates, we need per-leaf partition WCOs, because when
>> the row is inserted into target partition, that partition may be not
>> be included in the above planner resultRelInfo, so we need WCOs for
>> all partitions". I think this said comment should be sufficient if I
>> add this in the code ?
>
> Let's not get too focused on updating the comment until we are in
> agreement about what the code ought to be doing.  I'm not clear
> whether you accept the point that the patch needs to be changed to
> avoid generating the same WCOs and returning lists in both the planner
> and the executor.

Yes, we can re-use the WCOs generated in the planner, as an
optimization, since those we re-generate for the same relations will
look exactly the same. The WCOs generated by planner (in
inheritance_planner) are generated when (in adjust_appendrel_attrs())
we change attnos used in the query to refer to the child RTEs and this
adjusts the attnos of the WCOs of the child RTEs. So the WCOs of
subplan resultRelInfo are actually the parent table WCOs, but only the
attnos changed. And in ExecInitModifyTable() we do the same thing for
leaf partitions, although using different function
map_variable_attnos().

>
>>> Also, I feel like it's probably not correct to use the first result
>>> relation as the nominal relation for building WCOs and returning lists
>>> anyway.  I mean, if the first result relation has a different column
>>> order than the parent relation, isn't this just broken?  If it works
>>> for some reason, the comments don't explain what that reason is.

One thing I didn't mention earlier about the WCOs, is that for child
rels, we don't use the WCOs defined for the child rels. We only
inherit the WCO expressions defined for the root rel. That's the
reason they are the same expressions, only the attnos changed to match
the respective relation tupledesc. If the WCOs of each of the subplan
resultRelInfo() were different, then definitely it was not possible to
use the first resultRelinfo to generate other leaf partition WCOs,
because the WCO defined for relation A is independent of that defined
for relation B.

So, since the WCOs of all the relations are actually those of the
parent, we only need to adjust the attnos of any of these
resultRelInfos.

For e.g., if the root rel WCO is defined as "col > 5" where col is the
4th column, the expression will look like "var_1.attno_4 > 5". And the
WCO that is generated for a subplan resultRelInfo will look something
like "var_n.attno_2 > 5" if col is the 2nd column in this table.

All of the above logic assumes that we never use the WCO defined for
the child relation. At least that's how it looks by looking at the way
we generate WCOs in ExecInitModifyTable() for INSERTs as well looking
at the code in inheritance_planner() for UPDATEs. At both these
places, we never use the WCOs defined for child tables.

So suppose we define the tables and their WCOs like this :

CREATE TABLE range_parted ( a text, b int, c int) partition by range (a, b);

ALTER TABLE range_parted ENABLE ROW LEVEL SECURITY;
GRANT ALL ON range_parted TO PUBLIC ;
create policy seeall ON range_parted as PERMISSIVE for SELECT using ( true);

create table part_b_10_b_20 partition of range_parted for values from
('b', 

Re: [HACKERS] UPDATE of partition key

2017-06-21 Thread Amit Khandekar
On 21 June 2017 at 20:14, Robert Haas <robertmh...@gmail.com> wrote:
> On Wed, Jun 21, 2017 at 5:28 AM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:>> The comment "UPDATE/DELETE
> cases are handled above" is referring to
>>> the code that initializes the WCOs generated by the planner.  You've
>>> modified the comment in your patch, but the associated code: your
>>> updated comment says that only "DELETEs and local UPDATES are handled
>>> above", but in reality, *all* updates are still handled above.  And
>>> then they are handled again here.  Similarly for returning lists.
>>> It's certainly not OK for the comment to be inaccurate, but I think
>>> it's also bad to redo the work which the planner has already done,
>>> even if it makes the patch smaller.
>>
>> I guess this has to do with the UPDATE turning into DELETE+INSERT.  So, it
>> seems like WCOs are being initialized for the leaf partitions
>> (ResultRelInfos in the mt_partitions array) that are in turn are
>> initialized for the aforementioned INSERT.  That's why the term "...local
>> UPDATEs" in the new comment text.
>>
>> If that's true, I wonder if it makes sense to apply what would be
>> WCO_RLS_UPDATE_CHECK to a leaf partition that the tuple will be moved into
>> by calling ExecInsert()?
>
> I think we probably should apply the insert policy, just as we're
> executing the insert trigger.

Yes, the RLS quals should execute during tuple routing according to
whether it is a update or whether it has been converted to insert. I
think the tests don't quite test the insert part. Will check.

>
>>> Also, I feel like it's probably not correct to use the first result
>>> relation as the nominal relation for building WCOs and returning lists
>>> anyway.  I mean, if the first result relation has a different column
>>> order than the parent relation, isn't this just broken?  If it works
>>> for some reason, the comments don't explain what that reason is.
>>
>> Yep, it's more appropriate to use
>> ModifyTableState->rootResultRelationInfo->ri_RelationDesc somehow.  That
>> is, if answer to the question I raised above is positive.

>From what I had checked earlier when coding that part,
rootResultRelInfo is NULL in case of inserts, unless something has
changed in later commits. That's the reason I decided to use the first
resultRelInfo.


Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-06-21 Thread Amit Khandekar
On 21 June 2017 at 00:23, Robert Haas <robertmh...@gmail.com> wrote:
> On Tue, Jun 20, 2017 at 2:54 AM, Amit Khandekar <amitdkhan...@gmail.com> 
> wrote:
>>> I guess I don't see why it should work like this.  In the INSERT case,
>>> we must build withCheckOption objects for each partition because those
>>> partitions don't appear in the plan otherwise -- but in the UPDATE
>>> case, they're already there, so why do we need to build anything at
>>> all?  Similarly for RETURNING projections.  How are the things we need
>>> for those cases not already getting built, associated with the
>>> relevant resultRelInfos?  Maybe there's a concern if some children got
>>> pruned - they could turn out later to be the children into which
>>> tuples need to be routed. But the patch makes no distinction
>>> between possibly-pruned children and any others.
>>
>> Yes, only a subset of the partitions appear in the UPDATE subplans. I
>> think typically for updates, a very small subset of the total leaf
>> partitions will be there in the plans, others would get pruned. IMHO,
>> it would not be worth having an optimization where it opens only those
>> leaf partitions which are not already there in the subplans. Without
>> the optimization, we are able to re-use the INSERT infrastructure
>> without additional changes.
>
> Well, that is possible, but certainly not guaranteed.  I mean,
> somebody could do a whole-table UPDATE, or an UPDATE that hits a
> smattering of rows in every partition;

I am not saying that it's guaranteed to be a small subset. I am saying
that it would be typically a small subset for
update-of-partitioned-key case. Seems weird if a user causes an
update-row-movement for multiple partitions at the same time.
Generally it would be an administrative task where some/all of the
rows of a partition need to have their partition key updated that
cause them to change their partition, and so there would be probably a
where clause that would narrow down the update to that particular
partition, because without the where clause the update is anyway
slower and it's redundant to scan all other partitions.

But, point taken, that there can always be certain cases involving
multiple table partition-key updates.

> e.g. the table is partitioned on order number, and you do UPDATE
> lineitem SET product_code = 'K372B' WHERE product_code = 'K372'.

This query does not update order number, so here there is no
partition-key-update. Are you thinking that the patch is generating
the per-leaf-partition WCO expressions even for a update not involving
a partition key ?

>
> Leaving that aside, the point here is that you're rebuilding
> withCheckOptions and returningLists that have already been built in
> the planner.  That's bad for two reasons.  First, it's inefficient,
> especially if there are many partitions.

Yeah, I agree that this becomes more and more redundant if the update
involves more partitions.

> Second, it will amount to a functional bug if you get a
> different answer than the planner did.

Actually, the per-leaf WCOs are meant to be executed on the
destination partitions where the tuple is moved, while the WCOs
belonging to the per-subplan resultRelInfo are meant for the
resultRelinfo used for the UPDATE plans. So actually it should not
matter whether they look same or different, because they are fired at
different objects. Now these objects can happen to be the same
relations though.

But in any case, it's not clear to me how the mapped WCO and the
planner's WCO would yield a different answer if they are both the same
relation. I am possibly missing something. The planner has already
generated the withCheckOptions for each of the resultRelInfo. And then
we are using one of those to re-generate the WCO for a leaf partition
by only adjusting the attnos. If there is already a WCO generated in
the planner for that leaf partition (because that partition was
present in mtstate->resultRelInfo), then the re-built WCO should be
exactly look same as the earlier one, because they are the same
relations, and so the attnos generated in them would be same since the
Relation TupleDesc is the same.

> Note this comment in the existing code:
>
> /*
>  * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
>  * that we didn't build the withCheckOptionList for each partition within
>  * the planner, but simple translation of the varattnos for each partition
>  * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
>  * cases are handled above.
>  */
>
> The comment "UPDATE/DELETE cases are handled above" is referring to
> the code that initializes the WCOs generated by the planner.  You've
> modified the comment in your patch, but t

Re: [HACKERS] UPDATE of partition key

2017-06-20 Thread Amit Khandekar
On 20 June 2017 at 03:46, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Jun 15, 2017 at 1:36 PM, Amit Khandekar <amitdkhan...@gmail.com> 
> wrote:
>> Attached patch v10 fixes the above. In the existing code, where it
>> builds WCO constraints for each leaf partition; with the patch, that
>> code now is applicable to row-movement-updates as well.
>
> I guess I don't see why it should work like this.  In the INSERT case,
> we must build withCheckOption objects for each partition because those
> partitions don't appear in the plan otherwise -- but in the UPDATE
> case, they're already there, so why do we need to build anything at
> all?  Similarly for RETURNING projections.  How are the things we need
> for those cases not already getting built, associated with the
> relevant resultRelInfos?  Maybe there's a concern if some children got
> pruned - they could turn out later to be the children into which
> tuples need to be routed. But the patch makes no distinction
> between possibly-pruned children and any others.

Yes, only a subset of the partitions appear in the UPDATE subplans. I
think typically for updates, a very small subset of the total leaf
partitions will be there in the plans, others would get pruned. IMHO,
it would not be worth having an optimization where it opens only those
leaf partitions which are not already there in the subplans. Without
the optimization, we are able to re-use the INSERT infrastructure
without additional changes.


>
>> There is another issue I discovered. The row-movement works fine if
>> the destination leaf partition has different attribute ordering than
>> the root : the existing insert-tuple-routing mapping handles that. But
>> if the source partition has different ordering w.r.t. the root, it has
>> a problem : there is no mapping in the opposite direction, i.e. from
>> the leaf to root. And we require that because the tuple of source leaf
>> partition needs to be converted to root partition tuple descriptor,
>> since ExecFindPartition() starts with root.
>
> Seems reasonable, but...
>
>> To fix this, I have introduced another mapping array
>> mtstate->mt_resultrel_maps[]. This corresponds to the
>> mtstate->resultRelInfo[]. We don't require per-leaf-partition mapping,
>> because the update result relations are pruned subset of the total
>> leaf partitions.
>
> ... I don't understand how you can *not* need a per-leaf-partition
> mapping.  I mean, maybe you only need the mapping for the *unpruned*
> leaf partitions

Yes, we need the mapping only for the unpruned leaf partitions, and
those partitions are available in the per-subplan resultRelInfo's.

> but you certainly need a separate mapping for each one of those.

You mean *each* of the leaf partitions ? I didn't get why we would
need it for each one. The tuple targeted for update belongs to one of
the per-subplan resultInfos. And this tuple is to be routed to another
leaf partition. So the reverse mapping is for conversion from the
source resultRelinfo to the root partition. I am unable to figure out
a scenario where we would require this reverse mapping for partitions
on which UPDATE is *not* going to be executed.

>
> It's possible to imagine driving the tuple routing off of just the
> partition key attributes, extracted from wherever they are inside the
> tuple at the current level, rather than converting to the root's tuple
> format.  However, that's not totally straightforward because there
> could be multiple levels of partitioning throughout the tree and
> different attributes might be needed at different levels.

Yes, the conversion anyway occurs at each of these levels even for
insert, specifically because there can be different partition
attributes each time. For update, its only one additional conversion.
But yes, this new mapping would be required for this one single
conversion.

> Moreover,
> in most cases, the mappings are going to end up being no-ops because
> the column order will be the same, so it's probably not worth
> complicating the code to try to avoid a double conversion that usually
> won't happen.

I agree.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-06-19 Thread Amit Khandekar
On 20 June 2017 at 03:42, Thomas Munro <thomas.mu...@enterprisedb.com> wrote:
> Just a thought: If I understand correctly this new array of tuple
> conversion maps is the same as mtstate->mt_transition_tupconv_maps in
> my patch transition-tuples-from-child-tables-v11.patch (hopefully soon
> to be committed to close a PG10 open item).  In my patch I bounce
> transition tuples from child relations up to the named relation's
> triggers, and in this patch you bounce child tuples up to the named
> relation for rerouting, so the conversion requirement is the same.
> Perhaps we could consider refactoring to build a common struct member
> on demand for the row movement patch at some point in the future if it
> makes the code cleaner.

I agree; thanks for bringing this to my attention. The conversion maps
in my patch and yours do sound like they are exactly same. And even in
case where both update-row-movement and transition tables are playing
together, the same map should serve the purpose of both. I will keep a
watch on your patch, and check how I can adjust my patch so that I
don't have to refactor the mapping.

One difference I see is : in your patch, in ExecModifyTable() we jump
the current map position for each successive subplan, whereas in my
patch, in ExecInsert() we deduce the position of the right map to be
fetched using the position of the current resultRelInfo in the
mtstate->resultRelInfo[] array. I think your way is more consistent
with the existing code.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-06-18 Thread Amit Khandekar
When I tested partition-key-update on a partitioned table having no
child partitions, it crashed. This is because there is an
Assert(mtstate->mt_num_partitions > 0) for creating the
partition-to-root map, which fails if there are no partitions under
the partitioned table. Actually we should skp creating this map if
there are no partitions under the partitioned table on which UPDATE is
run. So the attached patch has this new change to fix it (and
appropriate additional test case added) :

--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -2006,15 +2006,14 @@ ExecInitModifyTable(ModifyTable *node, EState
*estate, int eflags)
 * descriptor of a source partition does not match the root partition
 * descriptor. In such case we need to convert tuples to the
root partition
 * tuple descriptor, because the search for destination partition starts
-* from the root.
+* from the root. Skip this setup if it's not a partition key
update or if
+* there are no partitions below this partitioned table.
 */
-   if (is_partitionkey_update)
+   if (is_partitionkey_update && mtstate->mt_num_partitions > 0)
{
TupleConversionMap **tup_conv_maps;
TupleDesc   outdesc;

-   Assert(mtstate->mt_num_partitions > 0);
-
mtstate->mt_resultrel_maps =
(TupleConversionMap **)
palloc0(sizeof(TupleConversionMap*) * nplans);

On 15 June 2017 at 23:06, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> On 13 June 2017 at 15:40, Amit Khandekar <amitdkhan...@gmail.com> wrote:
>> While rebasing my patch for the below recent commit, I realized that a
>> similar issue exists for the uptate-tuple-routing patch as well :
>>
>> commit 78a030a441966d91bc7e932ef84da39c3ea7d970
>> Author: Tom Lane <t...@sss.pgh.pa.us>
>> Date:   Mon Jun 12 23:29:44 2017 -0400
>>
>> Fix confusion about number of subplans in partitioned INSERT setup.
>>
>> The above issue was about incorrectly using 'i' in
>> mtstate->mt_plans[i] during handling WITH CHECK OPTIONS in
>> ExecInitModifyTable(), where 'i' was actually meant to refer to the
>> positions in mtstate->mt_num_partitions. Actually for INSERT, there is
>> only a single plan element in mtstate->mt_plans[] array.
>>
>> Similarly, for update-tuple routing, we cannot use
>> mtstate->mt_plans[i], because 'i' refers to position in
>> mtstate->mt_partitions[] , whereas mtstate->mt_plans is not at all in
>> order of mtstate->mt_partitions; in fact mt_plans has only the plans
>> that are to be scanned on pruned partitions; so it can well be a small
>> subset of total partitions.
>>
>> I am working on an updated patch to fix the above.
>
> Attached patch v10 fixes the above. In the existing code, where it
> builds WCO constraints for each leaf partition; with the patch, that
> code now is applicable to row-movement-updates as well. So the
> assertions in the code are now updated to allow the same. Secondly,
> the mapping for each of the leaf partitions was constructed using the
> root partition attributes. Now in the patch, the
> mtstate->resultRelInfo[0] (i.e. the first resultRelInfo) is used as
> reference. So effectively, map_partition_varattnos() now represents
> not just parent-to-partition mapping, but rather, mapping between any
> two partitions/partitioned_tables. It's done this way, so that we can
> have a common WCO building code for inserts as well as updates. For
> e.g. for inserts, the first (and only) WCO belongs to
> node->nominalRelation so nominalRelation is used for
> map_partition_varattnos(), whereas for updates, first WCO belongs to
> the first resultRelInfo which is not same as nominalRelation. So in
> the patch, in both cases, we use the first resultRelInfo and the WCO
> of the first resultRelInfo for map_partition_varattnos().
>
> Similar thing is done for Returning expressions.
>
> -
>
> Another change in the patch is : for ExecInitQual() for WCO quals,
> mtstate->ps is used as parent, rather than first plan. For updates,
> first plan does not belong to the parent partition. In fact, I think
> in all cases, we should use mtstate->ps as the parent.
> mtstate->mt_plans[0] don't look like they should be considered parent
> of these expressions. May be it does not matter to which parent we
> link these quals, because there is no ReScan for ExecModifyTable().
>
> Note that for RETURNING projection expressions, we do use mtstate->ps.
>
> 
>
> There is another issue I discovered. The row-movement works fine if
> the destination leaf partition has dif

Re: [HACKERS] UPDATE of partition key

2017-06-15 Thread Amit Khandekar
On 13 June 2017 at 15:40, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> While rebasing my patch for the below recent commit, I realized that a
> similar issue exists for the uptate-tuple-routing patch as well :
>
> commit 78a030a441966d91bc7e932ef84da39c3ea7d970
> Author: Tom Lane <t...@sss.pgh.pa.us>
> Date:   Mon Jun 12 23:29:44 2017 -0400
>
> Fix confusion about number of subplans in partitioned INSERT setup.
>
> The above issue was about incorrectly using 'i' in
> mtstate->mt_plans[i] during handling WITH CHECK OPTIONS in
> ExecInitModifyTable(), where 'i' was actually meant to refer to the
> positions in mtstate->mt_num_partitions. Actually for INSERT, there is
> only a single plan element in mtstate->mt_plans[] array.
>
> Similarly, for update-tuple routing, we cannot use
> mtstate->mt_plans[i], because 'i' refers to position in
> mtstate->mt_partitions[] , whereas mtstate->mt_plans is not at all in
> order of mtstate->mt_partitions; in fact mt_plans has only the plans
> that are to be scanned on pruned partitions; so it can well be a small
> subset of total partitions.
>
> I am working on an updated patch to fix the above.

Attached patch v10 fixes the above. In the existing code, where it
builds WCO constraints for each leaf partition; with the patch, that
code now is applicable to row-movement-updates as well. So the
assertions in the code are now updated to allow the same. Secondly,
the mapping for each of the leaf partitions was constructed using the
root partition attributes. Now in the patch, the
mtstate->resultRelInfo[0] (i.e. the first resultRelInfo) is used as
reference. So effectively, map_partition_varattnos() now represents
not just parent-to-partition mapping, but rather, mapping between any
two partitions/partitioned_tables. It's done this way, so that we can
have a common WCO building code for inserts as well as updates. For
e.g. for inserts, the first (and only) WCO belongs to
node->nominalRelation so nominalRelation is used for
map_partition_varattnos(), whereas for updates, first WCO belongs to
the first resultRelInfo which is not same as nominalRelation. So in
the patch, in both cases, we use the first resultRelInfo and the WCO
of the first resultRelInfo for map_partition_varattnos().

Similar thing is done for Returning expressions.

-

Another change in the patch is : for ExecInitQual() for WCO quals,
mtstate->ps is used as parent, rather than first plan. For updates,
first plan does not belong to the parent partition. In fact, I think
in all cases, we should use mtstate->ps as the parent.
mtstate->mt_plans[0] don't look like they should be considered parent
of these expressions. May be it does not matter to which parent we
link these quals, because there is no ReScan for ExecModifyTable().

Note that for RETURNING projection expressions, we do use mtstate->ps.



There is another issue I discovered. The row-movement works fine if
the destination leaf partition has different attribute ordering than
the root : the existing insert-tuple-routing mapping handles that. But
if the source partition has different ordering w.r.t. the root, it has
a problem : there is no mapping in the opposite direction, i.e. from
the leaf to root. And we require that because the tuple of source leaf
partition needs to be converted to root partition tuple descriptor,
since ExecFindPartition() starts with root.

To fix this, I have introduced another mapping array
mtstate->mt_resultrel_maps[]. This corresponds to the
mtstate->resultRelInfo[]. We don't require per-leaf-partition mapping,
because the update result relations are pruned subset of the total
leaf partitions.

So in ExecInsert, before calling ExecFindPartition(), we need to
convert the leaf partition tuple to root using this reverse mapping.
Since we need to convert the tuple here, and again after
ExecFindPartition() for the found leaf partition, I have replaced the
common code by new function ConvertPartitionTupleSlot().

---

Used a new flag is_partitionkey_update in ExecInitModifyTable(), which
can be re-used in subsequent sections , rather than again calling
IsPartitionKeyUpdate() function again.

---

Some more test scenarios added that cover above changes. Basically
partitions that have different tuple descriptors than parents.


update-partition-key_v10.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-06-13 Thread Amit Khandekar
While rebasing my patch for the below recent commit, I realized that a
similar issue exists for the uptate-tuple-routing patch as well :

commit 78a030a441966d91bc7e932ef84da39c3ea7d970
Author: Tom Lane 
Date:   Mon Jun 12 23:29:44 2017 -0400

Fix confusion about number of subplans in partitioned INSERT setup.

The above issue was about incorrectly using 'i' in
mtstate->mt_plans[i] during handling WITH CHECK OPTIONS in
ExecInitModifyTable(), where 'i' was actually meant to refer to the
positions in mtstate->mt_num_partitions. Actually for INSERT, there is
only a single plan element in mtstate->mt_plans[] array.

Similarly, for update-tuple routing, we cannot use
mtstate->mt_plans[i], because 'i' refers to position in
mtstate->mt_partitions[] , whereas mtstate->mt_plans is not at all in
order of mtstate->mt_partitions; in fact mt_plans has only the plans
that are to be scanned on pruned partitions; so it can well be a small
subset of total partitions.

I am working on an updated patch to fix the above.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-06-09 Thread Amit Khandekar
On 9 June 2017 at 19:10, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Thu, Jun 8, 2017 at 10:40 PM, Robert Haas <robertmh...@gmail.com> wrote:
>> On Thu, Jun 8, 2017 at 7:01 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>> On Thu, Jun 8, 2017 at 1:33 AM, Robert Haas <robertmh...@gmail.com> wrote:
>>>> On Wed, Jun 7, 2017 at 5:46 AM, Amit Kapila <amit.kapil...@gmail.com> 
>>>> wrote:
>>>>> As far as I understand, it is to ensure that for deleted rows, nothing
>>>>> more needs to be done.  For example, see the below check in
>>>>> ExecUpdate/ExecDelete.
>>>>> if (!ItemPointerEquals(tupleid, ))
>>>>> {
>>>>> ..
>>>>> }
>>>>> ..
>>>>>
>>>>> Also a similar check in ExecLockRows.  Now for deleted rows, if the
>>>>> t_ctid wouldn't point to itself, then in the mentioned functions, we
>>>>> were not in a position to conclude that the row is deleted.
>>>>
>>>> Right, so we would have to find all such checks and change them to use
>>>> some other method to conclude that the row is deleted.  What method
>>>> would we use?
>>>
>>> I think before doing above check we can simply check if ctid.ip_blkid
>>> contains InvalidBlockNumber, then return an error.
>>
>> Hmm, OK.  That case never happens today?
>>
>
> As per my understanding that case doesn't exist.  I will verify again
> once the patch is available.  I can take a crack at it if Amit
> Khandekar is busy with something else or is not comfortable in this
> area.

Amit, I was going to have a look at this, once I finish with the other
part. I was busy on getting that done first. But your comments/help
are always welcome.

>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings

2017-06-09 Thread Amit Khandekar
On 2 June 2017 at 23:52, Peter Geoghegan <p...@bowt.ie> wrote:
> On Fri, Jun 2, 2017 at 10:34 AM, Amit Khandekar <amitdkhan...@gmail.com> 
> wrote:
>> Ok. I was thinking we are doing the tie-breaker because specifically
>> strcoll_l() was unexpectedly returning 0 for some cases. Now I get it,
>> that we do that to be compatible with texteq().
>
> Both of these explanations are correct, in a way. See commit 656beff.
>
>> Secondly, I was also considering if ICU especially has a way to
>> customize an ICU locale by setting some attributes which dictate
>> comparison or sorting rules for a set of characters. I mean, if there
>> is such customized ICU locale defined in the system, and we use that
>> to create PG collation, I thought we might have to strictly follow
>> those rules without a tie-breaker, so as to be 100% conformant to ICU.
>> I can't come up with an example, or may there isn't one, but , say ,
>> there is a locale which is supposed to sort only by lowest comparison
>> strength (de@strength=1 ?? ). In that case, there might be many
>> characters considered equal, but PG < operator or > operator would
>> still return true for those chars.
>
> In the terminology of the Unicode collation algorithm, PostgreSQL
> "forces deterministic comparisons" [1]. There is a lot of information
> on the details of that within the UCA spec.
>
> If we ever wanted to offer a case insensitive collation feature, then
> we wouldn't necessarily have to do the equivalent of a full strxfrm()
> when hashing, at least with collations controlled by ICU. Perhaps we
> could instead use a collator whose UCOL_STRENGTH is only UCOL_PRIMARY
> to build binary sort keys, and leave the rest to a ucol_equal() call
> (within texteq()) that has the usual UCOL_STRENGTH for the underlying
> PostgreSQL collation.
>
> I don't think it would be possible to implement case insensitive
> collations by using some pre-existing ICU collation that is case
> insensitive. Instead, an implementation might directly vary collation
> strength of any given collation to achieve case insensitivity.
> PostgreSQL would know that this collation was case insensitive, so
> regular collations wouldn't need to change their
> behavior/implementation (to use ucol_equal() within texteq(), and so
> on).

Ah ok. Understood, thanks.


Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-06-08 Thread Amit Khandekar
On 7 June 2017 at 20:19, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> On 7 June 2017 at 16:42, Amit Khandekar <amitdkhan...@gmail.com> wrote:
>> The column bitmap set returned by GetUpdatedColumns() refer to
>> attribute numbers w.r.t. to the root partition. And the
>> mstate->resultRelInfo[] have attnos w.r.t. to the leaf partitions. So
>> we need to do something similar to map_partition_varattnos() to change
>> the updated columns attnos to the leaf partitions
>
> I was wrong about this. Each of the mtstate->resultRelInfo[] has its
> own corresponding RangeTblEntry with its own updatedCols having attnos
> accordingly adjusted to refer its own table attributes. So we don't
> have to do the mapping; we need to get modifedCols separately for each
> of the ResultRelInfo, rather than the root relinfo.
>
>> and walk down the
>> partition constraint expressions to find if the attnos are present
>> there.
>
> But this we will need to do.

Attached is v9 patch. This covers the two parts discussed upthread :
1. Prevent triggers from causing the row movement.
2. Setup the tuple routing in ExecInitModifyTable(), but only if a
partition key is modified. Check new function IsPartitionKeyUpdate().

Have rebased the patch to consider changes done in commit
15ce775faa428dc9 to prevent triggers from violating partition
constraints. There, for the call to ExecFindPartition() in ExecInsert,
we need to fetch the mtstate->rootResultRelInfo in case the operation
is part of update row movement. This is because the root partition is
not available in the resultRelInfo for UPDATE.


Added many more test scenarios in update.sql that cover the above.

I am yet to test the concurrency part using isolation tester.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


update-partition-key_v9.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-06-07 Thread Amit Khandekar
On 7 June 2017 at 16:42, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> The column bitmap set returned by GetUpdatedColumns() refer to
> attribute numbers w.r.t. to the root partition. And the
> mstate->resultRelInfo[] have attnos w.r.t. to the leaf partitions. So
> we need to do something similar to map_partition_varattnos() to change
> the updated columns attnos to the leaf partitions

I was wrong about this. Each of the mtstate->resultRelInfo[] has its
own corresponding RangeTblEntry with its own updatedCols having attnos
accordingly adjusted to refer its own table attributes. So we don't
have to do the mapping; we need to get modifedCols separately for each
of the ResultRelInfo, rather than the root relinfo.

> and walk down the
> partition constraint expressions to find if the attnos are present
> there.

But this we will need to do.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-06-07 Thread Amit Khandekar
On 6 June 2017 at 23:52, Robert Haas <robertmh...@gmail.com> wrote:
> On Fri, Jun 2, 2017 at 7:07 AM, Amit Khandekar <amitdkhan...@gmail.com> wrote:
>> So, according to that, below would be the logic :
>>
>> Run partition constraint check on the original NEW row.
>> If it succeeds :
>> {
>> Fire BR UPDATE trigger on the original partition.
>> Run partition constraint check again with the modified NEW row
>> (may be do this only if the trigger modified the partition key)
>> If it fails,
>> abort.
>> Else
>> proceed with the usual local update.
>> }
>> else
>> {
>> Fire BR UPDATE trigger on original partition.
>> Find the right partition for the modified NEW row.
>> If it is the same partition,
>> proceed with the usual local update.
>> else
>> do the row movement.
>> }
>
> Sure, that sounds about right, although the "Fire BR UPDATE trigger on
> the original partition." is the same in both branches, so I'm not
> quite sure why you have that in the "if" block.

Actually after coding this logic, it looks a bit different. See
ExecUpdate() in the attached file  trigger_related_changes.patch



Now that we are making sure trigger won't change the partition of the
tuple, next thing we need to do is, make sure the tuple routing setup
is done *only* if the UPDATE is modifying partition keys. Otherwise,
this will degrade normal update performance.

Below is the logic I am implementing for determining whether the
UPDATE is modifying partition keys.

In ExecInitModifyTable() ...
Call GetUpdatedColumns(mtstate->rootResultRelInfo, estate) to get
updated_columns.
For each of the updated_columns :
{
Check if the column is part of partition key quals of any of
the relations in mtstate->resultRelInfo[] array.
/*
 * mtstate->resultRelInfo[] contains exactly those leaf partitions
 * which qualify the update quals.
 */

If (it is part of partition key quals of at least one of the relations)
{
   Do ExecSetupPartitionTupleRouting() for the root partition.
   break;
}
}

Few things need to be considered :

Use Relation->rd_partcheck to get partition check quals of each of the
relations in mtstate->resultRelInfo[].

The Relation->rd_partcheck of the leaf partitions would include the
ancestors' partition quals as well. So we are good: we don't have to
explicitly get the upper partition constraints. Note that an UPDATE
can modify a column which is not used in a partition constraint
expressions of any of the partitions or partitioned tables in the
subtree, but that column may have been used in partition constraint of
a partitioned table belonging to upper subtree.

All of the relations in mtstate->resultRelInfo are already open. So we
don't need to re-open any more relations to get the partition quals.

The column bitmap set returned by GetUpdatedColumns() refer to
attribute numbers w.r.t. to the root partition. And the
mstate->resultRelInfo[] have attnos w.r.t. to the leaf partitions. So
we need to do something similar to map_partition_varattnos() to change
the updated columns attnos to the leaf partitions and walk down the
partition constraint expressions to find if the attnos are present
there.


Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


trigger_related_changes.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-06-05 Thread Amit Khandekar
On 5 June 2017 at 11:27, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Fri, Jun 2, 2017 at 4:37 PM, Amit Khandekar <amitdkhan...@gmail.com> wrote:
>> On 2 June 2017 at 01:17, Robert Haas <robertmh...@gmail.com> wrote:
>>> On Thu, Jun 1, 2017 at 7:41 AM, Amit Khandekar <amitdkhan...@gmail.com> 
>>> wrote:
>>>>> Regarding the trigger issue, I can't claim to have a terribly strong
>>>>> opinion on this.  I think that practically anything we do here might
>>>>> upset somebody, but probably any halfway-reasonable thing we choose to
>>>>> do will be OK for most people.  However, there seems to be a
>>>>> discrepancy between the approach that got the most votes and the one
>>>>> that is implemented by the v8 patch, so that seems like something to
>>>>> fix.
>>>>
>>>> Yes, I have started working on updating the patch to use that approach
>>>> (BR and AR update triggers on source and destination partition
>>>> respectively, instead of delete+insert) The approach taken by the
>>>> patch (BR update + delete+insert triggers) didn't require any changes
>>>> in the way ExecDelete() and ExecInsert() were called. Now we would
>>>> require to skip the delete/insert triggers, so some flags need to be
>>>> passed to these functions,
>>>>
>
> I thought you already need to pass an additional flag for special
> handling of ctid in Delete case.

Yeah that was unavoidable.

> For Insert, a new flag needs to be
> passed and need to have a check for that in few places.

For skipping delete and insert trigger, we need to include still
another flag, and checks in both ExecDelete() and ExecInsert() for
skipping both BR and AR trigger, and then in ExecUpdate(), again a
call to ExecARUpdateTriggers() before quitting.

>
>> or else have stripped down versions of
>>>> ExecDelete() and ExecInsert() which don't do other things like
>>>> RETURNING handling and firing triggers.
>>>
>>> See, that strikes me as a pretty good argument for firing the
>>> DELETE+INSERT triggers...
>>>
>>> I'm not wedded to that approach, but "what makes the code simplest?"
>>> is not a bad tiebreak, other things being equal.
>>
>> Yes, that sounds good to me.
>>
>
> I am okay if we want to go ahead with firing BR UPDATE + DELETE +
> INSERT triggers for an Update statement (when row movement happens) on
> the argument of code simplicity, but it sounds slightly odd behavior.

Ok. Will keep this behaviour that is already present in the patch. I
myself also feel that code simplicity can be used as a tie-breaker if
a single behaviour  cannot be agreed upon that completely satisfies
all aspects.

>
>> But I think we want to wait for other's
>> opinion because it is quite understandable that two triggers firing on
>> the same partition sounds odd.
>>
>
> Yeah, but I think we have to rely on docs in this case as behavior is
> not intuitive.

Agreed. The doc changes in the patch already has explained in detail
this behaviour.

>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings

2017-06-02 Thread Amit Khandekar
On 2 June 2017 at 03:18, Thomas Munro <thomas.mu...@enterprisedb.com> wrote:
> On Fri, Jun 2, 2017 at 9:27 AM, Peter Geoghegan <p...@bowt.ie> wrote:
>> On Thu, Jun 1, 2017 at 2:24 PM, Thomas Munro
>> <thomas.mu...@enterprisedb.com> wrote:
>>> Why should ICU be any different than the system provider in this
>>> respect?  In both cases, we have a two-level comparison: first we use
>>> the collation-aware comparison, and then as a tie breaker, we use a
>>> binary comparison.  If we didn't do a binary comparison as a
>>> tie-breaker, wouldn't the result be logically incompatible with the =
>>> operator, which does a binary comparison?

Ok. I was thinking we are doing the tie-breaker because specifically
strcoll_l() was unexpectedly returning 0 for some cases. Now I get it,
that we do that to be compatible with texteq().

Secondly, I was also considering if ICU especially has a way to
customize an ICU locale by setting some attributes which dictate
comparison or sorting rules for a set of characters. I mean, if there
is such customized ICU locale defined in the system, and we use that
to create PG collation, I thought we might have to strictly follow
those rules without a tie-breaker, so as to be 100% conformant to ICU.
I can't come up with an example, or may there isn't one, but , say ,
there is a locale which is supposed to sort only by lowest comparison
strength (de@strength=1 ?? ). In that case, there might be many
characters considered equal, but PG < operator or > operator would
still return true for those chars.


-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-06-02 Thread Amit Khandekar
On 2 June 2017 at 01:17, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Jun 1, 2017 at 7:41 AM, Amit Khandekar <amitdkhan...@gmail.com> wrote:
>>> Regarding the trigger issue, I can't claim to have a terribly strong
>>> opinion on this.  I think that practically anything we do here might
>>> upset somebody, but probably any halfway-reasonable thing we choose to
>>> do will be OK for most people.  However, there seems to be a
>>> discrepancy between the approach that got the most votes and the one
>>> that is implemented by the v8 patch, so that seems like something to
>>> fix.
>>
>> Yes, I have started working on updating the patch to use that approach
>> (BR and AR update triggers on source and destination partition
>> respectively, instead of delete+insert) The approach taken by the
>> patch (BR update + delete+insert triggers) didn't require any changes
>> in the way ExecDelete() and ExecInsert() were called. Now we would
>> require to skip the delete/insert triggers, so some flags need to be
>> passed to these functions, or else have stripped down versions of
>> ExecDelete() and ExecInsert() which don't do other things like
>> RETURNING handling and firing triggers.
>
> See, that strikes me as a pretty good argument for firing the
> DELETE+INSERT triggers...
>
> I'm not wedded to that approach, but "what makes the code simplest?"
> is not a bad tiebreak, other things being equal.

Yes, that sounds good to me. But I think we want to wait for other's
opinion because it is quite understandable that two triggers firing on
the same partition sounds odd.

>
>>> In terms of the approach taken by the patch itself, it seems
>>> surprising to me that the patch only calls
>>> ExecSetupPartitionTupleRouting when an update fails the partition
>>> constraint.  Note that in the insert case, we call that function at
>>> the start of execution;
>>
>>> calling it in the middle seems to involve additional hazards;
>>> for example, is it really safe to add additional
>>> ResultRelInfos midway through the operation?
>>
>> I thought since the additional ResultRelInfos go into
>> mtstate->mt_partitions which is independent of
>> estate->es_result_relations, that should be safe.
>
> I don't know.  That sounds scary to me, but it might be OK.  Probably
> needs more study.
>
>>> Is it safe to take more locks midway through the operation?
>>
>> I can imagine some rows already updated, when other tasks like ALTER
>> TABLE or CREATE INDEX happen on other partitions which are still
>> unlocked, and then for row movement we try to lock these other
>> partitions and wait for the DDL tasks to complete. But I didn't see
>> any particular issues with that. But correct me if you suspect a
>> possible issue. One issue can be if we were able to modify the table
>> attributes, but I believe we cannot do that for inherited columns.
>
> It's just that it's very unlike what we do anywhere else.  I don't
> have a real specific idea in mind about what might totally break, but
> at a minimum it could certainly cause behavior that can't happen
> today.  Today, if you run a query on some tables, it will block
> waiting for any locks at the beginning of the query, and the query
> won't begin executing until it has all of the locks.  With this
> approach, you might block midway through; you might even deadlock
> midway through.  Maybe that's not overtly broken, but it's at least
> got the possibility of being surprising.
>
> Now, I'd actually kind of like to have behavior like this for other
> cases, too.  If we're inserting one row, can't we just lock the one
> partition into which it needs to get inserted, rather than all of
> them?  But I'm wary of introducing such behavior incidentally in a
> patch whose main goal is to allow UPDATE row movement.  Figuring out
> what could go wrong and fixing it seems like a substantial project all
> of its own.

Yes, I agree it makes sense trying to avoid introducing something we
haven't tried before, in this patch, as far as possible.

>
>> The reason I thought it cannot be done at the start of the execution,
>> is because even if we know that update is not modifying the partition
>> key column, we are not certain that the final NEW row has its
>> partition key column unchanged, because of triggers. I understand it
>> might be weird for a user requiring to modify a partition key value,
>> but if a user does that, it will result in crash because we won't have
>> the partition routing setup, thinking that there is no partition key
>> column in the UPDATE.
>
&

[HACKERS] strcmp() tie-breaker for identical ICU-collated strings

2017-06-01 Thread Amit Khandekar
While comparing two text strings using varstr_cmp(), if *strcoll*()
call returns 0, we do strcmp() tie-breaker to do binary comparison,
because strcoll() can return 0 for non-identical strings :

varstr_cmp()
{
...
/*
* In some locales strcoll() can claim that nonidentical strings are
* equal.  Believing that would be bad news for a number of reasons,
* so we follow Perl's lead and sort "equal" strings according to
* strcmp().
*/
if (result == 0)
result = strcmp(a1p, a2p);
...
}

But is this supposed to apply for ICU collations as well ? If
collation provider is icu, the comparison is done using
ucol_strcoll*(). I suspect that ucol_strcoll*() intentionally returns
some characters as being identical, so doing strcmp() may not make
sense.

For e.g. , if the below two characters are compared using
ucol_strcollUTF8(), it returns 0, meaning the strings are identical :
Greek Oxia : UTF-16 encoding : 0x1FFD
(http://www.fileformat.info/info/unicode/char/1ffd/index.htm)
Greek Tonos : UTF-16 encoding : 0x0384
(http://www.fileformat.info/info/unicode/char/0384/index.htm)

The characters are displayed like this :
postgres=# select (U&'\+001FFD') , (U&'\+000384') collate ucatest;
 ?column? | ?column?
--+--
 ´| ΄
(Although this example has similar looking characters, this might not
be a factor behind treating them equal)

Now since ucol_strcoll*() returns 0, these strings are always compared
using strcmp(), so 1FFD > 0384 returns true :

create collation ucatest (locale = 'en_US.UTF8', provider = 'icu');

postgres=# select (U&'\+001FFD') > (U&'\+000384') collate ucatest;
 ?column?
--
 t

Whereas, if strcmp() is skipped for ICU collations :
if (result == 0 && !(mylocale && mylocale->provider == COLLPROVIDER_ICU))
   result = strcmp(a1p, a2p);

... then the comparison using ICU collation tells they are identical strings :

postgres=# select (U&'\+001FFD') > (U&'\+000384') collate ucatest;
 ?column?
--
 f
(1 row)

postgres=# select (U&'\+001FFD') < (U&'\+000384') collate ucatest;
 ?column?
--
 f
(1 row)

postgres=# select (U&'\+001FFD') <= (U&'\+000384') collate ucatest;
 ?column?
--
 t


Now I have verified that strcoll() returns true for 1FFD > 0384. So,
it looks like ICU API function ucol_strcoll() returns false by
intention. That's the reason I feel like the
strcmp-if-strtoll-returns-0 thing might not be applicable for ICU. But
I may be wrong, please correct me if I may be missing something.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-06-01 Thread Amit Khandekar
execution;

> calling it in the middle seems to involve additional hazards;
> for example, is it really safe to add additional
> ResultRelInfos midway through the operation?

I thought since the additional ResultRelInfos go into
mtstate->mt_partitions which is independent of
estate->es_result_relations, that should be safe.

> Is it safe to take more locks midway through the operation?

I can imagine some rows already updated, when other tasks like ALTER
TABLE or CREATE INDEX happen on other partitions which are still
unlocked, and then for row movement we try to lock these other
partitions and wait for the DDL tasks to complete. But I didn't see
any particular issues with that. But correct me if you suspect a
possible issue. One issue can be if we were able to modify the table
attributes, but I believe we cannot do that for inherited columns.

> It seems like it might be a lot
> safer to decide at the beginning of the operation whether this is
> needed -- we can skip it if none of the columns involved in the
> partition key (or partition key expressions) are mentioned in the
> update.
> (There's also the issue of triggers,

The reason I thought it cannot be done at the start of the execution,
is because even if we know that update is not modifying the partition
key column, we are not certain that the final NEW row has its
partition key column unchanged, because of triggers. I understand it
might be weird for a user requiring to modify a partition key value,
but if a user does that, it will result in crash because we won't have
the partition routing setup, thinking that there is no partition key
column in the UPDATE.

And we also cannot unconditionally setup the partition routing on all
updates, for performance reasons.

> I'm not sure that it's sensible to allow a trigger on an
> individual partition to reroute an update to another partition
> what if we get an infinite loop?)

You mean, if the other table has another trigger that will again route
to the original partition ? But this infinite loop problem could occur
even for 2 normal tables ?

>
> +if (concurrently_deleted)
> +return NULL;
>
> I don't understand the motivation for this change, and there are no
> comments explaining it that I can see.

Yeah comments, I think, are missing. I thought in the ExecDelete()
they are there, but they are not.
If a concurrent delete already deleted the row, we should not bother
about moving the row, hence the above code.


> Perhaps the concurrency-related (i.e. EPQ) behavior here could be
> tested via the isolation tester.
WIll check.



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-05-28 Thread Amit Khandekar
On 24 May 2017 at 20:16, Amit Kapila  wrote:
> On Wed, May 24, 2017 at 8:14 PM, Amit Kapila  wrote:
>> Apart from above, there is one open issue [1]
>>
>
> Forget to mention the link, doing it now.
>
> [1] - 
> https://www.postgresql.org/message-id/CAA4eK1KEZQ%2BCyXbBzfn1jFHoEfa_OemDLhLyy7xfD1QUZLo1DQ%40mail.gmail.com

I am not sure right now whether making the t_ctid of such tuples to
Invalid would be a right option, especially because I think there can
be already some other meaning if t_ctid is not valid. But may be we
can check this more.

If we decide to error out using some way, I would be inclined towards
considering re-using some combinations of infomask bits (like
HEAP_MOVED_OFF as suggested upthread) rather than using invalid t_ctid
value.

But I think, we can also take step-by-step approach even for v11. If
we agree that it is ok to silently do the updates as long as we
document the behaviour, we can go ahead and do this, and then as a
second step, implement error handling as a separate patch. If that
patch does not materialize, we at least have the current behaviour
documented.

Ideally, I think we would have liked if we were somehow able to make
the row-movement UPDATE itself abort if it finds any normal
updates waiting for it to finish, rather than making the normal
updates fail because a row-movement occurred . But I think we will
have to live with it.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-05-24 Thread Amit Khandekar
On 12 May 2017 at 09:27, Amit Kapila  wrote:
>
> + is_partitioned_table =
> + root_rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE;
> +
> + if (is_partitioned_table)
> + ExecSetupPartitionTupleRouting(
> + root_rel,
> + /* Build WITH CHECK OPTION constraints for leaf partitions */
> + ExecInitPartitionWithCheckOptions(mtstate, root_rel);
> + /* Build a projection for each leaf partition rel. */
> + ExecInitPartitionReturningProjection(mtstate, root_rel);
> ..
> + /* It's not a partitioned table after all; error out. */
> + ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
>
> When we are anyway going to give error if table is not a partitioned
> table, then isn't it better to give it early when we first identify
> that.

Yeah that's right, fixed. Moved the partitioned table check early.
This also showed that there is no need for is_partitioned_table
variable. Accordingly adjusted the code.


> -
> +static void ExecInitPartitionWithCheckOptions(ModifyTableState *mtstate,
> +  Relation root_rel);
> Spurious line delete.

Done.

Also rebased the patch over latest code.

Attached v8 patch.


update-partition-key_v8.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-05-24 Thread Amit Khandekar
On 18 May 2017 at 16:52, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Wed, May 17, 2017 at 4:05 PM, Dilip Kumar <dilipbal...@gmail.com> wrote:
>> On Wed, May 17, 2017 at 3:15 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>>> Earlier I thought that option1 is better but later I think that this
>>>> can complicate the situation as we are firing first BR update then BR
>>>> delete and can change the row multiple time and defining such
>>>> behaviour can be complicated.
>>>>
>>>
>>> If we have to go by this theory, then the option you have preferred
>>> will still execute BR triggers for both delete and insert, so input
>>> row can still be changed twice.
>>
>> Yeah, right as per my theory above Option3 have the same problem.
>>
>> But after putting some more thought I realised that only for "Before
>> Update" or the "Before Insert" trigger row can be changed.
>>
>
> Before Row Delete triggers can suppress the delete operation itself
> which is kind of unintended in this case.  I think without the user
> being aware it doesn't seem advisable to execute multiple BR triggers.

By now, majority of the opinions have shown that they do not favour
two triggers getting fired on a single update. Amit, do you consider
option 2 as a valid option ? That is, fire only UPDATE triggers. BR on
source partition, and AR on destination partition. Do you agree that
firing BR update trigger is essential since it can modify the row and
even prevent the update from happening ?

Also, since a user does not have a provision to install a common
UPDATE row trigger, (s)he installs it on each of the leaf partitions.
And then when an update causes row movement, using option 3 would end
up not firing update triggers on any of the partitions. So, I prefer
option 2 over option 3 , i.e. make sure to fire BR and AR update
triggers. Actually option 2 is what Robert had proposed in the
beginning.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-05-18 Thread Amit Khandekar
On 17 May 2017 at 17:29, Rushabh Lathia <rushabh.lat...@gmail.com> wrote:
>
>
> On Wed, May 17, 2017 at 12:06 PM, Dilip Kumar <dilipbal...@gmail.com> wrote:
>>
>> On Fri, May 12, 2017 at 4:17 PM, Amit Khandekar <amitdkhan...@gmail.com>
>> wrote:
>> > Option 3
>> > 
>> >
>> > BR, AR delete triggers on source partition
>> > BR, AR insert triggers on destination partition.
>> >
>> > Rationale :
>> > Since the update is converted to delete+insert, just skip the update
>> > triggers completely.
>>
>> +1 to option3
>> Generally, BR triggers are used for updating the ROW value and AR
>> triggers to VALIDATE the row or to modify some other tables.  So it
>> seems that we can fire the triggers what is actual operation is
>> happening at the partition level.
>>
>> For source partition, it's only the delete operation (no update
>> happened) so we fire delete triggers and for the destination only
>> insert operations so fire only inserts triggers.  That will keep the
>> things simple.  And, it will also be in sync with the actual partition
>> level delete/insert operations.
>>
>> We may argue that user might have declared only update triggers and as
>> he has executed the update operation he may expect those triggers to
>> get fired.  But, I think this behaviour can be documented with the
>> proper logic that if the user is updating the partition key then he
>> must be ready with the Delete/Insert triggers also, he can not rely
>> only upon update level triggers.
>>
>
> Right, that is even my concern. That user might had declared only update
> triggers and when user executing UPDATE its expect it to get call - but
> with option 3 its not happening.

Yes that's the issue with option 3. A user wants to make sure update
triggers run, and here we are skipping the BEFORE update triggers. And
user might even modify rows.

Now regarding the AR update triggers  The user might be more
concerned with the non-partition-key columns, and the UPDATE of
partition key typically would update only the partition key and not
the other column. So for typical case, it makes sense to skip the
UPDATE AR trigger. But if the UPDATE contains both partition key as
well as other column updates, it makes sense to fire AR UPDATE
trigger. One thing we can do is restrict an UPDATE to have both
partition key and non-partition key column updates. So this way we can
always skip the AR update triggers for row-movement updates, unless
may be fire AR UPDATE triggers *only* if they are created using
"BEFORE UPDATE OF " and the column is the partition key.

Between skipping delete-insert triggers versus skipping update
triggers, I would go for skipping delete-insert triggers. I think we
cannot skip BR update triggers because that would be a correctness
issue.

>From user-perspective, I think the user would like to install a
trigger that would fire if any of the child tables get modified. But
because there is no provision to install a common trigger, the user
has to install the same trigger on every child table. In that sense,
it might not matter whether we fire AR UPDATE trigger on old partition
or new partition.

>
> In term of consistency option 1 looks better. Its doing the same what
> its been implemented for the UPSERT - so that user might be already
> aware of trigger behaviour. Plus if we document the behaviour then it
> sounds correct -
>
> - Original command was UPDATE so BR update
> - Later found that its ROW movement - so BR delete followed by AR delete
> - Then Insert in new partition - so BR INSERT followed by AR Insert.
>
> But again I am not quite sure how good it will be to compare the partition
> behaviour with the UPSERT.
>
>
>
>>
>> Earlier I thought that option1 is better but later I think that this
>> can complicate the situation as we are firing first BR update then BR
>> delete and can change the row multiple time and defining such
>> behaviour can be complicated.
>>
>> --
>> Regards,
>> Dilip Kumar
>> EnterpriseDB: http://www.enterprisedb.com
>>
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
>
>
>
>
> --
> Rushabh Lathia



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-05-12 Thread Amit Khandekar
On 12 May 2017 at 14:56, Amit Kapila  wrote:
> I think it might be better to summarize all the options discussed
> including what the patch has and see what most people consider as
> sensible.

Yes, makes sense. Here are the options that were discussed so far for
ROW triggers :

Option 1 : (the patch follows this option)
--
BR Update trigger for source partition.
BR,AR Delete trigger for source partition.
BR,AR Insert trigger for destination partition.
No AR Update trigger.

Rationale :

BR Update trigger should be fired because that trigger can even modify
the rows, and that can even result in partition key update even though
the UPDATE statement is not updating the partition key.

Also, fire the delete/insert triggers on respective partitions since
the rows are about to be deleted/inserted. AR update trigger should
not be fired because that required an actual update to have happened.



Option 2
--
BR Update trigger for source partition.
AR Update trigger on destination partition.
No insert/delete triggers.

Rationale :

Since it's an UPDATE statement, only update triggers should be fired.
The update ends up moving the row into another partition, so AR Update
trigger should be fired on this partition rather than the original
partition.

Option 3


BR, AR delete triggers on source partition
BR, AR insert triggers on destination partition.

Rationale :
Since the update is converted to delete+insert, just skip the update
triggers completely.



Option 4


BR-AR update triggers for source partition
BR-AR insert triggers for destination partition

Rationale :
Since it is an update statement, both BR and AR UPDATE trigger should
be fired on original partition.
Since update is converted to delete+insert, the corresponding triggers
should be fired, but since we already are firing UPDATE trigger on
original partition, skip delete triggers, otherwise both UPDATE and
DELETE triggers would get fired on the same partition.




For statement triggers, I think it should be based on the
documentation recently checked in for partitions in general.

+A statement that targets a parent table in a inheritance or partitioning
+hierarchy does not cause the statement-level triggers of affected child
+tables to be fired; only the parent table's statement-level triggers are
+fired.  However, row-level triggers of any affected child tables will be
+fired.

Based on that, for row movement as well, the trigger should be fired
only for the table referred in the UPDATE statement, and not for any
child tables, or for any partitions to which the rows were moved. The
doc in this row-movement patch also matches with this behaviour.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-05-11 Thread Amit Khandekar
>> 3.
>> +   longer satisfy the partition constraint of the containing partition. In 
>> that
>> +   case, if there is some other partition in the partition tree for which 
>> this
>> +   row satisfies its partition constraint, then the row is moved to that
>> +   partition. If there isn't such a partition, an error will occur.
>>
>> Doesn't this error case indicate that this needs to be integrated with
>> Default partition patch of Rahila or that patch needs to take care
>> this error case?
>> Basically, if there is no matching partition, then move it to default 
>> partition.
>
> Will have a look on this. Thanks for pointing this out.

I tried update row movement with both my patch and default-partition
patch applied. And it looks like it works as expected :

1. When an update changes the partitioned key such that the row does
not fit into any of the non-default partitions, the row is moved to
the default partition.
2. If the row does fit into a non-default partition, the row moves
into that partition.
3. If a row from a default partition is updated such that it fits into
any of the non-default partition, it moves into that partition. I
think we can debate on whether the row should stay in the default
partition or move. I think it should be moved, since now the row has a
suitable partition.



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-05-11 Thread Amit Khandekar
On 12 May 2017 at 08:30, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Thu, May 11, 2017 at 5:41 PM, Amit Khandekar <amitdkhan...@gmail.com> 
> wrote:
>> On 11 May 2017 at 17:23, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>> On Fri, Mar 17, 2017 at 4:07 PM, Amit Khandekar <amitdkhan...@gmail.com> 
>>> wrote:
>>>> On 4 March 2017 at 12:49, Robert Haas <robertmh...@gmail.com> wrote:
>>>>> On Thu, Mar 2, 2017 at 11:53 AM, Amit Khandekar <amitdkhan...@gmail.com> 
>>>>> wrote:
>>>>>> I think it does not make sense running after row triggers in case of
>>>>>> row-movement. There is no update happened on that leaf partition. This
>>>>>> reasoning can also apply to BR update triggers. But the reasons for
>>>>>> having a BR trigger and AR triggers are quite different. Generally, a
>>>>>> user needs to do some modifications to the row before getting the
>>>>>> final NEW row into the database, and hence [s]he defines a BR trigger
>>>>>> for that. And we can't just silently skip this step only because the
>>>>>> final row went into some other partition; in fact the row-movement
>>>>>> itself might depend on what the BR trigger did with the row. Whereas,
>>>>>> AR triggers are typically written for doing some other operation once
>>>>>> it is made sure the row is actually updated. In case of row-movement,
>>>>>> it is not actually updated.
>>>>>
>>>>> How about running the BR update triggers for the old partition and the
>>>>> AR update triggers for the new partition?  It seems weird to run BR
>>>>> update triggers but not AR update triggers.  Another option would be
>>>>> to run BR and AR delete triggers and then BR and AR insert triggers,
>>>>> emphasizing the choice to treat this update as a delete + insert, but
>>>>> (as Amit Kh. pointed out to me when we were in a room together this
>>>>> week) that precludes using the BEFORE trigger to modify the row.
>>>>>
>>>
>>> I also find the current behavior with respect to triggers quite odd.
>>> The two points that appears odd are (a) Executing both before row
>>> update and delete triggers on original partition sounds quite odd.
>> Note that *before* trigger gets fired *before* the update happens. The
>> actual update may not even happen, depending upon what the trigger
>> does. And then in our case, the update does not happen; not just that,
>> it is transformed into delete-insert. So then we should fire
>> before-delete trigger.
>>
>
> Sure, I am aware of that point, but it doesn't seem obvious that both
> update and delete BR triggers get fired for original partition.  As a
> developer, it might be obvious to you that as you have used delete and
> insert interface, it is okay that corresponding BR/AR triggers get
> fired, however, it is not so obvious for others, rather it appears
> quite odd.

I agree that it seems a bit odd that we are firing both update and
delete triggers on the same partition. But if you look at the
perspective that the update=>delete+insert is a user-aware operation,
it does not seem that odd.

> If we try to compare it with the non-partitioned update,
> there also it is internally a delete and insert operation, but we
> don't fire triggers for those.

For a non-partitioned table, the delete+insert is internal, whereas
for partitioned table, it is completely visible to the user.

>
>>> (b) It seems inconsistent to consider behavior for row and statement
>>> triggers differently
>>
>> I am not sure whether we should compare row and statement triggers.
>> Statement triggers are anyway fired only per-statement, depending upon
>> whether it is update or insert or delete. It has nothing to do with
>> how the rows are modified.
>>
>
> Okay.  The broader point I was trying to convey was that the way this
> patch defines the behavior of triggers doesn't sound good to me.  It
> appears to me that in this thread multiple people have raised points
> around trigger behavior and you should try to consider those.

I understand that there is no single solution which will provide
completely intuitive trigger behaviour. Skipping BR delete trigger
should be fine. But then for consistency, we should skip BR insert
trigger as well, the theory being that the delete+insert are not fired
by the user so we should not fire them. But I feel both should be
fired to avoid any consequences unexpected to the user who has
in

Re: [HACKERS] UPDATE of partition key

2017-05-11 Thread Amit Khandekar
On 12 May 2017 at 10:01, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Fri, May 12, 2017 at 9:27 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> On Thu, May 11, 2017 at 5:45 PM, Amit Khandekar <amitdkhan...@gmail.com> 
>> wrote:
>>> On 11 May 2017 at 17:24, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>>> Few comments:
>>>> 1.
>>>> Operating directly on partition doesn't allow update to move row.
>>>> Refer below example:
>>>> create table t1(c1 int) partition by range(c1);
>>>> create table t1_part_1 partition of t1 for values from (1) to (100);
>>>> create table t1_part_2 partition of t1 for values from (100) to (200);
>>>> insert into t1 values(generate_series(1,11));
>>>> insert into t1 values(generate_series(110,120));
>>>>
>>>> postgres=# update t1_part_1 set c1=122 where c1=11;
>>>> ERROR:  new row for relation "t1_part_1" violates partition constraint
>>>> DETAIL:  Failing row contains (122).
>>>
>>> Yes, as Robert said, this is expected behaviour. We move the row only
>>> within the partition subtree that has the update table as its root. In
>>> this case, it's the leaf partition.
>>>
>>
>> Okay, but what is the technical reason behind it?  Is it because the
>> current design doesn't support it or is it because of something very
>> fundamental to partitions?
No, we can do that if decide to update some table outside the
partition subtree. The reason is more of semantics. I think the user
who is running UPDATE for a partitioned table, should not be
necessarily aware of the structure of the complete partition tree
outside of the current subtree. It is always safe to return error
instead of moving the data outside of the subtree silently.

>>
>
> One plausible theory is that as Select's on partitions just returns
> the rows of that partition, the update should also behave in same way.

Yes , right. Or even inserts fail if we try to insert data that does
not fit into the current subtree.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-05-11 Thread Amit Khandekar
On 11 May 2017 at 17:24, Amit Kapila <amit.kapil...@gmail.com> wrote:
> Few comments:
> 1.
> Operating directly on partition doesn't allow update to move row.
> Refer below example:
> create table t1(c1 int) partition by range(c1);
> create table t1_part_1 partition of t1 for values from (1) to (100);
> create table t1_part_2 partition of t1 for values from (100) to (200);
> insert into t1 values(generate_series(1,11));
> insert into t1 values(generate_series(110,120));
>
> postgres=# update t1_part_1 set c1=122 where c1=11;
> ERROR:  new row for relation "t1_part_1" violates partition constraint
> DETAIL:  Failing row contains (122).

Yes, as Robert said, this is expected behaviour. We move the row only
within the partition subtree that has the update table as its root. In
this case, it's the leaf partition.

>
> 3.
> +   longer satisfy the partition constraint of the containing partition. In 
> that
> +   case, if there is some other partition in the partition tree for which 
> this
> +   row satisfies its partition constraint, then the row is moved to that
> +   partition. If there isn't such a partition, an error will occur.
>
> Doesn't this error case indicate that this needs to be integrated with
> Default partition patch of Rahila or that patch needs to take care
> this error case?
> Basically, if there is no matching partition, then move it to default 
> partition.

Will have a look on this. Thanks for pointing this out.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-05-11 Thread Amit Khandekar
On 11 May 2017 at 17:23, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Fri, Mar 17, 2017 at 4:07 PM, Amit Khandekar <amitdkhan...@gmail.com> 
> wrote:
>> On 4 March 2017 at 12:49, Robert Haas <robertmh...@gmail.com> wrote:
>>> On Thu, Mar 2, 2017 at 11:53 AM, Amit Khandekar <amitdkhan...@gmail.com> 
>>> wrote:
>>>> I think it does not make sense running after row triggers in case of
>>>> row-movement. There is no update happened on that leaf partition. This
>>>> reasoning can also apply to BR update triggers. But the reasons for
>>>> having a BR trigger and AR triggers are quite different. Generally, a
>>>> user needs to do some modifications to the row before getting the
>>>> final NEW row into the database, and hence [s]he defines a BR trigger
>>>> for that. And we can't just silently skip this step only because the
>>>> final row went into some other partition; in fact the row-movement
>>>> itself might depend on what the BR trigger did with the row. Whereas,
>>>> AR triggers are typically written for doing some other operation once
>>>> it is made sure the row is actually updated. In case of row-movement,
>>>> it is not actually updated.
>>>
>>> How about running the BR update triggers for the old partition and the
>>> AR update triggers for the new partition?  It seems weird to run BR
>>> update triggers but not AR update triggers.  Another option would be
>>> to run BR and AR delete triggers and then BR and AR insert triggers,
>>> emphasizing the choice to treat this update as a delete + insert, but
>>> (as Amit Kh. pointed out to me when we were in a room together this
>>> week) that precludes using the BEFORE trigger to modify the row.
>>>
>
> I also find the current behavior with respect to triggers quite odd.
> The two points that appears odd are (a) Executing both before row
> update and delete triggers on original partition sounds quite odd.
Note that *before* trigger gets fired *before* the update happens. The
actual update may not even happen, depending upon what the trigger
does. And then in our case, the update does not happen; not just that,
it is transformed into delete-insert. So then we should fire
before-delete trigger.

> (b) It seems inconsistent to consider behavior for row and statement
> triggers differently

I am not sure whether we should compare row and statement triggers.
Statement triggers are anyway fired only per-statement, depending upon
whether it is update or insert or delete. It has nothing to do with
how the rows are modified.


>
>>
>> I checked the trigger behaviour in case of UPSERT. Here, when there is
>> conflict found, ExecOnConflictUpdate() is called, and then the
>> function returns immediately, which means AR INSERT trigger will not
>> fire. And ExecOnConflictUpdate() calls ExecUpdate(), which means BR
>> and AR UPDATE triggers will be fired. So in short, when an INSERT
>> becomes an UPDATE, BR INSERT triggers do fire, but then the BR UPDATE
>> and AR UPDATE also get fired. On the same lines, it makes sense in
>> case of UPDATE=>DELETE+INSERT operation to fire a BR UPDATE trigger on
>> the original table, and then the BR and AR DELETE/INSERT triggers on
>> the respective tables.
>>
>
> I am not sure if it is good idea to compare it with "Insert On
> Conflict Do Update", but  even if we want that way, I think Insert On
> Conflict is consistent in statement level triggers which means it will
> fire both Insert and Update statement level triggres (as per below
> note in docs) whereas the documentation in the patch indicates that
> this patch will only fire Update statement level triggers which is
> odd
>
> Note in docs about Insert On Conflict
> "Note that with an INSERT with an ON CONFLICT DO UPDATE clause, both
> INSERT and UPDATE statement level trigger will be fired.

I guess the reason this behaviour is kept for UPSERT, is because the
statement itself suggests : insert/update.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

2017-05-05 Thread Amit Khandekar
On 5 May 2017 at 07:50, David Rowley <david.row...@2ndquadrant.com> wrote:
>  On 3 May 2017 at 07:13, Robert Haas <robertmh...@gmail.com> wrote:
>> It is of course possible that the Parallel Seq Scan could run into
>> contention problems if the number of workers is large, but in my
>> experience there are bigger problems here.  The non-parallel Seq Scan
>> can also contend -- not of course over the shared mutex because there
>> isn't one, but over access to the blocks themselves.  Every one of
>> those blocks has a content lock and a buffer header and so on, and
>> having multiple processes accessing those things at the same time
>> scales well, but not perfectly.  The Hash node can also contend: if
>> the hash join spills to disk, you've got multiple processes reading
>> and writing to the temp directory at the same time and, of course,
>> that can be worse than just one process doing it -- sometimes much
>> worse.  It can also be better, depending on how much I/O gets
>> generated and how much I/O bandwidth you have.
>
> Yeah, I did get some time to look over the contention in Parallel Seq
> Scan a while back and I discovered that on the machine that I was
> testing on. the lock obtained in heap_parallelscan_nextpage() was
> causing workers to have to wait for other workers to fetch their next
> task to work on.
>
> I ended up writing the attached (which I'd not intended to post until
> some time closer to when the doors open for PG11). At the moment it's
> basically just a test patch to see how it affects things when we give
> workers a bit more to do before they come back to look for more work.
> In this case, I've just given them 10 pages to work on, instead of the
> 1 that's allocated in 9.6 and v10.
>
> A quick test on a pretty large table on a large machine shows:
>
> Unpatched:
>
> postgres=# select count(*) from a;
>count
> 
>  187400
> (1 row)
>
> Time: 5211.485 ms (00:05.211)
>
> Patched:
>
> postgres=# select count(*) from a;
>count
> 
>  187400
> (1 row)
>
> Time: 2523.983 ms (00:02.524)

The result is quite impressive.

>
> So it seems worth looking into. "a" was just a table with a single int
> column. I'm unsure as yet if there are more gains to be had for tables
> with wider tuples. I do suspect the patch will be a bigger win in
> those cases, since there's less work to do for each page, e.g less
> advance aggregate calls, so likely they'll be looking for their next
> page a bit sooner.
>
> Now I'm not going to pretend that this patch is ready for the
> prime-time. I've not yet worked out how to properly report sync-scan
> locations without risking reporting later pages after reporting the
> end of the scan. What I have at the moment could cause a report to be
> missed if SYNC_SCAN_REPORT_INTERVAL is not divisible by the batch
> size. I'm also not sure how batching like this affect read-aheads, but
> at least the numbers above speak for something. Although none of the
> pages in this case came from disk.
>
> I'd had thoughts that the 10 pages wouldn't be constant, but the
> batching size would depend on the size of the relation to be scanned.
> I'd rough ideas to just try to make about 1 million batches. Something
> like batch_pages = Max(parallel_scan->phs_nblocks / 100, 1); so
> that we only take more than 1 page if there's some decent amount to
> process. We don't want to make the batches too big as we might end up
> having to wait on slow workers at the end of a scan.

I was wondering : if we keep increasing the batch size, that might
lead to I/O contention. I mean, the higher the batch size, the higher
is the chance to cause more random I/O, because all workers would be
accessing disk blocks far away from each other in parallel. So there
might be a trade off here. (it's another thing that there needs to be
I/O contention testing done, in general, for many scenarios).

I believe there are certain parallel scans (parallel bitmap heap scan
? ) where the logic to go to the next block consumes time, so more
waits consequently.

What if we supply for each worker with a sequence of blocks to be
scanned, rather than a range of blocks. Each worker would have a list
of next few blocks, say :
w1 : 1, 5, 9, 13
w2 : 2, 6, 10, 14
w3 : 3, 7, 11, 15.
w4 : .

May be the leader worker would do the accounting and store the
instructions for each of the workers at individual locations in shared
memory, so there won't be any contention while accessing them.

This may be simple/applicable for a sequential scan, but not for other
scans, some of which this may not be even possible. But basically I
was thinking of a way around to tackle shared memory contention as
well as random I/O.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-05-02 Thread Amit Khandekar
On 2 May 2017 at 18:17, Robert Haas <robertmh...@gmail.com> wrote:
> On Tue, Apr 4, 2017 at 7:11 AM, Amit Khandekar <amitdkhan...@gmail.com> wrote:
>> Attached updated patch v7 has the above changes.
>
> This no longer applies.  Please rebase.

Thanks Robert for informing about this.

My patch has a separate function for emitting error message when a
partition constraint fails. And, the recent commit c0a8ae7be3 has
changes to correct the way the tuple is formed for displaying in the
error message. Hence there were some code-level conflicts.

Attached is the rebased patch, which resolves the above conflicts.


update-partition-key_v7_rebased.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning - another take

2017-04-26 Thread Amit Khandekar
On 26 April 2017 at 00:28, Robert Haas <robertmh...@gmail.com> wrote:
> So what I'd prefer -- on
> the totally unprincipled basis that it would let us improve
> performance in the future -- if you operate on a partition directly,
> you fire the partition's triggers, but if you operate on the parent,
> only the parent's triggers fire.

I would also opt for this behaviour.

Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-04-18 Thread Amit Khandekar
On 7 April 2017 at 20:35, Andres Freund  wrote:
>> But for costs such as (4, 4, 4,   20 times), the logic would give
>> us 20 workers because we want to finish the Append in 4 time units;
>> and this what we want to avoid when we go with
>> don't-allocate-too-many-workers approach.
>
> I guess, my problem is that I don't agree with that as a goal in an of
> itself.  If you actually want to run your query quickly, you *want* 20
> workers here.  The issues of backend startup overhead (already via
> parallel_setup_cost), concurrency and such cost should be modelled, not
> burried in a formula the user can't change.  If we want to make it less
> and less likely to start more workers we should make that configurable,
> not the default.
> I think there's some precedent taken from the parallel seqscan case,
> that's not actually applicable here.  Parallel seqscans have a good
> amount of shared state, both on the kernel and pg side, and that shared
> state reduces gains of increasing the number of workers.  But with
> non-partial scans such shared state largely doesn't exist.

After searching through earlier mails about parallel scan, I am not
sure whether the shared state was considered to be a potential factor
that might reduce parallel query gains, when deciding the calculation
for number of workers for a parallel seq scan. I mean even today if we
allocate 10 workers as against a calculated 4 workers count for a
parallel seq scan, they might help. I think it's just that we don't
know if they would *always* help or it would regress sometimes.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-04-07 Thread Amit Khandekar
On 6 April 2017 at 07:33, Andres Freund <and...@anarazel.de> wrote:
> On 2017-04-05 14:52:38 +0530, Amit Khandekar wrote:
>> This is what the earlier versions of my patch had done : just add up
>> per-subplan parallel_workers (1 for non-partial subplan and
>> subpath->parallel_workers for partial subplans) and set this total as
>> the Append parallel_workers.
>
> I don't think that's great, consider e.g. the case that you have one
> very expensive query, and a bunch of cheaper ones. Most of those workers
> wouldn't do much while waiting for the the expensive query.  What I'm
> basically thinking we should do is something like the following
> pythonesque pseudocode:
>
> best_nonpartial_cost = -1
> best_nonpartial_nworkers = -1
>
> for numworkers in 1...#max workers:
>worker_work = [0 for x in range(0, numworkers)]
>
>nonpartial_cost += startup_cost * numworkers
>
># distribute all nonpartial tasks over workers.  Assign tasks to the
># worker with the least amount of work already performed.
>for task in all_nonpartial_subqueries:
>least_busy_worker = worker_work.smallest()
>least_busy_worker += task.total_nonpartial_cost
>
># the nonpartial cost here is the largest amount any single worker
># has to perform.
>nonpartial_cost += worker_work.largest()
>
>total_partial_cost = 0
>for task in all_partial_subqueries:
>total_partial_cost += task.total_nonpartial_cost
>
># Compute resources needed by partial tasks. First compute how much
># cost we can distribute to workers that take shorter than the
># "busiest" worker doing non-partial tasks.
>remaining_avail_work = 0
>for i in range(0, numworkers):
>remaining_avail_work += worker_work.largest() - worker_work[i]
>
># Equally divide up remaining work over all workers
>if remaining_avail_work < total_partial_cost:
>   nonpartial_cost += (worker_work.largest - remaining_avail_work) / 
> numworkers
>
># check if this is the best number of workers
>if best_nonpartial_cost == -1 or best_nonpartial_cost > nonpartial_cost:
>   best_nonpartial_cost = worker_work.largest
>   best_nonpartial_nworkers = nworkers
>
> Does that make sense?

Yeah, I gather what you are trying to achieve is : allocate number of
workers such that the total cost does not exceed the cost of the first
non-partial plan (i.e. the costliest one, because the plans are sorted
by descending cost).

So for non-partial costs such as (20, 10, 5, 2) allocate only 2
workers because the 2nd worker will execute (10, 5, 2) while 1st
worker executes (20).

But for costs such as (4, 4, 4,   20 times), the logic would give
us 20 workers because we want to finish the Append in 4 time units;
and this what we want to avoid when we go with
don't-allocate-too-many-workers approach.

>
>
>> BTW all of the above points apply only for non-partial plans.
>
> Indeed. But I think that's going to be a pretty common type of plan,

Yes it is.

> especially if we get partitionwise joins.

About that I am not sure, because we already have support for parallel
joins, so wouldn't the join subpaths corresponding to all of the
partitions be partial paths ? I may be wrong about that.

But if the subplans are foreign scans, then yes all would be
non-partial plans. This may provoke  off-topic discussion, but here
instead of assigning so many workers to all these foreign plans and
all those workers waiting for the results, a single asynchronous
execution node (which is still in the making) would be desirable
because it would do the job of all these workers.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >