subject:"Re\: \[HACKERS\] UPDATE of partition key"

Re: [HACKERS] UPDATE of partition key

2017-11-13 Thread David Rowley

 On 10 November 2017 at 16:42, Amit Khandekar  wrote:
[ update-partition-key_v23.patch ]

Hi Amit,

Thanks for working on this. I'm looking forward to seeing this go in.

So... I've signed myself up to review the patch, and I've just had a
look at it, (after first reading this entire email thread!).

Overall the patch looks like it's in quite a good shape. I think I do
agree with Robert about the UPDATE anomaly that's been discussed. I
don't think we're painting ourselves into any corner by not having
this working correctly right away. Anyone who's using some trigger
workarounds for the current lack of support for updating the partition
key is already going to have the same issues, so at least this will
save them some troubles implementing triggers and give them much
better performance. I see you've documented this fact too, which is
good.

I'm writing this email now as I've just run out of review time for today.

Here's what I noted down during my first pass:

1. Closing command tags in docs should not be abbreviated

triggers are concerned, AFTER DELETE and

This changed in c29c5789. I think Peter will be happy if you don't
abbreviate the closing tags.

2. "about to do" would read better as "about to perform"

 concurrent session, and it is about to do an UPDATE

I think this paragraph could be more clear if we identified the
sessions with a number.

Perhaps:
   Suppose, session 1 is performing an UPDATE on a
   partition key, meanwhile, session 2 tries to perform an UPDATE
or DELETE operation on the same row.
   Session 2 can silently miss the row due to session 1's activity.  In
   such a case, session 2's UPDATE/DELETE
   , being unaware of the row's movement, interprets this that the
   row has just been deleted, so there is nothing to be done for this row.
   Whereas, in the usual case where the table is not partitioned, or where
   there is no row movement, the second session would have identified the
   newly updated row and carried UPDATE/DELETE
on this new row version.


3. Integer width. get_partition_natts returns int but we assign to int16.

int16 partnatts = get_partition_natts(key);

Confusingly get_partition_col_attnum() returns int16 instead of AttrNumber
but that's existingly not correct.

4. The following code could just pull_varattnos(partexprs, 1, _keycols);

foreach(lc, partexprs)
{
Node*expr = (Node *) lfirst(lc);

pull_varattnos(expr, 1, _keycols);
}

5. Triggers. Do we need a new "TG_" tag to allow trigger functions to
do something
special when the DELETE/INSERT is a partition move? I have audit
tables in mind here
it may appear as though a user performed a DELETE when they actually
performed an UPDATE
giving visibility of this to the trigger function will allow the
application to work around this.

6. change "row" to "a row" and "old" to "the old"

* depending on whether the event is for row being deleted from old

But to be honest, I'm having trouble parsing the comment. I think it
would be better to
say explicitly when the row will be NULL rather than "depending on
whether the event"

7. I'm confused with how this change came about. If the old comment
was correct here then the comment you're referring to here should
remain in ExecPartitionCheck(), but you're saying it's in
ExecConstraints().

/* See the comments in ExecConstraints. */

If the comment really is in ExecConstraints(), then you might want to
give an overview of what you mean, then reference ExecConstraints() if
more details are required.

8. I'm having trouble parsing this comment:

 * 'update_rri' has the UPDATE per-subplan result rels.

I think "has" should be "contains" ?

9. Also, this should likely be reworded:

 * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,
 *  this is 0.

 'num_update_rri' number of elements in 'update_rri' array or zero for INSERT.

10. There should be no space before the '?'

/* Is this leaf partition present in the update resultrel ? */

11. I'm struggling to understand this comment:

* This is required when converting tuple as per root
* partition tuple descriptor.

"tuple" should probably be "the tuple", but not quite sure what you
mean by "as per root".

I may have misunderstood, but maybe it should read:

* This is required when we convert the partition's tuple to
* be compatible with the partitioned table's tuple descriptor.

12. I think "as well" would be better written as "either".

* If we didn't open the partition rel, it means we haven't
* initialized the result rel as well.

13. I'm unsure what is meant by the following comment:

* Verify result relation is a valid target for insert operation. Even
* for updates, we are doing this for tuple-routing, so again, we need
* to check the validity for insert operation.

I'm not quite sure where UPDATE comes in here as we're only checking for INSERT?

14. Use of underscores instead of camelCase.

Re: [HACKERS] UPDATE of partition key

2017-11-09 Thread Amit Khandekar

On 9 November 2017 at 09:27, Thomas Munro  wrote:
> On Wed, Nov 8, 2017 at 5:57 PM, Amit Khandekar  wrote:
>> On 8 November 2017 at 07:55, Thomas Munro  
>> wrote:
>>> On Tue, Nov 7, 2017 at 8:03 AM, Robert Haas  wrote:
 The changes to trigger.c still make me super-nervous.  Hey THOMAS
 MUNRO, any chance you could review that part?
>
> At first, it seemed quite strange to me that row triggers and
> statement triggers fire different events for the same modification.
> Row triggers see DELETE +  INSERT (necessarily because different
> tables are involved), but this fact is hidden from the target table's
> statement triggers.
>
> The alternative would be for all triggers to see consistent events and
> transitions.  Instead of having your special case code in ExecInsert
> and ExecDelete that creates the two halves of a 'synthetic' UPDATE for
> the transition tables, you'd just let the existing ExecInsert and
> ExecDelete code do its thing, and you'd need a flag to record that you
> should also fire INSERT/DELETE after statement triggers if any rows
> moved.

Yeah I also had thought about that. But thought that change was too
invasive. For e.g. letting ExecARInsertTriggers() do the transition
capture even when transition_capture->tcs_update_new_table is set.

I was also thinking of having a separate function to *only* add the
transition table rows. So in ExecInsert, call this one instead of
ExecARUpdateTriggers(). But realized that the existing
ExecARUpdateTriggers() looks like a better, robust interface with all
its checks. Just that calling ExecARUpdateTriggers() sounds like we
are also firing trigger; we are not firing any trigger or saving any
event, we are just adding the transition row.

>
> After sleeping on this question, I am coming around to the view that
> the way you have it is right.  The distinction isn't really between
> row triggers and statement triggers, it's between triggers at
> different levels in the hierarchy.  It just so happens that we
> currently only fire target table statement triggers and leaf table row
> triggers.

Yes. And rows are there only in leaf partitions. So we have to
simulate as though the target table has these rows. Like you
mentioned, the user has to get the impression of a normal table. So we
have to do something extra to capture the rows.

> Future development ideas that seem consistent with your choice:
>
> 1.  If we ever allow row triggers with transition tables on child
> tables, then I think *their* transition tables should certainly see
> the deletes and inserts, otherwise OLD TABLE and NEW TABLE would be
> inconsistent with the OLD and NEW variables in a single trigger
> invocation.  (These were prohibited mainly due to lack of time and
> (AFAIK) limited usefulness; I think they would need probably need
> their own separate tuplestores, or possibly some kind of filtering.)

As we know, for row triggers on leaf partitions, we treat them as
normal tables, so a trigger written on a leaf partition sees only the
local changes. The trigger is unaware whether the insert is part of an
UPDATE row movement. Similarly, the transition table referenced by
that row trigger function should see only the NEW table, not the old
table.

>
> 2.  If we ever allow row triggers on partitioned tables (ie that fire
> when its children are modified), then I think their UPDATE trigger
> should probably fire when a row moves between any two (grand-)*child
> tables, just as you have it for target table statement triggers.

Yes I agree.

> It doesn't matter that the view from parent tables' triggers is
> inconsistent with the view from leaf table triggers: it's a feature
> that we 'hide' partitioning from the user to the extent we can so that
> you can treat the partitioned table just like a table.
>
> Any other views?

I think because because there is no provision for a row trigger on
partitioned table, users who want to have a common trigger on a
partition subtree, has no choice but to create the same trigger
individually on the leaf partitions. And that's the reason we cannot
handle an update row movement with triggers without anomalies.

Thanks
-Amit Khandekar

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-11-08 Thread Thomas Munro

On Wed, Nov 8, 2017 at 5:57 PM, Amit Khandekar  wrote:
> On 8 November 2017 at 07:55, Thomas Munro  
> wrote:
>> On Tue, Nov 7, 2017 at 8:03 AM, Robert Haas  wrote:
>>> The changes to trigger.c still make me super-nervous.  Hey THOMAS
>>> MUNRO, any chance you could review that part?

At first, it seemed quite strange to me that row triggers and
statement triggers fire different events for the same modification.
Row triggers see DELETE +  INSERT (necessarily because different
tables are involved), but this fact is hidden from the target table's
statement triggers.

The alternative would be for all triggers to see consistent events and
transitions.  Instead of having your special case code in ExecInsert
and ExecDelete that creates the two halves of a 'synthetic' UPDATE for
the transition tables, you'd just let the existing ExecInsert and
ExecDelete code do its thing, and you'd need a flag to record that you
should also fire INSERT/DELETE after statement triggers if any rows
moved.

After sleeping on this question, I am coming around to the view that
the way you have it is right.  The distinction isn't really between
row triggers and statement triggers, it's between triggers at
different levels in the hierarchy.  It just so happens that we
currently only fire target table statement triggers and leaf table row
triggers.  Future development ideas that seem consistent with your
choice:

1.  If we ever allow row triggers with transition tables on child
tables, then I think *their* transition tables should certainly see
the deletes and inserts, otherwise OLD TABLE and NEW TABLE would be
inconsistent with the OLD and NEW variables in a single trigger
invocation.  (These were prohibited mainly due to lack of time and
(AFAIK) limited usefulness; I think they would need probably need
their own separate tuplestores, or possibly some kind of filtering.)

2.  If we ever allow row triggers on partitioned tables (ie that fire
when its children are modified), then I think their UPDATE trigger
should probably fire when a row moves between any two (grand-)*child
tables, just as you have it for target table statement triggers.  It
doesn't matter that the view from parent tables' triggers is
inconsistent with the view from leaf table triggers: it's a feature
that we 'hide' partitioning from the user to the extent we can so that
you can treat the partitioned table just like a table.

Any other views?

As for the code, I haven't figured out how to break it yet, and I'm
wondering if there is some way to refactor so that ExecInsert and
ExecDelete don't have to record pseudo-UPDATE trigger events.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-11-07 Thread Thomas Munro

On Wed, Nov 8, 2017 at 5:57 PM, Amit Khandekar  wrote:
> Thomas, can you please try the attached incremental patch
> regress_locale_changes.patch and check if the test passes ? The patch
> is to be applied on the main v22 patch. If the test passes, I will
> include these changes (also for list_parted) in the upcoming v23
> patch.

That looks good.  Thanks.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-11-07 Thread Amit Khandekar

On 8 November 2017 at 07:55, Thomas Munro  wrote:
> On Tue, Nov 7, 2017 at 8:03 AM, Robert Haas  wrote:
>> The changes to trigger.c still make me super-nervous.  Hey THOMAS
>> MUNRO, any chance you could review that part?
>
> Looking, but here's one silly thing that jumped out at me while
> getting started with this patch.  I cannot seem to convince my macOS
> system to agree with the expected sort order from :show_data, where
> underscores precede numbers:
>
>   part_a_10_a_20 | a | 10 | 200 |  1 |
>   part_a_1_a_10  | a |  1 |   1 |  1 |
> - part_d_1_15| b | 15 | 146 |  1 |
> - part_d_1_15| b | 16 | 147 |  2 |
>   part_d_15_20   | b | 17 | 155 | 16 |
>   part_d_15_20   | b | 19 | 155 | 19 |
> + part_d_1_15| b | 15 | 146 |  1 |
> + part_d_1_15| b | 16 | 147 |  2 |
>
> It seems that macOS (like older BSDs) just doesn't know how to sort
> Unicode and falls back to sorting the bits.  I expect that means that
> the test will also fail on any other OS with "make check
> LC_COLLATE=C".  I believe our regression tests are supposed to pass
> with a wide range of collations including C, so I wonder if this means
> we should stick a leading zero on those single digit numbers, or
> something, to stabilise the output.

I preferably need to retain the partition names. I have now added a
LOCALE "C" for partname like this :

-\set show_data 'select tableoid::regclass::text partname, * from
range_parted order by 1, 2, 3, 4, 5, 6'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname,
* from range_parted order by 1, 2, 3, 4, 5, 6'

Thomas, can you please try the attached incremental patch
regress_locale_changes.patch and check if the test passes ? The patch
is to be applied on the main v22 patch. If the test passes, I will
include these changes (also for list_parted) in the upcoming v23
patch.

Thanks
-Amit Khandekar


regress_locale_changes.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-11-07 Thread Thomas Munro

On Tue, Nov 7, 2017 at 8:03 AM, Robert Haas  wrote:
> The changes to trigger.c still make me super-nervous.  Hey THOMAS
> MUNRO, any chance you could review that part?

Looking, but here's one silly thing that jumped out at me while
getting started with this patch.  I cannot seem to convince my macOS
system to agree with the expected sort order from :show_data, where
underscores precede numbers:

  part_a_10_a_20 | a | 10 | 200 |  1 |
  part_a_1_a_10  | a |  1 |   1 |  1 |
- part_d_1_15| b | 15 | 146 |  1 |
- part_d_1_15| b | 16 | 147 |  2 |
  part_d_15_20   | b | 17 | 155 | 16 |
  part_d_15_20   | b | 19 | 155 | 19 |
+ part_d_1_15| b | 15 | 146 |  1 |
+ part_d_1_15| b | 16 | 147 |  2 |

It seems that macOS (like older BSDs) just doesn't know how to sort
Unicode and falls back to sorting the bits.  I expect that means that
the test will also fail on any other OS with "make check
LC_COLLATE=C".  I believe our regression tests are supposed to pass
with a wide range of collations including C, so I wonder if this means
we should stick a leading zero on those single digit numbers, or
something, to stabilise the output.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-11-06 Thread Amit Langote

On 2017/11/07 14:40, Amit Khandekar wrote:
> On 7 November 2017 at 00:33, Robert Haas  wrote:
> 
>> Also, +1 for Amit Langote's idea of trying to merge
>> mt_perleaf_childparent_maps with mt_persubplan_childparent_maps.
> 
> Currently I am trying to see if it simplifies things if we do that. We
> will be merging these arrays into one, but we are adding a new int[]
> array that maps subplans to leaf partitions. Will get back with how it
> looks finally.

One thing to note is that the int[] array I mentioned will be much faster
to compute than going to convert_tuples_by_name() to build the additional
maps array.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-11-06 Thread Amit Khandekar

On 7 November 2017 at 00:33, Robert Haas  wrote:

> Also, +1 for Amit Langote's idea of trying to merge
> mt_perleaf_childparent_maps with mt_persubplan_childparent_maps.

Currently I am trying to see if it simplifies things if we do that. We
will be merging these arrays into one, but we are adding a new int[]
array that maps subplans to leaf partitions. Will get back with how it
looks finally.

Robert, Amit , I will get back with your other review comments.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-11-06 Thread Robert Haas

On Wed, Oct 25, 2017 at 11:40 AM, Amit Khandekar  wrote:
> Below I have addressed the remaining review comments :

The changes to trigger.c still make me super-nervous.  Hey THOMAS
MUNRO, any chance you could review that part?

+   /* The caller must have already locked all the partitioned tables. */
+   root_rel = heap_open(root_relid, NoLock);
+   *all_part_cols = NULL;
+   foreach(lc, partitioned_rels)
+   {
+   Index   rti = lfirst_int(lc);
+   Oid relid = getrelid(rti, rtables);
+   Relationpart_rel = heap_open(relid, NoLock);
+
+   pull_child_partition_columns(part_rel, root_rel, all_part_cols);
+   heap_close(part_rel, NoLock);

I don't like the fact that we're opening and closing the relation here
just to get information on the partitioning columns.  I think it would
be better to do this someplace that already has the relation open and
store the details in the RelOptInfo.  set_relation_partition_info()
looks like the right spot.

+void
+pull_child_partition_columns(Relation rel,
+Relation parent,
+Bitmapset **partcols)

This code has a lot in common with is_partition_attr().  I'm not sure
it's worth trying to unify them, but it could be done.

+ * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,

Instead of " : ", you could just write "is the".

+* For Updates, if the leaf partition is already present in the
+* per-subplan result rels, we re-use that rather than
initialize a
+* new result rel. The per-subplan resultrels and the
resultrels of
+* the leaf partitions are both in the same canonical
order. So while

It would be good to explain the reason.  Also, Updates shouldn't be
capitalized here.

+   Assert(cur_update_rri <= update_rri +
num_update_rri - 1);

Maybe just cur_update_rri < update_rri + num_update_rri, or even
current_update_rri - update_rri < num_update_rri.

Also, +1 for Amit Langote's idea of trying to merge
mt_perleaf_childparent_maps with mt_persubplan_childparent_maps.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-11-02 Thread Amit Langote

Hi Amit.

Thanks a lot for updated patches and sorry that I couldn't get to looking
at your emails sooner.  Note that I'm replying here to both of your
emails, but looking at only the latest v22 patch.

On 2017/10/24 0:15, Amit Khandekar wrote:
> On 16 October 2017 at 08:28, Amit Langote  
> wrote:
>>
>> +   (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup
>> == NULL
>>
>> Is there some reason why a bitwise operator is used here?
> 
> That exact condition means that the function is called for transition
> capture for updated rows being moved to another partition. For this
> scenario, either the oldtup or the newtup is NULL. I wanted to exactly
> capture that condition there. I think the bitwise operator is more
> user-friendly in emphasizing the point that it is indeed an "either a
> or b, not both" condition.

I see.  In that case, since this patch adds the new condition, a note
about it in the comment just above would be good, because the situation
you describe here seems to arise only during update-tuple-routing, IIUC.

>> + * 'perleaf_parentchild_maps' receives an array of TupleConversionMap 
>> objects
>> + * with on entry for every leaf partition (required to convert input
>> tuple
>> + * based on the root table's rowtype to a leaf partition's rowtype after
>> + * tuple routing is done)
>>
>> Could this be named leaf_tupconv_maps, maybe?  It perhaps makes clear that
>> they are maps needed for "tuple conversion".  And the other field holding
>> the reverse map as leaf_rev_tupconv_maps.  Either that or use underscores
>> to separate words, but then it gets too long I guess.
> 
> In master branch, now this param is already there with the name
> "tup_conv_maps". In the rebased version in the earlier mail, I haven't
> again changed it. I think "tup_conv_maps" looks clear enough.

OK.

In the latest patch:

+ * 'update_rri' has the UPDATE per-subplan result rels. These are re-used
+ *  instead of allocating new ones while generating the array of all leaf
+ *  partition result rels.

Instead of:

"These are re-used instead of allocating new ones while generating the
array of all leaf partition result rels."

how about:

"There is no need to allocate a new ResultRellInfo entry for leaf
partitions for which one already exists in this array"

>> ISTM, ConvertPartitionTupleSlot has a very specific job and a bit complex
>> interface.  I guess it could simply have the following interface:
>>
>> static HeapTuple ConvertPartitionTuple(ModifyTabelState *mtstate,
>>HeapTuple tuple, bool is_update);
>>
>> And figure out, based on the value of is_update, which map to use and
>> which slot to set *p_new_slot to (what is now "new_slot" argument).
>> You're getting mtstate here anyway, which contains all the information you
>> need here.  It seems better to make that (selecting which map and which
>> slot) part of the function's implementation if we're having this function
>> at all, imho.  Maybe I'm missing some details there, but my point still
>> remains that we should try to put more logic in that function instead of
>> it just do the mechanical tuple conversion.
> 
> I tried to see how the interface would look if we do that way. Here is
> how the code looks :
> 
> static TupleTableSlot *
> ConvertPartitionTupleSlot(ModifyTableState *mtstate,
> bool for_update_tuple_routing,
> int map_index,
> HeapTuple *tuple,
> TupleTableSlot *slot)
> {
>TupleConversionMap   *map;
>TupleTableSlot *new_slot;
> 
>if (for_update_tuple_routing)
>{
>   map = mtstate->mt_persubplan_childparent_maps[map_index];
>   new_slot = mtstate->mt_rootpartition_tuple_slot;
>}
>else
>{
>   map = mtstate->mt_perleaf_parentchild_maps[map_index];
>   new_slot = mtstate->mt_partition_tuple_slot;
>}
> 
>if (!map)
>   return slot;
> 
>*tuple = do_convert_tuple(*tuple, map);
> 
>/*
> * Change the partition tuple slot descriptor, as per converted tuple.
> */
>ExecSetSlotDescriptor(new_slot, map->outdesc);
>ExecStoreTuple(*tuple, new_slot, InvalidBuffer, true);
> 
>return new_slot;
> }
> 
> It looks like the interface does not much simplify, and above that, we
> have more number of lines in that function. Also, the caller anyway
> has to be aware whether the map_index is the index into the leaf
> partitions or the update subplans. So it is not like the caller does
> not have to be aware about whether the mapping should be
> mt_persubplan_childparent_maps or mt_perleaf_parentchild_maps.

Hmm, I think we should try to make it so that the caller doesn't have to
be aware of that.  And by caller I guess you mean ExecInsert(), which
should not be a place, IMHO, where to try to introduce a lot of new logic
specific to update tuple routing.  ISTM, ModifyTableState now has one too
many

Re: [HACKERS] UPDATE of partition key

2017-10-15 Thread Amit Langote

Hi Amit.

On 2017/10/04 22:51, Amit Khandekar wrote:
> Main patch :
> update-partition-key_v20.patch

Guess you're already working on it but the patch needs a rebase.  A couple
of hunks in the patch to execMain.c and nodeModifyTable.c fail.

Meanwhile a few comments:

+void
+pull_child_partition_columns(Bitmapset **bitmapset,
+Relation rel,
+Relation parent)

Nitpick: don't we normally list the output argument(s) at the end?  Also,
"bitmapset" could be renamed to something that conveys what it contains?

+   if (partattno != 0)
+   child_keycols =
+   bms_add_member(child_keycols,
+  partattno -
FirstLowInvalidHeapAttributeNumber);
+   }
+   foreach(lc, partexprs)
+   {

Elsewhere (in quite a few places), we don't iterate over partexprs
separately like this, although I'm not saying it is bad, just different
from other places.

+ * the transition tuplestores can be built. Furthermore, if the transition
+ *  capture is happening for UPDATEd rows being moved to another
partition due
+ *  partition-key change, then this function is called once when the row is
+ *  deleted (to capture OLD row), and once when the row is inserted to
another
+ *  partition (to capture NEW row). This is done separately because
DELETE and
+ *  INSERT happen on different tables.

Extra space at the beginning from the 2nd line onwards.

+   (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup
== NULL

Is there some reason why a bitwise operator is used here?

+ * 'update_rri' has the UPDATE per-subplan result rels.

Could you explain why they are being received as input here?

+ * 'perleaf_parentchild_maps' receives an array of TupleConversionMap objects
+ * with on entry for every leaf partition (required to convert input
tuple
+ * based on the root table's rowtype to a leaf partition's rowtype after
+ * tuple routing is done)

Could this be named leaf_tupconv_maps, maybe?  It perhaps makes clear that
they are maps needed for "tuple conversion".  And the other field holding
the reverse map as leaf_rev_tupconv_maps.  Either that or use underscores
to separate words, but then it gets too long I guess.

+   tuple = ConvertPartitionTupleSlot(mtstate,
+
mtstate->mt_perleaf_parentchild_maps[leaf_part_index],
+

The 2nd line here seems to have gone over 80 characters.

ISTM, ConvertPartitionTupleSlot has a very specific job and a bit complex
interface.  I guess it could simply have the following interface:

static HeapTuple ConvertPartitionTuple(ModifyTabelState *mtstate,
   HeapTuple tuple, bool is_update);

And figure out, based on the value of is_update, which map to use and
which slot to set *p_new_slot to (what is now "new_slot" argument).
You're getting mtstate here anyway, which contains all the information you
need here.  It seems better to make that (selecting which map and which
slot) part of the function's implementation if we're having this function
at all, imho.  Maybe I'm missing some details there, but my point still
remains that we should try to put more logic in that function instead of
it just do the mechanical tuple conversion.

+ * We have already checked partition constraints above, so skip them
+ * below.

How about: ", so skip checking here."?

ISTM, the newly introduced logic in ExecSetupTransitionCaptureState() to
try to reuse the per-subplan child-to-parent map as per-leaf
child-to-parent map could be simplified a bit.  I mean the following code:

+/*
+ * But for Updates, we can share the per-subplan maps with the per-leaf
+ * maps.
+ */
+update_rri_index = 0;
+update_rri = mtstate->resultRelInfo;
+if (mtstate->mt_nplans > 0)
+cur_reloid = RelationGetRelid(update_rri[0].ri_RelationDesc);

-/* Choose the right set of partitions */
-if (mtstate->mt_partition_dispatch_info != NULL)
+for (i = 0; i < numResultRelInfos; ++i)
+{

How about (pseudo-code):

 j = 0;
 for (i = 0; i < n_leaf_parts; i++)
 {
 if (j < n_subplans && leaf_rri[i]->oid == subplan_rri[j]->oid)
 {
 leaf_childparent_map[i] = subplan_childparent_map[j];
 j++;
 }
 else
 {
 leaf_childparent_map[i] = new map
 }
 }

I think the above would also be useful in ExecSetupPartitionTupleRouting()
where you've added similar code to try to reuse per-subplan ResultRelInfos.

In ExecInitModifyTable(), can we try to minimize the number of places
where update_tuple_routing_needed is being set.  Currently, it's being set
in 3 places:

+boolupdate_tuple_routing_needed = node->part_cols_updated;

&

+/*
+ * If this is an UPDATE and a BEFORE UPDATE trigger is present,
we may
+ * need to do update tuple routing.
+ */
+if (resultRelInfo->ri_TrigDesc &&
+resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+

Re: [HACKERS] UPDATE of partition key

2017-10-12 Thread Amit Langote

On 2017/10/13 6:18, Robert Haas wrote:
> Is anybody still reviewing the main patch here?  (It would be good if
> the answer is "yes".)

I am going to try to look at the latest version over the weekend and early
next week.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-10-12 Thread Robert Haas

On Wed, Oct 4, 2017 at 9:51 AM, Amit Khandekar  wrote:
> Preparatory patches :
> 0001-Prepare-for-re-using-UPDATE-result-rels-during-tuple.patch
> 0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patch
> Main patch :
> update-partition-key_v20.patch

Committed 0001 with a few tweaks and 0002 unchanged.  Please check
whether everything looks OK.

Is anybody still reviewing the main patch here?  (It would be good if
the answer is "yes".)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-10-03 Thread Robert Haas

On Tue, Oct 3, 2017 at 8:16 AM, Amit Khandekar  wrote:
> While writing this down, I observed that after multi-level partition
> tree expansion was introduced, the child table expressions are not
> converted directly from the root. Instead, they are converted from
> their immediate parent. So there is a chain of conversions : to leaf
> from its parent, to that parent from its parent, and so on from the
> root. Effectively, during the first conversion, there are that many
> ConvertRowtypeExpr nodes one above the other already present in the
> UPDATE result rel expressions. But my patch handles the optimization
> only for the leaf partition conversions.
>
> If already has CRE : CRE(rr) -> wr(r)
> Parent-to-child conversion ::: CRE(p) -> wr(r)  ===>   CRE(rr) ->
> CRE(r) -> wr(c1)
> W patch : CRE(rr) -> CRE(r) -> wr(c1) ===> CRE(rr) -> CRE(r) -> wr(c2)

Maybe adjust_appendrel_attrs() should have a similar provision for
avoiding extra ConvertRowTypeExpr nodes?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-10-03 Thread Amit Khandekar

On 30 September 2017 at 01:26, Robert Haas  wrote:
> On Fri, Sep 29, 2017 at 3:53 PM, Robert Haas  wrote:
>> On Fri, Sep 22, 2017 at 1:57 AM, Amit Khandekar  
>> wrote:
>>> The patch for the above change is :
>>> 0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patch
>>
>> Thinking about this a little more, I'm wondering about how this case
>> arises.  I think that for this patch to avoid multiple conversions,
>> we'd have to be calling map_variable_attnos on an expression and then
>> calling map_variable_attnos on that expression again.

We are not calling map_variable_attnos() twice. The first time it
calls, there is already the ConvertRowtypeExpr node if the expression
is a whole row var. This node is already added from
adjust_appendrel_attrs(). So the conversion is done by two different
functions.

For ConvertRowtypeExpr, map_variable_attnos_mutator() recursively
calls map_variable_attnos_mutator() for ConvertRowtypeExpr->arg with
coerced_var=true.

>
> I guess I didn't quite finish this thought, sorry.  Maybe it's
> obvious, but the point I was going for is: why would we do that, vs.
> just converting once?

The first time ConvertRowtypeExpr node gets added in the expression is
when adjust_appendrel_attrs() is called for each of the child tables.
Here, for each of the child table, when the parent parse tree is
converted into the child parse tree, the whole row var (in RETURNING
or WITH CHECK OPTIONS expr) is wrapped with ConvertRowtypeExpr(), so
child parse tree (or the child WCO expr) has this ConvertRowtypeExpr
node.

The second time this node is added is during update-tuple-routing in
ExecInitModifyTable(), when map_partition_varattnos() is called for
each of the partitions to convert from the first per-subplan
RETURNING/WCO expression to the RETURNING/WCO expression belonging to
the leaf partition. This second conversion happens for the leaf
partitions which are not already present in per-subplan UPDATE result
rels.

So the first conversion is from parent to child while building
per-subplan plans, and the second is from first per-subplan child to
another child for building expressions of the leaf partitions.

So suppose the root partitioned table RETURNING expression is a whole
row var wr(r) where r is its composite type representing the root
table type.
Then, one of its UPDATE child tables will have its RETURNING
expression converted like this :
wr(r)  ===>  CRE(r) -> wr(c1)
where CRE(r) represents ConvertRowtypeExpr of result type r, which has
its arg pointing to wr(c1) which is a whole row var of composite type
c1 for the child table c1. So this node converts from composite type
of child table to composite type of root table.

Now, when the second conversion occurs for the leaf partition (i.e.
during update-tuple-routing), the conversion looks like this :
CRE(r) -> wr(c1)  ===>  CRE(r) -> wr(c2)
But W/o the 0002*ConvertRowtypeExpr*.patch the conversion would have
looked like this :
CRE(r) -> wr(c1)  ===>  CRE(r) -> CRE(c1) -> wr(c2)
In short, we omit the intermediate CRE(c1) node.

While writing this down, I observed that after multi-level partition
tree expansion was introduced, the child table expressions are not
converted directly from the root. Instead, they are converted from
their immediate parent. So there is a chain of conversions : to leaf
from its parent, to that parent from its parent, and so on from the
root. Effectively, during the first conversion, there are that many
ConvertRowtypeExpr nodes one above the other already present in the
UPDATE result rel expressions. But my patch handles the optimization
only for the leaf partition conversions.

If already has CRE : CRE(rr) -> wr(r)
Parent-to-child conversion ::: CRE(p) -> wr(r)  ===>   CRE(rr) ->
CRE(r) -> wr(c1)
W patch : CRE(rr) -> CRE(r) -> wr(c1) ===> CRE(rr) -> CRE(r) -> wr(c2)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-09-29 Thread Robert Haas

On Fri, Sep 22, 2017 at 1:57 AM, Amit Khandekar  wrote:
> The patch for the above change is :
> 0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patch

Thinking about this a little more, I'm wondering about how this case
arises.  I think that for this patch to avoid multiple conversions,
we'd have to be calling map_variable_attnos on an expression and then
calling map_variable_attnos on that expression again.

>>> If I understand correctly, the reason for changing mt_partitions from
>>> ResultRelInfo * to ResultRelInfo ** is that, currently, all of the
>>> ResultRelInfos for a partitioning hierarchy are allocated as a single
>>> chunk, but we can't do that and also reuse the ResultRelInfos created
>>> during InitPlan.  I suggest that we do this as a preparatory patch.
>>
>> Ok, will prepare a separate patch. Do you mean to include in that
>> patch the changes I did in ExecSetupPartitionTupleRouting() that
>> re-use the ResultRelInfo structures of per-subplan update result rels
>> ?
>
> Above changes are in attached
> 0001-Re-use-UPDATE-result-rels-created-in-InitPlan.patch.

No, not all of those changes.  Just the adjustments to make
ModifyTableState's mt_partitions be of type ResultRelInfo ** rather
than ResultRelInfo *, and anything closely related to that.  Not, for
example, the num_update_rri stuff.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-09-29 Thread amul sul

On Wed, Sep 13, 2017 at 4:24 PM, amul sul  wrote:
>
>
> On Sun, Sep 10, 2017 at 8:47 AM, Amit Kapila 
> wrote:
>>
>> On Fri, Sep 8, 2017 at 4:51 PM, amul sul  wrote:
>> > On Thu, May 18, 2017 at 9:13 AM, Amit Kapila 
>> > wrote:
>> >>
>> >>  On Wed, May 17, 2017 at 5:17 PM, Robert Haas 
>> >> wrote:
>> >> > On Wed, May 17, 2017 at 6:29 AM, Amit Kapila
>> >> > 
>> >> > wrote:
>> >> >> I think we can do this even without using an additional infomask
>> >> >> bit.
>> >> >> As suggested by Greg up thread, we can set InvalidBlockId in ctid to
>> >> >> indicate such an update.
>> >> >
>> >> > Hmm.  How would that work?
>> >> >
>> >>
>> >> We can pass a flag say row_moved (or require_row_movement) to
>> >> heap_delete which will in turn set InvalidBlockId in ctid instead of
>> >> setting it to self. Then the ExecUpdate needs to check for the same
>> >> and return an error when heap_update is not successful (result !=
>> >> HeapTupleMayBeUpdated).  Can you explain what difficulty are you
>> >> envisioning?
>> >>
>> >
>> > Attaching WIP patch incorporates the above logic, although I am yet to
>> > check
>> > all the code for places which might be using ip_blkid.  I have got a
>> > small
>> > query here,
>> > do we need an error on HeapTupleSelfUpdated case as well?
>> >
>>
>> No, because that case is anyway a no-op (or error depending on whether
>> is updated/deleted by same command or later command).  Basically, even
>> if the row wouldn't have been moved to another partition, we would not
>> have allowed the command to proceed with the update.  This handling is
>> to make commands fail rather than a no-op where otherwise (when the
>> tuple is not moved to another partition) the command would have
>> succeeded.
>>
> Thank you.
>
> I've rebased patch against  Amit Khandekar's latest patch (v17_rebased_2).
> Also, added ip_blkid validation check in heap_get_latest_tid(), 
> rewrite_heap_tuple()
> & rewrite_heap_tuple() function, because only ItemPointerEquals() check is no
> longer sufficient after this patch.

FYI, I have posted this patch in a separate thread :
https://postgr.es/m/caaj_b95pkwojoyfz0bzxu8ookctvgzn6vygcnvuukeudrnf...@mail.gmail.com

Regards,
Amul


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-09-28 Thread Amit Khandekar

Below are some performance figures. Overall, there does not appear to
be a noticeable difference in the figures in partition key updates
with and without row movement (which is surprising), and
non-partition-key updates with and without the patch.

All the values are in milliseconds.

Configuration :

shared_buffers = 8GB
maintenance_work_mem = 4GB
synchronous_commit = off
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
log_line_prefix = '%t [%p] '
max_wal_size = 5GB
max_connections = 200

The attached files were used to create a partition tree made up of 16
partitioned tables, each containing 125 partitions. First half of the
2000 partitions are filled with 10 million rows. Update row movement
moves the data to the other half of the partitions.

gen.sql : Creates the partitions.
insert.data : This data file is uploaded here [1]. Used "COPY ptab
from '$PWD/insert.data' "
index.sql : Optionally, Create index on column d.

The schema looks like this :

CREATE TABLE ptab (a date, b int, c int, d int) PARTITION BY RANGE (a, b);

CREATE TABLE ptab_1_1 PARTITION OF ptab
for values from ('1900-01-01', 1) to ('1900-01-01', 7501)
PARTITION BY range (c);
CREATE TABLE ptab_1_1_1 PARTITION OF ptab_1_1
for values from (1) to (81);
CREATE TABLE ptab_1_1_2 PARTITION OF ptab_1_1
for values from (81) to (161);
..
..
CREATE TABLE ptab_1_2 PARTITION OF ptab
for values from ('1900-01-01', 7501) to ('1900-01-01', 15001)
PARTITION BY range (c);
..
..

On 20 September 2017 at 00:06, Robert Haas  wrote:
> I wonder how much more expensive it
> is to execute UPDATE root SET a = a + 1 WHERE a = 1 on a table with
> 1000 subpartitions with this patch than without, assuming the update
> succeeds in both cases.

UPDATE query used : UPDATE ptab set d = d + 1 where d = 1; -- where d
is not a partition key of any of the partitions.
This query updates 8 rows out of 10 million rows.
With HEAD  : 2953.691 , 2862.298 , 2855.286 , 2835.879 (avg : 2876)
With Patch : 2933.719 , 2832.463 , 2749.979 , 2820.416 (avg : 2834)
(All the values are in milliseconds.)

> suppose you make a table with 1000 partitions each containing
> 10,000 tuples and update them all, and consider three scenarios: (1)
> partition key not updated but all tuples subject to non-HOT updates
> because the updated column is indexed, (2) partition key updated but
> no tuple movement required as a result, (3) partition key updated and
> all tuples move to a different partition.

Note that the following figures do not represent a consistent set of
figures. They keep on varying. For e.g. , even though the
partition-key-update without row movement appears to have taken a bit
more time with patch than with HEAD, a new set of tests run might even
end up the other way round.

NPK  : 42089 (patch)
NPKI : 81593 (patch)
PK   : 45250 (patch) , 44944 (HEAD)
PKR  : 46701 (patch)

The above figures are in milliseconds. The explanations of the above
short-forms :

NPK :
Update of column that is not a partition-key.
UPDATE query used : UPDATE ptab set d = d + 1 ; This update *all* rows.

NPKI :
Update of column that is not a partition-key. And this column is
indexed (Used attached file index.sql).
UPDATE query used : UPDATE ptab set d = d + 1 ; This update *all* rows.

PK :
Update of partition key, but row movement does not occur. There are no
indexed columns.
UPDATE query used : UPDATE ptab set a = a + '1 hour'::interval ;

PKR :
Update of partition key, with all rows moved to other partitions.
There are no indexed columns.
UPDATE query used : UPDATE ptab set a = a + '2 years'::interval ;

[1] 
https://drive.google.com/open?id=0B_YJCqIAxKjeN3hMXzdDejlNYmlpWVJpaU9mWUhFRVhXTG5Z

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

gen.tar.gz
Description: GNU Zip compressed data

index.tar.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-09-22 Thread Amit Khandekar

On 21 September 2017 at 19:52, amul sul  wrote:
> On Wed, Sep 20, 2017 at 9:27 PM, Amit Khandekar 
> wrote:
>>
>> On 20 September 2017 at 00:06, Robert Haas  wrote:
>> > On Fri, Sep 15, 2017 at 7:25 AM, Amit Khandekar 
>> > wrote:
>> >> [ new patch ]
>
>
>   86 -   (event == TRIGGER_EVENT_UPDATE &&
> !trigdesc->trig_update_after_row))
>   87 +   (event == TRIGGER_EVENT_UPDATE &&
> !trigdesc->trig_update_after_row) ||
>   88 +   (event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup
> == NULL)))
>   89 return;
>   90 }
>
>
> Either of oldtup or newtup will be valid at a time & vice versa.  Can we
> improve
> this check accordingly?
>
> For e.g.:
> (event == TRIGGER_EVENT_UPDATE && )(HeapTupleIsValid(oldtup) ^
> ItemPointerIsValid(newtup)

Ok, I will be doing this as below :
-  (event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup == NULL)))
+ (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL

At other places in the function, oldtup and newtup are checked for
NULL, so to be consistent, I haven't used HeapTupleIsValid.

Actually, it won't happen that both oldtup and newtup are NULL ... in
either of delete, insert, or update, but I haven't added an Assert for
this, because that has been true even on HEAD.

Will include the above minor change in the next patch when more changes come in.

>
>
>  247
>  248 +   /*
>  249 +* EDB: In case this is part of update tuple routing, put this row
> into the
>  250 +* transition NEW TABLE if we are capturing transition tables. We
> need to
>  251 +* do this separately for DELETE and INSERT because they happen on
>  252 +* different tables.
>  253 +*/
>  254 +   if (mtstate->operation == CMD_UPDATE &&
> mtstate->mt_transition_capture)
>  255 +   ExecARUpdateTriggers(estate, resultRelInfo, NULL,
>  256 +NULL,
>  257 +tuple,
>  258 +NULL,
>  259 +mtstate->mt_transition_capture);
>  260 +
>  261 list_free(recheckIndexes);
>
>  267
>  268 +   /*
>  269 +* EDB: In case this is part of update tuple routing, put this row
> into the
>  270 +* transition OLD TABLE if we are capturing transition tables. We
> need to
>  271 +* do this separately for DELETE and INSERT because they happen on
>  272 +* different tables.
>  273 +*/
>  274 +   if (mtstate->operation == CMD_UPDATE &&
> mtstate->mt_transition_capture)
>  275 +   ExecARUpdateTriggers(estate, resultRelInfo, tupleid,
>  276 +oldtuple,
>  277 +NULL,
>  278 +NULL,
>  279 +mtstate->mt_transition_capture);
>  280 +
>
> Initially, I wondered that why can't we have above code right after
> ExecInsert() & ExecIDelete() in ExecUpdate respectively?
>
> We can do that for ExecIDelete() but not easily in the ExecInsert() case,
> because ExecInsert() internally searches the correct partition's
> resultRelInfo
> for an insert and before returning to ExecUpdate resultRelInfo is restored
> to the old one.  That's why current logic seems to be reasonable for now.
> Is there anything that we can do?

Yes, resultRelInfo is different when we return from ExecInsert().
Also, I think the trigger and transition capture be done immediately
after the rows are inserted. This is true for existing code also.
Furthermore, there is a dependency of ExecARUpdateTriggers() on
ExecARInsertTriggers(). transition_capture is passed NULL if we
already captured the tuple in ExecARUpdateTriggers(). It looks simpler
to do all this at a single place.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-09-21 Thread amul sul

On Wed, Sep 20, 2017 at 9:27 PM, Amit Khandekar 
wrote:

> On 20 September 2017 at 00:06, Robert Haas  wrote:
> > On Fri, Sep 15, 2017 at 7:25 AM, Amit Khandekar 
> wrote:
> >> [ new patch ]
>

  86 -   (event == TRIGGER_EVENT_UPDATE &&
!trigdesc->trig_update_after_row))
  87 +   (event == TRIGGER_EVENT_UPDATE &&
!trigdesc->trig_update_after_row) ||
  88 +   (event == TRIGGER_EVENT_UPDATE && (oldtup == NULL ||
newtup == NULL)))
  89 return;
  90 }


Either of oldtup or newtup will be valid at a time & vice versa.  Can we
improve
this check accordingly?

For e.g.:
(event == TRIGGER_EVENT_UPDATE && )(HeapTupleIsValid(oldtup) ^
ItemPointerIsValid(newtup)


 247
 248 +   /*
 249 +* EDB: In case this is part of update tuple routing, put this row
into the
 250 +* transition NEW TABLE if we are capturing transition tables. We
need to
 251 +* do this separately for DELETE and INSERT because they happen on
 252 +* different tables.
 253 +*/
 254 +   if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_
capture)
 255 +   ExecARUpdateTriggers(estate, resultRelInfo, NULL,
 256 +NULL,
 257 +tuple,
 258 +NULL,
 259 +mtstate->mt_transition_capture);
 260 +
 261 list_free(recheckIndexes);

 267
 268 +   /*
 269 +* EDB: In case this is part of update tuple routing, put this row
into the
 270 +* transition OLD TABLE if we are capturing transition tables. We
need to
 271 +* do this separately for DELETE and INSERT because they happen on
 272 +* different tables.
 273 +*/
 274 +   if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_
capture)
 275 +   ExecARUpdateTriggers(estate, resultRelInfo, tupleid,
 276 +oldtuple,
 277 +NULL,
 278 +NULL,
 279 +mtstate->mt_transition_capture);
 280 +

Initially, I wondered that why can't we have above code right after
ExecInsert() & ExecIDelete() in ExecUpdate respectively?

We can do that for ExecIDelete() but not easily in the ExecInsert() case,
because ExecInsert() internally searches the correct partition's
resultRelInfo
for an insert and before returning to ExecUpdate resultRelInfo is restored
to the old one.  That's why current logic seems to be reasonable for now.
Is there anything that we can do?

Regards,
Amul

Re: [HACKERS] UPDATE of partition key

2017-09-20 Thread Amit Khandekar

On 20 September 2017 at 00:06, Robert Haas  wrote:
> On Fri, Sep 15, 2017 at 7:25 AM, Amit Khandekar  
> wrote:
>> [ new patch ]
>
> This already fails to apply again.  In general, I think it would be a
> good idea to break this up into a patch series rather than have it as
> a single patch.  That would allow some bits to be applied earlier.
> The main patch will probably still be pretty big, but at least we can
> make things a little easier by getting some of the cleanup out of the
> way first.  Specific suggestions on what to break out below.
>
> If the changes to rewriteManip.c are a marginal efficiency hack and
> nothing more, then let's commit this part separately before the main
> patch.  If they're necessary for correctness, then please add a
> comment explaining why they are necessary.

Ok. Yes, just wanted to avoid two ConvertRowtypeExpr nodes one over
the other. But that was not causing any correctness issue. Will
extract these changes into separate patch.

>
> There appears to be no reason why the definitions of
> GetInsertedColumns() and GetUpdatedColumns() need to be moved to a
> header file as a result of this patch.  GetUpdatedColumns() was
> previously defined in trigger.c and execMain.c and, post-patch, is
> still called from only those files.  GetInsertedColumns() was, and
> remains, called only from execMain.c.  If this were needed I'd suggest
> doing it as a preparatory patch before the main patch, but it seems we
> don't need it at all.

In earlier versions of the patch, these functions were used in
nodeModifyTable.c as well. Now that those calls are not there in this
file, I will revert back the changes done for moving the definitions
into header file.

>
> If I understand correctly, the reason for changing mt_partitions from
> ResultRelInfo * to ResultRelInfo ** is that, currently, all of the
> ResultRelInfos for a partitioning hierarchy are allocated as a single
> chunk, but we can't do that and also reuse the ResultRelInfos created
> during InitPlan.  I suggest that we do this as a preparatory patch.

Ok, will prepare a separate patch. Do you mean to include in that
patch the changes I did in ExecSetupPartitionTupleRouting() that
re-use the ResultRelInfo structures of per-subplan update result rels
?

> Someone could argue that this is going the wrong way and that we ought
> to instead make InitPlan() create all of the necessarily
> ResultRelInfos, but it seems to me that eventually we probably want to
> allow setting up ResultRelInfos on the fly for only those partitions
> for which we end up needing them.  The code already has some provision
> for creating ResultRelInfos on the fly - see ExecGetTriggerResultRel.
> I don't think it's this patch's job to try to apply that kind of thing
> to tuple routing, but it seems like in the long run if we're inserting
> 1 tuple into a table with 1000 partitions, or performing 1 update that
> touches the partition key, it would be best not to create
> ResultRelInfos for all 1000 partitions just for fun.

Yes makes sense.

>  But this sort of
> thing seems much easier of mt_partitions is ResultRelInfo ** rather
> than ResultRelInfo *, so I think what you have is going in the right
> direction.

Ok.

>
> + * mtstate->resultRelInfo[]. Note: We assume that if the 
> resultRelInfo
> + * does not belong to subplans, then it already matches the root 
> tuple
> + * descriptor; although there is no such known scenario where this
> + * could happen.
> + */
> +if (rootResultRelInfo != resultRelInfo &&
> +mtstate->mt_persubplan_childparent_maps != NULL &&
> +resultRelInfo >= mtstate->resultRelInfo &&
> +resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1)
> +{
> +int map_index = resultRelInfo - mtstate->resultRelInfo;
>
> I think you should Assert() that it doesn't happen instead of assuming
> that it doesn't happen.   IOW, remove the last two branches of the
> if-condition, and then add an Assert() that map_index is sane.

Ok.

>
> It is not clear to me why we need both mt_perleaf_childparent_maps and
> mt_persubplan_childparent_maps.

mt_perleaf_childparent_maps :
This is used for converting transition-captured
inserted/modified/deleted tuples from leaf to root partition, because
we need to have all the ROWS in the root partition attribute order.
This map is used only for tuples that are routed from root to leaf
partition during INSERT, or when tuples are routed from one leaf
partition to another leaf partition during update row movement. For
both of these operations, we need per-leaf maps, because during tuple
conversion, the source relation is among the mtstate->mt_partitions.

mt_persubplan_childparent_maps :
This is used at two places :

1. After an ExecUpdate() updates a row of a per-subplan update result
rel, we need to capture the tuple, so again we need to convert to the
root

Re: [HACKERS] UPDATE of partition key

2017-09-19 Thread Robert Haas

On Fri, Sep 15, 2017 at 7:25 AM, Amit Khandekar  wrote:
> [ new patch ]

This already fails to apply again.  In general, I think it would be a
good idea to break this up into a patch series rather than have it as
a single patch.  That would allow some bits to be applied earlier.
The main patch will probably still be pretty big, but at least we can
make things a little easier by getting some of the cleanup out of the
way first.  Specific suggestions on what to break out below.

If the changes to rewriteManip.c are a marginal efficiency hack and
nothing more, then let's commit this part separately before the main
patch.  If they're necessary for correctness, then please add a
comment explaining why they are necessary.

There appears to be no reason why the definitions of
GetInsertedColumns() and GetUpdatedColumns() need to be moved to a
header file as a result of this patch.  GetUpdatedColumns() was
previously defined in trigger.c and execMain.c and, post-patch, is
still called from only those files.  GetInsertedColumns() was, and
remains, called only from execMain.c.  If this were needed I'd suggest
doing it as a preparatory patch before the main patch, but it seems we
don't need it at all.

If I understand correctly, the reason for changing mt_partitions from
ResultRelInfo * to ResultRelInfo ** is that, currently, all of the
ResultRelInfos for a partitioning hierarchy are allocated as a single
chunk, but we can't do that and also reuse the ResultRelInfos created
during InitPlan.  I suggest that we do this as a preparatory patch.
Someone could argue that this is going the wrong way and that we ought
to instead make InitPlan() create all of the necessarily
ResultRelInfos, but it seems to me that eventually we probably want to
allow setting up ResultRelInfos on the fly for only those partitions
for which we end up needing them.  The code already has some provision
for creating ResultRelInfos on the fly - see ExecGetTriggerResultRel.
I don't think it's this patch's job to try to apply that kind of thing
to tuple routing, but it seems like in the long run if we're inserting
1 tuple into a table with 1000 partitions, or performing 1 update that
touches the partition key, it would be best not to create
ResultRelInfos for all 1000 partitions just for fun.  But this sort of
thing seems much easier of mt_partitions is ResultRelInfo ** rather
than ResultRelInfo *, so I think what you have is going in the right
direction.

+ * mtstate->resultRelInfo[]. Note: We assume that if the resultRelInfo
+ * does not belong to subplans, then it already matches the root tuple
+ * descriptor; although there is no such known scenario where this
+ * could happen.
+ */
+if (rootResultRelInfo != resultRelInfo &&
+mtstate->mt_persubplan_childparent_maps != NULL &&
+resultRelInfo >= mtstate->resultRelInfo &&
+resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1)
+{
+int map_index = resultRelInfo - mtstate->resultRelInfo;

I think you should Assert() that it doesn't happen instead of assuming
that it doesn't happen.   IOW, remove the last two branches of the
if-condition, and then add an Assert() that map_index is sane.

It is not clear to me why we need both mt_perleaf_childparent_maps and
mt_persubplan_childparent_maps.

+ * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+ * update-partition-key operation, then this function is also called
+ * separately for DELETE and INSERT to capture transition table rows.
+ * In such case, either old tuple or new tuple can be NULL.

That seems pretty strange.  I don't quite see how that's going to work
correctly.  I'm skeptical about the idea that the old tuple capture
and new tuple capture can safely happen at different times.

I wonder if we should have a reloption controlling whether
update-tuple routing is enabled.  I wonder how much more expensive it
is to execute UPDATE root SET a = a + 1 WHERE a = 1 on a table with
1000 subpartitions with this patch than without, assuming the update
succeeds in both cases.

I also wonder how efficient this implementation is in general.  For
example, suppose you make a table with 1000 partitions each containing
10,000 tuples and update them all, and consider three scenarios: (1)
partition key not updated but all tuples subject to non-HOT updates
because the updated column is indexed, (2) partition key updated but
no tuple movement required as a result, (3) partition key updated and
all tuples move to a different partition.  It would be useful to
compare the times, and also to look at perf profiles and see if there
are any obvious sources of inefficiency that can be squeezed out.  It
wouldn't surprise me if tuple movement is a bit slower than the other
scenarios, but it would be nice to know how much slower and whether
the bottlenecks are anything that we can

Re: [HACKERS] UPDATE of partition key

2017-09-19 Thread Dilip Kumar

On Tue, Sep 19, 2017 at 1:15 PM, Amit Khandekar  wrote:
> On 18 September 2017 at 20:45, Dilip Kumar  wrote:
>> Please find few more comments.
>>
>> + * in which they appear in the PartitionDesc. Also, extract the
>> + * partition key columns of the root partitioned table. Those of the
>> + * child partitions would be collected during recursive expansion.
>> */
>> + pull_child_partition_columns(_part_cols, oldrelation, oldrelation);
>> expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
>>   lockmode, >append_rel_list,
>> +   _part_cols,
>>
>> pcinfo->all_part_cols is only used in case of update, I think we can
>> call pull_child_partition_columns
>> only if rte has updateCols?
>>
>> @@ -2029,6 +2034,7 @@ typedef struct PartitionedChildRelInfo
>>
>> Index parent_relid;
>> List   *child_rels;
>> + Bitmapset  *all_part_cols;
>> } PartitionedChildRelInfo;
>>
>> I might be missing something, but do we really need to store
>> all_part_cols inside the
>> PartitionedChildRelInfo,  can't we call pull_child_partition_columns
>> directly inside
>> inheritance_planner whenever we realize that RTE has some updateCols
>> and we want to
>> check the overlap?
>
> One thing  we will have to do extra is : Open and close the
> partitioned rels again. The idea was that we collect the bitmap
> *while* we are already expanding through the tree and the rel is open.
> Will check if this is feasible.

Oh, I see.
>
>>
>> +extern void partition_walker_init(PartitionWalker *walker, Relation rel);
>> +extern Relation partition_walker_next(PartitionWalker *walker,
>> +  Relation *parent);
>> +
>>
>> I don't see these functions are used anywhere?
>>
>> +typedef struct PartitionWalker
>> +{
>> + List   *rels_list;
>> + ListCell   *cur_cell;
>> +} PartitionWalker;
>> +
>>
>> Same as above
>
> Yes, this was left out from the earlier implementation. Will have this
> removed in the next updated patch.
Ok. I will continue my review thanks.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-09-19 Thread Amit Khandekar

On 18 September 2017 at 20:45, Dilip Kumar  wrote:
> Please find few more comments.
>
> + * in which they appear in the PartitionDesc. Also, extract the
> + * partition key columns of the root partitioned table. Those of the
> + * child partitions would be collected during recursive expansion.
> */
> + pull_child_partition_columns(_part_cols, oldrelation, oldrelation);
> expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
>   lockmode, >append_rel_list,
> +   _part_cols,
>
> pcinfo->all_part_cols is only used in case of update, I think we can
> call pull_child_partition_columns
> only if rte has updateCols?
>
> @@ -2029,6 +2034,7 @@ typedef struct PartitionedChildRelInfo
>
> Index parent_relid;
> List   *child_rels;
> + Bitmapset  *all_part_cols;
> } PartitionedChildRelInfo;
>
> I might be missing something, but do we really need to store
> all_part_cols inside the
> PartitionedChildRelInfo,  can't we call pull_child_partition_columns
> directly inside
> inheritance_planner whenever we realize that RTE has some updateCols
> and we want to
> check the overlap?

One thing  we will have to do extra is : Open and close the
partitioned rels again. The idea was that we collect the bitmap
*while* we are already expanding through the tree and the rel is open.
Will check if this is feasible.

>
> +extern void partition_walker_init(PartitionWalker *walker, Relation rel);
> +extern Relation partition_walker_next(PartitionWalker *walker,
> +  Relation *parent);
> +
>
> I don't see these functions are used anywhere?
>
> +typedef struct PartitionWalker
> +{
> + List   *rels_list;
> + ListCell   *cur_cell;
> +} PartitionWalker;
> +
>
> Same as above

Yes, this was left out from the earlier implementation. Will have this
removed in the next updated patch.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-09-18 Thread Dilip Kumar

On Mon, Sep 18, 2017 at 11:29 AM, Dilip Kumar  wrote:
> On Fri, Sep 15, 2017 at 4:55 PM, Amit Khandekar  
> wrote:
>> On 12 September 2017 at 12:39, Amit Khandekar  wrote:
>>> On 12 September 2017 at 11:57, Dilip Kumar  wrote:
 On Tue, Sep 12, 2017 at 11:15 AM, Amit Khandekar  
 wrote:


>> In the attached patch, we first call ExecARUpdateTriggers(), and while
>> doing that, we first save the info that a NEW row is already captured
>> (mtstate->mt_transition_capture->tcs_update_old_table == true). If it
>> captured, we pass NULL transition_capture pointer to
>> ExecARDeleteTriggers() (and ExecARInsertTriggers) so that it does not
>> again capture an extra row.
>>
>> Modified a testcase in update.sql by including DELETE statement
>> trigger that uses transition tables.
>
> Ok, this fix looks correct to me, I will review the latest patch.

Please find few more comments.

+ * in which they appear in the PartitionDesc. Also, extract the
+ * partition key columns of the root partitioned table. Those of the
+ * child partitions would be collected during recursive expansion.
*/
+ pull_child_partition_columns(_part_cols, oldrelation, oldrelation);
expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
  lockmode, >append_rel_list,
+   _part_cols,

pcinfo->all_part_cols is only used in case of update, I think we can
call pull_child_partition_columns
only if rte has updateCols?

@@ -2029,6 +2034,7 @@ typedef struct PartitionedChildRelInfo

Index parent_relid;
List   *child_rels;
+ Bitmapset  *all_part_cols;
} PartitionedChildRelInfo;

I might be missing something, but do we really need to store
all_part_cols inside the
PartitionedChildRelInfo,  can't we call pull_child_partition_columns
directly inside
inheritance_planner whenever we realize that RTE has some updateCols
and we want to
check the overlap?

+extern void partition_walker_init(PartitionWalker *walker, Relation rel);
+extern Relation partition_walker_next(PartitionWalker *walker,
+  Relation *parent);
+

I don't see these functions are used anywhere?

+typedef struct PartitionWalker
+{
+ List   *rels_list;
+ ListCell   *cur_cell;
+} PartitionWalker;
+

Same as above



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-09-17 Thread Dilip Kumar

On Fri, Sep 15, 2017 at 4:55 PM, Amit Khandekar  wrote:
> On 12 September 2017 at 12:39, Amit Khandekar  wrote:
>> On 12 September 2017 at 11:57, Dilip Kumar  wrote:
>>> On Tue, Sep 12, 2017 at 11:15 AM, Amit Khandekar  
>>> wrote:
>>>
> I found out that, in case when there is a DELETE statement trigger
> using transition tables, it's not only an issue of redundancy; it's a
> correctness issue. Since for transition tables both DELETE and UPDATE
> use the same old row tuplestore for capturing OLD table, that table
> gets duplicate rows: one from ExecARDeleteTriggers() and another from
> ExecARUpdateTriggers(). In presence of INSERT statement trigger using
> transition tables, both INSERT and UPDATE events have separate
> tuplestore, so duplicate rows don't show up in the UPDATE NEW table.
> But, nevertheless, we need to prevent NEW rows to be collected in the
> INSERT event tuplestore, and capture the NEW rows only in the UPDATE
> event tuplestore.
>
> In the attached patch, we first call ExecARUpdateTriggers(), and while
> doing that, we first save the info that a NEW row is already captured
> (mtstate->mt_transition_capture->tcs_update_old_table == true). If it
> captured, we pass NULL transition_capture pointer to
> ExecARDeleteTriggers() (and ExecARInsertTriggers) so that it does not
> again capture an extra row.
>
> Modified a testcase in update.sql by including DELETE statement
> trigger that uses transition tables.

Ok, this fix looks correct to me, I will review the latest patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-09-13 Thread amul sul

On Sun, Sep 10, 2017 at 8:47 AM, Amit Kapila 
wrote:

> On Fri, Sep 8, 2017 at 4:51 PM, amul sul  wrote:
> > On Thu, May 18, 2017 at 9:13 AM, Amit Kapila 
> > wrote:
> >>
> >>  On Wed, May 17, 2017 at 5:17 PM, Robert Haas 
> >> wrote:
> >> > On Wed, May 17, 2017 at 6:29 AM, Amit Kapila  >
> >> > wrote:
> >> >> I think we can do this even without using an additional infomask bit.
> >> >> As suggested by Greg up thread, we can set InvalidBlockId in ctid to
> >> >> indicate such an update.
> >> >
> >> > Hmm.  How would that work?
> >> >
> >>
> >> We can pass a flag say row_moved (or require_row_movement) to
> >> heap_delete which will in turn set InvalidBlockId in ctid instead of
> >> setting it to self. Then the ExecUpdate needs to check for the same
> >> and return an error when heap_update is not successful (result !=
> >> HeapTupleMayBeUpdated).  Can you explain what difficulty are you
> >> envisioning?
> >>
> >
> > Attaching WIP patch incorporates the above logic, although I am yet to
> check
> > all the code for places which might be using ip_blkid.  I have got a
> small
> > query here,
> > do we need an error on HeapTupleSelfUpdated case as well?
> >
>
> No, because that case is anyway a no-op (or error depending on whether
> is updated/deleted by same command or later command).  Basically, even
> if the row wouldn't have been moved to another partition, we would not
> have allowed the command to proceed with the update.  This handling is
> to make commands fail rather than a no-op where otherwise (when the
> tuple is not moved to another partition) the command would have
> succeeded.
>
> 
Thank you.

I've rebased patch against  Amit Khandekar's latest
 
patch
 
(v17_rebased_2)
.
Also 
added ip_blkid validation
 
check in heap_get_latest_tid(), rewrite_heap_tuple()
& rewrite_heap_tuple() function, because only
 
ItemPointerEquals() check is no
longer sufficient
after
 this patch.

Regards,
Amul


0002-invalidate_ctid-ip_blkid-WIP_2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-09-12 Thread Amit Khandekar

On 12 September 2017 at 11:57, Dilip Kumar  wrote:
> On Tue, Sep 12, 2017 at 11:15 AM, Amit Khandekar  
> wrote:
>
>> But the statement level trigger function can refer to OLD TABLE and
>> NEW TABLE, which will contain all the OLD rows and NEW rows
>> respectively. So the updated rows of the partitions (including the
>> moved ones) need to be captured. So for OLD TABLE, we need to capture
>> the deleted row, and for NEW TABLE, we need to capture the inserted
>> row.
>
> Yes, I agree.  So in ExecDelete for OLD TABLE we only need to call
> ExecARUpdateTriggers which will make the entry in OLD TABLE only if
> transition table is there otherwise nothing and I guess this part
> already exists in your patch.  And, we are also calling
> ExecARDeleteTriggers and I guess that is to fire the ROW-LEVEL delete
> trigger and that is also fine.  What I don't understand is that if
> there is no "ROW- LEVEL delete trigger" and there is only a "statement
> level delete trigger" with transition table still we are making the
> entry in transition table of the delete trigger and that will never be
> used.

Hmm, ok, that might be happening, since we are calling
ExecARDeleteTriggers() with mtstate->mt_transition_capture non-NULL,
and so the deleted tuple gets captured even when there is no UPDATE
statement trigger defined, which looks redundant. Will check this.
Thanks.

>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-09-12 Thread Dilip Kumar

On Tue, Sep 12, 2017 at 11:15 AM, Amit Khandekar  wrote:

> But the statement level trigger function can refer to OLD TABLE and
> NEW TABLE, which will contain all the OLD rows and NEW rows
> respectively. So the updated rows of the partitions (including the
> moved ones) need to be captured. So for OLD TABLE, we need to capture
> the deleted row, and for NEW TABLE, we need to capture the inserted
> row.

Yes, I agree.  So in ExecDelete for OLD TABLE we only need to call
ExecARUpdateTriggers which will make the entry in OLD TABLE only if
transition table is there otherwise nothing and I guess this part
already exists in your patch.  And, we are also calling
ExecARDeleteTriggers and I guess that is to fire the ROW-LEVEL delete
trigger and that is also fine.  What I don't understand is that if
there is no "ROW- LEVEL delete trigger" and there is only a "statement
level delete trigger" with transition table still we are making the
entry in transition table of the delete trigger and that will never be
used.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-09-11 Thread Amit Khandekar

On 11 September 2017 at 21:12, Dilip Kumar  wrote:
> On Thu, Sep 7, 2017 at 11:41 AM, Amit Khandekar  
> wrote:
>> On 6 September 2017 at 21:47, Dilip Kumar  wrote:
>
>> Actually, since transition tables came in, the functions like
>> ExecARUpdateTriggers() or ExecARInsertTriggers() have this additional
>> purpose of capturing transition table rows, so that the images of the
>> tables are visible when statement triggers are fired that refer to
>> these transition tables. So in the above code, these functions only
>> capture rows, they do not add any event for firing any ROW triggers.
>> AfterTriggerSaveEvent() returns without adding any event if it's
>> called only for transition capture. So even if UPDATE row triggers are
>> defined, they won't get fired in case of row movement, although the
>> updated rows would be captured if transition tables are referenced in
>> these triggers or in the statement triggers.
>>
>
> Ok then I have one more question,
>
> With transition table, we can only support statement level trigger

Yes, we don't support row triggers with transition tables if the table
is a partition.

> and for update
> statement, we are only going to execute UPDATE statement level
> trigger? so is there
> any point of making transition table entry for DELETE/INSERT trigger
> as those transition
> table will never be used.

But the statement level trigger function can refer to OLD TABLE and
NEW TABLE, which will contain all the OLD rows and NEW rows
respectively. So the updated rows of the partitions (including the
moved ones) need to be captured. So for OLD TABLE, we need to capture
the deleted row, and for NEW TABLE, we need to capture the inserted
row.

In the regression test update.sql, check how the statement trigger
trans_updatetrig prints all the updated rows, including the moved
ones.


>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-09-11 Thread Dilip Kumar

On Thu, Sep 7, 2017 at 11:41 AM, Amit Khandekar  wrote:
> On 6 September 2017 at 21:47, Dilip Kumar  wrote:

> Actually, since transition tables came in, the functions like
> ExecARUpdateTriggers() or ExecARInsertTriggers() have this additional
> purpose of capturing transition table rows, so that the images of the
> tables are visible when statement triggers are fired that refer to
> these transition tables. So in the above code, these functions only
> capture rows, they do not add any event for firing any ROW triggers.
> AfterTriggerSaveEvent() returns without adding any event if it's
> called only for transition capture. So even if UPDATE row triggers are
> defined, they won't get fired in case of row movement, although the
> updated rows would be captured if transition tables are referenced in
> these triggers or in the statement triggers.
>

Ok then I have one more question,

With transition table, we can only support statement level trigger and
for update
statement, we are only going to execute UPDATE statement level
trigger? so is there
any point of making transition table entry for DELETE/INSERT trigger
as those transition
table will never be used.  Or I am missing something?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-09-09 Thread Amit Kapila

On Fri, Sep 8, 2017 at 4:51 PM, amul sul  wrote:
> On Thu, May 18, 2017 at 9:13 AM, Amit Kapila 
> wrote:
>>
>>  On Wed, May 17, 2017 at 5:17 PM, Robert Haas 
>> wrote:
>> > On Wed, May 17, 2017 at 6:29 AM, Amit Kapila 
>> > wrote:
>> >> I think we can do this even without using an additional infomask bit.
>> >> As suggested by Greg up thread, we can set InvalidBlockId in ctid to
>> >> indicate such an update.
>> >
>> > Hmm.  How would that work?
>> >
>>
>> We can pass a flag say row_moved (or require_row_movement) to
>> heap_delete which will in turn set InvalidBlockId in ctid instead of
>> setting it to self. Then the ExecUpdate needs to check for the same
>> and return an error when heap_update is not successful (result !=
>> HeapTupleMayBeUpdated).  Can you explain what difficulty are you
>> envisioning?
>>
>
> Attaching WIP patch incorporates the above logic, although I am yet to check
> all the code for places which might be using ip_blkid.  I have got a small
> query here,
> do we need an error on HeapTupleSelfUpdated case as well?
>

No, because that case is anyway a no-op (or error depending on whether
is updated/deleted by same command or later command).  Basically, even
if the row wouldn't have been moved to another partition, we would not
have allowed the command to proceed with the update.  This handling is
to make commands fail rather than a no-op where otherwise (when the
tuple is not moved to another partition) the command would have
succeeded.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-09-08 Thread amul sul

On Thu, May 18, 2017 at 9:13 AM, Amit Kapila 
wrote:

>  On Wed, May 17, 2017 at 5:17 PM, Robert Haas 
> wrote:
> > On Wed, May 17, 2017 at 6:29 AM, Amit Kapila 
> wrote:
> >> I think we can do this even without using an additional infomask bit.
> >> As suggested by Greg up thread, we can set InvalidBlockId in ctid to
> >> indicate such an update.
> >
> > Hmm.  How would that work?
> >
>
> We can pass a flag say row_moved (or require_row_movement) to
> heap_delete which will in turn set InvalidBlockId in ctid instead of
> setting it to self. Then the ExecUpdate needs to check for the same
> and return an error when heap_update is not successful (result !=
> HeapTupleMayBeUpdated).  Can you explain what difficulty are you
> envisioning?
>
>
Attaching WIP patch incorporates the above logic, although I am yet to check
all the code for places which might be using ip_blkid.  I have got a small
query here,
do we need an error on HeapTupleSelfUpdated case as well?

Note that patch should be applied to the top of Amit Khandekar's latest
patch(v17_rebased).

Regards,
Amul


0002-invalidate-ctid.ip_blkid-WIP.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-09-08 Thread Amit Langote

On 2017/09/08 18:57, Robert Haas wrote:
>> As mentioned by Amit Langote in the above mail thread, he is going to
>> do changes for making RelationGetPartitionDispatchInfo() return the
>> leaf partitions in depth-first order. Once that is done, I will then
>> remove the hash table method for finding leaf partitions in update
>> result rels, and instead use the earlier efficient method that takes
>> advantage of the fact that update result rels and leaf partitions are
>> in the same order.
> 
> Has he posted that patch yet?  I don't think I saw it, but maybe I
> missed something.

I will post on that thread in a moment.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-09-08 Thread Robert Haas

On Thu, Sep 7, 2017 at 6:17 AM, Amit Khandekar  wrote:
> On 3 September 2017 at 17:10, Amit Khandekar  wrote:
>> After recent commit 30833ba154, now the partitions are expanded in
>> depth-first order. It didn't seem worthwhile rebasing my partition
>> walker changes onto the latest code. So in the attached patch, I have
>> removed all the partition walker changes. But
>> RelationGetPartitionDispatchInfo() traverses in breadth-first order,
>> which is different than the update result rels order (because
>> inheritance expansion order is depth-first). So, in order to make the
>> tuple-routing-related leaf partitions in the same order as that of the
>> update result rels, we would have to make changes in
>> RelationGetPartitionDispatchInfo(), which I am not sure whether it is
>> going to be done as part of the thread "expanding inheritance in
>> partition bound order" [1]. For now, in the attached patch, I have
>> reverted back to the hash table method to find the leaf partitions in
>> the update result rels.
>>
>> [1] 
>> https://www.postgresql.org/message-id/CAJ3gD9eyudCNU6V-veMme%2BeyzfX_ey%2BgEzULMzOw26c3f9rzdg%40mail.gmail.com
>
> As mentioned by Amit Langote in the above mail thread, he is going to
> do changes for making RelationGetPartitionDispatchInfo() return the
> leaf partitions in depth-first order. Once that is done, I will then
> remove the hash table method for finding leaf partitions in update
> result rels, and instead use the earlier efficient method that takes
> advantage of the fact that update result rels and leaf partitions are
> in the same order.

Has he posted that patch yet?  I don't think I saw it, but maybe I
missed something.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-09-07 Thread Amit Khandekar

On 3 September 2017 at 17:10, Amit Khandekar  wrote:

> After recent commit 30833ba154, now the partitions are expanded in
> depth-first order. It didn't seem worthwhile rebasing my partition
> walker changes onto the latest code. So in the attached patch, I have
> removed all the partition walker changes. But
> RelationGetPartitionDispatchInfo() traverses in breadth-first order,
> which is different than the update result rels order (because
> inheritance expansion order is depth-first). So, in order to make the
> tuple-routing-related leaf partitions in the same order as that of the
> update result rels, we would have to make changes in
> RelationGetPartitionDispatchInfo(), which I am not sure whether it is
> going to be done as part of the thread "expanding inheritance in
> partition bound order" [1]. For now, in the attached patch, I have
> reverted back to the hash table method to find the leaf partitions in
> the update result rels.
>
> [1] 
> https://www.postgresql.org/message-id/CAJ3gD9eyudCNU6V-veMme%2BeyzfX_ey%2BgEzULMzOw26c3f9rzdg%40mail.gmail.com

As mentioned by Amit Langote in the above mail thread, he is going to
do changes for making RelationGetPartitionDispatchInfo() return the
leaf partitions in depth-first order. Once that is done, I will then
remove the hash table method for finding leaf partitions in update
result rels, and instead use the earlier efficient method that takes
advantage of the fact that update result rels and leaf partitions are
in the same order.

>
> Thanks
> -Amit Khandekar



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-09-06 Thread Dilip Kumar

On Mon, Sep 4, 2017 at 10:52 AM, Amit Khandekar  wrote:
> On 4 September 2017 at 07:43, Amit Kapila  wrote:
> Oops sorry. Now attached.

I have done some basic testing and initial review of the patch.  I
have some comments/doubts.  I will continue the review.

+ if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture)
+ ExecARUpdateTriggers(estate, resultRelInfo, InvalidOid,

For passing invalid ItemPointer we are using InvalidOid, this seems
bit odd to me
are we using simmilar convention some other place? I think it would be better to
just pass 0?

--

- if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
- (event == TRIGGER_EVENT_UPDATE && update_old_table))
+ if (oldtup != NULL &&
+ ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+ (event == TRIGGER_EVENT_UPDATE && update_old_table)))
  {
  Tuplestorestate *old_tuplestore;

- Assert(oldtup != NULL);

Only if TRIGGER_EVENT_UPDATE it is possible that oldtup can be NULL,
so we have added an extra
check for oldtup and removed the Assert, but if  TRIGGER_EVENT_DELETE
we never expect it to be NULL.

Is it better to put Assert outside the condition check (Assert(oldtup
!= NULL || event == TRIGGER_EVENT_UPDATE)) ?
same for the newtup.

I think we should also explain in comments about why oldtup or newtup
can be NULL in case of if
TRIGGER_EVENT_UPDATE

---

+triggers affect the row being moved. As far as AFTER ROW
+triggers are concerned, AFTER DELETE and
+AFTER INSERT triggers are applied; but
+AFTER UPDATE triggers are not applied
+because the UPDATE has been converted to a
+DELETE and INSERT.

Above comments says that ARUpdate trigger is not fired but below code call
ARUpdateTrigger

+ if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture)
+ ExecARUpdateTriggers(estate, resultRelInfo, InvalidOid,
+ NULL,
+ tuple,
+ NULL,
+ mtstate->mt_transition_capture);


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-09-03 Thread Amit Kapila

On Sun, Sep 3, 2017 at 5:10 PM, Amit Khandekar  wrote:
> On 31 August 2017 at 14:15, Amit Khandekar  wrote:
>> Thanks Dilip. I am working on rebasing the patch. Particularly, the
>> partition walker in my patch depended on the fact that all the tables
>> get opened (and then closed) while creating the tuple routing info.
>> But in HEAD, now only the partitioned tables get opened. So need some
>> changes in my patch.
>>
>> The partition walker related changes are going to be inapplicable once
>> the other thread [1] commits the changes for expansion of inheritence
>> in bound-order, but till then I would have to rebase the partition
>> walker changes over HEAD.
>>
>> [1] 
>> https://www.postgresql.org/message-id/0118a1f2-84bb-19a7-b906-dec040a206f2%40lab.ntt.co.jp
>>
>
> After recent commit 30833ba154, now the partitions are expanded in
> depth-first order. It didn't seem worthwhile rebasing my partition
> walker changes onto the latest code. So in the attached patch, I have
> removed all the partition walker changes.
>

It seems you have forgotten to attach the patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-09-03 Thread Amit Khandekar

On 31 August 2017 at 14:15, Amit Khandekar  wrote:
> Thanks Dilip. I am working on rebasing the patch. Particularly, the
> partition walker in my patch depended on the fact that all the tables
> get opened (and then closed) while creating the tuple routing info.
> But in HEAD, now only the partitioned tables get opened. So need some
> changes in my patch.
>
> The partition walker related changes are going to be inapplicable once
> the other thread [1] commits the changes for expansion of inheritence
> in bound-order, but till then I would have to rebase the partition
> walker changes over HEAD.
>
> [1] 
> https://www.postgresql.org/message-id/0118a1f2-84bb-19a7-b906-dec040a206f2%40lab.ntt.co.jp
>

After recent commit 30833ba154, now the partitions are expanded in
depth-first order. It didn't seem worthwhile rebasing my partition
walker changes onto the latest code. So in the attached patch, I have
removed all the partition walker changes. But
RelationGetPartitionDispatchInfo() traverses in breadth-first order,
which is different than the update result rels order (because
inheritance expansion order is depth-first). So, in order to make the
tuple-routing-related leaf partitions in the same order as that of the
update result rels, we would have to make changes in
RelationGetPartitionDispatchInfo(), which I am not sure whether it is
going to be done as part of the thread "expanding inheritance in
partition bound order" [1]. For now, in the attached patch, I have
reverted back to the hash table method to find the leaf partitions in
the update result rels.

[1] 
https://www.postgresql.org/message-id/CAJ3gD9eyudCNU6V-veMme%2BeyzfX_ey%2BgEzULMzOw26c3f9rzdg%40mail.gmail.com

Thanks
-Amit Khandekar

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-08-31 Thread Amit Khandekar

Thanks Dilip. I am working on rebasing the patch. Particularly, the
partition walker in my patch depended on the fact that all the tables
get opened (and then closed) while creating the tuple routing info.
But in HEAD, now only the partitioned tables get opened. So need some
changes in my patch.

The partition walker related changes are going to be inapplicable once
the other thread [1] commits the changes for expansion of inheritence
in bound-order, but till then I would have to rebase the partition
walker changes over HEAD.

[1] 
https://www.postgresql.org/message-id/0118a1f2-84bb-19a7-b906-dec040a206f2%40lab.ntt.co.jp


On 31 August 2017 at 12:09, Dilip Kumar  wrote:
> On Fri, Aug 11, 2017 at 10:44 AM, Amit Khandekar  
> wrote:
>> On 4 August 2017 at 22:28, Amit Khandekar  wrote:

>
> I am planning to review and test this patch, Seems like this patch
> needs to be rebased.
>
> [dilip@localhost postgresql]$ patch -p1 <
> ../patches/update-partition-key_v15.patch
> patching file doc/src/sgml/ddl.sgml
> patching file doc/src/sgml/ref/update.sgml
> patching file doc/src/sgml/trigger.sgml
> patching file src/backend/catalog/partition.c
> Hunk #3 succeeded at 910 (offset -1 lines).
> Hunk #4 succeeded at 924 (offset -1 lines).
> Hunk #5 succeeded at 934 (offset -1 lines).
> Hunk #6 succeeded at 994 (offset -1 lines).
> Hunk #7 succeeded at 1009 with fuzz 1 (offset 3 lines).
> Hunk #8 FAILED at 1023.
> Hunk #9 succeeded at 1059 with fuzz 2 (offset 10 lines).
> Hunk #10 succeeded at 2069 (offset 2 lines).
> Hunk #11 succeeded at 2406 (offset 2 lines).
> 1 out of 11 hunks FAILED -- saving rejects to file
> src/backend/catalog/partition.c.rej
> patching file src/backend/commands/copy.c
> Hunk #2 FAILED at 1426.
> Hunk #3 FAILED at 1462.
> Hunk #4 succeeded at 2616 (offset 7 lines).
> Hunk #5 succeeded at 2726 (offset 8 lines).
> Hunk #6 succeeded at 2846 (offset 8 lines).
> 2 out of 6 hunks FAILED -- saving rejects to file
> src/backend/commands/copy.c.rej
> patching file src/backend/commands/trigger.c
> Hunk #4 succeeded at 5261 with fuzz 2.
> patching file src/backend/executor/execMain.c
> Hunk #1 succeeded at 65 (offset 1 line).
> Hunk #2 succeeded at 103 (offset 1 line).
> Hunk #3 succeeded at 1829 (offset 20 lines).
> Hunk #4 succeeded at 1860 (offset 20 lines).
> Hunk #5 succeeded at 1927 (offset 20 lines).
> Hunk #6 succeeded at 2044 (offset 21 lines).
> Hunk #7 FAILED at 3210.
> Hunk #8 FAILED at 3244.
> Hunk #9 succeeded at 3289 (offset 26 lines).
> Hunk #10 FAILED at 3340.
> Hunk #11 succeeded at 3387 (offset 29 lines).
> Hunk #12 succeeded at 3424 (offset 29 lines).
> 3 out of 12 hunks FAILED -- saving rejects to file
> src/backend/executor/execMain.c.rej
> patching file src/backend/executor/execReplication.c
> patching file src/backend/executor/nodeModifyTable.c
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-08-31 Thread Dilip Kumar

On Fri, Aug 11, 2017 at 10:44 AM, Amit Khandekar  wrote:
> On 4 August 2017 at 22:28, Amit Khandekar  wrote:
>>>

I am planning to review and test this patch, Seems like this patch
needs to be rebased.

[dilip@localhost postgresql]$ patch -p1 <
../patches/update-partition-key_v15.patch
patching file doc/src/sgml/ddl.sgml
patching file doc/src/sgml/ref/update.sgml
patching file doc/src/sgml/trigger.sgml
patching file src/backend/catalog/partition.c
Hunk #3 succeeded at 910 (offset -1 lines).
Hunk #4 succeeded at 924 (offset -1 lines).
Hunk #5 succeeded at 934 (offset -1 lines).
Hunk #6 succeeded at 994 (offset -1 lines).
Hunk #7 succeeded at 1009 with fuzz 1 (offset 3 lines).
Hunk #8 FAILED at 1023.
Hunk #9 succeeded at 1059 with fuzz 2 (offset 10 lines).
Hunk #10 succeeded at 2069 (offset 2 lines).
Hunk #11 succeeded at 2406 (offset 2 lines).
1 out of 11 hunks FAILED -- saving rejects to file
src/backend/catalog/partition.c.rej
patching file src/backend/commands/copy.c
Hunk #2 FAILED at 1426.
Hunk #3 FAILED at 1462.
Hunk #4 succeeded at 2616 (offset 7 lines).
Hunk #5 succeeded at 2726 (offset 8 lines).
Hunk #6 succeeded at 2846 (offset 8 lines).
2 out of 6 hunks FAILED -- saving rejects to file
src/backend/commands/copy.c.rej
patching file src/backend/commands/trigger.c
Hunk #4 succeeded at 5261 with fuzz 2.
patching file src/backend/executor/execMain.c
Hunk #1 succeeded at 65 (offset 1 line).
Hunk #2 succeeded at 103 (offset 1 line).
Hunk #3 succeeded at 1829 (offset 20 lines).
Hunk #4 succeeded at 1860 (offset 20 lines).
Hunk #5 succeeded at 1927 (offset 20 lines).
Hunk #6 succeeded at 2044 (offset 21 lines).
Hunk #7 FAILED at 3210.
Hunk #8 FAILED at 3244.
Hunk #9 succeeded at 3289 (offset 26 lines).
Hunk #10 FAILED at 3340.
Hunk #11 succeeded at 3387 (offset 29 lines).
Hunk #12 succeeded at 3424 (offset 29 lines).
3 out of 12 hunks FAILED -- saving rejects to file
src/backend/executor/execMain.c.rej
patching file src/backend/executor/execReplication.c
patching file src/backend/executor/nodeModifyTable.c

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-08-07 Thread Rajkumar Raghuwanshi

On Fri, Aug 4, 2017 at 10:28 PM, Amit Khandekar 
wrote:

> >
> > Below are the TODOS at this point :
> >
> > Fix for bug reported by Rajkumar about update with join.
>
> I had explained the root issue of this bug here : [1]
>
> Attached patch includes the fix, which is explained below.
>

Hi Amit,

I have applied v14 patch and tested from my side, everything looks good to
me. attaching some of test case and out file for reference.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation


update_partition_test.out
Description: Binary data
--===
--creating test dataset
CREATE TABLE pt (a INT, b INT, c INT) PARTITION BY RANGE(a);
CREATE TABLE pt_p1 PARTITION OF pt FOR VALUES FROM (1) to (6) PARTITION BY RANGE (b);
CREATE TABLE pt_p1_p1 PARTITION OF pt_p1 FOR VALUES FROM (11) to (44);
CREATE TABLE pt_p1_p2 PARTITION OF pt_p1 FOR VALUES FROM (44) to (66);
CREATE TABLE pt_p2 PARTITION OF pt FOR VALUES FROM (6) to (11) PARTITION BY LIST (c);
CREATE TABLE pt_p2_p1 PARTITION OF pt_p2 FOR VALUES IN (666,777,888);
CREATE TABLE pt_p2_p2 PARTITION OF pt_p2 FOR VALUES IN (999,NULL);
INSERT INTO pt (a,b,c) VALUES (1,11,111),(2,22,222),(3,33,333),(4,44,444),(5,55,555);
INSERT INTO pt (a,b,c) VALUES (6,66,666),(7,77,777),(8,88,888),(9,99,999),(10,100,NULL);
--test with updating root partition
--move data within same partition,setisfying partition contraint --should pass
BEGIN;
SELECT TABLEOID::REGCLASS AS PARTITION_NAME,* FROM pt;
UPDATE pt SET b = 23 WHERE b = 11;
SELECT TABLEOID::REGCLASS AS PARTITION_NAME,* FROM pt;
ROLLBACK;
--move data within same subtree,different partition,setisfying partition contraint --should pass
BEGIN;
SELECT TABLEOID::REGCLASS AS PARTITION_NAME,* FROM pt;
UPDATE pt SET b = 45,c = 422 WHERE b = 22;
SELECT TABLEOID::REGCLASS AS PARTITION_NAME,* FROM pt;
ROLLBACK;
--move data different subtree,setisfying partition contraint --should pass
BEGIN;
SELECT TABLEOID::REGCLASS AS PARTITION_NAME,* FROM pt;
UPDATE pt SET a = 8,c=888 WHERE b = 33;
SELECT TABLEOID::REGCLASS AS PARTITION_NAME,* FROM pt;
ROLLBACK;
--move data within same partition,not setisfying partition contraint --should fail
BEGIN;
SELECT TABLEOID::REGCLASS AS PARTITION_NAME,* FROM pt;
UPDATE pt SET b = 23, a=13 WHERE b = 11;
SELECT TABLEOID::REGCLASS AS PARTITION_NAME,* FROM pt;
ROLLBACK;
--move data within same subtree,different partition,not setisfying partition contraint --should fail
BEGIN;
SELECT TABLEOID::REGCLASS AS PARTITION_NAME,* FROM pt;
UPDATE pt SET b = 45, a=14 WHERE b = 22;
SELECT TABLEOID::REGCLASS AS PARTITION_NAME,* FROM pt;
ROLLBACK;
--move data different subtree,not setisfying partition contraint --should fail
BEGIN;
SELECT TABLEOID::REGCLASS AS PARTITION_NAME,* FROM pt;
UPDATE pt SET b = 88, c=198 WHERE b = 33;
SELECT TABLEOID::REGCLASS AS PARTITION_NAME,* FROM pt;
ROLLBACK;
--
--test with updating child partition
--move data within same partition,setisfying partition contraint --should pass
BEGIN;
SELECT TABLEOID::REGCLASS AS PARTITION_NAME,* FROM pt;
UPDATE pt_p1_p1 SET b = 23 WHERE b = 11;
SELECT TABLEOID::REGCLASS AS PARTITION_NAME,* FROM pt;
ROLLBACK;
--move data within same subtree,different partition,setisfying partition contraint --should pass
BEGIN;
SELECT TABLEOID::REGCLASS AS PARTITION_NAME,* FROM pt;
UPDATE pt_p1 SET b = 45,c = 422 WHERE b = 22;
SELECT TABLEOID::REGCLASS AS PARTITION_NAME,* FROM pt;
ROLLBACK;
--move data within same subtree,different partition,setisfying partition contraint,updating leaf child --should fail
BEGIN;
SELECT TABLEOID::REGCLASS AS PARTITION_NAME,* FROM pt;
UPDATE pt_p1_p1 SET b = 45 WHERE b = 22;
SELECT TABLEOID::REGCLASS AS PARTITION_NAME,* FROM pt;
ROLLBACK;
--move data different subtree,setisfying partition contraint,updating child partition --should fail
BEGIN;
SELECT TABLEOID::REGCLASS AS PARTITION_NAME,* FROM pt;
UPDATE pt_p1 SET a = 8,c=888 WHERE b = 33;
SELECT TABLEOID::REGCLASS AS PARTITION_NAME,* FROM pt;
ROLLBACK;
--move data within same partition,not setisfying partition contraint --should fail
BEGIN;
SELECT TABLEOID::REGCLASS AS PARTITION_NAME,* FROM pt;
UPDATE pt_p1_p1 SET b = 23, a=13 WHERE b = 11;
SELECT TABLEOID::REGCLASS AS PARTITION_NAME,* FROM pt;
ROLLBACK;
--move data within same subtree,different partition,not setisfying partition contraint --should fail
BEGIN;
SELECT TABLEOID::REGCLASS AS PARTITION_NAME,* FROM pt;
UPDATE pt_p1 SET b = 45, a=14 WHERE b = 22;
SELECT TABLEOID::REGCLASS AS PARTITION_NAME,* FROM pt;
ROLLBACK;
--===
--creating test dataset
ALTER TABLE pt_p1 ADD constraint pt_p1_check check(c < 560);
ALTER TABLE pt_p1_p1 add CONSTRAINT pt_p1_p1_uk UNIQUE (c);
ALTER TABLE pt_p1_p2 ADD constraint

Re: [HACKERS] UPDATE of partition key

2017-08-04 Thread Amit Khandekar

>
> Below are the TODOS at this point :
>
> Fix for bug reported by Rajkumar about update with join.

I had explained the root issue of this bug here : [1]

Attached patch includes the fix, which is explained below.
Currently in the patch, there is a check if the tuple is concurrently
deleted by other session, i.e. when heap_update() returns
HeapTupleUpdated. In such case we set concurrently_deleted output
param to true. We should also do the same for HeapTupleSelfUpdated
return value.

In fact, there are other places in ExecDelete() where it can return
without doing anything. For e.g. if a BR DELETE trigger prevents the
delete from happening, ExecBRDeleteTriggers() returns false, in which
case ExecDelete() returns.

So what the fix does is : rename concurrently_deleted parameter to
delete_skipped so as to indicate a more general status : whether
delete has actually happened or was it skipped. And set this param to
true only after the delete happens. This allows us to avoid adding a
new rows for the trigger case also.

Added test scenario for UPDATE with JOIN case, and also TRIGGER case.

> Do something about two separate mapping tables for Transition tables
> and update tuple-routing.
On 1 July 2017 at 03:15, Thomas Munro  wrote:
> Would make sense to have a set of functions with names like
> GetConvertor{From,To}{Subplan,Leaf}(mtstate, index) which build arrays
> m_convertors_{from,to}_by_{subplan,leaf} the first time they need
> them?

This was discussed here : [2]. I think even if we have them built when
needed, still in presence of both tuple routing and transition tables,
we do need separate arrays. So I think rather than dynamic arrays, we
can have static arrays but their elements will point to  a shared
TupleConversionMap structure whenever possible.
As already in the patch, in case of insert/update tuple routing, there
is a per-leaf partition mt_transition_tupconv_maps array for
transition tables, and a separate per-subplan arry mt_resultrel_maps
for update tuple routing. *But*, what I am proposing is: for the
mt_transition_tupconv_maps[] element for which the leaf partition also
exists as a per-subplan result, that array element and the
mt_resultrel_maps[] element will point to the same TupleConversionMap
structure.

This is quite similar to how we are re-using the per-subplan
resultrels for the per-leaf result rels. We will re-use the
per-subplan TupleConversionMap for the per-leaf
mt_transition_tupconv_maps[] elements.

Not yet implemented this.

> GetUpdatedColumns() to be moved to header file.

Done. I have moved it in execnodes.h

> More test scenarios in regression tests.
> Need to check/test whether we are correctly applying insert policies
> (ant not update) while inserting a routed tuple.

Yet to do above two.

> Use getASTriggerResultRelInfo() for attrno mapping, rather than first
> resultrel, for generating child WCO/RETURNING expression.
>

Regarding generating child WithCheckOption and Returning expressions
using those of the root result relation, ModifyTablePath and
ModifyTable should have new fields rootReturningList (and
rootWithCheckOptions) which would be derived from
root->parse->returningList in inheritance_planner(). But then, similar
to per-subplan returningList, rootReturningList would have to pass
through set_plan_refs()=>set_returning_clause_references() which
requires the subplan targetlist to be passed. Because of this, for
rootReturningList, we require a subplan for root partition, which is
not there currently; we have subpans only for child rels. That means
we would have to create such plan only for the sake of generating
rootReturningList.

The other option is to do the way the patch is currently doing in the
executor by using the returningList of the first per-subplan result
rel to generate the other child returningList (and WithCheckOption).
This is working by applying map_partition_varattnos() to the first
returningList. But now that we realized that we have to specially
handle whole-row vars, map_partition_varattnos() would need some
changes to convert whole row vars differently for
child-rel-to-child-rel mapping. For childrel-to-childrel conversion,
the whole-row var is already wrapped by ConvertRowtypeExpr, but we
need to change its Var->vartype to the new child vartype.

I think the second option looks easier, but I am open to suggestions,
and I am myself still checking the first one.

> Address Robert's review comments on make_resultrel_ordered.patch.
>
> +typedef struct ParentChild
>
> This is a pretty generic name.  Pick something more specific and informative.

I have used ChildPartitionInfo. But suggestions welcome.

>
> +static List *append_rel_partition_oids(List *rel_list, Relation rel);
>
> One could be forgiven for thinking that this function was just going
> to append OIDs, but it actually appends ParentChild structures, so I
> think the name needs work.

Renamed it to append_child_partitions().

>
> +List

Re: [HACKERS] UPDATE of partition key

2017-08-04 Thread Amit Langote

On 2017/08/02 19:49, Amit Khandekar wrote:
> On 2 August 2017 at 14:38, Amit Langote  wrote:
>>> One approach I had considered was to have find_inheritance_children()
>>> itself lock the children in bound order, so that everyone will have
>>> bound-ordered oids, but that would be too expensive since it requires
>>> opening all partitioned tables to initialize partition descriptors. In
>>> find_inheritance_children(), we get all oids without opening any
>>> tables. But now that I think more of it, it's only the partitioned
>>> tables that we have to open, not the leaf partitions; and furthermore,
>>> I didn't see calls to find_inheritance_children() and
>>> find_all_inheritors() in performance-critical code, except in
>>> expand_inherited_rtentry(). All of them are in DDL commands; but yes,
>>> that can change in the future.
>>
>> This approach more or less amounts to calling the new
>> RelationGetPartitionDispatchInfo() (per my proposed patch, a version of
>> which I posted upthread.)  Maybe we can add a wrapper on top, say,
>> get_all_partition_oids() which throws away other things that
>> RelationGetPartitionDispatchInfo() returned.  In addition it locks all the
>> partitions that are returned, unlike only the partitioned ones, which is
>> what RelationGetPartitionDispatchInfo() has been taught to do.
> 
> So there are three different task items here :
> 1. Arrange the oids in consistent order everywhere.
> 2. Prepare the Partition Dispatch Info data structure in the planner
> as against during execution.
> 3. For update tuple routing, assume that the result rels are ordered
> consistently to make the searching efficient.

That's a good breakdown.

> #3 depends on #1. So for that, I have come up with a minimum set of
> changes to have expand_inherited_rtentry() generate the rels in bound
> order. When we do #2 , it may be possible that we may need to re-do my
> changes in expand_inherited_rtentry(), but those are minimum. We may
> even end up having the walker function being used at multiple places,
> but right now it is not certain.

So AFAICS:

For performance reasons, we want the order in which leaf partition
sub-plans appear in the ModifyTable node (and subsequently leaf partition
ResultRelInfos ModifyTableState) to be some known canonical order.  That's
because we want to map partitions in the insert tuple-routing data
structure (which appear in a known canonical order as determined by
RelationGetPartititionDispatchInfo) to those appearing in the
ModifyTableState.  That's so that we can reuse the planner-generated WCO
and RETURNING lists in the insert code path when update tuple-routing
invokes that path.

To implement that, planner should retrieve the list of leaf partition OIDs
in the same order as ExecSetupPartitionTupleRouting() retrieves them.
Because the latter calls RelationGetPartitionDispatchInfo on the root
partitioned table, maybe the planner should do that too, instead of its
current method getting OIDs using find_all_inheritors().  But it's
currently not possible due to the way RelationGetPartitionDispatchInfo()
and involved data structures are designed.

One way forward I see is to invent new interface functions:

  List *get_all_partition_oids(Oid, LOCKMODE)
  List *get_partition_oids(Oid, LOCKMODE)

that resemble find_all_inheritors() and find_inheritance_children(),
respectively, but expects that users make sure that they are called only
for partitioned tables.  Needless to mention, OIDs are returned with
canonical order determined by that of the partition bounds and partition
tree structure.  We replace all the calls of the old interface functions
with the respective new ones.  That means expand_inherited_rtentry (among
others) now calls get_all_partition_oids() if the RTE is for a partitioned
table and find_all_inheritors() otherwise.

> So, I think we can continue the discussion about #1 and #2 in a separate 
> thread.

I have started a new thread named "expanding inheritance in partition
bound order" and posted a couple of patches [1].

After applying those patches, you can write code for #3 without having to
worry about the concerns of partition order, which I guess you've already
done.

>>> Regarding dynamically locking specific partitions as and when needed,
>>> I think this method inherently has the issue of deadlock because the
>>> order would be random. So it feels like there is no way around other
>>> than to lock all partitions beforehand.
>>
>> I'm not sure why the order has to be random.  If and when we decide to
>> open and lock a subset of partitions for a given query, it will be done in
>> some canonical order as far as I can imagine.  Do you have some specific
>> example in mind?
> 
> Partitioned table t1 has partitions t1p1 and t1p2
> Partitioned table t2 at the same level has partitions t2p1 and t2p2
> Tuple routing causes the first row to insert into t2p2, so t2p2 is locked.
> Next insert locks t1p1 because it inserts into t1p1.
>

Re: [HACKERS] UPDATE of partition key

2017-08-02 Thread Amit Khandekar

On 2 August 2017 at 14:38, Amit Langote  wrote:
> On 2017/07/29 2:45, Amit Khandekar wrote:
>> On 28 July 2017 at 20:10, Robert Haas  wrote:
>>> On Wed, Jul 26, 2017 at 2:13 AM, Amit Langote wrote:
 I checked that we get the same result relation order with both the
 patches, but I would like to highlight a notable difference here between
 the approaches taken by our patches.  In my patch, I have now taught
 RelationGetPartitionDispatchInfo() to lock *only* the partitioned tables
 in the tree, because we need to look at its partition descriptor to
 collect partition OIDs and bounds.  We can defer locking (and opening the
 relation descriptor of) leaf partitions to a point where planner has
 determined that the partition will be accessed after all (not pruned),
 which will be done in a separate patch of course.
>>>
>>> That's very desirable, but I believe it introduces a deadlock risk
>>> which Amit's patch avoids.  A transaction using the code you've
>>> written here is eventually going to lock all partitions, BUT it's
>>> going to move the partitioned ones to the front of the locking order
>>> vs. what find_all_inheritors would do.  So, when multi-level
>>> partitioning is in use, I think it could happen that some other
>>> transaction is accessing the table using a different code path that
>>> uses the find_all_inheritors order without modification.  If those
>>> locks conflict (e.g. query vs. DROP) then there's a deadlock risk.
>>
>> Yes, I agree. Even with single-level partitioning, the leaf partitions
>> ordered by find_all_inheritors() is by oid values, so that's also
>> going to be differently ordered.
>
> We do require to lock the parent first in any case.  Doesn't that prevent
> deadlocks by imparting an implicit order on locking by operations whose
> locks conflict.

Yes may be, but I am not too sure at this point. find_all_inheritors()
locks only the children, and the parent lock is already locked
separately. find_all_inheritors() does not necessitate to lock the
children with the same lockmode as the parent.

> Having said that, I think it would be desirable for all code paths to
> manipulate partitions in the same order.  For partitioned tables, I think
> we can make it the partition bound order by replacing all calls to
> find_all_inheritors and find_inheritance_children on partitioned table
> parents with something else that reads partition OIDs from the relcache
> (PartitionDesc) and traverses the partition tree breadth-first manner.
>
>>> Unfortunately I don't see any easy way around that problem, but maybe
>>> somebody else has an idea.
>>
>> One approach I had considered was to have find_inheritance_children()
>> itself lock the children in bound order, so that everyone will have
>> bound-ordered oids, but that would be too expensive since it requires
>> opening all partitioned tables to initialize partition descriptors. In
>> find_inheritance_children(), we get all oids without opening any
>> tables. But now that I think more of it, it's only the partitioned
>> tables that we have to open, not the leaf partitions; and furthermore,
>> I didn't see calls to find_inheritance_children() and
>> find_all_inheritors() in performance-critical code, except in
>> expand_inherited_rtentry(). All of them are in DDL commands; but yes,
>> that can change in the future.
>
> This approach more or less amounts to calling the new
> RelationGetPartitionDispatchInfo() (per my proposed patch, a version of
> which I posted upthread.)  Maybe we can add a wrapper on top, say,
> get_all_partition_oids() which throws away other things that
> RelationGetPartitionDispatchInfo() returned.  In addition it locks all the
> partitions that are returned, unlike only the partitioned ones, which is
> what RelationGetPartitionDispatchInfo() has been taught to do.

So there are three different task items here :
1. Arrange the oids in consistent order everywhere.
2. Prepare the Partition Dispatch Info data structure in the planner
as against during execution.
3. For update tuple routing, assume that the result rels are ordered
consistently to make the searching efficient.

#3 depends on #1. So for that, I have come up with a minimum set of
changes to have expand_inherited_rtentry() generate the rels in bound
order. When we do #2 , it may be possible that we may need to re-do my
changes in expand_inherited_rtentry(), but those are minimum. We may
even end up having the walker function being used at multiple places,
but right now it is not certain.

So, I think we can continue the discussion about #1 and #2 in a separate thread.

>
>> Regarding dynamically locking specific partitions as and when needed,
>> I think this method inherently has the issue of deadlock because the
>> order would be random. So it feels like there is no way around other
>> than to lock all partitions beforehand.
>
> I'm not sure why the order has to be random.  If

Re: [HACKERS] UPDATE of partition key

2017-08-02 Thread Amit Langote

On 2017/07/29 2:45, Amit Khandekar wrote:
> On 28 July 2017 at 20:10, Robert Haas  wrote:
>> On Wed, Jul 26, 2017 at 2:13 AM, Amit Langote wrote:
>>> I checked that we get the same result relation order with both the
>>> patches, but I would like to highlight a notable difference here between
>>> the approaches taken by our patches.  In my patch, I have now taught
>>> RelationGetPartitionDispatchInfo() to lock *only* the partitioned tables
>>> in the tree, because we need to look at its partition descriptor to
>>> collect partition OIDs and bounds.  We can defer locking (and opening the
>>> relation descriptor of) leaf partitions to a point where planner has
>>> determined that the partition will be accessed after all (not pruned),
>>> which will be done in a separate patch of course.
>>
>> That's very desirable, but I believe it introduces a deadlock risk
>> which Amit's patch avoids.  A transaction using the code you've
>> written here is eventually going to lock all partitions, BUT it's
>> going to move the partitioned ones to the front of the locking order
>> vs. what find_all_inheritors would do.  So, when multi-level
>> partitioning is in use, I think it could happen that some other
>> transaction is accessing the table using a different code path that
>> uses the find_all_inheritors order without modification.  If those
>> locks conflict (e.g. query vs. DROP) then there's a deadlock risk.
> 
> Yes, I agree. Even with single-level partitioning, the leaf partitions
> ordered by find_all_inheritors() is by oid values, so that's also
> going to be differently ordered.

We do require to lock the parent first in any case.  Doesn't that prevent
deadlocks by imparting an implicit order on locking by operations whose
locks conflict.

Having said that, I think it would be desirable for all code paths to
manipulate partitions in the same order.  For partitioned tables, I think
we can make it the partition bound order by replacing all calls to
find_all_inheritors and find_inheritance_children on partitioned table
parents with something else that reads partition OIDs from the relcache
(PartitionDesc) and traverses the partition tree breadth-first manner.

>> Unfortunately I don't see any easy way around that problem, but maybe
>> somebody else has an idea.
> 
> One approach I had considered was to have find_inheritance_children()
> itself lock the children in bound order, so that everyone will have
> bound-ordered oids, but that would be too expensive since it requires
> opening all partitioned tables to initialize partition descriptors. In
> find_inheritance_children(), we get all oids without opening any
> tables. But now that I think more of it, it's only the partitioned
> tables that we have to open, not the leaf partitions; and furthermore,
> I didn't see calls to find_inheritance_children() and
> find_all_inheritors() in performance-critical code, except in
> expand_inherited_rtentry(). All of them are in DDL commands; but yes,
> that can change in the future.

This approach more or less amounts to calling the new
RelationGetPartitionDispatchInfo() (per my proposed patch, a version of
which I posted upthread.)  Maybe we can add a wrapper on top, say,
get_all_partition_oids() which throws away other things that
RelationGetPartitionDispatchInfo() returned.  In addition it locks all the
partitions that are returned, unlike only the partitioned ones, which is
what RelationGetPartitionDispatchInfo() has been taught to do.

> Regarding dynamically locking specific partitions as and when needed,
> I think this method inherently has the issue of deadlock because the
> order would be random. So it feels like there is no way around other
> than to lock all partitions beforehand.

I'm not sure why the order has to be random.  If and when we decide to
open and lock a subset of partitions for a given query, it will be done in
some canonical order as far as I can imagine.  Do you have some specific
example in mind?

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-07-28 Thread Amit Khandekar

On 28 July 2017 at 20:10, Robert Haas  wrote:
> On Wed, Jul 26, 2017 at 2:13 AM, Amit Langote
>  wrote:
>> Sorry to be responding this late to the Amit's make_resultrel_ordered
>> patch itself, but I agree that we should teach the planner to *always*
>> expand partitioned tables in the partition bound order.
>
> Sounds like we have unanimous agreement on that point.

I too agree.

>
>> I checked that we get the same result relation order with both the
>> patches, but I would like to highlight a notable difference here between
>> the approaches taken by our patches.  In my patch, I have now taught
>> RelationGetPartitionDispatchInfo() to lock *only* the partitioned tables
>> in the tree, because we need to look at its partition descriptor to
>> collect partition OIDs and bounds.  We can defer locking (and opening the
>> relation descriptor of) leaf partitions to a point where planner has
>> determined that the partition will be accessed after all (not pruned),
>> which will be done in a separate patch of course.

With Amit Langote's patch, we can very well do the locking beforehand
by find_all_inheritors(), and then run
RelationGetPartitionDispatchInfo() with noLock, so as to remove the
deadlock problem. But I think we should keep these two tasks separate,
i.e. expanding the partition tree in bound order, and making
RelationGetPartitionDispatchInfo() work for the planner.

Regarding building the PartitionDispatchInfo in the planner, we should
do that only after it is known that partition columns are updated, so
it can't be done in expand_inherited_rtentry() because it would be too
soon. For planner setup, RelationGetPartitionDispatchInfo() should
just build the tupmap for each partitioned table, and then initialize
the rest of the fields like tuplslot, reldesc , etc later during
execution.

So for now, I feel we should just do the changes for making sure the
order is same, and then over that, separately modify
RelationGetPartitionDispatchInfo() for planner.

>
> That's very desirable, but I believe it introduces a deadlock risk
> which Amit's patch avoids.  A transaction using the code you've
> written here is eventually going to lock all partitions, BUT it's
> going to move the partitioned ones to the front of the locking order
> vs. what find_all_inheritors would do.  So, when multi-level
> partitioning is in use, I think it could happen that some other
> transaction is accessing the table using a different code path that
> uses the find_all_inheritors order without modification.  If those
> locks conflict (e.g. query vs. DROP) then there's a deadlock risk.

Yes, I agree. Even with single-level partitioning, the leaf partitions
ordered by find_all_inheritors() is by oid values, so that's also
going to be differently ordered.

>
> Unfortunately I don't see any easy way around that problem, but maybe
> somebody else has an idea.

One approach I had considered was to have find_inheritance_children()
itself lock the children in bound order, so that everyone will have
bound-ordered oids, but that would be too expensive since it requires
opening all partitioned tables to initialize partition descriptors. In
find_inheritance_children(), we get all oids without opening any
tables. But now that I think more of it, it's only the partitioned
tables that we have to open, not the leaf partitions; and furthermore,
I didn't see calls to find_inheritance_children() and
find_all_inheritors() in performance-critical code, except in
expand_inherited_rtentry(). All of them are in DDL commands; but yes,
that can change in the future.

Regarding dynamically locking specific partitions as and when needed,
I think this method inherently has the issue of deadlock because the
order would be random. So it feels like there is no way around other
than to lock all partitions beforehand.

Regarding using first resultrel for mapping RETURNING and WCO, I think
we can use (a renamed) getASTriggerResultRelInfo() to get the root
result relation, and use WCO and RETURNING expressions of this
relation to do the mapping for child rels. This way, there won't be
insert/update specific code, and we don't need to use first result
relation.

While checking the whole-row bug on the other thread [1] , I noticed
that the RETURNING/WCO expressions for the per-subplan result rels are
formed by considering not just simple vars, but also whole row vars
and other nodes. So for update-tuple-routing, there would be some
result-rels WCOs formed using adjust_appendrel_attrs(), while for
others, they would be built using map_partition_varattnos() which only
considers simple vars. So the bug in [1] would be there for
update-partition-key as well, when the tuple is routed into a newly
built resultrel. May be, while fixing the bug in [1] , this might be
automatically solved.

Below are the TODOS at this point :

Fix for bug reported by Rajkumar about update with join.
Do

Re: [HACKERS] UPDATE of partition key

2017-07-28 Thread Robert Haas

On Wed, Jul 26, 2017 at 2:13 AM, Amit Langote
 wrote:
> Sorry to be responding this late to the Amit's make_resultrel_ordered
> patch itself, but I agree that we should teach the planner to *always*
> expand partitioned tables in the partition bound order.

Sounds like we have unanimous agreement on that point.  Yesterday, I
was discussing with Beena Emerson, who is working on run-time
partition pruning, that it would also be useful for that purpose, if
you're trying to prune based on a range query.

> I checked that we get the same result relation order with both the
> patches, but I would like to highlight a notable difference here between
> the approaches taken by our patches.  In my patch, I have now taught
> RelationGetPartitionDispatchInfo() to lock *only* the partitioned tables
> in the tree, because we need to look at its partition descriptor to
> collect partition OIDs and bounds.  We can defer locking (and opening the
> relation descriptor of) leaf partitions to a point where planner has
> determined that the partition will be accessed after all (not pruned),
> which will be done in a separate patch of course.

That's very desirable, but I believe it introduces a deadlock risk
which Amit's patch avoids.  A transaction using the code you've
written here is eventually going to lock all partitions, BUT it's
going to move the partitioned ones to the front of the locking order
vs. what find_all_inheritors would do.  So, when multi-level
partitioning is in use, I think it could happen that some other
transaction is accessing the table using a different code path that
uses the find_all_inheritors order without modification.  If those
locks conflict (e.g. query vs. DROP) then there's a deadlock risk.

Unfortunately I don't see any easy way around that problem, but maybe
somebody else has an idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-07-26 Thread Amit Khandekar

On 26 July 2017 at 02:37, Robert Haas  wrote:
> Is there any real benefit in this "walker" interface?  It looks to me
> like it might be simpler to just change things around so that it
> returns a list of OIDs, like find_all_inheritors, but generated
> differently.  Then if you want bound-ordering rather than
> OID-ordering, you just do this:
>
> list_free(inhOids);
> inhOids = get_partition_oids_in_bound_order(rel);
>
> That'd remove the need for some if/then logic as you've currently got
> in get_next_child().

Yes, I had considered that ; i.e., first generating just a list of
bound-ordered oids. But that consequently needs all the child tables
to be opened and closed twice; once during the list generation, and
then while expanding the partitioned table. Agreed, that the second
time, heap_open() would not be that expensive because tables would be
cached, but still it would require to get the cached relation handle
from hash table. Since we anyway want to open the tables, better have
a *next() function to go-get the next partition in a fixed order.

Actually, there isn't much that the walker next() function does. Any
code that wants to traverse bound-wise can do that by its own. The
walker function is just a convenient way to make sure everyone
traverses in the same order by using this function.

Yet to go over other things including your review comments, and Amit
Langote's patch on refactoring RelationGetPartitionDispatchInfo().

> On another note, did you do anything about the suggestion Thomas made
> in 
> http://postgr.es/m/CAEepm=3sc_j1zwqdyrbu4dtfx5rhcamnnuaxrkwzfgt9m23...@mail.gmail.com
> ?

This is still pending on me; plus I think there are some more points.
I need to go over those and consolidate a list of todos.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-07-26 Thread Amit Langote

On 2017/07/25 21:55, Rajkumar Raghuwanshi wrote:
> Got one more observation :  update... returning is not working with whole
> row reference. please take a look.
> 
> postgres=# create table part (a int, b int) partition by range(a);
> CREATE TABLE
> postgres=# create table part_p1 partition of part for values from
> (minvalue) to (0);
> CREATE TABLE
> postgres=# create table part_p2 partition of part for values from (0) to
> (maxvalue);
> CREATE TABLE
> postgres=# insert into part values (10,1);
> INSERT 0 1
> postgres=# insert into part values (20,2);
> INSERT 0 1
> postgres=# update part t1 set a = b returning t1;
> ERROR:  unexpected whole-row reference found in partition key

That looks like a bug which exists in HEAD too.  I posted a patch in a
dedicated thread to address the same [1].

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/9a39df80-871e-6212-0684-f93c83be4097%40lab.ntt.co.jp



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-07-26 Thread Etsuro Fujita


On 2017/07/26 6:07, Robert Haas wrote:

On Thu, Jul 13, 2017 at 1:09 PM, Amit Khandekar  wrote:

Attached is a WIP patch (make_resultrels_ordered.patch) that generates
the result rels in canonical order. This patch is kept separate from
the update-partition-key patch, and can be applied on master branch.


Thank you for working on this, Amit!


Hmm, I like the approach you've taken here in general,


+1 for the approach.


Is there any real benefit in this "walker" interface?  It looks to me
like it might be simpler to just change things around so that it
returns a list of OIDs, like find_all_inheritors, but generated
differently.  Then if you want bound-ordering rather than
OID-ordering, you just do this:

list_free(inhOids);
inhOids = get_partition_oids_in_bound_order(rel);

That'd remove the need for some if/then logic as you've currently got
in get_next_child().


Yeah, that would make the code much simple, so +1 for Robert's idea.


I think we should always expand in bound order rather than only when
it's a result relation. I think for partition-wise join, we're going
to want to do it this way for all relations in the query, or at least
for all relations in the query that might possibly be able to
participate in a partition-wise join.  If there are multiple cases
that are going to need this ordering, it's hard for me to accept the
idea that it's worth the complexity of trying to keep track of when we
expanded things in one order vs. another.  There are other
applications of having things in bound order too, like MergeAppend ->
Append strength-reduction (which might not be legal anyway if there
are list partitions with multiple, non-contiguous list bounds or if
any NULL partition doesn't end up in the right place in the order, but
there will be lots of cases where it can work).


+1 for that as well.  Another benefit from that would be EXPLAIN; we 
could display partitions for a partitioned table in the same order for 
Append and ModifyTable (ie, SELECT/UPDATE/DELETE), which I think would 
make the EXPLAIN result much readable.


Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-07-26 Thread Amit Langote

On 2017/07/26 6:07, Robert Haas wrote:
> On Thu, Jul 13, 2017 at 1:09 PM, Amit Khandekar  
> wrote:
>> Attached is a WIP patch (make_resultrels_ordered.patch) that generates
>> the result rels in canonical order. This patch is kept separate from
>> the update-partition-key patch, and can be applied on master branch.
>
> I suspect this isn't correct for a table that contains wCTEs, because
> there would in that case be multiple result relations.
> 
> I think we should always expand in bound order rather than only when
> it's a result relation. I think for partition-wise join, we're going
> to want to do it this way for all relations in the query, or at least
> for all relations in the query that might possibly be able to
> participate in a partition-wise join.  If there are multiple cases
> that are going to need this ordering, it's hard for me to accept the
> idea that it's worth the complexity of trying to keep track of when we
> expanded things in one order vs. another.  There are other
> applications of having things in bound order too, like MergeAppend ->
> Append strength-reduction (which might not be legal anyway if there
> are list partitions with multiple, non-contiguous list bounds or if
> any NULL partition doesn't end up in the right place in the order, but
> there will be lots of cases where it can work).

Sorry to be responding this late to the Amit's make_resultrel_ordered
patch itself, but I agree that we should teach the planner to *always*
expand partitioned tables in the partition bound order.

When working on something else, I ended up writing a prerequisite patch
that refactors RelationGetPartitionDispatchInfo() to not be too tied to
its current usage for tuple-routing, so that it can now be used in the
planner (for example, in expand_inherited_rtentry(), instead of
find_all_inheritors()).  If we could adopt that patch, we can focus on the
update partition row movement issues more closely on this thread, rather
than the concerns about the order that planner puts partitions into.

I checked that we get the same result relation order with both the
patches, but I would like to highlight a notable difference here between
the approaches taken by our patches.  In my patch, I have now taught
RelationGetPartitionDispatchInfo() to lock *only* the partitioned tables
in the tree, because we need to look at its partition descriptor to
collect partition OIDs and bounds.  We can defer locking (and opening the
relation descriptor of) leaf partitions to a point where planner has
determined that the partition will be accessed after all (not pruned),
which will be done in a separate patch of course.

Sorry again that I didn't share this patch sooner.

Thanks,
Amit
From 7a22aedc7c1ae8e1568745c99cf1d11d42cf59d9 Mon Sep 17 00:00:00 2001
From: amit 
Date: Mon, 24 Jul 2017 18:59:57 +0900
Subject: [PATCH 1/3] Decouple RelationGetPartitionDispatchInfo() from executor

Currently it and the structure it generates viz. PartitionDispatch
objects are too coupled with the executor's tuple-routing code.  That
include locking considerations and responsibilities for releasing
relcache references, etc.  That makes it useless for usage in other
places such as during planning.
---
 src/backend/catalog/partition.c| 326 +
 src/backend/commands/copy.c|  35 ++--
 src/backend/executor/execMain.c| 156 ++--
 src/backend/executor/nodeModifyTable.c |  29 ++-
 src/include/catalog/partition.h|  53 ++
 src/include/executor/executor.h|   4 +-
 src/include/nodes/execnodes.h  |  53 +-
 7 files changed, 409 insertions(+), 247 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index e20ddce2db..e07701d5e8 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -105,6 +105,24 @@ typedef struct PartitionRangeBound
boollower;  /* this is the lower (vs upper) 
bound */
 } PartitionRangeBound;
 
+/*---
+ * PartitionDispatchData - information of partitions of one partitioned table
+ *in a partition tree
+ *
+ * partkey Partition key of the table
+ * partdescPartition descriptor of the table
+ * indexes Array with partdesc->nparts members (for details on 
what the
+ * individual value represents, see the comments in
+ * RelationGetPartitionDispatchInfo())
+ *---
+ */
+typedef struct PartitionDispatchData
+{
+   PartitionKeypartkey;/* Points into the table's relcache 
entry */
+   PartitionDesc   partdesc;   /* Ditto */
+   int*indexes;
+} PartitionDispatchData;
+
 static int32 qsort_partition_list_value_cmp(const void *a, const void *b,

Re: [HACKERS] UPDATE of partition key

2017-07-25 Thread Robert Haas

On Thu, Jul 13, 2017 at 1:09 PM, Amit Khandekar  wrote:
> Attached is a WIP patch (make_resultrels_ordered.patch) that generates
> the result rels in canonical order. This patch is kept separate from
> the update-partition-key patch, and can be applied on master branch.

Hmm, I like the approach you've taken here in general, but I think it
needs cleanup.

+typedef struct ParentChild

This is a pretty generic name.  Pick something more specific and informative.

+static List *append_rel_partition_oids(List *rel_list, Relation rel);

One could be forgiven for thinking that this function was just going
to append OIDs, but it actually appends ParentChild structures, so I
think the name needs work.

+List *append_rel_partition_oids(List *rel_list, Relation rel)

Style.  Please pgindent your patches.

+#ifdef DEBUG_PRINT_OIDS
+print_oids(*leaf_part_oids);
+#endif

I'd just rip out this debug stuff once you've got this working, but if
we keep it, it certainly can't have a name as generic as print_oids()
when it's actually doing something with a list of ParentChild
structures.  Also, it prints names, not OIDs.  And DEBUG_PRINT_OIDS is
no good for the same reasons.

+if (RelationGetPartitionDesc(rel))
+walker->rels_list = append_rel_partition_oids(walker->rels_list, rel);

Every place that calls append_rel_partition_oids guards that call with
if (RelationGetPartitionDesc(...)).  It seems to me that it would be
simpler to remove those tests and instead just replace the
Assert(partdesc) inside that function with if (!partdesc) return;

Is there any real benefit in this "walker" interface?  It looks to me
like it might be simpler to just change things around so that it
returns a list of OIDs, like find_all_inheritors, but generated
differently.  Then if you want bound-ordering rather than
OID-ordering, you just do this:

list_free(inhOids);
inhOids = get_partition_oids_in_bound_order(rel);

That'd remove the need for some if/then logic as you've currently got
in get_next_child().

+is_partitioned_resultrel =
+(oldrelation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE
+ && rti == parse->resultRelation);

I suspect this isn't correct for a table that contains wCTEs, because
there would in that case be multiple result relations.

I think we should always expand in bound order rather than only when
it's a result relation. I think for partition-wise join, we're going
to want to do it this way for all relations in the query, or at least
for all relations in the query that might possibly be able to
participate in a partition-wise join.  If there are multiple cases
that are going to need this ordering, it's hard for me to accept the
idea that it's worth the complexity of trying to keep track of when we
expanded things in one order vs. another.  There are other
applications of having things in bound order too, like MergeAppend ->
Append strength-reduction (which might not be legal anyway if there
are list partitions with multiple, non-contiguous list bounds or if
any NULL partition doesn't end up in the right place in the order, but
there will be lots of cases where it can work).

On another note, did you do anything about the suggestion Thomas made
in 
http://postgr.es/m/CAEepm=3sc_j1zwqdyrbu4dtfx5rhcamnnuaxrkwzfgt9m23...@mail.gmail.com
?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-07-25 Thread Rajkumar Raghuwanshi

On Tue, Jul 25, 2017 at 3:54 PM, Amit Khandekar 
wrote:

> On 25 July 2017 at 15:02, Rajkumar Raghuwanshi
>  wrote:
> > On Mon, Jul 24, 2017 at 11:23 AM, Amit Khandekar  >
> > wrote:
> >>
> >>
> >> Attached update-partition-key_v13.patch now contains this
> >> make_resultrels_ordered.patch changes.
> >>
> >
> > I have applied attach patch and got below observation.
> >
> > Observation :  if join producing multiple output rows for a given row to
> be
> > modified. I am seeing here it is updating a row and also inserting rows
> in
> > target table. hence after update total count of table got incremented.
>
> Thanks for catching this Rajkumar.
>
> So after the row to be updated is already moved to another partition,
> when the next join output row corresponds to the same row which is
> moved, that row is now deleted, so ExecDelete()=>heap_delete() gets
> HeapTupleSelfUpdated, and this is not handled. So even when
> ExecDelete() finds that the row is already deleted, we still call
> ExecInsert(), so a new row is inserted.  In ExecDelete(), we should
> indicate that the row is already deleted. In the existing patch, there
> is a parameter concurrenty_deleted for ExecDelete() which indicates
> that the row is concurrently deleted. I think we can make this
> parameter for both of these purposes so as to avoid ExecInsert() for
> both these scenarios. Will work on a patch.
>

Thanks Amit.

Got one more observation :  update... returning is not working with whole
row reference. please take a look.

postgres=# create table part (a int, b int) partition by range(a);
CREATE TABLE
postgres=# create table part_p1 partition of part for values from
(minvalue) to (0);
CREATE TABLE
postgres=# create table part_p2 partition of part for values from (0) to
(maxvalue);
CREATE TABLE
postgres=# insert into part values (10,1);
INSERT 0 1
postgres=# insert into part values (20,2);
INSERT 0 1
postgres=# update part t1 set a = b returning t1;
ERROR:  unexpected whole-row reference found in partition key

Re: [HACKERS] UPDATE of partition key

2017-07-25 Thread Amit Khandekar

On 25 July 2017 at 15:02, Rajkumar Raghuwanshi
 wrote:
> On Mon, Jul 24, 2017 at 11:23 AM, Amit Khandekar 
> wrote:
>>
>>
>> Attached update-partition-key_v13.patch now contains this
>> make_resultrels_ordered.patch changes.
>>
>
> I have applied attach patch and got below observation.
>
> Observation :  if join producing multiple output rows for a given row to be
> modified. I am seeing here it is updating a row and also inserting rows in
> target table. hence after update total count of table got incremented.

Thanks for catching this Rajkumar.

So after the row to be updated is already moved to another partition,
when the next join output row corresponds to the same row which is
moved, that row is now deleted, so ExecDelete()=>heap_delete() gets
HeapTupleSelfUpdated, and this is not handled. So even when
ExecDelete() finds that the row is already deleted, we still call
ExecInsert(), so a new row is inserted.  In ExecDelete(), we should
indicate that the row is already deleted. In the existing patch, there
is a parameter concurrenty_deleted for ExecDelete() which indicates
that the row is concurrently deleted. I think we can make this
parameter for both of these purposes so as to avoid ExecInsert() for
both these scenarios. Will work on a patch.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-07-25 Thread Rajkumar Raghuwanshi

On Mon, Jul 24, 2017 at 11:23 AM, Amit Khandekar 
wrote:

>
> Attached update-partition-key_v13.patch now contains this
> make_resultrels_ordered.patch changes.
>
>
I have applied attach patch and got below observation.

Observation :  if join producing multiple output rows for a given row to be
modified. I am seeing here it is updating a row and also inserting rows in
target table. hence after update total count of table got incremented.

below are steps:
postgres=# create table part_upd (a int, b int) partition by range(a);
CREATE TABLE
postgres=# create table part_upd1 partition of part_upd for values from
(minvalue) to (-10);
CREATE TABLE
postgres=# create table part_upd2 partition of part_upd for values from
(-10) to (0);
CREATE TABLE
postgres=# create table part_upd3 partition of part_upd for values from (0)
to (10);
CREATE TABLE
postgres=# create table part_upd4 partition of part_upd for values from
(10) to (maxvalue);
CREATE TABLE
postgres=# insert into part_upd select i,i from generate_series(-30,30,3)i;
INSERT 0 21





*postgres=# select count(*) from part_upd; count ---21(1 row)*
postgres=#
postgres=# create table non_part_upd (a int);
CREATE TABLE
postgres=# insert into non_part_upd select i%2 from
generate_series(-30,30,5)i;
INSERT 0 13
postgres=# update part_upd t1 set a = (t2.a+10) from non_part_upd t2 where
t2.a = t1.b;
UPDATE 7





*postgres=# select count(*) from part_upd; count ---27(1 row)*
postgres=# select tableoid::regclass,* from part_upd;
 tableoid  |  a  |  b
---+-+-
 part_upd1 | -30 | -30
 part_upd1 | -27 | -27
 part_upd1 | -24 | -24
 part_upd1 | -21 | -21
 part_upd1 | -18 | -18
 part_upd1 | -15 | -15
 part_upd1 | -12 | -12
 part_upd2 |  -9 |  -9
 part_upd2 |  -6 |  -6
 part_upd2 |  -3 |  -3
 part_upd3 |   3 |   3
 part_upd3 |   6 |   6
 part_upd3 |   9 |   9
 part_upd4 |  12 |  12
 part_upd4 |  15 |  15
 part_upd4 |  18 |  18
 part_upd4 |  21 |  21
 part_upd4 |  24 |  24
 part_upd4 |  27 |  27
 part_upd4 |  30 |  30







* part_upd4 |  10 |   0 part_upd4 |  10 |   0 part_upd4 |  10 |
0 part_upd4 |  10 |   0 part_upd4 |  10 |   0 part_upd4 |  10 |
0 part_upd4 |  10 |   0*(27 rows)

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Re: [HACKERS] UPDATE of partition key

2017-07-23 Thread Amit Khandekar

On 13 July 2017 at 22:39, Amit Khandekar  wrote:
> Attached is a WIP patch (make_resultrels_ordered.patch) that generates
> the result rels in canonical order. This patch is kept separate from
> the update-partition-key patch, and can be applied on master branch.

Attached update-partition-key_v13.patch now contains this
make_resultrels_ordered.patch changes.

So now that the per-subplan result rels and the leaf partition oids
that are generated for tuple routing are both known to have the same
order (cannonical), in ExecSetupPartitionTupleRouting(), we look for
the per-subplan results without the need for a hash table. Instead of
the hash table, we iterate over the leaf partition oids and at the
same time keep shifting a position over the per-subplan resultrels
whenever the resultrel at the position is found to be present in the
leaf partitions list. Since the two lists are in the same order, we
never have to again scan the portition of the lists that is already
scanned.

I considered whether the issue behind this recent commit might be
relevant for update tuple-routing as well :
commit f81a91db4d1c2032632aa5df9fc14be24f5fe5ec
Author: Robert Haas 
Date:   Mon Jul 17 21:29:45 2017 -0400
Use a real RT index when setting up partition tuple routing.

Since we know that using a dummy 1 value for tuple routing result rels
is not correct, I am checking about another possibility : Now in the
latest patch, the tuple routing partitions would have a mix of a)
existing update result-rels, and b) new partition resultrels. 'b'
resultrels would have the RT index of nominalRelation, but the
existing 'a' resultrels would have their own different RT indexes. I
suspect, this might surface a similar issue that was fixed by the
above commit, for e.g. with the WITH query having UPDATE subqueries
doing tuple routing. Will check that.

This patch also has Robert's changes in the planner to decide whether
to do update tuple routing.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

update-partition-key_v13.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-07-13 Thread Amit Khandekar

On 5 July 2017 at 15:12, Amit Khandekar  wrote:
> Like I mentioned upthread... in expand_inherited_rtentry(), if we
> replace find_all_inheritors() with something else that returns oids in
> canonical order, that will change the order in which children tables
> get locked, which increases the chance of deadlock. Because, then the
> callers of find_all_inheritors() will lock them in one order, while
> callers of expand_inherited_rtentry() will lock them in a different
> order. Even in the current code, I think there is a chance of
> deadlocks because RelationGetPartitionDispatchInfo() and
> find_all_inheritors() have different lock ordering.
>
> Now, to get the oids of a partitioned table children sorted by
> canonical ordering, (i.e. using the partition bound values) we need to
> either use the partition bounds to sort the oids like the way it is
> done in RelationBuildPartitionDesc() or, open the parent table and get
> it's Relation->rd_partdesc->oids[] which are already sorted in
> canonical order. So if we generate oids using this way in
> find_all_inheritors() and find_inheritance_children(), that will
> generate consistent ordering everywhere. But this method is quite
> expensive as compared to the way oids are generated and sorted using
> oid values in find_inheritance_children().
>
> In both expand_inherited_rtentry() and
> RelationGetPartitionDispatchInfo(), each of the child tables are
> opened.
>
> So, in both of these functions, what we can do is : call a new
> function partition_tree_walker() which does following :
> 1. Lock the children using the existing order (i.e. sorted by oid
> values) using the same function find_all_inheritors(). Rename
> find_all_inheritors() to lock_all_inheritors(... , bool return_oids)
> which returns the oid list only if requested.
> 2. And then scan through each of the partitions in canonical order, by
> opening the parent table, then opening the partition descriptor oids,
> and then doing whatever needs to be done with that partition rel.
>
> partition_tree_walker() will look something like this :
>
> void partition_tree_walker(Oid parentOid, LOCKMODE lockmode,
>void (*walker_func) (), void *context)
> {
> Relation parentrel;
> List *rels_list;
> ListCell *cell;
>
> (void) lock_all_inheritors(parentOid, lockmode,
>false /* don't generate oids */);
>
> parentrel = heap_open(parentOid, NoLock);
> rels_list = append_rel_partition_oids(NIL, parentrel);
>
> /* Scan through all partitioned rels, and at the
>  * same time append their children. */
> foreach(cell, rels_list)
> {
> /* Open partrel without locking; lock_all_inheritors() has locked it 
> */
> Relationpartrel = heap_open(lfirst_oid(cell), NoLock);
>
> /* Append the children of a partitioned rel to the same list
>  * that we are iterating on */
> if (RelationGetPartitionDesc(partrel))
> rels_list = append_rel_partition_oids(rels_list, partrel);
>
> /*
>  * Do whatever processing needs to be done on this partel.
>  * The walker function is free to either close the partel
>  * or keep it opened, but it needs to make sure the opened
>  * ones are closed later
>  */
> walker_func(partrel, context);
> }
> }
>
> List *append_rel_partition_oids(List *rel_list, Relation rel)
> {
> int i;
> for (i = 0; i < rel->rd_partdesc->nparts; i++)
> rel_list = lappend_oid(rel_list, rel->rd_partdesc->oids[i]);
>
> return rel_list;
> }
>
>
> So, in expand_inherited_rtentry() the foreach(l, inhOIDs) loop will be
> replaced by partition_tree_walker(parentOid, expand_rte_walker_func)
> where expand_rte_walker_func() will do all the work done in the for
> loop for each of the partition rels.
>
> Similarly, in RelationGetPartitionDispatchInfo() the initial part
> where it uses APPEND_REL_PARTITION_OIDS() can be replaced by
> partition_tree_walker(rel, dispatch_info_walkerfunc) where
> dispatch_info_walkerfunc() will generate the oids, or may be populate
> the complete PartitionDispatchData structure. 'pd' variable can be
> passed as context to the partition_tree_walker(..., context)
>
> Generating the resultrels in canonical order by opening the tables
> using the above way wouldn't be more expensive than the existing code,
> because even currently we anyways have to open all the tables in both
> of these functions.
>

Attached is a WIP patch (make_resultrels_ordered.patch) that generates
the result rels in canonical order. This patch is kept separate from
the update-partition-key patch, and can be applied on master branch.

In this patch, rather than partition_tree_walker() called with a
context, I have provided a function partition_walker_next() using
which we iterate over all the partitions in canonical order.
partition_walker_next() will take care of appending oids from
partition

Re: [HACKERS] UPDATE of partition key

2017-07-05 Thread Amit Khandekar

On 4 July 2017 at 15:23, Amit Khandekar  wrote:
> On 4 July 2017 at 14:48, Amit Khandekar  wrote:
>> On 4 July 2017 at 14:38, Amit Langote  wrote:
>>> On 2017/07/04 17:25, Etsuro Fujita wrote:
 On 2017/07/03 18:54, Amit Langote wrote:
> On 2017/07/02 20:10, Robert Haas wrote:
>> That
>> seems pretty easy to do - just have expand_inherited_rtentry() notice
>> that it's got a partitioned table and call
>> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
>> produce the list of OIDs.
 Seems like a good idea.

> Interesting idea.
>
> If we are going to do this, I think we may need to modify
> RelationGetPartitionDispatchInfo() a bit or invent an alternative that
> does not do as much work.  Currently, it assumes that it's only ever
> called by ExecSetupPartitionTupleRouting() and hence also generates
> PartitionDispatchInfo objects for partitioned child tables.  We don't need
> that if called from within the planner.
>
> Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled
> with its usage within the executor, because there is this comment:
>
>  /*
>   * We keep the partitioned ones open until we're done using the
>   * information being collected here (for example, see
>   * ExecEndModifyTable).
>   */

 Yeah, we need some refactoring work.  Is anyone working on that?
>>>
>>> I would like to take a shot at that if someone else hasn't already cooked
>>> up a patch.  Working on making RelationGetPartitionDispatchInfo() a
>>> routine callable from both within the planner and the executor should be a
>>> worthwhile effort.
>>
>> What I am currently working on is to see if we can call
>> find_all_inheritors() or find_inheritance_children() instead of
>> generating the leaf_part_oids using APPEND_REL_PARTITION_OIDS().
>> Possibly we don't have to refactor it completely.
>> find_inheritance_children() needs to return the oids in canonical
>> order. So in find_inheritance_children () need to re-use part of
>> RelationBuildPartitionDesc() where it generates those oids in that
>> order. I am checking this part, and am going to come up with an
>> approach based on findings.
>
> The other approach is to make canonical ordering only in
> find_all_inheritors() by replacing call to find_inheritance_children()
> with the refactored/modified RelationGetPartitionDispatchInfo(). But
> that would mean that the callers of find_inheritance_children() would
> have one ordering, while the callers of find_all_inheritors() would
> have a different ordering; that brings up chances of deadlocks. That's
> why I think, we need to think about modifying the common function
> find_inheritance_children(), so that we would be consistent with the
> ordering. And then use find_inheritance_children() or
> find_all_inheritors() in RelationGetPartitionDispatchInfo(). So yes,
> there would be some modifications to
> RelationGetPartitionDispatchInfo().
>
>>
>> Also, need to investigate whether *always* sorting the oids in
>> canonical order is going to be much expensive than the current sorting
>> using oids. But I guess it won't be that expensive.

Like I mentioned upthread... in expand_inherited_rtentry(), if we
replace find_all_inheritors() with something else that returns oids in
canonical order, that will change the order in which children tables
get locked, which increases the chance of deadlock. Because, then the
callers of find_all_inheritors() will lock them in one order, while
callers of expand_inherited_rtentry() will lock them in a different
order. Even in the current code, I think there is a chance of
deadlocks because RelationGetPartitionDispatchInfo() and
find_all_inheritors() have different lock ordering.

Now, to get the oids of a partitioned table children sorted by
canonical ordering, (i.e. using the partition bound values) we need to
either use the partition bounds to sort the oids like the way it is
done in RelationBuildPartitionDesc() or, open the parent table and get
it's Relation->rd_partdesc->oids[] which are already sorted in
canonical order. So if we generate oids using this way in
find_all_inheritors() and find_inheritance_children(), that will
generate consistent ordering everywhere. But this method is quite
expensive as compared to the way oids are generated and sorted using
oid values in find_inheritance_children().

In both expand_inherited_rtentry() and
RelationGetPartitionDispatchInfo(), each of the child tables are
opened.

So, in both of these functions, what we can do is : call a new
function partition_tree_walker() which does following :
1. Lock the children using the existing order (i.e. sorted by oid
values) using the same function find_all_inheritors(). Rename
find_all_inheritors() to lock_all_inheritors(... , bool return_oids)
which

Re: [HACKERS] UPDATE of partition key

2017-07-04 Thread Amit Khandekar

On 4 July 2017 at 14:48, Amit Khandekar  wrote:
> On 4 July 2017 at 14:38, Amit Langote  wrote:
>> On 2017/07/04 17:25, Etsuro Fujita wrote:
>>> On 2017/07/03 18:54, Amit Langote wrote:
 On 2017/07/02 20:10, Robert Haas wrote:
> That
> seems pretty easy to do - just have expand_inherited_rtentry() notice
> that it's got a partitioned table and call
> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
> produce the list of OIDs.
>>> Seems like a good idea.
>>>
 Interesting idea.

 If we are going to do this, I think we may need to modify
 RelationGetPartitionDispatchInfo() a bit or invent an alternative that
 does not do as much work.  Currently, it assumes that it's only ever
 called by ExecSetupPartitionTupleRouting() and hence also generates
 PartitionDispatchInfo objects for partitioned child tables.  We don't need
 that if called from within the planner.

 Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled
 with its usage within the executor, because there is this comment:

  /*
   * We keep the partitioned ones open until we're done using the
   * information being collected here (for example, see
   * ExecEndModifyTable).
   */
>>>
>>> Yeah, we need some refactoring work.  Is anyone working on that?
>>
>> I would like to take a shot at that if someone else hasn't already cooked
>> up a patch.  Working on making RelationGetPartitionDispatchInfo() a
>> routine callable from both within the planner and the executor should be a
>> worthwhile effort.
>
> What I am currently working on is to see if we can call
> find_all_inheritors() or find_inheritance_children() instead of
> generating the leaf_part_oids using APPEND_REL_PARTITION_OIDS().
> Possibly we don't have to refactor it completely.
> find_inheritance_children() needs to return the oids in canonical
> order. So in find_inheritance_children () need to re-use part of
> RelationBuildPartitionDesc() where it generates those oids in that
> order. I am checking this part, and am going to come up with an
> approach based on findings.

The other approach is to make canonical ordering only in
find_all_inheritors() by replacing call to find_inheritance_children()
with the refactored/modified RelationGetPartitionDispatchInfo(). But
that would mean that the callers of find_inheritance_children() would
have one ordering, while the callers of find_all_inheritors() would
have a different ordering; that brings up chances of deadlocks. That's
why I think, we need to think about modifying the common function
find_inheritance_children(), so that we would be consistent with the
ordering. And then use find_inheritance_children() or
find_all_inheritors() in RelationGetPartitionDispatchInfo(). So yes,
there would be some modifications to
RelationGetPartitionDispatchInfo().

>
> Also, need to investigate whether *always* sorting the oids in
> canonical order is going to be much expensive than the current sorting
> using oids. But I guess it won't be that expensive.
>
>
> --
> Thanks,
> -Amit Khandekar
> EnterpriseDB Corporation
> The Postgres Database Company

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-07-04 Thread Amit Khandekar

On 4 July 2017 at 14:38, Amit Langote  wrote:
> On 2017/07/04 17:25, Etsuro Fujita wrote:
>> On 2017/07/03 18:54, Amit Langote wrote:
>>> On 2017/07/02 20:10, Robert Haas wrote:
 That
 seems pretty easy to do - just have expand_inherited_rtentry() notice
 that it's got a partitioned table and call
 RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
 produce the list of OIDs.
>> Seems like a good idea.
>>
>>> Interesting idea.
>>>
>>> If we are going to do this, I think we may need to modify
>>> RelationGetPartitionDispatchInfo() a bit or invent an alternative that
>>> does not do as much work.  Currently, it assumes that it's only ever
>>> called by ExecSetupPartitionTupleRouting() and hence also generates
>>> PartitionDispatchInfo objects for partitioned child tables.  We don't need
>>> that if called from within the planner.
>>>
>>> Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled
>>> with its usage within the executor, because there is this comment:
>>>
>>>  /*
>>>   * We keep the partitioned ones open until we're done using the
>>>   * information being collected here (for example, see
>>>   * ExecEndModifyTable).
>>>   */
>>
>> Yeah, we need some refactoring work.  Is anyone working on that?
>
> I would like to take a shot at that if someone else hasn't already cooked
> up a patch.  Working on making RelationGetPartitionDispatchInfo() a
> routine callable from both within the planner and the executor should be a
> worthwhile effort.

What I am currently working on is to see if we can call
find_all_inheritors() or find_inheritance_children() instead of
generating the leaf_part_oids using APPEND_REL_PARTITION_OIDS().
Possibly we don't have to refactor it completely.
find_inheritance_children() needs to return the oids in canonical
order. So in find_inheritance_children () need to re-use part of
RelationBuildPartitionDesc() where it generates those oids in that
order. I am checking this part, and am going to come up with an
approach based on findings.

Also, need to investigate whether *always* sorting the oids in
canonical order is going to be much expensive than the current sorting
using oids. But I guess it won't be that expensive.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-07-04 Thread Amit Langote

On 2017/07/04 17:25, Etsuro Fujita wrote:
> On 2017/07/03 18:54, Amit Langote wrote:
>> On 2017/07/02 20:10, Robert Haas wrote:
>>> That
>>> seems pretty easy to do - just have expand_inherited_rtentry() notice
>>> that it's got a partitioned table and call
>>> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
>>> produce the list of OIDs.
> Seems like a good idea.
> 
>> Interesting idea.
>>
>> If we are going to do this, I think we may need to modify
>> RelationGetPartitionDispatchInfo() a bit or invent an alternative that
>> does not do as much work.  Currently, it assumes that it's only ever
>> called by ExecSetupPartitionTupleRouting() and hence also generates
>> PartitionDispatchInfo objects for partitioned child tables.  We don't need
>> that if called from within the planner.
>>
>> Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled
>> with its usage within the executor, because there is this comment:
>>
>>  /*
>>   * We keep the partitioned ones open until we're done using the
>>   * information being collected here (for example, see
>>   * ExecEndModifyTable).
>>   */
> 
> Yeah, we need some refactoring work.  Is anyone working on that?

I would like to take a shot at that if someone else hasn't already cooked
up a patch.  Working on making RelationGetPartitionDispatchInfo() a
routine callable from both within the planner and the executor should be a
worthwhile effort.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-07-04 Thread Etsuro Fujita


On 2017/07/03 18:54, Amit Langote wrote:

On 2017/07/02 20:10, Robert Haas wrote:



But that seems like it wouldn't be too hard to fix: let's have
expand_inherited_rtentry() expand the partitioned table in the same
order that will be used by ExecSetupPartitionTupleRouting().


That's really what I wanted when updating the patch for tuple-routing to 
foreign partitions.  (I don't understand the issue discussed here, though.)



That
seems pretty easy to do - just have expand_inherited_rtentry() notice
that it's got a partitioned table and call
RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
produce the list of OIDs.

Seems like a good idea.


Interesting idea.

If we are going to do this, I think we may need to modify
RelationGetPartitionDispatchInfo() a bit or invent an alternative that
does not do as much work.  Currently, it assumes that it's only ever
called by ExecSetupPartitionTupleRouting() and hence also generates
PartitionDispatchInfo objects for partitioned child tables.  We don't need
that if called from within the planner.

Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled
with its usage within the executor, because there is this comment:

 /*
  * We keep the partitioned ones open until we're done using the
  * information being collected here (for example, see
  * ExecEndModifyTable).
  */


Yeah, we need some refactoring work.  Is anyone working on that?

Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-07-03 Thread Amit Langote

On 2017/07/02 20:10, Robert Haas wrote:
> On Fri, Jun 30, 2017 at 4:20 PM, Robert Haas  wrote:
>> I don't think the approach of building a hash table to figure out
>> which result rels have already been created is a good one.  That too
>> feels like something that the planner should be figuring out and the
>> executor should just be implementing what the planner decided.  I
>> haven't figured out exactly how that should work yet, but it seems
>> like it ought to be doable.
> 
> I was imagining when I wrote the above that the planner should somehow
> compute a list of relations that it has excluded so that the executor
> can skip building ResultRelInfos for exactly those relations, but on
> further study, that's not particularly easy to achieve and wouldn't
> really save anything anyway, because the list of OIDs is coming
> straight out of the partition descriptor, so it's pretty much free.
> However, I still think it would be a nifty idea if we could avoid
> needing the hash table to deduplicate.  The reason we need that is, I
> think, that expand_inherited_rtentry() is going to expand the
> inheritance hierarchy in whatever order the scan(s) of pg_inherits
> return the descendant tables, whereas the partition descriptor is
> going to put them in a canonical order.
> 
> But that seems like it wouldn't be too hard to fix: let's have
> expand_inherited_rtentry() expand the partitioned table in the same
> order that will be used by ExecSetupPartitionTupleRouting().  That
> seems pretty easy to do - just have expand_inherited_rtentry() notice
> that it's got a partitioned table and call
> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
> produce the list of OIDs.  Then - I think -
> ExecSetupPartitionTupleRouting() doesn't need the hash table; it can
> just scan through the return value of ExecSetupPartitionTupleRouting()
> and the list of already-created ResultRelInfo structures in parallel -
> the order must be the same, but the latter can be missing some
> elements, so it can just create the missing ones.

Interesting idea.

If we are going to do this, I think we may need to modify
RelationGetPartitionDispatchInfo() a bit or invent an alternative that
does not do as much work.  Currently, it assumes that it's only ever
called by ExecSetupPartitionTupleRouting() and hence also generates
PartitionDispatchInfo objects for partitioned child tables.  We don't need
that if called from within the planner.

Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled
with its usage within the executor, because there is this comment:

/*
 * We keep the partitioned ones open until we're done using the
 * information being collected here (for example, see
 * ExecEndModifyTable).
 */

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-07-02 Thread Robert Haas

On Fri, Jun 30, 2017 at 4:20 PM, Robert Haas  wrote:
> I don't think the approach of building a hash table to figure out
> which result rels have already been created is a good one.  That too
> feels like something that the planner should be figuring out and the
> executor should just be implementing what the planner decided.  I
> haven't figured out exactly how that should work yet, but it seems
> like it ought to be doable.

I was imagining when I wrote the above that the planner should somehow
compute a list of relations that it has excluded so that the executor
can skip building ResultRelInfos for exactly those relations, but on
further study, that's not particularly easy to achieve and wouldn't
really save anything anyway, because the list of OIDs is coming
straight out of the partition descriptor, so it's pretty much free.
However, I still think it would be a nifty idea if we could avoid
needing the hash table to deduplicate.  The reason we need that is, I
think, that expand_inherited_rtentry() is going to expand the
inheritance hierarchy in whatever order the scan(s) of pg_inherits
return the descendant tables, whereas the partition descriptor is
going to put them in a canonical order.

But that seems like it wouldn't be too hard to fix: let's have
expand_inherited_rtentry() expand the partitioned table in the same
order that will be used by ExecSetupPartitionTupleRouting().  That
seems pretty easy to do - just have expand_inherited_rtentry() notice
that it's got a partitioned table and call
RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
produce the list of OIDs.  Then - I think -
ExecSetupPartitionTupleRouting() doesn't need the hash table; it can
just scan through the return value of ExecSetupPartitionTupleRouting()
and the list of already-created ResultRelInfo structures in parallel -
the order must be the same, but the latter can be missing some
elements, so it can just create the missing ones.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-30 Thread Thomas Munro

On Fri, Jun 30, 2017 at 12:01 AM, Amit Khandekar  wrote:
> On 29 June 2017 at 07:42, Amit Langote  wrote:
>> Hi Amit,
>>
>> On 2017/06/28 20:43, Amit Khandekar wrote:
>>> In attached patch v12
>>
>> The patch no longer applies and fails to compile after the following
>> commit was made yesterday:
>>
>> commit 501ed02cf6f4f60c3357775eb07578aebc912d3a
>> Author: Andrew Gierth 
>> Date:   Wed Jun 28 18:55:03 2017 +0100
>>
>> Fix transition tables for partition/inheritance.
>
> Thanks for informing Amit.
>
> As Thomas mentioned upthread, the above commit already uses a tuple
> conversion mapping from leaf partition to root partitioned table
> (mt_transition_tupconv_maps), which serves the same purpose as that of
> the mapping used in the update-partition-key patch during update tuple
> routing (mt_resultrel_maps).
>
> We need to try to merge these two into a general-purpose mapping array
> such as mt_leaf_root_maps. I haven't done that in the rebased patch
> (attached), so currently it has both these mapping fields.
>
> For transition tables, this map is per-leaf-partition in case of
> inserts, whereas it is per-subplan result rel for updates. For
> update-tuple routing, the mapping is required to be per-subplan. Now,
> for update-row-movement in presence of transition tables, we would
> require both per-subplan mapping as well as per-leaf-partition
> mapping, which can't be done if we have a single mapping field, unless
> we have some way to identify which of the per-leaf partition mapping
> elements belong to per-subplan rels.
>
> So, it's not immediately possible to merge them.

Would make sense to have a set of functions with names like
GetConvertor{From,To}{Subplan,Leaf}(mtstate, index) which build arrays
m_convertors_{from,to}_by_{subplan,leaf} the first time they need
them?

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-30 Thread Robert Haas

On Thu, Jun 29, 2017 at 3:52 PM, Amit Khandekar  wrote:
> So to conclude, I think, we can do this :
>
> Scenario 1 :
> Only one partitioned table : the root; rest all are leaf partitions.
> In this case, it is definitely efficient to just check the root
> partition key, which will be sufficient.
>
> Scenario 2 :
> There are few non-leaf partitioned tables (3-4) :
> Open those tables, and follow 2nd approach above: If we don't find any
> updated partition-keys in any of them, well and good. If we do find,
> failover to approach 3 : For each of the update resultrels, use the
> new rd_partcheckattrs bitmap to know if it uses any of the updated
> columns. This would be faster than pulling up attrs from the quals
> like how it was done in the patch.

I think we should just have the planner figure out a list of which
columns are partitioning columns either for the named relation or some
descendent, and set a flag if that set of columns overlaps the set of
columns updated.  At execution time, update tuple routing is needed if
either that flag is set or if some partition included in the plan has
a BR UPDATE trigger.  Attached is a draft patch implementing that
approach.

This could be made more more accurate.  Suppose table foo is
partitioned by a and some but not all of the partitions partitioned by
b.  If it so happens that, in a query which only updates b, constraint
exclusion eliminates all of the partitions that are subpartitioned by
b, it would be unnecessary to enable update tuple routing (unless BR
UPDATE triggers are present) but this patch will not figure that out.
I don't think that optimization is critical for the first version of
this feature; there will be a limited number of users with
asymmetrical subpartitioning setups, and if one of them has an idea
how to improve this without hurting anything else, they are free to
contribute a patch.  Other optimizations are possible too, but I don't
really see any of them as critical either.

I don't think the approach of building a hash table to figure out
which result rels have already been created is a good one.  That too
feels like something that the planner should be figuring out and the
executor should just be implementing what the planner decided.  I
haven't figured out exactly how that should work yet, but it seems
like it ought to be doable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

decide-whether-we-need-update-tuple-routing.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-29 Thread Amit Khandekar

On 22 June 2017 at 01:41, Robert Haas  wrote:
>>> +for (i = 0; i < num_rels; i++)
>>> +{
>>> +ResultRelInfo *resultRelInfo = _rels[i];
>>> +Relationrel = resultRelInfo->ri_RelationDesc;
>>> +Bitmapset *expr_attrs = NULL;
>>> +
>>> +pull_varattnos((Node *) rel->rd_partcheck, 1, _attrs);
>>> +
>>> +/* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. 
>>> */
>>> +if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, 
>>> estate)))
>>> +return true;
>>> +}
>>>
>>> This seems like an awfully expensive way of performing this test.
>>> Under what circumstances could this be true for some result relations
>>> and false for others;
>>
>> One resultRelinfo can have no partition key column used in its quals,
>> but the next resultRelinfo can have quite different quals, and these
>> quals can have partition key referred. This is possible if the two of
>> them have different parents that have different partition-key columns.
>
> Hmm, true.  So if we have a table foo that is partitioned by list (a),
> and one of its children is a table bar that is partitioned by list
> (b), then we need to consider doing tuple-routing if either column a
> is modified, or if column b is modified for a partition which is a
> descendant of bar.  But visiting that only requires looking at the
> partitioned table and those children that are also partitioned, not
> all of the leaf partitions as the patch does.

The main concern is that the non-leaf partitions are not open (except
root), so we would need to open them in order to get the partition key
of the parents of update resultrels (or get only the partition key
atts and exprs from pg_partitioned_table).

There can be multiple approaches to finding partition key columns.

Approach 1 : When there are a few update result rels and a large
partition tree, we traverse from each of the result rels to their
ancestors , and open their ancestors (get_partition_parent()) to get
the partition key columns. For result rels having common parents, do
this only once.

Approach 2 : If there are only a few partitioned tables, and large
number of update result rels, it would be easier to just open all the
partitioned tables and form the partition key column bitmap out of all
their partition keys. If the bitmap does not have updated columns,
that's not a partition-key-update. So for typical non-partition-key
updates, just opening the partitioned tables will suffice, and so that
would not affect performance of normal updates.

But if the bitmap has updated columns, we can't conclude that it's a
partition-key-update, otherwise it would be false positive. We again
need to further check whether the update result rels belong to
ancestors that have updated partition keys.

Approach 3 : In RelationData, in a new bitmap field (rd_partcheckattrs
?), store partition key attrs that are used in rd_partcheck . Populate
this field during generate_partition_qual().

So to conclude, I think, we can do this :

Scenario 1 :
Only one partitioned table : the root; rest all are leaf partitions.
In this case, it is definitely efficient to just check the root
partition key, which will be sufficient.

Scenario 2 :
There are few non-leaf partitioned tables (3-4) :
Open those tables, and follow 2nd approach above: If we don't find any
updated partition-keys in any of them, well and good. If we do find,
failover to approach 3 : For each of the update resultrels, use the
new rd_partcheckattrs bitmap to know if it uses any of the updated
columns. This would be faster than pulling up attrs from the quals
like how it was done in the patch.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-29 Thread Amit Khandekar

On 29 June 2017 at 07:42, Amit Langote  wrote:
> Hi Amit,
>
> On 2017/06/28 20:43, Amit Khandekar wrote:
>> In attached patch v12
>
> The patch no longer applies and fails to compile after the following
> commit was made yesterday:
>
> commit 501ed02cf6f4f60c3357775eb07578aebc912d3a
> Author: Andrew Gierth 
> Date:   Wed Jun 28 18:55:03 2017 +0100
>
> Fix transition tables for partition/inheritance.

Thanks for informing Amit.

As Thomas mentioned upthread, the above commit already uses a tuple
conversion mapping from leaf partition to root partitioned table
(mt_transition_tupconv_maps), which serves the same purpose as that of
the mapping used in the update-partition-key patch during update tuple
routing (mt_resultrel_maps).

We need to try to merge these two into a general-purpose mapping array
such as mt_leaf_root_maps. I haven't done that in the rebased patch
(attached), so currently it has both these mapping fields.

For transition tables, this map is per-leaf-partition in case of
inserts, whereas it is per-subplan result rel for updates. For
update-tuple routing, the mapping is required to be per-subplan. Now,
for update-row-movement in presence of transition tables, we would
require both per-subplan mapping as well as per-leaf-partition
mapping, which can't be done if we have a single mapping field, unless
we have some way to identify which of the per-leaf partition mapping
elements belong to per-subplan rels.

So, it's not immediately possible to merge them.

update-partition-key_v12_rebased.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-28 Thread Amit Langote

Hi Amit,

On 2017/06/28 20:43, Amit Khandekar wrote:
> In attached patch v12

The patch no longer applies and fails to compile after the following
commit was made yesterday:

commit 501ed02cf6f4f60c3357775eb07578aebc912d3a
Author: Andrew Gierth 
Date:   Wed Jun 28 18:55:03 2017 +0100

Fix transition tables for partition/inheritance.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-28 Thread Amit Khandekar

On 22 June 2017 at 01:57, Robert Haas  wrote:
> On Wed, Jun 21, 2017 at 1:38 PM, Amit Khandekar  
> wrote:
 Yep, it's more appropriate to use
 ModifyTableState->rootResultRelationInfo->ri_RelationDesc somehow.  That
 is, if answer to the question I raised above is positive.
>>
>> From what I had checked earlier when coding that part,
>> rootResultRelInfo is NULL in case of inserts, unless something has
>> changed in later commits. That's the reason I decided to use the first
>> resultRelInfo.
>
> We're just going around in circles here.  Saying that you decided to
> use the first child's resultRelInfo because you didn't have a
> resultRelInfo for the parent is an explanation of why you wrote the
> code the way you did, but that doesn't make it correct.  I want to
> know why you think it's correct.

Yeah, that was just an FYI on how I decided to use the first
resultRelInfo; it was not for explaining why using first resultRelInfo
is correct. So upthread, I have tried to explain.

>
> I think it's probably wrong, because it seems to me that if the INSERT
> code needs to use the parent's ResultRelInfo rather than the first
> child's ResultRelInfo, the UPDATE code probably needs to do the same.
> Commit d3cc37f1d801a6b5cad9bf179274a8d767f1ee50 got rid of
> resultRelInfos for non-leaf partitions, and commit
> e180c8aa8caf5c55a273d4a8e6092e77ff3cff10 added the resultRelInfo back
> for the topmost parent, because otherwise it didn't work correctly.

Regarding rootResultRelInfo , it would have been good if
rootResultRelInfo was set for both insert and update, but it isn't set
for inserts.

For inserts :
In ExecInitModifyTable(), ModifyTableState->rootResultRelInfo  remains
NULL because ModifyTable->rootResultRelIndex is = -1 :
/* If modifying a partitioned table, initialize the root table info */
if (node->rootResultRelIndex >= 0)
   mtstate->rootResultRelInfo = estate->es_root_result_relations +
node->rootResultRelIndex;

ModifyTable->rootResultRelIndex = -1 because it does not get set since
ModifyTable->partitioned_rels is NULL :

/*
* If the main target relation is a partitioned table, the
* following list contains the RT indexes of partitioned child
* relations including the root, which are not included in the
* above list.  We also keep RT indexes of the roots
* separately to be identitied as such during the executor
* initialization.
*/
if (splan->partitioned_rels != NIL)
{
root->glob->nonleafResultRelations =
list_concat(root->glob->nonleafResultRelations,
list_copy(splan->partitioned_rels));
/* Remember where this root will be in the global list. */
splan->rootResultRelIndex = list_length(root->glob->rootResultRelations);
root->glob->rootResultRelations =
lappend_int(root->glob->rootResultRelations,
linitial_int(splan->partitioned_rels));
}

ModifyTable->partitioned_rels is NULL because inheritance_planner()
does not get called for INSERTs; instead, grouping_planner() gets
called :

subquery_planner()
{
/*
* Do the main planning.  If we have an inherited target relation, that
* needs special processing, else go straight to grouping_planner.
*/
if (parse->resultRelation && rt_fetch(parse->resultRelation,
parse->rtable)->inh)
   inheritance_planner(root);
else
   grouping_planner(root, false, tuple_fraction);

}

Above, inh is false in case of inserts.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-28 Thread Amit Khandekar

On 26 June 2017 at 08:37, Amit Khandekar  wrote:
> On 22 June 2017 at 01:41, Robert Haas  wrote:
 Second, it will amount to a functional bug if you get a
 different answer than the planner did.
>>>
>>> Actually, the per-leaf WCOs are meant to be executed on the
>>> destination partitions where the tuple is moved, while the WCOs
>>> belonging to the per-subplan resultRelInfo are meant for the
>>> resultRelinfo used for the UPDATE plans. So actually it should not
>>> matter whether they look same or different, because they are fired at
>>> different objects. Now these objects can happen to be the same
>>> relations though.
>>>
>>> But in any case, it's not clear to me how the mapped WCO and the
>>> planner's WCO would yield a different answer if they are both the same
>>> relation. I am possibly missing something. The planner has already
>>> generated the withCheckOptions for each of the resultRelInfo. And then
>>> we are using one of those to re-generate the WCO for a leaf partition
>>> by only adjusting the attnos. If there is already a WCO generated in
>>> the planner for that leaf partition (because that partition was
>>> present in mtstate->resultRelInfo), then the re-built WCO should be
>>> exactly look same as the earlier one, because they are the same
>>> relations, and so the attnos generated in them would be same since the
>>> Relation TupleDesc is the same.
>>
>> If the planner's WCOs and mapped WCOs are always the same, then I
>> think we should try to avoid generating both.  If they can be
>> different, but that's intentional and correct, then there's no
>> substantive problem with the patch but the comments need to make it
>> clear why we are generating both.
>>
>>> Actually I meant, "above works for only local updates. For
>>> row-movement-updates, we need per-leaf partition WCOs, because when
>>> the row is inserted into target partition, that partition may be not
>>> be included in the above planner resultRelInfo, so we need WCOs for
>>> all partitions". I think this said comment should be sufficient if I
>>> add this in the code ?
>>
>> Let's not get too focused on updating the comment until we are in
>> agreement about what the code ought to be doing.  I'm not clear
>> whether you accept the point that the patch needs to be changed to
>> avoid generating the same WCOs and returning lists in both the planner
>> and the executor.
>
> Yes, we can re-use the WCOs generated in the planner, as an
> optimization, since those we re-generate for the same relations will
> look exactly the same. The WCOs generated by planner (in
> inheritance_planner) are generated when (in adjust_appendrel_attrs())
> we change attnos used in the query to refer to the child RTEs and this
> adjusts the attnos of the WCOs of the child RTEs. So the WCOs of
> subplan resultRelInfo are actually the parent table WCOs, but only the
> attnos changed. And in ExecInitModifyTable() we do the same thing for
> leaf partitions, although using different function
> map_variable_attnos().

In attached patch v12,  during UPDATE tuple routing setup, for each
leaf partition, we now check if it is present already in one of the
UPDATE per-subplan resultrels. If present, we re-use them rather than
creating a new one and opening the table again.

So the mtstate->mt_partitions is now an array of ResultRelInfo
pointers. That pointer points to either the UPDATE per-subplan result
rel, or a newly allocated ResultRelInfo.

For each of the leaf partitions, we have to search through the
per-subplan resultRelInfo oids to check if there is a match. To do
this, I have created a temporary hash table which stores oids and the
ResultRelInfo pointers of mtstate->resultRelInfo array, and which can
be used to search the oid for each of the leaf partitions.

This patch version has handled only the above discussion point. I will
follow up with the other points separately.


update-partition-key_v12.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-25 Thread Amit Khandekar

On 22 June 2017 at 01:41, Robert Haas  wrote:
>>> Second, it will amount to a functional bug if you get a
>>> different answer than the planner did.
>>
>> Actually, the per-leaf WCOs are meant to be executed on the
>> destination partitions where the tuple is moved, while the WCOs
>> belonging to the per-subplan resultRelInfo are meant for the
>> resultRelinfo used for the UPDATE plans. So actually it should not
>> matter whether they look same or different, because they are fired at
>> different objects. Now these objects can happen to be the same
>> relations though.
>>
>> But in any case, it's not clear to me how the mapped WCO and the
>> planner's WCO would yield a different answer if they are both the same
>> relation. I am possibly missing something. The planner has already
>> generated the withCheckOptions for each of the resultRelInfo. And then
>> we are using one of those to re-generate the WCO for a leaf partition
>> by only adjusting the attnos. If there is already a WCO generated in
>> the planner for that leaf partition (because that partition was
>> present in mtstate->resultRelInfo), then the re-built WCO should be
>> exactly look same as the earlier one, because they are the same
>> relations, and so the attnos generated in them would be same since the
>> Relation TupleDesc is the same.
>
> If the planner's WCOs and mapped WCOs are always the same, then I
> think we should try to avoid generating both.  If they can be
> different, but that's intentional and correct, then there's no
> substantive problem with the patch but the comments need to make it
> clear why we are generating both.
>
>> Actually I meant, "above works for only local updates. For
>> row-movement-updates, we need per-leaf partition WCOs, because when
>> the row is inserted into target partition, that partition may be not
>> be included in the above planner resultRelInfo, so we need WCOs for
>> all partitions". I think this said comment should be sufficient if I
>> add this in the code ?
>
> Let's not get too focused on updating the comment until we are in
> agreement about what the code ought to be doing.  I'm not clear
> whether you accept the point that the patch needs to be changed to
> avoid generating the same WCOs and returning lists in both the planner
> and the executor.

Yes, we can re-use the WCOs generated in the planner, as an
optimization, since those we re-generate for the same relations will
look exactly the same. The WCOs generated by planner (in
inheritance_planner) are generated when (in adjust_appendrel_attrs())
we change attnos used in the query to refer to the child RTEs and this
adjusts the attnos of the WCOs of the child RTEs. So the WCOs of
subplan resultRelInfo are actually the parent table WCOs, but only the
attnos changed. And in ExecInitModifyTable() we do the same thing for
leaf partitions, although using different function
map_variable_attnos().

>
>>> Also, I feel like it's probably not correct to use the first result
>>> relation as the nominal relation for building WCOs and returning lists
>>> anyway.  I mean, if the first result relation has a different column
>>> order than the parent relation, isn't this just broken?  If it works
>>> for some reason, the comments don't explain what that reason is.

One thing I didn't mention earlier about the WCOs, is that for child
rels, we don't use the WCOs defined for the child rels. We only
inherit the WCO expressions defined for the root rel. That's the
reason they are the same expressions, only the attnos changed to match
the respective relation tupledesc. If the WCOs of each of the subplan
resultRelInfo() were different, then definitely it was not possible to
use the first resultRelinfo to generate other leaf partition WCOs,
because the WCO defined for relation A is independent of that defined
for relation B.

So, since the WCOs of all the relations are actually those of the
parent, we only need to adjust the attnos of any of these
resultRelInfos.

For e.g., if the root rel WCO is defined as "col > 5" where col is the
4th column, the expression will look like "var_1.attno_4 > 5". And the
WCO that is generated for a subplan resultRelInfo will look something
like "var_n.attno_2 > 5" if col is the 2nd column in this table.

All of the above logic assumes that we never use the WCO defined for
the child relation. At least that's how it looks by looking at the way
we generate WCOs in ExecInitModifyTable() for INSERTs as well looking
at the code in inheritance_planner() for UPDATEs. At both these
places, we never use the WCOs defined for child tables.

So suppose we define the tables and their WCOs like this :

CREATE TABLE range_parted ( a text, b int, c int) partition by range (a, b);

ALTER TABLE range_parted ENABLE ROW LEVEL SECURITY;
GRANT ALL ON range_parted TO PUBLIC ;
create policy seeall ON range_parted as PERMISSIVE for SELECT using ( true);

create table part_b_10_b_20 partition of range_parted for values from
('b',

Re: [HACKERS] UPDATE of partition key

2017-06-21 Thread Robert Haas

On Wed, Jun 21, 2017 at 1:38 PM, Amit Khandekar  wrote:
>>> Yep, it's more appropriate to use
>>> ModifyTableState->rootResultRelationInfo->ri_RelationDesc somehow.  That
>>> is, if answer to the question I raised above is positive.
>
> From what I had checked earlier when coding that part,
> rootResultRelInfo is NULL in case of inserts, unless something has
> changed in later commits. That's the reason I decided to use the first
> resultRelInfo.

We're just going around in circles here.  Saying that you decided to
use the first child's resultRelInfo because you didn't have a
resultRelInfo for the parent is an explanation of why you wrote the
code the way you did, but that doesn't make it correct.  I want to
know why you think it's correct.

I think it's probably wrong, because it seems to me that if the INSERT
code needs to use the parent's ResultRelInfo rather than the first
child's ResultRelInfo, the UPDATE code probably needs to do the same.
Commit d3cc37f1d801a6b5cad9bf179274a8d767f1ee50 got rid of
resultRelInfos for non-leaf partitions, and commit
e180c8aa8caf5c55a273d4a8e6092e77ff3cff10 added the resultRelInfo back
for the topmost parent, because otherwise it didn't work correctly.
If every partition in the hierarchy has a different attribute
ordering, then it seems to me that it must surely matter which of
those attribute orderings we pick.  It's hard to imagine that we can
pick *either* the parent's attribute ordering *or* that of the first
child and nothing will be different - the attribute numbers inside the
returning lists and WCOs we create have got to get used somehow, so
surely it matters which attribute numbers we use, doesn't it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-21 Thread Robert Haas

On Wed, Jun 21, 2017 at 1:37 PM, Amit Khandekar  wrote:
>> e.g. the table is partitioned on order number, and you do UPDATE
>> lineitem SET product_code = 'K372B' WHERE product_code = 'K372'.
>
> This query does not update order number, so here there is no
> partition-key-update. Are you thinking that the patch is generating
> the per-leaf-partition WCO expressions even for a update not involving
> a partition key ?

No, it just wasn't a great example.  Sorry.

>> Second, it will amount to a functional bug if you get a
>> different answer than the planner did.
>
> Actually, the per-leaf WCOs are meant to be executed on the
> destination partitions where the tuple is moved, while the WCOs
> belonging to the per-subplan resultRelInfo are meant for the
> resultRelinfo used for the UPDATE plans. So actually it should not
> matter whether they look same or different, because they are fired at
> different objects. Now these objects can happen to be the same
> relations though.
>
> But in any case, it's not clear to me how the mapped WCO and the
> planner's WCO would yield a different answer if they are both the same
> relation. I am possibly missing something. The planner has already
> generated the withCheckOptions for each of the resultRelInfo. And then
> we are using one of those to re-generate the WCO for a leaf partition
> by only adjusting the attnos. If there is already a WCO generated in
> the planner for that leaf partition (because that partition was
> present in mtstate->resultRelInfo), then the re-built WCO should be
> exactly look same as the earlier one, because they are the same
> relations, and so the attnos generated in them would be same since the
> Relation TupleDesc is the same.

If the planner's WCOs and mapped WCOs are always the same, then I
think we should try to avoid generating both.  If they can be
different, but that's intentional and correct, then there's no
substantive problem with the patch but the comments need to make it
clear why we are generating both.

> Actually I meant, "above works for only local updates. For
> row-movement-updates, we need per-leaf partition WCOs, because when
> the row is inserted into target partition, that partition may be not
> be included in the above planner resultRelInfo, so we need WCOs for
> all partitions". I think this said comment should be sufficient if I
> add this in the code ?

Let's not get too focused on updating the comment until we are in
agreement about what the code ought to be doing.  I'm not clear
whether you accept the point that the patch needs to be changed to
avoid generating the same WCOs and returning lists in both the planner
and the executor.

>> Also, I feel like it's probably not correct to use the first result
>> relation as the nominal relation for building WCOs and returning lists
>> anyway.  I mean, if the first result relation has a different column
>> order than the parent relation, isn't this just broken?  If it works
>> for some reason, the comments don't explain what that reason is.
>
> Not sure why parent relation should come into picture. As long as the
> first result relation belongs to one of the partitions in the whole
> partition tree, we should be able to use that to build WCOs of any
> other partitions, because they have a common set of attributes having
> the same name. So we are bound to find each of the attributes of first
> resultRelInfo in the other leaf partitions during attno mapping.

Well, at least for returning lists, we've got to generate the
returning lists so that they all match the column order of the parent,
not the parent's first child.  Otherwise, for example, UPDATE
parent_table ... RETURNING * will not work correctly.  The tuples
returned by the returning clause have to have the attribute order of
parent_table, not the attribute order of parent_table's first child.
I'm not sure whether WCOs have the same issue, but it's not clear to
me why they wouldn't: they contain a qual which is an expression tree,
and presumably there are Var nodes in there someplace, and if so, then
they have varattnos that have to be right for the purpose for which
they're going to be used.

>> +for (i = 0; i < num_rels; i++)
>> +{
>> +ResultRelInfo *resultRelInfo = _rels[i];
>> +Relationrel = resultRelInfo->ri_RelationDesc;
>> +Bitmapset *expr_attrs = NULL;
>> +
>> +pull_varattnos((Node *) rel->rd_partcheck, 1, _attrs);
>> +
>> +/* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. */
>> +if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, 
>> estate)))
>> +return true;
>> +}
>>
>> This seems like an awfully expensive way of performing this test.
>> Under what circumstances could this be true for some result relations
>> and false for others;
>
> One resultRelinfo can have no partition key column used in its quals,
> but the next resultRelinfo can have quite different quals, and these

Re: [HACKERS] UPDATE of partition key

2017-06-21 Thread Amit Khandekar

On 21 June 2017 at 20:14, Robert Haas  wrote:
> On Wed, Jun 21, 2017 at 5:28 AM, Amit Langote
>  wrote:>> The comment "UPDATE/DELETE
> cases are handled above" is referring to
>>> the code that initializes the WCOs generated by the planner.  You've
>>> modified the comment in your patch, but the associated code: your
>>> updated comment says that only "DELETEs and local UPDATES are handled
>>> above", but in reality, *all* updates are still handled above.  And
>>> then they are handled again here.  Similarly for returning lists.
>>> It's certainly not OK for the comment to be inaccurate, but I think
>>> it's also bad to redo the work which the planner has already done,
>>> even if it makes the patch smaller.
>>
>> I guess this has to do with the UPDATE turning into DELETE+INSERT.  So, it
>> seems like WCOs are being initialized for the leaf partitions
>> (ResultRelInfos in the mt_partitions array) that are in turn are
>> initialized for the aforementioned INSERT.  That's why the term "...local
>> UPDATEs" in the new comment text.
>>
>> If that's true, I wonder if it makes sense to apply what would be
>> WCO_RLS_UPDATE_CHECK to a leaf partition that the tuple will be moved into
>> by calling ExecInsert()?
>
> I think we probably should apply the insert policy, just as we're
> executing the insert trigger.

Yes, the RLS quals should execute during tuple routing according to
whether it is a update or whether it has been converted to insert. I
think the tests don't quite test the insert part. Will check.

>
>>> Also, I feel like it's probably not correct to use the first result
>>> relation as the nominal relation for building WCOs and returning lists
>>> anyway.  I mean, if the first result relation has a different column
>>> order than the parent relation, isn't this just broken?  If it works
>>> for some reason, the comments don't explain what that reason is.
>>
>> Yep, it's more appropriate to use
>> ModifyTableState->rootResultRelationInfo->ri_RelationDesc somehow.  That
>> is, if answer to the question I raised above is positive.

>From what I had checked earlier when coding that part,
rootResultRelInfo is NULL in case of inserts, unless something has
changed in later commits. That's the reason I decided to use the first
resultRelInfo.


Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-21 Thread Amit Khandekar

On 21 June 2017 at 00:23, Robert Haas  wrote:
> On Tue, Jun 20, 2017 at 2:54 AM, Amit Khandekar  
> wrote:
>>> I guess I don't see why it should work like this.  In the INSERT case,
>>> we must build withCheckOption objects for each partition because those
>>> partitions don't appear in the plan otherwise -- but in the UPDATE
>>> case, they're already there, so why do we need to build anything at
>>> all?  Similarly for RETURNING projections.  How are the things we need
>>> for those cases not already getting built, associated with the
>>> relevant resultRelInfos?  Maybe there's a concern if some children got
>>> pruned - they could turn out later to be the children into which
>>> tuples need to be routed. But the patch makes no distinction
>>> between possibly-pruned children and any others.
>>
>> Yes, only a subset of the partitions appear in the UPDATE subplans. I
>> think typically for updates, a very small subset of the total leaf
>> partitions will be there in the plans, others would get pruned. IMHO,
>> it would not be worth having an optimization where it opens only those
>> leaf partitions which are not already there in the subplans. Without
>> the optimization, we are able to re-use the INSERT infrastructure
>> without additional changes.
>
> Well, that is possible, but certainly not guaranteed.  I mean,
> somebody could do a whole-table UPDATE, or an UPDATE that hits a
> smattering of rows in every partition;

I am not saying that it's guaranteed to be a small subset. I am saying
that it would be typically a small subset for
update-of-partitioned-key case. Seems weird if a user causes an
update-row-movement for multiple partitions at the same time.
Generally it would be an administrative task where some/all of the
rows of a partition need to have their partition key updated that
cause them to change their partition, and so there would be probably a
where clause that would narrow down the update to that particular
partition, because without the where clause the update is anyway
slower and it's redundant to scan all other partitions.

But, point taken, that there can always be certain cases involving
multiple table partition-key updates.

> e.g. the table is partitioned on order number, and you do UPDATE
> lineitem SET product_code = 'K372B' WHERE product_code = 'K372'.

This query does not update order number, so here there is no
partition-key-update. Are you thinking that the patch is generating
the per-leaf-partition WCO expressions even for a update not involving
a partition key ?

>
> Leaving that aside, the point here is that you're rebuilding
> withCheckOptions and returningLists that have already been built in
> the planner.  That's bad for two reasons.  First, it's inefficient,
> especially if there are many partitions.

Yeah, I agree that this becomes more and more redundant if the update
involves more partitions.

> Second, it will amount to a functional bug if you get a
> different answer than the planner did.

Actually, the per-leaf WCOs are meant to be executed on the
destination partitions where the tuple is moved, while the WCOs
belonging to the per-subplan resultRelInfo are meant for the
resultRelinfo used for the UPDATE plans. So actually it should not
matter whether they look same or different, because they are fired at
different objects. Now these objects can happen to be the same
relations though.

But in any case, it's not clear to me how the mapped WCO and the
planner's WCO would yield a different answer if they are both the same
relation. I am possibly missing something. The planner has already
generated the withCheckOptions for each of the resultRelInfo. And then
we are using one of those to re-generate the WCO for a leaf partition
by only adjusting the attnos. If there is already a WCO generated in
the planner for that leaf partition (because that partition was
present in mtstate->resultRelInfo), then the re-built WCO should be
exactly look same as the earlier one, because they are the same
relations, and so the attnos generated in them would be same since the
Relation TupleDesc is the same.

> Note this comment in the existing code:
>
> /*
>  * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
>  * that we didn't build the withCheckOptionList for each partition within
>  * the planner, but simple translation of the varattnos for each partition
>  * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
>  * cases are handled above.
>  */
>
> The comment "UPDATE/DELETE cases are handled above" is referring to
> the code that initializes the WCOs generated by the planner.  You've
> modified the comment in your patch, but the associated code: your
> updated comment says that only "DELETEs and local UPDATES are handled
> above", but in reality, *all* updates are still handled above.  And

Actually I meant, "above works for only local updates. For
row-movement-updates,

Re: [HACKERS] UPDATE of partition key

2017-06-21 Thread Robert Haas

On Wed, Jun 21, 2017 at 5:28 AM, Amit Langote
 wrote:>> The comment "UPDATE/DELETE
cases are handled above" is referring to
>> the code that initializes the WCOs generated by the planner.  You've
>> modified the comment in your patch, but the associated code: your
>> updated comment says that only "DELETEs and local UPDATES are handled
>> above", but in reality, *all* updates are still handled above.  And
>> then they are handled again here.  Similarly for returning lists.
>> It's certainly not OK for the comment to be inaccurate, but I think
>> it's also bad to redo the work which the planner has already done,
>> even if it makes the patch smaller.
>
> I guess this has to do with the UPDATE turning into DELETE+INSERT.  So, it
> seems like WCOs are being initialized for the leaf partitions
> (ResultRelInfos in the mt_partitions array) that are in turn are
> initialized for the aforementioned INSERT.  That's why the term "...local
> UPDATEs" in the new comment text.
>
> If that's true, I wonder if it makes sense to apply what would be
> WCO_RLS_UPDATE_CHECK to a leaf partition that the tuple will be moved into
> by calling ExecInsert()?

I think we probably should apply the insert policy, just as we're
executing the insert trigger.

>> Also, I feel like it's probably not correct to use the first result
>> relation as the nominal relation for building WCOs and returning lists
>> anyway.  I mean, if the first result relation has a different column
>> order than the parent relation, isn't this just broken?  If it works
>> for some reason, the comments don't explain what that reason is.
>
> Yep, it's more appropriate to use
> ModifyTableState->rootResultRelationInfo->ri_RelationDesc somehow.  That
> is, if answer to the question I raised above is positive.

The questions appear to me to be independent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-21 Thread Amit Langote

On 2017/06/21 3:53, Robert Haas wrote:
> On Tue, Jun 20, 2017 at 2:54 AM, Amit Khandekar  
> wrote:
>>> I guess I don't see why it should work like this.  In the INSERT case,
>>> we must build withCheckOption objects for each partition because those
>>> partitions don't appear in the plan otherwise -- but in the UPDATE
>>> case, they're already there, so why do we need to build anything at
>>> all?  Similarly for RETURNING projections.  How are the things we need
>>> for those cases not already getting built, associated with the
>>> relevant resultRelInfos?  Maybe there's a concern if some children got
>>> pruned - they could turn out later to be the children into which
>>> tuples need to be routed. But the patch makes no distinction
>>> between possibly-pruned children and any others.
>>
>> Yes, only a subset of the partitions appear in the UPDATE subplans. I
>> think typically for updates, a very small subset of the total leaf
>> partitions will be there in the plans, others would get pruned. IMHO,
>> it would not be worth having an optimization where it opens only those
>> leaf partitions which are not already there in the subplans. Without
>> the optimization, we are able to re-use the INSERT infrastructure
>> without additional changes.
> 
> Well, that is possible, but certainly not guaranteed.  I mean,
> somebody could do a whole-table UPDATE, or an UPDATE that hits a
> smattering of rows in every partition; e.g. the table is partitioned
> on order number, and you do UPDATE lineitem SET product_code = 'K372B'
> WHERE product_code = 'K372'.
> 
> Leaving that aside, the point here is that you're rebuilding
> withCheckOptions and returningLists that have already been built in
> the planner.  That's bad for two reasons.  First, it's inefficient,
> especially if there are many partitions.  Second, it will amount to a
> functional bug if you get a different answer than the planner did.
> Note this comment in the existing code:
> 
> /*
>  * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
>  * that we didn't build the withCheckOptionList for each partition within
>  * the planner, but simple translation of the varattnos for each partition
>  * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
>  * cases are handled above.
>  */
> 
> The comment "UPDATE/DELETE cases are handled above" is referring to
> the code that initializes the WCOs generated by the planner.  You've
> modified the comment in your patch, but the associated code: your
> updated comment says that only "DELETEs and local UPDATES are handled
> above", but in reality, *all* updates are still handled above.  And
> then they are handled again here.  Similarly for returning lists.
> It's certainly not OK for the comment to be inaccurate, but I think
> it's also bad to redo the work which the planner has already done,
> even if it makes the patch smaller.

I guess this has to do with the UPDATE turning into DELETE+INSERT.  So, it
seems like WCOs are being initialized for the leaf partitions
(ResultRelInfos in the mt_partitions array) that are in turn are
initialized for the aforementioned INSERT.  That's why the term "...local
UPDATEs" in the new comment text.

If that's true, I wonder if it makes sense to apply what would be
WCO_RLS_UPDATE_CHECK to a leaf partition that the tuple will be moved into
by calling ExecInsert()?

> Also, I feel like it's probably not correct to use the first result
> relation as the nominal relation for building WCOs and returning lists
> anyway.  I mean, if the first result relation has a different column
> order than the parent relation, isn't this just broken?  If it works
> for some reason, the comments don't explain what that reason is.

Yep, it's more appropriate to use
ModifyTableState->rootResultRelationInfo->ri_RelationDesc somehow.  That
is, if answer to the question I raised above is positive.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-20 Thread Robert Haas

On Tue, Jun 20, 2017 at 2:54 AM, Amit Khandekar  wrote:
>> I guess I don't see why it should work like this.  In the INSERT case,
>> we must build withCheckOption objects for each partition because those
>> partitions don't appear in the plan otherwise -- but in the UPDATE
>> case, they're already there, so why do we need to build anything at
>> all?  Similarly for RETURNING projections.  How are the things we need
>> for those cases not already getting built, associated with the
>> relevant resultRelInfos?  Maybe there's a concern if some children got
>> pruned - they could turn out later to be the children into which
>> tuples need to be routed. But the patch makes no distinction
>> between possibly-pruned children and any others.
>
> Yes, only a subset of the partitions appear in the UPDATE subplans. I
> think typically for updates, a very small subset of the total leaf
> partitions will be there in the plans, others would get pruned. IMHO,
> it would not be worth having an optimization where it opens only those
> leaf partitions which are not already there in the subplans. Without
> the optimization, we are able to re-use the INSERT infrastructure
> without additional changes.

Well, that is possible, but certainly not guaranteed.  I mean,
somebody could do a whole-table UPDATE, or an UPDATE that hits a
smattering of rows in every partition; e.g. the table is partitioned
on order number, and you do UPDATE lineitem SET product_code = 'K372B'
WHERE product_code = 'K372'.

Leaving that aside, the point here is that you're rebuilding
withCheckOptions and returningLists that have already been built in
the planner.  That's bad for two reasons.  First, it's inefficient,
especially if there are many partitions.  Second, it will amount to a
functional bug if you get a different answer than the planner did.
Note this comment in the existing code:

/*
 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 * that we didn't build the withCheckOptionList for each partition within
 * the planner, but simple translation of the varattnos for each partition
 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
 * cases are handled above.
 */

The comment "UPDATE/DELETE cases are handled above" is referring to
the code that initializes the WCOs generated by the planner.  You've
modified the comment in your patch, but the associated code: your
updated comment says that only "DELETEs and local UPDATES are handled
above", but in reality, *all* updates are still handled above.  And
then they are handled again here.  Similarly for returning lists.
It's certainly not OK for the comment to be inaccurate, but I think
it's also bad to redo the work which the planner has already done,
even if it makes the patch smaller.

Also, I feel like it's probably not correct to use the first result
relation as the nominal relation for building WCOs and returning lists
anyway.  I mean, if the first result relation has a different column
order than the parent relation, isn't this just broken?  If it works
for some reason, the comments don't explain what that reason is.

>> ... I don't understand how you can *not* need a per-leaf-partition
>> mapping.  I mean, maybe you only need the mapping for the *unpruned*
>> leaf partitions
>
> Yes, we need the mapping only for the unpruned leaf partitions, and
> those partitions are available in the per-subplan resultRelInfo's.

OK.

>> but you certainly need a separate mapping for each one of those.
>
> You mean *each* of the leaf partitions ? I didn't get why we would
> need it for each one. The tuple targeted for update belongs to one of
> the per-subplan resultInfos. And this tuple is to be routed to another
> leaf partition. So the reverse mapping is for conversion from the
> source resultRelinfo to the root partition. I am unable to figure out
> a scenario where we would require this reverse mapping for partitions
> on which UPDATE is *not* going to be executed.

I agree - the reverse mapping is only needed for the partitions in
which UPDATE will be executed.

Some other things:

+ * The row was already deleted by a concurrent DELETE. So we don't
+ * have anything to update.

I find this explanation, and the surrounding comments, inadequate.  It
doesn't really explain why we're doing this.  I think it should say
something like this: For a normal UPDATE, the case where the tuple has
been the subject of a concurrent UPDATE or DELETE would be handled by
the EvalPlanQual machinery, but for an UPDATE that we've translated
into a DELETE from this partition and an INSERT into some other
partition, that's not available, because CTID chains can't span
relation boundaries.  We mimic the semantics to a limited extent by
skipping the INSERT if the DELETE fails to find a tuple.  This ensures
that two concurrent attempts to UPDATE the same tuple at the same time
can't turn one tuple into two, and that

Re: [HACKERS] UPDATE of partition key

2017-06-20 Thread Amit Khandekar

On 20 June 2017 at 03:46, Robert Haas  wrote:
> On Thu, Jun 15, 2017 at 1:36 PM, Amit Khandekar  
> wrote:
>> Attached patch v10 fixes the above. In the existing code, where it
>> builds WCO constraints for each leaf partition; with the patch, that
>> code now is applicable to row-movement-updates as well.
>
> I guess I don't see why it should work like this.  In the INSERT case,
> we must build withCheckOption objects for each partition because those
> partitions don't appear in the plan otherwise -- but in the UPDATE
> case, they're already there, so why do we need to build anything at
> all?  Similarly for RETURNING projections.  How are the things we need
> for those cases not already getting built, associated with the
> relevant resultRelInfos?  Maybe there's a concern if some children got
> pruned - they could turn out later to be the children into which
> tuples need to be routed. But the patch makes no distinction
> between possibly-pruned children and any others.

Yes, only a subset of the partitions appear in the UPDATE subplans. I
think typically for updates, a very small subset of the total leaf
partitions will be there in the plans, others would get pruned. IMHO,
it would not be worth having an optimization where it opens only those
leaf partitions which are not already there in the subplans. Without
the optimization, we are able to re-use the INSERT infrastructure
without additional changes.

>
>> There is another issue I discovered. The row-movement works fine if
>> the destination leaf partition has different attribute ordering than
>> the root : the existing insert-tuple-routing mapping handles that. But
>> if the source partition has different ordering w.r.t. the root, it has
>> a problem : there is no mapping in the opposite direction, i.e. from
>> the leaf to root. And we require that because the tuple of source leaf
>> partition needs to be converted to root partition tuple descriptor,
>> since ExecFindPartition() starts with root.
>
> Seems reasonable, but...
>
>> To fix this, I have introduced another mapping array
>> mtstate->mt_resultrel_maps[]. This corresponds to the
>> mtstate->resultRelInfo[]. We don't require per-leaf-partition mapping,
>> because the update result relations are pruned subset of the total
>> leaf partitions.
>
> ... I don't understand how you can *not* need a per-leaf-partition
> mapping.  I mean, maybe you only need the mapping for the *unpruned*
> leaf partitions

Yes, we need the mapping only for the unpruned leaf partitions, and
those partitions are available in the per-subplan resultRelInfo's.

> but you certainly need a separate mapping for each one of those.

You mean *each* of the leaf partitions ? I didn't get why we would
need it for each one. The tuple targeted for update belongs to one of
the per-subplan resultInfos. And this tuple is to be routed to another
leaf partition. So the reverse mapping is for conversion from the
source resultRelinfo to the root partition. I am unable to figure out
a scenario where we would require this reverse mapping for partitions
on which UPDATE is *not* going to be executed.

>
> It's possible to imagine driving the tuple routing off of just the
> partition key attributes, extracted from wherever they are inside the
> tuple at the current level, rather than converting to the root's tuple
> format.  However, that's not totally straightforward because there
> could be multiple levels of partitioning throughout the tree and
> different attributes might be needed at different levels.

Yes, the conversion anyway occurs at each of these levels even for
insert, specifically because there can be different partition
attributes each time. For update, its only one additional conversion.
But yes, this new mapping would be required for this one single
conversion.

> Moreover,
> in most cases, the mappings are going to end up being no-ops because
> the column order will be the same, so it's probably not worth
> complicating the code to try to avoid a double conversion that usually
> won't happen.

I agree.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-19 Thread Amit Khandekar

On 20 June 2017 at 03:42, Thomas Munro  wrote:
> Just a thought: If I understand correctly this new array of tuple
> conversion maps is the same as mtstate->mt_transition_tupconv_maps in
> my patch transition-tuples-from-child-tables-v11.patch (hopefully soon
> to be committed to close a PG10 open item).  In my patch I bounce
> transition tuples from child relations up to the named relation's
> triggers, and in this patch you bounce child tuples up to the named
> relation for rerouting, so the conversion requirement is the same.
> Perhaps we could consider refactoring to build a common struct member
> on demand for the row movement patch at some point in the future if it
> makes the code cleaner.

I agree; thanks for bringing this to my attention. The conversion maps
in my patch and yours do sound like they are exactly same. And even in
case where both update-row-movement and transition tables are playing
together, the same map should serve the purpose of both. I will keep a
watch on your patch, and check how I can adjust my patch so that I
don't have to refactor the mapping.

One difference I see is : in your patch, in ExecModifyTable() we jump
the current map position for each successive subplan, whereas in my
patch, in ExecInsert() we deduce the position of the right map to be
fetched using the position of the current resultRelInfo in the
mtstate->resultRelInfo[] array. I think your way is more consistent
with the existing code.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-19 Thread Robert Haas

On Thu, Jun 15, 2017 at 1:36 PM, Amit Khandekar  wrote:
> Attached patch v10 fixes the above. In the existing code, where it
> builds WCO constraints for each leaf partition; with the patch, that
> code now is applicable to row-movement-updates as well.

I guess I don't see why it should work like this.  In the INSERT case,
we must build withCheckOption objects for each partition because those
partitions don't appear in the plan otherwise -- but in the UPDATE
case, they're already there, so why do we need to build anything at
all?  Similarly for RETURNING projections.  How are the things we need
for those cases not already getting built, associated with the
relevant resultRelInfos?  Maybe there's a concern if some children got
pruned - they could turn out later to be the children into which
tuples need to be routed.  But the patch makes no distinction between
possibly-pruned children and any others.

> There is another issue I discovered. The row-movement works fine if
> the destination leaf partition has different attribute ordering than
> the root : the existing insert-tuple-routing mapping handles that. But
> if the source partition has different ordering w.r.t. the root, it has
> a problem : there is no mapping in the opposite direction, i.e. from
> the leaf to root. And we require that because the tuple of source leaf
> partition needs to be converted to root partition tuple descriptor,
> since ExecFindPartition() starts with root.

Seems reasonable, but...

> To fix this, I have introduced another mapping array
> mtstate->mt_resultrel_maps[]. This corresponds to the
> mtstate->resultRelInfo[]. We don't require per-leaf-partition mapping,
> because the update result relations are pruned subset of the total
> leaf partitions.

... I don't understand how you can *not* need a per-leaf-partition
mapping.  I mean, maybe you only need the mapping for the *unpruned*
leaf partitions but you certainly need a separate mapping for each one
of those.

It's possible to imagine driving the tuple routing off of just the
partition key attributes, extracted from wherever they are inside the
tuple at the current level, rather than converting to the root's tuple
format.  However, that's not totally straightforward because there
could be multiple levels of partitioning throughout the tree and
different attributes might be needed at different levels.  Moreover,
in most cases, the mappings are going to end up being no-ops because
the column order will be the same, so it's probably not worth
complicating the code to try to avoid a double conversion that usually
won't happen.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-19 Thread Thomas Munro

On Fri, Jun 16, 2017 at 5:36 AM, Amit Khandekar  wrote:
> There is another issue I discovered. The row-movement works fine if
> the destination leaf partition has different attribute ordering than
> the root : the existing insert-tuple-routing mapping handles that. But
> if the source partition has different ordering w.r.t. the root, it has
> a problem : there is no mapping in the opposite direction, i.e. from
> the leaf to root. And we require that because the tuple of source leaf
> partition needs to be converted to root partition tuple descriptor,
> since ExecFindPartition() starts with root.
>
> To fix this, I have introduced another mapping array
> mtstate->mt_resultrel_maps[]. This corresponds to the
> mtstate->resultRelInfo[]. We don't require per-leaf-partition mapping,
> because the update result relations are pruned subset of the total
> leaf partitions.

Hi Amit & Amit,

Just a thought: If I understand correctly this new array of tuple
conversion maps is the same as mtstate->mt_transition_tupconv_maps in
my patch transition-tuples-from-child-tables-v11.patch (hopefully soon
to be committed to close a PG10 open item).  In my patch I bounce
transition tuples from child relations up to the named relation's
triggers, and in this patch you bounce child tuples up to the named
relation for rerouting, so the conversion requirement is the same.
Perhaps we could consider refactoring to build a common struct member
on demand for the row movement patch at some point in the future if it
makes the code cleaner.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-18 Thread Amit Khandekar

When I tested partition-key-update on a partitioned table having no
child partitions, it crashed. This is because there is an
Assert(mtstate->mt_num_partitions > 0) for creating the
partition-to-root map, which fails if there are no partitions under
the partitioned table. Actually we should skp creating this map if
there are no partitions under the partitioned table on which UPDATE is
run. So the attached patch has this new change to fix it (and
appropriate additional test case added) :

--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -2006,15 +2006,14 @@ ExecInitModifyTable(ModifyTable *node, EState
*estate, int eflags)
 * descriptor of a source partition does not match the root partition
 * descriptor. In such case we need to convert tuples to the
root partition
 * tuple descriptor, because the search for destination partition starts
-* from the root.
+* from the root. Skip this setup if it's not a partition key
update or if
+* there are no partitions below this partitioned table.
 */
-   if (is_partitionkey_update)
+   if (is_partitionkey_update && mtstate->mt_num_partitions > 0)
{
TupleConversionMap **tup_conv_maps;
TupleDesc   outdesc;

-   Assert(mtstate->mt_num_partitions > 0);
-
mtstate->mt_resultrel_maps =
(TupleConversionMap **)
palloc0(sizeof(TupleConversionMap*) * nplans);

On 15 June 2017 at 23:06, Amit Khandekar  wrote:
> On 13 June 2017 at 15:40, Amit Khandekar  wrote:
>> While rebasing my patch for the below recent commit, I realized that a
>> similar issue exists for the uptate-tuple-routing patch as well :
>>
>> commit 78a030a441966d91bc7e932ef84da39c3ea7d970
>> Author: Tom Lane 
>> Date:   Mon Jun 12 23:29:44 2017 -0400
>>
>> Fix confusion about number of subplans in partitioned INSERT setup.
>>
>> The above issue was about incorrectly using 'i' in
>> mtstate->mt_plans[i] during handling WITH CHECK OPTIONS in
>> ExecInitModifyTable(), where 'i' was actually meant to refer to the
>> positions in mtstate->mt_num_partitions. Actually for INSERT, there is
>> only a single plan element in mtstate->mt_plans[] array.
>>
>> Similarly, for update-tuple routing, we cannot use
>> mtstate->mt_plans[i], because 'i' refers to position in
>> mtstate->mt_partitions[] , whereas mtstate->mt_plans is not at all in
>> order of mtstate->mt_partitions; in fact mt_plans has only the plans
>> that are to be scanned on pruned partitions; so it can well be a small
>> subset of total partitions.
>>
>> I am working on an updated patch to fix the above.
>
> Attached patch v10 fixes the above. In the existing code, where it
> builds WCO constraints for each leaf partition; with the patch, that
> code now is applicable to row-movement-updates as well. So the
> assertions in the code are now updated to allow the same. Secondly,
> the mapping for each of the leaf partitions was constructed using the
> root partition attributes. Now in the patch, the
> mtstate->resultRelInfo[0] (i.e. the first resultRelInfo) is used as
> reference. So effectively, map_partition_varattnos() now represents
> not just parent-to-partition mapping, but rather, mapping between any
> two partitions/partitioned_tables. It's done this way, so that we can
> have a common WCO building code for inserts as well as updates. For
> e.g. for inserts, the first (and only) WCO belongs to
> node->nominalRelation so nominalRelation is used for
> map_partition_varattnos(), whereas for updates, first WCO belongs to
> the first resultRelInfo which is not same as nominalRelation. So in
> the patch, in both cases, we use the first resultRelInfo and the WCO
> of the first resultRelInfo for map_partition_varattnos().
>
> Similar thing is done for Returning expressions.
>
> -
>
> Another change in the patch is : for ExecInitQual() for WCO quals,
> mtstate->ps is used as parent, rather than first plan. For updates,
> first plan does not belong to the parent partition. In fact, I think
> in all cases, we should use mtstate->ps as the parent.
> mtstate->mt_plans[0] don't look like they should be considered parent
> of these expressions. May be it does not matter to which parent we
> link these quals, because there is no ReScan for ExecModifyTable().
>
> Note that for RETURNING projection expressions, we do use mtstate->ps.
>
> 
>
> There is another issue I discovered. The row-movement works fine if
> the destination leaf partition has different attribute ordering than
> the root : the existing insert-tuple-routing mapping handles that. But
> if the source partition has different ordering w.r.t. the root, it has
> a problem : there is no mapping in the opposite direction, i.e. from
> the leaf to root. And we require that because the tuple of source leaf
> partition needs to be

Re: [HACKERS] UPDATE of partition key

2017-06-15 Thread Amit Khandekar

On 13 June 2017 at 15:40, Amit Khandekar  wrote:
> While rebasing my patch for the below recent commit, I realized that a
> similar issue exists for the uptate-tuple-routing patch as well :
>
> commit 78a030a441966d91bc7e932ef84da39c3ea7d970
> Author: Tom Lane 
> Date:   Mon Jun 12 23:29:44 2017 -0400
>
> Fix confusion about number of subplans in partitioned INSERT setup.
>
> The above issue was about incorrectly using 'i' in
> mtstate->mt_plans[i] during handling WITH CHECK OPTIONS in
> ExecInitModifyTable(), where 'i' was actually meant to refer to the
> positions in mtstate->mt_num_partitions. Actually for INSERT, there is
> only a single plan element in mtstate->mt_plans[] array.
>
> Similarly, for update-tuple routing, we cannot use
> mtstate->mt_plans[i], because 'i' refers to position in
> mtstate->mt_partitions[] , whereas mtstate->mt_plans is not at all in
> order of mtstate->mt_partitions; in fact mt_plans has only the plans
> that are to be scanned on pruned partitions; so it can well be a small
> subset of total partitions.
>
> I am working on an updated patch to fix the above.

Attached patch v10 fixes the above. In the existing code, where it
builds WCO constraints for each leaf partition; with the patch, that
code now is applicable to row-movement-updates as well. So the
assertions in the code are now updated to allow the same. Secondly,
the mapping for each of the leaf partitions was constructed using the
root partition attributes. Now in the patch, the
mtstate->resultRelInfo[0] (i.e. the first resultRelInfo) is used as
reference. So effectively, map_partition_varattnos() now represents
not just parent-to-partition mapping, but rather, mapping between any
two partitions/partitioned_tables. It's done this way, so that we can
have a common WCO building code for inserts as well as updates. For
e.g. for inserts, the first (and only) WCO belongs to
node->nominalRelation so nominalRelation is used for
map_partition_varattnos(), whereas for updates, first WCO belongs to
the first resultRelInfo which is not same as nominalRelation. So in
the patch, in both cases, we use the first resultRelInfo and the WCO
of the first resultRelInfo for map_partition_varattnos().

Similar thing is done for Returning expressions.

-

Another change in the patch is : for ExecInitQual() for WCO quals,
mtstate->ps is used as parent, rather than first plan. For updates,
first plan does not belong to the parent partition. In fact, I think
in all cases, we should use mtstate->ps as the parent.
mtstate->mt_plans[0] don't look like they should be considered parent
of these expressions. May be it does not matter to which parent we
link these quals, because there is no ReScan for ExecModifyTable().

Note that for RETURNING projection expressions, we do use mtstate->ps.

There is another issue I discovered. The row-movement works fine if
the destination leaf partition has different attribute ordering than
the root : the existing insert-tuple-routing mapping handles that. But
if the source partition has different ordering w.r.t. the root, it has
a problem : there is no mapping in the opposite direction, i.e. from
the leaf to root. And we require that because the tuple of source leaf
partition needs to be converted to root partition tuple descriptor,
since ExecFindPartition() starts with root.

To fix this, I have introduced another mapping array
mtstate->mt_resultrel_maps[]. This corresponds to the
mtstate->resultRelInfo[]. We don't require per-leaf-partition mapping,
because the update result relations are pruned subset of the total
leaf partitions.

So in ExecInsert, before calling ExecFindPartition(), we need to
convert the leaf partition tuple to root using this reverse mapping.
Since we need to convert the tuple here, and again after
ExecFindPartition() for the found leaf partition, I have replaced the
common code by new function ConvertPartitionTupleSlot().

---

Used a new flag is_partitionkey_update in ExecInitModifyTable(), which
can be re-used in subsequent sections , rather than again calling
IsPartitionKeyUpdate() function again.

---

Some more test scenarios added that cover above changes. Basically
partitions that have different tuple descriptors than parents.

update-partition-key_v10.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-13 Thread Amit Khandekar

While rebasing my patch for the below recent commit, I realized that a
similar issue exists for the uptate-tuple-routing patch as well :

commit 78a030a441966d91bc7e932ef84da39c3ea7d970
Author: Tom Lane 
Date:   Mon Jun 12 23:29:44 2017 -0400

Fix confusion about number of subplans in partitioned INSERT setup.

The above issue was about incorrectly using 'i' in
mtstate->mt_plans[i] during handling WITH CHECK OPTIONS in
ExecInitModifyTable(), where 'i' was actually meant to refer to the
positions in mtstate->mt_num_partitions. Actually for INSERT, there is
only a single plan element in mtstate->mt_plans[] array.

Similarly, for update-tuple routing, we cannot use
mtstate->mt_plans[i], because 'i' refers to position in
mtstate->mt_partitions[] , whereas mtstate->mt_plans is not at all in
order of mtstate->mt_partitions; in fact mt_plans has only the plans
that are to be scanned on pruned partitions; so it can well be a small
subset of total partitions.

I am working on an updated patch to fix the above.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-09 Thread Amit Kapila

On Fri, Jun 9, 2017 at 7:48 PM, Amit Khandekar  wrote:
> On 9 June 2017 at 19:10, Amit Kapila  wrote:
>> On Thu, Jun 8, 2017 at 10:40 PM, Robert Haas  wrote:
>>> On Thu, Jun 8, 2017 at 7:01 AM, Amit Kapila  wrote:

 I think before doing above check we can simply check if ctid.ip_blkid
 contains InvalidBlockNumber, then return an error.
>>>
>>> Hmm, OK.  That case never happens today?
>>>
>>
>> As per my understanding that case doesn't exist.  I will verify again
>> once the patch is available.  I can take a crack at it if Amit
>> Khandekar is busy with something else or is not comfortable in this
>> area.
>
> Amit, I was going to have a look at this, once I finish with the other
> part.
>

Sure, will wait for your patch to be available.  I can help by
reviewing the same.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-09 Thread Amit Khandekar

On 9 June 2017 at 19:10, Amit Kapila  wrote:
> On Thu, Jun 8, 2017 at 10:40 PM, Robert Haas  wrote:
>> On Thu, Jun 8, 2017 at 7:01 AM, Amit Kapila  wrote:
>>> On Thu, Jun 8, 2017 at 1:33 AM, Robert Haas  wrote:
 On Wed, Jun 7, 2017 at 5:46 AM, Amit Kapila  
 wrote:
> As far as I understand, it is to ensure that for deleted rows, nothing
> more needs to be done.  For example, see the below check in
> ExecUpdate/ExecDelete.
> if (!ItemPointerEquals(tupleid, ))
> {
> ..
> }
> ..
>
> Also a similar check in ExecLockRows.  Now for deleted rows, if the
> t_ctid wouldn't point to itself, then in the mentioned functions, we
> were not in a position to conclude that the row is deleted.

 Right, so we would have to find all such checks and change them to use
 some other method to conclude that the row is deleted.  What method
 would we use?
>>>
>>> I think before doing above check we can simply check if ctid.ip_blkid
>>> contains InvalidBlockNumber, then return an error.
>>
>> Hmm, OK.  That case never happens today?
>>
>
> As per my understanding that case doesn't exist.  I will verify again
> once the patch is available.  I can take a crack at it if Amit
> Khandekar is busy with something else or is not comfortable in this
> area.

Amit, I was going to have a look at this, once I finish with the other
part. I was busy on getting that done first. But your comments/help
are always welcome.

>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-09 Thread Amit Kapila

On Thu, Jun 8, 2017 at 10:40 PM, Robert Haas  wrote:
> On Thu, Jun 8, 2017 at 7:01 AM, Amit Kapila  wrote:
>> On Thu, Jun 8, 2017 at 1:33 AM, Robert Haas  wrote:
>>> On Wed, Jun 7, 2017 at 5:46 AM, Amit Kapila  wrote:
 As far as I understand, it is to ensure that for deleted rows, nothing
 more needs to be done.  For example, see the below check in
 ExecUpdate/ExecDelete.
 if (!ItemPointerEquals(tupleid, ))
 {
 ..
 }
 ..

 Also a similar check in ExecLockRows.  Now for deleted rows, if the
 t_ctid wouldn't point to itself, then in the mentioned functions, we
 were not in a position to conclude that the row is deleted.
>>>
>>> Right, so we would have to find all such checks and change them to use
>>> some other method to conclude that the row is deleted.  What method
>>> would we use?
>>
>> I think before doing above check we can simply check if ctid.ip_blkid
>> contains InvalidBlockNumber, then return an error.
>
> Hmm, OK.  That case never happens today?
>

As per my understanding that case doesn't exist.  I will verify again
once the patch is available.  I can take a crack at it if Amit
Khandekar is busy with something else or is not comfortable in this
area.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-08 Thread Amit Khandekar

On 7 June 2017 at 20:19, Amit Khandekar  wrote:
> On 7 June 2017 at 16:42, Amit Khandekar  wrote:
>> The column bitmap set returned by GetUpdatedColumns() refer to
>> attribute numbers w.r.t. to the root partition. And the
>> mstate->resultRelInfo[] have attnos w.r.t. to the leaf partitions. So
>> we need to do something similar to map_partition_varattnos() to change
>> the updated columns attnos to the leaf partitions
>
> I was wrong about this. Each of the mtstate->resultRelInfo[] has its
> own corresponding RangeTblEntry with its own updatedCols having attnos
> accordingly adjusted to refer its own table attributes. So we don't
> have to do the mapping; we need to get modifedCols separately for each
> of the ResultRelInfo, rather than the root relinfo.
>
>> and walk down the
>> partition constraint expressions to find if the attnos are present
>> there.
>
> But this we will need to do.

Attached is v9 patch. This covers the two parts discussed upthread :
1. Prevent triggers from causing the row movement.
2. Setup the tuple routing in ExecInitModifyTable(), but only if a
partition key is modified. Check new function IsPartitionKeyUpdate().

Have rebased the patch to consider changes done in commit
15ce775faa428dc9 to prevent triggers from violating partition
constraints. There, for the call to ExecFindPartition() in ExecInsert,
we need to fetch the mtstate->rootResultRelInfo in case the operation
is part of update row movement. This is because the root partition is
not available in the resultRelInfo for UPDATE.

Added many more test scenarios in update.sql that cover the above.

I am yet to test the concurrency part using isolation tester.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

update-partition-key_v9.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-08 Thread Robert Haas

On Thu, Jun 8, 2017 at 7:01 AM, Amit Kapila  wrote:
> On Thu, Jun 8, 2017 at 1:33 AM, Robert Haas  wrote:
>> On Wed, Jun 7, 2017 at 5:46 AM, Amit Kapila  wrote:
>>> As far as I understand, it is to ensure that for deleted rows, nothing
>>> more needs to be done.  For example, see the below check in
>>> ExecUpdate/ExecDelete.
>>> if (!ItemPointerEquals(tupleid, ))
>>> {
>>> ..
>>> }
>>> ..
>>>
>>> Also a similar check in ExecLockRows.  Now for deleted rows, if the
>>> t_ctid wouldn't point to itself, then in the mentioned functions, we
>>> were not in a position to conclude that the row is deleted.
>>
>> Right, so we would have to find all such checks and change them to use
>> some other method to conclude that the row is deleted.  What method
>> would we use?
>
> I think before doing above check we can simply check if ctid.ip_blkid
> contains InvalidBlockNumber, then return an error.

Hmm, OK.  That case never happens today?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-08 Thread Amit Kapila

On Thu, Jun 8, 2017 at 1:33 AM, Robert Haas  wrote:
> On Wed, Jun 7, 2017 at 5:46 AM, Amit Kapila  wrote:
>> As far as I understand, it is to ensure that for deleted rows, nothing
>> more needs to be done.  For example, see the below check in
>> ExecUpdate/ExecDelete.
>> if (!ItemPointerEquals(tupleid, ))
>> {
>> ..
>> }
>> ..
>>
>> Also a similar check in ExecLockRows.  Now for deleted rows, if the
>> t_ctid wouldn't point to itself, then in the mentioned functions, we
>> were not in a position to conclude that the row is deleted.
>
> Right, so we would have to find all such checks and change them to use
> some other method to conclude that the row is deleted.  What method
> would we use?
>

I think before doing above check we can simply check if ctid.ip_blkid
contains InvalidBlockNumber, then return an error.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-07 Thread Robert Haas

On Wed, Jun 7, 2017 at 5:46 AM, Amit Kapila  wrote:
> As far as I understand, it is to ensure that for deleted rows, nothing
> more needs to be done.  For example, see the below check in
> ExecUpdate/ExecDelete.
> if (!ItemPointerEquals(tupleid, ))
> {
> ..
> }
> ..
>
> Also a similar check in ExecLockRows.  Now for deleted rows, if the
> t_ctid wouldn't point to itself, then in the mentioned functions, we
> were not in a position to conclude that the row is deleted.

Right, so we would have to find all such checks and change them to use
some other method to conclude that the row is deleted.  What method
would we use?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-07 Thread Amit Khandekar

On 7 June 2017 at 16:42, Amit Khandekar  wrote:
> The column bitmap set returned by GetUpdatedColumns() refer to
> attribute numbers w.r.t. to the root partition. And the
> mstate->resultRelInfo[] have attnos w.r.t. to the leaf partitions. So
> we need to do something similar to map_partition_varattnos() to change
> the updated columns attnos to the leaf partitions

I was wrong about this. Each of the mtstate->resultRelInfo[] has its
own corresponding RangeTblEntry with its own updatedCols having attnos
accordingly adjusted to refer its own table attributes. So we don't
have to do the mapping; we need to get modifedCols separately for each
of the ResultRelInfo, rather than the root relinfo.

> and walk down the
> partition constraint expressions to find if the attnos are present
> there.

But this we will need to do.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-07 Thread Amit Khandekar

On 6 June 2017 at 23:52, Robert Haas  wrote:
> On Fri, Jun 2, 2017 at 7:07 AM, Amit Khandekar  wrote:
>> So, according to that, below would be the logic :
>>
>> Run partition constraint check on the original NEW row.
>> If it succeeds :
>> {
>> Fire BR UPDATE trigger on the original partition.
>> Run partition constraint check again with the modified NEW row
>> (may be do this only if the trigger modified the partition key)
>> If it fails,
>> abort.
>> Else
>> proceed with the usual local update.
>> }
>> else
>> {
>> Fire BR UPDATE trigger on original partition.
>> Find the right partition for the modified NEW row.
>> If it is the same partition,
>> proceed with the usual local update.
>> else
>> do the row movement.
>> }
>
> Sure, that sounds about right, although the "Fire BR UPDATE trigger on
> the original partition." is the same in both branches, so I'm not
> quite sure why you have that in the "if" block.

Actually after coding this logic, it looks a bit different. See
ExecUpdate() in the attached file  trigger_related_changes.patch

Now that we are making sure trigger won't change the partition of the
tuple, next thing we need to do is, make sure the tuple routing setup
is done *only* if the UPDATE is modifying partition keys. Otherwise,
this will degrade normal update performance.

Below is the logic I am implementing for determining whether the
UPDATE is modifying partition keys.

In ExecInitModifyTable() ...
Call GetUpdatedColumns(mtstate->rootResultRelInfo, estate) to get
updated_columns.
For each of the updated_columns :
{
Check if the column is part of partition key quals of any of
the relations in mtstate->resultRelInfo[] array.
/*
 * mtstate->resultRelInfo[] contains exactly those leaf partitions
 * which qualify the update quals.
 */

If (it is part of partition key quals of at least one of the relations)
{
   Do ExecSetupPartitionTupleRouting() for the root partition.
   break;
}
}

Few things need to be considered :

Use Relation->rd_partcheck to get partition check quals of each of the
relations in mtstate->resultRelInfo[].

The Relation->rd_partcheck of the leaf partitions would include the
ancestors' partition quals as well. So we are good: we don't have to
explicitly get the upper partition constraints. Note that an UPDATE
can modify a column which is not used in a partition constraint
expressions of any of the partitions or partitioned tables in the
subtree, but that column may have been used in partition constraint of
a partitioned table belonging to upper subtree.

All of the relations in mtstate->resultRelInfo are already open. So we
don't need to re-open any more relations to get the partition quals.

The column bitmap set returned by GetUpdatedColumns() refer to
attribute numbers w.r.t. to the root partition. And the
mstate->resultRelInfo[] have attnos w.r.t. to the leaf partitions. So
we need to do something similar to map_partition_varattnos() to change
the updated columns attnos to the leaf partitions and walk down the
partition constraint expressions to find if the attnos are present
there.

Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

trigger_related_changes.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-07 Thread Amit Kapila

On Tue, Jun 6, 2017 at 11:54 PM, Robert Haas  wrote:
> On Mon, Jun 5, 2017 at 2:51 AM, Amit Kapila  wrote:
>>> Greg/Amit's idea of using the CTID field rather than an infomask bit
>>> seems like a possibly promising approach.  Not everything that needs
>>> bit-space can use the CTID field, so using it is a little less likely
>>> to conflict with something else we want to do in the future than using
>>> a precious infomask bit.  However, I'm worried about this:
>>>
>>> /* Make sure there is no forward chain link in t_ctid */
>>> tp.t_data->t_ctid = tp.t_self;
>>>
>>> The comment does not say *why* we need to make sure that there is no
>>> forward chain link, but it implies that some code somewhere in the
>>> system does or at one time did depend on no forward link existing.
>>
>> I think it is to ensure that EvalPlanQual mechanism gets invoked in
>> the right case.   The visibility routine will return HeapTupleUpdated
>> both when the tuple is deleted or updated (updated - has a newer
>> version of the tuple), so we use ctid to decide if we need to follow
>> the tuple chain for a newer version of the tuple.
>
> That would explain why need to make sure that there *is* a forward
> chain link in t_ctid for an update, but it doesn't explain why we need
> to make sure that there *isn't* a forward link for delete.
>

As far as I understand, it is to ensure that for deleted rows, nothing
more needs to be done.  For example, see the below check in
ExecUpdate/ExecDelete.
if (!ItemPointerEquals(tupleid, ))
{
..
}
..

Also a similar check in ExecLockRows.  Now for deleted rows, if the
t_ctid wouldn't point to itself, then in the mentioned functions, we
were not in a position to conclude that the row is deleted.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-06 Thread Robert Haas

On Mon, Jun 5, 2017 at 2:51 AM, Amit Kapila  wrote:
>> Greg/Amit's idea of using the CTID field rather than an infomask bit
>> seems like a possibly promising approach.  Not everything that needs
>> bit-space can use the CTID field, so using it is a little less likely
>> to conflict with something else we want to do in the future than using
>> a precious infomask bit.  However, I'm worried about this:
>>
>> /* Make sure there is no forward chain link in t_ctid */
>> tp.t_data->t_ctid = tp.t_self;
>>
>> The comment does not say *why* we need to make sure that there is no
>> forward chain link, but it implies that some code somewhere in the
>> system does or at one time did depend on no forward link existing.
>
> I think it is to ensure that EvalPlanQual mechanism gets invoked in
> the right case.   The visibility routine will return HeapTupleUpdated
> both when the tuple is deleted or updated (updated - has a newer
> version of the tuple), so we use ctid to decide if we need to follow
> the tuple chain for a newer version of the tuple.

That would explain why need to make sure that there *is* a forward
chain link in t_ctid for an update, but it doesn't explain why we need
to make sure that there *isn't* a forward link for delete.

> The proposed change in WARM tuple patch uses ip_posid field of CTID
> and we are planning to use ip_blkid field.  Here is the relevant text
> and code from WARM tuple patch:
>
> "Store the root line pointer of the WARM chain in the t_ctid.ip_posid
> field of the last tuple in the chain and mark the tuple header with
> HEAP_TUPLE_LATEST flag to record that fact."
>
> +#define HeapTupleHeaderSetHeapLatest(tup, offnum) \
> +do { \
> + AssertMacro(OffsetNumberIsValid(offnum)); \
> + (tup)->t_infomask2 |= HEAP_LATEST_TUPLE; \
> + ItemPointerSetOffsetNumber(&(tup)->t_ctid, (offnum)); \
> +} while (0)
>
> For further details, refer patch 0001-Track-root-line-pointer-v23_v26
> in the below e-mail:
> https://www.postgresql.org/message-id/CABOikdOTstHK2y0rDk%2BY3Wx9HRe%2BbZtj3zuYGU%3DVngneiHo5KQ%40mail.gmail.com

OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-06 Thread Robert Haas

On Fri, Jun 2, 2017 at 7:07 AM, Amit Khandekar  wrote:
> So, according to that, below would be the logic :
>
> Run partition constraint check on the original NEW row.
> If it succeeds :
> {
> Fire BR UPDATE trigger on the original partition.
> Run partition constraint check again with the modified NEW row
> (may be do this only if the trigger modified the partition key)
> If it fails,
> abort.
> Else
> proceed with the usual local update.
> }
> else
> {
> Fire BR UPDATE trigger on original partition.
> Find the right partition for the modified NEW row.
> If it is the same partition,
> proceed with the usual local update.
> else
> do the row movement.
> }

Sure, that sounds about right, although the "Fire BR UPDATE trigger on
the original partition." is the same in both branches, so I'm not
quite sure why you have that in the "if" block.

>> Actually, it seems like that's probably the
>> *easiest* behavior to implement.  Otherwise, you might fire triggers,
>> discover that you need to re-route the tuple, and then ... fire
>> triggers again on the new partition, which might reroute it again?
>
> Why would update BR trigger fire on the new partition ? On the new
> partition, only BR INSERT trigger would fire if at all we decide to
> fire delete+insert triggers. And insert trigger would not again cause
> the tuple to be re-routed because it's an insert.

OK, sure, that makes sense.  I guess it's really the insert case that
I was worried about -- if we have a BEFORE ROW INSERT trigger and it
changes the tuple and we reroute it, I think we'd have to fire the
BEFORE ROW INSERT on the new partition, which might change the tuple
again and cause yet another reroute, and in this worst case this is an
infinite loop.  But it sounds like we're going to fix that problem --
I think correctly -- by only ever allowing the tuple to be routed
once.  If some trigger tries to make a change the tuple after that
such that re-routing is required, they get an error.  And what you are
describing here seems like it will be fine.

> But now I think you are saying, the row that is being inserted into
> the new partition might get again modified by the INSERT trigger on
> the new partition, which might in turn cause it to fail the new
> partition constraint. But in that case, it will not cause another row
> movement, because in the new partition, it's an INSERT, not an UPDATE,
> so the operation would end there, aborted.

Yeah, that's what I was worried about.  I didn't want a row movement
to be able to trigger another row movement and so on ad infinitum.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-06-05 Thread Amit Khandekar

On 5 June 2017 at 11:27, Amit Kapila  wrote:
> On Fri, Jun 2, 2017 at 4:37 PM, Amit Khandekar  wrote:
>> On 2 June 2017 at 01:17, Robert Haas  wrote:
>>> On Thu, Jun 1, 2017 at 7:41 AM, Amit Khandekar  
>>> wrote:
> Regarding the trigger issue, I can't claim to have a terribly strong
> opinion on this.  I think that practically anything we do here might
> upset somebody, but probably any halfway-reasonable thing we choose to
> do will be OK for most people.  However, there seems to be a
> discrepancy between the approach that got the most votes and the one
> that is implemented by the v8 patch, so that seems like something to
> fix.

 Yes, I have started working on updating the patch to use that approach
 (BR and AR update triggers on source and destination partition
 respectively, instead of delete+insert) The approach taken by the
 patch (BR update + delete+insert triggers) didn't require any changes
 in the way ExecDelete() and ExecInsert() were called. Now we would
 require to skip the delete/insert triggers, so some flags need to be
 passed to these functions,

>
> I thought you already need to pass an additional flag for special
> handling of ctid in Delete case.

Yeah that was unavoidable.

> For Insert, a new flag needs to be
> passed and need to have a check for that in few places.

For skipping delete and insert trigger, we need to include still
another flag, and checks in both ExecDelete() and ExecInsert() for
skipping both BR and AR trigger, and then in ExecUpdate(), again a
call to ExecARUpdateTriggers() before quitting.

>
>> or else have stripped down versions of
 ExecDelete() and ExecInsert() which don't do other things like
 RETURNING handling and firing triggers.
>>>
>>> See, that strikes me as a pretty good argument for firing the
>>> DELETE+INSERT triggers...
>>>
>>> I'm not wedded to that approach, but "what makes the code simplest?"
>>> is not a bad tiebreak, other things being equal.
>>
>> Yes, that sounds good to me.
>>
>
> I am okay if we want to go ahead with firing BR UPDATE + DELETE +
> INSERT triggers for an Update statement (when row movement happens) on
> the argument of code simplicity, but it sounds slightly odd behavior.

Ok. Will keep this behaviour that is already present in the patch. I
myself also feel that code simplicity can be used as a tie-breaker if
a single behaviour  cannot be agreed upon that completely satisfies
all aspects.

>
>> But I think we want to wait for other's
>> opinion because it is quite understandable that two triggers firing on
>> the same partition sounds odd.
>>
>
> Yeah, but I think we have to rely on docs in this case as behavior is
> not intuitive.

Agreed. The doc changes in the patch already has explained in detail
this behaviour.

>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 >

1 - 100 of 180 matches

Mail list logo