subject:"\[HACKERS\] UPDATE of partition key"

Re: [HACKERS] UPDATE of partition key

2018-01-24 Thread Thomas Munro

On Thu, Jan 25, 2018 at 10:39 AM, Robert Haas  wrote:
> On Mon, Jan 22, 2018 at 9:46 AM, Tom Lane  wrote:
>> Robert Haas  writes:
>>> Tom, do you want to double-check that this fixes it for you?
>>
>> I can confirm that a valgrind run succeeded for me with the patch
>> in place.
>
> Committed.  Sorry for the delay.

FYI I'm planning to look into adding a valgrind check to the
commitfest CI thing I run so we can catch these earlier without
committer involvement.  It's super slow because of all those pesky
regression tests so I'll probably need to improve the scheduling logic
a bit to make it useful first (prioritising new patches or something,
since otherwise it'll take up to multiple days to get around to
valgrind-testing any given patch...).

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] UPDATE of partition key

2018-01-24 Thread Robert Haas

On Mon, Jan 22, 2018 at 9:46 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> Tom, do you want to double-check that this fixes it for you?
>
> I can confirm that a valgrind run succeeded for me with the patch
> in place.

Committed.  Sorry for the delay.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] UPDATE of partition key

2018-01-22 Thread Tom Lane

Robert Haas  writes:
> Tom, do you want to double-check that this fixes it for you?

I can confirm that a valgrind run succeeded for me with the patch
in place.

regards, tom lane

Re: [HACKERS] UPDATE of partition key

2018-01-22 Thread Robert Haas

On Mon, Jan 22, 2018 at 2:44 AM, Amit Khandekar  wrote:
> Yes, right, that's what is happening. It is not happening on an Assert
> though (there is no assert in that function). It is happening when we
> try to access the array here :
>
> if (proute->subplan_partition_offsets &&
> proute->subplan_partition_offsets[subplan_index] == i)
>
> Attached is a fix, where I have introduced another field
> PartitionTupleRouting.num_ subplan_partition_offsets, so that above,
> we can add another condition (subplan_index <
> proute->num_subplan_partition_offsets) in order to stop accessing the
> array once we are done with all the offset array elements.
>
> Ran the update.sql test with valgrind enabled on my laptop, and the
> valgrind output now does not show errors.

Tom, do you want to double-check that this fixes it for you?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] UPDATE of partition key

2018-01-21 Thread Amit Khandekar

On 22 January 2018 at 02:40, Robert Haas  wrote:
> On Sun, Jan 21, 2018 at 1:43 AM, Tom Lane  wrote:
>> Robert Haas  writes:
>>> Committed with a bunch of mostly-cosmetic revisions.
>>
>> Buildfarm member skink has been unhappy since this patch went in.
>> Running the regression tests under valgrind easily reproduces the
>> failure.  Now, I might be wrong about which of the patches committed
>> on Friday caused the unhappiness, but the valgrind backtrace sure
>> looks like it's to do with partition routing:
>
> Yeah, that must be the fault of this patch.  We assign to
> proute->subplan_partition_offsets[update_rri_index] from
> update_rri_index = 0 .. num_update_rri, and there's an Assert() at the
> bottom of this function that checks this, so probably this is indexing
> off the end of the array.  I bet the issue happens when we find all of
> the UPDATE result rels while there are still partitions left; then,
> subplan_index will be equal to the length of the
> proute->subplan_partition_offsets array and we'll be indexing just off
> the end.

Yes, right, that's what is happening. It is not happening on an Assert
though (there is no assert in that function). It is happening when we
try to access the array here :

if (proute->subplan_partition_offsets &&
proute->subplan_partition_offsets[subplan_index] == i)

Attached is a fix, where I have introduced another field
PartitionTupleRouting.num_ subplan_partition_offsets, so that above,
we can add another condition (subplan_index <
proute->num_subplan_partition_offsets) in order to stop accessing the
array once we are done with all the offset array elements.

Ran the update.sql test with valgrind enabled on my laptop, and the
valgrind output now does not show errors.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

fix_valgrind_issue.patch
Description: Binary data

Re: [HACKERS] UPDATE of partition key

2018-01-21 Thread Robert Haas

On Sun, Jan 21, 2018 at 1:43 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> Committed with a bunch of mostly-cosmetic revisions.
>
> Buildfarm member skink has been unhappy since this patch went in.
> Running the regression tests under valgrind easily reproduces the
> failure.  Now, I might be wrong about which of the patches committed
> on Friday caused the unhappiness, but the valgrind backtrace sure
> looks like it's to do with partition routing:

Yeah, that must be the fault of this patch.  We assign to
proute->subplan_partition_offsets[update_rri_index] from
update_rri_index = 0 .. num_update_rri, and there's an Assert() at the
bottom of this function that checks this, so probably this is indexing
off the end of the array.  I bet the issue happens when we find all of
the UPDATE result rels while there are still partitions left; then,
subplan_index will be equal to the length of the
proute->subplan_partition_offsets array and we'll be indexing just off
the end.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] UPDATE of partition key

2018-01-20 Thread Tom Lane

Robert Haas  writes:
> Committed with a bunch of mostly-cosmetic revisions.

Buildfarm member skink has been unhappy since this patch went in.
Running the regression tests under valgrind easily reproduces the
failure.  Now, I might be wrong about which of the patches committed
on Friday caused the unhappiness, but the valgrind backtrace sure
looks like it's to do with partition routing:

==00:00:05:49.683 17549== Invalid read of size 4
==00:00:05:49.683 17549==at 0x62A8BA: ExecCleanupTupleRouting 
(execPartition.c:483)
==00:00:05:49.683 17549==by 0x6483AA: ExecEndModifyTable 
(nodeModifyTable.c:2682)
==00:00:05:49.683 17549==by 0x627139: standard_ExecutorEnd (execMain.c:1604)
==00:00:05:49.683 17549==by 0x7780AF: ProcessQuery (pquery.c:206)
==00:00:05:49.683 17549==by 0x7782E4: PortalRunMulti (pquery.c:1286)
==00:00:05:49.683 17549==by 0x778AAF: PortalRun (pquery.c:799)
==00:00:05:49.683 17549==by 0x774E4C: exec_simple_query (postgres.c:1120)
==00:00:05:49.683 17549==by 0x776C17: PostgresMain (postgres.c:4143)
==00:00:05:49.683 17549==by 0x6FA419: PostmasterMain (postmaster.c:4412)
==00:00:05:49.683 17549==by 0x66E51F: main (main.c:228)
==00:00:05:49.683 17549==  Address 0xe25e298 is 2,088 bytes inside a block of 
size 32,768 alloc'd
==00:00:05:49.683 17549==at 0x4A06A2E: malloc (vg_replace_malloc.c:270)
==00:00:05:49.683 17549==by 0x89EB15: AllocSetAlloc (aset.c:945)
==00:00:05:49.683 17549==by 0x8A7577: palloc (mcxt.c:848)
==00:00:05:49.683 17549==by 0x671969: new_list (list.c:68)
==00:00:05:49.683 17549==by 0x672859: lappend_oid (list.c:169)
==00:00:05:49.683 17549==by 0x55330E: find_inheritance_children 
(pg_inherits.c:144)
==00:00:05:49.683 17549==by 0x553447: find_all_inheritors 
(pg_inherits.c:203)
==00:00:05:49.683 17549==by 0x62AC76: ExecSetupPartitionTupleRouting 
(execPartition.c:68)
==00:00:05:49.683 17549==by 0x64949D: ExecInitModifyTable 
(nodeModifyTable.c:2232)
==00:00:05:49.683 17549==by 0x62BBE8: ExecInitNode (execProcnode.c:174)
==00:00:05:49.683 17549==by 0x627B53: standard_ExecutorStart 
(execMain.c:1043)
==00:00:05:49.683 17549==by 0x778046: ProcessQuery (pquery.c:156)

(This is my local result, but skink's log looks about the same.)

regards, tom lane

Re: [HACKERS] UPDATE of partition key

2018-01-19 Thread Robert Haas

On Fri, Jan 19, 2018 at 4:37 AM, Amit Khandekar  wrote:
> Attached rebased patch.

Committed with a bunch of mostly-cosmetic revisions.  I removed the
macro you added, which has a multiple evaluation hazard, and just put
that logic back into the function.  I don't think it's likely to
matter for performance, and this way is safer.  I removed an inline
keyword from another static function as well; better to let the
compiler decide what to do.  I rearranged a few things to shorten some
long lines, too.  Aside from that I think all of the changes I made
were to comments and documentation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] UPDATE of partition key

2018-01-16 Thread Amit Khandekar

On 16 January 2018 at 09:17, David Rowley  wrote:
> On 16 January 2018 at 01:09, Robert Haas  wrote:
>> On Sun, Jan 14, 2018 at 6:57 AM, Amit Khandekar  
>> wrote:
>>> Even where partitions are present, in the usual case where there are
>>> Instead of a bool array, we can even make it a Bitmapset. But I think
>>> access would become slower as compared to array, particularly because
>>> it is going to be a heavily used function.
>>
>> It probably makes little difference -- the Bitmapset will be more
>> compact (which saves time) but involve function calls (which cost
>> time).
>
> I'm not arguing in either direction, but you'd also want to factor in
> how Bitmapsets only allocate words for the maximum stored member,
> which might mean multiple realloc() calls resulting in palloc/memcpy
> calls. The array would just be allocated in a single chunk, although
> it would be more memory and would require a memset too, however,
> that's likely much cheaper than the palloc() anyway.

Right. I agree. And also a function call for knowing whether required
or not. Overall, I think especially because the data structure will be
used heavily whenever it is set up, it's better to make it an array.
In the latest patch, I have retained it as an array


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] UPDATE of partition key

2018-01-15 Thread David Rowley

On 16 January 2018 at 01:09, Robert Haas  wrote:
> On Sun, Jan 14, 2018 at 6:57 AM, Amit Khandekar  
> wrote:
>> Even where partitions are present, in the usual case where there are
>> Instead of a bool array, we can even make it a Bitmapset. But I think
>> access would become slower as compared to array, particularly because
>> it is going to be a heavily used function.
>
> It probably makes little difference -- the Bitmapset will be more
> compact (which saves time) but involve function calls (which cost
> time).

I'm not arguing in either direction, but you'd also want to factor in
how Bitmapsets only allocate words for the maximum stored member,
which might mean multiple realloc() calls resulting in palloc/memcpy
calls. The array would just be allocated in a single chunk, although
it would be more memory and would require a memset too, however,
that's likely much cheaper than the palloc() anyway.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] UPDATE of partition key

2018-01-14 Thread Amit Khandekar

On 13 January 2018 at 02:56, Robert Haas  wrote:
> On Fri, Jan 12, 2018 at 12:23 PM, Amit Khandekar  
> wrote:
>>> (1) if they need it by subplan index, first use
>>> subplan_partition_offsets to convert it to a per-leaf index
>>
>> Before that, we need to check if there *is* an offset array. If there
>> are no partitions, there is only going to be a per-subplan array,
>> there won't be an offsets array. But I guess, you are saying : "do the
>> on-demand allocation only for leaf partitions; if there are no
>> partitions, the per-subplan maps will always be allocated for each of
>> the subplans from the beginning" . So if there is no offset array,
>> just return mtstate->mt_per_subplan_tupconv_maps[subplan_index]
>> without any further checks.
>
> Oops.  I forgot that there might not be partitions.  I was assuming
> that mt_per_subplan_tupconv_maps wouldn't exist at all, and we'd
> always use subplan_partition_offsets.  Both that won't work in the
> inheritance case.
>
>> But after that, I am not sure then why is mt_per_sub_plan_maps[] array
>> needed ? We are always going to convert the subplan index into leaf
>> index, so per-subplan map array will not come into picture. Or are you
>> saying, it will be allocated and used only when there are no
>> partitions ?  From one of your earlier replies, you did mention about
>> trying to share the maps between the two arrays, that means you were
>> considering both arrays being used at the same time.
>
> We'd use them both at the same time if we didn't have, or didn't use,
> subplan_partition_offsets, but if we have subplan_partition_offsets
> and can use it then we don't need mt_per_sub_plan_maps.
>
> I guess I'm inclined to keep mt_per_sub_plan_maps for the case where
> there are no partitions, but not use it when partitions are present.
> What do you think about that?

Even where partitions are present, in the usual case where there are
no transition tables we won't require per-leaf map at all [1]. So I
think we should keep mt_per_sub_plan_maps only for the case where
per-leaf map is not allocated. And we will not allocate
mt_per_sub_plan_maps when mt_per_leaf_maps is needed. In other words,
exactly one of the two maps will be allocated.

This is turning out to be close to what's already there in the last
patch versions: use a single map array, and an offsets array. The
difference is : in the patch I am using the *same* variable for the
two maps. Where as, now we are talking about two different array
variables for maps, but only allocating one of them.

Are you ok with this ? I think the thing you were against was to have
a common *variable* for two purposes. But above, I am saying we have
two variables but assign a map array to only *one* of them and leave
the other unused.

-

Regarding the on-demand map allocation 
Where mt_per_sub_plan_maps is allocated, we won't have the on-demand
allocation: all the maps will be allocated initially. The reason is
becaues the map_is_required array is only per-leaf. Or else, again, we
need to keep another map_is_required array for per-subplan. May be we
can support the on-demand stuff for subplan maps also, but only as a
separate change after we are done with update-partition-key.

-

Regarding mt_per_leaf_tupconv_required, I am thinking we can make it a
bool array, and name it : mt_per_leaf_map_not_required. When it is
true for a given index, it means, we have already called
convert_tuples_by_name() and it returned NULL; i.e. it means we are
sure that map is not required. A false value means we need to call
convert_tuples_by_name() if it is NULL, and then set
mt_per_leaf_map_not_required to (map == NULL).

Instead of a bool array, we can even make it a Bitmapset. But I think
access would become slower as compared to array, particularly because
it is going to be a heavily used function.

-

[1] - For update-tuple-routing, only per-subplan access is required;
- For transition tables, per-subplan access is required,
  and additionally per-leaf access is required when tuples are
  update-routed
- So if both update-tuple-routing and transition tables are
  required, both of the maps are needed.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] UPDATE of partition key

2018-01-12 Thread Robert Haas

On Fri, Jan 12, 2018 at 12:23 PM, Amit Khandekar  wrote:
>> (1) if they need it by subplan index, first use
>> subplan_partition_offsets to convert it to a per-leaf index
>
> Before that, we need to check if there *is* an offset array. If there
> are no partitions, there is only going to be a per-subplan array,
> there won't be an offsets array. But I guess, you are saying : "do the
> on-demand allocation only for leaf partitions; if there are no
> partitions, the per-subplan maps will always be allocated for each of
> the subplans from the beginning" . So if there is no offset array,
> just return mtstate->mt_per_subplan_tupconv_maps[subplan_index]
> without any further checks.

Oops.  I forgot that there might not be partitions.  I was assuming
that mt_per_subplan_tupconv_maps wouldn't exist at all, and we'd
always use subplan_partition_offsets.  Both that won't work in the
inheritance case.

> But after that, I am not sure then why is mt_per_sub_plan_maps[] array
> needed ? We are always going to convert the subplan index into leaf
> index, so per-subplan map array will not come into picture. Or are you
> saying, it will be allocated and used only when there are no
> partitions ?  From one of your earlier replies, you did mention about
> trying to share the maps between the two arrays, that means you were
> considering both arrays being used at the same time.

We'd use them both at the same time if we didn't have, or didn't use,
subplan_partition_offsets, but if we have subplan_partition_offsets
and can use it then we don't need mt_per_sub_plan_maps.

I guess I'm inclined to keep mt_per_sub_plan_maps for the case where
there are no partitions, but not use it when partitions are present.
What do you think about that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] UPDATE of partition key

2018-01-12 Thread Amit Khandekar

On 12 January 2018 at 20:24, Robert Haas  wrote:
> On Fri, Jan 12, 2018 at 5:12 AM, Amit Khandekar  
> wrote:
>> The reason why I am having map_required field inside a structure along
>> with the map, as against a separate array, is so that we can do the
>> on-demand allocation for both per-leaf array and per-subplan array.
>
> Putting the map_required field inside the structure with the map makes
> it completely silly to do the 0/1/2 thing, because the whole structure
> is going to be on the same cache line anyway.  It won't save anything
> to access the flag instead of a pointer in the same struct.

I see. Got it.

>  Also,
> the uint8 will be followed by 7 bytes of padding, because the pointer
> that follows will need to begin on an 8-byte boundary (at least, on
> 64-bit machines), so this will use more memory.
>
> What I suggest is:
>
> #define MT_CONVERSION_REQUIRED_UNKNOWN0
> #define MT_CONVERSION_REQUIRED_YES1
> #define MT_CONVERSION_REQUIRED_NO  2
>
> In ModifyTableState:
>
> uint8 *mt_per_leaf_tupconv_required;
> TupleConversionMap **mt_per_leaf_tupconv_maps;
>
> In PartitionTupleRouting:
>
> int *subplan_partition_offsets;
>
> When you initialize the ModifyTableState, do this:
>
> mtstate->mt_per_leaf_tupconv_required = palloc0(sizeof(uint8) *
> numResultRelInfos);
> mtstate->mt_per_leaf_tupconv_maps = palloc0(sizeof(TupleConversionMap
> *) * numResultRelInfos);
>

A few points below where I wanted to confirm that we are on the same page ...

> When somebody needs a map, then
>
> (1) if they need it by subplan index, first use
> subplan_partition_offsets to convert it to a per-leaf index

Before that, we need to check if there *is* an offset array. If there
are no partitions, there is only going to be a per-subplan array,
there won't be an offsets array. But I guess, you are saying : "do the
on-demand allocation only for leaf partitions; if there are no
partitions, the per-subplan maps will always be allocated for each of
the subplans from the beginning" . So if there is no offset array,
just return mtstate->mt_per_subplan_tupconv_maps[subplan_index]
without any further checks.

>
> (2) then write a function that takes the per-leaf index and does this:
>
> switch (mtstate->mt_per_leaf_tupconv_required[leaf_part_index])
> {
> case MT_CONVERSION_REQUIRED_UNKNOWN:
> map = convert_tuples_by_name(...);
> if (map == NULL)
> mtstate->mt_per_leaf_tupconv_required[leaf_part_index] =
> MT_CONVERSION_REQUIRED_NO;
> else
> {
> mtstate->mt_per_leaf_tupconv_required[leaf_part_index] =
> MT_CONVERSION_REQUIRED_YES;
> mtstate->mt_per_leaf_tupconv_maps[leaf_part_index] = map;
> }
> return map;
> case MT_CONVERSION_REQUIRED_YES:
> return mtstate->mt_per_leaf_tupconv_maps[leaf_part_index];
> case MT_CONVERSION_REQUIRED_NO:
> return NULL;
> }

Yeah, right.

But after that, I am not sure then why is mt_per_sub_plan_maps[] array
needed ? We are always going to convert the subplan index into leaf
index, so per-subplan map array will not come into picture. Or are you
saying, it will be allocated and used only when there are no
partitions ?  From one of your earlier replies, you did mention about
trying to share the maps between the two arrays, that means you were
considering both arrays being used at the same time.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] UPDATE of partition key

2018-01-12 Thread Robert Haas

On Fri, Jan 12, 2018 at 5:12 AM, Amit Khandekar  wrote:
> The reason why I am having map_required field inside a structure along
> with the map, as against a separate array, is so that we can do the
> on-demand allocation for both per-leaf array and per-subplan array.

Putting the map_required field inside the structure with the map makes
it completely silly to do the 0/1/2 thing, because the whole structure
is going to be on the same cache line anyway.  It won't save anything
to access the flag instead of a pointer in the same struct.   Also,
the uint8 will be followed by 7 bytes of padding, because the pointer
that follows will need to begin on an 8-byte boundary (at least, on
64-bit machines), so this will use more memory.

What I suggest is:

#define MT_CONVERSION_REQUIRED_UNKNOWN0
#define MT_CONVERSION_REQUIRED_YES1
#define MT_CONVERSION_REQUIRED_NO  2

In ModifyTableState:

uint8 *mt_per_leaf_tupconv_required;
TupleConversionMap **mt_per_leaf_tupconv_maps;

In PartitionTupleRouting:

int *subplan_partition_offsets;

When you initialize the ModifyTableState, do this:

mtstate->mt_per_leaf_tupconv_required = palloc0(sizeof(uint8) *
numResultRelInfos);
mtstate->mt_per_leaf_tupconv_maps = palloc0(sizeof(TupleConversionMap
*) * numResultRelInfos);

When somebody needs a map, then

(1) if they need it by subplan index, first use
subplan_partition_offsets to convert it to a per-leaf index

(2) then write a function that takes the per-leaf index and does this:

switch (mtstate->mt_per_leaf_tupconv_required[leaf_part_index])
{
case MT_CONVERSION_REQUIRED_UNKNOWN:
map = convert_tuples_by_name(...);
if (map == NULL)
mtstate->mt_per_leaf_tupconv_required[leaf_part_index] =
MT_CONVERSION_REQUIRED_NO;
else
{
mtstate->mt_per_leaf_tupconv_required[leaf_part_index] =
MT_CONVERSION_REQUIRED_YES;
mtstate->mt_per_leaf_tupconv_maps[leaf_part_index] = map;
}
return map;
case MT_CONVERSION_REQUIRED_YES:
return mtstate->mt_per_leaf_tupconv_maps[leaf_part_index];
case MT_CONVERSION_REQUIRED_NO:
return NULL;
}

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] UPDATE of partition key

2018-01-11 Thread Robert Haas

On Thu, Jan 11, 2018 at 6:07 AM, Amit Khandekar  wrote:
> In the first paragraph of my explanation, I was explaining why two
> Transition capture states does not look like a good idea to me :

Oh, sorry.  I didn't read what you wrote carefully enough, I guess.

I see your points.  I think that there is probably a general need for
some refactoring here.  AfterTriggerSaveEvent() got significantly more
complicated and harder to understand with the arrival of transition
tables, and this patch is adding more complexity still.  It's also
adding complexity in other places to make ExecInsert() and
ExecDelete() usable for the semi-internal DELETE/INSERT operations
being produced when we split a partition key update into a DELETE and
INSERT pair.  It would be awfully nice to have some better way to
separate out each of the different things we might or might not want
to do depending on the situation: capture old tuple, capture new
tuple, fire before triggers, fire after triggers, count processed
rows, set command tag, perform actual heap operation, update indexes,
etc.  However, I don't have a specific idea how to do it better, so
maybe we should just get this committed for now and perhaps, with more
eyes on the code, someone will have a good idea.

> Slight correction; it was suggested by Amit Langote; not by David.

Oh, OK, sorry.

> So there are two independent optimizations we are talking about :
>
> 1. Create the map only when needed.
> 2. In case of UPDATE, for partitions that take part in update scans,
> there should be a single map; there should not be two separate maps,
> one for accessing per-subplan and the other for accessing per-leaf.

These optimizations aren't completely independent.   Optimization #2
can be implemented in several different ways.  The way you've chosen
to do it is to index the same array in two different ways depending on
whether per-leaf indexing is not needed, which I think is
unacceptable.  Another approach, which I proposed upthread, is to
always built the per-leaf mapping, but you pointed out that this could
involve doing a lot of unnecessary work in the case where most leaves
were pruned.  However, if you also implement #1, then that problem
goes away.  In other words, depending on the design you choose for #2,
you may or may not need to also implement optimization #1 to get good
performance.

To put that another way, I think Amit's idea of keeping a
subplan-offsets array is a pretty good one.  From your comments, you
do too.  But if we want to keep that, then we need a way to avoid the
expense of populating it for leaves that got pruned, except when we
are doing update row movement.  Otherwise, I don't see much choice but
to jettison the subplan-offsets array and just maintain two separate
arrays of mappings.

> Regarding the array names ...
>
> Noting all this, I feel we can go with names according to the
> structure of maps. Something like : mt_perleaf_tupconv_maps, and
> mt_persubplan_tupconv_maps. Other suggestions welcome.

I'd probably do mt_per_leaf_tupconv_maps, since inserting an
underscore between some but not all words seems strange.  But OK
otherwise.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] UPDATE of partition key

2018-01-10 Thread Amit Khandekar

On 11 January 2018 at 10:44, David Rowley  wrote:
>>> 18. Why two RESET SESSION AUTHORIZATIONs?
>>>
>>> reset session authorization;
>>> drop trigger trig_d_1_15 ON part_d_1_15;
>>> drop function func_d_1_15();
>>> -- Policy expression contains SubPlan
>>> reset session authorization;
>>
>> The second reset is actually in a different paragraph. The reason it's
>> there is to ensure we have reset it regardless of the earlier cleanup.
>
> hmm, I was reviewing the .out file, which does not have the empty
> lines. Still seems a bit surplus.

I believe the output file does not have the blank lines present in the
.sql file. I was referring to the paragraph in the *.sql* file.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] UPDATE of partition key

2018-01-10 Thread David Rowley

Thanks for making those changes.

On 11 January 2018 at 04:00, Amit Khandekar  wrote:
> Yes, I understand that there won't be any update scan plans. But, with
> the modifications done in ExecInitModifyTable(), I wanted to run that
> code with this scenario where there are no partitions, to make sure it
> does not behave weirdly or crash. Any suggestions for comments, given
> this perspective ? For now, I have made the comment this way:
>
> -- Check that partition-key UPDATE works sanely on a partitioned table
> that does not have any child partitions.

Sounds good.

>> 18. Why two RESET SESSION AUTHORIZATIONs?
>>
>> reset session authorization;
>> drop trigger trig_d_1_15 ON part_d_1_15;
>> drop function func_d_1_15();
>> -- Policy expression contains SubPlan
>> reset session authorization;
>
> The second reset is actually in a different paragraph. The reason it's
> there is to ensure we have reset it regardless of the earlier cleanup.

hmm, I was reviewing the .out file, which does not have the empty
lines. Still seems a bit surplus.

> Attached v35 patch. Thanks.

Thanks. I'll try to look at it soon.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] UPDATE of partition key

2018-01-09 Thread Robert Haas

On Fri, Jan 5, 2018 at 3:25 PM, Robert Haas  wrote:
> On Fri, Jan 5, 2018 at 7:12 AM, Amit Khandekar  wrote:
>> The above patch is to be applied over the last remaining preparatory
>> patch, now named (and attached) :
>> 0001-Refactor-CheckConstraint-related-code.patch
>
> Committed that one, too.

Some more comments on the main patch:

I don't really like the fact that ExecCleanupTupleRouting() now takes
a ModifyTableState as an argument, particularly because of the way
that is using that argument.  To figure out whether a ResultRelInfo
was pre-existing or one it created, it checks whether the pointer
address of the ResultRelInfo is >= mtstate->resultRelInfo and <
mtstate->resultRelInfo + mtstate->mt_nplans.  However, that means that
ExecCleanupTupleRouting() ends up knowing about the memory allocation
pattern used by ExecInitModifyTable(), which seems like a slightly
dangerous amount of action at a distance.  I think it would be better
for the PartitionTupleRouting structure to explicitly indicate which
ResultRelInfos should be closed, for example by storing a Bitmapset
*input_partitions.  (Here, by "input", I mean "provided from the
mtstate rather than created by the PartitionTupleRouting structure;
other naming suggestions welcome.)  When
ExecSetupPartitionTupleRouting latches onto a partition, it can do
proute->input_partitions = bms_add_member(proute->input_partitons, i).
In ExecCleanupTupleRouting, it can do if
(bms_is_member(proute->input_partitions, i)) continue.

We have a test, in the regression test suite for file_fdw, which
generates the message "cannot route inserted tuples to a foreign
table".  I think we should have a similar test for the case where an
UPDATE tries to move a tuple from a regular partition to a foreign
table partition.  I'm not sure if it should fail with the same error
or a different one, but I think we should have a test that it fails
cleanly and with a nice error message of some sort.

The comment for get_partitioned_child_rels() claims that it sets
is_partition_key_update, but it really sets *is_partition_key_update.
And I think instead of "is a partition key" it should say "is used in
the partition key either of the relation whose RTI is specified or of
any child relation."  I propose "used in" instead of "is" because
there can be partition expressions, and the rest is to clarify that
child partition keys matter.

create_modifytable_path uses partColsUpdated rather than
partKeyUpdated, which actually seems like better terminology.  I
propose partKeyUpdated -> partColsUpdated everywhere.  Also, why use
is_partition_key_update for basically the same thing in some other
places?  I propose changing that to partColsUpdated as well.

The capitalization of the first comment hunk in execPartition.h is strange.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] UPDATE of partition key

2018-01-08 Thread David Rowley

On 4 January 2018 at 02:52, David Rowley  wrote:
> I'll try to look at the tests tomorrow and also do some testing.

I've made a pass over the tests. Again, sometimes I'm probably a bit
pedantic. The reason for that is that the tests are not that easy to
follow. Moving creation and cleanup of objects closer to where they're
used and no longer needed makes it easier to read through and verify
the tests. There are some genuine mistakes in there too.

1.

   NEW.c = NEW.c + 1; -- Make even number odd, or vice versa

This seems to be worded as if there'd only ever be one number. I think
it should be plural and read "Make even numbers odd, and vice versa"

2. The following comment does not make a huge amount of sense.

-- UPDATE with
-- partition key or non-partition columns, with different column ordering,
-- triggers.

Should "or" be "on"? Does ", triggers" mean "with triggers"?

3. The follow test tries to test a BEFORE DELETE trigger stopping a
DELETE on sub_part1, but going by the SELECT, there are no rows in
that table to stop being DELETEd.

select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
  tableoid  | a | b  | c
+---++
 list_part1 | 2 | 52 | 50
 list_part1 | 3 |  6 | 60
 sub_part2  | 1 |  2 | 10
 sub_part2  | 1 |  2 | 70
(4 rows)

drop trigger parted_mod_b ON sub_part1 ;
-- If BR DELETE trigger prevented DELETE from happening, we should also skip
-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
create or replace function func_parted_mod_b() returns trigger as $$
begin return NULL; end $$ language plpgsql;
create trigger trig_skip_delete before delete on sub_part1
   for each row execute procedure func_parted_mod_b();
update list_parted set b = 1 where c = 70;
select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
  tableoid  | a | b  | c
+---++
 list_part1 | 2 | 52 | 50
 list_part1 | 3 |  6 | 60
 sub_part1  | 1 |  1 | 70
 sub_part2  | 1 |  2 | 10
(4 rows)

You've added the BEFORE DELETE trigger to sub_part1, but you can see
the tuple was DELETEd from sub_part2 and INSERTed into sub_part1, so
the test is not working as you've commented.

It's probably a good idea to RAISE NOTICE 'something useful here'; in
the trigger function to verify they're actually being called in the
test.

4. I think the final drop function in the following should be before
the UPDATE FROM test. You've already done some cleanup for that test
by doing "drop trigger trig_skip_delete ON sub_part1 ;"

drop trigger trig_skip_delete ON sub_part1 ;
-- UPDATE partition-key with FROM clause. If join produces multiple output
-- rows for the same row to be modified, we should tuple-route the row
only once.
-- There should not be any rows inserted.
create table non_parted (id int);
insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
  tableoid  | a | b  | c
+---++
 list_part1 | 2 |  1 | 70
 list_part1 | 2 |  2 | 10
 list_part1 | 2 | 52 | 50
 list_part1 | 3 |  6 | 60
(4 rows)

drop table non_parted;
drop function func_parted_mod_b();

Also, there's a space before the ; in the drop trigger above. Can that
be removed?

5. The following comment:

-- update to a partition should check partition bound constraint for
the new tuple.
-- If partition key is updated, the row should be moved to the appropriate
-- partition. updatable views using partitions should enforce the check options
-- for the rows that have been moved.

Can this be changed a bit? I think it's not accurate to say that an
update to a partition key causes the row to move. The row movement
only occurs when the new tuple does not match the partition bound and
another partition exists that does have a partition bound that matches
the tuple. How about:

-- When a partitioned table receives an UPDATE to the partitioned key and the
-- new values no longer meet the partition's bound, the row must be moved to
-- the correct partition for the new partition key (if one exists). We must
-- also ensure that updatable views on partitioned tables properly enforce any
-- WITH CHECK OPTION that is defined. The situation with triggers in this case
-- also requires thorough testing as partition key updates causing row
-- movement convert UPDATEs into DELETE+INSERT.

6. What does the following actually test?

-- This tests partition-key UPDATE on a partitioned table that does
not have any child partitions
update part_b_10_b_20 set b = b - 6;

There are no records in that partition, or anywhere in the hierarchy.
Are you just testing that there's no error? If so then the comment
should say so.

7. I think the following comment:

-- As mentioned above, the partition creation is intentionally kept in
descending bound order.

should instead say:

-- Create some more partitions

Re: [HACKERS] UPDATE of partition key

2018-01-04 Thread Robert Haas

On Wed, Jan 3, 2018 at 6:29 AM, Amit Khandekar  wrote:
> ExecSetupPartitionTupleRouting() now returns PartitionTupleRouting *.
>
> Did this change in v3 version of
> 0001-Encapsulate-partition-related-info-in-a-structure.patch

I'll have to come back to some of the other open issues, but 0001 and
0005 look good to me now, so I pushed those as a single commit after
fixing a few things that pgindent didn't like.  I also think 0002 and
0003 look basically good, so I pushed those two as a single commit
also.  But the comment changes in 0003 didn't seem extensive enough to
me so I made a few more changes there along the way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] UPDATE of partition key

2018-01-03 Thread David Rowley

On 4 January 2018 at 02:52, David Rowley  wrote:
> I'll try to look at the tests tomorrow and also do some testing. So
> far I've only read the code and the docs.

There are a few more things I noticed on another pass I made today:

20. "carried" -> "carried out the"

+   would have identified the newly updated row and carried
+   UPDATE/DELETE on this new row

21. Extra new line

+   .
+

22. In copy.c CopyFrom() you have the following code:

/*
 * We might need to convert from the parent rowtype to the
 * partition rowtype.
 */
map = proute->partition_tupconv_maps[leaf_part_index];
if (map)
{
Relation partrel = resultRelInfo->ri_RelationDesc;

tuple = do_convert_tuple(tuple, map);

/*
* We must use the partition's tuple descriptor from this
* point on.  Use a dedicated slot from this point on until
* we're finished dealing with the partition.
*/
slot = proute->partition_tuple_slot;
Assert(slot != NULL);
ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
   ExecStoreTuple(tuple, slot, InvalidBuffer, true);
}

Should this use ConvertPartitionTupleSlot() instead?

23. Why write;

last_resultRelInfo = mtstate->resultRelInfo + mtstate->mt_nplans;

when you can write;

last_resultRelInfo = mtstate->resultRelInfo[mtstate->mt_nplans];?

24. In ExecCleanupTupleRouting(), do you think that you could just
have a special case loop for (mtstate && mtstate->operation ==
CMD_UPDATE)?

/*
* If this result rel is one of the UPDATE subplan result rels, let
* ExecEndPlan() close it. For INSERT or COPY, this does not apply
* because leaf partition result rels are always newly allocated.
*/
if (is_update &&
resultRelInfo >= first_resultRelInfo &&
resultRelInfo < last_resultRelInfo)
continue;

Something like:

if (mtstate && mtstate->operation == CMD_UPDATE)
{
ResultRelInfo *first_resultRelInfo = mtstate->resultRelInfo;
ResultRelInfo *last_resultRelInfo =
mtstate->resultRelInfo[mtstate->mt_nplans];

for (i = 0; i < proute->num_partitions; i++)
{
ResultRelInfo *resultRelInfo = proute->partitions[i];

/*
 * Leave any resultRelInfos that belong to the UPDATE's subplan
 * list.  These will be closed during executor shutdown.
 */
if (resultRelInfo >= first_resultRelInfo &&
resultRelInfo < last_resultRelInfo)
continue;

ExecCloseIndices(resultRelInfo);
heap_close(resultRelInfo->ri_RelationDesc, NoLock);
}
}
else
{
for (i = 0; i < proute->num_partitions; i++)
{
ResultRelInfo *resultRelInfo = proute->partitions[i];

ExecCloseIndices(resultRelInfo);
heap_close(resultRelInfo->ri_RelationDesc, NoLock);
}
}

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] UPDATE of partition key

2018-01-03 Thread Amit Khandekar

Robert, for tracking purpose, below I have consolidated your review
items on which we are yet to conclude. Let me know if you have more
comments on the points which I made.

--
1. ExecUpdate() needs to revert back tcs_map value changed by ExecInsert()
--

>> + /*
>> + * UPDATEs set the transition capture map only when a new subplan
>> + * is chosen.  But for INSERTs, it is set for each row. So after
>> + * INSERT, we need to revert back to the map created for UPDATE;
>> + * otherwise the next UPDATE will incorrectly use the one created
>> + * for INESRT.  So first save the one created for UPDATE.
>> + */
>> + if (mtstate->mt_transition_capture)
>> + saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
>>
>> I wonder if there is some more elegant way to handle this problem.
>> Basically, the issue is that ExecInsert() is stomping on
>> mtstate->mt_transition_capture, and your solution is to save and
>> restore the value you want to have there.  But maybe we could instead
>> find a way to get ExecInsert() not to stomp on that state in the first
>> place.  It seems like the ON CONFLICT stuff handled that by adding a
>> second TransitionCaptureState pointer to ModifyTable, thus
>> mt_transition_capture and mt_oc_transition_capture.  By that
>> precedent, we could add mt_utr_transition_capture or similar, and
>> maybe that's the way to go.  It seems a bit unsatisfying, but so does
>> what you have now.
>
> In case of ON CONFLICT, if there are both INSERT and UPDATE statement
> triggers referencing transition tables, both of the triggers need to
> independently populate their own transition tables, and hence the need
> for two separate transition states : mt_transition_capture and
> mt_oc_transition_capture. But in case of update-tuple-routing, the
> INSERT statement trigger won't come into picture. So the same
> mt_transition_capture can serve the purpose of populating the
> transition table with OLD and NEW rows. So I think it would be too
> redundant, if not incorrect, to have a whole new transition state for
> update tuple routing.
>
> I will see if it turns out better to have two tcs_maps in
> TransitionCaptureState, one for update and one for insert. But this,
> on first look, does not look good.

Suppose TransitionCaptureState has separate maps, upd_del_tcs_maps and
insert_tcs_maps for UPDATE/DELETE and INSERT events respectively. So
upd_del_tcs_maps will be updated only after we start with the next
UPDATE subplan, whereas insert_tcs_maps will keep on getting updated
for each row. So in AfterTriggerSaveEvent(), upd_del_tcs_maps would be
used when the event is TRIGGER_EVENT_[UPDATE/DELETE], and
insert_tcs_maps will be used when event == TRIGGER_EVENT_INSERT. But
the issue is : even if the event is TRIGGER_EVENT_UPDATE, we don't
know whether this is caused by a normal update or as part of an insert
into new partition during partition-key-update. So blindly using
upd_del_tcs_maps is incorrect. If the event is caused by the later, we
should use insert_tcs_maps rather than upd_del_tcs_maps. But we do not
have the information in trigger.c as to what caused this event.

So, overall, it would not work, and even if we make it work by passing
or storing some more information somewhere, the
AfterTriggerSaveEvent() logic will become too complicated.

So I can't think of anything else, but to keep the way I did, i.e.
reverting back the tcs_map once insert finishes. We so a similar thing
for reverting back the estate->es_result_relation_info.

--
2. mt_childparent_tupconv_maps is indexed by subplan or partition leaf index.
--

> + * If per-leaf map is required and the map is already created, that map
> + * has to be per-leaf. If that map is per-subplan, we won't be able to
> + * access the maps leaf-partition-wise. But if the map is per-leaf, we
> + * will be able to access the maps subplan-wise using the
> + * subplan_partition_offsets map using function
> + * tupconv_map_for_subplan().  So if the callers might need to access
> + * the map both leaf-partition-wise and subplan-wise, they should make
> + * sure that the first time this function is called, it should be
> + * called with perleaf=true so that the map created is per-leaf, not
> + * per-subplan.
>
> This sounds complicated and fragile.  It ends up meaning that
> mt_childparent_tupconv_maps is sometimes indexed by subplan number and
> sometimes by partition leaf index, which is extremely confusing and
> likely to lead to coding errors, either in this patch or in future
> ones.

Even if we always index the map by leaf partition, while accessing the
map the code still needs to be aware of whether the index number with
which we are accessing the map is the subplan number or leaf partition
number:

If the access is by subplan number, use subplan_partition_offsets to
convert to the leaf partition index. So the function
tupconv_map_for_subplan() is anyways necessary for accessing using
subplan

Re: [HACKERS] UPDATE of partition key

2018-01-03 Thread David Rowley

> On 3 January 2018 at 11:42, Amit Khandekar  wrote:
>> [...] So it make perfect sense for
>> ExecSetupPartitionTupleRouting() to palloc and return a pointer. Sorry
>> for the noise. Will share the change in an upcoming patch version.
>> Thanks !
>
> ExecSetupPartitionTupleRouting() now returns PartitionTupleRouting *.

Thanks for changing. I've just done almost a complete review of v32.
(v33 came along a bit sooner than I thought).

I've not finished looking at the regression tests yet, but here are a
few things, some may have been changed in v33, I've not looked yet.
Also apologies in advance if anything seems nitpicky.

1. "by INSERT" -> "by an INSERT" in:

from the original partition followed by INSERT into the

2. "and INSERT" -> "and an INSERT" in:

a DELETE and INSERT. As far as

3. "due partition-key change" -> "due to the partition-key being changed" in:

 * capture is happening for UPDATEd rows being moved to another partition due
 * partition-key change, then this function is called once when the row is

4. "inserted to another" -> "inserted into another" in:

 * deleted (to capture OLD row), and once when the row is inserted to another

5. "for UPDATE event" -> "for an UPDATE event" (singular), or -> "for
UPDATE events" (plural)

* oldtup and newtup are non-NULL.  But for UPDATE event fired for

I'm unsure if you need singular or plural. It perhaps does not matter.

6. "for row" -> "for a row" in:

* movement, oldtup is NULL when the event is for row being inserted,

Likewise in:

* whereas newtup is NULL when the event is for row being deleted.

7. In the following fragment the code does not do what the comment says:

/*
* If transition tables are the only reason we're here, return. As
* mentioned above, we can also be here during update tuple routing in
* presence of transition tables, in which case this function is called
* separately for oldtup and newtup, so either can be NULL, not both.
*/
if (trigdesc == NULL ||
(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
(event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL
return;

With the comment; "so either can be NULL, not both.", I'd expect a
boolean OR not an XOR.

maybe the comment is better written as:

"so we expect exactly one of them to be non-NULL"

(I know you've been discussing with Robert, so I've not checked v33 to
see if this still exists)

8. I'm struggling to make sense of this:

/*
* Save a tuple conversion map to convert a tuple routed to this
* partition from the parent's type to the partition's.
*/

Maybe it's better to write this as:

/*
* Generate a tuple conversion map to convert tuples of the parent's
* type into the partition's type.
*/

9. insert should be capitalised here and should be prefixed with "an":

/*
* Verify result relation is a valid target for insert operation. Even
* for updates, we are doing this for tuple-routing, so again, we need
* to check the validity for insert operation.
*/
CheckValidResultRel(leaf_part_rri, CMD_INSERT);

Maybe it's better to write:

/*
* Verify result relation is a valid target for an INSERT.  An UPDATE of
* a partition-key becomes a DELETE/INSERT operation, so this check is
* still required when the operation is CMD_UPDATE.
*/

10. The following code would be more clear if you replaced
mtstate->mt_transition_capture with transition_capture.

if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
&& mtstate->mt_transition_capture->tcs_update_new_table)
{
ExecARUpdateTriggers(estate, resultRelInfo, NULL,
NULL,
tuple,
NULL,
mtstate->mt_transition_capture);

/*
* Now that we have already captured NEW TABLE row, any AR INSERT
* trigger should not again capture it below. Arrange for the same.
*/
transition_capture = NULL;
}

You are, after all, doing:

transition_capture = mtstate->mt_transition_capture;

at the top of the function. There are a few other places you're also
accessing mtstate->mt_transition_capture.

11. Should tuple_deleted and process_returning be camelCase like the
other params?:

static TupleTableSlot *
ExecDelete(ModifyTableState *mtstate,
   ItemPointer tupleid,
   HeapTuple oldtuple,
   TupleTableSlot *planSlot,
   EPQState *epqstate,
   EState *estate,
   bool *tuple_deleted,
   bool process_returning,
   bool canSetTag)

12. The following comment talks about "target table descriptor", which
I think is a good term. In several other places, you mention "root",
which I take it to mean "target table".

 * This map array is required for two purposes :
 * 1. For update-tuple-routing. We need to convert the tuple from the subplan
 * result rel to the root partitioned table descriptor.
 * 2. For capturing transition tuples when the target table is a partitioned
 * table. For updates, we need to convert the tuple from subplan result rel to

Re: [HACKERS] UPDATE of partition key

2018-01-03 Thread Amit Khandekar

On 20 December 2017 at 11:52, Amit Khandekar  wrote:
> On 14 December 2017 at 08:11, Amit Langote  
> wrote:
>>
>> Regarding ExecSetupChildParentMap(), it seems to me that it could simply
>> be declared as
>>
>> static void ExecSetupChildParentMap(ModifyTableState *mtstate);
>>
>> Looking at the places from where it's called, it seems that you're just
>> extracting information from mtstate and passing the same for the rest of
>> its arguments.
>
> Agreed. But the last parameter per_leaf might be necessary. I will
> defer this until I address Robert's concern about the complexity of
> the related code.

Removed those parameters, but kept perleaf. The map required for
update-tuple-routing is a per-subplan one despite the presence of
partition tuple routing. And we cannot deduce from mtstate whether
update tuple routing is true. So for this case, the caller has to
explicitly specify that per-subplan map has to be created.

>>
>> tupconv_map_for_subplan() looks like it could be done as a macro.
>
> Or may be inline function. I will again defer this for similar reason
> as the above deferred item about ExecSetupChildParentMap parameters.
>

Made it inline.

Did the above changes in attached update-partition-key_v33.patch

On 3 January 2018 at 11:42, Amit Khandekar  wrote:
> On 2 January 2018 at 10:56, David Rowley  wrote:
>> I'm pretty sure Robert is suggesting that
>> ExecSetupPartitionTupleRouting pallocs the memory for the structure,
>> sets it up then returns a pointer to the new struct. That's not very
>> unusual. It seems unusual for a function to return void and modify a
>> single parameter pointer to get the value to the caller rather than
>> just to return that value.
>
> Sorry, my mistake. Earlier I somehow was under the impression that the
> callers of ExecSetupPartitionTupleRouting() already have this
> structure palloc'ed, and that they pass address of this structure. I
> now can see that both CopyStateData->partition_tuple_routing and
> ModifyTableState->mt_partition_tuple_routing are pointers, not
> structures. So it make perfect sense for
> ExecSetupPartitionTupleRouting() to palloc and return a pointer. Sorry
> for the noise. Will share the change in an upcoming patch version.
> Thanks !

ExecSetupPartitionTupleRouting() now returns PartitionTupleRouting *.

Did this change in v3 version of
0001-Encapsulate-partition-related-info-in-a-structure.patch

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


update-partition-key_v33.patch.tar.gz
Description: GNU Zip compressed data

Re: [HACKERS] UPDATE of partition key

2018-01-02 Thread Amit Khandekar

On 1 January 2018 at 21:43, Amit Khandekar  wrote:
> On 16 December 2017 at 03:09, Robert Haas  wrote:
>> + /*
>> + * UPDATEs set the transition capture map only when a new subplan
>> + * is chosen.  But for INSERTs, it is set for each row. So after
>> + * INSERT, we need to revert back to the map created for UPDATE;
>> + * otherwise the next UPDATE will incorrectly use the one created
>> + * for INESRT.  So first save the one created for UPDATE.
>> + */
>> + if (mtstate->mt_transition_capture)
>> + saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
>>
>> I wonder if there is some more elegant way to handle this problem.
>> Basically, the issue is that ExecInsert() is stomping on
>> mtstate->mt_transition_capture, and your solution is to save and
>> restore the value you want to have there.  But maybe we could instead
>> find a way to get ExecInsert() not to stomp on that state in the first
>> place.  It seems like the ON CONFLICT stuff handled that by adding a
>> second TransitionCaptureState pointer to ModifyTable, thus
>> mt_transition_capture and mt_oc_transition_capture.  By that
>> precedent, we could add mt_utr_transition_capture or similar, and
>> maybe that's the way to go.  It seems a bit unsatisfying, but so does
>> what you have now.
>
> In case of ON CONFLICT, if there are both INSERT and UPDATE statement
> triggers referencing transition tables, both of the triggers need to
> independently populate their own transition tables, and hence the need
> for two separate transition states : mt_transition_capture and
> mt_oc_transition_capture. But in case of update-tuple-routing, the
> INSERT statement trigger won't come into picture. So the same
> mt_transition_capture can serve the purpose of populating the
> transition table with OLD and NEW rows. So I think it would be too
> redundant, if not incorrect, to have a whole new transition state for
> update tuple routing.
>
> I will see if it turns out better to have two tcs_maps in
> TransitionCaptureState, one for update and one for insert. But this,
> on first look, does not look good.

Suppose TransitionCaptureState has separate maps, upd_del_tcs_maps and
insert_tcs_maps for UPDATE/DELETE and INSERT events respectively. So
upd_del_tcs_maps will be updated only after we start with the next
UPDATE subplan, whereas insert_tcs_maps will keep on getting updated
for each row. So in AfterTriggerSaveEvent(), upd_del_tcs_maps would be
used when the event is TRIGGER_EVENT_[UPDATE/DELETE], and
insert_tcs_maps will be used when event == TRIGGER_EVENT_INSERT. But
the issue is : even if the event is TRIGGER_EVENT_UPDATE, we don't
know whether this is caused by a normal update or as part of an insert
into new partition during partition-key-update. So blindly using
upd_del_tcs_maps is incorrect. If the event is caused by the later, we
should use insert_tcs_maps rather than upd_del_tcs_maps. But we do not
have the information in trigger.c as to what caused this event.

So, overall, it would not work, and even if we make it work by passing
or storing some more information somewhere, the
AfterTriggerSaveEvent() logic will become too complicated.

So I can't think of anything else, but to keep the way I did, i.e.
reverting back the tcs_map once insert finishes. We so a similar thing
for reverting back the estate->es_result_relation_info.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] UPDATE of partition key

2018-01-02 Thread Amit Khandekar

On 16 December 2017 at 03:09, Robert Haas  wrote:
>
> - map = ptr->partition_tupconv_maps[leaf_part_index];
> + map = ptr->parentchild_tupconv_maps[leaf_part_index];
>
> I don't think there's any reason to rename this.  In previous patch
> versions, you had multiple arrays of tuple conversion maps in this
> structure, but the refactoring eliminated that.

Done in an earlier version of the patch.

>
> Likewise, I'm not sure I get the point of mt_transition_tupconv_maps
> -> mt_childparent_tupconv_maps.  That seems like it could similarly be
> left alone.

We need to change it's name because now this map is not only used for
transition capture, but also for update-tuple-routing. Does it look ok
for you if, for readability, we keep the childparent tag ? Or else, we
can just make it "mt_tupconv_maps", but "mt_childparent_tupconv_maps"
looks more informative.

>
> + /*
> + * If transition tables are the only reason we're here, return. As
> + * mentioned above, we can also be here during update tuple routing in
> + * presence of transition tables, in which case this function is called
> + * separately for oldtup and newtup, so either can be NULL, not both.
> + */
>   if (trigdesc == NULL ||
>   (event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
>   (event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
> - (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
> + (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
> + (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL
>
> I guess this is correct, but it seems awfully fragile.  Can't we have
> some more explicit signaling about whether we're only here for
> transition tables, rather than deducing it based on exactly one of
> oldtup and newtup being NULL?

I had given a thought on this earlier. I felt, even the pre-existing
conditions like "!trigdesc->trig_update_after_row" are all indirect
ways to determine that this function is called only to capture
transition tables, and thought that it may have been better to have
separate parameter transition_table_only.

But then decided that I can continue on similar lines and add another
such condition to indicate that we are only capturing update-routed
tuples.

Instead of adding another parameter to AfterTriggerSaveEvent(), I had
also considered another approach: Put the transition-tuples-capture
logic part of AfterTriggerSaveEvent() into a helper function
CaptureTransitionTables(). In ExecInsert() and ExecDelete(), instead
of calling ExecARUpdateTriggers(), call this function
CaptureTransitionTables(). I then dropped this idea and thought rather
to call ExecARUpdateTriggers() which neatly does the required checks
and other things like locking the old tuple via GetTupleForTrigger().
So if we go by CaptureTransitionTables(), we would need to do what
ExecARUpdateTriggers() does before calling CaptureTransitionTables().
This is doable. If you think this is worth doing so as to get rid of
the "(oldtup == NULL) ^ (newtup == NULL)" condition, we can do that.

>
> + /* Initialization specific to update */
> + if (mtstate && mtstate->operation == CMD_UPDATE)
> + {
> + ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
> +
> + is_update = true;
> + update_rri = mtstate->resultRelInfo;
> + num_update_rri = list_length(node->plans);
> + }
>
> I guess I don't see why we need a separate "if" block for this.
> Neither is_update nor update_rri nor num_update_rri are used until we
> get to the block that begins with "if (is_update)".  Why not just
> change that block to test if (mtstate && mtstate->operation ==
> CMD_UPDATE)" and put the rest of these initializations inside that
> block?

Done.

>
> + int num_update_rri = 0,
> + update_rri_index = 0;
> ...
> + update_rri_index = 0;
>
> It's already 0.

Done. Retained the comment that mentions why we need to set it to 0,
and added a note in the end that we have already done this during
initialization.

>
> + leaf_part_rri = _rri[update_rri_index];
> ...
> + leaf_part_rri = leaf_part_arr + i;
>
> These are doing the same kind of thing, but using different styles.  I
> prefer the former style, so I'd change the second one to
> _part_arr[i]. Alternatively, you could change the first one to
> update_rri + update_rri_indx.  But it's strange to see the same
> variable initialized in two different ways just a few lines apart.
>

Done. Used the first style.

>
> +static HeapTuple
> +ConvertPartitionTupleSlot(ModifyTableState *mtstate,
> +   TupleConversionMap *map,
> +   HeapTuple tuple,
> +   TupleTableSlot *new_slot,
> +   TupleTableSlot **p_my_slot)
>
> This function doesn't use the mtstate argument at all.

Removed mtstate.

>
> + * (Similarly we need to add the deleted row in OLD TABLE).  We need to do
>
> The period should be before, not after, the closing parenthesis.

Done.

>
> + * Now that we have already captured NEW TABLE row, any AR INSERT
> + * trigger should not

Re: [HACKERS] UPDATE of partition key

2018-01-01 Thread David Rowley

On 23 December 2017 at 04:00, Amit Khandekar  wrote:
> On 15 December 2017 at 18:28, Robert Haas  wrote:
>> -PartitionDispatch **pd,
>> -ResultRelInfo ***partitions,
>> -TupleConversionMap ***tup_conv_maps,
>> -TupleTableSlot **partition_tuple_slot,
>> -int *num_parted, int *num_partitions)
>> +PartitionTupleRouting **partition_tuple_routing)
>>
>> Since we're consolidating all of ExecSetupPartitionTupleRouting's
>> output parameters into a single structure, I think it might make more
>> sense to have it just return that value.  I think it's only done with
>> output parameter today because there are so many different things
>> being produced, and we can't return them all.
>
> You mean ExecSetupPartitionTupleRouting() will return the structure
> (not pointer to structure), and the caller will get the copy of the
> structure like this ? :
>
> mtstate->mt_partition_tuple_routing =
> ExecSetupPartitionTupleRouting(mtstate, rel, node->nominalRelation, estate);
>
> I am ok with that, but just wanted to confirm if that is what you are
> saying. I don't recall seeing a structure return value in PG code, so
> not sure if it is conventional in PG to do that. Hence, I am somewhat
> inclined to keep it as output param. It also avoids a structure copy.
>
> Another way is for ExecSetupPartitionTupleRouting() to palloc this
> structure, and return its pointer, but then caller would have to
> anyway do a structure copy, so that's not convenient, and I don't
> think you are suggesting this way either.

I'm pretty sure Robert is suggesting that
ExecSetupPartitionTupleRouting pallocs the memory for the structure,
sets it up then returns a pointer to the new struct. That's not very
unusual. It seems unusual for a function to return void and modify a
single parameter pointer to get the value to the caller rather than
just to return that value.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] UPDATE of partition key

2018-01-01 Thread Amit Khandekar

On 16 December 2017 at 03:09, Robert Haas  wrote:
> started another review pass over the main patch, so here are
> some comments about that.

I am yet to address all the comments, but meanwhile, below are some
specific points  ...

> + if (!partrel)
> + {
> + /*
> + * We locked all the partitions above including the leaf
> + * partitions. Note that each of the newly opened relations in
> + * *partitions are eventually closed by the caller.
> + */
> + partrel = heap_open(leaf_oid, NoLock);
> + InitResultRelInfo(leaf_part_rri,
> +   partrel,
> +   resultRTindex,
> +   rel,
> +   estate->es_instrument);
> + }
>
> Hmm, isn't there a problem here?  Before, we opened all the relations
> here and the caller closed them all.  But now, we're only opening some
> of them.  If the caller closes them all, then they will be closing
> some that we opened and some that we didn't.  That seems quite bad,
> because the reference counts that are incremented and decremented by
> opening and closing should all end up at 0.  Maybe I'm confused
> because it seems like this would break in any scenario where even 1
> relation was already opened and surely you must have tested that
> case... but if there's some reason this works, I don't know what it
> is, and the comment doesn't tell me.

In ExecCleanupTupleRouting(), we are closing only those newly opened
partitions. We skip those which are actually part of the update result
rels.

> + /*
> + * UPDATEs set the transition capture map only when a new subplan
> + * is chosen.  But for INSERTs, it is set for each row. So after
> + * INSERT, we need to revert back to the map created for UPDATE;
> + * otherwise the next UPDATE will incorrectly use the one created
> + * for INESRT.  So first save the one created for UPDATE.
> + */
> + if (mtstate->mt_transition_capture)
> + saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
>
> I wonder if there is some more elegant way to handle this problem.
> Basically, the issue is that ExecInsert() is stomping on
> mtstate->mt_transition_capture, and your solution is to save and
> restore the value you want to have there.  But maybe we could instead
> find a way to get ExecInsert() not to stomp on that state in the first
> place.  It seems like the ON CONFLICT stuff handled that by adding a
> second TransitionCaptureState pointer to ModifyTable, thus
> mt_transition_capture and mt_oc_transition_capture.  By that
> precedent, we could add mt_utr_transition_capture or similar, and
> maybe that's the way to go.  It seems a bit unsatisfying, but so does
> what you have now.

In case of ON CONFLICT, if there are both INSERT and UPDATE statement
triggers referencing transition tables, both of the triggers need to
independently populate their own transition tables, and hence the need
for two separate transition states : mt_transition_capture and
mt_oc_transition_capture. But in case of update-tuple-routing, the
INSERT statement trigger won't come into picture. So the same
mt_transition_capture can serve the purpose of populating the
transition table with OLD and NEW rows. So I think it would be too
redundant, if not incorrect, to have a whole new transition state for
update tuple routing.

I will see if it turns out better to have two tcs_maps in
TransitionCaptureState, one for update and one for insert. But this,
on first look, does not look good.

> + * If per-leaf map is required and the map is already created, that map
> + * has to be per-leaf. If that map is per-subplan, we won't be able to
> + * access the maps leaf-partition-wise. But if the map is per-leaf, we
> + * will be able to access the maps subplan-wise using the
> + * subplan_partition_offsets map using function
> + * tupconv_map_for_subplan().  So if the callers might need to access
> + * the map both leaf-partition-wise and subplan-wise, they should make
> + * sure that the first time this function is called, it should be
> + * called with perleaf=true so that the map created is per-leaf, not
> + * per-subplan.
>
> This sounds complicated and fragile.  It ends up meaning that
> mt_childparent_tupconv_maps is sometimes indexed by subplan number and
> sometimes by partition leaf index, which is extremely confusing and
> likely to lead to coding errors, either in this patch or in future
> ones.

Even if we always index the map by leaf partition, while accessing the
map the code still needs to be aware of whether the index number with
which we are accessing the map is the subplan number or leaf partition
number:

If the access is by subplan number, use subplan_partition_offsets to
convert to the leaf partition index. So the function
tupconv_map_for_subplan() is anyways necessary for accessing using
subplan index. Only thing that will change is :
tupconv_map_for_subplan() will not have to check if the the map is
indexed by leaf partition or not. But that complexity is hidden in
this function alone; the outside code need not worry about that.

If the access is by

Re: [HACKERS] UPDATE of partition key

2017-12-15 Thread Robert Haas

On Fri, Dec 15, 2017 at 7:58 AM, Robert Haas  wrote:
> Reviewing the preparatory patch:

I started another review pass over the main patch, so here are some
comments about that.  This is unfortunately not a complete review,
however.

- map = ptr->partition_tupconv_maps[leaf_part_index];
+ map = ptr->parentchild_tupconv_maps[leaf_part_index];

I don't think there's any reason to rename this.  In previous patch
versions, you had multiple arrays of tuple conversion maps in this
structure, but the refactoring eliminated that.

Likewise, I'm not sure I get the point of mt_transition_tupconv_maps
-> mt_childparent_tupconv_maps.  That seems like it could similarly be
left alone.

+ /*
+ * If transition tables are the only reason we're here, return. As
+ * mentioned above, we can also be here during update tuple routing in
+ * presence of transition tables, in which case this function is called
+ * separately for oldtup and newtup, so either can be NULL, not both.
+ */
  if (trigdesc == NULL ||
  (event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
  (event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
- (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+ (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+ (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL

I guess this is correct, but it seems awfully fragile.  Can't we have
some more explicit signaling about whether we're only here for
transition tables, rather than deducing it based on exactly one of
oldtup and newtup being NULL?

+ /* Initialization specific to update */
+ if (mtstate && mtstate->operation == CMD_UPDATE)
+ {
+ ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+
+ is_update = true;
+ update_rri = mtstate->resultRelInfo;
+ num_update_rri = list_length(node->plans);
+ }

I guess I don't see why we need a separate "if" block for this.
Neither is_update nor update_rri nor num_update_rri are used until we
get to the block that begins with "if (is_update)".  Why not just
change that block to test if (mtstate && mtstate->operation ==
CMD_UPDATE)" and put the rest of these initializations inside that
block?

+ int num_update_rri = 0,
+ update_rri_index = 0;
...
+ update_rri_index = 0;

It's already 0.

+ leaf_part_rri = _rri[update_rri_index];
...
+ leaf_part_rri = leaf_part_arr + i;

These are doing the same kind of thing, but using different styles.  I
prefer the former style, so I'd change the second one to
_part_arr[i]. Alternatively, you could change the first one to
update_rri + update_rri_indx.  But it's strange to see the same
variable initialized in two different ways just a few lines apart.

+ if (!partrel)
+ {
+ /*
+ * We locked all the partitions above including the leaf
+ * partitions. Note that each of the newly opened relations in
+ * *partitions are eventually closed by the caller.
+ */
+ partrel = heap_open(leaf_oid, NoLock);
+ InitResultRelInfo(leaf_part_rri,
+   partrel,
+   resultRTindex,
+   rel,
+   estate->es_instrument);
+ }

Hmm, isn't there a problem here?  Before, we opened all the relations
here and the caller closed them all.  But now, we're only opening some
of them.  If the caller closes them all, then they will be closing
some that we opened and some that we didn't.  That seems quite bad,
because the reference counts that are incremented and decremented by
opening and closing should all end up at 0.  Maybe I'm confused
because it seems like this would break in any scenario where even 1
relation was already opened and surely you must have tested that
case... but if there's some reason this works, I don't know what it
is, and the comment doesn't tell me.

+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+   TupleConversionMap *map,
+   HeapTuple tuple,
+   TupleTableSlot *new_slot,
+   TupleTableSlot **p_my_slot)

This function doesn't use the mtstate argument at all.

+ * (Similarly we need to add the deleted row in OLD TABLE).  We need to do

The period should be before, not after, the closing parenthesis.

+ * Now that we have already captured NEW TABLE row, any AR INSERT
+ * trigger should not again capture it below. Arrange for the same.

A more American style would be something like "We've already captured
the NEW TABLE row, so make sure any AR INSERT trigger fired below
doesn't capture it again."  (Similarly for the other case.)

+ /* The delete has actually happened, so inform that to the caller */
+ if (tuple_deleted)
+ *tuple_deleted = true;

In the US, we inform the caller, not inform that to the caller.  In
other words, here the direct object of "inform" is the person or thing
getting the information (in this case, "the caller"), not the
information being conveyed (in this case, "that").  I realize your
usage is probably typical for your country...

+ Assert(mtstate->mt_is_tupconv_perpart == true);

We usually just Assert(thing_that_should_be_true), not

Re: [HACKERS] UPDATE of partition key

2017-12-15 Thread Robert Haas

On Wed, Dec 13, 2017 at 5:18 AM, Amit Khandekar  wrote:
> Amit Langote informed me off-list, - along with suggestions for
> changes - that my patch needs a rebase. Attached is the rebased
> version. I have also bumped the patch version number (now v29),
> because this as additional changes, again, suggested by Amit L :
> Because  ExecSetupPartitionTupleRouting() has mtstate parameter now,
> no need to pass update_rri and num_update_rri, since they can be
> retrieved from mtstate.
>
> Also, the preparatory patch is also rebased.

Reviewing the preparatory patch:

+ PartitionTupleRouting *partition_tuple_routing;
+ /* Tuple-routing support info */

Something's wrong with the formatting here.

-PartitionDispatch **pd,
-ResultRelInfo ***partitions,
-TupleConversionMap ***tup_conv_maps,
-TupleTableSlot **partition_tuple_slot,
-int *num_parted, int *num_partitions)
+PartitionTupleRouting **partition_tuple_routing)

Since we're consolidating all of ExecSetupPartitionTupleRouting's
output parameters into a single structure, I think it might make more
sense to have it just return that value.  I think it's only done with
output parameter today because there are so many different things
being produced, and we can't return them all.

+ PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;

This is just nitpicking, but I don't find "ptr" to be the greatest
variable name; it looks too much like "pointer".  Maybe we could use
"routing" or "proute" or something.

It seems to me that we could improve things here by adding a function
ExecCleanupTupleRouting(PartitionTupleRouting *) which would do the
various heap_close(), ExecDropSingleTupleTableSlot(), and
ExecCloseIndices() operations which are currently performed in
CopyFrom() and, by separate code, in ExecEndModifyTable().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] UPDATE of partition key

2017-12-13 Thread Amit Langote

Thanks for the updated patches, Amit.

Some review comments.

Forgot to remove the description of update_rri and num_update_rri in the
header comment of ExecSetupPartitionTupleRouting().

-
+extern void pull_child_partition_columns(Relation rel,
+ Relation parent,
+ Bitmapset **partcols);

It seems you forgot to remove this declaration in partition.h, because I
don't find it defined or used anywhere.

I think some of the changes that are currently part of the main patch are
better taken out into their own patches, because having those diffs appear
in the main patch is kind of distracting.  Just like you now have a patch
that introduces a PartitionTupleRouting structure.  I know that leads to
too many patches, but it helps to easily tell less substantial changes
from the substantial ones.

1. Patch to rename partition_tupconv_maps to parentchild_tupconv_maps.

2. Patch that introduces has_partition_attrs() in place of
   is_partition_attr()

3. Patch to change the names of map_partition_varattnos() arguments

4. Patch that does the refactoring involving ExecConstrains(),
   ExecPartitionCheck(), and the introduction of
   ExecPartitionCheckEmitError()


Regarding ExecSetupChildParentMap(), it seems to me that it could simply
be declared as

static void ExecSetupChildParentMap(ModifyTableState *mtstate);

Looking at the places from where it's called, it seems that you're just
extracting information from mtstate and passing the same for the rest of
its arguments.

mt_is_tupconv_perpart seems like it's unnecessary.  Its function could be
fulfilled by inspecting the state of some other fields of
ModifyTableState.  For example, in the case of an update (operation ==
CMD_UPDATE), if mt_partition_tuple_routing is non-NULL, then we can always
assume that mt_childparent_tupconv_maps has entries for all partitions.
If it's NULL, then there would be only entries for partitions that have
sub-plans.

tupconv_map_for_subplan() looks like it could be done as a macro.

Thanks,
Amit

Re: [HACKERS] UPDATE of partition key

2017-11-30 Thread Amit Khandekar

While addressing Thomas's point about test scenarios not yet covered,
I observed the following ...

Suppose an UPDATE RLS policy with a WITH CHECK clause is defined on
the target table. Now In ExecUpdate(), the corresponding WCO qual gets
executed *before* the partition constraint check, as per existing
behaviour. And the qual succeeds. And then because of partition-key
updated, the row is moved to another partition. Here, suppose there is
a BR INSERT trigger which modifies the row, and the resultant row
actually would *not* pass the UPDATE RLS policy. But for this
partition, since it is an INSERT, only INSERT RLS WCO quals are
executed.

So effectively, with a user-perspective, an RLS WITH CHECK policy that
was defined to reject an updated row, is getting updated successfully.
This is because the policy is not checked *after* a row trigger in the
new partition is executed.

Attached is a test case that reproduces this issue.

I think, in case of row-movement, we should defer calling
ExecWithCheckOptions() until the row is being inserted using
ExecInsert(). And then in ExecInsert(), ExecWithCheckOptions() should
be called using WCO_RLS_UPDATE_CHECK rather than WCO_RLS_INSERT_CHECK
(I recall Amit Langote was of this opinion) as below :

--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -510,7 +510,9 @@ ExecInsert(ModifyTableState *mtstate,
  * we are looking for at this point.
  */
  if (resultRelInfo->ri_WithCheckOptions != NIL)
- ExecWithCheckOptions(WCO_RLS_INSERT_CHECK,
+ExecWithCheckOptions((mtstate->operation == CMD_UPDATE ?
+ WCO_RLS_UPDATE_CHECK : WCO_RLS_INSERT_CHECK),
  resultRelInfo, slot, estate);


It can be argued that since in case of triggers we always execute
INSERT row triggers for rows inserted as part of update-row-movement,
we should be consistent and execute INSERT WCOs and not UPDATE WCOs
for such rows. But note that, the row triggers we execute are defined
on the leaf partitions. But the RLS policies being executed are
defined for the target partitioned table, and not the leaf partition.
Hence it makes sense to execute them as per the original operation on
the target table. This is similar to why we execute UPDATE statement
triggers even when the row is eventually inserted into another
partition. This is because UPDATE statement trigger was defined for
the target table, not the leaf partition.

Barring any objections, I am going to send a revised patch that fixes
the above issue as described.

Thanks
-Amit Khandekar


wco_rls_issue.sql
Description: Binary data

Re: [HACKERS] UPDATE of partition key

2017-11-28 Thread Michael Paquier

On Mon, Nov 27, 2017 at 5:28 PM, Amit Khandekar  wrote:
> So, in the upcoming patch version, I am intending to include the above
> two, and if possible, Robert's idea of re-using is_partition_attr()
> for pull_child_partition_columns().

Discussions are still going on, so moved to next CF.
-- 
Michael

Re: [HACKERS] UPDATE of partition key

2017-11-27 Thread Amit Khandekar

On 24 November 2017 at 10:52, Amit Langote
 wrote:
> On 2017/11/23 21:57, Amit Khandekar wrote:
>> If we collect the partition keys in expand_partitioned_rtentry(), we
>> need to pass the root relation also, so that we can convert the
>> partition key attributes to root rel descriptor. And the other thing
>> is, may be, we can check beforehand (in expand_inherited_rtentry)
>> whether the rootrte's updatedCols is empty, which I think implies that
>> it's not an UPDATE operation. If yes, we can just skip collecting the
>> partition keys.
>
> Yeah, it seems like a good idea after all to check in
> expand_inherited_rtentry() whether the root RTE's updatedCols is non-empty
> and if so check if any of the updatedCols are partition keys.  If we find
> some, then it will suffice to just set a simple flag in the
> PartitionedChildRelInfo that will be created for the root table.  That
> should be done *after* we have visited all the tables in the partition
> tree including some that might be partitioned and hence will provide their
> partition keys.  The following block in expand_inherited_rtentry() looks
> like a good spot:
>
> if (rte->inh && partitioned_child_rels != NIL)
> {
> PartitionedChildRelInfo *pcinfo;
>
> pcinfo = makeNode(PartitionedChildRelInfo);

Yes, I am thinking about something like that. Thanks.

I am also working on your suggestion of moving the
convert-to-root-descriptor logic from ExecInsert() to ExecUpdate().

So, in the upcoming patch version, I am intending to include the above
two, and if possible, Robert's idea of re-using is_partition_attr()
for pull_child_partition_columns().


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] UPDATE of partition key

2017-11-23 Thread Amit Langote

On 2017/11/23 21:57, Amit Khandekar wrote:
> If we collect the partition keys in expand_partitioned_rtentry(), we
> need to pass the root relation also, so that we can convert the
> partition key attributes to root rel descriptor. And the other thing
> is, may be, we can check beforehand (in expand_inherited_rtentry)
> whether the rootrte's updatedCols is empty, which I think implies that
> it's not an UPDATE operation. If yes, we can just skip collecting the
> partition keys.

Yeah, it seems like a good idea after all to check in
expand_inherited_rtentry() whether the root RTE's updatedCols is non-empty
and if so check if any of the updatedCols are partition keys.  If we find
some, then it will suffice to just set a simple flag in the
PartitionedChildRelInfo that will be created for the root table.  That
should be done *after* we have visited all the tables in the partition
tree including some that might be partitioned and hence will provide their
partition keys.  The following block in expand_inherited_rtentry() looks
like a good spot:

if (rte->inh && partitioned_child_rels != NIL)
{
PartitionedChildRelInfo *pcinfo;

pcinfo = makeNode(PartitionedChildRelInfo);

Thanks,
Amit

Re: [HACKERS] UPDATE of partition key

2017-11-22 Thread Amit Khandekar

On 21 November 2017 at 17:24, Amit Khandekar  wrote:
> On 13 November 2017 at 18:25, David Rowley  
> wrote:
>>
>> 30. The following chunk of code is giving me a headache trying to
>> verify which arrays are which size:
>>
>> ExecSetupPartitionTupleRouting(rel,
>>mtstate->resultRelInfo,
>>(operation == CMD_UPDATE ? nplans : 0),
>>node->nominalRelation,
>>estate,
>>_dispatch_info,
>>,
>>_tupconv_maps,
>>_leaf_map,
>>_tuple_slot,
>>_parted, _partitions);
>> mtstate->mt_partition_dispatch_info = partition_dispatch_info;
>> mtstate->mt_num_dispatch = num_parted;
>> mtstate->mt_partitions = partitions;
>> mtstate->mt_num_partitions = num_partitions;
>> mtstate->mt_parentchild_tupconv_maps = partition_tupconv_maps;
>> mtstate->mt_subplan_partition_offsets = subplan_leaf_map;
>> mtstate->mt_partition_tuple_slot = partition_tuple_slot;
>> mtstate->mt_root_tuple_slot = MakeTupleTableSlot();
>>
>> I know this patch is not completely responsible for it, but you're not
>> making things any better.
>>
>> Would it not be better to invent some PartitionTupleRouting struct and
>> make that struct a member of ModifyTableState and CopyState, then just
>> pass the pointer to that struct to ExecSetupPartitionTupleRouting()
>> and have it fill in the required details? I think the complexity of
>> this is already on the high end, I think you really need to do the
>> refactor before this gets any worse.
>>
>
>Ok. I am currently working on doing this change. So not yet included in the 
>attached patch. Will send yet another revised patch for this change.

Attached incremental patch encapsulate_partinfo.patch (to be applied
over the latest v25 patch) contains changes to move all the
partition-related information into new structure
PartitionTupleRouting. Further to that, I also moved
PartitionDispatchInfo into this structure. So it looks like this :

typedef struct PartitionTupleRouting
{
PartitionDispatch *partition_dispatch_info;
int num_dispatch;
ResultRelInfo **partitions;
int num_partitions;
TupleConversionMap **parentchild_tupconv_maps;
int*subplan_partition_offsets;
TupleTableSlot *partition_tuple_slot;
TupleTableSlot *root_tuple_slot;
} PartitionTupleRouting;

So this structure now encapsulates *all* the
partition-tuple-routing-related information. So ModifyTableState now
has only one tuple-routing-related field 'mt_partition_tuple_routing'.
It is changed like this :

@@ -976,24 +976,14 @@ typedef struct ModifyTableState
TupleTableSlot *mt_existing;/* slot to store existing
target tuple in */
List   *mt_excludedtlist;   /* the excluded pseudo
relation's tlist  */
TupleTableSlot *mt_conflproj;   /* CONFLICT ... SET ...
projection target */
-   struct PartitionDispatchData **mt_partition_dispatch_info;
-   /* Tuple-routing support info */
-   int mt_num_dispatch;/* Number of
entries in the above array */
-   int mt_num_partitions;  /* Number of
members in the following
-
  * arrays */
-   ResultRelInfo **mt_partitions;  /* Per partition result
relation pointers */
-   TupleTableSlot *mt_partition_tuple_slot;
-   TupleTableSlot *mt_root_tuple_slot;
+   struct PartitionTupleRouting *mt_partition_tuple_routing; /*
Tuple-routing support info */
struct TransitionCaptureState *mt_transition_capture;
/* controls transition table population for specified operation */
struct TransitionCaptureState *mt_oc_transition_capture;
/* controls transition table population for INSERT...ON
CONFLICT UPDATE */
-   TupleConversionMap **mt_parentchild_tupconv_maps;
-   /* Per partition map for tuple conversion from root to leaf */
TupleConversionMap **mt_childparent_tupconv_maps;
/* Per plan/partition map for tuple conversion from child to root */
boolmt_is_tupconv_perpart;  /* Is the above map
per-partition ? */
-   int *mt_subplan_partition_offsets;
/* Stores position of update result rels in leaf partitions */
 } ModifyTableState;

So the code in nodeModifyTable.c (and similar code in copy.c) is
accordingly adjusted to use mtstate->mt_partition_tuple_routing.

The places where we use (mtstate->mt_partition_dispatch_info != NULL)
condition to run tuple-routing code, I have replaced it with
mtstate->mt_partition_tuple_routing != NULL.

If you are ok with the incremental patch, I can extract this change
into a separate preparatory patch to be applied on PG master.

Thanks
-Amit Khandekar


encapsulate_partinfo.patch
Description: Binary data

Re: [HACKERS] UPDATE of partition key

2017-11-22 Thread Amit Langote

Thanks Amit.

Looking at the latest v25 patch.

On 2017/11/16 23:50, Amit Khandekar wrote:
> Below has the responses for both Amit's and David's comments, starting
> with Amit's 
> On 2 November 2017 at 12:40, Amit Langote  
> wrote:
>> On 2017/10/24 0:15, Amit Khandekar wrote:
>>> On 16 October 2017 at 08:28, Amit Langote  
>>> wrote:

 +   (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup
 == NULL

 Is there some reason why a bitwise operator is used here?
>>>
>>> That exact condition means that the function is called for transition
>>> capture for updated rows being moved to another partition. For this
>>> scenario, either the oldtup or the newtup is NULL. I wanted to exactly
>>> capture that condition there. I think the bitwise operator is more
>>> user-friendly in emphasizing the point that it is indeed an "either a
>>> or b, not both" condition.
>>
>> I see.  In that case, since this patch adds the new condition, a note
>> about it in the comment just above would be good, because the situation
>> you describe here seems to arise only during update-tuple-routing, IIUC.
> 
> Done. Please check.

Looks fine.

>> + * 'update_rri' has the UPDATE per-subplan result rels. These are re-used
>> + *  instead of allocating new ones while generating the array of all 
>> leaf
>> + *  partition result rels.
>>
>> Instead of:
>>
>> "These are re-used instead of allocating new ones while generating the
>> array of all leaf partition result rels."
>>
>> how about:
>>
>> "There is no need to allocate a new ResultRellInfo entry for leaf
>> partitions for which one already exists in this array"
> 
> Ok. I have made it like this :
> 
> + * 'update_rri' contains the UPDATE per-subplan result rels. For the
> output param
> + * 'partitions', we don't allocate new ResultRelInfo objects for
> + * leaf partitions for which they are already available
> in 'update_rri'.

Sure.

>>> It looks like the interface does not much simplify, and above that, we
>>> have more number of lines in that function. Also, the caller anyway
>>> has to be aware whether the map_index is the index into the leaf
>>> partitions or the update subplans. So it is not like the caller does
>>> not have to be aware about whether the mapping should be
>>> mt_persubplan_childparent_maps or mt_perleaf_parentchild_maps.
>>
>> Hmm, I think we should try to make it so that the caller doesn't have to
>> be aware of that.  And by caller I guess you mean ExecInsert(), which
>> should not be a place, IMHO, where to try to introduce a lot of new logic
>> specific to update tuple routing.
> 
> I think, for ExecInsert() since we have already given the job of
> routing the tuple from root partitioned table to a partition, it makes
> sense to give the function the additional job of routing the tuple
> from any partition to any partition. ExecInsert() can be looked at as
> doing this job : "insert a tuple into the right partition; the
> original tuple can belong to any partition"

Yeah, that's one way of looking at that.  But I think ExecInsert() as it
is today thinks it's got a *new* tuple and that's it.  I think the newly
introduced code in it to find out that it is not so (that the tuple
actually comes from some other partition), that it's really the
update-turned-into-delete-plus-insert, and then switch to the root
partitioned table's ResultRelInfo, etc. really belongs outside of it.
Maybe in its caller, which is ExecUpdate().  I mean why not add this code
to the block in ExecUpdate() that handles update-row-movement.

Just before calling ExecInsert() to do the re-routing seems like a good
place to do all that.  For example, try the attached incremental patch
that applies on top of yours.  I can see after applying it that diffs to
ExecInsert() are now just some refactoring ones and there are no
significant additions making it look like supporting update-row-movement
required substantial changes to how ExecInsert() itself works.

> After doing the changes for the int[] array map in the previous patch
> version, I still feel that ConvertPartitionTupleSlot() should be
> retained. We save some repeated lines of code saved.

OK.

>> You may be right, but I see for WithCheckOptions initialization
>> specifically that the non-tuple-routing code passes the actual sub-plan
>> when initializing the WCO for a given result rel.
> 
> Yes that's true. The problem with WithCheckOptions for newly allocated
> partition result rels is : we can't use a subplan for the parent
> parameter because there is no subplan for it. But I will still think
> on it a bit more (TODO).

Alright.

>>> I think you are suggesting we do it like how it's done in
>>> is_partition_attr(). Can you please let me know other places we do
>>> this same way ? I couldn't find.
>>
>> OK, not as many as I thought there would be, but there are following
>> beside

37 matches

Mail list logo