Re: [HACKERS] UPDATE of partition key
On Thu, Jan 25, 2018 at 10:39 AM, Robert Haaswrote: > On Mon, Jan 22, 2018 at 9:46 AM, Tom Lane wrote: >> Robert Haas writes: >>> Tom, do you want to double-check that this fixes it for you? >> >> I can confirm that a valgrind run succeeded for me with the patch >> in place. > > Committed. Sorry for the delay. FYI I'm planning to look into adding a valgrind check to the commitfest CI thing I run so we can catch these earlier without committer involvement. It's super slow because of all those pesky regression tests so I'll probably need to improve the scheduling logic a bit to make it useful first (prioritising new patches or something, since otherwise it'll take up to multiple days to get around to valgrind-testing any given patch...). -- Thomas Munro http://www.enterprisedb.com
Re: [HACKERS] UPDATE of partition key
On Mon, Jan 22, 2018 at 9:46 AM, Tom Lanewrote: > Robert Haas writes: >> Tom, do you want to double-check that this fixes it for you? > > I can confirm that a valgrind run succeeded for me with the patch > in place. Committed. Sorry for the delay. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] UPDATE of partition key
Robert Haaswrites: > Tom, do you want to double-check that this fixes it for you? I can confirm that a valgrind run succeeded for me with the patch in place. regards, tom lane
Re: [HACKERS] UPDATE of partition key
On Mon, Jan 22, 2018 at 2:44 AM, Amit Khandekarwrote: > Yes, right, that's what is happening. It is not happening on an Assert > though (there is no assert in that function). It is happening when we > try to access the array here : > > if (proute->subplan_partition_offsets && > proute->subplan_partition_offsets[subplan_index] == i) > > Attached is a fix, where I have introduced another field > PartitionTupleRouting.num_ subplan_partition_offsets, so that above, > we can add another condition (subplan_index < > proute->num_subplan_partition_offsets) in order to stop accessing the > array once we are done with all the offset array elements. > > Ran the update.sql test with valgrind enabled on my laptop, and the > valgrind output now does not show errors. Tom, do you want to double-check that this fixes it for you? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] UPDATE of partition key
On 22 January 2018 at 02:40, Robert Haaswrote: > On Sun, Jan 21, 2018 at 1:43 AM, Tom Lane wrote: >> Robert Haas writes: >>> Committed with a bunch of mostly-cosmetic revisions. >> >> Buildfarm member skink has been unhappy since this patch went in. >> Running the regression tests under valgrind easily reproduces the >> failure. Now, I might be wrong about which of the patches committed >> on Friday caused the unhappiness, but the valgrind backtrace sure >> looks like it's to do with partition routing: > > Yeah, that must be the fault of this patch. We assign to > proute->subplan_partition_offsets[update_rri_index] from > update_rri_index = 0 .. num_update_rri, and there's an Assert() at the > bottom of this function that checks this, so probably this is indexing > off the end of the array. I bet the issue happens when we find all of > the UPDATE result rels while there are still partitions left; then, > subplan_index will be equal to the length of the > proute->subplan_partition_offsets array and we'll be indexing just off > the end. Yes, right, that's what is happening. It is not happening on an Assert though (there is no assert in that function). It is happening when we try to access the array here : if (proute->subplan_partition_offsets && proute->subplan_partition_offsets[subplan_index] == i) Attached is a fix, where I have introduced another field PartitionTupleRouting.num_ subplan_partition_offsets, so that above, we can add another condition (subplan_index < proute->num_subplan_partition_offsets) in order to stop accessing the array once we are done with all the offset array elements. Ran the update.sql test with valgrind enabled on my laptop, and the valgrind output now does not show errors. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company fix_valgrind_issue.patch Description: Binary data
Re: [HACKERS] UPDATE of partition key
On Sun, Jan 21, 2018 at 1:43 AM, Tom Lanewrote: > Robert Haas writes: >> Committed with a bunch of mostly-cosmetic revisions. > > Buildfarm member skink has been unhappy since this patch went in. > Running the regression tests under valgrind easily reproduces the > failure. Now, I might be wrong about which of the patches committed > on Friday caused the unhappiness, but the valgrind backtrace sure > looks like it's to do with partition routing: Yeah, that must be the fault of this patch. We assign to proute->subplan_partition_offsets[update_rri_index] from update_rri_index = 0 .. num_update_rri, and there's an Assert() at the bottom of this function that checks this, so probably this is indexing off the end of the array. I bet the issue happens when we find all of the UPDATE result rels while there are still partitions left; then, subplan_index will be equal to the length of the proute->subplan_partition_offsets array and we'll be indexing just off the end. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] UPDATE of partition key
Robert Haaswrites: > Committed with a bunch of mostly-cosmetic revisions. Buildfarm member skink has been unhappy since this patch went in. Running the regression tests under valgrind easily reproduces the failure. Now, I might be wrong about which of the patches committed on Friday caused the unhappiness, but the valgrind backtrace sure looks like it's to do with partition routing: ==00:00:05:49.683 17549== Invalid read of size 4 ==00:00:05:49.683 17549==at 0x62A8BA: ExecCleanupTupleRouting (execPartition.c:483) ==00:00:05:49.683 17549==by 0x6483AA: ExecEndModifyTable (nodeModifyTable.c:2682) ==00:00:05:49.683 17549==by 0x627139: standard_ExecutorEnd (execMain.c:1604) ==00:00:05:49.683 17549==by 0x7780AF: ProcessQuery (pquery.c:206) ==00:00:05:49.683 17549==by 0x7782E4: PortalRunMulti (pquery.c:1286) ==00:00:05:49.683 17549==by 0x778AAF: PortalRun (pquery.c:799) ==00:00:05:49.683 17549==by 0x774E4C: exec_simple_query (postgres.c:1120) ==00:00:05:49.683 17549==by 0x776C17: PostgresMain (postgres.c:4143) ==00:00:05:49.683 17549==by 0x6FA419: PostmasterMain (postmaster.c:4412) ==00:00:05:49.683 17549==by 0x66E51F: main (main.c:228) ==00:00:05:49.683 17549== Address 0xe25e298 is 2,088 bytes inside a block of size 32,768 alloc'd ==00:00:05:49.683 17549==at 0x4A06A2E: malloc (vg_replace_malloc.c:270) ==00:00:05:49.683 17549==by 0x89EB15: AllocSetAlloc (aset.c:945) ==00:00:05:49.683 17549==by 0x8A7577: palloc (mcxt.c:848) ==00:00:05:49.683 17549==by 0x671969: new_list (list.c:68) ==00:00:05:49.683 17549==by 0x672859: lappend_oid (list.c:169) ==00:00:05:49.683 17549==by 0x55330E: find_inheritance_children (pg_inherits.c:144) ==00:00:05:49.683 17549==by 0x553447: find_all_inheritors (pg_inherits.c:203) ==00:00:05:49.683 17549==by 0x62AC76: ExecSetupPartitionTupleRouting (execPartition.c:68) ==00:00:05:49.683 17549==by 0x64949D: ExecInitModifyTable (nodeModifyTable.c:2232) ==00:00:05:49.683 17549==by 0x62BBE8: ExecInitNode (execProcnode.c:174) ==00:00:05:49.683 17549==by 0x627B53: standard_ExecutorStart (execMain.c:1043) ==00:00:05:49.683 17549==by 0x778046: ProcessQuery (pquery.c:156) (This is my local result, but skink's log looks about the same.) regards, tom lane
Re: [HACKERS] UPDATE of partition key
On Fri, Jan 19, 2018 at 4:37 AM, Amit Khandekarwrote: > Attached rebased patch. Committed with a bunch of mostly-cosmetic revisions. I removed the macro you added, which has a multiple evaluation hazard, and just put that logic back into the function. I don't think it's likely to matter for performance, and this way is safer. I removed an inline keyword from another static function as well; better to let the compiler decide what to do. I rearranged a few things to shorten some long lines, too. Aside from that I think all of the changes I made were to comments and documentation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] UPDATE of partition key
On 16 January 2018 at 09:17, David Rowleywrote: > On 16 January 2018 at 01:09, Robert Haas wrote: >> On Sun, Jan 14, 2018 at 6:57 AM, Amit Khandekar >> wrote: >>> Even where partitions are present, in the usual case where there are >>> Instead of a bool array, we can even make it a Bitmapset. But I think >>> access would become slower as compared to array, particularly because >>> it is going to be a heavily used function. >> >> It probably makes little difference -- the Bitmapset will be more >> compact (which saves time) but involve function calls (which cost >> time). > > I'm not arguing in either direction, but you'd also want to factor in > how Bitmapsets only allocate words for the maximum stored member, > which might mean multiple realloc() calls resulting in palloc/memcpy > calls. The array would just be allocated in a single chunk, although > it would be more memory and would require a memset too, however, > that's likely much cheaper than the palloc() anyway. Right. I agree. And also a function call for knowing whether required or not. Overall, I think especially because the data structure will be used heavily whenever it is set up, it's better to make it an array. In the latest patch, I have retained it as an array -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] UPDATE of partition key
On 16 January 2018 at 01:09, Robert Haaswrote: > On Sun, Jan 14, 2018 at 6:57 AM, Amit Khandekar > wrote: >> Even where partitions are present, in the usual case where there are >> Instead of a bool array, we can even make it a Bitmapset. But I think >> access would become slower as compared to array, particularly because >> it is going to be a heavily used function. > > It probably makes little difference -- the Bitmapset will be more > compact (which saves time) but involve function calls (which cost > time). I'm not arguing in either direction, but you'd also want to factor in how Bitmapsets only allocate words for the maximum stored member, which might mean multiple realloc() calls resulting in palloc/memcpy calls. The array would just be allocated in a single chunk, although it would be more memory and would require a memset too, however, that's likely much cheaper than the palloc() anyway. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] UPDATE of partition key
On 13 January 2018 at 02:56, Robert Haaswrote: > On Fri, Jan 12, 2018 at 12:23 PM, Amit Khandekar > wrote: >>> (1) if they need it by subplan index, first use >>> subplan_partition_offsets to convert it to a per-leaf index >> >> Before that, we need to check if there *is* an offset array. If there >> are no partitions, there is only going to be a per-subplan array, >> there won't be an offsets array. But I guess, you are saying : "do the >> on-demand allocation only for leaf partitions; if there are no >> partitions, the per-subplan maps will always be allocated for each of >> the subplans from the beginning" . So if there is no offset array, >> just return mtstate->mt_per_subplan_tupconv_maps[subplan_index] >> without any further checks. > > Oops. I forgot that there might not be partitions. I was assuming > that mt_per_subplan_tupconv_maps wouldn't exist at all, and we'd > always use subplan_partition_offsets. Both that won't work in the > inheritance case. > >> But after that, I am not sure then why is mt_per_sub_plan_maps[] array >> needed ? We are always going to convert the subplan index into leaf >> index, so per-subplan map array will not come into picture. Or are you >> saying, it will be allocated and used only when there are no >> partitions ? From one of your earlier replies, you did mention about >> trying to share the maps between the two arrays, that means you were >> considering both arrays being used at the same time. > > We'd use them both at the same time if we didn't have, or didn't use, > subplan_partition_offsets, but if we have subplan_partition_offsets > and can use it then we don't need mt_per_sub_plan_maps. > > I guess I'm inclined to keep mt_per_sub_plan_maps for the case where > there are no partitions, but not use it when partitions are present. > What do you think about that? Even where partitions are present, in the usual case where there are no transition tables we won't require per-leaf map at all [1]. So I think we should keep mt_per_sub_plan_maps only for the case where per-leaf map is not allocated. And we will not allocate mt_per_sub_plan_maps when mt_per_leaf_maps is needed. In other words, exactly one of the two maps will be allocated. This is turning out to be close to what's already there in the last patch versions: use a single map array, and an offsets array. The difference is : in the patch I am using the *same* variable for the two maps. Where as, now we are talking about two different array variables for maps, but only allocating one of them. Are you ok with this ? I think the thing you were against was to have a common *variable* for two purposes. But above, I am saying we have two variables but assign a map array to only *one* of them and leave the other unused. - Regarding the on-demand map allocation Where mt_per_sub_plan_maps is allocated, we won't have the on-demand allocation: all the maps will be allocated initially. The reason is becaues the map_is_required array is only per-leaf. Or else, again, we need to keep another map_is_required array for per-subplan. May be we can support the on-demand stuff for subplan maps also, but only as a separate change after we are done with update-partition-key. - Regarding mt_per_leaf_tupconv_required, I am thinking we can make it a bool array, and name it : mt_per_leaf_map_not_required. When it is true for a given index, it means, we have already called convert_tuples_by_name() and it returned NULL; i.e. it means we are sure that map is not required. A false value means we need to call convert_tuples_by_name() if it is NULL, and then set mt_per_leaf_map_not_required to (map == NULL). Instead of a bool array, we can even make it a Bitmapset. But I think access would become slower as compared to array, particularly because it is going to be a heavily used function. - [1] - For update-tuple-routing, only per-subplan access is required; - For transition tables, per-subplan access is required, and additionally per-leaf access is required when tuples are update-routed - So if both update-tuple-routing and transition tables are required, both of the maps are needed. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] UPDATE of partition key
On Fri, Jan 12, 2018 at 12:23 PM, Amit Khandekarwrote: >> (1) if they need it by subplan index, first use >> subplan_partition_offsets to convert it to a per-leaf index > > Before that, we need to check if there *is* an offset array. If there > are no partitions, there is only going to be a per-subplan array, > there won't be an offsets array. But I guess, you are saying : "do the > on-demand allocation only for leaf partitions; if there are no > partitions, the per-subplan maps will always be allocated for each of > the subplans from the beginning" . So if there is no offset array, > just return mtstate->mt_per_subplan_tupconv_maps[subplan_index] > without any further checks. Oops. I forgot that there might not be partitions. I was assuming that mt_per_subplan_tupconv_maps wouldn't exist at all, and we'd always use subplan_partition_offsets. Both that won't work in the inheritance case. > But after that, I am not sure then why is mt_per_sub_plan_maps[] array > needed ? We are always going to convert the subplan index into leaf > index, so per-subplan map array will not come into picture. Or are you > saying, it will be allocated and used only when there are no > partitions ? From one of your earlier replies, you did mention about > trying to share the maps between the two arrays, that means you were > considering both arrays being used at the same time. We'd use them both at the same time if we didn't have, or didn't use, subplan_partition_offsets, but if we have subplan_partition_offsets and can use it then we don't need mt_per_sub_plan_maps. I guess I'm inclined to keep mt_per_sub_plan_maps for the case where there are no partitions, but not use it when partitions are present. What do you think about that? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] UPDATE of partition key
On 12 January 2018 at 20:24, Robert Haaswrote: > On Fri, Jan 12, 2018 at 5:12 AM, Amit Khandekar > wrote: >> The reason why I am having map_required field inside a structure along >> with the map, as against a separate array, is so that we can do the >> on-demand allocation for both per-leaf array and per-subplan array. > > Putting the map_required field inside the structure with the map makes > it completely silly to do the 0/1/2 thing, because the whole structure > is going to be on the same cache line anyway. It won't save anything > to access the flag instead of a pointer in the same struct. I see. Got it. > Also, > the uint8 will be followed by 7 bytes of padding, because the pointer > that follows will need to begin on an 8-byte boundary (at least, on > 64-bit machines), so this will use more memory. > > What I suggest is: > > #define MT_CONVERSION_REQUIRED_UNKNOWN0 > #define MT_CONVERSION_REQUIRED_YES1 > #define MT_CONVERSION_REQUIRED_NO 2 > > In ModifyTableState: > > uint8 *mt_per_leaf_tupconv_required; > TupleConversionMap **mt_per_leaf_tupconv_maps; > > In PartitionTupleRouting: > > int *subplan_partition_offsets; > > When you initialize the ModifyTableState, do this: > > mtstate->mt_per_leaf_tupconv_required = palloc0(sizeof(uint8) * > numResultRelInfos); > mtstate->mt_per_leaf_tupconv_maps = palloc0(sizeof(TupleConversionMap > *) * numResultRelInfos); > A few points below where I wanted to confirm that we are on the same page ... > When somebody needs a map, then > > (1) if they need it by subplan index, first use > subplan_partition_offsets to convert it to a per-leaf index Before that, we need to check if there *is* an offset array. If there are no partitions, there is only going to be a per-subplan array, there won't be an offsets array. But I guess, you are saying : "do the on-demand allocation only for leaf partitions; if there are no partitions, the per-subplan maps will always be allocated for each of the subplans from the beginning" . So if there is no offset array, just return mtstate->mt_per_subplan_tupconv_maps[subplan_index] without any further checks. > > (2) then write a function that takes the per-leaf index and does this: > > switch (mtstate->mt_per_leaf_tupconv_required[leaf_part_index]) > { > case MT_CONVERSION_REQUIRED_UNKNOWN: > map = convert_tuples_by_name(...); > if (map == NULL) > mtstate->mt_per_leaf_tupconv_required[leaf_part_index] = > MT_CONVERSION_REQUIRED_NO; > else > { > mtstate->mt_per_leaf_tupconv_required[leaf_part_index] = > MT_CONVERSION_REQUIRED_YES; > mtstate->mt_per_leaf_tupconv_maps[leaf_part_index] = map; > } > return map; > case MT_CONVERSION_REQUIRED_YES: > return mtstate->mt_per_leaf_tupconv_maps[leaf_part_index]; > case MT_CONVERSION_REQUIRED_NO: > return NULL; > } Yeah, right. But after that, I am not sure then why is mt_per_sub_plan_maps[] array needed ? We are always going to convert the subplan index into leaf index, so per-subplan map array will not come into picture. Or are you saying, it will be allocated and used only when there are no partitions ? From one of your earlier replies, you did mention about trying to share the maps between the two arrays, that means you were considering both arrays being used at the same time. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] UPDATE of partition key
On Fri, Jan 12, 2018 at 5:12 AM, Amit Khandekarwrote: > The reason why I am having map_required field inside a structure along > with the map, as against a separate array, is so that we can do the > on-demand allocation for both per-leaf array and per-subplan array. Putting the map_required field inside the structure with the map makes it completely silly to do the 0/1/2 thing, because the whole structure is going to be on the same cache line anyway. It won't save anything to access the flag instead of a pointer in the same struct. Also, the uint8 will be followed by 7 bytes of padding, because the pointer that follows will need to begin on an 8-byte boundary (at least, on 64-bit machines), so this will use more memory. What I suggest is: #define MT_CONVERSION_REQUIRED_UNKNOWN0 #define MT_CONVERSION_REQUIRED_YES1 #define MT_CONVERSION_REQUIRED_NO 2 In ModifyTableState: uint8 *mt_per_leaf_tupconv_required; TupleConversionMap **mt_per_leaf_tupconv_maps; In PartitionTupleRouting: int *subplan_partition_offsets; When you initialize the ModifyTableState, do this: mtstate->mt_per_leaf_tupconv_required = palloc0(sizeof(uint8) * numResultRelInfos); mtstate->mt_per_leaf_tupconv_maps = palloc0(sizeof(TupleConversionMap *) * numResultRelInfos); When somebody needs a map, then (1) if they need it by subplan index, first use subplan_partition_offsets to convert it to a per-leaf index (2) then write a function that takes the per-leaf index and does this: switch (mtstate->mt_per_leaf_tupconv_required[leaf_part_index]) { case MT_CONVERSION_REQUIRED_UNKNOWN: map = convert_tuples_by_name(...); if (map == NULL) mtstate->mt_per_leaf_tupconv_required[leaf_part_index] = MT_CONVERSION_REQUIRED_NO; else { mtstate->mt_per_leaf_tupconv_required[leaf_part_index] = MT_CONVERSION_REQUIRED_YES; mtstate->mt_per_leaf_tupconv_maps[leaf_part_index] = map; } return map; case MT_CONVERSION_REQUIRED_YES: return mtstate->mt_per_leaf_tupconv_maps[leaf_part_index]; case MT_CONVERSION_REQUIRED_NO: return NULL; } -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] UPDATE of partition key
On Thu, Jan 11, 2018 at 6:07 AM, Amit Khandekarwrote: > In the first paragraph of my explanation, I was explaining why two > Transition capture states does not look like a good idea to me : Oh, sorry. I didn't read what you wrote carefully enough, I guess. I see your points. I think that there is probably a general need for some refactoring here. AfterTriggerSaveEvent() got significantly more complicated and harder to understand with the arrival of transition tables, and this patch is adding more complexity still. It's also adding complexity in other places to make ExecInsert() and ExecDelete() usable for the semi-internal DELETE/INSERT operations being produced when we split a partition key update into a DELETE and INSERT pair. It would be awfully nice to have some better way to separate out each of the different things we might or might not want to do depending on the situation: capture old tuple, capture new tuple, fire before triggers, fire after triggers, count processed rows, set command tag, perform actual heap operation, update indexes, etc. However, I don't have a specific idea how to do it better, so maybe we should just get this committed for now and perhaps, with more eyes on the code, someone will have a good idea. > Slight correction; it was suggested by Amit Langote; not by David. Oh, OK, sorry. > So there are two independent optimizations we are talking about : > > 1. Create the map only when needed. > 2. In case of UPDATE, for partitions that take part in update scans, > there should be a single map; there should not be two separate maps, > one for accessing per-subplan and the other for accessing per-leaf. These optimizations aren't completely independent. Optimization #2 can be implemented in several different ways. The way you've chosen to do it is to index the same array in two different ways depending on whether per-leaf indexing is not needed, which I think is unacceptable. Another approach, which I proposed upthread, is to always built the per-leaf mapping, but you pointed out that this could involve doing a lot of unnecessary work in the case where most leaves were pruned. However, if you also implement #1, then that problem goes away. In other words, depending on the design you choose for #2, you may or may not need to also implement optimization #1 to get good performance. To put that another way, I think Amit's idea of keeping a subplan-offsets array is a pretty good one. From your comments, you do too. But if we want to keep that, then we need a way to avoid the expense of populating it for leaves that got pruned, except when we are doing update row movement. Otherwise, I don't see much choice but to jettison the subplan-offsets array and just maintain two separate arrays of mappings. > Regarding the array names ... > > Noting all this, I feel we can go with names according to the > structure of maps. Something like : mt_perleaf_tupconv_maps, and > mt_persubplan_tupconv_maps. Other suggestions welcome. I'd probably do mt_per_leaf_tupconv_maps, since inserting an underscore between some but not all words seems strange. But OK otherwise. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] UPDATE of partition key
On 11 January 2018 at 10:44, David Rowleywrote: >>> 18. Why two RESET SESSION AUTHORIZATIONs? >>> >>> reset session authorization; >>> drop trigger trig_d_1_15 ON part_d_1_15; >>> drop function func_d_1_15(); >>> -- Policy expression contains SubPlan >>> reset session authorization; >> >> The second reset is actually in a different paragraph. The reason it's >> there is to ensure we have reset it regardless of the earlier cleanup. > > hmm, I was reviewing the .out file, which does not have the empty > lines. Still seems a bit surplus. I believe the output file does not have the blank lines present in the .sql file. I was referring to the paragraph in the *.sql* file. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] UPDATE of partition key
Thanks for making those changes. On 11 January 2018 at 04:00, Amit Khandekarwrote: > Yes, I understand that there won't be any update scan plans. But, with > the modifications done in ExecInitModifyTable(), I wanted to run that > code with this scenario where there are no partitions, to make sure it > does not behave weirdly or crash. Any suggestions for comments, given > this perspective ? For now, I have made the comment this way: > > -- Check that partition-key UPDATE works sanely on a partitioned table > that does not have any child partitions. Sounds good. >> 18. Why two RESET SESSION AUTHORIZATIONs? >> >> reset session authorization; >> drop trigger trig_d_1_15 ON part_d_1_15; >> drop function func_d_1_15(); >> -- Policy expression contains SubPlan >> reset session authorization; > > The second reset is actually in a different paragraph. The reason it's > there is to ensure we have reset it regardless of the earlier cleanup. hmm, I was reviewing the .out file, which does not have the empty lines. Still seems a bit surplus. > Attached v35 patch. Thanks. Thanks. I'll try to look at it soon. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] UPDATE of partition key
On Fri, Jan 5, 2018 at 3:25 PM, Robert Haaswrote: > On Fri, Jan 5, 2018 at 7:12 AM, Amit Khandekar wrote: >> The above patch is to be applied over the last remaining preparatory >> patch, now named (and attached) : >> 0001-Refactor-CheckConstraint-related-code.patch > > Committed that one, too. Some more comments on the main patch: I don't really like the fact that ExecCleanupTupleRouting() now takes a ModifyTableState as an argument, particularly because of the way that is using that argument. To figure out whether a ResultRelInfo was pre-existing or one it created, it checks whether the pointer address of the ResultRelInfo is >= mtstate->resultRelInfo and < mtstate->resultRelInfo + mtstate->mt_nplans. However, that means that ExecCleanupTupleRouting() ends up knowing about the memory allocation pattern used by ExecInitModifyTable(), which seems like a slightly dangerous amount of action at a distance. I think it would be better for the PartitionTupleRouting structure to explicitly indicate which ResultRelInfos should be closed, for example by storing a Bitmapset *input_partitions. (Here, by "input", I mean "provided from the mtstate rather than created by the PartitionTupleRouting structure; other naming suggestions welcome.) When ExecSetupPartitionTupleRouting latches onto a partition, it can do proute->input_partitions = bms_add_member(proute->input_partitons, i). In ExecCleanupTupleRouting, it can do if (bms_is_member(proute->input_partitions, i)) continue. We have a test, in the regression test suite for file_fdw, which generates the message "cannot route inserted tuples to a foreign table". I think we should have a similar test for the case where an UPDATE tries to move a tuple from a regular partition to a foreign table partition. I'm not sure if it should fail with the same error or a different one, but I think we should have a test that it fails cleanly and with a nice error message of some sort. The comment for get_partitioned_child_rels() claims that it sets is_partition_key_update, but it really sets *is_partition_key_update. And I think instead of "is a partition key" it should say "is used in the partition key either of the relation whose RTI is specified or of any child relation." I propose "used in" instead of "is" because there can be partition expressions, and the rest is to clarify that child partition keys matter. create_modifytable_path uses partColsUpdated rather than partKeyUpdated, which actually seems like better terminology. I propose partKeyUpdated -> partColsUpdated everywhere. Also, why use is_partition_key_update for basically the same thing in some other places? I propose changing that to partColsUpdated as well. The capitalization of the first comment hunk in execPartition.h is strange. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] UPDATE of partition key
On 4 January 2018 at 02:52, David Rowleywrote: > I'll try to look at the tests tomorrow and also do some testing. I've made a pass over the tests. Again, sometimes I'm probably a bit pedantic. The reason for that is that the tests are not that easy to follow. Moving creation and cleanup of objects closer to where they're used and no longer needed makes it easier to read through and verify the tests. There are some genuine mistakes in there too. 1. NEW.c = NEW.c + 1; -- Make even number odd, or vice versa This seems to be worded as if there'd only ever be one number. I think it should be plural and read "Make even numbers odd, and vice versa" 2. The following comment does not make a huge amount of sense. -- UPDATE with -- partition key or non-partition columns, with different column ordering, -- triggers. Should "or" be "on"? Does ", triggers" mean "with triggers"? 3. The follow test tries to test a BEFORE DELETE trigger stopping a DELETE on sub_part1, but going by the SELECT, there are no rows in that table to stop being DELETEd. select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4; tableoid | a | b | c +---++ list_part1 | 2 | 52 | 50 list_part1 | 3 | 6 | 60 sub_part2 | 1 | 2 | 10 sub_part2 | 1 | 2 | 70 (4 rows) drop trigger parted_mod_b ON sub_part1 ; -- If BR DELETE trigger prevented DELETE from happening, we should also skip -- the INSERT if that delete is part of UPDATE=>DELETE+INSERT. create or replace function func_parted_mod_b() returns trigger as $$ begin return NULL; end $$ language plpgsql; create trigger trig_skip_delete before delete on sub_part1 for each row execute procedure func_parted_mod_b(); update list_parted set b = 1 where c = 70; select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4; tableoid | a | b | c +---++ list_part1 | 2 | 52 | 50 list_part1 | 3 | 6 | 60 sub_part1 | 1 | 1 | 70 sub_part2 | 1 | 2 | 10 (4 rows) You've added the BEFORE DELETE trigger to sub_part1, but you can see the tuple was DELETEd from sub_part2 and INSERTed into sub_part1, so the test is not working as you've commented. It's probably a good idea to RAISE NOTICE 'something useful here'; in the trigger function to verify they're actually being called in the test. 4. I think the final drop function in the following should be before the UPDATE FROM test. You've already done some cleanup for that test by doing "drop trigger trig_skip_delete ON sub_part1 ;" drop trigger trig_skip_delete ON sub_part1 ; -- UPDATE partition-key with FROM clause. If join produces multiple output -- rows for the same row to be modified, we should tuple-route the row only once. -- There should not be any rows inserted. create table non_parted (id int); insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3); update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1; select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4; tableoid | a | b | c +---++ list_part1 | 2 | 1 | 70 list_part1 | 2 | 2 | 10 list_part1 | 2 | 52 | 50 list_part1 | 3 | 6 | 60 (4 rows) drop table non_parted; drop function func_parted_mod_b(); Also, there's a space before the ; in the drop trigger above. Can that be removed? 5. The following comment: -- update to a partition should check partition bound constraint for the new tuple. -- If partition key is updated, the row should be moved to the appropriate -- partition. updatable views using partitions should enforce the check options -- for the rows that have been moved. Can this be changed a bit? I think it's not accurate to say that an update to a partition key causes the row to move. The row movement only occurs when the new tuple does not match the partition bound and another partition exists that does have a partition bound that matches the tuple. How about: -- When a partitioned table receives an UPDATE to the partitioned key and the -- new values no longer meet the partition's bound, the row must be moved to -- the correct partition for the new partition key (if one exists). We must -- also ensure that updatable views on partitioned tables properly enforce any -- WITH CHECK OPTION that is defined. The situation with triggers in this case -- also requires thorough testing as partition key updates causing row -- movement convert UPDATEs into DELETE+INSERT. 6. What does the following actually test? -- This tests partition-key UPDATE on a partitioned table that does not have any child partitions update part_b_10_b_20 set b = b - 6; There are no records in that partition, or anywhere in the hierarchy. Are you just testing that there's no error? If so then the comment should say so. 7. I think the following comment: -- As mentioned above, the partition creation is intentionally kept in descending bound order. should instead say: -- Create some more partitions
Re: [HACKERS] UPDATE of partition key
On Wed, Jan 3, 2018 at 6:29 AM, Amit Khandekarwrote: > ExecSetupPartitionTupleRouting() now returns PartitionTupleRouting *. > > Did this change in v3 version of > 0001-Encapsulate-partition-related-info-in-a-structure.patch I'll have to come back to some of the other open issues, but 0001 and 0005 look good to me now, so I pushed those as a single commit after fixing a few things that pgindent didn't like. I also think 0002 and 0003 look basically good, so I pushed those two as a single commit also. But the comment changes in 0003 didn't seem extensive enough to me so I made a few more changes there along the way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] UPDATE of partition key
On 4 January 2018 at 02:52, David Rowleywrote: > I'll try to look at the tests tomorrow and also do some testing. So > far I've only read the code and the docs. There are a few more things I noticed on another pass I made today: 20. "carried" -> "carried out the" + would have identified the newly updated row and carried + UPDATE/DELETE on this new row 21. Extra new line + . + 22. In copy.c CopyFrom() you have the following code: /* * We might need to convert from the parent rowtype to the * partition rowtype. */ map = proute->partition_tupconv_maps[leaf_part_index]; if (map) { Relation partrel = resultRelInfo->ri_RelationDesc; tuple = do_convert_tuple(tuple, map); /* * We must use the partition's tuple descriptor from this * point on. Use a dedicated slot from this point on until * we're finished dealing with the partition. */ slot = proute->partition_tuple_slot; Assert(slot != NULL); ExecSetSlotDescriptor(slot, RelationGetDescr(partrel)); ExecStoreTuple(tuple, slot, InvalidBuffer, true); } Should this use ConvertPartitionTupleSlot() instead? 23. Why write; last_resultRelInfo = mtstate->resultRelInfo + mtstate->mt_nplans; when you can write; last_resultRelInfo = mtstate->resultRelInfo[mtstate->mt_nplans];? 24. In ExecCleanupTupleRouting(), do you think that you could just have a special case loop for (mtstate && mtstate->operation == CMD_UPDATE)? /* * If this result rel is one of the UPDATE subplan result rels, let * ExecEndPlan() close it. For INSERT or COPY, this does not apply * because leaf partition result rels are always newly allocated. */ if (is_update && resultRelInfo >= first_resultRelInfo && resultRelInfo < last_resultRelInfo) continue; Something like: if (mtstate && mtstate->operation == CMD_UPDATE) { ResultRelInfo *first_resultRelInfo = mtstate->resultRelInfo; ResultRelInfo *last_resultRelInfo = mtstate->resultRelInfo[mtstate->mt_nplans]; for (i = 0; i < proute->num_partitions; i++) { ResultRelInfo *resultRelInfo = proute->partitions[i]; /* * Leave any resultRelInfos that belong to the UPDATE's subplan * list. These will be closed during executor shutdown. */ if (resultRelInfo >= first_resultRelInfo && resultRelInfo < last_resultRelInfo) continue; ExecCloseIndices(resultRelInfo); heap_close(resultRelInfo->ri_RelationDesc, NoLock); } } else { for (i = 0; i < proute->num_partitions; i++) { ResultRelInfo *resultRelInfo = proute->partitions[i]; ExecCloseIndices(resultRelInfo); heap_close(resultRelInfo->ri_RelationDesc, NoLock); } } -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] UPDATE of partition key
Robert, for tracking purpose, below I have consolidated your review items on which we are yet to conclude. Let me know if you have more comments on the points which I made. -- 1. ExecUpdate() needs to revert back tcs_map value changed by ExecInsert() -- >> + /* >> + * UPDATEs set the transition capture map only when a new subplan >> + * is chosen. But for INSERTs, it is set for each row. So after >> + * INSERT, we need to revert back to the map created for UPDATE; >> + * otherwise the next UPDATE will incorrectly use the one created >> + * for INESRT. So first save the one created for UPDATE. >> + */ >> + if (mtstate->mt_transition_capture) >> + saved_tcs_map = mtstate->mt_transition_capture->tcs_map; >> >> I wonder if there is some more elegant way to handle this problem. >> Basically, the issue is that ExecInsert() is stomping on >> mtstate->mt_transition_capture, and your solution is to save and >> restore the value you want to have there. But maybe we could instead >> find a way to get ExecInsert() not to stomp on that state in the first >> place. It seems like the ON CONFLICT stuff handled that by adding a >> second TransitionCaptureState pointer to ModifyTable, thus >> mt_transition_capture and mt_oc_transition_capture. By that >> precedent, we could add mt_utr_transition_capture or similar, and >> maybe that's the way to go. It seems a bit unsatisfying, but so does >> what you have now. > > In case of ON CONFLICT, if there are both INSERT and UPDATE statement > triggers referencing transition tables, both of the triggers need to > independently populate their own transition tables, and hence the need > for two separate transition states : mt_transition_capture and > mt_oc_transition_capture. But in case of update-tuple-routing, the > INSERT statement trigger won't come into picture. So the same > mt_transition_capture can serve the purpose of populating the > transition table with OLD and NEW rows. So I think it would be too > redundant, if not incorrect, to have a whole new transition state for > update tuple routing. > > I will see if it turns out better to have two tcs_maps in > TransitionCaptureState, one for update and one for insert. But this, > on first look, does not look good. Suppose TransitionCaptureState has separate maps, upd_del_tcs_maps and insert_tcs_maps for UPDATE/DELETE and INSERT events respectively. So upd_del_tcs_maps will be updated only after we start with the next UPDATE subplan, whereas insert_tcs_maps will keep on getting updated for each row. So in AfterTriggerSaveEvent(), upd_del_tcs_maps would be used when the event is TRIGGER_EVENT_[UPDATE/DELETE], and insert_tcs_maps will be used when event == TRIGGER_EVENT_INSERT. But the issue is : even if the event is TRIGGER_EVENT_UPDATE, we don't know whether this is caused by a normal update or as part of an insert into new partition during partition-key-update. So blindly using upd_del_tcs_maps is incorrect. If the event is caused by the later, we should use insert_tcs_maps rather than upd_del_tcs_maps. But we do not have the information in trigger.c as to what caused this event. So, overall, it would not work, and even if we make it work by passing or storing some more information somewhere, the AfterTriggerSaveEvent() logic will become too complicated. So I can't think of anything else, but to keep the way I did, i.e. reverting back the tcs_map once insert finishes. We so a similar thing for reverting back the estate->es_result_relation_info. -- 2. mt_childparent_tupconv_maps is indexed by subplan or partition leaf index. -- > + * If per-leaf map is required and the map is already created, that map > + * has to be per-leaf. If that map is per-subplan, we won't be able to > + * access the maps leaf-partition-wise. But if the map is per-leaf, we > + * will be able to access the maps subplan-wise using the > + * subplan_partition_offsets map using function > + * tupconv_map_for_subplan(). So if the callers might need to access > + * the map both leaf-partition-wise and subplan-wise, they should make > + * sure that the first time this function is called, it should be > + * called with perleaf=true so that the map created is per-leaf, not > + * per-subplan. > > This sounds complicated and fragile. It ends up meaning that > mt_childparent_tupconv_maps is sometimes indexed by subplan number and > sometimes by partition leaf index, which is extremely confusing and > likely to lead to coding errors, either in this patch or in future > ones. Even if we always index the map by leaf partition, while accessing the map the code still needs to be aware of whether the index number with which we are accessing the map is the subplan number or leaf partition number: If the access is by subplan number, use subplan_partition_offsets to convert to the leaf partition index. So the function tupconv_map_for_subplan() is anyways necessary for accessing using subplan
Re: [HACKERS] UPDATE of partition key
> On 3 January 2018 at 11:42, Amit Khandekarwrote: >> [...] So it make perfect sense for >> ExecSetupPartitionTupleRouting() to palloc and return a pointer. Sorry >> for the noise. Will share the change in an upcoming patch version. >> Thanks ! > > ExecSetupPartitionTupleRouting() now returns PartitionTupleRouting *. Thanks for changing. I've just done almost a complete review of v32. (v33 came along a bit sooner than I thought). I've not finished looking at the regression tests yet, but here are a few things, some may have been changed in v33, I've not looked yet. Also apologies in advance if anything seems nitpicky. 1. "by INSERT" -> "by an INSERT" in: from the original partition followed by INSERT into the 2. "and INSERT" -> "and an INSERT" in: a DELETE and INSERT. As far as 3. "due partition-key change" -> "due to the partition-key being changed" in: * capture is happening for UPDATEd rows being moved to another partition due * partition-key change, then this function is called once when the row is 4. "inserted to another" -> "inserted into another" in: * deleted (to capture OLD row), and once when the row is inserted to another 5. "for UPDATE event" -> "for an UPDATE event" (singular), or -> "for UPDATE events" (plural) * oldtup and newtup are non-NULL. But for UPDATE event fired for I'm unsure if you need singular or plural. It perhaps does not matter. 6. "for row" -> "for a row" in: * movement, oldtup is NULL when the event is for row being inserted, Likewise in: * whereas newtup is NULL when the event is for row being deleted. 7. In the following fragment the code does not do what the comment says: /* * If transition tables are the only reason we're here, return. As * mentioned above, we can also be here during update tuple routing in * presence of transition tables, in which case this function is called * separately for oldtup and newtup, so either can be NULL, not both. */ if (trigdesc == NULL || (event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) || (event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) || (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) || (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL return; With the comment; "so either can be NULL, not both.", I'd expect a boolean OR not an XOR. maybe the comment is better written as: "so we expect exactly one of them to be non-NULL" (I know you've been discussing with Robert, so I've not checked v33 to see if this still exists) 8. I'm struggling to make sense of this: /* * Save a tuple conversion map to convert a tuple routed to this * partition from the parent's type to the partition's. */ Maybe it's better to write this as: /* * Generate a tuple conversion map to convert tuples of the parent's * type into the partition's type. */ 9. insert should be capitalised here and should be prefixed with "an": /* * Verify result relation is a valid target for insert operation. Even * for updates, we are doing this for tuple-routing, so again, we need * to check the validity for insert operation. */ CheckValidResultRel(leaf_part_rri, CMD_INSERT); Maybe it's better to write: /* * Verify result relation is a valid target for an INSERT. An UPDATE of * a partition-key becomes a DELETE/INSERT operation, so this check is * still required when the operation is CMD_UPDATE. */ 10. The following code would be more clear if you replaced mtstate->mt_transition_capture with transition_capture. if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture && mtstate->mt_transition_capture->tcs_update_new_table) { ExecARUpdateTriggers(estate, resultRelInfo, NULL, NULL, tuple, NULL, mtstate->mt_transition_capture); /* * Now that we have already captured NEW TABLE row, any AR INSERT * trigger should not again capture it below. Arrange for the same. */ transition_capture = NULL; } You are, after all, doing: transition_capture = mtstate->mt_transition_capture; at the top of the function. There are a few other places you're also accessing mtstate->mt_transition_capture. 11. Should tuple_deleted and process_returning be camelCase like the other params?: static TupleTableSlot * ExecDelete(ModifyTableState *mtstate, ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *planSlot, EPQState *epqstate, EState *estate, bool *tuple_deleted, bool process_returning, bool canSetTag) 12. The following comment talks about "target table descriptor", which I think is a good term. In several other places, you mention "root", which I take it to mean "target table". * This map array is required for two purposes : * 1. For update-tuple-routing. We need to convert the tuple from the subplan * result rel to the root partitioned table descriptor. * 2. For capturing transition tuples when the target table is a partitioned * table. For updates, we need to convert the tuple from subplan result rel to
Re: [HACKERS] UPDATE of partition key
On 20 December 2017 at 11:52, Amit Khandekarwrote: > On 14 December 2017 at 08:11, Amit Langote > wrote: >> >> Regarding ExecSetupChildParentMap(), it seems to me that it could simply >> be declared as >> >> static void ExecSetupChildParentMap(ModifyTableState *mtstate); >> >> Looking at the places from where it's called, it seems that you're just >> extracting information from mtstate and passing the same for the rest of >> its arguments. > > Agreed. But the last parameter per_leaf might be necessary. I will > defer this until I address Robert's concern about the complexity of > the related code. Removed those parameters, but kept perleaf. The map required for update-tuple-routing is a per-subplan one despite the presence of partition tuple routing. And we cannot deduce from mtstate whether update tuple routing is true. So for this case, the caller has to explicitly specify that per-subplan map has to be created. >> >> tupconv_map_for_subplan() looks like it could be done as a macro. > > Or may be inline function. I will again defer this for similar reason > as the above deferred item about ExecSetupChildParentMap parameters. > Made it inline. Did the above changes in attached update-partition-key_v33.patch On 3 January 2018 at 11:42, Amit Khandekar wrote: > On 2 January 2018 at 10:56, David Rowley wrote: >> I'm pretty sure Robert is suggesting that >> ExecSetupPartitionTupleRouting pallocs the memory for the structure, >> sets it up then returns a pointer to the new struct. That's not very >> unusual. It seems unusual for a function to return void and modify a >> single parameter pointer to get the value to the caller rather than >> just to return that value. > > Sorry, my mistake. Earlier I somehow was under the impression that the > callers of ExecSetupPartitionTupleRouting() already have this > structure palloc'ed, and that they pass address of this structure. I > now can see that both CopyStateData->partition_tuple_routing and > ModifyTableState->mt_partition_tuple_routing are pointers, not > structures. So it make perfect sense for > ExecSetupPartitionTupleRouting() to palloc and return a pointer. Sorry > for the noise. Will share the change in an upcoming patch version. > Thanks ! ExecSetupPartitionTupleRouting() now returns PartitionTupleRouting *. Did this change in v3 version of 0001-Encapsulate-partition-related-info-in-a-structure.patch -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company update-partition-key_v33.patch.tar.gz Description: GNU Zip compressed data
Re: [HACKERS] UPDATE of partition key
On 1 January 2018 at 21:43, Amit Khandekarwrote: > On 16 December 2017 at 03:09, Robert Haas wrote: >> + /* >> + * UPDATEs set the transition capture map only when a new subplan >> + * is chosen. But for INSERTs, it is set for each row. So after >> + * INSERT, we need to revert back to the map created for UPDATE; >> + * otherwise the next UPDATE will incorrectly use the one created >> + * for INESRT. So first save the one created for UPDATE. >> + */ >> + if (mtstate->mt_transition_capture) >> + saved_tcs_map = mtstate->mt_transition_capture->tcs_map; >> >> I wonder if there is some more elegant way to handle this problem. >> Basically, the issue is that ExecInsert() is stomping on >> mtstate->mt_transition_capture, and your solution is to save and >> restore the value you want to have there. But maybe we could instead >> find a way to get ExecInsert() not to stomp on that state in the first >> place. It seems like the ON CONFLICT stuff handled that by adding a >> second TransitionCaptureState pointer to ModifyTable, thus >> mt_transition_capture and mt_oc_transition_capture. By that >> precedent, we could add mt_utr_transition_capture or similar, and >> maybe that's the way to go. It seems a bit unsatisfying, but so does >> what you have now. > > In case of ON CONFLICT, if there are both INSERT and UPDATE statement > triggers referencing transition tables, both of the triggers need to > independently populate their own transition tables, and hence the need > for two separate transition states : mt_transition_capture and > mt_oc_transition_capture. But in case of update-tuple-routing, the > INSERT statement trigger won't come into picture. So the same > mt_transition_capture can serve the purpose of populating the > transition table with OLD and NEW rows. So I think it would be too > redundant, if not incorrect, to have a whole new transition state for > update tuple routing. > > I will see if it turns out better to have two tcs_maps in > TransitionCaptureState, one for update and one for insert. But this, > on first look, does not look good. Suppose TransitionCaptureState has separate maps, upd_del_tcs_maps and insert_tcs_maps for UPDATE/DELETE and INSERT events respectively. So upd_del_tcs_maps will be updated only after we start with the next UPDATE subplan, whereas insert_tcs_maps will keep on getting updated for each row. So in AfterTriggerSaveEvent(), upd_del_tcs_maps would be used when the event is TRIGGER_EVENT_[UPDATE/DELETE], and insert_tcs_maps will be used when event == TRIGGER_EVENT_INSERT. But the issue is : even if the event is TRIGGER_EVENT_UPDATE, we don't know whether this is caused by a normal update or as part of an insert into new partition during partition-key-update. So blindly using upd_del_tcs_maps is incorrect. If the event is caused by the later, we should use insert_tcs_maps rather than upd_del_tcs_maps. But we do not have the information in trigger.c as to what caused this event. So, overall, it would not work, and even if we make it work by passing or storing some more information somewhere, the AfterTriggerSaveEvent() logic will become too complicated. So I can't think of anything else, but to keep the way I did, i.e. reverting back the tcs_map once insert finishes. We so a similar thing for reverting back the estate->es_result_relation_info. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] UPDATE of partition key
On 16 December 2017 at 03:09, Robert Haaswrote: > > - map = ptr->partition_tupconv_maps[leaf_part_index]; > + map = ptr->parentchild_tupconv_maps[leaf_part_index]; > > I don't think there's any reason to rename this. In previous patch > versions, you had multiple arrays of tuple conversion maps in this > structure, but the refactoring eliminated that. Done in an earlier version of the patch. > > Likewise, I'm not sure I get the point of mt_transition_tupconv_maps > -> mt_childparent_tupconv_maps. That seems like it could similarly be > left alone. We need to change it's name because now this map is not only used for transition capture, but also for update-tuple-routing. Does it look ok for you if, for readability, we keep the childparent tag ? Or else, we can just make it "mt_tupconv_maps", but "mt_childparent_tupconv_maps" looks more informative. > > + /* > + * If transition tables are the only reason we're here, return. As > + * mentioned above, we can also be here during update tuple routing in > + * presence of transition tables, in which case this function is called > + * separately for oldtup and newtup, so either can be NULL, not both. > + */ > if (trigdesc == NULL || > (event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) || > (event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) || > - (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row)) > + (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) || > + (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL > > I guess this is correct, but it seems awfully fragile. Can't we have > some more explicit signaling about whether we're only here for > transition tables, rather than deducing it based on exactly one of > oldtup and newtup being NULL? I had given a thought on this earlier. I felt, even the pre-existing conditions like "!trigdesc->trig_update_after_row" are all indirect ways to determine that this function is called only to capture transition tables, and thought that it may have been better to have separate parameter transition_table_only. But then decided that I can continue on similar lines and add another such condition to indicate that we are only capturing update-routed tuples. Instead of adding another parameter to AfterTriggerSaveEvent(), I had also considered another approach: Put the transition-tuples-capture logic part of AfterTriggerSaveEvent() into a helper function CaptureTransitionTables(). In ExecInsert() and ExecDelete(), instead of calling ExecARUpdateTriggers(), call this function CaptureTransitionTables(). I then dropped this idea and thought rather to call ExecARUpdateTriggers() which neatly does the required checks and other things like locking the old tuple via GetTupleForTrigger(). So if we go by CaptureTransitionTables(), we would need to do what ExecARUpdateTriggers() does before calling CaptureTransitionTables(). This is doable. If you think this is worth doing so as to get rid of the "(oldtup == NULL) ^ (newtup == NULL)" condition, we can do that. > > + /* Initialization specific to update */ > + if (mtstate && mtstate->operation == CMD_UPDATE) > + { > + ModifyTable *node = (ModifyTable *) mtstate->ps.plan; > + > + is_update = true; > + update_rri = mtstate->resultRelInfo; > + num_update_rri = list_length(node->plans); > + } > > I guess I don't see why we need a separate "if" block for this. > Neither is_update nor update_rri nor num_update_rri are used until we > get to the block that begins with "if (is_update)". Why not just > change that block to test if (mtstate && mtstate->operation == > CMD_UPDATE)" and put the rest of these initializations inside that > block? Done. > > + int num_update_rri = 0, > + update_rri_index = 0; > ... > + update_rri_index = 0; > > It's already 0. Done. Retained the comment that mentions why we need to set it to 0, and added a note in the end that we have already done this during initialization. > > + leaf_part_rri = _rri[update_rri_index]; > ... > + leaf_part_rri = leaf_part_arr + i; > > These are doing the same kind of thing, but using different styles. I > prefer the former style, so I'd change the second one to > _part_arr[i]. Alternatively, you could change the first one to > update_rri + update_rri_indx. But it's strange to see the same > variable initialized in two different ways just a few lines apart. > Done. Used the first style. > > +static HeapTuple > +ConvertPartitionTupleSlot(ModifyTableState *mtstate, > + TupleConversionMap *map, > + HeapTuple tuple, > + TupleTableSlot *new_slot, > + TupleTableSlot **p_my_slot) > > This function doesn't use the mtstate argument at all. Removed mtstate. > > + * (Similarly we need to add the deleted row in OLD TABLE). We need to do > > The period should be before, not after, the closing parenthesis. Done. > > + * Now that we have already captured NEW TABLE row, any AR INSERT > + * trigger should not
Re: [HACKERS] UPDATE of partition key
On 23 December 2017 at 04:00, Amit Khandekarwrote: > On 15 December 2017 at 18:28, Robert Haas wrote: >> -PartitionDispatch **pd, >> -ResultRelInfo ***partitions, >> -TupleConversionMap ***tup_conv_maps, >> -TupleTableSlot **partition_tuple_slot, >> -int *num_parted, int *num_partitions) >> +PartitionTupleRouting **partition_tuple_routing) >> >> Since we're consolidating all of ExecSetupPartitionTupleRouting's >> output parameters into a single structure, I think it might make more >> sense to have it just return that value. I think it's only done with >> output parameter today because there are so many different things >> being produced, and we can't return them all. > > You mean ExecSetupPartitionTupleRouting() will return the structure > (not pointer to structure), and the caller will get the copy of the > structure like this ? : > > mtstate->mt_partition_tuple_routing = > ExecSetupPartitionTupleRouting(mtstate, rel, node->nominalRelation, estate); > > I am ok with that, but just wanted to confirm if that is what you are > saying. I don't recall seeing a structure return value in PG code, so > not sure if it is conventional in PG to do that. Hence, I am somewhat > inclined to keep it as output param. It also avoids a structure copy. > > Another way is for ExecSetupPartitionTupleRouting() to palloc this > structure, and return its pointer, but then caller would have to > anyway do a structure copy, so that's not convenient, and I don't > think you are suggesting this way either. I'm pretty sure Robert is suggesting that ExecSetupPartitionTupleRouting pallocs the memory for the structure, sets it up then returns a pointer to the new struct. That's not very unusual. It seems unusual for a function to return void and modify a single parameter pointer to get the value to the caller rather than just to return that value. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] UPDATE of partition key
On 16 December 2017 at 03:09, Robert Haaswrote: > started another review pass over the main patch, so here are > some comments about that. I am yet to address all the comments, but meanwhile, below are some specific points ... > + if (!partrel) > + { > + /* > + * We locked all the partitions above including the leaf > + * partitions. Note that each of the newly opened relations in > + * *partitions are eventually closed by the caller. > + */ > + partrel = heap_open(leaf_oid, NoLock); > + InitResultRelInfo(leaf_part_rri, > + partrel, > + resultRTindex, > + rel, > + estate->es_instrument); > + } > > Hmm, isn't there a problem here? Before, we opened all the relations > here and the caller closed them all. But now, we're only opening some > of them. If the caller closes them all, then they will be closing > some that we opened and some that we didn't. That seems quite bad, > because the reference counts that are incremented and decremented by > opening and closing should all end up at 0. Maybe I'm confused > because it seems like this would break in any scenario where even 1 > relation was already opened and surely you must have tested that > case... but if there's some reason this works, I don't know what it > is, and the comment doesn't tell me. In ExecCleanupTupleRouting(), we are closing only those newly opened partitions. We skip those which are actually part of the update result rels. > + /* > + * UPDATEs set the transition capture map only when a new subplan > + * is chosen. But for INSERTs, it is set for each row. So after > + * INSERT, we need to revert back to the map created for UPDATE; > + * otherwise the next UPDATE will incorrectly use the one created > + * for INESRT. So first save the one created for UPDATE. > + */ > + if (mtstate->mt_transition_capture) > + saved_tcs_map = mtstate->mt_transition_capture->tcs_map; > > I wonder if there is some more elegant way to handle this problem. > Basically, the issue is that ExecInsert() is stomping on > mtstate->mt_transition_capture, and your solution is to save and > restore the value you want to have there. But maybe we could instead > find a way to get ExecInsert() not to stomp on that state in the first > place. It seems like the ON CONFLICT stuff handled that by adding a > second TransitionCaptureState pointer to ModifyTable, thus > mt_transition_capture and mt_oc_transition_capture. By that > precedent, we could add mt_utr_transition_capture or similar, and > maybe that's the way to go. It seems a bit unsatisfying, but so does > what you have now. In case of ON CONFLICT, if there are both INSERT and UPDATE statement triggers referencing transition tables, both of the triggers need to independently populate their own transition tables, and hence the need for two separate transition states : mt_transition_capture and mt_oc_transition_capture. But in case of update-tuple-routing, the INSERT statement trigger won't come into picture. So the same mt_transition_capture can serve the purpose of populating the transition table with OLD and NEW rows. So I think it would be too redundant, if not incorrect, to have a whole new transition state for update tuple routing. I will see if it turns out better to have two tcs_maps in TransitionCaptureState, one for update and one for insert. But this, on first look, does not look good. > + * If per-leaf map is required and the map is already created, that map > + * has to be per-leaf. If that map is per-subplan, we won't be able to > + * access the maps leaf-partition-wise. But if the map is per-leaf, we > + * will be able to access the maps subplan-wise using the > + * subplan_partition_offsets map using function > + * tupconv_map_for_subplan(). So if the callers might need to access > + * the map both leaf-partition-wise and subplan-wise, they should make > + * sure that the first time this function is called, it should be > + * called with perleaf=true so that the map created is per-leaf, not > + * per-subplan. > > This sounds complicated and fragile. It ends up meaning that > mt_childparent_tupconv_maps is sometimes indexed by subplan number and > sometimes by partition leaf index, which is extremely confusing and > likely to lead to coding errors, either in this patch or in future > ones. Even if we always index the map by leaf partition, while accessing the map the code still needs to be aware of whether the index number with which we are accessing the map is the subplan number or leaf partition number: If the access is by subplan number, use subplan_partition_offsets to convert to the leaf partition index. So the function tupconv_map_for_subplan() is anyways necessary for accessing using subplan index. Only thing that will change is : tupconv_map_for_subplan() will not have to check if the the map is indexed by leaf partition or not. But that complexity is hidden in this function alone; the outside code need not worry about that. If the access is by
Re: [HACKERS] UPDATE of partition key
On Fri, Dec 15, 2017 at 7:58 AM, Robert Haaswrote: > Reviewing the preparatory patch: I started another review pass over the main patch, so here are some comments about that. This is unfortunately not a complete review, however. - map = ptr->partition_tupconv_maps[leaf_part_index]; + map = ptr->parentchild_tupconv_maps[leaf_part_index]; I don't think there's any reason to rename this. In previous patch versions, you had multiple arrays of tuple conversion maps in this structure, but the refactoring eliminated that. Likewise, I'm not sure I get the point of mt_transition_tupconv_maps -> mt_childparent_tupconv_maps. That seems like it could similarly be left alone. + /* + * If transition tables are the only reason we're here, return. As + * mentioned above, we can also be here during update tuple routing in + * presence of transition tables, in which case this function is called + * separately for oldtup and newtup, so either can be NULL, not both. + */ if (trigdesc == NULL || (event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) || (event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) || - (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row)) + (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) || + (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL I guess this is correct, but it seems awfully fragile. Can't we have some more explicit signaling about whether we're only here for transition tables, rather than deducing it based on exactly one of oldtup and newtup being NULL? + /* Initialization specific to update */ + if (mtstate && mtstate->operation == CMD_UPDATE) + { + ModifyTable *node = (ModifyTable *) mtstate->ps.plan; + + is_update = true; + update_rri = mtstate->resultRelInfo; + num_update_rri = list_length(node->plans); + } I guess I don't see why we need a separate "if" block for this. Neither is_update nor update_rri nor num_update_rri are used until we get to the block that begins with "if (is_update)". Why not just change that block to test if (mtstate && mtstate->operation == CMD_UPDATE)" and put the rest of these initializations inside that block? + int num_update_rri = 0, + update_rri_index = 0; ... + update_rri_index = 0; It's already 0. + leaf_part_rri = _rri[update_rri_index]; ... + leaf_part_rri = leaf_part_arr + i; These are doing the same kind of thing, but using different styles. I prefer the former style, so I'd change the second one to _part_arr[i]. Alternatively, you could change the first one to update_rri + update_rri_indx. But it's strange to see the same variable initialized in two different ways just a few lines apart. + if (!partrel) + { + /* + * We locked all the partitions above including the leaf + * partitions. Note that each of the newly opened relations in + * *partitions are eventually closed by the caller. + */ + partrel = heap_open(leaf_oid, NoLock); + InitResultRelInfo(leaf_part_rri, + partrel, + resultRTindex, + rel, + estate->es_instrument); + } Hmm, isn't there a problem here? Before, we opened all the relations here and the caller closed them all. But now, we're only opening some of them. If the caller closes them all, then they will be closing some that we opened and some that we didn't. That seems quite bad, because the reference counts that are incremented and decremented by opening and closing should all end up at 0. Maybe I'm confused because it seems like this would break in any scenario where even 1 relation was already opened and surely you must have tested that case... but if there's some reason this works, I don't know what it is, and the comment doesn't tell me. +static HeapTuple +ConvertPartitionTupleSlot(ModifyTableState *mtstate, + TupleConversionMap *map, + HeapTuple tuple, + TupleTableSlot *new_slot, + TupleTableSlot **p_my_slot) This function doesn't use the mtstate argument at all. + * (Similarly we need to add the deleted row in OLD TABLE). We need to do The period should be before, not after, the closing parenthesis. + * Now that we have already captured NEW TABLE row, any AR INSERT + * trigger should not again capture it below. Arrange for the same. A more American style would be something like "We've already captured the NEW TABLE row, so make sure any AR INSERT trigger fired below doesn't capture it again." (Similarly for the other case.) + /* The delete has actually happened, so inform that to the caller */ + if (tuple_deleted) + *tuple_deleted = true; In the US, we inform the caller, not inform that to the caller. In other words, here the direct object of "inform" is the person or thing getting the information (in this case, "the caller"), not the information being conveyed (in this case, "that"). I realize your usage is probably typical for your country... + Assert(mtstate->mt_is_tupconv_perpart == true); We usually just Assert(thing_that_should_be_true), not
Re: [HACKERS] UPDATE of partition key
On Wed, Dec 13, 2017 at 5:18 AM, Amit Khandekarwrote: > Amit Langote informed me off-list, - along with suggestions for > changes - that my patch needs a rebase. Attached is the rebased > version. I have also bumped the patch version number (now v29), > because this as additional changes, again, suggested by Amit L : > Because ExecSetupPartitionTupleRouting() has mtstate parameter now, > no need to pass update_rri and num_update_rri, since they can be > retrieved from mtstate. > > Also, the preparatory patch is also rebased. Reviewing the preparatory patch: + PartitionTupleRouting *partition_tuple_routing; + /* Tuple-routing support info */ Something's wrong with the formatting here. -PartitionDispatch **pd, -ResultRelInfo ***partitions, -TupleConversionMap ***tup_conv_maps, -TupleTableSlot **partition_tuple_slot, -int *num_parted, int *num_partitions) +PartitionTupleRouting **partition_tuple_routing) Since we're consolidating all of ExecSetupPartitionTupleRouting's output parameters into a single structure, I think it might make more sense to have it just return that value. I think it's only done with output parameter today because there are so many different things being produced, and we can't return them all. + PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing; This is just nitpicking, but I don't find "ptr" to be the greatest variable name; it looks too much like "pointer". Maybe we could use "routing" or "proute" or something. It seems to me that we could improve things here by adding a function ExecCleanupTupleRouting(PartitionTupleRouting *) which would do the various heap_close(), ExecDropSingleTupleTableSlot(), and ExecCloseIndices() operations which are currently performed in CopyFrom() and, by separate code, in ExecEndModifyTable(). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] UPDATE of partition key
Thanks for the updated patches, Amit. Some review comments. Forgot to remove the description of update_rri and num_update_rri in the header comment of ExecSetupPartitionTupleRouting(). - +extern void pull_child_partition_columns(Relation rel, + Relation parent, + Bitmapset **partcols); It seems you forgot to remove this declaration in partition.h, because I don't find it defined or used anywhere. I think some of the changes that are currently part of the main patch are better taken out into their own patches, because having those diffs appear in the main patch is kind of distracting. Just like you now have a patch that introduces a PartitionTupleRouting structure. I know that leads to too many patches, but it helps to easily tell less substantial changes from the substantial ones. 1. Patch to rename partition_tupconv_maps to parentchild_tupconv_maps. 2. Patch that introduces has_partition_attrs() in place of is_partition_attr() 3. Patch to change the names of map_partition_varattnos() arguments 4. Patch that does the refactoring involving ExecConstrains(), ExecPartitionCheck(), and the introduction of ExecPartitionCheckEmitError() Regarding ExecSetupChildParentMap(), it seems to me that it could simply be declared as static void ExecSetupChildParentMap(ModifyTableState *mtstate); Looking at the places from where it's called, it seems that you're just extracting information from mtstate and passing the same for the rest of its arguments. mt_is_tupconv_perpart seems like it's unnecessary. Its function could be fulfilled by inspecting the state of some other fields of ModifyTableState. For example, in the case of an update (operation == CMD_UPDATE), if mt_partition_tuple_routing is non-NULL, then we can always assume that mt_childparent_tupconv_maps has entries for all partitions. If it's NULL, then there would be only entries for partitions that have sub-plans. tupconv_map_for_subplan() looks like it could be done as a macro. Thanks, Amit
Re: [HACKERS] UPDATE of partition key
While addressing Thomas's point about test scenarios not yet covered, I observed the following ... Suppose an UPDATE RLS policy with a WITH CHECK clause is defined on the target table. Now In ExecUpdate(), the corresponding WCO qual gets executed *before* the partition constraint check, as per existing behaviour. And the qual succeeds. And then because of partition-key updated, the row is moved to another partition. Here, suppose there is a BR INSERT trigger which modifies the row, and the resultant row actually would *not* pass the UPDATE RLS policy. But for this partition, since it is an INSERT, only INSERT RLS WCO quals are executed. So effectively, with a user-perspective, an RLS WITH CHECK policy that was defined to reject an updated row, is getting updated successfully. This is because the policy is not checked *after* a row trigger in the new partition is executed. Attached is a test case that reproduces this issue. I think, in case of row-movement, we should defer calling ExecWithCheckOptions() until the row is being inserted using ExecInsert(). And then in ExecInsert(), ExecWithCheckOptions() should be called using WCO_RLS_UPDATE_CHECK rather than WCO_RLS_INSERT_CHECK (I recall Amit Langote was of this opinion) as below : --- a/src/backend/executor/nodeModifyTable.c +++ b/src/backend/executor/nodeModifyTable.c @@ -510,7 +510,9 @@ ExecInsert(ModifyTableState *mtstate, * we are looking for at this point. */ if (resultRelInfo->ri_WithCheckOptions != NIL) - ExecWithCheckOptions(WCO_RLS_INSERT_CHECK, +ExecWithCheckOptions((mtstate->operation == CMD_UPDATE ? + WCO_RLS_UPDATE_CHECK : WCO_RLS_INSERT_CHECK), resultRelInfo, slot, estate); It can be argued that since in case of triggers we always execute INSERT row triggers for rows inserted as part of update-row-movement, we should be consistent and execute INSERT WCOs and not UPDATE WCOs for such rows. But note that, the row triggers we execute are defined on the leaf partitions. But the RLS policies being executed are defined for the target partitioned table, and not the leaf partition. Hence it makes sense to execute them as per the original operation on the target table. This is similar to why we execute UPDATE statement triggers even when the row is eventually inserted into another partition. This is because UPDATE statement trigger was defined for the target table, not the leaf partition. Barring any objections, I am going to send a revised patch that fixes the above issue as described. Thanks -Amit Khandekar wco_rls_issue.sql Description: Binary data
Re: [HACKERS] UPDATE of partition key
On Mon, Nov 27, 2017 at 5:28 PM, Amit Khandekarwrote: > So, in the upcoming patch version, I am intending to include the above > two, and if possible, Robert's idea of re-using is_partition_attr() > for pull_child_partition_columns(). Discussions are still going on, so moved to next CF. -- Michael
Re: [HACKERS] UPDATE of partition key
On 24 November 2017 at 10:52, Amit Langotewrote: > On 2017/11/23 21:57, Amit Khandekar wrote: >> If we collect the partition keys in expand_partitioned_rtentry(), we >> need to pass the root relation also, so that we can convert the >> partition key attributes to root rel descriptor. And the other thing >> is, may be, we can check beforehand (in expand_inherited_rtentry) >> whether the rootrte's updatedCols is empty, which I think implies that >> it's not an UPDATE operation. If yes, we can just skip collecting the >> partition keys. > > Yeah, it seems like a good idea after all to check in > expand_inherited_rtentry() whether the root RTE's updatedCols is non-empty > and if so check if any of the updatedCols are partition keys. If we find > some, then it will suffice to just set a simple flag in the > PartitionedChildRelInfo that will be created for the root table. That > should be done *after* we have visited all the tables in the partition > tree including some that might be partitioned and hence will provide their > partition keys. The following block in expand_inherited_rtentry() looks > like a good spot: > > if (rte->inh && partitioned_child_rels != NIL) > { > PartitionedChildRelInfo *pcinfo; > > pcinfo = makeNode(PartitionedChildRelInfo); Yes, I am thinking about something like that. Thanks. I am also working on your suggestion of moving the convert-to-root-descriptor logic from ExecInsert() to ExecUpdate(). So, in the upcoming patch version, I am intending to include the above two, and if possible, Robert's idea of re-using is_partition_attr() for pull_child_partition_columns(). -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] UPDATE of partition key
On 2017/11/23 21:57, Amit Khandekar wrote: > If we collect the partition keys in expand_partitioned_rtentry(), we > need to pass the root relation also, so that we can convert the > partition key attributes to root rel descriptor. And the other thing > is, may be, we can check beforehand (in expand_inherited_rtentry) > whether the rootrte's updatedCols is empty, which I think implies that > it's not an UPDATE operation. If yes, we can just skip collecting the > partition keys. Yeah, it seems like a good idea after all to check in expand_inherited_rtentry() whether the root RTE's updatedCols is non-empty and if so check if any of the updatedCols are partition keys. If we find some, then it will suffice to just set a simple flag in the PartitionedChildRelInfo that will be created for the root table. That should be done *after* we have visited all the tables in the partition tree including some that might be partitioned and hence will provide their partition keys. The following block in expand_inherited_rtentry() looks like a good spot: if (rte->inh && partitioned_child_rels != NIL) { PartitionedChildRelInfo *pcinfo; pcinfo = makeNode(PartitionedChildRelInfo); Thanks, Amit
Re: [HACKERS] UPDATE of partition key
On 21 November 2017 at 17:24, Amit Khandekarwrote: > On 13 November 2017 at 18:25, David Rowley > wrote: >> >> 30. The following chunk of code is giving me a headache trying to >> verify which arrays are which size: >> >> ExecSetupPartitionTupleRouting(rel, >>mtstate->resultRelInfo, >>(operation == CMD_UPDATE ? nplans : 0), >>node->nominalRelation, >>estate, >>_dispatch_info, >>, >>_tupconv_maps, >>_leaf_map, >>_tuple_slot, >>_parted, _partitions); >> mtstate->mt_partition_dispatch_info = partition_dispatch_info; >> mtstate->mt_num_dispatch = num_parted; >> mtstate->mt_partitions = partitions; >> mtstate->mt_num_partitions = num_partitions; >> mtstate->mt_parentchild_tupconv_maps = partition_tupconv_maps; >> mtstate->mt_subplan_partition_offsets = subplan_leaf_map; >> mtstate->mt_partition_tuple_slot = partition_tuple_slot; >> mtstate->mt_root_tuple_slot = MakeTupleTableSlot(); >> >> I know this patch is not completely responsible for it, but you're not >> making things any better. >> >> Would it not be better to invent some PartitionTupleRouting struct and >> make that struct a member of ModifyTableState and CopyState, then just >> pass the pointer to that struct to ExecSetupPartitionTupleRouting() >> and have it fill in the required details? I think the complexity of >> this is already on the high end, I think you really need to do the >> refactor before this gets any worse. >> > >Ok. I am currently working on doing this change. So not yet included in the >attached patch. Will send yet another revised patch for this change. Attached incremental patch encapsulate_partinfo.patch (to be applied over the latest v25 patch) contains changes to move all the partition-related information into new structure PartitionTupleRouting. Further to that, I also moved PartitionDispatchInfo into this structure. So it looks like this : typedef struct PartitionTupleRouting { PartitionDispatch *partition_dispatch_info; int num_dispatch; ResultRelInfo **partitions; int num_partitions; TupleConversionMap **parentchild_tupconv_maps; int*subplan_partition_offsets; TupleTableSlot *partition_tuple_slot; TupleTableSlot *root_tuple_slot; } PartitionTupleRouting; So this structure now encapsulates *all* the partition-tuple-routing-related information. So ModifyTableState now has only one tuple-routing-related field 'mt_partition_tuple_routing'. It is changed like this : @@ -976,24 +976,14 @@ typedef struct ModifyTableState TupleTableSlot *mt_existing;/* slot to store existing target tuple in */ List *mt_excludedtlist; /* the excluded pseudo relation's tlist */ TupleTableSlot *mt_conflproj; /* CONFLICT ... SET ... projection target */ - struct PartitionDispatchData **mt_partition_dispatch_info; - /* Tuple-routing support info */ - int mt_num_dispatch;/* Number of entries in the above array */ - int mt_num_partitions; /* Number of members in the following - * arrays */ - ResultRelInfo **mt_partitions; /* Per partition result relation pointers */ - TupleTableSlot *mt_partition_tuple_slot; - TupleTableSlot *mt_root_tuple_slot; + struct PartitionTupleRouting *mt_partition_tuple_routing; /* Tuple-routing support info */ struct TransitionCaptureState *mt_transition_capture; /* controls transition table population for specified operation */ struct TransitionCaptureState *mt_oc_transition_capture; /* controls transition table population for INSERT...ON CONFLICT UPDATE */ - TupleConversionMap **mt_parentchild_tupconv_maps; - /* Per partition map for tuple conversion from root to leaf */ TupleConversionMap **mt_childparent_tupconv_maps; /* Per plan/partition map for tuple conversion from child to root */ boolmt_is_tupconv_perpart; /* Is the above map per-partition ? */ - int *mt_subplan_partition_offsets; /* Stores position of update result rels in leaf partitions */ } ModifyTableState; So the code in nodeModifyTable.c (and similar code in copy.c) is accordingly adjusted to use mtstate->mt_partition_tuple_routing. The places where we use (mtstate->mt_partition_dispatch_info != NULL) condition to run tuple-routing code, I have replaced it with mtstate->mt_partition_tuple_routing != NULL. If you are ok with the incremental patch, I can extract this change into a separate preparatory patch to be applied on PG master. Thanks -Amit Khandekar encapsulate_partinfo.patch Description: Binary data
Re: [HACKERS] UPDATE of partition key
Thanks Amit. Looking at the latest v25 patch. On 2017/11/16 23:50, Amit Khandekar wrote: > Below has the responses for both Amit's and David's comments, starting > with Amit's > On 2 November 2017 at 12:40, Amit Langote> wrote: >> On 2017/10/24 0:15, Amit Khandekar wrote: >>> On 16 October 2017 at 08:28, Amit Langote >>> wrote: + (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL Is there some reason why a bitwise operator is used here? >>> >>> That exact condition means that the function is called for transition >>> capture for updated rows being moved to another partition. For this >>> scenario, either the oldtup or the newtup is NULL. I wanted to exactly >>> capture that condition there. I think the bitwise operator is more >>> user-friendly in emphasizing the point that it is indeed an "either a >>> or b, not both" condition. >> >> I see. In that case, since this patch adds the new condition, a note >> about it in the comment just above would be good, because the situation >> you describe here seems to arise only during update-tuple-routing, IIUC. > > Done. Please check. Looks fine. >> + * 'update_rri' has the UPDATE per-subplan result rels. These are re-used >> + * instead of allocating new ones while generating the array of all >> leaf >> + * partition result rels. >> >> Instead of: >> >> "These are re-used instead of allocating new ones while generating the >> array of all leaf partition result rels." >> >> how about: >> >> "There is no need to allocate a new ResultRellInfo entry for leaf >> partitions for which one already exists in this array" > > Ok. I have made it like this : > > + * 'update_rri' contains the UPDATE per-subplan result rels. For the > output param > + * 'partitions', we don't allocate new ResultRelInfo objects for > + * leaf partitions for which they are already available > in 'update_rri'. Sure. >>> It looks like the interface does not much simplify, and above that, we >>> have more number of lines in that function. Also, the caller anyway >>> has to be aware whether the map_index is the index into the leaf >>> partitions or the update subplans. So it is not like the caller does >>> not have to be aware about whether the mapping should be >>> mt_persubplan_childparent_maps or mt_perleaf_parentchild_maps. >> >> Hmm, I think we should try to make it so that the caller doesn't have to >> be aware of that. And by caller I guess you mean ExecInsert(), which >> should not be a place, IMHO, where to try to introduce a lot of new logic >> specific to update tuple routing. > > I think, for ExecInsert() since we have already given the job of > routing the tuple from root partitioned table to a partition, it makes > sense to give the function the additional job of routing the tuple > from any partition to any partition. ExecInsert() can be looked at as > doing this job : "insert a tuple into the right partition; the > original tuple can belong to any partition" Yeah, that's one way of looking at that. But I think ExecInsert() as it is today thinks it's got a *new* tuple and that's it. I think the newly introduced code in it to find out that it is not so (that the tuple actually comes from some other partition), that it's really the update-turned-into-delete-plus-insert, and then switch to the root partitioned table's ResultRelInfo, etc. really belongs outside of it. Maybe in its caller, which is ExecUpdate(). I mean why not add this code to the block in ExecUpdate() that handles update-row-movement. Just before calling ExecInsert() to do the re-routing seems like a good place to do all that. For example, try the attached incremental patch that applies on top of yours. I can see after applying it that diffs to ExecInsert() are now just some refactoring ones and there are no significant additions making it look like supporting update-row-movement required substantial changes to how ExecInsert() itself works. > After doing the changes for the int[] array map in the previous patch > version, I still feel that ConvertPartitionTupleSlot() should be > retained. We save some repeated lines of code saved. OK. >> You may be right, but I see for WithCheckOptions initialization >> specifically that the non-tuple-routing code passes the actual sub-plan >> when initializing the WCO for a given result rel. > > Yes that's true. The problem with WithCheckOptions for newly allocated > partition result rels is : we can't use a subplan for the parent > parameter because there is no subplan for it. But I will still think > on it a bit more (TODO). Alright. >>> I think you are suggesting we do it like how it's done in >>> is_partition_attr(). Can you please let me know other places we do >>> this same way ? I couldn't find. >> >> OK, not as many as I thought there would be, but there are following >> beside