backtrace_functions emits trace for any elog
Hi, I used backtrace_functions to debug one of my ideas and found its behavior counter-intuitive and contradictory to it own docs. I think the GUC is supposed to be used to dump backtrace only on elog(ERROR) (should it also be done for higher levels? not sure about this), but, in fact, it does that for any log-level. I have attached a patch that checks log-level before attaching backtrace. Regards, Ilya 0001-check-elevel-before-attaching-backtrace.patch Description: Binary data
Re: Progress report of CREATE INDEX for nested partitioned tables
> 16 марта 2023 г., в 04:07, Justin Pryzby написал(а): > > On Tue, Mar 14, 2023 at 06:58:14PM +0400, Ilya Gladyshev wrote: >>> The only change from the current patch is (3). (1) still calls >>> count_leaf_partitions(), but only once. I'd prefer that to rearranging >>> the progress reporting to set the TOTAL in ProcessUtilitySlow(). >> >> As for reusing TOTAL calculated outside of DefineIndex, as I can see, >> ProcessUtilitySlow is not the only call site for DefineIndex (although, I >> don’t know whether all of them need progress tracking), for instance, there >> is ALTER TABLE that calls DefineIndex to create index for constraints. So I >> feel like rearranging progress reporting will result in unnecessary code >> duplication in those call sites, so passing in an optional parameter seems >> to be easier here, if we are going to optimize it, after all. Especially if >> back-patching is a non-issue. > > Yeah. See attached. I don't like duplicating the loop. Is this really > the right direction to go ? > > I haven't verified if the child tables are locked in all the paths which > would call count_leaf_partitions(). But why is it important to lock > them for this? If they weren't locked before, that'd be a pre-existing > problem... > <0001-fix-CREATE-INDEX-progress-report-with-nested-partiti.patch> I’m not sure what the general policy on locking is, but I have checked ALTER TABLE ADD INDEX, and the all the partitions seem to be locked on the first entry to DefineIndex there. All other call sites pass in the parentIndexId, which means the progress tracking machinery will not be initialized, so I think, we don’t need to do locking in count_leaf_partitions(). The approach in the patch looks good to me. Some nitpicks on the patch: 1. There’s an unnecessary second call to get_rel_relkind in ProcessUtilitySlow, we can just use what’s in the variable relkind. 2. We can also combine else and if to have one less nested level like that: + else if (!RELKIND_HAS_PARTITIONS(child_relkind)) 3. There was a part of the comment saying "If the index was built by calling DefineIndex() recursively, the called function is responsible for updating the progress report for built indexes.", I think it is still useful to have it there.
Re: Progress report of CREATE INDEX for nested partitioned tables
> 14 марта 2023 г., в 18:34, Justin Pryzby написал(а): > > On Mon, Mar 13, 2023 at 10:42:59AM -0400, Tom Lane wrote: >> Justin Pryzby writes: >>> On Sun, Mar 12, 2023 at 06:25:13PM -0400, Tom Lane wrote: I agree that adding such a field to IndexStmt would be a very bad idea. However, adding another parameter to DefineIndex doesn't seem like a problem. >> >>> It's a problem since this is a bug and it's desirable to backpatch a >>> fix, right ? >> >> I do not think this is important enough to justify a back-patch. > > That's fine with me, but it comes as a surprise, and it might invalidate > earlier discussions, which were working under the constraint of > maintaining a compatible ABI. > >>> Incrementing by 0 sounds terrible, since someone who has intermediate >>> partitioned tables is likely to always see 0% done. >> >> How so? The counter will increase after there's some actual work done, >> ie building an index. If there's no indexes to build then it hardly >> matters, because the command will complete in very little time. > > I misunderstood your idea as suggesting to skip progress reporting for > *every* intermediate partitioned index, and its leaves. But I guess > what you meant is to skip progress reporting when ATTACHing an > intermediate partitioned index. That seems okay, since 1) intermediate > partitioned tables are probably rare, and 2) ATTACH is fast, so the > effect is indistinguisable from querying the progress report a few > moments later. +1 that counting attached partitioned indexes as 0 is fine. > The idea would be for: > 1) TOTAL to show the number of direct and indirect leaf partitions; > 2) update progress while building direct or indirect indexes; > 3) ATTACHing intermediate partitioned tables to increment by 0; > 4) ATTACHing a direct child should continue to increment by 1, > since that common case already works as expected and shouldn't be > changed. > > The only change from the current patch is (3). (1) still calls > count_leaf_partitions(), but only once. I'd prefer that to rearranging > the progress reporting to set the TOTAL in ProcessUtilitySlow(). > > -- > Justin As for reusing TOTAL calculated outside of DefineIndex, as I can see, ProcessUtilitySlow is not the only call site for DefineIndex (although, I don’t know whether all of them need progress tracking), for instance, there is ALTER TABLE that calls DefineIndex to create index for constraints. So I feel like rearranging progress reporting will result in unnecessary code duplication in those call sites, so passing in an optional parameter seems to be easier here, if we are going to optimize it, after all. Especially if back-patching is a non-issue.
Re: Progress report of CREATE INDEX for nested partitioned tables
> 1 февр. 2023 г., в 20:27, Matthias van de Meent > написал(а): > > On Wed, 1 Feb 2023 at 16:53, Justin Pryzby <mailto:pry...@telsasoft.com>> wrote: >> >> On Wed, Feb 01, 2023 at 04:21:35PM +0100, Matthias van de Meent wrote: >>> On Wed, 1 Feb 2023 at 15:21, Ilya Gladyshev >>> wrote: >>>>> 1 февр. 2023 г., в 16:01, Alvaro Herrera >>>>> написал(а): >>>>> Hmm, count_leaf_partitions has to scan pg_inherits and do a syscache >>>>> lookup for every single element therein ... this sounds slow. >>>>> >>>>> In one of the callsites, we already have the partition descriptor >>>>> available. We could just scan partdesc->is_leaf[] and add one for each >>>>> 'true' value we see there. >>>> >>>> The problem is that partdesc contains only direct children of the table >>>> and we need all the children down the inheritance tree to count the total >>>> number of leaf partitions in the first callsite. >>>> >>>>> In the other callsite, we had the table open just a few lines before the >>>>> place you call count_leaf_partitions. Maybe we can rejigger things by >>>>> examining its state before closing it: if relkind is not partitioned we >>>>> know leaf_partitions=0, and only if partitioned we count leaf partitions. >>>>> I think that would save some work. I also wonder if it's worth writing >>>>> a bespoke function for counting leaf partitions rather than relying on >>>>> find_all_inheritors. >>>> >>>> Sure, added this condition to avoid the extra work here. >>>> >>> >>>> When creating an index on a partitioned table, this column is set to >>>> - the total number of partitions on which the index is to be created. >>>> + the total number of leaf partitions on which the index is to be >>>> created or attached. >>> >>> I think we should also add a note about the (now) non-constant nature >>> of the value, something along the lines of "This value is updated as >>> we're processing and discovering partitioned tables in the partition >>> hierarchy". >> >> But the TOTAL is constant, right ? Updating the total when being called >> recursively is the problem these patches fix. > > If that's the case, then I'm not seeing the 'fix' part of the patch. I > thought this patch was fixing the provably incorrect TOTAL value where > DONE > TOTAL due to the recursive operation overwriting the DONE/TOTAL > values instead of updating them. > > In HEAD we set TOTAL to whatever number partitioned table we're > currently processing has - regardless of whether we're the top level > statement. > With the patch we instead add the number of child relations to that > count, for which REL_HAS_STORAGE(child) -- or at least, in the v3 > posted by Ilya. Approximately immediately after updating that count we > recurse to the child relations, and that only returns once it is done > creating the indexes, so both TOTAL and DONE go up as we process more > partitions in the hierarchy. The TOTAL in the patch is set only when processing the top-level parent and it is not updated when we recurse, so yes, it is constant. From v3: @@ -1219,8 +1243,14 @@ DefineIndex(Oid relationId, RelationparentIndex; TupleDesc parentDesc; - pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_TOTAL, - nparts); + if (!OidIsValid(parentIndexId)) + { + int total_parts; + + total_parts = count_leaf_partitions(relationId); + pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_TOTAL, + total_parts); + } It is set to the total number of children on all levels of the hierarchy, not just the current one, so the total value doesn’t need to be updated later, because it is set to the correct value from the very beginning. It is the DONE counter that is updated, and when we attach an index of a partition that is itself a partitioned table (like a2 in your example, if it already had an index created), it will be updated by the number of children of the partition. @@ -1431,9 +1463,25 @@ DefineIndex(Oid rela
Re: Progress report of CREATE INDEX for nested partitioned tables
> 1 февр. 2023 г., в 16:01, Alvaro Herrera написал(а): > > Hmm, count_leaf_partitions has to scan pg_inherits and do a syscache > lookup for every single element therein ... this sounds slow. > > In one of the callsites, we already have the partition descriptor > available. We could just scan partdesc->is_leaf[] and add one for each > 'true' value we see there. The problem is that partdesc contains only direct children of the table and we need all the children down the inheritance tree to count the total number of leaf partitions in the first callsite. > In the other callsite, we had the table open just a few lines before the > place you call count_leaf_partitions. Maybe we can rejigger things by > examining its state before closing it: if relkind is not partitioned we > know leaf_partitions=0, and only if partitioned we count leaf partitions. > I think that would save some work. I also wonder if it's worth writing > a bespoke function for counting leaf partitions rather than relying on > find_all_inheritors. Sure, added this condition to avoid the extra work here. v3-0001-create-index-progress-increment.patch Description: Binary data
Re: Progress report of CREATE INDEX for nested partitioned tables
> 1 февр. 2023 г., в 08:29, Justin Pryzby написал(а): > > On Tue, Jan 31, 2023 at 07:32:20PM +0400, Ilya Gladyshev wrote: >>> 17 янв. 2023 г., в 23:44, Tomas Vondra >>> написал(а): >>> Do we actually need the new parts_done field? I mean, we already do >>> track the value - at PROGRESS_CREATEIDX_PARTITIONS_DONE index in the >>> st_progress_param array. Can't we simply read it from there? Then we >>> would not have ABI issues with the new field added to IndexStmt. >> >> I think it’s a good approach and it could be useful outside of scope of this >> patch too. So I have attached a patch, that introduces >> pgstat_progress_incr_param function for this purpose. There’s one thing I am >> not sure about, IIUC, we can assume that the only process that can write >> into MyBEEntry of the current backend is the current backend itself, >> therefore looping to get consistent reads from this array is not required. >> Please correct me, if I am wrong here. > > Thanks for the updated patch. > > I think you're right - pgstat_begin_read_activity() is used for cases > involving other backends. It ought to be safe for a process to read its > own status bits, since we know they're not also being written. > > You changed DefineIndex() to update progress for the leaf indexes' when > called recursively. The caller updates the progress for "attached" > indexes, but not created ones. That allows providing fine-granularity > progress updates when using intermediate partitions, right ? (Rather > than updating the progress by more than one at a time in the case of > intermediate partitioning). > > If my understanding is right, that's subtle, and adds a bit of > complexity to the current code, so could use careful commentary. I > suggest: > > * If the index was attached, update progress for all its direct and > * indirect leaf indexes all at once. If the index was built by calling > * DefineIndex() recursively, the called function is responsible for > * updating the progress report for built indexes. > > ... > > * If this is the top-level index, we're done. When called recursively > * for child tables, the done partition counter is incremented now, > * rather than in the caller. Yes, you are correct about the intended behavior, I added your comments to the patch. > I guess you know that there were compiler warnings (related to your > question). > https://cirrus-ci.com/task/6571212386598912 > > pgstat_progress_incr_param() could call pgstat_progress_update_param() > rather than using its own Assert() and WRITE_ACTIVITY calls. I'm not > sure which I prefer, though. > > Also, there are whitespace/tab/style issues in > pgstat_progress_incr_param(). > > -- > Justin Thank you for the review, I fixed the aforementioned issues in the v2. v2-0001-create-index-progress-increment.patch Description: Binary data
Re: Progress report of CREATE INDEX for nested partitioned tables
> 17 янв. 2023 г., в 23:44, Tomas Vondra > написал(а): > Do we actually need the new parts_done field? I mean, we already do > track the value - at PROGRESS_CREATEIDX_PARTITIONS_DONE index in the > st_progress_param array. Can't we simply read it from there? Then we > would not have ABI issues with the new field added to IndexStmt. I think it’s a good approach and it could be useful outside of scope of this patch too. So I have attached a patch, that introduces pgstat_progress_incr_param function for this purpose. There’s one thing I am not sure about, IIUC, we can assume that the only process that can write into MyBEEntry of the current backend is the current backend itself, therefore looping to get consistent reads from this array is not required. Please correct me, if I am wrong here. 0001-create-index-progress-increment.patch Description: Binary data
Re: Progress report of CREATE INDEX for nested partitioned tables
On Sun, 2023-01-08 at 10:48 -0600, Justin Pryzby wrote: > On Sat, Dec 17, 2022 at 08:30:02AM -0600, Justin Pryzby wrote: > > We have the common problem of too many patches. > > > > https://www.postgresql.org/message-id/a15f904a70924ffa4ca25c3c744cff31e0e6e143.camel%40gmail.com > > This changes the progress reporting to show indirect children as > > "total", and adds a global variable to track recursion into > > DefineIndex(), allowing it to be incremented without the value > > being > > lost to the caller. > > > > https://www.postgresql.org/message-id/20221211063334.GB27893%40telsasoft.com > > This also counts indirect children, but only increments the > > progress > > reporting in the parent. This has the disadvantage that when > > intermediate partitions are in use, the done_partitions counter > > will > > "jump" from (say) 20 to 30 without ever hitting 21-29. > > > > https://www.postgresql.org/message-id/20221213044331.GJ27893%40telsasoft.com > > This has two alternate patches: > > - One patch changes to only update progress reporting of *direct* > > children. This is minimal, but discourages any future plan to > > track > > progress involving intermediate partitions with finer > > granularity. > > - A alternative patch adds IndexStmt.nparts_done, and allows > > reporting > > fine-grained progress involving intermediate partitions. > > > > https://www.postgresql.org/message-id/flat/039564d234fc3d014c555a7ee98be69a9e724836.ca...@gmail.com > > This also reports progress of intermediate children. The first > > patch > > does it by adding an argument to DefineIndex() (which isn't okay to > > backpatch). And an alternate patch does it by adding to IndexStmt. > > > > @committers: Is it okay to add nparts_done to IndexStmt ? > > Any hint about this ? > > This should be resolved before the "CIC on partitioned tables" patch, > which I think is otherwise done. I suggest that we move on with the IndexStmt patch and see what the committers have to say about it. I have brushed the patch up a bit, fixing TODOs and adding docs as per our discussion above. From 490d8afa7cb952e5b3947d81d85bf9690499e8a3 Mon Sep 17 00:00:00 2001 From: Ilya Gladyshev Date: Tue, 13 Dec 2022 22:04:45 +0400 Subject: [PATCH v2] parts_done via IndexStmt --- doc/src/sgml/monitoring.sgml | 4 +-- src/backend/bootstrap/bootparse.y | 2 ++ src/backend/commands/indexcmds.c | 57 +++--- src/backend/parser/parse_utilcmd.c | 2 ++ src/include/nodes/parsenodes.h | 1 + 5 files changed, 59 insertions(+), 7 deletions(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index cf220c3bcb..2c32a92364 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -6565,7 +6565,7 @@ FROM pg_stat_get_backend_idset() AS backendid; When creating an index on a partitioned table, this column is set to - the total number of partitions on which the index is to be created. + the total number of leaf partitions on which the index is to be created or attached. This field is 0 during a REINDEX. @@ -6576,7 +6576,7 @@ FROM pg_stat_get_backend_idset() AS backendid; When creating an index on a partitioned table, this column is set to - the number of partitions on which the index has been created. + the number of leaf partitions on which the index has been created or attached. This field is 0 during a REINDEX. diff --git a/src/backend/bootstrap/bootparse.y b/src/backend/bootstrap/bootparse.y index 86804bb598..dad9c38d90 100644 --- a/src/backend/bootstrap/bootparse.y +++ b/src/backend/bootstrap/bootparse.y @@ -296,6 +296,7 @@ Boot_DeclareIndexStmt: stmt->concurrent = false; stmt->if_not_exists = false; stmt->reset_default_tblspc = false; + stmt->parts_done = -1; /* locks and races need not concern us in bootstrap mode */ relationId = RangeVarGetRelid(stmt->relation, NoLock, @@ -348,6 +349,7 @@ Boot_DeclareUniqueIndexStmt: stmt->concurrent = false; stmt->if_not_exists = false; stmt->reset_default_tblspc = false; + stmt->parts_done = -1; /* locks and races need not concern us in bootstrap mode */ relationId = RangeVarGetRelid(stmt->relation, NoLock, diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index 8afc006f89..be4d1c884e 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -130,6 +130,29 @@ typedef struct ReindexErrorInfo char relkind; } ReindexErrorInfo; + +/* + * Count the number of direct and indirect leaf partitions, excl
Re: Progress report of CREATE INDEX for nested partitioned tables
On Mon, 2022-12-12 at 22:43 -0600, Justin Pryzby wrote: > On Mon, Dec 12, 2022 at 11:39:23PM +0400, Ilya Gladyshev wrote: > > > > > Could you check what I've written as a counter-proposal ? > > > > I think that this might be a good solution to start with, it gives > > us the opportunity to improve the granularity later without any > > surprising changes for the end user. We could use this patch for > > previous versions and make more granular output in the latest. What > > do you think? > > Somehow, it hadn't occured to me that my patch "lost granularity" by > incrementing the progress bar by more than one... Shoot. > > > I actually think that the progress view would be better off without > > the total number of partitions, > > Just curious - why ? We don't really know how many indexes we are going to create, unless we have some kind of preliminary "planning" stage where we acumulate all the relations that will need to have indexes created (rather than attached). And if someone wants the total, it can be calculated manually without this view, it's less user-friendly, but if we can't do it well, I would leave it up to the user. > > > With this in mind, I think your proposal to exclude catalog-only > > indexes sounds reasonable to me, but I feel like the docs are off > > in this case, because the attached indexes are not created, but we > > pretend like they are in this metric, so we should fix one or the > > other. > > I agree that the docs should indicate whether we're counting "all > partitions", "direct partitions", and whether or not that includes > partitioned partitions, or just leaf partitions. Agree. I think that docs should also be explicit about the attached indexes, if we decide to count them in as "created". > I have another proposal: since the original patch 3.5 years ago > didn't > consider or account for sub-partitions, let's not start counting them > now. It was never defined whether they were included or not (and I > guess that they're not common) so we can take this opportunity to > clarify the definition. I have had this thought initially, but then I thought that it's not what I would want, if I was to track progress of multi-level partitioned tables (but yeah, I guess it's pretty uncommon). In this respect, I like your initial counter-proposal more, because it leaves us room to improve this in the future. Otherwise, if we commit to reporting only top-level partitions now, I'm not sure we will have the opportunity to change this. > Alternately, if it's okay to add nparts_done to the IndexStmt, then > that's easy. Yeah, or we could add another argument to DefineIndex. I don't know if it's ok, or which option is better here in terms of compatibility and interface-wise, so I have tried both of them, see the attached patches. From 4f55526f8a534d6b2f27b6b17cadaee0370e4cbc Mon Sep 17 00:00:00 2001 From: Ilya Gladyshev Date: Tue, 13 Dec 2022 19:42:53 +0400 Subject: [PATCH] parts_done via DefineIndex args --- src/backend/commands/indexcmds.c | 52 +++- src/backend/commands/tablecmds.c | 7 +++-- src/backend/tcop/utility.c | 3 +- src/include/commands/defrem.h| 3 +- 4 files changed, 53 insertions(+), 12 deletions(-) diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index b5b860c3ab..9fa9109b9e 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -129,6 +129,29 @@ typedef struct ReindexErrorInfo charrelkind; } ReindexErrorInfo; + +/* + * Count the number of direct and indirect leaf partitions, excluding foreign + * tables. + */ +static int +num_leaf_partitions(Oid relid) +{ + int nleaves = 0; + List*childs = find_all_inheritors(relid, NoLock, NULL); + ListCell *lc; + + foreach(lc, childs) + { + Oid partrelid = lfirst_oid(lc); + if (RELKIND_HAS_STORAGE(get_rel_relkind(partrelid))) + nleaves++; + } + + list_free(childs); + return nleaves; +} + /* * CheckIndexCompatible * Determine whether an existing index definition is compatible with a @@ -530,7 +553,8 @@ DefineIndex(Oid relationId, bool check_rights, bool check_not_in_use, bool skip_build, - bool quiet) + bool quiet, + int *parts_done) { boolconcurrent; char *indexRelationName; @@ -568,6 +592,7 @@ DefineIndex(Oid relationId, Oid root_save_userid; int root_save_sec_context; int ro
Re: Progress report of CREATE INDEX for nested partitioned tables
> Could you check what I've written as a counter-proposal ? I think that this might be a good solution to start with, it gives us the opportunity to improve the granularity later without any surprising changes for the end user. We could use this patch for previous versions and make more granular output in the latest. What do you think? > As long as we're changing partitions_done to include nested > sub-partitions, it seems to me like we should exclude intermediate > "catalog-only" partitioned indexes, and count only physical leaf > partitions. Should it alo exclude any children with matching indexes, > which will also be catalog-only changes? Probably not. > > The docs say: > |When creating an index on a partitioned table, this column is set to the > |total number of partitions on which the index is to be created. This > |field is 0 during a REINDEX. I agree with you on catalog-only partitioned indexes, but I think that in the perfect world we should exclude all the relations where the index isn’t actually created, so that means excluding attached indexes as well. However, IMO doing it this way will require too much of a code rewrite for quite a minor feature (but we could do it, ofc). I actually think that the progress view would be better off without the total number of partitions, but I’m not sure we have this option now. With this in mind, I think your proposal to exclude catalog-only indexes sounds reasonable to me, but I feel like the docs are off in this case, because the attached indexes are not created, but we pretend like they are in this metric, so we should fix one or the other. > >> I changed current behaviour to report the total number of partitions in >> the inheritance tree and fixed recursion in the attached patch. I used >> a static variable to keep the counter to avoid ABI breakage of >> DefineIndex, so that we could backpatch this to previous versions. > > I wrote a bunch of assertions for this, which seems to have uncovered an > similar issue with COPY progress reporting, dating to 8a4f618e7. I'm > not sure the assertions are okay. I imagine they may break other > extensions, as with file_fdw. > > -- > Justin > <0001-fix-progress-reporting-of-nested-partitioned-indexes.patch>
Progress report of CREATE INDEX for nested partitioned tables
Hi, I have noticed that progress reporting for CREATE INDEX of partitioned tables seems to be working poorly for nested partitioned tables. In particular, it overwrites total and done partitions count when it recurses down to child partitioned tables and it only reports top-level partitions. So it's not hard to see something like this during CREATE INDEX now: postgres=# select partitions_total, partitions_done from pg_stat_progress_create_index ; partitions_total | partitions_done --+- 1 | 2 (1 row) I changed current behaviour to report the total number of partitions in the inheritance tree and fixed recursion in the attached patch. I used a static variable to keep the counter to avoid ABI breakage of DefineIndex, so that we could backpatch this to previous versions. Thanks, Ilya Gladyshev From 342caa3d4ce273b14a6998c20c32b862d2d4c890 Mon Sep 17 00:00:00 2001 From: Ilya Gladyshev Date: Fri, 9 Dec 2022 23:17:29 +0400 Subject: [PATCH] report top parent progress for CREATE INDEX --- src/backend/commands/indexcmds.c | 36 ++-- 1 file changed, 30 insertions(+), 6 deletions(-) diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index b5b860c3ab..80557dad8d 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -129,6 +129,12 @@ typedef struct ReindexErrorInfo char relkind; } ReindexErrorInfo; +/* + * Used to report the number of partitions, + * that were processed during index creation. + */ +static int create_index_parts_done; + /* * CheckIndexCompatible * Determine whether an existing index definition is compatible with a @@ -1218,8 +1224,18 @@ DefineIndex(Oid relationId, Relation parentIndex; TupleDesc parentDesc; - pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_TOTAL, - nparts); + if (!OidIsValid(parentIndexId)) + { +List *inhs; + +/* Reset counter of done partitions when processing root index */ +create_index_parts_done = 0; +inhs = find_all_inheritors(relationId, NoLock, NULL); +/* inheritance tree size without root itself */ +pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_TOTAL, + list_length(inhs) - 1); +list_free(inhs); + } /* Make a local copy of partdesc->oids[], just for safety */ memcpy(part_oids, partdesc->oids, sizeof(Oid) * nparts); @@ -1431,8 +1447,6 @@ DefineIndex(Oid relationId, child_save_sec_context); } -pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_DONE, - i + 1); free_attrmap(attmap); } @@ -1465,13 +1479,17 @@ DefineIndex(Oid relationId, /* * Indexes on partitioned tables are not themselves built, so we're - * done here. + * done here. If it is a child partitioned table, increment done parts counter. */ AtEOXact_GUC(false, root_save_nestlevel); SetUserIdAndSecContext(root_save_userid, root_save_sec_context); table_close(rel, NoLock); if (!OidIsValid(parentIndexId)) pgstat_progress_end_command(); + else + pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_DONE, + ++create_index_parts_done); + return address; } @@ -1483,9 +1501,15 @@ DefineIndex(Oid relationId, /* Close the heap and we're done, in the non-concurrent case */ table_close(rel, NoLock); - /* If this is the top-level index, we're done. */ + /* + * If this is the top-level index, we're done, otherwise, increment + * done partition counter. + */ if (!OidIsValid(parentIndexId)) pgstat_progress_end_command(); + else + pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_DONE, + ++create_index_parts_done); return address; } -- 2.30.2
Re: CREATE INDEX CONCURRENTLY on partitioned index
On Sun, 2022-12-04 at 13:09 -0600, Justin Pryzby wrote: > > This beavior is fixed. I re-factored and re-implented to use > DefineIndex() for building indexes concurrently rather than > reindexing. > That makes the patch smaller, actually, and has the added benefit of > splitting off the "Concurrently" part of DefineIndex() into a > separate > function. Nice, I think it turned out pretty concise. I played around with the patch quite a bit, didn't find any major problems, the only minor thing that I can note is that we should skip the top parent index itself in the loop not to increment the pg_stat counter, something like this: diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index cfab45b999..9049540b5b 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -1515,6 +1515,9 @@ DefineIndex(Oid relationId, Oid indrelid = lfirst_oid(lc); Oid tabrelid = IndexGetRelation(indrelid, false); + if (indrelid == indexRelationId) + continue; + if (RELKIND_HAS_STORAGE(get_rel_relkind(indrelid)) && !get_index_isvalid(indrelid)) { > > BTW, it causes the patch to fail to apply in cfbot when you send an > additional (002) supplementary patch without including the original > (001) patch. You can name it *.txt to avoid the issue. > > https://wiki.postgresql.org/wiki/Cfbot#Which_attachments_are_considered_to_be_patches.3F > > Thanks for looking. > My bad, didn't know about this, thanks for the link. On a side note, I noticed that reindex behaviour is strange on partitioned tables, it doesn't mark partitioned tables as valid after reindexing children, as I could understand from the code and mailing lists, this is the intended behaviour, but I can't quite understand the rationale for it, do you know why it is done this way?
Re: CREATE INDEX CONCURRENTLY on partitioned index
Hi, Thank you Justin and Alexander for working on this, I have reviewed and tested the latest patch, it works well, the problems mentioned previously are all fixed. I like the idea of sharing code of reindex and index, but I have noticed some peculiarities as a user. The reporting is somewhat confusing as it switches to reporting for reindex concurrently while building child indexes, this should be fixed with the simple patch I have attached. Another thing that I have noticed is that REINDEX, which is used under the hood, creates new indexes with suffix _ccnew, and if the index building fails, the indexes that could not be build will have the name with _ccnew suffix. This can actually be seen in your test: ERROR: could not create unique index "idxpart2_a_idx2_ccnew" I find it quite confusing and I don't think that this the expected behavior (if it is, I think it should be documented, like it is for REINDEX). As an example of problems that it might entail, DROP INDEX will not drop all the invalid indexes in the inheritance tree, because it will leave _ccnew indexes in place, which is ok for reindex concurrently, but that's not how C-I-C works now. I think that fixing this problem requires some heavy code rewrite and I'm not quite sure how to go about it, if you have any ideas, I will be happy to try them out. Thanks, Ilya From 8eb9fd7ce7d34c5c323c47b60a7f883f360ef090 Mon Sep 17 00:00:00 2001 From: Ilya Gladyshev Date: Sat, 3 Dec 2022 18:20:03 +0400 Subject: [PATCH] turn off reindex reporting for create --- src/backend/commands/indexcmds.c | 36 +++- 1 file changed, 21 insertions(+), 15 deletions(-) diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index a2775931e2..b3c713037f 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -3804,14 +3804,16 @@ ReindexRelationConcurrently(Oid relationOid, ReindexParams *params) /* Don't overwrite CREATE INDEX command */ if (!(params->options & REINDEXOPT_REPORT_CREATE_PART)) + { pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX, idx->tableId); - progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY; - progress_vals[1] = 0; /* initializing */ - progress_vals[2] = idx->indexId; - progress_vals[3] = idx->amId; - pgstat_progress_update_multi_param(4, progress_index, progress_vals); + progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY; + progress_vals[1] = 0; /* initializing */ + progress_vals[2] = idx->indexId; + progress_vals[3] = idx->amId; + pgstat_progress_update_multi_param(4, progress_index, progress_vals); + } /* Choose a temporary relation name for the new index */ concurrentName = ChooseRelationName(get_rel_name(idx->indexId), @@ -3967,13 +3969,15 @@ ReindexRelationConcurrently(Oid relationOid, ReindexParams *params) * table involved. Don't overwrite CREATE INDEX command. */ if (!(params->options & REINDEXOPT_REPORT_CREATE_PART)) + { pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX, newidx->tableId); - progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY; - progress_vals[1] = PROGRESS_CREATEIDX_PHASE_BUILD; - progress_vals[2] = newidx->indexId; - progress_vals[3] = newidx->amId; - pgstat_progress_update_multi_param(4, progress_index, progress_vals); + progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY; + progress_vals[1] = PROGRESS_CREATEIDX_PHASE_BUILD; + progress_vals[2] = newidx->indexId; + progress_vals[3] = newidx->amId; + pgstat_progress_update_multi_param(4, progress_index, progress_vals); + } /* Perform concurrent build of new index */ index_concurrently_build(newidx->tableId, newidx->indexId); @@ -4033,14 +4037,16 @@ ReindexRelationConcurrently(Oid relationOid, ReindexParams *params) * table involved. Don't overwrite CREATE INDEX command. */ if (!(params->options & REINDEXOPT_REPORT_CREATE_PART)) + { pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX, newidx->tableId); - progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY; - progress_vals[1] = PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN; - progress_vals[2] = newidx->indexId; - progress_vals[3] = newidx->amId; - pgstat_progress_update_multi_param(4, progress_index, progress_vals); + progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY; + progress_vals[1] = PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN; + progress_vals[2] = newidx->indexId; + progress_vals[3] = newidx->amId; + pgstat_progress_update_multi_param(4, progress_index, progress_vals); + } validate_index(newidx->tableId, newidx->indexId, snapshot); -- 2.30.2
Re: postgres_fdw binary protocol support
> 22 нояб. 2022 г., в 17:10, Ashutosh Bapat > написал(а): > > Hi Illya, > > On Mon, Nov 21, 2022 at 8:50 PM Ilya Gladyshev > wrote: >> >> Hi everyone, >> >> I have made a patch that introduces support for libpq binary protocol >> in postgres_fdw. The idea is simple, when a user knows that the foreign >> server is binary compatible with the local and his workload could >> somehow benefit from using binary protocol, it can be switched on for a >> particular server or even a particular table. >> > > Why do we need this feature? If it's for performance then do we have > performance numbers? Yes, it is for performance, but I am yet to do the benchmarks. My initial idea was that binary protocol must be more efficient than text, because as I understand that’s the whole point of it. However, the minor tests that I have done do not prove this and I couldn’t find any benchmarks for it online, so I will do further tests to find a use case for it. > About the patch itself, I see a lot of if (binary) {} else {} block > which are repeated. It will be good if we can add functions/macros to > avoid duplication. Yea, that’s true, I have some ideas about improving it
postgres_fdw binary protocol support
Hi everyone, I have made a patch that introduces support for libpq binary protocol in postgres_fdw. The idea is simple, when a user knows that the foreign server is binary compatible with the local and his workload could somehow benefit from using binary protocol, it can be switched on for a particular server or even a particular table. The patch adds a new foreign server and table option 'binary_format' (by default off) and implements serialization/deserialization of query results and parameters for binary protocol. I have tested the patch by switching foreign servers in postgres_fdw.sql tests to binary_mode, the only diff was in the text of the error for parsing an invalid integer value, so it worked as expected for the test. There are a few minor issues I don't like in the code and I am yet to write the tests and docs for it. It would be great to get some feedback and understand, whether this is a welcome feature, before proceeding with all of the abovementioned. Thanks, Ilya Gladyshev From 2cb72df03ed94d55cf51531a2d21a7d3369ae27b Mon Sep 17 00:00:00 2001 From: Ilya Gladyshev Date: Sat, 19 Nov 2022 17:47:49 +0400 Subject: [PATCH] postgres_fdw libpq binary proto support --- contrib/postgres_fdw/option.c | 6 +- contrib/postgres_fdw/postgres_fdw.c | 389 contrib/postgres_fdw/postgres_fdw.h | 1 + 3 files changed, 338 insertions(+), 58 deletions(-) diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c index fa80ee2a55..f96cb79b42 100644 --- a/contrib/postgres_fdw/option.c +++ b/contrib/postgres_fdw/option.c @@ -125,7 +125,8 @@ postgres_fdw_validator(PG_FUNCTION_ARGS) strcmp(def->defname, "truncatable") == 0 || strcmp(def->defname, "async_capable") == 0 || strcmp(def->defname, "parallel_commit") == 0 || - strcmp(def->defname, "keep_connections") == 0) + strcmp(def->defname, "keep_connections") == 0 || + strcmp(def->defname, "binary_format") == 0) { /* these accept only boolean values */ (void) defGetBoolean(def); @@ -253,6 +254,9 @@ InitPgFdwOptions(void) /* async_capable is available on both server and table */ {"async_capable", ForeignServerRelationId, false}, {"async_capable", ForeignTableRelationId, false}, + /* async_capable is available on both server and table */ + {"binary_format", ForeignServerRelationId, false}, + {"binary_format", ForeignTableRelationId, false}, {"parallel_commit", ForeignServerRelationId, false}, {"keep_connections", ForeignServerRelationId, false}, {"password_required", UserMappingRelationId, false}, diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index 8d7500abfb..9344b6f5fc 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -76,6 +76,8 @@ enum FdwScanPrivateIndex FdwScanPrivateRetrievedAttrs, /* Integer representing the desired fetch_size */ FdwScanPrivateFetchSize, + /* Boolean flag showing whether to use binary or text libpq protocol */ + FdwScanPrivateBinaryFormat, /* * String describing join i.e. names of relations being joined and types @@ -128,7 +130,8 @@ enum FdwDirectModifyPrivateIndex /* Integer list of attribute numbers retrieved by RETURNING */ FdwDirectModifyPrivateRetrievedAttrs, /* set-processed flag (as a Boolean node) */ - FdwDirectModifyPrivateSetProcessed + FdwDirectModifyPrivateSetProcessed, + FdwDirectModifyPrivateBinaryFormat }; /* @@ -154,6 +157,7 @@ typedef struct PgFdwScanState FmgrInfo *param_flinfo; /* output conversion functions for them */ List *param_exprs; /* executable expressions for param values */ const char **param_values; /* textual values of query parameters */ + int *param_lengths; /* for storing result tuples */ HeapTuple *tuples; /* array of currently-retrieved tuples */ @@ -172,6 +176,7 @@ typedef struct PgFdwScanState MemoryContext temp_cxt; /* context for per-tuple temporary data */ int fetch_size; /* number of tuples per fetch */ + bool binary_format; /* whether to use libpq binary or text format */ } PgFdwScanState; /* @@ -195,6 +200,7 @@ typedef struct PgFdwModifyState int batch_size; /* value of FDW option "batch_size" */ bool has_returning; /* is there a RETURNING clause? */ List *retrieved_attrs; /* attr numbers retrieved by RETURNING */ + bool binary_format; /* info about parameters for prepared statement */ AttrNumber ctidAttno; /* attnum of input resjunk ctid column */ @@ -225,7 +231,7 @@ typedef struct PgFdwDirectModifyState bool has_returning; /* is there a RETURNING clause? */ List *retrieved_attrs; /* attr numbers retrieved by RETURNING */ bool set_processed; /* do we set the command es_processed? */ - + bool binary_format; /* for remote query execution */ PGconn *co
Re: Segfault on logical replication to partitioned table with foreign children
On Wed, 2022-11-02 at 12:37 -0400, Tom Lane wrote: > Since we're getting pretty close to the next set of back-branch > releases, > I went ahead and pushed a minimal fix along the lines suggested by > Shi Yu. > (I realized that there's a second ExecFindPartition call in worker.c > that > also needs a check.) We still can at leisure think about refactoring > CheckSubscriptionRelkind to avoid unnecessary lookups. I think that > is something we should do only in HEAD; it'll just be a marginal > savings, > not worth the risks of API changes in stable branches. The other > loose > end is whether to worry about a regression test case. I'm inclined > not > to bother. The only thing that isn't getting exercised is the actual > ereport, which probably isn't in great need of routine testing. > > regards, tom lane I agree that early checks limit some of the functionality that was available before, so I guess the only way to preserve it is to do only the last-minute checks after routing, fair enough. As for the refactoring of the premature lookup, I have done some work on that in the previous patch, I think we can just use it. Attached a separate patch with the refactoring. From 004c63a8eba777be739f062cdc9b7ddcf2eac253 Mon Sep 17 00:00:00 2001 From: Ilya Gladyshev Date: Thu, 3 Nov 2022 11:39:24 +0400 Subject: [PATCH] Delay namespace and relname lookups until error --- src/backend/commands/subscriptioncmds.c| 12 src/backend/executor/execReplication.c | 13 ++--- src/backend/replication/logical/relation.c | 2 +- src/backend/replication/logical/worker.c | 8 ++-- src/include/executor/executor.h| 2 +- 5 files changed, 22 insertions(+), 15 deletions(-) diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c index f0cec2ad5e..0c3ad698eb 100644 --- a/src/backend/commands/subscriptioncmds.c +++ b/src/backend/commands/subscriptioncmds.c @@ -700,12 +700,14 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt, { RangeVar *rv = (RangeVar *) lfirst(lc); Oid relid; +Relation sub_rel; relid = RangeVarGetRelid(rv, AccessShareLock, false); +sub_rel = RelationIdGetRelation(relid); /* Check for supported relkind. */ -CheckSubscriptionRelkind(get_rel_relkind(relid), - rv->schemaname, rv->relname); +CheckSubscriptionRelkind(sub_rel, rv->schemaname, rv->relname); +RelationClose(sub_rel); AddSubscriptionRelState(subid, relid, table_state, InvalidXLogRecPtr); @@ -861,12 +863,14 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data, { RangeVar *rv = (RangeVar *) lfirst(lc); Oid relid; + Relation sub_rel; relid = RangeVarGetRelid(rv, AccessShareLock, false); + sub_rel = RelationIdGetRelation(relid); /* Check for supported relkind. */ - CheckSubscriptionRelkind(get_rel_relkind(relid), - rv->schemaname, rv->relname); + CheckSubscriptionRelkind(sub_rel, rv->schemaname, rv->relname); + RelationClose(sub_rel); pubrel_local_oids[off++] = relid; diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c index 6014f2e248..a0a2d326db 100644 --- a/src/backend/executor/execReplication.c +++ b/src/backend/executor/execReplication.c @@ -649,16 +649,23 @@ CheckCmdReplicaIdentity(Relation rel, CmdType cmd) /* * Check if we support writing into specific relkind. * - * The nspname and relname are only needed for error reporting. + * If nspname and relname are not NULL, they are used for error reporting, otherwise these values are fetched from relcache. */ void -CheckSubscriptionRelkind(char relkind, const char *nspname, - const char *relname) +CheckSubscriptionRelkind(Relation rel, const char *nspname, const char *relname) { + char relkind = rel->rd_rel->relkind; + if (relkind != RELKIND_RELATION && relkind != RELKIND_PARTITIONED_TABLE) + { + if (relname == NULL) + relname = RelationGetRelationName(rel); + if (nspname == NULL) + nspname = get_namespace_name(RelationGetNamespace(rel)); ereport(ERROR, (errcode(ERRCODE_WRONG_OBJECT_TYPE), errmsg("cannot use relation \"%s.%s\" as logical replication target", nspname, relname), errdetail_relkind_not_supported(relkind))); + } } diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c index e989047681..e98031e272 100644 --- a/src/backend/replication/logical/relation.c +++ b/src/backend/replication/logical/relation.c @@ -393,7 +393,7 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode) entry->localreloid = relid; /* Check for supported relkind. */ - CheckSubscriptionRelkind(entry->localrel->rd_rel->relkind, + CheckSubscriptionRelkind(entry->localrel,
Re: Segfault on logical replication to partitioned table with foreign children
On Mon, 2022-10-31 at 03:20 +, shiy.f...@fujitsu.com wrote: > On Sun, Oct 30, 2022 9:39 PM Tom Lane wrote: > > > > What I'm wondering about is whether we can refactor this code > > to avoid so many usually-useless catalog lookups. Pulling the > > namespace name, in particular, is expensive and we generally > > are not going to need the result. In the child-rel case it'd > > be much better to pass the opened relation and let the error-check > > subroutine work from that. Maybe we should just do it like that > > at the start of the recursion, too? Or pass the relid and let > > the subroutine look up the names only in the error case. > > > > A completely different line of thought is that this doesn't seem > > like a terribly bulletproof fix, since children could get added to > > a partitioned table after we look. Maybe it'd be better to check > > the relkind at the last moment before we do something that depends > > on a child table being a relation. > > > > I agree. So maybe we can add this check in > apply_handle_tuple_routing(). > > diff --git a/src/backend/replication/logical/worker.c > b/src/backend/replication/logical/worker.c > index 5250ae7f54..e941b68e4b 100644 > --- a/src/backend/replication/logical/worker.c > +++ b/src/backend/replication/logical/worker.c > @@ -2176,6 +2176,10 @@ apply_handle_tuple_routing(ApplyExecutionData > *edata, > Assert(partrelinfo != NULL); > partrel = partrelinfo->ri_RelationDesc; > > + /* Check for supported relkind. */ > + CheckSubscriptionRelkind(partrel->rd_rel->relkind, > + relmapentry- > >remoterel.nspname, relmapentry->remoterel.relname); > + > /* > * To perform any of the operations below, the tuple must > match the > * partition's rowtype. Convert if needed or just copy, using > a dedicated > > > Regards, > Shi yu I have verified that the current patch handles the attaching of new partitions to the target partitioned table by throwing an error on attempt to insert into a foreign table inside the logical replication worker. I have refactored the code to minimize cache lookups, but I am yet to write the tests for this. See the attached patch for the refactored version. From 56086185324e12e9db6add40f84bc7e60867f1d6 Mon Sep 17 00:00:00 2001 From: Ilya Gladyshev Date: Tue, 1 Nov 2022 01:42:44 +0400 Subject: [PATCH v2] check relkind of subscription target recursively --- src/backend/commands/subscriptioncmds.c| 18 src/backend/executor/execReplication.c | 51 ++ src/backend/replication/logical/relation.c | 6 +-- src/include/executor/executor.h| 4 +- 4 files changed, 57 insertions(+), 22 deletions(-) diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c index f0cec2ad5e..2ae89f9e03 100644 --- a/src/backend/commands/subscriptioncmds.c +++ b/src/backend/commands/subscriptioncmds.c @@ -700,12 +700,13 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt, { RangeVar *rv = (RangeVar *) lfirst(lc); Oid relid; +Relation rel; relid = RangeVarGetRelid(rv, AccessShareLock, false); - -/* Check for supported relkind. */ -CheckSubscriptionRelkind(get_rel_relkind(relid), - rv->schemaname, rv->relname); +rel = RelationIdGetRelation(relid); +/* Check for supported relkind recursively. */ +CheckSubscriptionRelation(rel, rv->schemaname, rv->relname); +RelationClose(rel); AddSubscriptionRelState(subid, relid, table_state, InvalidXLogRecPtr); @@ -861,12 +862,13 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data, { RangeVar *rv = (RangeVar *) lfirst(lc); Oid relid; + Relation rel; relid = RangeVarGetRelid(rv, AccessShareLock, false); - - /* Check for supported relkind. */ - CheckSubscriptionRelkind(get_rel_relkind(relid), - rv->schemaname, rv->relname); + rel = RelationIdGetRelation(relid); + /* Check for supported relkind recursively. */ + CheckSubscriptionRelation(rel, rv->schemaname, rv->relname); + RelationClose(rel); pubrel_local_oids[off++] = relid; diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c index 6014f2e248..9da0b6b3a1 100644 --- a/src/backend/executor/execReplication.c +++ b/src/backend/executor/execReplication.c @@ -16,9 +16,11 @@ #include "access/genam.h" #include "access/relscan.h" +#include "access/table.h" #include "access/tableam.h" #include "access/transam.h" #include "access/xact.h" +#include "catalog/pg_inherits.h" #include "commands/trigger.h
Re: Segfault on logical replication to partitioned table with foreign children
On Sun, 2022-10-30 at 16:52 +0100, Alvaro Herrera wrote: > On 2022-Oct-28, ilya.v.gladys...@gmail.com wrote: > > > This will cause a segfault or raise an assert, because inserting > > into > > foreign tables via logical replication is not possible. The > > solution I > > propose is to add recursive checks of relkind for children of a > > target, > > if the target is a partitioned table. > > However, I imagine that the only reason we don't support this is that > the code hasn't been written yet. I think it would be better to write > that code, so that we don't have to raise any error at all (unless > the > foreign table is something that doesn't support DML, in which case we > would have to raise an error). Of course, we still have to fix it in > backbranches, but we can just do it as a targeted check at the moment > of > insert/update, not at the moment of subscription create/alter. > Sure, this patch is just a quick fix. A proper implementation of logical replication into foreign tables would be a much more difficult undertaking. I think this patch is simple enough, the checks in the patch are performed both on subscription DDL and when the relation is opened for logical replication, so it gives both early feedback and last-minute checks as well. All the code infrastructure for these kinds of checks is already in place, so I think it's a good idea to use it. P.S. sorry, duplicating the message, forgot to cc the mailing list the first time
Re: Partial aggregates pushdown
On 01.11.2021 13:30, Alexander Pyhalov wrote: Peter Eisentraut писал 2021-11-01 12:47: On 21.10.21 12:55, Alexander Pyhalov wrote: Now aggregates with internal states can be pushed down, if they are marked as pushdown safe (this flag is set to true for min/max/sum), have internal states and associated converters. Converters are called locally, they transform aggregate result to serialized internal representation. As converters don't have access to internal aggregate state, partial aggregates like avg() are still not pushable. It seems to me that the system should be able to determine from the existing aggregate catalog entry whether an aggregate can be pushed down. For example, it could check aggtranstype != internal and similar. A separate boolean flag should not be necessary. Hi. I think we can't infer this property from existing flags. For example, if I have avg() with bigint[] argtranstype, it doesn't mean we can push down it. We couldn't also decide if partial aggregete is safe to push down based on aggfinalfn presence (for example, it is defined for sum(numeric), but we can push it down. I think one potential way to do it would be to allow pushing down aggregates that EITHER have state of the same type as their return type, OR have a conversion function that converts their return value to the type of their state.
Re: Partial aggregates pushdown
Hi, On 21.10.2021 13:55, Alexander Pyhalov wrote: Hi. Updated patch. Now aggregates with internal states can be pushed down, if they are marked as pushdown safe (this flag is set to true for min/max/sum), have internal states and associated converters. I don't quite understand why this is restricted only to aggregates that have 'internal' state, I feel like that should be possible for any aggregate that has a function to convert its final result back to aggregate state to be pushed down. While I couldn't come up with a useful example for this, except maybe for an aggregate whose aggfinalfn is used purely for cosmetic purposes (e.g. format the result into a string), I still feel that it is an unnecessary restriction. A few minor review notes to the patch: +static List *build_conv_list(RelOptInfo *foreignrel); this should probably be up top among other declarations. @@ -1433,6 +1453,48 @@ postgresGetForeignPlan(PlannerInfo *root, outer_plan); } +/* + * Generate attinmeta if there are some converters: + * they are expecxted to return BYTEA, but real input type is likely different. + */ typo in word "expec*x*ted". @@ -139,10 +147,13 @@ typedef struct PgFdwScanState * for a foreign join scan. */ TupleDesc tupdesc; /* tuple descriptor of scan */ AttInMetadata *attinmeta; /* attribute datatype conversion metadata */ + AttInMetadata *rcvd_attinmeta; /* metadata for received tuples, NULL if + * there's no converters */ Looks like rcvd_attinmeta is redundant and you could use attinmeta for conversion metadata. The last thing - the patch needs to be rebased, it doesn't apply cleanly on top of current master. Thanks, Ilya Gladyshev
Re: Update of partition key on foreign server
2 авг. 2021 г., в 15:29, Илья Гладышев <i.gladys...@postgrespro.ru> написал(а): Hi,I am currently looking into a partition constraint violation that occurs on update of a partition key on a foreign server. To reproduce it you can run:On server 1 using port 5432: create extension postgres_fdw; create table players (id integer, name text) partition by list(name); create table players_0 partition of players for values in ('name1'); create server neighbor foreign data wrapper postgres_fdw options (host 'localhost', port '5433', dbname 'postgres'); create foreign table players_1 partition of players for values in ('name2') server neighbor ; create user mapping for current_user server neighbor;On server 2 using port 5433: create extension postgres_fdw; create server neighbor foreign data wrapper postgres_fdw options (host 'localhost', port '5432', dbname 'postgres'); create table players (id integer, name text) partition by list(name); create table players_1 partition of players for values in ('name2'); create user mapping for current_user server neighbor; create foreign table players_0 partition of players for values in ('name1') server neighbor; insert into players values (1, 'name1'); update players set name='name2' where name='name1'; ERROR: new row for relation "players_0" violates partition constraint DETAIL: Failing row contains (1, name2). CONTEXT: remote SQL command: UPDATE public.players_0 SET name = 'name2'::text WHERE ((name = 'name1'::text))From what I have read on the mailing list, I understand that this is a known problem, but I haven't found any thread discussing it in particular. Is this something that needs fixing? If it is, I want to try to do it, but I’m wondering if there are any known caveats and looking for any tips on how to implement it. My understanding is that this should be implemented in a similar way to how the row is routed from a local partition in ExecCrossPartitionUpdate, so the update should be replaced with a foreign delete + local/foreign insert. In addition, a direct update should be forbidden when the query modifies the partition key. I’m probably missing a lot of details (feel free to point out), but is the general idea correct? I will be grateful for any feedback.Thanks,Ilya Gladyshev I have developed a simple patch to fix this, while I’m not fully satisfied with it, it gets the job done. From what I have read in the docs, partition constraints are currently not checked for foreign tables, because they must be enforced on the remote server anyway (because there might be many other ways to access the data source besides FDW), and thus there’s no point in doing that locally. However, in the case of foreign partition update, checking the constraints locally can be used to allow for routing from remote sources, so I don’t feel like this point is valid in this case. In message [1] there’s also mentioned possibility of existence of triggers on the foreign server, and transforming the update to delete+insert will cause these triggers to be omitted. While it is true, I feel like partition pruning already has a similar effect, as it allows to skip scanning foreign partitions. The only way to both fix the update and have the triggers work, that I came up with, is to use parent partitioned table as a target for the foreign update (FDW request would then be "UPDATE public.players …"), while this is possible, it requires the foreign server to have the same partitioned table, which seems quite restrictive to me. Please let me know if I’m missing something or there is a better way to do it.Thanks,Ilya Gladyshev[1] https://www.postgresql.org/message-id/5A93F487.4080602%40lab.ntt.co.jp 0001-Partition-key-update-in-foreign-table.patch Description: Binary data
Re: Per query FDW network stat collection
On 24.08.2021 12:19, Julien Rouhaud wrote: However I'm not sure that having a new "network" option is the best way for that. It seems confusing as IIUC it won't be catching all network activity (like fe/be activity, or network disk...) but only FDW activity. I think it would be better to have those information retrieved when using the verbose option rather than a new one. Similarly, I'm afraid that INSTRUMENT_NETWORK could be misleading, although I don't have any better proposal right now. I am also doubtful about this naming. Initially, I wanted to add fe/be activity as one of the metrics, but then decided to restrict myself to FDW for now. However, I decided to leave "network" as it is, because to me it makes sense to have all the network-related metrics under a single explain option (and a single instrumentation flag perhaps), in case more are added later. The struct fields used for collection internally tell explicitly that they are meant to be used only for FDW stats and the explain output also mentions that the displayed stats are for FDW network activity.
Per query FDW network stat collection
Hello, I have implemented per query network stat collection for FDW. It is done in a similar way to how buffer and WAL stats are collected and it can be seen with a new NETWORK option for explain command: explain (analyze, network) insert into itrtest values (2, 'blah'); QUERY PLAN --- Insert on itrtest (cost=0.00..0.01 rows=0 width=0) (actual time=0.544..0.544 rows=0 loops=1) Network: FDW bytes sent=197 received=72, wait_time=0.689 -> Result (cost=0.00..0.01 rows=1 width=36) (actual time=0.003..0.003 rows=1 loops=1) Planning Time: 0.025 ms Execution Time: 0.701 ms (5 rows) I am yet to add corresponding columns to pg_stat_statements, write tests and documentation, but before I go ahead with that, I would like to know what the community thinks about the patch. Regards, Ilya Gladyshev >From 3ffbe071480672189c2e03d7e54707c77ba58b0b Mon Sep 17 00:00:00 2001 From: Ilya Gladyshev Date: Mon, 23 Aug 2021 21:37:31 +0300 Subject: [PATCH] adds per query FDW network usage stats Adds means for collecting network usage stats and outputting it in explain with NETWORK option. Implements network stats collection for postgres_fdw via adding a hook for stat collection to libpq. --- contrib/postgres_fdw/connection.c| 57 - contrib/postgres_fdw/postgres_fdw.c | 28 + src/backend/access/heap/vacuumlazy.c | 5 ++- src/backend/access/nbtree/nbtsort.c | 14 --- src/backend/commands/explain.c | 62 src/backend/executor/execParallel.c | 28 ++--- src/backend/executor/instrument.c| 57 + src/backend/utils/misc/guc.c | 10 - src/include/commands/explain.h | 1 + src/include/executor/execParallel.h | 1 + src/include/executor/instrument.h| 25 --- src/interfaces/libpq/exports.txt | 1 + src/interfaces/libpq/fe-misc.c | 2 + src/interfaces/libpq/fe-secure.c | 4 ++ src/interfaces/libpq/libpq-fe.h | 5 ++- 15 files changed, 271 insertions(+), 29 deletions(-) diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c index 82aa14a65de..3f479a74ba1 100644 --- a/contrib/postgres_fdw/connection.c +++ b/contrib/postgres_fdw/connection.c @@ -571,10 +571,22 @@ void do_sql_command(PGconn *conn, const char *sql) { PGresult *res; + instr_time start, duration; + + if (track_fdw_wait_timing) + INSTR_TIME_SET_CURRENT(start); if (!PQsendQuery(conn, sql)) pgfdw_report_error(ERROR, NULL, conn, false, sql); res = pgfdw_get_result(conn, sql); + + if (track_fdw_wait_timing) + { + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + INSTR_TIME_ADD(pgNetUsage.fdw_wait_time, duration); + } + if (PQresultStatus(res) != PGRES_COMMAND_OK) pgfdw_report_error(ERROR, res, conn, true, sql); PQclear(res); @@ -684,10 +696,14 @@ GetPrepStmtNumber(PGconn *conn) PGresult * pgfdw_exec_query(PGconn *conn, const char *query, PgFdwConnState *state) { + PGresult *res; + instr_time start, duration; /* First, process a pending asynchronous request, if any. */ if (state && state->pendingAreq) process_pending_request(state->pendingAreq); + if (track_fdw_wait_timing) + INSTR_TIME_SET_CURRENT(start); /* * Submit a query. Since we don't use non-blocking mode, this also can * block. But its risk is relatively small, so we ignore that for now. @@ -696,7 +712,14 @@ pgfdw_exec_query(PGconn *conn, const char *query, PgFdwConnState *state) pgfdw_report_error(ERROR, NULL, conn, false, query); /* Wait for the result. */ - return pgfdw_get_result(conn, query); + res = pgfdw_get_result(conn, query); + if (track_fdw_wait_timing) + { + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + INSTR_TIME_ADD(pgNetUsage.fdw_wait_time, duration); + } + return res; } /* @@ -717,6 +740,10 @@ pgfdw_get_result(PGconn *conn, const char *query) /* In what follows, do not leak any PGresults on an error. */ PG_TRY(); { + instr_time start, duration; + if (track_fdw_wait_timing) + INSTR_TIME_SET_CURRENT(start); + for (;;) { PGresult *res; @@ -750,6 +777,13 @@ pgfdw_get_result(PGconn *conn, const char *query) PQclear(last_res); last_res = res; } + + if (track_fdw_wait_timing) + { + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + INSTR_TIME_ADD(pgNetUsage.fdw_wait_time, duration); + } } PG_CATCH(); { @@ -893,7 +927,18 @@ pgfdw_xact_callback(XactEvent event, void *arg) */ if (entry->have_prep_stmt && entry->have_error) { + instr_time start, duration; + if (track_fdw_wait_timing) + INSTR_TIME_SET_CURRENT(start); + res = PQexec(entry->conn, "DEALLOCATE