RE: Failed transaction statistics to measure the logical replication progress
Hi On Friday, March 25, 2022 2:36 PM Masahiko Sawada wrote: > On Thu, Mar 24, 2022 at 12:30 PM Amit Kapila > wrote: > > > > On Thu, Mar 3, 2022 at 8:58 AM osumi.takami...@fujitsu.com > > wrote: > > > > > > > This patch introduces two new subscription statistics columns > > (apply_commit_count and apply_rollback_count) to the > > pg_stat_subscription_stats view for counting cumulative transactions > > commits/rollbacks for a particular subscription. Now, users can > > already see the total number of xacts committed/rolled back in a > > particular database via pg_stat_database, so this can be considered > > duplicate information. > > Right. ... > > I am not sure if it is worth adding this additional information or how > > useful it will be for users. Does anyone else want to weigh in on > > this? > > > > If nobody else sees value in this then I feel it is better to mark > > this patch as Returned with feedback or Rejected. We can come back to > > it later if we see more demand for this. > > Marking as Returned with feedback makes sense to me. OK. Thank you so much for sharing your opinions, Sawada-san and Amit-san. I changed the status of this entry to "Returned with feedback" accordingly. Best Regards, Takamichi Osumi
Re: Failed transaction statistics to measure the logical replication progress
On Thu, Mar 24, 2022 at 12:30 PM Amit Kapila wrote: > > On Thu, Mar 3, 2022 at 8:58 AM osumi.takami...@fujitsu.com > wrote: > > > > This patch introduces two new subscription statistics columns > (apply_commit_count and apply_rollback_count) to the > pg_stat_subscription_stats view for counting cumulative transactions > commits/rollbacks for a particular subscription. Now, users can > already see the total number of xacts committed/rolled back in a > particular database via pg_stat_database, so this can be considered > duplicate information. Right. > OTOH, some users might be interested in the > stats for a subscription to know how many transactions are > successfully applied during replication because the information in > pg_stat_database also includes the operations that happened on the > node. I'm not sure how useful this information is in practice. What can we use this information for? IIRC the original purpose of this proposed feature is to provide a way for the users to understand the size and count of the succeeded and failed transactions. At some point, the patch includes the statistics of only the counts of commits, rollbacks, and errors. If new statistics also include the size, it might be useful to achieve the original goal. But I’m concerned that adding only apply_commit_count and apply_rollback_count ends up adding the duplicate statistics with no concrete use cases. > I am not sure if it is worth adding this additional information or how > useful it will be for users. Does anyone else want to weigh in on > this? > > If nobody else sees value in this then I feel it is better to mark > this patch as Returned with feedback or Rejected. We can come back to > it later if we see more demand for this. Marking as Returned with feedback makes sense to me. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Failed transaction statistics to measure the logical replication progress
On Thu, Mar 3, 2022 at 8:58 AM osumi.takami...@fujitsu.com wrote: > This patch introduces two new subscription statistics columns (apply_commit_count and apply_rollback_count) to the pg_stat_subscription_stats view for counting cumulative transactions commits/rollbacks for a particular subscription. Now, users can already see the total number of xacts committed/rolled back in a particular database via pg_stat_database, so this can be considered duplicate information. OTOH, some users might be interested in the stats for a subscription to know how many transactions are successfully applied during replication because the information in pg_stat_database also includes the operations that happened on the node. I am not sure if it is worth adding this additional information or how useful it will be for users. Does anyone else want to weigh in on this? If nobody else sees value in this then I feel it is better to mark this patch as Returned with feedback or Rejected. We can come back to it later if we see more demand for this. -- With Regards, Amit Kapila.
RE: Failed transaction statistics to measure the logical replication progress
On Wednesday, March 2, 2022 5:29 PM Shi, Yu/侍 雨 wrote: > A comments on the v26 patch. Thank you for checking the patch ! > > The following document about pg_stat_subscription_stats view only says that > "showing statistics about errors", should we add something about transactions > here? > > > > pg_stat_subscription_stats >pg_stat_subscription_stats > One row per subscription, showing statistics about errors. > See > pg_stat_subscription_stats for > details. > > > > > I noticed that the v24 patch has some changes about the description of this > view. Maybe we can modify to "showing statistics about errors and > transactions". You are right. Fixed. New patch v27 that incorporated your comments is shared in [1]. Kindly have a look at it. [1] - https://www.postgresql.org/message-id/TYCPR01MB837345FB72E3CE9C827AF42CED049%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
RE: Failed transaction statistics to measure the logical replication progress
On Wednesday, March 2, 2022 2:18 PM Masahiko Sawada wrote: > On Wed, Mar 2, 2022 at 10:21 AM osumi.takami...@fujitsu.com > wrote: > > Also, I quickly checked other similar views(pg_stat_slru, > > pg_stat_wal_receiver) commit logs, especially when they introduce columns. > > But, I couldn't find column name validations. > > So, I feel this is aligned. > > > > I've looked at v26 patch and here are some random comments: Hi, thank you for reviewing ! > +/* determine the subscription entry */ > +Oidm_subid; > + > +PgStat_Counter apply_commit_count; > +PgStat_Counter apply_rollback_count; > > I think it's better to add the prefix "m_" to apply_commit/rollback_count for > consistency. Fixed. > --- > +/* > + * Increment the counter of commit for subscription statistics. > + */ > +static void > +subscription_stats_incr_commit(void) > +{ > +Assert(OidIsValid(subStats.subid)); > + > +subStats.apply_commit_count++; > +} > + > > I think we don't need the Assert() here since it should not be a problem even > if > subStats.subid is InvalidOid at least in this function. > > If we remove it, we can remove both subscription_stats_incr_commit() and > +subscription_stats_incr_rollback() as well. Removed the Assert() from both functions. > --- > +void > +pgstat_report_subscription_xact(bool force) { > +static TimestampTz last_report = 0; > +PgStat_MsgSubscriptionXact msg; > + > +/* Bailout early if nothing to do */ > +if (!OidIsValid(subStats.subid) || > +(subStats.apply_commit_count == 0 && > subStats.apply_rollback_count == 0)) > +return; > + > > +LogicalRepSubscriptionStats subStats = > +{ > +.subid = InvalidOid, > +.apply_commit_count = 0, > +.apply_rollback_count = 0, > +}; > > Do we need subStats.subid? I think we can pass MySubscription->oid (or > MyLogicalRepWorker->subid) to pgstat_report_subscription_xact() along > with the pointer of the statistics (subStats). That way, we don't need to > expose > subStats. Removed the subStats.subid. Also, now I pass the oid to the pgstat_report_subscription_xact with the pointer of the statistics. > Also, I think it's better to add "Xact" or something to the struct name. For > example, SubscriptionXactStats. Renamed. > --- > + > +typedef struct LogicalRepSubscriptionStats { > +Oidsubid; > + > +int64 apply_commit_count; > +int64 apply_rollback_count; > +} LogicalRepSubscriptionStats; > > We need a description for this struct. > > Probably it is better to declare it in logicalworker.h instead so that > pgstat.c > includes it instead of worker_internal.h? worker_internal.h is the header file > shared by logical replication workers such as apply worker, tablesync worker, > and launcher. So it might not be advisable to include it in pgstat.c. Changed the definition place to logicalworker.h and added some explanations for it. Attached the updated v27. Best Regards, Takamichi Osumi v27-0001-Extend-pg_stat_subscription_stats-to-include-gen.patch Description: v27-0001-Extend-pg_stat_subscription_stats-to-include-gen.patch
RE: Failed transaction statistics to measure the logical replication progress
Hi, A comments on the v26 patch. The following document about pg_stat_subscription_stats view only says that "showing statistics about errors", should we add something about transactions here? pg_stat_subscription_statspg_stat_subscription_stats One row per subscription, showing statistics about errors. See pg_stat_subscription_stats for details. I noticed that the v24 patch has some changes about the description of this view. Maybe we can modify to "showing statistics about errors and transactions". Regards, Shi yu
Re: Failed transaction statistics to measure the logical replication progress
On Wed, Mar 2, 2022 at 10:21 AM osumi.takami...@fujitsu.com wrote: > > > Also, I quickly checked other similar views(pg_stat_slru, > pg_stat_wal_receiver) > commit logs, especially when they introduce columns. > But, I couldn't find column name validations. > So, I feel this is aligned. > I've looked at v26 patch and here are some random comments: +/* determine the subscription entry */ +Oidm_subid; + +PgStat_Counter apply_commit_count; +PgStat_Counter apply_rollback_count; I think it's better to add the prefix "m_" to apply_commit/rollback_count for consistency. --- +/* + * Increment the counter of commit for subscription statistics. + */ +static void +subscription_stats_incr_commit(void) +{ +Assert(OidIsValid(subStats.subid)); + +subStats.apply_commit_count++; +} + I think we don't need the Assert() here since it should not be a problem even if subStats.subid is InvalidOid at least in this function. If we remove it, we can remove both subscription_stats_incr_commit() and +subscription_stats_incr_rollback() as well. --- +void +pgstat_report_subscription_xact(bool force) +{ +static TimestampTz last_report = 0; +PgStat_MsgSubscriptionXact msg; + +/* Bailout early if nothing to do */ +if (!OidIsValid(subStats.subid) || +(subStats.apply_commit_count == 0 && subStats.apply_rollback_count == 0)) +return; + +LogicalRepSubscriptionStats subStats = +{ +.subid = InvalidOid, +.apply_commit_count = 0, +.apply_rollback_count = 0, +}; Do we need subStats.subid? I think we can pass MySubscription->oid (or MyLogicalRepWorker->subid) to pgstat_report_subscription_xact() along with the pointer of the statistics (subStats). That way, we don't need to expose subStats. Also, I think it's better to add "Xact" or something to the struct name. For example, SubscriptionXactStats. --- + +typedef struct LogicalRepSubscriptionStats +{ +Oidsubid; + +int64 apply_commit_count; +int64 apply_rollback_count; +} LogicalRepSubscriptionStats; We need a description for this struct. Probably it is better to declare it in logicalworker.h instead so that pgstat.c includes it instead of worker_internal.h? worker_internal.h is the header file shared by logical replication workers such as apply worker, tablesync worker, and launcher. So it might not be advisable to include it in pgstat.c. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
RE: Failed transaction statistics to measure the logical replication progress
On Tuesday, March 1, 2022 4:12 PM Peter Smith wrote: > Please see below my review comments for v25. > > == > > 1. Commit message > > Introduce cumulative columns of transactions of logical replication subscriber > to the pg_stat_subscription_stats view. > > "cumulative columns of transactions" sounds a bit strange to me. > > SUGGESTED > Introduce 2 new subscription statistics columns (apply_commit_count, and > apply_rollback_count) to the pg_stat_subscription_stats view for counting > cumulative transaction commits/rollbacks. Fixed. > ~~~ > > 2. doc/src/sgml/monitoring.sgml - bug > > The new SGML s have been added in the wrong place! > > I don't think this renders like you expect it does. Please regenerate the > help to > see for yourself. Fixed. > ~~~ > > 3. doc/src/sgml/monitoring.sgml - wording > > + > + Number of transactions rollbacked in this subscription. Both > + ROLLBACK of transaction streamed as > in-progress > + transaction and ROLLBACK PREPARED > increment this > + counter. > + > > BEFORE > Number of transactions rollbacked in this subscription. > > SUGGESTED > Number of transaction rollbacks in this subscription. Fixed. > ~~~ > > 4. doc/src/sgml/monitoring.sgml - wording > > + > + Number of transactions rollbacked in this subscription. Both > + ROLLBACK of transaction streamed as > in-progress > + transaction and ROLLBACK PREPARED > increment this > + counter. > + > > Trying to distinguish between the ROLLBACK of a transaction and of a > streamed in-progress transaction seems to have made this description too > complicated. I don't think the user even cares/knows about this > (in-progress) distinction. So, I think this should just be written more simply > (like the COMMIT part was) > > BEFORE > Both ROLLBACK of transaction streamed as > in-progress transaction and ROLLBACK > PREPARED increment this counter. > > SUGGESTED > Both ROLLBACK and ROLLBACK > PREPARED increment this counter. Fixed. > ~~~ > > 5. Question - column names. > > Just curious why the columns are called "apply_commit_count" and > "apply_rollback_count"? Specifically, what extra meaning do those names have > versus just calling them "commit_count" and "rollback_count"? I think there's possibility that we'll have counters for tablesync commit for example. So, the name prefix avoids the overlap between the possible names. > ~~~ > > 6. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact > > @@ -3421,6 +3425,60 @@ pgstat_send_slru(void) } > > /* -- > + * pgstat_report_subscription_xact() - > + * > + * Send a subscription transaction stats to the collector. > + * The statistics are cleared upon sending. > + * > + * 'force' is true only when the subscription worker process exits. > + * -- > + */ > +void > +pgstat_report_subscription_xact(bool force) > > 6a. > I think this comment should be worded more like the other > pgstat_report_subscption_XXX comments > > BEFORE > Send a subscription transaction stats to the collector. > > SUGGESTED > Tell the collector about subscriptions transaction stats. Fixed. > 6b. > + * 'force' is true only when the subscription worker process exits. > > I thought this comment should just describe what the 'force' param actually > does in this function; not the scenario about who calls it... Fixed. > ~~~ > > 7. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact > > I think the entire function maybe should be relocated to be nearby the other > pgstat_report_subscription_XXX functions in the source. I placed the pgstat_report_subscription_xact below pgstat_report_subscription_drop. Meanwhile, pgstat_recv_subscription_xact, another new function in pgstat.c, is already placed below pgstat_recv_subscription_error, so I kept it as it is. > ~~~ > > 8. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact > > + /* > + * This function can be called even if nothing at all has happened. In > + * this case, there's no need to go forward. > + */ > > Too much information. Clearly, it is possible for this function to be called > for this > case otherwise this code would not exist in the first place :) IMO the comment > can be much simpler but still say all it needs to. > > BEFORE > This function can be called even if nothing at all has happened. In this case, > there's no need to go forward. > SUGGESTED > Bailout early if nothing to do. Fixed. > ~~~ > > 9. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact > > + if (subStats.subid == InvalidOid || > + (subStats.apply_commit_count == 0 && subStats.apply_rollback_count == > + 0)) return; > > Maybe using !OisIsValid(subStats.subid) is better? Fixed. > ~~~ > > 10. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact > > + /* > + * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL > + * msec since we last sent one to avoid
Re: Failed transaction statistics to measure the logical replication progress
Please see below my review comments for v25. == 1. Commit message Introduce cumulative columns of transactions of logical replication subscriber to the pg_stat_subscription_stats view. "cumulative columns of transactions" sounds a bit strange to me. SUGGESTED Introduce 2 new subscription statistics columns (apply_commit_count, and apply_rollback_count) to the pg_stat_subscription_stats view for counting cumulative transaction commits/rollbacks. ~~~ 2. doc/src/sgml/monitoring.sgml - bug The new SGML s have been added in the wrong place! I don't think this renders like you expect it does. Please regenerate the help to see for yourself. ~~~ 3. doc/src/sgml/monitoring.sgml - wording + + Number of transactions rollbacked in this subscription. Both + ROLLBACK of transaction streamed as in-progress + transaction and ROLLBACK PREPARED increment this + counter. + BEFORE Number of transactions rollbacked in this subscription. SUGGESTED Number of transaction rollbacks in this subscription. ~~~ 4. doc/src/sgml/monitoring.sgml - wording + + Number of transactions rollbacked in this subscription. Both + ROLLBACK of transaction streamed as in-progress + transaction and ROLLBACK PREPARED increment this + counter. + Trying to distinguish between the ROLLBACK of a transaction and of a streamed in-progress transaction seems to have made this description too complicated. I don’t think the user even cares/knows about this (in-progress) distinction. So, I think this should just be written more simply (like the COMMIT part was) BEFORE Both ROLLBACK of transaction streamed as in-progress transaction and ROLLBACK PREPARED increment this counter. SUGGESTED Both ROLLBACK and ROLLBACK PREPARED increment this counter. ~~~ 5. Question - column names. Just curious why the columns are called "apply_commit_count" and "apply_rollback_count"? Specifically, what extra meaning do those names have versus just calling them "commit_count" and "rollback_count"? ~~~ 6. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact @@ -3421,6 +3425,60 @@ pgstat_send_slru(void) } /* -- + * pgstat_report_subscription_xact() - + * + * Send a subscription transaction stats to the collector. + * The statistics are cleared upon sending. + * + * 'force' is true only when the subscription worker process exits. + * -- + */ +void +pgstat_report_subscription_xact(bool force) 6a. I think this comment should be worded more like the other pgstat_report_subscption_XXX comments BEFORE Send a subscription transaction stats to the collector. SUGGESTED Tell the collector about subscriptions transaction stats. 6b. + * 'force' is true only when the subscription worker process exits. I thought this comment should just describe what the 'force' param actually does in this function; not the scenario about who calls it... ~~~ 7. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact I think the entire function maybe should be relocated to be nearby the other pgstat_report_subscription_XXX functions in the source. ~~~ 8. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact + /* + * This function can be called even if nothing at all has happened. In + * this case, there's no need to go forward. + */ Too much information. Clearly, it is possible for this function to be called for this case otherwise this code would not exist in the first place :) IMO the comment can be much simpler but still say all it needs to. BEFORE This function can be called even if nothing at all has happened. In this case, there's no need to go forward. SUGGESTED Bailout early if nothing to do. ~~~ 9. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact + if (subStats.subid == InvalidOid || + (subStats.apply_commit_count == 0 && subStats.apply_rollback_count == 0)) + return; Maybe using !OisIsValid(subStats.subid) is better? ~~~ 10. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact + /* + * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL + * msec since we last sent one to avoid overloading the stats + * collector. + */ SUGGESTED (2 sentences instead of 1) Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL msec since we last sent one. This is to avoid overloading the stats collector. ~~~ 11. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact + if (!force) + { + TimestampTz now = GetCurrentTimestamp(); + + /* + * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL + * msec since we last sent one to avoid overloading the stats + * collector. + */ + if (!TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL)) + return; + last_report = now; + } (Yeah, I know there is similar code in this module but 2 wrongs do not make a right) I think logically it is better to put the 'now' and the 'last_report' outside this if (!force)
RE: Failed transaction statistics to measure the logical replication progress
On Friday, February 25, 2022 7:58 AM osumi.takami...@fujitsu.com wrote: > Kindly have a look at v24. Hi. The recent commit(7a85073) has redesigned the view pg_stat_subscription_workers and now we have pg_stat_subscription_stats. Therefore, I rebased my patch so that my statistics patch can be applied on top of the HEAD. In the process of this rebase, I had to drop one column that stored error count for unique errors(which was incremented after confirming the error is not the same as previous one), because the commit tentatively removes the same error check mechanism. Therefore, this patch has apply_commit_count, apply_rollback_count only. I slightly changed minor changes as well so that those can become more aligned. Kindly please have a look at the patch. Best Regards, Takamichi Osumi v25-0001-Extend-pg_stat_subscription_stats-to-include-gen.patch Description: v25-0001-Extend-pg_stat_subscription_stats-to-include-gen.patch
RE: Failed transaction statistics to measure the logical replication progress
On Wednesday, February 23, 2022 3:30 PM Amit Kapila > On Tue, Feb 22, 2022 at 6:45 AM tanghy.f...@fujitsu.com > wrote: > > > > I found a problem when using it. When a replication workers exits, the > > transaction stats should be sent to stats collector if they were not > > sent before because it didn't reach PGSTAT_STAT_INTERVAL. But I saw > > that the stats weren't updated as expected. > > > > I looked into it and found that the replication worker would send the > > transaction stats (if any) before it exits. But it got invalid subid > > in pgstat_send_subworker_xact_stats(), which led to the following result: > > > > postgres=# select pg_stat_get_subscription_worker(0, null); > > pg_stat_get_subscription_worker > > - > > (0,,2,0,00,"",) > > (1 row) > > > > I think that's because subid has already been cleaned when trying to > > send the stats. I printed the value of before_shmem_exit_list, the > > functions in this list would be called in shmem_exit() when the worker > > exits. > > logicalrep_worker_onexit() would clean up the worker info (including > > subid), and > > pgstat_shutdown_hook() would send stats if any. > > logicalrep_worker_onexit() was called before calling > pgstat_shutdown_hook(). > > > > Yeah, I think that is a problem and maybe we can think of solving it by > sending > the stats via logicalrep_worker_onexit before subid is cleared but not sure > that > is a good idea. I feel we need to go back to the idea of v21 for sending stats > instead of using pgstat_report_stat. > I think the one thing which we could improve is to avoid trying to send it > each > time before receiving each message by walrcv_receive and rather try to send it > before we try to wait (WaitLatchOrSocket). > Trying after each message doesn't seem to be required and could lead to some > overhead as well. What do you think? I agree. Fixed. Kindly have a look at v24 shared in [1]. [1] - https://www.postgresql.org/message-id/TYCPR01MB8373A3E1BE237BAF38185BF2ED3D9%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
RE: Failed transaction statistics to measure the logical replication progress
On Thursday, February 24, 2022 11:07 AM Masahiko Sawada wrote: > I have some comments on v23 patch: > > @@ -66,6 +66,12 @@ typedef struct LogicalRepWorker > TimestampTz last_recv_time; > XLogRecPtr reply_lsn; > TimestampTz reply_time; > + > + /* > +* Transaction statistics of subscription worker > +*/ > + int64 commit_count; > + int64 abort_count; > } LogicalRepWorker; > > I think that adding these statistics to the struct whose data is allocated on > the > shared memory is not a good idea since they don't need to be shared. We might > want to add more statistics for subscriptions such as insert_count and > update_count in the future. I think it's better to track these statistics in > local > memory either in worker.c or pgstat.c. Fixed. > +/* -- > + * pgstat_report_subworker_xact_end() - > + * > + * Update the statistics of subscription worker and have > + * pgstat_report_stat send a message to stats collector > + * after count increment. > + * -- > + */ > +void > +pgstat_report_subworker_xact_end(bool is_commit) { > +if (is_commit) > +MyLogicalRepWorker->commit_count++; > +else > +MyLogicalRepWorker->abort_count++; > +} > > It's slightly odd and it seems unnecessary to me that we modify fields of > MyLogicalRepWorker in pgstat.c. Although this function has “report” > in its name but it just increments the counter. I think we can do that in > worker.c. Fixed. Also, I made the timing adjustment logic back and now have the independent one as Amit-san suggested in [1]. Kindly have a look at v24. [1] - https://www.postgresql.org/message-id/CAA4eK1LWYc15%3DASj1tMTEFsXtxu%3D02aGoMwq9YanUVr9-QMhdQ%40mail.gmail.com Best Regards, Takamichi Osumi v24-0001-Extend-pg_stat_subscription_workers-to-include-g.patch Description: v24-0001-Extend-pg_stat_subscription_workers-to-include-g.patch
Re: Failed transaction statistics to measure the logical replication progress
On Mon, Feb 21, 2022 at 12:45 PM osumi.takami...@fujitsu.com wrote: > > On Saturday, February 19, 2022 12:00 AM osumi.takami...@fujitsu.com > wrote: > > On Friday, February 18, 2022 3:34 PM Tang, Haiying/唐 海英 > > wrote: > > > On Wed, Jan 12, 2022 8:35 PM osumi.takami...@fujitsu.com > > > wrote: > > > 4) I noticed that the abort_count doesn't include aborted streaming > > > transactions. > > > Should we take this case into consideration? > > Hmm, we can add this into this column, when there's no objection. > > I'm not sure but someone might say those should be separate columns. > I've addressed this point in a new v23 patch, > since there was no opinion on this so far. > > Kindly have a look at the attached one. > I have some comments on v23 patch: @@ -66,6 +66,12 @@ typedef struct LogicalRepWorker TimestampTz last_recv_time; XLogRecPtr reply_lsn; TimestampTz reply_time; + + /* +* Transaction statistics of subscription worker +*/ + int64 commit_count; + int64 abort_count; } LogicalRepWorker; I think that adding these statistics to the struct whose data is allocated on the shared memory is not a good idea since they don't need to be shared. We might want to add more statistics for subscriptions such as insert_count and update_count in the future. I think it's better to track these statistics in local memory either in worker.c or pgstat.c. +/* -- + * pgstat_report_subworker_xact_end() - + * + * Update the statistics of subscription worker and have + * pgstat_report_stat send a message to stats collector + * after count increment. + * -- + */ +void +pgstat_report_subworker_xact_end(bool is_commit) +{ +if (is_commit) +MyLogicalRepWorker->commit_count++; +else +MyLogicalRepWorker->abort_count++; +} It's slightly odd and it seems unnecessary to me that we modify fields of MyLogicalRepWorker in pgstat.c. Although this function has “report” in its name but it just increments the counter. I think we can do that in worker.c. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Failed transaction statistics to measure the logical replication progress
On Tue, Feb 22, 2022 at 6:45 AM tanghy.f...@fujitsu.com wrote: > > I found a problem when using it. When a replication workers exits, the > transaction stats should be sent to stats collector if they were not sent > before > because it didn't reach PGSTAT_STAT_INTERVAL. But I saw that the stats weren't > updated as expected. > > I looked into it and found that the replication worker would send the > transaction stats (if any) before it exits. But it got invalid subid in > pgstat_send_subworker_xact_stats(), which led to the following result: > > postgres=# select pg_stat_get_subscription_worker(0, null); > pg_stat_get_subscription_worker > - > (0,,2,0,00,"",) > (1 row) > > I think that's because subid has already been cleaned when trying to send the > stats. I printed the value of before_shmem_exit_list, the functions in this > list > would be called in shmem_exit() when the worker exits. > logicalrep_worker_onexit() would clean up the worker info (including subid), > and > pgstat_shutdown_hook() would send stats if any. logicalrep_worker_onexit() > was > called before calling pgstat_shutdown_hook(). > Yeah, I think that is a problem and maybe we can think of solving it by sending the stats via logicalrep_worker_onexit before subid is cleared but not sure that is a good idea. I feel we need to go back to the idea of v21 for sending stats instead of using pgstat_report_stat. I think the one thing which we could improve is to avoid trying to send it each time before receiving each message by walrcv_receive and rather try to send it before we try to wait (WaitLatchOrSocket). Trying after each message doesn't seem to be required and could lead to some overhead as well. What do you think? -- With Regards, Amit Kapila.
RE: Failed transaction statistics to measure the logical replication progress
On Tuesday, February 22, 2022 10:15 AM Tang, Haiying/唐 海英 wrote: > On Mon, Feb 21, 2022 11:46 AM osumi.takami...@fujitsu.com > wrote: > > > > On Saturday, February 19, 2022 12:00 AM osumi.takami...@fujitsu.com > > wrote: > > > On Friday, February 18, 2022 3:34 PM Tang, Haiying/唐 海英 > > > wrote: > > > > On Wed, Jan 12, 2022 8:35 PM osumi.takami...@fujitsu.com > > > > wrote: > > > > 4) I noticed that the abort_count doesn't include aborted > > > > streaming transactions. > > > > Should we take this case into consideration? > > > Hmm, we can add this into this column, when there's no objection. > > > I'm not sure but someone might say those should be separate columns. > > I've addressed this point in a new v23 patch, since there was no > > opinion on this so far. > > > > Kindly have a look at the attached one. > > > > Thanks for updating the patch. > > I found a problem when using it. When a replication workers exits, the > transaction stats should be sent to stats collector if they were not sent > before > because it didn't reach PGSTAT_STAT_INTERVAL. But I saw that the stats > weren't updated as expected. > > I looked into it and found that the replication worker would send the > transaction > stats (if any) before it exits. But it got invalid subid in > pgstat_send_subworker_xact_stats(), which led to the following result: > > postgres=# select pg_stat_get_subscription_worker(0, null); > pg_stat_get_subscription_worker > - > (0,,2,0,00,"",) > (1 row) > > I think that's because subid has already been cleaned when trying to send the > stats. I printed the value of before_shmem_exit_list, the functions in this > list > would be called in shmem_exit() when the worker exits. > logicalrep_worker_onexit() would clean up the worker info (including subid), > and > pgstat_shutdown_hook() would send stats if any. logicalrep_worker_onexit() > was called before calling pgstat_shutdown_hook(). > > (gdb) p before_shmem_exit_list > $1 = {{function = 0xa88f1e , arg = 0}, {function = > 0xb619e7 , arg = 0}, {function = 0xb07b5c > , arg = 0}, { > function = 0xabdd93 , arg = 0}, {function = > 0xe30c89 , arg = 0}, {function = 0x0, arg = 0} times>} > > Maybe we should make some modification to fix it. Thank you for letting me know this issue. I'll investigate this and will report the result. Best Regards, Takamichi Osumi
RE: Failed transaction statistics to measure the logical replication progress
On Mon, Feb 21, 2022 11:46 AM osumi.takami...@fujitsu.com wrote: > > On Saturday, February 19, 2022 12:00 AM osumi.takami...@fujitsu.com > wrote: > > On Friday, February 18, 2022 3:34 PM Tang, Haiying/唐 海英 > > wrote: > > > On Wed, Jan 12, 2022 8:35 PM osumi.takami...@fujitsu.com > > > wrote: > > > 4) I noticed that the abort_count doesn't include aborted streaming > > > transactions. > > > Should we take this case into consideration? > > Hmm, we can add this into this column, when there's no objection. > > I'm not sure but someone might say those should be separate columns. > I've addressed this point in a new v23 patch, > since there was no opinion on this so far. > > Kindly have a look at the attached one. > Thanks for updating the patch. I found a problem when using it. When a replication workers exits, the transaction stats should be sent to stats collector if they were not sent before because it didn't reach PGSTAT_STAT_INTERVAL. But I saw that the stats weren't updated as expected. I looked into it and found that the replication worker would send the transaction stats (if any) before it exits. But it got invalid subid in pgstat_send_subworker_xact_stats(), which led to the following result: postgres=# select pg_stat_get_subscription_worker(0, null); pg_stat_get_subscription_worker - (0,,2,0,00,"",) (1 row) I think that's because subid has already been cleaned when trying to send the stats. I printed the value of before_shmem_exit_list, the functions in this list would be called in shmem_exit() when the worker exits. logicalrep_worker_onexit() would clean up the worker info (including subid), and pgstat_shutdown_hook() would send stats if any. logicalrep_worker_onexit() was called before calling pgstat_shutdown_hook(). (gdb) p before_shmem_exit_list $1 = {{function = 0xa88f1e , arg = 0}, {function = 0xb619e7 , arg = 0}, {function = 0xb07b5c , arg = 0}, { function = 0xabdd93 , arg = 0}, {function = 0xe30c89 , arg = 0}, {function = 0x0, arg = 0} } Maybe we should make some modification to fix it. Regards, Tang
RE: Failed transaction statistics to measure the logical replication progress
On Monday, February 21, 2022 6:06 PM Monday, February 21, 2022 6:06 PM wrote: > On Mon, Feb 21, 2022 11:46 AM Osumi, Takamichi/大墨 昂道 > wrote: > >I've addressed this point in a new v23 patch, since there was no opinion on > this so far. > >Kindly have a look at the attached one. > Thanks for updating the patch. Here is a comment: > > In function apply_handle_stream_abort: > @@ -1217,6 +1219,7 @@ apply_handle_stream_abort(StringInfo s) > { > set_apply_error_context_xact(xid, 0); > stream_cleanup_files(MyLogicalRepWorker->subid, xid); > + pgstat_report_subworker_xact_end(false); > } > else > { > > I think there is a problem here, pgstat_report_stat is not invoked here. > While the other three places where function > pgstat_report_subworker_xact_end is invoked, the function pgstat_report_stat > is invoked. > Do we need to invoke pgstat_report_stat in apply_handle_stream_abort? Hi, I had tested this case before I posted the latest patch v23. It works when I call pg_stat_report_stat by other transaction. But, if we want to add pgstat_report_stat here, I need to investigate the impact of the addition. I'll check it and let you know. Best Regards, Takamichi Osumi
RE: Failed transaction statistics to measure the logical replication progress
On Mon, Feb 21, 2022 11:46 AM Osumi, Takamichi/大墨 昂道 wrote: >I've addressed this point in a new v23 patch, since there was no opinion on >this so far. >Kindly have a look at the attached one. Thanks for updating the patch. Here is a comment: In function apply_handle_stream_abort: @@ -1217,6 +1219,7 @@ apply_handle_stream_abort(StringInfo s) { set_apply_error_context_xact(xid, 0); stream_cleanup_files(MyLogicalRepWorker->subid, xid); + pgstat_report_subworker_xact_end(false); } else { I think there is a problem here, pgstat_report_stat is not invoked here. While the other three places where function pgstat_report_subworker_xact_end is invoked, the function pgstat_report_stat is invoked. Do we need to invoke pgstat_report_stat in apply_handle_stream_abort? Regards, Wang wei
RE: Failed transaction statistics to measure the logical replication progress
On Saturday, February 19, 2022 12:00 AM osumi.takami...@fujitsu.com wrote: > On Friday, February 18, 2022 3:34 PM Tang, Haiying/唐 海英 > wrote: > > On Wed, Jan 12, 2022 8:35 PM osumi.takami...@fujitsu.com > > wrote: > > 4) I noticed that the abort_count doesn't include aborted streaming > > transactions. > > Should we take this case into consideration? > Hmm, we can add this into this column, when there's no objection. > I'm not sure but someone might say those should be separate columns. I've addressed this point in a new v23 patch, since there was no opinion on this so far. Kindly have a look at the attached one. Best Regards, Takamichi Osumi v23-0001-Extend-pg_stat_subscription_workers-to-include-g.patch Description: v23-0001-Extend-pg_stat_subscription_workers-to-include-g.patch
RE: Failed transaction statistics to measure the logical replication progress
On Friday, February 18, 2022 3:34 PM Tang, Haiying/唐 海英 wrote: > On Wed, Jan 12, 2022 8:35 PM osumi.takami...@fujitsu.com > wrote: > > > > The attached v21 has a couple of other minor updates like a > > modification of error message text. > > > > > > Thanks for updating the patch. Here are some comments. Thank you for your reivew ! > 1) I saw the following description about pg_stat_subscription_workers view in > existing doc: > >The pg_stat_subscription_workers view will > contain >one row per subscription worker on which errors have occurred, ... > > It only says "which errors have occurred", maybe we should also mention > transactions here, right? I wrote about this statistics in the next line but as you pointed out, separating the description into two sentences wasn't good idea. Fixed. > 2) > /* -- > + * pgstat_send_subworker_xact_stats() - > + * > + * Send a subworker's transaction stats to the collector. > + * The statistics are cleared upon return. > > Should "The statistics are cleared upon return" changed to "The statistics are > cleared upon sending"? Because if it doesn't reach PGSTAT_STAT_INTERVAL > and the transaction stats are not sent, the function will return without > clearing > out statistics. Now, the purpose of this function has become purely to send a message and whenever it's called, the function clears the saved stats. So, I skipped this comments now. > 3) > + Assert(command == LOGICAL_REP_MSG_COMMIT || > +command == LOGICAL_REP_MSG_STREAM_COMMIT || > +command == LOGICAL_REP_MSG_COMMIT_PREPARED > || > +command == > LOGICAL_REP_MSG_ROLLBACK_PREPARED); > + > + switch (command) > + { > + case LOGICAL_REP_MSG_COMMIT: > + case LOGICAL_REP_MSG_STREAM_COMMIT: > + case LOGICAL_REP_MSG_COMMIT_PREPARED: > + MyLogicalRepWorker->commit_count++; > + break; > + case LOGICAL_REP_MSG_ROLLBACK_PREPARED: > + MyLogicalRepWorker->abort_count++; > + break; > + default: > + ereport(ERROR, > + errmsg("invalid logical message type > for transaction statistics of subscription")); > + break; > + } > > I'm not sure that do we need the assert, because it will report an error > later if > command is an invalid value. The Assert has been removed, along with the switch branches now. Since there was an adivce that we should call this from apply_handle_commit_internal and from that function, if we don't want to change this function's argument, all we need to do is to pass a boolean value that indicates the stats is commit_count or abort_count. Kindly have a look at the updated version. > 4) I noticed that the abort_count doesn't include aborted streaming > transactions. > Should we take this case into consideration? Hmm, we can add this into this column, when there's no objection. I'm not sure but someone might say those should be separate columns. The new patch v22 is shared in [2]. [2] - https://www.postgresql.org/message-id/TYCPR01MB83737C689F8F310C87C19C1EED379%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
RE: Failed transaction statistics to measure the logical replication progress
On Friday, February 18, 2022 8:11 PM Amit Kapila wrote: > On Fri, Feb 18, 2022 at 2:04 PM osumi.takami...@fujitsu.com > wrote: > > > > On Thursday, February 17, 2022 6:45 PM Amit Kapila > wrote: > > > On Thu, Feb 17, 2022 at 3:13 PM Amit Kapila > > > > > > wrote: > > > > Can't we use pgstat_report_stat() here? Basically, you can update > > > > xact completetion counters during apply, and then from > > > > pgstat_report_stat(), you can invoke a logical replication worker > > > > stats-related function. > > > > > > > > > > If we can do this then we can save the logic this patch is trying to > > > introduce for PGSTAT_STAT_INTERVAL. > > Hi, I've encounter a couple of questions during my modification, following > your advice. > > > > In the pgstat_report_stat, we refer to the return value of > > GetCurrentTransactionStopTimestamp to compare the time different from > the last time. > > (In my previous patch, I used GetCurrentTimestamp) > > > > This time is updated in apply_handle_commit_internal's > CommitTransactionCommand for the apply worker. > > Then, if I update the subscription worker > > stats(commit_count/abort_count) immediately after this > > CommitTransactionCommand and immediately call pgstat_report_stat in the > apply_handle_commit_internal, the time difference becomes too small (falls > short of PGSTAT_STAT_INTERVAL). > > Also, the time of GetCurrentTransactionStopTimestamp is not updated > > even when I keep calling pgstat_report_stat repeatedly. > > Then, IIUC the next possible timing that message of commit_count or > > abort_count is sent to the stats collector would become the time when > > we execute another transaction by the apply worker and update the time > > for GetCurrentTransactionStopTimestamp > > and rerun pgstat_report_stat again. > > > > I think but same is true in the case of the transaction in the backend where > we > increment commit counter via AtEOXact_PgStat after updating the transaction > time. After that, we call pgstat_report_stat() at later point. How is this > case > different? > > > So, if we keep GetCurrentTransactionStopTimestamp without change, an > > update of stats depends on another new subsequent transaction if we > simply merge those. > > (this leads to users cannot see the latest stats information ?) > > > > I think this should be okay as these don't need to be accurate. > > > At least, I got a test failure because of this function for streaming > > commit case because it uses poll_query_until to wait for stats update. > > > > I feel it is not a good idea to wait for the accurate update of these > counters. Ah, then I had wrote tests based on totally wrong direction and made noises for it. Sorry for that. I don't see tests for existing xact_commit/rollback count, so I'll follow the same way. Attached a new patch that addresses three major improvements I've got so far as comments. 1. skip increment of stats counter by empty transaction, on the subscriber side (except for commit prepared) 2. utilize the existing pgstat_report_stat, instead of having a similar logic newly. 3. remove the wrong tests. Best Regards, Takamichi Osumi v22-0001-Extend-pg_stat_subscription_workers-to-include-g.patch Description: v22-0001-Extend-pg_stat_subscription_workers-to-include-g.patch
Re: Failed transaction statistics to measure the logical replication progress
On Fri, Feb 18, 2022 at 2:04 PM osumi.takami...@fujitsu.com wrote: > > On Thursday, February 17, 2022 6:45 PM Amit Kapila > wrote: > > On Thu, Feb 17, 2022 at 3:13 PM Amit Kapila > > wrote: > > > Can't we use pgstat_report_stat() here? Basically, you can update xact > > > completetion counters during apply, and then from > > > pgstat_report_stat(), you can invoke a logical replication worker > > > stats-related function. > > > > > > > If we can do this then we can save the logic this patch is trying to > > introduce for > > PGSTAT_STAT_INTERVAL. > Hi, I've encounter a couple of questions during my modification, following > your advice. > > In the pgstat_report_stat, we refer to the return value of > GetCurrentTransactionStopTimestamp to compare the time different from the > last time. > (In my previous patch, I used GetCurrentTimestamp) > > This time is updated in apply_handle_commit_internal's > CommitTransactionCommand for the apply worker. > Then, if I update the subscription worker stats(commit_count/abort_count) > immediately after > this CommitTransactionCommand and immediately call pgstat_report_stat in the > apply_handle_commit_internal, > the time difference becomes too small (falls short of PGSTAT_STAT_INTERVAL). > Also, the time of GetCurrentTransactionStopTimestamp is not updated > even when I keep calling pgstat_report_stat repeatedly. > Then, IIUC the next possible timing that message of commit_count or > abort_count > is sent to the stats collector would become the time when we execute another > transaction > by the apply worker and update the time for GetCurrentTransactionStopTimestamp > and rerun pgstat_report_stat again. > I think but same is true in the case of the transaction in the backend where we increment commit counter via AtEOXact_PgStat after updating the transaction time. After that, we call pgstat_report_stat() at later point. How is this case different? > So, if we keep GetCurrentTransactionStopTimestamp without change, > an update of stats depends on another new subsequent transaction if we simply > merge those. > (this leads to users cannot see the latest stats information ?) > I think this should be okay as these don't need to be accurate. > At least, I got a test failure because of this function for streaming commit > case > because it uses poll_query_until to wait for stats update. > I feel it is not a good idea to wait for the accurate update of these counters. -- With Regards, Amit Kapila.
RE: Failed transaction statistics to measure the logical replication progress
On Thursday, February 17, 2022 6:45 PM Amit Kapila wrote: > On Thu, Feb 17, 2022 at 3:13 PM Amit Kapila > wrote: > > > > On Tue, Jan 4, 2022 at 5:22 PM osumi.takami...@fujitsu.com > > wrote: > > > > > > On Friday, December 31, 2021 10:12 AM Tang, Haiying/唐 海英 > wrote: > > > > 4) > > > > +void > > > > +pgstat_send_subworker_xact_stats(LogicalRepWorker *repWorker, > > > > +bool > > > > +force) { > > > > + static TimestampTz last_report = 0; > > > > + PgStat_MsgSubWorkerXactEnd msg; > > > > + > > > > + if (!force) > > > > + { > > > > ... > > > > + if (!TimestampDifferenceExceeds(last_report, now, > > > > PGSTAT_STAT_INTERVAL)) > > > > + return; > > > > + last_report = now; > > > > + } > > > > + > > > > ... > > > > + if (repWorker->commit_count == 0 && repWorker->abort_count > > > > + == > > > > 0) > > > > + return; > > > > ... > > > > > > > > I think it's better to check commit_count and abort_count first, > > > > then check if reach PGSTAT_STAT_INTERVAL. > > > > Otherwise if commit_count and abort_count are 0, it is possible > > > > that the value of last_report has been updated but it didn't send > > > > stats in fact. In this case, last_report is not the real time that send > > > > last > message. > > > Yeah, agreed. This fix is right in terms of the variable name aspect. > > > > > > > Can't we use pgstat_report_stat() here? Basically, you can update xact > > completetion counters during apply, and then from > > pgstat_report_stat(), you can invoke a logical replication worker > > stats-related function. > > > > If we can do this then we can save the logic this patch is trying to > introduce for > PGSTAT_STAT_INTERVAL. Hi, I've encounter a couple of questions during my modification, following your advice. In the pgstat_report_stat, we refer to the return value of GetCurrentTransactionStopTimestamp to compare the time different from the last time. (In my previous patch, I used GetCurrentTimestamp) This time is updated in apply_handle_commit_internal's CommitTransactionCommand for the apply worker. Then, if I update the subscription worker stats(commit_count/abort_count) immediately after this CommitTransactionCommand and immediately call pgstat_report_stat in the apply_handle_commit_internal, the time difference becomes too small (falls short of PGSTAT_STAT_INTERVAL). Also, the time of GetCurrentTransactionStopTimestamp is not updated even when I keep calling pgstat_report_stat repeatedly. Then, IIUC the next possible timing that message of commit_count or abort_count is sent to the stats collector would become the time when we execute another transaction by the apply worker and update the time for GetCurrentTransactionStopTimestamp and rerun pgstat_report_stat again. So, if we keep GetCurrentTransactionStopTimestamp without change, an update of stats depends on another new subsequent transaction if we simply merge those. (this leads to users cannot see the latest stats information ?) At least, I got a test failure because of this function for streaming commit case because it uses poll_query_until to wait for stats update. On the other hand, replacing GetCurrentTransactionStopTimestamp with GetCurrentTimestamp in case of apply worker looks have another negative impact. If we do so, it becomes possible that we go into the code to scan TabStatusArray with PgStat_TableStatus's trans with non-null values, because of the timing change. I might be able to avoid this kind of assert failure if I write the message send function of this patch before other existing functions to send various type of messages and return if the process is apply worker in pgstat_report_stat. But, I can't be convinced that this way of modification is OK. What did you think about those issues ? Best Regards, Takamichi Osumi
RE: Failed transaction statistics to measure the logical replication progress
On Wed, Jan 12, 2022 8:35 PM osumi.takami...@fujitsu.com wrote: > > The attached v21 has a couple of other minor updates > like a modification of error message text. > > Thanks for updating the patch. Here are some comments. 1) I saw the following description about pg_stat_subscription_workers view in existing doc: The pg_stat_subscription_workers view will contain one row per subscription worker on which errors have occurred, ... It only says "which errors have occurred", maybe we should also mention transactions here, right? 2) /* -- + * pgstat_send_subworker_xact_stats() - + * + * Send a subworker's transaction stats to the collector. + * The statistics are cleared upon return. Should "The statistics are cleared upon return" changed to "The statistics are cleared upon sending"? Because if it doesn't reach PGSTAT_STAT_INTERVAL and the transaction stats are not sent, the function will return without clearing out statistics. 3) + Assert(command == LOGICAL_REP_MSG_COMMIT || + command == LOGICAL_REP_MSG_STREAM_COMMIT || + command == LOGICAL_REP_MSG_COMMIT_PREPARED || + command == LOGICAL_REP_MSG_ROLLBACK_PREPARED); + + switch (command) + { + case LOGICAL_REP_MSG_COMMIT: + case LOGICAL_REP_MSG_STREAM_COMMIT: + case LOGICAL_REP_MSG_COMMIT_PREPARED: + MyLogicalRepWorker->commit_count++; + break; + case LOGICAL_REP_MSG_ROLLBACK_PREPARED: + MyLogicalRepWorker->abort_count++; + break; + default: + ereport(ERROR, + errmsg("invalid logical message type for transaction statistics of subscription")); + break; + } I'm not sure that do we need the assert, because it will report an error later if command is an invalid value. 4) I noticed that the abort_count doesn't include aborted streaming transactions. Should we take this case into consideration? Regards, Tang
Re: Failed transaction statistics to measure the logical replication progress
On Thu, Feb 17, 2022 at 3:13 PM Amit Kapila wrote: > > On Tue, Jan 4, 2022 at 5:22 PM osumi.takami...@fujitsu.com > wrote: > > > > On Friday, December 31, 2021 10:12 AM Tang, Haiying/唐 海英 > > wrote: > > > 4) > > > +void > > > +pgstat_send_subworker_xact_stats(LogicalRepWorker *repWorker, bool > > > +force) { > > > + static TimestampTz last_report = 0; > > > + PgStat_MsgSubWorkerXactEnd msg; > > > + > > > + if (!force) > > > + { > > > ... > > > + if (!TimestampDifferenceExceeds(last_report, now, > > > PGSTAT_STAT_INTERVAL)) > > > + return; > > > + last_report = now; > > > + } > > > + > > > ... > > > + if (repWorker->commit_count == 0 && repWorker->abort_count == > > > 0) > > > + return; > > > ... > > > > > > I think it's better to check commit_count and abort_count first, then > > > check if > > > reach PGSTAT_STAT_INTERVAL. > > > Otherwise if commit_count and abort_count are 0, it is possible that the > > > value > > > of last_report has been updated but it didn't send stats in fact. In this > > > case, > > > last_report is not the real time that send last message. > > Yeah, agreed. This fix is right in terms of the variable name aspect. > > > > Can't we use pgstat_report_stat() here? Basically, you can update xact > completetion counters during apply, and then from > pgstat_report_stat(), you can invoke a logical replication worker > stats-related function. > If we can do this then we can save the logic this patch is trying to introduce for PGSTAT_STAT_INTERVAL. -- With Regards, Amit Kapila.
Re: Failed transaction statistics to measure the logical replication progress
On Tue, Jan 4, 2022 at 5:22 PM osumi.takami...@fujitsu.com wrote: > > On Friday, December 31, 2021 10:12 AM Tang, Haiying/唐 海英 > wrote: > > 4) > > +void > > +pgstat_send_subworker_xact_stats(LogicalRepWorker *repWorker, bool > > +force) { > > + static TimestampTz last_report = 0; > > + PgStat_MsgSubWorkerXactEnd msg; > > + > > + if (!force) > > + { > > ... > > + if (!TimestampDifferenceExceeds(last_report, now, > > PGSTAT_STAT_INTERVAL)) > > + return; > > + last_report = now; > > + } > > + > > ... > > + if (repWorker->commit_count == 0 && repWorker->abort_count == > > 0) > > + return; > > ... > > > > I think it's better to check commit_count and abort_count first, then check > > if > > reach PGSTAT_STAT_INTERVAL. > > Otherwise if commit_count and abort_count are 0, it is possible that the > > value > > of last_report has been updated but it didn't send stats in fact. In this > > case, > > last_report is not the real time that send last message. > Yeah, agreed. This fix is right in terms of the variable name aspect. > Can't we use pgstat_report_stat() here? Basically, you can update xact completetion counters during apply, and then from pgstat_report_stat(), you can invoke a logical replication worker stats-related function. -- With Regards, Amit Kapila.
Re: Failed transaction statistics to measure the logical replication progress
On Wed, Jan 12, 2022 at 6:04 PM osumi.takami...@fujitsu.com wrote: > > On Thursday, December 23, 2021 6:37 PM Wang, Wei/王 威 > wrote: > > On Wednesday, December 22, 2021 10:30 PM osumi.takami...@fujitsu.com > > wrote: > > > On Wednesday, December 22, 2021 8:38 PM I wrote: > > > > Do we expect these commit counts which come from empty transactions ? > > > This is another issue discussed in [1] where the patch in the thread > > > is a work in progress, I think. > > > .. > > > IMHO, the conclusion is we are currently in the middle of fixing the > > > behavior. > > > > Thank you for telling me this. > > After applying v19-* and > > v15-0001-Skip-empty-transactions-for-logical-replication.patch, > > I retested v19-* patches. The result of previous case looks good to me. > > > > But the results of following cases are also similar to previous unexpected > > result > > which increases commit_count or abort_count unexpectedly. > > [1] > > (Based on environment in the previous example, set TWO_PHASE=true) > > [Publisher] begin; insert into replica_test1 values(1,'1'); prepare > > transaction > > 'id'; commit prepared 'id'; > > > > In subscriber side, the commit_count of two records(sub1 and sub2) is > > increased. > > > > [2] > > (Based on environment in the previous example, set STREAMING=on) > > [Publisher] begin; INSERT INTO replica_test1 SELECT i, md5(i::text) FROM > > generate_series(1, 5000) s(i); commit; > > > > In subscriber side, the commit_count of two records(sub1 and sub2) is > > increased. > > > > [3] > > (Based on environment in the previous example, set TWO_PHASE=true) > > [Publisher] begin; insert into replica_test1 values(1,'1'); prepare > > transaction > > 'id'; rollback prepared 'id'; > > > > In subscriber side, the abort_count of two records(sub1 and sub2) is > > increased. > > > > I think the problem maybe is the patch you mentioned > > (Skip-empty-transactions-for-logical-replication.patch) is not finished yet. > > Share this information here. > Hi, thank you for your report. > > Yes. As the patch's commit message mentions, the patch doesn't > cover streaming and two phase transactions. > > In the above reply, I said that this was an independent issue > and we were in the middle of the modification of the behavior, > but empty transaction turned out to be harmful enough for this feature. > Isn't that because of this patch? I mean the patch is reporting count even when during apply we haven't started any transaction. In particular, if you would have reported stats from apply_handle_commit_internal() when IsTransactionState() returns true, then it shouldn't have updated the stats for an empty transaction. Similarly, I see for rollback_prepared, you don't need to report stats if there is no prepared transaction to rollback. I think for commit_prepare case, we can't do much for empty xacts but why would that be a problem of this patch? I think as far as this patch goes, it reports the number of completed xacts (committed/aborted) in the subscription worker, so, it doesn't need to handle any special cases like empty transactions. -- With Regards, Amit Kapila.
RE: Failed transaction statistics to measure the logical replication progress
On Thursday, December 23, 2021 6:37 PM Wang, Wei/王 威 wrote: > On Wednesday, December 22, 2021 10:30 PM osumi.takami...@fujitsu.com > wrote: > > On Wednesday, December 22, 2021 8:38 PM I wrote: > > > Do we expect these commit counts which come from empty transactions ? > > This is another issue discussed in [1] where the patch in the thread > > is a work in progress, I think. > > .. > > IMHO, the conclusion is we are currently in the middle of fixing the > > behavior. > > Thank you for telling me this. > After applying v19-* and > v15-0001-Skip-empty-transactions-for-logical-replication.patch, > I retested v19-* patches. The result of previous case looks good to me. > > But the results of following cases are also similar to previous unexpected > result > which increases commit_count or abort_count unexpectedly. > [1] > (Based on environment in the previous example, set TWO_PHASE=true) > [Publisher] begin; insert into replica_test1 values(1,'1'); prepare > transaction > 'id'; commit prepared 'id'; > > In subscriber side, the commit_count of two records(sub1 and sub2) is > increased. > > [2] > (Based on environment in the previous example, set STREAMING=on) > [Publisher] begin; INSERT INTO replica_test1 SELECT i, md5(i::text) FROM > generate_series(1, 5000) s(i); commit; > > In subscriber side, the commit_count of two records(sub1 and sub2) is > increased. > > [3] > (Based on environment in the previous example, set TWO_PHASE=true) > [Publisher] begin; insert into replica_test1 values(1,'1'); prepare > transaction > 'id'; rollback prepared 'id'; > > In subscriber side, the abort_count of two records(sub1 and sub2) is > increased. > > I think the problem maybe is the patch you mentioned > (Skip-empty-transactions-for-logical-replication.patch) is not finished yet. > Share this information here. Hi, thank you for your report. Yes. As the patch's commit message mentions, the patch doesn't cover streaming and two phase transactions. In the above reply, I said that this was an independent issue and we were in the middle of the modification of the behavior, but empty transaction turned out to be harmful enough for this feature. As far as I searched, the current logical replication sends every transaction that is unrelated to publication specification. It means that in a common scenario where some parts of tables are replicated but others are not, the subscription statistics will be buried by the updates by the empty transactions for the latter, which damages this patch's value greatly. Therefore, I included the description in the documentation reported by you. The attached v21 has a couple of other minor updates like a modification of error message text. Best Regards, Takamichi Osumi v21-0001-Fix-alignments-of-TAP-tests-for-pg_stat_subscrip.patch Description: v21-0001-Fix-alignments-of-TAP-tests-for-pg_stat_subscrip.patch v21-0002-Extend-pg_stat_subscription_workers-to-include-g.patch Description: v21-0002-Extend-pg_stat_subscription_workers-to-include-g.patch
RE: Failed transaction statistics to measure the logical replication progress
On Friday, December 31, 2021 10:12 AM Tang, Haiying/唐 海英 wrote: > On Wednesday, December 22, 2021 6:14 PM osumi.takami...@fujitsu.com > wrote: > > > > Attached the new patch v19. > > > > Thanks for your patch. I think it's better if you could add this patch to the > commitfest. > Here are some comments: Thank you for your review ! I've created one entry in the next commitfest for this patch [1] > > 1) > + commit_count bigint > + > + > + Number of transactions successfully applied in this subscription. > + Both COMMIT and COMMIT PREPARED increment this counter. > + > + > ... > > I think the commands (like COMMIT, COMMIT PREPARED ...) can be > surrounded with " ", thoughts? Makes sense to me. Fixed. Note that to the user perspective, we should write only COMMIT and COMMIT PREPARED in the documentation. Thus, I don't list up other commands. I wrapped ROLLBACK PREPARED for abort_count as well. > 2) > +extern void pgstat_report_subworker_xact_end(LogicalRepWorker > *repWorker, > + >LogicalRepMsgType command, > + >bool bforce); > > Should "bforce" be "force"? Fixed the typo. > 3) > + * This should be called before the call of process_syning_tables() so > + to not > > "process_syning_tables()" should be "process_syncing_tables()". Fixed. > 4) > +void > +pgstat_send_subworker_xact_stats(LogicalRepWorker *repWorker, bool > +force) { > + static TimestampTz last_report = 0; > + PgStat_MsgSubWorkerXactEnd msg; > + > + if (!force) > + { > ... > + if (!TimestampDifferenceExceeds(last_report, now, > PGSTAT_STAT_INTERVAL)) > + return; > + last_report = now; > + } > + > ... > + if (repWorker->commit_count == 0 && repWorker->abort_count == > 0) > + return; > ... > > I think it's better to check commit_count and abort_count first, then check if > reach PGSTAT_STAT_INTERVAL. > Otherwise if commit_count and abort_count are 0, it is possible that the value > of last_report has been updated but it didn't send stats in fact. In this > case, > last_report is not the real time that send last message. Yeah, agreed. This fix is right in terms of the variable name aspect. The only scenario that we can take advantage of the previous implementation of v19's pgstat_send_subworker_xact_stats() should be a case where we execute a bunch of commit-like logical replication apply messages within PGSTAT_STAT_INTERVAL intensively and continuously for long period, because we check "repWorker->commit_count == 0 && repWorker->abort_count == 0" just once before calling pgstat_send() in this case. *But*, this scenario didn't look reasonable. In other words, the way to call TimestampDifferenceExceeds() only if there's any need of update for commit_count/abort_count looks less heavy. Accordingly, I've fixed it as you suggested. Also, I modified some comments in pgstat_send_subworker_xact_stats() for this change. Kindly have a look at the v20 shared in [2]. [1] - https://commitfest.postgresql.org/37/3504/ [2] - https://www.postgresql.org/message-id/TYCPR01MB8373AB2AE1A6EC7B9E012519ED4A9%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
RE: Failed transaction statistics to measure the logical replication progress
On Monday, January 3, 2022 2:46 PM Hou, Zhijie/侯 志杰 wrote: > On Wednesday, December 22, 2021 6:14 PM Osumi, Takamichi > wrote: > > Attached the new patch v19. > Hi, > > Thanks for updating the patch. > > --- a/src/include/pgstat.h > +++ b/src/include/pgstat.h > @@ -15,6 +15,7 @@ > #include "portability/instr_time.h" > #include "postmaster/pgarch.h" /* for MAX_XFN_CHARS */ > #include "replication/logicalproto.h" > +#include "replication/worker_internal.h" > > I noticed that the patch includes "worker_internal.h " in pgstat.h. > I think it might be better to only include this file in pgstat.c. > And it seems we can access MyLogicalRepWorker directly in the following > functions instead of passing a parameter. > > +extern void pgstat_report_subworker_xact_end(LogicalRepWorker > *repWorker, > + >LogicalRepMsgType command, > + >bool bforce); > +extern void pgstat_send_subworker_xact_stats(LogicalRepWorker > *repWorker, > + >bool force); Hi, thank you for your review ! Both are fixed. Additionally, I modified related comments, the header comment of pgstat_send_subworker_xact_stats, by the change. One other new improvements in v20 is to have removed the 2nd argument of pgstat_send_subworker_xact_stats(). When we called it from commit-like operations, I passed 'false' without exceptions in v19 and noticed that could be removed. Also, there's a really minor alignment in the source codes. In pgstat.c, I placed pgstat_report_subworker_xact_end() after pgstat_report_subworker_error(), and pgstat_send_subworker_xact_stats() after pgstat_send_subscription_purge() and so on like the order of PgstatCollectorMain() and PgStat_Msg definition, because my patch's new functions are added after the existing functions for error handling functions of subscription workers. Lastly, I changed the error report in pgstat_report_subworker_xact_end() so that it can be more understandable easily and a bit modern. Kindly have a look at the attached version. Best Regards, Takamichi Osumi v20-0001-Fix-alignments-of-TAP-tests-for-pg_stat_subscrip.patch Description: v20-0001-Fix-alignments-of-TAP-tests-for-pg_stat_subscrip.patch v20-0002-Extend-pg_stat_subscription_workers-to-include-g.patch Description: v20-0002-Extend-pg_stat_subscription_workers-to-include-g.patch
RE: Failed transaction statistics to measure the logical replication progress
On Wednesday, December 22, 2021 6:14 PM Osumi, Takamichi wrote: > Attached the new patch v19. Hi, Thanks for updating the patch. --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -15,6 +15,7 @@ #include "portability/instr_time.h" #include "postmaster/pgarch.h" /* for MAX_XFN_CHARS */ #include "replication/logicalproto.h" +#include "replication/worker_internal.h" I noticed that the patch includes "worker_internal.h " in pgstat.h. I think it might be better to only include this file in pgstat.c. And it seems we can access MyLogicalRepWorker directly in the following functions instead of passing a parameter. +extern void pgstat_report_subworker_xact_end(LogicalRepWorker *repWorker, + LogicalRepMsgType command, + bool bforce); +extern void pgstat_send_subworker_xact_stats(LogicalRepWorker *repWorker, + bool force); Best regards, Hou zj
RE: Failed transaction statistics to measure the logical replication progress
On Wednesday, December 22, 2021 6:14 PM osumi.takami...@fujitsu.com wrote: > > Attached the new patch v19. > Thanks for your patch. I think it's better if you could add this patch to the commitfest. Here are some comments: 1) + commit_count bigint + + + Number of transactions successfully applied in this subscription. + Both COMMIT and COMMIT PREPARED increment this counter. + + ... I think the commands (like COMMIT, COMMIT PREPARED ...) can be surrounded with " ", thoughts? 2) +extern void pgstat_report_subworker_xact_end(LogicalRepWorker *repWorker, + LogicalRepMsgType command, + bool bforce); Should "bforce" be "force"? 3) + * This should be called before the call of process_syning_tables() so to not "process_syning_tables()" should be "process_syncing_tables()". 4) +void +pgstat_send_subworker_xact_stats(LogicalRepWorker *repWorker, bool force) +{ + static TimestampTz last_report = 0; + PgStat_MsgSubWorkerXactEnd msg; + + if (!force) + { ... + if (!TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL)) + return; + last_report = now; + } + ... + if (repWorker->commit_count == 0 && repWorker->abort_count == 0) + return; ... I think it's better to check commit_count and abort_count first, then check if reach PGSTAT_STAT_INTERVAL. Otherwise if commit_count and abort_count are 0, it is possible that the value of last_report has been updated but it didn't send stats in fact. In this case, last_report is not the real time that send last message. Regards, Tang
RE: Failed transaction statistics to measure the logical replication progress
On Wednesday, December 22, 2021 10:30 PM osumi.takami...@fujitsu.com wrote: > On Wednesday, December 22, 2021 8:38 PM I wrote: > > Do we expect these commit counts which come from empty transactions ? > This is another issue discussed in [1] > where the patch in the thread is a work in progress, I think. > .. > IMHO, the conclusion is we are currently in the middle of fixing the behavior. Thank you for telling me this. After applying v19-* and v15-0001-Skip-empty-transactions-for-logical-replication.patch, I retested v19-* patches. The result of previous case looks good to me. But the results of following cases are also similar to previous unexpected result which increases commit_count or abort_count unexpectedly. [1] (Based on environment in the previous example, set TWO_PHASE=true) [Publisher] begin; insert into replica_test1 values(1,'1'); prepare transaction 'id'; commit prepared 'id'; In subscriber side, the commit_count of two records(sub1 and sub2) is increased. [2] (Based on environment in the previous example, set STREAMING=on) [Publisher] begin; INSERT INTO replica_test1 SELECT i, md5(i::text) FROM generate_series(1, 5000) s(i); commit; In subscriber side, the commit_count of two records(sub1 and sub2) is increased. [3] (Based on environment in the previous example, set TWO_PHASE=true) [Publisher] begin; insert into replica_test1 values(1,'1'); prepare transaction 'id'; rollback prepared 'id'; In subscriber side, the abort_count of two records(sub1 and sub2) is increased. I think the problem maybe is the patch you mentioned (Skip-empty-transactions-for-logical-replication.patch) is not finished yet. Share this information here. Regards, Wang wei
RE: Failed transaction statistics to measure the logical replication progress
On Wednesday, December 22, 2021 9:38 PM Wang, Wei/王 威 wrote: > I have a question on the v19-0002 patch: > > When I tested for this patch, I found pg_stat_subscription_workers has some > unexpected data. > For example: > [Publisher] > create table replica_test1(a int, b text); create publication pub1 for table > replica_test1; create table replica_test2(a int, b text); create publication > pub2 > for table replica_test2; > > [Subscriber] > create table replica_test1(a int, b text); create subscription sub1 CONNECTION > 'dbname=postgres' publication pub1; create table replica_test2(a int, b text); > create subscription sub2 CONNECTION 'dbname=postgres' publication pub2; > > [Publisher] > insert into replica_test1 values(1,'1'); > > [Subscriber] > select * from pg_stat_subscription_workers; > > -[ RECORD 1 ]--+-- > Subid | 16389 > subname | sub1 > subrelid | > commit_count | 1 > ... > -[ RECORD 2 ]--+-- > subid | 16395 > subname | sub2 > subrelid | > commit_count | 1 > ... > > I originally expected only one record for "sub1". > > I think the reason is apply_handle_commit() always invoke > pgstat_report_subworker_xact_end(). > But when we insert data to replica_test1 in publish side: > [In the publish] > pub1's walsender1 will send three messages((LOGICAL_REP_MSG_BEGIN, > LOGICAL_REP_MSG_INSERT and LOGICAL_REP_MSG_COMMIT)) > to sub1's apply worker1. > pub2's walsender2 will also send two messages(LOGICAL_REP_MSG_BEGIN > and LOGICAL_REP_MSG_COMMIT) to sub2's apply worker2. Because > inserted table is not published by pub2. > > [In the subscription] > sub1's apply worker1 receive LOGICAL_REP_MSG_COMMIT, > so invoke pgstat_report_subworker_xact_end to increase > commit_count of sub1's stats. > sub2's apply worker2 receive LOGICAL_REP_MSG_COMMIT, > it will do the same action to increase commit_count of sub2's stats. > > Do we expect these commit counts which come from empty transactions ? Hi, thank you so much for your test ! This is another issue discussed in [1] where the patch in the thread is a work in progress, I think. The point you reported will bring a lot of confusion for the users, to interpret the results of the subscription stats values, if those patches including the subscription stats patch will not get committed together (like in the same version). I've confirmed that HEAD applied with v19-* and v15-0001-Skip-empty-transactions-for-logical-replication.patch on top of v19-* showed only one record, as you expected like below, although all patches are not finished yet. postgres=# select * from pg_stat_subscription_workers; -[ RECORD 1 ]--+-- subid | 16389 subname| sub1 subrelid | commit_count | 1 abort_count| 0 error_count| 0 IMHO, the conclusion is we are currently in the middle of fixing the behavior. [1] - https://www.postgresql.org/message-id/CAFPTHDbVLWxpfnwMxJcXq703gLXciXHE83hwKQ_0OTCZ6oLCjg%40mail.gmail.com Best Regards, Takamichi Osumi
RE: Failed transaction statistics to measure the logical replication progress
On Mon, Dec 22, 2021 at 6:14 PM osumi.takami...@fujitsu.com wrote: > > Attached the new patch v19. > I have a question on the v19-0002 patch: When I tested for this patch, I found pg_stat_subscription_workers has some unexpected data. For example: [Publisher] create table replica_test1(a int, b text); create publication pub1 for table replica_test1; create table replica_test2(a int, b text); create publication pub2 for table replica_test2; [Subscriber] create table replica_test1(a int, b text); create subscription sub1 CONNECTION 'dbname=postgres' publication pub1; create table replica_test2(a int, b text); create subscription sub2 CONNECTION 'dbname=postgres' publication pub2; [Publisher] insert into replica_test1 values(1,'1'); [Subscriber] select * from pg_stat_subscription_workers; -[ RECORD 1 ]--+-- Subid | 16389 subname | sub1 subrelid| commit_count| 1 ... -[ RECORD 2 ]--+-- subid | 16395 subname | sub2 subrelid| commit_count| 1 ... I originally expected only one record for "sub1". I think the reason is apply_handle_commit() always invoke pgstat_report_subworker_xact_end(). But when we insert data to replica_test1 in publish side: [In the publish] pub1's walsender1 will send three messages((LOGICAL_REP_MSG_BEGIN, LOGICAL_REP_MSG_INSERT and LOGICAL_REP_MSG_COMMIT)) to sub1's apply worker1. pub2's walsender2 will also send two messages(LOGICAL_REP_MSG_BEGIN and LOGICAL_REP_MSG_COMMIT) to sub2's apply worker2. Because inserted table is not published by pub2. [In the subscription] sub1's apply worker1 receive LOGICAL_REP_MSG_COMMIT, so invoke pgstat_report_subworker_xact_end to increase commit_count of sub1's stats. sub2's apply worker2 receive LOGICAL_REP_MSG_COMMIT, it will do the same action to increase commit_count of sub2's stats. Do we expect these commit counts which come from empty transactions ? Regards, Wang wei
RE: Failed transaction statistics to measure the logical replication progress
On Tuesday, December 21, 2021 6:00 PM Greg Nancarrow wrote: > Some review comments on the v18 patches: Thank you for your review ! > v18-0002 > > doc/src/sgml/monitoring.sgml > (1) tablesync worker stats? > > Shouldn't the comment below only mention the apply worker? (since we're no > longer recording stats of the tablesync worker) > > + Number of transactions that failed to be applied by the table > + sync worker or main apply worker in this subscription. This > + counter is updated after confirming the error is not same as > + the previous one. > + > > Also, it should say "... the error is not the same as the previous one." Fixed. > src/backend/catalog/system_views.sql > (2) pgstat_report_subworker_xact_end() > > Fix typo and some wording: > > BEFORE: > + * This should be called before the call of process_syning_tables() > + not to > AFTER: > + * This should be called before the call of > process_syncing_tables(), so to not Fixed. > src/backend/postmaster/pgstat.c > (3) pgstat_send_subworker_xact_stats() > > BEFORE: > + * Send a subworker transaction stats to the collector. > AFTER: > + * Send a subworker's transaction stats to the collector. Fixed. > (4) > Wouldn't it be best for: > > + if (!TimestampDifferenceExceeds(last_report, now, > + PGSTAT_STAT_INTERVAL)) > > to be: > > + if (last_report != 0 && !TimestampDifferenceExceeds(last_report, > now, PGSTAT_STAT_INTERVAL)) > > ? I'm not sure which is better and I never have strong objection to your idea but currently I prefer the previous code because we don't need to add one extra condition (last_report != 0) in the function called really frequently and the optimization to avoid calling TimestampDifferenceExceeds works just once with your change, I'd say. We call pgstat_send_subworker_xact_stats() in the LogicalRepApplyLoop's loop. For the apply worker, this should be the first call for normal operation, before any call of the apply_dispatch(and subsequent commit-like functions which calls pgstat_send_subworker_xact_stats() in the end). In the first call, existing v18's codes (without your suggested change) just initializes 'last_report' variable because of the diff between 0 and 'now' and returns because of no stats values in commit_count and abort_count in the function. After this, 'last_report' will not be 0 for the apply worker. On the other hand, in the case I add your change, in the first call of pgstat_send_subworker_xact_stats(), similarly 'last_report' is initialized but without one call of TimestampDifferenceExceeds(), which might be the optimization effect and the function returns with no stats again. Here the 'last_report' will not be 0 after this. But, then we'll have to check the condition in the apply worker in the every loop. Besides, after the first setting of 'last_report', every call of the pgstat_send_subworker_xact_stats() calculates the time subtraction. This means the one skipped call of the function looks less worth in the case of frequent calls of the function. So, I'm not sure it' a good idea to incorporate this change. Kindly let me know if I miss something. At present, I'll keep the code as it is. > (5) pgstat_send_subworker_xact_stats() > > I think that the comment: > > + * Clear out the statistics buffer, so it can be re-used. > > should instead say: > > + * Clear out the supplied statistics. > > because the current comment infers that repWorker is pointed at the > MyLogicalRepWorker buffer (which it might be but it shouldn't be relying on > that) Also, I think that the function header should mention something like: > "The supplied repWorker statistics are cleared upon return, to assist re-use > by > the caller." Fixed. Attached the new patch v19. Best Regards, Takamichi Osumi v19-0002-Extend-pg_stat_subscription_workers-to-include-g.patch Description: v19-0002-Extend-pg_stat_subscription_workers-to-include-g.patch v19-0001-Fix-alignments-of-TAP-tests-for-pg_stat_subscrip.patch Description: v19-0001-Fix-alignments-of-TAP-tests-for-pg_stat_subscrip.patch
Re: Failed transaction statistics to measure the logical replication progress
On Mon, Dec 20, 2021 at 8:40 PM osumi.takami...@fujitsu.com wrote: > > Updated the patch to address your concern. > Some review comments on the v18 patches: v18-0002 doc/src/sgml/monitoring.sgml (1) tablesync worker stats? Shouldn't the comment below only mention the apply worker? (since we're no longer recording stats of the tablesync worker) + Number of transactions that failed to be applied by the table + sync worker or main apply worker in this subscription. This + counter is updated after confirming the error is not same as + the previous one. + Also, it should say "... the error is not the same as the previous one." src/backend/catalog/system_views.sql (2) pgstat_report_subworker_xact_end() Fix typo and some wording: BEFORE: + * This should be called before the call of process_syning_tables() not to AFTER: + * This should be called before the call of process_syncing_tables(), so to not src/backend/postmaster/pgstat.c (3) pgstat_send_subworker_xact_stats() BEFORE: + * Send a subworker transaction stats to the collector. AFTER: + * Send a subworker's transaction stats to the collector. (4) Wouldn't it be best for: + if (!TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL)) to be: + if (last_report != 0 && !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL)) ? (5) pgstat_send_subworker_xact_stats() I think that the comment: + * Clear out the statistics buffer, so it can be re-used. should instead say: + * Clear out the supplied statistics. because the current comment infers that repWorker is pointed at the MyLogicalRepWorker buffer (which it might be but it shouldn't be relying on that) Also, I think that the function header should mention something like: "The supplied repWorker statistics are cleared upon return, to assist re-use by the caller." Regards, Greg Nancarrow Fujitsu Australia
RE: Failed transaction statistics to measure the logical replication progress
Friday, December 17, 2021 2:03 PM Kyotaro Horiguchi wrote: > It sends stats packets at every commit-like operation on apply workers. The > current pgstat is so smart that it refrain from sending stats packets at too > high > frequency. We already suffer frequent stats packets so apply workers need to > bahave the same way. > > That is, the new stats numbers are once accumulated locally then the > accumulated numbers are sent to stats collector by pgstat_report_stat. Hi, Horiguchi-san. I felt your point is absolutely right ! Updated the patch to address your concern. Best Regards, Takamichi Osumi v18-0002-Extend-pg_stat_subscription_workers-to-include-g.patch Description: v18-0002-Extend-pg_stat_subscription_workers-to-include-g.patch v18-0001-Fix-alignments-of-TAP-tests-for-pg_stat_subscrip.patch Description: v18-0001-Fix-alignments-of-TAP-tests-for-pg_stat_subscrip.patch
Re: Failed transaction statistics to measure the logical replication progress
At Thu, 16 Dec 2021 11:36:46 +, "osumi.takami...@fujitsu.com" wrote in > On Thursday, December 16, 2021 4:00 PM I wrote: > > Thank you, everyone for confirming the direction. > > I'll follow the consensus of the community and fix the patch, including > > other > > comments. > > I'll treat only the stats for apply workers. > Hi, created a new version v17 according to the recent discussion > with changes to address other review comments. > > Kindly have a look at it. It sends stats packets at every commit-like operation on apply workers. The current pgstat is so smart that it refrain from sending stats packets at too high frequency. We already suffer frequent stats packets so apply workers need to bahave the same way. That is, the new stats numbers are once accumulated locally then the accumulated numbers are sent to stats collector by pgstat_report_stat. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
RE: Failed transaction statistics to measure the logical replication progress
On Thursday, December 16, 2021 8:37 PM I wrote: > Hi, created a new version v17 according to the recent discussion with changes > to address other review comments. FYI, in v17 I've removed one part of commit message about spool file statistics on the subscriber. My intention is just to make the patches more committable shape. Although I deleted it, I'd say still there would be some room for the discussion of the necessity. It's because to begin with, we are interested in the disk writes (for the logical replication, pg_stat_replication_slots is an example), and secondly there can be a scenario that if the user of logical replication dislikes and wants to suppress unnecessary writes of file on the subscriber (STREAM ABORT causes truncate of file with changes, IIUC) they can increase the logical_decoding_work_mem on the publisher. I'll postpone this discussion, till it becomes necessary or will abandon this idea, if it's rejected. Anyway, I detached the discussion by removing it from the commit message. Best Regards, Takamichi Osumi
RE: Failed transaction statistics to measure the logical replication progress
On Tuesday, December 14, 2021 12:45 AM vignesh C wrote: > Thanks for the updated patch, few comments: > 1) Can we change this: > /* > +* Report the success of table sync as one commit to consolidate all > +* transaction stats into one record. > +*/ > + pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid, > + > LOGICAL_REP_MSG_COMMIT); > + > To: > /* Report the success of table sync */ > + pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid, > + > LOGICAL_REP_MSG_COMMIT); > + This function call that the table sync worker reports an update of stats has been removed according to the recent discussion. > 2) Typo: ealier should be earlier > + /* > +* Report ealier than the call of process_syncing_tables() not > to miss an > +* increment of commit_count in case it leads to the process exit. See > +* process_syncing_tables_for_apply(). > +*/ > + pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid, > + > LOGICAL_REP_MSG_COMMIT); > + Thanks ! Fixed. > 3) Should we add an Assert for subwentry: > + /* > +* If this is a new error reported by table sync worker, > consolidate this > +* error count into the entry of apply worker, by swapping the stats > +* entries. > +*/ > + if (OidIsValid(msg->m_subrelid)) > + subwentry = pgstat_get_subworker_entry(dbentry, > + msg->m_subid, > + > InvalidOid, true); > + subwentry->error_count++; The latest implementation doesn't require the call of pgstat_get_subworker_entry(). So, I skipped. > 4) Can we slightly change it to :We can change it: > +# Check the update of stats counters. > +confirm_transaction_stats_update( > + $node_subscriber, > + 'commit_count = 1', > + 'the commit_count increment by table sync'); > + > +confirm_transaction_stats_update( > + $node_subscriber, > + 'error_count = 1', > + 'the error_count increment by table sync'); > to: > +# Check updation of subscription worker transaction count statistics. > +confirm_transaction_stats_update( > + $node_subscriber, > + 'commit_count = 1', > + 'check table sync worker commit count is updated'); > + > +confirm_transaction_stats_update( > + $node_subscriber, > + 'error_count = 1', > + 'check table sync worker error count is updated'); I've removed the corresponding tests for table sync workers in the patch. But, I adopted the comment suggestion partly for the tests of the apply worker. On the other hand, I didn't fix the 3rd arguments of confirm_transaction_stats_update(). It needs to be a noun, because it's connected to another string "Timed out while waiting for ". in the function. See the definition of the function. The new patch v17 is shared in [1]. [1] - https://www.postgresql.org/message-id/TYCPR01MB83734A7A0596AC7ADB0DCB51ED779%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
RE: Failed transaction statistics to measure the logical replication progress
On Monday, December 13, 2021 6:19 PM Amit Kapila wrote: > On Tue, Dec 7, 2021 at 3:12 PM osumi.takami...@fujitsu.com > wrote: > > > > Few questions and comments: > > 1. > The pg_stat_subscription_workers view will > contain > one row per subscription worker on which errors have occurred, for workers > applying logical replication changes and workers handling the initial data > - copy of the subscribed tables. The statistics entry is removed when the > - corresponding subscription is dropped. > + copy of the subscribed tables. Also, the row corresponding to the apply > + worker shows all transaction statistics of both types of workers on the > + subscription. The statistics entry is removed when the corresponding > + subscription is dropped. > > Why did you choose to show stats for both types of workers in one row? Now, the added stats show only the statistics of apply worker as we agreed. > 2. > + PGSTAT_MTYPE_SUBWORKERXACTEND, > } StatMsgType; > > I don't think we comma with the last message type. Fixed. > 3. > + Oid m_subrelid; > + > + /* necessary to determine column to increment */ LogicalRepMsgType > + m_command; > + > +} PgStat_MsgSubWorkerXactEnd; > > Is m_subrelid used in this patch? If not, why did you keep it? I think if you > choose to show separate stats for table sync and apply worker then probably it > will be used. Removed. > 4. > /* > + * Cumulative transaction statistics of subscription worker */ > + PgStat_Counter commit_count; PgStat_Counter error_count; > + PgStat_Counter abort_count; > + > > I think it is better to keep the order of columns as commit_count, > abort_count, > error_count in the entire patch. Fixed. The new patch is shared in [1]. [1] - https://www.postgresql.org/message-id/TYCPR01MB83734A7A0596AC7ADB0DCB51ED779%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
RE: Failed transaction statistics to measure the logical replication progress
On Thursday, December 16, 2021 4:00 PM I wrote: > Thank you, everyone for confirming the direction. > I'll follow the consensus of the community and fix the patch, including other > comments. > I'll treat only the stats for apply workers. Hi, created a new version v17 according to the recent discussion with changes to address other review comments. Kindly have a look at it. Best Regards, Takamichi Osumi v17-0002-Extend-pg_stat_subscription_workers-to-include-g.patch Description: v17-0002-Extend-pg_stat_subscription_workers-to-include-g.patch v17-0001-Fix-alignments-of-TAP-tests-for-pg_stat_subscrip.patch Description: v17-0001-Fix-alignments-of-TAP-tests-for-pg_stat_subscrip.patch
Re: Failed transaction statistics to measure the logical replication progress
On Tue, Dec 14, 2021 at 1:28 PM Amit Kapila wrote: > > If we fear a large number of entries for such workers then won't it be > better to show the value of these stats only for apply workers. I > think normally the table sync workers perform only copy operation or > maybe a fixed number of xacts, so, one might not be interested in the > transaction stats of these workers. I find merging only specific stats > of two different types of workers confusing. > > What do others think about this? > I think it might be OK to NOT include the transaction stats of the tablesync workers, but my understanding (and slight concern) is that currently there is potentially some overlap in the work done by the tablesync and apply workers - perhaps the small patch (see [1]) proposed by Peter Smith could also be considered, in order to make that distinction of work clearer, and the stats more meaningful? [1] https://www.postgresql.org/message-id/flat/cahut+pt39pbqs0sxt9rmm89ayizoq0kw46yzskkzwk8z5ho...@mail.gmail.com Regards, Greg Nancarrow Fujitsu Australia
RE: Failed transaction statistics to measure the logical replication progress
On Wednesday, December 15, 2021 9:52 PM vignesh C wrote: > On Tue, Dec 14, 2021 at 7:58 AM Amit Kapila > wrote: > > > > On Mon, Dec 13, 2021 at 5:48 PM osumi.takami...@fujitsu.com > > wrote: > > > > > > On Monday, December 13, 2021 6:19 PM Amit Kapila > wrote: > > > > On Tue, Dec 7, 2021 at 3:12 PM osumi.takami...@fujitsu.com > > > > wrote: > > > > > > > > Few questions and comments: > > > Thank you for your comments ! > > > > > > > > > > > 1. > > > > The pg_stat_subscription_workers view > > > > will contain > > > > one row per subscription worker on which errors have occurred, for > workers > > > > applying logical replication changes and workers handling the > > > > initial > data > > > > - copy of the subscribed tables. The statistics entry is removed > when the > > > > - corresponding subscription is dropped. > > > > + copy of the subscribed tables. Also, the row corresponding to the > apply > > > > + worker shows all transaction statistics of both types of workers on > the > > > > + subscription. The statistics entry is removed when the > corresponding > > > > + subscription is dropped. > > > > > > > > Why did you choose to show stats for both types of workers in one row? > > > This is because if we have hundreds or thousands of tables for table > > > sync, we need to create many entries to cover them and store the entries > > > for > all tables. > > > > > > > If we fear a large number of entries for such workers then won't it be > > better to show the value of these stats only for apply workers. I > > think normally the table sync workers perform only copy operation or > > maybe a fixed number of xacts, so, one might not be interested in the > > transaction stats of these workers. I find merging only specific stats > > of two different types of workers confusing. > > > > What do others think about this? > > We can remove the table sync workers transaction stats count to avoid > confusion, take care of the documentation changes too accordingly. Hi, apologies for my late reply. Thank you, everyone for confirming the direction. I'll follow the consensus of the community and fix the patch, including other comments. I'll treat only the stats for apply workers. Best Regards, Takamichi Osumi
Re: Failed transaction statistics to measure the logical replication progress
On Tue, Dec 14, 2021 at 7:58 AM Amit Kapila wrote: > > On Mon, Dec 13, 2021 at 5:48 PM osumi.takami...@fujitsu.com > wrote: > > > > On Monday, December 13, 2021 6:19 PM Amit Kapila > > wrote: > > > On Tue, Dec 7, 2021 at 3:12 PM osumi.takami...@fujitsu.com > > > wrote: > > > > > > Few questions and comments: > > Thank you for your comments ! > > > > > > > > 1. > > > The pg_stat_subscription_workers view will > > > contain > > > one row per subscription worker on which errors have occurred, for > > > workers > > > applying logical replication changes and workers handling the initial > > > data > > > - copy of the subscribed tables. The statistics entry is removed when > > > the > > > - corresponding subscription is dropped. > > > + copy of the subscribed tables. Also, the row corresponding to the > > > apply > > > + worker shows all transaction statistics of both types of workers on > > > the > > > + subscription. The statistics entry is removed when the corresponding > > > + subscription is dropped. > > > > > > Why did you choose to show stats for both types of workers in one row? > > This is because if we have hundreds or thousands of tables for table sync, > > we need to create many entries to cover them and store the entries for all > > tables. > > > > If we fear a large number of entries for such workers then won't it be > better to show the value of these stats only for apply workers. I > think normally the table sync workers perform only copy operation or > maybe a fixed number of xacts, so, one might not be interested in the > transaction stats of these workers. I find merging only specific stats > of two different types of workers confusing. > > What do others think about this? We can remove the table sync workers transaction stats count to avoid confusion, take care of the documentation changes too accordingly. Regards, Vignesh
Re: Failed transaction statistics to measure the logical replication progress
On Tue, Dec 14, 2021 at 11:28 AM Amit Kapila wrote: > > On Mon, Dec 13, 2021 at 5:48 PM osumi.takami...@fujitsu.com > wrote: > > > > On Monday, December 13, 2021 6:19 PM Amit Kapila > > wrote: > > > On Tue, Dec 7, 2021 at 3:12 PM osumi.takami...@fujitsu.com > > > wrote: > > > > > > Few questions and comments: > > Thank you for your comments ! > > > > > > > > 1. > > > The pg_stat_subscription_workers view will > > > contain > > > one row per subscription worker on which errors have occurred, for > > > workers > > > applying logical replication changes and workers handling the initial > > > data > > > - copy of the subscribed tables. The statistics entry is removed when > > > the > > > - corresponding subscription is dropped. > > > + copy of the subscribed tables. Also, the row corresponding to the > > > apply > > > + worker shows all transaction statistics of both types of workers on > > > the > > > + subscription. The statistics entry is removed when the corresponding > > > + subscription is dropped. > > > > > > Why did you choose to show stats for both types of workers in one row? > > This is because if we have hundreds or thousands of tables for table sync, > > we need to create many entries to cover them and store the entries for all > > tables. > > > > If we fear a large number of entries for such workers then won't it be > better to show the value of these stats only for apply workers. I > think normally the table sync workers perform only copy operation or > maybe a fixed number of xacts, so, one might not be interested in the > transaction stats of these workers. I find merging only specific stats > of two different types of workers confusing. > > What do others think about this? I understand the concern to have a large number of entries but I agree that merging only specific stats would confuse users. As Amit suggested, it'd be better to show only apply workers' transaction stats. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
RE: Failed transaction statistics to measure the logical replication progress
On Tues, Dec 14, 2021 10:28 AM Amit Kapila wrote: > On Mon, Dec 13, 2021 at 5:48 PM osumi.takami...@fujitsu.com > wrote: > > > > On Monday, December 13, 2021 6:19 PM Amit Kapila > wrote: > > > On Tue, Dec 7, 2021 at 3:12 PM osumi.takami...@fujitsu.com > > > wrote: > > > > > > Few questions and comments: > > Thank you for your comments ! > > > > > > > > 1. > > > one row per subscription worker on which errors have occurred, for > > > workers > > > applying logical replication changes and workers handling the initial > > > data > > > - copy of the subscribed tables. The statistics entry is removed when > > > the > > > - corresponding subscription is dropped. > > > + copy of the subscribed tables. Also, the row corresponding to the > > > apply > > > + worker shows all transaction statistics of both types of workers on > > > the > > > + subscription. The statistics entry is removed when the corresponding > > > + subscription is dropped. > > > > > > Why did you choose to show stats for both types of workers in one row? > > This is because if we have hundreds or thousands of tables for table sync, > > we need to create many entries to cover them and store the entries for all > > tables. > > > > If we fear a large number of entries for such workers then won't it be > better to show the value of these stats for apply workers. I > think normally the table sync workers perform only copy operation or > maybe a fixed number of xacts, so, one might not be interested in the > transaction stats of these workers. I find merging only specific stats > of two different types of workers confusing. > > What do others think about this? Personally, I agreed that merging two types of stats into one row might not be a good idea. And the xact stats of table sync workers are usually less interesting than the apply worker's, So, it's seems acceptable to me if we show stats only for apply workers and document about this. Best regards, Hou zj
Re: Failed transaction statistics to measure the logical replication progress
On Mon, Dec 13, 2021 at 5:48 PM osumi.takami...@fujitsu.com wrote: > > On Monday, December 13, 2021 6:19 PM Amit Kapila > wrote: > > On Tue, Dec 7, 2021 at 3:12 PM osumi.takami...@fujitsu.com > > wrote: > > > > Few questions and comments: > Thank you for your comments ! > > > > > 1. > > The pg_stat_subscription_workers view will > > contain > > one row per subscription worker on which errors have occurred, for > > workers > > applying logical replication changes and workers handling the initial > > data > > - copy of the subscribed tables. The statistics entry is removed when the > > - corresponding subscription is dropped. > > + copy of the subscribed tables. Also, the row corresponding to the apply > > + worker shows all transaction statistics of both types of workers on the > > + subscription. The statistics entry is removed when the corresponding > > + subscription is dropped. > > > > Why did you choose to show stats for both types of workers in one row? > This is because if we have hundreds or thousands of tables for table sync, > we need to create many entries to cover them and store the entries for all > tables. > If we fear a large number of entries for such workers then won't it be better to show the value of these stats only for apply workers. I think normally the table sync workers perform only copy operation or maybe a fixed number of xacts, so, one might not be interested in the transaction stats of these workers. I find merging only specific stats of two different types of workers confusing. What do others think about this? -- With Regards, Amit Kapila.
Re: Failed transaction statistics to measure the logical replication progress
On Tue, Dec 7, 2021 at 3:12 PM osumi.takami...@fujitsu.com wrote: > > On Monday, December 6, 2021 11:27 PM vignesh C wrote: > > Thanks for the updated patch, few comments: > Thank you for your review ! > > > 1) We can keep the documentation similar to mention the count includes both > > table sync worker / main apply worker in case of commit_count/error_count > > and abort_count to keep it consistent. > > + commit_count bigint > > + > > + > > + Number of transactions successfully applied in this subscription. > > + COMMIT and COMMIT PREPARED increments this counter. > > + > > + > > + > > + > > + > > + error_count bigint > > + > > + > > + Number of transactions that failed to be applied by the table > > + sync worker or main apply worker in this subscription. > > + > > + > > + > > + > > + > > + abort_count bigint > > + > > + > > + Number of transactions aborted in this subscription. > > + ROLLBACK PREPARED increments this counter. > > + > > + > Yeah, you are right. Fixed. > Note that abort_count is not used by table sync worker. > > > > 2) Can this be changed: > > + /* > > +* If this is a new error reported by table sync worker, > > consolidate this > > +* error count into the entry of apply worker. > > +*/ > > + if (OidIsValid(msg->m_subrelid)) > > + { > > + /* Gain the apply worker stats */ > > + subwentry = pgstat_get_subworker_entry(dbentry, > > + msg->m_subid, > > + > > InvalidOid, true); > > + subwentry->error_count++; > > + } > > + else > > + subwentry->error_count++; /* increment the apply > > worker's counter. */ > > To: > > + /* > > +* If this is a new error reported by table sync worker, > > consolidate this > > +* error count into the entry of apply worker. > > +*/ > > + if (OidIsValid(msg->m_subrelid)) > > + /* Gain the apply worker stats */ > > + subwentry = pgstat_get_subworker_entry(dbentry, > > + msg->m_subid, > > + > > InvalidOid, true); > > + > > + subwentry->error_count++; /* increment the apply > > worker's counter. */ > Your suggestion looks better. > Also, I fixed some comments of this part > so that we don't need to add a separate comment at the bottom > for the increment of the apply worker. > > > > 3) Since both 026_worker_stats and 027_worker_xact_stats.pl are testing > > pg_stat_subscription_workers, can we move the tests to 026_worker_stats.pl. > > If possible the error_count validation can be combined with the existing > > tests. > > diff --git a/src/test/subscription/t/027_worker_xact_stats.pl > > b/src/test/subscription/t/027_worker_xact_stats.pl > > new file mode 100644 > > index 000..31dbea1 > > --- /dev/null > > +++ b/src/test/subscription/t/027_worker_xact_stats.pl > > @@ -0,0 +1,162 @@ > > + > > +# Copyright (c) 2021, PostgreSQL Global Development Group > > + > > +# Tests for subscription worker statistics during apply. > > +use strict; > > +use warnings; > > +use PostgreSQL::Test::Cluster; > > +use PostgreSQL::Test::Utils; > > +use Test::More tests => 1; > > + > > +# Create publisher node > Right. I've integrated my tests with 026_worker_stats.pl. > I think error_count validations are combined as you suggested. > Another change I did is to introduce one function > to contribute to better readability of the stats tests. > > Here, the 026_worker_stats.pl didn't look aligned by > pgperltidy. This is not a serious issue at all. > Yet, when I ran pgperltidy, the existing codes > that required adjustments came into my patch. > Therefore, I made a separate part for this. Thanks for the updated patch, few comments: 1) Can we change this: /* +* Report the success of table sync as one commit to consolidate all +* transaction stats into one record. +*/ + pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid, + LOGICAL_REP_MSG_COMMIT); + To: /* Report the success of table sync */ + pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid, + LOGICAL_REP_MSG_COMMIT); + 2) Typo: ealier should be earlier + /* +* Report ealier than the call of process_syncing_tables() not to miss an +* increment of commit_count in case it leads to the process exit. See +* process_syncing_tables_for_apply(). +*/ + pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid, + LOGICAL_REP_MSG_COMMIT); + 3) Should we add an Assert for subwentry: + /* +* If this is a new error reported by table sync worker, consolidate this +* error count into the entry of apply worker, by swapping the stats +* entries. +*/ + if (OidIsValid(msg->m_subrelid)) + subwentry =
RE: Failed transaction statistics to measure the logical replication progress
On Monday, December 13, 2021 6:19 PM Amit Kapila wrote: > On Tue, Dec 7, 2021 at 3:12 PM osumi.takami...@fujitsu.com > wrote: > > Few questions and comments: Thank you for your comments ! > > 1. > The pg_stat_subscription_workers view will > contain > one row per subscription worker on which errors have occurred, for workers > applying logical replication changes and workers handling the initial data > - copy of the subscribed tables. The statistics entry is removed when the > - corresponding subscription is dropped. > + copy of the subscribed tables. Also, the row corresponding to the apply > + worker shows all transaction statistics of both types of workers on the > + subscription. The statistics entry is removed when the corresponding > + subscription is dropped. > > Why did you choose to show stats for both types of workers in one row? This is because if we have hundreds or thousands of tables for table sync, we need to create many entries to cover them and store the entries for all tables. > 2. > + PGSTAT_MTYPE_SUBWORKERXACTEND, > } StatMsgType; > > I don't think we comma with the last message type. > 4. > /* > + * Cumulative transaction statistics of subscription worker */ > + PgStat_Counter commit_count; PgStat_Counter error_count; > + PgStat_Counter abort_count; > + > > I think it is better to keep the order of columns as commit_count, > abort_count, > error_count in the entire patch. Okay, I'll fix both points in the next version. > 3. > + Oid m_subrelid; > + > + /* necessary to determine column to increment */ LogicalRepMsgType > + m_command; > + > +} PgStat_MsgSubWorkerXactEnd; > > Is m_subrelid used in this patch? If not, why did you keep it? Absolutely, this was a mistake when I took the decision to merge both stats of table sync and apply worker. > I think if you choose > to show separate stats for table sync and apply worker then probably it will > be > used. Yeah, I'll fix this. Of course, after I could confirm that the idea for merging the two types of workers stats was acceptable for you and others. Best Regards, Takamichi Osumi
Re: Failed transaction statistics to measure the logical replication progress
On Tue, Dec 7, 2021 at 3:12 PM osumi.takami...@fujitsu.com wrote: > Few questions and comments: 1. The pg_stat_subscription_workers view will contain one row per subscription worker on which errors have occurred, for workers applying logical replication changes and workers handling the initial data - copy of the subscribed tables. The statistics entry is removed when the - corresponding subscription is dropped. + copy of the subscribed tables. Also, the row corresponding to the apply + worker shows all transaction statistics of both types of workers on the + subscription. The statistics entry is removed when the corresponding + subscription is dropped. Why did you choose to show stats for both types of workers in one row? 2. + PGSTAT_MTYPE_SUBWORKERXACTEND, } StatMsgType; I don't think we comma with the last message type. 3. + Oid m_subrelid; + + /* necessary to determine column to increment */ + LogicalRepMsgType m_command; + +} PgStat_MsgSubWorkerXactEnd; Is m_subrelid used in this patch? If not, why did you keep it? I think if you choose to show separate stats for table sync and apply worker then probably it will be used. 4. /* + * Cumulative transaction statistics of subscription worker + */ + PgStat_Counter commit_count; + PgStat_Counter error_count; + PgStat_Counter abort_count; + I think it is better to keep the order of columns as commit_count, abort_count, error_count in the entire patch. -- With Regards, Amit Kapila.
RE: Failed transaction statistics to measure the logical replication progress
On Monday, December 6, 2021 11:27 PM vignesh C wrote: > Thanks for the updated patch, few comments: Thank you for your review ! > 1) We can keep the documentation similar to mention the count includes both > table sync worker / main apply worker in case of commit_count/error_count > and abort_count to keep it consistent. > + commit_count bigint > + > + > + Number of transactions successfully applied in this subscription. > + COMMIT and COMMIT PREPARED increments this counter. > + > + > + > + > + > + error_count bigint > + > + > + Number of transactions that failed to be applied by the table > + sync worker or main apply worker in this subscription. > + > + > + > + > + > + abort_count bigint > + > + > + Number of transactions aborted in this subscription. > + ROLLBACK PREPARED increments this counter. > + > + Yeah, you are right. Fixed. Note that abort_count is not used by table sync worker. > 2) Can this be changed: > + /* > +* If this is a new error reported by table sync worker, > consolidate this > +* error count into the entry of apply worker. > +*/ > + if (OidIsValid(msg->m_subrelid)) > + { > + /* Gain the apply worker stats */ > + subwentry = pgstat_get_subworker_entry(dbentry, > + msg->m_subid, > + > InvalidOid, true); > + subwentry->error_count++; > + } > + else > + subwentry->error_count++; /* increment the apply > worker's counter. */ > To: > + /* > +* If this is a new error reported by table sync worker, > consolidate this > +* error count into the entry of apply worker. > +*/ > + if (OidIsValid(msg->m_subrelid)) > + /* Gain the apply worker stats */ > + subwentry = pgstat_get_subworker_entry(dbentry, > + msg->m_subid, > + > InvalidOid, true); > + > + subwentry->error_count++; /* increment the apply > worker's counter. */ Your suggestion looks better. Also, I fixed some comments of this part so that we don't need to add a separate comment at the bottom for the increment of the apply worker. > 3) Since both 026_worker_stats and 027_worker_xact_stats.pl are testing > pg_stat_subscription_workers, can we move the tests to 026_worker_stats.pl. > If possible the error_count validation can be combined with the existing > tests. > diff --git a/src/test/subscription/t/027_worker_xact_stats.pl > b/src/test/subscription/t/027_worker_xact_stats.pl > new file mode 100644 > index 000..31dbea1 > --- /dev/null > +++ b/src/test/subscription/t/027_worker_xact_stats.pl > @@ -0,0 +1,162 @@ > + > +# Copyright (c) 2021, PostgreSQL Global Development Group > + > +# Tests for subscription worker statistics during apply. > +use strict; > +use warnings; > +use PostgreSQL::Test::Cluster; > +use PostgreSQL::Test::Utils; > +use Test::More tests => 1; > + > +# Create publisher node Right. I've integrated my tests with 026_worker_stats.pl. I think error_count validations are combined as you suggested. Another change I did is to introduce one function to contribute to better readability of the stats tests. Here, the 026_worker_stats.pl didn't look aligned by pgperltidy. This is not a serious issue at all. Yet, when I ran pgperltidy, the existing codes that required adjustments came into my patch. Therefore, I made a separate part for this. Best Regards, Takamichi Osumi v16-0001-Fix-alignments-of-TAP-tests-for-pg_stat_subscrip.patch Description: v16-0001-Fix-alignments-of-TAP-tests-for-pg_stat_subscrip.patch v16-0002-Extend-pg_stat_subscription_workers-to-include-g.patch Description: v16-0002-Extend-pg_stat_subscription_workers-to-include-g.patch
Re: Failed transaction statistics to measure the logical replication progress
On Sat, Dec 4, 2021 at 6:32 PM osumi.takami...@fujitsu.com wrote: > > On Friday, December 3, 2021 3:12 PM vignesh C wrote: > > Thanks for the updated patch. > > Currently we are storing the commit count, error_count and abort_count for > > each table of the table sync operation. If we have thousands of tables, we > > will > > be storing the information for each of the tables. > > Shouldn't we be storing the consolidated information in this case. > > diff --git a/src/backend/replication/logical/tablesync.c > > b/src/backend/replication/logical/tablesync.c > > index f07983a..02e9486 100644 > > --- a/src/backend/replication/logical/tablesync.c > > +++ b/src/backend/replication/logical/tablesync.c > > @@ -1149,6 +1149,11 @@ copy_table_done: > > MyLogicalRepWorker->relstate_lsn = *origin_startpos; > > SpinLockRelease(>relmutex); > > > > + /* Report the success of table sync. */ > > + pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid, > > + > > MyLogicalRepWorker->relid, > > + > > 0 /* no logical message type */ ); > Okay. > > I united all stats into that of apply worker. > In line with this change, I fixed the TAP tests as well > to cover the updates of stats done by table sync workers. > > Also, during my self-review, I noticed that > I should call pgstat_report_subworker_xact_end() before > process_syncing_tables() because it can lead to process > exit, which results in missing one increment of the stats columns. > I noted this point in a comment as well. Thanks for the updated patch, few comments: 1) We can keep the documentation similar to mention the count includes both table sync worker / main apply worker in case of commit_count/error_count and abort_count to keep it consistent. + commit_count bigint + + + Number of transactions successfully applied in this subscription. + COMMIT and COMMIT PREPARED increments this counter. + + + + + + error_count bigint + + + Number of transactions that failed to be applied by the table + sync worker or main apply worker in this subscription. + + + + + + abort_count bigint + + + Number of transactions aborted in this subscription. + ROLLBACK PREPARED increments this counter. + + 2) Can this be changed: + /* +* If this is a new error reported by table sync worker, consolidate this +* error count into the entry of apply worker. +*/ + if (OidIsValid(msg->m_subrelid)) + { + /* Gain the apply worker stats */ + subwentry = pgstat_get_subworker_entry(dbentry, msg->m_subid, + InvalidOid, true); + subwentry->error_count++; + } + else + subwentry->error_count++; /* increment the apply worker's counter. */ To: + /* +* If this is a new error reported by table sync worker, consolidate this +* error count into the entry of apply worker. +*/ + if (OidIsValid(msg->m_subrelid)) + /* Gain the apply worker stats */ + subwentry = pgstat_get_subworker_entry(dbentry, msg->m_subid, + InvalidOid, true); + + subwentry->error_count++; /* increment the apply worker's counter. */ 3) Since both 026_worker_stats and 027_worker_xact_stats.pl are testing pg_stat_subscription_workers, can we move the tests to 026_worker_stats.pl. If possible the error_count validation can be combined with the existing tests. diff --git a/src/test/subscription/t/027_worker_xact_stats.pl b/src/test/subscription/t/027_worker_xact_stats.pl new file mode 100644 index 000..31dbea1 --- /dev/null +++ b/src/test/subscription/t/027_worker_xact_stats.pl @@ -0,0 +1,162 @@ + +# Copyright (c) 2021, PostgreSQL Global Development Group + +# Tests for subscription worker statistics during apply. +use strict; +use warnings; +use PostgreSQL::Test::Cluster; +use PostgreSQL::Test::Utils; +use Test::More tests => 1; + +# Create publisher node Regards, Vignesh
RE: Failed transaction statistics to measure the logical replication progress
On Friday, December 3, 2021 3:12 PM vignesh C wrote: > Thanks for the updated patch. > Currently we are storing the commit count, error_count and abort_count for > each table of the table sync operation. If we have thousands of tables, we > will > be storing the information for each of the tables. > Shouldn't we be storing the consolidated information in this case. > diff --git a/src/backend/replication/logical/tablesync.c > b/src/backend/replication/logical/tablesync.c > index f07983a..02e9486 100644 > --- a/src/backend/replication/logical/tablesync.c > +++ b/src/backend/replication/logical/tablesync.c > @@ -1149,6 +1149,11 @@ copy_table_done: > MyLogicalRepWorker->relstate_lsn = *origin_startpos; > SpinLockRelease(>relmutex); > > + /* Report the success of table sync. */ > + pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid, > + > MyLogicalRepWorker->relid, > + > 0 /* no logical message type */ ); Okay. I united all stats into that of apply worker. In line with this change, I fixed the TAP tests as well to cover the updates of stats done by table sync workers. Also, during my self-review, I noticed that I should call pgstat_report_subworker_xact_end() before process_syncing_tables() because it can lead to process exit, which results in missing one increment of the stats columns. I noted this point in a comment as well. Best Regards, Takamichi Osumi v15-0001-Extend-pg_stat_subscription_workers-to-include-g.patch Description: v15-0001-Extend-pg_stat_subscription_workers-to-include-g.patch
Re: Failed transaction statistics to measure the logical replication progress
On Wed, Dec 1, 2021 at 3:04 PM osumi.takami...@fujitsu.com wrote: > > On Friday, November 19, 2021 11:11 PM Masahiko Sawada > wrote: > > Besides that, I’m not sure how useful commit_bytes, abort_bytes, and > > error_bytes are. I originally thought these statistics track the size of > > received > > data, i.g., how much data is transferred from the publisher and processed on > > the subscriber. But what the view currently has is how much memory is used > > in > > the subscription worker. The subscription worker emulates > > ReorderBufferChangeSize() on the subscriber side but, as the comment of > > update_apply_change_size() mentions, the size in the view is not accurate: > ... > > I guess that the purpose of these values is to compare them to total_bytes, > > stream_byte, and spill_bytes but if the calculation is not accurate, does > > it mean > > that the more stats are updated, the more the stats will be getting > > inaccurate? > Thanks for your comment ! > > I tried to solve your concerns about byte columns but there are really > difficult issues to solve. > For example, to begin with the messages of apply worker are different from > those of > reorder buffer. > > Therefore, I decided to split the previous patch and make counter columns go > first. > v14 was checked by pgperltidy and pgindent. > > This patch can be applied to the PG whose commit id is after 8d74fc9 > (introduction of > pg_stat_subscription_workers). Thanks for the updated patch. Currently we are storing the commit count, error_count and abort_count for each table of the table sync operation. If we have thousands of tables, we will be storing the information for each of the tables. Shouldn't we be storing the consolidated information in this case. diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c index f07983a..02e9486 100644 --- a/src/backend/replication/logical/tablesync.c +++ b/src/backend/replication/logical/tablesync.c @@ -1149,6 +1149,11 @@ copy_table_done: MyLogicalRepWorker->relstate_lsn = *origin_startpos; SpinLockRelease(>relmutex); + /* Report the success of table sync. */ + pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid, + MyLogicalRepWorker->relid, + 0 /* no logical message type */ ); postgres=# select * from pg_stat_subscription_workers ; subid | subname | subrelid | commit_count | error_count | abort_count | last_error_relid | last_error_command | last_error_xid | last_error_count | last_error_message | last_error_time ---+-+--+--+-+-+--+++--++- 16411 | sub1|16387 |1 | 0 | 0 | ||| 0 || 16411 | sub1|16396 |1 | 0 | 0 | ||| 0 || 16411 | sub1|16390 |1 | 0 | 0 | ||| 0 || 16411 | sub1|16393 |1 | 0 | 0 | ||| 0 || 16411 | sub1|16402 |1 | 0 | 0 | ||| 0 || 16411 | sub1|16408 |1 | 0 | 0 | ||| 0 || 16411 | sub1|16384 |1 | 0 | 0 | ||| 0 || 16411 | sub1|16399 |1 | 0 | 0 | ||| 0 || 16411 | sub1|16405 |1 | 0 | 0 | ||| 0 || (9 rows) Regards, Vignesh
RE: Failed transaction statistics to measure the logical replication progress
On Friday, November 19, 2021 11:11 PM Masahiko Sawada wrote: > Besides that, I’m not sure how useful commit_bytes, abort_bytes, and > error_bytes are. I originally thought these statistics track the size of > received > data, i.g., how much data is transferred from the publisher and processed on > the subscriber. But what the view currently has is how much memory is used in > the subscription worker. The subscription worker emulates > ReorderBufferChangeSize() on the subscriber side but, as the comment of > update_apply_change_size() mentions, the size in the view is not accurate: ... > I guess that the purpose of these values is to compare them to total_bytes, > stream_byte, and spill_bytes but if the calculation is not accurate, does it > mean > that the more stats are updated, the more the stats will be getting > inaccurate? Thanks for your comment ! I tried to solve your concerns about byte columns but there are really difficult issues to solve. For example, to begin with the messages of apply worker are different from those of reorder buffer. Therefore, I decided to split the previous patch and make counter columns go first. v14 was checked by pgperltidy and pgindent. This patch can be applied to the PG whose commit id is after 8d74fc9 (introduction of pg_stat_subscription_workers). Best Regards, Takamichi Osumi v14-0001-Extend-pg_stat_subscription_workers-to-include-g.patch Description: v14-0001-Extend-pg_stat_subscription_workers-to-include-g.patch
Re: Failed transaction statistics to measure the logical replication progress
On Tue, Nov 23, 2021 at 3:21 PM Greg Nancarrow wrote: > > On Sat, Nov 20, 2021 at 1:11 AM Masahiko Sawada wrote: > > > > I'm concerned that these new names will introduce confusion; if we > > have last_error_relid, last_error_command, last_error_message, > > last_error_time, and last_error_xid, I think users might think that > > first_error_time is the timestamp at which an error occurred for the > > first time in the subscription worker. > > You mean you think users might think "first_error_time" is the > timestamp at which the last_error first occurred (rather than the > timestamp of the first of any type of error that occurred) on that > worker? I felt that "first_error_time" is the timestamp of the first of any type of error that occurred on the worker. > > > ... Also, I'm not sure > > last_error_count is not clear to me (it looks like showing something > > count but the only "last" one?). > > It's the number of times that the last_error has occurred. > Unless it's some kind of transient error, that might get resolved > without intervention, logical replication will get stuck in a loop > retrying and the last error will occur again and again, hence the > count of how many times that has happened. > Maybe there's not much benefit in counting different errors prior to > the last error? The name "last_error_count" is somewhat clear to me now. I had felt that since the last error refers to *one* error that occurred last and it’s odd there is the count of it. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Failed transaction statistics to measure the logical replication progress
On Sat, Nov 20, 2021 at 3:25 PM Amit Kapila wrote: > > On Fri, Nov 19, 2021 at 7:41 PM Masahiko Sawada wrote: > > > > On Thu, Nov 18, 2021 at 12:26 PM Amit Kapila > > wrote: > > > > > > On Wed, Nov 17, 2021 at 7:12 PM osumi.takami...@fujitsu.com > > > wrote: > > > > > > > > On Wednesday, November 17, 2021 10:00 PM Amit Kapila > > > > wrote: > > > > > > > > > > Can you please tell us why you think the names in your proposed patch > > > > > are > > > > > better than the existing names proposed in Sawada-San's patch? Is it > > > > > because > > > > > those fields always contain the information of the last or latest > > > > > error that > > > > > occurred in the corresponding subscription worker? > > > > This is one reason. > > > > > > > > Another big reason comes from the final alignment when we list up all > > > > columns of both patches. > > > > The patches in this thread is trying to introduce a column that > > > > indicates > > > > cumulative count of error to show all error counts that the worker got > > > > in the past. > > > > > > > > > > Okay, I see your point and it makes sense to rename columns after > > > these other stats. I am not able to come up with any better names than > > > what is being used here. Sawada-San, do you agree with this, or do let > > > us know if you have any better ideas? > > > > > > > I'm concerned that these new names will introduce confusion; if we > > have last_error_relid, last_error_command, last_error_message, > > last_error_time, and last_error_xid, I think users might think that > > first_error_time is the timestamp at which an error occurred for the > > first time in the subscription worker. > > > > Isn't to some extent that confusion already exists because of > last_error_time column? > > > Also, I'm not sure > > last_error_count is not clear to me (it looks like showing something > > count but the only "last" one?). > > > > I feel if all the error related columns have "last_error_" as a prefix > then it should not be that confusing? > > > An alternative idea would be to add > > total_error_count by this patch, resulting in having both error_count > > and total_error_count. Regarding commit_count and abort_count, I > > personally think xact_commit and xact_rollback would be better since > > they’re more consistent with pg_stat_database view, although we might > > already have discussed that. > > > > Even if we decide to change the column names to > xact_commit/xact_rollback, I think with additional non-error columns > it will be clear to add 'error' in column names corresponding to error > columns, and last_error_* seems to be consistent with what we have in > pg_stat_archiver (last_failed_wal, last_failed_time). Okay, I agree that last_error_* columns will be consistent. > Your point > related to first_error_time has merit and I don't have a better answer > for it. I think it is just a convenience column and we are not sure > whether that will be required in practice so maybe we can drop that > column and come back to it later once we get some field feedback on > this view? +1 Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Failed transaction statistics to measure the logical replication progress
On Sat, Nov 20, 2021 at 1:11 AM Masahiko Sawada wrote: > > I'm concerned that these new names will introduce confusion; if we > have last_error_relid, last_error_command, last_error_message, > last_error_time, and last_error_xid, I think users might think that > first_error_time is the timestamp at which an error occurred for the > first time in the subscription worker. You mean you think users might think "first_error_time" is the timestamp at which the last_error first occurred (rather than the timestamp of the first of any type of error that occurred) on that worker? > ... Also, I'm not sure > last_error_count is not clear to me (it looks like showing something > count but the only "last" one?). It's the number of times that the last_error has occurred. Unless it's some kind of transient error, that might get resolved without intervention, logical replication will get stuck in a loop retrying and the last error will occur again and again, hence the count of how many times that has happened. Maybe there's not much benefit in counting different errors prior to the last error? Regards, Greg Nancarrow Fujitsu Australia
Re: Failed transaction statistics to measure the logical replication progress
On Fri, Nov 19, 2021 at 7:41 PM Masahiko Sawada wrote: > > On Thu, Nov 18, 2021 at 12:26 PM Amit Kapila wrote: > > > > On Wed, Nov 17, 2021 at 7:12 PM osumi.takami...@fujitsu.com > > wrote: > > > > > > On Wednesday, November 17, 2021 10:00 PM Amit Kapila > > > wrote: > > > > > > > > Can you please tell us why you think the names in your proposed patch > > > > are > > > > better than the existing names proposed in Sawada-San's patch? Is it > > > > because > > > > those fields always contain the information of the last or latest error > > > > that > > > > occurred in the corresponding subscription worker? > > > This is one reason. > > > > > > Another big reason comes from the final alignment when we list up all > > > columns of both patches. > > > The patches in this thread is trying to introduce a column that indicates > > > cumulative count of error to show all error counts that the worker got in > > > the past. > > > > > > > Okay, I see your point and it makes sense to rename columns after > > these other stats. I am not able to come up with any better names than > > what is being used here. Sawada-San, do you agree with this, or do let > > us know if you have any better ideas? > > > > I'm concerned that these new names will introduce confusion; if we > have last_error_relid, last_error_command, last_error_message, > last_error_time, and last_error_xid, I think users might think that > first_error_time is the timestamp at which an error occurred for the > first time in the subscription worker. Also, I'm not sure > last_error_count is not clear to me (it looks like showing something > count but the only "last" one?). An alternative idea would be to add > total_error_count by this patch, resulting in having both error_count > and total_error_count. Regarding commit_count and abort_count, I > personally think xact_commit and xact_rollback would be better since > they’re more consistent with pg_stat_database view, although we might > already have discussed that. > > Besides that, I’m not sure how useful commit_bytes, abort_bytes, and > error_bytes are. I originally thought these statistics track the size > of received data, i.g., how much data is transferred from the > publisher and processed on the subscriber. But what the view currently > has is how much memory is used in the subscription worker. The > subscription worker emulates ReorderBufferChangeSize() on the > subscriber side but, as the comment of update_apply_change_size() > mentions, the size in the view is not accurate: > > + * The byte size of transaction on the publisher is calculated by > + * ReorderBufferChangeSize() based on the ReorderBufferChange structure. > + * But on the subscriber, consumed resources are not same as the > + * publisher's decoding processsing and required to be computed in > + * different way. Therefore, the exact same byte size is not restored on > + * the subscriber usually. > > Also, it seems to take into account the size of FlushPosition that is > not taken into account by ReorderBufferChangeSize(). Let's keep the size calculation similar to the publisher side to avoid any confusion, we can try to keep it the same as ReorderBufferChangeSize wherever possible. Regards, Vignesh
Re: Failed transaction statistics to measure the logical replication progress
On Fri, Nov 19, 2021 at 7:41 PM Masahiko Sawada wrote: > > On Thu, Nov 18, 2021 at 12:26 PM Amit Kapila wrote: > > > > On Wed, Nov 17, 2021 at 7:12 PM osumi.takami...@fujitsu.com > > wrote: > > > > > > On Wednesday, November 17, 2021 10:00 PM Amit Kapila > > > wrote: > > > > > > > > Can you please tell us why you think the names in your proposed patch > > > > are > > > > better than the existing names proposed in Sawada-San's patch? Is it > > > > because > > > > those fields always contain the information of the last or latest error > > > > that > > > > occurred in the corresponding subscription worker? > > > This is one reason. > > > > > > Another big reason comes from the final alignment when we list up all > > > columns of both patches. > > > The patches in this thread is trying to introduce a column that indicates > > > cumulative count of error to show all error counts that the worker got in > > > the past. > > > > > > > Okay, I see your point and it makes sense to rename columns after > > these other stats. I am not able to come up with any better names than > > what is being used here. Sawada-San, do you agree with this, or do let > > us know if you have any better ideas? > > > > I'm concerned that these new names will introduce confusion; if we > have last_error_relid, last_error_command, last_error_message, > last_error_time, and last_error_xid, I think users might think that > first_error_time is the timestamp at which an error occurred for the > first time in the subscription worker. > Isn't to some extent that confusion already exists because of last_error_time column? > Also, I'm not sure > last_error_count is not clear to me (it looks like showing something > count but the only "last" one?). > I feel if all the error related columns have "last_error_" as a prefix then it should not be that confusing? > An alternative idea would be to add > total_error_count by this patch, resulting in having both error_count > and total_error_count. Regarding commit_count and abort_count, I > personally think xact_commit and xact_rollback would be better since > they’re more consistent with pg_stat_database view, although we might > already have discussed that. > Even if we decide to change the column names to xact_commit/xact_rollback, I think with additional non-error columns it will be clear to add 'error' in column names corresponding to error columns, and last_error_* seems to be consistent with what we have in pg_stat_archiver (last_failed_wal, last_failed_time). Your point related to first_error_time has merit and I don't have a better answer for it. I think it is just a convenience column and we are not sure whether that will be required in practice so maybe we can drop that column and come back to it later once we get some field feedback on this view? -- With Regards, Amit Kapila.
Re: Failed transaction statistics to measure the logical replication progress
On Thu, Nov 18, 2021 at 12:26 PM Amit Kapila wrote: > > On Wed, Nov 17, 2021 at 7:12 PM osumi.takami...@fujitsu.com > wrote: > > > > On Wednesday, November 17, 2021 10:00 PM Amit Kapila > > wrote: > > > > > > Can you please tell us why you think the names in your proposed patch are > > > better than the existing names proposed in Sawada-San's patch? Is it > > > because > > > those fields always contain the information of the last or latest error > > > that > > > occurred in the corresponding subscription worker? > > This is one reason. > > > > Another big reason comes from the final alignment when we list up all > > columns of both patches. > > The patches in this thread is trying to introduce a column that indicates > > cumulative count of error to show all error counts that the worker got in > > the past. > > > > Okay, I see your point and it makes sense to rename columns after > these other stats. I am not able to come up with any better names than > what is being used here. Sawada-San, do you agree with this, or do let > us know if you have any better ideas? > I'm concerned that these new names will introduce confusion; if we have last_error_relid, last_error_command, last_error_message, last_error_time, and last_error_xid, I think users might think that first_error_time is the timestamp at which an error occurred for the first time in the subscription worker. Also, I'm not sure last_error_count is not clear to me (it looks like showing something count but the only "last" one?). An alternative idea would be to add total_error_count by this patch, resulting in having both error_count and total_error_count. Regarding commit_count and abort_count, I personally think xact_commit and xact_rollback would be better since they’re more consistent with pg_stat_database view, although we might already have discussed that. Besides that, I’m not sure how useful commit_bytes, abort_bytes, and error_bytes are. I originally thought these statistics track the size of received data, i.g., how much data is transferred from the publisher and processed on the subscriber. But what the view currently has is how much memory is used in the subscription worker. The subscription worker emulates ReorderBufferChangeSize() on the subscriber side but, as the comment of update_apply_change_size() mentions, the size in the view is not accurate: + * The byte size of transaction on the publisher is calculated by + * ReorderBufferChangeSize() based on the ReorderBufferChange structure. + * But on the subscriber, consumed resources are not same as the + * publisher's decoding processsing and required to be computed in + * different way. Therefore, the exact same byte size is not restored on + * the subscriber usually. Also, it seems to take into account the size of FlushPosition that is not taken into account by ReorderBufferChangeSize(). I guess that the purpose of these values is to compare them to total_bytes, stream_byte, and spill_bytes but if the calculation is not accurate, does it mean that the more stats are updated, the more the stats will be getting inaccurate? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
RE: Failed transaction statistics to measure the logical replication progress
On Thursday, November 18, 2021 12:26 PM Amit Kapila wrote: > BTW, I think the way you are computing error_count in > pgstat_recv_subworker_error() doesn't seem correct to me because it will > accumulate the counter/bytes for the same error again and again. > You might want to update these counters after we have checked that the > received error is not the same as the previous one. Thank you for your comments ! This is addressed by v13 patchset [1] [1] - https://www.postgresql.org/message-id/TYCPR01MB8373533A5C24BDDA516DA7E1ED9B9%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
RE: Failed transaction statistics to measure the logical replication progress
On Thursday, November 18, 2021 8:35 PM vignesh C wrote: > On Tue, Nov 16, 2021 at 6:04 PM osumi.takami...@fujitsu.com > wrote: > > > > On Monday, November 15, 2021 9:14 PM I wrote: > > > I've conducted some update for this. > > > (The rebased part is only C code and checked by pgindent) > > I'll update my patches since a new skip xid patch has been shared in > > [1]. > > > > This version includes some minor renames of functions that are related > > to transaction sizes. > > Thanks for the updated patch, Few comments: Thank you for checking the patches ! > 1) since pgstat_get_subworker_prepared_txn is called from only one place and > create is passed as true, we can remove create function parameter or the > function could be removed. > + * Return subscription worker entry with the given subscription OID and > + * gid. > + * -- > + */ > +static PgStat_SW_PreparedXactEntry * > +pgstat_get_subworker_prepared_txn(Oid databaseid, Oid subid, > + char > *gid, bool create) > +{ > + PgStat_StatDBEntry *dbentry; > + PgStat_SW_PreparedXactKey key; Removed the parameter. > 2) Include subworker prepared transactions also > /* > * Don't create tables/functions/subworkers hashtables for > * uninteresting databases. > */ > if (onlydb != InvalidOid) > { > if (dbbuf.databaseid != onlydb && > dbbuf.databaseid != InvalidOid) > break; > } Fixed. > 3) Similarly it should be mentioned in: > reset_dbentry_counters function header, pgstat_read_db_statsfile function > header and pgstat_get_db_entry function comments. Fixed. > 4) I felt we can remove "COMMIT of streaming transaction", since only commit > and commit prepared are the user operations. Shall we change it to "COMMIT > and COMMIT PREPARED will increment this counter." > + commit_count bigint > + > + > + Number of transactions successfully applied in this subscription. > + COMMIT, COMMIT of streaming transaction and COMMIT > PREPARED increments > + this counter. > + > + You are right ! Fixed. > 5) PgStat_SW_PreparedXactEntry should be before > PgStat_SW_PreparedXactKey PgStat_StatSubWorkerEntry > PgStat_StatSubWorkerKey > +PgStat_SW_PreparedXactKey > +PgStat_SW_PreparedXactEntry > PgStat_StatTabEntry > PgStat_SubXactStatus Fixed. > 6) This change is not required > @@ -293,6 +306,7 @@ static inline void cleanup_subxact_info(void); static > void stream_cleanup_files(Oid subid, TransactionId xid); static void > stream_open_file(Oid subid, TransactionId xid, bool first); static void > stream_write_change(char action, StringInfo s); > + > static void stream_close_file(void); Removed. Other changes are 1. refined the commit message of v13-0003*. 2. made the additional comment for ApplyErrorCallbackArg simple. 3. wrote more explanations about update_apply_change_size() as comment. 4. changed the behavior of pgstat_recv_subworker_error so that it can store stats info only once per error. 5. added one simple test for PREPARE and COMMIT PREPARED. This used v23 skip xid patch [1]. (I will remove v13-0001* when the column names are fixed and Sawada-san starts to take care of the column name definitions) [1] - https://www.postgresql.org/message-id/CAD21AoA5jupM6O%3DpYsyfaxQ1aMX-en8%3DQNgpW6KfXsg7_CS0CQ%40mail.gmail.com Best Regards, Takamichi Osumi v13-0001-Rename-existing-columns-of-pg_stat_subscription_.patch Description: v13-0001-Rename-existing-columns-of-pg_stat_subscription_.patch v13-0002-Export-PartitionTupleRouting-for-transaction-siz.patch Description: v13-0002-Export-PartitionTupleRouting-for-transaction-siz.patch v13-0003-Extend-pg_stat_subscription_workers-to-include-g.patch Description: v13-0003-Extend-pg_stat_subscription_workers-to-include-g.patch
Re: Failed transaction statistics to measure the logical replication progress
On Tue, Nov 16, 2021 at 6:04 PM osumi.takami...@fujitsu.com wrote: > > On Monday, November 15, 2021 9:14 PM I wrote: > > I've conducted some update for this. > > (The rebased part is only C code and checked by pgindent) > I'll update my patches since a new skip xid patch > has been shared in [1]. > > This version includes some minor renames of functions > that are related to transaction sizes. Thanks for the updated patch, Few comments: 1) since pgstat_get_subworker_prepared_txn is called from only one place and create is passed as true, we can remove create function parameter or the function could be removed. + * Return subscription worker entry with the given subscription OID and + * gid. + * -- + */ +static PgStat_SW_PreparedXactEntry * +pgstat_get_subworker_prepared_txn(Oid databaseid, Oid subid, + char *gid, bool create) +{ + PgStat_StatDBEntry *dbentry; + PgStat_SW_PreparedXactKey key; 2) Include subworker prepared transactions also /* * Don't create tables/functions/subworkers hashtables for * uninteresting databases. */ if (onlydb != InvalidOid) { if (dbbuf.databaseid != onlydb && dbbuf.databaseid != InvalidOid) break; } 3) Similarly it should be mentioned in: reset_dbentry_counters function header, pgstat_read_db_statsfile function header and pgstat_get_db_entry function comments. 4) I felt we can remove "COMMIT of streaming transaction", since only commit and commit prepared are the user operations. Shall we change it to "COMMIT and COMMIT PREPARED will increment this counter." + commit_count bigint + + + Number of transactions successfully applied in this subscription. + COMMIT, COMMIT of streaming transaction and COMMIT PREPARED increments + this counter. + + 5) PgStat_SW_PreparedXactEntry should be before PgStat_SW_PreparedXactKey PgStat_StatSubWorkerEntry PgStat_StatSubWorkerKey +PgStat_SW_PreparedXactKey +PgStat_SW_PreparedXactEntry PgStat_StatTabEntry PgStat_SubXactStatus 6) This change is not required @@ -293,6 +306,7 @@ static inline void cleanup_subxact_info(void); static void stream_cleanup_files(Oid subid, TransactionId xid); static void stream_open_file(Oid subid, TransactionId xid, bool first); static void stream_write_change(char action, StringInfo s); + static void stream_close_file(void); Regards, Vignesh
Re: Failed transaction statistics to measure the logical replication progress
On Wed, Nov 17, 2021 at 7:12 PM osumi.takami...@fujitsu.com wrote: > > On Wednesday, November 17, 2021 10:00 PM Amit Kapila > wrote: > > > > Can you please tell us why you think the names in your proposed patch are > > better than the existing names proposed in Sawada-San's patch? Is it because > > those fields always contain the information of the last or latest error that > > occurred in the corresponding subscription worker? > This is one reason. > > Another big reason comes from the final alignment when we list up all columns > of both patches. > The patches in this thread is trying to introduce a column that indicates > cumulative count of error to show all error counts that the worker got in the > past. > Okay, I see your point and it makes sense to rename columns after these other stats. I am not able to come up with any better names than what is being used here. Sawada-San, do you agree with this, or do let us know if you have any better ideas? BTW, I think the way you are computing error_count in pgstat_recv_subworker_error() doesn't seem correct to me because it will accumulate the counter/bytes for the same error again and again. You might want to update these counters after we have checked that the received error is not the same as the previous one. -- With Regards, Amit Kapila.
RE: Failed transaction statistics to measure the logical replication progress
On Wednesday, November 17, 2021 10:00 PM Amit Kapila wrote: > On Wed, Nov 17, 2021 at 9:44 AM osumi.takami...@fujitsu.com > wrote: > > > > On Wednesday, November 17, 2021 12:19 PM Masahiko Sawada > wrote: > > > On Tue, Nov 16, 2021 at 9:34 PM osumi.takami...@fujitsu.com > > > wrote: > > > > > > > > On Monday, November 15, 2021 9:14 PM I wrote: > > > > > I've conducted some update for this. > > > > > (The rebased part is only C code and checked by pgindent) > > > > I'll update my patches since a new skip xid patch has been shared > > > > in [1]. > > > > > > > > This version includes some minor renames of functions that are > > > > related to transaction sizes. > > > > > > I've looked at v12-0001 patch. Here are some comments: > > Thank you for paying attention to this thread ! > > > > > > > - TupleDescInitEntry(tupdesc, (AttrNumber) 3, "relid", > > > + TupleDescInitEntry(tupdesc, (AttrNumber) 3, "last_error_relid", > > >OIDOID, -1, 0); > > > - TupleDescInitEntry(tupdesc, (AttrNumber) 4, "command", > > > + TupleDescInitEntry(tupdesc, (AttrNumber) 4, > > > + "last_error_command", > > >TEXTOID, -1, 0); > > > - TupleDescInitEntry(tupdesc, (AttrNumber) 5, "xid", > > > + TupleDescInitEntry(tupdesc, (AttrNumber) 5, "last_error_xid", > > >XIDOID, -1, 0); > > > - TupleDescInitEntry(tupdesc, (AttrNumber) 6, "error_count", > > > + TupleDescInitEntry(tupdesc, (AttrNumber) 6, "last_error_count", > > >INT8OID, -1, 0); > > > - TupleDescInitEntry(tupdesc, (AttrNumber) 7, "error_message", > > > + TupleDescInitEntry(tupdesc, (AttrNumber) 7, > > > + "last_error_message", > > > > > > If renaming column names clarifies those meanings, the above changes > > > should be included into my patch that introduces > > > pg_stat_subscription_workers view? > > Right. > > > At first, your column names of pg_stat_subscription_workers look > > totally OK to me by itself and I thought I should take care of those > > renaming at > the commit timing of my stats patches. > > > > Can you please tell us why you think the names in your proposed patch are > better than the existing names proposed in Sawada-San's patch? Is it because > those fields always contain the information of the last or latest error that > occurred in the corresponding subscription worker? This is one reason. Another big reason comes from the final alignment when we list up all columns of both patches. The patches in this thread is trying to introduce a column that indicates cumulative count of error to show all error counts that the worker got in the past. In this thread, after two or three improvements, this column name has reached to a simple one 'error_count' (aligned with 'commit_count' and 'abort_count'). Then we need to differentiate what this thread's patch is trying to introduce and what skip xid patch is introducing. > If so, I am not very sure if that is a good reason to increase the length of > most of > the column names but if you and others feel that is helpful then it is better > to do > it as part of Sawada-San's patch. Yes. On this point, it was a mistake that I handled that changes. It'd be better that Sawada-san takes care of them once the names have been fixed. Best Regards, Takamichi Osumi
Re: Failed transaction statistics to measure the logical replication progress
On Wed, Nov 17, 2021 at 9:44 AM osumi.takami...@fujitsu.com wrote: > > On Wednesday, November 17, 2021 12:19 PM Masahiko Sawada > wrote: > > On Tue, Nov 16, 2021 at 9:34 PM osumi.takami...@fujitsu.com > > wrote: > > > > > > On Monday, November 15, 2021 9:14 PM I wrote: > > > > I've conducted some update for this. > > > > (The rebased part is only C code and checked by pgindent) > > > I'll update my patches since a new skip xid patch has been shared in > > > [1]. > > > > > > This version includes some minor renames of functions that are related > > > to transaction sizes. > > > > I've looked at v12-0001 patch. Here are some comments: > Thank you for paying attention to this thread ! > > > > - TupleDescInitEntry(tupdesc, (AttrNumber) 3, "relid", > > + TupleDescInitEntry(tupdesc, (AttrNumber) 3, "last_error_relid", > >OIDOID, -1, 0); > > - TupleDescInitEntry(tupdesc, (AttrNumber) 4, "command", > > + TupleDescInitEntry(tupdesc, (AttrNumber) 4, "last_error_command", > >TEXTOID, -1, 0); > > - TupleDescInitEntry(tupdesc, (AttrNumber) 5, "xid", > > + TupleDescInitEntry(tupdesc, (AttrNumber) 5, "last_error_xid", > >XIDOID, -1, 0); > > - TupleDescInitEntry(tupdesc, (AttrNumber) 6, "error_count", > > + TupleDescInitEntry(tupdesc, (AttrNumber) 6, "last_error_count", > >INT8OID, -1, 0); > > - TupleDescInitEntry(tupdesc, (AttrNumber) 7, "error_message", > > + TupleDescInitEntry(tupdesc, (AttrNumber) 7, "last_error_message", > > > > If renaming column names clarifies those meanings, the above changes should > > be included into my patch that introduces pg_stat_subscription_workers > > view? Right. > At first, your column names of pg_stat_subscription_workers look totally OK > to me by itself > and I thought I should take care of those renaming at the commit timing of my > stats patches. > Can you please tell us why you think the names in your proposed patch are better than the existing names proposed in Sawada-San's patch? Is it because those fields always contain the information of the last or latest error that occurred in the corresponding subscription worker? If so, I am not very sure if that is a good reason to increase the length of most of the column names but if you and others feel that is helpful then it is better to do it as part of Sawada-San's patch. -- With Regards, Amit Kapila.
RE: Failed transaction statistics to measure the logical replication progress
On Wednesday, November 17, 2021 12:19 PM Masahiko Sawada wrote: > On Tue, Nov 16, 2021 at 9:34 PM osumi.takami...@fujitsu.com > wrote: > > > > On Monday, November 15, 2021 9:14 PM I wrote: > > > I've conducted some update for this. > > > (The rebased part is only C code and checked by pgindent) > > I'll update my patches since a new skip xid patch has been shared in > > [1]. > > > > This version includes some minor renames of functions that are related > > to transaction sizes. > > I've looked at v12-0001 patch. Here are some comments: Thank you for paying attention to this thread ! > - TupleDescInitEntry(tupdesc, (AttrNumber) 3, "relid", > + TupleDescInitEntry(tupdesc, (AttrNumber) 3, "last_error_relid", >OIDOID, -1, 0); > - TupleDescInitEntry(tupdesc, (AttrNumber) 4, "command", > + TupleDescInitEntry(tupdesc, (AttrNumber) 4, "last_error_command", >TEXTOID, -1, 0); > - TupleDescInitEntry(tupdesc, (AttrNumber) 5, "xid", > + TupleDescInitEntry(tupdesc, (AttrNumber) 5, "last_error_xid", >XIDOID, -1, 0); > - TupleDescInitEntry(tupdesc, (AttrNumber) 6, "error_count", > + TupleDescInitEntry(tupdesc, (AttrNumber) 6, "last_error_count", >INT8OID, -1, 0); > - TupleDescInitEntry(tupdesc, (AttrNumber) 7, "error_message", > + TupleDescInitEntry(tupdesc, (AttrNumber) 7, "last_error_message", > > If renaming column names clarifies those meanings, the above changes should > be included into my patch that introduces pg_stat_subscription_workers > view? At first, your column names of pg_stat_subscription_workers look totally OK to me by itself and I thought I should take care of those renaming at the commit timing of my stats patches. But, if you agree with the new names above and fixing your patch doesn't bother you, I'd appreciate your help ! > I think that exporting PartitionTupleRouting should not be done in the one > patch together with renaming the view columns. There is not relevance > between them at all. If it's used by v12-0002 patch, I think it should be > included > in that patch or in another separate patch. Yes, it's used by v12-0002. Absolutely you are right. When you update the patch like above, I would like to make it independent. Best Regards, Takamichi Osumi
Re: Failed transaction statistics to measure the logical replication progress
On Tue, Nov 16, 2021 at 9:34 PM osumi.takami...@fujitsu.com wrote: > > On Monday, November 15, 2021 9:14 PM I wrote: > > I've conducted some update for this. > > (The rebased part is only C code and checked by pgindent) > I'll update my patches since a new skip xid patch > has been shared in [1]. > > This version includes some minor renames of functions > that are related to transaction sizes. I've looked at v12-0001 patch. Here are some comments: - TupleDescInitEntry(tupdesc, (AttrNumber) 3, "relid", + TupleDescInitEntry(tupdesc, (AttrNumber) 3, "last_error_relid", OIDOID, -1, 0); - TupleDescInitEntry(tupdesc, (AttrNumber) 4, "command", + TupleDescInitEntry(tupdesc, (AttrNumber) 4, "last_error_command", TEXTOID, -1, 0); - TupleDescInitEntry(tupdesc, (AttrNumber) 5, "xid", + TupleDescInitEntry(tupdesc, (AttrNumber) 5, "last_error_xid", XIDOID, -1, 0); - TupleDescInitEntry(tupdesc, (AttrNumber) 6, "error_count", + TupleDescInitEntry(tupdesc, (AttrNumber) 6, "last_error_count", INT8OID, -1, 0); - TupleDescInitEntry(tupdesc, (AttrNumber) 7, "error_message", + TupleDescInitEntry(tupdesc, (AttrNumber) 7, "last_error_message", If renaming column names clarifies those meanings, the above changes should be included into my patch that introduces pg_stat_subscription_workers view? --- I think that exporting PartitionTupleRouting should not be done in the one patch together with renaming the view columns. There is not relevance between them at all. If it's used by v12-0002 patch, I think it should be included in that patch or in another separate patch. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
RE: Failed transaction statistics to measure the logical replication progress
On Monday, November 15, 2021 9:14 PM I wrote: > I've conducted some update for this. > (The rebased part is only C code and checked by pgindent) I'll update my patches since a new skip xid patch has been shared in [1]. This version includes some minor renames of functions that are related to transaction sizes. [1] - https://www.postgresql.org/message-id/CAD21AoA5jupM6O%3DpYsyfaxQ1aMX-en8%3DQNgpW6KfXsg7_CS0CQ%40mail.gmail.com Best Regards, Takamichi Osumi v12-0001-Rename-existing-columns-of-pg_stat_subscription_.patch Description: v12-0001-Rename-existing-columns-of-pg_stat_subscription_.patch v12-0002-Extend-pg_stat_subscription_workers-to-include-g.patch Description: v12-0002-Extend-pg_stat_subscription_workers-to-include-g.patch
RE: Failed transaction statistics to measure the logical replication progress
On Monday, November 15, 2021 6:28 PM I wrote: > On Thursday, November 11, 2021 9:47 PM vignesh C > wrote: > > Few more comments: ... > Fixed. Also, its name was too long > when aligned with other PgStat_StatDBEntry memebers. > Thus I renamed it as subworkers_preparedsizes. > > This depends on v21 in [1] Hi There was some minor conflict error by the change of v22 in [1]. I've conducted some update for this. (The rebased part is only C code and checked by pgindent) [1] - https://www.postgresql.org/message-id/CAD21AoCE72vKXJp99f4xRw7Mh5ve-Z2roe21gP8Y82_CxXKvbg%40mail.gmail.com Best Regards, Takamichi Osumi v11-0001-Rename-existing-columns-of-pg_stat_subscription_.patch Description: v11-0001-Rename-existing-columns-of-pg_stat_subscription_.patch v11-0002-Extend-pg_stat_subscription_workers-to-include-g.patch Description: v11-0002-Extend-pg_stat_subscription_workers-to-include-g.patch
RE: Failed transaction statistics to measure the logical replication progress
On Wednesday, November 10, 2021 6:13 PM I wrote: > On Wednesday, November 10, 2021 3:43 PM Dilip Kumar > wrote: > > On Tue, Nov 9, 2021 at 5:05 PM osumi.takami...@fujitsu.com > > wrote: > > > On Tuesday, November 9, 2021 12:08 PM Greg Nancarrow > > wrote: > > > > On Fri, Nov 5, 2021 at 7:11 PM osumi.takami...@fujitsu.com > > > > wrote: > > > > > > > > > > > > > I did a quick scan through the latest v8 patch and noticed the > > > > following > > things: > > > I appreciate your review ! > > I have reviewed some part of the patch and I have a few comments > I really appreciate your attention and review. ... > > 3. > > +{ > > +size += *extra_data->stream_write_len; > > +add_apply_error_context_xact_size(size); > > +return; > > +} > > > > From apply_handle_insert(), we are calling update_apply_change_size(), > > and inside this function we are dereferencing > *extra_data->stream_write_len. > > Basically, stream_write_len is in integer pointer and the caller > > hasn't allocated memory for that and inside update_apply_change_size, > > we are directly dereferencing the pointer, how this can be correct. ... > I'll just delete the top part that handles streaming bytes calculation in the > update_apply_change_size(). > It's because now that there is a specific structure to recognize each > streaming > xid and save transaction size there, which makes the top part in question > useless. > > > And I also see that in the > > whole patch stream_write_len, is never used as lvalue so without > > storing anything into this why are we trying to use this as rvalue > > here? This is clearly an issue. > As described above, I'll fix this part and related codes mainly streaming > related > codes in the next version. Removed this deadcodes. Please have a look at [1] [1] - https://www.postgresql.org/message-id/TYCPR01MB8373FEB287F733C81C1E4D42ED989%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
RE: Failed transaction statistics to measure the logical replication progress
On Wednesday, November 10, 2021 7:13 PM vignesh C wrote: > On Tue, Nov 9, 2021 at 5:05 PM osumi.takami...@fujitsu.com > wrote: > > Yes. I've rebased and updated the patch, paying attention to this point. > > Attached the updated version. > > Thanks for the updated patch, few comments: > 1) you could rename PgStat_StatSubWorkerPreparedXact to > PgStat_SW_PreparedXactKey or a simpler name which includes key and > similarly change PgStat_StatSubWorkerPreparedXactSize to > PgStat_SW_PreparedXactEntry > > +/* prepared transaction */ > +typedef struct PgStat_StatSubWorkerPreparedXact { > + Oid subid; > + chargid[GIDSIZE]; > +} PgStat_StatSubWorkerPreparedXact; > + > +typedef struct PgStat_StatSubWorkerPreparedXactSize > +{ > + PgStat_StatSubWorkerPreparedXact key; /* hash key */ > + > + Oid subid; > + chargid[GIDSIZE]; > + PgStat_Counter xact_size; > +} PgStat_StatSubWorkerPreparedXactSize; > + Fixed. Adopted your suggested names. > 2) You can change prepared_size to sw_prepared_xact_entry or > prepared_xact_entry since it is a hash entry with few fields > + if (subWorkerPreparedXactSizeHash) > + { > + PgStat_StatSubWorkerPreparedXactSize *prepared_size; > + > + hash_seq_init(, subWorkerPreparedXactSizeHash); > + while((prepared_size = > (PgStat_StatSubWorkerPreparedXactSize *) hash_seq_search()) != > NULL) > + { > + fputc('P', fpout); > + rc = fwrite(prepared_size, > sizeof(PgStat_StatSubWorkerPreparedXactSize), 1, fpout); > + (void) rc; /* we'll check > for error with ferror */ > + } I preferred prepared_xact_entry. Fixed. > 3) This need to be indented > -w.relid, > -w.command, > -w.xid, > -w.error_count, > -w.error_message, > -w.last_error_time > + w.commit_count, > + w.commit_bytes, > + w.error_count, > + w.error_bytes, > + w.abort_count, > + w.abort_bytes, > + w.last_error_relid, > + w.last_error_command, > + w.last_error_xid, > + w.last_error_count, > + w.last_error_message, > + w.last_error_time Fixed. > 4) Instead of adding a function to calculate the size, can we move > PartitionTupleRouting from c file to the header file and use sizeof at the > caller > function? > +/* > + * PartitionTupleRoutingSize - exported to calculate total data size > + * of logical replication mesage apply, because this is one of the > + * ApplyExecutionData struct members. > + */ > +size_t > +PartitionTupleRoutingSize(void) > +{ > + return sizeof(PartitionTupleRouting); } Fixed. > 5) You could run pgindent and pgperltidy for the code and test code to fix the > indent issues. > + > subWorkerPreparedXactSizeHash = hash_create("Subscription worker stats > of prepared txn", > + > > PGSTAT_SUBWORKER_HASH_SIZE, > + > _ctl, > + > HASH_ELEM | HASH_STRINGS | > HASH_CONTEXT); > +# There's no entry at the beginning > +my $result = $node_subscriber->safe_psql('postgres', > +"SELECT count(*) FROM pg_stat_subscription_workers;"); is($result, > +q(0), 'no entry for transaction stats yet'); Conducted pgindent and pgperltidy. > 6) Few places you have used strlcpy and few places you have used memcpy, > you can keep it consistent: > + msg.m_command = command; > + strlcpy(msg.m_gid, gid, sizeof(msg.m_gid)); > + msg.m_xact_bytes = xact_size; > > + key.subid = subid; > + memcpy(key.gid, gid, sizeof(key.gid)); > + action = (create ? HASH_ENTER : HASH_FIND); Fixed. I used strlcpy for new additional functions I made. An exception is pgstat_read_db_statsfile(). In this function, we use memcpy() consistently in other places. Please have a look at [1] [1] - https://www.postgresql.org/message-id/TYCPR01MB8373FEB287F733C81C1E4D42ED989%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Takamichi Osumi
RE: Failed transaction statistics to measure the logical replication progress
On Thursday, November 11, 2021 9:47 PM vignesh C wrote: > Few more comments: > 1) Here the tuple length is not considered in the calculation, else it will > always > show the fixed size for any size tuple. Ex varchar insert with 1 byte or > varchar > insert with 100's of bytes. So I feel we should include the tuple length in > the > calculation. > + case LOGICAL_REP_MSG_INSERT: > + case LOGICAL_REP_MSG_UPDATE: > + case LOGICAL_REP_MSG_DELETE: > + Assert(extra_data != NULL); > + > + /* > +* Compute size based on ApplyExecutionData. > +* The size of LogicalRepRelMapEntry can be > skipped because > +* it is obtained from hash_search in > logicalrep_rel_open. > +*/ > + size += sizeof(ApplyExecutionData) + sizeof(EState) > + > + sizeof(ResultRelInfo) + > + sizeof(ResultRelInfo); > + > + /* > +* Add some extra size if the target relation > is partitioned. > +* PartitionTupleRouting isn't exported. > Therefore, call the > +* function that returns its size instead. > +*/ > + if > (extra_data->relmapentry->localrel->rd_rel->relkind == > RELKIND_PARTITIONED_TABLE) > + size += sizeof(ModifyTableState) + > PartitionTupleRoutingSize(); > + break; Thanks a lot ! Fixed. > 2) Can this be part of PgStat_StatDBEntry, similar to tables, functions and > subworkers. It might be more appropriate to have it there instead of having > another global variable. > + * Stats of prepared transactions should be displayed > + * at either commit prepared or rollback prepared time, even when it's > + * after the server restart. We have the apply worker send those > +statistics > + * to the stats collector at prepare time and the startup process > +restore > + * those at restart if necessary. > + */ > +static HTAB *subWorkerPreparedXactSizeHash = NULL; > + > +/* Fixed. Also, its name was too long when aligned with other PgStat_StatDBEntry memebers. Thus I renamed it as subworkers_preparedsizes. This depends on v21 in [1] [1] - https://www.postgresql.org/message-id/CAD21AoAkd4YSoQUUFfpcrYOtkPRbninaw3sD0qc77nLW6Q89gg%40mail.gmail.com Best Regards, Takamichi Osumi v10-0001-Rename-existing-columns-of-pg_stat_subscription_.patch Description: v10-0001-Rename-existing-columns-of-pg_stat_subscription_.patch v10-0002-Extend-pg_stat_subscription_workers-to-include-g.patch Description: v10-0002-Extend-pg_stat_subscription_workers-to-include-g.patch
Re: Failed transaction statistics to measure the logical replication progress
On Wed, Nov 10, 2021 at 3:43 PM vignesh C wrote: > > On Tue, Nov 9, 2021 at 5:05 PM osumi.takami...@fujitsu.com > wrote: > > Yes. I've rebased and updated the patch, paying attention to this point. > > Attached the updated version. > > Thanks for the updated patch, few comments: > 6) Few places you have used strlcpy and few places you have used > memcpy, you can keep it consistent: > + msg.m_command = command; > + strlcpy(msg.m_gid, gid, sizeof(msg.m_gid)); > + msg.m_xact_bytes = xact_size; > > + key.subid = subid; > + memcpy(key.gid, gid, sizeof(key.gid)); > + action = (create ? HASH_ENTER : HASH_FIND); Few more comments: 1) Here the tuple length is not considered in the calculation, else it will always show the fixed size for any size tuple. Ex varchar insert with 1 byte or varchar insert with 100's of bytes. So I feel we should include the tuple length in the calculation. + case LOGICAL_REP_MSG_INSERT: + case LOGICAL_REP_MSG_UPDATE: + case LOGICAL_REP_MSG_DELETE: + Assert(extra_data != NULL); + + /* +* Compute size based on ApplyExecutionData. +* The size of LogicalRepRelMapEntry can be skipped because +* it is obtained from hash_search in logicalrep_rel_open. +*/ + size += sizeof(ApplyExecutionData) + sizeof(EState) + + sizeof(ResultRelInfo) + sizeof(ResultRelInfo); + + /* +* Add some extra size if the target relation is partitioned. +* PartitionTupleRouting isn't exported. Therefore, call the +* function that returns its size instead. +*/ + if (extra_data->relmapentry->localrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) + size += sizeof(ModifyTableState) + PartitionTupleRoutingSize(); + break; 2) Can this be part of PgStat_StatDBEntry, similar to tables, functions and subworkers. It might be more appropriate to have it there instead of having another global variable. + * Stats of prepared transactions should be displayed + * at either commit prepared or rollback prepared time, even when it's + * after the server restart. We have the apply worker send those statistics + * to the stats collector at prepare time and the startup process restore + * those at restart if necessary. + */ +static HTAB *subWorkerPreparedXactSizeHash = NULL; + +/* Regards, Vignesh
Re: Failed transaction statistics to measure the logical replication progress
On Tue, Nov 9, 2021 at 5:05 PM osumi.takami...@fujitsu.com wrote: > Yes. I've rebased and updated the patch, paying attention to this point. > Attached the updated version. Thanks for the updated patch, few comments: 1) you could rename PgStat_StatSubWorkerPreparedXact to PgStat_SW_PreparedXactKey or a simpler name which includes key and similarly change PgStat_StatSubWorkerPreparedXactSize to PgStat_SW_PreparedXactEntry +/* prepared transaction */ +typedef struct PgStat_StatSubWorkerPreparedXact +{ + Oid subid; + chargid[GIDSIZE]; +} PgStat_StatSubWorkerPreparedXact; + +typedef struct PgStat_StatSubWorkerPreparedXactSize +{ + PgStat_StatSubWorkerPreparedXact key; /* hash key */ + + Oid subid; + chargid[GIDSIZE]; + PgStat_Counter xact_size; +} PgStat_StatSubWorkerPreparedXactSize; + 2) You can change prepared_size to sw_prepared_xact_entry or prepared_xact_entry since it is a hash entry with few fields + if (subWorkerPreparedXactSizeHash) + { + PgStat_StatSubWorkerPreparedXactSize *prepared_size; + + hash_seq_init(, subWorkerPreparedXactSizeHash); + while((prepared_size = (PgStat_StatSubWorkerPreparedXactSize *) hash_seq_search()) != NULL) + { + fputc('P', fpout); + rc = fwrite(prepared_size, sizeof(PgStat_StatSubWorkerPreparedXactSize), 1, fpout); + (void) rc; /* we'll check for error with ferror */ + } 3) This need to be indented -w.relid, -w.command, -w.xid, -w.error_count, -w.error_message, -w.last_error_time + w.commit_count, + w.commit_bytes, + w.error_count, + w.error_bytes, + w.abort_count, + w.abort_bytes, + w.last_error_relid, + w.last_error_command, + w.last_error_xid, + w.last_error_count, + w.last_error_message, + w.last_error_time 4) Instead of adding a function to calculate the size, can we move PartitionTupleRouting from c file to the header file and use sizeof at the caller function? +/* + * PartitionTupleRoutingSize - exported to calculate total data size + * of logical replication mesage apply, because this is one of the + * ApplyExecutionData struct members. + */ +size_t +PartitionTupleRoutingSize(void) +{ + return sizeof(PartitionTupleRouting); +} 5) You could run pgindent and pgperltidy for the code and test code to fix the indent issues. + subWorkerPreparedXactSizeHash = hash_create("Subscription worker stats of prepared txn", + PGSTAT_SUBWORKER_HASH_SIZE, + _ctl, + HASH_ELEM | HASH_STRINGS | HASH_CONTEXT); +# There's no entry at the beginning +my $result = $node_subscriber->safe_psql('postgres', +"SELECT count(*) FROM pg_stat_subscription_workers;"); +is($result, q(0), 'no entry for transaction stats yet'); 6) Few places you have used strlcpy and few places you have used memcpy, you can keep it consistent: + msg.m_command = command; + strlcpy(msg.m_gid, gid, sizeof(msg.m_gid)); + msg.m_xact_bytes = xact_size; + key.subid = subid; + memcpy(key.gid, gid, sizeof(key.gid)); + action = (create ? HASH_ENTER : HASH_FIND); Regards, Vignesh
RE: Failed transaction statistics to measure the logical replication progress
On Wednesday, November 10, 2021 3:43 PM Dilip Kumar wrote: > On Tue, Nov 9, 2021 at 5:05 PM osumi.takami...@fujitsu.com > wrote: > > On Tuesday, November 9, 2021 12:08 PM Greg Nancarrow > wrote: > > > On Fri, Nov 5, 2021 at 7:11 PM osumi.takami...@fujitsu.com > > > wrote: > > > > > > > > > > I did a quick scan through the latest v8 patch and noticed the following > things: > > I appreciate your review ! > I have reviewed some part of the patch and I have a few comments I really appreciate your attention and review. > 1. > + > + > + error_count bigint > + > + > + Number of transactions that failed to be applied by the table > + sync worker or main apply worker in this subscription. > + > + > > The error_count, should be number of transaction failed to applied? or it > should > be number of error? I thought those were same and currently it gets incremented when an error of apply occurs. Then, it equals to the number of total error. May I have the case when we get different values between those two ? I can be missing something. > 2. > + > + > + error_bytes bigint > + > > How different is error_bytes from the abort_bytes? By the error_bytes, you can see the consumed resources that were acquired during apply but the applying processing stopped by some error. On the other hand, abort_bytes displays bytes used for ROLLBACK PREPARED and stream_abort processing. That's what I intended. > 3. > +{ > +size += *extra_data->stream_write_len; > +add_apply_error_context_xact_size(size); > +return; > +} > > From apply_handle_insert(), we are calling update_apply_change_size(), and > inside this function we are dereferencing *extra_data->stream_write_len. > Basically, stream_write_len is in integer pointer and the caller hasn't > allocated > memory for that and inside update_apply_change_size, we are directly > dereferencing the pointer, how this can be correct. I'm so sorry to make you confused. I'll just delete the top part that handles streaming bytes calculation in the update_apply_change_size(). It's because now that there is a specific structure to recognize each streaming xid and save transaction size there, which makes the top part in question useless. > And I also see that in the > whole patch stream_write_len, is never used as lvalue so without storing > anything into this why are we trying to use this as rvalue here? This is > clearly > an issue. As described above, I'll fix this part and related codes mainly streaming related codes in the next version. Best Regards, Takamichi Osumi
Re: Failed transaction statistics to measure the logical replication progress
On Tue, Nov 9, 2021 at 5:05 PM osumi.takami...@fujitsu.com wrote: > > On Tuesday, November 9, 2021 12:08 PM Greg Nancarrow > wrote: > > On Fri, Nov 5, 2021 at 7:11 PM osumi.takami...@fujitsu.com > > wrote: > > > > > > > I did a quick scan through the latest v8 patch and noticed the following > > things: > I appreciate your review ! > I have reviewed some part of the patch and I have a few comments 1. + + + error_count bigint + + + Number of transactions that failed to be applied by the table + sync worker or main apply worker in this subscription. + + The error_count, should be number of transaction failed to applied? or it should be number of error? 2. + + + error_bytes bigint + How different is error_bytes from the abort_bytes? 3. +{ +size += *extra_data->stream_write_len; +add_apply_error_context_xact_size(size); +return; +} >From apply_handle_insert(), we are calling update_apply_change_size(), and inside this function we are dereferencing *extra_data->stream_write_len. Basically, stream_write_len is in integer pointer and the caller hasn't allocated memory for that and inside update_apply_change_size, we are directly dereferencing the pointer, how this can be correct. And I also see that in the whole patch stream_write_len, is never used as lvalue so without storing anything into this why are we trying to use this as rvalue here? This is clearly an issue. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
RE: Failed transaction statistics to measure the logical replication progress
On Tuesday, November 9, 2021 8:35 PM I wrote: > Yes. I've rebased and updated the patch, paying attention to this point. > Attached the updated version. Forgot to note one thing. This is based on the skip xid v20 shared in [1] [1] - https://www.postgresql.org/message-id/CAD21AoAT42mhcqeB1jPfRL1%2BEUHbZk8MMY_fBgsyZvJeKNpG%2Bw%40mail.gmail.com Best Regards, Takamichi Osumi
RE: Failed transaction statistics to measure the logical replication progress
On Monday, November 8, 2021 3:12 PM vignesh C wrote: > On Fri, Nov 5, 2021 at 1:42 PM osumi.takami...@fujitsu.com > wrote: > > Lastly, I removed one unnecessary test that checked publisher's stats > > in the TAP tests. > > Also I introduced ApplyTxnExtraData structure to remove void* argument > > of update_apply_change_size that might worsen the readability of codes > > in the previous version. > > Thanks for the updated patch. Thanks you for checking my patch ! > Few comments: > 1) You could remove LogicalRepPreparedTxnData, > LogicalRepCommitPreparedTxnData & LogicalRepRollbackPreparedTxnData > and change it to char *gid to reduce the function parameter and simiplify the > assignment: > + */ > +void > +pgstat_report_subworker_twophase_xact(Oid subid, LogicalRepMsgType > +command, > + >PgStat_Counter xact_size, > + >LogicalRepPreparedTxnData *prepare_data, > + >LogicalRepCommitPreparedTxnData *commit_data, > + >LogicalRepRollbackPreparedTxnData *rollback_data) Fixed. > 2) Shouldn't this change be part of skip xid patch? > - TupleDescInitEntry(tupdesc, (AttrNumber) 4, "command", > + TupleDescInitEntry(tupdesc, (AttrNumber) 10, > + "last_error_command", >TEXTOID, -1, 0); > - TupleDescInitEntry(tupdesc, (AttrNumber) 5, "xid", > + TupleDescInitEntry(tupdesc, (AttrNumber) 11, "last_error_xid", >XIDOID, -1, 0); > - TupleDescInitEntry(tupdesc, (AttrNumber) 6, "error_count", > + TupleDescInitEntry(tupdesc, (AttrNumber) 12, "last_error_count", >INT8OID, -1, 0); > - TupleDescInitEntry(tupdesc, (AttrNumber) 7, "error_message", > + TupleDescInitEntry(tupdesc, (AttrNumber) 13, > + "last_error_message", >TEXTOID, -1, 0); > - TupleDescInitEntry(tupdesc, (AttrNumber) 8, "last_error_time", > + TupleDescInitEntry(tupdesc, (AttrNumber) 14, "last_error_time", Hmm, I didn't think so. Those renames are necessary to make exisiting columns of skip xid separate from newly-introduced xact stats. That means, original names of skip xid columns in v20 by itself are fine and the renames are needed only when this patch gets committed. At present, we cannot guarantee that this patch will be committed so I'd like to take care of those renames. > 3) This newly added structures should be added to typedefs.list: > ApplyTxnExtraData > XactSizeEntry > PgStat_MsgSubWorkerXactEnd > PgStat_MsgSubWorkerTwophaseXact > PgStat_StatSubWorkerPreparedXact > PgStat_StatSubWorkerPreparedXactSize Added. > 4) We are not sending the transaction size in case of table sync, is this > intentional, if so I felt we should document this in > pg_stat_subscription_workers > + /* Report the success of table sync. */ > + pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid, > + > MyLogicalRepWorker->relid, > + > 0 /* no logical message type */, > + > 0 /* xact size */); Right. Updated the doc description. I added the description in a way that the bytes stats are only for apply worker. > 5) pg_stat_subscription_workers has a lot of columns, if we can reduce the > column size the readability will improve, like xact_commit_count to > commit_count, xact_commit_bytes to commit_bytes, etc > +w.xact_commit_count, > +w.xact_commit_bytes, > +w.xact_error_count, > +w.xact_error_bytes, > +w.xact_abort_count, > +w.xact_abort_bytes, It makes sense. Those can be somehow redundant. Tentatively, I renamed only columns' names exported to the users. This is because changing internal data structure as well (e.g. removing the PgStat_StatSubWorkerEntry's prefixes) causes duplication name of 'error_count' members and changing such an internal data structure of skip xid part will have a huge impact of other parts. Kindly imagine a case that we add 'last_' prefix to the all error statistics representing an error of the structure. If you aren't satisfied with this change, please let me know. Best Regards, Takamichi Osumi
RE: Failed transaction statistics to measure the logical replication progress
On Tuesday, November 9, 2021 12:08 PM Greg Nancarrow wrote: > On Fri, Nov 5, 2021 at 7:11 PM osumi.takami...@fujitsu.com > wrote: > > > > I did a quick scan through the latest v8 patch and noticed the following > things: I appreciate your review ! > src/backend/postmaster/pgstat.c > > (1) pgstat_recv_subworker_twophase_xact() > The copying from msg->m_gid to key.gid does not seem to be correct. > strlen() is being called on a junk value, since key.gid has not been assigned > yet. > It should be changed as follows: > > BEFORE: > + strlcpy(key.gid, msg->m_gid, strlen(key.gid)); > AFTER: > + strlcpy(key.gid, msg->m_gid, sizeof(key.gid)); Fixed. > (2) pgstat_get_subworker_prepared_txn() > Similar to above, strlen() usage is not correct, and should use > strlcpy() instead of memcpy(). > > BEFORE: > + memcpy(key.gid, gid, strlen(key.gid)); > AFTER: > + strlcpy(key.gid, gid, sizeof(key.gid)); Fixed. > (3) stats_reset > Note that the "stats_reset" column has been removed from the > pg_stat_subscription_workers view in the underlying latest v20 patch. Yes. I've rebased and updated the patch, paying attention to this point. Attached the updated version. Best Regards, Takamichi Osumi extend_xact_stats_of_subscription_worker_v9.patch Description: extend_xact_stats_of_subscription_worker_v9.patch
Re: Failed transaction statistics to measure the logical replication progress
On Fri, Nov 5, 2021 at 7:11 PM osumi.takami...@fujitsu.com wrote: > I did a quick scan through the latest v8 patch and noticed the following things: src/backend/postmaster/pgstat.c (1) pgstat_recv_subworker_twophase_xact() The copying from msg->m_gid to key.gid does not seem to be correct. strlen() is being called on a junk value, since key.gid has not been assigned yet. It should be changed as follows: BEFORE: + strlcpy(key.gid, msg->m_gid, strlen(key.gid)); AFTER: + strlcpy(key.gid, msg->m_gid, sizeof(key.gid)); (2) pgstat_get_subworker_prepared_txn() Similar to above, strlen() usage is not correct, and should use strlcpy() instead of memcpy(). BEFORE: + memcpy(key.gid, gid, strlen(key.gid)); AFTER: + strlcpy(key.gid, gid, sizeof(key.gid)); (3) stats_reset Note that the "stats_reset" column has been removed from the pg_stat_subscription_workers view in the underlying latest v20 patch. Regards, Greg Nancarrow Fujitsu Australia
Re: Failed transaction statistics to measure the logical replication progress
On Fri, Nov 5, 2021 at 7:11 PM osumi.takami...@fujitsu.com wrote: > > I'm not sure about the last part. > > additionally increase the available subscriber memory, > Which GUC parameter did you mean by this ? > Could we point out and enalrge the memory size only for > subscriber's apply processing intentionally ? > I incorporated (7) except for this last part. > Will revise according to your reply. > I might have misinterpreted your original description, so I'll re-review that in your latest patch. As a newer version (v20) of the prerequisite patch was posted a day ago, it looks like your patch needs to be rebased against that (as it currently applies on top of the v19 version only). Regards, Greg Nancarrow Fujitsu Australia
Re: Failed transaction statistics to measure the logical replication progress
On Fri, Nov 5, 2021 at 1:42 PM osumi.takami...@fujitsu.com wrote: > > On Thursday, November 4, 2021 9:54 AM Greg Nancarrow > wrote: > > On Tue, Nov 2, 2021 at 12:18 AM osumi.takami...@fujitsu.com > > wrote: > > > > > > On Thursday, October 28, 2021 11:19 PM I wrote: > > > > I've created a new patch that extends pg_stat_subscription_workers > > > > to include other transaction statistics. > > > > > > > > Note that this patch depends on v18 patch-set in [1]... > > > Rebased based on the v19 in [1]. > > > Also, fixed documentation a little and made tests tidy. > > > FYI, the newly included TAP test(027_worker_xact_stats.pl) is stable > > > because I checked that 100 times of its execution in a tight loop all > > > passed. > > > > > > > I have done some basic testing of the patch and have some initial review > > comments: > Thanks for your review ! > > > (1) I think this patch needs to be in "git format-patch" format, with a > > proper > > commit message that describes the purpose of the patch and the functionality > > it adds, and any high-level design points (something like the overview > > given in > > the initial post, and accounting for the subsequent discussion points and > > updated functionality). > Fixed. > > > (2) doc/src/sgml/monitoring.sgml > > There are some grammatical issues in the current description. I suggest > > changing it to something like: > > BEFORE: > > + At least one row per subscription, showing about > > transaction statistics and error summary that > > AFTER: > > + At least one row per subscription, showing transaction > > statistics and information about errors that > Fixed. > > > (2) doc/src/sgml/monitoring.sgml > > The current description seems a little confusing. > > Per subscription, it shows the transaction statistics and any last error > > info from > > tablesync/apply workers? If this is the case, I'd suggest the following > > change: > > > > BEFORE: > > + one row per subscription for transaction statistics and summary of the > > last > > + error reported by workers applying logical replication changes and > > workers > > + handling the initial data copy of the subscribed tables. > > AFTER: > > + one row per subscription, showing corresponding transaction statistics > > and > > + information about the last error reported by workers applying > > logical replication > > + changes or by workers handling the initial data copy of the > > subscribed tables. > Fixed. > > > (3) xact_commit > > I think that the "xact_commit" column should be named "xact_commit_count" > > or "xact_commits". > > Similarly, I think "xact_error" should be named "xact_error_count" or > > "xact_errors", and "xact_aborts" should be named "xact_abort_count" or > > "xact_aborts". > I prefered *_count. Renamed. > > > (4) xact_commit_bytes > > > > + Amount of transactions data successfully applied in this > > subscription. > > + Consumed memory for xact_commit is displayed. > > > > I find this description a bit confusing. "Consumed memory for xact_commit" > > seems different to "transactions data". > > Could the description be something like: Amount of data (in bytes) > > successfully > > applied in this subscription, across "xact_commit_count" > > transactions. > Fixed. > > > (5) > > I'd suggest some minor rewording for the following: > > > > BEFORE: > > + Number of transactions failed to be applied and caught by table > > + sync worker or main apply worker in this subscription. > > AFTER: > > + Number of transactions that failed to be applied by the table > > + sync worker or main apply worker in this subscription. > Fixed. > > > (6) xact_error_bytes > > Again, it's a little confusing referring to "consumed memory" here. > > How about rewording this, something like: > > > > BEFORE: > > + Amount of transactions data unsuccessfully applied in this > > subscription. > > + Consumed memory that past failed transaction used is displayed. > > AFTER: > > + Amount of data (in bytes) unsuccessfully applied in this > > subscription by the last failed transaction. > xact_error_bytes (and other bytes columns as well) is cumulative > so when a new error happens, the size of this new bytes would be > added to the same. So here we shouldn't mention just the last error. > I simply applied your previous comments of 'xact_commit_bytes' > to 'xact_error_bytes' description. > > > (7) > > The additional information provided for "xact_abort_bytes" needs some > > rewording, something like: > > > > BEFORE: > > + Increase logical_decoding_work_mem on the > > publisher > > + so that it exceeds the size of whole streamed transaction > > + to suppress unnecessary consumed network bandwidth in addition to > > change > > + in memory of the subscriber, if unexpected amount of streamed > > transactions > > + are aborted. > > AFTER: > > + In order to suppress unnecessary consumed network bandwidth, > > increase > > +
RE: Failed transaction statistics to measure the logical replication progress
On Thursday, November 4, 2021 9:54 AM Greg Nancarrow wrote: > On Tue, Nov 2, 2021 at 12:18 AM osumi.takami...@fujitsu.com > wrote: > > > > On Thursday, October 28, 2021 11:19 PM I wrote: > > > I've created a new patch that extends pg_stat_subscription_workers > > > to include other transaction statistics. > > > > > > Note that this patch depends on v18 patch-set in [1]... > > Rebased based on the v19 in [1]. > > Also, fixed documentation a little and made tests tidy. > > FYI, the newly included TAP test(027_worker_xact_stats.pl) is stable > > because I checked that 100 times of its execution in a tight loop all > > passed. > > > > I have done some basic testing of the patch and have some initial review > comments: Thanks for your review ! > (1) I think this patch needs to be in "git format-patch" format, with a proper > commit message that describes the purpose of the patch and the functionality > it adds, and any high-level design points (something like the overview given > in > the initial post, and accounting for the subsequent discussion points and > updated functionality). Fixed. > (2) doc/src/sgml/monitoring.sgml > There are some grammatical issues in the current description. I suggest > changing it to something like: > BEFORE: > + At least one row per subscription, showing about > transaction statistics and error summary that > AFTER: > + At least one row per subscription, showing transaction > statistics and information about errors that Fixed. > (2) doc/src/sgml/monitoring.sgml > The current description seems a little confusing. > Per subscription, it shows the transaction statistics and any last error info > from > tablesync/apply workers? If this is the case, I'd suggest the following > change: > > BEFORE: > + one row per subscription for transaction statistics and summary of the > last > + error reported by workers applying logical replication changes and > workers > + handling the initial data copy of the subscribed tables. > AFTER: > + one row per subscription, showing corresponding transaction statistics > and > + information about the last error reported by workers applying > logical replication > + changes or by workers handling the initial data copy of the > subscribed tables. Fixed. > (3) xact_commit > I think that the "xact_commit" column should be named "xact_commit_count" > or "xact_commits". > Similarly, I think "xact_error" should be named "xact_error_count" or > "xact_errors", and "xact_aborts" should be named "xact_abort_count" or > "xact_aborts". I prefered *_count. Renamed. > (4) xact_commit_bytes > > + Amount of transactions data successfully applied in this subscription. > + Consumed memory for xact_commit is displayed. > > I find this description a bit confusing. "Consumed memory for xact_commit" > seems different to "transactions data". > Could the description be something like: Amount of data (in bytes) > successfully > applied in this subscription, across "xact_commit_count" > transactions. Fixed. > (5) > I'd suggest some minor rewording for the following: > > BEFORE: > + Number of transactions failed to be applied and caught by table > + sync worker or main apply worker in this subscription. > AFTER: > + Number of transactions that failed to be applied by the table > + sync worker or main apply worker in this subscription. Fixed. > (6) xact_error_bytes > Again, it's a little confusing referring to "consumed memory" here. > How about rewording this, something like: > > BEFORE: > + Amount of transactions data unsuccessfully applied in this > subscription. > + Consumed memory that past failed transaction used is displayed. > AFTER: > + Amount of data (in bytes) unsuccessfully applied in this > subscription by the last failed transaction. xact_error_bytes (and other bytes columns as well) is cumulative so when a new error happens, the size of this new bytes would be added to the same. So here we shouldn't mention just the last error. I simply applied your previous comments of 'xact_commit_bytes' to 'xact_error_bytes' description. > (7) > The additional information provided for "xact_abort_bytes" needs some > rewording, something like: > > BEFORE: > + Increase logical_decoding_work_mem on the > publisher > + so that it exceeds the size of whole streamed transaction > + to suppress unnecessary consumed network bandwidth in addition to > change > + in memory of the subscriber, if unexpected amount of streamed > transactions > + are aborted. > AFTER: > + In order to suppress unnecessary consumed network bandwidth, > increase > + logical_decoding_work_mem on the publisher so > that it > + exceeds the size of the whole streamed transaction, and > additionally increase > + the available subscriber memory, if an unexpected amount of > streamed transactions > + are aborted. I'm not sure about the last part. > additionally
Re: Failed transaction statistics to measure the logical replication progress
On Tue, Nov 2, 2021 at 12:18 AM osumi.takami...@fujitsu.com wrote: > > On Thursday, October 28, 2021 11:19 PM I wrote: > > I've created a new patch that extends pg_stat_subscription_workers to > > include > > other transaction statistics. > > > > Note that this patch depends on v18 patch-set in [1]... > Rebased based on the v19 in [1]. > Also, fixed documentation a little and made tests tidy. > FYI, the newly included TAP test(027_worker_xact_stats.pl) is stable > because I checked that 100 times of its execution in a tight loop all passed. > I have done some basic testing of the patch and have some initial review comments: (1) I think this patch needs to be in "git format-patch" format, with a proper commit message that describes the purpose of the patch and the functionality it adds, and any high-level design points (something like the overview given in the initial post, and accounting for the subsequent discussion points and updated functionality). (2) doc/src/sgml/monitoring.sgml There are some grammatical issues in the current description. I suggest changing it to something like: BEFORE: + At least one row per subscription, showing about transaction statistics and error summary that AFTER: + At least one row per subscription, showing transaction statistics and information about errors that (2) doc/src/sgml/monitoring.sgml The current description seems a little confusing. Per subscription, it shows the transaction statistics and any last error info from tablesync/apply workers? If this is the case, I'd suggest the following change: BEFORE: + one row per subscription for transaction statistics and summary of the last + error reported by workers applying logical replication changes and workers + handling the initial data copy of the subscribed tables. AFTER: + one row per subscription, showing corresponding transaction statistics and + information about the last error reported by workers applying logical replication + changes or by workers handling the initial data copy of the subscribed tables. (3) xact_commit I think that the "xact_commit" column should be named "xact_commit_count" or "xact_commits". Similarly, I think "xact_error" should be named "xact_error_count" or "xact_errors", and "xact_aborts" should be named "xact_abort_count" or "xact_aborts". (4) xact_commit_bytes + Amount of transactions data successfully applied in this subscription. + Consumed memory for xact_commit is displayed. I find this description a bit confusing. "Consumed memory for xact_commit" seems different to "transactions data". Could the description be something like: Amount of data (in bytes) successfully applied in this subscription, across "xact_commit_count" transactions. (5) I'd suggest some minor rewording for the following: BEFORE: + Number of transactions failed to be applied and caught by table + sync worker or main apply worker in this subscription. AFTER: + Number of transactions that failed to be applied by the table + sync worker or main apply worker in this subscription. (6) xact_error_bytes Again, it's a little confusing referring to "consumed memory" here. How about rewording this, something like: BEFORE: + Amount of transactions data unsuccessfully applied in this subscription. + Consumed memory that past failed transaction used is displayed. AFTER: + Amount of data (in bytes) unsuccessfully applied in this subscription by the last failed transaction. (7) The additional information provided for "xact_abort_bytes" needs some rewording, something like: BEFORE: + Increase logical_decoding_work_mem on the publisher + so that it exceeds the size of whole streamed transaction + to suppress unnecessary consumed network bandwidth in addition to change + in memory of the subscriber, if unexpected amount of streamed transactions + are aborted. AFTER: + In order to suppress unnecessary consumed network bandwidth, increase + logical_decoding_work_mem on the publisher so that it + exceeds the size of the whole streamed transaction, and additionally increase + the available subscriber memory, if an unexpected amount of streamed transactions + are aborted. (8) Suggested update: BEFORE: + * Tell the collector that worker transaction has finished without problem. AFTER: + * Tell the collector that the worker transaction has successfully completed. (9) src/backend/postmaster/pgstat.c I think that the GID copying is unnecessarily copying the whole GID buffer or using an additional strlen(). It should be changed to use strlcpy() to match other code: BEFORE: + /* get the gid for this two phase operation */ + if (command == LOGICAL_REP_MSG_PREPARE || + command == LOGICAL_REP_MSG_STREAM_PREPARE) + memcpy(msg.m_gid, prepare_data->gid, GIDSIZE); + else if (command == LOGICAL_REP_MSG_COMMIT_PREPARED) + memcpy(msg.m_gid, commit_data->gid, GIDSIZE); + else /* rollback
RE: Failed transaction statistics to measure the logical replication progress
On Monday, November 1, 2021 10:18 PM I wrote: > On Thursday, October 28, 2021 11:19 PM I wrote: > > I've created a new patch that extends pg_stat_subscription_workers to > > include other transaction statistics. > > > > Note that this patch depends on v18 patch-set in [1]... > Rebased based on the v19 in [1]. I'd like to add one more explanation about the latest patches for review. On Thursday, October 28, 2021 11:19 PM I wrote: > (2) > Updates of stats on the view at either commit prepared or rollback prepared > time. > This means we don't lost prepared transaction size even after server restart > and user can see the results of two phase operation at those timings. There is another reason I had to treat 'prepare' message like above. Any reviewer might think that sending prepared bytes to stats collector and making the bytes survive over the restart is too much. But, we don't know if the prepared transaction is processed by commit prepared or rollback prepared at 'prepare' time. An execution of commit prepared needs to update the column of xact_commit and xact_commit_bytes while rollback prepared does the column of xact_abort and xact_abort_bytes where this patch is introducing. Therefore, even when we can calculate the bytes of prepared txn at prepare time, it's *not* possible to add the transaction bytes to either stats column of bytes sizes and clean up the bytes. Then, there was a premise that we should not lose the prepared txn bytes by the shutdown so my patch has become the implementation described above. Best Regards, Takamichi Osumi
RE: Failed transaction statistics to measure the logical replication progress
On Thursday, October 28, 2021 11:19 PM I wrote: > I've created a new patch that extends pg_stat_subscription_workers to include > other transaction statistics. > > Note that this patch depends on v18 patch-set in [1]... Rebased based on the v19 in [1]. Also, fixed documentation a little and made tests tidy. FYI, the newly included TAP test(027_worker_xact_stats.pl) is stable because I checked that 100 times of its execution in a tight loop all passed. [1] - https://www.postgresql.org/message-id/CAD21AoDY-9_x819F_m1_wfCVXXFJrGiSmR2MfC9Nw4nW8Om0qA%40mail.gmail.com Best Regards, Takamichi Osumi extend_xact_stats_of_subscription_worker_v7.patch Description: extend_xact_stats_of_subscription_worker_v7.patch
RE: Failed transaction statistics to measure the logical replication progress
On Thursday, October 21, 2021 10:19 AM Masahiko Sawada wrote: > On Mon, Oct 18, 2021 at 7:03 PM Amit Kapila > wrote: > > > > On Thu, Oct 14, 2021 at 9:23 AM houzj.f...@fujitsu.com > > wrote: > > > > > > On Thursday, September 30, 2021 12:15 PM Amit Kapila > > > > > > > > > > > These all views are related to untransmitted to the collector but > > > > what we really need is a view similar to pg_stat_archiver or > > > > pg_stat_bgwriter which gives information about background workers. > > > > Now, the problem as I see is if we go that route then > > > > pg_stat_subscription will no longer remain dynamic view and one > > > > might consider that as a compatibility break. The other idea I > > > > have shared is that we display these stats under the new view > > > > introduced by Sawada-San's patch [1] and probably rename that view > > > > as pg_stat_subscription_worker where all the stats (xact info and > > > > last failure information) about each worker will be displayed. Do > > > > you have any opinion on that idea or do you see any problem with it? > > > > > > Personally, I think it seems reasonable to merge the xact stat into > > > the view from sawada-san's patch. > > > > > > One problem I noticed is that pg_stat_subscription_error currently > > > have a 'count' column which show how many times the last error > > > happened. The xact stat here also have a similar value 'xact_error'. > > > I think we might need to rename it or merge them into one in some way. > > > > > > Besides, if we decide to merge xact stat into > > > pg_stat_subscription_error, some column seems need to be renamed. > Maybe like: > > > error_message => Last_error_message, command=> > last_error_command.. > > > > > > > Don't you think that keeping the view name as > > pg_stat_subscription_error would be a bit confusing if it has to > > display xact_info? Isn't it better to change it to > > pg_stat_subscription_worker or some other worker-specific generic > > name? > > I agree that it'd be better to include xact info to > pg_stat_subscription_errors > view rather than making pg_stat_subscription a cumulative view. It would be > more simple and consistent. ... >I'll change the view name in the next version patch. Thanks a lot, Sawasa-san. I've created a new patch that extends pg_stat_subscription_workers to include other transaction statistics. Note that this patch depends on v18 patch-set in [1] and needs to be after the perl modules' namespace changes conducted recently by commit b3b4d8e68ae83f432f43f035c7eb481ef93e1583. There are other several major changes compared to the previous version. (1) Addressing several streaming transactions running in parallel on the pub that can send unexpected order of partial data demarcated stream start and stream stop. Even if one of them aborts, now bytes calculation should be correct. (2) Updates of stats on the view at either commit prepared or rollback prepared time. This means we don't lost prepared transaction size even after server restart and user can see the results of two phase operation at those timings. (3) Addition of TAP tests. (4) Renames of other existing columns so that those meanings are more apparent in align with other new stats. Some important details are written in the comments of the attached patch. [1] - https://www.postgresql.org/message-id/CAD21AoCTtQgfy57AxB4q8KUOpRH8rkHN%3Ds_9p9Pvno_XoBK5wg%40mail.gmail.com Best Regards, Takamichi Osumi extend_xact_stats_of_subscription_worker_v6.patch Description: extend_xact_stats_of_subscription_worker_v6.patch
Re: Failed transaction statistics to measure the logical replication progress
On Thu, Oct 21, 2021 at 6:49 AM Masahiko Sawada wrote: > > On Mon, Oct 18, 2021 at 7:03 PM Amit Kapila wrote: > > > > On Thu, Oct 14, 2021 at 9:23 AM houzj.f...@fujitsu.com > > wrote: > > > > > > On Thursday, September 30, 2021 12:15 PM Amit Kapila > > > > > > > > > > > These all views are related to untransmitted to the collector but what > > > > we really need is a view similar to pg_stat_archiver or > > > > pg_stat_bgwriter which gives information about background workers. > > > > Now, the problem as I see is if we go that route then > > > > pg_stat_subscription will no longer remain dynamic view and one might > > > > consider that as a compatibility break. The other idea I have shared > > > > is that we display these stats under the new view introduced by > > > > Sawada-San's patch [1] and probably rename that view as > > > > pg_stat_subscription_worker where all the stats (xact info and last > > > > failure information) about each worker will be displayed. Do you have > > > > any opinion on that idea or do you see any problem with it? > > > > > > Personally, I think it seems reasonable to merge the xact stat into the > > > view from > > > sawada-san's patch. > > > > > > One problem I noticed is that pg_stat_subscription_error > > > currently have a 'count' column which show how many times the last error > > > happened. The xact stat here also have a similar value 'xact_error'. I > > > think we > > > might need to rename it or merge them into one in some way. > > > > > > Besides, if we decide to merge xact stat into pg_stat_subscription_error, > > > some column > > > seems need to be renamed. Maybe like: > > > error_message => Last_error_message, command=> last_error_command.. > > > > > > > Don't you think that keeping the view name as > > pg_stat_subscription_error would be a bit confusing if it has to > > display xact_info? Isn't it better to change it to > > pg_stat_subscription_worker or some other worker-specific generic > > name? > > I agree that it'd be better to include xact info to > pg_stat_subscription_errors view rather than making > pg_stat_subscription a cumulative view. It would be more simple and > consistent. > > The user might want to reset only either error stats or xact stats but > we can do that by passing a value to the reset function. For example, > pg_stat_reset_subscription_worker(10, 20, ‘error') for resetting only > error stats whereas pg_stat_reset_subscription_worker(10, 20, ‘xact’) > for resetting only xact stats etc (passing NULL or omitting the third > argument means resetting all stats). > Sounds reasonable. > I'll change the view name in the > next version patch. > Thanks, that will be helpful. -- With Regards, Amit Kapila.
Re: Failed transaction statistics to measure the logical replication progress
On Mon, Oct 18, 2021 at 7:03 PM Amit Kapila wrote: > > On Thu, Oct 14, 2021 at 9:23 AM houzj.f...@fujitsu.com > wrote: > > > > On Thursday, September 30, 2021 12:15 PM Amit Kapila > > > > > > > > These all views are related to untransmitted to the collector but what > > > we really need is a view similar to pg_stat_archiver or > > > pg_stat_bgwriter which gives information about background workers. > > > Now, the problem as I see is if we go that route then > > > pg_stat_subscription will no longer remain dynamic view and one might > > > consider that as a compatibility break. The other idea I have shared > > > is that we display these stats under the new view introduced by > > > Sawada-San's patch [1] and probably rename that view as > > > pg_stat_subscription_worker where all the stats (xact info and last > > > failure information) about each worker will be displayed. Do you have > > > any opinion on that idea or do you see any problem with it? > > > > Personally, I think it seems reasonable to merge the xact stat into the > > view from > > sawada-san's patch. > > > > One problem I noticed is that pg_stat_subscription_error > > currently have a 'count' column which show how many times the last error > > happened. The xact stat here also have a similar value 'xact_error'. I > > think we > > might need to rename it or merge them into one in some way. > > > > Besides, if we decide to merge xact stat into pg_stat_subscription_error, > > some column > > seems need to be renamed. Maybe like: > > error_message => Last_error_message, command=> last_error_command.. > > > > Don't you think that keeping the view name as > pg_stat_subscription_error would be a bit confusing if it has to > display xact_info? Isn't it better to change it to > pg_stat_subscription_worker or some other worker-specific generic > name? I agree that it'd be better to include xact info to pg_stat_subscription_errors view rather than making pg_stat_subscription a cumulative view. It would be more simple and consistent. The user might want to reset only either error stats or xact stats but we can do that by passing a value to the reset function. For example, pg_stat_reset_subscription_worker(10, 20, ‘error') for resetting only error stats whereas pg_stat_reset_subscription_worker(10, 20, ‘xact’) for resetting only xact stats etc (passing NULL or omitting the third argument means resetting all stats). I'll change the view name in the next version patch. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
RE: Failed transaction statistics to measure the logical replication progress
On Monday, October 18, 2021 11:52 AM Hou, Zhijie/侯 志杰 wrote: > On Thursday, October 14, 2021 2:13 PM Osumi, Takamichi wrote: > > On Thursday, October 14, 2021 12:54 PM Hou, > Zhijie wrote: > > > On Thursday, September 30, 2021 12:15 PM Amit Kapila > > > > On Thu, Sep 30, 2021 at 8:22 AM Hou, Zhijie/侯 志杰wrote: > > > > > > > > > > On Tues, Sep 28, 2021 6:05 PM Amit Kapila > > > > > > > > wrote: > > > > > > > > > > > > Can't we keep the current and new stats both in-memory and > > > > > > persist on disk? So, the persistent stats data will be used to > > > > > > fill the in-memory counters after restarting of workers, > > > > > > otherwise, we will always refer to in-memory values. > > > > > > > > > > I think this approach works, but I have another concern about it. > > > > > > > > > > The current pg_stat_subscription view is listed as "Dynamic > > > > > Statistics Views" in the document, the data in it seems about > > > > > the worker process, and the view data shows what the current > > > > > worker did. But if we keep the new xact stat persist, then it's > > > > > not what the current worker did, it looks more related to the > > > > > subscription historic data. > > > > > > > > > > > > > I see your point. > > > > > > > > > Adding a new view seems resonalble, but it will bring another > > > > > subscription related view which might be too much. OTOH, I can > > > > > see there are already some different views[1] including xact > > > > > stat, maybe adding another one is accepatble ? > > > > > > > > > > > > > These all views are related to untransmitted to the collector but > > > > what we really need is a view similar to pg_stat_archiver or > > > > pg_stat_bgwriter which gives information about background workers. > > > > Now, the problem as I see is if we go that route then > > > > pg_stat_subscription will no longer remain dynamic view and one > > > > might consider that as a compatibility break. The other idea I > > > > have shared is that we display these stats under the new view > > > > introduced by Sawada-San's patch [1] and probably rename that view > > > > as pg_stat_subscription_worker where all the stats (xact info and > > > > last failure information) about each worker will be displayed. Do > > > > you have any opinion on that idea or do you see any problem with it? > > > > > > Personally, I think it seems reasonable to merge the xact stat into > > > the view from sawada-san's patch. > > > > > > One problem I noticed is that pg_stat_subscription_error currently > > > have a 'count' column which show how many times the last error > > > happened. The xact stat here also have a similar value 'xact_error'. > > > I think we might need to rename it or merge them into one in some way. > > > > > > Besides, if we decide to merge xact stat into > > > pg_stat_subscription_error, some column seems need to be renamed. > > Maybe like: > > > error_message => Last_error_message, command=> > last_error_command.. > > Yeah, we must make them distinguished clearly. > > > > I guessed that you are concerned about amount of renaming codes that > > could be a bit large or you come up with a necessity to consider the > > all column names of the pg_stat_subscription_worker together all at > > once in advance. > > > > It's because my instant impression is, when we go with the current > > xact stats column definitions (xact_commit, xact_commit_bytes, > > xact_error, xact_error_bytes, xact_abort, xact_abort_bytes), the > > renaming problem can be solved if I write one additional patch or > > extend the main patch of xact stats to handle renaming. > > (This can work to keep both threads independent from each other). > > > > Did you have some concern that cannot be handled by this way ? > Hi, > > Currently, I don't find some unsolvable issues in this approach. Okay. Glad to hear that. Then, I can restart my implementation with this direction. Best Regards, Takamichi Osumi
RE: Failed transaction statistics to measure the logical replication progress
On Monday, October 18, 2021 6:04 PM Amit Kapila wrote: > On Thu, Oct 14, 2021 at 9:23 AM houzj.f...@fujitsu.com > wrote: > > > > On Thursday, September 30, 2021 12:15 PM Amit Kapila > > > > > > > > These all views are related to untransmitted to the collector but > > > what we really need is a view similar to pg_stat_archiver or > > > pg_stat_bgwriter which gives information about background workers. > > > Now, the problem as I see is if we go that route then > > > pg_stat_subscription will no longer remain dynamic view and one > > > might consider that as a compatibility break. The other idea I have > > > shared is that we display these stats under the new view introduced > > > by Sawada-San's patch [1] and probably rename that view as > > > pg_stat_subscription_worker where all the stats (xact info and last > > > failure information) about each worker will be displayed. Do you > > > have any opinion on that idea or do you see any problem with it? > > > > Personally, I think it seems reasonable to merge the xact stat into > > the view from sawada-san's patch. > > > > One problem I noticed is that pg_stat_subscription_error currently > > have a 'count' column which show how many times the last error > > happened. The xact stat here also have a similar value 'xact_error'. I > > think we might need to rename it or merge them into one in some way. > > > > Besides, if we decide to merge xact stat into > > pg_stat_subscription_error, some column seems need to be renamed. > Maybe like: > > error_message => Last_error_message, command=> last_error_command.. > > > > Don't you think that keeping the view name as pg_stat_subscription_error > would be a bit confusing if it has to display xact_info? Isn't it better to > change it > to pg_stat_subscription_worker or some other worker-specific generic name? Yes, I agreed that rename the view to pg_stat_subscription_worker or some other worker-specific generic name is better if we decide to move forward with this approach. Best regards, Hou zj
Re: Failed transaction statistics to measure the logical replication progress
On Thu, Oct 14, 2021 at 9:23 AM houzj.f...@fujitsu.com wrote: > > On Thursday, September 30, 2021 12:15 PM Amit Kapila > > > > These all views are related to untransmitted to the collector but what > > we really need is a view similar to pg_stat_archiver or > > pg_stat_bgwriter which gives information about background workers. > > Now, the problem as I see is if we go that route then > > pg_stat_subscription will no longer remain dynamic view and one might > > consider that as a compatibility break. The other idea I have shared > > is that we display these stats under the new view introduced by > > Sawada-San's patch [1] and probably rename that view as > > pg_stat_subscription_worker where all the stats (xact info and last > > failure information) about each worker will be displayed. Do you have > > any opinion on that idea or do you see any problem with it? > > Personally, I think it seems reasonable to merge the xact stat into the view > from > sawada-san's patch. > > One problem I noticed is that pg_stat_subscription_error > currently have a 'count' column which show how many times the last error > happened. The xact stat here also have a similar value 'xact_error'. I think > we > might need to rename it or merge them into one in some way. > > Besides, if we decide to merge xact stat into pg_stat_subscription_error, > some column > seems need to be renamed. Maybe like: > error_message => Last_error_message, command=> last_error_command.. > Don't you think that keeping the view name as pg_stat_subscription_error would be a bit confusing if it has to display xact_info? Isn't it better to change it to pg_stat_subscription_worker or some other worker-specific generic name? -- With Regards, Amit Kapila.
RE: Failed transaction statistics to measure the logical replication progress
On Thursday, October 14, 2021 2:13 PM Osumi, Takamichi wrote: > On Thursday, October 14, 2021 12:54 PM Hou, Zhijie > wrote: > > On Thursday, September 30, 2021 12:15 PM Amit Kapila > > > On Thu, Sep 30, 2021 at 8:22 AM Hou, Zhijie/侯 志杰wrote: > > > > > > > > On Tues, Sep 28, 2021 6:05 PM Amit Kapila > > > > > > wrote: > > > > > > > > > > Can't we keep the current and new stats both in-memory and > > > > > persist on disk? So, the persistent stats data will be used to > > > > > fill the in-memory counters after restarting of workers, > > > > > otherwise, we will always refer to in-memory values. > > > > > > > > I think this approach works, but I have another concern about it. > > > > > > > > The current pg_stat_subscription view is listed as "Dynamic > > > > Statistics Views" in > > > > the document, the data in it seems about the worker process, and > > > > the view data > > > > shows what the current worker did. But if we keep the new xact > > > > stat persist, then it's not what the current worker did, it looks > > > > more related to the subscription historic data. > > > > > > > > > > I see your point. > > > > > > > Adding a new view seems resonalble, but it will bring another > > > > subscription related view which might be too much. OTOH, I can see > > > > there are already some > > > > different views[1] including xact stat, maybe adding another one > > > > is accepatble ? > > > > > > > > > > These all views are related to untransmitted to the collector but > > > what we really need is a view similar to pg_stat_archiver or > > > pg_stat_bgwriter which gives information about background workers. > > > Now, the problem as I see is if we go that route then > > > pg_stat_subscription will no longer remain dynamic view and one > > > might consider that as a compatibility break. The other idea I have > > > shared is that we display these stats under the new view introduced > > > by Sawada-San's patch [1] and probably rename that view as > > > pg_stat_subscription_worker where all the stats (xact info and last > > > failure information) about each worker will be displayed. Do you > > > have any opinion on that idea or do you see any problem with it? > > > > Personally, I think it seems reasonable to merge the xact stat into > > the view from sawada-san's patch. > > > > One problem I noticed is that pg_stat_subscription_error currently > > have a 'count' column which show how many times the last error > > happened. The xact stat here also have a similar value 'xact_error'. I > > think we might need to rename it or merge them into one in some way. > > > > Besides, if we decide to merge xact stat into > > pg_stat_subscription_error, some column seems need to be renamed. > Maybe like: > > error_message => Last_error_message, command=> last_error_command.. > Yeah, we must make them distinguished clearly. > > I guessed that you are concerned about > amount of renaming codes that could be a bit large or you come up with a > necessity to consider the all column names of the pg_stat_subscription_worker > together all at once in advance. > > It's because my instant impression is, > when we go with the current xact stats column definitions (xact_commit, > xact_commit_bytes, xact_error, xact_error_bytes, xact_abort, > xact_abort_bytes), the renaming problem can be solved if I write one > additional patch or extend the main patch of xact stats to handle renaming. > (This can work to keep both threads independent from each other). > > Did you have some concern that cannot be handled by this way ? Hi, Currently, I don't find some unsolvable issues in this approach. Best regards, Hou zj
RE: Failed transaction statistics to measure the logical replication progress
On Thursday, October 14, 2021 12:54 PM Hou, Zhijie/侯 志杰 wrote: > On Thursday, September 30, 2021 12:15 PM Amit Kapila > > > On Thu, Sep 30, 2021 at 8:22 AM Hou, Zhijie/侯 志杰 > wrote: > > > > > > On Tues, Sep 28, 2021 6:05 PM Amit Kapila > wrote: > > > > > > > > Can't we keep the current and new stats both in-memory and persist > > > > on disk? So, the persistent stats data will be used to fill the > > > > in-memory counters after restarting of workers, otherwise, we will > > > > always refer to in-memory values. > > > > > > I think this approach works, but I have another concern about it. > > > > > > The current pg_stat_subscription view is listed as "Dynamic Statistics > Views" > > in > > > the document, the data in it seems about the worker process, and the > > > view > > data > > > shows what the current worker did. But if we keep the new xact stat > > > persist, then it's not what the current worker did, it looks more > > > related to the subscription historic data. > > > > > > > I see your point. > > > > > Adding a new view seems resonalble, but it will bring another > > > subscription related view which might be too much. OTOH, I can see > > > there are already > > some > > > different views[1] including xact stat, maybe adding another one is > > accepatble ? > > > > > > > These all views are related to untransmitted to the collector but what > > we really need is a view similar to pg_stat_archiver or > > pg_stat_bgwriter which gives information about background workers. > > Now, the problem as I see is if we go that route then > > pg_stat_subscription will no longer remain dynamic view and one might > > consider that as a compatibility break. The other idea I have shared > > is that we display these stats under the new view introduced by > > Sawada-San's patch [1] and probably rename that view as > > pg_stat_subscription_worker where all the stats (xact info and last > > failure information) about each worker will be displayed. Do you have > > any opinion on that idea or do you see any problem with it? > > Personally, I think it seems reasonable to merge the xact stat into the view > from > sawada-san's patch. > > One problem I noticed is that pg_stat_subscription_error currently have a > 'count' column which show how many times the last error happened. The xact > stat here also have a similar value 'xact_error'. I think we might need to > rename > it or merge them into one in some way. > > Besides, if we decide to merge xact stat into pg_stat_subscription_error, some > column seems need to be renamed. Maybe like: > error_message => Last_error_message, command=> last_error_command.. Yeah, we must make them distinguished clearly. I guessed that you are concerned about amount of renaming codes that could be a bit large or you come up with a necessity to consider the all column names of the pg_stat_subscription_worker together all at once in advance. It's because my instant impression is, when we go with the current xact stats column definitions (xact_commit, xact_commit_bytes, xact_error, xact_error_bytes, xact_abort, xact_abort_bytes), the renaming problem can be solved if I write one additional patch or extend the main patch of xact stats to handle renaming. (This can work to keep both threads independent from each other). Did you have some concern that cannot be handled by this way ? Best Regards, Takamichi Osumi
RE: Failed transaction statistics to measure the logical replication progress
On Thursday, September 30, 2021 12:15 PM Amit Kapila > On Thu, Sep 30, 2021 at 8:22 AM Hou, Zhijie/侯 志杰 > wrote: > > > > On Tues, Sep 28, 2021 6:05 PM Amit Kapila wrote: > > > > > > Can't we keep the current and new stats both in-memory and persist on > > > disk? So, the persistent stats data will be used to fill the in-memory > > > counters after restarting of workers, otherwise, we will always refer > > > to in-memory values. > > > > I think this approach works, but I have another concern about it. > > > > The current pg_stat_subscription view is listed as "Dynamic Statistics > > Views" > in > > the document, the data in it seems about the worker process, and the view > data > > shows what the current worker did. But if we keep the new xact stat persist, > > then it's not what the current worker did, it looks more related to the > > subscription historic data. > > > > I see your point. > > > Adding a new view seems resonalble, but it will bring another subscription > > related view which might be too much. OTOH, I can see there are already > some > > different views[1] including xact stat, maybe adding another one is > accepatble ? > > > > These all views are related to untransmitted to the collector but what > we really need is a view similar to pg_stat_archiver or > pg_stat_bgwriter which gives information about background workers. > Now, the problem as I see is if we go that route then > pg_stat_subscription will no longer remain dynamic view and one might > consider that as a compatibility break. The other idea I have shared > is that we display these stats under the new view introduced by > Sawada-San's patch [1] and probably rename that view as > pg_stat_subscription_worker where all the stats (xact info and last > failure information) about each worker will be displayed. Do you have > any opinion on that idea or do you see any problem with it? Personally, I think it seems reasonable to merge the xact stat into the view from sawada-san's patch. One problem I noticed is that pg_stat_subscription_error currently have a 'count' column which show how many times the last error happened. The xact stat here also have a similar value 'xact_error'. I think we might need to rename it or merge them into one in some way. Besides, if we decide to merge xact stat into pg_stat_subscription_error, some column seems need to be renamed. Maybe like: error_message => Last_error_message, command=> last_error_command.. Best regards, Hou zj
RE: Failed transaction statistics to measure the logical replication progress
On Thursday, September 30, 2021 8:13 PM Amit Kapila wrote: > On Thu, Sep 30, 2021 at 1:02 PM Osumi, Takamichi/大墨 昂道 > wrote: > > > > On Thursday, September 30, 2021 1:15 PM Amit Kapila > wrote: > > > On Thu, Sep 30, 2021 at 8:22 AM Hou, Zhijie/侯 志杰 > > > wrote: > > > > > > > > On Tues, Sep 28, 2021 6:05 PM Amit Kapila > > > > > > > wrote: > > > > Adding a new view seems resonalble, but it will bring another > > > > subscription related view which might be too much. OTOH, I can see > > > > there are already some different views[1] including xact stat, > > > > maybe adding > > > another one is accepatble ? > > > These all views are related to untransmitted to the collector but > > > what we really need is a view similar to pg_stat_archiver or > > > pg_stat_bgwriter which gives information about background workers. > > > Now, the problem as I see is if we go that route then > > > pg_stat_subscription will no longer remain dynamic view and one > > > might consider that as a compatibility break. The other idea I have > > > shared is that we display these stats under the new view introduced > > > by Sawada-San's patch [1] and probably rename that view as > > > pg_stat_subscription_worker where all the stats (xact info and last > > > failure > > > information) about each worker will be displayed. > > Sorry, all the stats in pg_stat_subscription_worker view ? > > > > There was a discussion that > > the xact info should be displayed from pg_stat_subscription with > > existing stats in the same (which will be changed to persist), but > > when your above proposal comes true, the list of > > pg_stat_subscription_worker's columns will be something like below > > (when I list up major columns). > > > > - subid, subrelid and some other relation attributes required > > - 5 stats values moved from pg_stat_subscription > > received_lsn, last_msg_send_time, last_msg_receipt_time, > > latest_end_lsn, latest_end_time > > > > If we go with the view as pg_stat_subscription_worker, we don't need to move > the existing stats of pg_stat_subscription into it. Thank you for clarifying this point. I understand. Best Regards, Takamichi Osumi