RE: Failed transaction statistics to measure the logical replication progress

2022-03-28 Thread osumi.takami...@fujitsu.com
Hi

On Friday, March 25, 2022 2:36 PM Masahiko Sawada  wrote:
> On Thu, Mar 24, 2022 at 12:30 PM Amit Kapila 
> wrote:
> >
> > On Thu, Mar 3, 2022 at 8:58 AM osumi.takami...@fujitsu.com
> >  wrote:
> > >
> >
> > This patch introduces two new subscription statistics columns
> > (apply_commit_count and apply_rollback_count) to the
> > pg_stat_subscription_stats view for counting cumulative transactions
> > commits/rollbacks for a particular subscription. Now, users can
> > already see the total number of xacts committed/rolled back in a
> > particular database via pg_stat_database, so this can be considered
> > duplicate information.
> 
> Right.
...
> > I am not sure if it is worth adding this additional information or how
> > useful it will be for users. Does anyone else want to weigh in on
> > this?
> >
> > If nobody else sees value in this then I feel it is better to mark
> > this patch as Returned with feedback or Rejected. We can come back to
> > it later if we see more demand for this.
> 
> Marking as Returned with feedback makes sense to me.
OK. Thank you so much for sharing your opinions, Sawada-san and Amit-san.

I changed the status of this entry to "Returned with feedback" accordingly.



Best Regards,
Takamichi Osumi



Re: Failed transaction statistics to measure the logical replication progress

2022-03-24 Thread Masahiko Sawada
On Thu, Mar 24, 2022 at 12:30 PM Amit Kapila  wrote:
>
> On Thu, Mar 3, 2022 at 8:58 AM osumi.takami...@fujitsu.com
>  wrote:
> >
>
> This patch introduces two new subscription statistics columns
> (apply_commit_count and apply_rollback_count) to the
> pg_stat_subscription_stats view for counting cumulative transactions
> commits/rollbacks for a particular subscription. Now, users can
> already see the total number of xacts committed/rolled back in a
> particular database via pg_stat_database, so this can be considered
> duplicate information.

Right.

> OTOH, some users might be interested in the
> stats for a subscription to know how many transactions are
> successfully applied during replication because the information in
> pg_stat_database also includes the operations that happened on the
> node.

I'm not sure how useful this information is in practice. What can we
use this information for?

IIRC the original purpose of this proposed feature is to provide a way
for the users to understand the size and count of the succeeded and
failed transactions. At some point, the patch includes the statistics
of only the counts of commits, rollbacks, and errors. If new
statistics also include the size, it might be useful to achieve the
original goal. But I’m concerned that adding only apply_commit_count
and apply_rollback_count ends up adding the duplicate statistics with
no concrete use cases.

> I am not sure if it is worth adding this additional information or how
> useful it will be for users. Does anyone else want to weigh in on
> this?
>
> If nobody else sees value in this then I feel it is better to mark
> this patch as Returned with feedback or Rejected. We can come back to
> it later if we see more demand for this.

Marking as Returned with feedback makes sense to me.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Failed transaction statistics to measure the logical replication progress

2022-03-23 Thread Amit Kapila
On Thu, Mar 3, 2022 at 8:58 AM osumi.takami...@fujitsu.com
 wrote:
>

This patch introduces two new subscription statistics columns
(apply_commit_count and apply_rollback_count) to the
pg_stat_subscription_stats view for counting cumulative transactions
commits/rollbacks for a particular subscription. Now, users can
already see the total number of xacts committed/rolled back in a
particular database via pg_stat_database, so this can be considered
duplicate information. OTOH, some users might be interested in the
stats for a subscription to know how many transactions are
successfully applied during replication because the information in
pg_stat_database also includes the operations that happened on the
node.

I am not sure if it is worth adding this additional information or how
useful it will be for users. Does anyone else want to weigh in on
this?

If nobody else sees value in this then I feel it is better to mark
this patch as Returned with feedback or Rejected. We can come back to
it later if we see more demand for this.

-- 
With Regards,
Amit Kapila.




RE: Failed transaction statistics to measure the logical replication progress

2022-03-02 Thread osumi.takami...@fujitsu.com
On Wednesday, March 2, 2022 5:29 PM Shi, Yu/侍 雨  wrote:
> A comments on the v26 patch.
Thank you for checking the patch !

> 
> The following document about pg_stat_subscription_stats view only says that
> "showing statistics about errors", should we add something about transactions
> here?
> 
>  
> 
> pg_stat_subscription_stats >pg_stat_subscription_stats
>   One row per subscription, showing statistics about errors.
>   See 
>   pg_stat_subscription_stats for
> details.
>   
>  
> 
> 
> I noticed that the v24 patch has some changes about the description of this
> view. Maybe we can modify to "showing statistics about errors and
> transactions".
You are right. Fixed.

New patch v27 that incorporated your comments is shared in [1].
Kindly have a look at it.


[1] - 
https://www.postgresql.org/message-id/TYCPR01MB837345FB72E3CE9C827AF42CED049%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
Takamichi Osumi



RE: Failed transaction statistics to measure the logical replication progress

2022-03-02 Thread osumi.takami...@fujitsu.com
On Wednesday, March 2, 2022 2:18 PM Masahiko Sawada  
wrote:
> On Wed, Mar 2, 2022 at 10:21 AM osumi.takami...@fujitsu.com
>  wrote:
> > Also, I quickly checked other similar views(pg_stat_slru,
> > pg_stat_wal_receiver) commit logs, especially when they introduce columns.
> > But, I couldn't find column name validations.
> > So, I feel this is aligned.
> >
> 
> I've looked at v26 patch and here are some random comments:
Hi, thank you for reviewing !


> +/* determine the subscription entry */
> +Oidm_subid;
> +
> +PgStat_Counter apply_commit_count;
> +PgStat_Counter apply_rollback_count;
> 
> I think it's better to add the prefix "m_" to apply_commit/rollback_count for
> consistency.
Fixed.

 
> ---
> +/*
> + * Increment the counter of commit for subscription statistics.
> + */
> +static void
> +subscription_stats_incr_commit(void)
> +{
> +Assert(OidIsValid(subStats.subid));
> +
> +subStats.apply_commit_count++;
> +}
> +
> 
> I think we don't need the Assert() here since it should not be a problem even 
> if
> subStats.subid is InvalidOid at least in this function.
> 
> If we remove it, we can remove both subscription_stats_incr_commit() and
> +subscription_stats_incr_rollback() as well.
Removed the Assert() from both functions.


> ---
> +void
> +pgstat_report_subscription_xact(bool force) {
> +static TimestampTz last_report = 0;
> +PgStat_MsgSubscriptionXact msg;
> +
> +/* Bailout early if nothing to do */
> +if (!OidIsValid(subStats.subid) ||
> +(subStats.apply_commit_count == 0 &&
> subStats.apply_rollback_count == 0))
> +return;
> +
> 
> +LogicalRepSubscriptionStats subStats =
> +{
> +.subid = InvalidOid,
> +.apply_commit_count = 0,
> +.apply_rollback_count = 0,
> +};
> 
> Do we need subStats.subid? I think we can pass MySubscription->oid (or
> MyLogicalRepWorker->subid) to pgstat_report_subscription_xact() along
> with the pointer of the statistics (subStats). That way, we don't need to 
> expose
> subStats.
Removed the subStats.subid. Also, now I pass the oid to the
pgstat_report_subscription_xact with the pointer of the statistics.

> Also, I think it's better to add "Xact" or something to the struct name. For
> example, SubscriptionXactStats.
Renamed.

> ---
> +
> +typedef struct LogicalRepSubscriptionStats {
> +Oidsubid;
> +
> +int64  apply_commit_count;
> +int64  apply_rollback_count;
> +} LogicalRepSubscriptionStats;
> 
> We need a description for this struct.
> 
> Probably it is better to declare it in logicalworker.h instead so that 
> pgstat.c
> includes it instead of worker_internal.h? worker_internal.h is the header file
> shared by logical replication workers such as apply worker,  tablesync worker,
> and launcher. So it might not be advisable to include it in pgstat.c.
Changed the definition place to logicalworker.h
and added some explanations for it.

Attached the updated v27.


Best Regards,
Takamichi Osumi



v27-0001-Extend-pg_stat_subscription_stats-to-include-gen.patch
Description:  v27-0001-Extend-pg_stat_subscription_stats-to-include-gen.patch


RE: Failed transaction statistics to measure the logical replication progress

2022-03-02 Thread shiy.f...@fujitsu.com
Hi,

A comments on the v26 patch.

The following document about pg_stat_subscription_stats view only says that
"showing statistics about errors", should we add something about transactions
here? 

 
  
pg_stat_subscription_statspg_stat_subscription_stats
  One row per subscription, showing statistics about errors.
  See 
  pg_stat_subscription_stats for details.
  
 


I noticed that the v24 patch has some changes about the description of this
view. Maybe we can modify to "showing statistics about errors and transactions".

Regards,
Shi yu


Re: Failed transaction statistics to measure the logical replication progress

2022-03-01 Thread Masahiko Sawada
On Wed, Mar 2, 2022 at 10:21 AM osumi.takami...@fujitsu.com
 wrote:
>
>
> Also, I quickly checked other similar views(pg_stat_slru, 
> pg_stat_wal_receiver)
> commit logs, especially when they introduce columns.
> But, I couldn't find column name validations.
> So, I feel this is aligned.
>

I've looked at v26 patch and here are some random comments:

+/* determine the subscription entry */
+Oidm_subid;
+
+PgStat_Counter apply_commit_count;
+PgStat_Counter apply_rollback_count;

I think it's better to add the prefix "m_" to
apply_commit/rollback_count for consistency.

---
+/*
+ * Increment the counter of commit for subscription statistics.
+ */
+static void
+subscription_stats_incr_commit(void)
+{
+Assert(OidIsValid(subStats.subid));
+
+subStats.apply_commit_count++;
+}
+

I think we don't need the Assert() here since it should not be a
problem even if subStats.subid is InvalidOid at least in this
function.

If we remove it, we can remove both subscription_stats_incr_commit()
and +subscription_stats_incr_rollback() as well.

---
+void
+pgstat_report_subscription_xact(bool force)
+{
+static TimestampTz last_report = 0;
+PgStat_MsgSubscriptionXact msg;
+
+/* Bailout early if nothing to do */
+if (!OidIsValid(subStats.subid) ||
+(subStats.apply_commit_count == 0 &&
subStats.apply_rollback_count == 0))
+return;
+

+LogicalRepSubscriptionStats subStats =
+{
+.subid = InvalidOid,
+.apply_commit_count = 0,
+.apply_rollback_count = 0,
+};

Do we need subStats.subid? I think we can pass MySubscription->oid (or
MyLogicalRepWorker->subid) to pgstat_report_subscription_xact() along
with the pointer of the statistics (subStats). That way, we don't need
to expose subStats.

Also, I think it's better to add "Xact" or something to the struct
name. For example, SubscriptionXactStats.

---
+
+typedef struct LogicalRepSubscriptionStats
+{
+Oidsubid;
+
+int64  apply_commit_count;
+int64  apply_rollback_count;
+} LogicalRepSubscriptionStats;

We need a description for this struct.

Probably it is better to declare it in logicalworker.h instead so that
pgstat.c includes it instead of worker_internal.h? worker_internal.h
is the header file shared by logical replication workers such as apply
worker,  tablesync worker, and launcher. So it might not be advisable
to include it in pgstat.c.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




RE: Failed transaction statistics to measure the logical replication progress

2022-03-01 Thread osumi.takami...@fujitsu.com
On Tuesday, March 1, 2022 4:12 PM Peter Smith  wrote:
> Please see below my review comments for v25.
> 
> ==
> 
> 1. Commit message
> 
> Introduce cumulative columns of transactions of logical replication subscriber
> to the pg_stat_subscription_stats view.
> 
> "cumulative columns of transactions" sounds a bit strange to me.
> 
> SUGGESTED
> Introduce 2 new subscription statistics columns (apply_commit_count, and
> apply_rollback_count) to the pg_stat_subscription_stats view for counting
> cumulative transaction commits/rollbacks.
Fixed.



> ~~~
> 
> 2. doc/src/sgml/monitoring.sgml - bug
> 
> The new SGML s have been added in the wrong place!
> 
> I don't think this renders like you expect it does. Please regenerate the 
> help to
> see for yourself.
Fixed.


> ~~~
> 
> 3. doc/src/sgml/monitoring.sgml - wording
> 
> +  
> +   Number of transactions rollbacked in this subscription. Both
> +   ROLLBACK of transaction streamed as
> in-progress
> +   transaction and ROLLBACK PREPARED
> increment this
> +   counter.
> +  
> 
> BEFORE
> Number of transactions rollbacked in this subscription.
> 
> SUGGESTED
> Number of transaction rollbacks in this subscription.
Fixed.


> ~~~
> 
> 4. doc/src/sgml/monitoring.sgml - wording
> 
> +  
> +   Number of transactions rollbacked in this subscription. Both
> +   ROLLBACK of transaction streamed as
> in-progress
> +   transaction and ROLLBACK PREPARED
> increment this
> +   counter.
> +  
> 
> Trying to distinguish between the ROLLBACK of a transaction and of a
> streamed in-progress transaction seems to have made this description too
> complicated. I don't think the user even cares/knows about this
> (in-progress) distinction. So, I think this should just be written more simply
> (like the COMMIT part was)
> 
> BEFORE
> Both  ROLLBACK of transaction streamed as
> in-progress  transaction and ROLLBACK
> PREPARED increment this counter.
> 
> SUGGESTED
> Both  ROLLBACK and ROLLBACK
> PREPARED increment this counter.
Fixed.


> ~~~
> 
> 5. Question - column names.
> 
> Just curious why the columns are called "apply_commit_count" and
> "apply_rollback_count"? Specifically, what extra meaning do those names have
> versus just calling them "commit_count" and "rollback_count"?
I think there's possibility that we'll have counters
for tablesync commit for example. So, the name prefix avoids
the overlap between the possible names.



> ~~~
> 
> 6. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact
> 
> @@ -3421,6 +3425,60 @@ pgstat_send_slru(void)  }
> 
>  /* --
> + * pgstat_report_subscription_xact() -
> + *
> + * Send a subscription transaction stats to the collector.
> + * The statistics are cleared upon sending.
> + *
> + * 'force' is true only when the subscription worker process exits.
> + * --
> + */
> +void
> +pgstat_report_subscription_xact(bool force)
> 
> 6a.
> I think this comment should be worded more like the other
> pgstat_report_subscption_XXX comments
> 
> BEFORE
> Send a subscription transaction stats to the collector.
> 
> SUGGESTED
> Tell the collector about subscriptions transaction stats.
Fixed.



> 6b.
> + * 'force' is true only when the subscription worker process exits.
> 
> I thought this comment should just describe what the 'force' param actually
> does in this function; not the scenario about who calls it...
Fixed.



> ~~~
> 
> 7. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact
> 
> I think the entire function maybe should be relocated to be nearby the other
> pgstat_report_subscription_XXX functions in the source.
I placed the pgstat_report_subscription_xact below 
pgstat_report_subscription_drop.
Meanwhile, pgstat_recv_subscription_xact, another new function in pgstat.c,
is already placed below pgstat_recv_subscription_error, so I kept it as it is.



> ~~~
> 
> 8. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact
> 
> + /*
> + * This function can be called even if nothing at all has happened. In
> + * this case, there's no need to go forward.
> + */
> 
> Too much information. Clearly, it is possible for this function to be called 
> for this
> case otherwise this code would not exist in the first place :) IMO the comment
> can be much simpler but still say all it needs to.
> 
> BEFORE
> This function can be called even if nothing at all has happened. In this case,
> there's no need to go forward.
> SUGGESTED
> Bailout early if nothing to do.
Fixed.


> ~~~
> 
> 9. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact
> 
> + if (subStats.subid == InvalidOid ||
> + (subStats.apply_commit_count == 0 && subStats.apply_rollback_count ==
> + 0)) return;
> 
> Maybe using !OisIsValid(subStats.subid) is better?
Fixed.


> ~~~
> 
> 10. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact
> 
> + /*
> + * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
> + * msec since we last sent one to avoid 

Re: Failed transaction statistics to measure the logical replication progress

2022-02-28 Thread Peter Smith
Please see below my review comments for v25.

==

1. Commit message

Introduce cumulative columns of transactions of
logical replication subscriber to the pg_stat_subscription_stats view.

"cumulative columns of transactions" sounds a bit strange to me.

SUGGESTED
Introduce 2 new subscription statistics columns (apply_commit_count,
and apply_rollback_count) to the pg_stat_subscription_stats view for
counting cumulative transaction commits/rollbacks.

~~~

2. doc/src/sgml/monitoring.sgml - bug

The new SGML s have been added in the wrong place!

I don't think this renders like you expect it does. Please regenerate
the help to see for yourself.

~~~

3. doc/src/sgml/monitoring.sgml - wording

+  
+   Number of transactions rollbacked in this subscription. Both
+   ROLLBACK of transaction streamed as in-progress
+   transaction and ROLLBACK PREPARED increment this
+   counter.
+  

BEFORE
Number of transactions rollbacked in this subscription.

SUGGESTED
Number of transaction rollbacks in this subscription.

~~~

4. doc/src/sgml/monitoring.sgml - wording

+  
+   Number of transactions rollbacked in this subscription. Both
+   ROLLBACK of transaction streamed as in-progress
+   transaction and ROLLBACK PREPARED increment this
+   counter.
+  

Trying to distinguish between the ROLLBACK of a transaction and of a
streamed in-progress transaction seems to have made this description
too complicated. I don’t think the user even cares/knows about this
(in-progress) distinction. So, I think this should just be written
more simply (like the COMMIT part was)

BEFORE
Both  ROLLBACK of transaction streamed as
in-progress  transaction and ROLLBACK PREPARED
increment this counter.

SUGGESTED
Both  ROLLBACK and ROLLBACK
PREPARED increment this counter.

~~~

5. Question - column names.

Just curious why the columns are called "apply_commit_count" and
"apply_rollback_count"? Specifically, what extra meaning do those
names have versus just calling them "commit_count" and
"rollback_count"?

~~~

6. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact

@@ -3421,6 +3425,60 @@ pgstat_send_slru(void)
 }

 /* --
+ * pgstat_report_subscription_xact() -
+ *
+ * Send a subscription transaction stats to the collector.
+ * The statistics are cleared upon sending.
+ *
+ * 'force' is true only when the subscription worker process exits.
+ * --
+ */
+void
+pgstat_report_subscription_xact(bool force)

6a.
I think this comment should be worded more like the other
pgstat_report_subscption_XXX comments

BEFORE
Send a subscription transaction stats to the collector.

SUGGESTED
Tell the collector about subscriptions transaction stats.

6b.
+ * 'force' is true only when the subscription worker process exits.

I thought this comment should just describe what the 'force' param
actually does in this function; not the scenario about who calls it...

~~~

7. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact

I think the entire function maybe should be relocated to be nearby the
other pgstat_report_subscription_XXX functions in the source.

~~~

8. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact

+ /*
+ * This function can be called even if nothing at all has happened. In
+ * this case, there's no need to go forward.
+ */

Too much information. Clearly, it is possible for this function to be
called for this case otherwise this code would not exist in the first
place :)
IMO the comment can be much simpler but still say all it needs to.

BEFORE
This function can be called even if nothing at all has happened. In
this case, there's no need to go forward.
SUGGESTED
Bailout early if nothing to do.

~~~

9. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact

+ if (subStats.subid == InvalidOid ||
+ (subStats.apply_commit_count == 0 && subStats.apply_rollback_count == 0))
+ return;

Maybe using !OisIsValid(subStats.subid) is better?

~~~

10. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact

+ /*
+ * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
+ * msec since we last sent one to avoid overloading the stats
+ * collector.
+ */

SUGGESTED (2 sentences instead of 1)
Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
msec since we last sent one. This is to avoid overloading the stats
collector.

~~~

11. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact

+ if (!force)
+ {
+ TimestampTz now = GetCurrentTimestamp();
+
+ /*
+ * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
+ * msec since we last sent one to avoid overloading the stats
+ * collector.
+ */
+ if (!TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
+ return;
+ last_report = now;
+ }

(Yeah, I know there is similar code in this module but 2 wrongs do not
make a right)

I think logically it is better to put the 'now' and the 'last_report'
outside this if (!force) 

RE: Failed transaction statistics to measure the logical replication progress

2022-02-28 Thread osumi.takami...@fujitsu.com
On Friday, February 25, 2022 7:58 AM osumi.takami...@fujitsu.com 
 wrote:
> Kindly have a look at v24.
Hi.

The recent commit(7a85073) has redesigned the view pg_stat_subscription_workers
and now we have pg_stat_subscription_stats. Therefore, I rebased my patch
so that my statistics patch can be applied on top of the HEAD.

In the process of this rebase, I had to drop one column
that stored error count for unique errors(which was
incremented after confirming the error is not the same as previous
one), because the commit tentatively removes the same error check
mechanism.

Therefore, this patch has apply_commit_count, apply_rollback_count only.
I slightly changed minor changes as well so that those can
become more aligned.

Kindly please have a look at the patch.

Best Regards,
Takamichi Osumi



v25-0001-Extend-pg_stat_subscription_stats-to-include-gen.patch
Description:  v25-0001-Extend-pg_stat_subscription_stats-to-include-gen.patch


RE: Failed transaction statistics to measure the logical replication progress

2022-02-24 Thread osumi.takami...@fujitsu.com
On Wednesday, February 23, 2022 3:30 PM Amit Kapila 
> On Tue, Feb 22, 2022 at 6:45 AM tanghy.f...@fujitsu.com
>  wrote:
> >
> > I found a problem when using it. When a replication workers exits, the
> > transaction stats should be sent to stats collector if they were not
> > sent before because it didn't reach PGSTAT_STAT_INTERVAL. But I saw
> > that the stats weren't updated as expected.
> >
> > I looked into it and found that the replication worker would send the
> > transaction stats (if any) before it exits. But it got invalid subid
> > in pgstat_send_subworker_xact_stats(), which led to the following result:
> >
> > postgres=# select pg_stat_get_subscription_worker(0, null);
> > pg_stat_get_subscription_worker
> > -
> >  (0,,2,0,00,"",)
> > (1 row)
> >
> > I think that's because subid has already been cleaned when trying to
> > send the stats. I printed the value of before_shmem_exit_list, the
> > functions in this list would be called in shmem_exit() when the worker 
> > exits.
> > logicalrep_worker_onexit() would clean up the worker info (including
> > subid), and
> > pgstat_shutdown_hook() would send stats if any.
> > logicalrep_worker_onexit() was called before calling
> pgstat_shutdown_hook().
> >
> 
> Yeah, I think that is a problem and maybe we can think of solving it by 
> sending
> the stats via logicalrep_worker_onexit before subid is cleared but not sure 
> that
> is a good idea. I feel we need to go back to the idea of v21 for sending stats
> instead of using pgstat_report_stat.
> I think the one thing which we could improve is to avoid trying to send it 
> each
> time before receiving each message by walrcv_receive and rather try to send it
> before we try to wait (WaitLatchOrSocket).
> Trying after each message doesn't seem to be required and could lead to some
> overhead as well. What do you think?
I agree. Fixed.

Kindly have a look at v24 shared in [1].

[1] - 
https://www.postgresql.org/message-id/TYCPR01MB8373A3E1BE237BAF38185BF2ED3D9%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
Takamichi Osumi



RE: Failed transaction statistics to measure the logical replication progress

2022-02-24 Thread osumi.takami...@fujitsu.com
On Thursday, February 24, 2022 11:07 AM Masahiko Sawada  
wrote:
> I have some comments on v23 patch:
> 
> @@ -66,6 +66,12 @@ typedef struct LogicalRepWorker
> TimestampTz last_recv_time;
> XLogRecPtr  reply_lsn;
> TimestampTz reply_time;
> +
> +   /*
> +* Transaction statistics of subscription worker
> +*/
> +   int64   commit_count;
> +   int64   abort_count;
>  } LogicalRepWorker;
> 
> I think that adding these statistics to the struct whose data is allocated on 
> the
> shared memory is not a good idea since they don't need to be shared. We might
> want to add more statistics for subscriptions such as insert_count and
> update_count in the future. I think it's better to track these statistics in 
> local
> memory either in worker.c or pgstat.c.
Fixed.

> +/* --
> + * pgstat_report_subworker_xact_end() -
> + *
> + *  Update the statistics of subscription worker and have
> + *  pgstat_report_stat send a message to stats collector
> + *  after count increment.
> + * --
> + */
> +void
> +pgstat_report_subworker_xact_end(bool is_commit) {
> +if (is_commit)
> +MyLogicalRepWorker->commit_count++;
> +else
> +MyLogicalRepWorker->abort_count++;
> +}
> 
> It's slightly odd and it seems unnecessary to me that we modify fields of
> MyLogicalRepWorker in pgstat.c. Although this function has “report”
> in its name but it just increments the counter. I think we can do that in 
> worker.c.
Fixed.


Also, I made the timing adjustment logic
back and now have the independent one as Amit-san suggested in [1].

Kindly have a look at v24.


[1] - 
https://www.postgresql.org/message-id/CAA4eK1LWYc15%3DASj1tMTEFsXtxu%3D02aGoMwq9YanUVr9-QMhdQ%40mail.gmail.com


Best Regards,
Takamichi Osumi



v24-0001-Extend-pg_stat_subscription_workers-to-include-g.patch
Description:  v24-0001-Extend-pg_stat_subscription_workers-to-include-g.patch


Re: Failed transaction statistics to measure the logical replication progress

2022-02-23 Thread Masahiko Sawada
On Mon, Feb 21, 2022 at 12:45 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Saturday, February 19, 2022 12:00 AM osumi.takami...@fujitsu.com 
>  wrote:
> > On Friday, February 18, 2022 3:34 PM Tang, Haiying/唐 海英
> >  wrote:
> > > On Wed, Jan 12, 2022 8:35 PM osumi.takami...@fujitsu.com
> > >  wrote:
> > > 4) I noticed that the abort_count doesn't include aborted streaming
> > > transactions.
> > > Should we take this case into consideration?
> > Hmm, we can add this into this column, when there's no objection.
> > I'm not sure but someone might say those should be separate columns.
> I've addressed this point in a new v23 patch,
> since there was no opinion on this so far.
>
> Kindly have a look at the attached one.
>

I have some comments on v23 patch:

@@ -66,6 +66,12 @@ typedef struct LogicalRepWorker
TimestampTz last_recv_time;
XLogRecPtr  reply_lsn;
TimestampTz reply_time;
+
+   /*
+* Transaction statistics of subscription worker
+*/
+   int64   commit_count;
+   int64   abort_count;
 } LogicalRepWorker;

I think that adding these statistics to the struct whose data is
allocated on the shared memory is not a good idea since they don't
need to be shared. We might want to add more statistics for
subscriptions such as insert_count and update_count in the future. I
think it's better to track these statistics in local memory either in
worker.c or pgstat.c.

+/* --
+ * pgstat_report_subworker_xact_end() -
+ *
+ *  Update the statistics of subscription worker and have
+ *  pgstat_report_stat send a message to stats collector
+ *  after count increment.
+ * --
+ */
+void
+pgstat_report_subworker_xact_end(bool is_commit)
+{
+if (is_commit)
+MyLogicalRepWorker->commit_count++;
+else
+MyLogicalRepWorker->abort_count++;
+}

It's slightly odd and it seems unnecessary to me that we modify fields
of MyLogicalRepWorker in pgstat.c. Although this function has “report”
in its name but it just increments the counter. I think we can do that
in worker.c.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Failed transaction statistics to measure the logical replication progress

2022-02-22 Thread Amit Kapila
On Tue, Feb 22, 2022 at 6:45 AM tanghy.f...@fujitsu.com
 wrote:
>
> I found a problem when using it. When a replication workers exits, the
> transaction stats should be sent to stats collector if they were not sent 
> before
> because it didn't reach PGSTAT_STAT_INTERVAL. But I saw that the stats weren't
> updated as expected.
>
> I looked into it and found that the replication worker would send the
> transaction stats (if any) before it exits. But it got invalid subid in
> pgstat_send_subworker_xact_stats(), which led to the following result:
>
> postgres=# select pg_stat_get_subscription_worker(0, null);
>  pg_stat_get_subscription_worker
> -
>  (0,,2,0,00,"",)
> (1 row)
>
> I think that's because subid has already been cleaned when trying to send the
> stats. I printed the value of before_shmem_exit_list, the functions in this 
> list
> would be called in shmem_exit() when the worker exits.
> logicalrep_worker_onexit() would clean up the worker info (including subid), 
> and
> pgstat_shutdown_hook() would send stats if any.  logicalrep_worker_onexit() 
> was
> called before calling pgstat_shutdown_hook().
>

Yeah, I think that is a problem and maybe we can think of solving it
by sending the stats via logicalrep_worker_onexit before subid is
cleared but not sure that is a good idea. I feel we need to go back to
the idea of v21 for sending stats instead of using pgstat_report_stat.
I think the one thing which we could improve is to avoid trying to
send it each time before receiving each message by walrcv_receive and
rather try to send it before we try to wait (WaitLatchOrSocket).
Trying after each message doesn't seem to be required and could lead
to some overhead as well. What do you think?

-- 
With Regards,
Amit Kapila.




RE: Failed transaction statistics to measure the logical replication progress

2022-02-21 Thread osumi.takami...@fujitsu.com
On Tuesday, February 22, 2022 10:15 AM Tang, Haiying/唐 海英 
 wrote:
> On Mon, Feb 21, 2022 11:46 AM osumi.takami...@fujitsu.com
>  wrote:
> >
> > On Saturday, February 19, 2022 12:00 AM osumi.takami...@fujitsu.com
> >  wrote:
> > > On Friday, February 18, 2022 3:34 PM Tang, Haiying/唐 海英
> > >  wrote:
> > > > On Wed, Jan 12, 2022 8:35 PM osumi.takami...@fujitsu.com
> > > >  wrote:
> > > > 4) I noticed that the abort_count doesn't include aborted
> > > > streaming transactions.
> > > > Should we take this case into consideration?
> > > Hmm, we can add this into this column, when there's no objection.
> > > I'm not sure but someone might say those should be separate columns.
> > I've addressed this point in a new v23 patch, since there was no
> > opinion on this so far.
> >
> > Kindly have a look at the attached one.
> >
> 
> Thanks for updating the patch.
> 
> I found a problem when using it. When a replication workers exits, the
> transaction stats should be sent to stats collector if they were not sent 
> before
> because it didn't reach PGSTAT_STAT_INTERVAL. But I saw that the stats
> weren't updated as expected.
> 
> I looked into it and found that the replication worker would send the 
> transaction
> stats (if any) before it exits. But it got invalid subid in
> pgstat_send_subworker_xact_stats(), which led to the following result:
> 
> postgres=# select pg_stat_get_subscription_worker(0, null);
> pg_stat_get_subscription_worker
> -
>  (0,,2,0,00,"",)
> (1 row)
> 
> I think that's because subid has already been cleaned when trying to send the
> stats. I printed the value of before_shmem_exit_list, the functions in this 
> list
> would be called in shmem_exit() when the worker exits.
> logicalrep_worker_onexit() would clean up the worker info (including subid),
> and
> pgstat_shutdown_hook() would send stats if any.  logicalrep_worker_onexit()
> was called before calling pgstat_shutdown_hook().
> 
> (gdb) p before_shmem_exit_list
> $1 = {{function = 0xa88f1e , arg = 0}, {function =
> 0xb619e7 , arg = 0}, {function = 0xb07b5c
> , arg = 0}, {
> function = 0xabdd93 , arg = 0}, {function =
> 0xe30c89 , arg = 0}, {function = 0x0, arg = 0}  times>}
> 
> Maybe we should make some modification to fix it.
Thank you for letting me know this issue.
I'll investigate this and will report the result.


Best Regards,
Takamichi Osumi



RE: Failed transaction statistics to measure the logical replication progress

2022-02-21 Thread tanghy.f...@fujitsu.com
On Mon, Feb 21, 2022 11:46 AM osumi.takami...@fujitsu.com 
 wrote:
> 
> On Saturday, February 19, 2022 12:00 AM osumi.takami...@fujitsu.com
>  wrote:
> > On Friday, February 18, 2022 3:34 PM Tang, Haiying/唐 海英
> >  wrote:
> > > On Wed, Jan 12, 2022 8:35 PM osumi.takami...@fujitsu.com
> > >  wrote:
> > > 4) I noticed that the abort_count doesn't include aborted streaming
> > > transactions.
> > > Should we take this case into consideration?
> > Hmm, we can add this into this column, when there's no objection.
> > I'm not sure but someone might say those should be separate columns.
> I've addressed this point in a new v23 patch,
> since there was no opinion on this so far.
> 
> Kindly have a look at the attached one.
> 

Thanks for updating the patch.

I found a problem when using it. When a replication workers exits, the
transaction stats should be sent to stats collector if they were not sent before
because it didn't reach PGSTAT_STAT_INTERVAL. But I saw that the stats weren't
updated as expected.

I looked into it and found that the replication worker would send the 
transaction stats (if any) before it exits. But it got invalid subid in
pgstat_send_subworker_xact_stats(), which led to the following result:

postgres=# select pg_stat_get_subscription_worker(0, null);
 pg_stat_get_subscription_worker
-
 (0,,2,0,00,"",)
(1 row)

I think that's because subid has already been cleaned when trying to send the
stats. I printed the value of before_shmem_exit_list, the functions in this list
would be called in shmem_exit() when the worker exits.
logicalrep_worker_onexit() would clean up the worker info (including subid), and
pgstat_shutdown_hook() would send stats if any.  logicalrep_worker_onexit() was
called before calling pgstat_shutdown_hook().

(gdb) p before_shmem_exit_list
$1 = {{function = 0xa88f1e , arg = 0}, {function = 
0xb619e7 , arg = 0}, {function = 0xb07b5c 
, arg = 0}, {
function = 0xabdd93 , arg = 0}, {function = 
0xe30c89 , arg = 0}, {function = 0x0, arg = 0} }

Maybe we should make some modification to fix it.

Regards,
Tang


RE: Failed transaction statistics to measure the logical replication progress

2022-02-21 Thread osumi.takami...@fujitsu.com
On Monday, February 21, 2022 6:06 PM Monday, February 21, 2022 6:06 PM wrote:
> On Mon, Feb 21, 2022 11:46 AM Osumi, Takamichi/大墨 昂道
>  wrote:
> >I've addressed this point in a new v23 patch, since there was no opinion on
> this so far.
> >Kindly have a look at the attached one.
> Thanks for updating the patch. Here is a comment:
> 
> In function apply_handle_stream_abort:
> @@ -1217,6 +1219,7 @@ apply_handle_stream_abort(StringInfo s)
>   {
>   set_apply_error_context_xact(xid, 0);
>   stream_cleanup_files(MyLogicalRepWorker->subid, xid);
> + pgstat_report_subworker_xact_end(false);
>   }
>   else
>   {
> 
> I think there is a problem here, pgstat_report_stat is not invoked here.
> While the other three places where function
> pgstat_report_subworker_xact_end is invoked, the function pgstat_report_stat
> is invoked.
> Do we need to invoke pgstat_report_stat in apply_handle_stream_abort?
Hi,

I had tested this case before I posted the latest patch v23.
It works when I call pg_stat_report_stat by other transaction.

But, if we want to add pgstat_report_stat here,
I need to investigate the impact of the addition.
I'll check it and let you know.


Best Regards,
Takamichi Osumi



RE: Failed transaction statistics to measure the logical replication progress

2022-02-21 Thread wangw.f...@fujitsu.com
On Mon, Feb 21, 2022 11:46 AM Osumi, Takamichi/大墨 昂道 
 wrote:
>I've addressed this point in a new v23 patch, since there was no opinion on 
>this so far.
>Kindly have a look at the attached one.
Thanks for updating the patch. Here is a comment:

In function apply_handle_stream_abort:
@@ -1217,6 +1219,7 @@ apply_handle_stream_abort(StringInfo s)
{
set_apply_error_context_xact(xid, 0);
stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+   pgstat_report_subworker_xact_end(false);
}
else
{

I think there is a problem here, pgstat_report_stat is not invoked here.
While the other three places where function pgstat_report_subworker_xact_end is
invoked, the function pgstat_report_stat is invoked.
Do we need to invoke pgstat_report_stat in apply_handle_stream_abort?

Regards,
Wang wei


RE: Failed transaction statistics to measure the logical replication progress

2022-02-20 Thread osumi.takami...@fujitsu.com
On Saturday, February 19, 2022 12:00 AM osumi.takami...@fujitsu.com 
 wrote:
> On Friday, February 18, 2022 3:34 PM Tang, Haiying/唐 海英
>  wrote:
> > On Wed, Jan 12, 2022 8:35 PM osumi.takami...@fujitsu.com
> >  wrote:
> > 4) I noticed that the abort_count doesn't include aborted streaming
> > transactions.
> > Should we take this case into consideration?
> Hmm, we can add this into this column, when there's no objection.
> I'm not sure but someone might say those should be separate columns.
I've addressed this point in a new v23 patch,
since there was no opinion on this so far.

Kindly have a look at the attached one.

Best Regards,
Takamichi Osumi



v23-0001-Extend-pg_stat_subscription_workers-to-include-g.patch
Description:  v23-0001-Extend-pg_stat_subscription_workers-to-include-g.patch


RE: Failed transaction statistics to measure the logical replication progress

2022-02-18 Thread osumi.takami...@fujitsu.com
On Friday, February 18, 2022 3:34 PM Tang, Haiying/唐 海英 
 wrote:
> On Wed, Jan 12, 2022 8:35 PM osumi.takami...@fujitsu.com
>  wrote:
> >
> > The attached v21 has a couple of other minor updates like a
> > modification of error message text.
> >
> >
> 
> Thanks for updating the patch. Here are some comments.
Thank you for your reivew !



> 1) I saw the following description about pg_stat_subscription_workers view in
> existing doc:
> 
>The pg_stat_subscription_workers view will
> contain
>one row per subscription worker on which errors have occurred, ...
> 
> It only says  "which errors have occurred", maybe we should also mention
> transactions here, right?
I wrote about this statistics in the next line but as you pointed out,
separating the description into two sentences wasn't good idea.
Fixed.



> 2)
> /* --
> + * pgstat_send_subworker_xact_stats() -
> + *
> + *   Send a subworker's transaction stats to the collector.
> + *   The statistics are cleared upon return.
> 
> Should "The statistics are cleared upon return" changed to "The statistics are
> cleared upon sending"? Because if it doesn't reach PGSTAT_STAT_INTERVAL
> and the transaction stats are not sent, the function will return without 
> clearing
> out statistics.
Now, the purpose of this function has become purely
to send a message and whenever it's called, the function
clears the saved stats. So, I skipped this comments now.


> 3)
> + Assert(command == LOGICAL_REP_MSG_COMMIT ||
> +command == LOGICAL_REP_MSG_STREAM_COMMIT ||
> +command == LOGICAL_REP_MSG_COMMIT_PREPARED
> ||
> +command ==
> LOGICAL_REP_MSG_ROLLBACK_PREPARED);
> +
> + switch (command)
> + {
> + case LOGICAL_REP_MSG_COMMIT:
> + case LOGICAL_REP_MSG_STREAM_COMMIT:
> + case LOGICAL_REP_MSG_COMMIT_PREPARED:
> + MyLogicalRepWorker->commit_count++;
> + break;
> + case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
> + MyLogicalRepWorker->abort_count++;
> + break;
> + default:
> + ereport(ERROR,
> + errmsg("invalid logical message type
> for transaction statistics of subscription"));
> + break;
> + }
> 
> I'm not sure that do we need the assert, because it will report an error 
> later if
> command is an invalid value.
The Assert has been removed, along with the switch branches now.
Since there was an adivce that we should call this from 
apply_handle_commit_internal
and from that function, if we don't want to change this function's argument,
all we need to do is to pass a boolean value that indicates the stats is
commit_count or abort_count. Kindly have a look at the updated version.


> 4) I noticed that the abort_count doesn't include aborted streaming
> transactions.
> Should we take this case into consideration?
Hmm, we can add this into this column, when there's no objection.
I'm not sure but someone might say those should be separate columns.

The new patch v22 is shared in [2].

[2] - 
https://www.postgresql.org/message-id/TYCPR01MB83737C689F8F310C87C19C1EED379%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
Takamichi Osumi



RE: Failed transaction statistics to measure the logical replication progress

2022-02-18 Thread osumi.takami...@fujitsu.com
On Friday, February 18, 2022 8:11 PM Amit Kapila  
wrote:
> On Fri, Feb 18, 2022 at 2:04 PM osumi.takami...@fujitsu.com
>  wrote:
> >
> > On Thursday, February 17, 2022 6:45 PM Amit Kapila
>  wrote:
> > > On Thu, Feb 17, 2022 at 3:13 PM Amit Kapila
> > > 
> > > wrote:
> > > > Can't we use pgstat_report_stat() here? Basically, you can update
> > > > xact completetion counters during apply, and then from
> > > > pgstat_report_stat(), you can invoke a logical replication worker
> > > > stats-related function.
> > > >
> > >
> > > If we can do this then we can save the logic this patch is trying to
> > > introduce for PGSTAT_STAT_INTERVAL.
> > Hi, I've encounter a couple of questions during my modification, following
> your advice.
> >
> > In the pgstat_report_stat, we refer to the return value of
> > GetCurrentTransactionStopTimestamp to compare the time different from
> the last time.
> > (In my previous patch, I used GetCurrentTimestamp)
> >
> > This time is updated in apply_handle_commit_internal's
> CommitTransactionCommand for the apply worker.
> > Then, if I update the subscription worker
> > stats(commit_count/abort_count) immediately after this
> > CommitTransactionCommand and immediately call pgstat_report_stat in the
> apply_handle_commit_internal, the time difference becomes too small (falls
> short of PGSTAT_STAT_INTERVAL).
> > Also, the time of GetCurrentTransactionStopTimestamp is not updated
> > even when I keep calling pgstat_report_stat repeatedly.
> > Then, IIUC the next possible timing that message of commit_count or
> > abort_count is sent to the stats collector would become the time when
> > we execute another transaction by the apply worker and update the time
> > for GetCurrentTransactionStopTimestamp
> > and rerun pgstat_report_stat again.
> >
> 
> I think but same is true in the case of the transaction in the backend where 
> we
> increment commit counter via AtEOXact_PgStat after updating the transaction
> time. After that, we call pgstat_report_stat() at later point. How is this 
> case
> different?
> 
> > So, if we keep GetCurrentTransactionStopTimestamp without change, an
> > update of stats depends on another new subsequent transaction if we
> simply merge those.
> > (this leads to users cannot see the latest stats information ?)
> >
> 
> I think this should be okay as these don't need to be accurate.
> 
> > At least, I got a test failure because of this function for streaming
> > commit case because it uses poll_query_until to wait for stats update.
> >
> 
> I feel it is not a good idea to wait for the accurate update of these 
> counters.
Ah, then I had wrote tests based on totally wrong direction and made noises for 
it.
Sorry for that. I don't see tests for existing xact_commit/rollback count,
so I'll follow the same way.

Attached a new patch that addresses three major improvements I've got so far as 
comments.
1. skip increment of stats counter by empty transaction, on the subscriber side
   (except for commit prepared)
2. utilize the existing pgstat_report_stat, instead of having a similar logic 
newly.
3. remove the wrong tests.


Best Regards,
Takamichi Osumi



v22-0001-Extend-pg_stat_subscription_workers-to-include-g.patch
Description:  v22-0001-Extend-pg_stat_subscription_workers-to-include-g.patch


Re: Failed transaction statistics to measure the logical replication progress

2022-02-18 Thread Amit Kapila
On Fri, Feb 18, 2022 at 2:04 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Thursday, February 17, 2022 6:45 PM Amit Kapila  
> wrote:
> > On Thu, Feb 17, 2022 at 3:13 PM Amit Kapila 
> > wrote:
> > > Can't we use pgstat_report_stat() here? Basically, you can update xact
> > > completetion counters during apply, and then from
> > > pgstat_report_stat(), you can invoke a logical replication worker
> > > stats-related function.
> > >
> >
> > If we can do this then we can save the logic this patch is trying to 
> > introduce for
> > PGSTAT_STAT_INTERVAL.
> Hi, I've encounter a couple of questions during my modification, following 
> your advice.
>
> In the pgstat_report_stat, we refer to the return value of
> GetCurrentTransactionStopTimestamp to compare the time different from the 
> last time.
> (In my previous patch, I used GetCurrentTimestamp)
>
> This time is updated in apply_handle_commit_internal's 
> CommitTransactionCommand for the apply worker.
> Then, if I update the subscription worker stats(commit_count/abort_count) 
> immediately after
> this CommitTransactionCommand and immediately call pgstat_report_stat in the 
> apply_handle_commit_internal,
> the time difference becomes too small (falls short of PGSTAT_STAT_INTERVAL).
> Also, the time of GetCurrentTransactionStopTimestamp is not updated
> even when I keep calling pgstat_report_stat repeatedly.
> Then, IIUC the next possible timing that message of commit_count or 
> abort_count
> is sent to the stats collector would become the time when we execute another 
> transaction
> by the apply worker and update the time for GetCurrentTransactionStopTimestamp
> and rerun pgstat_report_stat again.
>

I think but same is true in the case of the transaction in the backend
where we increment commit counter via AtEOXact_PgStat after updating
the transaction time. After that, we call pgstat_report_stat() at
later point. How is this case different?

> So, if we keep GetCurrentTransactionStopTimestamp without change,
> an update of stats depends on another new subsequent transaction if we simply 
> merge those.
> (this leads to users cannot see the latest stats information ?)
>

I think this should be okay as these don't need to be accurate.

> At least, I got a test failure because of this function for streaming commit 
> case
> because it uses poll_query_until to wait for stats update.
>

I feel it is not a good idea to wait for the accurate update of these counters.

-- 
With Regards,
Amit Kapila.




RE: Failed transaction statistics to measure the logical replication progress

2022-02-18 Thread osumi.takami...@fujitsu.com
On Thursday, February 17, 2022 6:45 PM Amit Kapila  
wrote:
> On Thu, Feb 17, 2022 at 3:13 PM Amit Kapila 
> wrote:
> >
> > On Tue, Jan 4, 2022 at 5:22 PM osumi.takami...@fujitsu.com
> >  wrote:
> > >
> > > On Friday, December 31, 2021 10:12 AM Tang, Haiying/唐 海英
>  wrote:
> > > > 4)
> > > > +void
> > > > +pgstat_send_subworker_xact_stats(LogicalRepWorker *repWorker,
> > > > +bool
> > > > +force) {
> > > > + static TimestampTz last_report = 0;
> > > > + PgStat_MsgSubWorkerXactEnd msg;
> > > > +
> > > > + if (!force)
> > > > + {
> > > > ...
> > > > + if (!TimestampDifferenceExceeds(last_report, now,
> > > > PGSTAT_STAT_INTERVAL))
> > > > + return;
> > > > + last_report = now;
> > > > + }
> > > > +
> > > > ...
> > > > + if (repWorker->commit_count == 0 && repWorker->abort_count
> > > > + ==
> > > > 0)
> > > > + return;
> > > > ...
> > > >
> > > > I think it's better to check commit_count and abort_count first,
> > > > then check if reach PGSTAT_STAT_INTERVAL.
> > > > Otherwise if commit_count and abort_count are 0, it is possible
> > > > that the value of last_report has been updated but it didn't send
> > > > stats in fact. In this case, last_report is not the real time that send 
> > > > last
> message.
> > > Yeah, agreed. This fix is right in terms of the variable name aspect.
> > >
> >
> > Can't we use pgstat_report_stat() here? Basically, you can update xact
> > completetion counters during apply, and then from
> > pgstat_report_stat(), you can invoke a logical replication worker
> > stats-related function.
> >
> 
> If we can do this then we can save the logic this patch is trying to 
> introduce for
> PGSTAT_STAT_INTERVAL.
Hi, I've encounter a couple of questions during my modification, following your 
advice.

In the pgstat_report_stat, we refer to the return value of
GetCurrentTransactionStopTimestamp to compare the time different from the last 
time.
(In my previous patch, I used GetCurrentTimestamp)

This time is updated in apply_handle_commit_internal's CommitTransactionCommand 
for the apply worker.
Then, if I update the subscription worker stats(commit_count/abort_count) 
immediately after
this CommitTransactionCommand and immediately call pgstat_report_stat in the 
apply_handle_commit_internal,
the time difference becomes too small (falls short of PGSTAT_STAT_INTERVAL).
Also, the time of GetCurrentTransactionStopTimestamp is not updated
even when I keep calling pgstat_report_stat repeatedly.
Then, IIUC the next possible timing that message of commit_count or abort_count
is sent to the stats collector would become the time when we execute another 
transaction
by the apply worker and update the time for GetCurrentTransactionStopTimestamp
and rerun pgstat_report_stat again.

So, if we keep GetCurrentTransactionStopTimestamp without change,
an update of stats depends on another new subsequent transaction if we simply 
merge those.
(this leads to users cannot see the latest stats information ?)
At least, I got a test failure because of this function for streaming commit 
case
because it uses poll_query_until to wait for stats update.

On the other hand, replacing GetCurrentTransactionStopTimestamp with
GetCurrentTimestamp in case of apply worker looks have another negative impact.
If we do so, it becomes possible that we go into the code to scan 
TabStatusArray with
PgStat_TableStatus's trans with non-null values, because of the timing change.

I might be able to avoid this kind of assert failure if I write the message send
function of this patch before other existing functions to send various type of 
messages
and return if the process is apply worker in pgstat_report_stat.
But, I can't be convinced that this way of modification is OK.

What did you think about those issues ?

Best Regards,
Takamichi Osumi



RE: Failed transaction statistics to measure the logical replication progress

2022-02-17 Thread tanghy.f...@fujitsu.com
On Wed, Jan 12, 2022 8:35 PM osumi.takami...@fujitsu.com 
 wrote:
> 
> The attached v21 has a couple of other minor updates
> like a modification of error message text.
> 
> 

Thanks for updating the patch. Here are some comments.

1) I saw the following description about pg_stat_subscription_workers view in
existing doc:

   The pg_stat_subscription_workers view will contain
   one row per subscription worker on which errors have occurred, ...

It only says  "which errors have occurred", maybe we should also mention
transactions here, right?

2)
/* --
+ * pgstat_send_subworker_xact_stats() -
+ *
+ * Send a subworker's transaction stats to the collector.
+ * The statistics are cleared upon return.

Should "The statistics are cleared upon return" changed to "The statistics are
cleared upon sending"? Because if it doesn't reach PGSTAT_STAT_INTERVAL and the
transaction stats are not sent, the function will return without clearing out
statistics.

3)
+   Assert(command == LOGICAL_REP_MSG_COMMIT ||
+  command == LOGICAL_REP_MSG_STREAM_COMMIT ||
+  command == LOGICAL_REP_MSG_COMMIT_PREPARED ||
+  command == LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+   switch (command)
+   {
+   case LOGICAL_REP_MSG_COMMIT:
+   case LOGICAL_REP_MSG_STREAM_COMMIT:
+   case LOGICAL_REP_MSG_COMMIT_PREPARED:
+   MyLogicalRepWorker->commit_count++;
+   break;
+   case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+   MyLogicalRepWorker->abort_count++;
+   break;
+   default:
+   ereport(ERROR,
+   errmsg("invalid logical message type 
for transaction statistics of subscription"));
+   break;
+   }

I'm not sure that do we need the assert, because it will report an error later
if command is an invalid value.

4) I noticed that the abort_count doesn't include aborted streaming 
transactions.
Should we take this case into consideration?

Regards,
Tang



Re: Failed transaction statistics to measure the logical replication progress

2022-02-17 Thread Amit Kapila
On Thu, Feb 17, 2022 at 3:13 PM Amit Kapila  wrote:
>
> On Tue, Jan 4, 2022 at 5:22 PM osumi.takami...@fujitsu.com
>  wrote:
> >
> > On Friday, December 31, 2021 10:12 AM Tang, Haiying/唐 海英 
> >  wrote:
> > > 4)
> > > +void
> > > +pgstat_send_subworker_xact_stats(LogicalRepWorker *repWorker, bool
> > > +force) {
> > > + static TimestampTz last_report = 0;
> > > + PgStat_MsgSubWorkerXactEnd msg;
> > > +
> > > + if (!force)
> > > + {
> > > ...
> > > + if (!TimestampDifferenceExceeds(last_report, now,
> > > PGSTAT_STAT_INTERVAL))
> > > + return;
> > > + last_report = now;
> > > + }
> > > +
> > > ...
> > > + if (repWorker->commit_count == 0 && repWorker->abort_count ==
> > > 0)
> > > + return;
> > > ...
> > >
> > > I think it's better to check commit_count and abort_count first, then 
> > > check if
> > > reach PGSTAT_STAT_INTERVAL.
> > > Otherwise if commit_count and abort_count are 0, it is possible that the 
> > > value
> > > of last_report has been updated but it didn't send stats in fact. In this 
> > > case,
> > > last_report is not the real time that send last message.
> > Yeah, agreed. This fix is right in terms of the variable name aspect.
> >
>
> Can't we use pgstat_report_stat() here? Basically, you can update xact
> completetion counters during apply, and then from
> pgstat_report_stat(), you can invoke a logical replication worker
> stats-related function.
>

If we can do this then we can save the logic this patch is trying to
introduce for PGSTAT_STAT_INTERVAL.

-- 
With Regards,
Amit Kapila.




Re: Failed transaction statistics to measure the logical replication progress

2022-02-17 Thread Amit Kapila
On Tue, Jan 4, 2022 at 5:22 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Friday, December 31, 2021 10:12 AM Tang, Haiying/唐 海英 
>  wrote:
> > 4)
> > +void
> > +pgstat_send_subworker_xact_stats(LogicalRepWorker *repWorker, bool
> > +force) {
> > + static TimestampTz last_report = 0;
> > + PgStat_MsgSubWorkerXactEnd msg;
> > +
> > + if (!force)
> > + {
> > ...
> > + if (!TimestampDifferenceExceeds(last_report, now,
> > PGSTAT_STAT_INTERVAL))
> > + return;
> > + last_report = now;
> > + }
> > +
> > ...
> > + if (repWorker->commit_count == 0 && repWorker->abort_count ==
> > 0)
> > + return;
> > ...
> >
> > I think it's better to check commit_count and abort_count first, then check 
> > if
> > reach PGSTAT_STAT_INTERVAL.
> > Otherwise if commit_count and abort_count are 0, it is possible that the 
> > value
> > of last_report has been updated but it didn't send stats in fact. In this 
> > case,
> > last_report is not the real time that send last message.
> Yeah, agreed. This fix is right in terms of the variable name aspect.
>

Can't we use pgstat_report_stat() here? Basically, you can update xact
completetion counters during apply, and then from
pgstat_report_stat(), you can invoke a logical replication worker
stats-related function.

-- 
With Regards,
Amit Kapila.




Re: Failed transaction statistics to measure the logical replication progress

2022-02-17 Thread Amit Kapila
On Wed, Jan 12, 2022 at 6:04 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Thursday, December 23, 2021 6:37 PM Wang, Wei/王 威  
> wrote:
> > On Wednesday, December 22, 2021 10:30 PM osumi.takami...@fujitsu.com
> >  wrote:
> > > On Wednesday, December 22, 2021 8:38 PM I wrote:
> > > > Do we expect these commit counts which come from empty transactions ?
> > > This is another issue discussed in [1] where the patch in the thread
> > > is a work in progress, I think.
> > > ..
> > > IMHO, the conclusion is we are currently in the middle of fixing the 
> > > behavior.
> >
> > Thank you for telling me this.
> > After applying v19-* and
> > v15-0001-Skip-empty-transactions-for-logical-replication.patch,
> > I retested v19-* patches. The result of previous case looks good to me.
> >
> > But the results of following cases are also similar to previous unexpected 
> > result
> > which increases commit_count or abort_count unexpectedly.
> > [1]
> > (Based on environment in the previous example, set TWO_PHASE=true)
> > [Publisher] begin; insert into replica_test1 values(1,'1'); prepare 
> > transaction
> > 'id'; commit prepared 'id';
> >
> > In subscriber side, the commit_count of two records(sub1 and sub2) is
> > increased.
> >
> > [2]
> > (Based on environment in the previous example, set STREAMING=on)
> > [Publisher] begin; INSERT INTO replica_test1 SELECT i, md5(i::text) FROM
> > generate_series(1, 5000) s(i); commit;
> >
> > In subscriber side, the commit_count of two records(sub1 and sub2) is
> > increased.
> >
> > [3]
> > (Based on environment in the previous example, set TWO_PHASE=true)
> > [Publisher] begin; insert into replica_test1 values(1,'1'); prepare 
> > transaction
> > 'id'; rollback prepared 'id';
> >
> > In subscriber side, the abort_count of two records(sub1 and sub2) is
> > increased.
> >
> > I think the problem maybe is the patch you mentioned
> > (Skip-empty-transactions-for-logical-replication.patch) is not finished yet.
> > Share this information here.
> Hi, thank you for your report.
>
> Yes. As the patch's commit message mentions, the patch doesn't
> cover streaming and two phase transactions.
>
> In the above reply, I said that this was an independent issue
> and we were in the middle of the modification of the behavior,
> but empty transaction turned out to be harmful enough for this feature.
>

Isn't that because of this patch? I mean the patch is reporting count
even when during apply we haven't started any transaction. In
particular, if you would have reported stats from
apply_handle_commit_internal() when IsTransactionState() returns true,
then it shouldn't have updated the stats for an empty transaction.
Similarly, I see for rollback_prepared, you don't need to report stats
if there is no prepared transaction to rollback. I think for
commit_prepare case, we can't do much for empty xacts but why would
that be a problem of this patch? I think as far as this patch goes, it
reports the number of completed xacts (committed/aborted) in the
subscription worker, so, it doesn't need to handle any special cases
like empty transactions.

-- 
With Regards,
Amit Kapila.




RE: Failed transaction statistics to measure the logical replication progress

2022-01-12 Thread osumi.takami...@fujitsu.com
On Thursday, December 23, 2021 6:37 PM Wang, Wei/王 威  
wrote:
> On Wednesday, December 22, 2021 10:30 PM osumi.takami...@fujitsu.com
>  wrote:
> > On Wednesday, December 22, 2021 8:38 PM I wrote:
> > > Do we expect these commit counts which come from empty transactions ?
> > This is another issue discussed in [1] where the patch in the thread
> > is a work in progress, I think.
> > ..
> > IMHO, the conclusion is we are currently in the middle of fixing the 
> > behavior.
> 
> Thank you for telling me this.
> After applying v19-* and
> v15-0001-Skip-empty-transactions-for-logical-replication.patch,
> I retested v19-* patches. The result of previous case looks good to me.
> 
> But the results of following cases are also similar to previous unexpected 
> result
> which increases commit_count or abort_count unexpectedly.
> [1]
> (Based on environment in the previous example, set TWO_PHASE=true)
> [Publisher] begin; insert into replica_test1 values(1,'1'); prepare 
> transaction
> 'id'; commit prepared 'id';
> 
> In subscriber side, the commit_count of two records(sub1 and sub2) is
> increased.
> 
> [2]
> (Based on environment in the previous example, set STREAMING=on)
> [Publisher] begin; INSERT INTO replica_test1 SELECT i, md5(i::text) FROM
> generate_series(1, 5000) s(i); commit;
> 
> In subscriber side, the commit_count of two records(sub1 and sub2) is
> increased.
> 
> [3]
> (Based on environment in the previous example, set TWO_PHASE=true)
> [Publisher] begin; insert into replica_test1 values(1,'1'); prepare 
> transaction
> 'id'; rollback prepared 'id';
> 
> In subscriber side, the abort_count of two records(sub1 and sub2) is
> increased.
> 
> I think the problem maybe is the patch you mentioned
> (Skip-empty-transactions-for-logical-replication.patch) is not finished yet.
> Share this information here.
Hi, thank you for your report.

Yes. As the patch's commit message mentions, the patch doesn't
cover streaming and two phase transactions.

In the above reply, I said that this was an independent issue
and we were in the middle of the modification of the behavior,
but empty transaction turned out to be harmful enough for this feature.
As far as I searched, the current logical replication sends
every transaction that is unrelated to publication specification.
It means that in a common scenario where some parts of tables are
replicated but others are not, the subscription statistics will
be buried by the updates by the empty transactions for the latter,
which damages this patch's value greatly.

Therefore, I included the description in the documentation
reported by you.

The attached v21 has a couple of other minor updates
like a modification of error message text.


Best Regards,
Takamichi Osumi



v21-0001-Fix-alignments-of-TAP-tests-for-pg_stat_subscrip.patch
Description:  v21-0001-Fix-alignments-of-TAP-tests-for-pg_stat_subscrip.patch


v21-0002-Extend-pg_stat_subscription_workers-to-include-g.patch
Description:  v21-0002-Extend-pg_stat_subscription_workers-to-include-g.patch


RE: Failed transaction statistics to measure the logical replication progress

2022-01-04 Thread osumi.takami...@fujitsu.com
On Friday, December 31, 2021 10:12 AM Tang, Haiying/唐 海英 
 wrote:
> On Wednesday, December 22, 2021 6:14 PM osumi.takami...@fujitsu.com
>  wrote:
> >
> > Attached the new patch v19.
> >
> 
> Thanks for your patch. I think it's better if you could add this patch to the
> commitfest.
> Here are some comments:
Thank you for your review !
I've created one entry in the next commitfest for this patch [1]

> 
> 1)
> +   commit_count bigint
> +  
> +  
> +   Number of transactions successfully applied in this subscription.
> +   Both COMMIT and COMMIT PREPARED increment this counter.
> +  
> + 
> ...
> 
> I think the commands (like COMMIT, COMMIT PREPARED ...) can be
> surrounded with " ", thoughts?
Makes sense to me. Fixed.

Note that to the user perspective,
we should write only COMMIT and COMMIT PREPARED in the documentation.
Thus, I don't list up other commands.

I wrapped ROLLBACK PREPARED for abort_count as well.
 
> 2)
> +extern void pgstat_report_subworker_xact_end(LogicalRepWorker
> *repWorker,
> +
>LogicalRepMsgType command,
> +
>bool bforce);
> 
> Should "bforce" be "force"?
Fixed the typo.

> 3)
> + *  This should be called before the call of process_syning_tables() so
> + to not
> 
> "process_syning_tables()" should be "process_syncing_tables()".
Fixed.
 
> 4)
> +void
> +pgstat_send_subworker_xact_stats(LogicalRepWorker *repWorker, bool
> +force) {
> + static TimestampTz last_report = 0;
> + PgStat_MsgSubWorkerXactEnd msg;
> +
> + if (!force)
> + {
> ...
> + if (!TimestampDifferenceExceeds(last_report, now,
> PGSTAT_STAT_INTERVAL))
> + return;
> + last_report = now;
> + }
> +
> ...
> + if (repWorker->commit_count == 0 && repWorker->abort_count ==
> 0)
> + return;
> ...
> 
> I think it's better to check commit_count and abort_count first, then check if
> reach PGSTAT_STAT_INTERVAL.
> Otherwise if commit_count and abort_count are 0, it is possible that the value
> of last_report has been updated but it didn't send stats in fact. In this 
> case,
> last_report is not the real time that send last message.
Yeah, agreed. This fix is right in terms of the variable name aspect.

The only scenario that we can take advantage of the previous implementation of
v19's pgstat_send_subworker_xact_stats() should be a case where
we execute a bunch of commit-like logical replication apply messages
within PGSTAT_STAT_INTERVAL intensively and continuously for long period,
because we check "repWorker->commit_count == 0 && repWorker->abort_count == 0"
just once before calling pgstat_send() in this case.
*But*, this scenario didn't look reasonable. In other words,
the way to call TimestampDifferenceExceeds() only if there's any need of
update for commit_count/abort_count looks less heavy.
Accordingly, I've fixed it as you suggested.
Also, I modified some comments in pgstat_send_subworker_xact_stats() for this 
change.

Kindly have a look at the v20 shared in [2].

[1] - https://commitfest.postgresql.org/37/3504/
[2] - 
https://www.postgresql.org/message-id/TYCPR01MB8373AB2AE1A6EC7B9E012519ED4A9%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
Takamichi Osumi



RE: Failed transaction statistics to measure the logical replication progress

2022-01-04 Thread osumi.takami...@fujitsu.com
On Monday, January 3, 2022 2:46 PM Hou, Zhijie/侯 志杰  
wrote:
> On Wednesday, December 22, 2021 6:14 PM Osumi, Takamichi
>  wrote:
> > Attached the new patch v19.
> Hi,
> 
> Thanks for updating the patch.
> 
> --- a/src/include/pgstat.h
> +++ b/src/include/pgstat.h
> @@ -15,6 +15,7 @@
>  #include "portability/instr_time.h"
>  #include "postmaster/pgarch.h"   /* for MAX_XFN_CHARS */
>  #include "replication/logicalproto.h"
> +#include "replication/worker_internal.h"
> 
> I noticed that the patch includes "worker_internal.h " in pgstat.h.
> I think it might be better to only include this file in pgstat.c.
> And it seems we can access MyLogicalRepWorker directly in the following
> functions instead of passing a parameter.
> 
> +extern void pgstat_report_subworker_xact_end(LogicalRepWorker
> *repWorker,
> +
>LogicalRepMsgType command,
> +
>bool bforce);
> +extern void pgstat_send_subworker_xact_stats(LogicalRepWorker
> *repWorker,
> +
>bool force);
Hi, thank you for your review !

Both are fixed. Additionally, I modified
related comments, the header comment of pgstat_send_subworker_xact_stats,
by the change.

One other new improvements in v20 is to have removed the 2nd argument of
pgstat_send_subworker_xact_stats(). When we called it from
commit-like operations, I passed 'false' without exceptions in v19
and noticed that could be removed.

Also, there's a really minor alignment in the source codes.
In pgstat.c, I placed pgstat_report_subworker_xact_end() after 
pgstat_report_subworker_error(),
and pgstat_send_subworker_xact_stats() after pgstat_send_subscription_purge() 
and so on
like the order of PgstatCollectorMain() and PgStat_Msg definition, because
my patch's new functions are added after the existing functions
for error handling functions of subscription workers.

Lastly, I changed the error report in pgstat_report_subworker_xact_end()
so that it can be more understandable easily and a bit modern.

Kindly have a look at the attached version.

Best Regards,
Takamichi Osumi



v20-0001-Fix-alignments-of-TAP-tests-for-pg_stat_subscrip.patch
Description:  v20-0001-Fix-alignments-of-TAP-tests-for-pg_stat_subscrip.patch


v20-0002-Extend-pg_stat_subscription_workers-to-include-g.patch
Description:  v20-0002-Extend-pg_stat_subscription_workers-to-include-g.patch


RE: Failed transaction statistics to measure the logical replication progress

2022-01-02 Thread houzj.f...@fujitsu.com
On Wednesday, December 22, 2021 6:14 PM Osumi, Takamichi 
 wrote:
> Attached the new patch v19.
Hi,

Thanks for updating the patch.

--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -15,6 +15,7 @@
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h" /* for MAX_XFN_CHARS */
 #include "replication/logicalproto.h"
+#include "replication/worker_internal.h"

I noticed that the patch includes "worker_internal.h " in pgstat.h.
I think it might be better to only include this file in pgstat.c.

And it seems we can access MyLogicalRepWorker directly in the
following functions instead of passing a parameter.

+extern void pgstat_report_subworker_xact_end(LogicalRepWorker *repWorker,
+   
 LogicalRepMsgType command,
+   
 bool bforce);
+extern void pgstat_send_subworker_xact_stats(LogicalRepWorker *repWorker,
+   
 bool force);

Best regards,
Hou zj


RE: Failed transaction statistics to measure the logical replication progress

2021-12-30 Thread tanghy.f...@fujitsu.com
On Wednesday, December 22, 2021 6:14 PM osumi.takami...@fujitsu.com 
 wrote:
> 
> Attached the new patch v19.
> 

Thanks for your patch. I think it's better if you could add this patch to the 
commitfest.
Here are some comments:

1)
+   commit_count bigint
+  
+  
+   Number of transactions successfully applied in this subscription.
+   Both COMMIT and COMMIT PREPARED increment this counter.
+  
+ 
...

I think the commands (like COMMIT, COMMIT PREPARED ...) can be surrounded with
" ", thoughts?

2)
+extern void pgstat_report_subworker_xact_end(LogicalRepWorker *repWorker,
+   
 LogicalRepMsgType command,
+   
 bool bforce);

Should "bforce" be "force"?

3)
+ *  This should be called before the call of process_syning_tables() so to not

"process_syning_tables()" should be "process_syncing_tables()".

4)
+void
+pgstat_send_subworker_xact_stats(LogicalRepWorker *repWorker, bool force)
+{
+   static TimestampTz last_report = 0;
+   PgStat_MsgSubWorkerXactEnd msg;
+
+   if (!force)
+   {
...
+   if (!TimestampDifferenceExceeds(last_report, now, 
PGSTAT_STAT_INTERVAL))
+   return;
+   last_report = now;
+   }
+
...
+   if (repWorker->commit_count == 0 && repWorker->abort_count == 0)
+   return;
...

I think it's better to check commit_count and abort_count first, then check if
reach PGSTAT_STAT_INTERVAL.
Otherwise if commit_count and abort_count are 0, it is possible that the value
of last_report has been updated but it didn't send stats in fact. In this case,
last_report is not the real time that send last message.

Regards,
Tang


RE: Failed transaction statistics to measure the logical replication progress

2021-12-23 Thread wangw.f...@fujitsu.com
On Wednesday, December 22, 2021 10:30 PM
osumi.takami...@fujitsu.com  wrote:
> On Wednesday, December 22, 2021 8:38 PM I wrote:
> > Do we expect these commit counts which come from empty transactions ?
> This is another issue discussed in [1]
> where the patch in the thread is a work in progress, I think.
> ..
> IMHO, the conclusion is we are currently in the middle of fixing the behavior.

Thank you for telling me this.
After applying v19-* and 
v15-0001-Skip-empty-transactions-for-logical-replication.patch, 
I retested v19-* patches. The result of previous case looks good to me.

But the results of following cases are also similar to previous unexpected 
result
which increases commit_count or abort_count unexpectedly.
[1]
(Based on environment in the previous example, set TWO_PHASE=true)
[Publisher]
begin;
insert into replica_test1 values(1,'1');
prepare transaction 'id';
commit prepared 'id';

In subscriber side, the commit_count of two records(sub1 and sub2) is increased.

[2]
(Based on environment in the previous example, set STREAMING=on)
[Publisher]
begin;
INSERT INTO replica_test1 SELECT i, md5(i::text) FROM generate_series(1, 5000) 
s(i);
commit;

In subscriber side, the commit_count of two records(sub1 and sub2) is increased.

[3]
(Based on environment in the previous example, set TWO_PHASE=true)
[Publisher]
begin;
insert into replica_test1 values(1,'1');
prepare transaction 'id';
rollback prepared 'id';

In subscriber side, the abort_count of two records(sub1 and sub2) is increased.

I think the problem maybe is the patch you mentioned
(Skip-empty-transactions-for-logical-replication.patch) is not finished yet.
Share this information here.


Regards,
Wang wei


RE: Failed transaction statistics to measure the logical replication progress

2021-12-22 Thread osumi.takami...@fujitsu.com
On Wednesday, December 22, 2021 9:38 PM Wang, Wei/王 威  
wrote:
> I have a question on the v19-0002 patch:
> 
> When I tested for this patch, I found pg_stat_subscription_workers has some
> unexpected data.
> For example:
> [Publisher]
> create table replica_test1(a int, b text); create publication pub1 for table
> replica_test1; create table replica_test2(a int, b text); create publication 
> pub2
> for table replica_test2;
> 
> [Subscriber]
> create table replica_test1(a int, b text); create subscription sub1 CONNECTION
> 'dbname=postgres' publication pub1; create table replica_test2(a int, b text);
> create subscription sub2 CONNECTION 'dbname=postgres' publication pub2;
> 
> [Publisher]
> insert into replica_test1 values(1,'1');
> 
> [Subscriber]
> select * from pg_stat_subscription_workers;
> 
> -[ RECORD 1 ]--+--
> Subid | 16389
> subname   | sub1
> subrelid  |
> commit_count  | 1
> ...
> -[ RECORD 2 ]--+--
> subid | 16395
> subname   | sub2
> subrelid  |
> commit_count  | 1
> ...
> 
> I originally expected only one record for "sub1".
> 
> I think the reason is apply_handle_commit() always invoke
> pgstat_report_subworker_xact_end().
> But when we insert data to replica_test1 in publish side:
> [In the publish]
> pub1's walsender1 will send three messages((LOGICAL_REP_MSG_BEGIN,
> LOGICAL_REP_MSG_INSERT and LOGICAL_REP_MSG_COMMIT))
> to sub1's apply worker1.
> pub2's walsender2 will also send two messages(LOGICAL_REP_MSG_BEGIN
> and LOGICAL_REP_MSG_COMMIT) to sub2's apply worker2. Because
> inserted table is not published by pub2.
> 
> [In the subscription]
> sub1's apply worker1 receive LOGICAL_REP_MSG_COMMIT,
>   so invoke pgstat_report_subworker_xact_end to increase
> commit_count of sub1's stats.
> sub2's apply worker2 receive LOGICAL_REP_MSG_COMMIT,
>   it will do the same action to increase commit_count of sub2's stats.
> 
> Do we expect these commit counts which come from empty transactions ?
Hi, thank you so much for your test !


This is another issue discussed in [1]
where the patch in the thread is a work in progress, I think.

The point you reported will bring a lot of confusion for the users,
to interpret the results of the subscription stats values,
if those patches including the subscription stats patch will not get committed
together (like in the same version).

I've confirmed that HEAD applied with v19-* and
v15-0001-Skip-empty-transactions-for-logical-replication.patch
on top of v19-* showed only one record, as you expected like below,
although all patches are not finished yet.

postgres=# select * from pg_stat_subscription_workers;
-[ RECORD 1 ]--+--
subid  | 16389
subname| sub1
subrelid   | 
commit_count   | 1
abort_count| 0
error_count| 0


IMHO, the conclusion is we are currently in the middle of fixing the behavior.

[1] - 
https://www.postgresql.org/message-id/CAFPTHDbVLWxpfnwMxJcXq703gLXciXHE83hwKQ_0OTCZ6oLCjg%40mail.gmail.com


Best Regards,
Takamichi Osumi



RE: Failed transaction statistics to measure the logical replication progress

2021-12-22 Thread wangw.f...@fujitsu.com
On Mon, Dec 22, 2021 at 6:14 PM osumi.takami...@fujitsu.com 
 wrote:
>
> Attached the new patch v19.
>

I have a question on the v19-0002 patch:

When I tested for this patch, I found pg_stat_subscription_workers has some 
unexpected data.
For example:
[Publisher]
create table replica_test1(a int, b text); create publication pub1 for table 
replica_test1;
create table replica_test2(a int, b text); create publication pub2 for table 
replica_test2;

[Subscriber]
create table replica_test1(a int, b text); create subscription sub1 CONNECTION 
'dbname=postgres' publication pub1;
create table replica_test2(a int, b text); create subscription sub2 CONNECTION 
'dbname=postgres' publication pub2;

[Publisher]
insert into replica_test1 values(1,'1');

[Subscriber]
select * from pg_stat_subscription_workers;

-[ RECORD 1 ]--+--
Subid   | 16389
subname | sub1
subrelid|
commit_count| 1
...
-[ RECORD 2 ]--+--
subid   | 16395
subname | sub2
subrelid|
commit_count| 1
...

I originally expected only one record for "sub1". 

I think the reason is apply_handle_commit() always invoke 
pgstat_report_subworker_xact_end().
But when we insert data to replica_test1 in publish side:
[In the publish]
pub1's walsender1 will send three messages((LOGICAL_REP_MSG_BEGIN, 
LOGICAL_REP_MSG_INSERT and LOGICAL_REP_MSG_COMMIT))
to sub1's apply worker1.
pub2's walsender2 will also send two messages(LOGICAL_REP_MSG_BEGIN and 
LOGICAL_REP_MSG_COMMIT)
to sub2's apply worker2. Because inserted table is not published by pub2.

[In the subscription]
sub1's apply worker1 receive LOGICAL_REP_MSG_COMMIT,
so invoke pgstat_report_subworker_xact_end to increase commit_count of 
sub1's stats.
sub2's apply worker2 receive LOGICAL_REP_MSG_COMMIT,
it will do the same action to increase commit_count of sub2's stats.

Do we expect these commit counts which come from empty transactions ?

Regards,
Wang wei


RE: Failed transaction statistics to measure the logical replication progress

2021-12-22 Thread osumi.takami...@fujitsu.com
On Tuesday, December 21, 2021 6:00 PM Greg Nancarrow  
wrote:
> Some review comments on the v18 patches:
Thank you for your review !

> v18-0002
> 
> doc/src/sgml/monitoring.sgml
> (1) tablesync worker stats?
> 
> Shouldn't the comment below only mention the apply worker? (since we're no
> longer recording stats of the tablesync worker)
> 
> +   Number of transactions that failed to be applied by the table
> +   sync worker or main apply worker in this subscription. This
> +   counter is updated after confirming the error is not same as
> +   the previous one.
> +   
> 
> Also, it should say "... the error is not the same as the previous one."
Fixed.
 
> src/backend/catalog/system_views.sql
> (2) pgstat_report_subworker_xact_end()
> 
> Fix typo and some wording:
> 
> BEFORE:
> + *  This should be called before the call of process_syning_tables()
> + not to
> AFTER:
> + *  This should be called before the call of
> process_syncing_tables(), so to not
Fixed.

> src/backend/postmaster/pgstat.c
> (3) pgstat_send_subworker_xact_stats()
> 
> BEFORE:
> + * Send a subworker transaction stats to the collector.
> AFTER:
> + * Send a subworker's transaction stats to the collector.
Fixed.

> (4)
> Wouldn't it be best for:
> 
> +   if (!TimestampDifferenceExceeds(last_report, now,
> + PGSTAT_STAT_INTERVAL))
> 
> to be:
> 
> +   if (last_report != 0 && !TimestampDifferenceExceeds(last_report,
> now, PGSTAT_STAT_INTERVAL))
> 
> ?
I'm not sure which is better and
I never have strong objection to your idea but currently
I prefer the previous code because we don't need to
add one extra condition (last_report != 0) in the function called really 
frequently
and the optimization to avoid calling TimestampDifferenceExceeds works just once
with your change, I'd say.

We call pgstat_send_subworker_xact_stats() in the LogicalRepApplyLoop's loop.
For the apply worker, this should be the first call for normal operation,
before any call of the apply_dispatch(and subsequent commit-like functions which
calls pgstat_send_subworker_xact_stats() in the end).

In the first call, existing v18's codes (without your suggested change) just 
initializes
'last_report' variable because of the diff between 0 and 'now' and returns
because of no stats values in commit_count and abort_count in the function.
After this, 'last_report' will not be 0 for the apply worker.

On the other hand, in the case I add your change, in the first call of
pgstat_send_subworker_xact_stats(), similarly 'last_report' is initialized but
without one call of TimestampDifferenceExceeds(), which might be the 
optimization
effect and the function returns with no stats again. Here the 'last_report' will
not be 0 after this. But, then we'll have to check the condition in the apply 
worker
in the every loop. Besides, after the first setting of 'last_report',
every call of the pgstat_send_subworker_xact_stats() calculates the time 
subtraction.
This means the one skipped call of the function looks less worth in the case of 
frequent
calls of the function. So, I'm not sure it' a good idea to incorporate this 
change.

Kindly let me know if I miss something.
At present, I'll keep the code as it is.

> (5) pgstat_send_subworker_xact_stats()
> 
> I think that the comment:
> 
> +   * Clear out the statistics buffer, so it can be re-used.
> 
> should instead say:
> 
> +   * Clear out the supplied statistics.
> 
> because the current comment infers that repWorker is pointed at the
> MyLogicalRepWorker buffer (which it might be but it shouldn't be relying on
> that) Also, I think that the function header should mention something like:
> "The supplied repWorker statistics are cleared upon return, to assist re-use 
> by
> the caller."
Fixed.

Attached the new patch v19.

Best Regards,
Takamichi Osumi



v19-0002-Extend-pg_stat_subscription_workers-to-include-g.patch
Description:  v19-0002-Extend-pg_stat_subscription_workers-to-include-g.patch


v19-0001-Fix-alignments-of-TAP-tests-for-pg_stat_subscrip.patch
Description:  v19-0001-Fix-alignments-of-TAP-tests-for-pg_stat_subscrip.patch


Re: Failed transaction statistics to measure the logical replication progress

2021-12-21 Thread Greg Nancarrow
On Mon, Dec 20, 2021 at 8:40 PM osumi.takami...@fujitsu.com
 wrote:
>
> Updated the patch to address your concern.
>

Some review comments on the v18 patches:

v18-0002

doc/src/sgml/monitoring.sgml
(1) tablesync worker stats?

Shouldn't the comment below only mention the apply worker? (since
we're no longer recording stats of the tablesync worker)

+   Number of transactions that failed to be applied by the table
+   sync worker or main apply worker in this subscription. This
+   counter is updated after confirming the error is not same as
+   the previous one.
+   

Also, it should say "... the error is not the same as the previous one."


src/backend/catalog/system_views.sql
(2) pgstat_report_subworker_xact_end()

Fix typo and some wording:

BEFORE:
+ *  This should be called before the call of process_syning_tables() not to
AFTER:
+ *  This should be called before the call of
process_syncing_tables(), so to not


src/backend/postmaster/pgstat.c
(3) pgstat_send_subworker_xact_stats()

BEFORE:
+ * Send a subworker transaction stats to the collector.
AFTER:
+ * Send a subworker's transaction stats to the collector.

(4)
Wouldn't it be best for:

+   if (!TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))

to be:

+   if (last_report != 0 && !TimestampDifferenceExceeds(last_report,
now, PGSTAT_STAT_INTERVAL))

?

(5) pgstat_send_subworker_xact_stats()

I think that the comment:

+   * Clear out the statistics buffer, so it can be re-used.

should instead say:

+   * Clear out the supplied statistics.

because the current comment infers that repWorker is pointed at the
MyLogicalRepWorker buffer (which it might be but it shouldn't be
relying on that)
Also, I think that the function header should mention something like:
"The supplied repWorker statistics are cleared upon return, to assist
re-use by the caller."


Regards,
Greg Nancarrow
Fujitsu Australia




RE: Failed transaction statistics to measure the logical replication progress

2021-12-20 Thread osumi.takami...@fujitsu.com
Friday, December 17, 2021 2:03 PM Kyotaro Horiguchi  
wrote:
> It sends stats packets at every commit-like operation on apply workers.  The
> current pgstat is so smart that it refrain from sending stats packets at too 
> high
> frequency.  We already suffer frequent stats packets so apply workers need to
> bahave the same way.
> 
> That is, the new stats numbers are once accumulated locally then the
> accumulated numbers are sent to stats collector by pgstat_report_stat.
Hi, Horiguchi-san.

I felt your point is absolutely right !
Updated the patch to address your concern.

Best Regards,
Takamichi Osumi



v18-0002-Extend-pg_stat_subscription_workers-to-include-g.patch
Description:  v18-0002-Extend-pg_stat_subscription_workers-to-include-g.patch


v18-0001-Fix-alignments-of-TAP-tests-for-pg_stat_subscrip.patch
Description:  v18-0001-Fix-alignments-of-TAP-tests-for-pg_stat_subscrip.patch


Re: Failed transaction statistics to measure the logical replication progress

2021-12-16 Thread Kyotaro Horiguchi
At Thu, 16 Dec 2021 11:36:46 +, "osumi.takami...@fujitsu.com" 
 wrote in 
> On Thursday, December 16, 2021 4:00 PM I wrote:
> > Thank you, everyone for confirming the direction.
> > I'll follow the consensus of the community and fix the patch, including 
> > other
> > comments.
> > I'll treat only the stats for apply workers.
> Hi, created a new version v17 according to the recent discussion
> with changes to address other review comments.
> 
> Kindly have a look at it.

It sends stats packets at every commit-like operation on apply
workers.  The current pgstat is so smart that it refrain from sending
stats packets at too high frequency.  We already suffer frequent stats
packets so apply workers need to bahave the same way.

That is, the new stats numbers are once accumulated locally then the
accumulated numbers are sent to stats collector by pgstat_report_stat.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: Failed transaction statistics to measure the logical replication progress

2021-12-16 Thread osumi.takami...@fujitsu.com
On Thursday, December 16, 2021 8:37 PM I wrote:
> Hi, created a new version v17 according to the recent discussion with changes
> to address other review comments.
FYI, in v17 I've removed one part of commit message
about spool file statistics on the subscriber.
My intention is just to make the patches more committable shape.

Although I deleted it, I'd say still there would be some room
for the discussion of the necessity. It's because to begin with,
we are interested in the disk writes (for the logical replication,
pg_stat_replication_slots is an example), and secondly there can
be a scenario that if the user of logical replication dislikes and wants
to suppress unnecessary writes of file on the subscriber
(STREAM ABORT causes truncate of file with changes, IIUC)
they can increase the logical_decoding_work_mem on the publisher.
I'll postpone this discussion, till it becomes necessary
or will abandon this idea, if it's rejected. Anyway,
I detached the discussion by removing it from the commit message.

Best Regards,
Takamichi Osumi



RE: Failed transaction statistics to measure the logical replication progress

2021-12-16 Thread osumi.takami...@fujitsu.com
On Tuesday, December 14, 2021 12:45 AM vignesh C  wrote:
> Thanks for the updated patch, few comments:
> 1) Can we change this:
> /*
> +* Report the success of table sync as one commit to consolidate all
> +* transaction stats into one record.
> +*/
> +   pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid,
> +
>   LOGICAL_REP_MSG_COMMIT);
> +
> To:
> /* Report the success of table sync */
> +   pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid,
> +
>   LOGICAL_REP_MSG_COMMIT);
> +
This function call that the table sync worker reports
an update of stats has been removed according to the recent discussion.


> 2) Typo: ealier should be earlier
> +   /*
> +* Report ealier than the call of process_syncing_tables() not
> to miss an
> +* increment of commit_count in case it leads to the process exit. See
> +* process_syncing_tables_for_apply().
> +*/
> +   pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid,
> +
>   LOGICAL_REP_MSG_COMMIT);
> +
Thanks ! Fixed.

 
> 3) Should we add an Assert for subwentry:
> +   /*
> +* If this is a new error reported by table sync worker,
> consolidate this
> +* error count into the entry of apply worker, by swapping the stats
> +* entries.
> +*/
> +   if (OidIsValid(msg->m_subrelid))
> +   subwentry = pgstat_get_subworker_entry(dbentry,
> + msg->m_subid,
> +
> InvalidOid, true);
> +   subwentry->error_count++;
The latest implementation doesn't require
the call of pgstat_get_subworker_entry().
So, I skipped.

> 4) Can we slightly change it to :We can change it:
> +# Check the update of stats counters.
> +confirm_transaction_stats_update(
> +   $node_subscriber,
> +   'commit_count = 1',
> +   'the commit_count increment by table sync');
> +
> +confirm_transaction_stats_update(
> +   $node_subscriber,
> +   'error_count = 1',
> +   'the error_count increment by table sync');
> to:
> +# Check updation of subscription worker transaction count statistics.
> +confirm_transaction_stats_update(
> +   $node_subscriber,
> +   'commit_count = 1',
> +   'check table sync worker commit count is updated');
> +
> +confirm_transaction_stats_update(
> +   $node_subscriber,
> +   'error_count = 1',
> +   'check table sync worker error count is updated');
I've removed the corresponding tests for table sync workers in the patch. 

But, I adopted the comment suggestion partly for the tests of the apply worker.
On the other hand, I didn't fix the 3rd arguments of 
confirm_transaction_stats_update().
It needs to be a noun, because it's connected to another string
"Timed out while waiting for ". in the function.
See the definition of the function.

The new patch v17 is shared in [1].

[1] - 
https://www.postgresql.org/message-id/TYCPR01MB83734A7A0596AC7ADB0DCB51ED779%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
Takamichi Osumi



RE: Failed transaction statistics to measure the logical replication progress

2021-12-16 Thread osumi.takami...@fujitsu.com
On Monday, December 13, 2021 6:19 PM Amit Kapila  
wrote:
> On Tue, Dec 7, 2021 at 3:12 PM osumi.takami...@fujitsu.com
>  wrote:
> >
> 
> Few questions and comments:
> 
> 1.
> The pg_stat_subscription_workers view will
> contain
> one row per subscription worker on which errors have occurred, for workers
> applying logical replication changes and workers handling the initial data
> -   copy of the subscribed tables.  The statistics entry is removed when the
> -   corresponding subscription is dropped.
> +   copy of the subscribed tables. Also, the row corresponding to the apply
> +   worker shows all transaction statistics of both types of workers on the
> +   subscription. The statistics entry is removed when the corresponding
> +   subscription is dropped.
> 
> Why did you choose to show stats for both types of workers in one row?
Now, the added stats show only the statistics of apply worker
as we agreed.


> 2.
> + PGSTAT_MTYPE_SUBWORKERXACTEND,
>  } StatMsgType;
> 
> I don't think we comma with the last message type.
Fixed.

 
> 3.
> + Oid m_subrelid;
> +
> + /* necessary to determine column to increment */ LogicalRepMsgType
> + m_command;
> +
> +} PgStat_MsgSubWorkerXactEnd;
> 
> Is m_subrelid used in this patch? If not, why did you keep it? I think if you
> choose to show separate stats for table sync and apply worker then probably it
> will be used.
Removed.


> 4.
>   /*
> + * Cumulative transaction statistics of subscription worker */
> + PgStat_Counter commit_count; PgStat_Counter error_count;
> + PgStat_Counter abort_count;
> +
> 
> I think it is better to keep the order of columns as commit_count, 
> abort_count,
> error_count in the entire patch.
Fixed.


The new patch is shared in [1].

[1] - 
https://www.postgresql.org/message-id/TYCPR01MB83734A7A0596AC7ADB0DCB51ED779%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
Takamichi Osumi



RE: Failed transaction statistics to measure the logical replication progress

2021-12-16 Thread osumi.takami...@fujitsu.com
On Thursday, December 16, 2021 4:00 PM I wrote:
> Thank you, everyone for confirming the direction.
> I'll follow the consensus of the community and fix the patch, including other
> comments.
> I'll treat only the stats for apply workers.
Hi, created a new version v17 according to the recent discussion
with changes to address other review comments.

Kindly have a look at it.

Best Regards,
Takamichi Osumi



v17-0002-Extend-pg_stat_subscription_workers-to-include-g.patch
Description:  v17-0002-Extend-pg_stat_subscription_workers-to-include-g.patch


v17-0001-Fix-alignments-of-TAP-tests-for-pg_stat_subscrip.patch
Description:  v17-0001-Fix-alignments-of-TAP-tests-for-pg_stat_subscrip.patch


Re: Failed transaction statistics to measure the logical replication progress

2021-12-15 Thread Greg Nancarrow
On Tue, Dec 14, 2021 at 1:28 PM Amit Kapila  wrote:
>
> If we fear a large number of entries for such workers then won't it be
> better to show the value of these stats only for apply workers. I
> think normally the table sync workers perform only copy operation or
> maybe a fixed number of xacts, so, one might not be interested in the
> transaction stats of these workers. I find merging only specific stats
> of two different types of workers confusing.
>
> What do others think about this?
>

I think it might be OK to NOT include the transaction stats of the
tablesync workers, but my understanding (and slight concern) is that
currently there is potentially some overlap in the work done by the
tablesync and apply workers - perhaps the small patch (see [1]) proposed by
Peter Smith could also be considered, in order to make that distinction of
work clearer, and the stats more meaningful?


[1]
https://www.postgresql.org/message-id/flat/cahut+pt39pbqs0sxt9rmm89ayizoq0kw46yzskkzwk8z5ho...@mail.gmail.com


Regards,
Greg Nancarrow
Fujitsu Australia


RE: Failed transaction statistics to measure the logical replication progress

2021-12-15 Thread osumi.takami...@fujitsu.com
On Wednesday, December 15, 2021 9:52 PM vignesh C  wrote:
> On Tue, Dec 14, 2021 at 7:58 AM Amit Kapila 
> wrote:
> >
> > On Mon, Dec 13, 2021 at 5:48 PM osumi.takami...@fujitsu.com
> >  wrote:
> > >
> > > On Monday, December 13, 2021 6:19 PM Amit Kapila
>  wrote:
> > > > On Tue, Dec 7, 2021 at 3:12 PM osumi.takami...@fujitsu.com
> > > >  wrote:
> > > >
> > > > Few questions and comments:
> > > Thank you for your comments !
> > >
> > > > 
> > > > 1.
> > > > The pg_stat_subscription_workers view
> > > > will contain
> > > > one row per subscription worker on which errors have occurred, for
> workers
> > > > applying logical replication changes and workers handling the 
> > > > initial
> data
> > > > -   copy of the subscribed tables.  The statistics entry is removed
> when the
> > > > -   corresponding subscription is dropped.
> > > > +   copy of the subscribed tables. Also, the row corresponding to the
> apply
> > > > +   worker shows all transaction statistics of both types of workers on
> the
> > > > +   subscription. The statistics entry is removed when the
> corresponding
> > > > +   subscription is dropped.
> > > >
> > > > Why did you choose to show stats for both types of workers in one row?
> > > This is because if we have hundreds or thousands of tables for table
> > > sync, we need to create many entries to cover them and store the entries 
> > > for
> all tables.
> > >
> >
> > If we fear a large number of entries for such workers then won't it be
> > better to show the value of these stats only for apply workers. I
> > think normally the table sync workers perform only copy operation or
> > maybe a fixed number of xacts, so, one might not be interested in the
> > transaction stats of these workers. I find merging only specific stats
> > of two different types of workers confusing.
> >
> > What do others think about this?
> 
> We can remove the table sync workers transaction stats count to avoid
> confusion, take care of the documentation changes too accordingly.
Hi, apologies for my late reply.

Thank you, everyone for confirming the direction.
I'll follow the consensus of the community
and fix the patch, including other comments.
I'll treat only the stats for apply workers.


Best Regards,
Takamichi Osumi




Re: Failed transaction statistics to measure the logical replication progress

2021-12-15 Thread vignesh C
On Tue, Dec 14, 2021 at 7:58 AM Amit Kapila  wrote:
>
> On Mon, Dec 13, 2021 at 5:48 PM osumi.takami...@fujitsu.com
>  wrote:
> >
> > On Monday, December 13, 2021 6:19 PM Amit Kapila  
> > wrote:
> > > On Tue, Dec 7, 2021 at 3:12 PM osumi.takami...@fujitsu.com
> > >  wrote:
> > >
> > > Few questions and comments:
> > Thank you for your comments !
> >
> > > 
> > > 1.
> > > The pg_stat_subscription_workers view will
> > > contain
> > > one row per subscription worker on which errors have occurred, for 
> > > workers
> > > applying logical replication changes and workers handling the initial 
> > > data
> > > -   copy of the subscribed tables.  The statistics entry is removed when 
> > > the
> > > -   corresponding subscription is dropped.
> > > +   copy of the subscribed tables. Also, the row corresponding to the 
> > > apply
> > > +   worker shows all transaction statistics of both types of workers on 
> > > the
> > > +   subscription. The statistics entry is removed when the corresponding
> > > +   subscription is dropped.
> > >
> > > Why did you choose to show stats for both types of workers in one row?
> > This is because if we have hundreds or thousands of tables for table sync,
> > we need to create many entries to cover them and store the entries for all 
> > tables.
> >
>
> If we fear a large number of entries for such workers then won't it be
> better to show the value of these stats only for apply workers. I
> think normally the table sync workers perform only copy operation or
> maybe a fixed number of xacts, so, one might not be interested in the
> transaction stats of these workers. I find merging only specific stats
> of two different types of workers confusing.
>
> What do others think about this?

We can remove the table sync workers transaction stats count to avoid
confusion, take care of the documentation changes too accordingly.

Regards,
Vignesh




Re: Failed transaction statistics to measure the logical replication progress

2021-12-14 Thread Masahiko Sawada
On Tue, Dec 14, 2021 at 11:28 AM Amit Kapila  wrote:
>
> On Mon, Dec 13, 2021 at 5:48 PM osumi.takami...@fujitsu.com
>  wrote:
> >
> > On Monday, December 13, 2021 6:19 PM Amit Kapila  
> > wrote:
> > > On Tue, Dec 7, 2021 at 3:12 PM osumi.takami...@fujitsu.com
> > >  wrote:
> > >
> > > Few questions and comments:
> > Thank you for your comments !
> >
> > > 
> > > 1.
> > > The pg_stat_subscription_workers view will
> > > contain
> > > one row per subscription worker on which errors have occurred, for 
> > > workers
> > > applying logical replication changes and workers handling the initial 
> > > data
> > > -   copy of the subscribed tables.  The statistics entry is removed when 
> > > the
> > > -   corresponding subscription is dropped.
> > > +   copy of the subscribed tables. Also, the row corresponding to the 
> > > apply
> > > +   worker shows all transaction statistics of both types of workers on 
> > > the
> > > +   subscription. The statistics entry is removed when the corresponding
> > > +   subscription is dropped.
> > >
> > > Why did you choose to show stats for both types of workers in one row?
> > This is because if we have hundreds or thousands of tables for table sync,
> > we need to create many entries to cover them and store the entries for all 
> > tables.
> >
>
> If we fear a large number of entries for such workers then won't it be
> better to show the value of these stats only for apply workers. I
> think normally the table sync workers perform only copy operation or
> maybe a fixed number of xacts, so, one might not be interested in the
> transaction stats of these workers. I find merging only specific stats
> of two different types of workers confusing.
>
> What do others think about this?

I understand the concern to have a large number of entries but I agree
that merging only specific stats would confuse users. As Amit
suggested, it'd be better to show only apply workers' transaction
stats.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




RE: Failed transaction statistics to measure the logical replication progress

2021-12-14 Thread houzj.f...@fujitsu.com
On Tues, Dec 14, 2021 10:28 AM Amit Kapila  wrote:
> On Mon, Dec 13, 2021 at 5:48 PM osumi.takami...@fujitsu.com
>  wrote:
> >
> > On Monday, December 13, 2021 6:19 PM Amit Kapila
>  wrote:
> > > On Tue, Dec 7, 2021 at 3:12 PM osumi.takami...@fujitsu.com
> > >  wrote:
> > >
> > > Few questions and comments:
> > Thank you for your comments !
> >
> > > 
> > > 1.
> > > one row per subscription worker on which errors have occurred, for 
> > > workers
> > > applying logical replication changes and workers handling the initial 
> > > data
> > > -   copy of the subscribed tables.  The statistics entry is removed when 
> > > the
> > > -   corresponding subscription is dropped.
> > > +   copy of the subscribed tables. Also, the row corresponding to the 
> > > apply
> > > +   worker shows all transaction statistics of both types of workers on 
> > > the
> > > +   subscription. The statistics entry is removed when the corresponding
> > > +   subscription is dropped.
> > >
> > > Why did you choose to show stats for both types of workers in one row?
> > This is because if we have hundreds or thousands of tables for table sync,
> > we need to create many entries to cover them and store the entries for all
> > tables.
> >
> 
> If we fear a large number of entries for such workers then won't it be
> better to show the value of these stats for apply workers. I
> think normally the table sync workers perform only copy operation or
> maybe a fixed number of xacts, so, one might not be interested in the
> transaction stats of these workers. I find merging only specific stats
> of two different types of workers confusing.
> 
> What do others think about this?

Personally, I agreed that merging two types of stats into one row might not be
a good idea. And the xact stats of table sync workers are usually less
interesting than the apply worker's, So, it's seems acceptable to me if we
show stats only for apply workers and document about this.

Best regards,
Hou zj


Re: Failed transaction statistics to measure the logical replication progress

2021-12-13 Thread Amit Kapila
On Mon, Dec 13, 2021 at 5:48 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Monday, December 13, 2021 6:19 PM Amit Kapila  
> wrote:
> > On Tue, Dec 7, 2021 at 3:12 PM osumi.takami...@fujitsu.com
> >  wrote:
> >
> > Few questions and comments:
> Thank you for your comments !
>
> > 
> > 1.
> > The pg_stat_subscription_workers view will
> > contain
> > one row per subscription worker on which errors have occurred, for 
> > workers
> > applying logical replication changes and workers handling the initial 
> > data
> > -   copy of the subscribed tables.  The statistics entry is removed when the
> > -   corresponding subscription is dropped.
> > +   copy of the subscribed tables. Also, the row corresponding to the apply
> > +   worker shows all transaction statistics of both types of workers on the
> > +   subscription. The statistics entry is removed when the corresponding
> > +   subscription is dropped.
> >
> > Why did you choose to show stats for both types of workers in one row?
> This is because if we have hundreds or thousands of tables for table sync,
> we need to create many entries to cover them and store the entries for all 
> tables.
>

If we fear a large number of entries for such workers then won't it be
better to show the value of these stats only for apply workers. I
think normally the table sync workers perform only copy operation or
maybe a fixed number of xacts, so, one might not be interested in the
transaction stats of these workers. I find merging only specific stats
of two different types of workers confusing.

What do others think about this?

-- 
With Regards,
Amit Kapila.




Re: Failed transaction statistics to measure the logical replication progress

2021-12-13 Thread vignesh C
On Tue, Dec 7, 2021 at 3:12 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Monday, December 6, 2021 11:27 PM vignesh C  wrote:
> > Thanks for the updated patch, few comments:
> Thank you for your review !
>
> > 1) We can keep the documentation similar to mention the count includes both
> > table sync worker / main apply worker in case of commit_count/error_count
> > and abort_count to keep it consistent.
> > +   commit_count bigint
> > +  
> > +  
> > +   Number of transactions successfully applied in this subscription.
> > +   COMMIT and COMMIT PREPARED increments this counter.
> > +  
> > + 
> > +
> > + 
> > +  
> > +   error_count bigint
> > +  
> > +  
> > +   Number of transactions that failed to be applied by the table
> > +   sync worker or main apply worker in this subscription.
> > +   
> > + 
> > +
> > + 
> > +  
> > +   abort_count bigint
> > +  
> > +  
> > +   Number of transactions aborted in this subscription.
> > +   ROLLBACK PREPARED increments this counter.
> > +  
> > + 
> Yeah, you are right. Fixed.
> Note that abort_count is not used by table sync worker.
>
>
> > 2) Can this be changed:
> > +   /*
> > +* If this is a new error reported by table sync worker,
> > consolidate this
> > +* error count into the entry of apply worker.
> > +*/
> > +   if (OidIsValid(msg->m_subrelid))
> > +   {
> > +   /* Gain the apply worker stats */
> > +   subwentry = pgstat_get_subworker_entry(dbentry,
> > + msg->m_subid,
> > +
> > InvalidOid, true);
> > +   subwentry->error_count++;
> > +   }
> > +   else
> > +   subwentry->error_count++;   /* increment the apply
> > worker's counter. */
> > To:
> > +   /*
> > +* If this is a new error reported by table sync worker,
> > consolidate this
> > +* error count into the entry of apply worker.
> > +*/
> > +   if (OidIsValid(msg->m_subrelid))
> > +   /* Gain the apply worker stats */
> > +   subwentry = pgstat_get_subworker_entry(dbentry,
> > + msg->m_subid,
> > +
> > InvalidOid, true);
> > +
> > +   subwentry->error_count++;   /* increment the apply
> > worker's counter. */
> Your suggestion looks better.
> Also, I fixed some comments of this part
> so that we don't need to add a separate comment at the bottom
> for the increment of the apply worker.
>
>
> > 3) Since both 026_worker_stats and 027_worker_xact_stats.pl are testing
> > pg_stat_subscription_workers, can we move the tests to 026_worker_stats.pl.
> > If possible the error_count validation can be combined with the existing 
> > tests.
> > diff --git a/src/test/subscription/t/027_worker_xact_stats.pl
> > b/src/test/subscription/t/027_worker_xact_stats.pl
> > new file mode 100644
> > index 000..31dbea1
> > --- /dev/null
> > +++ b/src/test/subscription/t/027_worker_xact_stats.pl
> > @@ -0,0 +1,162 @@
> > +
> > +# Copyright (c) 2021, PostgreSQL Global Development Group
> > +
> > +# Tests for subscription worker statistics during apply.
> > +use strict;
> > +use warnings;
> > +use PostgreSQL::Test::Cluster;
> > +use PostgreSQL::Test::Utils;
> > +use Test::More tests => 1;
> > +
> > +# Create publisher node
> Right. I've integrated my tests with 026_worker_stats.pl.
> I think error_count validations are combined as you suggested.
> Another change I did is to introduce one function
> to contribute to better readability of the stats tests.
>
> Here, the 026_worker_stats.pl didn't look aligned by
> pgperltidy. This is not a serious issue at all.
> Yet, when I ran pgperltidy, the existing codes
> that required adjustments came into my patch.
> Therefore, I made a separate part for this.

Thanks for the updated patch, few comments:
1) Can we change this:
/*
+* Report the success of table sync as one commit to consolidate all
+* transaction stats into one record.
+*/
+   pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid,
+
  LOGICAL_REP_MSG_COMMIT);
+
To:
/* Report the success of table sync */
+   pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid,
+
  LOGICAL_REP_MSG_COMMIT);
+

2) Typo: ealier should be earlier
+   /*
+* Report ealier than the call of process_syncing_tables() not
to miss an
+* increment of commit_count in case it leads to the process exit. See
+* process_syncing_tables_for_apply().
+*/
+   pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid,
+
  LOGICAL_REP_MSG_COMMIT);
+

3) Should we add an Assert for subwentry:
+   /*
+* If this is a new error reported by table sync worker,
consolidate this
+* error count into the entry of apply worker, by swapping the stats
+* entries.
+*/
+   if (OidIsValid(msg->m_subrelid))
+   subwentry = 

RE: Failed transaction statistics to measure the logical replication progress

2021-12-13 Thread osumi.takami...@fujitsu.com
On Monday, December 13, 2021 6:19 PM Amit Kapila  
wrote:
> On Tue, Dec 7, 2021 at 3:12 PM osumi.takami...@fujitsu.com
>  wrote:
> 
> Few questions and comments:
Thank you for your comments !

> 
> 1.
> The pg_stat_subscription_workers view will
> contain
> one row per subscription worker on which errors have occurred, for workers
> applying logical replication changes and workers handling the initial data
> -   copy of the subscribed tables.  The statistics entry is removed when the
> -   corresponding subscription is dropped.
> +   copy of the subscribed tables. Also, the row corresponding to the apply
> +   worker shows all transaction statistics of both types of workers on the
> +   subscription. The statistics entry is removed when the corresponding
> +   subscription is dropped.
> 
> Why did you choose to show stats for both types of workers in one row?
This is because if we have hundreds or thousands of tables for table sync,
we need to create many entries to cover them and store the entries for all 
tables.


> 2.
> + PGSTAT_MTYPE_SUBWORKERXACTEND,
>  } StatMsgType;
> 
> I don't think we comma with the last message type.
> 4.
>   /*
> + * Cumulative transaction statistics of subscription worker */
> + PgStat_Counter commit_count; PgStat_Counter error_count;
> + PgStat_Counter abort_count;
> +
> 
> I think it is better to keep the order of columns as commit_count, 
> abort_count,
> error_count in the entire patch.
Okay, I'll fix both points in the next version.

 
> 3.
> + Oid m_subrelid;
> +
> + /* necessary to determine column to increment */ LogicalRepMsgType
> + m_command;
> +
> +} PgStat_MsgSubWorkerXactEnd;
> 
> Is m_subrelid used in this patch? If not, why did you keep it?
Absolutely, this was a mistake when I took the decision to merge both stats
of table sync and apply worker.

> I think if you choose
> to show separate stats for table sync and apply worker then probably it will 
> be
> used.
Yeah, I'll fix this. Of course, after I could confirm that the idea for merging 
the
two types of workers stats was acceptable for you and others.

Best Regards,
Takamichi Osumi




Re: Failed transaction statistics to measure the logical replication progress

2021-12-13 Thread Amit Kapila
On Tue, Dec 7, 2021 at 3:12 PM osumi.takami...@fujitsu.com
 wrote:
>

Few questions and comments:

1.
The pg_stat_subscription_workers view will contain
one row per subscription worker on which errors have occurred, for workers
applying logical replication changes and workers handling the initial data
-   copy of the subscribed tables.  The statistics entry is removed when the
-   corresponding subscription is dropped.
+   copy of the subscribed tables. Also, the row corresponding to the apply
+   worker shows all transaction statistics of both types of workers on the
+   subscription. The statistics entry is removed when the corresponding
+   subscription is dropped.

Why did you choose to show stats for both types of workers in one row?

2.
+ PGSTAT_MTYPE_SUBWORKERXACTEND,
 } StatMsgType;

I don't think we comma with the last message type.

3.
+ Oid m_subrelid;
+
+ /* necessary to determine column to increment */
+ LogicalRepMsgType m_command;
+
+} PgStat_MsgSubWorkerXactEnd;

Is m_subrelid used in this patch? If not, why did you keep it? I think
if you choose to show separate stats for table sync and apply worker
then probably it will be used.

4.
  /*
+ * Cumulative transaction statistics of subscription worker
+ */
+ PgStat_Counter commit_count;
+ PgStat_Counter error_count;
+ PgStat_Counter abort_count;
+

I think it is better to keep the order of columns as commit_count,
abort_count, error_count in the entire patch.

-- 
With Regards,
Amit Kapila.




RE: Failed transaction statistics to measure the logical replication progress

2021-12-07 Thread osumi.takami...@fujitsu.com
On Monday, December 6, 2021 11:27 PM vignesh C  wrote:
> Thanks for the updated patch, few comments:
Thank you for your review !

> 1) We can keep the documentation similar to mention the count includes both
> table sync worker / main apply worker in case of commit_count/error_count
> and abort_count to keep it consistent.
> +   commit_count bigint
> +  
> +  
> +   Number of transactions successfully applied in this subscription.
> +   COMMIT and COMMIT PREPARED increments this counter.
> +  
> + 
> +
> + 
> +  
> +   error_count bigint
> +  
> +  
> +   Number of transactions that failed to be applied by the table
> +   sync worker or main apply worker in this subscription.
> +   
> + 
> +
> + 
> +  
> +   abort_count bigint
> +  
> +  
> +   Number of transactions aborted in this subscription.
> +   ROLLBACK PREPARED increments this counter.
> +  
> + 
Yeah, you are right. Fixed.
Note that abort_count is not used by table sync worker.

 
> 2) Can this be changed:
> +   /*
> +* If this is a new error reported by table sync worker,
> consolidate this
> +* error count into the entry of apply worker.
> +*/
> +   if (OidIsValid(msg->m_subrelid))
> +   {
> +   /* Gain the apply worker stats */
> +   subwentry = pgstat_get_subworker_entry(dbentry,
> + msg->m_subid,
> +
> InvalidOid, true);
> +   subwentry->error_count++;
> +   }
> +   else
> +   subwentry->error_count++;   /* increment the apply
> worker's counter. */
> To:
> +   /*
> +* If this is a new error reported by table sync worker,
> consolidate this
> +* error count into the entry of apply worker.
> +*/
> +   if (OidIsValid(msg->m_subrelid))
> +   /* Gain the apply worker stats */
> +   subwentry = pgstat_get_subworker_entry(dbentry,
> + msg->m_subid,
> +
> InvalidOid, true);
> +
> +   subwentry->error_count++;   /* increment the apply
> worker's counter. */
Your suggestion looks better.
Also, I fixed some comments of this part
so that we don't need to add a separate comment at the bottom
for the increment of the apply worker.

 
> 3) Since both 026_worker_stats and 027_worker_xact_stats.pl are testing
> pg_stat_subscription_workers, can we move the tests to 026_worker_stats.pl.
> If possible the error_count validation can be combined with the existing 
> tests.
> diff --git a/src/test/subscription/t/027_worker_xact_stats.pl
> b/src/test/subscription/t/027_worker_xact_stats.pl
> new file mode 100644
> index 000..31dbea1
> --- /dev/null
> +++ b/src/test/subscription/t/027_worker_xact_stats.pl
> @@ -0,0 +1,162 @@
> +
> +# Copyright (c) 2021, PostgreSQL Global Development Group
> +
> +# Tests for subscription worker statistics during apply.
> +use strict;
> +use warnings;
> +use PostgreSQL::Test::Cluster;
> +use PostgreSQL::Test::Utils;
> +use Test::More tests => 1;
> +
> +# Create publisher node
Right. I've integrated my tests with 026_worker_stats.pl.
I think error_count validations are combined as you suggested.
Another change I did is to introduce one function
to contribute to better readability of the stats tests.

Here, the 026_worker_stats.pl didn't look aligned by
pgperltidy. This is not a serious issue at all.
Yet, when I ran pgperltidy, the existing codes
that required adjustments came into my patch.
Therefore, I made a separate part for this.

Best Regards,
Takamichi Osumi



v16-0001-Fix-alignments-of-TAP-tests-for-pg_stat_subscrip.patch
Description:  v16-0001-Fix-alignments-of-TAP-tests-for-pg_stat_subscrip.patch


v16-0002-Extend-pg_stat_subscription_workers-to-include-g.patch
Description:  v16-0002-Extend-pg_stat_subscription_workers-to-include-g.patch


Re: Failed transaction statistics to measure the logical replication progress

2021-12-06 Thread vignesh C
On Sat, Dec 4, 2021 at 6:32 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Friday, December 3, 2021 3:12 PM vignesh C  wrote:
> > Thanks for the updated patch.
> > Currently we are storing the commit count, error_count and abort_count for
> > each table of the table sync operation. If we have thousands of tables, we 
> > will
> > be storing the information for each of the tables.
> > Shouldn't we be storing the consolidated information in this case.
> > diff --git a/src/backend/replication/logical/tablesync.c
> > b/src/backend/replication/logical/tablesync.c
> > index f07983a..02e9486 100644
> > --- a/src/backend/replication/logical/tablesync.c
> > +++ b/src/backend/replication/logical/tablesync.c
> > @@ -1149,6 +1149,11 @@ copy_table_done:
> > MyLogicalRepWorker->relstate_lsn = *origin_startpos;
> > SpinLockRelease(>relmutex);
> >
> > +   /* Report the success of table sync. */
> > +   pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid,
> > +
> >   MyLogicalRepWorker->relid,
> > +
> >   0 /* no logical message type */ );
> Okay.
>
> I united all stats into that of apply worker.
> In line with this change, I fixed the TAP tests as well
> to cover the updates of stats done by table sync workers.
>
> Also, during my self-review, I noticed that
> I should call pgstat_report_subworker_xact_end() before
> process_syncing_tables() because it can lead to process
> exit, which results in missing one increment of the stats columns.
> I noted this point in a comment as well.

Thanks for the updated patch, few comments:
1) We can keep the documentation similar to mention the count includes
both table sync worker / main apply worker in case of
commit_count/error_count and abort_count to keep it consistent.
+   commit_count bigint
+  
+  
+   Number of transactions successfully applied in this subscription.
+   COMMIT and COMMIT PREPARED increments this counter.
+  
+ 
+
+ 
+  
+   error_count bigint
+  
+  
+   Number of transactions that failed to be applied by the table
+   sync worker or main apply worker in this subscription.
+   
+ 
+
+ 
+  
+   abort_count bigint
+  
+  
+   Number of transactions aborted in this subscription.
+   ROLLBACK PREPARED increments this counter.
+  
+ 

2) Can this be changed:
+   /*
+* If this is a new error reported by table sync worker,
consolidate this
+* error count into the entry of apply worker.
+*/
+   if (OidIsValid(msg->m_subrelid))
+   {
+   /* Gain the apply worker stats */
+   subwentry = pgstat_get_subworker_entry(dbentry, msg->m_subid,
+
InvalidOid, true);
+   subwentry->error_count++;
+   }
+   else
+   subwentry->error_count++;   /* increment the apply
worker's counter. */
To:
+   /*
+* If this is a new error reported by table sync worker,
consolidate this
+* error count into the entry of apply worker.
+*/
+   if (OidIsValid(msg->m_subrelid))
+   /* Gain the apply worker stats */
+   subwentry = pgstat_get_subworker_entry(dbentry, msg->m_subid,
+
InvalidOid, true);
+
+   subwentry->error_count++;   /* increment the apply
worker's counter. */

3) Since both 026_worker_stats and 027_worker_xact_stats.pl are
testing pg_stat_subscription_workers, can we move the tests to
026_worker_stats.pl. If possible the error_count validation can be
combined with the existing tests.
diff --git a/src/test/subscription/t/027_worker_xact_stats.pl
b/src/test/subscription/t/027_worker_xact_stats.pl
new file mode 100644
index 000..31dbea1
--- /dev/null
+++ b/src/test/subscription/t/027_worker_xact_stats.pl
@@ -0,0 +1,162 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Tests for subscription worker statistics during apply.
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 1;
+
+# Create publisher node

Regards,
Vignesh




RE: Failed transaction statistics to measure the logical replication progress

2021-12-04 Thread osumi.takami...@fujitsu.com
On Friday, December 3, 2021 3:12 PM vignesh C  wrote:
> Thanks for the updated patch.
> Currently we are storing the commit count, error_count and abort_count for
> each table of the table sync operation. If we have thousands of tables, we 
> will
> be storing the information for each of the tables.
> Shouldn't we be storing the consolidated information in this case.
> diff --git a/src/backend/replication/logical/tablesync.c
> b/src/backend/replication/logical/tablesync.c
> index f07983a..02e9486 100644
> --- a/src/backend/replication/logical/tablesync.c
> +++ b/src/backend/replication/logical/tablesync.c
> @@ -1149,6 +1149,11 @@ copy_table_done:
> MyLogicalRepWorker->relstate_lsn = *origin_startpos;
> SpinLockRelease(>relmutex);
> 
> +   /* Report the success of table sync. */
> +   pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid,
> +
>   MyLogicalRepWorker->relid,
> +
>   0 /* no logical message type */ );
Okay.

I united all stats into that of apply worker.
In line with this change, I fixed the TAP tests as well
to cover the updates of stats done by table sync workers.

Also, during my self-review, I noticed that
I should call pgstat_report_subworker_xact_end() before
process_syncing_tables() because it can lead to process
exit, which results in missing one increment of the stats columns.
I noted this point in a comment as well.

Best Regards,
Takamichi Osumi



v15-0001-Extend-pg_stat_subscription_workers-to-include-g.patch
Description:  v15-0001-Extend-pg_stat_subscription_workers-to-include-g.patch


Re: Failed transaction statistics to measure the logical replication progress

2021-12-02 Thread vignesh C
On Wed, Dec 1, 2021 at 3:04 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Friday, November 19, 2021 11:11 PM Masahiko Sawada  
> wrote:
> > Besides that, I’m not sure how useful commit_bytes, abort_bytes, and
> > error_bytes are. I originally thought these statistics track the size of 
> > received
> > data, i.g., how much data is transferred from the publisher and processed on
> > the subscriber. But what the view currently has is how much memory is used 
> > in
> > the subscription worker. The subscription worker emulates
> > ReorderBufferChangeSize() on the subscriber side but, as the comment of
> > update_apply_change_size() mentions, the size in the view is not accurate:
> ...
> > I guess that the purpose of these values is to compare them to total_bytes,
> > stream_byte, and spill_bytes but if the calculation is not accurate, does 
> > it mean
> > that the more stats are updated, the more the stats will be getting 
> > inaccurate?
> Thanks for your comment !
>
> I tried to solve your concerns about byte columns but there are really 
> difficult issues to solve.
> For example, to begin with the messages of apply worker are different from 
> those of
> reorder buffer.
>
> Therefore, I decided to split the previous patch and make counter columns go 
> first.
> v14 was checked by pgperltidy and pgindent.
>
> This patch can be applied to the PG whose commit id is after 8d74fc9 
> (introduction of
> pg_stat_subscription_workers).

Thanks for the updated patch.
Currently we are storing the commit count, error_count and abort_count
for each table of the table sync operation. If we have thousands of
tables, we will be storing the information for each of the tables.
Shouldn't we be storing the consolidated information in this case.
diff --git a/src/backend/replication/logical/tablesync.c
b/src/backend/replication/logical/tablesync.c
index f07983a..02e9486 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -1149,6 +1149,11 @@ copy_table_done:
MyLogicalRepWorker->relstate_lsn = *origin_startpos;
SpinLockRelease(>relmutex);

+   /* Report the success of table sync. */
+   pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid,
+
  MyLogicalRepWorker->relid,
+
  0 /* no logical message type */ );

postgres=# select * from pg_stat_subscription_workers ;
 subid | subname | subrelid | commit_count | error_count | abort_count
| last_error_relid | last_error_command | last_error_xid |
last_error_count | last_error_message | last_error_time
---+-+--+--+-+-+--+++--++-
 16411 | sub1|16387 |1 |   0 |   0
|  |||
   0 ||
 16411 | sub1|16396 |1 |   0 |   0
|  |||
   0 ||
 16411 | sub1|16390 |1 |   0 |   0
|  |||
   0 ||
 16411 | sub1|16393 |1 |   0 |   0
|  |||
   0 ||
 16411 | sub1|16402 |1 |   0 |   0
|  |||
   0 ||
 16411 | sub1|16408 |1 |   0 |   0
|  |||
   0 ||
 16411 | sub1|16384 |1 |   0 |   0
|  |||
   0 ||
 16411 | sub1|16399 |1 |   0 |   0
|  |||
   0 ||
 16411 | sub1|16405 |1 |   0 |   0
|  |||
   0 ||
(9 rows)

Regards,
Vignesh




RE: Failed transaction statistics to measure the logical replication progress

2021-12-01 Thread osumi.takami...@fujitsu.com
On Friday, November 19, 2021 11:11 PM Masahiko Sawada  
wrote:
> Besides that, I’m not sure how useful commit_bytes, abort_bytes, and
> error_bytes are. I originally thought these statistics track the size of 
> received
> data, i.g., how much data is transferred from the publisher and processed on
> the subscriber. But what the view currently has is how much memory is used in
> the subscription worker. The subscription worker emulates
> ReorderBufferChangeSize() on the subscriber side but, as the comment of
> update_apply_change_size() mentions, the size in the view is not accurate:
...
> I guess that the purpose of these values is to compare them to total_bytes,
> stream_byte, and spill_bytes but if the calculation is not accurate, does it 
> mean
> that the more stats are updated, the more the stats will be getting 
> inaccurate?
Thanks for your comment !

I tried to solve your concerns about byte columns but there are really 
difficult issues to solve.
For example, to begin with the messages of apply worker are different from 
those of
reorder buffer.
 
Therefore, I decided to split the previous patch and make counter columns go 
first.
v14 was checked by pgperltidy and pgindent.

This patch can be applied to the PG whose commit id is after 8d74fc9 
(introduction of
pg_stat_subscription_workers).

Best Regards,
Takamichi Osumi



v14-0001-Extend-pg_stat_subscription_workers-to-include-g.patch
Description:  v14-0001-Extend-pg_stat_subscription_workers-to-include-g.patch


Re: Failed transaction statistics to measure the logical replication progress

2021-11-23 Thread Masahiko Sawada
On Tue, Nov 23, 2021 at 3:21 PM Greg Nancarrow  wrote:
>
> On Sat, Nov 20, 2021 at 1:11 AM Masahiko Sawada  wrote:
> >
> > I'm concerned that these new names will introduce confusion; if we
> > have last_error_relid, last_error_command, last_error_message,
> > last_error_time, and last_error_xid, I think users might think that
> > first_error_time is the timestamp at which an error occurred for the
> > first time in the subscription worker.
>
> You mean you think users might think "first_error_time" is the
> timestamp at which the last_error first occurred (rather than the
> timestamp of the first of any type of error that occurred) on that
> worker?

I felt that "first_error_time" is the timestamp of the first of any
type of error that occurred on the worker.

>
> > ... Also, I'm not sure
> > last_error_count is not clear to me (it looks like showing something
> > count but the only "last" one?).
>
> It's the number of times that the last_error has occurred.
> Unless it's some kind of transient error, that might get resolved
> without intervention, logical replication will get stuck in a loop
> retrying and the last error will occur again and again, hence the
> count of how many times that has happened.
> Maybe there's not much benefit in counting different errors prior to
> the last error?

The name "last_error_count" is somewhat clear to me now. I had felt
that since the last error refers to *one* error that occurred last and
it’s odd there is the count of it.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Failed transaction statistics to measure the logical replication progress

2021-11-23 Thread Masahiko Sawada
On Sat, Nov 20, 2021 at 3:25 PM Amit Kapila  wrote:
>
> On Fri, Nov 19, 2021 at 7:41 PM Masahiko Sawada  wrote:
> >
> > On Thu, Nov 18, 2021 at 12:26 PM Amit Kapila  
> > wrote:
> > >
> > > On Wed, Nov 17, 2021 at 7:12 PM osumi.takami...@fujitsu.com
> > >  wrote:
> > > >
> > > > On Wednesday, November 17, 2021 10:00 PM Amit Kapila 
> > > >  wrote:
> > > > >
> > > > > Can you please tell us why you think the names in your proposed patch 
> > > > > are
> > > > > better than the existing names proposed in Sawada-San's patch? Is it 
> > > > > because
> > > > > those fields always contain the information of the last or latest 
> > > > > error that
> > > > > occurred in the corresponding subscription worker?
> > > > This is one reason.
> > > >
> > > > Another big reason comes from the final alignment when we list up all 
> > > > columns of both patches.
> > > > The patches in this thread is trying to introduce a column that 
> > > > indicates
> > > > cumulative count of error to show all error counts that the worker got 
> > > > in the past.
> > > >
> > >
> > > Okay, I see your point and it makes sense to rename columns after
> > > these other stats. I am not able to come up with any better names than
> > > what is being used here. Sawada-San, do you agree with this, or do let
> > > us know if you have any better ideas?
> > >
> >
> > I'm concerned that these new names will introduce confusion; if we
> > have last_error_relid, last_error_command, last_error_message,
> > last_error_time, and last_error_xid, I think users might think that
> > first_error_time is the timestamp at which an error occurred for the
> > first time in the subscription worker.
> >
>
> Isn't to some extent that confusion already exists because of
> last_error_time column?
>
> > Also, I'm not sure
> > last_error_count is not clear to me (it looks like showing something
> > count but the only "last" one?).
> >
>
> I feel if all the error related columns have "last_error_" as a prefix
> then it should not be that confusing?
>
> > An alternative idea would be to add
> > total_error_count by this patch, resulting in having both error_count
> > and total_error_count. Regarding commit_count and abort_count, I
> > personally think xact_commit and xact_rollback would be better since
> > they’re more consistent with pg_stat_database view, although we might
> > already have discussed that.
> >
>
> Even if we decide to change the column names to
> xact_commit/xact_rollback, I think with additional non-error columns
> it will be clear to add 'error' in column names corresponding to error
> columns, and last_error_* seems to be consistent with what we have in
> pg_stat_archiver (last_failed_wal, last_failed_time).

Okay, I agree that last_error_* columns will be consistent.

> Your point
> related to first_error_time has merit and I don't have a better answer
> for it. I think it is just a convenience column and we are not sure
> whether that will be required in practice so maybe we can drop that
> column and come back to it later once we get some field feedback on
> this view?

+1

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Failed transaction statistics to measure the logical replication progress

2021-11-22 Thread Greg Nancarrow
On Sat, Nov 20, 2021 at 1:11 AM Masahiko Sawada  wrote:
>
> I'm concerned that these new names will introduce confusion; if we
> have last_error_relid, last_error_command, last_error_message,
> last_error_time, and last_error_xid, I think users might think that
> first_error_time is the timestamp at which an error occurred for the
> first time in the subscription worker.

You mean you think users might think "first_error_time" is the
timestamp at which the last_error first occurred (rather than the
timestamp of the first of any type of error that occurred) on that
worker?

> ... Also, I'm not sure
> last_error_count is not clear to me (it looks like showing something
> count but the only "last" one?).

It's the number of times that the last_error has occurred.
Unless it's some kind of transient error, that might get resolved
without intervention, logical replication will get stuck in a loop
retrying and the last error will occur again and again, hence the
count of how many times that has happened.
Maybe there's not much benefit in counting different errors prior to
the last error?


Regards,
Greg Nancarrow
Fujitsu Australia




Re: Failed transaction statistics to measure the logical replication progress

2021-11-21 Thread vignesh C
On Fri, Nov 19, 2021 at 7:41 PM Masahiko Sawada  wrote:
>
> On Thu, Nov 18, 2021 at 12:26 PM Amit Kapila  wrote:
> >
> > On Wed, Nov 17, 2021 at 7:12 PM osumi.takami...@fujitsu.com
> >  wrote:
> > >
> > > On Wednesday, November 17, 2021 10:00 PM Amit Kapila 
> > >  wrote:
> > > >
> > > > Can you please tell us why you think the names in your proposed patch 
> > > > are
> > > > better than the existing names proposed in Sawada-San's patch? Is it 
> > > > because
> > > > those fields always contain the information of the last or latest error 
> > > > that
> > > > occurred in the corresponding subscription worker?
> > > This is one reason.
> > >
> > > Another big reason comes from the final alignment when we list up all 
> > > columns of both patches.
> > > The patches in this thread is trying to introduce a column that indicates
> > > cumulative count of error to show all error counts that the worker got in 
> > > the past.
> > >
> >
> > Okay, I see your point and it makes sense to rename columns after
> > these other stats. I am not able to come up with any better names than
> > what is being used here. Sawada-San, do you agree with this, or do let
> > us know if you have any better ideas?
> >
>
> I'm concerned that these new names will introduce confusion; if we
> have last_error_relid, last_error_command, last_error_message,
> last_error_time, and last_error_xid, I think users might think that
> first_error_time is the timestamp at which an error occurred for the
> first time in the subscription worker. Also, I'm not sure
> last_error_count is not clear to me (it looks like showing something
> count but the only "last" one?). An alternative idea would be to add
> total_error_count by this patch, resulting in having both error_count
> and total_error_count. Regarding commit_count and abort_count, I
> personally think xact_commit and xact_rollback would be better since
> they’re more consistent with pg_stat_database view, although we might
> already have discussed that.
>
> Besides that, I’m not sure how useful commit_bytes, abort_bytes, and
> error_bytes are. I originally thought these statistics track the size
> of received data, i.g., how much data is transferred from the
> publisher and processed on the subscriber. But what the view currently
> has is how much memory is used in the subscription worker. The
> subscription worker emulates ReorderBufferChangeSize() on the
> subscriber side but, as the comment of update_apply_change_size()
> mentions, the size in the view is not accurate:
>
> + * The byte size of transaction on the publisher is calculated by
> + * ReorderBufferChangeSize() based on the ReorderBufferChange structure.
> + * But on the subscriber, consumed resources are not same as the
> + * publisher's decoding processsing and required to be computed in
> + * different way. Therefore, the exact same byte size is not restored on
> + * the subscriber usually.
>
> Also, it seems to take into account the size of FlushPosition that is
> not taken into account by ReorderBufferChangeSize().

Let's keep the size calculation similar to the publisher side to avoid
any confusion, we can try to keep it the same as
ReorderBufferChangeSize wherever possible.

Regards,
Vignesh




Re: Failed transaction statistics to measure the logical replication progress

2021-11-19 Thread Amit Kapila
On Fri, Nov 19, 2021 at 7:41 PM Masahiko Sawada  wrote:
>
> On Thu, Nov 18, 2021 at 12:26 PM Amit Kapila  wrote:
> >
> > On Wed, Nov 17, 2021 at 7:12 PM osumi.takami...@fujitsu.com
> >  wrote:
> > >
> > > On Wednesday, November 17, 2021 10:00 PM Amit Kapila 
> > >  wrote:
> > > >
> > > > Can you please tell us why you think the names in your proposed patch 
> > > > are
> > > > better than the existing names proposed in Sawada-San's patch? Is it 
> > > > because
> > > > those fields always contain the information of the last or latest error 
> > > > that
> > > > occurred in the corresponding subscription worker?
> > > This is one reason.
> > >
> > > Another big reason comes from the final alignment when we list up all 
> > > columns of both patches.
> > > The patches in this thread is trying to introduce a column that indicates
> > > cumulative count of error to show all error counts that the worker got in 
> > > the past.
> > >
> >
> > Okay, I see your point and it makes sense to rename columns after
> > these other stats. I am not able to come up with any better names than
> > what is being used here. Sawada-San, do you agree with this, or do let
> > us know if you have any better ideas?
> >
>
> I'm concerned that these new names will introduce confusion; if we
> have last_error_relid, last_error_command, last_error_message,
> last_error_time, and last_error_xid, I think users might think that
> first_error_time is the timestamp at which an error occurred for the
> first time in the subscription worker.
>

Isn't to some extent that confusion already exists because of
last_error_time column?

> Also, I'm not sure
> last_error_count is not clear to me (it looks like showing something
> count but the only "last" one?).
>

I feel if all the error related columns have "last_error_" as a prefix
then it should not be that confusing?

> An alternative idea would be to add
> total_error_count by this patch, resulting in having both error_count
> and total_error_count. Regarding commit_count and abort_count, I
> personally think xact_commit and xact_rollback would be better since
> they’re more consistent with pg_stat_database view, although we might
> already have discussed that.
>

Even if we decide to change the column names to
xact_commit/xact_rollback, I think with additional non-error columns
it will be clear to add 'error' in column names corresponding to error
columns, and last_error_* seems to be consistent with what we have in
pg_stat_archiver (last_failed_wal, last_failed_time). Your point
related to first_error_time has merit and I don't have a better answer
for it. I think it is just a convenience column and we are not sure
whether that will be required in practice so maybe we can drop that
column and come back to it later once we get some field feedback on
this view?

-- 
With Regards,
Amit Kapila.




Re: Failed transaction statistics to measure the logical replication progress

2021-11-19 Thread Masahiko Sawada
On Thu, Nov 18, 2021 at 12:26 PM Amit Kapila  wrote:
>
> On Wed, Nov 17, 2021 at 7:12 PM osumi.takami...@fujitsu.com
>  wrote:
> >
> > On Wednesday, November 17, 2021 10:00 PM Amit Kapila 
> >  wrote:
> > >
> > > Can you please tell us why you think the names in your proposed patch are
> > > better than the existing names proposed in Sawada-San's patch? Is it 
> > > because
> > > those fields always contain the information of the last or latest error 
> > > that
> > > occurred in the corresponding subscription worker?
> > This is one reason.
> >
> > Another big reason comes from the final alignment when we list up all 
> > columns of both patches.
> > The patches in this thread is trying to introduce a column that indicates
> > cumulative count of error to show all error counts that the worker got in 
> > the past.
> >
>
> Okay, I see your point and it makes sense to rename columns after
> these other stats. I am not able to come up with any better names than
> what is being used here. Sawada-San, do you agree with this, or do let
> us know if you have any better ideas?
>

I'm concerned that these new names will introduce confusion; if we
have last_error_relid, last_error_command, last_error_message,
last_error_time, and last_error_xid, I think users might think that
first_error_time is the timestamp at which an error occurred for the
first time in the subscription worker. Also, I'm not sure
last_error_count is not clear to me (it looks like showing something
count but the only "last" one?). An alternative idea would be to add
total_error_count by this patch, resulting in having both error_count
and total_error_count. Regarding commit_count and abort_count, I
personally think xact_commit and xact_rollback would be better since
they’re more consistent with pg_stat_database view, although we might
already have discussed that.

Besides that, I’m not sure how useful commit_bytes, abort_bytes, and
error_bytes are. I originally thought these statistics track the size
of received data, i.g., how much data is transferred from the
publisher and processed on the subscriber. But what the view currently
has is how much memory is used in the subscription worker. The
subscription worker emulates ReorderBufferChangeSize() on the
subscriber side but, as the comment of update_apply_change_size()
mentions, the size in the view is not accurate:

+ * The byte size of transaction on the publisher is calculated by
+ * ReorderBufferChangeSize() based on the ReorderBufferChange structure.
+ * But on the subscriber, consumed resources are not same as the
+ * publisher's decoding processsing and required to be computed in
+ * different way. Therefore, the exact same byte size is not restored on
+ * the subscriber usually.

Also, it seems to take into account the size of FlushPosition that is
not taken into account by ReorderBufferChangeSize().

I guess that the purpose of these values is to compare them to
total_bytes, stream_byte, and spill_bytes but if the calculation is
not accurate, does it mean that the more stats are updated, the more
the stats will be getting inaccurate?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




RE: Failed transaction statistics to measure the logical replication progress

2021-11-18 Thread osumi.takami...@fujitsu.com
On Thursday, November 18, 2021 12:26 PM Amit Kapila  
wrote:
> BTW, I think the way you are computing error_count in
> pgstat_recv_subworker_error() doesn't seem correct to me because it will
> accumulate the counter/bytes for the same error again and again.
> You might want to update these counters after we have checked that the
> received error is not the same as the previous one.
Thank you for your comments !
This is addressed by v13 patchset [1]


[1] - 
https://www.postgresql.org/message-id/TYCPR01MB8373533A5C24BDDA516DA7E1ED9B9%40TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
Takamichi Osumi



RE: Failed transaction statistics to measure the logical replication progress

2021-11-18 Thread osumi.takami...@fujitsu.com
On Thursday, November 18, 2021 8:35 PM vignesh C  wrote:
> On Tue, Nov 16, 2021 at 6:04 PM osumi.takami...@fujitsu.com
>  wrote:
> >
> > On Monday, November 15, 2021 9:14 PM I wrote:
> > > I've conducted some update for this.
> > > (The rebased part is only C code and checked by pgindent)
> > I'll update my patches since a new skip xid patch has been shared in
> > [1].
> >
> > This version includes some minor renames of functions that are related
> > to transaction sizes.
> 
> Thanks for the updated patch, Few comments:
Thank you for checking the patches !


> 1) since pgstat_get_subworker_prepared_txn is called from only one place and
> create is passed as true, we can remove create function parameter or the
> function could be removed.
> + * Return subscription worker entry with the given subscription OID and
> + * gid.
> + * --
> + */
> +static PgStat_SW_PreparedXactEntry *
> +pgstat_get_subworker_prepared_txn(Oid databaseid, Oid subid,
> + char
> *gid, bool create)
> +{
> +   PgStat_StatDBEntry *dbentry;
> +   PgStat_SW_PreparedXactKey key;
Removed the parameter.


>  2) Include subworker prepared transactions also
>   /*
> * Don't create tables/functions/subworkers hashtables for
> * uninteresting databases.
> */
> if (onlydb != InvalidOid)
> {
> if (dbbuf.databaseid != onlydb &&
> dbbuf.databaseid != InvalidOid)
> break;
> }
Fixed.


> 3) Similarly it should be mentioned in:
> reset_dbentry_counters function header, pgstat_read_db_statsfile function
> header and pgstat_get_db_entry function comments.
Fixed.

 
> 4) I felt we can remove "COMMIT of streaming transaction", since only commit
> and commit prepared are the user operations. Shall we change it to "COMMIT
> and COMMIT PREPARED will increment this counter."
> +   commit_count bigint
> +  
> +  
> +   Number of transactions successfully applied in this subscription.
> +   COMMIT, COMMIT of streaming transaction and COMMIT
> PREPARED increments
> +   this counter.
> +  
> + 
You are right ! Fixed.

 
> 5) PgStat_SW_PreparedXactEntry should be before
> PgStat_SW_PreparedXactKey  PgStat_StatSubWorkerEntry
> PgStat_StatSubWorkerKey
> +PgStat_SW_PreparedXactKey
> +PgStat_SW_PreparedXactEntry
>  PgStat_StatTabEntry
>  PgStat_SubXactStatus
Fixed.

> 6)  This change is not required
> @@ -293,6 +306,7 @@ static inline void cleanup_subxact_info(void);  static
> void stream_cleanup_files(Oid subid, TransactionId xid);  static void
> stream_open_file(Oid subid, TransactionId xid, bool first);  static void
> stream_write_change(char action, StringInfo s);
> +
>  static void stream_close_file(void);
Removed.


Other changes are
1. refined the commit message of v13-0003*.
2. made the additional comment for ApplyErrorCallbackArg simple.
3. wrote more explanations about update_apply_change_size() as comment.
4. changed the behavior of pgstat_recv_subworker_error so that
   it can store stats info only once per error.
5. added one simple test for PREPARE and COMMIT PREPARED.

This used v23 skip xid patch [1].
(I will remove v13-0001* when the column names are fixed
and Sawada-san starts to take care of the column name definitions)

[1] - 
https://www.postgresql.org/message-id/CAD21AoA5jupM6O%3DpYsyfaxQ1aMX-en8%3DQNgpW6KfXsg7_CS0CQ%40mail.gmail.com

Best Regards,
Takamichi Osumi



v13-0001-Rename-existing-columns-of-pg_stat_subscription_.patch
Description:  v13-0001-Rename-existing-columns-of-pg_stat_subscription_.patch


v13-0002-Export-PartitionTupleRouting-for-transaction-siz.patch
Description:  v13-0002-Export-PartitionTupleRouting-for-transaction-siz.patch


v13-0003-Extend-pg_stat_subscription_workers-to-include-g.patch
Description:  v13-0003-Extend-pg_stat_subscription_workers-to-include-g.patch


Re: Failed transaction statistics to measure the logical replication progress

2021-11-18 Thread vignesh C
On Tue, Nov 16, 2021 at 6:04 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Monday, November 15, 2021 9:14 PM I wrote:
> > I've conducted some update for this.
> > (The rebased part is only C code and checked by pgindent)
> I'll update my patches since a new skip xid patch
> has been shared in [1].
>
> This version includes some minor renames of functions
> that are related to transaction sizes.

Thanks for the updated patch, Few comments:
1) since pgstat_get_subworker_prepared_txn is called from only one
place and create is passed as true, we can remove create function
parameter or the function could be removed.
+ * Return subscription worker entry with the given subscription OID and
+ * gid.
+ * --
+ */
+static PgStat_SW_PreparedXactEntry *
+pgstat_get_subworker_prepared_txn(Oid databaseid, Oid subid,
+ char
*gid, bool create)
+{
+   PgStat_StatDBEntry *dbentry;
+   PgStat_SW_PreparedXactKey key;

 2) Include subworker prepared transactions also
  /*
* Don't create tables/functions/subworkers hashtables for
* uninteresting databases.
*/
if (onlydb != InvalidOid)
{
if (dbbuf.databaseid != onlydb &&
dbbuf.databaseid != InvalidOid)
break;
}

3) Similarly it should be mentioned in:
reset_dbentry_counters function header, pgstat_read_db_statsfile
function header and pgstat_get_db_entry function comments.

4) I felt we can remove "COMMIT of streaming transaction", since only
commit and commit prepared are the user operations. Shall we change it
to "COMMIT and COMMIT PREPARED will increment this counter."
+   commit_count bigint
+  
+  
+   Number of transactions successfully applied in this subscription.
+   COMMIT, COMMIT of streaming transaction and COMMIT PREPARED increments
+   this counter.
+  
+ 

5) PgStat_SW_PreparedXactEntry should be before PgStat_SW_PreparedXactKey
 PgStat_StatSubWorkerEntry
 PgStat_StatSubWorkerKey
+PgStat_SW_PreparedXactKey
+PgStat_SW_PreparedXactEntry
 PgStat_StatTabEntry
 PgStat_SubXactStatus

6)  This change is not required
@@ -293,6 +306,7 @@ static inline void cleanup_subxact_info(void);
 static void stream_cleanup_files(Oid subid, TransactionId xid);
 static void stream_open_file(Oid subid, TransactionId xid, bool first);
 static void stream_write_change(char action, StringInfo s);
+
 static void stream_close_file(void);

Regards,
Vignesh




Re: Failed transaction statistics to measure the logical replication progress

2021-11-17 Thread Amit Kapila
On Wed, Nov 17, 2021 at 7:12 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Wednesday, November 17, 2021 10:00 PM Amit Kapila 
>  wrote:
> >
> > Can you please tell us why you think the names in your proposed patch are
> > better than the existing names proposed in Sawada-San's patch? Is it because
> > those fields always contain the information of the last or latest error that
> > occurred in the corresponding subscription worker?
> This is one reason.
>
> Another big reason comes from the final alignment when we list up all columns 
> of both patches.
> The patches in this thread is trying to introduce a column that indicates
> cumulative count of error to show all error counts that the worker got in the 
> past.
>

Okay, I see your point and it makes sense to rename columns after
these other stats. I am not able to come up with any better names than
what is being used here. Sawada-San, do you agree with this, or do let
us know if you have any better ideas?

BTW, I think the way you are computing error_count in
pgstat_recv_subworker_error() doesn't seem correct to me because it
will accumulate the counter/bytes for the same error again and again.
You might want to update these counters after we have checked that the
received error is not the same as the previous one.

-- 
With Regards,
Amit Kapila.




RE: Failed transaction statistics to measure the logical replication progress

2021-11-17 Thread osumi.takami...@fujitsu.com
On Wednesday, November 17, 2021 10:00 PM Amit Kapila  
wrote:
> On Wed, Nov 17, 2021 at 9:44 AM osumi.takami...@fujitsu.com
>  wrote:
> >
> > On Wednesday, November 17, 2021 12:19 PM Masahiko Sawada
>  wrote:
> > > On Tue, Nov 16, 2021 at 9:34 PM osumi.takami...@fujitsu.com
> > >  wrote:
> > > >
> > > > On Monday, November 15, 2021 9:14 PM I wrote:
> > > > > I've conducted some update for this.
> > > > > (The rebased part is only C code and checked by pgindent)
> > > > I'll update my patches since a new skip xid patch has been shared
> > > > in [1].
> > > >
> > > > This version includes some minor renames of functions that are
> > > > related to transaction sizes.
> > >
> > > I've looked at v12-0001 patch. Here are some comments:
> > Thank you for paying attention to this thread !
> >
> >
> > > -   TupleDescInitEntry(tupdesc, (AttrNumber) 3, "relid",
> > > +   TupleDescInitEntry(tupdesc, (AttrNumber) 3, "last_error_relid",
> > >OIDOID, -1, 0);
> > > -   TupleDescInitEntry(tupdesc, (AttrNumber) 4, "command",
> > > +   TupleDescInitEntry(tupdesc, (AttrNumber) 4,
> > > + "last_error_command",
> > >TEXTOID, -1, 0);
> > > -   TupleDescInitEntry(tupdesc, (AttrNumber) 5, "xid",
> > > +   TupleDescInitEntry(tupdesc, (AttrNumber) 5, "last_error_xid",
> > >XIDOID, -1, 0);
> > > -   TupleDescInitEntry(tupdesc, (AttrNumber) 6, "error_count",
> > > +   TupleDescInitEntry(tupdesc, (AttrNumber) 6, "last_error_count",
> > >INT8OID, -1, 0);
> > > -   TupleDescInitEntry(tupdesc, (AttrNumber) 7, "error_message",
> > > +   TupleDescInitEntry(tupdesc, (AttrNumber) 7,
> > > + "last_error_message",
> > >
> > > If renaming column names clarifies those meanings, the above changes
> > > should be included into my patch that introduces
> > > pg_stat_subscription_workers view?
> 
> Right.
> 
> > At first, your column names of pg_stat_subscription_workers look
> > totally OK to me by itself and I thought I should take care of those 
> > renaming at
> the commit timing of my stats patches.
> >
> 
> Can you please tell us why you think the names in your proposed patch are
> better than the existing names proposed in Sawada-San's patch? Is it because
> those fields always contain the information of the last or latest error that
> occurred in the corresponding subscription worker?
This is one reason. 

Another big reason comes from the final alignment when we list up all columns 
of both patches.
The patches in this thread is trying to introduce a column that indicates
cumulative count of error to show all error counts that the worker got in the 
past.
In this thread, after two or three improvements, this column name has reached to
a simple one 'error_count' (aligned with 'commit_count' and 'abort_count').
Then we need to differentiate what this thread's patch is trying to introduce 
and
what skip xid patch is introducing.

> If so, I am not very sure if that is a good reason to increase the length of 
> most of
> the column names but if you and others feel that is helpful then it is better 
> to do
> it as part of Sawada-San's patch.
Yes. On this point, it was a mistake that I handled that changes.
It'd be better that Sawada-san takes care of them once the names have been 
fixed.


Best Regards,
Takamichi Osumi



Re: Failed transaction statistics to measure the logical replication progress

2021-11-17 Thread Amit Kapila
On Wed, Nov 17, 2021 at 9:44 AM osumi.takami...@fujitsu.com
 wrote:
>
> On Wednesday, November 17, 2021 12:19 PM Masahiko Sawada 
>  wrote:
> > On Tue, Nov 16, 2021 at 9:34 PM osumi.takami...@fujitsu.com
> >  wrote:
> > >
> > > On Monday, November 15, 2021 9:14 PM I wrote:
> > > > I've conducted some update for this.
> > > > (The rebased part is only C code and checked by pgindent)
> > > I'll update my patches since a new skip xid patch has been shared in
> > > [1].
> > >
> > > This version includes some minor renames of functions that are related
> > > to transaction sizes.
> >
> > I've looked at v12-0001 patch. Here are some comments:
> Thank you for paying attention to this thread !
>
>
> > -   TupleDescInitEntry(tupdesc, (AttrNumber) 3, "relid",
> > +   TupleDescInitEntry(tupdesc, (AttrNumber) 3, "last_error_relid",
> >OIDOID, -1, 0);
> > -   TupleDescInitEntry(tupdesc, (AttrNumber) 4, "command",
> > +   TupleDescInitEntry(tupdesc, (AttrNumber) 4, "last_error_command",
> >TEXTOID, -1, 0);
> > -   TupleDescInitEntry(tupdesc, (AttrNumber) 5, "xid",
> > +   TupleDescInitEntry(tupdesc, (AttrNumber) 5, "last_error_xid",
> >XIDOID, -1, 0);
> > -   TupleDescInitEntry(tupdesc, (AttrNumber) 6, "error_count",
> > +   TupleDescInitEntry(tupdesc, (AttrNumber) 6, "last_error_count",
> >INT8OID, -1, 0);
> > -   TupleDescInitEntry(tupdesc, (AttrNumber) 7, "error_message",
> > +   TupleDescInitEntry(tupdesc, (AttrNumber) 7, "last_error_message",
> >
> > If renaming column names clarifies those meanings, the above changes should
> > be included into my patch that introduces pg_stat_subscription_workers
> > view?

Right.

> At first, your column names of pg_stat_subscription_workers look totally OK 
> to me by itself
> and I thought I should take care of those renaming at the commit timing of my 
> stats patches.
>

Can you please tell us why you think the names in your proposed patch
are better than the existing names proposed in Sawada-San's patch? Is
it because those fields always contain the information of the last or
latest error that occurred in the corresponding subscription worker?
If so, I am not very sure if that is a good reason to increase the
length of most of the column names but if you and others feel that is
helpful then it is better to do it as part of Sawada-San's patch.

-- 
With Regards,
Amit Kapila.




RE: Failed transaction statistics to measure the logical replication progress

2021-11-16 Thread osumi.takami...@fujitsu.com
On Wednesday, November 17, 2021 12:19 PM Masahiko Sawada 
 wrote:
> On Tue, Nov 16, 2021 at 9:34 PM osumi.takami...@fujitsu.com
>  wrote:
> >
> > On Monday, November 15, 2021 9:14 PM I wrote:
> > > I've conducted some update for this.
> > > (The rebased part is only C code and checked by pgindent)
> > I'll update my patches since a new skip xid patch has been shared in
> > [1].
> >
> > This version includes some minor renames of functions that are related
> > to transaction sizes.
> 
> I've looked at v12-0001 patch. Here are some comments:
Thank you for paying attention to this thread !


> -   TupleDescInitEntry(tupdesc, (AttrNumber) 3, "relid",
> +   TupleDescInitEntry(tupdesc, (AttrNumber) 3, "last_error_relid",
>OIDOID, -1, 0);
> -   TupleDescInitEntry(tupdesc, (AttrNumber) 4, "command",
> +   TupleDescInitEntry(tupdesc, (AttrNumber) 4, "last_error_command",
>TEXTOID, -1, 0);
> -   TupleDescInitEntry(tupdesc, (AttrNumber) 5, "xid",
> +   TupleDescInitEntry(tupdesc, (AttrNumber) 5, "last_error_xid",
>XIDOID, -1, 0);
> -   TupleDescInitEntry(tupdesc, (AttrNumber) 6, "error_count",
> +   TupleDescInitEntry(tupdesc, (AttrNumber) 6, "last_error_count",
>INT8OID, -1, 0);
> -   TupleDescInitEntry(tupdesc, (AttrNumber) 7, "error_message",
> +   TupleDescInitEntry(tupdesc, (AttrNumber) 7, "last_error_message",
> 
> If renaming column names clarifies those meanings, the above changes should
> be included into my patch that introduces pg_stat_subscription_workers
> view?
At first, your column names of pg_stat_subscription_workers look totally OK to 
me by itself
and I thought I should take care of those renaming at the commit timing of my 
stats patches.
But, if you agree with the new names above and fixing your patch doesn't
bother you, I'd appreciate your help !

> I think that exporting PartitionTupleRouting should not be done in the one
> patch together with renaming the view columns. There is not relevance
> between them at all. If it's used by v12-0002 patch, I think it should be 
> included
> in that patch or in another separate patch.
Yes, it's used by v12-0002.

Absolutely you are right. When you update the patch like above,
I would like to make it independent.


Best Regards,
Takamichi Osumi




Re: Failed transaction statistics to measure the logical replication progress

2021-11-16 Thread Masahiko Sawada
On Tue, Nov 16, 2021 at 9:34 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Monday, November 15, 2021 9:14 PM I wrote:
> > I've conducted some update for this.
> > (The rebased part is only C code and checked by pgindent)
> I'll update my patches since a new skip xid patch
> has been shared in [1].
>
> This version includes some minor renames of functions
> that are related to transaction sizes.

I've looked at v12-0001 patch. Here are some comments:

-   TupleDescInitEntry(tupdesc, (AttrNumber) 3, "relid",
+   TupleDescInitEntry(tupdesc, (AttrNumber) 3, "last_error_relid",
   OIDOID, -1, 0);
-   TupleDescInitEntry(tupdesc, (AttrNumber) 4, "command",
+   TupleDescInitEntry(tupdesc, (AttrNumber) 4, "last_error_command",
   TEXTOID, -1, 0);
-   TupleDescInitEntry(tupdesc, (AttrNumber) 5, "xid",
+   TupleDescInitEntry(tupdesc, (AttrNumber) 5, "last_error_xid",
   XIDOID, -1, 0);
-   TupleDescInitEntry(tupdesc, (AttrNumber) 6, "error_count",
+   TupleDescInitEntry(tupdesc, (AttrNumber) 6, "last_error_count",
   INT8OID, -1, 0);
-   TupleDescInitEntry(tupdesc, (AttrNumber) 7, "error_message",
+   TupleDescInitEntry(tupdesc, (AttrNumber) 7, "last_error_message",

If renaming column names clarifies those meanings, the above changes
should be included into my patch that introduces
pg_stat_subscription_workers view?

---
I think that exporting PartitionTupleRouting should not be done in the
one patch together with renaming the view columns. There is not
relevance between them at all. If it's used by v12-0002 patch, I think
it should be included in that patch or in another separate patch.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




RE: Failed transaction statistics to measure the logical replication progress

2021-11-16 Thread osumi.takami...@fujitsu.com
On Monday, November 15, 2021 9:14 PM I wrote:
> I've conducted some update for this.
> (The rebased part is only C code and checked by pgindent)
I'll update my patches since a new skip xid patch
has been shared in [1].

This version includes some minor renames of functions
that are related to transaction sizes.

[1] - 
https://www.postgresql.org/message-id/CAD21AoA5jupM6O%3DpYsyfaxQ1aMX-en8%3DQNgpW6KfXsg7_CS0CQ%40mail.gmail.com

Best Regards,
Takamichi Osumi



v12-0001-Rename-existing-columns-of-pg_stat_subscription_.patch
Description:  v12-0001-Rename-existing-columns-of-pg_stat_subscription_.patch


v12-0002-Extend-pg_stat_subscription_workers-to-include-g.patch
Description:  v12-0002-Extend-pg_stat_subscription_workers-to-include-g.patch


RE: Failed transaction statistics to measure the logical replication progress

2021-11-15 Thread osumi.takami...@fujitsu.com
On Monday, November 15, 2021 6:28 PM I wrote:
> On Thursday, November 11, 2021 9:47 PM vignesh C 
> wrote:
> > Few more comments:
...
> Fixed. Also, its name was too long
> when aligned with other PgStat_StatDBEntry memebers.
> Thus I renamed it as subworkers_preparedsizes.
> 
> This depends on v21 in [1]
Hi

There was some minor conflict error by the change of v22 in [1].
I've conducted some update for this.
(The rebased part is only C code and checked by pgindent)

[1] - 
https://www.postgresql.org/message-id/CAD21AoCE72vKXJp99f4xRw7Mh5ve-Z2roe21gP8Y82_CxXKvbg%40mail.gmail.com


Best Regards,
Takamichi Osumi



v11-0001-Rename-existing-columns-of-pg_stat_subscription_.patch
Description:  v11-0001-Rename-existing-columns-of-pg_stat_subscription_.patch


v11-0002-Extend-pg_stat_subscription_workers-to-include-g.patch
Description:  v11-0002-Extend-pg_stat_subscription_workers-to-include-g.patch


RE: Failed transaction statistics to measure the logical replication progress

2021-11-15 Thread osumi.takami...@fujitsu.com
On Wednesday, November 10, 2021 6:13 PM I wrote:
> On Wednesday, November 10, 2021 3:43 PM Dilip Kumar
>  wrote:
> > On Tue, Nov 9, 2021 at 5:05 PM osumi.takami...@fujitsu.com
> >  wrote:
> > > On Tuesday, November 9, 2021 12:08 PM Greg Nancarrow
> >  wrote:
> > > > On Fri, Nov 5, 2021 at 7:11 PM osumi.takami...@fujitsu.com
> > > >  wrote:
> > > > >
> > > >
> > > > I did a quick scan through the latest v8 patch and noticed the
> > > > following
> > things:
> > > I appreciate your review !
> > I have reviewed some part of the patch and I have a few comments
> I really appreciate your attention and review.
...
> > 3.
> > +{
> > +size += *extra_data->stream_write_len;
> > +add_apply_error_context_xact_size(size);
> > +return;
> > +}
> >
> > From apply_handle_insert(), we are calling update_apply_change_size(),
> > and inside this function we are dereferencing
> *extra_data->stream_write_len.
> > Basically, stream_write_len is in integer pointer and the caller
> > hasn't allocated memory for that and inside update_apply_change_size,
> > we are directly dereferencing the pointer, how this can be correct.
...
> I'll just delete the top part that handles streaming bytes calculation in the
> update_apply_change_size().
> It's because now that there is a specific structure to recognize each 
> streaming
> xid and save transaction size there, which makes the top part in question
> useless.
> 
> > And I also see that in the
> > whole patch stream_write_len, is never used as lvalue so without
> > storing anything into this why are we trying to use this as rvalue
> > here?  This is clearly an issue.
> As described above, I'll fix this part and related codes mainly streaming 
> related
> codes in the next version.
Removed this deadcodes.

Please have a look at [1]

[1] - 
https://www.postgresql.org/message-id/TYCPR01MB8373FEB287F733C81C1E4D42ED989%40TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
Takamichi Osumi



RE: Failed transaction statistics to measure the logical replication progress

2021-11-15 Thread osumi.takami...@fujitsu.com
On Wednesday, November 10, 2021 7:13 PM vignesh C  wrote:
> On Tue, Nov 9, 2021 at 5:05 PM osumi.takami...@fujitsu.com
>  wrote:
> > Yes. I've rebased and updated the patch, paying attention to this point.
> > Attached the updated version.
> 
> Thanks for the updated patch, few comments:
> 1) you could rename PgStat_StatSubWorkerPreparedXact to
> PgStat_SW_PreparedXactKey or a simpler name which includes key and
> similarly change PgStat_StatSubWorkerPreparedXactSize to
> PgStat_SW_PreparedXactEntry
> 
> +/* prepared transaction */
> +typedef struct PgStat_StatSubWorkerPreparedXact {
> +   Oid subid;
> +   chargid[GIDSIZE];
> +} PgStat_StatSubWorkerPreparedXact;
> +
> +typedef struct PgStat_StatSubWorkerPreparedXactSize
> +{
> +   PgStat_StatSubWorkerPreparedXact key; /* hash key */
> +
> +   Oid subid;
> +   chargid[GIDSIZE];
> +   PgStat_Counter xact_size;
> +} PgStat_StatSubWorkerPreparedXactSize;
> +
Fixed. Adopted your suggested names.


> 2) You can change prepared_size to sw_prepared_xact_entry or
> prepared_xact_entry since it is a hash entry with few fields
> +   if (subWorkerPreparedXactSizeHash)
> +   {
> +   PgStat_StatSubWorkerPreparedXactSize *prepared_size;
> +
> +   hash_seq_init(, subWorkerPreparedXactSizeHash);
> +   while((prepared_size =
> (PgStat_StatSubWorkerPreparedXactSize *) hash_seq_search()) !=
> NULL)
> +   {
> +   fputc('P', fpout);
> +   rc = fwrite(prepared_size,
> sizeof(PgStat_StatSubWorkerPreparedXactSize), 1, fpout);
> +   (void) rc;  /* we'll check
> for error with ferror */
> +   }
I preferred prepared_xact_entry. Fixed.


> 3) This need to be indented
> -w.relid,
> -w.command,
> -w.xid,
> -w.error_count,
> -w.error_message,
> -w.last_error_time
> +   w.commit_count,
> +   w.commit_bytes,
> +   w.error_count,
> +   w.error_bytes,
> +   w.abort_count,
> +   w.abort_bytes,
> +   w.last_error_relid,
> +   w.last_error_command,
> +   w.last_error_xid,
> +   w.last_error_count,
> +   w.last_error_message,
> +   w.last_error_time
Fixed.


> 4) Instead of adding a function to calculate the size, can we move
> PartitionTupleRouting from c file to the header file and use sizeof at the 
> caller
> function?
> +/*
> + * PartitionTupleRoutingSize - exported to calculate total data size
> + * of logical replication mesage apply, because this is one of the
> + * ApplyExecutionData struct members.
> + */
> +size_t
> +PartitionTupleRoutingSize(void)
> +{
> +   return sizeof(PartitionTupleRouting); }
Fixed.


> 5) You could run pgindent and pgperltidy for the code and test code to fix the
> indent issues.
> +
> subWorkerPreparedXactSizeHash = hash_create("Subscription worker stats
> of prepared txn",
> +
> 
> PGSTAT_SUBWORKER_HASH_SIZE,
> +
>  _ctl,
> +
>  HASH_ELEM | HASH_STRINGS |
> HASH_CONTEXT);
> +# There's no entry at the beginning
> +my $result = $node_subscriber->safe_psql('postgres',
> +"SELECT count(*) FROM pg_stat_subscription_workers;"); is($result,
> +q(0), 'no entry for transaction stats yet');
Conducted pgindent and pgperltidy.


> 6) Few places you have used strlcpy and few places you have used memcpy,
> you can keep it consistent:
> +   msg.m_command = command;
> +   strlcpy(msg.m_gid, gid, sizeof(msg.m_gid));
> +   msg.m_xact_bytes = xact_size;
> 
> +   key.subid = subid;
> +   memcpy(key.gid, gid, sizeof(key.gid));
> +   action = (create ? HASH_ENTER : HASH_FIND);
Fixed. I used strlcpy for new additional functions I made.
An exception is pgstat_read_db_statsfile().
In this function, we use memcpy() consistently in other places.

Please have a look at [1]

[1] - 
https://www.postgresql.org/message-id/TYCPR01MB8373FEB287F733C81C1E4D42ED989%40TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
Takamichi Osumi



RE: Failed transaction statistics to measure the logical replication progress

2021-11-15 Thread osumi.takami...@fujitsu.com
On Thursday, November 11, 2021 9:47 PM vignesh C  wrote:
> Few more comments:
> 1) Here the tuple length is not considered in the calculation, else it will 
> always
> show the fixed size for any size tuple. Ex varchar insert with 1 byte or 
> varchar
> insert with 100's of bytes. So I feel we should include the tuple length in 
> the
> calculation.
> +   case LOGICAL_REP_MSG_INSERT:
> +   case LOGICAL_REP_MSG_UPDATE:
> +   case LOGICAL_REP_MSG_DELETE:
> +   Assert(extra_data != NULL);
> +
> +   /*
> +* Compute size based on ApplyExecutionData.
> +* The size of LogicalRepRelMapEntry can be
> skipped because
> +* it is obtained from hash_search in
> logicalrep_rel_open.
> +*/
> +   size += sizeof(ApplyExecutionData) + sizeof(EState)
> +
> +   sizeof(ResultRelInfo) +
> + sizeof(ResultRelInfo);
> +
> +   /*
> +* Add some extra size if the target relation
> is partitioned.
> +* PartitionTupleRouting isn't exported.
> Therefore, call the
> +* function that returns its size instead.
> +*/
> +   if
> (extra_data->relmapentry->localrel->rd_rel->relkind ==
> RELKIND_PARTITIONED_TABLE)
> +   size += sizeof(ModifyTableState) +
> PartitionTupleRoutingSize();
> +   break;
Thanks a lot ! Fixed.


> 2) Can this be part of PgStat_StatDBEntry, similar to tables, functions and
> subworkers. It might be more appropriate to have it there instead of having
> another global variable.
> + * Stats of prepared transactions should be displayed
> + * at either commit prepared or rollback prepared time, even when it's
> + * after the server restart. We have the apply worker send those
> +statistics
> + * to the stats collector at prepare time and the startup process
> +restore
> + * those at restart if necessary.
> + */
> +static HTAB *subWorkerPreparedXactSizeHash = NULL;
> +
> +/*
Fixed. Also, its name was too long
when aligned with other PgStat_StatDBEntry memebers.
Thus I renamed it as subworkers_preparedsizes.

This depends on v21 in [1]

[1] - 
https://www.postgresql.org/message-id/CAD21AoAkd4YSoQUUFfpcrYOtkPRbninaw3sD0qc77nLW6Q89gg%40mail.gmail.com

Best Regards,
Takamichi Osumi



v10-0001-Rename-existing-columns-of-pg_stat_subscription_.patch
Description:  v10-0001-Rename-existing-columns-of-pg_stat_subscription_.patch


v10-0002-Extend-pg_stat_subscription_workers-to-include-g.patch
Description:  v10-0002-Extend-pg_stat_subscription_workers-to-include-g.patch


Re: Failed transaction statistics to measure the logical replication progress

2021-11-11 Thread vignesh C
On Wed, Nov 10, 2021 at 3:43 PM vignesh C  wrote:
>
> On Tue, Nov 9, 2021 at 5:05 PM osumi.takami...@fujitsu.com
>  wrote:
> > Yes. I've rebased and updated the patch, paying attention to this point.
> > Attached the updated version.
>
> Thanks for the updated patch, few comments:
> 6) Few places you have used strlcpy and few places you have used
> memcpy, you can keep it consistent:
> +   msg.m_command = command;
> +   strlcpy(msg.m_gid, gid, sizeof(msg.m_gid));
> +   msg.m_xact_bytes = xact_size;
>
> +   key.subid = subid;
> +   memcpy(key.gid, gid, sizeof(key.gid));
> +   action = (create ? HASH_ENTER : HASH_FIND);

Few more comments:
1) Here the tuple length is not considered in the calculation, else it
will always show the fixed size for any size tuple. Ex varchar insert
with 1 byte or varchar insert with 100's of bytes. So I feel we should
include the tuple length in the calculation.
+   case LOGICAL_REP_MSG_INSERT:
+   case LOGICAL_REP_MSG_UPDATE:
+   case LOGICAL_REP_MSG_DELETE:
+   Assert(extra_data != NULL);
+
+   /*
+* Compute size based on ApplyExecutionData.
+* The size of LogicalRepRelMapEntry can be
skipped because
+* it is obtained from hash_search in
logicalrep_rel_open.
+*/
+   size += sizeof(ApplyExecutionData) + sizeof(EState) +
+   sizeof(ResultRelInfo) + sizeof(ResultRelInfo);
+
+   /*
+* Add some extra size if the target relation
is partitioned.
+* PartitionTupleRouting isn't exported.
Therefore, call the
+* function that returns its size instead.
+*/
+   if
(extra_data->relmapentry->localrel->rd_rel->relkind ==
RELKIND_PARTITIONED_TABLE)
+   size += sizeof(ModifyTableState) +
PartitionTupleRoutingSize();
+   break;

2) Can this be part of PgStat_StatDBEntry, similar to tables,
functions and subworkers. It might be more appropriate to have it
there instead of having another global variable.
+ * Stats of prepared transactions should be displayed
+ * at either commit prepared or rollback prepared time, even when it's
+ * after the server restart. We have the apply worker send those statistics
+ * to the stats collector at prepare time and the startup process restore
+ * those at restart if necessary.
+ */
+static HTAB *subWorkerPreparedXactSizeHash = NULL;
+
+/*

Regards,
Vignesh




Re: Failed transaction statistics to measure the logical replication progress

2021-11-10 Thread vignesh C
On Tue, Nov 9, 2021 at 5:05 PM osumi.takami...@fujitsu.com
 wrote:
> Yes. I've rebased and updated the patch, paying attention to this point.
> Attached the updated version.

Thanks for the updated patch, few comments:
1) you could rename PgStat_StatSubWorkerPreparedXact to
PgStat_SW_PreparedXactKey or a simpler name which includes key and
similarly change PgStat_StatSubWorkerPreparedXactSize to
PgStat_SW_PreparedXactEntry

+/* prepared transaction */
+typedef struct PgStat_StatSubWorkerPreparedXact
+{
+   Oid subid;
+   chargid[GIDSIZE];
+} PgStat_StatSubWorkerPreparedXact;
+
+typedef struct PgStat_StatSubWorkerPreparedXactSize
+{
+   PgStat_StatSubWorkerPreparedXact key; /* hash key */
+
+   Oid subid;
+   chargid[GIDSIZE];
+   PgStat_Counter xact_size;
+} PgStat_StatSubWorkerPreparedXactSize;
+

2) You can change prepared_size to sw_prepared_xact_entry or
prepared_xact_entry since it is a hash entry with few fields
+   if (subWorkerPreparedXactSizeHash)
+   {
+   PgStat_StatSubWorkerPreparedXactSize *prepared_size;
+
+   hash_seq_init(, subWorkerPreparedXactSizeHash);
+   while((prepared_size =
(PgStat_StatSubWorkerPreparedXactSize *) hash_seq_search()) !=
NULL)
+   {
+   fputc('P', fpout);
+   rc = fwrite(prepared_size,
sizeof(PgStat_StatSubWorkerPreparedXactSize), 1, fpout);
+   (void) rc;  /* we'll check
for error with ferror */
+   }

3) This need to be indented
-w.relid,
-w.command,
-w.xid,
-w.error_count,
-w.error_message,
-w.last_error_time
+   w.commit_count,
+   w.commit_bytes,
+   w.error_count,
+   w.error_bytes,
+   w.abort_count,
+   w.abort_bytes,
+   w.last_error_relid,
+   w.last_error_command,
+   w.last_error_xid,
+   w.last_error_count,
+   w.last_error_message,
+   w.last_error_time

4) Instead of adding a function to calculate the size, can we move
PartitionTupleRouting from c file to the header file and use sizeof at
the caller function?
+/*
+ * PartitionTupleRoutingSize - exported to calculate total data size
+ * of logical replication mesage apply, because this is one of the
+ * ApplyExecutionData struct members.
+ */
+size_t
+PartitionTupleRoutingSize(void)
+{
+   return sizeof(PartitionTupleRouting);
+}

5) You could run pgindent and pgperltidy for the code and test code to
fix the indent issues.
+
subWorkerPreparedXactSizeHash = hash_create("Subscription worker stats
of prepared txn",
+
 PGSTAT_SUBWORKER_HASH_SIZE,
+
 _ctl,
+
 HASH_ELEM | HASH_STRINGS |
HASH_CONTEXT);

+# There's no entry at the beginning
+my $result = $node_subscriber->safe_psql('postgres',
+"SELECT count(*) FROM pg_stat_subscription_workers;");
+is($result, q(0), 'no entry for transaction stats yet');

6) Few places you have used strlcpy and few places you have used
memcpy, you can keep it consistent:
+   msg.m_command = command;
+   strlcpy(msg.m_gid, gid, sizeof(msg.m_gid));
+   msg.m_xact_bytes = xact_size;

+   key.subid = subid;
+   memcpy(key.gid, gid, sizeof(key.gid));
+   action = (create ? HASH_ENTER : HASH_FIND);

Regards,
Vignesh




RE: Failed transaction statistics to measure the logical replication progress

2021-11-10 Thread osumi.takami...@fujitsu.com
On Wednesday, November 10, 2021 3:43 PM Dilip Kumar  
wrote:
> On Tue, Nov 9, 2021 at 5:05 PM osumi.takami...@fujitsu.com
>  wrote:
> > On Tuesday, November 9, 2021 12:08 PM Greg Nancarrow
>  wrote:
> > > On Fri, Nov 5, 2021 at 7:11 PM osumi.takami...@fujitsu.com
> > >  wrote:
> > > >
> > >
> > > I did a quick scan through the latest v8 patch and noticed the following
> things:
> > I appreciate your review !
> I have reviewed some part of the patch and I have a few comments
I really appreciate your attention and review.

> 1.
> + 
> +  
> +   error_count bigint
> +  
> +  
> +   Number of transactions that failed to be applied by the table
> +   sync worker or main apply worker in this subscription.
> +   
> + 
> 
> The error_count, should be number of transaction failed to applied? or it 
> should
> be number of error?
I thought those were same and currently it gets incremented when an error of 
apply occurs.
Then, it equals to the number of total error. May I have the case when we
get different values between those two ? I can be missing something.

> 2.
> + 
> +  
> +   error_bytes bigint
> +  
> 
> How different is error_bytes from the abort_bytes?
By the error_bytes, you can see the consumed resources that
were acquired during apply but the applying processing stopped by some error.
On the other hand, abort_bytes displays bytes used for ROLLBACK PREPARED
and stream_abort processing. That's what I intended.


> 3.
> +{
> +size += *extra_data->stream_write_len;
> +add_apply_error_context_xact_size(size);
> +return;
> +}
> 
> From apply_handle_insert(), we are calling update_apply_change_size(), and
> inside this function we are dereferencing *extra_data->stream_write_len.
> Basically, stream_write_len is in integer pointer and the caller hasn't 
> allocated
> memory for that and inside update_apply_change_size, we are directly
> dereferencing the pointer, how this can be correct. 
I'm so sorry to make you confused.

I'll just delete the top part that handles streaming
bytes calculation in the update_apply_change_size().
It's because now that there is a specific structure to recognize each streaming 
xid
and save transaction size there, which makes the top part in question useless.

> And I also see that in the
> whole patch stream_write_len, is never used as lvalue so without storing
> anything into this why are we trying to use this as rvalue here?  This is 
> clearly
> an issue.
As described above, I'll fix this part and
related codes mainly streaming related codes in the next version.


Best Regards,
Takamichi Osumi



Re: Failed transaction statistics to measure the logical replication progress

2021-11-09 Thread Dilip Kumar
On Tue, Nov 9, 2021 at 5:05 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Tuesday, November 9, 2021 12:08 PM Greg Nancarrow  
> wrote:
> > On Fri, Nov 5, 2021 at 7:11 PM osumi.takami...@fujitsu.com
> >  wrote:
> > >
> >
> > I did a quick scan through the latest v8 patch and noticed the following 
> > things:
> I appreciate your review !
>

I have reviewed some part of the patch and I have a few comments

1.
+ 
+  
+   error_count bigint
+  
+  
+   Number of transactions that failed to be applied by the table
+   sync worker or main apply worker in this subscription.
+   
+ 

The error_count, should be number of transaction failed to applied? or
it should be number of error?

2.
+ 
+  
+   error_bytes bigint
+  

How different is error_bytes from the abort_bytes?

3.
+{
+size += *extra_data->stream_write_len;
+add_apply_error_context_xact_size(size);
+return;
+}

>From apply_handle_insert(), we are calling update_apply_change_size(),
and inside this function we are dereferencing
*extra_data->stream_write_len.  Basically, stream_write_len is in
integer pointer and the caller hasn't allocated memory for that and
inside update_apply_change_size, we are directly dereferencing the
pointer, how this can be correct.  And I also see that in the whole
patch stream_write_len, is never used as lvalue so without storing
anything into this why are we trying to use this as rvalue here?  This
is clearly an issue.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




RE: Failed transaction statistics to measure the logical replication progress

2021-11-09 Thread osumi.takami...@fujitsu.com
On Tuesday, November 9, 2021 8:35 PM I wrote:
> Yes. I've rebased and updated the patch, paying attention to this point.
> Attached the updated version.
Forgot to note one thing.
This is based on the skip xid v20 shared in [1]

[1] - 
https://www.postgresql.org/message-id/CAD21AoAT42mhcqeB1jPfRL1%2BEUHbZk8MMY_fBgsyZvJeKNpG%2Bw%40mail.gmail.com

Best Regards,
Takamichi Osumi



RE: Failed transaction statistics to measure the logical replication progress

2021-11-09 Thread osumi.takami...@fujitsu.com
On Monday, November 8, 2021 3:12 PM vignesh C  wrote:
> On Fri, Nov 5, 2021 at 1:42 PM osumi.takami...@fujitsu.com
>  wrote:
> > Lastly, I removed one unnecessary test that checked publisher's stats
> > in the TAP tests.
> > Also I introduced ApplyTxnExtraData structure to remove void* argument
> > of update_apply_change_size that might worsen the readability of codes
> > in the previous version.
> 
> Thanks for the updated patch.
Thanks you for checking my patch !


> Few comments:
> 1)  You could remove LogicalRepPreparedTxnData,
> LogicalRepCommitPreparedTxnData & LogicalRepRollbackPreparedTxnData
> and change it to char *gid to reduce the function parameter and simiplify the
> assignment:
> + */
> +void
> +pgstat_report_subworker_twophase_xact(Oid subid, LogicalRepMsgType
> +command,
> +
>PgStat_Counter xact_size,
> +
>LogicalRepPreparedTxnData *prepare_data,
> +
>LogicalRepCommitPreparedTxnData *commit_data,
> +
>LogicalRepRollbackPreparedTxnData *rollback_data)
Fixed. 

 
> 2) Shouldn't this change be part of skip xid patch?
> -   TupleDescInitEntry(tupdesc, (AttrNumber) 4, "command",
> +   TupleDescInitEntry(tupdesc, (AttrNumber) 10,
> + "last_error_command",
>TEXTOID, -1, 0);
> -   TupleDescInitEntry(tupdesc, (AttrNumber) 5, "xid",
> +   TupleDescInitEntry(tupdesc, (AttrNumber) 11, "last_error_xid",
>XIDOID, -1, 0);
> -   TupleDescInitEntry(tupdesc, (AttrNumber) 6, "error_count",
> +   TupleDescInitEntry(tupdesc, (AttrNumber) 12, "last_error_count",
>INT8OID, -1, 0);
> -   TupleDescInitEntry(tupdesc, (AttrNumber) 7, "error_message",
> +   TupleDescInitEntry(tupdesc, (AttrNumber) 13,
> + "last_error_message",
>TEXTOID, -1, 0);
> -   TupleDescInitEntry(tupdesc, (AttrNumber) 8, "last_error_time",
> +   TupleDescInitEntry(tupdesc, (AttrNumber) 14, "last_error_time",
Hmm, I didn't think so. Those renames are necessary
to make exisiting columns of skip xid separate from newly-introduced xact stats.
That means, original names of skip xid columns in v20 by itself are fine
and the renames are needed only when this patch gets committed.
At present, we cannot guarantee that this patch will be committed
so I'd like to take care of those renames.

 
> 3) This newly added structures should be added to typedefs.list:
> ApplyTxnExtraData
> XactSizeEntry
> PgStat_MsgSubWorkerXactEnd
> PgStat_MsgSubWorkerTwophaseXact
> PgStat_StatSubWorkerPreparedXact
> PgStat_StatSubWorkerPreparedXactSize
Added.

> 4) We are not sending the transaction size in case of table sync, is this
> intentional, if so I felt we should document this in
> pg_stat_subscription_workers
> +   /* Report the success of table sync. */
> +   pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid,
> +
>   MyLogicalRepWorker->relid,
> +
>   0 /* no logical message type */,
> +
>   0 /* xact size */);
Right. Updated the doc description.
I added the description in a way that the bytes
stats are only for apply worker.

 
> 5) pg_stat_subscription_workers has a lot of columns, if we can reduce the
> column size the readability will improve, like xact_commit_count to
> commit_count, xact_commit_bytes to commit_bytes, etc
> +w.xact_commit_count,
> +w.xact_commit_bytes,
> +w.xact_error_count,
> +w.xact_error_bytes,
> +w.xact_abort_count,
> +w.xact_abort_bytes,
It makes sense. Those can be somehow redundant.

Tentatively, I renamed only columns' names exported to the users.
This is because changing internal data structure as well (e.g. removing
the PgStat_StatSubWorkerEntry's prefixes) causes duplication name
of 'error_count' members and changing such an internal data structure
of skip xid part will have a huge impact of other parts.
Kindly imagine a case that we add 'last_' prefix to the
all error statistics representing an error of the structure.
If you aren't satisfied with this change, please let me know.

Best Regards,
Takamichi Osumi



RE: Failed transaction statistics to measure the logical replication progress

2021-11-09 Thread osumi.takami...@fujitsu.com
On Tuesday, November 9, 2021 12:08 PM Greg Nancarrow  
wrote:
> On Fri, Nov 5, 2021 at 7:11 PM osumi.takami...@fujitsu.com
>  wrote:
> >
> 
> I did a quick scan through the latest v8 patch and noticed the following 
> things:
I appreciate your review !

> src/backend/postmaster/pgstat.c
> 
> (1) pgstat_recv_subworker_twophase_xact()
> The copying from msg->m_gid to key.gid does not seem to be correct.
> strlen() is being called on a junk value, since key.gid has not been assigned 
> yet.
> It should be changed as follows:
> 
> BEFORE:
> + strlcpy(key.gid, msg->m_gid, strlen(key.gid));
> AFTER:
> + strlcpy(key.gid, msg->m_gid, sizeof(key.gid));
Fixed.
 
> (2) pgstat_get_subworker_prepared_txn()
> Similar to above, strlen() usage is not correct, and should use
> strlcpy() instead of memcpy().
> 
> BEFORE:
> + memcpy(key.gid, gid, strlen(key.gid));
> AFTER:
> + strlcpy(key.gid, gid, sizeof(key.gid));
Fixed.

 
> (3) stats_reset
> Note that the "stats_reset" column has been removed from the
> pg_stat_subscription_workers view in the underlying latest v20 patch.
Yes. I've rebased and updated the patch, paying attention to this point.
Attached the updated version.


Best Regards,
Takamichi Osumi



extend_xact_stats_of_subscription_worker_v9.patch
Description: extend_xact_stats_of_subscription_worker_v9.patch


Re: Failed transaction statistics to measure the logical replication progress

2021-11-08 Thread Greg Nancarrow
On Fri, Nov 5, 2021 at 7:11 PM osumi.takami...@fujitsu.com
 wrote:
>

I did a quick scan through the latest v8 patch and noticed the following things:

src/backend/postmaster/pgstat.c

(1) pgstat_recv_subworker_twophase_xact()
The copying from msg->m_gid to key.gid does not seem to be correct.
strlen() is being called on a junk value, since key.gid has not been
assigned yet.
It should be changed as follows:

BEFORE:
+ strlcpy(key.gid, msg->m_gid, strlen(key.gid));
AFTER:
+ strlcpy(key.gid, msg->m_gid, sizeof(key.gid));


(2) pgstat_get_subworker_prepared_txn()
Similar to above, strlen() usage is not correct, and should use
strlcpy() instead of memcpy().

BEFORE:
+ memcpy(key.gid, gid, strlen(key.gid));
AFTER:
+ strlcpy(key.gid, gid, sizeof(key.gid));

(3) stats_reset
Note that the "stats_reset" column has been removed from the
pg_stat_subscription_workers view in the underlying latest v20 patch.


Regards,
Greg Nancarrow
Fujitsu Australia




Re: Failed transaction statistics to measure the logical replication progress

2021-11-08 Thread Greg Nancarrow
On Fri, Nov 5, 2021 at 7:11 PM osumi.takami...@fujitsu.com
 wrote:
>
> I'm not sure about the last part.
> > additionally increase the available subscriber memory,
> Which GUC parameter did you mean by this ?
> Could we point out and enalrge the memory size only for
> subscriber's apply processing intentionally ?
> I incorporated (7) except for this last part.
> Will revise according to your reply.
>
I might have misinterpreted your original description, so I'll
re-review that in your latest patch.

As a newer version (v20) of the prerequisite patch was posted a day
ago, it looks like your patch needs to be rebased against that (as it
currently applies on top of the v19 version only).


Regards,
Greg Nancarrow
Fujitsu Australia




Re: Failed transaction statistics to measure the logical replication progress

2021-11-07 Thread vignesh C
On Fri, Nov 5, 2021 at 1:42 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Thursday, November 4, 2021 9:54 AM  Greg Nancarrow  
> wrote:
> > On Tue, Nov 2, 2021 at 12:18 AM osumi.takami...@fujitsu.com
> >  wrote:
> > >
> > > On Thursday, October 28, 2021 11:19 PM I wrote:
> > > > I've created a new patch that extends pg_stat_subscription_workers
> > > > to include other transaction statistics.
> > > >
> > > > Note that this patch depends on v18 patch-set in [1]...
> > > Rebased based on the v19 in [1].
> > > Also, fixed documentation a little and made tests tidy.
> > > FYI, the newly included TAP test(027_worker_xact_stats.pl) is stable
> > > because I checked that 100 times of its execution in a tight loop all 
> > > passed.
> > >
> >
> > I have done some basic testing of the patch and have some initial review
> > comments:
> Thanks for your review !
>
> > (1) I think this patch needs to be in "git format-patch" format, with a 
> > proper
> > commit message that describes the purpose of the patch and the functionality
> > it adds, and any high-level design points (something like the overview 
> > given in
> > the initial post, and accounting for the subsequent discussion points and
> > updated functionality).
> Fixed.
>
> > (2) doc/src/sgml/monitoring.sgml
> > There are some grammatical issues in the current description. I suggest
> > changing it to something like:
> > BEFORE:
> > +  At least one row per subscription, showing about
> > transaction statistics and error summary that
> > AFTER:
> > +  At least one row per subscription, showing transaction
> > statistics and information about errors that
> Fixed.
>
> > (2) doc/src/sgml/monitoring.sgml
> > The current description seems a little confusing.
> > Per subscription, it shows the transaction statistics and any last error 
> > info from
> > tablesync/apply workers? If this is the case, I'd suggest the following 
> > change:
> >
> > BEFORE:
> > +   one row per subscription for transaction statistics and summary of the
> > last
> > +   error reported by workers applying logical replication changes and
> > workers
> > +   handling the initial data copy of the subscribed tables.
> > AFTER:
> > +   one row per subscription, showing corresponding transaction statistics
> > and
> > +   information about the last error reported by workers applying
> > logical replication
> > +   changes or by workers handling the initial data copy of the
> > subscribed tables.
> Fixed.
>
> > (3) xact_commit
> > I think that the "xact_commit" column should be named "xact_commit_count"
> > or "xact_commits".
> > Similarly, I think "xact_error" should be named "xact_error_count" or
> > "xact_errors", and "xact_aborts" should be named "xact_abort_count" or
> > "xact_aborts".
> I prefered *_count. Renamed.
>
> > (4) xact_commit_bytes
> >
> > +   Amount of transactions data successfully applied in this 
> > subscription.
> > +   Consumed memory for xact_commit is displayed.
> >
> > I find this description a bit confusing. "Consumed memory for xact_commit"
> > seems different to "transactions data".
> > Could the description be something like: Amount of data (in bytes) 
> > successfully
> > applied in this subscription, across "xact_commit_count"
> > transactions.
> Fixed.
>
> > (5)
> > I'd suggest some minor rewording for the following:
> >
> > BEFORE:
> > +   Number of transactions failed to be applied and caught by table
> > +   sync worker or main apply worker in this subscription.
> > AFTER:
> > +   Number of transactions that failed to be applied by the table
> > +   sync worker or main apply worker in this subscription.
> Fixed.
>
> > (6) xact_error_bytes
> > Again, it's a little confusing referring to "consumed memory" here.
> > How about rewording this, something like:
> >
> > BEFORE:
> > +   Amount of transactions data unsuccessfully applied in this
> > subscription.
> > +   Consumed memory that past failed transaction used is displayed.
> > AFTER:
> > +   Amount of data (in bytes) unsuccessfully applied in this
> > subscription by the last failed transaction.
> xact_error_bytes (and other bytes columns as well) is cumulative
> so when a new error happens, the size of this new bytes would be
> added to the same. So here we shouldn't mention just the last error.
> I simply applied your previous comments of 'xact_commit_bytes'
> to 'xact_error_bytes' description.
>
> > (7)
> > The additional information provided for "xact_abort_bytes" needs some
> > rewording, something like:
> >
> > BEFORE:
> > +   Increase logical_decoding_work_mem on the
> > publisher
> > +   so that it exceeds the size of whole streamed transaction
> > +   to suppress unnecessary consumed network bandwidth in addition to
> > change
> > +   in memory of the subscriber, if unexpected amount of streamed
> > transactions
> > +   are aborted.
> > AFTER:
> > +   In order to suppress unnecessary consumed network bandwidth,
> > increase
> > +

RE: Failed transaction statistics to measure the logical replication progress

2021-11-05 Thread osumi.takami...@fujitsu.com
On Thursday, November 4, 2021 9:54 AM  Greg Nancarrow  
wrote:
> On Tue, Nov 2, 2021 at 12:18 AM osumi.takami...@fujitsu.com
>  wrote:
> >
> > On Thursday, October 28, 2021 11:19 PM I wrote:
> > > I've created a new patch that extends pg_stat_subscription_workers
> > > to include other transaction statistics.
> > >
> > > Note that this patch depends on v18 patch-set in [1]...
> > Rebased based on the v19 in [1].
> > Also, fixed documentation a little and made tests tidy.
> > FYI, the newly included TAP test(027_worker_xact_stats.pl) is stable
> > because I checked that 100 times of its execution in a tight loop all 
> > passed.
> >
> 
> I have done some basic testing of the patch and have some initial review
> comments:
Thanks for your review !

> (1) I think this patch needs to be in "git format-patch" format, with a proper
> commit message that describes the purpose of the patch and the functionality
> it adds, and any high-level design points (something like the overview given 
> in
> the initial post, and accounting for the subsequent discussion points and
> updated functionality).
Fixed.

> (2) doc/src/sgml/monitoring.sgml
> There are some grammatical issues in the current description. I suggest
> changing it to something like:
> BEFORE:
> +  At least one row per subscription, showing about
> transaction statistics and error summary that
> AFTER:
> +  At least one row per subscription, showing transaction
> statistics and information about errors that
Fixed.

> (2) doc/src/sgml/monitoring.sgml
> The current description seems a little confusing.
> Per subscription, it shows the transaction statistics and any last error info 
> from
> tablesync/apply workers? If this is the case, I'd suggest the following 
> change:
> 
> BEFORE:
> +   one row per subscription for transaction statistics and summary of the
> last
> +   error reported by workers applying logical replication changes and
> workers
> +   handling the initial data copy of the subscribed tables.
> AFTER:
> +   one row per subscription, showing corresponding transaction statistics
> and
> +   information about the last error reported by workers applying
> logical replication
> +   changes or by workers handling the initial data copy of the
> subscribed tables.
Fixed.

> (3) xact_commit
> I think that the "xact_commit" column should be named "xact_commit_count"
> or "xact_commits".
> Similarly, I think "xact_error" should be named "xact_error_count" or
> "xact_errors", and "xact_aborts" should be named "xact_abort_count" or
> "xact_aborts".
I prefered *_count. Renamed.

> (4) xact_commit_bytes
> 
> +   Amount of transactions data successfully applied in this subscription.
> +   Consumed memory for xact_commit is displayed.
> 
> I find this description a bit confusing. "Consumed memory for xact_commit"
> seems different to "transactions data".
> Could the description be something like: Amount of data (in bytes) 
> successfully
> applied in this subscription, across "xact_commit_count"
> transactions.
Fixed.

> (5)
> I'd suggest some minor rewording for the following:
> 
> BEFORE:
> +   Number of transactions failed to be applied and caught by table
> +   sync worker or main apply worker in this subscription.
> AFTER:
> +   Number of transactions that failed to be applied by the table
> +   sync worker or main apply worker in this subscription.
Fixed.

> (6) xact_error_bytes
> Again, it's a little confusing referring to "consumed memory" here.
> How about rewording this, something like:
> 
> BEFORE:
> +   Amount of transactions data unsuccessfully applied in this
> subscription.
> +   Consumed memory that past failed transaction used is displayed.
> AFTER:
> +   Amount of data (in bytes) unsuccessfully applied in this
> subscription by the last failed transaction.
xact_error_bytes (and other bytes columns as well) is cumulative
so when a new error happens, the size of this new bytes would be
added to the same. So here we shouldn't mention just the last error.
I simply applied your previous comments of 'xact_commit_bytes'
to 'xact_error_bytes' description.

> (7)
> The additional information provided for "xact_abort_bytes" needs some
> rewording, something like:
> 
> BEFORE:
> +   Increase logical_decoding_work_mem on the
> publisher
> +   so that it exceeds the size of whole streamed transaction
> +   to suppress unnecessary consumed network bandwidth in addition to
> change
> +   in memory of the subscriber, if unexpected amount of streamed
> transactions
> +   are aborted.
> AFTER:
> +   In order to suppress unnecessary consumed network bandwidth,
> increase
> +   logical_decoding_work_mem on the publisher so
> that it
> +   exceeds the size of the whole streamed transaction, and
> additionally increase
> +   the available subscriber memory, if an unexpected amount of
> streamed transactions
> +   are aborted.

I'm not sure about the last part.
> additionally 

Re: Failed transaction statistics to measure the logical replication progress

2021-11-03 Thread Greg Nancarrow
On Tue, Nov 2, 2021 at 12:18 AM osumi.takami...@fujitsu.com
 wrote:
>
> On Thursday, October 28, 2021 11:19 PM I wrote:
> > I've created a new patch that extends pg_stat_subscription_workers to 
> > include
> > other transaction statistics.
> >
> > Note that this patch depends on v18 patch-set in [1]...
> Rebased based on the v19 in [1].
> Also, fixed documentation a little and made tests tidy.
> FYI, the newly included TAP test(027_worker_xact_stats.pl) is stable
> because I checked that 100 times of its execution in a tight loop all passed.
>

I have done some basic testing of the patch and have some initial
review comments:

(1) I think this patch needs to be in "git format-patch" format, with
a proper commit message that describes the purpose of the patch and
the functionality it adds, and any high-level design points (something
like the overview given in the initial post, and accounting for the
subsequent discussion points and updated functionality).

(2) doc/src/sgml/monitoring.sgml
There are some grammatical issues in the current description. I
suggest changing it to something like:

BEFORE:
+  At least one row per subscription, showing about
transaction statistics and error summary that
AFTER:
+  At least one row per subscription, showing transaction
statistics and information about errors that

(2) doc/src/sgml/monitoring.sgml
The current description seems a little confusing.
Per subscription, it shows the transaction statistics and any last
error info from tablesync/apply workers? If this is the case, I'd
suggest the following change:

BEFORE:
+   one row per subscription for transaction statistics and summary of the last
+   error reported by workers applying logical replication changes and workers
+   handling the initial data copy of the subscribed tables.
AFTER:
+   one row per subscription, showing corresponding transaction statistics and
+   information about the last error reported by workers applying
logical replication
+   changes or by workers handling the initial data copy of the
subscribed tables.

(3) xact_commit
I think that the "xact_commit" column should be named
"xact_commit_count" or "xact_commits".
Similarly, I think "xact_error" should be named "xact_error_count" or
"xact_errors", and "xact_aborts" should be named "xact_abort_count" or
"xact_aborts".

(4) xact_commit_bytes

+   Amount of transactions data successfully applied in this subscription.
+   Consumed memory for xact_commit is displayed.

I find this description a bit confusing. "Consumed memory for
xact_commit" seems different to "transactions data".
Could the description be something like: Amount of data (in bytes)
successfully applied in this subscription, across "xact_commit_count"
transactions.

(5)
I'd suggest some minor rewording for the following:

BEFORE:
+   Number of transactions failed to be applied and caught by table
+   sync worker or main apply worker in this subscription.
AFTER:
+   Number of transactions that failed to be applied by the table
+   sync worker or main apply worker in this subscription.

(6) xact_error_bytes
Again, it's a little confusing referring to "consumed memory" here.
How about rewording this, something like:

BEFORE:
+   Amount of transactions data unsuccessfully applied in this subscription.
+   Consumed memory that past failed transaction used is displayed.
AFTER:
+   Amount of data (in bytes) unsuccessfully applied in this
subscription by the last failed transaction.

(7)
The additional information provided for "xact_abort_bytes" needs some
rewording, something like:

BEFORE:
+   Increase logical_decoding_work_mem on the publisher
+   so that it exceeds the size of whole streamed transaction
+   to suppress unnecessary consumed network bandwidth in addition to change
+   in memory of the subscriber, if unexpected amount of streamed
transactions
+   are aborted.
AFTER:
+   In order to suppress unnecessary consumed network bandwidth, increase
+   logical_decoding_work_mem on the publisher so that it
+   exceeds the size of the whole streamed transaction, and
additionally increase
+   the available subscriber memory, if an unexpected amount of
streamed transactions
+   are aborted.

(8)
Suggested update:

BEFORE:
+ *  Tell the collector that worker transaction has finished without problem.
AFTER:
+ *  Tell the collector that the worker transaction has successfully completed.

(9) src/backend/postmaster/pgstat.c
I think that the GID copying is unnecessarily copying the whole GID
buffer or using an additional strlen().
It should be changed to use strlcpy() to match other code:

BEFORE:
+ /* get the gid for this two phase operation */
+ if (command == LOGICAL_REP_MSG_PREPARE ||
+   command == LOGICAL_REP_MSG_STREAM_PREPARE)
+   memcpy(msg.m_gid, prepare_data->gid, GIDSIZE);
+ else if (command == LOGICAL_REP_MSG_COMMIT_PREPARED)
+   memcpy(msg.m_gid, commit_data->gid, GIDSIZE);
+ else /* rollback 

RE: Failed transaction statistics to measure the logical replication progress

2021-11-02 Thread osumi.takami...@fujitsu.com
On Monday, November 1, 2021 10:18 PM I wrote:
> On Thursday, October 28, 2021 11:19 PM I wrote:
> > I've created a new patch that extends pg_stat_subscription_workers to
> > include other transaction statistics.
> >
> > Note that this patch depends on v18 patch-set in [1]...
> Rebased based on the v19 in [1].
I'd like to add one more explanation about the latest patches for review.


On Thursday, October 28, 2021 11:19 PM I wrote:
> (2)
> Updates of stats on the view at either commit prepared or rollback prepared
> time.
> This means we don't lost prepared transaction size even after server restart
> and user can see the results of two phase operation at those timings.
There is another reason I had to treat 'prepare' message like above.
Any reviewer might think that sending prepared bytes to stats collector
and making the bytes survive over the restart is too much.

But, we don't know if the prepared transaction is processed
by commit prepared or rollback prepared at 'prepare' time.
An execution of commit prepared needs to update the column of xact_commit and
xact_commit_bytes while rollback prepared does the column of xact_abort
and xact_abort_bytes where this patch is introducing.

Therefore, even when we can calculate the bytes of prepared txn at prepare time,
it's *not* possible to add the transaction bytes to either stats column of
bytes sizes and clean up the bytes. Then, there was a premise
that we should not lose the prepared txn bytes by the shutdown
so my patch has become the implementation described above.


Best Regards,
Takamichi Osumi



RE: Failed transaction statistics to measure the logical replication progress

2021-11-01 Thread osumi.takami...@fujitsu.com
On Thursday, October 28, 2021 11:19 PM I wrote:
> I've created a new patch that extends pg_stat_subscription_workers to include
> other transaction statistics.
> 
> Note that this patch depends on v18 patch-set in [1]...
Rebased based on the v19 in [1].
Also, fixed documentation a little and made tests tidy.
FYI, the newly included TAP test(027_worker_xact_stats.pl) is stable
because I checked that 100 times of its execution in a tight loop all passed.


[1] - 
https://www.postgresql.org/message-id/CAD21AoDY-9_x819F_m1_wfCVXXFJrGiSmR2MfC9Nw4nW8Om0qA%40mail.gmail.com


Best Regards,
Takamichi Osumi



extend_xact_stats_of_subscription_worker_v7.patch
Description: extend_xact_stats_of_subscription_worker_v7.patch


RE: Failed transaction statistics to measure the logical replication progress

2021-10-28 Thread osumi.takami...@fujitsu.com
On Thursday, October 21, 2021 10:19 AM Masahiko Sawada  
wrote:
> On Mon, Oct 18, 2021 at 7:03 PM Amit Kapila 
> wrote:
> >
> > On Thu, Oct 14, 2021 at 9:23 AM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > On Thursday, September 30, 2021 12:15 PM Amit Kapila
> > > 
> > > >
> > > > These all views are related to untransmitted to the collector but
> > > > what we really need is a view similar to pg_stat_archiver or
> > > > pg_stat_bgwriter which gives information about background workers.
> > > > Now, the problem as I see is if we go that route then
> > > > pg_stat_subscription will no longer remain dynamic view and one
> > > > might consider that as a compatibility break. The other idea I
> > > > have shared is that we display these stats under the new view
> > > > introduced by Sawada-San's patch [1] and probably rename that view
> > > > as pg_stat_subscription_worker where all the stats (xact info and
> > > > last failure information) about each worker will be displayed. Do
> > > > you have any opinion on that idea or do you see any problem with it?
> > >
> > > Personally, I think it seems reasonable to merge the xact stat into
> > > the view from sawada-san's patch.
> > >
> > > One problem I noticed is that pg_stat_subscription_error currently
> > > have a 'count' column which show how many times the last error
> > > happened. The xact stat here also have a similar value 'xact_error'.
> > > I think we might need to rename it or merge them into one in some way.
> > >
> > > Besides, if we decide to merge xact stat into
> > > pg_stat_subscription_error, some column seems need to be renamed.
> Maybe like:
> > > error_message => Last_error_message, command=>
> last_error_command..
> > >
> >
> > Don't you think that keeping the view name as
> > pg_stat_subscription_error would be a bit confusing if it has to
> > display xact_info? Isn't it better to change it to
> > pg_stat_subscription_worker or some other worker-specific generic
> > name?
> 
> I agree that it'd be better to include xact info to 
> pg_stat_subscription_errors
> view rather than making pg_stat_subscription a cumulative view. It would be
> more simple and consistent.
... 
>I'll change the view name in the next version patch.
Thanks a lot, Sawasa-san.

I've created a new patch that extends pg_stat_subscription_workers
to include other transaction statistics.

Note that this patch depends on v18 patch-set in [1]
and needs to be after the perl modules' namespace changes
conducted recently by commit b3b4d8e68ae83f432f43f035c7eb481ef93e1583.

There are other several major changes compared to the previous version.

(1)
Addressing several streaming transactions running in parallel on the pub that
can send unexpected order of partial data demarcated stream start and stream 
stop.
Even if one of them aborts, now bytes calculation should be correct.

(2)
Updates of stats on the view at either commit prepared or rollback prepared 
time.
This means we don't lost prepared transaction size even after server restart
and user can see the results of two phase operation at those timings.

(3)
Addition of TAP tests.

(4)
Renames of other existing columns so that those meanings are more apparent
in align with other new stats.

Some important details are written in the comments of the attached patch.

[1] - 
https://www.postgresql.org/message-id/CAD21AoCTtQgfy57AxB4q8KUOpRH8rkHN%3Ds_9p9Pvno_XoBK5wg%40mail.gmail.com

Best Regards,
Takamichi Osumi



extend_xact_stats_of_subscription_worker_v6.patch
Description: extend_xact_stats_of_subscription_worker_v6.patch


Re: Failed transaction statistics to measure the logical replication progress

2021-10-21 Thread Amit Kapila
On Thu, Oct 21, 2021 at 6:49 AM Masahiko Sawada  wrote:
>
> On Mon, Oct 18, 2021 at 7:03 PM Amit Kapila  wrote:
> >
> > On Thu, Oct 14, 2021 at 9:23 AM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > On Thursday, September 30, 2021 12:15 PM Amit Kapila 
> > > 
> > > >
> > > > These all views are related to untransmitted to the collector but what
> > > > we really need is a view similar to pg_stat_archiver or
> > > > pg_stat_bgwriter which gives information about background workers.
> > > > Now, the problem as I see is if we go that route then
> > > > pg_stat_subscription will no longer remain dynamic view and one might
> > > > consider that as a compatibility break. The other idea I have shared
> > > > is that we display these stats under the new view introduced by
> > > > Sawada-San's patch [1] and probably rename that view as
> > > > pg_stat_subscription_worker where all the stats (xact info and last
> > > > failure information) about each worker will be displayed. Do you have
> > > > any opinion on that idea or do you see any problem with it?
> > >
> > > Personally, I think it seems reasonable to merge the xact stat into the 
> > > view from
> > > sawada-san's patch.
> > >
> > > One problem I noticed is that pg_stat_subscription_error
> > > currently have a 'count' column which show how many times the last error
> > > happened. The xact stat here also have a similar value 'xact_error'. I 
> > > think we
> > > might need to rename it or merge them into one in some way.
> > >
> > > Besides, if we decide to merge xact stat into pg_stat_subscription_error, 
> > > some column
> > > seems need to be renamed. Maybe like:
> > > error_message => Last_error_message, command=> last_error_command..
> > >
> >
> > Don't you think that keeping the view name as
> > pg_stat_subscription_error would be a bit confusing if it has to
> > display xact_info? Isn't it better to change it to
> > pg_stat_subscription_worker or some other worker-specific generic
> > name?
>
> I agree that it'd be better to include xact info to
> pg_stat_subscription_errors view rather than making
> pg_stat_subscription a cumulative view. It would be more simple and
> consistent.
>
> The user might want to reset only either error stats or xact stats but
> we can do that by passing a value to the reset function. For example,
> pg_stat_reset_subscription_worker(10, 20, ‘error') for resetting only
> error stats whereas pg_stat_reset_subscription_worker(10, 20, ‘xact’)
> for resetting only xact stats etc (passing NULL or omitting the third
> argument means resetting all stats).
>

Sounds reasonable.

> I'll change the view name in the
> next version patch.
>

Thanks, that will be helpful.

-- 
With Regards,
Amit Kapila.




Re: Failed transaction statistics to measure the logical replication progress

2021-10-20 Thread Masahiko Sawada
On Mon, Oct 18, 2021 at 7:03 PM Amit Kapila  wrote:
>
> On Thu, Oct 14, 2021 at 9:23 AM houzj.f...@fujitsu.com
>  wrote:
> >
> > On Thursday, September 30, 2021 12:15 PM Amit Kapila 
> > 
> > >
> > > These all views are related to untransmitted to the collector but what
> > > we really need is a view similar to pg_stat_archiver or
> > > pg_stat_bgwriter which gives information about background workers.
> > > Now, the problem as I see is if we go that route then
> > > pg_stat_subscription will no longer remain dynamic view and one might
> > > consider that as a compatibility break. The other idea I have shared
> > > is that we display these stats under the new view introduced by
> > > Sawada-San's patch [1] and probably rename that view as
> > > pg_stat_subscription_worker where all the stats (xact info and last
> > > failure information) about each worker will be displayed. Do you have
> > > any opinion on that idea or do you see any problem with it?
> >
> > Personally, I think it seems reasonable to merge the xact stat into the 
> > view from
> > sawada-san's patch.
> >
> > One problem I noticed is that pg_stat_subscription_error
> > currently have a 'count' column which show how many times the last error
> > happened. The xact stat here also have a similar value 'xact_error'. I 
> > think we
> > might need to rename it or merge them into one in some way.
> >
> > Besides, if we decide to merge xact stat into pg_stat_subscription_error, 
> > some column
> > seems need to be renamed. Maybe like:
> > error_message => Last_error_message, command=> last_error_command..
> >
>
> Don't you think that keeping the view name as
> pg_stat_subscription_error would be a bit confusing if it has to
> display xact_info? Isn't it better to change it to
> pg_stat_subscription_worker or some other worker-specific generic
> name?

I agree that it'd be better to include xact info to
pg_stat_subscription_errors view rather than making
pg_stat_subscription a cumulative view. It would be more simple and
consistent.

The user might want to reset only either error stats or xact stats but
we can do that by passing a value to the reset function. For example,
pg_stat_reset_subscription_worker(10, 20, ‘error') for resetting only
error stats whereas pg_stat_reset_subscription_worker(10, 20, ‘xact’)
for resetting only xact stats etc (passing NULL or omitting the third
argument means resetting all stats). I'll change the view name in the
next version patch.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




RE: Failed transaction statistics to measure the logical replication progress

2021-10-18 Thread osumi.takami...@fujitsu.com
On Monday, October 18, 2021 11:52 AM Hou, Zhijie/侯 志杰  
wrote:
> On Thursday, October 14, 2021 2:13 PM Osumi, Takamichi wrote:
> > On Thursday, October 14, 2021 12:54 PM Hou,
> Zhijie wrote:
> > > On Thursday, September 30, 2021 12:15 PM Amit Kapila
> > > > On Thu, Sep 30, 2021 at 8:22 AM Hou, Zhijie/侯 志杰wrote:
> > > > >
> > > > > On Tues, Sep 28, 2021 6:05 PM Amit Kapila
> > > > > 
> > > wrote:
> > > > > >
> > > > > > Can't we keep the current and new stats both in-memory and
> > > > > > persist on disk? So, the persistent stats data will be used to
> > > > > > fill the in-memory counters after restarting of workers,
> > > > > > otherwise, we will always refer to in-memory values.
> > > > >
> > > > > I think this approach works, but I have another concern about it.
> > > > >
> > > > > The current pg_stat_subscription view is listed as "Dynamic
> > > > > Statistics Views" in the document, the data in it seems about
> > > > > the worker process, and the view data shows what the current
> > > > > worker did. But if we keep the new xact stat persist, then it's
> > > > > not what the current worker did, it looks more related to the
> > > > > subscription historic data.
> > > > >
> > > >
> > > > I see your point.
> > > >
> > > > > Adding a new view seems resonalble, but it will bring another
> > > > > subscription related view which might be too much. OTOH, I can
> > > > > see there are already some different views[1] including xact
> > > > > stat, maybe adding another one is accepatble ?
> > > > >
> > > >
> > > > These all views are related to untransmitted to the collector but
> > > > what we really need is a view similar to pg_stat_archiver or
> > > > pg_stat_bgwriter which gives information about background workers.
> > > > Now, the problem as I see is if we go that route then
> > > > pg_stat_subscription will no longer remain dynamic view and one
> > > > might consider that as a compatibility break. The other idea I
> > > > have shared is that we display these stats under the new view
> > > > introduced by Sawada-San's patch [1] and probably rename that view
> > > > as pg_stat_subscription_worker where all the stats (xact info and
> > > > last failure information) about each worker will be displayed. Do
> > > > you have any opinion on that idea or do you see any problem with it?
> > >
> > > Personally, I think it seems reasonable to merge the xact stat into
> > > the view from sawada-san's patch.
> > >
> > > One problem I noticed is that pg_stat_subscription_error currently
> > > have a 'count' column which show how many times the last error
> > > happened. The xact stat here also have a similar value 'xact_error'.
> > > I think we might need to rename it or merge them into one in some way.
> > >
> > > Besides, if we decide to merge xact stat into
> > > pg_stat_subscription_error, some column seems need to be renamed.
> > Maybe like:
> > > error_message => Last_error_message, command=>
> last_error_command..
> > Yeah, we must make them distinguished clearly.
> >
> > I guessed that you are concerned about amount of renaming codes that
> > could be a bit large or you come up with a necessity to consider the
> > all column names of the pg_stat_subscription_worker together all at
> > once in advance.
> >
> > It's because my instant impression is, when we go with the current
> > xact stats column definitions (xact_commit, xact_commit_bytes,
> > xact_error, xact_error_bytes, xact_abort, xact_abort_bytes), the
> > renaming problem can be solved if I write one additional patch or
> > extend the main patch of xact stats to handle renaming.
> > (This can work to keep both threads independent from each other).
> >
> > Did you have some concern that cannot be handled by this way ?
> Hi,
> 
> Currently, I don't find some unsolvable issues in this approach.
Okay. Glad to hear that.
Then, I can restart my implementation with this direction.


Best Regards,
Takamichi Osumi
 

RE: Failed transaction statistics to measure the logical replication progress

2021-10-18 Thread houzj.f...@fujitsu.com
On Monday, October 18, 2021 6:04 PM Amit Kapila  wrote:
> On Thu, Oct 14, 2021 at 9:23 AM houzj.f...@fujitsu.com
>  wrote:
> >
> > On Thursday, September 30, 2021 12:15 PM Amit Kapila
> > 
> > >
> > > These all views are related to untransmitted to the collector but
> > > what we really need is a view similar to pg_stat_archiver or
> > > pg_stat_bgwriter which gives information about background workers.
> > > Now, the problem as I see is if we go that route then
> > > pg_stat_subscription will no longer remain dynamic view and one
> > > might consider that as a compatibility break. The other idea I have
> > > shared is that we display these stats under the new view introduced
> > > by Sawada-San's patch [1] and probably rename that view as
> > > pg_stat_subscription_worker where all the stats (xact info and last
> > > failure information) about each worker will be displayed. Do you
> > > have any opinion on that idea or do you see any problem with it?
> >
> > Personally, I think it seems reasonable to merge the xact stat into
> > the view from sawada-san's patch.
> >
> > One problem I noticed is that pg_stat_subscription_error currently
> > have a 'count' column which show how many times the last error
> > happened. The xact stat here also have a similar value 'xact_error'. I
> > think we might need to rename it or merge them into one in some way.
> >
> > Besides, if we decide to merge xact stat into
> > pg_stat_subscription_error, some column seems need to be renamed.
> Maybe like:
> > error_message => Last_error_message, command=> last_error_command..
> >
> 
> Don't you think that keeping the view name as pg_stat_subscription_error
> would be a bit confusing if it has to display xact_info? Isn't it better to 
> change it
> to pg_stat_subscription_worker or some other worker-specific generic name?

Yes, I agreed that rename the view to pg_stat_subscription_worker or some
other worker-specific generic name is better if we decide to move forward
with this approach.

Best regards,
Hou zj


Re: Failed transaction statistics to measure the logical replication progress

2021-10-18 Thread Amit Kapila
On Thu, Oct 14, 2021 at 9:23 AM houzj.f...@fujitsu.com
 wrote:
>
> On Thursday, September 30, 2021 12:15 PM Amit Kapila 
> >
> > These all views are related to untransmitted to the collector but what
> > we really need is a view similar to pg_stat_archiver or
> > pg_stat_bgwriter which gives information about background workers.
> > Now, the problem as I see is if we go that route then
> > pg_stat_subscription will no longer remain dynamic view and one might
> > consider that as a compatibility break. The other idea I have shared
> > is that we display these stats under the new view introduced by
> > Sawada-San's patch [1] and probably rename that view as
> > pg_stat_subscription_worker where all the stats (xact info and last
> > failure information) about each worker will be displayed. Do you have
> > any opinion on that idea or do you see any problem with it?
>
> Personally, I think it seems reasonable to merge the xact stat into the view 
> from
> sawada-san's patch.
>
> One problem I noticed is that pg_stat_subscription_error
> currently have a 'count' column which show how many times the last error
> happened. The xact stat here also have a similar value 'xact_error'. I think 
> we
> might need to rename it or merge them into one in some way.
>
> Besides, if we decide to merge xact stat into pg_stat_subscription_error, 
> some column
> seems need to be renamed. Maybe like:
> error_message => Last_error_message, command=> last_error_command..
>

Don't you think that keeping the view name as
pg_stat_subscription_error would be a bit confusing if it has to
display xact_info? Isn't it better to change it to
pg_stat_subscription_worker or some other worker-specific generic
name?

-- 
With Regards,
Amit Kapila.




RE: Failed transaction statistics to measure the logical replication progress

2021-10-17 Thread houzj.f...@fujitsu.com
On Thursday, October 14, 2021 2:13 PM Osumi, Takamichi wrote:
> On Thursday, October 14, 2021 12:54 PM Hou, Zhijie 
> wrote:
> > On Thursday, September 30, 2021 12:15 PM Amit Kapila
> > > On Thu, Sep 30, 2021 at 8:22 AM Hou, Zhijie/侯 志杰wrote:
> > > >
> > > > On Tues, Sep 28, 2021 6:05 PM Amit Kapila
> > > > 
> > wrote:
> > > > >
> > > > > Can't we keep the current and new stats both in-memory and
> > > > > persist on disk? So, the persistent stats data will be used to
> > > > > fill the in-memory counters after restarting of workers,
> > > > > otherwise, we will always refer to in-memory values.
> > > >
> > > > I think this approach works, but I have another concern about it.
> > > >
> > > > The current pg_stat_subscription view is listed as "Dynamic
> > > > Statistics Views" in
> > > > the document, the data in it seems about the worker process, and
> > > > the view data
> > > > shows what the current worker did. But if we keep the new xact
> > > > stat persist, then it's not what the current worker did, it looks
> > > > more related to the subscription historic data.
> > > >
> > >
> > > I see your point.
> > >
> > > > Adding a new view seems resonalble, but it will bring another
> > > > subscription related view which might be too much. OTOH, I can see
> > > > there are already some
> > > > different views[1] including xact stat, maybe adding another one
> > > > is accepatble ?
> > > >
> > >
> > > These all views are related to untransmitted to the collector but
> > > what we really need is a view similar to pg_stat_archiver or
> > > pg_stat_bgwriter which gives information about background workers.
> > > Now, the problem as I see is if we go that route then
> > > pg_stat_subscription will no longer remain dynamic view and one
> > > might consider that as a compatibility break. The other idea I have
> > > shared is that we display these stats under the new view introduced
> > > by Sawada-San's patch [1] and probably rename that view as
> > > pg_stat_subscription_worker where all the stats (xact info and last
> > > failure information) about each worker will be displayed. Do you
> > > have any opinion on that idea or do you see any problem with it?
> >
> > Personally, I think it seems reasonable to merge the xact stat into
> > the view from sawada-san's patch.
> >
> > One problem I noticed is that pg_stat_subscription_error currently
> > have a 'count' column which show how many times the last error
> > happened. The xact stat here also have a similar value 'xact_error'. I
> > think we might need to rename it or merge them into one in some way.
> >
> > Besides, if we decide to merge xact stat into
> > pg_stat_subscription_error, some column seems need to be renamed.
> Maybe like:
> > error_message => Last_error_message, command=> last_error_command..
> Yeah, we must make them distinguished clearly.
> 
> I guessed that you are concerned about
> amount of renaming codes that could be a bit large or you come up with a
> necessity to consider the all column names of the pg_stat_subscription_worker
> together all at once in advance.
> 
> It's because my instant impression is,
> when we go with the current xact stats column definitions (xact_commit,
> xact_commit_bytes, xact_error, xact_error_bytes, xact_abort,
> xact_abort_bytes), the renaming problem can be solved if I write one
> additional patch or extend the main patch of xact stats to handle renaming.
> (This can work to keep both threads independent from each other).
> 
> Did you have some concern that cannot be handled by this way ?
Hi,

Currently, I don't find some unsolvable issues in this approach.

Best regards,
Hou zj



RE: Failed transaction statistics to measure the logical replication progress

2021-10-14 Thread osumi.takami...@fujitsu.com
On Thursday, October 14, 2021 12:54 PM Hou, Zhijie/侯 志杰 
 wrote:
> On Thursday, September 30, 2021 12:15 PM Amit Kapila
> 
> > On Thu, Sep 30, 2021 at 8:22 AM Hou, Zhijie/侯 志杰
>  wrote:
> > >
> > > On Tues, Sep 28, 2021 6:05 PM Amit Kapila 
> wrote:
> > > >
> > > > Can't we keep the current and new stats both in-memory and persist
> > > > on disk? So, the persistent stats data will be used to fill the
> > > > in-memory counters after restarting of workers, otherwise, we will
> > > > always refer to in-memory values.
> > >
> > > I think this approach works, but I have another concern about it.
> > >
> > > The current pg_stat_subscription view is listed as "Dynamic Statistics
> Views"
> > in
> > > the document, the data in it seems about the worker process, and the
> > > view
> > data
> > > shows what the current worker did. But if we keep the new xact stat
> > > persist, then it's not what the current worker did, it looks more
> > > related to the subscription historic data.
> > >
> >
> > I see your point.
> >
> > > Adding a new view seems resonalble, but it will bring another
> > > subscription related view which might be too much. OTOH, I can see
> > > there are already
> > some
> > > different views[1] including xact stat, maybe adding another one is
> > accepatble ?
> > >
> >
> > These all views are related to untransmitted to the collector but what
> > we really need is a view similar to pg_stat_archiver or
> > pg_stat_bgwriter which gives information about background workers.
> > Now, the problem as I see is if we go that route then
> > pg_stat_subscription will no longer remain dynamic view and one might
> > consider that as a compatibility break. The other idea I have shared
> > is that we display these stats under the new view introduced by
> > Sawada-San's patch [1] and probably rename that view as
> > pg_stat_subscription_worker where all the stats (xact info and last
> > failure information) about each worker will be displayed. Do you have
> > any opinion on that idea or do you see any problem with it?
> 
> Personally, I think it seems reasonable to merge the xact stat into the view 
> from
> sawada-san's patch.
> 
> One problem I noticed is that pg_stat_subscription_error currently have a
> 'count' column which show how many times the last error happened. The xact
> stat here also have a similar value 'xact_error'. I think we might need to 
> rename
> it or merge them into one in some way.
> 
> Besides, if we decide to merge xact stat into pg_stat_subscription_error, some
> column seems need to be renamed. Maybe like:
> error_message => Last_error_message, command=> last_error_command..
Yeah, we must make them distinguished clearly.

I guessed that you are concerned about
amount of renaming codes that could be a bit large
or you come up with a necessity to consider
the all column names of the pg_stat_subscription_worker
together all at once in advance.

It's because my instant impression is,
when we go with the current xact stats column definitions
(xact_commit, xact_commit_bytes, xact_error, xact_error_bytes,
xact_abort, xact_abort_bytes), the renaming problem can be solved
if I write one additional patch or extend the main patch of
xact stats to handle renaming.
(This can work to keep both threads independent
from each other).

Did you have some concern that cannot be handled by this way ?


Best Regards,
Takamichi Osumi



RE: Failed transaction statistics to measure the logical replication progress

2021-10-13 Thread houzj.f...@fujitsu.com
On Thursday, September 30, 2021 12:15 PM Amit Kapila 
> On Thu, Sep 30, 2021 at 8:22 AM Hou, Zhijie/侯 志杰  
> wrote:
> >
> > On Tues, Sep 28, 2021 6:05 PM Amit Kapila  wrote:
> > >
> > > Can't we keep the current and new stats both in-memory and persist on
> > > disk? So, the persistent stats data will be used to fill the in-memory
> > > counters after restarting of workers, otherwise, we will always refer
> > > to in-memory values.
> >
> > I think this approach works, but I have another concern about it.
> >
> > The current pg_stat_subscription view is listed as "Dynamic Statistics 
> > Views"
> in
> > the document, the data in it seems about the worker process, and the view
> data
> > shows what the current worker did. But if we keep the new xact stat persist,
> > then it's not what the current worker did, it looks more related to the
> > subscription historic data.
> >
> 
> I see your point.
> 
> > Adding a new view seems resonalble, but it will bring another subscription
> > related view which might be too much. OTOH, I can see there are already
> some
> > different views[1] including xact stat, maybe adding another one is
> accepatble ?
> >
> 
> These all views are related to untransmitted to the collector but what
> we really need is a view similar to pg_stat_archiver or
> pg_stat_bgwriter which gives information about background workers.
> Now, the problem as I see is if we go that route then
> pg_stat_subscription will no longer remain dynamic view and one might
> consider that as a compatibility break. The other idea I have shared
> is that we display these stats under the new view introduced by
> Sawada-San's patch [1] and probably rename that view as
> pg_stat_subscription_worker where all the stats (xact info and last
> failure information) about each worker will be displayed. Do you have
> any opinion on that idea or do you see any problem with it?

Personally, I think it seems reasonable to merge the xact stat into the view 
from
sawada-san's patch.

One problem I noticed is that pg_stat_subscription_error
currently have a 'count' column which show how many times the last error
happened. The xact stat here also have a similar value 'xact_error'. I think we
might need to rename it or merge them into one in some way.

Besides, if we decide to merge xact stat into pg_stat_subscription_error, some 
column
seems need to be renamed. Maybe like:
error_message => Last_error_message, command=> last_error_command..

Best regards,
Hou zj


RE: Failed transaction statistics to measure the logical replication progress

2021-10-01 Thread osumi.takami...@fujitsu.com
On Thursday, September 30, 2021 8:13 PM Amit Kapila  
wrote:
> On Thu, Sep 30, 2021 at 1:02 PM Osumi, Takamichi/大墨 昂道
>  wrote:
> >
> > On Thursday, September 30, 2021 1:15 PM Amit Kapila
>  wrote:
> > > On Thu, Sep 30, 2021 at 8:22 AM Hou, Zhijie/侯 志杰
> > >  wrote:
> > > >
> > > > On Tues, Sep 28, 2021 6:05 PM Amit Kapila
> > > > 
> > > wrote:
> > > > Adding a new view seems resonalble, but it will bring another
> > > > subscription related view which might be too much. OTOH, I can see
> > > > there are already some different views[1] including xact stat,
> > > > maybe adding
> > > another one is accepatble ?
> > > These all views are related to untransmitted to the collector but
> > > what we really need is a view similar to pg_stat_archiver or
> > > pg_stat_bgwriter which gives information about background workers.
> > > Now, the problem as I see is if we go that route then
> > > pg_stat_subscription will no longer remain dynamic view and one
> > > might consider that as a compatibility break. The other idea I have
> > > shared is that we display these stats under the new view introduced
> > > by Sawada-San's patch [1] and probably rename that view as
> > > pg_stat_subscription_worker where all the stats (xact info and last
> > > failure
> > > information) about each worker will be displayed.
> > Sorry, all the stats in pg_stat_subscription_worker view ?
> >
> > There was a discussion that
> > the xact info should be displayed from pg_stat_subscription with
> > existing stats in the same (which will be changed to persist), but
> > when your above proposal comes true, the list of
> > pg_stat_subscription_worker's columns will be something like below
> > (when I list up major columns).
> >
> > - subid, subrelid and some other relation attributes required
> > - 5 stats values moved from pg_stat_subscription
> >   received_lsn, last_msg_send_time, last_msg_receipt_time,
> >   latest_end_lsn, latest_end_time
> >
> 
> If we go with the view as pg_stat_subscription_worker, we don't need to move
> the existing stats of pg_stat_subscription into it.
Thank you for clarifying this point. I understand.


Best Regards,
Takamichi Osumi



  1   2   >