Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)

2022-07-26 Thread Bharath Rupireddy
On Fri, May 13, 2022 at 6:41 PM Robert Haas  wrote:
>
> On Fri, Apr 29, 2022 at 5:11 AM Bharath Rupireddy
>  wrote:
> > Here's the rebased v9 patch.
>
> This seems like it has enormous overlap with the existing
> functionality that we have from log_startup_progress_interval.
>
> I think that facility is also better-designed than this one. It prints
> out a message based on elapsed time, whereas this patch prints out a
> message based progress through the WAL. That means that if WAL replay
> isn't actually advancing for some reason, you just won't get any log
> messages and you don't know whether it's advancing slowly or not at
> all or the server is just hung. With that facility you can distinguish
> those cases.
>
> Also, if for some reason we do think that amount of WAL replayed is
> the right metric, rather than time, why would we only allow high=1
> segment and low=128 segments, rather than say any number of MB or GB
> that the user would like to configure?
>
> I suggest that if log_startup_progress_interval doesn't meet your
> needs here, we should try to understand why not and maybe enhance it,
> instead of adding a separate facility.

After thinking for a while, I agree with Robert and others that we
could leverage the existing log_startup_progress_interval mechanism
for reporting which WAL file currently is being replayed. I added
current TLI (helps to construct the WAL file name from current LSN)
and current WAL file source (helps to know where the WAL files was
fetched from) to the existing "redo in progress, elapsed time:..."
message. This very well serves the purpose of identifying the issues
such as the restore command taking a lot of time (>
log_startup_progress_interval for instance), WAL replay rate on
standby or primary for long recoveries and so on.

However, ereport_startup_progress isn't enabled on standby to not let
it bloat the server logs. I believe the "redo in progress, elapsed
time:..." message can provide some important info/metric for standby
too and there's no way for the users to enable it on standbys today.
For instance, users can know how well the standby fares in replaying,
they can figure this out, by looking at two or more such messages. If
enabled, with default value of 10 sec for
log_startup_progress_interval, the standby can emit 8640 messages per
day which is too much. I'm not sure if we are okay to change the
default value of log_startup_progress_interval to say 1min or 5min so
that 1440 messages are emitted. In production environments, typically
users may or may not be interested if recovery takes just 10sec, but
they really are interested if it takes in the order of minutes.
Basically, I would like to enable "redo in progress, elapsed time:..."
message for standbys too.

Thoughts?

PSA v10 patch with enhanced "redo in progress, elapsed time:..."
message. Note that it's not a final patch though.

Regards,
Bharath Rupireddy.


v10-0001-Add-WAL-recovery-info-to-startup-progress-log-me.patch
Description: Binary data


Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)

2022-07-18 Thread Nathan Bossart
On Fri, May 13, 2022 at 09:10:52AM -0400, Robert Haas wrote:
> I suggest that if log_startup_progress_interval doesn't meet your
> needs here, we should try to understand why not and maybe enhance it,
> instead of adding a separate facility.

+1.  AFAICT it should be possible to make the log_startup_progress_interval
machinery extensible enough to be reused for several other tasks that can
take a while but have little visibility.

Since this thread has been inactive for over 2 months, I'm marking the
commitfest entry as Waiting on Author.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)

2022-05-17 Thread Robert Haas
On Tue, May 17, 2022 at 1:32 AM Bharath Rupireddy
 wrote:
> In a production environment (of course with a better management of
> server logs) one can set log_wal_traffic to "high" and emit the
> required info to answer some of the customer questions like - "How far
> the server is in recovery? How much time recovery of each WAL file
> approximately took? How much time will it take to recover all the WAL
> files? What's the rate of recovery - time per WAL file? etc."
>
> Whereas ereport_startup_progress facility will help to emit log
> messages only if "some" operation takes longer than set
> log_startup_progress_interval time which may not serve the above
> purpose.

I think you are misunderstanding what an "operation" is here. If you
take the time to test out that facility, I think you will find that it
is quite capable of answering all of those questions.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)

2022-05-16 Thread Bharath Rupireddy
On Fri, May 13, 2022 at 6:41 PM Robert Haas  wrote:
>
> On Fri, Apr 29, 2022 at 5:11 AM Bharath Rupireddy
>  wrote:
> > Here's the rebased v9 patch.
>
> This seems like it has enormous overlap with the existing
> functionality that we have from log_startup_progress_interval.
>
> I think that facility is also better-designed than this one. It prints
> out a message based on elapsed time, whereas this patch prints out a
> message based progress through the WAL. That means that if WAL replay
> isn't actually advancing for some reason, you just won't get any log
> messages and you don't know whether it's advancing slowly or not at
> all or the server is just hung. With that facility you can distinguish
> those cases.
>
> Also, if for some reason we do think that amount of WAL replayed is
> the right metric, rather than time, why would we only allow high=1
> segment and low=128 segments, rather than say any number of MB or GB
> that the user would like to configure?
>
> I suggest that if log_startup_progress_interval doesn't meet your
> needs here, we should try to understand why not and maybe enhance it,
> instead of adding a separate facility.

Thanks Robert!

In a production environment (of course with a better management of
server logs) one can set log_wal_traffic to "high" and emit the
required info to answer some of the customer questions like - "How far
the server is in recovery? How much time recovery of each WAL file
approximately took? How much time will it take to recover all the WAL
files? What's the rate of recovery - time per WAL file? etc."

Whereas ereport_startup_progress facility will help to emit log
messages only if "some" operation takes longer than set
log_startup_progress_interval time which may not serve the above
purpose.

Actually, IMO a generic GUC log_file_processing_traffic = {none,
medium, high} to help server emit logs for all the critical file
processing operations - WAL file recovery (as proposed here in this
thread), temp file processing during server startup or restarts
(log_startup_progress_interval can't be used here as postmaster
doesn't register for any timeouts) [1], snapshot and mapping file
processing during checkpoint, temp relation files, removing old WAL
files and so on.

Thoughts?

[1] 
https://www.postgresql.org/message-id/CALj2ACW-ELOF5JT2zPavs95wbZ0BrLPrqvSZ7Ac%2BpjxCkmXtEQ%40mail.gmail.com

Regards,
Bharath Rupireddy.




Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)

2022-05-13 Thread Robert Haas
On Fri, Apr 29, 2022 at 5:11 AM Bharath Rupireddy
 wrote:
> Here's the rebased v9 patch.

This seems like it has enormous overlap with the existing
functionality that we have from log_startup_progress_interval.

I think that facility is also better-designed than this one. It prints
out a message based on elapsed time, whereas this patch prints out a
message based progress through the WAL. That means that if WAL replay
isn't actually advancing for some reason, you just won't get any log
messages and you don't know whether it's advancing slowly or not at
all or the server is just hung. With that facility you can distinguish
those cases.

Also, if for some reason we do think that amount of WAL replayed is
the right metric, rather than time, why would we only allow high=1
segment and low=128 segments, rather than say any number of MB or GB
that the user would like to configure?

I suggest that if log_startup_progress_interval doesn't meet your
needs here, we should try to understand why not and maybe enhance it,
instead of adding a separate facility.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)

2022-05-13 Thread Bharath Rupireddy
On Thu, May 5, 2022 at 2:07 PM Alvaro Herrera  wrote:
>
> On 2022-May-05, Bharath Rupireddy wrote:
>
> > On Fri, Apr 29, 2022 at 4:11 PM Alvaro Herrera  
> > wrote:
> > >
> > > Did we ever consider the idea of using a new pg_stat_wal_activity_progress
> > > view or something like that, using the backend_progress.c functionality?
> > > I don't see it mentioned in the thread.
> >
> > IMO, progress reporting works well on a running server and at the
> > moment. The WAL recovery/replay can happen even before the server
> > opens up for connections
>
> It's definitely true that you wouldn't be able to use it for when the
> server is not accepting connections.
>
> > and the progress report view can't be used
> > for later analysis like how much time the restoring WAL files from
> > archive location took
>
> This is true too -- progress report doesn't store historical data, only
> current status.
>
> > and also the WAL file names can't be reported in progress reporting
> > mechanism
>
> Also true.
>
> > (only integers columns, of course if required we can add text columns
> > to pg_stat_get_progress_info).
>
> Yeah, I don't think adding text columns is terribly easy, because the
> whole point of the progress reporting infrastructure is that it can be
> updated very cheaply as atomic operations, and if you want to transmit
> text columns, that's no longer possible.
>
> > Having the recovery info in server logs might help.
>
> I suppose it might.
>
> > I think reporting a long-running file processing operation (removing
> > or syncing) within postgres is a generic problem (for snapshot,
> > mapping, temporary (pgsql_tmp), temp relation files, old WAL file
> > processing, WAL file processing during recovery etc.) and needs to be
> > solved
>
> I agree up to here.
>
> > in two ways: 1) logging progress into server logs (which helps
> > for analysis and report when the server isn't available for
> > connections, crash recovery), a generic GUC
> > log_file_processing_traffic = {none, medium, high} might help here
> > (also proposed in [1]) and 2) pg_stat_file_processing_progress
> > (extending progress reporting pg_stat_get_progress_info to have few
> > text columns for current file name and directory path).
>
> I think using the server log to store telemetry data is not a great fit.
> It bloats the log files and can be so slow as to block other operations
> in the server.  Server logs should normally be considered critical info
> that's not okay to lose; telemetry tends to be of secondary importance
> and in a pinch you can drop a few messages without hurting too much.
>
> We've done moderately okay so far with having some system views where
> some telemetry readings can be obtained, but there several drawbacks to
> that approach that we should at some point solve.  My opinion on this is
> that we need to bite the bullet and develop separate infrastructure for
> reporting server metrics.
>
> I just
> think that we should look into a new mechanism going forward.

I completely agree that we must have new ways to report important
server metrics in an easily consumable fashion - it could be something
like server computing metrics (time taken, the file being processed
etc.) important/time-taking operations such as execution of archive
command, restore command, recovery, checkpoint, old WAL files
removal/recycling, processing of various files - snapshot, mapping,
pgsql_tmp files, temp relation files etc. and writing these metrics to
a file or a stat table on the database itself (this table can have an
automatic mechanism of clean up or limit on number of rows or size of
the table so that the clients can read those metrics and use it in
whichever way they want. To start with, a simple structure of the
metrics can be {OPERATION text, START_TIME timestamp, END_TIME
timestamp, few text, double and/or integer flexible columns that each
operation can use to store metrics}.

I can start a discussion in a separate thread to seek more thoughts on
the server metrics part.

> That said, I'm not opposed to having a patch somewhat as posted.

Thanks. I'm thinking if we should have a generic GUC
log_file_processing_traffic = {none, medium, high} that might help
reporting other time taking file processing operations such as [1].

[1] 
https://www.postgresql.org/message-id/CALj2ACW-ELOF5JT2zPavs95wbZ0BrLPrqvSZ7Ac%2BpjxCkmXtEQ%40mail.gmail.com

Regards,
Bharath Rupireddy.




Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)

2022-05-05 Thread Alvaro Herrera
On 2022-May-05, Bharath Rupireddy wrote:

> On Fri, Apr 29, 2022 at 4:11 PM Alvaro Herrera  
> wrote:
> >
> > Did we ever consider the idea of using a new pg_stat_wal_activity_progress
> > view or something like that, using the backend_progress.c functionality?
> > I don't see it mentioned in the thread.
> 
> IMO, progress reporting works well on a running server and at the
> moment. The WAL recovery/replay can happen even before the server
> opens up for connections

It's definitely true that you wouldn't be able to use it for when the
server is not accepting connections.

> and the progress report view can't be used
> for later analysis like how much time the restoring WAL files from
> archive location took

This is true too -- progress report doesn't store historical data, only
current status.

> and also the WAL file names can't be reported in progress reporting
> mechanism

Also true.

> (only integers columns, of course if required we can add text columns
> to pg_stat_get_progress_info).

Yeah, I don't think adding text columns is terribly easy, because the
whole point of the progress reporting infrastructure is that it can be
updated very cheaply as atomic operations, and if you want to transmit
text columns, that's no longer possible.

> Having the recovery info in server logs might help.

I suppose it might.

> I think reporting a long-running file processing operation (removing
> or syncing) within postgres is a generic problem (for snapshot,
> mapping, temporary (pgsql_tmp), temp relation files, old WAL file
> processing, WAL file processing during recovery etc.) and needs to be
> solved

I agree up to here.

> in two ways: 1) logging progress into server logs (which helps
> for analysis and report when the server isn't available for
> connections, crash recovery), a generic GUC
> log_file_processing_traffic = {none, medium, high} might help here
> (also proposed in [1]) and 2) pg_stat_file_processing_progress
> (extending progress reporting pg_stat_get_progress_info to have few
> text columns for current file name and directory path).

I think using the server log to store telemetry data is not a great fit.
It bloats the log files and can be so slow as to block other operations
in the server.  Server logs should normally be considered critical info
that's not okay to lose; telemetry tends to be of secondary importance
and in a pinch you can drop a few messages without hurting too much.

We've done moderately okay so far with having some system views where
some telemetry readings can be obtained, but there several drawbacks to
that approach that we should at some point solve.  My opinion on this is
that we need to bite the bullet and develop separate infrastructure for
reporting server metrics.

That said, I'm not opposed to having a patch somewhat as posted.  I just
think that we should look into a new mechanism going forward.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)

2022-05-05 Thread Bharath Rupireddy
On Fri, Apr 29, 2022 at 4:11 PM Alvaro Herrera  wrote:
>
> Did we ever consider the idea of using a new pg_stat_wal_activity_progress
> view or something like that, using the backend_progress.c functionality?
> I don't see it mentioned in the thread.

IMO, progress reporting works well on a running server and at the
moment. The WAL recovery/replay can happen even before the server
opens up for connections and the progress report view can't be used
for later analysis like how much time the restoring WAL files from
archive location took and also the WAL file names can't be reported in
progress reporting mechanism (only integers columns, of course if
required we can add text columns to pg_stat_get_progress_info). Having
the recovery info in server logs might help.

> It's not an *exact* fit, since this is not about one "command" being
> executed by a "backend", but it seems a near enough fit, and it would
> unobtrusive enough that we don't need to worry about the progress-update
> rate or have a GUC to determine how frequently to update (with a gauge
> that's heavily workload-dependant and is likely to make little sense to
> some users -- consider differently-sized WAL segments, for one thing).

I think reporting a long-running file processing operation (removing
or syncing) within postgres is a generic problem (for snapshot,
mapping, temporary (pgsql_tmp), temp relation files, old WAL file
processing, WAL file processing during recovery etc.) and needs to be
solved in two ways: 1) logging progress into server logs (which helps
for analysis and report when the server isn't available for
connections, crash recovery), a generic GUC
log_file_processing_traffic = {none, medium, high} might help here
(also proposed in [1]) and 2) pg_stat_file_processing_progress
(extending progress reporting pg_stat_get_progress_info to have few
text columns for current file name and directory path).

Thoughts?

Regards,
Bharath Rupireddy.




Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)

2022-04-29 Thread Alvaro Herrera
Did we ever consider the idea of using a new pg_stat_wal_activity_progress
view or something like that, using the backend_progress.c functionality?
I don't see it mentioned in the thread.

It's not an *exact* fit, since this is not about one "command" being
executed by a "backend", but it seems a near enough fit, and it would
unobtrusive enough that we don't need to worry about the progress-update
rate or have a GUC to determine how frequently to update (with a gauge
that's heavily workload-dependant and is likely to make little sense to
some users -- consider differently-sized WAL segments, for one thing).

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)

2022-04-29 Thread Bharath Rupireddy
On Sun, Feb 27, 2022 at 12:50 PM Bharath Rupireddy
 wrote:
>
> On Wed, Dec 29, 2021 at 6:50 AM Bharath Rupireddy
>  wrote:
> >
> > On Thu, Dec 9, 2021 at 9:28 PM Alvaro Herrera  
> > wrote:
> > >
> > > Maybe some tunable like
> > > log_wal_traffic = { none, medium, high }
> > > where "none" is current behavior of no noise, "medium" gets (say) once
> > > every 256 segments (e.g., when going from XFF to (X+1)00), "high" gets
> > > you one message per segment.
> >
> > On Fri, Dec 24, 2021 at 7:19 PM Justin Pryzby  wrote:
> > >
> > > If you're talking about a new feature that uses the infrastructre from 
> > > 9ce3,
> > > but is controlled by a separate GUC like log_wal_traffic, that could be 
> > > okay.
> >
> > Thanks for the suggestion. I've added a new GUC log_wal_traffic as
> > suggested above. PSA  v7 patch.
>
> Here's the rebased v8 patch, please review it.
>
> https://commitfest.postgresql.org/37/3443/

Here's the rebased v9 patch.

Regards,
Bharath Rupireddy.


v9-0001-Add-WAL-recovery-messages-with-log_wal_traffic-GU.patch
Description: Binary data


Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)

2022-02-26 Thread Bharath Rupireddy
On Wed, Dec 29, 2021 at 6:50 AM Bharath Rupireddy
 wrote:
>
> On Thu, Dec 9, 2021 at 9:28 PM Alvaro Herrera  wrote:
> >
> > Maybe some tunable like
> > log_wal_traffic = { none, medium, high }
> > where "none" is current behavior of no noise, "medium" gets (say) once
> > every 256 segments (e.g., when going from XFF to (X+1)00), "high" gets
> > you one message per segment.
>
> On Fri, Dec 24, 2021 at 7:19 PM Justin Pryzby  wrote:
> >
> > If you're talking about a new feature that uses the infrastructre from 9ce3,
> > but is controlled by a separate GUC like log_wal_traffic, that could be 
> > okay.
>
> Thanks for the suggestion. I've added a new GUC log_wal_traffic as
> suggested above. PSA  v7 patch.

Here's the rebased v8 patch, please review it.

https://commitfest.postgresql.org/37/3443/

Regards,
Bharath Rupireddy.


v8-0001-Add-WAL-recovery-messages-with-log_wal_traffic-GU.patch
Description: Binary data