Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)
On Fri, May 13, 2022 at 6:41 PM Robert Haas wrote: > > On Fri, Apr 29, 2022 at 5:11 AM Bharath Rupireddy > wrote: > > Here's the rebased v9 patch. > > This seems like it has enormous overlap with the existing > functionality that we have from log_startup_progress_interval. > > I think that facility is also better-designed than this one. It prints > out a message based on elapsed time, whereas this patch prints out a > message based progress through the WAL. That means that if WAL replay > isn't actually advancing for some reason, you just won't get any log > messages and you don't know whether it's advancing slowly or not at > all or the server is just hung. With that facility you can distinguish > those cases. > > Also, if for some reason we do think that amount of WAL replayed is > the right metric, rather than time, why would we only allow high=1 > segment and low=128 segments, rather than say any number of MB or GB > that the user would like to configure? > > I suggest that if log_startup_progress_interval doesn't meet your > needs here, we should try to understand why not and maybe enhance it, > instead of adding a separate facility. After thinking for a while, I agree with Robert and others that we could leverage the existing log_startup_progress_interval mechanism for reporting which WAL file currently is being replayed. I added current TLI (helps to construct the WAL file name from current LSN) and current WAL file source (helps to know where the WAL files was fetched from) to the existing "redo in progress, elapsed time:..." message. This very well serves the purpose of identifying the issues such as the restore command taking a lot of time (> log_startup_progress_interval for instance), WAL replay rate on standby or primary for long recoveries and so on. However, ereport_startup_progress isn't enabled on standby to not let it bloat the server logs. I believe the "redo in progress, elapsed time:..." message can provide some important info/metric for standby too and there's no way for the users to enable it on standbys today. For instance, users can know how well the standby fares in replaying, they can figure this out, by looking at two or more such messages. If enabled, with default value of 10 sec for log_startup_progress_interval, the standby can emit 8640 messages per day which is too much. I'm not sure if we are okay to change the default value of log_startup_progress_interval to say 1min or 5min so that 1440 messages are emitted. In production environments, typically users may or may not be interested if recovery takes just 10sec, but they really are interested if it takes in the order of minutes. Basically, I would like to enable "redo in progress, elapsed time:..." message for standbys too. Thoughts? PSA v10 patch with enhanced "redo in progress, elapsed time:..." message. Note that it's not a final patch though. Regards, Bharath Rupireddy. v10-0001-Add-WAL-recovery-info-to-startup-progress-log-me.patch Description: Binary data
Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)
On Fri, May 13, 2022 at 09:10:52AM -0400, Robert Haas wrote: > I suggest that if log_startup_progress_interval doesn't meet your > needs here, we should try to understand why not and maybe enhance it, > instead of adding a separate facility. +1. AFAICT it should be possible to make the log_startup_progress_interval machinery extensible enough to be reused for several other tasks that can take a while but have little visibility. Since this thread has been inactive for over 2 months, I'm marking the commitfest entry as Waiting on Author. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)
On Tue, May 17, 2022 at 1:32 AM Bharath Rupireddy wrote: > In a production environment (of course with a better management of > server logs) one can set log_wal_traffic to "high" and emit the > required info to answer some of the customer questions like - "How far > the server is in recovery? How much time recovery of each WAL file > approximately took? How much time will it take to recover all the WAL > files? What's the rate of recovery - time per WAL file? etc." > > Whereas ereport_startup_progress facility will help to emit log > messages only if "some" operation takes longer than set > log_startup_progress_interval time which may not serve the above > purpose. I think you are misunderstanding what an "operation" is here. If you take the time to test out that facility, I think you will find that it is quite capable of answering all of those questions. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)
On Fri, May 13, 2022 at 6:41 PM Robert Haas wrote: > > On Fri, Apr 29, 2022 at 5:11 AM Bharath Rupireddy > wrote: > > Here's the rebased v9 patch. > > This seems like it has enormous overlap with the existing > functionality that we have from log_startup_progress_interval. > > I think that facility is also better-designed than this one. It prints > out a message based on elapsed time, whereas this patch prints out a > message based progress through the WAL. That means that if WAL replay > isn't actually advancing for some reason, you just won't get any log > messages and you don't know whether it's advancing slowly or not at > all or the server is just hung. With that facility you can distinguish > those cases. > > Also, if for some reason we do think that amount of WAL replayed is > the right metric, rather than time, why would we only allow high=1 > segment and low=128 segments, rather than say any number of MB or GB > that the user would like to configure? > > I suggest that if log_startup_progress_interval doesn't meet your > needs here, we should try to understand why not and maybe enhance it, > instead of adding a separate facility. Thanks Robert! In a production environment (of course with a better management of server logs) one can set log_wal_traffic to "high" and emit the required info to answer some of the customer questions like - "How far the server is in recovery? How much time recovery of each WAL file approximately took? How much time will it take to recover all the WAL files? What's the rate of recovery - time per WAL file? etc." Whereas ereport_startup_progress facility will help to emit log messages only if "some" operation takes longer than set log_startup_progress_interval time which may not serve the above purpose. Actually, IMO a generic GUC log_file_processing_traffic = {none, medium, high} to help server emit logs for all the critical file processing operations - WAL file recovery (as proposed here in this thread), temp file processing during server startup or restarts (log_startup_progress_interval can't be used here as postmaster doesn't register for any timeouts) [1], snapshot and mapping file processing during checkpoint, temp relation files, removing old WAL files and so on. Thoughts? [1] https://www.postgresql.org/message-id/CALj2ACW-ELOF5JT2zPavs95wbZ0BrLPrqvSZ7Ac%2BpjxCkmXtEQ%40mail.gmail.com Regards, Bharath Rupireddy.
Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)
On Fri, Apr 29, 2022 at 5:11 AM Bharath Rupireddy wrote: > Here's the rebased v9 patch. This seems like it has enormous overlap with the existing functionality that we have from log_startup_progress_interval. I think that facility is also better-designed than this one. It prints out a message based on elapsed time, whereas this patch prints out a message based progress through the WAL. That means that if WAL replay isn't actually advancing for some reason, you just won't get any log messages and you don't know whether it's advancing slowly or not at all or the server is just hung. With that facility you can distinguish those cases. Also, if for some reason we do think that amount of WAL replayed is the right metric, rather than time, why would we only allow high=1 segment and low=128 segments, rather than say any number of MB or GB that the user would like to configure? I suggest that if log_startup_progress_interval doesn't meet your needs here, we should try to understand why not and maybe enhance it, instead of adding a separate facility. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)
On Thu, May 5, 2022 at 2:07 PM Alvaro Herrera wrote: > > On 2022-May-05, Bharath Rupireddy wrote: > > > On Fri, Apr 29, 2022 at 4:11 PM Alvaro Herrera > > wrote: > > > > > > Did we ever consider the idea of using a new pg_stat_wal_activity_progress > > > view or something like that, using the backend_progress.c functionality? > > > I don't see it mentioned in the thread. > > > > IMO, progress reporting works well on a running server and at the > > moment. The WAL recovery/replay can happen even before the server > > opens up for connections > > It's definitely true that you wouldn't be able to use it for when the > server is not accepting connections. > > > and the progress report view can't be used > > for later analysis like how much time the restoring WAL files from > > archive location took > > This is true too -- progress report doesn't store historical data, only > current status. > > > and also the WAL file names can't be reported in progress reporting > > mechanism > > Also true. > > > (only integers columns, of course if required we can add text columns > > to pg_stat_get_progress_info). > > Yeah, I don't think adding text columns is terribly easy, because the > whole point of the progress reporting infrastructure is that it can be > updated very cheaply as atomic operations, and if you want to transmit > text columns, that's no longer possible. > > > Having the recovery info in server logs might help. > > I suppose it might. > > > I think reporting a long-running file processing operation (removing > > or syncing) within postgres is a generic problem (for snapshot, > > mapping, temporary (pgsql_tmp), temp relation files, old WAL file > > processing, WAL file processing during recovery etc.) and needs to be > > solved > > I agree up to here. > > > in two ways: 1) logging progress into server logs (which helps > > for analysis and report when the server isn't available for > > connections, crash recovery), a generic GUC > > log_file_processing_traffic = {none, medium, high} might help here > > (also proposed in [1]) and 2) pg_stat_file_processing_progress > > (extending progress reporting pg_stat_get_progress_info to have few > > text columns for current file name and directory path). > > I think using the server log to store telemetry data is not a great fit. > It bloats the log files and can be so slow as to block other operations > in the server. Server logs should normally be considered critical info > that's not okay to lose; telemetry tends to be of secondary importance > and in a pinch you can drop a few messages without hurting too much. > > We've done moderately okay so far with having some system views where > some telemetry readings can be obtained, but there several drawbacks to > that approach that we should at some point solve. My opinion on this is > that we need to bite the bullet and develop separate infrastructure for > reporting server metrics. > > I just > think that we should look into a new mechanism going forward. I completely agree that we must have new ways to report important server metrics in an easily consumable fashion - it could be something like server computing metrics (time taken, the file being processed etc.) important/time-taking operations such as execution of archive command, restore command, recovery, checkpoint, old WAL files removal/recycling, processing of various files - snapshot, mapping, pgsql_tmp files, temp relation files etc. and writing these metrics to a file or a stat table on the database itself (this table can have an automatic mechanism of clean up or limit on number of rows or size of the table so that the clients can read those metrics and use it in whichever way they want. To start with, a simple structure of the metrics can be {OPERATION text, START_TIME timestamp, END_TIME timestamp, few text, double and/or integer flexible columns that each operation can use to store metrics}. I can start a discussion in a separate thread to seek more thoughts on the server metrics part. > That said, I'm not opposed to having a patch somewhat as posted. Thanks. I'm thinking if we should have a generic GUC log_file_processing_traffic = {none, medium, high} that might help reporting other time taking file processing operations such as [1]. [1] https://www.postgresql.org/message-id/CALj2ACW-ELOF5JT2zPavs95wbZ0BrLPrqvSZ7Ac%2BpjxCkmXtEQ%40mail.gmail.com Regards, Bharath Rupireddy.
Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)
On 2022-May-05, Bharath Rupireddy wrote: > On Fri, Apr 29, 2022 at 4:11 PM Alvaro Herrera > wrote: > > > > Did we ever consider the idea of using a new pg_stat_wal_activity_progress > > view or something like that, using the backend_progress.c functionality? > > I don't see it mentioned in the thread. > > IMO, progress reporting works well on a running server and at the > moment. The WAL recovery/replay can happen even before the server > opens up for connections It's definitely true that you wouldn't be able to use it for when the server is not accepting connections. > and the progress report view can't be used > for later analysis like how much time the restoring WAL files from > archive location took This is true too -- progress report doesn't store historical data, only current status. > and also the WAL file names can't be reported in progress reporting > mechanism Also true. > (only integers columns, of course if required we can add text columns > to pg_stat_get_progress_info). Yeah, I don't think adding text columns is terribly easy, because the whole point of the progress reporting infrastructure is that it can be updated very cheaply as atomic operations, and if you want to transmit text columns, that's no longer possible. > Having the recovery info in server logs might help. I suppose it might. > I think reporting a long-running file processing operation (removing > or syncing) within postgres is a generic problem (for snapshot, > mapping, temporary (pgsql_tmp), temp relation files, old WAL file > processing, WAL file processing during recovery etc.) and needs to be > solved I agree up to here. > in two ways: 1) logging progress into server logs (which helps > for analysis and report when the server isn't available for > connections, crash recovery), a generic GUC > log_file_processing_traffic = {none, medium, high} might help here > (also proposed in [1]) and 2) pg_stat_file_processing_progress > (extending progress reporting pg_stat_get_progress_info to have few > text columns for current file name and directory path). I think using the server log to store telemetry data is not a great fit. It bloats the log files and can be so slow as to block other operations in the server. Server logs should normally be considered critical info that's not okay to lose; telemetry tends to be of secondary importance and in a pinch you can drop a few messages without hurting too much. We've done moderately okay so far with having some system views where some telemetry readings can be obtained, but there several drawbacks to that approach that we should at some point solve. My opinion on this is that we need to bite the bullet and develop separate infrastructure for reporting server metrics. That said, I'm not opposed to having a patch somewhat as posted. I just think that we should look into a new mechanism going forward. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)
On Fri, Apr 29, 2022 at 4:11 PM Alvaro Herrera wrote: > > Did we ever consider the idea of using a new pg_stat_wal_activity_progress > view or something like that, using the backend_progress.c functionality? > I don't see it mentioned in the thread. IMO, progress reporting works well on a running server and at the moment. The WAL recovery/replay can happen even before the server opens up for connections and the progress report view can't be used for later analysis like how much time the restoring WAL files from archive location took and also the WAL file names can't be reported in progress reporting mechanism (only integers columns, of course if required we can add text columns to pg_stat_get_progress_info). Having the recovery info in server logs might help. > It's not an *exact* fit, since this is not about one "command" being > executed by a "backend", but it seems a near enough fit, and it would > unobtrusive enough that we don't need to worry about the progress-update > rate or have a GUC to determine how frequently to update (with a gauge > that's heavily workload-dependant and is likely to make little sense to > some users -- consider differently-sized WAL segments, for one thing). I think reporting a long-running file processing operation (removing or syncing) within postgres is a generic problem (for snapshot, mapping, temporary (pgsql_tmp), temp relation files, old WAL file processing, WAL file processing during recovery etc.) and needs to be solved in two ways: 1) logging progress into server logs (which helps for analysis and report when the server isn't available for connections, crash recovery), a generic GUC log_file_processing_traffic = {none, medium, high} might help here (also proposed in [1]) and 2) pg_stat_file_processing_progress (extending progress reporting pg_stat_get_progress_info to have few text columns for current file name and directory path). Thoughts? Regards, Bharath Rupireddy.
Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)
Did we ever consider the idea of using a new pg_stat_wal_activity_progress view or something like that, using the backend_progress.c functionality? I don't see it mentioned in the thread. It's not an *exact* fit, since this is not about one "command" being executed by a "backend", but it seems a near enough fit, and it would unobtrusive enough that we don't need to worry about the progress-update rate or have a GUC to determine how frequently to update (with a gauge that's heavily workload-dependant and is likely to make little sense to some users -- consider differently-sized WAL segments, for one thing). -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)
On Sun, Feb 27, 2022 at 12:50 PM Bharath Rupireddy wrote: > > On Wed, Dec 29, 2021 at 6:50 AM Bharath Rupireddy > wrote: > > > > On Thu, Dec 9, 2021 at 9:28 PM Alvaro Herrera > > wrote: > > > > > > Maybe some tunable like > > > log_wal_traffic = { none, medium, high } > > > where "none" is current behavior of no noise, "medium" gets (say) once > > > every 256 segments (e.g., when going from XFF to (X+1)00), "high" gets > > > you one message per segment. > > > > On Fri, Dec 24, 2021 at 7:19 PM Justin Pryzby wrote: > > > > > > If you're talking about a new feature that uses the infrastructre from > > > 9ce3, > > > but is controlled by a separate GUC like log_wal_traffic, that could be > > > okay. > > > > Thanks for the suggestion. I've added a new GUC log_wal_traffic as > > suggested above. PSA v7 patch. > > Here's the rebased v8 patch, please review it. > > https://commitfest.postgresql.org/37/3443/ Here's the rebased v9 patch. Regards, Bharath Rupireddy. v9-0001-Add-WAL-recovery-messages-with-log_wal_traffic-GU.patch Description: Binary data
Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)
On Wed, Dec 29, 2021 at 6:50 AM Bharath Rupireddy wrote: > > On Thu, Dec 9, 2021 at 9:28 PM Alvaro Herrera wrote: > > > > Maybe some tunable like > > log_wal_traffic = { none, medium, high } > > where "none" is current behavior of no noise, "medium" gets (say) once > > every 256 segments (e.g., when going from XFF to (X+1)00), "high" gets > > you one message per segment. > > On Fri, Dec 24, 2021 at 7:19 PM Justin Pryzby wrote: > > > > If you're talking about a new feature that uses the infrastructre from 9ce3, > > but is controlled by a separate GUC like log_wal_traffic, that could be > > okay. > > Thanks for the suggestion. I've added a new GUC log_wal_traffic as > suggested above. PSA v7 patch. Here's the rebased v8 patch, please review it. https://commitfest.postgresql.org/37/3443/ Regards, Bharath Rupireddy. v8-0001-Add-WAL-recovery-messages-with-log_wal_traffic-GU.patch Description: Binary data