[HACKERS] Re: [BUGS] BUG #14230: Wrong timeline returned by pg_stop_backup on a standby
On Thu, Jul 07, 2016 at 03:38:26PM +0900, Michael Paquier wrote: > On Thu, Jul 7, 2016 at 12:57 AM, Marco Nenciarini >wrote: > > After further analysis, the issue is that we retrieve the starttli from > > the ControlFile structure, but it was using ThisTimeLineID when writing > > the backup label. > > > > I've attached a very simple patch that fixes it. > > ThisTimeLineID is always set at 0 on purpose on a standby, so we > cannot rely on it (well it is set temporarily when recycling old > segments). At recovery when parsing the backup_label file there is no > actual use of the start segment name, so that's only a cosmetic > change. But surely it would be better to get that fixed, because > that's useful for debugging. > > While looking at your patch, I thought that it would have been > tempting to use GetXLogReplayRecPtr() to get the timeline ID when in > recovery, but what we really want to know here is the timeline of the > last REDO pointer, which is starttli, and that's more consistent with > the fact that we use startpoint when writing the backup_label file. In > short, +1 for this fix. > > I am adding that in the list of open items, adding Magnus in CC whose > commit for non-exclusive backups is at the origin of this defect. [Action required within 72 hours. This is a generic notification.] The above-described topic is currently a PostgreSQL 9.6 open item. Magnus, since you committed the patch believed to have created it, you own this open item. If some other commit is more relevant or if this does not belong as a 9.6 open item, please let us know. Otherwise, please observe the policy on open item ownership[1] and send a status update within 72 hours of this message. Include a date for your subsequent status update. Testers may discover new open items at any time, and I want to plan to get them all fixed well in advance of shipping 9.6rc1. Consequently, I will appreciate your efforts toward speedy resolution. Thanks. [1] http://www.postgresql.org/message-id/20160527025039.ga447...@tornado.leadboat.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [BUGS] BUG #14230: Wrong timeline returned by pg_stop_backup on a standby
On 08/07/16 13:10, Michael Paquier wrote: > On Fri, Jul 8, 2016 at 6:40 PM, Marco Nenciarini >wrote: >> The resulting backup is working perfectly, because Postgres has no use >> for pg_stop_backup LSN, but this can confuse any tool that uses the stop >> LSN to figure out which WAL files are needed by the backup (in this case >> the only file needed is the one containing the start checkpoint). >> >> After some discussion with Álvaro, my proposal is to avoid that by >> returning the stoppoint as the maximum between the startpoint and the >> min_recovery_end_location, in case of backup from the standby. > > You are facing a pattern similar to the problem reported already on > this thread by Horiguchi-san: > http://www.postgresql.org/message-id/20160609.215558.118976703.horiguchi.kyot...@lab.ntt.co.jp > And it seems to me that you are jumping to an incorrect conclusion, > what we'd want to do is to update a bit more aggressively the minimum > recovery point in cases on a node in recovery in the case where no > buffers are flushed by other backends. > Yes, it is exactly the same bug. My proposal was based on the assumption that it were only a cosmetic issue, but given that it can trigger errors, I agree that the right solution is to advance the minimum recovery point in that case. Regards, Marco -- Marco Nenciarini - 2ndQuadrant Italy PostgreSQL Training, Services and Support marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it signature.asc Description: OpenPGP digital signature
[HACKERS] Re: [BUGS] BUG #14230: Wrong timeline returned by pg_stop_backup on a standby
On 07/07/16 08:38, Michael Paquier wrote: > On Thu, Jul 7, 2016 at 12:57 AM, Marco Nenciarini >wrote: >> After further analysis, the issue is that we retrieve the starttli from >> the ControlFile structure, but it was using ThisTimeLineID when writing >> the backup label. >> >> I've attached a very simple patch that fixes it. > > ThisTimeLineID is always set at 0 on purpose on a standby, so we > cannot rely on it (well it is set temporarily when recycling old > segments). At recovery when parsing the backup_label file there is no > actual use of the start segment name, so that's only a cosmetic > change. But surely it would be better to get that fixed, because > that's useful for debugging. > > While looking at your patch, I thought that it would have been > tempting to use GetXLogReplayRecPtr() to get the timeline ID when in > recovery, but what we really want to know here is the timeline of the > last REDO pointer, which is starttli, and that's more consistent with > the fact that we use startpoint when writing the backup_label file. In > short, +1 for this fix. > > I am adding that in the list of open items, adding Magnus in CC whose > commit for non-exclusive backups is at the origin of this defect. > While we were testing the patch we noticed another behavior that is not strictly a bug, but can confuse backup tools: To quickly produce some WAL files we were executing a series of pg_switch_xlog+CHECKPOINT, and we noticed that doing a backup from a standby after that results in a startpoint higher than the stoppoint. Let me show it on a brand new master/replica cluster (master is port 5496, replica is 6496). The script is attached. --- You are now connected to database "postgres" as user "postgres" via socket in "/tmp" at port "5496". SELECT pg_is_in_recovery(); -[ RECORD 1 ]-+-- pg_is_in_recovery | f CHECKPOINT; CHECKPOINT SELECT pg_switch_xlog(); -[ RECORD 1 ]--+-- pg_switch_xlog | 0/3E8 CHECKPOINT; CHECKPOINT SELECT pg_switch_xlog(); -[ RECORD 1 ]--+-- pg_switch_xlog | 0/4E8 You are now connected to database "postgres" as user "postgres" via socket in "/tmp" at port "6496". SELECT pg_is_in_recovery(); -[ RECORD 1 ]-+-- pg_is_in_recovery | t SELECT pg_start_backup('tst backup',TRUE,FALSE); -[ RECORD 1 ]---+-- pg_start_backup | 0/428 SELECT * FROM pg_stop_backup(FALSE); -[ RECORD 1 ]- lsn| 0/2F8 labelfile | START WAL LOCATION: 0/428 (file 0004)+ | CHECKPOINT LOCATION: 0/460 + | BACKUP METHOD: streamed + | BACKUP FROM: standby + | START TIME: 2016-07-08 10:46:55 CEST + | LABEL: tst backup+ | spcmapfile | SELECT * FROM pg_control_checkpoint(); -[ RECORD 1 ]+- checkpoint_location | 0/460 prior_location | 0/260 redo_location| 0/428 redo_wal_file| 00010004 timeline_id | 1 prev_timeline_id | 1 full_page_writes | t next_xid | 0:865 next_oid | 12670 next_multixact_id| 1 next_multi_offset| 0 oldest_xid | 858 oldest_xid_dbid | 1 oldest_active_xid| 865 oldest_multi_xid | 1 oldest_multi_dbid| 1 oldest_commit_ts_xid | 865 newest_commit_ts_xid | 865 checkpoint_time | 2016-07-08 10:46:55+02 SELECT * FROM pg_control_recovery(); -[ RECORD 1 ]-+-- min_recovery_end_location | 0/2F8 min_recovery_end_timeline | 1 backup_start_location | 0/0 backup_end_location | 0/0 end_of_backup_record_required | f --- In particular, the pg_start_backup LSN is 0/428 and the pg_stop_backup LSN is 0/2F8. The same issue is present when you do a backup using pg_basebackup: --- transaction log start point: 0/828 on timeline 1 pg_basebackup: starting background WAL receiver 22244/22244 kB (100%), 1/1 tablespace transaction log end point: 0/2F8 pg_basebackup: waiting for background process to finish streaming ... pg_basebackup: base backup completed --- The resulting backup is working perfectly, because Postgres has no use for pg_stop_backup LSN, but this can confuse any tool that uses the stop LSN to figure out which WAL files are needed by the backup (in this case the only file needed is the one containing the start checkpoint). After some discussion with Álvaro, my proposal is to avoid that by returning the stoppoint as the
[HACKERS] Re: [BUGS] BUG #14230: Wrong timeline returned by pg_stop_backup on a standby
On 06/07/16 17:41, Marco Nenciarini wrote: > On 06/07/16 17:37, Marco Nenciarini wrote: >> Hi, >> >> On 06/07/16 17:07, francesco.cano...@2ndquadrant.it wrote: >>> The following bug has been logged on the website: >>> >>> Bug reference: 14230 >>> Logged by: Francesco Canovai >>> Email address: francesco.cano...@2ndquadrant.it >>> PostgreSQL version: 9.6beta2 >>> Operating system: Linux >>> Description: >>> >>> I'm taking a concurrent backup from a standby in PostgreSQL beta2 and I get >>> the wrong timeline from pg_stop_backup(false). >>> >>> This is what I'm doing: >>> >>> 1) I set up an environment with a primary server and a replica in streaming >>> replication. >>> >>> 2) On the replica, I run >>> >>> postgres=# SELECT pg_start_backup('test_backup', true, false); >>> pg_start_backup >>> - >>> 0/3000A00 >>> (1 row) >>> >>> 3) When I run pg_stop_backup, it returns a start wal location belonging to a >>> file with timeline 0. >>> >>> postgres=# SELECT pg_stop_backup(false); >>> pg_stop_backup >>> >>> --- >>> (0/3000AE0,"START WAL LOCATION: 0/3000A00 (file >>> 0003)+ >>> CHECKPOINT LOCATION: 0/3000A38 >>> + >>> BACKUP METHOD: streamed >>> + >>> BACKUP FROM: standby >>> + >>> START TIME: 2016-07-06 16:44:31 CEST >>> + >>> LABEL: test_backup >>> + >>> ","") >>> (1 row) >>> >>> The timeline returned is fine (is 1) when running the same commands on the >>> master. >>> >>> An incorrect backup label doesn't prevent PostgreSQL from starting up, but >>> it affects the tools using that information. >>> >>> >> >> The issue here is that the do_pg_stop_backup function uses the >> ThisTimeLineID variable that is not valid on standbys. >> >> I think that it should read it from >> ControlFile->checkPointCopy.ThisTimeLineID as we do in do_pg_start_backup. >> > > No, that's not the solution. > > The backup_label is generated during the do_pg_start_backup call, so > also the copy in ControlFile->checkPointCopy.ThisTimeLineID is > uninitialized. > After further analysis, the issue is that we retrieve the starttli from the ControlFile structure, but it was using ThisTimeLineID when writing the backup label. I've attached a very simple patch that fixes it. Regards, Marco -- Marco Nenciarini - 2ndQuadrant Italy PostgreSQL Training, Services and Support marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e4645a3..aecede1 100644 *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *** do_pg_start_backup(const char *backupids *** 9974,9980 } while (!gotUniqueStartpoint); XLByteToSeg(startpoint, _logSegNo); ! XLogFileName(xlogfilename, ThisTimeLineID, _logSegNo); /* * Construct tablespace_map file --- 9974,9980 } while (!gotUniqueStartpoint); XLByteToSeg(startpoint, _logSegNo); ! XLogFileName(xlogfilename, starttli, _logSegNo); /* * Construct tablespace_map file signature.asc Description: OpenPGP digital signature
[HACKERS] Re: [BUGS] BUG #14230: Wrong timeline returned by pg_stop_backup on a standby
On 06/07/16 17:37, Marco Nenciarini wrote: > Hi, > > On 06/07/16 17:07, francesco.cano...@2ndquadrant.it wrote: >> The following bug has been logged on the website: >> >> Bug reference: 14230 >> Logged by: Francesco Canovai >> Email address: francesco.cano...@2ndquadrant.it >> PostgreSQL version: 9.6beta2 >> Operating system: Linux >> Description: >> >> I'm taking a concurrent backup from a standby in PostgreSQL beta2 and I get >> the wrong timeline from pg_stop_backup(false). >> >> This is what I'm doing: >> >> 1) I set up an environment with a primary server and a replica in streaming >> replication. >> >> 2) On the replica, I run >> >> postgres=# SELECT pg_start_backup('test_backup', true, false); >> pg_start_backup >> - >> 0/3000A00 >> (1 row) >> >> 3) When I run pg_stop_backup, it returns a start wal location belonging to a >> file with timeline 0. >> >> postgres=# SELECT pg_stop_backup(false); >> pg_stop_backup >> >> --- >> (0/3000AE0,"START WAL LOCATION: 0/3000A00 (file >> 0003)+ >> CHECKPOINT LOCATION: 0/3000A38 >> + >> BACKUP METHOD: streamed >> + >> BACKUP FROM: standby >> + >> START TIME: 2016-07-06 16:44:31 CEST >> + >> LABEL: test_backup >> + >> ","") >> (1 row) >> >> The timeline returned is fine (is 1) when running the same commands on the >> master. >> >> An incorrect backup label doesn't prevent PostgreSQL from starting up, but >> it affects the tools using that information. >> >> > > The issue here is that the do_pg_stop_backup function uses the > ThisTimeLineID variable that is not valid on standbys. > > I think that it should read it from > ControlFile->checkPointCopy.ThisTimeLineID as we do in do_pg_start_backup. > No, that's not the solution. The backup_label is generated during the do_pg_start_backup call, so also the copy in ControlFile->checkPointCopy.ThisTimeLineID is uninitialized. Regards, Marco -- Marco Nenciarini - 2ndQuadrant Italy PostgreSQL Training, Services and Support marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it signature.asc Description: OpenPGP digital signature
[HACKERS] Re: [BUGS] BUG #14230: Wrong timeline returned by pg_stop_backup on a standby
Hi, On 06/07/16 17:07, francesco.cano...@2ndquadrant.it wrote: > The following bug has been logged on the website: > > Bug reference: 14230 > Logged by: Francesco Canovai > Email address: francesco.cano...@2ndquadrant.it > PostgreSQL version: 9.6beta2 > Operating system: Linux > Description: > > I'm taking a concurrent backup from a standby in PostgreSQL beta2 and I get > the wrong timeline from pg_stop_backup(false). > > This is what I'm doing: > > 1) I set up an environment with a primary server and a replica in streaming > replication. > > 2) On the replica, I run > > postgres=# SELECT pg_start_backup('test_backup', true, false); > pg_start_backup > - > 0/3000A00 > (1 row) > > 3) When I run pg_stop_backup, it returns a start wal location belonging to a > file with timeline 0. > > postgres=# SELECT pg_stop_backup(false); > pg_stop_backup > > --- > (0/3000AE0,"START WAL LOCATION: 0/3000A00 (file > 0003)+ > CHECKPOINT LOCATION: 0/3000A38 > + > BACKUP METHOD: streamed > + > BACKUP FROM: standby > + > START TIME: 2016-07-06 16:44:31 CEST > + > LABEL: test_backup > + > ","") > (1 row) > > The timeline returned is fine (is 1) when running the same commands on the > master. > > An incorrect backup label doesn't prevent PostgreSQL from starting up, but > it affects the tools using that information. > > The issue here is that the do_pg_stop_backup function uses the ThisTimeLineID variable that is not valid on standbys. I think that it should read it from ControlFile->checkPointCopy.ThisTimeLineID as we do in do_pg_start_backup. Regards, Marco -- Marco Nenciarini - 2ndQuadrant Italy PostgreSQL Training, Services and Support marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it signature.asc Description: OpenPGP digital signature