Re: PITR promote bug: Checkpointer writes to older timeline
I wrote: > It sure looks like recovering a prepared > transaction creates a transient state in which a new backend will > compute a broken snapshot. Oh, after further digging this is the same issue discussed here: https://www.postgresql.org/message-id/flat/20210422203603.fdnh3fu2mmfp2iov%40alap3.anarazel.de regards, tom lane
Re: PITR promote bug: Checkpointer writes to older timeline
I wrote: > Buildfarm member hornet just reported a failure in this test: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet=2021-06-27%2013%3A40%3A57 > It's not clear whether this is a problem with the test case or an > actual server bug, but I'm leaning to the latter theory. My gut > feel is it's some problem in the "snapshot scalability" work. It > doesn't look the same as the known open issue, but maybe related? Hmm, the plot thickens. I scraped the buildfarm logs for similar-looking assertion failures back to last August, when the snapshot scalability patches went in. The first such failure is not until 2021-03-24 (see attachment), and they all look to be triggered by 023_pitr_prepared_xact.pl. It sure looks like recovering a prepared transaction creates a transient state in which a new backend will compute a broken snapshot. regards, tom lane sysname| branch | snapshot | stage | l ---++-+---+--- calliphoridae | HEAD | 2021-03-24 06:50:09 | recoveryCheck | TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "/mnt/resource/andres/bf/calliphoridae/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/procarray.c", Line: 2463, PID: 3890215) francolin | HEAD | 2021-03-29 16:21:58 | recoveryCheck | TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "/mnt/resource/andres/bf/francolin/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/procarray.c", Line: 2463, PID: 1861665) moonjelly | HEAD | 2021-04-01 15:25:38 | recoveryCheck | TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2463, PID: 2345153) francolin | HEAD | 2021-04-07 12:30:08 | recoveryCheck | TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "/mnt/resource/andres/bf/francolin/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/procarray.c", Line: 2468, PID: 3257637) fairywren | HEAD | 2021-04-20 03:04:04 | recoveryCheck | TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "C:/tools/msys64/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/procarray.c", Line: 2094, PID: 94824) mantid| HEAD | 2021-04-25 10:07:06 | recoveryCheck | TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2094, PID: 2820886) thorntail | HEAD | 2021-04-29 07:18:09 | recoveryCheck | TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "/home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/procarray.c", Line: 2094, PID: 3099560) mantid| HEAD | 2021-05-03 13:07:06 | recoveryCheck | TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2094, PID: 1163004) mantid| HEAD | 2021-05-10 01:07:07 | recoveryCheck | TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2468, PID: 2812704) hornet| HEAD | 2021-06-27 13:40:57 | recoveryCheck | TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2492, PID: 11862234) (10 rows)
Re: PITR promote bug: Checkpointer writes to older timeline
Michael Paquier writes: > I have been working on that over the last couple of days, and applied > a fix down to 10. One thing that I did not like in the test was the > use of compare() to check if the contents of the WAL segment before > and after the timeline jump remained the same as this would have been > unstable with any concurrent activity. Instead, I have added a phase > at the end of the test with an extra checkpoint and recovery triggered > once, which is enough to reproduce the PANIC reported at the top of > the thread. Buildfarm member hornet just reported a failure in this test: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet=2021-06-27%2013%3A40%3A57 the critical bit being 2021-06-27 17:35:46.504 UTC [11862234:1] [unknown] LOG: connection received: host=[local] 2021-06-27 17:35:46.505 UTC [18350260:12] LOG: recovering prepared transaction 734 from shared memory TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2492, PID: 11862234) 2021-06-27 17:35:46.511 UTC [14876838:4] LOG: database system is ready to accept connections It's not clear whether this is a problem with the test case or an actual server bug, but I'm leaning to the latter theory. My gut feel is it's some problem in the "snapshot scalability" work. It doesn't look the same as the known open issue, but maybe related? regards, tom lane
Re: PITR promote bug: Checkpointer writes to older timeline
On Thu, Mar 18, 2021 at 12:56:12PM +0900, Michael Paquier wrote: > I was looking at uses of ThisTimeLineID in the wild, and could not > find it getting checked or used actually in backend-side code that > involved the WAL reader facility. Even if it brings confidence, it > does not mean that it is not used somewhere :/ I have been working on that over the last couple of days, and applied a fix down to 10. One thing that I did not like in the test was the use of compare() to check if the contents of the WAL segment before and after the timeline jump remained the same as this would have been unstable with any concurrent activity. Instead, I have added a phase at the end of the test with an extra checkpoint and recovery triggered once, which is enough to reproduce the PANIC reported at the top of the thread. I'll look into clarifying the use of ThisTimeLineID within the those WAL reader callbacks, because this is really bug-prone in the long term... This requires some coordination with the recent work aimed at adding some logical decoding support in standbys, though. -- Michael signature.asc Description: PGP signature
Re: PITR promote bug: Checkpointer writes to older timeline
On Wed, Mar 17, 2021 at 05:09:50PM +0900, Michael Paquier wrote: > Currently with HEAD and back branches, nothing would be broken as > logical contexts cannot exist in recovery. Still it would be easy > to miss the new behavior for anybody attempting to work more on this > feature in the future if we change read_local_xlog_page to not update > ThisTimeLineID anymore. Resetting the value of ThisTimeLineID in > read_local_xlog_page() does not seem completely right either with this > argument, as they could be some custom code relying on the existing > behavior of read_local_xlog_page() to maintain ThisTimeLineID. I was looking at uses of ThisTimeLineID in the wild, and could not find it getting checked or used actually in backend-side code that involved the WAL reader facility. Even if it brings confidence, it does not mean that it is not used somewhere :/ > Hmmm. I am wondering whether the best answer for the moment would not > be to save and reset ThisTimeLineID just in XlogReadTwoPhaseData(), as > that's the local change that uses read_local_xlog_page(). And attached is the patch able to achieve that. At least it is simple, and does not break the actual assumptions this callback relies on. This is rather weak though if there are errors as this is out of critical sections, still the disease is worse. I have double-checked all the existing backend code that uses XLogReadRecord(), and did not notice any code paths with issues similar to this one. > The state of the code is really confusing on HEAD, and I'd like to > think that the best thing we could do in the long-term is to make the > logical decoding path not rely on ThisTimeLineID at all and decouple > all that, putting the code in a state sane enough so as we don't > finish with similar errors as what is discussed on this thread. That > would be a work for a different patch though, not for stable > branches. And seeing some slot and advancing functions update > directly ThisTimeLineID is no good either.. However, I'd like to think that we should completely untie the dependency to ThisTimeLineID in any page read callbacks in core in the long-term, and potentially clean up any assumptions behind timeline jumps while in recovery for logical contexts as that cannot happen. At this stage of the 14 dev cycle, that would be material for 15~, but I also got to wonder if there is work going on to support logical decoding on standbys, in particular if this would really rely on ThisTimeLineID. Thoughts are welcome. -- Michael diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c index 6023e7c16f..80be6cc7a4 100644 --- a/src/backend/access/transam/twophase.c +++ b/src/backend/access/transam/twophase.c @@ -1316,7 +1316,11 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok) * twophase files and ReadTwoPhaseFile should be used instead. * * Note clearly that this function can access WAL during normal operation, - * similarly to the way WALSender or Logical Decoding would do. + * similarly to the way WALSender or Logical Decoding would do. While + * accessing WAL, read_local_xlog_page() may change ThisTimeLineID, + * particularly if this routine is called for the end-of-recovery checkpoint + * in the checkpointer itself, so save its existing value and restore it + * once done. */ static void XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len) @@ -1324,6 +1328,7 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len) XLogRecord *record; XLogReaderState *xlogreader; char *errormsg; + TimeLineID save_currtli = ThisTimeLineID; xlogreader = XLogReaderAllocate(wal_segment_size, NULL, XL_ROUTINE(.page_read = _local_xlog_page, @@ -1338,6 +1343,14 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len) XLogBeginRead(xlogreader, lsn); record = XLogReadRecord(xlogreader, ); + + /* + * Restore immediately the timeline where it was previously, as + * read_local_xlog_page() could have changed it if the record was read + * while recovery was finishing or if the timeline has jumped in-between. + */ + ThisTimeLineID = save_currtli; + if (record == NULL) ereport(ERROR, (errcode_for_file_access(), diff --git a/src/test/recovery/t/022_pitr_prepared_xact.pl b/src/test/recovery/t/022_pitr_prepared_xact.pl new file mode 100644 index 00..3a7907b2a0 --- /dev/null +++ b/src/test/recovery/t/022_pitr_prepared_xact.pl @@ -0,0 +1,86 @@ +# Test for point-in-time-recovery (PITR) with prepared transactions +use strict; +use warnings; +use PostgresNode; +use TestLib; +use Test::More tests => 2; +use File::Compare; + +# Initialize and start primary node with WAL archiving +my $node_primary = get_new_node('primary'); +$node_primary->init(has_archiving => 1); +$node_primary->append_conf('postgresql.conf', qq{ +max_wal_senders = 10 +wal_level = 'replica' +max_prepared_transactions = 10}); +$node_primary->start; + +# Take backup +my $backup_name = 'my_backup';
Re: PITR promote bug: Checkpointer writes to older timeline
On Mon, Mar 15, 2021 at 04:38:08PM +0900, Michael Paquier wrote: > On Mon, Mar 15, 2021 at 03:01:09PM +0900, Kyotaro Horiguchi wrote: >> Logical decoding stuff is (I think) designed to turn any backend into >> a walsender, which may need to maintain ThisTimeLineID. It seems to >> me that logical decoding stuff indents to maintain ThisTimeLineID of >> such backends at reading a WAL record. logical_read_xlog_page also >> updates ThisTimeLineID and pg_logical_slot_get_changes_guts(), >> pg_replication_slot_advance() (and maybe other functions) updates >> ThisTimeLineID. So it is natural that local_read_xlog_page() updates >> it since it is intended to be used used in logical decoding plugins. > > Logical decoding contexts cannot be created while in recovery as per > CheckLogicalDecodingRequirements(), and as mentioned not everything > is > in place to allow that. FWIW, I think that it is just confusing for > pg_replication_slot_advance() and pg_logical_slot_get_changes_guts() > to update it, and we just look for the latest value each time it is > necessary when reading a new WAL page. Studying some history today, having read_local_xlog_page() directly update ThisTimeLineID has been extensively discussed here back in 2017 to attempt to introduce logical decoding on standbys (1148e22a): https://www.postgresql.org/message-id/CAMsr%2BYEVmBJ%3DdyLw%3D%2BkTihmUnGy5_EW4Mig5T0maieg_Zu%3DXCg%40mail.gmail.com Currently with HEAD and back branches, nothing would be broken as logical contexts cannot exist in recovery. Still it would be easy to miss the new behavior for anybody attempting to work more on this feature in the future if we change read_local_xlog_page to not update ThisTimeLineID anymore. Resetting the value of ThisTimeLineID in read_local_xlog_page() does not seem completely right either with this argument, as they could be some custom code relying on the existing behavior of read_local_xlog_page() to maintain ThisTimeLineID. Hmmm. I am wondering whether the best answer for the moment would not be to save and reset ThisTimeLineID just in XlogReadTwoPhaseData(), as that's the local change that uses read_local_xlog_page(). The state of the code is really confusing on HEAD, and I'd like to think that the best thing we could do in the long-term is to make the logical decoding path not rely on ThisTimeLineID at all and decouple all that, putting the code in a state sane enough so as we don't finish with similar errors as what is discussed on this thread. That would be a work for a different patch though, not for stable branches. And seeing some slot and advancing functions update directly ThisTimeLineID is no good either.. Any thoughts? -- Michael signature.asc Description: PGP signature
Re: PITR promote bug: Checkpointer writes to older timeline
On Mon, Mar 15, 2021 at 03:01:09PM +0900, Kyotaro Horiguchi wrote: > Logical decoding stuff is (I think) designed to turn any backend into > a walsender, which may need to maintain ThisTimeLineID. It seems to > me that logical decoding stuff indents to maintain ThisTimeLineID of > such backends at reading a WAL record. logical_read_xlog_page also > updates ThisTimeLineID and pg_logical_slot_get_changes_guts(), > pg_replication_slot_advance() (and maybe other functions) updates > ThisTimeLineID. So it is natural that local_read_xlog_page() updates > it since it is intended to be used used in logical decoding plugins. Logical decoding contexts cannot be created while in recovery as per CheckLogicalDecodingRequirements(), and as mentioned not everything is in place to allow that. FWIW, I think that it is just confusing for pg_replication_slot_advance() and pg_logical_slot_get_changes_guts() to update it, and we just look for the latest value each time it is necessary when reading a new WAL page. > It is initialized by IndentifySystem(). And the logical walsender > intends to maintain ThisTimeLineID by subsequent calls to > GetStandbyFlushRecPtr(), which happen in logical_read_xlog_page(). I don't understand this part about logical_read_xlog_page(), though. Do you mean a different routine or a different code path? > I agree that it's better that the replay TLI is stored in a separate > variable. It is what I was complained on in the previous mails. (It > might not have been so obvious, though..) Okay. I understood that this was what you implied. > I don't think there's any acutual user of the function for the > purpose, but.. Anyawy if we remove the update of ThisTimeLineID from > read_local_xlog_page, I think we should remove or rewrite the > following comment for the function. It no longer works as written in > the catchphrase. Who knows. We cannot know all the users of this code, and the API is public. > > * Public because it would likely be very helpful for someone writing another > > * output method outside walsender, e.g. in a bgworker. I don't see a reason to remove this comment as this routine can still be useful for many purposes. What kind of rewording do you have in mind? -- Michael signature.asc Description: PGP signature
Re: PITR promote bug: Checkpointer writes to older timeline
At Sun, 14 Mar 2021 17:59:59 +0900, Michael Paquier wrote in > On Thu, Mar 04, 2021 at 05:10:36PM +0900, Kyotaro Horiguchi wrote: > > read_local_xlog_page is *designed* to maintain ThisTimeLineID. > > Currently it doesn't seem utilized but I think it's sufficiently > > reasonable that the function maintains ThisTimeLineID. > > I don't quite follow this line of thoughts. ThisTimeLineID is > designed to remain 0 while recovery is running in most processes > (at the close exception of a WAL sender with a cascading setup, The reason for the "0" is they just aren't interested in the value. Checkpointer temporarily uses it then restore to 0 soon. > physical or logical, of course), so why is there any business for > read_local_xlog_page() to touch this field at all while in recovery to > begin with? Logical decoding stuff is (I think) designed to turn any backend into a walsender, which may need to maintain ThisTimeLineID. It seems to me that logical decoding stuff indents to maintain ThisTimeLineID of such backends at reading a WAL record. logical_read_xlog_page also updates ThisTimeLineID and pg_logical_slot_get_changes_guts(), pg_replication_slot_advance() (and maybe other functions) updates ThisTimeLineID. So it is natural that local_read_xlog_page() updates it since it is intended to be used used in logical decoding plugins. > I equally find confusing that XLogReadDetermineTimeline() relies on a > specific value of ThisTimeLineID in its own logic, while it clearly > states that all its callers have to read the current active TLI > beforehand. So I think that the existing logic is pretty weak, and > that resetting the field is an incorrect approach? It seems to me It is initialized by IndentifySystem(). And the logical walsender intends to maintain ThisTimeLineID by subsequent calls to GetStandbyFlushRecPtr(), which happen in logical_read_xlog_page(). > that we had better not change ThisTimeLineID actively while in > recovery in this code path and just let others do the job, like > RecoveryInProgress() once recovery finishes, or > GetStandbyFlushRecPtr() for a WAL sender. And finally, we should > store the current TLI used for replay in a separate variable that gets > passed down to the XLogReadDetermineTimeline() as argument. I agree that it's better that the replay TLI is stored in a separate variable. It is what I was complained on in the previous mails. (It might not have been so obvious, though..) > While going through it, I have simplified a bit the proposed TAP tests > (thanks for replacing the sleep() call, Soumyadeep. This would have > made the test slower for nothing on fast machines, and it would cause > failures on very slow machines). > > The attached fixes the original issue for me, keeping all the records > in their correct timeline. And I have not been able to break > cascading setups. If it happens that such cases actually break, we > have holes in our existing test coverage that should be improved. I > cannot see anything fancy missing on this side, though. > > Any thoughts? I don't think there's any acutual user of the function for the purpose, but.. Anyawy if we remove the update of ThisTimeLineID from read_local_xlog_page, I think we should remove or rewrite the following comment for the function. It no longer works as written in the catchphrase. > * Public because it would likely be very helpful for someone writing another > * output method outside walsender, e.g. in a bgworker. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: PITR promote bug: Checkpointer writes to older timeline
On Thu, Mar 04, 2021 at 05:10:36PM +0900, Kyotaro Horiguchi wrote: > read_local_xlog_page is *designed* to maintain ThisTimeLineID. > Currently it doesn't seem utilized but I think it's sufficiently > reasonable that the function maintains ThisTimeLineID. I don't quite follow this line of thoughts. ThisTimeLineID is designed to remain 0 while recovery is running in most processes (at the close exception of a WAL sender with a cascading setup, physical or logical, of course), so why is there any business for read_local_xlog_page() to touch this field at all while in recovery to begin with? I equally find confusing that XLogReadDetermineTimeline() relies on a specific value of ThisTimeLineID in its own logic, while it clearly states that all its callers have to read the current active TLI beforehand. So I think that the existing logic is pretty weak, and that resetting the field is an incorrect approach? It seems to me that we had better not change ThisTimeLineID actively while in recovery in this code path and just let others do the job, like RecoveryInProgress() once recovery finishes, or GetStandbyFlushRecPtr() for a WAL sender. And finally, we should store the current TLI used for replay in a separate variable that gets passed down to the XLogReadDetermineTimeline() as argument. While going through it, I have simplified a bit the proposed TAP tests (thanks for replacing the sleep() call, Soumyadeep. This would have made the test slower for nothing on fast machines, and it would cause failures on very slow machines). The attached fixes the original issue for me, keeping all the records in their correct timeline. And I have not been able to break cascading setups. If it happens that such cases actually break, we have holes in our existing test coverage that should be improved. I cannot see anything fancy missing on this side, though. Any thoughts? -- Michael diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h index 9ac602b674..34d6286d75 100644 --- a/src/include/access/xlogutils.h +++ b/src/include/access/xlogutils.h @@ -56,7 +56,9 @@ extern void wal_segment_open(XLogReaderState *state, extern void wal_segment_close(XLogReaderState *state); extern void XLogReadDetermineTimeline(XLogReaderState *state, - XLogRecPtr wantPage, uint32 wantLength); + XLogRecPtr wantPage, + uint32 wantLength, + TimeLineID readtli); extern void WALReadRaiseError(WALReadError *errinfo); diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c index d17d660f46..9695653bc2 100644 --- a/src/backend/access/transam/xlogutils.c +++ b/src/backend/access/transam/xlogutils.c @@ -658,6 +658,13 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum, * wantLength to the amount of the page that will be read, up to * XLOG_BLCKSZ. If the amount to be read isn't known, pass XLOG_BLCKSZ. * + * readtli is the current timeline as found by the caller of this routine. + * When not in recovery, this should be ThisTimeLineID. As ThisTimeLineID + * remains set to 0 for most processes while in recovery, the caller ought + * to provide the timeline number given as a result of GetXLogReplayRecPtr() + * instead (for a WAL sender this would actually be ThisTimeLineID as the + * field gets updated in a cascading WAL sender). + * * We switch to an xlog segment from the new timeline eagerly when on a * historical timeline, as soon as we reach the start of the xlog segment * containing the timeline switch. The server copied the segment to the new @@ -679,12 +686,13 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum, * * The caller must also make sure it doesn't read past the current replay * position (using GetXLogReplayRecPtr) if executing in recovery, so it - * doesn't fail to notice that the current timeline became historical. The - * caller must also update ThisTimeLineID with the result of - * GetXLogReplayRecPtr and must check RecoveryInProgress(). + * doesn't fail to notice that the current timeline became historical. */ void -XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wantLength) +XLogReadDetermineTimeline(XLogReaderState *state, + XLogRecPtr wantPage, + uint32 wantLength, + TimeLineID readtli) { const XLogRecPtr lastReadPage = (state->seg.ws_segno * state->segcxt.ws_segsize + state->segoff); @@ -712,12 +720,12 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa * just carry on. (Seeking backwards requires a check to make sure the * older page isn't on a prior timeline). * - * ThisTimeLineID might've become historical since we last looked, but the + * readtli might've become historical since we last looked, but the * caller is required not to read past the flush limit it saw at the time * it looked up the timeline. There's nothing we can do about it if * StartupXLOG()
Re: PITR promote bug: Checkpointer writes to older timeline
Hello, PFA version 2 of the TAP test. I removed the non-deterministic sleep and introduced retries until the WAL segment is archived and promotion is complete. Some additional tidying up too. Regards, Soumyadeep (VMware) diff --git a/src/test/recovery/t/022_pitr_prepared_xact.pl b/src/test/recovery/t/022_pitr_prepared_xact.pl new file mode 100644 index 000..8738fa39d4d --- /dev/null +++ b/src/test/recovery/t/022_pitr_prepared_xact.pl @@ -0,0 +1,83 @@ +# Test for point-in-time-recovery (PITR) with prepared transactions +use strict; +use warnings; +use PostgresNode; +use TestLib; +use Test::More tests => 2; +use File::Compare; + +# Initialize and start primary node with WAL archiving +my $node_primary = get_new_node('primary'); +$node_primary->init(has_archiving => 1); +$node_primary->append_conf('postgresql.conf', "max_wal_senders = 10"); +$node_primary->append_conf('postgresql.conf', "wal_level = 'replica'"); +$node_primary->append_conf('postgresql.conf', + "max_prepared_transactions = 10"); +$node_primary->start; + +# Take backup +my $backup_name = 'my_backup'; +$node_primary->backup($backup_name); + +# Initialize node for PITR targeting a specific restore point +my $node_pitr = get_new_node('node_pitr'); +$node_pitr->init_from_backup( + $node_primary, $backup_name, + standby => 0, + has_restoring => 1); +$node_pitr->append_conf('postgresql.conf', "max_prepared_transactions = 10"); +$node_pitr->append_conf('postgresql.conf', "recovery_target_name = 'rp'"); +$node_pitr->append_conf('postgresql.conf', + "recovery_target_action = 'promote'"); + +# Workload with a prepared transaction and the target restore point +$node_primary->psql( + 'postgres', qq{ +CREATE TABLE foo(i int); +BEGIN; +INSERT INTO foo VALUES(1); +PREPARE TRANSACTION 'fooinsert'; +SELECT pg_create_restore_point('rp'); +INSERT INTO foo VALUES(2); +}); + +# Find next WAL segment to be archived +my $walfile_to_be_archived = $node_primary->safe_psql('postgres', + "SELECT pg_walfile_name(pg_current_wal_lsn());"); + +# Make WAL segment eligible for archival +$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal()'); + +# Wait until the WAL segment has been archived +my $archive_wait_query = + "SELECT '$walfile_to_be_archived' <= last_archived_wal FROM pg_stat_archiver;"; +$node_primary->poll_query_until('postgres', $archive_wait_query) + or die "Timed out while waiting for WAL segment to be archived"; +my $last_archived_wal_file = $walfile_to_be_archived; + +# Now start the PITR node +$node_pitr->start; + +# Wait until the PITR node exits recovery +$node_pitr->poll_query_until('postgres', "SELECT pg_is_in_recovery() = 'f';") + or die "Timed out while waiting for PITR promotion"; + +# Ensure that we didn't write to the older timeline during PITR promotion by +# checking that the last archived WAL segment was not overwritten during recovery +my $archive_dir = $node_primary->archive_dir; +my $archive_wal_file_path = "$archive_dir/$last_archived_wal_file"; +my $node_pitr_data= $node_pitr->data_dir; +my $local_wal_file_path = "$node_pitr_data/pg_wal/$last_archived_wal_file"; +is(compare($archive_wal_file_path, $local_wal_file_path), + qq(0), "check if the last archived WAL file was overwritten"); + +# Commit the prepared transaction in the latest timeline and check the result. +# There should only be one row (from the prepared transaction). The row from +# the INSERT after the restore point should not show up, since our recovery +# target preceded said INSERT +$node_pitr->psql( + 'postgres', qq{ +COMMIT PREPARED 'fooinsert'; +}); +my $result = $node_pitr->safe_psql('postgres', "SELECT * FROM foo;"); +is($result, qq(1), "check table contents after COMMIT PREPARED");
Re: PITR promote bug: Checkpointer writes to older timeline
Hey all, I took a stab at a quick and dirty TAP test (my first ever). So it can probably be improved a lot. Please take a look. On Thu, Mar 04, 2021 at 10:28:31AM +0900, Kyotaro Horiguchi wrote: > 2. Restore ThisTimeLineID after calling XLogReadRecord() in the > *caller* side. This is what came up to me first but I don't like > this, too, but I don't find better fix. way. +1 to this patch [1]. The above TAP test passes with this patch applied. [1] https://www.postgresql.org/message-id/attachment/119972/dont_change_thistimelineid.patch Regards, Soumyadeep Regards, Soumyadeep diff --git a/src/test/recovery/t/022_pitr_prepared_xact.pl b/src/test/recovery/t/022_pitr_prepared_xact.pl new file mode 100644 index 000..b8d5146bb9d --- /dev/null +++ b/src/test/recovery/t/022_pitr_prepared_xact.pl @@ -0,0 +1,69 @@ +# Test for timeline switch +use strict; +use warnings; +use File::Path qw(rmtree); +use PostgresNode; +use TestLib; +use Test::More tests => 2; +use File::Compare; +use Time::HiRes qw(usleep); + +$ENV{PGDATABASE} = 'postgres'; + +# Initialize primary node +my $node_primary = get_new_node('primary'); +$node_primary->init(has_archiving => 1); +$node_primary->append_conf('postgresql.conf', "max_prepared_transactions = 10"); +$node_primary->append_conf('postgresql.conf', "max_wal_senders = 10"); +$node_primary->append_conf('postgresql.conf', "wal_level = 'replica'"); +$node_primary->start; + +# Take backup +my $backup_name = 'my_backup'; +$node_primary->backup($backup_name); + +my $node_pitr = get_new_node('node_pitr'); +$node_pitr->init_from_backup($node_primary, $backup_name, +standby => 0, has_restoring => 1); +$node_pitr->append_conf('postgresql.conf', "max_prepared_transactions = 10"); +$node_pitr->append_conf('postgresql.conf', "recovery_target_name = 'rp'"); +$node_pitr->append_conf('postgresql.conf', "recovery_target_action = 'promote'"); + +# Workload with prepared transaction +$node_primary->psql( +'postgres', qq{ +CREATE TABLE foo(i int); +BEGIN; +INSERT INTO foo VALUES(1); +PREPARE TRANSACTION 'fooinsert'; +SELECT pg_create_restore_point('rp'); +INSERT INTO foo VALUES(2); +SELECT pg_switch_wal(); +}); + +# Sleep 5s to ensure that the WAL has been archived. +# probably can be replaced by a wait +usleep(500); + +$node_pitr->start; + +my $last_archived_wal_file_name = $node_primary->safe_psql('postgres', +"SELECT last_archived_wal FROM pg_stat_archiver;"); + +# Ensure that we don't write to the older timeline during PITR promotion by +# ensuring that the last archived WAL file was not overwritten during recovery. +my $archive_dir = $node_primary->archive_dir; +my $archive_wal_file_path = "${archive_dir}/${last_archived_wal_file_name}"; +my $node_pitr_data = $node_pitr->data_dir; +my $local_wal_file_path = "${node_pitr_data}/pg_wal/${last_archived_wal_file_name}"; +is(compare($archive_wal_file_path, $local_wal_file_path), qq(0), "Check if the last archived WAL file was overwritten"); + +$node_pitr->psql( +'postgres', qq{ +COMMIT PREPARED 'fooinsert'; +}); + +my $foo_select_result = $node_pitr->safe_psql('postgres', +"SELECT * FROM foo;"); + +is($foo_select_result, qq(1), "Check foo select result");
Re: PITR promote bug: Checkpointer writes to older timeline
At Thu, 04 Mar 2021 16:17:34 +0900 (JST), Kyotaro Horiguchi wrote in > At Thu, 4 Mar 2021 11:18:42 +0900, Michael Paquier > wrote in > > I have not looked in details at the solutions proposed here, but could > > it be possible to have a TAP test at least please? Seeing the script > > from the top of the thread, it should not be difficult to do so. I > > would put that in a file different than 009_twophase.pl, within > > src/test/recovery/. Isn't 004_timeline_switch.pl the place? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: PITR promote bug: Checkpointer writes to older timeline
At Thu, 4 Mar 2021 14:57:13 +0900, Fujii Masao wrote in > > read_local_xlog_page() works as a part of logical decoding and has > > responsibility to update ThisTimeLineID properly. As the comment in > > the function, it is the proper place to update ThisTimeLineID since we > > miss a timeline change if we check it earlier and the function uses > > the value just after. So we cannot change that behavior of the > > function. That is, neither of them doesn't seem to be the right fix. > > Could you tell me what actual issue happens if read_local_xlog_page() > resets > ThisTimeLineID at the end? Some replication slot-related functions > that use > read_local_xlog_page() can be executed even during recovery. For > example, > you mean that, when timeline swithes during recovery, those functions > behave incorrectly if ThisTimeLineID is reset? The most significant point for me was I'm not fully convinced that we can safely (or validly) remove the fucntion to maintain the variable from read_local_xlog_page. > * RecoveryInProgress() will update ThisTimeLineID when it first > * notices recovery finishes, so we only have to maintain it for the > * local process until recovery ends. read_local_xlog_page is *designed* to maintain ThisTimeLineID. Currently it doesn't seem utilized but I think it's sufficiently reasonable that the function maintains ThisTimeLineID. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: PITR promote bug: Checkpointer writes to older timeline
At Thu, 4 Mar 2021 11:18:42 +0900, Michael Paquier wrote in > On Thu, Mar 04, 2021 at 10:28:31AM +0900, Kyotaro Horiguchi wrote: > > read_local_xlog_page() works as a part of logical decoding and has > > responsibility to update ThisTimeLineID properly. As the comment in > > the function, it is the proper place to update ThisTimeLineID since we > > miss a timeline change if we check it earlier and the function uses > > the value just after. So we cannot change that behavior of the > > function. That is, neither of them doesn't seem to be the right fix. > > > > The confusion here is that there's two ThisTimeLineID's here. The > > previous TLI for read and the next TLI to write. Most part of the > > function is needed to read a 2pc recrod so the ways we can take here > > is: > > > > 1. Somehow tell the function not to update ThisTimeLineID in specific > > cases. This can be done by xlogreader private data but this doesn't > > seem reasonable. > > > > 2. Restore ThisTimeLineID after calling XLogReadRecord() in the > > *caller* side. This is what came up to me first but I don't like > > this, too, but I don't find better fix. way. > > I have not looked in details at the solutions proposed here, but could > it be possible to have a TAP test at least please? Seeing the script > from the top of the thread, it should not be difficult to do so. I > would put that in a file different than 009_twophase.pl, within > src/test/recovery/. Year, agreed. It is needed as the final patch. That situation is easily caused. I'm not sure how to detect the corruption yet, though. I'll consider that. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: PITR promote bug: Checkpointer writes to older timeline
On 2021/03/04 10:28, Kyotaro Horiguchi wrote: At Wed, 3 Mar 2021 14:56:25 -0800, Soumyadeep Chakraborty wrote in On 2021/03/03 17:46, Heikki Linnakangas wrote: I think it should be reset even earlier, inside XlogReadTwoPhaseData() probably. With your patch, doesn't the LogStandbySnapshot() call just above where you're ressetting ThisTimeLineID also write a WAL record with incorrect timeline? Agreed. Right. On Wed, Mar 3, 2021 at 1:04 AM Fujii Masao wrote: Even better, can we avoid setting ThisTimeLineID in XlogReadTwoPhaseData() in the first place? Or isn't it better to reset ThisTimeLineID in read_local_xlog_page(), i.e., prevent read_local_xlog_page() from changing ThisTimeLineID? I'm not sure if that's possible, though.. In the future other functions that calls read_local_xlog_page() during the promotion may appear. Fixing the issue outside read_local_xlog_page() may cause those functions to get the same issue. I agree. We should fix the issue in read_local_xlog_page(). I have attached two different patches which do so: saved_ThisTimeLineID.patch and pass_ThisTimeLineID.patch. The former saves the value of the ThisTimeLineID before it gets changed in read_local_xlog_page() and resets it after ThisTimeLineID has been used later on in the code (by XLogReadDetermineTimeline()). The latter removes occurrences of ThisTimeLineID from XLogReadDetermineTimeline() and introduces an argument currTLI to XLogReadDetermineTimeline() to be used in its stead. read_local_xlog_page() works as a part of logical decoding and has responsibility to update ThisTimeLineID properly. As the comment in the function, it is the proper place to update ThisTimeLineID since we miss a timeline change if we check it earlier and the function uses the value just after. So we cannot change that behavior of the function. That is, neither of them doesn't seem to be the right fix. Could you tell me what actual issue happens if read_local_xlog_page() resets ThisTimeLineID at the end? Some replication slot-related functions that use read_local_xlog_page() can be executed even during recovery. For example, you mean that, when timeline swithes during recovery, those functions behave incorrectly if ThisTimeLineID is reset? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: PITR promote bug: Checkpointer writes to older timeline
On Thu, Mar 04, 2021 at 10:28:31AM +0900, Kyotaro Horiguchi wrote: > read_local_xlog_page() works as a part of logical decoding and has > responsibility to update ThisTimeLineID properly. As the comment in > the function, it is the proper place to update ThisTimeLineID since we > miss a timeline change if we check it earlier and the function uses > the value just after. So we cannot change that behavior of the > function. That is, neither of them doesn't seem to be the right fix. > > The confusion here is that there's two ThisTimeLineID's here. The > previous TLI for read and the next TLI to write. Most part of the > function is needed to read a 2pc recrod so the ways we can take here > is: > > 1. Somehow tell the function not to update ThisTimeLineID in specific > cases. This can be done by xlogreader private data but this doesn't > seem reasonable. > > 2. Restore ThisTimeLineID after calling XLogReadRecord() in the > *caller* side. This is what came up to me first but I don't like > this, too, but I don't find better fix. way. I have not looked in details at the solutions proposed here, but could it be possible to have a TAP test at least please? Seeing the script from the top of the thread, it should not be difficult to do so. I would put that in a file different than 009_twophase.pl, within src/test/recovery/. -- Michael signature.asc Description: PGP signature
Re: PITR promote bug: Checkpointer writes to older timeline
At Wed, 3 Mar 2021 14:56:25 -0800, Soumyadeep Chakraborty wrote in > On 2021/03/03 17:46, Heikki Linnakangas wrote: > > > I think it should be reset even earlier, inside XlogReadTwoPhaseData() > > probably. With your patch, doesn't the LogStandbySnapshot() call just > > above where you're ressetting ThisTimeLineID also write a WAL record > > with incorrect timeline? > > Agreed. Right. > On Wed, Mar 3, 2021 at 1:04 AM Fujii Masao > wrote: > > > > Even better, can we avoid setting ThisTimeLineID in > > > XlogReadTwoPhaseData() in the first place? > > > > > > > > Or isn't it better to reset ThisTimeLineID in read_local_xlog_page(), i.e., > > prevent read_local_xlog_page() from changing ThisTimeLineID? I'm not > > sure if that's possible, though.. In the future other functions that calls > > read_local_xlog_page() during the promotion may appear. Fixing the issue > > outside read_local_xlog_page() may cause those functions to get > > the same issue. > > I agree. We should fix the issue in read_local_xlog_page(). I have > attached two different patches which do so: > saved_ThisTimeLineID.patch and pass_ThisTimeLineID.patch. > > The former saves the value of the ThisTimeLineID before it gets changed > in read_local_xlog_page() and resets it after ThisTimeLineID has been > used later on in the code (by XLogReadDetermineTimeline()). > > The latter removes occurrences of ThisTimeLineID from > XLogReadDetermineTimeline() and introduces an argument currTLI to > XLogReadDetermineTimeline() to be used in its stead. read_local_xlog_page() works as a part of logical decoding and has responsibility to update ThisTimeLineID properly. As the comment in the function, it is the proper place to update ThisTimeLineID since we miss a timeline change if we check it earlier and the function uses the value just after. So we cannot change that behavior of the function. That is, neither of them doesn't seem to be the right fix. The confusion here is that there's two ThisTimeLineID's here. The previous TLI for read and the next TLI to write. Most part of the function is needed to read a 2pc recrod so the ways we can take here is: 1. Somehow tell the function not to update ThisTimeLineID in specific cases. This can be done by xlogreader private data but this doesn't seem reasonable. 2. Restore ThisTimeLineID after calling XLogReadRecord() in the *caller* side. This is what came up to me first but I don't like this, too, but I don't find better fix. way. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c index 80d2d20d6c..ab9ac0cd5a 100644 --- a/src/backend/access/transam/twophase.c +++ b/src/backend/access/transam/twophase.c @@ -1324,6 +1324,7 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len) XLogRecord *record; XLogReaderState *xlogreader; char *errormsg; + TimeLineID currtli = ThisTimeLineID; xlogreader = XLogReaderAllocate(wal_segment_size, NULL, XL_ROUTINE(.page_read = _local_xlog_page, @@ -1358,6 +1359,12 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len) memcpy(*buf, XLogRecGetData(xlogreader), sizeof(char) * XLogRecGetDataLen(xlogreader)); XLogReaderFree(xlogreader); + + /* + * This function reads a record not as a part of replication so we don't + * want to let XLogReadRecord to change ThisTimeLineID. + */ + ThisTimeLineID = currtli; }
Re: PITR promote bug: Checkpointer writes to older timeline
On 2021/03/03 17:46, Heikki Linnakangas wrote: > I think it should be reset even earlier, inside XlogReadTwoPhaseData() > probably. With your patch, doesn't the LogStandbySnapshot() call just > above where you're ressetting ThisTimeLineID also write a WAL record > with incorrect timeline? Agreed. On Wed, Mar 3, 2021 at 1:04 AM Fujii Masao wrote: > > Even better, can we avoid setting ThisTimeLineID in XlogReadTwoPhaseData() > > in the first place? > > > > Or isn't it better to reset ThisTimeLineID in read_local_xlog_page(), i.e., > prevent read_local_xlog_page() from changing ThisTimeLineID? I'm not > sure if that's possible, though.. In the future other functions that calls > read_local_xlog_page() during the promotion may appear. Fixing the issue > outside read_local_xlog_page() may cause those functions to get > the same issue. I agree. We should fix the issue in read_local_xlog_page(). I have attached two different patches which do so: saved_ThisTimeLineID.patch and pass_ThisTimeLineID.patch. The former saves the value of the ThisTimeLineID before it gets changed in read_local_xlog_page() and resets it after ThisTimeLineID has been used later on in the code (by XLogReadDetermineTimeline()). The latter removes occurrences of ThisTimeLineID from XLogReadDetermineTimeline() and introduces an argument currTLI to XLogReadDetermineTimeline() to be used in its stead. Regards, Soumyadeep diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c index d17d660f460..a96fd236eeb 100644 --- a/src/backend/access/transam/xlogutils.c +++ b/src/backend/access/transam/xlogutils.c @@ -680,11 +680,12 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum, * The caller must also make sure it doesn't read past the current replay * position (using GetXLogReplayRecPtr) if executing in recovery, so it * doesn't fail to notice that the current timeline became historical. The - * caller must also update ThisTimeLineID with the result of + * caller must also ensure that currTLI is updated with the result of * GetXLogReplayRecPtr and must check RecoveryInProgress(). */ void -XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wantLength) +XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, + uint32 wantLength, TimeLineID currTLI) { const XLogRecPtr lastReadPage = (state->seg.ws_segno * state->segcxt.ws_segsize + state->segoff); @@ -712,12 +713,12 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa * just carry on. (Seeking backwards requires a check to make sure the * older page isn't on a prior timeline). * - * ThisTimeLineID might've become historical since we last looked, but the + * currTLI might be historical, but the * caller is required not to read past the flush limit it saw at the time * it looked up the timeline. There's nothing we can do about it if * StartupXLOG() renames it to .partial concurrently. */ - if (state->currTLI == ThisTimeLineID && wantPage >= lastReadPage) + if (state->currTLI == currTLI && wantPage >= lastReadPage) { Assert(state->currTLIValidUntil == InvalidXLogRecPtr); return; @@ -729,7 +730,7 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa * the current segment we can just keep reading. */ if (state->currTLIValidUntil != InvalidXLogRecPtr && - state->currTLI != ThisTimeLineID && + state->currTLI != currTLI && state->currTLI != 0 && ((wantPage + wantLength) / state->segcxt.ws_segsize) < (state->currTLIValidUntil / state->segcxt.ws_segsize)) @@ -752,7 +753,7 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa * We need to re-read the timeline history in case it's been changed * by a promotion or replay from a cascaded replica. */ - List *timelineHistory = readTimeLineHistory(ThisTimeLineID); + List *timelineHistory = readTimeLineHistory(currTLI); XLogRecPtr endOfSegment; endOfSegment = ((wantPage / state->segcxt.ws_segsize) + 1) * @@ -830,7 +831,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, { XLogRecPtr read_upto, loc; - TimeLineID tli; + TimeLineID tli = ThisTimeLineID; int count; WALReadError errinfo; @@ -850,8 +851,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, if (!RecoveryInProgress()) read_upto = GetFlushRecPtr(); else - read_upto = GetXLogReplayRecPtr(); - tli = ThisTimeLineID; + read_upto = GetXLogReplayRecPtr(); /* * Check which timeline to get the record from. @@ -877,9 +877,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, * standby whose primary gets promoted while we're decoding, so a * one-off ERROR isn't too bad. */ - XLogReadDetermineTimeline(state, targetPagePtr, reqLen); + XLogReadDetermineTimeline(state, targetPagePtr, reqLen, tli); - if
Re: PITR promote bug: Checkpointer writes to older timeline
On 2021/03/03 17:46, Heikki Linnakangas wrote: On 03/03/2021 08:47, Kyotaro Horiguchi wrote: At Tue, 2 Mar 2021 17:56:03 -0800, Soumyadeep Chakraborty wrote in When there are prepared transactions in an older timeline, in the checkpointer, a call to CheckPointTwoPhase() and subsequently to XlogReadTwoPhaseData() and subsequently to read_local_xlog_page() leads to the following line: read_upto = GetXLogReplayRecPtr(); GetXLogReplayRecPtr() will change ThisTimeLineID to 1, in order to read the two phase WAL records in the older timeline. This variable will remain unchanged and the checkpointer ends up writing the checkpoint record into the older WAL segment (when XLogBeginInsert() is called within CreateCheckPoint(), the value is still 1). The value is not synchronized as even if RecoveryInProgress() is called, xlogctl->SharedRecoveryState is not RECOVERY_STATE_DONE (SharedRecoveryInProgress = true in older versions) as the startup process waits for the checkpointer inside RequestCheckpoint() (since recovery_target_action='promote' involves a non-fast promotion). Thus, InitXLOGAccess() is not called and the value of ThisTimeLineID is not updated before the checkpoint record write. Since 1148e22a82e, GetXLogReplayRecPtr() is called with ThisTimeLineID instead of a local variable, within read_local_xlog_page(). Confusing... PFA a small patch that fixes the problem by explicitly calling InitXLOGAccess() in CheckPointTwoPhase(), after the two phase state data is read, in order to update ThisTimeLineID to the latest timeline. It is okay to call InitXLOGAccess() as it is lightweight and would mostly be a no-op. It is correct that read_local_xlog_page() changes ThisTimeLineID, but InitXLOGAccess() is correctly called in CreateCheckPoint: | /* | * An end-of-recovery checkpoint is created before anyone is allowed to | * write WAL. To allow us to write the checkpoint record, temporarily | * enable XLogInsertAllowed. (This also ensures ThisTimeLineID is | * initialized, which we need here and in AdvanceXLInsertBuffer.) | */ | if (flags & CHECKPOINT_END_OF_RECOVERY) | LocalSetXLogInsertAllowed(); It seems to e suficcient to recover ThisTimeLineID from the checkpoint record to be written, as attached? I think it should be reset even earlier, inside XlogReadTwoPhaseData() probably. With your patch, doesn't the LogStandbySnapshot() call just above where you're ressetting ThisTimeLineID also write a WAL record with incorrect timeline? Even better, can we avoid setting ThisTimeLineID in XlogReadTwoPhaseData() in the first place? Or isn't it better to reset ThisTimeLineID in read_local_xlog_page(), i.e., prevent read_local_xlog_page() from changing ThisTimeLineID? I'm not sure if that's possible, though.. In the future other functions that calls read_local_xlog_page() during the promotion may appear. Fixing the issue outside read_local_xlog_page() may cause those functions to get the same issue. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: PITR promote bug: Checkpointer writes to older timeline
On 03/03/2021 08:47, Kyotaro Horiguchi wrote: At Tue, 2 Mar 2021 17:56:03 -0800, Soumyadeep Chakraborty wrote in When there are prepared transactions in an older timeline, in the checkpointer, a call to CheckPointTwoPhase() and subsequently to XlogReadTwoPhaseData() and subsequently to read_local_xlog_page() leads to the following line: read_upto = GetXLogReplayRecPtr(); GetXLogReplayRecPtr() will change ThisTimeLineID to 1, in order to read the two phase WAL records in the older timeline. This variable will remain unchanged and the checkpointer ends up writing the checkpoint record into the older WAL segment (when XLogBeginInsert() is called within CreateCheckPoint(), the value is still 1). The value is not synchronized as even if RecoveryInProgress() is called, xlogctl->SharedRecoveryState is not RECOVERY_STATE_DONE (SharedRecoveryInProgress = true in older versions) as the startup process waits for the checkpointer inside RequestCheckpoint() (since recovery_target_action='promote' involves a non-fast promotion). Thus, InitXLOGAccess() is not called and the value of ThisTimeLineID is not updated before the checkpoint record write. Since 1148e22a82e, GetXLogReplayRecPtr() is called with ThisTimeLineID instead of a local variable, within read_local_xlog_page(). Confusing... PFA a small patch that fixes the problem by explicitly calling InitXLOGAccess() in CheckPointTwoPhase(), after the two phase state data is read, in order to update ThisTimeLineID to the latest timeline. It is okay to call InitXLOGAccess() as it is lightweight and would mostly be a no-op. It is correct that read_local_xlog_page() changes ThisTimeLineID, but InitXLOGAccess() is correctly called in CreateCheckPoint: | /* |* An end-of-recovery checkpoint is created before anyone is allowed to |* write WAL. To allow us to write the checkpoint record, temporarily |* enable XLogInsertAllowed. (This also ensures ThisTimeLineID is |* initialized, which we need here and in AdvanceXLInsertBuffer.) |*/ | if (flags & CHECKPOINT_END_OF_RECOVERY) | LocalSetXLogInsertAllowed(); It seems to e suficcient to recover ThisTimeLineID from the checkpoint record to be written, as attached? I think it should be reset even earlier, inside XlogReadTwoPhaseData() probably. With your patch, doesn't the LogStandbySnapshot() call just above where you're ressetting ThisTimeLineID also write a WAL record with incorrect timeline? Even better, can we avoid setting ThisTimeLineID in XlogReadTwoPhaseData() in the first place? - Heikki
Re: PITR promote bug: Checkpointer writes to older timeline
At Tue, 2 Mar 2021 17:56:03 -0800, Soumyadeep Chakraborty wrote in > Hello hackers, > > We came across an issue where the checkpointer writes to the older > timeline while a promotion is ongoing after reaching the recovery point > in a PITR, when there are prepared transactions before the recovery > point. We came across this issue first in REL_12_STABLE and saw that it > also exists in devel. Good Catch! I can reproduce that. > When there are prepared transactions in an older timeline, in the > checkpointer, a call to CheckPointTwoPhase() and subsequently to > XlogReadTwoPhaseData() and subsequently to read_local_xlog_page() leads > to the following line: > > read_upto = GetXLogReplayRecPtr(); > > GetXLogReplayRecPtr() will change ThisTimeLineID to 1, in order to read > the two phase WAL records in the older timeline. This variable will > remain unchanged and the checkpointer ends up writing the checkpoint > record into the older WAL segment (when XLogBeginInsert() is called > within CreateCheckPoint(), the value is still 1). The value is not > synchronized as even if RecoveryInProgress() is called, > xlogctl->SharedRecoveryState is not RECOVERY_STATE_DONE > (SharedRecoveryInProgress = true in older versions) as the startup > process waits for the checkpointer inside RequestCheckpoint() (since > recovery_target_action='promote' involves a non-fast promotion). Thus, > InitXLOGAccess() is not called and the value of ThisTimeLineID is not > updated before the checkpoint record write. > > Since 1148e22a82e, GetXLogReplayRecPtr() is called with ThisTimeLineID > instead of a local variable, within read_local_xlog_page(). > > PFA a small patch that fixes the problem by explicitly calling > InitXLOGAccess() in CheckPointTwoPhase(), after the two phase state data > is read, in order to update ThisTimeLineID to the latest timeline. It is > okay to call InitXLOGAccess() as it is lightweight and would mostly be > a no-op. It is correct that read_local_xlog_page() changes ThisTimeLineID, but InitXLOGAccess() is correctly called in CreateCheckPoint: | /* |* An end-of-recovery checkpoint is created before anyone is allowed to |* write WAL. To allow us to write the checkpoint record, temporarily |* enable XLogInsertAllowed. (This also ensures ThisTimeLineID is |* initialized, which we need here and in AdvanceXLInsertBuffer.) |*/ | if (flags & CHECKPOINT_END_OF_RECOVERY) | LocalSetXLogInsertAllowed(); It seems to e suficcient to recover ThisTimeLineID from the checkpoint record to be written, as attached? regareds. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 377afb8732..06d3ce676e 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -9088,6 +9088,13 @@ CreateCheckPoint(int flags) if (!shutdown && XLogStandbyInfoActive()) LogStandbySnapshot(); + /* + * ThisTimeLineID may have moved so far so we need to restore it so that + * the checkpoint record goes to the right timeline. (Currently + * CheckPointTwoPhase() is known to have a chance to change it.) + */ + ThisTimeLineID = checkPoint.ThisTimeLineID; + START_CRIT_SECTION(); /*
PITR promote bug: Checkpointer writes to older timeline
Hello hackers, We came across an issue where the checkpointer writes to the older timeline while a promotion is ongoing after reaching the recovery point in a PITR, when there are prepared transactions before the recovery point. We came across this issue first in REL_12_STABLE and saw that it also exists in devel. Setup: PFA a minimal reproduction script repro.txt. After running the script, we notice that the checkpointer has written the end-of-recovery shutdown checkpoint in the previous timeline (tli = 1), i.e. it wrote into the WAL segment 00010003 instead of writing to the WAL segment 00020003, causing it to overwrite WAL records past the recovery point (please see attached diff output file waldiff.diff) in 00010003. Also, performing a subsequent shutdown on the recovered server may cause the next shutdown checkpoint record to be written, again, to the previous timeline, i.e. to 00010003. A subsequent server start will fail as the starup process will be unable to find the checkpoint in the latest timeline (00020003) and we will get: ... LOG: invalid record length at 0/3016FB8: wante d 24, got 0 LOG: invalid primary checkpoint record PANIC: could not locate a valid checkpoint record ... RCA: When there are prepared transactions in an older timeline, in the checkpointer, a call to CheckPointTwoPhase() and subsequently to XlogReadTwoPhaseData() and subsequently to read_local_xlog_page() leads to the following line: read_upto = GetXLogReplayRecPtr(); GetXLogReplayRecPtr() will change ThisTimeLineID to 1, in order to read the two phase WAL records in the older timeline. This variable will remain unchanged and the checkpointer ends up writing the checkpoint record into the older WAL segment (when XLogBeginInsert() is called within CreateCheckPoint(), the value is still 1). The value is not synchronized as even if RecoveryInProgress() is called, xlogctl->SharedRecoveryState is not RECOVERY_STATE_DONE (SharedRecoveryInProgress = true in older versions) as the startup process waits for the checkpointer inside RequestCheckpoint() (since recovery_target_action='promote' involves a non-fast promotion). Thus, InitXLOGAccess() is not called and the value of ThisTimeLineID is not updated before the checkpoint record write. Since 1148e22a82e, GetXLogReplayRecPtr() is called with ThisTimeLineID instead of a local variable, within read_local_xlog_page(). PFA a small patch that fixes the problem by explicitly calling InitXLOGAccess() in CheckPointTwoPhase(), after the two phase state data is read, in order to update ThisTimeLineID to the latest timeline. It is okay to call InitXLOGAccess() as it is lightweight and would mostly be a no-op. Regards, Soumyadeep, Kevin and Jimmy VMWare #! /bin/sh # Primary setup /usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data echo "archive_mode = on archive_command = 'cp %p /tmp/archive/%f' max_prepared_transactions = 10" >> /usr/local/pgsql/data/postgresql.conf /usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data -l /tmp/logfile_p start # Secondary setup /usr/local/pgsql/bin/pg_basebackup -D /usr/local/pgsql/data_s -X stream && touch /usr/local/pgsql/data_s/recovery.signal echo "port = 6432 restore_command = 'cp /tmp/archive/%f %p' recovery_target_name = 'rp' recovery_target_action = 'promote' max_prepared_transactions = 20" >> /usr/local/pgsql/data_s/postgresql.conf # Workload echo "CREATE TABLE foo(i int); BEGIN; INSERT INTO foo VALUES(1); PREPARE TRANSACTION 'fooinsert'; SELECT pg_create_restore_point('rp'); INSERT INTO foo VALUES(2); SELECT pg_switch_wal(); " > /tmp/workload.sql /usr/local/pgsql/bin/psql postgres -f /tmp/workload.sql /usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data_s -l /tmp/logfile_s start sleep 5 /usr/local/pgsql/bin/pg_waldump /tmp/archive/00010003 > /tmp/archive_00010003 /usr/local/pgsql/bin/pg_waldump /usr/local/pgsql/data_s/pg_wal/00010003 > /tmp/standby_00010003 diff -u /tmp/archive_00010003 /tmp/standby_00010003 > /tmp/waldiff.diff cat /tmp/waldiff.diff From dbf5a9f6899bedf28b482fc03a4a2b0571e92e9b Mon Sep 17 00:00:00 2001 From: Soumyadeep Chakraborty Date: Tue, 2 Mar 2021 17:41:20 -0800 Subject: [PATCH 1/1] Prevent checkpointer from writing to older timeline Co-authored-by: Kevin Yeap --- src/backend/access/transam/twophase.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c index 80d2d20d6cc..081aee6217e 100644 --- a/src/backend/access/transam/twophase.c +++ b/src/backend/access/transam/twophase.c @@ -1740,6 +1740,16 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon) } LWLockRelease(TwoPhaseStateLock); + if (serialized_xacts > 0) + { + /* + * In order to ensure that we write the checkpoint record to the latest + * timeline, refresh ThisTimeLineID.