Re: PITR promote bug: Checkpointer writes to older timeline

2021-06-27 Thread Tom Lane
I wrote:
> It sure looks like recovering a prepared
> transaction creates a transient state in which a new backend will
> compute a broken snapshot.

Oh, after further digging this is the same issue discussed here:

https://www.postgresql.org/message-id/flat/20210422203603.fdnh3fu2mmfp2iov%40alap3.anarazel.de

regards, tom lane




Re: PITR promote bug: Checkpointer writes to older timeline

2021-06-27 Thread Tom Lane
I wrote:
> Buildfarm member hornet just reported a failure in this test:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet=2021-06-27%2013%3A40%3A57
> It's not clear whether this is a problem with the test case or an
> actual server bug, but I'm leaning to the latter theory.  My gut
> feel is it's some problem in the "snapshot scalability" work.  It
> doesn't look the same as the known open issue, but maybe related?

Hmm, the plot thickens.  I scraped the buildfarm logs for similar-looking
assertion failures back to last August, when the snapshot scalability
patches went in.  The first such failure is not until 2021-03-24
(see attachment), and they all look to be triggered by
023_pitr_prepared_xact.pl.  It sure looks like recovering a prepared
transaction creates a transient state in which a new backend will
compute a broken snapshot.

regards, tom lane

sysname| branch |  snapshot   | stage | 

  l 
  
---++-+---+---
 calliphoridae | HEAD   | 2021-03-24 06:50:09 | recoveryCheck | TRAP: 
FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", 
File: 
"/mnt/resource/andres/bf/calliphoridae/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/procarray.c",
 Line: 2463, PID: 3890215)
 francolin | HEAD   | 2021-03-29 16:21:58 | recoveryCheck | TRAP: 
FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", 
File: 
"/mnt/resource/andres/bf/francolin/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/procarray.c",
 Line: 2463, PID: 1861665)
 moonjelly | HEAD   | 2021-04-01 15:25:38 | recoveryCheck | TRAP: 
FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", 
File: "procarray.c", Line: 2463, PID: 2345153)
 francolin | HEAD   | 2021-04-07 12:30:08 | recoveryCheck | TRAP: 
FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", 
File: 
"/mnt/resource/andres/bf/francolin/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/procarray.c",
 Line: 2468, PID: 3257637)
 fairywren | HEAD   | 2021-04-20 03:04:04 | recoveryCheck | TRAP: 
FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", 
File: 
"C:/tools/msys64/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/procarray.c",
 Line: 2094, PID: 94824)
 mantid| HEAD   | 2021-04-25 10:07:06 | recoveryCheck | TRAP: 
FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", 
File: "procarray.c", Line: 2094, PID: 2820886)
 thorntail | HEAD   | 2021-04-29 07:18:09 | recoveryCheck | TRAP: 
FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", 
File: 
"/home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/procarray.c",
 Line: 2094, PID: 3099560)
 mantid| HEAD   | 2021-05-03 13:07:06 | recoveryCheck | TRAP: 
FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", 
File: "procarray.c", Line: 2094, PID: 1163004)
 mantid| HEAD   | 2021-05-10 01:07:07 | recoveryCheck | TRAP: 
FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", 
File: "procarray.c", Line: 2468, PID: 2812704)
 hornet| HEAD   | 2021-06-27 13:40:57 | recoveryCheck | TRAP: 
FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", 
File: "procarray.c", Line: 2492, PID: 11862234)
(10 rows)



Re: PITR promote bug: Checkpointer writes to older timeline

2021-06-27 Thread Tom Lane
Michael Paquier  writes:
> I have been working on that over the last couple of days, and applied
> a fix down to 10.  One thing that I did not like in the test was the
> use of compare() to check if the contents of the WAL segment before
> and after the timeline jump remained the same as this would have been
> unstable with any concurrent activity.  Instead, I have added a phase
> at the end of the test with an extra checkpoint and recovery triggered
> once, which is enough to reproduce the PANIC reported at the top of
> the thread.

Buildfarm member hornet just reported a failure in this test:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet=2021-06-27%2013%3A40%3A57

the critical bit being

2021-06-27 17:35:46.504 UTC [11862234:1] [unknown] LOG:  connection received: 
host=[local]
2021-06-27 17:35:46.505 UTC [18350260:12] LOG:  recovering prepared transaction 
734 from shared memory
TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, 
RecentXmin)", File: "procarray.c", Line: 2492, PID: 11862234)
2021-06-27 17:35:46.511 UTC [14876838:4] LOG:  database system is ready to 
accept connections

It's not clear whether this is a problem with the test case or an
actual server bug, but I'm leaning to the latter theory.  My gut
feel is it's some problem in the "snapshot scalability" work.  It
doesn't look the same as the known open issue, but maybe related?

regards, tom lane




Re: PITR promote bug: Checkpointer writes to older timeline

2021-03-21 Thread Michael Paquier
On Thu, Mar 18, 2021 at 12:56:12PM +0900, Michael Paquier wrote:
> I was looking at uses of ThisTimeLineID in the wild, and could not
> find it getting checked or used actually in backend-side code that
> involved the WAL reader facility.  Even if it brings confidence, it
> does not mean that it is not used somewhere :/

I have been working on that over the last couple of days, and applied
a fix down to 10.  One thing that I did not like in the test was the
use of compare() to check if the contents of the WAL segment before
and after the timeline jump remained the same as this would have been
unstable with any concurrent activity.  Instead, I have added a phase
at the end of the test with an extra checkpoint and recovery triggered
once, which is enough to reproduce the PANIC reported at the top of
the thread.

I'll look into clarifying the use of ThisTimeLineID within the those
WAL reader callbacks, because this is really bug-prone in the long
term...  This requires some coordination with the recent work aimed at
adding some logical decoding support in standbys, though.
--
Michael


signature.asc
Description: PGP signature


Re: PITR promote bug: Checkpointer writes to older timeline

2021-03-17 Thread Michael Paquier
On Wed, Mar 17, 2021 at 05:09:50PM +0900, Michael Paquier wrote:
> Currently with HEAD and back branches, nothing would be broken as
> logical contexts cannot exist in recovery.  Still it would be easy
> to miss the new behavior for anybody attempting to work more on this
> feature in the future if we change read_local_xlog_page to not update
> ThisTimeLineID anymore.  Resetting the value of ThisTimeLineID in
> read_local_xlog_page() does not seem completely right either with this
> argument, as they could be some custom code relying on the existing
> behavior of read_local_xlog_page() to maintain ThisTimeLineID.

I was looking at uses of ThisTimeLineID in the wild, and could not
find it getting checked or used actually in backend-side code that
involved the WAL reader facility.  Even if it brings confidence, it
does not mean that it is not used somewhere :/

> Hmmm.  I am wondering whether the best answer for the moment would not
> be to save and reset ThisTimeLineID just in XlogReadTwoPhaseData(), as
> that's the local change that uses read_local_xlog_page().

And attached is the patch able to achieve that.  At least it is
simple, and does not break the actual assumptions this callback relies
on.  This is rather weak though if there are errors as this is out of
critical sections, still the disease is worse.  I have double-checked
all the existing backend code that uses XLogReadRecord(), and did not
notice any code paths with issues similar to this one.

> The state of the code is really confusing on HEAD, and I'd like to
> think that the best thing we could do in the long-term is to make the
> logical decoding path not rely on ThisTimeLineID at all and decouple
> all that, putting the code in a state sane enough so as we don't
> finish with similar errors as what is discussed on this thread.  That
> would be a work for a different patch though, not for stable
> branches.  And seeing some slot and advancing functions update
> directly ThisTimeLineID is no good either..

However, I'd like to think that we should completely untie the
dependency to ThisTimeLineID in any page read callbacks in core in the
long-term, and potentially clean up any assumptions behind timeline
jumps while in recovery for logical contexts as that cannot happen.
At this stage of the 14 dev cycle, that would be material for 15~, but
I also got to wonder if there is work going on to support logical
decoding on standbys, in particular if this would really rely on
ThisTimeLineID.

Thoughts are welcome.
--
Michael
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c16f..80be6cc7a4 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1316,7 +1316,11 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
  * twophase files and ReadTwoPhaseFile should be used instead.
  *
  * Note clearly that this function can access WAL during normal operation,
- * similarly to the way WALSender or Logical Decoding would do.
+ * similarly to the way WALSender or Logical Decoding would do.  While
+ * accessing WAL, read_local_xlog_page() may change ThisTimeLineID,
+ * particularly if this routine is called for the end-of-recovery checkpoint
+ * in the checkpointer itself, so save its existing value and restore it
+ * once done.
  */
 static void
 XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
@@ -1324,6 +1328,7 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 	XLogRecord *record;
 	XLogReaderState *xlogreader;
 	char	   *errormsg;
+	TimeLineID	save_currtli = ThisTimeLineID;
 
 	xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
 	XL_ROUTINE(.page_read = _local_xlog_page,
@@ -1338,6 +1343,14 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 
 	XLogBeginRead(xlogreader, lsn);
 	record = XLogReadRecord(xlogreader, );
+
+	/*
+	 * Restore immediately the timeline where it was previously, as
+	 * read_local_xlog_page() could have changed it if the record was read
+	 * while recovery was finishing or if the timeline has jumped in-between.
+	 */
+	ThisTimeLineID = save_currtli;
+
 	if (record == NULL)
 		ereport(ERROR,
 (errcode_for_file_access(),
diff --git a/src/test/recovery/t/022_pitr_prepared_xact.pl b/src/test/recovery/t/022_pitr_prepared_xact.pl
new file mode 100644
index 00..3a7907b2a0
--- /dev/null
+++ b/src/test/recovery/t/022_pitr_prepared_xact.pl
@@ -0,0 +1,86 @@
+# Test for point-in-time-recovery (PITR) with prepared transactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use File::Compare;
+
+# Initialize and start primary node with WAL archiving
+my $node_primary = get_new_node('primary');
+$node_primary->init(has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', qq{
+max_wal_senders = 10
+wal_level = 'replica'
+max_prepared_transactions = 10});
+$node_primary->start;
+
+# Take backup
+my $backup_name = 'my_backup';

Re: PITR promote bug: Checkpointer writes to older timeline

2021-03-17 Thread Michael Paquier
On Mon, Mar 15, 2021 at 04:38:08PM +0900, Michael Paquier wrote:
> On Mon, Mar 15, 2021 at 03:01:09PM +0900, Kyotaro Horiguchi wrote:
>> Logical decoding stuff is (I think) designed to turn any backend into
>> a walsender, which may need to maintain ThisTimeLineID.  It seems to
>> me that logical decoding stuff indents to maintain ThisTimeLineID of
>> such backends at reading a WAL record. logical_read_xlog_page also
>> updates ThisTimeLineID and pg_logical_slot_get_changes_guts(),
>> pg_replication_slot_advance() (and maybe other functions) updates
>> ThisTimeLineID. So it is natural that local_read_xlog_page() updates
>> it since it is intended to be used used in logical decoding plugins.
>
> Logical decoding contexts cannot be created while in recovery as per
> CheckLogicalDecodingRequirements(), and as mentioned not everything
> is
> in place to allow that.  FWIW, I think that it is just confusing for
> pg_replication_slot_advance() and pg_logical_slot_get_changes_guts()
> to update it, and we just look for the latest value each time it is
> necessary when reading a new WAL page.

Studying some history today, having read_local_xlog_page() directly
update ThisTimeLineID has been extensively discussed here back in 2017
to attempt to introduce logical decoding on standbys (1148e22a):
https://www.postgresql.org/message-id/CAMsr%2BYEVmBJ%3DdyLw%3D%2BkTihmUnGy5_EW4Mig5T0maieg_Zu%3DXCg%40mail.gmail.com

Currently with HEAD and back branches, nothing would be broken as
logical contexts cannot exist in recovery.  Still it would be easy
to miss the new behavior for anybody attempting to work more on this
feature in the future if we change read_local_xlog_page to not update
ThisTimeLineID anymore.  Resetting the value of ThisTimeLineID in
read_local_xlog_page() does not seem completely right either with this
argument, as they could be some custom code relying on the existing
behavior of read_local_xlog_page() to maintain ThisTimeLineID.

Hmmm.  I am wondering whether the best answer for the moment would not
be to save and reset ThisTimeLineID just in XlogReadTwoPhaseData(), as
that's the local change that uses read_local_xlog_page().

The state of the code is really confusing on HEAD, and I'd like to
think that the best thing we could do in the long-term is to make the
logical decoding path not rely on ThisTimeLineID at all and decouple
all that, putting the code in a state sane enough so as we don't
finish with similar errors as what is discussed on this thread.  That
would be a work for a different patch though, not for stable
branches.  And seeing some slot and advancing functions update
directly ThisTimeLineID is no good either..

Any thoughts?
--
Michael


signature.asc
Description: PGP signature


Re: PITR promote bug: Checkpointer writes to older timeline

2021-03-15 Thread Michael Paquier
On Mon, Mar 15, 2021 at 03:01:09PM +0900, Kyotaro Horiguchi wrote:
> Logical decoding stuff is (I think) designed to turn any backend into
> a walsender, which may need to maintain ThisTimeLineID.  It seems to
> me that logical decoding stuff indents to maintain ThisTimeLineID of
> such backends at reading a WAL record. logical_read_xlog_page also
> updates ThisTimeLineID and pg_logical_slot_get_changes_guts(),
> pg_replication_slot_advance() (and maybe other functions) updates
> ThisTimeLineID. So it is natural that local_read_xlog_page() updates
> it since it is intended to be used used in logical decoding plugins.

Logical decoding contexts cannot be created while in recovery as per
CheckLogicalDecodingRequirements(), and as mentioned not everything is
in place to allow that.  FWIW, I think that it is just confusing for
pg_replication_slot_advance() and pg_logical_slot_get_changes_guts()
to update it, and we just look for the latest value each time it is
necessary when reading a new WAL page.

> It is initialized by IndentifySystem(). And the logical walsender
> intends to maintain ThisTimeLineID by subsequent calls to
> GetStandbyFlushRecPtr(), which happen in logical_read_xlog_page().

I don't understand this part about logical_read_xlog_page(), though.
Do you mean a different routine or a different code path?

> I agree that it's better that the replay TLI is stored in a separate
> variable. It is what I was complained on in the previous mails. (It
> might not have been so obvious, though..)

Okay.  I understood that this was what you implied.

> I don't think there's any acutual user of the function for the
> purpose, but..  Anyawy if we remove the update of ThisTimeLineID from
> read_local_xlog_page, I think we should remove or rewrite the
> following comment for the function.  It no longer works as written in
> the catchphrase.

Who knows.  We cannot know all the users of this code, and the API is
public.

> > * Public because it would likely be very helpful for someone writing another
> > * output method outside walsender, e.g. in a bgworker.

I don't see a reason to remove this comment as this routine can still
be useful for many purposes.  What kind of rewording do you have in
mind?
--
Michael


signature.asc
Description: PGP signature


Re: PITR promote bug: Checkpointer writes to older timeline

2021-03-15 Thread Kyotaro Horiguchi
At Sun, 14 Mar 2021 17:59:59 +0900, Michael Paquier  wrote 
in 
> On Thu, Mar 04, 2021 at 05:10:36PM +0900, Kyotaro Horiguchi wrote:
> > read_local_xlog_page is *designed* to maintain ThisTimeLineID.
> > Currently it doesn't seem utilized but I think it's sufficiently
> > reasonable that the function maintains ThisTimeLineID.
> 
> I don't quite follow this line of thoughts.  ThisTimeLineID is
> designed to remain 0 while recovery is running in most processes
> (at the close exception of a WAL sender with a cascading setup,

The reason for the "0" is they just aren't interested in the value.
Checkpointer temporarily uses it then restore to 0 soon.

> physical or logical, of course), so why is there any business for
> read_local_xlog_page() to touch this field at all while in recovery to
> begin with?

Logical decoding stuff is (I think) designed to turn any backend into
a walsender, which may need to maintain ThisTimeLineID.  It seems to
me that logical decoding stuff indents to maintain ThisTimeLineID of
such backends at reading a WAL record. logical_read_xlog_page also
updates ThisTimeLineID and pg_logical_slot_get_changes_guts(),
pg_replication_slot_advance() (and maybe other functions) updates
ThisTimeLineID. So it is natural that local_read_xlog_page() updates
it since it is intended to be used used in logical decoding plugins.

> I equally find confusing that XLogReadDetermineTimeline() relies on a
> specific value of ThisTimeLineID in its own logic, while it clearly
> states that all its callers have to read the current active TLI
> beforehand.  So I think that the existing logic is pretty weak, and
> that resetting the field is an incorrect approach?  It seems to me

It is initialized by IndentifySystem(). And the logical walsender
intends to maintain ThisTimeLineID by subsequent calls to
GetStandbyFlushRecPtr(), which happen in logical_read_xlog_page().

> that we had better not change ThisTimeLineID actively while in
> recovery in this code path and just let others do the job, like
> RecoveryInProgress() once recovery finishes, or
> GetStandbyFlushRecPtr() for a WAL sender.  And finally, we should
> store the current TLI used for replay in a separate variable that gets
> passed down to the XLogReadDetermineTimeline() as argument.

I agree that it's better that the replay TLI is stored in a separate
variable. It is what I was complained on in the previous mails. (It
might not have been so obvious, though..)

> While going through it, I have simplified a bit the proposed TAP tests
> (thanks for replacing the sleep() call, Soumyadeep.  This would have
> made the test slower for nothing on fast machines, and it would cause
> failures on very slow machines).
> 
> The attached fixes the original issue for me, keeping all the records
> in their correct timeline.  And I have not been able to break
> cascading setups.  If it happens that such cases actually break, we
> have holes in our existing test coverage that should be improved.  I
> cannot see anything fancy missing on this side, though.
> 
> Any thoughts?

I don't think there's any acutual user of the function for the
purpose, but..  Anyawy if we remove the update of ThisTimeLineID from
read_local_xlog_page, I think we should remove or rewrite the
following comment for the function.  It no longer works as written in
the catchphrase.

> * Public because it would likely be very helpful for someone writing another
> * output method outside walsender, e.g. in a bgworker.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: PITR promote bug: Checkpointer writes to older timeline

2021-03-14 Thread Michael Paquier
On Thu, Mar 04, 2021 at 05:10:36PM +0900, Kyotaro Horiguchi wrote:
> read_local_xlog_page is *designed* to maintain ThisTimeLineID.
> Currently it doesn't seem utilized but I think it's sufficiently
> reasonable that the function maintains ThisTimeLineID.

I don't quite follow this line of thoughts.  ThisTimeLineID is
designed to remain 0 while recovery is running in most processes
(at the close exception of a WAL sender with a cascading setup,
physical or logical, of course), so why is there any business for
read_local_xlog_page() to touch this field at all while in recovery to
begin with?

I equally find confusing that XLogReadDetermineTimeline() relies on a
specific value of ThisTimeLineID in its own logic, while it clearly
states that all its callers have to read the current active TLI
beforehand.  So I think that the existing logic is pretty weak, and
that resetting the field is an incorrect approach?  It seems to me
that we had better not change ThisTimeLineID actively while in
recovery in this code path and just let others do the job, like
RecoveryInProgress() once recovery finishes, or
GetStandbyFlushRecPtr() for a WAL sender.  And finally, we should
store the current TLI used for replay in a separate variable that gets
passed down to the XLogReadDetermineTimeline() as argument.

While going through it, I have simplified a bit the proposed TAP tests
(thanks for replacing the sleep() call, Soumyadeep.  This would have
made the test slower for nothing on fast machines, and it would cause
failures on very slow machines).

The attached fixes the original issue for me, keeping all the records
in their correct timeline.  And I have not been able to break
cascading setups.  If it happens that such cases actually break, we
have holes in our existing test coverage that should be improved.  I
cannot see anything fancy missing on this side, though.

Any thoughts?
--
Michael
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 9ac602b674..34d6286d75 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -56,7 +56,9 @@ extern void wal_segment_open(XLogReaderState *state,
 extern void wal_segment_close(XLogReaderState *state);
 
 extern void XLogReadDetermineTimeline(XLogReaderState *state,
-	  XLogRecPtr wantPage, uint32 wantLength);
+	  XLogRecPtr wantPage,
+	  uint32 wantLength,
+	  TimeLineID readtli);
 
 extern void WALReadRaiseError(WALReadError *errinfo);
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index d17d660f46..9695653bc2 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -658,6 +658,13 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
  * wantLength to the amount of the page that will be read, up to
  * XLOG_BLCKSZ. If the amount to be read isn't known, pass XLOG_BLCKSZ.
  *
+ * readtli is the current timeline as found by the caller of this routine.
+ * When not in recovery, this should be ThisTimeLineID.  As ThisTimeLineID
+ * remains set to 0 for most processes while in recovery, the caller ought
+ * to provide the timeline number given as a result of GetXLogReplayRecPtr()
+ * instead (for a WAL sender this would actually be ThisTimeLineID as the
+ * field gets updated in a cascading WAL sender).
+ *
  * We switch to an xlog segment from the new timeline eagerly when on a
  * historical timeline, as soon as we reach the start of the xlog segment
  * containing the timeline switch.  The server copied the segment to the new
@@ -679,12 +686,13 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
  *
  * The caller must also make sure it doesn't read past the current replay
  * position (using GetXLogReplayRecPtr) if executing in recovery, so it
- * doesn't fail to notice that the current timeline became historical. The
- * caller must also update ThisTimeLineID with the result of
- * GetXLogReplayRecPtr and must check RecoveryInProgress().
+ * doesn't fail to notice that the current timeline became historical.
  */
 void
-XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wantLength)
+XLogReadDetermineTimeline(XLogReaderState *state,
+		  XLogRecPtr wantPage,
+		  uint32 wantLength,
+		  TimeLineID readtli)
 {
 	const XLogRecPtr lastReadPage = (state->seg.ws_segno *
 	 state->segcxt.ws_segsize + state->segoff);
@@ -712,12 +720,12 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	 * just carry on. (Seeking backwards requires a check to make sure the
 	 * older page isn't on a prior timeline).
 	 *
-	 * ThisTimeLineID might've become historical since we last looked, but the
+	 * readtli might've become historical since we last looked, but the
 	 * caller is required not to read past the flush limit it saw at the time
 	 * it looked up the timeline. There's nothing we can do about it if
 	 * StartupXLOG() 

Re: PITR promote bug: Checkpointer writes to older timeline

2021-03-13 Thread Soumyadeep Chakraborty
Hello,

PFA version 2 of the TAP test. I removed the non-deterministic sleep
and introduced retries until the WAL segment is archived and promotion
is complete. Some additional tidying up too.

Regards,
Soumyadeep (VMware)
diff --git a/src/test/recovery/t/022_pitr_prepared_xact.pl b/src/test/recovery/t/022_pitr_prepared_xact.pl
new file mode 100644
index 000..8738fa39d4d
--- /dev/null
+++ b/src/test/recovery/t/022_pitr_prepared_xact.pl
@@ -0,0 +1,83 @@
+# Test for point-in-time-recovery (PITR) with prepared transactions
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use File::Compare;
+
+# Initialize and start primary node with WAL archiving
+my $node_primary = get_new_node('primary');
+$node_primary->init(has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', "max_wal_senders = 10");
+$node_primary->append_conf('postgresql.conf', "wal_level = 'replica'");
+$node_primary->append_conf('postgresql.conf',
+	"max_prepared_transactions = 10");
+$node_primary->start;
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Initialize node for PITR targeting a specific restore point
+my $node_pitr = get_new_node('node_pitr');
+$node_pitr->init_from_backup(
+	$node_primary, $backup_name,
+	standby   => 0,
+	has_restoring => 1);
+$node_pitr->append_conf('postgresql.conf', "max_prepared_transactions = 10");
+$node_pitr->append_conf('postgresql.conf', "recovery_target_name = 'rp'");
+$node_pitr->append_conf('postgresql.conf',
+	"recovery_target_action = 'promote'");
+
+# Workload with a prepared transaction and the target restore point
+$node_primary->psql(
+	'postgres', qq{
+CREATE TABLE foo(i int);
+BEGIN;
+INSERT INTO foo VALUES(1);
+PREPARE TRANSACTION 'fooinsert';
+SELECT pg_create_restore_point('rp');
+INSERT INTO foo VALUES(2);
+});
+
+# Find next WAL segment to be archived
+my $walfile_to_be_archived = $node_primary->safe_psql('postgres',
+	"SELECT pg_walfile_name(pg_current_wal_lsn());");
+
+# Make WAL segment eligible for archival
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Wait until the WAL segment has been archived
+my $archive_wait_query =
+  "SELECT '$walfile_to_be_archived' <= last_archived_wal FROM pg_stat_archiver;";
+$node_primary->poll_query_until('postgres', $archive_wait_query)
+  or die "Timed out while waiting for WAL segment to be archived";
+my $last_archived_wal_file = $walfile_to_be_archived;
+
+# Now start the PITR node
+$node_pitr->start;
+
+# Wait until the PITR node exits recovery
+$node_pitr->poll_query_until('postgres', "SELECT pg_is_in_recovery() = 'f';")
+  or die "Timed out while waiting for PITR promotion";
+
+# Ensure that we didn't write to the older timeline during PITR promotion by
+# checking that the last archived WAL segment was not overwritten during recovery
+my $archive_dir   = $node_primary->archive_dir;
+my $archive_wal_file_path = "$archive_dir/$last_archived_wal_file";
+my $node_pitr_data= $node_pitr->data_dir;
+my $local_wal_file_path   = "$node_pitr_data/pg_wal/$last_archived_wal_file";
+is(compare($archive_wal_file_path, $local_wal_file_path),
+	qq(0), "check if the last archived WAL file was overwritten");
+
+# Commit the prepared transaction in the latest timeline and check the result.
+# There should only be one row (from the prepared transaction). The row from
+# the INSERT after the restore point should not show up, since our recovery
+# target preceded said INSERT
+$node_pitr->psql(
+	'postgres', qq{
+COMMIT PREPARED 'fooinsert';
+});
+my $result = $node_pitr->safe_psql('postgres', "SELECT * FROM foo;");
+is($result, qq(1), "check table contents after COMMIT PREPARED");


Re: PITR promote bug: Checkpointer writes to older timeline

2021-03-04 Thread Soumyadeep Chakraborty
Hey all,

I took a stab at a quick and dirty TAP test (my first ever). So it
can probably be improved a lot. Please take a look.

On Thu, Mar 04, 2021 at 10:28:31AM +0900, Kyotaro Horiguchi wrote:

> 2. Restore ThisTimeLineID after calling XLogReadRecord() in the
>   *caller* side.  This is what came up to me first but I don't like
>   this, too, but I don't find better fix.  way.

+1 to this patch [1].
The above TAP test passes with this patch applied.

[1] 
https://www.postgresql.org/message-id/attachment/119972/dont_change_thistimelineid.patch

Regards,
Soumyadeep

Regards,
Soumyadeep
diff --git a/src/test/recovery/t/022_pitr_prepared_xact.pl b/src/test/recovery/t/022_pitr_prepared_xact.pl
new file mode 100644
index 000..b8d5146bb9d
--- /dev/null
+++ b/src/test/recovery/t/022_pitr_prepared_xact.pl
@@ -0,0 +1,69 @@
+# Test for timeline switch
+use strict;
+use warnings;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use File::Compare;
+use Time::HiRes qw(usleep);
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize primary node
+my $node_primary = get_new_node('primary');
+$node_primary->init(has_archiving => 1);
+$node_primary->append_conf('postgresql.conf', "max_prepared_transactions = 10");
+$node_primary->append_conf('postgresql.conf', "max_wal_senders = 10");
+$node_primary->append_conf('postgresql.conf', "wal_level = 'replica'");
+$node_primary->start;
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+my $node_pitr = get_new_node('node_pitr');
+$node_pitr->init_from_backup($node_primary, $backup_name,
+standby => 0, has_restoring => 1);
+$node_pitr->append_conf('postgresql.conf', "max_prepared_transactions = 10");
+$node_pitr->append_conf('postgresql.conf', "recovery_target_name = 'rp'");
+$node_pitr->append_conf('postgresql.conf', "recovery_target_action = 'promote'");
+
+# Workload with prepared transaction
+$node_primary->psql(
+'postgres', qq{
+CREATE TABLE foo(i int);
+BEGIN;
+INSERT INTO foo VALUES(1);
+PREPARE TRANSACTION 'fooinsert';
+SELECT pg_create_restore_point('rp');
+INSERT INTO foo VALUES(2);
+SELECT pg_switch_wal();
+});
+
+# Sleep 5s to ensure that the WAL has been archived.
+# probably can be replaced by a wait
+usleep(500);
+
+$node_pitr->start;
+
+my $last_archived_wal_file_name = $node_primary->safe_psql('postgres',
+"SELECT last_archived_wal FROM pg_stat_archiver;");
+
+# Ensure that we don't write to the older timeline during PITR promotion by
+# ensuring that the last archived WAL file was not overwritten during recovery.
+my $archive_dir = $node_primary->archive_dir;
+my $archive_wal_file_path = "${archive_dir}/${last_archived_wal_file_name}";
+my $node_pitr_data = $node_pitr->data_dir;
+my $local_wal_file_path = "${node_pitr_data}/pg_wal/${last_archived_wal_file_name}";
+is(compare($archive_wal_file_path, $local_wal_file_path), qq(0), "Check if the last archived WAL file was overwritten");
+
+$node_pitr->psql(
+'postgres', qq{
+COMMIT PREPARED 'fooinsert';
+});
+
+my $foo_select_result = $node_pitr->safe_psql('postgres',
+"SELECT * FROM foo;");
+
+is($foo_select_result, qq(1), "Check foo select result");


Re: PITR promote bug: Checkpointer writes to older timeline

2021-03-04 Thread Kyotaro Horiguchi
At Thu, 04 Mar 2021 16:17:34 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Thu, 4 Mar 2021 11:18:42 +0900, Michael Paquier  
> wrote in 
> > I have not looked in details at the solutions proposed here, but could
> > it be possible to have a TAP test at least please?  Seeing the script
> > from the top of the thread, it should not be difficult to do so.  I
> > would put that in a file different than 009_twophase.pl, within
> > src/test/recovery/.

Isn't 004_timeline_switch.pl the place?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: PITR promote bug: Checkpointer writes to older timeline

2021-03-04 Thread Kyotaro Horiguchi
At Thu, 4 Mar 2021 14:57:13 +0900, Fujii Masao  
wrote in 
> > read_local_xlog_page() works as a part of logical decoding and has
> > responsibility to update ThisTimeLineID properly.  As the comment in
> > the function, it is the proper place to update ThisTimeLineID since we
> > miss a timeline change if we check it earlier and the function uses
> > the value just after. So we cannot change that behavior of the
> > function.  That is, neither of them doesn't seem to be the right fix.
> 
> Could you tell me what actual issue happens if read_local_xlog_page()
> resets
> ThisTimeLineID at the end? Some replication slot-related functions
> that use
> read_local_xlog_page() can be executed even during recovery. For
> example,
> you mean that, when timeline swithes during recovery, those functions
> behave incorrectly if ThisTimeLineID is reset?

The most significant point for me was I'm not fully convinced that we
can safely (or validly) remove the fucntion to maintain the variable
from read_local_xlog_page.

> * RecoveryInProgress() will update ThisTimeLineID when it first
> * notices recovery finishes, so we only have to maintain it for the
> * local process until recovery ends.

read_local_xlog_page is *designed* to maintain ThisTimeLineID.
Currently it doesn't seem utilized but I think it's sufficiently
reasonable that the function maintains ThisTimeLineID.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: PITR promote bug: Checkpointer writes to older timeline

2021-03-03 Thread Kyotaro Horiguchi
At Thu, 4 Mar 2021 11:18:42 +0900, Michael Paquier  wrote 
in 
> On Thu, Mar 04, 2021 at 10:28:31AM +0900, Kyotaro Horiguchi wrote:
> > read_local_xlog_page() works as a part of logical decoding and has
> > responsibility to update ThisTimeLineID properly.  As the comment in
> > the function, it is the proper place to update ThisTimeLineID since we
> > miss a timeline change if we check it earlier and the function uses
> > the value just after. So we cannot change that behavior of the
> > function.  That is, neither of them doesn't seem to be the right fix.
> > 
> > The confusion here is that there's two ThisTimeLineID's here. The
> > previous TLI for read and the next TLI to write.  Most part of the
> > function is needed to read a 2pc recrod so the ways we can take here
> > is:
> > 
> > 1. Somehow tell the function not to update ThisTimeLineID in specific
> >   cases. This can be done by xlogreader private data but this doesn't
> >   seem reasonable.
> > 
> > 2. Restore ThisTimeLineID after calling XLogReadRecord() in the
> >   *caller* side.  This is what came up to me first but I don't like
> >   this, too, but I don't find better fix.  way.
> 
> I have not looked in details at the solutions proposed here, but could
> it be possible to have a TAP test at least please?  Seeing the script
> from the top of the thread, it should not be difficult to do so.  I
> would put that in a file different than 009_twophase.pl, within
> src/test/recovery/.

Year, agreed. It is needed as the final patch. That situation is
easily caused. I'm not sure how to detect the corruption yet, though.
I'll consider that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: PITR promote bug: Checkpointer writes to older timeline

2021-03-03 Thread Fujii Masao




On 2021/03/04 10:28, Kyotaro Horiguchi wrote:

At Wed, 3 Mar 2021 14:56:25 -0800, Soumyadeep Chakraborty 
 wrote in

On 2021/03/03 17:46, Heikki Linnakangas wrote:


I think it should be reset even earlier, inside XlogReadTwoPhaseData()
probably. With your patch, doesn't the LogStandbySnapshot() call just
above where you're ressetting ThisTimeLineID also write a WAL record
with incorrect timeline?


Agreed.


Right.


On Wed, Mar 3, 2021 at 1:04 AM Fujii Masao  wrote:


Even better, can we avoid setting ThisTimeLineID in XlogReadTwoPhaseData() in 
the first place?




Or isn't it better to reset ThisTimeLineID in read_local_xlog_page(), i.e.,
prevent read_local_xlog_page() from changing ThisTimeLineID? I'm not
sure if that's possible, though.. In the future other functions that calls
read_local_xlog_page() during the promotion may appear. Fixing the issue
outside read_local_xlog_page() may cause those functions to get
the same issue.


I agree. We should fix the issue in read_local_xlog_page(). I have
attached two different patches which do so:
saved_ThisTimeLineID.patch and pass_ThisTimeLineID.patch.

The former saves the value of the ThisTimeLineID before it gets changed
in read_local_xlog_page() and resets it after ThisTimeLineID has been
used later on in the code (by XLogReadDetermineTimeline()).

The latter removes occurrences of ThisTimeLineID from
XLogReadDetermineTimeline() and introduces an argument currTLI to
XLogReadDetermineTimeline() to be used in its stead.


read_local_xlog_page() works as a part of logical decoding and has
responsibility to update ThisTimeLineID properly.  As the comment in
the function, it is the proper place to update ThisTimeLineID since we
miss a timeline change if we check it earlier and the function uses
the value just after. So we cannot change that behavior of the
function.  That is, neither of them doesn't seem to be the right fix.


Could you tell me what actual issue happens if read_local_xlog_page() resets
ThisTimeLineID at the end? Some replication slot-related functions that use
read_local_xlog_page() can be executed even during recovery. For example,
you mean that, when timeline swithes during recovery, those functions
behave incorrectly if ThisTimeLineID is reset?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: PITR promote bug: Checkpointer writes to older timeline

2021-03-03 Thread Michael Paquier
On Thu, Mar 04, 2021 at 10:28:31AM +0900, Kyotaro Horiguchi wrote:
> read_local_xlog_page() works as a part of logical decoding and has
> responsibility to update ThisTimeLineID properly.  As the comment in
> the function, it is the proper place to update ThisTimeLineID since we
> miss a timeline change if we check it earlier and the function uses
> the value just after. So we cannot change that behavior of the
> function.  That is, neither of them doesn't seem to be the right fix.
> 
> The confusion here is that there's two ThisTimeLineID's here. The
> previous TLI for read and the next TLI to write.  Most part of the
> function is needed to read a 2pc recrod so the ways we can take here
> is:
> 
> 1. Somehow tell the function not to update ThisTimeLineID in specific
>   cases. This can be done by xlogreader private data but this doesn't
>   seem reasonable.
> 
> 2. Restore ThisTimeLineID after calling XLogReadRecord() in the
>   *caller* side.  This is what came up to me first but I don't like
>   this, too, but I don't find better fix.  way.

I have not looked in details at the solutions proposed here, but could
it be possible to have a TAP test at least please?  Seeing the script
from the top of the thread, it should not be difficult to do so.  I
would put that in a file different than 009_twophase.pl, within
src/test/recovery/.
--
Michael


signature.asc
Description: PGP signature


Re: PITR promote bug: Checkpointer writes to older timeline

2021-03-03 Thread Kyotaro Horiguchi
At Wed, 3 Mar 2021 14:56:25 -0800, Soumyadeep Chakraborty 
 wrote in 
> On 2021/03/03 17:46, Heikki Linnakangas wrote:
> 
> > I think it should be reset even earlier, inside XlogReadTwoPhaseData()
> > probably. With your patch, doesn't the LogStandbySnapshot() call just
> > above where you're ressetting ThisTimeLineID also write a WAL record
> > with incorrect timeline?
> 
> Agreed.

Right.

> On Wed, Mar 3, 2021 at 1:04 AM Fujii Masao  
> wrote:
> 
> > > Even better, can we avoid setting ThisTimeLineID in 
> > > XlogReadTwoPhaseData() in the first place?
> >
> >
> >
> > Or isn't it better to reset ThisTimeLineID in read_local_xlog_page(), i.e.,
> > prevent read_local_xlog_page() from changing ThisTimeLineID? I'm not
> > sure if that's possible, though.. In the future other functions that calls
> > read_local_xlog_page() during the promotion may appear. Fixing the issue
> > outside read_local_xlog_page() may cause those functions to get
> > the same issue.
> 
> I agree. We should fix the issue in read_local_xlog_page(). I have
> attached two different patches which do so:
> saved_ThisTimeLineID.patch and pass_ThisTimeLineID.patch.
> 
> The former saves the value of the ThisTimeLineID before it gets changed
> in read_local_xlog_page() and resets it after ThisTimeLineID has been
> used later on in the code (by XLogReadDetermineTimeline()).
> 
> The latter removes occurrences of ThisTimeLineID from
> XLogReadDetermineTimeline() and introduces an argument currTLI to
> XLogReadDetermineTimeline() to be used in its stead.

read_local_xlog_page() works as a part of logical decoding and has
responsibility to update ThisTimeLineID properly.  As the comment in
the function, it is the proper place to update ThisTimeLineID since we
miss a timeline change if we check it earlier and the function uses
the value just after. So we cannot change that behavior of the
function.  That is, neither of them doesn't seem to be the right fix.

The confusion here is that there's two ThisTimeLineID's here. The
previous TLI for read and the next TLI to write.  Most part of the
function is needed to read a 2pc recrod so the ways we can take here
is:

1. Somehow tell the function not to update ThisTimeLineID in specific
  cases. This can be done by xlogreader private data but this doesn't
  seem reasonable.

2. Restore ThisTimeLineID after calling XLogReadRecord() in the
  *caller* side.  This is what came up to me first but I don't like
  this, too, but I don't find better fix.  way.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 80d2d20d6c..ab9ac0cd5a 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1324,6 +1324,7 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 	XLogRecord *record;
 	XLogReaderState *xlogreader;
 	char	   *errormsg;
+	TimeLineID	currtli = ThisTimeLineID;
 
 	xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
 	XL_ROUTINE(.page_read = _local_xlog_page,
@@ -1358,6 +1359,12 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 	memcpy(*buf, XLogRecGetData(xlogreader), sizeof(char) * XLogRecGetDataLen(xlogreader));
 
 	XLogReaderFree(xlogreader);
+
+	/*
+	 * This function reads a record not as a part of replication so we don't
+	 * want to let XLogReadRecord to change ThisTimeLineID.
+	 */
+	ThisTimeLineID = currtli;
 }
 
 


Re: PITR promote bug: Checkpointer writes to older timeline

2021-03-03 Thread Soumyadeep Chakraborty
On 2021/03/03 17:46, Heikki Linnakangas wrote:

> I think it should be reset even earlier, inside XlogReadTwoPhaseData()
> probably. With your patch, doesn't the LogStandbySnapshot() call just
> above where you're ressetting ThisTimeLineID also write a WAL record
> with incorrect timeline?

Agreed.

On Wed, Mar 3, 2021 at 1:04 AM Fujii Masao  wrote:

> > Even better, can we avoid setting ThisTimeLineID in XlogReadTwoPhaseData() 
> > in the first place?
>
>
>
> Or isn't it better to reset ThisTimeLineID in read_local_xlog_page(), i.e.,
> prevent read_local_xlog_page() from changing ThisTimeLineID? I'm not
> sure if that's possible, though.. In the future other functions that calls
> read_local_xlog_page() during the promotion may appear. Fixing the issue
> outside read_local_xlog_page() may cause those functions to get
> the same issue.

I agree. We should fix the issue in read_local_xlog_page(). I have
attached two different patches which do so:
saved_ThisTimeLineID.patch and pass_ThisTimeLineID.patch.

The former saves the value of the ThisTimeLineID before it gets changed
in read_local_xlog_page() and resets it after ThisTimeLineID has been
used later on in the code (by XLogReadDetermineTimeline()).

The latter removes occurrences of ThisTimeLineID from
XLogReadDetermineTimeline() and introduces an argument currTLI to
XLogReadDetermineTimeline() to be used in its stead.

Regards,
Soumyadeep
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index d17d660f460..a96fd236eeb 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -680,11 +680,12 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
  * The caller must also make sure it doesn't read past the current replay
  * position (using GetXLogReplayRecPtr) if executing in recovery, so it
  * doesn't fail to notice that the current timeline became historical. The
- * caller must also update ThisTimeLineID with the result of
+ * caller must also ensure that currTLI is updated with the result of
  * GetXLogReplayRecPtr and must check RecoveryInProgress().
  */
 void
-XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wantLength)
+XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage,
+		  uint32 wantLength, TimeLineID currTLI)
 {
 	const XLogRecPtr lastReadPage = (state->seg.ws_segno *
 	 state->segcxt.ws_segsize + state->segoff);
@@ -712,12 +713,12 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	 * just carry on. (Seeking backwards requires a check to make sure the
 	 * older page isn't on a prior timeline).
 	 *
-	 * ThisTimeLineID might've become historical since we last looked, but the
+	 * currTLI might be historical, but the
 	 * caller is required not to read past the flush limit it saw at the time
 	 * it looked up the timeline. There's nothing we can do about it if
 	 * StartupXLOG() renames it to .partial concurrently.
 	 */
-	if (state->currTLI == ThisTimeLineID && wantPage >= lastReadPage)
+	if (state->currTLI == currTLI && wantPage >= lastReadPage)
 	{
 		Assert(state->currTLIValidUntil == InvalidXLogRecPtr);
 		return;
@@ -729,7 +730,7 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	 * the current segment we can just keep reading.
 	 */
 	if (state->currTLIValidUntil != InvalidXLogRecPtr &&
-		state->currTLI != ThisTimeLineID &&
+		state->currTLI != currTLI &&
 		state->currTLI != 0 &&
 		((wantPage + wantLength) / state->segcxt.ws_segsize) <
 		(state->currTLIValidUntil / state->segcxt.ws_segsize))
@@ -752,7 +753,7 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 		 * We need to re-read the timeline history in case it's been changed
 		 * by a promotion or replay from a cascaded replica.
 		 */
-		List	   *timelineHistory = readTimeLineHistory(ThisTimeLineID);
+		List	   *timelineHistory = readTimeLineHistory(currTLI);
 		XLogRecPtr	endOfSegment;
 
 		endOfSegment = ((wantPage / state->segcxt.ws_segsize) + 1) *
@@ -830,7 +831,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 {
 	XLogRecPtr	read_upto,
 loc;
-	TimeLineID	tli;
+	TimeLineID	tli = ThisTimeLineID;
 	int			count;
 	WALReadError errinfo;
 
@@ -850,8 +851,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		if (!RecoveryInProgress())
 			read_upto = GetFlushRecPtr();
 		else
-			read_upto = GetXLogReplayRecPtr();
-		tli = ThisTimeLineID;
+			read_upto = GetXLogReplayRecPtr();
 
 		/*
 		 * Check which timeline to get the record from.
@@ -877,9 +877,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		 * standby whose primary gets promoted while we're decoding, so a
 		 * one-off ERROR isn't too bad.
 		 */
-		XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+		XLogReadDetermineTimeline(state, targetPagePtr, reqLen, tli);
 
-		if 

Re: PITR promote bug: Checkpointer writes to older timeline

2021-03-03 Thread Fujii Masao




On 2021/03/03 17:46, Heikki Linnakangas wrote:

On 03/03/2021 08:47, Kyotaro Horiguchi wrote:

At Tue, 2 Mar 2021 17:56:03 -0800, Soumyadeep Chakraborty 
 wrote in

When there are prepared transactions in an older timeline, in the
checkpointer, a call to CheckPointTwoPhase() and subsequently to
XlogReadTwoPhaseData() and subsequently to read_local_xlog_page() leads
to the following line:

read_upto = GetXLogReplayRecPtr();

GetXLogReplayRecPtr() will change ThisTimeLineID to 1, in order to read
the two phase WAL records in the older timeline. This variable will
remain unchanged and the checkpointer ends up writing the checkpoint
record into the older WAL segment (when XLogBeginInsert() is called
within CreateCheckPoint(), the value is still 1). The value is not
synchronized as even if RecoveryInProgress() is called,
xlogctl->SharedRecoveryState is not RECOVERY_STATE_DONE
(SharedRecoveryInProgress = true in older versions) as the startup
process waits for the checkpointer inside RequestCheckpoint() (since
recovery_target_action='promote' involves a non-fast promotion). Thus,
InitXLOGAccess() is not called and the value of ThisTimeLineID is not
updated before the checkpoint record write.

Since 1148e22a82e, GetXLogReplayRecPtr() is called with ThisTimeLineID
instead of a local variable, within read_local_xlog_page().


Confusing...


PFA a small patch that fixes the problem by explicitly calling
InitXLOGAccess() in CheckPointTwoPhase(), after the two phase state data
is read, in order to update ThisTimeLineID to the latest timeline. It is
okay to call InitXLOGAccess() as it is lightweight and would mostly be
a no-op.


It is correct that read_local_xlog_page() changes ThisTimeLineID, but
InitXLOGAccess() is correctly called in CreateCheckPoint:

|    /*
| * An end-of-recovery checkpoint is created before anyone is allowed to
| * write WAL. To allow us to write the checkpoint record, temporarily
| * enable XLogInsertAllowed.  (This also ensures ThisTimeLineID is
| * initialized, which we need here and in AdvanceXLInsertBuffer.)
| */
|    if (flags & CHECKPOINT_END_OF_RECOVERY)
|    LocalSetXLogInsertAllowed();

It seems to e suficcient to recover ThisTimeLineID from the checkpoint
record to be written, as attached?


I think it should be reset even earlier, inside XlogReadTwoPhaseData() 
probably. With your patch, doesn't the LogStandbySnapshot() call just above 
where you're ressetting ThisTimeLineID also write a WAL record with incorrect 
timeline?

Even better, can we avoid setting ThisTimeLineID in XlogReadTwoPhaseData() in 
the first place?


Or isn't it better to reset ThisTimeLineID in read_local_xlog_page(), i.e.,
prevent read_local_xlog_page() from changing ThisTimeLineID? I'm not
sure if that's possible, though.. In the future other functions that calls
read_local_xlog_page() during the promotion may appear. Fixing the issue
outside read_local_xlog_page() may cause those functions to get
the same issue.

Regards,


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: PITR promote bug: Checkpointer writes to older timeline

2021-03-03 Thread Heikki Linnakangas

On 03/03/2021 08:47, Kyotaro Horiguchi wrote:

At Tue, 2 Mar 2021 17:56:03 -0800, Soumyadeep Chakraborty 
 wrote in

When there are prepared transactions in an older timeline, in the
checkpointer, a call to CheckPointTwoPhase() and subsequently to
XlogReadTwoPhaseData() and subsequently to read_local_xlog_page() leads
to the following line:

read_upto = GetXLogReplayRecPtr();

GetXLogReplayRecPtr() will change ThisTimeLineID to 1, in order to read
the two phase WAL records in the older timeline. This variable will
remain unchanged and the checkpointer ends up writing the checkpoint
record into the older WAL segment (when XLogBeginInsert() is called
within CreateCheckPoint(), the value is still 1). The value is not
synchronized as even if RecoveryInProgress() is called,
xlogctl->SharedRecoveryState is not RECOVERY_STATE_DONE
(SharedRecoveryInProgress = true in older versions) as the startup
process waits for the checkpointer inside RequestCheckpoint() (since
recovery_target_action='promote' involves a non-fast promotion). Thus,
InitXLOGAccess() is not called and the value of ThisTimeLineID is not
updated before the checkpoint record write.

Since 1148e22a82e, GetXLogReplayRecPtr() is called with ThisTimeLineID
instead of a local variable, within read_local_xlog_page().


Confusing...


PFA a small patch that fixes the problem by explicitly calling
InitXLOGAccess() in CheckPointTwoPhase(), after the two phase state data
is read, in order to update ThisTimeLineID to the latest timeline. It is
okay to call InitXLOGAccess() as it is lightweight and would mostly be
a no-op.


It is correct that read_local_xlog_page() changes ThisTimeLineID, but
InitXLOGAccess() is correctly called in CreateCheckPoint:

|   /*
|* An end-of-recovery checkpoint is created before anyone is allowed to
|* write WAL. To allow us to write the checkpoint record, temporarily
|* enable XLogInsertAllowed.  (This also ensures ThisTimeLineID is
|* initialized, which we need here and in AdvanceXLInsertBuffer.)
|*/
|   if (flags & CHECKPOINT_END_OF_RECOVERY)
|   LocalSetXLogInsertAllowed();

It seems to e suficcient to recover ThisTimeLineID from the checkpoint
record to be written, as attached?


I think it should be reset even earlier, inside XlogReadTwoPhaseData() 
probably. With your patch, doesn't the LogStandbySnapshot() call just 
above where you're ressetting ThisTimeLineID also write a WAL record 
with incorrect timeline?


Even better, can we avoid setting ThisTimeLineID in 
XlogReadTwoPhaseData() in the first place?


- Heikki




Re: PITR promote bug: Checkpointer writes to older timeline

2021-03-02 Thread Kyotaro Horiguchi
At Tue, 2 Mar 2021 17:56:03 -0800, Soumyadeep Chakraborty 
 wrote in 
> Hello hackers,
> 
> We came across an issue where the checkpointer writes to the older
> timeline while a promotion is ongoing after reaching the recovery point
> in a PITR, when there are prepared transactions before the recovery
> point. We came across this issue first in REL_12_STABLE and saw that it
> also exists in devel.

Good Catch!  I can reproduce that.

> When there are prepared transactions in an older timeline, in the
> checkpointer, a call to CheckPointTwoPhase() and subsequently to
> XlogReadTwoPhaseData() and subsequently to read_local_xlog_page() leads
> to the following line:
> 
> read_upto = GetXLogReplayRecPtr();
> 
> GetXLogReplayRecPtr() will change ThisTimeLineID to 1, in order to read
> the two phase WAL records in the older timeline. This variable will
> remain unchanged and the checkpointer ends up writing the checkpoint
> record into the older WAL segment (when XLogBeginInsert() is called
> within CreateCheckPoint(), the value is still 1). The value is not
> synchronized as even if RecoveryInProgress() is called,
> xlogctl->SharedRecoveryState is not RECOVERY_STATE_DONE
> (SharedRecoveryInProgress = true in older versions) as the startup
> process waits for the checkpointer inside RequestCheckpoint() (since
> recovery_target_action='promote' involves a non-fast promotion). Thus,
> InitXLOGAccess() is not called and the value of ThisTimeLineID is not
> updated before the checkpoint record write.
> 
> Since 1148e22a82e, GetXLogReplayRecPtr() is called with ThisTimeLineID
> instead of a local variable, within read_local_xlog_page().
> 
> PFA a small patch that fixes the problem by explicitly calling
> InitXLOGAccess() in CheckPointTwoPhase(), after the two phase state data
> is read, in order to update ThisTimeLineID to the latest timeline. It is
> okay to call InitXLOGAccess() as it is lightweight and would mostly be
> a no-op.

It is correct that read_local_xlog_page() changes ThisTimeLineID, but
InitXLOGAccess() is correctly called in CreateCheckPoint:

|   /*
|* An end-of-recovery checkpoint is created before anyone is allowed to
|* write WAL. To allow us to write the checkpoint record, temporarily
|* enable XLogInsertAllowed.  (This also ensures ThisTimeLineID is
|* initialized, which we need here and in AdvanceXLInsertBuffer.)
|*/
|   if (flags & CHECKPOINT_END_OF_RECOVERY)
|   LocalSetXLogInsertAllowed();

It seems to e suficcient to recover ThisTimeLineID from the checkpoint
record to be written, as attached?

regareds.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 377afb8732..06d3ce676e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9088,6 +9088,13 @@ CreateCheckPoint(int flags)
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
 
+	/*
+	 * ThisTimeLineID may have moved so far so we need to restore it so that
+	 * the checkpoint record goes to the right timeline.  (Currently
+	 * CheckPointTwoPhase() is known to have a chance to change it.)
+	 */
+	ThisTimeLineID = checkPoint.ThisTimeLineID;
+
 	START_CRIT_SECTION();
 
 	/*


PITR promote bug: Checkpointer writes to older timeline

2021-03-02 Thread Soumyadeep Chakraborty
Hello hackers,

We came across an issue where the checkpointer writes to the older
timeline while a promotion is ongoing after reaching the recovery point
in a PITR, when there are prepared transactions before the recovery
point. We came across this issue first in REL_12_STABLE and saw that it
also exists in devel.

Setup:
PFA a minimal reproduction script repro.txt.

After running the script, we notice that the checkpointer has written
the end-of-recovery shutdown checkpoint in the previous timeline (tli =
1), i.e. it wrote into the WAL segment 00010003 instead
of writing to the WAL segment 00020003, causing it to
overwrite WAL records past the recovery point (please see attached diff
output file waldiff.diff) in 00010003.

Also, performing a subsequent shutdown on the recovered server may cause
the next shutdown checkpoint record to be written, again, to the
previous timeline, i.e. to 00010003. A subsequent server
start will fail as the starup process will be unable to find the
checkpoint in the latest timeline (00020003) and we will
get:

...
LOG:  invalid record length at 0/3016FB8: wante
d 24, got 0
LOG:  invalid primary checkpoint record
PANIC:  could not locate a valid checkpoint record
...

RCA:

When there are prepared transactions in an older timeline, in the
checkpointer, a call to CheckPointTwoPhase() and subsequently to
XlogReadTwoPhaseData() and subsequently to read_local_xlog_page() leads
to the following line:

read_upto = GetXLogReplayRecPtr();

GetXLogReplayRecPtr() will change ThisTimeLineID to 1, in order to read
the two phase WAL records in the older timeline. This variable will
remain unchanged and the checkpointer ends up writing the checkpoint
record into the older WAL segment (when XLogBeginInsert() is called
within CreateCheckPoint(), the value is still 1). The value is not
synchronized as even if RecoveryInProgress() is called,
xlogctl->SharedRecoveryState is not RECOVERY_STATE_DONE
(SharedRecoveryInProgress = true in older versions) as the startup
process waits for the checkpointer inside RequestCheckpoint() (since
recovery_target_action='promote' involves a non-fast promotion). Thus,
InitXLOGAccess() is not called and the value of ThisTimeLineID is not
updated before the checkpoint record write.

Since 1148e22a82e, GetXLogReplayRecPtr() is called with ThisTimeLineID
instead of a local variable, within read_local_xlog_page().

PFA a small patch that fixes the problem by explicitly calling
InitXLOGAccess() in CheckPointTwoPhase(), after the two phase state data
is read, in order to update ThisTimeLineID to the latest timeline. It is
okay to call InitXLOGAccess() as it is lightweight and would mostly be
a no-op.

Regards,
Soumyadeep, Kevin and Jimmy
VMWare
#! /bin/sh

# Primary setup

/usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data
echo "archive_mode = on
archive_command = 'cp %p /tmp/archive/%f'
max_prepared_transactions = 10" >> /usr/local/pgsql/data/postgresql.conf

/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data -l /tmp/logfile_p start

# Secondary setup

/usr/local/pgsql/bin/pg_basebackup -D /usr/local/pgsql/data_s -X stream && 
touch /usr/local/pgsql/data_s/recovery.signal
echo "port = 6432
restore_command = 'cp /tmp/archive/%f %p'
recovery_target_name = 'rp'
recovery_target_action = 'promote'
max_prepared_transactions = 20" >> /usr/local/pgsql/data_s/postgresql.conf

# Workload
echo "CREATE TABLE foo(i int);
BEGIN;
INSERT INTO foo VALUES(1);
PREPARE TRANSACTION 'fooinsert';
SELECT pg_create_restore_point('rp');
INSERT INTO foo VALUES(2);
SELECT pg_switch_wal();
" > /tmp/workload.sql
/usr/local/pgsql/bin/psql postgres -f /tmp/workload.sql

/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data_s -l /tmp/logfile_s start

sleep 5

/usr/local/pgsql/bin/pg_waldump /tmp/archive/00010003 > 
/tmp/archive_00010003
/usr/local/pgsql/bin/pg_waldump 
/usr/local/pgsql/data_s/pg_wal/00010003 > 
/tmp/standby_00010003

diff -u /tmp/archive_00010003 
/tmp/standby_00010003 > /tmp/waldiff.diff
cat /tmp/waldiff.diff
From dbf5a9f6899bedf28b482fc03a4a2b0571e92e9b Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty 
Date: Tue, 2 Mar 2021 17:41:20 -0800
Subject: [PATCH 1/1] Prevent checkpointer from writing to older timeline

Co-authored-by: Kevin Yeap 
---
 src/backend/access/transam/twophase.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 80d2d20d6cc..081aee6217e 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1740,6 +1740,16 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
 	}
 	LWLockRelease(TwoPhaseStateLock);
 
+	if (serialized_xacts > 0)
+	{
+		/*
+		 * In order to ensure that we write the checkpoint record to the latest
+		 * timeline, refresh ThisTimeLineID.