Re: Add basic tests for the low-level backup method.

2024-03-15 Thread Michael Paquier
On Fri, Mar 15, 2024 at 06:37:35PM +1300, David Steele wrote:
> This seems like a reasonable explanation to me.

FYI, drongo has just passed the test.  fairywren uses TAP, does not
run the recovery tests.
--
Michael


signature.asc
Description: PGP signature


Re: Add basic tests for the low-level backup method.

2024-03-14 Thread David Steele

On 3/15/24 18:32, Michael Paquier wrote:

On Fri, Mar 15, 2024 at 06:23:15PM +1300, David Steele wrote:

Well, this is what we recommend in the docs, i.e. using bin mode to save
backup_label, so it seems OK to me.


Indeed, I didn't notice that this is actually documented, so what I
did took the right angle.  French flair, perhaps..


This seems like a reasonable explanation to me.

-David




Re: Add basic tests for the low-level backup method.

2024-03-14 Thread Michael Paquier
On Fri, Mar 15, 2024 at 06:23:15PM +1300, David Steele wrote:
> Well, this is what we recommend in the docs, i.e. using bin mode to save
> backup_label, so it seems OK to me.

Indeed, I didn't notice that this is actually documented, so what I
did took the right angle.  French flair, perhaps..
--
Michael


signature.asc
Description: PGP signature


Re: Add basic tests for the low-level backup method.

2024-03-14 Thread Michael Paquier
On Fri, Mar 15, 2024 at 08:38:47AM +0900, Michael Paquier wrote:
> That's why these tests are not that easy, they can be racy.  I've run
> the test 5~10 times in the CI this time to gain more confidence, and
> saw zero failures with the stability fixes in place including Windows.
> I've applied it now, as I can still monitor the buildfarm for a few
> more days.  Let's see what happens, but that should be better.

So, it looks like the buildfarm is clear.  sidewinder has reported a
green state, and the recent runs of the CFbot across all the patches
are looking stable as well on all platforms.  There are still a few
buildfarm members on Windows that will take time more time before
running.
--
Michael


signature.asc
Description: PGP signature


Re: Add basic tests for the low-level backup method.

2024-03-14 Thread David Steele

On 3/15/24 12:38, Michael Paquier wrote:

On Fri, Mar 15, 2024 at 09:40:38AM +1300, David Steele wrote:

Is the missing test in meson the reason we did not see test failures for
Windows in CI?


The test has to be listed in src/test/recovery/meson.build or the CI
would ignore it.


Right -- I will keep this in mind for the future.


The second LOG is something that can be acted on.  I've added some
debugging to the parsing of the backup_label file in the backend, and
noticed that the first fscanf() for START WAL LOCATION is failing
because the last %c is detected as \r rather than \n.  Tweaking the
contents stored from pg_backend_stop() with a sed won't help, because
the issue is that we write the CRLFs with append_to_file, and the
startup process cannot cope with that.  The simplest method I can
think of is to use binmode, as of the attached.


Yeah, that makes sense.


I am wondering if there is a better trick here that would not require
changes in the backend to make the backup_label parsing more flexible,
though.


Well, this is what we recommend in the docs, i.e. using bin mode to save 
backup_label, so it seems OK to me.



I am attaching an updated patch with all that fixed, which is stable
in the CI and any tests I've run.  Do you have any comments about


These changes look good to me. Sure wish we had an easier to way to test
commits in the build farm.


That's why these tests are not that easy, they can be racy.  I've run
the test 5~10 times in the CI this time to gain more confidence, and
saw zero failures with the stability fixes in place including Windows.
I've applied it now, as I can still monitor the buildfarm for a few
more days.  Let's see what happens, but that should be better.


At least sidewinder is happy now -- and the build farm in general as far 
as I can see.


Thank you for your help on this!
-David




Re: Add basic tests for the low-level backup method.

2024-03-14 Thread Michael Paquier
On Fri, Mar 15, 2024 at 09:40:38AM +1300, David Steele wrote:
> Is the missing test in meson the reason we did not see test failures for
> Windows in CI?

The test has to be listed in src/test/recovery/meson.build or the CI
would ignore it.

>> The second LOG is something that can be acted on.  I've added some
>> debugging to the parsing of the backup_label file in the backend, and
>> noticed that the first fscanf() for START WAL LOCATION is failing
>> because the last %c is detected as \r rather than \n.  Tweaking the
>> contents stored from pg_backend_stop() with a sed won't help, because
>> the issue is that we write the CRLFs with append_to_file, and the
>> startup process cannot cope with that.  The simplest method I can
>> think of is to use binmode, as of the attached.
> 
> Yeah, that makes sense.

I am wondering if there is a better trick here that would not require
changes in the backend to make the backup_label parsing more flexible,
though.

>> I am attaching an updated patch with all that fixed, which is stable
>> in the CI and any tests I've run.  Do you have any comments about
> 
> These changes look good to me. Sure wish we had an easier to way to test
> commits in the build farm.

That's why these tests are not that easy, they can be racy.  I've run
the test 5~10 times in the CI this time to gain more confidence, and
saw zero failures with the stability fixes in place including Windows.
I've applied it now, as I can still monitor the buildfarm for a few
more days.  Let's see what happens, but that should be better.
--
Michael


signature.asc
Description: PGP signature


Re: Add basic tests for the low-level backup method.

2024-03-14 Thread David Steele

On 3/14/24 20:00, Michael Paquier wrote:

On Thu, Mar 14, 2024 at 09:12:52AM +1300, David Steele wrote:

I think you are right that the start message is better since it can only
appear once when the backup_label is found. The completed message could in
theory appear after a restart, though the backup_label must have been found
at some point.


So, I've given a try to this patch with 99b4a63bef94, to note that
sidewinder failed because of a timing issue on Windows: the recovery
of the node without backup_label, expected to fail, would try to
backup the last segment it has replayed, because, as it has no
backup_label, it behaves like the primary.  It would try to use the
same archive location as the primary, leading to a conflict failure on
Windows.  This one was easy to fix, by overwritting postgresql.conf on
the node to not do archiving.


Hmmm, I wonder why this did not show up in the Windows tests on CI?


Following that, I've noticed a second race condition: we don't wait
for the segment after pg_switch_wal() to be archived.  This one can be
easily avoided with a poll on pg_stat_archiver.


Ugh, yeah, good change.


After that, also because I've initially managed to, cough, forget an
update of meson.build to list the new test, I've noticed a third
failure on Windows for the case of the node that has a backup_label.
Here is one of the failures:
https://cirrus-ci.com/task/5245341683941376


Is the missing test in meson the reason we did not see test failures for 
Windows in CI?



regress_log_042_low_level_backup and
042_low_level_backup_replica_success.log have all the information
needed, that can be summarized like that:
The system cannot find the file specified.
2024-03-14 06:02:37.670 GMT [560][startup] FATAL:  invalid data in file 
"backup_label"

The first message is something new to me, that seems to point to a
corruption failure of the file system.  Why don't we see something
similar in other tests, then?  Leaving that aside..

The second LOG is something that can be acted on.  I've added some
debugging to the parsing of the backup_label file in the backend, and
noticed that the first fscanf() for START WAL LOCATION is failing
because the last %c is detected as \r rather than \n.  Tweaking the
contents stored from pg_backend_stop() with a sed won't help, because
the issue is that we write the CRLFs with append_to_file, and the
startup process cannot cope with that.  The simplest method I can
think of is to use binmode, as of the attached.


Yeah, that makes sense.


It is the first time that we'd take the contents received from a
BackgroundPsql and write them to a file parsed by the backend, so
perhaps we should try to do that in a more general way, but I'm not
sure how, tbh, and the case of this test is special while adding
handling for \r when reading the backup_label got discussed in the
past but we were OK with what we are doing now on HEAD.


I think it makes sense to leave the parsing code as is and make the 
change in the test. If we add more tests to this module we'll probably 
need a function to avoid repeating code.



On top of all that, note that I have removed remove_tree as I am not
sure if this would be OK in all the buildfarm animals, added a quit()
for BackgroundPsql, moved queries to use less BackgroundPsql, as well
as a few other things like avoiding the hardcoded segment names.
meson.build is.. Cough.. Updated now.


OK.


I am attaching an updated patch with all that fixed, which is stable
in the CI and any tests I've run.  Do you have any comments about


These changes look good to me. Sure wish we had an easier to way to test 
commits in the build farm.


Regards,
-David




Re: Add basic tests for the low-level backup method.

2024-03-14 Thread Michael Paquier
On Thu, Mar 14, 2024 at 09:12:52AM +1300, David Steele wrote:
> I think you are right that the start message is better since it can only
> appear once when the backup_label is found. The completed message could in
> theory appear after a restart, though the backup_label must have been found
> at some point.

So, I've given a try to this patch with 99b4a63bef94, to note that
sidewinder failed because of a timing issue on Windows: the recovery
of the node without backup_label, expected to fail, would try to
backup the last segment it has replayed, because, as it has no
backup_label, it behaves like the primary.  It would try to use the
same archive location as the primary, leading to a conflict failure on
Windows.  This one was easy to fix, by overwritting postgresql.conf on
the node to not do archiving.

Following that, I've noticed a second race condition: we don't wait
for the segment after pg_switch_wal() to be archived.  This one can be
easily avoided with a poll on pg_stat_archiver.

After that, also because I've initially managed to, cough, forget an
update of meson.build to list the new test, I've noticed a third
failure on Windows for the case of the node that has a backup_label.
Here is one of the failures:
https://cirrus-ci.com/task/5245341683941376

regress_log_042_low_level_backup and
042_low_level_backup_replica_success.log have all the information
needed, that can be summarized like that:
The system cannot find the file specified.
2024-03-14 06:02:37.670 GMT [560][startup] FATAL:  invalid data in file 
"backup_label"

The first message is something new to me, that seems to point to a
corruption failure of the file system.  Why don't we see something
similar in other tests, then?  Leaving that aside..

The second LOG is something that can be acted on.  I've added some
debugging to the parsing of the backup_label file in the backend, and
noticed that the first fscanf() for START WAL LOCATION is failing
because the last %c is detected as \r rather than \n.  Tweaking the
contents stored from pg_backend_stop() with a sed won't help, because
the issue is that we write the CRLFs with append_to_file, and the
startup process cannot cope with that.  The simplest method I can
think of is to use binmode, as of the attached.

It is the first time that we'd take the contents received from a
BackgroundPsql and write them to a file parsed by the backend, so
perhaps we should try to do that in a more general way, but I'm not
sure how, tbh, and the case of this test is special while adding
handling for \r when reading the backup_label got discussed in the
past but we were OK with what we are doing now on HEAD.

On top of all that, note that I have removed remove_tree as I am not
sure if this would be OK in all the buildfarm animals, added a quit()
for BackgroundPsql, moved queries to use less BackgroundPsql, as well
as a few other things like avoiding the hardcoded segment names.
meson.build is.. Cough.. Updated now.

I am attaching an updated patch with all that fixed, which is stable
in the CI and any tests I've run.  Do you have any comments about
that?
--
Michael
From 5799e7e48f02cd2ed1d8b2afb88001884c173855 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 14 Mar 2024 15:57:58 +0900
Subject: [PATCH v3] Add basic TAP tests for the low-level backup method

There are currently no tests for the low-level backup method where
pg_backup_start() and pg_backup_stop() are involved while taking a
file-system backup.  The tests introduced in this commit rely on a
background psql process to make sure that the backup is taken while the
session doing the SQL start and stop calls remains alive.

Two cases are checked here with the backup taken:
- Recovery without a backup_label, leading to a corrupted state.
- Recovery with a backup_label, with a consistent state reached.
Both cases cross-check some patterns in the logs generated when running
recovery.

Author: David Steele
Discussion: https://postgr.es/m/f20fcc82-dadb-478d-beb4-1e2ffb0ac...@pgmasters.net
---
 src/test/recovery/meson.build   |   1 +
 src/test/recovery/t/042_low_level_backup.pl | 144 
 2 files changed, 145 insertions(+)
 create mode 100644 src/test/recovery/t/042_low_level_backup.pl

diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index c67249500e..b1eb77b1ec 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -50,6 +50,7 @@ tests += {
   't/039_end_of_wal.pl',
   't/040_standby_failover_slots_sync.pl',
   't/041_checkpoint_at_promote.pl',
+  't/042_low_level_backup.pl',
 ],
   },
 }
diff --git a/src/test/recovery/t/042_low_level_backup.pl b/src/test/recovery/t/042_low_level_backup.pl
new file mode 100644
index 00..0a75a8624a
--- /dev/null
+++ b/src/test/recovery/t/042_low_level_backup.pl
@@ -0,0 +1,144 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test low-level backup method by using pg_backup_start() and 

Re: Add basic tests for the low-level backup method.

2024-03-13 Thread David Steele

On 3/13/24 19:15, Michael Paquier wrote:

On Wed, Mar 13, 2024 at 01:12:28PM +1300, David Steele wrote:


Not sure what to look for here. There are no distinct messages for crash
recovery. Perhaps there should be?


The closest thing I can think of here would be "database system was
not properly shut down; automatic recovery in progress" as we don't
have InArchiveRecovery, after checking that the canary is missing.  If
you don't like this suggestion, feel free to say so, of course :)


That works for me. I think I got it confused with "database system was 
interrupted..." when I was looking at the success vs. fail logs.



Sure, I added a check for the new log message when recovering with a
backup_label.


+ok($node_replica->log_contains('completed backup recovery with redo LSN'),
+   'verify backup recovery performed with backup_label');

Okay for this choice.  I was thinking first about "starting backup
recovery with redo LSN", closer to the area where the backup_label is
read.


I think you are right that the start message is better since it can only 
appear once when the backup_label is found. The completed message could 
in theory appear after a restart, though the backup_label must have been 
found at some point.


Regards,
-DavidFrom 91f8e8a390713760116990be05d96f8a7e0d1bde Mon Sep 17 00:00:00 2001
From: David Steele 
Date: Wed, 13 Mar 2024 20:02:40 +
Subject: Add basic tests for the low-level backup method.

There are currently no tests for the low-level backup method. pg_backup_start()
and pg_backup_stop() are called but not exercised in any real fashion.

There is a lot more that can be done, but this at least provides some basic
tests and provides a place for future improvement.
---
 src/test/recovery/t/042_low_level_backup.pl | 114 
 1 file changed, 114 insertions(+)
 create mode 100644 src/test/recovery/t/042_low_level_backup.pl

diff --git a/src/test/recovery/t/042_low_level_backup.pl 
b/src/test/recovery/t/042_low_level_backup.pl
new file mode 100644
index 00..76bac7e324
--- /dev/null
+++ b/src/test/recovery/t/042_low_level_backup.pl
@@ -0,0 +1,114 @@
+
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+# Test low-level backup method by using pg_backup_start() and pg_backup_stop()
+# to create backups.
+
+use strict;
+use warnings;
+
+use File::Copy qw(copy);
+use File::Path qw(remove_tree);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Start primary node with archiving
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(has_archiving => 1, allows_streaming => 1);
+$node_primary->start;
+
+# Start backup
+my $backup_name = 'backup1';
+my $psql = $node_primary->background_psql('postgres');
+
+$psql->query_safe("SET client_min_messages TO WARNING");
+$psql->set_query_timer_restart();
+$psql->query_safe("select pg_backup_start('test label')");
+
+# Copy files
+my $backup_dir = $node_primary->backup_dir() . '/' . $backup_name;
+
+PostgreSQL::Test::RecursiveCopy::copypath(
+   $node_primary->data_dir(), $backup_dir);
+
+# Cleanup some files/paths that should not be in the backup. There is no
+# attempt to all the exclusions done by pg_basebackup here, in part because
+# they are not required, but also to keep the test simple.
+#
+# Also remove pg_control because it needs to be copied later and it will be
+# obvious if the copy fails.
+unlink("$backup_dir/postmaster.pid")
+   or BAIL_OUT("unable to unlink $backup_dir/postmaster.pid");
+unlink("$backup_dir/postmaster.opts")
+   or BAIL_OUT("unable to unlink $backup_dir/postmaster.opts");
+unlink("$backup_dir/global/pg_control")
+   or BAIL_OUT("unable to unlink $backup_dir/global/pg_control");
+
+remove_tree("$backup_dir/pg_wal", {keep_root => 1})
+   or BAIL_OUT("unable to unlink contents of $backup_dir/pg_wal");
+
+# Create a table that will be used to verify that recovery started at the
+# correct location, rather than the location recorded in pg_control
+$psql->query_safe("create table canary (id int)");
+
+# Advance the checkpoint location in pg_control past the location where the
+# backup started. Switch the WAL to make it really obvious that the location
+# is different and to put the checkpoint in a new WAL segment.
+$psql->query_safe("select pg_switch_wal()");
+$psql->query_safe("checkpoint");
+
+# Copy pg_control last so it contains the new checkpoint
+copy($node_primary->data_dir() . '/global/pg_control',
+   "$backup_dir/global/pg_control")
+   or BAIL_OUT("unable to copy global/pg_control");
+
+# Stop backup and get backup_label
+my $backup_label = $psql->query_safe("select labelfile from pg_backup_stop()");
+
+# Rather than writing out backup_label, try to recover the backup without
+# backup_label to demonstrate that recovery will not work correctly without it,
+# i.e. the canary table will be missing and the cluster will be corrupt. 
Provide
+# only the WAL segment that recovery will 

Re: Add basic tests for the low-level backup method.

2024-03-13 Thread Michael Paquier
On Wed, Mar 13, 2024 at 01:12:28PM +1300, David Steele wrote:
> On 2/29/24 16:55, Michael Paquier wrote:
>> +unlink("$backup_dir/postmaster.pid")
>> +or BAIL_OUT("unable to unlink $backup_dir/postmaster.pid");
>> +unlink("$backup_dir/postmaster.opts")
>> +or BAIL_OUT("unable to unlink $backup_dir/postmaster.opts");
>> +unlink("$backup_dir/global/pg_control")
>> +or BAIL_OUT("unable to unlink $backup_dir/global/pg_control");
>> 
>> RELCACHE_INIT_FILENAME as well?
> 
> I'm not trying to implement the full exclusion list here, just enough to get
> the test working. Since exclusions are optional according to the docs I
> don't think we need them for a valid tests.

Okay.  Fine by me at the end.

>> +# Rather than writing out backup_label, try to recover the backup without
>> +# backup_label to demonstrate that recovery will not work correctly without 
>> it,
>> +# i.e. the canary table will be missing and the cluster will be corrupt. 
>> Provide
>> +# only the WAL segment that recovery will think it needs.
>> 
>> Okay, why not.  No objections to this addition.  I am a bit surprised
>> that this is not scanning some of the logs produced by the startup
>> process for particular patterns.
> 
> Not sure what to look for here. There are no distinct messages for crash
> recovery. Perhaps there should be?

The closest thing I can think of here would be "database system was
not properly shut down; automatic recovery in progress" as we don't
have InArchiveRecovery, after checking that the canary is missing.  If
you don't like this suggestion, feel free to say so, of course :)

>> +# Save backup_label into the backup directory and recover using the 
>> primary's
>> +# archive. This time recovery will succeed and the canary table will be
>> +# present.
>> 
>> Here are well, I think that we should add some log_contains() with
>> pre-defined patterns to show that recovery has completed the way we
>> want it with a backup_label up to the end-of-backup record.
> 
> Sure, I added a check for the new log message when recovering with a
> backup_label.

+ok($node_replica->log_contains('completed backup recovery with redo LSN'),
+   'verify backup recovery performed with backup_label');

Okay for this choice.  I was thinking first about "starting backup
recovery with redo LSN", closer to the area where the backup_label is
read.
--
Michael


signature.asc
Description: PGP signature


Re: Add basic tests for the low-level backup method.

2024-03-12 Thread David Steele

On 2/29/24 16:55, Michael Paquier wrote:

On Thu, Feb 29, 2024 at 10:30:52AM +1300, David Steele wrote:

On 11/12/23 08:21, David Steele wrote:

As promised in [1], attached are some basic tests for the low-level
backup method.


Added to the 2024-03 CF.


There is already 040_standby_failover_slots_sync.pl in recovery/ that
uses the number of your test script.  You may want to bump it, that's
a nit.


Renamed to 042_low_level_backup.pl.


+unlink("$backup_dir/postmaster.pid")
+   or BAIL_OUT("unable to unlink $backup_dir/postmaster.pid");
+unlink("$backup_dir/postmaster.opts")
+   or BAIL_OUT("unable to unlink $backup_dir/postmaster.opts");
+unlink("$backup_dir/global/pg_control")
+   or BAIL_OUT("unable to unlink $backup_dir/global/pg_control");

RELCACHE_INIT_FILENAME as well?


I'm not trying to implement the full exclusion list here, just enough to 
get the test working. Since exclusions are optional according to the 
docs I don't think we need them for a valid tests.



+# Rather than writing out backup_label, try to recover the backup without
+# backup_label to demonstrate that recovery will not work correctly without it,
+# i.e. the canary table will be missing and the cluster will be corrupt. 
Provide
+# only the WAL segment that recovery will think it needs.

Okay, why not.  No objections to this addition.  I am a bit surprised
that this is not scanning some of the logs produced by the startup
process for particular patterns.


Not sure what to look for here. There are no distinct messages for crash 
recovery. Perhaps there should be?



+# Save backup_label into the backup directory and recover using the primary's
+# archive. This time recovery will succeed and the canary table will be
+# present.

Here are well, I think that we should add some log_contains() with
pre-defined patterns to show that recovery has completed the way we
want it with a backup_label up to the end-of-backup record.


Sure, I added a check for the new log message when recovering with a 
backup_label.


Regards,
-DavidFrom 076429a578ece3c213525220a4961cb5b56c190a Mon Sep 17 00:00:00 2001
From: David Steele 
Date: Wed, 13 Mar 2024 00:01:55 +
Subject: Add basic tests for the low-level backup method.

There are currently no tests for the low-level backup method. pg_backup_start()
and pg_backup_stop() are called but not exercised in any real fashion.

There is a lot more that can be done, but this at least provides some basic
tests and provides a place for future improvement.
---
 src/test/recovery/t/042_low_level_backup.pl | 108 
 1 file changed, 108 insertions(+)
 create mode 100644 src/test/recovery/t/042_low_level_backup.pl

diff --git a/src/test/recovery/t/042_low_level_backup.pl 
b/src/test/recovery/t/042_low_level_backup.pl
new file mode 100644
index 00..22668bdc0d
--- /dev/null
+++ b/src/test/recovery/t/042_low_level_backup.pl
@@ -0,0 +1,108 @@
+
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+# Test low-level backup method by using pg_backup_start() and pg_backup_stop()
+# to create backups.
+
+use strict;
+use warnings;
+
+use File::Copy qw(copy);
+use File::Path qw(remove_tree);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Start primary node with archiving
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(has_archiving => 1, allows_streaming => 1);
+$node_primary->start;
+
+# Start backup
+my $backup_name = 'backup1';
+my $psql = $node_primary->background_psql('postgres');
+
+$psql->query_safe("SET client_min_messages TO WARNING");
+$psql->set_query_timer_restart();
+$psql->query_safe("select pg_backup_start('test label')");
+
+# Copy files
+my $backup_dir = $node_primary->backup_dir() . '/' . $backup_name;
+
+PostgreSQL::Test::RecursiveCopy::copypath(
+   $node_primary->data_dir(), $backup_dir);
+
+# Cleanup some files/paths that should not be in the backup. There is no
+# attempt to all the exclusions done by pg_basebackup here, in part because
+# they are not required, but also to keep the test simple.
+#
+# Also remove pg_control because it needs to be copied later and it will be
+# obvious if the copy fails.
+unlink("$backup_dir/postmaster.pid")
+   or BAIL_OUT("unable to unlink $backup_dir/postmaster.pid");
+unlink("$backup_dir/postmaster.opts")
+   or BAIL_OUT("unable to unlink $backup_dir/postmaster.opts");
+unlink("$backup_dir/global/pg_control")
+   or BAIL_OUT("unable to unlink $backup_dir/global/pg_control");
+
+remove_tree("$backup_dir/pg_wal", {keep_root => 1})
+   or BAIL_OUT("unable to unlink contents of $backup_dir/pg_wal");
+
+# Create a table that will be used to verify that recovery started at the
+# correct location, rather than the location recorded in pg_control
+$psql->query_safe("create table canary (id int)");
+
+# Advance the checkpoint location in pg_control past the location where the
+# backup started. Switch the WAL to make it really 

Re: Add basic tests for the low-level backup method.

2024-02-28 Thread Michael Paquier
On Thu, Feb 29, 2024 at 10:30:52AM +1300, David Steele wrote:
> On 11/12/23 08:21, David Steele wrote:
>> As promised in [1], attached are some basic tests for the low-level
>> backup method.
> 
> Added to the 2024-03 CF.

There is already 040_standby_failover_slots_sync.pl in recovery/ that
uses the number of your test script.  You may want to bump it, that's
a nit.

+unlink("$backup_dir/postmaster.pid")
+   or BAIL_OUT("unable to unlink $backup_dir/postmaster.pid");
+unlink("$backup_dir/postmaster.opts")
+   or BAIL_OUT("unable to unlink $backup_dir/postmaster.opts");
+unlink("$backup_dir/global/pg_control")
+   or BAIL_OUT("unable to unlink $backup_dir/global/pg_control");

RELCACHE_INIT_FILENAME as well?

+# Rather than writing out backup_label, try to recover the backup without
+# backup_label to demonstrate that recovery will not work correctly without it,
+# i.e. the canary table will be missing and the cluster will be corrupt. 
Provide
+# only the WAL segment that recovery will think it needs.

Okay, why not.  No objections to this addition.  I am a bit surprised
that this is not scanning some of the logs produced by the startup
process for particular patterns.

+# Save backup_label into the backup directory and recover using the primary's
+# archive. This time recovery will succeed and the canary table will be
+# present. 

Here are well, I think that we should add some log_contains() with
pre-defined patterns to show that recovery has completed the way we
want it with a backup_label up to the end-of-backup record.
--
Michael


signature.asc
Description: PGP signature


Re: Add basic tests for the low-level backup method.

2024-02-28 Thread David Steele

On 11/12/23 08:21, David Steele wrote:


As promised in [1], attached are some basic tests for the low-level 
backup method.


Added to the 2024-03 CF.

Regards,
-David