Re: trying again to get incremental backup

2024-04-26 Thread Robert Haas
On Thu, Apr 25, 2024 at 6:44 PM Thom Brown  wrote:
> I would like to query the following:
>
> --tablespace-mapping=olddir=newdir
>
> Relocates the tablespace in directory olddir to newdir during the backup. 
> olddir is the absolute path of the tablespace as it exists in the first 
> backup specified on the command line, and newdir is the absolute path to use 
> for the tablespace in the reconstructed backup.
>
> The first backup specified on the command line will be the regular, full, 
> non-incremental backup.  But if a tablespace was introduced subsequently, it 
> would only appear in an incremental backup.  Wouldn't this then mean that a 
> mapping would need to be provided based on the path to the tablespace of that 
> incremental backup's copy?

Yes. Tomas Vondra found the same issue, which I have fixed in
1713e3d6cd393fcc1d4873e75c7fa1f6c7023d75.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2024-04-25 Thread Thom Brown
On Wed, 3 Jan 2024 at 15:10, Robert Haas  wrote:

> On Fri, Dec 22, 2023 at 12:00 AM Alexander Lakhin 
> wrote:
> > My quick experiment shows that that TimestampDifferenceMilliseconds call
> > always returns zero, due to it's arguments swapped.
>
> Thanks. Tom already changed the unsigned -> int stuff in a separate
> commit, so I just pushed the fixes to PrepareForIncrementalBackup,
> both the one I had before, and swapping the arguments to
> TimestampDifferenceMilliseconds
>

I would like to query the following:

--tablespace-mapping=olddir=newdir

Relocates the tablespace in directory olddir to newdir during the
backup. olddir is the absolute path of the tablespace as it exists in the
first backup specified on the command line, and newdir is the absolute path
to use for the tablespace in the reconstructed backup.

The first backup specified on the command line will be the regular, full,
non-incremental backup.  But if a tablespace was introduced subsequently,
it would only appear in an incremental backup.  Wouldn't this then mean
that a mapping would need to be provided based on the path to the
tablespace of that incremental backup's copy?

Regards

Thom


Re: trying again to get incremental backup

2024-01-03 Thread Robert Haas
On Fri, Dec 22, 2023 at 12:00 AM Alexander Lakhin  wrote:
> My quick experiment shows that that TimestampDifferenceMilliseconds call
> always returns zero, due to it's arguments swapped.

Thanks. Tom already changed the unsigned -> int stuff in a separate
commit, so I just pushed the fixes to PrepareForIncrementalBackup,
both the one I had before, and swapping the arguments to
TimestampDifferenceMilliseconds.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2024-01-02 Thread Robert Haas
On Wed, Dec 27, 2023 at 10:36 AM Nathan Bossart
 wrote:
> On Wed, Dec 27, 2023 at 09:11:02AM -0500, Robert Haas wrote:
> > Thanks. I don't think there's a real bug, but I pushed a fix, same as
> > what you had.
>
> Thanks!  I also noticed that WALSummarizerLock probably needs a mention in
> wait_event_names.txt.

Fixed.

It seems like it would be good if there were an automated cross-check
between lwlocknames.txt and wait_event_names.txt.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-12-27 Thread Nathan Bossart
On Wed, Dec 27, 2023 at 09:11:02AM -0500, Robert Haas wrote:
> Thanks. I don't think there's a real bug, but I pushed a fix, same as
> what you had.

Thanks!  I also noticed that WALSummarizerLock probably needs a mention in
wait_event_names.txt.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: trying again to get incremental backup

2023-12-27 Thread Robert Haas
On Sat, Dec 23, 2023 at 4:51 PM Nathan Bossart  wrote:
> My compiler has the following complaint:
>
> ../postgresql/src/backend/postmaster/walsummarizer.c: In function 
> ‘GetOldestUnsummarizedLSN’:
> ../postgresql/src/backend/postmaster/walsummarizer.c:540:32: error: 
> ‘unsummarized_lsn’ may be used uninitialized in this function 
> [-Werror=maybe-uninitialized]
>   540 |  WalSummarizerCtl->pending_lsn = unsummarized_lsn;
>   |  ~~^~

Thanks. I don't think there's a real bug, but I pushed a fix, same as
what you had.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-12-23 Thread Nathan Bossart
My compiler has the following complaint:

../postgresql/src/backend/postmaster/walsummarizer.c: In function 
‘GetOldestUnsummarizedLSN’:
../postgresql/src/backend/postmaster/walsummarizer.c:540:32: error: 
‘unsummarized_lsn’ may be used uninitialized in this function 
[-Werror=maybe-uninitialized]
  540 |  WalSummarizerCtl->pending_lsn = unsummarized_lsn;
  |  ~~^~

I haven't looked closely to see whether there is actually a problem here,
but the attached patch at least resolves the warning.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 9b5d3cdeb0..0cf6bbe59d 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -438,7 +438,7 @@ GetOldestUnsummarizedLSN(TimeLineID *tli, bool *lsn_is_exact,
 	LWLockMode	mode = reset_pending_lsn ? LW_EXCLUSIVE : LW_SHARED;
 	int			n;
 	List	   *tles;
-	XLogRecPtr	unsummarized_lsn;
+	XLogRecPtr	unsummarized_lsn = InvalidXLogRecPtr;
 	TimeLineID	unsummarized_tli = 0;
 	bool		should_make_exact = false;
 	List	   *existing_summaries;


Re: trying again to get incremental backup

2023-12-21 Thread Alexander Lakhin

21.12.2023 23:43, Robert Haas wrote:

There are also two deadcode.DeadStores complaints from clang. First one is
about:
  /*
   * Align the wait time to prevent drift. This doesn't really matter,
   * but we'd like the warnings about how long we've been waiting to say
   * 10 seconds, 20 seconds, 30 seconds, 40 seconds ... without ever
   * drifting to something that is not a multiple of ten.
   */
  timeout_in_ms -=
  TimestampDifferenceMilliseconds(current_time, initial_time) %
  timeout_in_ms;
It looks like this timeout is really not used.

Oops. It should be. See attached.


My quick experiment shows that that TimestampDifferenceMilliseconds call
always returns zero, due to it's arguments swapped.

The other changes look good to me.

Thank you!

Best regards,
Alexander




Re: trying again to get incremental backup

2023-12-21 Thread Robert Haas
On Thu, Dec 21, 2023 at 10:00 AM Alexander Lakhin  wrote:
> Please look at the attached patch; it corrects all 29 items ("recods"
> fixed in two places), but maybe you find some substitutions wrong...

Thanks, committed with a few additions.

> I've also observed that those commits introduced new warnings:
> $ CC=gcc-12 CPPFLAGS="-Wtype-limits" ./configure -q && make -s -j8
> reconstruct.c: In function ‘read_bytes’:
> reconstruct.c:511:24: warning: comparison of unsigned expression in ‘< 0’ is 
> always false [-Wtype-limits]
>511 | if (rb < 0)
>|^
> reconstruct.c: In function ‘write_reconstructed_file’:
> reconstruct.c:650:40: warning: comparison of unsigned expression in ‘< 0’ is 
> always false [-Wtype-limits]
>650 | if (rb < 0)
>|^
> reconstruct.c:662:32: warning: comparison of unsigned expression in ‘< 0’ is 
> always false [-Wtype-limits]
>662 | if (wb < 0)

Oops. I think the variables should be type int. See attached.

> There are also two deadcode.DeadStores complaints from clang. First one is
> about:
>  /*
>   * Align the wait time to prevent drift. This doesn't really matter,
>   * but we'd like the warnings about how long we've been waiting to 
> say
>   * 10 seconds, 20 seconds, 30 seconds, 40 seconds ... without ever
>   * drifting to something that is not a multiple of ten.
>   */
>  timeout_in_ms -=
>  TimestampDifferenceMilliseconds(current_time, initial_time) %
>  timeout_in_ms;
> It looks like this timeout is really not used.

Oops. It should be. See attached.

> And the minor one (similar to many existing, maybe doesn't deserve fixing):
> walsummarizer.c:808:5: warning: Value stored to 'summary_end_lsn' is never 
> read [deadcode.DeadStores]
>  summary_end_lsn = 
> private_data->read_upto;
>  ^ ~~~

It kind of surprises me that this is dead, but it seems best to keep
it there to be on the safe side, in case some change to the logic
renders it not dead in the future.

> >> Also, a comment above MaybeRemoveOldWalSummaries() basically repeats a
> >> comment above redo_pointer_at_last_summary_removal declaration, but
> >> perhaps it should say about removing summaries instead?
> > Wow, yeah. Thanks, will fix.
>
> Thank you for paying attention to it!

I'll fix this next.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


fix-ib-thinkos.patch
Description: Binary data


Re: trying again to get incremental backup

2023-12-21 Thread Alexander Lakhin

21.12.2023 15:07, Robert Haas wrote:

On Wed, Dec 20, 2023 at 11:00 PM Alexander Lakhin  wrote:

I've found several typos/inconsistencies introduced with 174c48050 and
dc2123400. Maybe you would want to fix them, while on it?:

That's an impressively long list of mistakes in something I thought
I'd been careful about. Sigh.

I don't suppose you could provide these corrections in the form of a
patch? I don't really want to run these sed commands across the entire
tree and then try to figure out what's what...


Please look at the attached patch; it corrects all 29 items ("recods"
fixed in two places), but maybe you find some substitutions wrong...

I've also observed that those commits introduced new warnings:
$ CC=gcc-12 CPPFLAGS="-Wtype-limits" ./configure -q && make -s -j8
reconstruct.c: In function ‘read_bytes’:
reconstruct.c:511:24: warning: comparison of unsigned expression in ‘< 0’ is 
always false [-Wtype-limits]
  511 | if (rb < 0)
  |    ^
reconstruct.c: In function ‘write_reconstructed_file’:
reconstruct.c:650:40: warning: comparison of unsigned expression in ‘< 0’ is 
always false [-Wtype-limits]
  650 | if (rb < 0)
  |    ^
reconstruct.c:662:32: warning: comparison of unsigned expression in ‘< 0’ is 
always false [-Wtype-limits]
  662 | if (wb < 0)

There are also two deadcode.DeadStores complaints from clang. First one is
about:
    /*
 * Align the wait time to prevent drift. This doesn't really matter,
 * but we'd like the warnings about how long we've been waiting to say
 * 10 seconds, 20 seconds, 30 seconds, 40 seconds ... without ever
 * drifting to something that is not a multiple of ten.
 */
    timeout_in_ms -=
    TimestampDifferenceMilliseconds(current_time, initial_time) %
    timeout_in_ms;
It looks like this timeout is really not used.

And the minor one (similar to many existing, maybe doesn't deserve fixing):
walsummarizer.c:808:5: warning: Value stored to 'summary_end_lsn' is never read 
[deadcode.DeadStores]
    summary_end_lsn = 
private_data->read_upto;
    ^ ~~~


Also, a comment above MaybeRemoveOldWalSummaries() basically repeats a
comment above redo_pointer_at_last_summary_removal declaration, but
perhaps it should say about removing summaries instead?

Wow, yeah. Thanks, will fix.


Thank you for paying attention to it!

Best regards,
Alexanderdiff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 7c183a5cfd..e411ddbf45 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -213,7 +213,7 @@ PostgreSQL documentation
 
  
   -i old_manifest_file
-  --incremental=old_meanifest_file
+  --incremental=old_manifest_file
   

 Performs an incremental
diff --git a/doc/src/sgml/ref/pg_combinebackup.sgml b/doc/src/sgml/ref/pg_combinebackup.sgml
index e1cb31607e..8a0a600c2b 100644
--- a/doc/src/sgml/ref/pg_combinebackup.sgml
+++ b/doc/src/sgml/ref/pg_combinebackup.sgml
@@ -83,7 +83,7 @@ PostgreSQL documentation
   

 The -n/--dry-run option instructs
-pg_cominebackup to figure out what would be done
+pg_combinebackup to figure out what would be done
 without actually creating the target directory or any output files.
 It is particularly useful in combination with --debug.

diff --git a/src/backend/backup/basebackup_incremental.c b/src/backend/backup/basebackup_incremental.c
index 1e5a5ac33a..42bbe564e2 100644
--- a/src/backend/backup/basebackup_incremental.c
+++ b/src/backend/backup/basebackup_incremental.c
@@ -158,7 +158,7 @@ CreateIncrementalBackupInfo(MemoryContext mcxt)
 
 /*
  * Before taking an incremental backup, the caller must supply the backup
- * manifest from a prior backup. Each chunk of manifest data recieved
+ * manifest from a prior backup. Each chunk of manifest data received
  * from the client should be passed to this function.
  */
 void
@@ -462,7 +462,7 @@ PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
 			++deadcycles;
 
 		/*
-		 * If we've managed to wait for an entire minute withot the WAL
+		 * If we've managed to wait for an entire minute without the WAL
 		 * summarizer absorbing a single WAL record, error out; probably
 		 * something is wrong.
 		 *
@@ -473,7 +473,7 @@ PrepareForIncrementalBackup(IncrementalBackupInfo *ib,
 		 * likely to catch a reasonable number of the things that can go wrong
 		 * in practice (e.g. the summarizer process is completely hung, say
 		 * because somebody hooked up a debugger to it or something) without
-		 * giving up too quickly when the sytem is just slow.
+		 * giving up too quickly when the system is just slow.
 		 */
 		if (deadcycles >= 6)
 			

Re: trying again to get incremental backup

2023-12-21 Thread Robert Haas
On Wed, Dec 20, 2023 at 11:00 PM Alexander Lakhin  wrote:
> I've found several typos/inconsistencies introduced with 174c48050 and
> dc2123400. Maybe you would want to fix them, while on it?:

That's an impressively long list of mistakes in something I thought
I'd been careful about. Sigh.

I don't suppose you could provide these corrections in the form of a
patch? I don't really want to run these sed commands across the entire
tree and then try to figure out what's what...

> Also, a comment above MaybeRemoveOldWalSummaries() basically repeats a
> comment above redo_pointer_at_last_summary_removal declaration, but
> perhaps it should say about removing summaries instead?

Wow, yeah. Thanks, will fix.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-12-20 Thread Alexander Lakhin

Hello Robert,

20.12.2023 23:56, Robert Haas wrote:

On Wed, Dec 20, 2023 at 8:11 AM Jakub Wartak
 wrote:

the v15 patchset (posted yesterday) test results are GOOD:

All right. I committed the main two patches, dropped the
for-testing-only patch, and added a simple test to the remaining
pg_walsummary patch. That needs more work, but here's what I have as
of now.


I've found several typos/inconsistencies introduced with 174c48050 and
dc2123400. Maybe you would want to fix them, while on it?:
s/arguent/argument/;
s/BlkRefTableEntry/BlockRefTableEntry/;
s/BlockRefTablEntry/BlockRefTableEntry/;
s/Caonicalize/Canonicalize/;
s/Checksum_Algorithm/Checksum-Algorithm/;
s/corresonding/corresponding/;
s/differenly/differently/;
s/excessing/excessive/;
s/ exta / extra /;
s/hexademical/hexadecimal/;
s/initally/initially/;
s/MAXGPATH/MAXPGPATH/;
s/overrreacting/overreacting/;
s/old_meanifest_file/old_manifest_file/;
s/pg_cominebackup/pg_combinebackup/;
s/pg_tblpc/pg_tblspc/;
s/pointrs/pointers/;
s/Recieve/Receive/;
s/recieved/received/;
s/ recod / record /;
s/ recods / records /;
s/substntially/substantially/;
s/sumamry/summary/;
s/summry/summary/;
s/synchronizaton/synchronization/;
s/sytem/system/;
s/withot/without/;
s/Woops/Whoops/;
s/xlograder/xlogreader/;

Also, a comment above MaybeRemoveOldWalSummaries() basically repeats a
comment above redo_pointer_at_last_summary_removal declaration, but
perhaps it should say about removing summaries instead?

Best regards,
Alexander




Re: trying again to get incremental backup

2023-12-20 Thread Robert Haas
On Wed, Dec 20, 2023 at 8:11 AM Jakub Wartak
 wrote:
> the v15 patchset (posted yesterday) test results are GOOD:

All right. I committed the main two patches, dropped the
for-testing-only patch, and added a simple test to the remaining
pg_walsummary patch. That needs more work, but here's what I have as
of now.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v17-0001-Add-new-pg_walsummary-tool.patch
Description: Binary data


Re: trying again to get incremental backup

2023-12-20 Thread Jakub Wartak
Hi Robert,

On Tue, Dec 19, 2023 at 9:36 PM Robert Haas  wrote:
>
> On Fri, Dec 15, 2023 at 5:36 AM Jakub Wartak
>  wrote:
> > I've played with with initdb/pg_upgrade (17->17) and i don't get DBID
> > mismatch (of course they do differ after initdb), but i get this
> > instead:
> >
> >  $ pg_basebackup -c fast -D /tmp/incr2.after.upgrade -p 5432
> > --incremental /tmp/incr1.before.upgrade/backup_manifest
> > WARNING:  aborting backup due to backend exiting before pg_backup_stop
> > was called
> > pg_basebackup: error: could not initiate base backup: ERROR:  timeline
> > 2 found in manifest, but not in this server's history
> > pg_basebackup: removing data directory "/tmp/incr2.after.upgrade"
> >
> > Also in the manifest I don't see DBID ?
> > Maybe it's a nuisance and all I'm trying to see is that if an
> > automated cronjob with pg_basebackup --incremental hits a freshly
> > upgraded cluster, that error message without errhint() is going to
> > scare some Junior DBAs.
>
> Yeah. I think we should add the system identifier to the manifest, but
> I think that should be left for a future project, as I don't think the
> lack of it is a good reason to stop all progress here. When we have
> that, we can give more reliable error messages about system mismatches
> at an earlier stage. Unfortunately, I don't think that the timeline
> messages you're seeing here are going to apply in every case: suppose
> you have two unrelated servers that are both on timeline 1. I think
> you could use a base backup from one of those servers and use it as
> the basis for the incremental from the other, and I think that if you
> did it right you might fail to hit any sanity check that would block
> that. pg_combinebackup will realize there's a problem, because it has
> the whole cluster to work with, not just the manifest, and will notice
> the mismatching system identifiers, but that's kind of late to find
> out that you made a big mistake. However, right now, it's the best we
> can do.
>

OK, understood.

> > The incrementals are being generated , but just for the first (0)
> > segment of the relation?
>
> I committed the first two patches from the series I posted yesterday.
> The first should fix this, and the second relocates parse_manifest.c.
> That patch hasn't changed in a while and seems unlikely to attract
> major objections. There's no real reason to commit it until we're
> ready to move forward with the main patches, but I think we're very
> close to that now, so I did.
>
> Here's a rebase for cfbot.

the v15 patchset (posted yesterday) test results are GOOD:

1. make check-world - GOOD
2. cfbot was GOOD
3. the devel/master bug present in
parse_filename_for_nontemp_relation() seems to be gone (in local
testing)
4. some further tests:
test_across_wallevelminimal.sh - GOOD
test_incr_after_timelineincrease.sh - GOOD
test_incr_on_standby_after_promote.sh - GOOD
test_many_incrementals_dbcreate.sh - GOOD
test_many_incrementals.sh - GOOD
test_multixact.sh - GOOD
test_pending_2pc.sh - GOOD
test_reindex_and_vacuum_full.sh - GOOD
test_repro_assert.sh
test_standby_incr_just_backup.sh - GOOD
test_stuck_walsum.sh - GOOD
test_truncaterollback.sh - GOOD
test_unlogged_table.sh - GOOD
test_full_pri__incr_stby__restore_on_pri.sh - GOOD
test_full_pri__incr_stby__restore_on_stby.sh - GOOD
test_full_stby__incr_stby__restore_on_pri.sh - GOOD
test_full_stby__incr_stby__restore_on_stby.sh - GOOD

5. the more real-world pgbench test with localized segment writes
usigng `\set aid random_exponential...` [1] indicates much greater
efficiency in terms of backup space use now, du -sm shows:

210229  /backups/backups/full
250 /backups/backups/incr.1
255 /backups/backups/incr.2
[..]
348 /backups/backups/incr.13
408 /backups/backups/incr.14 // latest(20th of Dec on 10:40)
6673/backups/archive/

The DB size was as reported by \l+ 205GB.
That pgbench was running for ~27h (19th Dec 08:39 -> 20th Dec 11:30)
with slow 100 TPS (-R), so no insane amounts of WAL.
Time to reconstruct 14 chained incremental backups was 45mins
(pg_combinebackup -o /var/lib/postgres/17/data /backups/backups/full
/backups/backups/incr.1 (..)  /backups/backups/incr.14).
DB after recovering was OK and working fine.

-J.




Re: trying again to get incremental backup

2023-12-18 Thread Robert Haas
On Fri, Dec 15, 2023 at 6:58 AM Peter Eisentraut  wrote:
> A separate bikeshedding topic: The GUC "summarize_wal", could that be
> "wal_something" instead?  (wal_summarize? wal_summarizer?)  It would be
> nice if these settings names group together a bit, both with existing
> wal_* ones and also with the new ones you are adding
> (wal_summary_keep_time).

Yeah, this is highly debatable, so bikeshed away. IMHO, the question
here is whether we care more about (1) having the name of the GUC
sound nice grammatically or (2) having the GUC begin with the same
string as other, related GUCs. I think that Tom Lane tends to prefer
the former, and probably some other people do too, while some other
people tend to prefer the latter. Ideally it would be possible to
satisfy both goals at once here, but everything I thought about that
started with "wal" sounded too awkward for me to like it; hence the
current choice of name. But if there's consensus on something else, so
be it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-12-18 Thread Robert Haas
On Fri, Dec 15, 2023 at 6:53 AM Peter Eisentraut  wrote:
> The first fixes up some things in nls.mk related to a file move.  The
> second is some cleanup because some function you are using has been
> removed in the meantime; you probably found that yourself while rebasing.

Incorporated these. As you guessed,
MemoryContextResetAndDeleteChildren -> MemoryContextReset had already
been done locally.

> The pg_walsummary patch doesn't have a nls.mk, but you also comment that
> it doesn't have tests yet, so I assume it's not considered complete yet
> anyway.

I think this was more of a case of me just not realizing that I should
add that. I'll add something simple to the next version, but I'm not
very good at this NLS stuff.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-12-18 Thread Peter Eisentraut

Another set of comments, about the patch that adds pg_combinebackup:

Make sure all the options are listed in a consistent order.  We have 
lately changed everything to be alphabetical.  This includes:


- reference page pg_combinebackup.sgml

- long_options listing

- getopt_long() argument

- subsequent switch

- (--help output, but it looks ok as is)

Also, in pg_combinebackup.sgml, the option --sync-method is listed as if 
it does not take an argument, but it does.






Re: trying again to get incremental backup

2023-12-15 Thread Peter Eisentraut
A separate bikeshedding topic: The GUC "summarize_wal", could that be 
"wal_something" instead?  (wal_summarize? wal_summarizer?)  It would be 
nice if these settings names group together a bit, both with existing 
wal_* ones and also with the new ones you are adding 
(wal_summary_keep_time).






Re: trying again to get incremental backup

2023-12-15 Thread Peter Eisentraut

I have a couple of quick fixes here.

The first fixes up some things in nls.mk related to a file move.  The 
second is some cleanup because some function you are using has been 
removed in the meantime; you probably found that yourself while rebasing.


The pg_walsummary patch doesn't have a nls.mk, but you also comment that 
it doesn't have tests yet, so I assume it's not considered complete yet 
anyway.
From 04aae4ee91ddd1d4ce061c36a99b0fa18bdd98ec Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 14 Dec 2023 12:50:33 +0100
Subject: [PATCH 2/6] fixup! Move src/bin/pg_verifybackup/parse_manifest.c into
 src/common.

---
 src/bin/pg_verifybackup/nls.mk | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_verifybackup/nls.mk b/src/bin/pg_verifybackup/nls.mk
index eba73a2c05..9e6a6049ba 100644
--- a/src/bin/pg_verifybackup/nls.mk
+++ b/src/bin/pg_verifybackup/nls.mk
@@ -1,10 +1,10 @@
 # src/bin/pg_verifybackup/nls.mk
 CATALOG_NAME = pg_verifybackup
 GETTEXT_FILES= $(FRONTEND_COMMON_GETTEXT_FILES) \
-   parse_manifest.c \
pg_verifybackup.c \
../../common/fe_memutils.c \
-   ../../common/jsonapi.c
+   ../../common/jsonapi.c \
+   ../../common/parse_manifest.c
 GETTEXT_TRIGGERS = $(FRONTEND_COMMON_GETTEXT_TRIGGERS) \
json_manifest_parse_failure:2 \
error_cb:2 \
-- 
2.43.0

From 25211044687a629e632ef0a2bfad30acea337266 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 14 Dec 2023 18:32:29 +0100
Subject: [PATCH 4/6] fixup! Add a new WAL summarizer process.

---
 src/backend/backup/meson.build | 2 +-
 src/backend/postmaster/walsummarizer.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/backup/meson.build b/src/backend/backup/meson.build
index 0e2de91e9f..5d4ebe3ebe 100644
--- a/src/backend/backup/meson.build
+++ b/src/backend/backup/meson.build
@@ -13,5 +13,5 @@ backend_sources += files(
   'basebackup_throttle.c',
   'basebackup_zstd.c',
   'walsummary.c',
-  'walsummaryfuncs.c'
+  'walsummaryfuncs.c',
 )
diff --git a/src/backend/postmaster/walsummarizer.c 
b/src/backend/postmaster/walsummarizer.c
index 7c840c36b3..9fa155349e 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -290,7 +290,7 @@ WalSummarizerMain(void)
FlushErrorState();
 
/* Flush any leaked data in the top-level context */
-   MemoryContextResetAndDeleteChildren(context);
+   MemoryContextReset(context);
 
/* Now we can allow interrupts again */
RESUME_INTERRUPTS();
@@ -342,7 +342,7 @@ WalSummarizerMain(void)
XLogRecPtr  end_of_summary_lsn;
 
/* Flush any leaked data in the top-level context */
-   MemoryContextResetAndDeleteChildren(context);
+   MemoryContextReset(context);
 
/* Process any signals received recently. */
HandleWalSummarizerInterrupts();
-- 
2.43.0



Re: trying again to get incremental backup

2023-12-15 Thread Jakub Wartak
Hi Robert,

On Wed, Dec 13, 2023 at 2:16 PM Robert Haas  wrote:
>
 >
> > > not even in case of an intervening
> > > timeline switch. So, all of the errors in this function are warning
> > > you that you've done something that you really should not have done.
> > > In this particular case, you've either (1) manually removed the
> > > timeline history file, and not just any timeline history file but the
> > > one for a timeline for a backup that you still intend to use as the
> > > basis for taking an incremental backup or (2) tried to use a full
> > > backup taken from one server as the basis for an incremental backup on
> > > a completely different server that happens to share the same system
> > > identifier, e.g. because you promoted two standbys derived from the
> > > same original primary and then tried to use a full backup taken on one
> > > as the basis for an incremental backup taken on the other.
> > >
> >
> > Okay, but please consider two other possibilities:
> >
> > (3) I had a corrupted DB where I've fixed it by running pg_resetwal
> > and some cronjob just a day later attempted to take incremental and
> > failed with that error.
> >
> > (4) I had pg_upgraded (which calls pg_resetwal on fresh initdb
> > directory) the DB where I had cronjob that just failed with this error
> >
> > I bet that (4) is going to happen more often than (1), (2) , which
> > might trigger users to complain on forums, support tickets.
>
> Hmm. In case (4), I was thinking that you'd get a complaint about the
> database system identifier not matching. I'm not actually sure that's
> what would happen, though, now that you mention it.
>

I've played with with initdb/pg_upgrade (17->17) and i don't get DBID
mismatch (of course they do differ after initdb), but i get this
instead:

 $ pg_basebackup -c fast -D /tmp/incr2.after.upgrade -p 5432
--incremental /tmp/incr1.before.upgrade/backup_manifest
WARNING:  aborting backup due to backend exiting before pg_backup_stop
was called
pg_basebackup: error: could not initiate base backup: ERROR:  timeline
2 found in manifest, but not in this server's history
pg_basebackup: removing data directory "/tmp/incr2.after.upgrade"

Also in the manifest I don't see DBID ?
Maybe it's a nuisance and all I'm trying to see is that if an
automated cronjob with pg_basebackup --incremental hits a freshly
upgraded cluster, that error message without errhint() is going to
scare some Junior DBAs.

> > LGTM, all quick tests work from my end too. BTW: I have also scheduled
> > the long/large pgbench -s 14000 (~200GB?) - multiple day incremental
> > test. I'll let you know how it went.
>
> Awesome, thank you so much.

OK, so pgbench -i -s 14440 and pgbench -P 1 -R 100 -c 8 -T 259200 did
generate pretty large incrementals (so I had to abort it due to lack
of space, I was expecting to see smaller incrementals so it took too
much space). I initally suspected that the problem lies in the normal
distribution of `\set aid random(1, 10 * :scale)` for tpcbb that
UPDATEs on big pgbench_accounts.

$ du -sm /backups/backups/* /backups/archive/
216205  /backups/backups/full
215207  /backups/backups/incr.1
216706  /backups/backups/incr.2
102273  /backups/archive/

So I verified the recoverability yesterday anyway - the
pg_combinebackup "full incr.1 incr.2" took 44 minutes and later
archive wal recovery and promotion SUCCEED. The 8-way parallel seqscan
foir sum(abalance) on the pgbench_accounts and other tables worked
fine. The pg_combinebackup was using 15-20% CPU (mostly on %sys),
while performing mostly 60-80MB/s separately for both reads and writes
(it's slow, but it's due to maxed out sequence I/O of the Premium on a
small SSD on Azure).

So i've launched another improved test (to force more localized
UPDATEs) to see the more real-world space-effectiveness of the
incremental backup:

\set aid random_exponential(1, 10 * :scale, 8)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
INSERT INTO pgbench_history (tid
, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
END;

But then... (and i have verified the low-IDs for :aid above).. same
has happened:

backups/backups$ du -sm /backups/backups/*
210229  /backups/backups/full
208299  /backups/backups/incr.1
208351  /backups/backups/incr.2

# pgbench_accounts has relfilenodeid 16486
postgres@jw-test-1:/backups/backups$ for L in 5 10 15 30 100 161 173
174 175  ; do md5sum full/base/5/16486.$L ./incr.1/base/5/16486.$L
./incr.2/base/5/16486.$L /var/lib/postgres/17/data/base/5/16486.$L ;
echo; done
005c6bbb40fca3c1a0a819376ef0e793  full/base/5/16486.5
005c6bbb40fca3c1a0a819376ef0e793  ./incr.1/base/5/16486.5
005c6bbb40fca3c1a0a819376ef0e793  ./incr.2/base/5/16486.5
005c6bbb40fca3c1a0a819376ef0e793  /var/lib/postgres/17/data/base/5/16486.5

[.. all the checksums match (!) for the above $L..]


Re: trying again to get incremental backup

2023-12-13 Thread Robert Haas
On Wed, Dec 13, 2023 at 5:39 AM Jakub Wartak
 wrote:
> > I can't exactly say that such a hint would be inaccurate, but I think
> > the impulse to add it here is misguided. One of my design goals for
> > this system is to make it so that you never have to take a new
> > incremental backup "just because,"
>
> Did you mean take a new full backup here?

Yes, apologies for the typo.

> > not even in case of an intervening
> > timeline switch. So, all of the errors in this function are warning
> > you that you've done something that you really should not have done.
> > In this particular case, you've either (1) manually removed the
> > timeline history file, and not just any timeline history file but the
> > one for a timeline for a backup that you still intend to use as the
> > basis for taking an incremental backup or (2) tried to use a full
> > backup taken from one server as the basis for an incremental backup on
> > a completely different server that happens to share the same system
> > identifier, e.g. because you promoted two standbys derived from the
> > same original primary and then tried to use a full backup taken on one
> > as the basis for an incremental backup taken on the other.
> >
>
> Okay, but please consider two other possibilities:
>
> (3) I had a corrupted DB where I've fixed it by running pg_resetwal
> and some cronjob just a day later attempted to take incremental and
> failed with that error.
>
> (4) I had pg_upgraded (which calls pg_resetwal on fresh initdb
> directory) the DB where I had cronjob that just failed with this error
>
> I bet that (4) is going to happen more often than (1), (2) , which
> might trigger users to complain on forums, support tickets.

Hmm. In case (4), I was thinking that you'd get a complaint about the
database system identifier not matching. I'm not actually sure that's
what would happen, though, now that you mention it.

In case (3), I think you would get an error about missing WAL summary files.

> > Huzzah, the cfbot likes the patch set now. Here's a new version with
> > the promised fix for your non-reproducible issue. Let's see whether
> > you and cfbot still like this version.
>
> LGTM, all quick tests work from my end too. BTW: I have also scheduled
> the long/large pgbench -s 14000 (~200GB?) - multiple day incremental
> test. I'll let you know how it went.

Awesome, thank you so much.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-12-13 Thread Jakub Wartak
Hi Robert,

On Mon, Dec 11, 2023 at 6:08 PM Robert Haas  wrote:
>
> On Fri, Dec 8, 2023 at 5:02 AM Jakub Wartak
>  wrote:
> > While we are at it, maybe around the below in PrepareForIncrementalBackup()
> >
> > if (tlep[i] == NULL)
> > ereport(ERROR,
> >
> > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> >  errmsg("timeline %u found in
> > manifest, but not in this server's history",
> > range->tli)));
> >
> > we could add
> >
> > errhint("You might need to start a new full backup instead of
> > incremental one")
> >
> > ?
>
> I can't exactly say that such a hint would be inaccurate, but I think
> the impulse to add it here is misguided. One of my design goals for
> this system is to make it so that you never have to take a new
> incremental backup "just because,"

Did you mean take a new full backup here?

> not even in case of an intervening
> timeline switch. So, all of the errors in this function are warning
> you that you've done something that you really should not have done.
> In this particular case, you've either (1) manually removed the
> timeline history file, and not just any timeline history file but the
> one for a timeline for a backup that you still intend to use as the
> basis for taking an incremental backup or (2) tried to use a full
> backup taken from one server as the basis for an incremental backup on
> a completely different server that happens to share the same system
> identifier, e.g. because you promoted two standbys derived from the
> same original primary and then tried to use a full backup taken on one
> as the basis for an incremental backup taken on the other.
>

Okay, but please consider two other possibilities:

(3) I had a corrupted DB where I've fixed it by running pg_resetwal
and some cronjob just a day later attempted to take incremental and
failed with that error.

(4) I had pg_upgraded (which calls pg_resetwal on fresh initdb
directory) the DB where I had cronjob that just failed with this error

I bet that (4) is going to happen more often than (1), (2) , which
might trigger users to complain on forums, support tickets.

> > > I have a fix for this locally, but I'm going to hold off on publishing
> > > a new version until either there's a few more things I can address all
> > > at once, or until Thomas commits the ubsan fix.
> > >
> >
> > Great, I cannot get it to fail again today, it had to be some dirty
> > state of the testing env. BTW: Thomas has pushed that ubsan fix.
>
> Huzzah, the cfbot likes the patch set now. Here's a new version with
> the promised fix for your non-reproducible issue. Let's see whether
> you and cfbot still like this version.

LGTM, all quick tests work from my end too. BTW: I have also scheduled
the long/large pgbench -s 14000 (~200GB?) - multiple day incremental
test. I'll let you know how it went.

-J.




Re: trying again to get incremental backup

2023-12-10 Thread Dilip Kumar
On Mon, Dec 11, 2023 at 11:44 AM Dilip Kumar  wrote:
>
> On Tue, Dec 5, 2023 at 11:40 PM Robert Haas  wrote:
> >
> > On Mon, Dec 4, 2023 at 3:58 PM Robert Haas  wrote:
> > > Considering all this, what I'm inclined to do is go and put
> > > UPLOAD_MANIFEST back, instead of INCREMENTAL_WAL_RANGE, and adjust
> > > accordingly. But first: does anybody see more problems here that I may
> > > have missed?
> >
> > OK, so here's a new version with UPLOAD_MANIFEST put back. I wrote a
> > long comment explaining why that's believed to be necessary and
> > sufficient. I committed 0001 and 0002 from the previous series also,
> > since it doesn't seem like anyone has further comments on those
> > renamings.
>
> I have done some testing on standby, but I am facing some issues,
> although things are working fine on the primary.  As shown below test
> [1]standby is reporting some errors that manifest require WAL from
> 0/6F8, but this backup starts at 0/628.  Then I tried to look
> into the manifest file of the full backup and it shows contents as
> below[0].  Actually from this WARNING and ERROR, I am not clear what
> is the problem, I understand that full backup ends at  "0/6F8" so
> for the next incremental backup we should be looking for a summary
> that has WAL starting at "0/6F8" and we do have those WALs.  In
> fact, the error message is saying "this backup starts at 0/628"
> which is before  "0/6F8" so whats the issue?
>
> [0]
> "WAL-Ranges": [
> { "Timeline": 1, "Start-LSN": "0/628", "End-LSN": "0/6F8" }
>
>
> [1]
> -- test on primary
> dilipkumar@dkmac bin % ./pg_basebackup -D d
> dilipkumar@dkmac bin % ./pg_basebackup -D d1 -i d/backup_manifest
>
> -- cleanup the backup directory
> dilipkumar@dkmac bin % rm -rf d
> dilipkumar@dkmac bin % rm -rf d1
>
> --test on standby
> dilipkumar@dkmac bin % ./pg_basebackup -D d -p 5433
> dilipkumar@dkmac bin % ./pg_basebackup -D d1 -i d/backup_manifest -p 5433
>
> WARNING:  aborting backup due to backend exiting before pg_backup_stop
> was called
> pg_basebackup: error: could not initiate base backup: ERROR:  manifest
> requires WAL from final timeline 1 ending at 0/6F8, but this
> backup starts at 0/628
> pg_basebackup: removing data directory "d1"

Jakub, pinged me offlist and pointed me to the thread[1] where it is
already explained so I think we can ignore this.

[1] 
https://www.postgresql.org/message-id/CA%2BTgmoYuC27_ToGtTTNyHgpn_eJmdqrmhJ93bAbinkBtXsWHaA%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-12-10 Thread Dilip Kumar
On Tue, Dec 5, 2023 at 11:40 PM Robert Haas  wrote:
>
> On Mon, Dec 4, 2023 at 3:58 PM Robert Haas  wrote:
> > Considering all this, what I'm inclined to do is go and put
> > UPLOAD_MANIFEST back, instead of INCREMENTAL_WAL_RANGE, and adjust
> > accordingly. But first: does anybody see more problems here that I may
> > have missed?
>
> OK, so here's a new version with UPLOAD_MANIFEST put back. I wrote a
> long comment explaining why that's believed to be necessary and
> sufficient. I committed 0001 and 0002 from the previous series also,
> since it doesn't seem like anyone has further comments on those
> renamings.

I have done some testing on standby, but I am facing some issues,
although things are working fine on the primary.  As shown below test
[1]standby is reporting some errors that manifest require WAL from
0/6F8, but this backup starts at 0/628.  Then I tried to look
into the manifest file of the full backup and it shows contents as
below[0].  Actually from this WARNING and ERROR, I am not clear what
is the problem, I understand that full backup ends at  "0/6F8" so
for the next incremental backup we should be looking for a summary
that has WAL starting at "0/6F8" and we do have those WALs.  In
fact, the error message is saying "this backup starts at 0/628"
which is before  "0/6F8" so whats the issue?

[0]
"WAL-Ranges": [
{ "Timeline": 1, "Start-LSN": "0/628", "End-LSN": "0/6F8" }


[1]
-- test on primary
dilipkumar@dkmac bin % ./pg_basebackup -D d
dilipkumar@dkmac bin % ./pg_basebackup -D d1 -i d/backup_manifest

-- cleanup the backup directory
dilipkumar@dkmac bin % rm -rf d
dilipkumar@dkmac bin % rm -rf d1

--test on standby
dilipkumar@dkmac bin % ./pg_basebackup -D d -p 5433
dilipkumar@dkmac bin % ./pg_basebackup -D d1 -i d/backup_manifest -p 5433

WARNING:  aborting backup due to backend exiting before pg_backup_stop
was called
pg_basebackup: error: could not initiate base backup: ERROR:  manifest
requires WAL from final timeline 1 ending at 0/6F8, but this
backup starts at 0/628
pg_basebackup: removing data directory "d1"


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-12-08 Thread Jakub Wartak
On Thu, Dec 7, 2023 at 4:15 PM Robert Haas  wrote:

Hi Robert,

> On Thu, Dec 7, 2023 at 9:42 AM Jakub Wartak
>  wrote:
> > Comment: I was wondering if it wouldn't make some sense to teach
> > pg_resetwal to actually delete all WAL summaries after any any
> > WAL/controlfile alteration?
>
> I thought that this was a good idea so I decided to go implement it,
> only to discover that it was already part of the patch set ... did you
> find some case where it doesn't work as expected? The code looks like
> this:

Ah, my bad, with a fresh mind and coffee the error message makes it
clear and of course it did reset the summaries properly.

While we are at it, maybe around the below in PrepareForIncrementalBackup()

if (tlep[i] == NULL)
ereport(ERROR,

(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 errmsg("timeline %u found in
manifest, but not in this server's history",
range->tli)));

we could add

errhint("You might need to start a new full backup instead of
incremental one")

?

> > test_pending_2pc.sh - getting GOOD on most recent runs, but several
> > times during early testing (probably due to my own mishaps), I've been
> > hit by Abort/TRAP. I'm still investigating and trying to reproduce
> > those ones. TRAP: failed Assert("summary_end_lsn >=
> > WalSummarizerCtl->pending_lsn"), File: "walsummarizer.c", Line: 940
>
> I have a fix for this locally, but I'm going to hold off on publishing
> a new version until either there's a few more things I can address all
> at once, or until Thomas commits the ubsan fix.
>

Great, I cannot get it to fail again today, it had to be some dirty
state of the testing env. BTW: Thomas has pushed that ubsan fix.

-J.




Re: trying again to get incremental backup

2023-12-07 Thread Robert Haas
On Thu, Dec 7, 2023 at 9:42 AM Jakub Wartak
 wrote:
> Comment: I was wondering if it wouldn't make some sense to teach
> pg_resetwal to actually delete all WAL summaries after any any
> WAL/controlfile alteration?

I thought that this was a good idea so I decided to go implement it,
only to discover that it was already part of the patch set ... did you
find some case where it doesn't work as expected? The code looks like
this:

RewriteControlFile();
KillExistingXLOG();
KillExistingArchiveStatus();
KillExistingWALSummaries();
WriteEmptyXLOG();

> test_pending_2pc.sh - getting GOOD on most recent runs, but several
> times during early testing (probably due to my own mishaps), I've been
> hit by Abort/TRAP. I'm still investigating and trying to reproduce
> those ones. TRAP: failed Assert("summary_end_lsn >=
> WalSummarizerCtl->pending_lsn"), File: "walsummarizer.c", Line: 940

I have a fix for this locally, but I'm going to hold off on publishing
a new version until either there's a few more things I can address all
at once, or until Thomas commits the ubsan fix.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-12-07 Thread Jakub Wartak
On Tue, Dec 5, 2023 at 7:11 PM Robert Haas  wrote:

[..v13 patchset]

The results with v13 patchset are following:

* - requires checkpoint on primary when doing incremental on standby
when it's too idle, this was explained by Robert in [1], something AKA
too-fast-incremental backup due to testing-scenario:

test_across_wallevelminimal.sh - GOOD
test_many_incrementals_dbcreate.sh - GOOD
test_many_incrementals.sh - GOOD
test_multixact.sh - GOOD
test_reindex_and_vacuum_full.sh - GOOD
test_standby_incr_just_backup.sh - GOOD*
test_truncaterollback.sh - GOOD
test_unlogged_table.sh - GOOD
test_full_pri__incr_stby__restore_on_pri.sh - GOOD
test_full_pri__incr_stby__restore_on_stby.sh - GOOD
test_full_stby__incr_stby__restore_on_pri.sh - GOOD*
test_full_stby__incr_stby__restore_on_stby.sh - GOOD*
test_incr_on_standby_after_promote.sh - GOOD*
test_incr_after_timelineincrease.sh (pg_ctl stop, pg_resetwal -l
0002000E ..., pg_ctl start, pg_basebackup
--incremental) - GOOD, I've got:
pg_basebackup: error: could not initiate base backup: ERROR:
timeline 1 found in manifest, but not in this server's history
Comment: I was wondering if it wouldn't make some sense to teach
pg_resetwal to actually delete all WAL summaries after any any
WAL/controlfile alteration?

test_stuck_walsummary.sh (pkill -STOP walsumm) - GOOD:

> This version also improves (at least, IMHO) the way that we wait for
> WAL summarization to finish. Previously, you either caught up fully
> within 60 seconds or you died. I didn't like that, because it seemed
> like some people would get timeouts when the operation was slowly
> progressing and would eventually succeed. So what this version does
> is:

WARNING:  still waiting for WAL summarization through 0/AD8
after 10 seconds
DETAIL:  Summarization has reached 0/828 on disk and 0/8F8
in memory.
[..]
pg_basebackup: error: could not initiate base backup: ERROR:  WAL
summarization is not progressing
DETAIL:  Summarization is needed through 0/AD8, but is stuck
at 0/828 on disk and 0/8F8 in memory.
Comment2: looks good to me!

test_pending_2pc.sh - getting GOOD on most recent runs, but several
times during early testing (probably due to my own mishaps), I've been
hit by Abort/TRAP. I'm still investigating and trying to reproduce
those ones. TRAP: failed Assert("summary_end_lsn >=
WalSummarizerCtl->pending_lsn"), File: "walsummarizer.c", Line: 940

Regards,
-J.

[1] -  
https://www.postgresql.org/message-id/CA%2BTgmoYuC27_ToGtTTNyHgpn_eJmdqrmhJ93bAbinkBtXsWHaA%40mail.gmail.com




Re: trying again to get incremental backup

2023-12-04 Thread Robert Haas
On Thu, Nov 30, 2023 at 9:33 AM Robert Haas  wrote:
> 0005: Incremental backup itself. Changes:
> - Remove UPLOAD_MANIFEST replication command and instead add
> INCREMENTAL_WAL_RANGE replication command.

Unfortunately, I think this change is going to need to be reverted.
Jakub reported out a problem to me off-list, which I think boils down
to this: take a full backup on the primary. create a database on the
primary. now take an incremental backup on the standby using the full
backup from the master as the prior backup. What happens at this point
depends on how far replay has progressed on the standby. I think there
are three scenarios: (1) If replay has not yet reached a checkpoint
later than the one at which the full backup began, then taking the
incremental backup will fail. This is correct, because it makes no
sense to take an incremental backup that goes backwards in time, and
it's pointless to take one that goes forwards but not far enough to
reach the next checkpoint, as you won't save anything. (2) If replay
has progressed far enough that the redo pointer is now beyond the
CREATE DATABASE record, then everything is fine. (3) But if the redo
pointer for the backup is a later checkpoint than the one from which
the full backup started, but also before the CREATE DATABASE record,
then the new database's files exist on disk, but are not mentioned in
the WAL summary, which covers all LSNs from the start of the prior
backup to the start of this one. Here, the start of the backup is
basically the LSN from which replay will start, and since the database
was created after that, those changes aren't in the WAL summary. This
means that we think the file is unchanged since the prior backup, and
so backup no blocks at all. But now we have an incremental file for a
relation for which no full file is present in the prior backup, and
we're in big trouble.

If my analysis is correct, this bug should be new in v12. In v11 and
prior, I think that we always included every file that didn't appear
in the prior manifest in full. I didn't really quite know why I was
doing that, which is why I was willing to rip it out and thus remove
the need for the manifest, but now I think it was actually preventing
exactly this problem. This issue, in general, is files that get
created after the start of the backup. By that time, the WAL summary
that drives the backup has already been built, so it doesn't know
anything about the new files. That would be fine if we either (A)
omitted those new files from the backup completely, since replay would
recreate them or (B) backed them up in full, so that there was nothing
relying on them being there in the earlier backup. But an incremental
backup of such a file is no good.

Then I started worrying about whether there were problems in cases
where a file was dropped and recreated with the same name. I *think*
it's OK. If file F is dropped and recreated after being copied into
the full backup but before being copied into the incremental backup,
then there are basically two cases. First, F might be dropped before
the start LSN of the incremental backup; if so, we'll know from the
WAL summary that the limit block is 0 and back up the whole thing.
Second, F might be dropped after the start LSN of the incremental
backup and before it's actually coped. In that case, we'll not know
when backing up the file that it was dropped and recreated, so we'll
back it up incrementally as if that hadn't happened. That's OK as long
as reconstruction doesn't fail, because WAL replay will again drop and
recreate F. And I think reconstruction won't fail: blocks that are in
the incremental file will be taken from there, blocks in the prior
backup file will be taken from there, and blocks in neither place will
be zero-filled. The result is logically incoherent, but replay will
nuke the file anyway, so whatever.

It bugs me a bit that we don't obey the WAL-before-data rule with file
creation, e.g. RelationCreateStorage does smgrcreate() and then
log_smgrcreate(). So in theory we could see a file on disk for which
nothing has been logged yet; it could even happen that the file gets
created before the start LSN of the backup and the log record gets
written afterward. It seems like it would be far more comfortable to
swap the order there, so that if it's on disk, it's definitely in the
WAL. But I haven't yet been able to think of a scenario in which the
current ordering causes a real problem. If we backup a stray file in
full (or, hypothetically, if we skipped it entirely) then nothing will
happen that can't already happen today with full backup; any problems
we end up having are, I think, not new problems. It's only when we
back up a file incrementally that we need to be careful, and the
analsysis is basically the same as before ... whatever we put into an
incremental file will cause *something* to get reconstructed except
when there's no prior file at all. Having the manifest for the prior
backup lets us avoid the 

Re: trying again to get incremental backup

2023-11-29 Thread Robert Haas
On Wed, Nov 15, 2023 at 9:14 AM Jakub Wartak
 wrote:
> so I've spent some time playing still with patchset v8 (without the
> 6/6 testing patch related to wal_level=minimal), with the exception of
> - patchset v9 - marked otherwise.

Thanks, as usual, for that.

> 2. Usability thing: I hit the timeout hard: "This backup requires WAL
> to be summarized up to 0/9D8, but summarizer has only reached
> 0/0." with summarize_wal=off (default) but apparently this in TODO.
> Looks like an important usability thing.

All right. I'd sort of forgotten about the need to address that issue,
but apparently, I need to re-remember.

> 5. On v8 i've finally played a little bit with standby(s) and this
> patchset with couple of basic scenarios while mixing source of the
> backups:
>
> a. full on standby, incr1 on standby, full db restore (incl. incr1) on standby
> # sometimes i'm getting spurious error like those when doing
> incrementals on standby with -c fast :
> 2023-11-15 13:49:05.721 CET [10573] LOG:  recovery restart point
> at 0/A28
> 2023-11-15 13:49:07.591 CET [10597] WARNING:  aborting backup due
> to backend exiting before pg_backup_stop was called
> 2023-11-15 13:49:07.591 CET [10597] ERROR:  manifest requires WAL
> from final timeline 1 ending at 0/AF8, but this backup starts at
> 0/A28
> 2023-11-15 13:49:07.591 CET [10597] STATEMENT:  BASE_BACKUP (
> INCREMENTAL,  LABEL 'pg_basebackup base backup',  PROGRESS,
> CHECKPOINT 'fast',  WAIT 0,  MANIFEST 'yes',  TARGET 'client')
> # when you retry the same pg_basebackup it goes fine (looks like
> CHECKPOINT on standby/restartpoint <-> summarizer disconnect, I'll dig
> deeper tomorrow. It seems that issuing "CHECKPOINT; pg_sleep(1);"
> against primary just before pg_basebackup --incr on standby
> workarounds it)
>
> b. full on primary, incr1 on standby, full db restore (incl. incr1) on
> standby # WORKS
> c. full on standby, incr1 on standby, full db restore (incl. incr1) on
> primary # WORKS*
> d. full on primary, incr1 on standby, full db restore (incl. incr1) on
> primary # WORKS*
>
> * - needs pg_promote() due to the controlfile having standby bit +
> potential fiddling with postgresql.auto.conf as it is having
> primary_connstring GUC.

Well, "manifest requires WAL from final timeline 1 ending at
0/AF8, but this backup starts at 0/A28" is a valid complaint,
not a spurious error. It's essentially saying that WAL replay for this
incremental backup would have to begin at a location that is earlier
than where replay for the earlier backup would have to end while
recovering that backup. It's almost like you're trying to go backwards
in time, with the incremental happening before the full backup instead
of after it. I think the reason this is happening is that when you
take a backup, recovery has to start from the previous checkpoint. On
the primary, we perform a new checkpoint and plan to start recovery
from it. But on a standby, we can't perform a new checkpoint, since we
can't write WAL, so we arrange for recovery of the backup to begin
from the most recent checkpoint. And if you do two backups on the
standby in a row without much happening in the middle, then the most
recent checkpoint will be the same for both. And that I think is
what's resulting in this error, because the end of the backup follows
the start of the backup, so if two consecutive backups have the same
start, then the start of the second one will precede the end of the
first one.

One thing that's interesting to note here is that there is no point in
performing an incremental backup under these circumstances. You would
accrue no advantage over just letting replay continue further from the
full backup. The whole point of an incremental backup is that it lets
you "fast forward" your older backup -- you could have just replayed
all the WAL from the older backup until you got to the start LSN of
the newer backup, but reconstructing a backup that can start replay
from the newer LSN directly is, hopefully, quicker than replaying all
of that WAL. But in this scenario, you're starting from the same
checkpoint no matter what -- the amount of WAL replay required to
reach any given LSN will be unchanged. So storing an incremental
backup would be strictly a loss.

Another interesting point to consider is that you could also get this
complaint by doing something like take the full backup from the
primary, and then try to take an incremental backup from a standby,
maybe even a time-delayed standby that's far behind the primary. In
that case, you would really be trying to take an incremental backup
before you actually took the full backup, as far as LSN time goes.

I'm not quite sure what to do about any of this. I think the error is
correct and accurate, but understanding what it means and why it's
happening and what to do about it is probably going to be difficult
for people. Possibly we should have documentation that talks you
through all of this. Or possibly 

Re: trying again to get incremental backup

2023-11-27 Thread Robert Haas
On Thu, Nov 23, 2023 at 11:18 PM Thomas Munro  wrote:
> Robert pinged me to see if I had any ideas.

Thanks, Thomas.

> The reason it fails on Windows is because there is a special code path
> for WIN32 in the patch's src/bin/pg_combinebackup/copy_file.c, but it
> is incomplete: it returns early without feeding the data into the
> checksum, so all the checksums have the same initial and bogus value.
> I commented that part out so it took the normal path like Unix, and it
> passed.

Yikes, that's embarrassing. Thanks for running it down. There is logic
in the caller to figure out whether we need to recompute the checksum
or can reuse one we already have, but copy_file() doesn't understand
that it should take the slow path if a new checksum computation is
required.

> The reason it fails on Linux 32 bit with -fsanitize is because this
> has uncovered a bug in xlogreader.c, which overflows a 32 bit pointer
> when doing a size test that could easily be changed to non-overflowing
> formulation.  AFAICS it is not a live bug because it comes to the
> right conclusion without deferencing the pointer due to other checks,
> but the sanitizer is not wrong to complain about it and I will post a
> patch to fix that in a new thread.  With the draft patch I am testing,
> the sanitizer is happy and this passes too.

Thanks so much.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-11-23 Thread Thomas Munro
On Wed, Nov 22, 2023 at 3:14 AM Jakub Wartak
 wrote:
> CFbot failed on two hosts this time, I haven't looked at the details
> yet (https://cirrus-ci.com/task/6425149646307328 -> end of EOL? ->
> LOG:  WAL summarizer process (PID 71511) was terminated by signal 6:
> Aborted?)

Robert pinged me to see if I had any ideas.

The reason it fails on Windows is because there is a special code path
for WIN32 in the patch's src/bin/pg_combinebackup/copy_file.c, but it
is incomplete: it returns early without feeding the data into the
checksum, so all the checksums have the same initial and bogus value.
I commented that part out so it took the normal path like Unix, and it
passed.

The reason it fails on Linux 32 bit with -fsanitize is because this
has uncovered a bug in xlogreader.c, which overflows a 32 bit pointer
when doing a size test that could easily be changed to non-overflowing
formulation.  AFAICS it is not a live bug because it comes to the
right conclusion without deferencing the pointer due to other checks,
but the sanitizer is not wrong to complain about it and I will post a
patch to fix that in a new thread.  With the draft patch I am testing,
the sanitizer is happy and this passes too.




Re: trying again to get incremental backup

2023-11-21 Thread Jakub Wartak
On Mon, Nov 20, 2023 at 4:43 PM Robert Haas  wrote:
>
> On Fri, Nov 17, 2023 at 5:01 AM Alvaro Herrera  
> wrote:
> > I made a pass over pg_combinebackup for NLS.  I propose the attached
> > patch.
>
> This doesn't quite compile for me so I changed a few things and
> incorporated it. Hopefully I didn't mess anything up.
>
> Here's v11.
[..]

> I wish I had better ideas about how to thoroughly test this. [..]

Hopefully the below add some confidence, I've done some further
quick(?) checks today and results are good:

make check-world #GOOD
test_full_pri__incr_stby__restore_on_pri.sh #GOOD
test_full_pri__incr_stby__restore_on_stby.sh #GOOD*
test_full_stby__incr_stby__restore_on_pri.sh #GOOD
test_full_stby__incr_stby__restore_on_stby.sh #GOOD*
test_many_incrementals_dbcreate.sh #GOOD
test_many_incrementals.sh #GOOD
test_multixact.sh #GOOD
test_pending_2pc.sh #GOOD
test_reindex_and_vacuum_full.sh #GOOD
test_truncaterollback.sh #GOOD
test_unlogged_table.sh #GOOD
test_across_wallevelminimal.sh # GOOD(expected failure, that
walsummaries are off during walminimal and incremental cannot be
taken--> full needs to be taken after wal_level=minimal)

CFbot failed on two hosts this time, I haven't looked at the details
yet (https://cirrus-ci.com/task/6425149646307328 -> end of EOL? ->
LOG:  WAL summarizer process (PID 71511) was terminated by signal 6:
Aborted?)

The remaining test idea is to have a longer running DB under stress
test in more real-world conditions and try to recover using chained
incremental backups (one such test was carried out on patchset v6 and
the result was good back then).

-J.




Re: trying again to get incremental backup

2023-11-20 Thread Robert Haas
On Mon, Nov 20, 2023 at 2:10 PM Robert Haas  wrote:
> > Is this meant to support multiple timelines each with non-overlapping
> > adjacent ranges, rather than multiple non-adjacent ranges?
>
> Correct. I don't see how non-adjacent LSN ranges could ever be a
> useful thing, but adjacent ranges on different timelines are useful.

Thinking about this a bit more, there are a couple of things we could
do here in terms of syntax. Once idea is to give up on having a
separate MANIFEST-WAL-RANGE command for each range and instead just
cram everything into either a single command:

MANIFEST-WAL-RANGES {tli} {startlsn} {endlsn}...

Or even into a single option to the BASE_BACKUP command:

BASE_BACKUP yadda yadda INCREMENTAL 'tli@startlsn-endlsn,...'

Or, since we expect adjacent, non-overlapping ranges, you could even
arrange to elide the duplicated boundary LSNs, e.g.

MANIFEST_WAL-RANGES {{tli} {lsn}}... {final-lsn}

Or

BASE_BACKUP yadda yadda INCREMENTAL 'tli@lsn,...,final-lsn'

I'm not sure what's best here. Trying to trim out the duplicated
boundary LSNs feels a bit like rearrangement for the sake of
rearrangement, but maybe it isn't really.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-11-20 Thread Robert Haas
On Mon, Nov 20, 2023 at 2:03 PM Alvaro Herrera  wrote:
> That sounds good to me.  Not having to parse the manifest server-side
> sounds like a win, as does saving the transfer, for the cases where the
> manifest is large.

OK. I'll look into this next week, hopefully.

> Is this meant to support multiple timelines each with non-overlapping
> adjacent ranges, rather than multiple non-adjacent ranges?

Correct. I don't see how non-adjacent LSN ranges could ever be a
useful thing, but adjacent ranges on different timelines are useful.

> Do I have it right that you want to rewrite this bit before considering
> this ready to commit?

For sure. I don't think this is the only thing that needs to be
revised before commit, but it's definitely *a* thing that needs to be
revised before commit.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-11-20 Thread Alvaro Herrera
On 2023-Nov-16, Robert Haas wrote:

> On Thu, Nov 16, 2023 at 12:23 PM Alvaro Herrera  
> wrote:

> > I don't understand this point.  Currently, the protocol is that
> > UPLOAD_MANIFEST is used to send the manifest prior to requesting the
> > backup.  You seem to be saying that you're thinking of removing support
> > for UPLOAD_MANIFEST and instead just give the LSN as an option to the
> > BASE_BACKUP command?
> 
> I don't think I'd want to do exactly that, because then you could only
> send one LSN, and I do think we want to send a set of LSN ranges with
> the corresponding TLI for each. I was thinking about dumping
> UPLOAD_MANIFEST and instead having a command like:
> 
> INCREMENTAL_WAL_RANGE 1 2/462AC48 2/462C698
> 
> The client would execute this command one or more times before
> starting an incremental backup.

That sounds good to me.  Not having to parse the manifest server-side
sounds like a win, as does saving the transfer, for the cases where the
manifest is large.

Is this meant to support multiple timelines each with non-overlapping
adjacent ranges, rather than multiple non-adjacent ranges?

Do I have it right that you want to rewrite this bit before considering
this ready to commit?

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"No nos atrevemos a muchas cosas porque son difíciles,
pero son difíciles porque no nos atrevemos a hacerlas" (Séneca)




Re: trying again to get incremental backup

2023-11-17 Thread Alvaro Herrera
I made a pass over pg_combinebackup for NLS.  I propose the attached
patch.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Right now the sectors on the hard disk run clockwise, but I heard a rumor that
you can squeeze 0.2% more throughput by running them counterclockwise.
It's worth the effort. Recommended."  (Gerry Pourwelle)
>From a385542ff03514885fa4e84b0485e51cdcdd04bd Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 17 Nov 2023 10:51:36 +0100
Subject: [PATCH] do NLS for pg_combinebackup

---
 src/bin/pg_combinebackup/backup_label.c | 34 +++--
 src/bin/pg_combinebackup/nls.mk | 11 +++
 src/bin/pg_combinebackup/pg_combinebackup.c | 24 ++-
 src/bin/pg_combinebackup/reconstruct.c  |  9 --
 4 files changed, 52 insertions(+), 26 deletions(-)
 create mode 100644 src/bin/pg_combinebackup/nls.mk

diff --git a/src/bin/pg_combinebackup/backup_label.c b/src/bin/pg_combinebackup/backup_label.c
index 2a62aa6fad..922e00854d 100644
--- a/src/bin/pg_combinebackup/backup_label.c
+++ b/src/bin/pg_combinebackup/backup_label.c
@@ -63,18 +63,18 @@ parse_backup_label(char *filename, StringInfo buf,
 		if (line_starts_with(s, e, "START WAL LOCATION: ", ))
 		{
 			if (!parse_lsn(s, e, start_lsn, ))
-pg_fatal("%s: could not parse START WAL LOCATION",
-		 filename);
+pg_fatal("%s: could not parse %s",
+		 filename, "START WAL LOCATION");
 			if (c >= e || *c != ' ')
-pg_fatal("%s: improper terminator for START WAL LOCATION",
-		 filename);
+pg_fatal("%s: improper terminator for %s",
+		 filename, "START WAL LOCATION");
 			found |= 1;
 		}
 		else if (line_starts_with(s, e, "START TIMELINE: ", ))
 		{
 			if (!parse_tli(s, e, start_tli))
-pg_fatal("%s: could not parse TLI for START TIMELINE",
-		 filename);
+pg_fatal("%s: could not parse TLI for %s",
+		 filename, "START TIMELINE");
 			if (*start_tli == 0)
 pg_fatal("%s: invalid TLI", filename);
 			found |= 2;
@@ -82,18 +82,18 @@ parse_backup_label(char *filename, StringInfo buf,
 		else if (line_starts_with(s, e, "INCREMENTAL FROM LSN: ", ))
 		{
 			if (!parse_lsn(s, e, previous_lsn, ))
-pg_fatal("%s: could not parse INCREMENTAL FROM LSN",
-		 filename);
+pg_fatal("%s: could not parse %s",
+		 filename, "INCREMENTAL FROM LSN");
 			if (c >= e || *c != '\n')
-pg_fatal("%s: improper terminator for INCREMENTAL FROM LSN",
-		 filename);
+pg_fatal("%s: improper terminator for %s",
+		 filename, "INCREMENTAL FROM LSN");
 			found |= 4;
 		}
 		else if (line_starts_with(s, e, "INCREMENTAL FROM TLI: ", ))
 		{
 			if (!parse_tli(s, e, previous_tli))
-pg_fatal("%s: could not parse INCREMENTAL FROM TLI",
-		 filename);
+pg_fatal("%s: could not parse %s",
+		 filename, "INCREMENTAL FROM TLI");
 			if (*previous_tli == 0)
 pg_fatal("%s: invalid TLI", filename);
 			found |= 8;
@@ -103,13 +103,15 @@ parse_backup_label(char *filename, StringInfo buf,
 	}
 
 	if ((found & 1) == 0)
-		pg_fatal("%s: could not find START WAL LOCATION", filename);
+		pg_fatal("%s: could not find %s", filename, "START WAL LOCATION");
 	if ((found & 2) == 0)
-		pg_fatal("%s: could not find START TIMELINE", filename);
+		pg_fatal("%s: could not find %s", filename, "START TIMELINE");
 	if ((found & 4) != 0 && (found & 8) == 0)
-		pg_fatal("%s: INCREMENTAL FROM LSN requires INCREMENTAL FROM TLI", filename);
+		pg_fatal("%s: %s requires %s", filename,
+ "INCREMENTAL FROM LSN", "INCREMENTAL FROM TLI");
 	if ((found & 8) != 0 && (found & 4) == 0)
-		pg_fatal("%s: INCREMENTAL FROM TLI requires INCREMENTAL FROM LSN", filename);
+		pg_fatal("%s: %s requires %s", filename,
+ "INCREMENTAL FROM TLI", "INCREMENTAL FROM LSN");
 }
 
 /*
diff --git a/src/bin/pg_combinebackup/nls.mk b/src/bin/pg_combinebackup/nls.mk
new file mode 100644
index 00..c8e59d1d00
--- /dev/null
+++ b/src/bin/pg_combinebackup/nls.mk
@@ -0,0 +1,11 @@
+# src/bin/pg_combinebackup/nls.mk
+CATALOG_NAME = pg_combinebackup
+GETTEXT_FILES= $(FRONTEND_COMMON_GETTEXT_FILES) \
+		   backup_label.c \
+		   copy_file.c \
+		   load_manifest.c \
+		   pg_combinebackup.c \
+		   reconstruct.c \
+		   write_manifest.c
+GETTEXT_TRIGGERS = $(FRONTEND_COMMON_GETTEXT_TRIGGERS)
+GETTEXT_FLAGS= $(FRONTEND_COMMON_GETTEXT_FLAGS)
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
index 7bf56e57ae..618b5dd7f6 100644
--- a/src/bin/pg_combinebackup/pg_combinebackup.c
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -521,29 +521,33 @@ check_control_files(int n_backups, char **backup_dirs)
 	{
 		ControlFileData *control_file;
 		bool		crc_ok;
+		char	   *controlpath;
 
-		pg_log_debug("reading \"%s/global/pg_control\"", backup_dirs[i]);
+		path = psprintf("%s/%s", backup_dirs[i], "global/pg_control");
+
+		pg_log_debug("reading \"%s\"", controlpath);
 		control_file = 

Re: trying again to get incremental backup

2023-11-16 Thread Robert Haas
On Thu, Nov 16, 2023 at 12:34 PM Alvaro Herrera  wrote:
> Putting those two thoughts together, I think pg_basebackup with
> --progress could tell you "still waiting for the summary file up to LSN
> %X/%X to appear, and the walsummarizer is currently handling lsn %X/%X"
> or something like that.  This would probably require two concurrent
> connections, one to run BASE_BACKUP and another to inquire server state;
> but this should easy enough to integrate together with parallel
> basebackup later.

I had similar thoughts, except I was thinking it would be better to
have the warnings be generated on the server side. That would save the
need for a second libpq connection, which would be good, because I
think adding that would result in a pretty large increase in
complexity and some not-so-great user-visible consequences. In fact,
my latest thought is to just remove the timeout altogether, and emit
warnings like this:

WARNING:  still waiting for WAL summarization to reach %X/%X after %d
seconds, currently at %X/%X

We could emit that every 30 seconds or so until either the situation
resolves itself or the user hits ^C. I think that would be good enough
here. If we want, the interval between messages can be a GUC, but I
don't know how much real need there will be to tailor that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-11-16 Thread Alvaro Herrera
On 2023-Nov-16, Alvaro Herrera wrote:

> On 2023-Oct-04, Robert Haas wrote:

> > - Right now, I have a hard-coded 60 second timeout for WAL
> > summarization. If you try to take an incremental backup and the WAL
> > summaries you need don't show up within 60 seconds, the backup times
> > out. I think that's a reasonable default, but should it be
> > configurable? If yes, should that be a GUC or, perhaps better, a
> > pg_basebackup option?
> 
> I'd rather have a way for the server to provide diagnostics on why the
> summaries aren't being produced.  Maybe a server running under valgrind
> is going to fail and need a longer one, but otherwise a hardcoded
> timeout seems sufficient.
> 
> You did say later that you thought summary files would just go from one
> checkpoint to the next.  So the only question is at what point the file
> for the last checkpoint (i.e. from the previous one up to the one
> requested by pg_basebackup) is written.  If walsummarizer keeps almost
> the complete state in memory and just waits for the checkpoint record to
> write it, then it's probably okay.

On 2023-Nov-16, Alvaro Herrera wrote:

> On 2023-Nov-16, Robert Haas wrote:
> 
> > On Thu, Nov 16, 2023 at 5:21 AM Alvaro Herrera  
> > wrote:

> > > It's not clear to me if WalSummarizerCtl->pending_lsn if fulfilling some
> > > purpose or it's just a leftover from prior development.  I see it's only
> > > read in an assertion ... Maybe if we think this cross-check is
> > > important, it should be turned into an elog?  Otherwise, I'd remove it.
> > 
> > I've been thinking about that. One thing I'm not quite sure about
> > though is introspection. Maybe there should be a function that shows
> > summarized_tli and summarized_lsn from WalSummarizerData, and maybe it
> > should expose pending_lsn too.
> 
> True.

Putting those two thoughts together, I think pg_basebackup with
--progress could tell you "still waiting for the summary file up to LSN
%X/%X to appear, and the walsummarizer is currently handling lsn %X/%X"
or something like that.  This would probably require two concurrent
connections, one to run BASE_BACKUP and another to inquire server state;
but this should easy enough to integrate together with parallel
basebackup later.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: trying again to get incremental backup

2023-11-16 Thread Alvaro Herrera
On 2023-Nov-16, Robert Haas wrote:

> On Thu, Nov 16, 2023 at 5:21 AM Alvaro Herrera  
> wrote:
> > I meant code like this
> >
> > memcpy(, rlocator, sizeof(RelFileLocator));
> > key.forknum = forknum;
> > entry = blockreftable_lookup(brtab->hash, key);
> 
> Ah, I hadn't thought about that. Another way of handling that might be
> to add = {0} to the declaration of key. But I can do the initializer
> thing too if you think it's better. I'm not sure if there's an
> argument that the initializer might optimize better.

I think the {0} initializer is good enough, given a comment to indicate
why.

> > It's not clear to me if WalSummarizerCtl->pending_lsn if fulfilling some
> > purpose or it's just a leftover from prior development.  I see it's only
> > read in an assertion ... Maybe if we think this cross-check is
> > important, it should be turned into an elog?  Otherwise, I'd remove it.
> 
> I've been thinking about that. One thing I'm not quite sure about
> though is introspection. Maybe there should be a function that shows
> summarized_tli and summarized_lsn from WalSummarizerData, and maybe it
> should expose pending_lsn too.

True.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/




Re: trying again to get incremental backup

2023-11-16 Thread Alvaro Herrera
On 2023-Oct-04, Robert Haas wrote:

> - I would like some feedback on the generation of WAL summary files.
> Right now, I have it enabled by default, and summaries are kept for a
> week. That means that, with no additional setup, you can take an
> incremental backup as long as the reference backup was taken in the
> last week. File removal is governed by mtimes, so if you change the
> mtimes of your summary files or whack your system clock around, weird
> things might happen. But obviously this might be inconvenient. Some
> people might not want WAL summary files to be generated at all because
> they don't care about incremental backup, and other people might want
> them retained for longer, and still other people might want them to be
> not removed automatically or removed automatically based on some
> criteria other than mtime. I don't really know what's best here. I
> don't think the default policy that the patches implement is
> especially terrible, but it's just something that I made up and I
> don't have any real confidence that it's wonderful. One point to be
> consider here is that, if WAL summarization is enabled, checkpoints
> can't remove WAL that isn't summarized yet. Mostly that's not a
> problem, I think, because the WAL summarizer is pretty fast. But it
> could increase disk consumption for some people. I don't think that we
> need to worry about the summaries themselves being a problem in terms
> of space consumption; at least in all the cases I've tested, they're
> just not very big.

So, wal_summary is no longer turned on by default, I think following a
comment from Peter E.  I think this is a good decision, as we're only
going to need them on servers from which incremental backups are going
to be taken, which is a strict subset of all servers; and furthermore,
people that need them are going to realize that very easily, while if we
went the other around most people would not realize that they need to
turn them off to save some resource consumption.

Granted, the amount of resources additionally used is probably not very
big.  But since it can be changed with a reload not restart, it doesn't
seem problematic.

... oh, I just noticed that this patch now fails to compile because of
the MemoryContextResetAndDeleteChildren removal.

(Typo in the pg_walsummary manpage: "since WAL summary files primary
exist" -> "primarily")

> - On a related note, I haven't yet tested this on a standby, which is
> a thing that I definitely need to do. I don't know of a reason why it
> shouldn't be possible for all of this machinery to work on a standby
> just as it does on a primary, but then we need the WAL summarizer to
> run there too, which could end up being a waste if nobody ever tries
> to take an incremental backup. I wonder how that should be reflected
> in the configuration. We could do something like what we've done for
> archive_mode, where on means "only on if this is a primary" and you
> have to say always if you want it to run on standbys as well ... but
> I'm not sure if that's a design pattern that we really want to
> replicate into more places. I'd be somewhat inclined to just make
> whatever configuration parameters we need to configure this thing on
> the primary also work on standbys, and you can set each server up as
> you please. But I'm open to other suggestions.

I think it should default to off in primary and standby, and the user
has to enable it in whichever server they want to take backups from.

> - We need to settle the question of whether to send the whole backup
> manifest to the server or just the LSN. In a previous attempt at
> incremental backup, we decided the whole manifest was necessary,
> because flat-copying files could make new data show up with old LSNs.
> But that version of the patch set was trying to find modified blocks
> by checking their LSNs individually, not by summarizing WAL. And since
> the operations that flat-copy files are WAL-logged, the WAL summary
> approach seems to eliminate that problem - maybe an LSN (and the
> associated TLI) is good enough now. This also relates to Jakub's
> question about whether this machinery could be used to fast-forward a
> standby, which is not exactly a base backup but  ... perhaps close
> enough? I'm somewhat inclined to believe that we can simplify to an
> LSN and TLI; however, if we do that, then we'll have big problems if
> later we realize that we want the manifest for something after all. So
> if anybody thinks that there's a reason to keep doing what the patch
> does today -- namely, upload the whole manifest to the server --
> please speak up.

I don't understand this point.  Currently, the protocol is that
UPLOAD_MANIFEST is used to send the manifest prior to requesting the
backup.  You seem to be saying that you're thinking of removing support
for UPLOAD_MANIFEST and instead just give the LSN as an option to the
BASE_BACKUP command?

> - It's regrettable that we don't have incremental JSON parsing;

We now do 

Re: trying again to get incremental backup

2023-11-16 Thread Robert Haas
On Thu, Nov 16, 2023 at 5:21 AM Alvaro Herrera  wrote:
> I meant code like this
>
> memcpy(, rlocator, sizeof(RelFileLocator));
> key.forknum = forknum;
> entry = blockreftable_lookup(brtab->hash, key);

Ah, I hadn't thought about that. Another way of handling that might be
to add = {0} to the declaration of key. But I can do the initializer
thing too if you think it's better. I'm not sure if there's an
argument that the initializer might optimize better.

> An output pointer, you mean :-)  (Should it be const?)

I'm bad at const, but that seems to work, so sure.

> When the return value is BACK_UP_FILE_FULLY, it's not clear what happens
> to these output values; we modify some, but why?  Maybe they should be
> left alone?  In that case, the "if size == 0" test should move a couple
> of lines up, in the brtentry == NULL block.

OK.

> BTW, you could do the qsort() after deciding to backup the file fully if
> more than 90% needs to be replaced.

OK.

> BTW, in sendDir() why do
>  lookup_path = pstrdup(pathbuf + basepathlen + 1);
> when you could do
>  lookup_path = pstrdup(tarfilename);
> ?

No reason, changed.

> > If we want to inject more underscores here, my vote is to go all the
> > way and make it per_wal_range_cb.
>
> +1

Will look into this.

> Yeah, I just think that endless stream of hex chars are hard to read,
> and I've found myself following digits in the screen with my fingers in
> order to parse file names.  I guess you could say thousands separators
> for regular numbers aren't needed either, but we do have them for
> readability sake.

Sigh.

> I think a new section in chapter 30 "Reliability and the Write-Ahead
> Log" is warranted.  It would explain the summarization process, what the
> summary files are used for, and the deletion mechanism.  I can draft
> something if you want.

Sure, if you want to take a crack at it, that's great.

> It's not clear to me if WalSummarizerCtl->pending_lsn if fulfilling some
> purpose or it's just a leftover from prior development.  I see it's only
> read in an assertion ... Maybe if we think this cross-check is
> important, it should be turned into an elog?  Otherwise, I'd remove it.

I've been thinking about that. One thing I'm not quite sure about
though is introspection. Maybe there should be a function that shows
summarized_tli and summarized_lsn from WalSummarizerData, and maybe it
should expose pending_lsn too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-11-16 Thread Robert Haas
On Tue, Nov 14, 2023 at 8:12 AM Alvaro Herrera  wrote:
> 0001 looks OK to push, and since it stands on its own I would get it out
> of the way soon rather than waiting for the rest of the series to be
> further reviewed.

All right, done.

> 0003:
> AmWalSummarizerProcess() is unused.  Remove?

The intent seems to be to have one of these per enum value, whether it
gets used or not. Some of the others aren't used, either.

> MaybeWalSummarizer() is called on each ServerLoop() in postmaster.c?
> This causes a function call to be emitted every time through.  That
> looks odd.  All other process starts have some triggering condition.

I'm not sure how much this matters, really. I would expect that the
function call overhead here wouldn't be very noticeable. Generally I
think that when ServerLoop returns from WaitEventSetWait it's going to
be because we need to fork a process. That's pretty expensive compared
to a function call. If we can iterate through this loop lots of times
without doing any real work then it might matter, but I feel like
that's probably not the case, and probably something we would want to
fix if it were the case.

Now, I could nevertheless move some of the triggering conditions in
MaybeStartWalSummarizer(), but moving, say, just the summarize_wal
condition wouldn't be enough to avoid having MaybeStartWalSummarizer()
called repeatedly when there was no work to do, because summarize_wal
could be true and the summarizer could all be running. Similarly, if I
move just the WalSummarizerPID == 0 condition, the function gets
called repeatedly without doing anything when summarize_wal = off. So
at a minimum you have to move both of those if you care about avoiding
the function call overhead, and then you have to wonder if you care
about the corner cases where the function would be called repeatedly
for no gain even then.

Another approach would be to make the function static inline rather
than just static. Or we could delete the whole function and just
duplicate the logic it contains at both call sites. Personally I'm
inclined to just leave it how it is in the absence of some evidence
that there's a real problem here. It's nice to have all the triggering
conditions in one place with nothing duplicated.

> GetOldestUnsummarizedLSN uses while(true), but WaitForWalSummarization
> and SummarizeWAL use while(1). Maybe settle on one style?

OK.

> 0004:
> in PrepareForIncrementalBackup(), the logic that determines
> earliest_wal_range_tli and latest_wal_range_tli looks pretty weird.  I
> think it works fine if there's a single timeline, but not otherwise.
> Or maybe the trick is that it relies on timelines returned by
> readTimeLineHistory being sorted backwards?  If so, maybe add a comment
> about that somewhere; I don't think other callers of readTimeLineHistory
> make that assumption.

It does indeed rely on that assumption, and the comment at the top of
the for (i = 0; i < num_wal_ranges; ++i) loop explains that. Note also
the comment just below that begins "If we found this TLI in the
server's history". I agree with you that this logic looks strange, and
it's possible that there's some better way to do encode the idea than
what I've done here, but I think it might be just that the particular
calculation we're trying to do here is strange. It's almost easier to
understand the logic if you start by reading the sanity checks
("manifest requires WAL from initial timeline %u starting at %X/%X,
but that timeline begins at %X/%X" et. al.), look at the triggering
conditions for those, and then work upward to see how
earliest/latest_wal_range_tli get set, and then look up from there to
see how saw_earliest/latest_wal_range_tli are used in computing those
values.

We do rely on the ordering assumption elsewhere. For example, in
XLogFileReadAnyTLI, see if (tli < curFileTLI) break. We also use it to
set expectedTLEs, which is documented to have this property. And
AddWALInfoToBackupManifest relies on it too; see the comment "Because
the timeline history file lists newer timelines before older ones" in
AddWALInfoToBackupManifest. We're not entirely consistent about this,
e.g., unlike XLogFileReadAnyTLI, tliInHistory() and
tliOfPointInHistory() don't have an early exit provision, but we do
use it some places.

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-11-16 Thread Alvaro Herrera
On 2023-Nov-13, Robert Haas wrote:

> On Mon, Nov 13, 2023 at 11:25 AM Alvaro Herrera  
> wrote:

> > Also, it would be good to provide and use a
> > function to initialize a BlockRefTableKey from the RelFileNode and
> > forknum components, and ensure that any padding bytes are zeroed.
> > Otherwise it's not going to be a great hash key.  On my platform there
> > aren't any (padding bytes), but I think it's unwise to rely on that.
> 
> I'm having trouble understanding the second part of this suggestion.
> Note that in a frontend context, SH_RAW_ALLOCATOR is pg_malloc0, and
> in a backend context, we get the default, which is
> MemoryContextAllocZero. Maybe there's some case this doesn't cover,
> though?

I meant code like this

memcpy(, rlocator, sizeof(RelFileLocator));
key.forknum = forknum;
entry = blockreftable_lookup(brtab->hash, key);

where any padding bytes in "key" could have arbitrary values, because
they're not initialized.  So I'd have a (maybe static inline) function
  BlockRefTableKeyInit(, rlocator, forknum)
that fills it in for you.

Note:
  #define SH_EQUAL(tb, a, b) (memcmp(, , sizeof(BlockRefTableKey)) == 0)
AFAICT the new simplehash uses in this patch series are the only ones
that use memcmp() as SH_EQUAL, so we don't necessarily have precedent on
lack of padding bytes initialization in existing uses of simplehash.

> > These forward struct declarations are not buying you anything, I'd
> > remove them:
> 
> I've had problems from time to time when I don't do this. I'll remove
> it here, but I'm not convinced that it's always useless.

Well, certainly there are places where they are necessary.

> > I don't much like the way header files in src/bin/pg_combinebackup files
> > are structured.  Particularly, causing a simplehash to be "instantiated"
> > just because load_manifest.h is included seems poised to cause pain.  I
> > think there should be a file with the basic struct declarations (no
> > simplehash); and then maybe since both pg_basebackup and
> > pg_combinebackup seem to need the same simplehash, create a separate
> > header file containing just that..  But, did you notice that anything
> > that includes reconstruct.h will instantiate the simplehash stuff,
> > because it includes load_manifest.h?  It may be unwise to have the
> > simplehash in a header file.  Maybe just declare it in each .c file that
> > needs it.  The duplicity is not that large.
> 
> I think that I did this correctly.

Oh, I hadn't grokked that we had this SH_SCOPE thing and a separate
SH_DECLARE for it being extern.  OK, please ignore that.

> > Why leave unnamed arguments in function declarations?
> 
> I mean, I've changed it now, but I don't think it's worth getting too
> excited about.

Well, we did get into consistency arguments on this point previously.  I
agree it's not *terribly* important, but on thread
https://www.postgresql.org/message-id/flat/CAH2-WznJt9CMM9KJTMjJh_zbL5hD9oX44qdJ4aqZtjFi-zA3Tg%40mail.gmail.com
people got really worked up about this stuff.

> > In GetFileBackupMethod(), which arguments are in and which are out?
> > The comment doesn't say, and it's not obvious why we pass both the file
> > path as well as the individual constituent pieces for it.
> 
> The header comment does document which values are potentially set on
> return. I guess I thought it was clear enough that the stuff not
> documented to be output parameters was input parameters. Most of them
> aren't even pointers, so they have to be input parameters. The only
> exception is 'path', which I have some difficulty thinking that anyone
> is going to imagine to be an input pointer.

An output pointer, you mean :-)  (Should it be const?)

When the return value is BACK_UP_FILE_FULLY, it's not clear what happens
to these output values; we modify some, but why?  Maybe they should be
left alone?  In that case, the "if size == 0" test should move a couple
of lines up, in the brtentry == NULL block.

BTW, you could do the qsort() after deciding to backup the file fully if
more than 90% needs to be replaced.

BTW, in sendDir() why do
 lookup_path = pstrdup(pathbuf + basepathlen + 1);
when you could do
 lookup_path = pstrdup(tarfilename);
?

> > There are two functions named record_manifest_details_for_file() in
> > different programs.
> 
> I had trouble figuring out how to name this stuff. I did notice the
> awkwardness, but surely nobody can think that two functions with the
> same name in different binaries can be actually the same function.

Of course not, but when cscope-jumping around, it is weird.

> If we want to inject more underscores here, my vote is to go all the
> way and make it per_wal_range_cb.

+1

> > In walsummarizer.c, HandleWalSummarizerInterrupts is called in
> > summarizer_read_local_xlog_page but SummarizeWAL() doesn't do that.
> > Maybe it should?
> 
> I replaced all the CHECK_FOR_INTERRUPTS() in that file with
> HandleWalSummarizerInterrupts(). Does that seem right?

Looks to be 

Re: trying again to get incremental backup

2023-11-15 Thread Jakub Wartak
Hi Robert,

[..spotted the v9 patchset..]

so I've spent some time playing still with patchset v8 (without the
6/6 testing patch related to wal_level=minimal), with the exception of
- patchset v9 - marked otherwise.

1. On compile time there were 2 warnings to shadowing variable (at
least with gcc version 10.2.1), but on v9 that is fixed:

blkreftable.c: In function ‘WriteBlockRefTable’:
blkreftable.c:520:24: warning: declaration of ‘brtentry’ shadows a
previous local [-Wshadow=compatible-local]
walsummarizer.c: In function ‘SummarizeWAL’:
walsummarizer.c:833:36: warning: declaration of ‘private_data’ shadows
a previous local [-Wshadow=compatible-local]

2. Usability thing: I hit the timeout hard: "This backup requires WAL
to be summarized up to 0/9D8, but summarizer has only reached
0/0." with summarize_wal=off (default) but apparently this in TODO.
Looks like an important usability thing.

3. I've verified that if the DB was in wal_level=minimal even
temporarily (and thus summarization was disabled) it is impossible to
take incremental backup:

pg_basebackup: error: could not initiate base backup: ERROR:  WAL
summaries are required on timeline 1 from 0/7D8 to 0/1028, but
the summaries for that timeline and LSN range are incomplete
DETAIL:  The first unsummarized LSN is this range is 0/D04AE88.

4. As we have discussed off list, there's is (was) this
pg_combinebackup bug in v8's reconstruct_from_incremental_file() where
it was unable to realize that - in case of combining multiple
incremental backups - it should stop looking for the previous instance
of the full file (while it was fine with v6 of the patchset). I've
checked it on v9 - it is good now.

5. On v8 i've finally played a little bit with standby(s) and this
patchset with couple of basic scenarios while mixing source of the
backups:

a. full on standby, incr1 on standby, full db restore (incl. incr1) on standby
# sometimes i'm getting spurious error like those when doing
incrementals on standby with -c fast :
2023-11-15 13:49:05.721 CET [10573] LOG:  recovery restart point
at 0/A28
2023-11-15 13:49:07.591 CET [10597] WARNING:  aborting backup due
to backend exiting before pg_backup_stop was called
2023-11-15 13:49:07.591 CET [10597] ERROR:  manifest requires WAL
from final timeline 1 ending at 0/AF8, but this backup starts at
0/A28
2023-11-15 13:49:07.591 CET [10597] STATEMENT:  BASE_BACKUP (
INCREMENTAL,  LABEL 'pg_basebackup base backup',  PROGRESS,
CHECKPOINT 'fast',  WAIT 0,  MANIFEST 'yes',  TARGET 'client')
# when you retry the same pg_basebackup it goes fine (looks like
CHECKPOINT on standby/restartpoint <-> summarizer disconnect, I'll dig
deeper tomorrow. It seems that issuing "CHECKPOINT; pg_sleep(1);"
against primary just before pg_basebackup --incr on standby
workarounds it)

b. full on primary, incr1 on standby, full db restore (incl. incr1) on
standby # WORKS
c. full on standby, incr1 on standby, full db restore (incl. incr1) on
primary # WORKS*
d. full on primary, incr1 on standby, full db restore (incl. incr1) on
primary # WORKS*

* - needs pg_promote() due to the controlfile having standby bit +
potential fiddling with postgresql.auto.conf as it is having
primary_connstring GUC.

6. Sci-fi-mode-on: I was wondering about the dangers of e.g. having
more recent pg_basebackup (e.g. from pg18 one day) running against
pg17 in the scope of having this incremental backups possibility. Is
it going to be safe? (currently there seems to be no safeguards
against such use) or should those things (core, pg_basebackup) should
be running in version lock step?

Regards,
-J.




Re: trying again to get incremental backup

2023-11-14 Thread Alvaro Herrera
0001 looks OK to push, and since it stands on its own I would get it out
of the way soon rather than waiting for the rest of the series to be
further reviewed.


0002:
This moves bin/pg_verifybackup/parse_manifest.c to
common/parse_manifest.c, where it's not clear that it's for backup
manifests (wasn't a problem in the previous location).  I wonder if
we're going to have anything else called "manifest", in which case I
propose to rename the file to make it clear that this is about backup
manifests -- maybe parse_bkp_manifest.c.

This patch looks pretty uncontroversial, but there's no point in going
further with this one until followup patches are closer to commit.


0003:
AmWalSummarizerProcess() is unused.  Remove?

MaybeWalSummarizer() is called on each ServerLoop() in postmaster.c?
This causes a function call to be emitted every time through.  That
looks odd.  All other process starts have some triggering condition.  

GetOldestUnsummarizedLSN uses while(true), but WaitForWalSummarization
and SummarizeWAL use while(1). Maybe settle on one style?

Still reading this one.


0004:
in PrepareForIncrementalBackup(), the logic that determines
earliest_wal_range_tli and latest_wal_range_tli looks pretty weird.  I
think it works fine if there's a single timeline, but not otherwise.
Or maybe the trick is that it relies on timelines returned by
readTimeLineHistory being sorted backwards?  If so, maybe add a comment
about that somewhere; I don't think other callers of readTimeLineHistory
make that assumption.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Postgres is bloatware by design: it was built to house
 PhD theses." (Joey Hellerstein, SIGMOD annual conference 2002)




Re: trying again to get incremental backup

2023-11-14 Thread Dilip Kumar
On Tue, Nov 14, 2023 at 2:10 AM Robert Haas  wrote:
>
> On Mon, Nov 13, 2023 at 11:25 AM Alvaro Herrera  
> wrote:
> > Great stuff you got here.  I'm doing a first pass trying to grok the
> > whole thing for more substantive comments, but in the meantime here are
> > some cosmetic ones.
>
> Thanks, thanks, and thanks.
>
> I've fixed some things that you mentioned in the attached version.
> Other comments below.

Here are some more comments based on what I have read so far, mostly
cosmetics comments.

1.
+ * summary file yet, then stoppng doesn't make any sense, and we
+ * should wait until the next stop point instead.

Typo /stoppng/stopping

2.
+ /* Close temporary file and shut down xlogreader. */
+ FileClose(io.file);
+

We have already freed the xlogreader so the second part of the comment
is not valid.

3.+ /*
+ * If a relation fork is truncated on disk, there is in point in
+ * tracking anything about block modifications beyond the truncation
+ * point.


Typo. /there is in point/ there is no point

4.
+/*
+ * Special handling for WAL recods with RM_XACT_ID.
+ */

/recods/records

5.

+ if (xact_info == XLOG_XACT_COMMIT ||
+ xact_info == XLOG_XACT_COMMIT_PREPARED)
+ {
+ xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(xlogreader);
+ xl_xact_parsed_commit parsed;
+ int i;
+
+ ParseCommitRecord(XLogRecGetInfo(xlogreader), xlrec, );
+ for (i = 0; i < parsed.nrels; ++i)
+ {
+ ForkNumber forknum;
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; ++forknum)
+ if (forknum != FSM_FORKNUM)
+ BlockRefTableSetLimitBlock(brtab, [i],
+forknum, 0);
+ }
+ }

For SmgrCreate and Truncate I understand setting the 'limit block' but
why for commit/abort?  I think it would be better to add some comments
here.

6.
+ * Caller must set private_data->tli to the TLI of interest,
+ * private_data->read_upto to the lowest LSN that is not known to be safe
+ * to read on that timeline, and private_data->historic to true if and only
+ * if the timeline is not the current timeline. This function will update
+ * private_data->read_upto and private_data->historic if more WAL appears
+ * on the current timeline or if the current timeline becomes historic.
+ */
+static int
+summarizer_read_local_xlog_page(XLogReaderState *state,
+ XLogRecPtr targetPagePtr, int reqLen,
+ XLogRecPtr targetRecPtr, char *cur_page)

The comments say "private_data->read_upto to the lowest LSN that is
not known to be safe" but is it really the lowest LSN? I think it is
the highest LSN this is known to be safe for that TLI no?

7.
+ /* If it's time to remove any old WAL summaries, do that now. */
+ MaybeRemoveOldWalSummaries();

I was just wondering whether removing old summaries should be the job
of the wal summarizer or it should be the job of the checkpointer, I
mean while removing the old wals it can also check and remove the old
summaries?  Anyway, it's just a question and I do not have a strong
opinion on this.

8.
+ /*
+ * Whether we we removed the file or not, we need not consider it
+ * again.
+ */

Typo /Whether we we removed/ Whether we removed

9.
+/*
+ * Get an entry from a block reference table.
+ *
+ * If the entry does not exist, this function returns NULL. Otherwise, it
+ * returns the entry and sets *limit_block to the value from the entry.
+ */
+BlockRefTableEntry *
+BlockRefTableGetEntry(BlockRefTable *brtab, const RelFileLocator *rlocator,
+   ForkNumber forknum, BlockNumber *limit_block)

If this function is already returning 'BlockRefTableEntry' then why
would it need to set an extra '*limit_block' out parameter which it is
actually reading from the entry itself?


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-11-13 Thread Dilip Kumar
On Tue, Nov 14, 2023 at 12:52 AM Robert Haas  wrote:
>
> On Fri, Nov 10, 2023 at 6:27 AM Dilip Kumar  wrote:
> > - I think 0001 looks good improvement irrespective of the patch series.
>
> OK, perhaps that can be independently committed, then, if nobody objects.
>
> Thanks for the review; I've fixed a bunch of things that you
> mentioned. I'll just comment on the ones I haven't yet done anything
> about below.
>
> > 2.
> > +  > xreflabel="wal_summarize_keep_time">
> > +  wal_summarize_keep_time 
> > (boolean)
> > +  
> > +   wal_summarize_keep_time
> > configuration parameter
> > +  
> >
> > I feel the name of the guy should be either wal_summarizer_keep_time
> > or wal_summaries_keep_time, I mean either we should refer to the
> > summarizer process or to the way summaries files.
>
> How about wal_summary_keep_time?

Yes, that looks perfect to me.

> > 6.
> > + * If the whole range of LSNs is covered, returns true, otherwise false.
> > + * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr
> > + * if there are no WAL summary files in the input list, or to the first LSN
> > + * in the range that is not covered by a WAL summary file in the input 
> > list.
> > + */
> > +bool
> > +WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,
> >
> > I did not see the usage of this function, but I think if the whole
> > range is not covered why not keep the behavior uniform w.r.t. what we
> > set for '*missing_lsn',  I mean suppose there is no file then
> > missing_lsn is the start_lsn because a very first LSN is missing.
>
> It's used later in the patch series. I think the way that I have it
> makes for a more understandable error message.

Okay

> > 8.
> > +/*
> > + * Comparator to sort a List of WalSummaryFile objects by start_lsn.
> > + */
> > +static int
> > +ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b)
> > +{
>
> I'm not sure what needs fixing here.

I think I copy-pasted it by mistake, just ignore it.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-11-13 Thread Robert Haas
On Fri, Nov 10, 2023 at 6:27 AM Dilip Kumar  wrote:
> - I think 0001 looks good improvement irrespective of the patch series.

OK, perhaps that can be independently committed, then, if nobody objects.

Thanks for the review; I've fixed a bunch of things that you
mentioned. I'll just comment on the ones I haven't yet done anything
about below.

> 2.
> +  xreflabel="wal_summarize_keep_time">
> +  wal_summarize_keep_time (boolean)
> +  
> +   wal_summarize_keep_time
> configuration parameter
> +  
>
> I feel the name of the guy should be either wal_summarizer_keep_time
> or wal_summaries_keep_time, I mean either we should refer to the
> summarizer process or to the way summaries files.

How about wal_summary_keep_time?

> 6.
> + * If the whole range of LSNs is covered, returns true, otherwise false.
> + * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr
> + * if there are no WAL summary files in the input list, or to the first LSN
> + * in the range that is not covered by a WAL summary file in the input list.
> + */
> +bool
> +WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,
>
> I did not see the usage of this function, but I think if the whole
> range is not covered why not keep the behavior uniform w.r.t. what we
> set for '*missing_lsn',  I mean suppose there is no file then
> missing_lsn is the start_lsn because a very first LSN is missing.

It's used later in the patch series. I think the way that I have it
makes for a more understandable error message.

> 8.
> +/*
> + * Comparator to sort a List of WalSummaryFile objects by start_lsn.
> + */
> +static int
> +ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b)
> +{

I'm not sure what needs fixing here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-11-13 Thread Alvaro Herrera
Great stuff you got here.  I'm doing a first pass trying to grok the
whole thing for more substantive comments, but in the meantime here are
some cosmetic ones.

I got the following warnings, both valid:

../../../../pgsql/source/master/src/common/blkreftable.c: In function 
'WriteBlockRefTable':
../../../../pgsql/source/master/src/common/blkreftable.c:520:45: warning: 
declaration of 'brtentry' shadows a previous local [-Wshadow=compatible-local]
  520 | BlockRefTableEntry *brtentry;
  | ^~~~
../../../../pgsql/source/master/src/common/blkreftable.c:492:37: note: shadowed 
declaration is here
  492 | BlockRefTableEntry *brtentry;
  | ^~~~

../../../../../pgsql/source/master/src/backend/postmaster/walsummarizer.c: In 
function 'SummarizeWAL':
../../../../../pgsql/source/master/src/backend/postmaster/walsummarizer.c:833:57:
 warning: declaration of 'private_data' shadows a previous local 
[-Wshadow=compatible-local]
  833 | SummarizerReadLocalXLogPrivate *private_data;
  | ^~~~
../../../../../pgsql/source/master/src/backend/postmaster/walsummarizer.c:709:41:
 note: shadowed declaration is here
  709 | SummarizerReadLocalXLogPrivate *private_data;
  | ^~~~

In blkreftable.c, I think the definition of SH_EQUAL should have an
outer layer of parentheses.  Also, it would be good to provide and use a
function to initialize a BlockRefTableKey from the RelFileNode and
forknum components, and ensure that any padding bytes are zeroed.
Otherwise it's not going to be a great hash key.  On my platform there
aren't any (padding bytes), but I think it's unwise to rely on that.

I don't think SummarizerReadLocalXLogPrivate->waited is used for
anything.  Could be removed AFAICS, unless you're foreseen adding
something that uses it.

These forward struct declarations are not buying you anything, I'd
remove them:

diff --git a/src/include/common/blkreftable.h b/src/include/common/blkreftable.h
index 70d6c072d7..316e67122c 100644
--- a/src/include/common/blkreftable.h
+++ b/src/include/common/blkreftable.h
@@ -29,10 +29,7 @@
 /* Magic number for serialization file format. */
 #define BLOCKREFTABLE_MAGIC0x652b137b
 
-struct BlockRefTable;
-struct BlockRefTableEntry;
-struct BlockRefTableReader;
-struct BlockRefTableWriter;
+/* Struct definitions appear in blkreftable.c */
 typedef struct BlockRefTable BlockRefTable;
 typedef struct BlockRefTableEntry BlockRefTableEntry;
 typedef struct BlockRefTableReader BlockRefTableReader;


and backup_label.h doesn't know about TimeLineID, so it needs this:

diff --git a/src/bin/pg_combinebackup/backup_label.h 
b/src/bin/pg_combinebackup/backup_label.h
index 08d6ed67a9..3af7ea274c 100644
--- a/src/bin/pg_combinebackup/backup_label.h
+++ b/src/bin/pg_combinebackup/backup_label.h
@@ -12,6 +12,7 @@
 #ifndef BACKUP_LABEL_H
 #define BACKUP_LABEL_H
 
+#include "access/xlogdefs.h"
 #include "common/checksum_helper.h"
 #include "lib/stringinfo.h"
 

I don't much like the way header files in src/bin/pg_combinebackup files
are structured.  Particularly, causing a simplehash to be "instantiated"
just because load_manifest.h is included seems poised to cause pain.  I
think there should be a file with the basic struct declarations (no
simplehash); and then maybe since both pg_basebackup and
pg_combinebackup seem to need the same simplehash, create a separate
header file containing just that..  But, did you notice that anything
that includes reconstruct.h will instantiate the simplehash stuff,
because it includes load_manifest.h?  It may be unwise to have the
simplehash in a header file.  Maybe just declare it in each .c file that
needs it.  The duplicity is not that large.

I'll see if I can understand the way all these headers are needed to
propose some other arrangement.

I see this idea of having "struct FooBar;" immediately followed by
"typedef struct FooBar FooBar;" which I mentioned from blkreftable.h
occurs in other places as well (JsonManifestParseContext in
parse_manifest.h, maybe others?).  Was this pattern cargo-culted from
somewhere?  Perhaps we have other places to clean up.


Why leave unnamed arguments in function declarations?  For example, in

static void manifest_process_file(JsonManifestParseContext *,
  char *pathname,
  size_t size,
  pg_checksum_type checksum_type,
  int checksum_length,
  uint8 *checksum_payload);
the first argument lacks a name.  Is this just an oversight, I hope?


In GetFileBackupMethod(), which arguments are in and which are out?
The comment doesn't say, and it's not obvious why we pass both the file

Re: trying again to get incremental backup

2023-11-10 Thread Dilip Kumar
On Tue, Nov 7, 2023 at 2:06 AM Robert Haas  wrote:
>
> On Mon, Oct 30, 2023 at 2:46 PM Andres Freund  wrote:
> > After playing with this for a while, I don't see a reason for 
> > wal_summarize_mb
> > from a memory usage POV at least.
>
> Here's v8. Changes:

Review comments, based on what I reviewed so far.

- I think 0001 looks good improvement irrespective of the patch series.

- review 0003
1.
+   be enabled either on a primary or on a standby. WAL summarization can
+   cannot be enabled when wal_level is set to
+   minimal.

Grammatical error
"WAL summarization can cannot" -> WAL summarization cannot

2.
+ 
+  wal_summarize_keep_time (boolean)
+  
+   wal_summarize_keep_time
configuration parameter
+  

I feel the name of the guy should be either wal_summarizer_keep_time
or wal_summaries_keep_time, I mean either we should refer to the
summarizer process or to the way summaries files.

3.

+XLogGetOldestSegno(TimeLineID tli)
+{
+
+ /* Ignore files that are not XLOG segments */
+ if (!IsXLogFileName(xlde->d_name))
+ continue;
+
+ /* Parse filename to get TLI and segno. */
+ XLogFromFileName(xlde->d_name, _tli, _segno,
+ wal_segment_size);
+
+ /* Ignore anything that's not from the TLI of interest. */
+ if (tli != file_tli)
+ continue;
+
+ /* If it's the oldest so far, update oldest_segno. */

Some of the single-line comments end with a full stop whereas others
do not, so better to be consistent.

4.

+ * If start_lsn != InvalidXLogRecPtr, only summaries that end before the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ *
+ * The intent is that you can call GetWalSummaries(tli, start_lsn, end_lsn)
+ * to get all WAL summaries on the indicated timeline that overlap the
+ * specified LSN range.
+ */
+List *
+GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn, XLogRecPtr end_lsn)


Instead of "If start_lsn != InvalidXLogRecPtr, only summaries that end
before the" it should be "If start_lsn != InvalidXLogRecPtr, only
summaries that end after the" because only if the summary files are
Ending after the start_lsn then it will have some overlapping and we
need to return them if ending before start lsn then those files are
not overlapping at all, right?

5.
In FilterWalSummaries() header also the comment is wrong same as for
GetWalSummaries() function.

6.
+ * If the whole range of LSNs is covered, returns true, otherwise false.
+ * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr
+ * if there are no WAL summary files in the input list, or to the first LSN
+ * in the range that is not covered by a WAL summary file in the input list.
+ */
+bool
+WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,

I did not see the usage of this function, but I think if the whole
range is not covered why not keep the behavior uniform w.r.t. what we
set for '*missing_lsn',  I mean suppose there is no file then
missing_lsn is the start_lsn because a very first LSN is missing.

7.
+ nbytes = FileRead(io->file, data, length, io->filepos,
+   WAIT_EVENT_WAL_SUMMARY_READ);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file;

/could not write file/ could not read file

8.
+/*
+ * Comparator to sort a List of WalSummaryFile objects by start_lsn.
+ */
+static int
+ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b)
+{


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-11-03 Thread Robert Haas
On Wed, Nov 1, 2023 at 8:57 AM Jakub Wartak
 wrote:
> Thanks for answering! It all sounds like this
> resync-standby-using-primary-incrbackup idea isn't fit for the current
> pg_combinebackup, but rather for a new tool hopefully in future. It
> could take the current LSN from stuck standby, calculate manifest on
> the lagged and offline standby (do we need to calculate manifest
> Checksum in that case? I cannot find code for it), deliver it via
> "UPLOAD_MANIFEST" to primary and start fetching and applying the
> differences while doing some form of copy-on-write from old & incoming
> incrbackup data to "$relfilenodeid.new" and then durable_unlink() old
> one and durable_rename("$relfilenodeid.new", "$relfilenodeid". Would
> it still be possible in theory? (it could use additional safeguards
> like rename controlfile when starting and just before ending to
> additionally block startup if it hasn't finished). Also it looks as
> per comment nearby struct IncrementalBackupInfo.manifest_files that
> even checksums are just more for safeguarding rather than core
> implementation (?)
>
> What I've meant in the initial idea is not to hinder current efforts,
> but asking if the current design will not stand in a way for such a
> cool new addition in future ?

Hmm, interesting idea. I think something like that could be made to
work. My first thought was that it would sort of suck to have to
compute a manifest as a precondition of doing this, but then I started
to think maybe it wouldn't, really. I mean, you'd have to scan the
local directory tree and collect all the filenames so that you could
remove any files that are no longer present in the current version of
the data directory which the incremental backup would send to you. If
you're already doing that, the additional cost of generating a
manifest isn't that high, at least if you don't include checksums,
which aren't required. On the other hand, if you didn't need to send
the server a manifest and just needed to send the required WAL ranges,
that would be even cheaper. I'll spend some more time thinking about
this next week.

> As per earlier test [1], I've already tried to simulate that in
> incrbackuptests-0.1.tgz/test_across_wallevelminimal.sh , but that
> worked (but that was with CTAS-wal-minimal-optimization -> new
> relfilenodeOID is used for CTAS which got included in the incremental
> backup as it's new file) Even retested that with Your v7 patch with
> asserts, same. When simulating with "BEGIN; TRUNCATE nightmare; COPY
> nightmare FROM '/tmp/copy.out'; COMMIT;" on wal_level=minimal it still
> recovers using incremental backup because the WAL contains:

TRUNCATE itself is always WAL-logged, but data added to the relation
in the same relation as the TRUNCATE isn't always WAL-logged (but
sometimes it is, depending on the relation size). So the failure case
wouldn't be missing the TRUNCATE but missing some data-containing
blocks within the relation shortly after it was created or truncated.
I think what I need to do here is avoid summarizing WAL that was
generated under wal_level=minimal. The walsummarizer process should
just refuse to emit summaries for any such WAL.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-11-01 Thread Jakub Wartak
On Mon, Oct 30, 2023 at 6:46 PM Robert Haas  wrote:
>
> On Thu, Sep 28, 2023 at 6:22 AM Jakub Wartak
>  wrote:
> > If that is still an area open for discussion: wouldn't it be better to
> > just specify LSN as it would allow resyncing standby across major lag
> > where the WAL to replay would be enormous? Given that we had
> > primary->standby where standby would be stuck on some LSN, right now
> > it would be:
> > 1) calculate backup manifest of desynced 10TB standby (how? using
> > which tool?)  - even if possible, that means reading 10TB of data
> > instead of just putting a number, isn't it?
> > 2) backup primary with such incremental backup >= LSN
> > 3) copy the incremental backup to standby
> > 4) apply it to the impaired standby
> > 5) restart the WAL replay
>
> As you may be able to tell from the flurry of posts and new patch
> sets, I'm trying hard to sort out the remaining open items that
> pertain to this patch set, and I'm now back to thinking about this
> one.
>
> TL;DR: I think the idea has some potential, but there are some
> pitfalls that I'm not sure how to address.
>
> I spent some time looking at how we currently use the data from the
> backup manifest. Currently, we do two things with it. First, when
> we're backing up each file, we check whether it's present in the
> backup manifest and, if not, we back it up in full. This actually
> feels fairly poor. If it makes any difference at all, then presumably
> the underlying algorithm is buggy and needs to be fixed. Maybe that
> should be ripped out altogether or turned into some kind of sanity
> check that causes a big explosion if it fails. Second, we check
> whether the WAL ranges reported by the client match up with the
> timeline history of the server (see PrepareForIncrementalBackup). This
> set of sanity checks seems fairly important to me, and I'd regret
> discarding them. I think there's some possibility that they might
> catch user error, like where somebody promotes multiple standbys and
> maybe they even get the same timeline on more than one of them, and
> then confusion might ensue.
[..]

> Another practical problem here is that, right now, pg_combinebackup
> doesn't have an in-place mode. It knows how to take a bunch of input
> backups and write out an output backup, but that output backup needs
> to go into a new, fresh directory (or directories plural, if there are
> user-defined tablespaces). I had previously considered adding such a
> mode, but the idea I had at the time wouldn't have worked for this
> case. I imagined that someone might want to run "pg_combinebackup
> --in-place full incr" and clobber the contents of the incr directory
> with the output, basically discarding the incremental backup you took
> in favor of a full backup that could have been taken at the same point
> in time.
[..]

Thanks for answering! It all sounds like this
resync-standby-using-primary-incrbackup idea isn't fit for the current
pg_combinebackup, but rather for a new tool hopefully in future. It
could take the current LSN from stuck standby, calculate manifest on
the lagged and offline standby (do we need to calculate manifest
Checksum in that case? I cannot find code for it), deliver it via
"UPLOAD_MANIFEST" to primary and start fetching and applying the
differences while doing some form of copy-on-write from old & incoming
incrbackup data to "$relfilenodeid.new" and then durable_unlink() old
one and durable_rename("$relfilenodeid.new", "$relfilenodeid". Would
it still be possible in theory? (it could use additional safeguards
like rename controlfile when starting and just before ending to
additionally block startup if it hasn't finished). Also it looks as
per comment nearby struct IncrementalBackupInfo.manifest_files that
even checksums are just more for safeguarding rather than core
implementation (?)

What I've meant in the initial idea is not to hinder current efforts,
but asking if the current design will not stand in a way for such a
cool new addition in future ?

> One thing I also realized when thinking about this is that you could
> probably hose yourself with the patch set as it stands today by taking
> a full backup, downgrading to wal_level=minimal for a while, doing
> some WAL-skipping operations, upgrading to a higher WAL-level again,
> and then taking an incremental backup. I think the solution to that is
> probably for the WAL summarizer to refuse to run if wal_level=minimal.
> Then there would be a gap in the summary files which an incremental
> backup attempt would detect.

As per earlier test [1], I've already tried to simulate that in
incrbackuptests-0.1.tgz/test_across_wallevelminimal.sh , but that
worked (but that was with CTAS-wal-minimal-optimization -> new
relfilenodeOID is used for CTAS which got included in the incremental
backup as it's new file) Even retested that with Your v7 patch with
asserts, same. When simulating with "BEGIN; TRUNCATE nightmare; COPY
nightmare FROM '/tmp/copy.out'; COMMIT;" on 

Re: trying again to get incremental backup

2023-10-30 Thread Robert Haas
On Mon, Oct 30, 2023 at 2:46 PM Andres Freund  wrote:
> After playing with this for a while, I don't see a reason for wal_summarize_mb
> from a memory usage POV at least.

Cool! Thanks for testing.

> I wonder if there are use cases that might like to consume WAL summaries
> before the next checkpoint? For those wal_summarize_mb likely wouldn't be a
> good control, but they might want to request a summary file to be created at
> some point?

It's possible. I actually think it's even more likely that there are
use cases that will also want the WAL summarized, but in some
different way. For example, you might want a summary that would give
you the LSN or approximate LSN where changes to a certain block
occurred. Such a summary would be way bigger than these summaries and
therefore, at least IMHO, a lot less useful for incremental backup,
but it could be really useful for something else. Or you might want
summaries that focus on something other than which blocks got changed,
like what relations were created or destroyed, or only changes to
certain kinds of relations or relation forks, or whatever. In a way,
you can even think of logical decoding as a kind of WAL summarization,
just with a very different set of goals from this one. I won't be too
surprised if the next hacker wants something that is different enough
from what this does that it doesn't make sense to share mechanism, but
if by chance they want the same thing but dumped a bit more
frequently, well, that can be done.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-10-30 Thread Andres Freund
Hi,

On 2023-10-30 10:45:03 -0400, Robert Haas wrote:
> On Tue, Oct 24, 2023 at 12:08 PM Robert Haas  wrote:
> > Note that whether to remove summaries is a separate question from
> > whether to generate them in the first place. Right now, I have
> > wal_summarize_mb controlling whether they get generated in the first
> > place, but as I noted in another recent email, that isn't an entirely
> > satisfying solution.
> 
> I did some more research on this. My conclusion is that I should
> remove wal_summarize_mb and just have a GUC summarize_wal = on|off
> that controls whether the summarizer runs at all. There will be one
> summary file per checkpoint, no matter how far apart checkpoints are
> or how large the summary gets. Below I'll explain the reasoning; let
> me know if you disagree.

> What I describe above would be a bad plan if it were realistically
> possible for a summary file to get so large that it might run the
> machine out of memory either when producing it or when trying to make
> use of it for an incremental backup. This seems to be a somewhat
> difficult scenario to create. So far, I haven't been able to generate
> WAL summary files more than a few tens of megabytes in size, even when
> summarizing 50+ GB of WAL per summary file. One reason why it's hard
> to produce large summary files is because, for a single relation fork,
> the WAL summary size converges to 1 bit per modified block when the
> number of modified blocks is large. This means that, even if you have
> a terabyte sized relation, you're looking at no more than perhaps 20MB
> of summary data no matter how much of it gets modified. Now, somebody
> could have a 30TB relation and then if they modify the whole thing
> they could have the better part of a gigabyte of summary data for that
> relation, but if you've got a 30TB table you probably have enough
> memory that that's no big deal.

I'm not particularly worried about the rewriting-30TB-table case - that'd also
generate >= 30TB of WAL most of the time. Which realistically is going to
trigger a few checkpoints, even on very big instances.


> But, what if you have multiple relations? I initialized pgbench with a
> scale factor of 3 and also with 3 partitions and did a 1-hour
> run. I got 4 checkpoints during that time and each one produced an
> approximately 16MB summary file.

Hm, I assume the pgbench run will be fairly massively bottlenecked on IO, due
to having to read data from disk, lots of full page write and having to write
out lots of data?  I.e. we won't do all that many transactions during the 1h?


> To get a 1GB+ WAL summary file, you'd need to modify millions of relation
> forks, maybe tens of millions, and most installations aren't even going to
> have that many relation forks, let alone be modifying them all frequently.

I tried to find bad cases for a bit - and I am not worried. I wrote a pgbench
script to create 10k single-row relations in each script, ran that with 96
clients, checkpointed, and ran a pgbench script that updated the single row in
each table.

After creation of the relation WAL summarizer uses
LOG:  level: 1; Wal Summarizer: 378433680 total in 43 blocks; 5628936 free (66 
chunks); 372804744 used
and creates a 26MB summary file.

After checkpoint & updates WAL summarizer uses:
LOG:  level: 1; Wal Summarizer: 369205392 total in 43 blocks; 5864536 free (26 
chunks); 363340856 used
and creates a 26MB summary file.

Sure, 350MB ain't nothing, but simply just executing \dt in the database
created by this makes the backend use 260MB after. Which isn't going away,
whereas WAL summarizer drops its memory usage soon after.


> But I think that's sufficiently niche that the current patch shouldn't
> concern itself with such cases. If we find that they're common enough
> to worry about, we might eventually want to do something to mitigate
> them, but whether that thing looks anything like wal_summarize_mb
> seems pretty unclear. So I conclude that it's a mistake to include
> that GUC as currently designed and propose to replace it with a
> Boolean as described above.

After playing with this for a while, I don't see a reason for wal_summarize_mb
from a memory usage POV at least.

I wonder if there are use cases that might like to consume WAL summaries
before the next checkpoint? For those wal_summarize_mb likely wouldn't be a
good control, but they might want to request a summary file to be created at
some point?

Greetings,

Andres Freund




Re: trying again to get incremental backup

2023-10-30 Thread Robert Haas
On Thu, Sep 28, 2023 at 6:22 AM Jakub Wartak
 wrote:
> If that is still an area open for discussion: wouldn't it be better to
> just specify LSN as it would allow resyncing standby across major lag
> where the WAL to replay would be enormous? Given that we had
> primary->standby where standby would be stuck on some LSN, right now
> it would be:
> 1) calculate backup manifest of desynced 10TB standby (how? using
> which tool?)  - even if possible, that means reading 10TB of data
> instead of just putting a number, isn't it?
> 2) backup primary with such incremental backup >= LSN
> 3) copy the incremental backup to standby
> 4) apply it to the impaired standby
> 5) restart the WAL replay

As you may be able to tell from the flurry of posts and new patch
sets, I'm trying hard to sort out the remaining open items that
pertain to this patch set, and I'm now back to thinking about this
one.

TL;DR: I think the idea has some potential, but there are some
pitfalls that I'm not sure how to address.

I spent some time looking at how we currently use the data from the
backup manifest. Currently, we do two things with it. First, when
we're backing up each file, we check whether it's present in the
backup manifest and, if not, we back it up in full. This actually
feels fairly poor. If it makes any difference at all, then presumably
the underlying algorithm is buggy and needs to be fixed. Maybe that
should be ripped out altogether or turned into some kind of sanity
check that causes a big explosion if it fails. Second, we check
whether the WAL ranges reported by the client match up with the
timeline history of the server (see PrepareForIncrementalBackup). This
set of sanity checks seems fairly important to me, and I'd regret
discarding them. I think there's some possibility that they might
catch user error, like where somebody promotes multiple standbys and
maybe they even get the same timeline on more than one of them, and
then confusion might ensue. I also think that there's a real
possibility that they might make it easier to track down bugs in my
code, even if those bugs aren't necessarily timeline-related. If (or
more realistically when) somebody ends up with a corrupted cluster
after running pg_combinebackup, we're going to need to figure out
whether that corruption is the result of bugs (and if so where they
are) or user error (and if so what it was). The most obvious ways of
ending up with a corrupted cluster are (1) taking an incremental
backup against a prior backup that is not in the history of the server
from which the backup is taken or (2) combining an incremental backup
the wrong prior backup, so whatever sanity checks we can have that
will tend to prevent those kinds of mistakes seem like a really good
idea.

And those kinds of checks seem relevant here, too. Consider that it
wouldn't be valid to use pg_combinebackup to fast-forward a standby
server if the incremental backup's backup-end-LSN preceded the standby
server's minimum recovery point. Imagine that you have a standby whose
last checkpoint's redo location was at LSN 2/48. Being the
enterprising DBA that you are, you make a note of that LSN and go take
an incremental backup based on it. You then stop the standby server
and try to apply the incremental backup to fast-forward the standby.
Well, it's possible that in the meanwhile the standby actually caught
up, and now has a minimum recovery point that follows the
backup-end-LSN of your incremental backup. In that case, you can't
legally use that incremental backup to fast-forward that standby, but
no code I've yet written would be smart enough to figure that out. Or,
maybe you (or some other DBA on your team) got really excited and
actually promoted that standby meanwhile, and now it's not even on the
same timeline any more. In the "normal" case where you take an
incremental backup based on an earlier base backup, these kinds of
problems are detectable, and it seems to me that if we want to enable
this kind of use case, it would be pretty smart to have a plan to
detect similar mistakes here. I don't, currently, but maybe there is
one.

Another practical problem here is that, right now, pg_combinebackup
doesn't have an in-place mode. It knows how to take a bunch of input
backups and write out an output backup, but that output backup needs
to go into a new, fresh directory (or directories plural, if there are
user-defined tablespaces). I had previously considered adding such a
mode, but the idea I had at the time wouldn't have worked for this
case. I imagined that someone might want to run "pg_combinebackup
--in-place full incr" and clobber the contents of the incr directory
with the output, basically discarding the incremental backup you took
in favor of a full backup that could have been taken at the same point
in time. But here, you'd want to clobber the *first* input to
pg_combinebackup, not the last one, so if we want to add something
like this, the UI needs some thought.

One thing that I find 

Re: trying again to get incremental backup

2023-10-30 Thread Robert Haas
While reviewing this thread today, I realized that I never responded
to this email. That was inadvertent; my apologies.

On Wed, Jun 14, 2023 at 4:34 PM Matthias van de Meent
 wrote:
> Nice, I like this idea.

Cool.

> Skimming through the 7th patch, I see claims that FSM is not fully
> WAL-logged and thus shouldn't be tracked, and so it indeed doesn't
> track those changes.
> I disagree with that decision: we now have support for custom resource
> managers, which may use the various forks for other purposes than
> those used in PostgreSQL right now. It would be a shame if data is
> lost because of the backup tool ignoring forks because the PostgreSQL
> project itself doesn't have post-recovery consistency guarantees in
> that fork. So, unless we document that WAL-logged changes in the FSM
> fork are actually not recoverable from backup, regardless of the type
> of contents, we should still keep track of the changes in the FSM fork
> and include the fork in our backups or only exclude those FSM updates
> that we know are safe to ignore.

I'm not sure what to do about this problem. I don't think any data
would be *lost* in the scenario that you mention; what I think would
happen is that the FSM forks would be backed up in their entirety even
if they were owned by some other table AM or index AM that was
WAL-logging all changes to whatever it was storing in that fork. So I
think that there is not a correctness issue here but rather an
efficiency issue.

It would still be nice to fix that somehow, but I don't see how to do
it. It would be easy to make the WAL summarizer stop treating the FSM
as a special case, but there's no way for basebackup_incremental.c to
know whether a particular relation fork is for the heap AM or some
other AM that handles WAL-logging differently. It can't for example
examine pg_class; it's not connected to any database, let alone every
database. So we have to either trust that the WAL for the FSM is
correct and complete in all cases, or assume that it isn't in any
case. And the former doesn't seem like a safe or wise assumption given
how the heap AM works.

I think the reality here is unfortunately that we're missing a lot of
important infrastructure to really enable a multi-table-AM world. The
heap AM, and every other table AM, should include a metapage so we can
tell what we're looking at just by examining the disk files. Relation
forks don't scale and should be replaced with some better system that
does. We should have at least two table AMs in core that are fully
supported and do truly useful things. Until some of that stuff (and
probably a bunch of other things) get sorted out, out-of-core AMs are
going to have to remain second-class citizens to some degree.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-10-30 Thread Robert Haas
On Tue, Oct 24, 2023 at 12:08 PM Robert Haas  wrote:
> Note that whether to remove summaries is a separate question from
> whether to generate them in the first place. Right now, I have
> wal_summarize_mb controlling whether they get generated in the first
> place, but as I noted in another recent email, that isn't an entirely
> satisfying solution.

I did some more research on this. My conclusion is that I should
remove wal_summarize_mb and just have a GUC summarize_wal = on|off
that controls whether the summarizer runs at all. There will be one
summary file per checkpoint, no matter how far apart checkpoints are
or how large the summary gets. Below I'll explain the reasoning; let
me know if you disagree.

What I describe above would be a bad plan if it were realistically
possible for a summary file to get so large that it might run the
machine out of memory either when producing it or when trying to make
use of it for an incremental backup. This seems to be a somewhat
difficult scenario to create. So far, I haven't been able to generate
WAL summary files more than a few tens of megabytes in size, even when
summarizing 50+ GB of WAL per summary file. One reason why it's hard
to produce large summary files is because, for a single relation fork,
the WAL summary size converges to 1 bit per modified block when the
number of modified blocks is large. This means that, even if you have
a terabyte sized relation, you're looking at no more than perhaps 20MB
of summary data no matter how much of it gets modified. Now, somebody
could have a 30TB relation and then if they modify the whole thing
they could have the better part of a gigabyte of summary data for that
relation, but if you've got a 30TB table you probably have enough
memory that that's no big deal.

But, what if you have multiple relations? I initialized pgbench with a
scale factor of 3 and also with 3 partitions and did a 1-hour
run. I got 4 checkpoints during that time and each one produced an
approximately 16MB summary file. The efficiency here drops
considerably. For example, one of the files is 16495398 bytes and
records information on 7498403 modified blocks, which works out to
about 2.2 bytes per modified block. That's more than an order of
magnitude worse than what I got in the single-relation case, where the
summary file didn't even use two *bits* per modified block. But here
again, the file just isn't that big in absolute terms. To get a 1GB+
WAL summary file, you'd need to modify millions of relation forks,
maybe tens of millions, and most installations aren't even going to
have that many relation forks, let alone be modifying them all
frequently.

My conclusion here is that it's pretty hard to have a database where
WAL summarization is going to use too much memory. I wouldn't be
terribly surprised if there are some extreme cases where it happens,
but those databases probably aren't great candidates for incremental
backup anyway. They're probably databases with millions of relations
and frequent, widely-scattered modifications to those relations. And
if you have that kind of high turnover rate then incremental backup
isn't going to as helpful anyway, so there's probably no reason to
enable WAL summarization in the first place. Maybe if you have that
plus in the same database cluster you have a 100TB of completely
static data that is never modified, and if you also do all of this on
a pretty small machine, then you can find a case where incremental
backup would have worked well but for the memory consumed by WAL
summarization.

But I think that's sufficiently niche that the current patch shouldn't
concern itself with such cases. If we find that they're common enough
to worry about, we might eventually want to do something to mitigate
them, but whether that thing looks anything like wal_summarize_mb
seems pretty unclear. So I conclude that it's a mistake to include
that GUC as currently designed and propose to replace it with a
Boolean as described above.

Comments?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-10-26 Thread Robert Haas
On Thu, Oct 26, 2023 at 6:59 AM Andrew Dunstan  wrote:
> Because we won't be removing the RD parser.

Ah, OK.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-10-26 Thread Andrew Dunstan



On 2023-10-25 We 15:19, Robert Haas wrote:

On Wed, Oct 25, 2023 at 3:17 PM Andrew Dunstan  wrote:

OK, I'll go with that. It will actually be a bit less invasive than the
patch I posted.

Why's that?



Because we won't be removing the RD parser.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: trying again to get incremental backup

2023-10-25 Thread Robert Haas
On Wed, Oct 25, 2023 at 3:17 PM Andrew Dunstan  wrote:
> OK, I'll go with that. It will actually be a bit less invasive than the
> patch I posted.

Why's that?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-10-25 Thread Andrew Dunstan



On 2023-10-25 We 11:24, Robert Haas wrote:

On Wed, Oct 25, 2023 at 10:33 AM Andrew Dunstan  wrote:

I'm not too worried about the maintenance burden.

That said, I agree that JSON might not be the best format for backup
manifests, but maybe that ship has sailed.

I think it's a decision we could walk back if we had a good enough
reason, but it would be nicer if we didn't have to, because what we
have right now is working. If we change it for no real reason, we
might introduce new bugs, and at least in theory, incompatibility with
third-party tools that parse the existing format. If you think we can
live with the additional complexity in the JSON parsing stuff, I'd
rather go that way.



OK, I'll go with that. It will actually be a bit less invasive than the 
patch I posted.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: trying again to get incremental backup

2023-10-25 Thread Robert Haas
On Wed, Oct 25, 2023 at 10:33 AM Andrew Dunstan  wrote:
> I'm not too worried about the maintenance burden.
>
> That said, I agree that JSON might not be the best format for backup
> manifests, but maybe that ship has sailed.

I think it's a decision we could walk back if we had a good enough
reason, but it would be nicer if we didn't have to, because what we
have right now is working. If we change it for no real reason, we
might introduce new bugs, and at least in theory, incompatibility with
third-party tools that parse the existing format. If you think we can
live with the additional complexity in the JSON parsing stuff, I'd
rather go that way.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-10-25 Thread Andrew Dunstan



On 2023-10-25 We 09:05, Robert Haas wrote:

On Wed, Oct 25, 2023 at 7:54 AM Andrew Dunstan  wrote:

Robert asked me to work on this quite some time ago, and most of this
work was done last year.

Here's my WIP for an incremental JSON parser. It works and passes all
the usual json/b tests. It implements Algorithm 4.3 in the Dragon Book.
The reason I haven't posted it before is that it's about 50% slower in
pure parsing speed than the current recursive descent parser in my
testing. I've tried various things to make it faster, but haven't made
much impact. One of my colleagues is going to take a fresh look at it,
but maybe someone on the list can see where we can save some cycles.

If we can't make it faster, I guess we could use the RD parser for
non-incremental cases and only use the non-RD parser for incremental,
although that would be a bit sad. However, I don't think we can make the
RD parser suitable for incremental parsing - there's too much state
involved in the call stack.

Yeah, this is exactly why I didn't want to use JSON for the backup
manifest in the first place. Parsing such a manifest incrementally is
complicated. If we'd gone with my original design where the manifest
consisted of a bunch of lines each of which could be parsed
separately, we'd already have incremental parsing and wouldn't be
faced with these difficult trade-offs.

Unfortunately, I'm not in a good position either to figure out how to
make your prototype faster, or to evaluate how painful it is to keep
both in the source tree. It's probably worth considering how likely it
is that we'd be interested in incremental JSON parsing in other cases.
Maintaining two JSON parsers is probably not a lot of fun regardless,
but if each of them gets used for a bunch of things, that feels less
bad than if one of them gets used for a bunch of things and the other
one only ever gets used for backup manifests. Would we be interested
in JSON-format database dumps? Incrementally parsing JSON LOBs? Either
seems tenuous, but those are examples of the kind of thing that could
make us happy to have incremental JSON parsing as a general facility.

If nobody's very excited by those kinds of use cases, then this just
boils down to whether we want to (a) accept that users with very large
numbers of relation files won't be able to use pg_verifybackup or
incremental backup, (b) accept that we're going to maintain a second
JSON parser just to enable that use cas and with no other benefit, or
(c) undertake to change the manifest format to something that is
straightforward to parse incrementally. I think (a) is reasonable
short term, but at some point I think we should do better. I'm not
really that enthused about (c) because it means more work for me and
possibly more arguing, but if (b) is going to cause a lot of hassle
then we might need to consider it.



I'm not too worried about the maintenance burden. The RD routines were 
added in March 2013 (commit a570c98d7fa) and have hardly changed since 
then. The new code is not ground-breaking - it's just a different (and 
fairly well known) way of doing the same thing. I'd be happier if we 
could make it faster, but maybe it's just a fact that keeping an 
explicit stack, which is how this works, is slower.


I wouldn't at all be surprised if there were other good uses for 
incremental JSON parsing, including some you've identified.


That said, I agree that JSON might not be the best format for backup 
manifests, but maybe that ship has sailed.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: trying again to get incremental backup

2023-10-25 Thread Robert Haas
On Wed, Oct 25, 2023 at 7:54 AM Andrew Dunstan  wrote:
> Robert asked me to work on this quite some time ago, and most of this
> work was done last year.
>
> Here's my WIP for an incremental JSON parser. It works and passes all
> the usual json/b tests. It implements Algorithm 4.3 in the Dragon Book.
> The reason I haven't posted it before is that it's about 50% slower in
> pure parsing speed than the current recursive descent parser in my
> testing. I've tried various things to make it faster, but haven't made
> much impact. One of my colleagues is going to take a fresh look at it,
> but maybe someone on the list can see where we can save some cycles.
>
> If we can't make it faster, I guess we could use the RD parser for
> non-incremental cases and only use the non-RD parser for incremental,
> although that would be a bit sad. However, I don't think we can make the
> RD parser suitable for incremental parsing - there's too much state
> involved in the call stack.

Yeah, this is exactly why I didn't want to use JSON for the backup
manifest in the first place. Parsing such a manifest incrementally is
complicated. If we'd gone with my original design where the manifest
consisted of a bunch of lines each of which could be parsed
separately, we'd already have incremental parsing and wouldn't be
faced with these difficult trade-offs.

Unfortunately, I'm not in a good position either to figure out how to
make your prototype faster, or to evaluate how painful it is to keep
both in the source tree. It's probably worth considering how likely it
is that we'd be interested in incremental JSON parsing in other cases.
Maintaining two JSON parsers is probably not a lot of fun regardless,
but if each of them gets used for a bunch of things, that feels less
bad than if one of them gets used for a bunch of things and the other
one only ever gets used for backup manifests. Would we be interested
in JSON-format database dumps? Incrementally parsing JSON LOBs? Either
seems tenuous, but those are examples of the kind of thing that could
make us happy to have incremental JSON parsing as a general facility.

If nobody's very excited by those kinds of use cases, then this just
boils down to whether we want to (a) accept that users with very large
numbers of relation files won't be able to use pg_verifybackup or
incremental backup, (b) accept that we're going to maintain a second
JSON parser just to enable that use cas and with no other benefit, or
(c) undertake to change the manifest format to something that is
straightforward to parse incrementally. I think (a) is reasonable
short term, but at some point I think we should do better. I'm not
really that enthused about (c) because it means more work for me and
possibly more arguing, but if (b) is going to cause a lot of hassle
then we might need to consider it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-10-25 Thread Andrew Dunstan


On 2023-10-24 Tu 12:08, Robert Haas wrote:



It looks like each file entry in the manifest takes about 150 bytes, so
1 GB would allow for 1024**3/150 = 7158278 files.  That seems fine for now?

I suspect a few people have more files than that. They'll just have to Maybe 
someone on the list can see some way o
wait to use this feature until we get incremental JSON parsing (or
undo the decision to use JSON for the manifest).



Robert asked me to work on this quite some time ago, and most of this 
work was done last year.


Here's my WIP for an incremental JSON parser. It works and passes all 
the usual json/b tests. It implements Algorithm 4.3 in the Dragon Book. 
The reason I haven't posted it before is that it's about 50% slower in 
pure parsing speed than the current recursive descent parser in my 
testing. I've tried various things to make it faster, but haven't made 
much impact. One of my colleagues is going to take a fresh look at it, 
but maybe someone on the list can see where we can save some cycles.


If we can't make it faster, I guess we could use the RD parser for 
non-incremental cases and only use the non-RD parser for incremental, 
although that would be a bit sad. However, I don't think we can make the 
RD parser suitable for incremental parsing - there's too much state 
involved in the call stack.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com


json_incremental_parser-2023-09-25.patch.gz
Description: application/gzip


Re: trying again to get incremental backup

2023-10-24 Thread Robert Haas
On Tue, Oct 24, 2023 at 10:53 AM Peter Eisentraut  wrote:
> The easiest answer is to have it off by default.  Let people figure out
> what works for them.  There are various factors like storage, network,
> server performance, RTO that will determine what combination of full
> backup, incremental backup, and WAL replay will satisfy someone's
> requirements.  I suppose tests could be set up to determine this to some
> degree.  But we could also start slow and let people figure it out
> themselves.  When pg_basebackup was added, it was also disabled by default.
>
> If we think that 7d is a good setting, then I would suggest to consider,
> like 10d.  Otherwise, if you do a weekly incremental backup and you have
> a time change or a hiccup of some kind one day, you lose your backup
> sequence.
>
> Another possible answer is, like, 400 days?  Because why not?  What is a
> reasonable upper limit for this?

In concept, I don't think this should even be time-based. What you
should do is remove WAL summaries once you know that you've taken as
many incremental backups that might use them as you're ever going to
do. But PostgreSQL itself doesn't have any way of knowing what your
intended backup patterns are. If your incremental backup fails on
Monday night and you run it manually on Tuesday morning, you might
still rerun it as an incremental backup. If it fails every night for a
month and you finally realize and decide to intervene manually, maybe
you want a new full backup at that point. It's been a month. But on
the other hand maybe you don't. There's no time-based answer to this
question that is really correct, and I think it's quite possible that
your backup software might want to shut off time-based deletion
altogether and make its own decisions about when to nuke summaries.
However, I also don't think that's a great default setting. It could
easily lead to people wasting a bunch of disk space for no reason.

As far as the 7d value, I figured that nighty incremental backups
would be fairly common. If we think weekly incremental backups would
be common, then pushing this out to 10d would make sense. While
there's no reason you couldn't take an annual incremental backup, and
thus want a 400d value, it seems like a pretty niche use case.

Note that whether to remove summaries is a separate question from
whether to generate them in the first place. Right now, I have
wal_summarize_mb controlling whether they get generated in the first
place, but as I noted in another recent email, that isn't an entirely
satisfying solution.

> It looks like each file entry in the manifest takes about 150 bytes, so
> 1 GB would allow for 1024**3/150 = 7158278 files.  That seems fine for now?

I suspect a few people have more files than that. They'll just have to
wait to use this feature until we get incremental JSON parsing (or
undo the decision to use JSON for the manifest).

> The current user experience of pg_basebackup is that it waits possibly a
> long time for a checkpoint, and there is an option to make it go faster,
> but there is no timeout AFAICT.  Is this substantially different?  Could
> we just let it wait forever?

We could. I installed the timeout because the first versions of the
feature were buggy, and I didn't like having my tests hang forever
with no indication of what had gone wrong. At least in my experience
so far, the time spent waiting for WAL summarization is typically
quite short, because only the WAL that needs to be summarized is
whatever was emitted since the last time it woke up up through the
start LSN of the backup. That's probably not much, and the next time
the summarizer wakes up, the file should appear within moments. So,
it's a little different from the checkpoint case, where long waits are
expected.

> Also, does waiting for checkpoint and WAL summarization happen in
> parallel?  If so, what if it starts a checkpoint that might take 15 min
> to complete, and then after 60 seconds it kicks you off because the WAL
> summarization isn't ready.  That might be wasteful.

It is not parallel. The trouble is, we don't really have any way to
know whether WAL summarization is going to fail for whatever reason.
We don't expect that to happen, but if somebody changes the
permissions on the WAL summary directory or attaches gdb to the WAL
summarizer process or something of that sort, it might.

We could check at the outset whether we seem to be really far behind
and emit a warning. For instance, if we're 1TB behind on WAL
summarization when the checkpoint is requested, chances are something
is busted and we're probably not going to catch up any time soon. We
could warn the user about that and let them make their own decision
about whether to cancel. But, that idea won't help in unattended
operation, and the threshold for "really far behind" is not very
clear. It might be better to wait until we get more experience with
how things actually fail before doing too much engineering here, but
I'm also open to 

Re: trying again to get incremental backup

2023-10-24 Thread Peter Eisentraut

On 04.10.23 22:08, Robert Haas wrote:

- I would like some feedback on the generation of WAL summary files.
Right now, I have it enabled by default, and summaries are kept for a
week. That means that, with no additional setup, you can take an
incremental backup as long as the reference backup was taken in the
last week. File removal is governed by mtimes, so if you change the
mtimes of your summary files or whack your system clock around, weird
things might happen. But obviously this might be inconvenient. Some
people might not want WAL summary files to be generated at all because
they don't care about incremental backup, and other people might want
them retained for longer, and still other people might want them to be
not removed automatically or removed automatically based on some
criteria other than mtime. I don't really know what's best here. I
don't think the default policy that the patches implement is
especially terrible, but it's just something that I made up and I
don't have any real confidence that it's wonderful.


The easiest answer is to have it off by default.  Let people figure out 
what works for them.  There are various factors like storage, network, 
server performance, RTO that will determine what combination of full 
backup, incremental backup, and WAL replay will satisfy someone's 
requirements.  I suppose tests could be set up to determine this to some 
degree.  But we could also start slow and let people figure it out 
themselves.  When pg_basebackup was added, it was also disabled by default.


If we think that 7d is a good setting, then I would suggest to consider, 
like 10d.  Otherwise, if you do a weekly incremental backup and you have 
a time change or a hiccup of some kind one day, you lose your backup 
sequence.


Another possible answer is, like, 400 days?  Because why not?  What is a 
reasonable upper limit for this?



- It's regrettable that we don't have incremental JSON parsing; I
think that means anyone who has a backup manifest that is bigger than
1GB can't use this feature. However, that's also a problem for the
existing backup manifest feature, and as far as I can see, we have no
complaints about it. So maybe people just don't have databases with
enough relations for that to be much of a live issue yet. I'm inclined
to treat this as a non-blocker,


It looks like each file entry in the manifest takes about 150 bytes, so 
1 GB would allow for 1024**3/150 = 7158278 files.  That seems fine for now?



- Right now, I have a hard-coded 60 second timeout for WAL
summarization. If you try to take an incremental backup and the WAL
summaries you need don't show up within 60 seconds, the backup times
out. I think that's a reasonable default, but should it be
configurable? If yes, should that be a GUC or, perhaps better, a
pg_basebackup option?


The current user experience of pg_basebackup is that it waits possibly a 
long time for a checkpoint, and there is an option to make it go faster, 
but there is no timeout AFAICT.  Is this substantially different?  Could 
we just let it wait forever?


Also, does waiting for checkpoint and WAL summarization happen in 
parallel?  If so, what if it starts a checkpoint that might take 15 min 
to complete, and then after 60 seconds it kicks you off because the WAL 
summarization isn't ready.  That might be wasteful.



- I'm curious what people think about the pg_walsummary tool that is
included in 0006. I think it's going to be fairly important for
debugging, but it does feel a little bit bad to add a new binary for
something pretty niche.


This seems fine.

Is the WAL summary file format documented anywhere in your patch set 
yet?  My only thought was, maybe the file format could be human-readable 
(more like backup_label) to avoid this.  But maybe not.






Re: trying again to get incremental backup

2023-10-24 Thread Robert Haas
On Mon, Oct 23, 2023 at 7:56 PM David Steele  wrote:
> > I also think a lot of the use of the low-level API is
> > driven by it being just too darn slow to copy the whole database, and
> > incremental backup can help with that in some circumstances.
>
> I would argue that restore performance is *more* important than backup
> performance and this patch is a step backward in that regard. Backups
> will be faster and less space will be used in the repository, but
> restore performance is going to suffer. If the deltas are very small the
> difference will probably be negligible, but as the deltas get large (and
> especially if there are a lot of them) the penalty will be more noticeable.

I think an awful lot depends here on whether the repository is local
or remote. If you have filesystem access to wherever the backups are
stored anyway, I don't think that using pg_combinebackup to write out
a new data directory is going to be much slower than copying one data
directory from the repository to wherever you'd actually use the
backup. It may be somewhat slower because we do need to access some
data in every involved backup, but I don't think it should be vastly
slower because we don't have to read every backup in its entirety. For
each file, we read the (small) header of the newest incremental file
and every incremental file that precedes it until we find a full file.
Then, we construct a map of which blocks need to be read from which
sources and read only the required blocks from each source. If all the
blocks are coming from a single file (because there are no incremental
for a certain file, or they contain no blocks) then we just copy the
entire source file in one shot, which can be optimized using the same
tricks we use elsewhere. Inevitably, this is going to read more data
and do more random I/O than just a flat copy of a directory, but it's
not terrible. The overall amount of I/O should be a lot closer to the
size of the output directory than to the sum of the sizes of the input
directories.

Now, if the repository is remote, and you have to download all of
those backups first, and then run pg_combinebackup on them afterward,
that is going to be unpleasant, unless the incremental backups are all
quite small. Possibly this could be addressed by teaching
pg_combinebackup to do things like accessing data over HTTP and SSH,
and relatedly, looking inside tarfiles without needing them unpacked.
For now, I've left those as ideas for future improvement, but I think
potentially they could address some of your concerns here. A
difficulty is that there are a lot of protocols that people might want
to use to push bytes around, and it might be hard to keep up with the
march of progress.

I do agree, though, that there's no such thing as a free lunch. I
wouldn't recommend to anyone that they plan to restore from a chain of
100 incremental backups. Not only might it be slow, but the
opportunities for something to go wrong are magnified. Even if you've
automated everything well enough that there's no human error involved,
what if you've got a corrupted file somewhere? Maybe that's not likely
in absolute terms, but the more files you've got, the more likely it
becomes. What I'd suggest someone do instead is periodically do
pg_combinebackup full_reference_backup oldest_incremental -o
new_full_reference_backup; rm -rf full_reference_backup; mv
new_full_reference_backup full_reference_backup. The new full
reference backup is intended to still be usable for restoring
incrementals based on the incremental it replaced. I hope that, if
people use the feature well, this should limit the need for really
long backup chains. I am sure, though, that some people will use it
poorly. Maybe there's room for more documentation on this topic.

> I was concerned with the difficulty of trying to stage the correct
> backups for pg_combinebackup, not whether it would recognize that the
> needed data was not available and then error appropriately. The latter
> is surmountable within pg_combinebackup but the former is left up to the
> user.

Indeed.

> One note regarding the patches. I feel like
> v5-0005-Prototype-patch-for-incremental-backup should be split to have
> the WAL summarizer as one patch and the changes to base backup as a
> separate patch.
>
> It might not be useful to commit one without the other, but it would
> make for an easier read. Just my 2c.

Yeah, maybe so. I'm not quite ready to commit to doing that split as
of this writing but I will think about it and possibly do it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-10-23 Thread David Steele

On 10/23/23 11:44, Robert Haas wrote:

On Fri, Oct 20, 2023 at 11:30 AM David Steele  wrote:


I don't plan to stand in your way on this feature. I'm reviewing what
patches I can out of courtesy and to be sure that nothing adjacent to
your work is being affected. My apologies if my reviews are not meeting
your expectations, but I am contributing as my time constraints allow.


Sorry, I realize reading this response that I probably didn't do a
very good job writing that email and came across sounding like a jerk.
Possibly, I actually am a jerk. Whether it just sounded like it or I
actually am, I apologize. 


That was the way it came across, though I prefer to think it was 
unintentional. I certainly understand how frustrating dealing with a 
large and uncertain patch can be. Either way, I appreciate the apology.


Now onward...


But your last paragraph here gets at my real
question, which is whether you were going to try to block the feature.
I recognize that we have different priorities when it comes to what
would make most sense to implement in PostgreSQL, and that's OK, or at
least, it's OK with me. 


This seem perfectly natural to me.


I also don't have any particular expectation
about how much you should review the patch or in what level of detail,
and I have sincerely appreciated your feedback thus far. If you are
able to continue to provide more, that's great, and if that's not,
well, you're not obligated. What I was concerned about was whether
that review was a precursor to a vigorous attempt to keep the main
patch from getting committed, because if that was going to be the
case, then I'd like to surface that conflict sooner rather than later.
It sounds like that's not an issue, which is great.


Overall I would say I'm not strongly for or against the patch. I think 
it will be very difficult to use in a manual fashion, but automation is 
they way to go in general so that's not necessarily and argument against.


However, this is an area of great interest to me so I do want to at 
least make sure nothing is being impacted adjacent to the main goal of 
this patch. So far I have seen no sign of that, but that has been a 
primary goal of my reviews.



At the risk of drifting into the fraught question of what I *should*
be implementing rather than the hopefully-less-fraught question of
whether what I am actually implementing is any good, I see incremental
backup as a way of removing some of the use cases for the low-level
backup API. If you said "but people still will have lots of reasons to
use it," I would agree; and if you said "people can still screw things
up with pg_basebackup," I would also agree. Nonetheless, most of the
disasters I've personally seen have stemmed from the use of the
low-level API rather than from the use of pg_basebackup, though there
are exceptions. 


This all makes sense to me.


I also think a lot of the use of the low-level API is
driven by it being just too darn slow to copy the whole database, and
incremental backup can help with that in some circumstances. 


I would argue that restore performance is *more* important than backup 
performance and this patch is a step backward in that regard. Backups 
will be faster and less space will be used in the repository, but 
restore performance is going to suffer. If the deltas are very small the 
difference will probably be negligible, but as the deltas get large (and 
especially if there are a lot of them) the penalty will be more noticeable.



Also, I
have worked fairly hard to try to make sure that if you misuse
pg_combinebackup, or fail to use it altogether, you'll get an error
rather than silent data corruption. I would be interested to hear
about scenarios where the checks that I've implemented can be defeated
by something that is plausibly described as stupidity rather than
malice. I'm not sure we can fix all such cases, but I'm very alert to
the horror that will befall me if user error looks exactly like a bug
in the code. For my own sanity, we have to be able to distinguish
those cases. 


I was concerned with the difficulty of trying to stage the correct 
backups for pg_combinebackup, not whether it would recognize that the 
needed data was not available and then error appropriately. The latter 
is surmountable within pg_combinebackup but the former is left up to the 
user.



Moreover, we also need to be able to distinguish
backup-time bugs from reassembly-time bugs, which is why I've got the
pg_walsummary tool, and why pg_combinebackup has the ability to emit
fairly detailed debugging output. I anticipate those things being
useful in investigating bug reports when they show up. I won't be too
surprised if it turns out that more work on sanity-checking and/or
debugging tools is needed, but I think your concern about people
misusing stuff is bang on target and I really want to do whatever we
can to avoid that when possible and detect it when it happens.


The ability of users to misuse tools is, of course, 

Re: trying again to get incremental backup

2023-10-23 Thread Robert Haas
On Fri, Oct 20, 2023 at 11:30 AM David Steele  wrote:
> Then I'm fine with it as is.

OK, thanks.

> In my view this feature puts the cart way before the horse. I'd think
> higher priority features might be parallelism, a backup repository,
> expiration management, archiving, or maybe even a restore command.
>
> It seems the only goal here is to make pg_basebackup a tool for external
> backup software to use, which might be OK, but I don't believe this
> feature really advances pg_basebackup as a usable piece of stand-alone
> software. If people really think that start/stop backup is too
> complicated an interface how are they supposed to track page
> incrementals and get them to a place where pg_combinebackup can put them
> backup together? If automation is required to use this feature,
> shouldn't pg_basebackup implement that automation?
>
> I have plenty of thoughts about the implementation as well, but I have a
> lot on my plate right now and I don't have time to get into it.
>
> I don't plan to stand in your way on this feature. I'm reviewing what
> patches I can out of courtesy and to be sure that nothing adjacent to
> your work is being affected. My apologies if my reviews are not meeting
> your expectations, but I am contributing as my time constraints allow.

Sorry, I realize reading this response that I probably didn't do a
very good job writing that email and came across sounding like a jerk.
Possibly, I actually am a jerk. Whether it just sounded like it or I
actually am, I apologize. But your last paragraph here gets at my real
question, which is whether you were going to try to block the feature.
I recognize that we have different priorities when it comes to what
would make most sense to implement in PostgreSQL, and that's OK, or at
least, it's OK with me. I also don't have any particular expectation
about how much you should review the patch or in what level of detail,
and I have sincerely appreciated your feedback thus far. If you are
able to continue to provide more, that's great, and if that's not,
well, you're not obligated. What I was concerned about was whether
that review was a precursor to a vigorous attempt to keep the main
patch from getting committed, because if that was going to be the
case, then I'd like to surface that conflict sooner rather than later.
It sounds like that's not an issue, which is great.

At the risk of drifting into the fraught question of what I *should*
be implementing rather than the hopefully-less-fraught question of
whether what I am actually implementing is any good, I see incremental
backup as a way of removing some of the use cases for the low-level
backup API. If you said "but people still will have lots of reasons to
use it," I would agree; and if you said "people can still screw things
up with pg_basebackup," I would also agree. Nonetheless, most of the
disasters I've personally seen have stemmed from the use of the
low-level API rather than from the use of pg_basebackup, though there
are exceptions. I also think a lot of the use of the low-level API is
driven by it being just too darn slow to copy the whole database, and
incremental backup can help with that in some circumstances. Also, I
have worked fairly hard to try to make sure that if you misuse
pg_combinebackup, or fail to use it altogether, you'll get an error
rather than silent data corruption. I would be interested to hear
about scenarios where the checks that I've implemented can be defeated
by something that is plausibly described as stupidity rather than
malice. I'm not sure we can fix all such cases, but I'm very alert to
the horror that will befall me if user error looks exactly like a bug
in the code. For my own sanity, we have to be able to distinguish
those cases. Moreover, we also need to be able to distinguish
backup-time bugs from reassembly-time bugs, which is why I've got the
pg_walsummary tool, and why pg_combinebackup has the ability to emit
fairly detailed debugging output. I anticipate those things being
useful in investigating bug reports when they show up. I won't be too
surprised if it turns out that more work on sanity-checking and/or
debugging tools is needed, but I think your concern about people
misusing stuff is bang on target and I really want to do whatever we
can to avoid that when possible and detect it when it happens.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-10-20 Thread David Steele

On 10/19/23 16:00, Robert Haas wrote:

On Thu, Oct 19, 2023 at 3:18 PM David Steele  wrote:

0001 looks pretty good to me. The only thing I find a little troublesome
is the repeated construction of file names with/without segment numbers
in ResetUnloggedRelationsInDbspaceDir(), .e.g.:

+   if (segno == 0)
+   snprintf(dstpath, sizeof(dstpath), "%s/%u",
+dbspacedirname, relNumber);
+   else
+   snprintf(dstpath, sizeof(dstpath), "%s/%u.%u",
+dbspacedirname, relNumber, 
segno);


If this happened three times I'd definitely want a helper function, but
even with two I think it would be a bit nicer.


Personally I think that would make the code harder to read rather than
easier. I agree that repeating code isn't great, but this is a
relatively brief idiom and pretty self-explanatory. If other people
agree with you I can change it, but to me it's not an improvement.


Then I'm fine with it as is.


0002 is definitely a good idea. FWIW pgBackRest does this conversion but
also errors if it does not succeed. We have never seen a report of this
error happening in the wild, so I think it must be pretty rare if it
does happen.


Cool, but ... how about the main patch set? It's nice to get some of
these refactoring bits and pieces out of the way, but if I spend the
effort to work out what I think are the right answers to the remaining
design questions for the main patch set and then find out after I've
done all that that you have massive objections, I'm going to be
annoyed. I've been trying to get this feature into PostgreSQL for
years, and if I don't succeed this time, I want the reason to be
something better than "well, I didn't find out that David disliked X
until five minutes before I was planning to type 'git push'."


I simply have not had time to look at the main patch set in any detail.


I'm not really concerned about detailed bug-hunting in the main
patches just yet. The time for that will come. But if you have views
on how to resolve the design questions that I mentioned in a couple of
emails back, or intend to advocate vigorously against the whole
concept for some reason, let's try to sort that out sooner rather than
later.


In my view this feature puts the cart way before the horse. I'd think 
higher priority features might be parallelism, a backup repository, 
expiration management, archiving, or maybe even a restore command.


It seems the only goal here is to make pg_basebackup a tool for external 
backup software to use, which might be OK, but I don't believe this 
feature really advances pg_basebackup as a usable piece of stand-alone 
software. If people really think that start/stop backup is too 
complicated an interface how are they supposed to track page 
incrementals and get them to a place where pg_combinebackup can put them 
backup together? If automation is required to use this feature, 
shouldn't pg_basebackup implement that automation?


I have plenty of thoughts about the implementation as well, but I have a 
lot on my plate right now and I don't have time to get into it.


I don't plan to stand in your way on this feature. I'm reviewing what 
patches I can out of courtesy and to be sure that nothing adjacent to 
your work is being affected. My apologies if my reviews are not meeting 
your expectations, but I am contributing as my time constraints allow.


Regards,
-David




Re: trying again to get incremental backup

2023-10-20 Thread Jakub Wartak
Hi Robert,

On Wed, Oct 4, 2023 at 10:09 PM Robert Haas  wrote:
>
> On Tue, Oct 3, 2023 at 2:21 PM Robert Haas  wrote:
> > Here's a new patch set, also addressing Jakub's observation that
> > MINIMUM_VERSION_FOR_WAL_SUMMARIES needed updating.
>
> Here's yet another new version.[..]

Okay, so another good news - related to the patch version #4.
Not-so-tiny stress test consisting of pgbench run for 24h straight
(with incremental backups every 2h, with base of initial full backup),
followed by two PITRs (one not using incremental backup and one using
to to illustrate the performance point - and potentially spot any
errors in between). In both cases it worked fine. Pgbench has this
behaviour that it doesn't cause space growth over time - it produces
lots of WAL instead. Some stats:

START DBSIZE: ~3.3GB (pgbench -i -s 200 --partitions=8)
END DBSIZE: ~4.3GB
RUN DURATION: 24h (pgbench -P 1 -R 100 -T 86400)
WALARCHIVES-24h: 77GB
FULL-DB-BACKUP-SIZE: 3.4GB
INCREMENTAL-BACKUP-11-SIZE: 3.5GB
Env: Azure VM D4s (4VCPU), Debian 11, gcc 10.2, normal build (asserts
and debug disabled)
The increments were taken every 2h just to see if they would fail for
any reason - they did not.

PITR RTO RESULTS (copy/pg_combinebackup time + recovery time):
1. time to restore from fullbackup (+ recovery of 24h WAL[77GB]): 53s
+ 4640s =~ 78min
2. time to restore from fullbackup+incremental backup from 2h ago (+
recovery of 2h WAL [5.4GB]): 68s + 190s =~ 4min18s

I could probably pre populate the DB with 1TB cold data (not touched
to be touched pgbench at all), just for the sake of argument, and that
would probably could be demonstrated how space efficient the
incremental backup can be, but most of time would be time wasted on
copying the 1TB here...

> - I would like some feedback on the generation of WAL summary files.
> Right now, I have it enabled by default, and summaries are kept for a
> week. That means that, with no additional setup, you can take an
> incremental backup as long as the reference backup was taken in the
> last week.

I've just noticed one thing when recovery is progress: is
summarization working during recovery - in the background - an
expected behaviour? I'm wondering about that, because after freshly
restored and recovered DB, one would need to create a *new* full
backup and only from that point new summaries would have any use?
Sample log:

2023-10-20 11:10:02.288 UTC [64434] LOG:  restored log file
"000100020022" from archive
2023-10-20 11:10:02.599 UTC [64434] LOG:  restored log file
"000100020023" from archive
2023-10-20 11:10:02.769 UTC [64446] LOG:  summarized WAL on TLI 1 from
2/139B1130 to 2/239B1518
2023-10-20 11:10:02.923 UTC [64434] LOG:  restored log file
"000100020024" from archive
2023-10-20 11:10:03.193 UTC [64434] LOG:  restored log file
"000100020025" from archive
2023-10-20 11:10:03.345 UTC [64432] LOG:  restartpoint starting: wal
2023-10-20 11:10:03.407 UTC [64446] LOG:  summarized WAL on TLI 1 from
2/239B1518 to 2/25B609D0
2023-10-20 11:10:03.521 UTC [64434] LOG:  restored log file
"000100020026" from archive
2023-10-20 11:10:04.429 UTC [64434] LOG:  restored log file
"000100020027" from archive

>
> - On a related note, I haven't yet tested this on a standby, which is
> a thing that I definitely need to do. I don't know of a reason why it
> shouldn't be possible for all of this machinery to work on a standby
> just as it does on a primary, but then we need the WAL summarizer to
> run there too, which could end up being a waste if nobody ever tries
> to take an incremental backup. I wonder how that should be reflected
> in the configuration. We could do something like what we've done for
> archive_mode, where on means "only on if this is a primary" and you
> have to say always if you want it to run on standbys as well ... but
> I'm not sure if that's a design pattern that we really want to
> replicate into more places. I'd be somewhat inclined to just make
> whatever configuration parameters we need to configure this thing on
> the primary also work on standbys, and you can set each server up as
> you please. But I'm open to other suggestions.

I'll try to play with some standby restores in future, stay tuned.

Regards,
-J.




Re: trying again to get incremental backup

2023-10-19 Thread Robert Haas
On Thu, Oct 19, 2023 at 3:18 PM David Steele  wrote:
> 0001 looks pretty good to me. The only thing I find a little troublesome
> is the repeated construction of file names with/without segment numbers
> in ResetUnloggedRelationsInDbspaceDir(), .e.g.:
>
> +   if (segno == 0)
> +   snprintf(dstpath, sizeof(dstpath), "%s/%u",
> +dbspacedirname, relNumber);
> +   else
> +   snprintf(dstpath, sizeof(dstpath), "%s/%u.%u",
> +dbspacedirname, relNumber, 
> segno);
>
>
> If this happened three times I'd definitely want a helper function, but
> even with two I think it would be a bit nicer.

Personally I think that would make the code harder to read rather than
easier. I agree that repeating code isn't great, but this is a
relatively brief idiom and pretty self-explanatory. If other people
agree with you I can change it, but to me it's not an improvement.

> 0002 is definitely a good idea. FWIW pgBackRest does this conversion but
> also errors if it does not succeed. We have never seen a report of this
> error happening in the wild, so I think it must be pretty rare if it
> does happen.

Cool, but ... how about the main patch set? It's nice to get some of
these refactoring bits and pieces out of the way, but if I spend the
effort to work out what I think are the right answers to the remaining
design questions for the main patch set and then find out after I've
done all that that you have massive objections, I'm going to be
annoyed. I've been trying to get this feature into PostgreSQL for
years, and if I don't succeed this time, I want the reason to be
something better than "well, I didn't find out that David disliked X
until five minutes before I was planning to type 'git push'."

I'm not really concerned about detailed bug-hunting in the main
patches just yet. The time for that will come. But if you have views
on how to resolve the design questions that I mentioned in a couple of
emails back, or intend to advocate vigorously against the whole
concept for some reason, let's try to sort that out sooner rather than
later.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-10-19 Thread David Steele

On 10/19/23 12:05, Robert Haas wrote:

On Wed, Oct 4, 2023 at 4:08 PM Robert Haas  wrote:

Clearly there's a good amount of stuff to sort out here, but we've
still got quite a bit of time left before feature freeze so I'd like
to have a go at it. Please let me know your thoughts, if you have any.


Apparently, nobody has any thoughts, but here's an updated patch set
anyway. The main change, other than rebasing, is that I did a bunch
more documentation work on the main patch (0005). I'm much happier
with it now, although I expect it may need more adjustments here and
there as outstanding design questions get settled.

After some thought, I think that it should be fine to commit 0001 and
0002 as independent refactoring patches, and I plan to go ahead and do
that pretty soon unless somebody objects.


0001 looks pretty good to me. The only thing I find a little troublesome 
is the repeated construction of file names with/without segment numbers 
in ResetUnloggedRelationsInDbspaceDir(), .e.g.:


+   if (segno == 0)
+   snprintf(dstpath, sizeof(dstpath), "%s/%u",
+dbspacedirname, relNumber);
+   else
+   snprintf(dstpath, sizeof(dstpath), "%s/%u.%u",
+dbspacedirname, relNumber, 
segno);


If this happened three times I'd definitely want a helper function, but 
even with two I think it would be a bit nicer.


0002 is definitely a good idea. FWIW pgBackRest does this conversion but 
also errors if it does not succeed. We have never seen a report of this 
error happening in the wild, so I think it must be pretty rare if it 
does happen.


Regards,
-David




Re: trying again to get incremental backup

2023-10-04 Thread Robert Haas
On Thu, Sep 28, 2023 at 6:22 AM Jakub Wartak
 wrote:
> all those basic tests had GOOD results. Please find attached. I'll try
> to schedule some more realistic (in terms of workload and sizes) test
> in a couple of days + maybe have some fun with cross-backup-and
> restores across standbys.

That's awesome! Thanks for testing! This can definitely benefit from
any amount of beating on it that people wish to do. It's a complex,
delicate area that risks data loss.

> If that is still an area open for discussion: wouldn't it be better to
> just specify LSN as it would allow resyncing standby across major lag
> where the WAL to replay would be enormous? Given that we had
> primary->standby where standby would be stuck on some LSN, right now
> it would be:
> 1) calculate backup manifest of desynced 10TB standby (how? using
> which tool?)  - even if possible, that means reading 10TB of data
> instead of just putting a number, isn't it?
> 2) backup primary with such incremental backup >= LSN
> 3) copy the incremental backup to standby
> 4) apply it to the impaired standby
> 5) restart the WAL replay

Hmm. I wonder if this would even be a safe procedure. I admit that I
can't quite see a problem with it, but sometimes I'm kind of dumb.

> Also maybe it's too early to ask, but wouldn't it be nice if we could
> have an future option in pg_combinebackup to avoid double writes when
> used from restore hosts (right now we need to first to reconstruct the
> original datadir from full and incremental backups on host hosting
> backups and then TRANSFER it again and on target host?). So something
> like that could work well from restorehost: pg_combinebackup
> /tmp/backup1 /tmp/incbackup2 /tmp/incbackup3 -O tar -o - | ssh
> dbserver 'tar xvf -C /path/to/restored/cluster - ' . The bad thing is
> that such a pipe prevents parallelism from day 1 and I'm afraid I do
> not have a better easy idea on how to have both at the same time in
> the long term.

I don't think it's too early to ask for this, but I do think it's too
early for you to get it. ;-)

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-10-03 Thread Robert Haas
On Fri, Sep 1, 2023 at 10:30 AM Robert Haas  wrote:
> > No objections to 0001/0002.
>
> Cool.

Nobody else objected either, so I went ahead and committed those. I'll
rebase the rest of the patches on top of the latest master and repost,
hopefully after addressing some of the other review comments from
Dilip and Jakub.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-09-28 Thread Jakub Wartak
On Wed, Aug 30, 2023 at 4:50 PM Robert Haas  wrote:
[..]

I've played a little bit more this second batch of patches on
e8d74ad625f7344f6b715254d3869663c1569a51 @ 31Aug (days before wait
events refactor):

test_across_wallevelminimal.sh
test_many_incrementals_dbcreate.sh
test_many_incrementals.sh
test_multixact.sh
test_pending_2pc.sh
test_reindex_and_vacuum_full.sh
test_truncaterollback.sh
test_unlogged_table.sh

all those basic tests had GOOD results. Please find attached. I'll try
to schedule some more realistic (in terms of workload and sizes) test
in a couple of days + maybe have some fun with cross-backup-and
restores across standbys. As per earlier doubt: raw wal_level =
minimal situation, shouldn't be a concern, sadly because it requires
max_wal_senders==0, while pg_basebackup requires it above 0 (due to
"FATAL:  number of requested standby connections exceeds
max_wal_senders (currently 0)").

I wanted to also introduce corruption onto pg_walsummaries files, but
later saw in code that is already covered with CRC32, cool.

In v07:
> +#define MINIMUM_VERSION_FOR_WAL_SUMMARIES 16

17 ?

> A related design question is whether we should really be sending the
> whole backup manifest to the server at all. If it turns out that we
> don't really need anything except for the LSN of the previous backup,
> we could send that one piece of information instead of everything. On
> the other hand, if we need the list of files from the previous backup,
> then sending the whole manifest makes sense.

If that is still an area open for discussion: wouldn't it be better to
just specify LSN as it would allow resyncing standby across major lag
where the WAL to replay would be enormous? Given that we had
primary->standby where standby would be stuck on some LSN, right now
it would be:
1) calculate backup manifest of desynced 10TB standby (how? using
which tool?)  - even if possible, that means reading 10TB of data
instead of just putting a number, isn't it?
2) backup primary with such incremental backup >= LSN
3) copy the incremental backup to standby
4) apply it to the impaired standby
5) restart the WAL replay

> - We only know how to operate on directories, not tar files. I thought
> about that when working on pg_verifybackup as well, but I didn't do
> anything about it. It would be nice to go back and make that tool work
> on tar-format backups, and this one, too. I don't think there would be
> a whole lot of point trying to operate on compressed tar files because
> you need random access and that seems hard on a compressed file, but
> on uncompressed files it seems at least theoretically doable. I'm not
> sure whether anyone would care that much about this, though, even
> though it does sound pretty cool.

Also maybe it's too early to ask, but wouldn't it be nice if we could
have an future option in pg_combinebackup to avoid double writes when
used from restore hosts (right now we need to first to reconstruct the
original datadir from full and incremental backups on host hosting
backups and then TRANSFER it again and on target host?). So something
like that could work well from restorehost: pg_combinebackup
/tmp/backup1 /tmp/incbackup2 /tmp/incbackup3 -O tar -o - | ssh
dbserver 'tar xvf -C /path/to/restored/cluster - ' . The bad thing is
that such a pipe prevents parallelism from day 1 and I'm afraid I do
not have a better easy idea on how to have both at the same time in
the long term.

-J.


incrbackuptests-0.1.tgz
Description: application/compressed


Re: trying again to get incremental backup

2023-09-12 Thread Dilip Kumar
On Wed, Aug 30, 2023 at 9:20 PM Robert Haas  wrote:
>
> Meanwhile, here's a rebased set of patches. The somewhat-primitive
> attempts at writing tests are in 0009, but they don't work, for the
> reasons explained above. I think I'd probably like to go ahead and
> commit 0001 and 0002 soon if there are no objections, since I think
> those are good refactorings independently of the rest of this.
>

I have started reading the patch today, I haven't yet completed one
pass but here are my comments in 0007

1.

+ BlockNumber relative_block_numbers[RELSEG_SIZE];

This is close to 400kB of memory, so I think it is better we palloc it
instead of keeping it in the stack.

2.
  /*
  * Try to parse the directory name as an unsigned integer.
  *
- * Tablespace directories should be positive integers that can
- * be represented in 32 bits, with no leading zeroes or trailing
+ * Tablespace directories should be positive integers that can be
+ * represented in 32 bits, with no leading zeroes or trailing
  * garbage. If we come across a name that doesn't meet those
  * criteria, skip it.

Unrelated code refactoring hunk

3.
+typedef struct
+{
+ const char *filename;
+ pg_checksum_context *checksum_ctx;
+ bbsink*sink;
+ size_t bytes_sent;
+} FileChunkContext;

This structure is not used anywhere.

4.
+ * If the file is to be set incrementally, then num_incremental_blocks
+ * should be the number of blocks to be sent, and incremental_blocks

/If the file is to be set incrementally/If the file is to be sent incrementally

5.
- while (bytes_done < statbuf->st_size)
+ while (1)
  {
- size_t remaining = statbuf->st_size - bytes_done;
+ /*

I do not really like this change, because after removing this you have
put 2 independent checks for sending the full file[1] and sending it
incrementally[1].  Actually for sending incrementally
'statbuf->st_size' is computed from the 'num_incremental_blocks'
itself so why don't we keep this breaking condition in the while loop
itself?  So that we can avoid these two separate conditions.

[1]
+ /*
+ * If we've read the required number of bytes, then it's time to
+ * stop.
+ */
+ if (bytes_done >= statbuf->st_size)
+ break;

[2]
+ /*
+ * If we've read all the blocks, then it's time to stop.
+ */
+ if (ibindex >= num_incremental_blocks)
+ break;


6.
+typedef struct
+{
+ TimeLineID tli;
+ XLogRecPtr start_lsn;
+ XLogRecPtr end_lsn;
+} backup_wal_range;
+
+typedef struct
+{
+ uint32 status;
+ const char *path;
+ size_t size;
+} backup_file_entry;

Better we add some comments for these structures.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-09-05 Thread Robert Haas
On Mon, Sep 4, 2023 at 8:42 AM Dilip Kumar  wrote:
> Can't we think of comparing at the block level, like we can compare
> each block but ignore the content of the hole?

We could do that, but I don't think that's a full solution. I think
I'd end up having to reimplement the equivalent of heap_mask,
btree_mask, et. al. in Perl, which doesn't seem very reasonable. It's
fairly complicated logic even written in C, and doing the right thing
in Perl would be more complex, I think, because it wouldn't have
access to all the same #defines which depend on things like word size
and Endianness and stuff. If we want to allow this sort of comparison,
I feel we should think of changing the C code in some way to make it
work reliably rather than try to paper over the problems in Perl.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-09-04 Thread Dilip Kumar
On Wed, Aug 30, 2023 at 9:20 PM Robert Haas  wrote:

> Unless someone has a brilliant idea that I lack, this suggests to me
> that this whole line of testing is a dead end. I can, of course, write
> tests that compare clusters *logically* -- do the correct relations
> exist, are they accessible, do they have the right contents?

Can't we think of comparing at the block level, like we can compare
each block but ignore the content of the hole?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-09-01 Thread Robert Haas
Hey, thanks for the reply.

On Thu, Aug 31, 2023 at 6:50 PM David Steele  wrote:
> pg_subtrans, at least, can be ignored since it is excluded from the
> backup and not required for recovery.

I agree...

> Welcome to the club!

Thanks for the welcome, but being a member feels *terrible*. :-)

> I do not. My conclusion back then was that validating a physical
> comparison would be nearly impossible without changes to Postgres to
> make the primary and standby match via replication. Which, by the way, I
> still think would be a great idea. In principle, at least. Replay is
> already a major bottleneck and anything that makes it slower will likely
> not be very popular.

Fair point. But maybe the bigger issue is the work involved. I don't
think zeroing the hole in all cases would likely be that expensive,
but finding everything that can cause the standby to diverge from the
primary and fixing all of it sounds like an unpleasant amount of
effort. Still, it's good to know that I'm not missing something
obvious.

> No objections to 0001/0002.

Cool.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-08-31 Thread David Steele

Hi Robert,

On 8/30/23 10:49, Robert Haas wrote:

In the limited time that I've had to work on this project lately, I've
been trying to come up with a test case for this feature -- and since
I've gotten completely stuck, I thought it might be time to post and
see if anyone else has a better idea. I thought a reasonable test case
would be: Do a full backup. Change some stuff. Do an incremental
backup. Restore both backups and perform replay to the same LSN. Then
compare the files on disk. But I cannot make this work. The first
problem I ran into was that replay of the full backup does a
restartpoint, while the replay of the incremental backup does not.
That results in, for example, pg_subtrans having different contents.


pg_subtrans, at least, can be ignored since it is excluded from the 
backup and not required for recovery.



I'm not sure whether it can also result in data files having different
contents: are changes that we replayed following the last restartpoint
guaranteed to end up on disk when the server is shut down? It wasn't
clear to me that this is the case. I thought maybe I could get both
servers to perform a restartpoint at the same location by shutting
down the primary and then replaying through the shutdown checkpoint,
but that doesn't work because the primary doesn't finish archiving
before shutting down. After some more fiddling I settled (at least for
research purposes) on having the restored backups PITR and promote,
instead of PITR and pause, so that we're guaranteed a checkpoint. But
that just caused me to run into a far worse problem: replay on the
standby doesn't actually create a state that is byte-for-byte
identical to the one that exists on the primary. I quickly discovered
that in my test case, I was ending up with different contents in the
"hole" of a block wherein a tuple got updated. Replay doesn't think
it's important to make the hole end up with the same contents on all
machines that replay the WAL, so I end up with one server that has
more junk in there than the other one and the tests fail.


This is pretty much what I discovered when investigating backup from 
standby back in 2016. My (ultimately unsuccessful) efforts to find a 
clean delta resulted in [1] as I systematically excluded directories 
that are not required for recovery and will not be synced between a 
primary and standby.


FWIW Heikki also made similar attempts at this before me (back then I 
found the thread but I doubt I could find it again) and arrived at 
similar results. We discussed this in person and figured out that we had 
come to more or less the same conclusion. Welcome to the club!



Unless someone has a brilliant idea that I lack, this suggests to me
that this whole line of testing is a dead end. I can, of course, write
tests that compare clusters *logically* -- do the correct relations
exist, are they accessible, do they have the right contents? But I
feel like it would be easy to have bugs that escape detection in such
a test but would be detected by a physical comparison of the clusters.


Agreed, though a matching logical result is still very compelling.


However, such a comparison can only be conducted if either (a) there's
some way to set up the test so that byte-for-byte identical clusters
can be expected or (b) there's some way to perform the comparison that
can distinguish between expected, harmless differences and unexpected,
problematic differences. And at the moment my conclusion is that
neither (a) nor (b) exists. Does anyone think otherwise?


I do not. My conclusion back then was that validating a physical 
comparison would be nearly impossible without changes to Postgres to 
make the primary and standby match via replication. Which, by the way, I 
still think would be a great idea. In principle, at least. Replay is 
already a major bottleneck and anything that makes it slower will likely 
not be very popular.


This would also be great for WAL, since last time I tested the same WAL 
segment can be different between the primary and standby because the 
unused (and recycled) portion at the end is not zeroed as it is on the 
primary (but logically they do match). I would be very happy if somebody 
told me that my info is out of date here and this has been fixed. But 
when I looked at the code it was incredibly tricky to do this because of 
how WAL is replicated.



Meanwhile, here's a rebased set of patches. The somewhat-primitive
attempts at writing tests are in 0009, but they don't work, for the
reasons explained above. I think I'd probably like to go ahead and
commit 0001 and 0002 soon if there are no objections, since I think
those are good refactorings independently of the rest of this.


No objections to 0001/0002.

Regards,
-David

[1] 
http://git.postgresql.org/pg/commitdiff/6ad8ac6026287e3ccbc4d606b6ab6116ccc0eec8





Re: trying again to get incremental backup

2023-06-19 Thread Andres Freund
Hi,

On 2023-06-19 09:46:12 -0400, Robert Haas wrote:
> On Wed, Jun 14, 2023 at 4:40 PM Andres Freund  wrote:
> > > But I'm not sure that's a great approach, because that LSN gap might be
> > > large and then we're duplicating a lot of work that the summarizer has
> > > probably already done most of.
> >
> > I guess that really depends on what the summary granularity is. If you 
> > create
> > a separate summary every 32MB or so, recomputing just the required range
> > shouldn't be too bad.
> 
> Yeah, but I don't think that's the right approach, for two reasons.
> First, one of the things I'm rather worried about is what happens when
> the WAL distance between the prior backup and the incremental backup
> is large. It could be a terabyte. If we have a WAL summary for every
> 32MB of WAL, that's 32k files we have to read, and I'm concerned
> that's too many. Maybe it isn't, but it's something that has really
> been weighing on my mind as I've been thinking through the design
> questions here.

It doesn't have to be a separate file - you could easily summarize ranges
at a higher granularity, storing multiple ranges into a single file with a
coarser naming pattern.


> The files are really very small, and having to open a bazillion tiny little
> files to get the job done sounds lame. Second, I don't see what problem it
> actually solves. Why not just signal the summarizer to write out the
> accumulated data to a file instead of re-doing the work ourselves? Or else
> adopt the WAL-record-at-the-redo-pointer approach, and then the whole thing
> is moot?

The one point for a relatively grainy summarization scheme that I see is that
it would pave the way for using the WAL summary data for other purposes in the
future. That could be done orthogonal to the other solutions to the redo
pointer issues.

Other potential use cases:

- only restore parts of a base backup that aren't going to be overwritten by
  WAL replay
- reconstructing database contents from WAL after data loss
- more efficient pg_rewind
- more efficient prefetching during WAL replay


Greetings,

Andres Freund




Re: trying again to get incremental backup

2023-06-19 Thread Robert Haas
On Wed, Jun 14, 2023 at 4:40 PM Andres Freund  wrote:
> > But I'm not sure that's a great approach, because that LSN gap might be
> > large and then we're duplicating a lot of work that the summarizer has
> > probably already done most of.
>
> I guess that really depends on what the summary granularity is. If you create
> a separate summary every 32MB or so, recomputing just the required range
> shouldn't be too bad.

Yeah, but I don't think that's the right approach, for two reasons.
First, one of the things I'm rather worried about is what happens when
the WAL distance between the prior backup and the incremental backup
is large. It could be a terabyte. If we have a WAL summary for every
32MB of WAL, that's 32k files we have to read, and I'm concerned
that's too many. Maybe it isn't, but it's something that has really
been weighing on my mind as I've been thinking through the design
questions here. The files are really very small, and having to open a
bazillion tiny little files to get the job done sounds lame. Second, I
don't see what problem it actually solves. Why not just signal the
summarizer to write out the accumulated data to a file instead of
re-doing the work ourselves? Or else adopt the
WAL-record-at-the-redo-pointer approach, and then the whole thing is
moot?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-06-15 Thread Dilip Kumar
On Thu, Jun 15, 2023 at 2:11 AM Andres Freund  wrote:
>

> > I'm not really sure. I expect Dilip would be happy to post his patch,
> > and if you'd be willing to have a look at it and express your concerns
> > or lack thereof, that would be super valuable.
>
> Will do. Adding me to CC: might help, I have a backlog unfortunately :(.

Thanks, I have posted it here[1]

[1] 
https://www.postgresql.org/message-id/CAFiTN-s-K%3DmVA%3DHPr_VoU-5bvyLQpNeuzjq1ebPJMEfCJZKFsg%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-06-14 Thread Andres Freund
Hi,

On 2023-06-14 16:10:38 -0400, Robert Haas wrote:
> On Wed, Jun 14, 2023 at 3:47 PM Andres Freund  wrote:
> > Could we just recompute the WAL summary for the [redo, end of chunk] for the
> > relevant summary file?
> 
> I'm not understanding how that would help. If we were going to compute
> a WAL summary on the fly rather than waiting for one to show up on
> disk, what we'd want is [end of last WAL summary that does exist on
> disk, redo].

Oh, right.


> But I'm not sure that's a great approach, because that LSN gap might be
> large and then we're duplicating a lot of work that the summarizer has
> probably already done most of.

I guess that really depends on what the summary granularity is. If you create
a separate summary every 32MB or so, recomputing just the required range
shouldn't be too bad.


> > FWIW, I like the idea of a special WAL record at that point, independent of
> > this feature. It wouldn't be a meaningful overhead compared to the cost of a
> > checkpoint, and it seems like it'd be quite useful for debugging. But I can
> > see uses going beyond that - we occasionally have been discussing 
> > associating
> > additional data with redo points, and that'd be a lot easier to deal with
> > during recovery with such a record.
> >
> > I don't really see a performance and concurrency angle right now - what are
> > you wondering about?
> 
> I'm not really sure. I expect Dilip would be happy to post his patch,
> and if you'd be willing to have a look at it and express your concerns
> or lack thereof, that would be super valuable.

Will do. Adding me to CC: might help, I have a backlog unfortunately :(.


Greetings,

Andres Freund




Re: trying again to get incremental backup

2023-06-14 Thread Matthias van de Meent
On Wed, 14 Jun 2023 at 20:47, Robert Haas  wrote:
>
> A few years ago, I sketched out a design for incremental backup, but
> no patch for incremental backup ever got committed. Instead, the whole
> thing evolved into a project to add backup manifests, which are nice,
> but not as nice as incremental backup would be. So I've decided to
> have another go at incremental backup itself. Attached are some WIP
> patches.

Nice, I like this idea.

> Let me summarize the design and some open questions and
> problems with it that I've discovered. I welcome problem reports and
> test results from others, as well.

Skimming through the 7th patch, I see claims that FSM is not fully
WAL-logged and thus shouldn't be tracked, and so it indeed doesn't
track those changes.
I disagree with that decision: we now have support for custom resource
managers, which may use the various forks for other purposes than
those used in PostgreSQL right now. It would be a shame if data is
lost because of the backup tool ignoring forks because the PostgreSQL
project itself doesn't have post-recovery consistency guarantees in
that fork. So, unless we document that WAL-logged changes in the FSM
fork are actually not recoverable from backup, regardless of the type
of contents, we should still keep track of the changes in the FSM fork
and include the fork in our backups or only exclude those FSM updates
that we know are safe to ignore.

Kind regards,

Matthias van de Meent
Neon, Inc.




Re: trying again to get incremental backup

2023-06-14 Thread Robert Haas
On Wed, Jun 14, 2023 at 3:47 PM Andres Freund  wrote:
> I assume this is "solely" required for keeping the incremental backups as
> small as possible, rather than being required for correctness?

I believe so. I want to spend some more time thinking about this to
make sure I'm not missing anything.

> Could we just recompute the WAL summary for the [redo, end of chunk] for the
> relevant summary file?

I'm not understanding how that would help. If we were going to compute
a WAL summary on the fly rather than waiting for one to show up on
disk, what we'd want is [end of last WAL summary that does exist on
disk, redo]. But I'm not sure that's a great approach, because that
LSN gap might be large and then we're duplicating a lot of work that
the summarizer has probably already done most of.

> FWIW, I like the idea of a special WAL record at that point, independent of
> this feature. It wouldn't be a meaningful overhead compared to the cost of a
> checkpoint, and it seems like it'd be quite useful for debugging. But I can
> see uses going beyond that - we occasionally have been discussing associating
> additional data with redo points, and that'd be a lot easier to deal with
> during recovery with such a record.
>
> I don't really see a performance and concurrency angle right now - what are
> you wondering about?

I'm not really sure. I expect Dilip would be happy to post his patch,
and if you'd be willing to have a look at it and express your concerns
or lack thereof, that would be super valuable.

> > Another thing that I'm not too sure about is: what happens if we find
> > a relation file on disk that doesn't appear in the backup_manifest for
> > the previous backup and isn't mentioned in the WAL summaries either?
>
> Wouldn't that commonly happen for unlogged relations at least?
>
> I suspect there's also other ways to end up with such additional files,
> e.g. by crashing during the creation of a new relation.

Yeah, this needs some more careful thought.

> > A few less-serious problems with the patch:
> >
> > - We don't have an incremental JSON parser, so if you have a
> > backup_manifest>1GB, pg_basebackup --incremental is going to fail.
> > That's also true of the existing code in pg_verifybackup, and for the
> > same reason. I talked to Andrew Dunstan at one point about adapting
> > our JSON parser to support incremental parsing, and he had a patch for
> > that, but I think he found some problems with it and I'm not sure what
> > the current status is.
>
> As a stopgap measure, can't we just use the relevant flag to allow larger
> allocations?

I'm not sure that's a good idea, but theoretically, yes. We can also
just choose to accept the limitation that your data directory can't be
too darn big if you want to use this feature. But getting incremental
JSON parsing would be better.

Not having the manifest in JSON would be an even better solution, but
regrettably I did not win that argument.

> That seems like a feature for the future...

Sure.

> I don't know the tar format well, but my understanding is that it doesn't have
> a "central metadata" portion. I.e. doing something like this would entail
> scanning the tar file sequentially, skipping file contents?  And wouldn't you
> have to create an entirely new tar file for the modified output? That kind of
> makes it not so incremental ;)
>
> IOW, I'm not sure it's worth bothering about this ever, and certainly doesn't
> seem worth bothering about now. But I might just be missing something.

Oh, yeah, it's just an idle thought. I'll get to it when I get to it,
or else I won't.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-06-14 Thread Andres Freund
Hi,

On 2023-06-14 14:46:48 -0400, Robert Haas wrote:
> A few years ago, I sketched out a design for incremental backup, but
> no patch for incremental backup ever got committed. Instead, the whole
> thing evolved into a project to add backup manifests, which are nice,
> but not as nice as incremental backup would be. So I've decided to
> have another go at incremental backup itself. Attached are some WIP
> patches. Let me summarize the design and some open questions and
> problems with it that I've discovered. I welcome problem reports and
> test results from others, as well.

Cool!


> I originally had the idea of summarizing a certain number of MB of WAL
> per WAL summary file, and so I added a GUC wal_summarize_mb for that
> purpose. But then I realized that actually, you really want WAL
> summary file boundaries to line up with possible redo points, because
> when you do an incremental backup, you need a summary that stretches
> from the redo point of the checkpoint written at the start of the
> prior backup to the redo point of the checkpoint written at the start
> of the current backup. The block modifications that happen in that
> range of WAL records are the ones that need to be included in the
> incremental.

I assume this is "solely" required for keeping the incremental backups as
small as possible, rather than being required for correctness?


> Unfortunately, there's no indication in the WAL itself
> that you've reached a redo point, but I wrote code that tries to
> notice when we've reached the redo point stored in shared memory and
> stops the summary there. But I eventually realized that's not good
> enough either, because if summarization zooms past the redo point
> before noticing the updated redo point in shared memory, then the
> backup sat around waiting for the next summary file to be generated so
> it had enough summaries to proceed with the backup, while the
> summarizer was in no hurry to finish up the current file and just sat
> there waiting for more WAL to be generated. Eventually the incremental
> backup would just time out. I tried to fix that by making it so that
> if somebody's waiting for a summary file to be generated, they can let
> the summarizer know about that and it can write a summary file ending
> at the LSN up to which it has read and then begin a new file from
> there. That seems to fix the hangs, but now I've got three
> overlapping, interconnected systems for deciding where to end the
> current summary file, and maybe that's OK, but I have a feeling there
> might be a better way.

Could we just recompute the WAL summary for the [redo, end of chunk] for the
relevant summary file?


> Dilip had an interesting potential solution to this problem, which was
> to always emit a special WAL record at the redo pointer. That is, when
> we fix the redo pointer for the checkpoint record we're about to
> write, also insert a WAL record there. That way, when the summarizer
> reaches that sentinel record, it knows it should stop the summary just
> before. I'm not sure whether this approach is viable, especially from
> a performance and concurrency perspective, and I'm not sure whether
> people here would like it, but it does seem like it would make things
> a whole lot simpler for this patch set.

FWIW, I like the idea of a special WAL record at that point, independent of
this feature. It wouldn't be a meaningful overhead compared to the cost of a
checkpoint, and it seems like it'd be quite useful for debugging. But I can
see uses going beyond that - we occasionally have been discussing associating
additional data with redo points, and that'd be a lot easier to deal with
during recovery with such a record.

I don't really see a performance and concurrency angle right now - what are
you wondering about?


> Another thing that I'm not too sure about is: what happens if we find
> a relation file on disk that doesn't appear in the backup_manifest for
> the previous backup and isn't mentioned in the WAL summaries either?

Wouldn't that commonly happen for unlogged relations at least?

I suspect there's also other ways to end up with such additional files,
e.g. by crashing during the creation of a new relation.


> A few less-serious problems with the patch:
> 
> - We don't have an incremental JSON parser, so if you have a
> backup_manifest>1GB, pg_basebackup --incremental is going to fail.
> That's also true of the existing code in pg_verifybackup, and for the
> same reason. I talked to Andrew Dunstan at one point about adapting
> our JSON parser to support incremental parsing, and he had a patch for
> that, but I think he found some problems with it and I'm not sure what
> the current status is.

As a stopgap measure, can't we just use the relevant flag to allow larger
allocations?


> - The patch does support differential backup, aka an incremental atop
> another incremental. There's no particular limit to how long a chain
> of backups can be. However, pg_combinebackup currently