Re: pg_combinebackup does not detect missing files

2024-04-23 Thread David Steele

On 4/22/24 23:53, Robert Haas wrote:

On Sun, Apr 21, 2024 at 8:47 PM David Steele  wrote:

I figured that wouldn't be particularly meaningful, and
that's pretty much the only kind of validation that's even
theoretically possible without a bunch of extra overhead, since we
compute checksums on entire files rather than, say, individual blocks.
And you could really only do it for the final backup in the chain,
because you should end up accessing all of those files, but the same
is not true for the predecessor backups. So it's a very weak form of
verification.


I don't think it is weak if you can verify that the output is exactly as
expected, i.e. all files are present and have the correct contents.


I don't understand what you mean here. I thought we were in agreement
that verifying contents would cost a lot more. The verification that
we can actually do without much cost can only check for missing files
in the most recent backup, which is quite weak. pg_verifybackup is
available if you want more comprehensive verification and you're
willing to pay the cost of it.


I simply meant that it is *possible* to verify the output of 
pg_combinebackup without explicitly verifying all the backups. There 
would be overhead, yes, but it would be less than verifying each backup 
individually. For my 2c that efficiency would make it worth doing 
verification in pg_combinebackup, with perhaps a switch to turn it off 
if the user is confident in their sources.



I think it is a worthwhile change and we are still a month away from
beta1. We'll see if anyone disagrees.


I don't plan to press forward with this in this release unless we get
a couple of +1s from disinterested parties. We're now two weeks after
feature freeze and this is design behavior, not a bug. Perhaps the
design should have been otherwise, but two weeks after feature freeze
is not the time to debate that.


It doesn't appear that anyone but me is terribly concerned about 
verification, even in this weak form, so probably best to hold this 
patch until the next release. As you say, it is late in the game.


Regards,
-David




Re: pg_combinebackup does not detect missing files

2024-04-21 Thread David Steele

On 4/20/24 01:47, Robert Haas wrote:

On Thu, Apr 18, 2024 at 7:36 PM David Steele  wrote:

Sure -- pg_combinebackup would only need to verify the data that it
uses. I'm not suggesting that it should do an exhaustive verify of every
single backup in the chain. Though I can see how it sounded that way
since with pg_verifybackup that would pretty much be your only choice.

The beauty of doing verification in pg_combinebackup is that it can do a
lot less than running pg_verifybackup against every backup but still get
a valid result. All we care about is that the output is correct -- if
there is corruption in an unused part of an earlier backup
pg_combinebackup doesn't need to care about that.

As far as I can see, pg_combinebackup already checks most of the boxes.
The only thing I know that it can't do is detect missing files and that
doesn't seem like too big a thing to handle.


Hmm, that's an interesting perspective. I've always been very
skeptical of doing verification only around missing files and not
anything else. 


Yeah, me too. There should also be some verification of the file 
contents themselves but now I can see we don't have that. For instance, 
I can do something like this:


cp test/backup/incr1/base/1/3598 test/backup/incr1/base/1/2336

And pg_combinebackup runs without complaint. Maybe missing files are 
more likely than corrupted files, but it would still be nice to check 
for both.



I figured that wouldn't be particularly meaningful, and
that's pretty much the only kind of validation that's even
theoretically possible without a bunch of extra overhead, since we
compute checksums on entire files rather than, say, individual blocks.
And you could really only do it for the final backup in the chain,
because you should end up accessing all of those files, but the same
is not true for the predecessor backups. So it's a very weak form of
verification.


I don't think it is weak if you can verify that the output is exactly as 
expected, i.e. all files are present and have the correct contents.


But in this case it would be nice to at least know that the source files 
on disk are valid (which pg_verifybackup does). Without block checksums 
it is hard to know if the final output is correct or not.



But I looked into it and I think you're correct that, if you restrict
the scope in the way that you suggest, we can do it without much
additional code, or much additional run-time. The cost is basically
that, instead of only looking for a backup_manifest entry when we
think we can reuse its checksum, we need to do a lookup for every
single file in the final input directory. Then, after processing all
such files, we need to iterate over the hash table one more time and
see what files were never touched. That seems like an acceptably low
cost to me. So, here's a patch.


I tested the patch and it works, but there is some noise from WAL files 
since they are not in the manifest:


$ pg_combinebackup test/backup/full test/backup/incr1 -o test/backup/combine
pg_combinebackup: warning: "pg_wal/00010008" is present 
on disk but not in the manifest
pg_combinebackup: error: "base/1/3596" is present in the manifest but 
not on disk


Maybe it would be better to omit this warning since it could get very 
noisy depending on the number of WAL segments generated during backup.


Though I do find it a bit odd that WAL files are not in the source 
backup manifests but do end up in the manifest after a pg_combinebackup. 
It doesn't seem harmful, just odd.



I do think there's some chance that this will encourage people to
believe that pg_combinebackup is better at finding problems than it
really is or ever will be, 


Given that pg_combinebackup is not verifying checksums, you are probably 
right.



and I also question whether it's right to
keep changing stuff after feature freeze. But I have a feeling most
people here are going to think this is worth including in 17. Let's
see what others say.


I think it is a worthwhile change and we are still a month away from 
beta1. We'll see if anyone disagrees.


Regards,
-David




Re: pg_combinebackup does not detect missing files

2024-04-18 Thread David Steele

On 4/19/24 00:50, Robert Haas wrote:

On Wed, Apr 17, 2024 at 7:09 PM David Steele  wrote:


Fair enough. I accept that your reasoning is not random, but I'm still
not very satisfied that the user needs to run a separate and rather
expensive process to do the verification when pg_combinebackup already
has the necessary information at hand. My guess is that most users will
elect to skip verification.


I think you're probably right that a lot of people will skip it; I'm
just less convinced than you are that it's a bad thing. It's not a
*great* thing if people skip it, but restore time is actually just
about the worst time to find out that you have a problem with your
backups. I think users would be better served by verifying stored
backups periodically when they *don't* need to restore them. 


Agreed, running verify regularly is a good idea, but in my experience 
most users are only willing to run verify once they suspect (or know) 
there is an issue. It's a pretty expensive process depending on how many 
backups you have and where they are stored.


> Also,

saying that we have all of the information that we need to do the
verification is only partially true:

- we do have to parse the manifest anyway, but we don't have to
compute checksums anyway, and I think that cost can be significant
even for CRC-32C and much more significant for any of the SHA variants

- we don't need to read all of the files in all of the backups. if
there's a newer full, the corresponding file in older backups, whether
full or incremental, need not be read

- incremental files other than the most recent only need to be read to
the extent that we need their data; if some of the same blocks have
been changed again, we can economize

How much you save because of these effects is pretty variable. Best
case, you have a 2-backup chain with no manifest checksums, and all
verification will have to do that you wouldn't otherwise need to do is
walk each older directory tree in toto and cross-check which files
exist against the manifest. That's probably cheap enough that nobody
would be too fussed. Worst case, you have a 10-backup (or whatever)
chain with SHA512 checksums and, say, a 50% turnover rate. In that
case, I think having verification happen automatically could be a
pretty major hit, both in terms of I/O and CPU. If your database is
1TB, it's ~5.5TB of read I/O (because one 1TB full backup and 9 0.5TB
incrementals) instead of ~1TB of read I/O, plus the checksumming.

Now, obviously you can still feel that it's totally worth it, or that
someone in that situation shouldn't even be using incremental backups,
and it's a value judgement, so fair enough. But my guess is that the
efforts that this implementation makes to minimize the amount of I/O
required for a restore are going to be important for a lot of people.


Sure -- pg_combinebackup would only need to verify the data that it 
uses. I'm not suggesting that it should do an exhaustive verify of every 
single backup in the chain. Though I can see how it sounded that way 
since with pg_verifybackup that would pretty much be your only choice.


The beauty of doing verification in pg_combinebackup is that it can do a 
lot less than running pg_verifybackup against every backup but still get 
a valid result. All we care about is that the output is correct -- if 
there is corruption in an unused part of an earlier backup 
pg_combinebackup doesn't need to care about that.


As far as I can see, pg_combinebackup already checks most of the boxes. 
The only thing I know that it can't do is detect missing files and that 
doesn't seem like too big a thing to handle.


Regards,
-David




Re: pg_combinebackup fails on file named INCREMENTAL.*

2024-04-18 Thread David Steele

On 4/19/24 01:10, Robert Haas wrote:

On Wed, Apr 17, 2024 at 7:56 PM David Steele  wrote:

Thanks! I've tested this and it works as advertised.

Ideally I'd want an error on backup if there is a similar file in any
data directories that would cause an error on combine, but I admit that
it is vanishingly rare for users to put files anywhere but the root of
PGDATA, so this approach seems OK to me.


OK, committed. Thanks for the review.


Excellent, thank you!

Regards,
-David




Re: pg_combinebackup fails on file named INCREMENTAL.*

2024-04-17 Thread David Steele

On 4/18/24 00:14, Robert Haas wrote:

On Tue, Apr 16, 2024 at 9:25 AM Robert Haas  wrote:

On Mon, Apr 15, 2024 at 10:12 PM David Steele  wrote:

Anyway, I think it should be fixed or documented as a caveat since it
causes a hard failure on restore.


Alright, I'll look into this.


Here's a patch.


Thanks! I've tested this and it works as advertised.

Ideally I'd want an error on backup if there is a similar file in any 
data directories that would cause an error on combine, but I admit that 
it is vanishingly rare for users to put files anywhere but the root of 
PGDATA, so this approach seems OK to me.


Regards,
-David




Re: pg_combinebackup does not detect missing files

2024-04-17 Thread David Steele

On 4/18/24 01:03, Robert Haas wrote:

On Tue, Apr 16, 2024 at 7:25 PM David Steele  wrote:

But it will not go out of its way to perform checks that are unrelated
to its documented purpose. And I don't think it should, especially if
we have another tool that already does that.


I'm not averse to having some kind of statement in the documentation
along the lines of "Note that pg_combinebackup does not attempt to
verify that the individual backups are intact; for that, use
pg_verifybackup."


I think we should do this at a minimum.


Here is a patch to do that.


I think here:

+   pg_basebackup only attempts to verify

you mean:

+   pg_combinebackup only attempts to verify

Otherwise this looks good to me.


Especially given how pg_combinebackup works, backups are going to
undergo a lot of user manipulation (pushing to and pull from storage,
decompressing, untaring, etc.) and I think that means we should take
extra care.


We are in agreement on that point, if nothing else. I am terrified of
users having problems with pg_combinebackup and me not being able to
tell whether those problems are due to user error, Robert error, or
something else. I put a lot of effort into detecting dumb things that
I thought a user might do, and a lot of effort into debugging output
so that when things do go wrong anyway, we have a reasonable chance of
figuring out exactly where they went wrong. We do seem to have a
philosophical difference about what the scope of those checks ought to
be, and I don't really expect what I wrote above to convince you that
my position is correct, but perhaps it will convince you that I have a
thoughtful position, as opposed to just having done stuff at random.


Fair enough. I accept that your reasoning is not random, but I'm still 
not very satisfied that the user needs to run a separate and rather 
expensive process to do the verification when pg_combinebackup already 
has the necessary information at hand. My guess is that most users will 
elect to skip verification.


At least now they'll have the information they need to make an informed 
choice.


Regards,
-David




Re: pg_combinebackup does not detect missing files

2024-04-16 Thread David Steele

On 4/16/24 23:50, Robert Haas wrote:

On Wed, Apr 10, 2024 at 9:36 PM David Steele  wrote:

I've been playing around with the incremental backup feature trying to
get a sense of how it can be practically used. One of the first things I
always try is to delete random files and see what happens.

You can delete pretty much anything you want from the most recent
incremental backup (not the manifest) and it will not be detected.


Sure, but you can also delete anything you want from the most recent
non-incremental backup and it will also not be detected. There's no
reason at all to hold incremental backup to a higher standard than we
do in general.


Except that we are running pg_combinebackup on the incremental, which 
the user might reasonably expect to check backup integrity. It actually 
does a bunch of integrity checks -- but not this one.



Maybe the answer here is to update the docs to specify that
pg_verifybackup should be run on all backup directories before
pg_combinebackup is run. Right now that is not at all clear.


I don't want to make those kinds of prescriptive statements. If you
want to verify the backups that you use as input to pg_combinebackup,
you can use pg_verifybackup to do that, but it's not a requirement.
I'm not averse to having some kind of statement in the documentation
along the lines of "Note that pg_combinebackup does not attempt to
verify that the individual backups are intact; for that, use
pg_verifybackup." 


I think we should do this at a minimum.


But I think it should be blindingly obvious to
everyone that you can't go whacking around the inputs to a program and
expect to get perfectly good output. I know it isn't blindingly
obvious to everyone, which is why I'm not averse to adding something
like what I just mentioned, and maybe it wouldn't be a bad idea to
document in a few other places that you shouldn't randomly remove
files from the data directory of your live cluster, either, because
people seem to keep doing it, but really, the expectation that you
can't just blow files away and expect good things to happen afterward
should hardly need to be stated.


And yet, we see it all the time.


I think it's very easy to go overboard with warnings of this type.
Weird stuff comes to me all the time because people call me when the
weird stuff happens, and I'm guessing that your experience is similar.
But my actual personal experience, as opposed to the cases reported to
me by others, practically never features files evaporating into the
ether. 


Same -- if it happens at all it is very rare. Virtually every time I am 
able to track down the cause of missing files it is because the user 
deleted them, usually to "save space" or because they "did not seem 
important".


But given that this occurrence is pretty common in my experience, I 
think it is smart to mitigate against it, rather than just take it on 
faith that the user hasn't done anything destructive.


Especially given how pg_combinebackup works, backups are going to 
undergo a lot of user manipulation (pushing to and pull from storage, 
decompressing, untaring, etc.) and I think that means we should take 
extra care.


Regards,
-David




Re: pg_combinebackup fails on file named INCREMENTAL.*

2024-04-15 Thread David Steele

On 4/16/24 06:33, Robert Haas wrote:

On Sun, Apr 14, 2024 at 11:53 PM David Steele  wrote:

Since incremental backup is using INCREMENTAL as a keyword (rather than
storing incremental info in the manifest) it is vulnerable to any file
in PGDATA with the pattern INCREMENTAL.*.


Yeah, that's true. I'm not greatly concerned about this, because I
think anyone who creates a file called INCREMENTAL.CONFIG is just
being perverse. 


Well, it's INCREMENTAL.* and you might be surprised (though I doubt it) 
at what users will do. We've been caught out by wacky file names (and 
database names) a few times.



However, we could ignore those files just in subtrees
where they're not expected to occur i.e. only pay attention to them
under base, global, and pg_tblspc. I didn't do this because I thought
someone might eventually want to do something like incremental backup
of SLRU files, and then it would be annoying if there were a bunch of
random pathname restrictions. But if we're really concerned about
this, I think it would be a reasonable fix; you're not really supposed
to decide to store your configuration files in directories that exist
for the purpose of storing relation data files.


I think it would be reasonable to restrict what can be put in base/ and 
global/ but users generally feel free to create whatever they want in 
the root of PGDATA, despite being strongly encouraged not to.


Anyway, I think it should be fixed or documented as a caveat since it 
causes a hard failure on restore.



We could fix the issue by forbidding this file pattern in PGDATA, i.e.
error when it is detected during pg_basebackup, but in my view it would
be better (at least eventually) to add incremental info to the manifest.
That would also allow us to skip storing zero-length files and
incremental stubs (with no changes) as files.


I did think about this, and I do like the idea of being able to elide
incremental stubs. If you have a tremendous number of relations and
very few of them have changed at all, the incremental stubs might take
up a significant percentage of the space consumed by the incremental
backup, which kind of sucks, even if the incremental backup is still
quite small compared to the original database. And, like you, I had
the idea of trying to use the backup_manifest to do it.

But ... I didn't really end up feeling very comfortable with it. Right
now, the backup manifest is something we only use to verify the
integrity of the backup. If we were to do this, it would become a
critical part of the backup. 


For my 2c the manifest is absolutely a critical part of the backup. I'm 
having a hard time even seeing why we have the --no-manifest option for 
pg_combinebackup at all. If the manifest is missing who knows what else 
might be missing as well? If present, why wouldn't we use it?


I know Tomas added some optimizations that work best with --no-manifest 
but if we can eventually read compressed tars (which I expect to be the 
general case) then those optimizations are not very useful.



I don't think I like the idea that
removing the backup_manifest should be allowed to, in effect, corrupt
the backup. But I think I dislike even more the idea that the data
that is used to verify the backup gets mushed together with the backup
data itself. Maybe in practice it's fine, but it doesn't seem very
conceptually clean.


I don't think this is a problem. The manifest can do more than store 
verification info, IMO.



There are also some practical considerations that made me not want to
go in this direction: we'd need to make sure that the relevant
information from the backup manifest was available to the
reconstruction process at the right part of the code. When iterating
over a directory, it would need to consider all of the files actually
present in that directory plus any "hallucinated" incremental stubs
from the manifest. I didn't feel confident of my ability to implement
that in the available time without messing anything up.


I think it would be better to iterate over the manifest than files in a 
directory. In any case we still need to fix [1], which presents more or 
less the same problem.



I think we should consider other possible designs here. For example,
instead of including this in the manifest, we could ship one
INCREMENTAL.STUBS file per directory that contains a list of all of
the incremental stubs that should be created in that directory. My
guess is that something like this will turn out to be way simpler to
code.


So we'd store a mini manifest per directory rather than just put the 
info in the main manifest? Sounds like unnecessary complexity to me.


Regards,
-David

[1] 
https://www.postgresql.org/message-id/flat/9badd24d-5bd9-4c35-ba85-4c38a2feb73e%40pgmasters.net





pg_combinebackup fails on file named INCREMENTAL.*

2024-04-14 Thread David Steele

Hackers,

Since incremental backup is using INCREMENTAL as a keyword (rather than 
storing incremental info in the manifest) it is vulnerable to any file 
in PGDATA with the pattern INCREMENTAL.*.


For example:

$ pg_basebackup -c fast -D test/backup/full -F plain
$ touch test/data/INCREMENTAL.CONFIG
$ /home/dev/test/pg/bin/pg_basebackup -c fast -D test/backup/incr1 -F 
plain -i /home/dev/test/backup/full/backup_manifest


$ /home/dev/test/pg/bin/pg_combinebackup test/backup/full 
test/backup/incr1 -o test/backup/combine
pg_combinebackup: error: could not read file 
"test/backup/incr1/INCREMENTAL.CONFIG": read only 0 of 4 bytes

pg_combinebackup: removing output directory "test/backup/combine"

This works anywhere in PGDATA as far as I can see, e.g.

$ touch test/data/base/1/INCREMENTAL.1

Or just by dropping a new file into the incremental backup:

$ touch test/backup/incr1/INCREMENTAL.x
$ /home/dev/test/pg/bin/pg_combinebackup test/backup/full 
test/backup/incr1 -o test/backup/combine
pg_combinebackup: error: could not read file 
"test/backup/incr1/INCREMENTAL.x": read only 0 of 4 bytes

pg_combinebackup: removing output directory "test/backup/combine"

We could fix the issue by forbidding this file pattern in PGDATA, i.e. 
error when it is detected during pg_basebackup, but in my view it would 
be better (at least eventually) to add incremental info to the manifest. 
That would also allow us to skip storing zero-length files and 
incremental stubs (with no changes) as files.


Regards,
-David




Re: post-freeze damage control

2024-04-14 Thread David Steele

On 4/13/24 21:02, Tomas Vondra wrote:

On 4/13/24 01:23, David Steele wrote:


Even for the summarizer, though, I do worry about the complexity of
maintaining it over time. It seems like it would be very easy to
introduce a bug and have it go unnoticed until it causes problems in the
field. A lot of testing was done outside of the test suite for this
feature and I'm not sure if we can rely on that focus with every release.



I'm not sure there's a simpler way to implement this. I haven't really
worked on that part (not until the CoW changes a couple weeks ago), but
I think Robert was very conscious of the complexity.

I don't think expect this code to change very often, but I agree it's
not great to rely on testing outside the regular regression test suite.
But I'm not sure how much more we can do, really - for example my
testing was very much "randomized stress testing" with a lot of data and
long runs, looking for unexpected stuff. That's not something we could
do in the usual regression tests, I think.

But if you have suggestions how to extend the testing ...


Doing stress testing in the regular test suite is obviously a problem 
due to runtime, but it would still be great to see tests for issues that 
were found during external stress testing.


For example, the issue you and Jakub found was fixed in 55a5ee30 but 
there is no accompanying test and no existing test was broken by the change.



For me an incremental approach would be to introduce the WAL summarizer
first. There are already plenty of projects that do page-level
incremental (WAL-G, pg_probackup, pgBackRest) and could help shake out
the bugs. Then introduce the client tools later when they are more
robust. Or, release the client tools now but mark them as experimental
or something so people know that changes are coming and they don't get
blindsided by that in the next release. Or, at the very least, make the
caveats very clear so users can make an informed choice.



I don't think introducing just the summarizer, without any client tools,
would really work. How would we even test the summarizer, for example?
If the only users of that code are external tools, we'd do only some
very rudimentary tests. But the more complex tests would happen in the
external tools, which means it wouldn't be covered by cfbot, buildfarm
and so on. Considering the external tools are likely a bit behind, It's
not clear to me how I would do the stress testing, for example.

IMHO we should aim to have in-tree clients when possible, even if some
external tools can do more advanced stuff etc.

This however reminds me my question is the summarizer provides the right
interface(s) for the external tools. One option is to do pg_basebackup
and then parse the incremental files, but is that suitable for the
external tools, or should there be a more convenient way?


Running a pg_basebackup to get the incremental changes would not be at 
all satisfactory. Luckily there are the 
pg_wal_summary_contents()/pg_available_wal_summaries() functions, which 
seem to provide the required information. I have not played with them 
much but I think they will do the trick.


They are pretty awkward to work with since they are essentially 
time-series data but what you'd really want, I think, is the ability to 
get page changes for a particular relfileid/segment.


Regards,
-David




Re: post-freeze damage control

2024-04-12 Thread David Steele

On 4/12/24 22:27, Tomas Vondra wrote:



On 4/12/24 08:42, David Steele wrote:

On 4/11/24 20:26, Tomas Vondra wrote:

On 4/11/24 03:52, David Steele wrote:

On 4/11/24 10:23, Tom Kincaid wrote:


The extensive Beta process we have can be used to build confidence we
need in a feature that has extensive review and currently has no known
issues or outstanding objections.


I did have objections, here [1] and here [2]. I think the complexity,
space requirements, and likely performance issues involved in restores
are going to be a real problem for users. Some of these can be addressed
in future releases, but I can't escape the feeling that what we are
releasing here is half-baked.


I do not think it's half-baked. I certainly agree there are limitations,
and there's all kinds of bells and whistles we could add, but I think
the fundamental infrastructure is corrent and a meaningful step forward.
Would I wish it to handle .tar for example? Sure I would. But I think
it's something we can add in the future - if we require all of this to
happen in a single release, it'll never happen.


I'm not sure that I really buy this argument, anyway. It is not uncommon
for significant features to spend years in development before they are
committed. This feature went from first introduction to commit in just
over six months. Obviously Robert had been working on it for a while,
but for a feature this large six months is a sprint.



Sure, but it's also not uncommon for significant features to be
developed incrementally, over multiple releases, introducing the basic
infrastructure first, and then expanding the capabilities later. I'd
cite logical decoding/replication and parallel query as examples of this
approach.

It's possible there's some fundamental flaw in the WAL summarization?
Sure, I can't rule that out, although I find it unlikely. Could there be
bugs? Sure, that's possible, but that applies to all code.

But it seems to me all the comments are about the client side, not about
the infrastructure. Which is fair, I certainly agree it'd be nice to
handle more use cases with less effort, but I still think the patch is a
meaningful step forward.


Yes, my comments are all about the client code. I like the 
implementation of the WAL summarizer a lot. I don't think there is a 
fundamental flaw in the design, either, but I wouldn't be surprised if 
there are bugs. That's life in software development biz.


Even for the summarizer, though, I do worry about the complexity of 
maintaining it over time. It seems like it would be very easy to 
introduce a bug and have it go unnoticed until it causes problems in the 
field. A lot of testing was done outside of the test suite for this 
feature and I'm not sure if we can rely on that focus with every release.


For me an incremental approach would be to introduce the WAL summarizer 
first. There are already plenty of projects that do page-level 
incremental (WAL-G, pg_probackup, pgBackRest) and could help shake out 
the bugs. Then introduce the client tools later when they are more 
robust. Or, release the client tools now but mark them as experimental 
or something so people know that changes are coming and they don't get 
blindsided by that in the next release. Or, at the very least, make the 
caveats very clear so users can make an informed choice.


Regards,
-David




Re: post-freeze damage control

2024-04-12 Thread David Steele

On 4/12/24 22:12, Tomas Vondra wrote:

On 4/11/24 23:48, David Steele wrote:

On 4/11/24 20:26, Tomas Vondra wrote:

>>

FWIW that discussion also mentions stuff that I think the feature should
not do. In particular, I don't think the ambition was (or should be) to
make pg_basebackup into a stand-alone tool. I always saw pg_basebackup
more as an interface to "backup steps" correctly rather than a complete
backup solution that'd manage backup registry, retention, etc.


Right -- this is exactly my issue. pg_basebackup was never easy to use
as a backup solution and this feature makes it significantly more
complicated. Complicated enough that it would be extremely difficult for
most users to utilize in a meaningful way.


Perhaps, I agree we could/should try to do better job to do backups, no
argument there. But I still don't quite see why introducing such
infrastructure to "manage backups" should be up to the patch adding
incremental backups. I see it as something to build on top of
pg_basebackup/pg_combinebackup, not into those tools.


I'm not saying that managing backups needs to be part of pg_basebackup, 
but I am saying without that it is not a complete backup solution. 
Therefore introducing advanced features that the user then has to figure 
out how to manage puts a large burden on them. Implementing 
pg_combinebackup inefficiently out of the gate just makes their life harder.



But they'll try because it is a new pg_basebackup feature and they'll
assume it is there to be used. Maybe it would be a good idea to make it
clear in the documentation that significant tooling will be required to
make it work.


Sure, I'm not against making it clearer pg_combinebackup is not a
complete backup solution, and documenting the existing restrictions.


Let's do that then. I think it would make sense to add caveats to the 
pg_combinebackup docs including space requirements, being explicit about 
the format required (e.g. plain), and also possible mitigation with COW 
filesystems.


Regards,
-David




Re: Add notes to pg_combinebackup docs

2024-04-12 Thread David Steele




On 4/12/24 22:40, Tomas Vondra wrote:

On 4/12/24 11:50, David Steele wrote:

On 4/12/24 19:09, Magnus Hagander wrote:

On Fri, Apr 12, 2024 at 12:14 AM David Steele >
  > But yeah, having to keep the backups as expanded directories is
not
  > great, I'd love to have .tar. Not necessarily because of the disk
     space
  > (in my experience the compression in filesystems works quite
well for
  > this purpose), but mostly because it's more compact and allows
     working
  > with backups as a single piece of data (e.g. it's much cleared
     what the
  > checksum of a single .tar is, compared to a directory).

     But again, object stores are commonly used for backup these days and
     billing is based on data stored rather than any compression that
can be
     done on the data. Of course, you'd want to store the compressed
tars in
     the object store, but that does mean storing an expanded copy
somewhere
     to do pg_combinebackup.

Object stores are definitely getting more common. I wish they were
getting a lot more common than they actually are, because they
simplify a lot.  But they're in my experience still very far from
being a majority.


I see it the other way, especially the last few years. The majority seem
to be object stores followed up closely by NFS. Directly mounted storage
on the backup host appears to be rarer.



One thing I'd mention is that not having built-in support for .tar and
.tgz backups does not mean it's impossible to use pg_combinebackup with
archives. You can mount them using e.g. "ratarmount" and then use that
as source directories for pg_combinebackup.

It's not entirely friction-less because AFAICS it's necessary to do the
backup in plain format and then do the .tar to have the expected "flat"
directory structure (and not manifest + 2x tar). But other than that it
seems to work fine (based on my limited testing).


Well, that's certainly convoluted and doesn't really help a lot in terms 
of space consumption, it just shifts the additional space required to 
the backup side. I doubt this is something we'd be willing to add to our 
documentation so it would be up to the user to figure out and script.



FWIW the "archivemount" performs terribly, so adding this capability
into pg_combinebackup is clearly far from trivial.


I imagine this would perform pretty badly. And yes, doing it efficiently 
is not trivial but certainly doable. Scanning the tar file and matching 
to entries in the manifest is one way, but I would prefer to store the 
offsets into the tar file in the manifest then assemble an ordered list 
of work to do on each tar file. But of course the latter requires a 
manifest-centric approach, which is not what we have right now.


Regards,
-David




Re: Add notes to pg_combinebackup docs

2024-04-12 Thread David Steele

On 4/12/24 19:09, Magnus Hagander wrote:
On Fri, Apr 12, 2024 at 12:14 AM David Steele 

OK, sure, but if the plan is to make it practical later doesn't that
make the feature something to be avoided now?


That could be said for any feature. When we shipped streaming 
replication, the plan was to support synchronous in the future. Should 
we not have shipped it, or told people to avoid it?


This doesn't seem like a great example. Synchronous rep is by far the 
more used mode in my experience. I actively dissuade people from using 
sync rep because of the downsides. More people think they need it than 
actually need it.



 > However, who says this has to be the filesystem the Postgres instance
 > runs on? Who in their right mind put backups on the same volume
as the
 > instance anyway? At which point it can be a different filesystem,
even
 > if it's not ideal for running the database.

My experience is these days backups are generally placed in object
stores. Sure, people are still using NFS but admins rarely have much
control over those volumes. They may or not be COW filesystems.


If it's mounted through NFS I assume pg_combinebackup won't actually be 
able to use the COW features? Or does that actually work through NFS?


Pretty sure it won't work via NFS, but I was wrong about XFS, so...

Mounted LUNs on a SAN I find more common today though, and there it 
would do a fine job.


Huh, interesting. This is a case I almost never see anymore.


 > All of this also depends on how people do the restore. With the CoW
 > stuff they can do a quick (and small) copy on the backup server, and
 > then copy the result to the actual instance. Or they can do
restore on
 > the target directly (e.g. by mounting a r/o volume with backups), in
 > which case the CoW won't really help.

And again, this all requires a significant amount of setup and tooling.
Obviously I believe good backup requires effort but doing this right
gets very complicated due to the limitations of the tool.

It clearly needs to be documented that there are space needs. But 
temporarily getting space for something like that is not very 
complicated in most environments. But you do have to be aware of it.


We find many environments ridiculously tight on space. There is a 
constant fight with customers/users to even get the data/WAL volumes 
sized correctly.


For small databases it is probably not an issue, but this feature really 
shines with very large databases.


Generally speaking it's already the case that the "restore experience" 
with pg_basebackup is far from great. We don't have a "pg_baserestore". 
You still have to deal with archive_command and restore_command, which 
we all know can be easy to get wrong. I don't see how this is 
fundamentally worse than that.


I pretty much agree with this statement. pg_basebackup is already hard 
to use effectively. Now it is just optionally harder.


Personally, I tend to recommend that "if you want PITR and thus need to 
mess with archive_command etc, you should use a backup tool like 
pg_backrest. If you're fine with just daily backups or whatnot, use 
pg_basebackup". The incremental backup story fits somewhere in between, 
but I'd still say this is (today) primarily a tool directed at those 
that don't need full PITR.


Yeah, there are certainly cases where PITR is not required, but they 
still seem to be in the minority. PITR cannot be disabled for the most 
recent backup in pgBackRest and we've had few complaints about that overall.



 > But yeah, having to keep the backups as expanded directories is not
 > great, I'd love to have .tar. Not necessarily because of the disk
space
 > (in my experience the compression in filesystems works quite well for
 > this purpose), but mostly because it's more compact and allows
working
 > with backups as a single piece of data (e.g. it's much cleared
what the
 > checksum of a single .tar is, compared to a directory).

But again, object stores are commonly used for backup these days and
billing is based on data stored rather than any compression that can be
done on the data. Of course, you'd want to store the compressed tars in
the object store, but that does mean storing an expanded copy somewhere
to do pg_combinebackup.

Object stores are definitely getting more common. I wish they were 
getting a lot more common than they actually are, because they simplify 
a lot.  But they're in my experience still very far from being a majority.


I see it the other way, especially the last few years. The majority seem 
to be object stores followed up closely by NFS. Directly mounted storage 
on the backup host appears to be rarer.



But if the argument is that all this can/will be fixed in the future, I
guess the smart thing for users to do is wa

Re: post-freeze damage control

2024-04-12 Thread David Steele

On 4/11/24 20:26, Tomas Vondra wrote:

On 4/11/24 03:52, David Steele wrote:

On 4/11/24 10:23, Tom Kincaid wrote:


The extensive Beta process we have can be used to build confidence we
need in a feature that has extensive review and currently has no known
issues or outstanding objections.


I did have objections, here [1] and here [2]. I think the complexity,
space requirements, and likely performance issues involved in restores
are going to be a real problem for users. Some of these can be addressed
in future releases, but I can't escape the feeling that what we are
releasing here is half-baked.


I do not think it's half-baked. I certainly agree there are limitations,
and there's all kinds of bells and whistles we could add, but I think
the fundamental infrastructure is corrent and a meaningful step forward.
Would I wish it to handle .tar for example? Sure I would. But I think
it's something we can add in the future - if we require all of this to
happen in a single release, it'll never happen.


I'm not sure that I really buy this argument, anyway. It is not uncommon 
for significant features to spend years in development before they are 
committed. This feature went from first introduction to commit in just 
over six months. Obviously Robert had been working on it for a while, 
but for a feature this large six months is a sprint.


Regards,
-David




Re: post-freeze damage control

2024-04-11 Thread David Steele




On 4/12/24 12:15, Robert Haas wrote:

On Thu, Apr 11, 2024 at 5:48 PM David Steele  wrote:

But they'll try because it is a new pg_basebackup feature and they'll
assume it is there to be used. Maybe it would be a good idea to make it
clear in the documentation that significant tooling will be required to
make it work.


I don't agree with that idea. LOTS of what we ship takes a significant
amount of effort to make it work. You may well need a connection
pooler. You may well need a failover manager which may or may not be
separate from your connection pooler. You need a backup tool. You need
a replication management tool which may or may not be separate from
your backup tool and may or may not be separate from your failover
tool. You probably need various out-of-core connections for the
programming languages you need. You may need a management tool, and
you probably need a monitoring tool. Some of the tools you might
choose to do all that stuff themselves have a whole bunch of complex
dependencies. It's a mess.


The difference here is you *can* use Postgres without a connection 
pooler (I have many times) or failover (if downtime is acceptable) but 
most people would agree that you really *need* backup.


The backup tool should be clear and easy to use or misery will 
inevitably result. pg_basebackup is difficult enough to use and automate 
because it has no notion of a repository, no expiration, and no WAL 
handling just to name a few things. Now there is an even more advanced 
feature that is even harder to use. So, no, I really don't think this 
feature is practically usable by the vast majority of end users.



Now, if someone were to say that we ought to talk about these issues
in our documentation and maybe give people some ideas about how to get
started, I would likely be in favor of that, modulo the small
political problem that various people would want their solution to be
the canonical one to which everyone gets referred. But I think it's
wrong to pretend like this feature is somehow special, that it's
somehow more raw or unfinished than tons of other things. I actually
think it's significantly *better* than a lot of other things. If we
add a disclaimer to the documentation saying "hey, this new
incremental backup feature is half-finished garbage!", and meanwhile
the documentation still says "hey, you can use cp as your
archive_command," then we have completely lost our minds.


Fair point on cp, but that just points to an overall lack in our 
documentation and built-in backup/recovery tools in general.



I also think that you're being more negative about this than the facts
justify. As I said to several colleagues today, I *fully* acknowledge
that you have a lot more practical experience in this area than I do,
and a bunch of good ideas. I was really pleased to see you talking
about how it would be good if these tools worked on tar files - and I
completely agree, and I hope that will happen, and I hope to help in
making that happen. I think there are a bunch of other problems too,
only some of which I can guess at. However, I think saying that this
feature is not realistically intended to be used by end-users or that
they will not be able to do so is over the top, and is actually kind
of insulting. 


It is not meant to be insulting, but I still believe it to be true. 
After years of working with users on backup problems I think I have a 
pretty good bead on what the vast majority of admins are capable of 
and/or willing to do. Making this feature work is pretty high above that 
bar.


If the primary motivation is to provide a feature that can be integrated 
with third party tools, as Tomas suggests, then I guess usability is 
somewhat moot. But you are insisting that is not the case and I just 
don't see it that way.



There has been more enthusiasm for this feature on this
mailing list and elsewhere than I've gotten for anything I've
developed in years. And I don't think that's because all of the people
who have expressed enthusiasm are silly geese who don't understand how
terrible it is.


No doubt there is enthusiasm. It's a great feature to have. In 
particular I think the WAL summarizer is cool. But I do think the 
shortcomings are significant and that will become very apparent when 
people start to implement. The last minute effort to add COW support is 
an indication of problems that people will see in the field.


Further, I do think some less that ideal design decisions were made. In 
particular, I think sidelining manifests, i.e. making them optional, is 
not a good choice. This has led directly to the issue we see in [1]. If 
we require a manifest to make an incremental backup, why make it 
optional for combine?


This same design decision has led us to have "marker files" for 
zero-length files and unchanged files, which just seems extremely 
wasteful when these could be noted in the manifest. There are good 
reasons for writing e

Re: Add notes to pg_combinebackup docs

2024-04-11 Thread David Steele




On 4/11/24 20:51, Tomas Vondra wrote:

On 4/11/24 02:01, David Steele wrote:


I have a hard time seeing this feature as being very useful, especially
for large databases, until pg_combinebackup works on tar (and compressed
tar). Right now restoring an incremental requires at least twice the
space of the original cluster, which is going to take a lot of users by
surprise.


I do agree it'd be nice if pg_combinebackup worked with .tar directly,
without having to extract the directories first. No argument there, but
as I said in the other thread, I believe that's something we can add
later. That's simply how incremental development works.


OK, sure, but if the plan is to make it practical later doesn't that 
make the feature something to be avoided now?



I know you have made some improvements here for COW filesystems, but my
experience is that Postgres is generally not run on such filesystems,
though that is changing a bit.


I'd say XFS is a pretty common choice, for example. And it's one of the
filesystems that work great with pg_combinebackup.


XFS has certainly advanced more than I was aware.


However, who says this has to be the filesystem the Postgres instance
runs on? Who in their right mind put backups on the same volume as the
instance anyway? At which point it can be a different filesystem, even
if it's not ideal for running the database.


My experience is these days backups are generally placed in object 
stores. Sure, people are still using NFS but admins rarely have much 
control over those volumes. They may or not be COW filesystems.



FWIW I think it's fine to tell users that to minimize the disk space
requirements, they should use a CoW filesystem and --copy-file-range.
The docs don't say that currently, that's true.


That would probably be a good addition to the docs.


All of this also depends on how people do the restore. With the CoW
stuff they can do a quick (and small) copy on the backup server, and
then copy the result to the actual instance. Or they can do restore on
the target directly (e.g. by mounting a r/o volume with backups), in
which case the CoW won't really help.


And again, this all requires a significant amount of setup and tooling. 
Obviously I believe good backup requires effort but doing this right 
gets very complicated due to the limitations of the tool.



But yeah, having to keep the backups as expanded directories is not
great, I'd love to have .tar. Not necessarily because of the disk space
(in my experience the compression in filesystems works quite well for
this purpose), but mostly because it's more compact and allows working
with backups as a single piece of data (e.g. it's much cleared what the
checksum of a single .tar is, compared to a directory).


But again, object stores are commonly used for backup these days and 
billing is based on data stored rather than any compression that can be 
done on the data. Of course, you'd want to store the compressed tars in 
the object store, but that does mean storing an expanded copy somewhere 
to do pg_combinebackup.


But if the argument is that all this can/will be fixed in the future, I 
guess the smart thing for users to do is wait a few releases for 
incremental backups to become a practical feature.


Regards,
-David




Re: post-freeze damage control

2024-04-11 Thread David Steele

On 4/11/24 20:26, Tomas Vondra wrote:


On 4/11/24 03:52, David Steele wrote:

On 4/11/24 10:23, Tom Kincaid wrote:


The extensive Beta process we have can be used to build confidence we
need in a feature that has extensive review and currently has no known
issues or outstanding objections.


I did have objections, here [1] and here [2]. I think the complexity,
space requirements, and likely performance issues involved in restores
are going to be a real problem for users. Some of these can be addressed
in future releases, but I can't escape the feeling that what we are
releasing here is half-baked.


I haven't been part of those discussions, and that part of the thread is
a couple months old already, so I'll share my view here instead.

I do not think it's half-baked. I certainly agree there are limitations,
and there's all kinds of bells and whistles we could add, but I think
the fundamental infrastructure is corrent and a meaningful step forward.
Would I wish it to handle .tar for example? Sure I would. But I think
it's something we can add in the future - if we require all of this to
happen in a single release, it'll never happen.


Fair enough, but the current release is extremely limited and it would 
be best if that was well understood by users.



FWIW that discussion also mentions stuff that I think the feature should
not do. In particular, I don't think the ambition was (or should be) to
make pg_basebackup into a stand-alone tool. I always saw pg_basebackup
more as an interface to "backup steps" correctly rather than a complete
backup solution that'd manage backup registry, retention, etc.


Right -- this is exactly my issue. pg_basebackup was never easy to use 
as a backup solution and this feature makes it significantly more 
complicated. Complicated enough that it would be extremely difficult for 
most users to utilize in a meaningful way.


But they'll try because it is a new pg_basebackup feature and they'll 
assume it is there to be used. Maybe it would be a good idea to make it 
clear in the documentation that significant tooling will be required to 
make it work.


Regards,
-David




Re: post-freeze damage control

2024-04-10 Thread David Steele

On 4/11/24 10:23, Tom Kincaid wrote:


The extensive Beta process we have can be used to build confidence we 
need in a feature that has extensive review and currently has no known 
issues or outstanding objections.


I did have objections, here [1] and here [2]. I think the complexity, 
space requirements, and likely performance issues involved in restores 
are going to be a real problem for users. Some of these can be addressed 
in future releases, but I can't escape the feeling that what we are 
releasing here is half-baked.


Also, there are outstanding issues here [3] and now here [4].

Regards,
-David

[1] 
https://www.postgresql.org/message-id/590e3017-da1f-4af6-9bf0-1679511ca7e5%40pgmasters.net
[2] 
https://www.postgresql.org/message-id/11b38a96-6ded-4668-b772-40f992132797%40pgmasters.net
[3] 
https://www.postgresql.org/message-id/flat/05fb32c9-18d8-4f72-9af3-f41576c33119%40pgmasters.net#bb04b896f0f0147c10cee944a1391c1e
[4] 
https://www.postgresql.org/message-id/flat/9badd24d-5bd9-4c35-ba85-4c38a2feb73e%40pgmasters.net





pg_combinebackup does not detect missing files

2024-04-10 Thread David Steele

Hackers,

I've been playing around with the incremental backup feature trying to 
get a sense of how it can be practically used. One of the first things I 
always try is to delete random files and see what happens.


You can delete pretty much anything you want from the most recent 
incremental backup (not the manifest) and it will not be detected.


For example:

$ pg_basebackup -c fast -D test/backup/full -F plain
$ pg_basebackup -c fast -D test/backup/incr1 -F plain -i 
/home/dev/test/backup/full/backup_manifest


$ rm test/backup/incr1/base/1/INCREMENTAL.2675
$ rm test/backup/incr1/base/1/826
$ /home/dev/test/pg/bin/pg_combinebackup test/backup/full 
test/backup/incr1 -o test/backup/combine


$ ls test/backup/incr1/base/1/2675
No such file or directory
$ ls test/backup/incr1/base/1/826
No such file or directory

I can certainly use verify to check the backups individually:

$ /home/dev/test/pg/bin/pg_verifybackup /home/dev/test/backup/incr1
pg_verifybackup: error: "base/1/INCREMENTAL.2675" is present in the 
manifest but not on disk
pg_verifybackup: error: "base/1/826" is present in the manifest but not 
on disk


But I can't detect this issue if I use verify on the combined backup:

$ /home/dev/test/pg/bin/pg_verifybackup /home/dev/test/backup/combine
backup successfully verified

Maybe the answer here is to update the docs to specify that 
pg_verifybackup should be run on all backup directories before 
pg_combinebackup is run. Right now that is not at all clear.


In fact the docs say, "pg_combinebackup will attempt to verify that the 
backups you specify form a legal backup chain". Which I guess just means 
the chain is verified and not the files, but it seems easy to misinterpret.


Overall I think it is an issue that the combine is being driven from the 
most recent incremental directory rather than from the manifest.


Regards,
-David




Re: Add notes to pg_combinebackup docs

2024-04-10 Thread David Steele

On 4/9/24 19:44, Tomas Vondra wrote:


On 4/9/24 09:59, Martín Marqués wrote:

Hello,

While doing some work/research on the new incremental backup feature
some limitations were not listed in the docs. Mainly the fact that
pg_combienbackup works with plain format and not tar.



Right. The docs mostly imply this by talking about output directory and
backup directories, but making it more explicit would not hurt.

FWIW it'd be great if we could make incremental backups work with tar
format in the future too. People probably don't want to keep around the
expanded data directory or extract everything before combining the
backups is not very convenient. Reading and writing the tar would make
this simpler.


I have a hard time seeing this feature as being very useful, especially 
for large databases, until pg_combinebackup works on tar (and compressed 
tar). Right now restoring an incremental requires at least twice the 
space of the original cluster, which is going to take a lot of users by 
surprise.


I know you have made some improvements here for COW filesystems, but my 
experience is that Postgres is generally not run on such filesystems, 
though that is changing a bit.



Around the same time, Tomas Vondra tested incremental backups with a
cluster where he enabled checksums after taking the previous full
backup. After combining the backups the synthetic backup had pages
with checksums and other pages without checksums which ended in
checksum errors.


I'm not sure just documenting this limitation is sufficient. We can't
make the incremental backups work in this case (it's as if someone
messes with cluster without writing stuff into WAL), but I think we
should do better than silently producing (seemingly) corrupted backups.

I say seemingly, because the backup is actually fine, the only problem
is it has checksums enabled in the controlfile, but the pages from the
full backup (and the early incremental backups) have no checksums.

What we could do is detect this in pg_combinebackup, and either just
disable checksums with a warning and hint to maybe enable them again. Or
maybe just print that the user needs to disable them.

I was thinking maybe we could detect this while taking the backups, and
force taking a full backup if checksums got enabled since the last
backup. But we can't do that because we only have the manifest from the
last backup, and the manifest does not include info about checksums.


I'd say making a new full backup is the right thing to do in this case. 
It should be easy enough to store the checksum state of the cluster in 
the manifest.


Regards,
-David




Re: post-freeze damage control

2024-04-10 Thread David Steele




On 4/10/24 09:50, Michael Paquier wrote:

On Wed, Apr 10, 2024 at 09:29:38AM +1000, David Steele wrote:

Even so, only keeping WAL for the last backup is a dangerous move in any
case. Lots of things can happen to a backup (other than bugs in the
software) so keeping WAL back to the last full (or for all backups) is
always an excellent idea.


Yeah, that's an excellent practive, but is why I'm less worried for
this feature.  The docs at [1] caution about "not to remove earlier
backups if they might be needed when restoring later incremental
backups".  Like Alvaro said, should we insist a bit more about the WAL
retention part in this section of the docs, down to the last full
backup?


I think that would make sense in general. But if we are doing it because 
we lack confidence in the incremental backup feature maybe that's a sign 
that the feature should be released as experimental (or not released at 
all).


Regards,
-David




Re: post-freeze damage control

2024-04-09 Thread David Steele

On 4/10/24 01:59, Andrey M. Borodin wrote:



On 9 Apr 2024, at 18:45, Alvaro Herrera  wrote:

Maybe we should explicitly advise users to not delete that WAL from
their archives, until pg_combinebackup is hammered a bit more.


As a backup tool maintainer, I always reference to out-of-the box Postgres 
tools as some bulletproof alternative.
I really would like to stick to this reputation and not discredit these tools.


+1.

Even so, only keeping WAL for the last backup is a dangerous move in any 
case. Lots of things can happen to a backup (other than bugs in the 
software) so keeping WAL back to the last full (or for all backups) is 
always an excellent idea.


Regards,
-David




Re: Possibility to disable `ALTER SYSTEM`

2024-03-20 Thread David Steele

On 3/20/24 22:30, Michael Banck wrote:


On Tue, Mar 19, 2024 at 10:51:50AM -0400, Tom Lane wrote:

Heikki Linnakangas  writes:

Perhaps we could make that even better with a GUC though. I propose a
GUC called 'configuration_managed_externally = true / false". If you set
it to true, we prevent ALTER SYSTEM and make the error message more
definitive:



postgres=# ALTER SYSTEM SET wal_level TO minimal;
ERROR:  configuration is managed externally



As a bonus, if that GUC is set, we could even check at server startup
that all the configuration files are not writable by the postgres user,
and print a warning or refuse to start up if they are.


I like this idea.  The "bonus" is not optional though, because
setting the files' ownership/permissions is the only way to be
sure that the prohibition is even a little bit bulletproof.


Isn't this going to break pgbackrest restore then, which (AIUI, and was
mentioned upthread) writes recovery configs into postgresql.auto.conf?
Or do I misunderstand the proposal? I think it would be awkward if only
root users are able to run pgbackrest restore. I have added David to the
CC list to make him aware of this, in case he was not following this
thread.


It doesn't sound like people are in favor of requiring read-only 
permissions for postgresql.auto.conf, but in any case it would not be a 
big issue for pgBackRest or other backup solutions as far as I can see.


pgBackRest stores all permissions and ownership so a restore by the user 
will bring everything back just as it was. Restoring as root sounds bad 
on the face of it, but for managed environments like k8s it would not be 
all that unusual.


There is also the option of restoring and then modifying permissions 
later, or in pgBackRest use the --type=preserve option to leave 
postgresql.auto.conf as it is. Permissions could also be updated before 
the backup tool is run and then set back.


Since this feature is intended for managed environments scripting these 
kinds of changes should be pretty easy and not a barrier to using any 
backup tool.


Regards,
-David




Re: Add basic tests for the low-level backup method.

2024-03-14 Thread David Steele

On 3/15/24 18:32, Michael Paquier wrote:

On Fri, Mar 15, 2024 at 06:23:15PM +1300, David Steele wrote:

Well, this is what we recommend in the docs, i.e. using bin mode to save
backup_label, so it seems OK to me.


Indeed, I didn't notice that this is actually documented, so what I
did took the right angle.  French flair, perhaps..


This seems like a reasonable explanation to me.

-David




Re: Add basic tests for the low-level backup method.

2024-03-14 Thread David Steele

On 3/15/24 12:38, Michael Paquier wrote:

On Fri, Mar 15, 2024 at 09:40:38AM +1300, David Steele wrote:

Is the missing test in meson the reason we did not see test failures for
Windows in CI?


The test has to be listed in src/test/recovery/meson.build or the CI
would ignore it.


Right -- I will keep this in mind for the future.


The second LOG is something that can be acted on.  I've added some
debugging to the parsing of the backup_label file in the backend, and
noticed that the first fscanf() for START WAL LOCATION is failing
because the last %c is detected as \r rather than \n.  Tweaking the
contents stored from pg_backend_stop() with a sed won't help, because
the issue is that we write the CRLFs with append_to_file, and the
startup process cannot cope with that.  The simplest method I can
think of is to use binmode, as of the attached.


Yeah, that makes sense.


I am wondering if there is a better trick here that would not require
changes in the backend to make the backup_label parsing more flexible,
though.


Well, this is what we recommend in the docs, i.e. using bin mode to save 
backup_label, so it seems OK to me.



I am attaching an updated patch with all that fixed, which is stable
in the CI and any tests I've run.  Do you have any comments about


These changes look good to me. Sure wish we had an easier to way to test
commits in the build farm.


That's why these tests are not that easy, they can be racy.  I've run
the test 5~10 times in the CI this time to gain more confidence, and
saw zero failures with the stability fixes in place including Windows.
I've applied it now, as I can still monitor the buildfarm for a few
more days.  Let's see what happens, but that should be better.


At least sidewinder is happy now -- and the build farm in general as far 
as I can see.


Thank you for your help on this!
-David




Re: Add basic tests for the low-level backup method.

2024-03-14 Thread David Steele

On 3/14/24 20:00, Michael Paquier wrote:

On Thu, Mar 14, 2024 at 09:12:52AM +1300, David Steele wrote:

I think you are right that the start message is better since it can only
appear once when the backup_label is found. The completed message could in
theory appear after a restart, though the backup_label must have been found
at some point.


So, I've given a try to this patch with 99b4a63bef94, to note that
sidewinder failed because of a timing issue on Windows: the recovery
of the node without backup_label, expected to fail, would try to
backup the last segment it has replayed, because, as it has no
backup_label, it behaves like the primary.  It would try to use the
same archive location as the primary, leading to a conflict failure on
Windows.  This one was easy to fix, by overwritting postgresql.conf on
the node to not do archiving.


Hmmm, I wonder why this did not show up in the Windows tests on CI?


Following that, I've noticed a second race condition: we don't wait
for the segment after pg_switch_wal() to be archived.  This one can be
easily avoided with a poll on pg_stat_archiver.


Ugh, yeah, good change.


After that, also because I've initially managed to, cough, forget an
update of meson.build to list the new test, I've noticed a third
failure on Windows for the case of the node that has a backup_label.
Here is one of the failures:
https://cirrus-ci.com/task/5245341683941376


Is the missing test in meson the reason we did not see test failures for 
Windows in CI?



regress_log_042_low_level_backup and
042_low_level_backup_replica_success.log have all the information
needed, that can be summarized like that:
The system cannot find the file specified.
2024-03-14 06:02:37.670 GMT [560][startup] FATAL:  invalid data in file 
"backup_label"

The first message is something new to me, that seems to point to a
corruption failure of the file system.  Why don't we see something
similar in other tests, then?  Leaving that aside..

The second LOG is something that can be acted on.  I've added some
debugging to the parsing of the backup_label file in the backend, and
noticed that the first fscanf() for START WAL LOCATION is failing
because the last %c is detected as \r rather than \n.  Tweaking the
contents stored from pg_backend_stop() with a sed won't help, because
the issue is that we write the CRLFs with append_to_file, and the
startup process cannot cope with that.  The simplest method I can
think of is to use binmode, as of the attached.


Yeah, that makes sense.


It is the first time that we'd take the contents received from a
BackgroundPsql and write them to a file parsed by the backend, so
perhaps we should try to do that in a more general way, but I'm not
sure how, tbh, and the case of this test is special while adding
handling for \r when reading the backup_label got discussed in the
past but we were OK with what we are doing now on HEAD.


I think it makes sense to leave the parsing code as is and make the 
change in the test. If we add more tests to this module we'll probably 
need a function to avoid repeating code.



On top of all that, note that I have removed remove_tree as I am not
sure if this would be OK in all the buildfarm animals, added a quit()
for BackgroundPsql, moved queries to use less BackgroundPsql, as well
as a few other things like avoiding the hardcoded segment names.
meson.build is.. Cough.. Updated now.


OK.


I am attaching an updated patch with all that fixed, which is stable
in the CI and any tests I've run.  Do you have any comments about


These changes look good to me. Sure wish we had an easier to way to test 
commits in the build farm.


Regards,
-David




Re: Add basic tests for the low-level backup method.

2024-03-13 Thread David Steele

On 3/13/24 19:15, Michael Paquier wrote:

On Wed, Mar 13, 2024 at 01:12:28PM +1300, David Steele wrote:


Not sure what to look for here. There are no distinct messages for crash
recovery. Perhaps there should be?


The closest thing I can think of here would be "database system was
not properly shut down; automatic recovery in progress" as we don't
have InArchiveRecovery, after checking that the canary is missing.  If
you don't like this suggestion, feel free to say so, of course :)


That works for me. I think I got it confused with "database system was 
interrupted..." when I was looking at the success vs. fail logs.



Sure, I added a check for the new log message when recovering with a
backup_label.


+ok($node_replica->log_contains('completed backup recovery with redo LSN'),
+   'verify backup recovery performed with backup_label');

Okay for this choice.  I was thinking first about "starting backup
recovery with redo LSN", closer to the area where the backup_label is
read.


I think you are right that the start message is better since it can only 
appear once when the backup_label is found. The completed message could 
in theory appear after a restart, though the backup_label must have been 
found at some point.


Regards,
-DavidFrom 91f8e8a390713760116990be05d96f8a7e0d1bde Mon Sep 17 00:00:00 2001
From: David Steele 
Date: Wed, 13 Mar 2024 20:02:40 +
Subject: Add basic tests for the low-level backup method.

There are currently no tests for the low-level backup method. pg_backup_start()
and pg_backup_stop() are called but not exercised in any real fashion.

There is a lot more that can be done, but this at least provides some basic
tests and provides a place for future improvement.
---
 src/test/recovery/t/042_low_level_backup.pl | 114 
 1 file changed, 114 insertions(+)
 create mode 100644 src/test/recovery/t/042_low_level_backup.pl

diff --git a/src/test/recovery/t/042_low_level_backup.pl 
b/src/test/recovery/t/042_low_level_backup.pl
new file mode 100644
index 00..76bac7e324
--- /dev/null
+++ b/src/test/recovery/t/042_low_level_backup.pl
@@ -0,0 +1,114 @@
+
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+# Test low-level backup method by using pg_backup_start() and pg_backup_stop()
+# to create backups.
+
+use strict;
+use warnings;
+
+use File::Copy qw(copy);
+use File::Path qw(remove_tree);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Start primary node with archiving
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(has_archiving => 1, allows_streaming => 1);
+$node_primary->start;
+
+# Start backup
+my $backup_name = 'backup1';
+my $psql = $node_primary->background_psql('postgres');
+
+$psql->query_safe("SET client_min_messages TO WARNING");
+$psql->set_query_timer_restart();
+$psql->query_safe("select pg_backup_start('test label')");
+
+# Copy files
+my $backup_dir = $node_primary->backup_dir() . '/' . $backup_name;
+
+PostgreSQL::Test::RecursiveCopy::copypath(
+   $node_primary->data_dir(), $backup_dir);
+
+# Cleanup some files/paths that should not be in the backup. There is no
+# attempt to all the exclusions done by pg_basebackup here, in part because
+# they are not required, but also to keep the test simple.
+#
+# Also remove pg_control because it needs to be copied later and it will be
+# obvious if the copy fails.
+unlink("$backup_dir/postmaster.pid")
+   or BAIL_OUT("unable to unlink $backup_dir/postmaster.pid");
+unlink("$backup_dir/postmaster.opts")
+   or BAIL_OUT("unable to unlink $backup_dir/postmaster.opts");
+unlink("$backup_dir/global/pg_control")
+   or BAIL_OUT("unable to unlink $backup_dir/global/pg_control");
+
+remove_tree("$backup_dir/pg_wal", {keep_root => 1})
+   or BAIL_OUT("unable to unlink contents of $backup_dir/pg_wal");
+
+# Create a table that will be used to verify that recovery started at the
+# correct location, rather than the location recorded in pg_control
+$psql->query_safe("create table canary (id int)");
+
+# Advance the checkpoint location in pg_control past the location where the
+# backup started. Switch the WAL to make it really obvious that the location
+# is different and to put the checkpoint in a new WAL segment.
+$psql->query_safe("select pg_switch_wal()");
+$psql->query_safe("checkpoint");
+
+# Copy pg_control last so it contains the new checkpoint
+copy($node_primary->data_dir() . '/global/pg_control',
+   "$backup_dir/global/pg_control")
+   or BAIL_OUT("unable to copy global/pg_control");
+
+# Stop backup and get backup_label
+my $backup_label = $psql->query_safe("select labelfile from pg_backup_stop()");
+
+# Rather than writing out backu

Re: Add basic tests for the low-level backup method.

2024-03-12 Thread David Steele

On 2/29/24 16:55, Michael Paquier wrote:

On Thu, Feb 29, 2024 at 10:30:52AM +1300, David Steele wrote:

On 11/12/23 08:21, David Steele wrote:

As promised in [1], attached are some basic tests for the low-level
backup method.


Added to the 2024-03 CF.


There is already 040_standby_failover_slots_sync.pl in recovery/ that
uses the number of your test script.  You may want to bump it, that's
a nit.


Renamed to 042_low_level_backup.pl.


+unlink("$backup_dir/postmaster.pid")
+   or BAIL_OUT("unable to unlink $backup_dir/postmaster.pid");
+unlink("$backup_dir/postmaster.opts")
+   or BAIL_OUT("unable to unlink $backup_dir/postmaster.opts");
+unlink("$backup_dir/global/pg_control")
+   or BAIL_OUT("unable to unlink $backup_dir/global/pg_control");

RELCACHE_INIT_FILENAME as well?


I'm not trying to implement the full exclusion list here, just enough to 
get the test working. Since exclusions are optional according to the 
docs I don't think we need them for a valid tests.



+# Rather than writing out backup_label, try to recover the backup without
+# backup_label to demonstrate that recovery will not work correctly without it,
+# i.e. the canary table will be missing and the cluster will be corrupt. 
Provide
+# only the WAL segment that recovery will think it needs.

Okay, why not.  No objections to this addition.  I am a bit surprised
that this is not scanning some of the logs produced by the startup
process for particular patterns.


Not sure what to look for here. There are no distinct messages for crash 
recovery. Perhaps there should be?



+# Save backup_label into the backup directory and recover using the primary's
+# archive. This time recovery will succeed and the canary table will be
+# present.

Here are well, I think that we should add some log_contains() with
pre-defined patterns to show that recovery has completed the way we
want it with a backup_label up to the end-of-backup record.


Sure, I added a check for the new log message when recovering with a 
backup_label.


Regards,
-DavidFrom 076429a578ece3c213525220a4961cb5b56c190a Mon Sep 17 00:00:00 2001
From: David Steele 
Date: Wed, 13 Mar 2024 00:01:55 +
Subject: Add basic tests for the low-level backup method.

There are currently no tests for the low-level backup method. pg_backup_start()
and pg_backup_stop() are called but not exercised in any real fashion.

There is a lot more that can be done, but this at least provides some basic
tests and provides a place for future improvement.
---
 src/test/recovery/t/042_low_level_backup.pl | 108 
 1 file changed, 108 insertions(+)
 create mode 100644 src/test/recovery/t/042_low_level_backup.pl

diff --git a/src/test/recovery/t/042_low_level_backup.pl 
b/src/test/recovery/t/042_low_level_backup.pl
new file mode 100644
index 00..22668bdc0d
--- /dev/null
+++ b/src/test/recovery/t/042_low_level_backup.pl
@@ -0,0 +1,108 @@
+
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+# Test low-level backup method by using pg_backup_start() and pg_backup_stop()
+# to create backups.
+
+use strict;
+use warnings;
+
+use File::Copy qw(copy);
+use File::Path qw(remove_tree);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Start primary node with archiving
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(has_archiving => 1, allows_streaming => 1);
+$node_primary->start;
+
+# Start backup
+my $backup_name = 'backup1';
+my $psql = $node_primary->background_psql('postgres');
+
+$psql->query_safe("SET client_min_messages TO WARNING");
+$psql->set_query_timer_restart();
+$psql->query_safe("select pg_backup_start('test label')");
+
+# Copy files
+my $backup_dir = $node_primary->backup_dir() . '/' . $backup_name;
+
+PostgreSQL::Test::RecursiveCopy::copypath(
+   $node_primary->data_dir(), $backup_dir);
+
+# Cleanup some files/paths that should not be in the backup. There is no
+# attempt to all the exclusions done by pg_basebackup here, in part because
+# they are not required, but also to keep the test simple.
+#
+# Also remove pg_control because it needs to be copied later and it will be
+# obvious if the copy fails.
+unlink("$backup_dir/postmaster.pid")
+   or BAIL_OUT("unable to unlink $backup_dir/postmaster.pid");
+unlink("$backup_dir/postmaster.opts")
+   or BAIL_OUT("unable to unlink $backup_dir/postmaster.opts");
+unlink("$backup_dir/global/pg_control")
+   or BAIL_OUT("unable to unlink $backup_dir/global/pg_control");
+
+remove_tree("$backup_dir/pg_wal", {keep_root => 1})
+   or BAIL_OUT("unable to unlink contents of $backup_dir/pg_wal");
+
+# Create a table that will be used to verify that recovery started at the
+# correct location, rather than the location recorded 

Re: Add checkpoint/redo LSNs to recovery errors.

2024-03-09 Thread David Steele

On 2/29/24 16:42, Michael Paquier wrote:

On Thu, Feb 29, 2024 at 10:53:15AM +1300, David Steele wrote:

This patch adds checkpoint/redo LSNs to recovery error messages where they
may be useful for debugging.


I've scanned a bit xlogrecovery.c, and I have spotted a couple of that
could gain more information.

 ereport(PANIC,
 (errmsg("invalid redo record in shutdown checkpoint")));
[...]
 ereport(PANIC,
 (errmsg("invalid redo in checkpoint record")));
These two could mention CheckPointLoc and checkPoint.redo.

 ereport(PANIC,
 (errmsg("invalid next transaction ID")));
Perhaps some XID information could be added here?

 ereport(FATAL,
 (errmsg("WAL ends before consistent recovery point")));
[...]
 ereport(FATAL,
 (errmsg("WAL ends before end of online backup"),

These two are in xlog.c, and don't mention backupStartPoint for the
first one.  Perhaps there's a point in adding some information about
EndOfLog and LocalMinRecoveryPoint as well?


For now I'd like to just focus on these three messages that are related 
to a missing backup_label or a misconfiguration of restore_command when 
backup_label is present.


No doubt there are many other recovery log messages that could be 
improved, but I'd rather not bog down the patch.


Regards,
-David




Re: Add recovery to pg_control and remove backup_label

2024-03-09 Thread David Steele

On 1/29/24 12:28, David Steele wrote:

On 1/28/24 19:11, Michael Paquier wrote:

On Fri, Jan 26, 2024 at 06:27:30PM +0530, vignesh C wrote:

Please post an updated version for the same.

[1] - http://cfbot.cputube.org/patch_46_3511.log


With the recent introduction of incremental backups that depend on
backup_label and the rather negative feedback received, I think that
it would be better to return this entry as RwF for now.  What do you
think?


I've been thinking it makes little sense to update the patch. It would 
be a lot of work with all the new changes for incremental backup and 
since Andres and Robert appear to be very against the idea, I doubt it 
would be worth the effort.


I've had a new idea which may revive this patch. The basic idea is to 
keep backup_label but also return a copy of pg_control from 
pg_stop_backup(). This copy of pg_control would be safe from tears and 
have a backupLabelRequired field set (as Andres suggested) so recovery 
cannot proceed without the backup label.


So, everything will continue to work as it does now. But, backup 
software can be enhanced to write the improved pg_control that is 
guaranteed not to be torn and has protection against a missing backup label.


Of course, pg_basebackup will write the new backupLabelRequired field 
into pg_control, but this way third party software can also gain 
advantages from the new field.


Thoughts?

Regards,
-David




Add checkpoint/redo LSNs to recovery errors.

2024-02-28 Thread David Steele

Hackers,

This patch adds checkpoint/redo LSNs to recovery error messages where 
they may be useful for debugging.


When backup_label is present the LSNs are already output in a log 
message, but it still seems like a good idea to repeat them.


When backup_label is not present, the checkpoint LSN is not logged 
unless backup recovery is in progress or the checkpoint is found. In the 
case where a backup is restored but the backup_label is missing, the 
checkpoint LSN is not logged so it is useful for debugging to have it in 
the error message.


Regards,
-DavidFrom 9678d0cd928c58e4a3d038c8aa4c191f5bdddbe7 Mon Sep 17 00:00:00 2001
From: David Steele 
Date: Wed, 28 Feb 2024 21:45:39 +
Subject: Add checkpoint/redo LSNs to recovery errors.

When backup_label is present the LSNs are already output in a log message, but
it still seems like a good idea to repeat them.

When backup_label is not present, the checkpoint LSN is not logged unless backup
recovery is in progress or the checkpoint is found. In the case where a backup
is restored but the backup_label is missing, the checkpoint LSN is not logged so
it is useful for debugging to have it in the error message.
---
 src/backend/access/transam/xlogrecovery.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c 
b/src/backend/access/transam/xlogrecovery.c
index d73a49b3e8..baf1765f5e 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -647,7 +647,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool 
*wasShutdown_ptr,
if (!ReadRecord(xlogprefetcher, LOG, false,

checkPoint.ThisTimeLineID))
ereport(FATAL,
-   (errmsg("could not find 
redo location referenced by checkpoint record"),
+   (errmsg("could not find 
redo location %X/%X referenced by checkpoint record at %X/%X",
+   
LSN_FORMAT_ARGS(checkPoint.redo), LSN_FORMAT_ARGS(CheckPointLoc)),
 errhint("If you are 
restoring from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" 
and add required recovery options.\n"
 "If 
you are not restoring from a backup, try removing the file 
\"%s/backup_label\".\n"
 "Be 
careful: removing \"%s/backup_label\" will result in a corrupt cluster if 
restoring from a backup.",
@@ -657,7 +658,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool 
*wasShutdown_ptr,
else
{
ereport(FATAL,
-   (errmsg("could not locate required 
checkpoint record"),
+   (errmsg("could not locate required 
checkpoint record at %X/%X",
+   
LSN_FORMAT_ARGS(CheckPointLoc)),
 errhint("If you are restoring from a 
backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" and add required 
recovery options.\n"
 "If you are not 
restoring from a backup, try removing the file \"%s/backup_label\".\n"
 "Be careful: removing 
\"%s/backup_label\" will result in a corrupt cluster if restoring from a 
backup.",
@@ -791,7 +793,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool 
*wasShutdown_ptr,
 * simplify processing around checkpoints.
 */
ereport(PANIC,
-   (errmsg("could not locate a valid 
checkpoint record")));
+   (errmsg("could not locate a valid 
checkpoint record at %X/%X",
+   
LSN_FORMAT_ARGS(CheckPointLoc;
}
memcpy(, XLogRecGetData(xlogreader), 
sizeof(CheckPoint));
wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == 
XLOG_CHECKPOINT_SHUTDOWN);
-- 
2.34.1



Re: Add basic tests for the low-level backup method.

2024-02-28 Thread David Steele

On 11/12/23 08:21, David Steele wrote:


As promised in [1], attached are some basic tests for the low-level 
backup method.


Added to the 2024-03 CF.

Regards,
-David




Re: Use of backup_label not noted in log

2024-01-29 Thread David Steele

On 1/28/24 20:09, Michael Paquier wrote:

On Fri, Jan 26, 2024 at 12:08:46PM +0900, Michael Paquier wrote:

Well, I'm OK with this consensus on 1d35f705e if folks think this is
useful enough for all the stable branches.


I have done that down to REL_15_STABLE for now as this is able to
apply cleanly there.  Older branches have a lack of information here,
actually, because read_backup_label() does not return the TLI
retrieved from the start WAL segment, so we don't have the whole
package of information.


I took a pass at this on PG14 and things definitely look a lot different 
back there. Not only is the timeline missing, but there are two sections 
of code for ending a backup, one for standby backup and one for primary.


I'm satisfied with the back patches as they stand, unless anyone else 
wants to have a look.


Regards,
-David




Re: Add recovery to pg_control and remove backup_label

2024-01-28 Thread David Steele

On 1/28/24 19:11, Michael Paquier wrote:

On Fri, Jan 26, 2024 at 06:27:30PM +0530, vignesh C wrote:

Please post an updated version for the same.

[1] - http://cfbot.cputube.org/patch_46_3511.log


With the recent introduction of incremental backups that depend on
backup_label and the rather negative feedback received, I think that
it would be better to return this entry as RwF for now.  What do you
think?


I've been thinking it makes little sense to update the patch. It would 
be a lot of work with all the new changes for incremental backup and 
since Andres and Robert appear to be very against the idea, I doubt it 
would be worth the effort.


I have withdrawn the patch.

Regards,
-David




Re: Use of backup_label not noted in log

2024-01-26 Thread David Steele




On 1/25/24 20:52, Michael Paquier wrote:

On Thu, Jan 25, 2024 at 08:56:52AM -0400, David Steele wrote:

I would still advocate for a back patch here. It is frustrating to get logs
from users that just say:

LOG:  invalid checkpoint record
PANIC:  could not locate a valid checkpoint record

It would be very helpful to know what the checkpoint record LSN was in this
case.


Yes, I've pested over this one in the past when debugging corruption
issues.  To me, this would just mean to appens to the PANIC an "at
%X/%X", but perhaps you have more in mind for these code paths?


I think adding the LSN to the panic message would be a good change for HEAD.

However, that still would not take the place of the additional messages 
in 1d35f705e showing that the LSN came from a backup_label.


Regards,
-David




Re: Use of backup_label not noted in log

2024-01-25 Thread David Steele

On 1/25/24 17:42, Tom Lane wrote:

David Steele  writes:

Another thing to note here -- knowing the LSN is important but also
knowing that backup recovery was attempted (i.e. backup_label exists) is
really crucial. Knowing both just saves so much time in back and forth
debugging.



It appears the tally for back patching is:



For: Andres, David, Michael B
Not Sure: Robert, Laurenz, Michael P



It seems at least nobody is dead set against it.


We're talking about 1d35f705e, right?  That certainly looks harmless
and potentially useful.  I'm +1 for back-patching.


That's the one. If we were modifying existing messages I would be 
against it, but new, infrequent (but oh so helpful) messages seem fine.


Regards,
-David




Re: Use of backup_label not noted in log

2024-01-25 Thread David Steele

On 1/25/24 09:29, Michael Banck wrote:

Hi,

On Thu, Jan 25, 2024 at 08:56:52AM -0400, David Steele wrote:

I would still advocate for a back patch here. It is frustrating to get logs
from users that just say:

LOG:  invalid checkpoint record
PANIC:  could not locate a valid checkpoint record

It would be very helpful to know what the checkpoint record LSN was in this
case.


I agree.


Another thing to note here -- knowing the LSN is important but also 
knowing that backup recovery was attempted (i.e. backup_label exists) is 
really crucial. Knowing both just saves so much time in back and forth 
debugging.


It appears the tally for back patching is:

For: Andres, David, Michael B
Not Sure: Robert, Laurenz, Michael P

It seems at least nobody is dead set against it.

Regards,
-David




Re: Use of backup_label not noted in log

2024-01-25 Thread David Steele

On 1/25/24 04:12, Michael Paquier wrote:

On Mon, Jan 22, 2024 at 04:36:27PM +0900, Michael Paquier wrote:

+   if (ControlFile->backupStartPoint != InvalidXLogRecPtr)

Nit 1: I would use XLogRecPtrIsInvalid here.

+   ereport(LOG,
+   (errmsg("completed backup recovery with redo LSN %X/%X",
+   LSN_FORMAT_ARGS(oldBackupStartPoint;

Nit 2: How about adding backupEndPoint in this LOG?  That would give:
"completed backup recovery with redo LSN %X/%X and end LSN %X/%X".


Hearing nothing, I've just applied a version of the patch with these
two modifications on HEAD.  If this needs tweaks, just let me know.


I had planned to update the patch this morning -- so thanks for doing 
that. I think having the end point in the message makes perfect sense.


I would still advocate for a back patch here. It is frustrating to get 
logs from users that just say:


LOG:  invalid checkpoint record
PANIC:  could not locate a valid checkpoint record

It would be very helpful to know what the checkpoint record LSN was in 
this case.


Regards,
-David




Re: Use of backup_label not noted in log

2024-01-19 Thread David Steele

On 11/20/23 15:08, David Steele wrote:

On 11/20/23 14:27, Andres Freund wrote:

I've wondered whether it's worth also adding an explicit message 
just after

ReachedEndOfBackup(), but it seems far less urgent due to the existing
"consistent recovery state reached at %X/%X" message.


I think the current message is sufficient, but what do you have in mind?


Well, the consistency message is emitted after every restart. Whereas 
a single
instance only should go through backup recovery once. So it seems 
worthwhile

to differentiate the two in log messages.


Ah, right. That works for me, then.


Any status on this patch? If we do back patch it would be nice to see 
this in the upcoming minor releases. I'm in favor of a back patch, as I 
think this is minimally invasive and would be very useful for debugging 
recovery issues.


I like the phrasing you demonstrated in [1] but doesn't seem like 
there's a new patch for that, so I have attached one.


Happy to do whatever else I can to get this across the line.

Regards,
-David

---

[1] 
https://www.postgresql.org/message-id/20231120183633.c4lhoq4hld4u56dd%40awork3.anarazel.dediff --git a/src/backend/access/transam/xlogrecovery.c 
b/src/backend/access/transam/xlogrecovery.c
index 1b48d7171a..c0918e6c75 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -603,6 +603,22 @@ InitWalRecovery(ControlFileData *ControlFile, bool 
*wasShutdown_ptr,
if (StandbyModeRequested)
EnableStandbyMode();
 
+   /*
+* Omitting backup_label when creating a new replica, PITR node 
etc.
+* unfortunately is a common cause of corruption. Logging that
+* backup_label was used makes it a bit easier to exclude that 
as the
+* cause of observed corruption.
+*
+* Do so before we try to read the checkpoint record (which can 
fail),
+* as otherwise it can be hard to understand why a checkpoint 
other
+* than ControlFile->checkPoint is used.
+*/
+   ereport(LOG,
+   (errmsg("starting backup recovery with redo LSN 
%X/%X, checkpoint LSN %X/%X, on timeline ID %u",
+   LSN_FORMAT_ARGS(RedoStartLSN),
+   LSN_FORMAT_ARGS(CheckPointLoc),
+   CheckPointTLI)));
+
/*
 * When a backup_label file is present, we want to roll forward 
from
 * the checkpoint it identifies, rather than using pg_control.
@@ -742,6 +758,16 @@ InitWalRecovery(ControlFileData *ControlFile, bool 
*wasShutdown_ptr,
EnableStandbyMode();
}
 
+   /*
+* For the same reason as when starting up with backup_label 
present,
+* emit a log message when we continue initializing from a base
+* backup.
+*/
+   if (ControlFile->backupStartPoint != InvalidXLogRecPtr)
+   ereport(LOG,
+   (errmsg("restarting backup recovery 
with redo LSN %X/%X",
+   
LSN_FORMAT_ARGS(ControlFile->backupStartPoint;
+
/* Get the last valid checkpoint record. */
CheckPointLoc = ControlFile->checkPoint;
CheckPointTLI = ControlFile->checkPointCopy.ThisTimeLineID;
@@ -2155,6 +2181,8 @@ CheckRecoveryConsistency(void)
if (!XLogRecPtrIsInvalid(backupEndPoint) &&
backupEndPoint <= lastReplayedEndRecPtr)
{
+   XLogRecPtr oldBackupStartPoint = backupStartPoint;
+
elog(DEBUG1, "end of backup reached");
 
/*
@@ -2165,6 +2193,10 @@ CheckRecoveryConsistency(void)
backupStartPoint = InvalidXLogRecPtr;
backupEndPoint = InvalidXLogRecPtr;
backupEndRequired = false;
+
+   ereport(LOG,
+   (errmsg("completed backup recovery with redo 
LSN %X/%X",
+   
LSN_FORMAT_ARGS(oldBackupStartPoint;
}
 
/*


Re: Detecting some cases of missing backup_label

2023-12-21 Thread David Steele

On 12/21/23 07:37, Andres Freund wrote:


On 2023-12-20 13:11:37 -0400, David Steele wrote:

I've run this through a bunch of scenarios (in my head) with parallel
backups and it does seem to hold up.

I think we'd need to write the state file before XLOG_BACKUP_START just in
case. Seems better to have an extra state file rather than have one be
missing.


That'd very significantly weaken the approach, afaict, because "external" base
base backup could end up copying those files. The whole point is to detect
broken procedures, so relying on such files being excluded from the base
backup seems like a bad idea.

I also see no need to do so - because we'd only verify that a backup start has
been replayed when replaying XLOG_BACKUP_STOP there's no danger in not
creating the files during XLOG_BACKUP_START, but doing so just before logging
the XLOG_BACKUP_STOP.


Ugh, I meant XLOG_BACKUP_STOP. So sounds like we are on the same page.


Probably we'd want to exclude *all* state files from backups, though.


I don't think so - I think we want the opposite? As noted above, I think in a
safety net like this we shouldn't assume that backup procedures were followed
correctly.


Fair enough.


Seems like in various PITR scenarios it could be hard to determine when to
remove them.


Why? I think we can basically remove the files when:

a) after the checkpoint during which XLOG_BACKUP_STOP was replayed - I think
we already have the infrastructure to queue file deletions that we can hook
into
b) when replaying a shutdown checkpoint / after creation of a shutdown
checkpoint


I thought about this some more. I *think* any state files a backup can 
see would have to be for XLOG_BACKUP_STOP records generated during the 
backup and they would get removed before the cluster had recovered to 
consistency.


I'd still prefer to exclude state files from the backup, but I agree 
there is no actual need to do so.


Regards,
-David




Re: Detecting some cases of missing backup_label

2023-12-20 Thread David Steele

On 12/18/23 10:39, Stephen Frost wrote:

Greetings,

* Stephen Frost (sfr...@snowman.net) wrote:

* Andres Freund (and...@anarazel.de) wrote:

I recently mentioned to Robert (and also Heikki earlier), that I think I see a
way to detect an omitted backup_label in a relevant subset of the cases (it'd
apply to the pg_control as well, if we moved to that).  Robert encouraged me
to share the idea, even though it does not provide complete protection.


That would certainly be nice.


The subset I think we can address is the following:

a) An omitted backup_label would lead to corruption, i.e. without the
backup_label we won't start recovery at the right position. Obviously it'd
be better to also catch a wrong procedure when it'd not cause corruption -
perhaps my idea can be extended to handle that, with a small bit of
overhead.

b) The backup has been taken from a primary. Unfortunately that probably can't
be addressed - but the vast majority of backups are taken from a primary,
so I think it's still a worthwhile protection.


Agreed that this is a worthwhile set to try and address, even if we
can't address other cases.


Here's my approach

1) We add a XLOG_BACKUP_START WAL record when starting a base backup on a
primary, emitted just *after* the checkpoint completed

2) When replaying a base backup start record, we create a state file that
includes the corresponding LSN in the filename

3) On the primary, the state file for XLOG_BACKUP_START is *not* created at
that time. Instead the state file is created during pg_backup_stop().

4) When replaying a XLOG_BACKUP_END record, we verif that the state file
created by XLOG_BACKUP_START is present, and error out if not.  Backups
that started before the redo LSN from backup_label are ignored
(necessitates remembering that LSN, but we've been discussing that anyway).


Because the backup state file on the primary is only created during
pg_backup_stop(), a copy of the data directory taken between pg_backup_start()
and pg_backup_stop() does *not* contain the corresponding "backup state
file". Because of this, an omitted backup_label is detected if recovery does
not start early enough - recovery won't encounter the XLOG_BACKUP_START record
and thus would not create the state file, leading to an error in 4).


While I see the idea here, I think, doesn't it end up being an issue if
things happen like this:

pg_backup_start -> XLOG_BACKUP_START WAL written -> new checkpoint
happens -> pg_backup_stop -> XLOG_BACKUP_STOP WAL written -> crash

In that scenario, we'd go back to the new checkpoint (the one *after*
the checkpoint that happened before we wrote XLOG_BACKUP_START), start
replaying, and then hit the XLOG_BACKUP_STOP and then error out, right?
Even though we're actually doing crash recovery and everything should be
fine as long as we replay all of the WAL.


Andres and I discussed this in person at PGConf.eu and the idea is that
if we find a XLOG_BACKUP_STOP record then we can check if the state file
was written out and if so then we can conclude that we are doing crash
recovery and not restoring from a backup and therefore we don't error
out.  This also implies that we don't consider PG to be recovered at the
XLOG_BACKUP_STOP point, if the state file exists, but instead we have to
be sure to replay all WAL that's been written.  Perhaps we even
explicitly refuse to use restore_command in this case?

We do error out if we hit a XLOG_BACKUP_STOP and the state file
doesn't exist, as that implies that we started replaying from a point
after a XLOG_BACKUP_START record was written but are working from a copy
of the data directory which didn't include the state file.

Of course, we need to actually implement and test these different cases
to make sure it all works but I'm at least feeling better about the idea
and wanted to share that here.


I've run this through a bunch of scenarios (in my head) with parallel 
backups and it does seem to hold up.


I think we'd need to write the state file before XLOG_BACKUP_START just 
in case. Seems better to have an extra state file rather than have one 
be missing.


As you say, we'll need to store redo for the last recovered backup in 
pg_control. I guess it would be OK to remove that when the cluster is 
promoted. As long as recovery is going on seems like it would always be 
possible to hit an XLOG_BACKUP_STOP for an even longer running backup.


I'm a little worried about what happens if a state file goes missing, 
but I guess that could be true of any file in PGDATA.


Probably we'd want to exclude *all* state files from backups, though. 
Seems like in various PITR scenarios it could be hard to determine when 
to remove them.


Regards,
-David




Re: Add recovery to pg_control and remove backup_label

2023-11-21 Thread David Steele

On 11/21/23 16:00, Andres Freund wrote:

Hi,

On 2023-11-21 14:48:59 -0400, David Steele wrote:

I'd not call 7.06->4.77 or 6.76->4.77 "virtually free".


OK, but how does that look with compression


With compression it's obviously somewhat different - but that part is done in
parallel, potentially on a different machine with client side compression,
whereas I think right now the checksumming is single-threaded, on the server
side.


Ah, yes, that's certainly a bottleneck.


With parallel server side compression, it's still 20% slower with the default
checksumming than none. With client side it's 15%.


Yeah, that still seems a lot. But to a large extent it sounds like a 
limitation of the current implementation.



-- to a remote location?


I think this one unfortunately makes checksums a bigger issue, not a smaller
one. The network interaction piece is single-threaded, adding another
significant use of CPU onto the same thread means that you are hit harder by
using substantial amount of CPU for checksumming in the same thread.

Once you go beyond the small instances, you have plenty network bandwidth in
cloud environments. We top out well before the network on bigger instances.


Uncompressed backup to local storage doesn't seem very realistic. With gzip
compression we measure SHA1 checksums at about 5% of total CPU.


IMO using gzip is basically infeasible for non-toy sized databases today. I
think we're using our users a disservice by defaulting to it in a bunch of
places. Even if another default exposes them to difficulty due to potentially
using a different compiled binary with fewer supported compression methods -
that's gona be very rare in practice.


Yeah, I don't use gzip anymore, but there are still some platforms that 
do not provide zstd (at least not easily) and lz4 compresses less. One 
thing people do seem to have is a lot of cores.



I can't understate how valuable checksums are in finding corruption,
especially in long-lived backups.


I agree!  But I think we need faster checksum algorithms or a faster
implementation of the existing ones. And probably default to something faster
once we have it.


We've been using xxHash to generate checksums for our block-level 
incremental and it is seriously fast, written by the same guy who did 
zstd and lz4.


Regards,
-David




Re: Add recovery to pg_control and remove backup_label

2023-11-21 Thread David Steele

On 11/21/23 13:59, Andres Freund wrote:


On 2023-11-21 13:41:15 -0400, David Steele wrote:

On 11/20/23 16:41, Andres Freund wrote:


On 2023-11-20 15:56:19 -0400, David Steele wrote:

I understand this is an option -- but does it need to be? What is the
benefit of excluding the manifest?


It's not free to create the manifest, particularly if checksums are enabled.


It's virtually free, even with the basic CRCs.


Huh?





I'd not call 7.06->4.77 or 6.76->4.77 "virtually free".


OK, but how does that look with compression -- to a remote location? 
Uncompressed backup to local storage doesn't seem very realistic. With 
gzip compression we measure SHA1 checksums at about 5% of total CPU. 
Obviously that goes up with zstd or lz4. but parallelism helps offset 
that cost, at least in clock time.


I can't understate how valuable checksums are in finding corruption, 
especially in long-lived backups.



Anyway, would you really want a backup without a manifest? How would you
know something is missing? In particular, for page incremental how do you
know something is new (but not WAL logged) if there is no manifest? Is the
plan to just recopy anything not WAL logged with each incremental?


Shrug. If you just want to create a new standby by copying the primary, I
don't think creating and then validating the manifest buys you much. Long term
backups are a different story, particularly if data files are stored
individually, rather than in a single checksummed file.


Fine, but you are probably not using page incremental if just using 
pg_basebackup to create a standby. With page incremental, at least one 
of the backups will already exist, which argues for a manifest.



Also, for external backups, there's no manifest...


There certainly is a manifest for many external backup solutions. Not having
a manifest is just running with scissors, backup-wise.


You mean that you have an external solution gin up a backup manifest? I fail
to see how that's relevant here?


Just saying that for external backups there *is* often a manifest and it 
is a good thing to have.


Regards,
-David




Re: Add recovery to pg_control and remove backup_label

2023-11-21 Thread David Steele

On 11/20/23 16:37, Andres Freund wrote:


On 2023-11-20 11:11:13 -0500, Robert Haas wrote:

I think we need more votes to make a change this big. I have a
concern, which I think I've expressed before, that we keep whacking
around the backup APIs, and that has a cost which is potentially
larger than the benefits.


+1.  The amount of whacking around in this area has been substantial, and it's
hard for operators to keep up. And realistically, with data sizes today, the
pressure to do basebackups with disk snapshots etc is not going to shrink.


True enough, but disk snapshots aren't really backups in themselves, in 
most scenarios, because they reside on the same storage as the cluster. 
Of course, snapshots can be exported, but that's also expensive.


I see snapshots as an adjunct to backups -- a safe backup offsite 
somewhere for DR and snapshots for day to day operations. Even so, 
managing snapshots as backups is harder than people think. It is easy to 
get wrong and end up with silent corruption.



Leaving that concern aside, I am still on the fence about this proposal. I
think it does decrease the chance of getting things wrong in the
streaming-basebackup case. But for external backups, it seems almost
universally worse (with the exception of the torn pg_control issue, that we
also can address otherwise):


Why universally worse? The software stores pg_control instead of backup 
label. The changes to pg_basebackup were pretty trivial and the changes 
to external backup are pretty much the same, at least in my limited 
sample of one.


And I don't believe we have a satisfactory solution to the torn 
pg_control issue yet. Certainly it has not been committed and Thomas has 
shown enthusiasm for this approach, to the point of hoping it could be 
back patched (it can't).



It doesn't reduce the risk of getting things wrong, you can still omit placing
a file into the data directory and get silent corruption as a consequence. In
addition, it's harder to see when looking at a base backup whether the process
was right or not, because now the good and bad state look the same if you just
look on the filesystem level!


This is one of the reasons I thought writing just the first 512 bytes of 
pg_control would be valuable. It would give an easy indicator that 
pg_control came from a backup. Michael was not in favor of conflating 
that change with this patch -- but I still think it's a good idea.



Then there's the issue of making ad-hoc debugging harder by not having a
human readable file with information anymore, including when looking at the
history, via backup_label.old.


Yeah, you'd need to use pg_controldata instead. But as Michael has 
suggested, we could also write backup_label as backup_info so there is 
human-readable information available.



Given that, I wonder if what we should do is to just add a new field to
pg_control that says "error out if backup_label does not exist", that we set
when creating a streaming base backup


I'm not in favor of a change only accessible to pg_basebackup or 
external software that can manipulate pg_control.


Regards,
-David




Re: Add recovery to pg_control and remove backup_label

2023-11-21 Thread David Steele

On 11/20/23 16:41, Andres Freund wrote:


On 2023-11-20 15:56:19 -0400, David Steele wrote:

I understand this is an option -- but does it need to be? What is the
benefit of excluding the manifest?


It's not free to create the manifest, particularly if checksums are enabled.


It's virtually free, even with the basic CRCs. Anyway, would you really 
want a backup without a manifest? How would you know something is 
missing? In particular, for page incremental how do you know something 
is new (but not WAL logged) if there is no manifest? Is the plan to just 
recopy anything not WAL logged with each incremental?



Also, for external backups, there's no manifest...


There certainly is a manifest for many external backup solutions. Not 
having a manifest is just running with scissors, backup-wise.


Regards,
-David




Re: Add recovery to pg_control and remove backup_label

2023-11-21 Thread David Steele

On 11/21/23 12:41, Andres Freund wrote:


On 2023-11-21 07:42:42 -0400, David Steele wrote:

On 11/20/23 19:58, Andres Freund wrote:

On 2023-11-21 08:52:08 +0900, Michael Paquier wrote:

On Mon, Nov 20, 2023 at 12:37:46PM -0800, Andres Freund wrote:

Given that, I wonder if what we should do is to just add a new field to
pg_control that says "error out if backup_label does not exist", that we set
when creating a streaming base backup


That would mean that one still needs to take an extra step to update a
control file with this byte set, which is something you had a concern
with in terms of compatibility when it comes to external backup
solutions because more steps are necessary to take a backup, no?


I was thinking we'd just set it in the pg_basebackup style path, and we'd
error out if it's set and backup_label is present. But we'd still use
backup_label without the pg_control flag set.

So it'd just provide a cross-check that backup_label was not removed for
pg_basebackup style backup, but wouldn't do anything for external backups. But
imo the proposal to just us pg_control doesn't actually do anything for
external backups either - which is why I think my proposal would achieve as
much, for a much lower price.


I'm not sure why you think the patch under discussion doesn't do anything
for external backups. It provides the same benefits to both pg_basebackup
and external backups, i.e. they both receive the updated version of
pg_control.


Sure. They also receive a backup_label today. If an external solution forgets
to replace pg_control copied as part of the filesystem copy, they won't get an
error after the remove of backup_label, just like they don't get one today if
they don't put backup_label in the data directory.  Given that users don't do
the right thing with backup_label today, why can we rely on them doing the
right thing with pg_control?


I think reliable backup software does the right thing with backup_label, 
but if the user starts getting errors on recovery they the decide to 
remove backup_label. I know we can't do much about bad backup software, 
but we can at least make this a bit more resistant to user error after 
the fact.


It doesn't help that one of our hints suggests removing backup_label. In 
highly automated systems, the user might not even know they just 
restored from a backup. They are only in the loop because the restore 
failed and they are trying to figure out what is going wrong. When they 
remove backup_label the cluster comes up just fine. Victory!


This is the scenario I've seen most often -- not the backup/restore 
process getting it wrong but the user removing backup_label on their own 
initiative. And because it yields such a positive result, at least 
initially, they remember in the future that the thing to do is to remove 
backup_label whenever they see the error.


If they only have pg_control, then their only choice is to get it right 
or run pg_resetwal. Most users have no knowledge of pg_resetwal so it 
will take them longer to get there. Also, I think that tool make it 
pretty clear that corruption will result and the only thing to do is a 
logical dump and restore after using it.


There are plenty of ways a user can mess things up. What I'd like to 
prevent is the appearance of everything being OK when in fact they have 
corrupted their cluster. That's the situation we have now with 
backup_label. Is this new solution perfect? No, but I do think it checks 
several boxes, and is a worthwhile improvement.


Regards,
-David

Regards,
-David




Re: Add recovery to pg_control and remove backup_label

2023-11-21 Thread David Steele

On 11/20/23 19:58, Andres Freund wrote:


On 2023-11-21 08:52:08 +0900, Michael Paquier wrote:

On Mon, Nov 20, 2023 at 12:37:46PM -0800, Andres Freund wrote:

Given that, I wonder if what we should do is to just add a new field to
pg_control that says "error out if backup_label does not exist", that we set
when creating a streaming base backup


That would mean that one still needs to take an extra step to update a
control file with this byte set, which is something you had a concern
with in terms of compatibility when it comes to external backup
solutions because more steps are necessary to take a backup, no?


I was thinking we'd just set it in the pg_basebackup style path, and we'd
error out if it's set and backup_label is present. But we'd still use
backup_label without the pg_control flag set.

So it'd just provide a cross-check that backup_label was not removed for
pg_basebackup style backup, but wouldn't do anything for external backups. But
imo the proposal to just us pg_control doesn't actually do anything for
external backups either - which is why I think my proposal would achieve as
much, for a much lower price.


I'm not sure why you think the patch under discussion doesn't do 
anything for external backups. It provides the same benefits to both 
pg_basebackup and external backups, i.e. they both receive the updated 
version of pg_control.


I really dislike the idea of pg_basebackup having a special mechanism 
for making recovery safer that is not generally available to external 
backup software. It might be easy enough for some (e.g. pgBackRest) to 
manipulate pg_control but would be out of reach for most.


Regards,
-David





Re: Use of backup_label not noted in log

2023-11-21 Thread David Steele

On 11/20/23 23:54, Michael Paquier wrote:

On Mon, Nov 20, 2023 at 03:31:20PM -0400, David Steele wrote:


I still wonder if we need "base backup" in the messages? That sort of
implies (at least to me) you used pg_basebackup but that may not be the
case.


Or just s/base backup/backup/?


That's what I meant but did not explain very well.

Regards,
-David




Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread David Steele

On 11/20/23 15:47, Robert Haas wrote:

On Mon, Nov 20, 2023 at 2:41 PM David Steele  wrote:

I can't see why a backup would continue to be valid without a manifest
-- that's not very standard for backup software. If you have the
critical info in backup_label, you can't afford to lose that, so why
should backup_manifest be any different?


I mean, you can run pg_basebackup --no-manifest.


Maybe this would be a good thing to disable for page incremental. With 
all the work being done by pg_combinebackup, it seems like it would be a 
good idea to be able to verify the final result?


I understand this is an option -- but does it need to be? What is the 
benefit of excluding the manifest?


Regards,
-David




Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread David Steele

On 11/20/23 14:44, Robert Haas wrote:

On Mon, Nov 20, 2023 at 12:54 PM David Steele  wrote:

Another thing we could do is explicitly error if we see backup_label in
PGDATA during recovery. That's just a few lines of code so would not be
a big deal to maintain. This error would only be visible on restore, so
it presumes that backup software is being tested.


I think that if we do decide to adopt this proposal, that would be a
smart precaution.


I'd be OK with it -- what do you think, Michael? Would this be enough 
that we would not need to rename the functions, or should we just go 
with the rename?



A little hard to add to the tests, I think, since they are purely
informational, i.e. not pushed up by the parser. Maybe we could just
grep for the fields?


Hmm. Or should they be pushed up by the parser?


We could do that. I started on that road, but it's a lot of code for 
fields that aren't used. I think it would be better if the parser also 
loaded a data structure that represented the manifest. Seems to me 
there's a lot of duplicated code between pg_verifybackup and 
pg_combinebackup the way it is now.



I think these fields would be handled the same as the rest of the fields
in backup_label: returned from pg_backup_stop() and also stored in
backup_manifest. Third-party software can do as they like with them and
pg_combinebackup can just read from backup_manifest.


I think that would be a bad plan, because this is critical
information, and a backup manifest is not a thing that you're required
to have. It's not a natural fit at all. We don't want to create a
situation where if you nuke the backup_manifest then the server
forgets that what it has is an incremental backup rather than a usable
data directory.


I can't see why a backup would continue to be valid without a manifest 
-- that's not very standard for backup software. If you have the 
critical info in backup_label, you can't afford to lose that, so why 
should backup_manifest be any different?


Regards,
-David




Re: Use of backup_label not noted in log

2023-11-20 Thread David Steele

On 11/20/23 15:03, Andres Freund wrote:

On 2023-11-20 11:35:15 +0100, Laurenz Albe wrote:


If we add a message for starting with "backup_label", shouldn't
we also add a corresponding message for starting from a checkpoint
found in the control file?  If you see that in a problem report,
you immediately know what is going on.


Maybe - the reason I hesitate on that is that emitting an additional log
message when starting from a base backup just adds something "once once the
lifetime of a node". Whereas emitting something every start obviously doesn't
impose any limit.


Hmm, yeah, that would be a bit much.


Here's the state with my updated patch, when starting up from a base backup:

LOG:  starting PostgreSQL 17devel on x86_64-linux, compiled by gcc-14.0.0, 
64-bit
LOG:  listening on IPv6 address "::1", port 5441
LOG:  listening on IPv4 address "127.0.0.1", port 5441
LOG:  listening on Unix socket "/tmp/.s.PGSQL.5441"
LOG:  database system was interrupted; last known up at 2023-11-20 10:55:49 PST
LOG:  starting recovery from base backup with redo LSN E/AFF07F20, checkpoint 
LSN E/B01B17F0, on timeline ID 1
LOG:  entering standby mode
LOG:  redo starts at E/AFF07F20
LOG:  completed recovery from base backup with redo LSN E/AFF07F20
LOG:  consistent recovery state reached at E/B420FC80

Besides the phrasing and the additional log message (I have no opinion about
whether it should be backpatched or not), I used %u for TimelineID as
appropriate, and added a comma before "on timeline".


I still wonder if we need "base backup" in the messages? That sort of 
implies (at least to me) you used pg_basebackup but that may not be the 
case.


FWIW, I also prefer "backup recovery" over "recovery from backup". 
"recovery from backup" reads fine here, but if gets more awkward when 
you want to say something like "recovery from backup settings". In that 
case, I think "backup recovery settings" reads better. Not important for 
this patch, maybe, but the recovery in pg_control patch went the other 
way and I definitely think it makes sense to keep them consistent, 
whichever way we go.


Other than that, looks good for HEAD. Whether we back patch or not is 
another question, of course.


Regards,
-David





Re: Use of backup_label not noted in log

2023-11-20 Thread David Steele

On 11/20/23 14:27, Andres Freund wrote:

Hi,

On 2023-11-19 14:28:12 -0400, David Steele wrote:

On 11/18/23 17:49, Andres Freund wrote:

On 2023-11-18 10:01:42 -0800, Andres Freund wrote:
Not enamored with the phrasing of the log messages, but here's a prototype:

When starting up with backup_label present:
LOG:  starting from base backup with redo LSN A/34100028, checkpoint LSN 
A/34100080 on timeline ID 1


I'd prefer something like:

LOG:  starting backup recovery with redo...



When restarting before reaching the end of the backup, but after backup_label
has been removed:
LOG:  continuing to start from base backup with redo LSN A/34100028
LOG:  entering standby mode
LOG:  redo starts at A/3954B958


And here:

LOG:  restarting backup recovery with redo...


I like it.


Cool.


I've wondered whether it's worth also adding an explicit message just after
ReachedEndOfBackup(), but it seems far less urgent due to the existing
"consistent recovery state reached at %X/%X" message.


I think the current message is sufficient, but what do you have in mind?


Well, the consistency message is emitted after every restart. Whereas a single
instance only should go through backup recovery once. So it seems worthwhile
to differentiate the two in log messages.


Ah, right. That works for me, then.

Regards,
-David




Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread David Steele




On 11/20/23 12:11, Robert Haas wrote:

On Sun, Nov 19, 2023 at 8:16 PM Michael Paquier  wrote:

(I am not exactly sure how, but we've lost pgsql-hackers on the way
when you sent v5.  Now added back in CC with the two latest patches
you've proposed attached.)

Here is a short summary of what has been missed by the lists:
- I've commented that the patch should not create, not show up in
fields returned the SQL functions or stream control files with a size
of 512B, just stick to 8kB.  If this is worth changing this should be
applied consistently across the board including initdb, discussed on
its own thread.
- The backup-related fields in the control file are reset at the end
of recovery.  I've suggested to not do that to keep a trace of what
was happening during recovery.  The latest version of the patch resets
the fields.
- With the backup_label file gone, we lose some information in the
backups themselves, which is not good.  Instead, you have suggested an
approach where this data is added to the backup manifest, meaning that
no information would be lost, particularly useful for self-contained
backups.  The fields planned to be added to the backup manifest are:
-- The start and end time of the backup, the end timestamp being
useful to know when stop time can be used for PITR.
-- The backup label.
I've agreed that it may be the best thing to do at this end to not
lose any data related to the removal of the backup_label file.


I think we need more votes to make a change this big. I have a
concern, which I think I've expressed before, that we keep whacking
around the backup APIs, and that has a cost which is potentially
larger than the benefits. 


From my perspective it's not that big a change for backup software but 
it does bring a lot of benefits, including fixing an outstanding bug in 
Postgres, i.e. reading pg_control without getting a torn copy.



The last time we changed the API, we changed
pg_stop_backup to pg_backup_stop, but this doesn't do that, and I
wonder if that's OK. Even if it is, do we really want to change this
API around again after such a short time?


This is a good point. We could just rename again, but not sure what 
names to go for this time. OTOH if the backup software is selecting 
fields then they will get an error because the names have changed. If 
the software is grabbing fields by position then they'll get a 
valid-looking result (even if querying by position is a terrible idea).


Another thing we could do is explicitly error if we see backup_label in 
PGDATA during recovery. That's just a few lines of code so would not be 
a big deal to maintain. This error would only be visible on restore, so 
it presumes that backup software is being tested.


Maybe just a rename to something like pg_backup_begin/end would be the 
way to go.



That said, I don't have an intrinsic problem with moving this
information from the backup_label to the backup_manifest file since it
is purely informational. I do think there should perhaps be some
additions to the test cases, though.


A little hard to add to the tests, I think, since they are purely 
informational, i.e. not pushed up by the parser. Maybe we could just 
grep for the fields?



I am concerned about the interaction of this proposal with incremental
backup. When you take an incremental backup, you get something that
looks a lot like a usable data directory but isn't. To prevent that
from causing avoidable disasters, the present version of the patch
adds INCREMENTAL FROM LSN and INCREMENTAL FROM TLI fields to the
backup_label. pg_combinebackup knows to look for those fields, and the
server knows that if they are present it should refuse to start. With
this change, though, I suppose those fields would end up in
pg_control. But that does not feel entirely great, because we have a
goal of keeping the amount of real data in pg_control below 512 bytes,
the traditional sector size, and this adds another 12 bytes of stuff
to that file that currently doesn't need to be there. I feel like
that's kind of a problem.


I think these fields would be handled the same as the rest of the fields 
in backup_label: returned from pg_backup_stop() and also stored in 
backup_manifest. Third-party software can do as they like with them and 
pg_combinebackup can just read from backup_manifest.


As for the pg_control file -- it might be best to give it a different 
name for backups that are not essentially copies of PGDATA. On the other 
hand, pgBackRest has included pg_control in incremental backups since 
day one and we've never had a user mistakenly do a manual restore of one 
and cause a problem (though manual restores are not the norm). Still, 
probably can't hurt to be a bit careful.



But my main point here is ... if we have a few more senior hackers
weigh in and vote in favor of this change, well then that's one thing.
But IMHO a discussion that's mostly between 2 people is not nearly a
strong enough consensus to justify this amount of 

Re: Use of backup_label not noted in log

2023-11-20 Thread David Steele

On 11/20/23 12:24, Robert Haas wrote:

On Mon, Nov 20, 2023 at 5:35 AM Laurenz Albe  wrote:

I can accept that adding log messages to back branches is ok.
Perhaps I am too nervous about things like that, because as an extension
developer I have been bitten too often by ABI breaks in minor releases
in the past.


I think that adding a log message to the back branches would probably
make my life better not worse, because when people do strange things
and then send me the log messages to figure out what the heck
happened, it would be there, and I'd have a clue. However, the world
doesn't revolve around me. I can imagine users getting spooked if a
new message that they've never seen before, and I think that risk
should be considered. There are good reasons for keeping the
back-branches stable, and as you said before, this isn't a bug fix.


Personally I think that the value of the information outweighs the 
weirdness of a new message appearing.



I do also think it is worth considering how this proposal interacts
with the proposal to remove backup_label. If that proposal goes
through, then this proposal is obsolete, I believe. 


Not at all. I don't even think the messages will need to be reworded, or 
not much since they don't mention backup_label.



But if this is a
good idea, does that mean that's not a good idea? Or would we try to
make the pg_control which that patch would drop in place have some
internal difference which we could use to drive a similar log message?


The recovery in pg_control patch has all the same recovery info stored, 
so similar (or the same) log message would still be appropriate.



Maybe we should, because knowing whether or not the user followed the
backup procedure correctly would indeed be a big help and it would be
regrettable to gain that capability only to lose it again...


The info is certainly valuable and we wouldn't lose it, unless there is 
something I'm not getting.


Regards,
-David




Re: Use of backup_label not noted in log

2023-11-20 Thread David Steele

On 11/20/23 06:35, Laurenz Albe wrote:


If we add a message for starting with "backup_label", shouldn't
we also add a corresponding message for starting from a checkpoint
found in the control file?  If you see that in a problem report,
you immediately know what is going on.


+1. It is easier to detect the presence of a message than the absence of 
one.


Regards,
-David




Re: Use of backup_label not noted in log

2023-11-20 Thread David Steele

[Resending since I accidentally replied off-list]

On 11/18/23 17:49, Andres Freund wrote:

On 2023-11-18 10:01:42 -0800, Andres Freund wrote:

What about adding it to the "redo starts at" message, something like

   redo starts at 12/12345678, taken from control file

or

   redo starts at 12/12345678, taken from backup label


I think it'd make sense to log use of backup_label earlier than that - the
locations from backup_label might end up not being available in the archive,
the primary or locally, and we'll error out with "could not locate a valid
checkpoint record".

I'd probably just do it within the if (read_backup_label()) block in
InitWalRecovery(), *before* the ReadCheckpointRecord().


Not enamored with the phrasing of the log messages, but here's a prototype:

When starting up with backup_label present:
LOG:  starting from base backup with redo LSN A/34100028, checkpoint LSN 
A/34100080 on timeline ID 1


I'd prefer something like:

LOG:  starting backup recovery with redo...


When restarting before reaching the end of the backup, but after backup_label
has been removed:
LOG:  continuing to start from base backup with redo LSN A/34100028
LOG:  entering standby mode
LOG:  redo starts at A/3954B958


And here:

LOG:  restarting backup recovery with redo...


Note that the LSN in the "continuing" case is the one the backup started at,
not where recovery will start.

I've wondered whether it's worth also adding an explicit message just after
ReachedEndOfBackup(), but it seems far less urgent due to the existing
"consistent recovery state reached at %X/%X" message.


I think the current message is sufficient, but what do you have in mind?


We are quite inconsistent about how we spell LSNs. Sometimes with LSN
preceding, sometimes not. Sometimes with (LSN). Etc.


Well, this could be improved in HEAD for sure.


I do like the idea of expanding the "redo starts at" message
though. E.g. including minRecoveryLSN, ControlFile->backupStartPoint,
ControlFile->backupEndPoint would provide information about when the node
might become consistent.


+1


Playing around with this a bit, I'm wondering if we instead should remove that
message, and emit something more informative earlier on. If there's a problem,
you kinda want the information before we finally get to the loop in
PerformWalLRecovery(). If e.g. there's no WAL you'll only get
LOG:  invalid checkpoint record
PANIC:  could not locate a valid checkpoint record

which is delightfully lacking in details.


I've been thinking about improving this myself. It would probably also 
help a lot to hint that restore_command may be missing or not returning 
results (but also not erroring). But there are a bunch of ways to get to 
this message so we'd need to be careful.



There also are some other oddities:

If the primary is down when starting up, and we'd need WAL from the primary
for the first record, the "redo start at" message is delayed until that
happens, because we emit the message not before we read the first record, but
after. That's just plain odd.


Agreed. Moving it up would be better.


And sometimes we'll start referencing the LSN at which we are starting
recovery before the "redo starts at" message. If e.g. we shut down
at a restart point, we'll emit

   LOG:  consistent recovery state reached at ...
before
   LOG:  redo starts at ...


Huh, I haven't seen that one. Definitely confusing.


But that's all clearly just material for HEAD.


Absolutely. I've been thinking about some of this as well, but want to 
see if we can remove the backup label first so we don't have to rework a 
bunch of stuff.


Of course, that shouldn't stop you from proceeding. I'm sure anything 
you are thinking of here could be adapted.


Regards,
-David




Re: Add recovery to pg_control and remove backup_label

2023-11-20 Thread David Steele

On 11/19/23 21:15, Michael Paquier wrote:

(I am not exactly sure how, but we've lost pgsql-hackers on the way
when you sent v5.  Now added back in CC with the two latest patches
you've proposed attached.)


Ugh, I must have hit reply instead of reply all. It's a rookie error and 
you hate to see it.



Here is a short summary of what has been missed by the lists:
- I've commented that the patch should not create, not show up in
fields returned the SQL functions or stream control files with a size
of 512B, just stick to 8kB.  If this is worth changing this should be
applied consistently across the board including initdb, discussed on
its own thread.
- The backup-related fields in the control file are reset at the end
of recovery.  I've suggested to not do that to keep a trace of what
was happening during recovery.  The latest version of the patch resets
the fields.
- With the backup_label file gone, we lose some information in the
backups themselves, which is not good.  Instead, you have suggested an
approach where this data is added to the backup manifest, meaning that
no information would be lost, particularly useful for self-contained
backups.  The fields planned to be added to the backup manifest are:
-- The start and end time of the backup, the end timestamp being
useful to know when stop time can be used for PITR.
-- The backup label.
I've agreed that it may be the best thing to do at this end to not
lose any data related to the removal of the backup_label file.


This looks right to me.


On Sun, Nov 19, 2023 at 02:14:32PM -0400, David Steele wrote:

On 11/15/23 20:03, Michael Paquier wrote:

As the label is only an informational field, the parsing added to
pg_verifybackup is not really needed because it is used nowhere in the
validation process, so keeping the logic simpler would be the way to
go IMO.  This is contrary to the WAL range for example, where start
and end LSNs are used for validation with a pg_waldump command.
Robert, any comments about the addition of the label in the manifest?


I'm sure Robert will comment on this when he gets the time, but for now I
have backed off on passing the new info to pg_verifybackup and added
start/stop time.


FWIW, I'm OK with the bits for the backup manifest as presented.  So
if there are no remarks and/or no objections, I'd like to apply it but
let give some room to others to comment on that as there's been a gap
in the emails exchanged on pgsql-hackers.  I hope that the summary
I've posted above covers everything.  So let's see about doing
something around the middle of next week.  With Thanksgiving in the
US, a lot of folks will not have the time to monitor what's happening
on this thread.


Timing sounds good to me.



+  The end time for the backup. This is when the backup was stopped in
+  PostgreSQL and represents the earliest time
+  that can be used for time-based Point-In-Time Recovery.

This one is actually a very good point.  We'd lost this capacity with
the backup_label file gone without the end timestamps in the control
file.


Yeah, the end time is very important for recovery scenarios. We 
definitely need that recorded somewhere.



I've noticed on the other thread the remark about being less
aggressive with the fields related to recovery in the control file, so
I assume that this patch should leave the fields be after the end of
recovery from the start and only rely on backupRecoveryRequired to
decide if the recovery should use the fields or not:
https://www.postgresql.org/message-id/241ccde1-1928-4ba2-a0bb-5350f7b191a8@=pgmasters.net

+   ControlFile->backupCheckPoint = InvalidXLogRecPtr;
ControlFile->backupStartPoint = InvalidXLogRecPtr;
+   ControlFile->backupStartPointTLI = 0;
ControlFile->backupEndPoint = InvalidXLogRecPtr;
+   ControlFile->backupFromStandby = false;
ControlFile->backupEndRequired = false;

Still, I get the temptation of being consistent with the current style
on HEAD to reset everything, as well..


I'd rather reset everything for now (as we do now) and think about 
keeping these values as a separate patch. It may be that we don't want 
to keep all of them, or we need a separate flag to say recovery was 
completed. We are accumulating a lot of booleans here, maybe we need a 
state var (recoveryRequired, recoveryInProgress, recoveryComplete) and 
then define which other vars are valid in each state.


Regards,
-David




Re: Use of backup_label not noted in log

2023-11-18 Thread David Steele

On 11/17/23 01:41, Laurenz Albe wrote:

On Thu, 2023-11-16 at 20:18 -0800, Andres Freund wrote:

I've often had to analyze what caused corruption in PG instances, where the
symptoms match not having had backup_label in place when bringing on the
node. However that's surprisingly hard - the only log messages that indicate
use of backup_label are at DEBUG1.

Given how crucial use of backup_label is and how frequently people do get it
wrong, I think we should add a LOG message - it's not like use of backup_label
is a frequent thing in the life of a postgres instance and is going to swamp
the log.  And I think we should backpatch that addition.


+1

I am not sure about the backpatch: it is not a bug, and we should not wantonly
introduce new log messages in a minor release.  Some monitoring system may
get confused.

What about adding it to the "redo starts at" message, something like

   redo starts at 12/12345678, taken from control file

or

   redo starts at 12/12345678, taken from backup label


I think a backpatch is OK as long as it is a separate message, but I 
like your idea of adding to the "redo starts" message going forward.


I know this isn't really a bug, but not being able to tell where 
recovery information came from seems like a major omission in the logging.


Regards,
-David




Re: Use of backup_label not noted in log

2023-11-18 Thread David Steele

On 11/17/23 00:18, Andres Freund wrote:


I've often had to analyze what caused corruption in PG instances, where the
symptoms match not having had backup_label in place when bringing on the
node. However that's surprisingly hard - the only log messages that indicate
use of backup_label are at DEBUG1.

Given how crucial use of backup_label is and how frequently people do get it
wrong, I think we should add a LOG message - it's not like use of backup_label
is a frequent thing in the life of a postgres instance and is going to swamp
the log.  And I think we should backpatch that addition.


+1 for the message and I think a backpatch is fine as long as it is a 
new message. If monitoring systems can't handle an unrecognized message 
then that feels like a problem on their part.



Medium term I think we should go further, and leave evidence in pg_control
about the last use of ControlFile->backupStartPoint, instead of resetting it.


Michael also thinks this is a good idea.


I realize that there's a discussion about removing backup_label - but I think
that's fairly orthogonal. Should we go with the pg_control approach, we should
still emit a useful message when starting in a state that's "equivalent" to
having used the backup_label.


Agreed, this new message could easily be adapted to the recovery in 
pg_control patch.


Regards,
-David




Add basic tests for the low-level backup method.

2023-11-11 Thread David Steele

Hackers,

As promised in [1], attached are some basic tests for the low-level 
backup method.


There are currently no tests for the low-level backup method. 
pg_backup_start() and pg_backup_stop() are called but not exercised in 
any real fashion.


There is a lot more that can be done, but this at least supplies some 
basic tests and provides a place for future improvement.


Regards,
-David

[1] 
https://www.postgresql.org/message-id/2daf8adc-8db7-4204-a7f2-a7e94e2bfa4b%40pgmasters.netFrom b1a10ab4ab912e6f174206fcb8ce67496f93df83 Mon Sep 17 00:00:00 2001
From: David Steele 
Date: Sat, 11 Nov 2023 19:12:49 +
Subject: Add basic tests for the low-level backup method.

There are currently no tests for the low-level backup method. pg_backup_start()
and pg_backup_stop() are called but not exercised in any real fashion.

There is a lot more that can be done, but this at least provides some basic
tests and provides a place for future improvement.
---
 src/test/recovery/t/040_low_level_backup.pl | 101 
 1 file changed, 101 insertions(+)
 create mode 100644 src/test/recovery/t/040_low_level_backup.pl

diff --git a/src/test/recovery/t/040_low_level_backup.pl 
b/src/test/recovery/t/040_low_level_backup.pl
new file mode 100644
index 000..3f30cb6b59c
--- /dev/null
+++ b/src/test/recovery/t/040_low_level_backup.pl
@@ -0,0 +1,101 @@
+
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+# Test low-level backup method by using pg_backup_start() and pg_backup_stop()
+# to create backups.
+
+use strict;
+use warnings;
+
+use File::Copy qw(copy);
+use File::Path qw(remove_tree);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Start primary node with archiving
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(has_archiving => 1, allows_streaming => 1);
+$node_primary->start;
+
+# Start backup
+my $backup_name = 'backup1';
+my $psql = $node_primary->background_psql('postgres');
+
+$psql->query_safe("SET client_min_messages TO WARNING");
+$psql->set_query_timer_restart();
+$psql->query_safe("select pg_backup_start('test label')");
+
+# Copy files
+my $backup_dir = $node_primary->backup_dir() . '/' . $backup_name;
+
+PostgreSQL::Test::RecursiveCopy::copypath(
+   $node_primary->data_dir(), $backup_dir);
+
+# Cleanup files/paths that should not be in the backup. Also remove pg_control
+# because it needs to be copied later and it will be obvious if the copy fails.
+unlink("$backup_dir/postmaster.pid")
+   or BAIL_OUT("unable to unlink $backup_dir/postmaster.pid");
+unlink("$backup_dir/postmaster.opts")
+   or BAIL_OUT("unable to unlink $backup_dir/postmaster.opts");
+unlink("$backup_dir/global/pg_control")
+   or BAIL_OUT("unable to unlink $backup_dir/global/pg_control");
+
+remove_tree("$backup_dir/pg_wal", {keep_root => 1})
+   or BAIL_OUT("unable to unlink contents of $backup_dir/pg_wal");
+
+# Create a table that will be used to verify that recovery started at the
+# correct location, rather than the location recorded in pg_control
+$psql->query_safe("create table canary (id int)");
+
+# Advance the checkpoint location in pg_control past the location where the
+# backup started. Switch the WAL to make it really obvious that the location
+# is different and to put the checkpoint in a new WAL segment.
+$psql->query_safe("select pg_switch_wal()");
+$psql->query_safe("checkpoint");
+
+# Copy pg_control last so it contains the new checkpoint
+copy($node_primary->data_dir() . '/global/pg_control',
+   "$backup_dir/global/pg_control")
+   or BAIL_OUT("unable to copy global/pg_control");
+
+# Stop backup and get backup_label
+my $backup_label = $psql->query_safe("select labelfile from pg_backup_stop()");
+
+# Rather than writing out backup_label, try to recover the backup without
+# backup_label to demonstrate that recovery will not work correctly without it,
+# i.e. the canary table will be missing and the cluster will be corrupt. 
Provide
+# only the WAL segment that recovery will think it needs.
+#
+# The point of this test is to explicitly demonstrate that backup_label is
+# being used in a later test to get the correct recovery info.
+my $node_replica = PostgreSQL::Test::Cluster->new('replica_fail');
+$node_replica->init_from_backup($node_primary, $backup_name);
+my $canary_query = "select count(*) from pg_class where relname = 'canary'";
+
+copy($node_primary->archive_dir() . '/00010003',
+   $node_replica->data_dir() . '/pg_wal/00010003')
+   or BAIL_OUT("unable to copy 00010003");
+
+$node_replica->start;
+
+ok($node_replica->safe_psql('postgres', $canary_quer

Re: Add recovery to pg_control and remove backup_label

2023-11-10 Thread David Steele

On 11/10/23 00:37, Michael Paquier wrote:

On Tue, Nov 07, 2023 at 05:20:27PM +0900, Michael Paquier wrote:

On Mon, Nov 06, 2023 at 05:39:02PM -0400, David Steele wrote:
I've retested today, and miss the failure.  I'll let you know if I see
this again.


I've done a few more dozen runs, and still nothing.  I am wondering
what this disturbance was.


OK, hopefully it was just a blip.


If we set these fields where backup_label was renamed, the logic would not
be exactly the same since pg_control won't be updated until the next time
through the loop. Since the fields should be updated before
UpdateControlFile() I thought it made sense to keep all the updates
together.

Overall I think it is simpler, and we don't need to acquire a lock on
ControlFile.


What you are proposing is the same as what we already do for
backupEndRequired or backupStartPoint in the control file when
initializing recovery, so objection withdrawn.


Thomas included this change in his pg_basebackup changes so I did the same.
Maybe wait a bit before we split this out? Seems like a pretty small
change...


Seems like a pretty good argument for refactoring that now, and let
any other patches rely on it.  Would you like to send a separate
patch?


The split still looks worth doing seen from here, so I am switching
the patch as WoA for now.


This has been split out.


Actually, grouping backupStartPointTLI and minRecoveryPointTLI should
reduce more the size with some alignment magic, no?


I thought about this, but it seemed to me that existing fields had been
positioned to make the grouping logical rather than to optimize alignment,
e.g. minRecoveryPointTLI. Ideally that would have been placed near
backupEndRequired (or vice versa). But if the general opinion is to
rearrange for alignment, I'm OK with that.


I've not tested, but it looks like moving backupStartPointTLI after
backupEndPoint should shave 8 bytes, if you want to maintain a more
coherent group for the LSNs.


OK, I have moved backupStartPointTLI.


+* backupFromStandby indicates that the backup was taken on a standby. It is
+* require to initialize recovery and set to false afterwards.
s/require/required/.


Fixed.


The term "backup recovery", that we've never used in the tree until
now as far as I know.  Perhaps this recovery method should just be
referred as "recovery from backup"?


Well, "backup recovery" is less awkward, I think. For instance "backup 
recovery field" vs "recovery from backup field".



By the way, there is another thing that this patch has forgotten: the
SQL functions that display data from the control file.  Shouldn't
pg_control_recovery() be extended with the new fields?  These fields
may be less critical than the other ones related to recovery, but I
suspect that showing them can become handy at least for debugging and
monitoring purposes.


I guess that depends on whether we reset them or not (discussion below). 
Right now they would not be visible since by the time the user could log 
on they would be reset.



Something in this area is that backupRecoveryRequired is the switch
controlling if the fields set by the recovery initialization.  Could
it be actually useful to leave the other fields as they are and only
reset backupRecoveryRequired before the first control file update?
This would leave a trace of the backup history directly in the control
file.


Since the other recovery fields are cleared in ReachedEndOfBackup() this 
would be a change from what we do now.


None of these fields are ever visible (with the exception of 
minRecoveryPoint/TLI) since they are reset when the database becomes 
consistent and before logons are allowed. Viewing them with 
pg_controldata makes sense, but I'm not sure pg_control_recovery() does.


In fact, are backup_start_lsn, backup_end_lsn, and 
end_of_backup_record_required ever non-zero when logged onto Postgres? 
Maybe I'm missing something?



What about pg_resetwal and RewriteControlFile()?  Shouldn't these
recovery fields be reset as well?


Done.


git diff --check is complaining a bit.


Fixed.

New patches attached based on eb81e8e790.

Regards,
-DavidFrom d241a0e8aa589b3abf66331f8a3af0aabe87c214 Mon Sep 17 00:00:00 2001
From: David Steele 
Date: Fri, 10 Nov 2023 17:50:54 +
Subject: Allow content size to be passed to sendFileWithContent().

sendFileWithContent() previously got the content length by using strlen(),
but it is also possible to pass binary content. Use len == -1 to indicate
that strlen() should be use to get the content length, otherwise honor the
value in len.
---
 src/backend/backup/basebackup.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index b537f462197..f216b588422 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -94,7 +94,7 @@ static bool ver

Re: Add recovery to pg_control and remove backup_label

2023-11-06 Thread David Steele

On 11/6/23 02:35, Michael Paquier wrote:

On Sun, Nov 05, 2023 at 01:45:39PM -0400, David Steele wrote:

Rebased on 151ffcf6.


I like this patch a lot.  Even if the backup_label file is removed, we
still have all the debug information from the backup history file,
thanks to its LABEL, BACKUP METHOD and BACKUP FROM, so no information
is lost.  It does a 1:1 replacement of the contents parsed from the
backup_label needed by recovery by fetching them from the control
file.  Sounds like a straight-forward change to me.


That's the plan, at least!


The patch is failing the recovery test 039_end_of_wal.pl.  Could you
look at the failure?


I'm not seeing this failure, and CI seems happy [1]. Can you give 
details of the error message?



  /* Build and save the contents of the backup history file */
-history_file = build_backup_content(state, true);
+history_file = build_backup_content(state);

build_backup_content() sounds like an incorrect name if it is a
routine onlyused to build the contents of backup history files.


Good point, I have renamed this to build_backup_history_content().


Why is there nothing updated in src/bin/pg_controldata/?


Oops, added.


+/* Clear fields used to initialize recovery */
+ControlFile->backupCheckPoint = InvalidXLogRecPtr;
+ControlFile->backupStartPointTLI = 0;
+ControlFile->backupRecoveryRequired = false;
+ControlFile->backupFromStandby = false;

These variables in the control file are cleaned up when the
backup_label file was read previously, but backup_label is renamed to
backup_label.old a bit later than that.  Your logic looks correct seen
from here, but shouldn't these variables be set much later, aka just
*after* UpdateControlFile().  This gap between the initialization of
the control file and the in-memory reset makes the code quite brittle,
IMO.


If we set these fields where backup_label was renamed, the logic would 
not be exactly the same since pg_control won't be updated until the next 
time through the loop. Since the fields should be updated before 
UpdateControlFile() I thought it made sense to keep all the updates 
together.


Overall I think it is simpler, and we don't need to acquire a lock on 
ControlFile.



-   basebackup_progress_wait_wal_archive();
-   do_pg_backup_stop(backup_state, !opt->nowait);

Why is that moved?


do_pg_backup_stop() generates the updated pg_control so it needs to run 
before we transmit pg_control.



-The backup label
-file includes the label string you gave to 
pg_backup_start,
-as well as the time at which pg_backup_start was run, 
and
-the name of the starting WAL file.  In case of confusion it is therefore
-possible to look inside a backup file and determine exactly which
-backup session the dump file came from.  The tablespace map file includes
+The tablespace map file includes

It may be worth mentioning that the backup history file holds this
information on the primary's pg_wal, as well.


OK, reworded.


The changes in sendFileWithContent() may be worth a patch of its own.


Thomas included this change in his pg_basebackup changes so I did the 
same. Maybe wait a bit before we split this out? Seems like a pretty 
small change...



--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -146,6 +146,9 @@ typedef struct ControlFileData
@@ -160,14 +163,25 @@ typedef struct ControlFileData
  XLogRecPtrminRecoveryPoint;
  TimeLineIDminRecoveryPointTLI;
+XLogRecPtrbackupCheckPoint;
  XLogRecPtrbackupStartPoint;
+TimeLineIDbackupStartPointTLI;
  XLogRecPtrbackupEndPoint;
+bool backupRecoveryRequired;
+bool backupFromStandby;

This increases the size of the control file from 296B to 312B with an
8-byte alignment, as far as I can see.  The size of the control file
has been always a sensitive subject especially with the hard limit of
PG_CONTROL_MAX_SAFE_SIZE.  Well, the point of this patch is that this
is the price to pay to prevent users from doing something stupid with
a removal of a backup_label when they should not.  Do others have an
opinion about this increase in size?

Actually, grouping backupStartPointTLI and minRecoveryPointTLI should
reduce more the size with some alignment magic, no?


I thought about this, but it seemed to me that existing fields had been 
positioned to make the grouping logical rather than to optimize 
alignment, e.g. minRecoveryPointTLI. Ideally that would have been placed 
near backupEndRequired (or vice versa). But if the general opinion is to 
rearrange for alignment, I'm OK with that.



backupRecoveryRequired in the control file is switched to false for
pg_rewind and true for streamed backups.  My gut feeling is telling me
that this should be OK, as out-of-core tools would need an upgrade if
they relied on the backend_label file anyway.  I can see that this
change

Re: Add recovery to pg_control and remove backup_label

2023-11-06 Thread David Steele

On 11/6/23 01:05, Michael Paquier wrote:

On Fri, Oct 27, 2023 at 10:10:42AM -0400, David Steele wrote:

We are still planning to address this issue in the back branches.


FWIW, redesigning the backend code in charge of doing base backups in
the back branches is out of scope.  Based on a read of the proposed
patch, it includes catalog changes which would require a catversion
bump, so that's not going to work anyway.


I did not mean this patch -- rather some variation of what Thomas has 
been working on, more than likely.


Regards,
-David




Re: Add recovery to pg_control and remove backup_label

2023-11-05 Thread David Steele

Rebased on 151ffcf6.diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 8cb24d6ae54..6be8fb902c5 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -935,19 +935,20 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
  ready to archive.
 
 
- pg_backup_stop will return one row with three
- values. The second of these fields should be written to a file named
- backup_label in the root directory of the backup. The
- third field should be written to a file named
- tablespace_map unless the field is empty. These 
files are
+ pg_backup_stop returns the
+ pg_control file, which must be stored in the
+ global directory of the backup. It also returns the
+ tablespace_map file, which should be written in the
+ root directory of the backup unless the field is empty. These files are
  vital to the backup working and must be written byte for byte without
- modification, which may require opening the file in binary mode.
+ modification, which will require opening the file in binary mode.
 


 
  Once the WAL segment files active during the backup are archived, you are
- done.  The file identified by pg_backup_stop's first 
return
+ done.  The file identified by pg_backup_stop's
+ lsn return
  value is the last segment that is required to form a complete set of
  backup files.  On a primary, if archive_mode is 
enabled and the
  wait_for_archive parameter is true,
@@ -1013,7 +1014,15 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);

 

-You should, however, omit from the backup the files within the
+You must exclude global/pg_control from your backup
+and put the contents of the pg_control_file column
+returned from pg_backup_stop in your backup at
+global/pg_control. This file contains the information
+required to safely recover.
+   
+
+   
+You should also omit from the backup the files within the
 cluster's pg_wal/ subdirectory.  This
 slight adjustment is worthwhile because it reduces the risk
 of mistakes when restoring.  This is easy to arrange if
@@ -1062,12 +1071,7 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);

 

-The backup label
-file includes the label string you gave to 
pg_backup_start,
-as well as the time at which pg_backup_start was run, 
and
-the name of the starting WAL file.  In case of confusion it is therefore
-possible to look inside a backup file and determine exactly which
-backup session the dump file came from.  The tablespace map file includes
+The tablespace map file includes
 the symbolic link names as they exist in the directory
 pg_tblspc/ and the full path of each symbolic link.
 These files are not merely for your information; their presence and
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index a6fcac0824a..6fd2eb8cc50 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26815,7 +26815,10 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 
free (88 chunks); 1029560
   label text
   , fast boolean
)
-pg_lsn
+record
+( lsn pg_lsn,
+timeline_id int8,
+start timestamptz )


 Prepares the server to begin an on-line backup.  The only required
@@ -26827,6 +26830,13 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 
free (88 chunks); 1029560
 as possible.  This forces an immediate checkpoint which will cause a
 spike in I/O operations, slowing any concurrently executing queries.

+   
+The result columns contain information about the start of the backup
+and can be ignored: the lsn column holds the
+starting write-ahead log location, the
+timeline_id column holds the starting timeline,
+and the stop column holds the starting 
timestamp.
+   

 This function is restricted to superusers by default, but other users
 can be granted EXECUTE to run the function.
@@ -26842,13 +26852,15 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 
622360 free (88 chunks); 1029560
   wait_for_archive 
boolean
)
 record
-( lsn pg_lsn,
-labelfile text,
-spcmapfile text )
+( pg_control_file text,
+tablespace_map_file text,
+lsn pg_lsn,
+timeline_id int8,
+stop timestamptz )


 Finishes performing an on-line backup.  The desired contents of the
-backup label file and the tablespace map file are returned as part of
+pg_control file and the tablespace map file are returned as part of
 the result of the function and must be written to files in the
 backup area.  These files must not be written to the live data 
directory
 (doing so will cause PostgreSQL to fail to 

Re: Add recovery to pg_control and remove backup_label

2023-11-05 Thread David Steele

On 10/27/23 13:45, David G. Johnston wrote:


Let me modify that to make it a bit more clear, I actually wouldn't care 
if pg_backup_end outputs an entire binary pg_control file as part of the 
SQL resultset.


My proposal would be to, in addition, place in the temporary directory 
on the server, Postgres-written versions of pg_control and 
tablespace_map as part of the pg_backup_end processing.  The client 
software would then have a choice.  Write the contents of the SQL 
resultset to newly created binary mode files in the destination, or, 
copy the server-written files from the temporary directory to the 
destination.


That said, I'm starting to dislike that idea myself.  It only really 
makes sense if the files could be placed in the data directory but that 
isn't doable given concurrent backups and not wanting to place the 
source server into an inconsistent state.


Pretty much the conclusion I have come to myself over the years.

Regards,
-David





Re: Add recovery to pg_control and remove backup_label

2023-10-27 Thread David Steele

On 10/26/23 17:27, David G. Johnston wrote:
On Thu, Oct 26, 2023 at 2:02 PM David Steele <mailto:da...@pgmasters.net>> wrote:


Are we planning on dealing with torn writes in the back branches in some 
way or are we just throwing in the towel and saying the old method is 
too error-prone to exist/retain 


We are still planning to address this issue in the back branches.

and therefore the goal of the v17 
changes is to not only provide a better way but also to ensure the old 
way no longer works?  It seems sufficient to change the output signature 
of pg_backup_stop to accomplish that goal though I am pondering whether 
an explicit check and error for seeing the backup_label file would be 
warranted.


Well, if the backup tool is just copying the second column of output to 
the backup_label, then it won't break. Of course in that case, restores 
won't work correctly but you would not get an error. Testing would show 
that it is not working properly and backup tools should certainly be tested.


Even so, I'm OK with an explicit check for backup_label. Let's see what 
others think.


If we are going to solve the torn writes problem completely then while I 
agree the new way is superior, implementing it doesn't have to mean 
existing tools built to produce backup_label and rely upon the 
pg_control in the data directory need to be forcibly broken.


It is a pretty easy update to any backup software that supports 
non-exclusive backup. I was able to make the changes to pgBackRest in 
less than an hour. We've made major changes to backup and restore in 
almost every major version of PostgreSQL for a while: non-exlusive 
backup in 9.6, dir renames in 10, variable WAL size in 11, new recovery 
location in 12, hard recovery target errors in 13, and changes to 
non-exclusive backup and removal of exclusive backup in 15. In 17 we are 
already looking at new page and segment sizes.



I know that outputting pg_control as bytea is going to be a bit
controversial. Software that is using psql get run pg_backup_stop()
could use encode() to get pg_control as text and then decode it later.
Alternately, we could update ReadControlFile() to recognize a
base64-encoded pg_control file. I'm not sure dealing with binary
data is
that much of a problem, though, and if the backup software gets it
wrong
then recovery with fail on an invalid pg_control file.

Can we not figure out some way to place the relevant files onto the 
server somewhere so that a simple "cp" command would work?  Have 
pg_backup_stop return paths instead of contents, those paths being 
"$TEMP_DIR"//pg_control.conf (and 
tablespace_map)


Nobody has been able to figure this out, and some of us have been 
thinking about it for years. It just doesn't seem possible to reliably 
tell the difference between a cluster that was copied and one that 
simply crashed.


If cp is really the backup tool being employed, I would recommend using 
pg_basebackup. cp has flaws that could lead to corruption, and of course 
does not at all take into account the archive required to make a backup 
consistent, directories to be excluded, the order of copying pg_control 
on backup from standy, etc., etc.


Backup/restore is not a simple endeavor and we don't do anyone favors 
pretending that it is.


Regards,
-David




Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

2023-10-27 Thread David Steele

On 10/27/23 03:22, Michael Paquier wrote:

On Mon, Oct 16, 2023 at 02:54:35PM +0900, Michael Paquier wrote:

On Sat, Oct 14, 2023 at 03:45:33PM -0400, David Steele wrote:

On 9/28/23 19:59, Michael Paquier wrote:

Another idea I had was to force the creation of recovery.signal by
pg_basebackup even if -R is not used.  All the reports we've seen with
people getting confused came from pg_basebackup that enforces no
configuration.


This change makes it more obvious if configuration is missing (since you'll
get an error), however +1 for adding this to pg_basebackup.


Looking at the streaming APIs of pg_basebackup, it looks like this
would be a matter of using bbstreamer_inject_file() to inject an empty
file into the stream.  Still something seems to be off once
compression methods are involved..  Hmm.  I am not sure.  Well, this
could always be done as a patch independant of this one, under a
separate discussion.  There are extra arguments about whether it would
be a good idea to add a recovery.signal even when taking a backup from
a standby, and do that only in 17~.


Hmm.  On this specific point, it would actually be much simpler to
force recovery.signal to be in the contents streamed to a BASE_BACKUP.


That sounds like the right plan to me. Nice and simple.


This does not step on your proposal at [1], though, because you'd
still require a .signal file for recovery as far as I understand :/

[1]: 
https://www.postgresql.org/message-id/2daf8adc-8db7-4204-a7f2-a7e94e2bf...@pgmasters.net


Yes.


Would folks be OK to move on with the patch of this thread at the end?
I am attempting a last-call kind of thing.


I'm still +1 for the patch as it stands.

Regards,
-David




Add recovery to pg_control and remove backup_label

2023-10-26 Thread David Steele

Hackers,

This was originally proposed in [1] but that thread went through a 
number of different proposals so it seems better to start anew.


The basic idea here is to simplify and harden recovery by getting rid of 
backup_label and storing recovery information directly in pg_control. 
Instead of backup software copying pg_control from PGDATA, it stores an 
updated version that is returned from pg_backup_stop(). I believe this 
is better for the following reasons:


* The user can no longer remove backup_label and get what looks like a 
successful restore (while almost certainly causing corruption). If 
pg_control is removed the cluster will not start. The user may try 
pg_resetwal, but I think that tool makes it pretty clear that corruption 
will result from its use. We could also modify pg_resetwal to complain 
if recovery info is present in pg_control.


* We don't need to worry about backup software seeing a torn copy of 
pg_control, since Postgres can safely read it out of memory and provide 
a valid copy via pg_backup_stop(). This solves [2] without needing to 
write pg_control via a temp file, which may affect performance on a 
standby. Unfortunately, this solution cannot be back patched.


* For backup from standby, we no longer need to instruct the backup 
software to copy pg_control last. In fact the backup software should not 
copy pg_control from PGDATA at all.


Since backup_label is now gone, the fields that used to be in 
backup_label are now provided as columns returned from pg_backup_start() 
and pg_backup_stop() and the backup history file is still written to the 
archive. For pg_basebackup we would have the option of writing the 
fields into the JSON manifest, storing them to a file (e.g. 
backup.info), or just ignoring them. None of the fields are required for 
recovery but backup software may be very interested in them.


I updated pg_rewind but I'm not very confident in the tests. When I 
removed backup_label processing, but before I updated pg_rewind to write 
recovery info into pg_control, all the rewind tests passed.


This patch highlights the fact that we still have no tests for the 
low-level backup method. I modified pgBackRest to work with this patch 
and the entire test suite ran without any issues, but in-core tests 
would be good to have. I'm planning to work on those myself as a 
separate patch.


This patch would also make the proposal in [3] obsolete since there is 
no need to rename backup_label if it is gone.


I know that outputting pg_control as bytea is going to be a bit 
controversial. Software that is using psql get run pg_backup_stop() 
could use encode() to get pg_control as text and then decode it later. 
Alternately, we could update ReadControlFile() to recognize a 
base64-encoded pg_control file. I'm not sure dealing with binary data is 
that much of a problem, though, and if the backup software gets it wrong 
then recovery with fail on an invalid pg_control file.


Lastly, I think there are improvements to be made in recovery that go 
beyond this patch. I originally set out to load the recovery info into 
*just* the existing fields in pg_control but it required so many changes 
to recovery that I decided it was too dangerous to do all in one patch. 
This patch very much takes the "backup_label in pg_control" approach, 
though I reused fields where possible. The added fields, e.g. 
backupRecoveryRequested, also allow us to keep the user experience 
pretty much the same in terms of messages and errors.


Thoughts?

Regards,
-David

[1] 
https://postgresql.org/message-id/1330cb48-4e47-03ca-f2fb-b144b49514d8%40pgmasters.net
[2] 
https://postgresql.org/message-id/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
[3] 
https://postgresql.org/message-id/eb3d1aae-1a75-bcd3-692a-38729423168f%40pgmasters.netdiff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 8cb24d6ae54..6be8fb902c5 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -935,19 +935,20 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
  ready to archive.
 
 
- pg_backup_stop will return one row with three
- values. The second of these fields should be written to a file named
- backup_label in the root directory of the backup. The
- third field should be written to a file named
- tablespace_map unless the field is empty. These 
files are
+ pg_backup_stop returns the
+ pg_control file, which must be stored in the
+ global directory of the backup. It also returns the
+ tablespace_map file, which should be written in the
+ root directory of the backup unless the field is empty. These files are
  vital to the backup working and must be written byte for byte without
- modification, which may require opening the file in binary mode.
+ modification, which will require opening the file in binary mode.
 


 
  Once the WAL segment files active during the backup are archived, you 

Re: Remove dead code in pg_ctl.c

2023-10-25 Thread David Steele

On 10/25/23 17:30, Nathan Bossart wrote:

On Wed, Oct 25, 2023 at 03:02:01PM -0500, Nathan Bossart wrote:

On Wed, Oct 25, 2023 at 02:53:31PM -0400, David Steele wrote:

It looks like this code was missed in 39969e2a when exclusive backup was
removed.


Indeed.  I'll plan on committing this shortly.


Committed.


Thank you, Nathan!

-David




Remove dead code in pg_ctl.c

2023-10-25 Thread David Steele

Hackers,

It looks like this code was missed in 39969e2a when exclusive backup was 
removed.


Regards,
-Daviddiff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 807d7023a99..4099d240e03 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -96,7 +96,6 @@ static time_t start_time;
 static char postopts_file[MAXPGPATH];
 static char version_file[MAXPGPATH];
 static char pid_file[MAXPGPATH];
-static char backup_file[MAXPGPATH];
 static char promote_file[MAXPGPATH];
 static char logrotate_file[MAXPGPATH];
 
@@ -2447,7 +2446,6 @@ main(int argc, char **argv)
snprintf(postopts_file, MAXPGPATH, "%s/postmaster.opts", 
pg_data);
snprintf(version_file, MAXPGPATH, "%s/PG_VERSION", pg_data);
snprintf(pid_file, MAXPGPATH, "%s/postmaster.pid", pg_data);
-   snprintf(backup_file, MAXPGPATH, "%s/backup_label", pg_data);
 
/*
 * Set mask based on PGDATA permissions,


Re: Add trailing commas to enum definitions

2023-10-23 Thread David Steele

On 10/23/23 17:04, Tom Lane wrote:

Nathan Bossart  writes:

 From a long-term perspective, I
think standardizing on the trailing comma style will actually improve
git-blame because patches won't need to add a comma to the previous line
when adding a value.


Yeah, that's a good point.  I had been leaning towards "this is
unnecessary churn", but with that idea I'm now +1.


+1 from me.

-David





Re: trying again to get incremental backup

2023-10-23 Thread David Steele

On 10/23/23 11:44, Robert Haas wrote:

On Fri, Oct 20, 2023 at 11:30 AM David Steele  wrote:


I don't plan to stand in your way on this feature. I'm reviewing what
patches I can out of courtesy and to be sure that nothing adjacent to
your work is being affected. My apologies if my reviews are not meeting
your expectations, but I am contributing as my time constraints allow.


Sorry, I realize reading this response that I probably didn't do a
very good job writing that email and came across sounding like a jerk.
Possibly, I actually am a jerk. Whether it just sounded like it or I
actually am, I apologize. 


That was the way it came across, though I prefer to think it was 
unintentional. I certainly understand how frustrating dealing with a 
large and uncertain patch can be. Either way, I appreciate the apology.


Now onward...


But your last paragraph here gets at my real
question, which is whether you were going to try to block the feature.
I recognize that we have different priorities when it comes to what
would make most sense to implement in PostgreSQL, and that's OK, or at
least, it's OK with me. 


This seem perfectly natural to me.


I also don't have any particular expectation
about how much you should review the patch or in what level of detail,
and I have sincerely appreciated your feedback thus far. If you are
able to continue to provide more, that's great, and if that's not,
well, you're not obligated. What I was concerned about was whether
that review was a precursor to a vigorous attempt to keep the main
patch from getting committed, because if that was going to be the
case, then I'd like to surface that conflict sooner rather than later.
It sounds like that's not an issue, which is great.


Overall I would say I'm not strongly for or against the patch. I think 
it will be very difficult to use in a manual fashion, but automation is 
they way to go in general so that's not necessarily and argument against.


However, this is an area of great interest to me so I do want to at 
least make sure nothing is being impacted adjacent to the main goal of 
this patch. So far I have seen no sign of that, but that has been a 
primary goal of my reviews.



At the risk of drifting into the fraught question of what I *should*
be implementing rather than the hopefully-less-fraught question of
whether what I am actually implementing is any good, I see incremental
backup as a way of removing some of the use cases for the low-level
backup API. If you said "but people still will have lots of reasons to
use it," I would agree; and if you said "people can still screw things
up with pg_basebackup," I would also agree. Nonetheless, most of the
disasters I've personally seen have stemmed from the use of the
low-level API rather than from the use of pg_basebackup, though there
are exceptions. 


This all makes sense to me.


I also think a lot of the use of the low-level API is
driven by it being just too darn slow to copy the whole database, and
incremental backup can help with that in some circumstances. 


I would argue that restore performance is *more* important than backup 
performance and this patch is a step backward in that regard. Backups 
will be faster and less space will be used in the repository, but 
restore performance is going to suffer. If the deltas are very small the 
difference will probably be negligible, but as the deltas get large (and 
especially if there are a lot of them) the penalty will be more noticeable.



Also, I
have worked fairly hard to try to make sure that if you misuse
pg_combinebackup, or fail to use it altogether, you'll get an error
rather than silent data corruption. I would be interested to hear
about scenarios where the checks that I've implemented can be defeated
by something that is plausibly described as stupidity rather than
malice. I'm not sure we can fix all such cases, but I'm very alert to
the horror that will befall me if user error looks exactly like a bug
in the code. For my own sanity, we have to be able to distinguish
those cases. 


I was concerned with the difficulty of trying to stage the correct 
backups for pg_combinebackup, not whether it would recognize that the 
needed data was not available and then error appropriately. The latter 
is surmountable within pg_combinebackup but the former is left up to the 
user.



Moreover, we also need to be able to distinguish
backup-time bugs from reassembly-time bugs, which is why I've got the
pg_walsummary tool, and why pg_combinebackup has the ability to emit
fairly detailed debugging output. I anticipate those things being
useful in investigating bug reports when they show up. I won't be too
surprised if it turns out that more work on sanity-checking and/or
debugging tools is needed, but I think your concern about people
misusing stuff is bang on target and I really want to do whatever we
can to avoid that when possible and detect it when it happens.


The ability of users to misuse

Re: trying again to get incremental backup

2023-10-20 Thread David Steele

On 10/19/23 16:00, Robert Haas wrote:

On Thu, Oct 19, 2023 at 3:18 PM David Steele  wrote:

0001 looks pretty good to me. The only thing I find a little troublesome
is the repeated construction of file names with/without segment numbers
in ResetUnloggedRelationsInDbspaceDir(), .e.g.:

+   if (segno == 0)
+   snprintf(dstpath, sizeof(dstpath), "%s/%u",
+dbspacedirname, relNumber);
+   else
+   snprintf(dstpath, sizeof(dstpath), "%s/%u.%u",
+dbspacedirname, relNumber, 
segno);


If this happened three times I'd definitely want a helper function, but
even with two I think it would be a bit nicer.


Personally I think that would make the code harder to read rather than
easier. I agree that repeating code isn't great, but this is a
relatively brief idiom and pretty self-explanatory. If other people
agree with you I can change it, but to me it's not an improvement.


Then I'm fine with it as is.


0002 is definitely a good idea. FWIW pgBackRest does this conversion but
also errors if it does not succeed. We have never seen a report of this
error happening in the wild, so I think it must be pretty rare if it
does happen.


Cool, but ... how about the main patch set? It's nice to get some of
these refactoring bits and pieces out of the way, but if I spend the
effort to work out what I think are the right answers to the remaining
design questions for the main patch set and then find out after I've
done all that that you have massive objections, I'm going to be
annoyed. I've been trying to get this feature into PostgreSQL for
years, and if I don't succeed this time, I want the reason to be
something better than "well, I didn't find out that David disliked X
until five minutes before I was planning to type 'git push'."


I simply have not had time to look at the main patch set in any detail.


I'm not really concerned about detailed bug-hunting in the main
patches just yet. The time for that will come. But if you have views
on how to resolve the design questions that I mentioned in a couple of
emails back, or intend to advocate vigorously against the whole
concept for some reason, let's try to sort that out sooner rather than
later.


In my view this feature puts the cart way before the horse. I'd think 
higher priority features might be parallelism, a backup repository, 
expiration management, archiving, or maybe even a restore command.


It seems the only goal here is to make pg_basebackup a tool for external 
backup software to use, which might be OK, but I don't believe this 
feature really advances pg_basebackup as a usable piece of stand-alone 
software. If people really think that start/stop backup is too 
complicated an interface how are they supposed to track page 
incrementals and get them to a place where pg_combinebackup can put them 
backup together? If automation is required to use this feature, 
shouldn't pg_basebackup implement that automation?


I have plenty of thoughts about the implementation as well, but I have a 
lot on my plate right now and I don't have time to get into it.


I don't plan to stand in your way on this feature. I'm reviewing what 
patches I can out of courtesy and to be sure that nothing adjacent to 
your work is being affected. My apologies if my reviews are not meeting 
your expectations, but I am contributing as my time constraints allow.


Regards,
-David




Re: trying again to get incremental backup

2023-10-19 Thread David Steele

On 10/19/23 12:05, Robert Haas wrote:

On Wed, Oct 4, 2023 at 4:08 PM Robert Haas  wrote:

Clearly there's a good amount of stuff to sort out here, but we've
still got quite a bit of time left before feature freeze so I'd like
to have a go at it. Please let me know your thoughts, if you have any.


Apparently, nobody has any thoughts, but here's an updated patch set
anyway. The main change, other than rebasing, is that I did a bunch
more documentation work on the main patch (0005). I'm much happier
with it now, although I expect it may need more adjustments here and
there as outstanding design questions get settled.

After some thought, I think that it should be fine to commit 0001 and
0002 as independent refactoring patches, and I plan to go ahead and do
that pretty soon unless somebody objects.


0001 looks pretty good to me. The only thing I find a little troublesome 
is the repeated construction of file names with/without segment numbers 
in ResetUnloggedRelationsInDbspaceDir(), .e.g.:


+   if (segno == 0)
+   snprintf(dstpath, sizeof(dstpath), "%s/%u",
+dbspacedirname, relNumber);
+   else
+   snprintf(dstpath, sizeof(dstpath), "%s/%u.%u",
+dbspacedirname, relNumber, 
segno);


If this happened three times I'd definitely want a helper function, but 
even with two I think it would be a bit nicer.


0002 is definitely a good idea. FWIW pgBackRest does this conversion but 
also errors if it does not succeed. We have never seen a report of this 
error happening in the wild, so I think it must be pretty rare if it 
does happen.


Regards,
-David




Re: The danger of deleting backup_label

2023-10-19 Thread David Steele

On 10/19/23 10:56, Robert Haas wrote:

On Thu, Oct 19, 2023 at 10:43 AM David Steele  wrote:

What I meant here (but said badly) is that in the case of snapshot
backups, the backup_label and tablespace_map will likely need to be
stored somewhere off the server since they can't be part of the
snapshot, perhaps in a key store. In that case the backup software would
still need to read the files from wherever we stored then and correctly
handle them when storing elsewhere. If you were moving the files to say,
S3, a similar thing needs to happen. In general, I think a locally
mounted filesystem is very unlikely to be the final destination for
these files, and if it is then probably pg_basebackup is your friend.


I mean, writing those tiny little files locally and then uploading
them should be fine in a case like that. It still reduces the surface
for mistakes. And you could also have --backup-label='| whatever' or
something if you wanted. The point is that right now we're asking
people to pull this information out of a query result, and that means
people are trying to do it by calling out to psql, and that is a GREAT
way to screw up the escaping or the newlines or whatever. I don't
think the mistakes people are making here are being made by people
using Perl and DBD::Pg, or Python and psycopg2, or C and libpq.
They're being made by people who are trying to shell script their way
through it, which entails using psql, which makes screwing it up a
breeze.


OK, I see what you mean and I agree. Better documentation might be the 
answer here, but I doubt that psql is a good tool for starting/stopping 
backup and not sure we want to encourage it.


Regards,
-David




Re: The danger of deleting backup_label

2023-10-19 Thread David Steele

On 10/19/23 10:24, Robert Haas wrote:

On Wed, Oct 18, 2023 at 7:15 PM David Steele  wrote:

(b) be stored someplace
else,


I don't think the additional fields *need* to be stored anywhere at all,
at least not by us. We can provide them as output from pg_backup_stop()
and the caller can do as they please. None of those fields are part of
the restore process.


Not sure which fields we're talking about here. I agree that if
they're not really needed, we can return them and the user can keep
them or discard them as they wish. But in that case you might also ask
why bother even returning them.


I'm specifically talking about START TIME and LABEL. They are currently 
stored in backup_label but not used for recovery. START TIMELINE is also 
not used, except as a cross check against START WAL LOCATION.


I'd still like to see most or all of these fields exposed through 
pg_backup_stop(). The user can choose to store them or not, but none of 
them will be required for recovery.



pg_llbackup -d $CONNTR --backup-label=PATH --tablespace-map=PATH
--copy-data-directory=SHELLCOMMAND


I think in most cases where this would be useful the user should just be
using pg_basebackup. If the backup is trying to use snapshots, then
backup_label needs to be stored outside the snapshot and we won't be
able to easily help.


Right, the idea of the above was that you would specify paths for the
backup label and the tablespace map that were outside of the snapshot
directory in that kind of case. But you couldn't screw up the line
endings or whatever because pg_llbackup would take care of that aspect
of it for you.


What I meant here (but said badly) is that in the case of snapshot 
backups, the backup_label and tablespace_map will likely need to be 
stored somewhere off the server since they can't be part of the 
snapshot, perhaps in a key store. In that case the backup software would 
still need to read the files from wherever we stored then and correctly 
handle them when storing elsewhere. If you were moving the files to say, 
S3, a similar thing needs to happen. In general, I think a locally 
mounted filesystem is very unlikely to be the final destination for 
these files, and if it is then probably pg_basebackup is your friend.


Regards,
-David





Re: The danger of deleting backup_label

2023-10-18 Thread David Steele

On 10/18/23 08:39, Robert Haas wrote:

On Tue, Oct 17, 2023 at 4:17 PM David Steele  wrote:

Given that the above can't be back patched, I'm thinking we don't need
backup_label at all going forward. We just write the values we need for
recovery into pg_control and return *that* from pg_backup_stop() and
tell the user to store it with their backup. We already have "These
files are vital to the backup working and must be written byte for byte
without modification, which may require opening the file in binary
mode." in the documentation so dealing with pg_control should not be a
problem. pg_control also has a CRC so we will know if it gets munged.


Yeah, I was thinking about this kind of idea, too. I think it might be
a good idea, although I'm not completely certain about that, either.





First, anything that is stored in the backup_label but not the control
file has to (a) move into the control file, 


I'd rather avoid this.


(b) be stored someplace
else, 


I don't think the additional fields *need* to be stored anywhere at all, 
at least not by us. We can provide them as output from pg_backup_stop() 
and the caller can do as they please. None of those fields are part of 
the restore process.


or (c) be eliminated as a concept. 


I think we should keep them as above since I don't believe they hurt 
anything.



We're likely to get
complaints about (a), especially if the data in question is anything
big. Any proposal to do (b) risks undermining the whole theory under
which this is a good proposal, namely that removing backup_label gives
us one less thing to worry about. So that brings us to (c).
Personally, I would lose very little sleep if the LABEL field died and
never came back, and I wouldn't miss START TIME and STOP TIME either,
but somebody else might well feel differently. I don't think it's
trivial to get rid of BACKUP METHOD, as there unfortunately seems to
be code that depends on knowing the difference between BACKUP FROM:
streamed and BACKUP FROM: pg_rewind. I suspect that BACKUP FROM:
primary/standby might have the same issue, but I'm not sure. 


BACKUP FROM: primary/standby we can definitely handle, and BACKUP 
METHOD: pg_rewind just means backupEndRequired is false. That will need 
to be written by pg_rewind (instead of writing backup_label).



STOP
TIMELINE could be a problem too. I think that if somebody could do
some rejiggering to eliminate some of the differences between the
cases here, that could be really good general cleanup irrespective of
what we decide about this proposal, and moving some things in to
pg_control is probably reasonable too. For instance, it would seem
crazy to me to argue that storing the backup end location in the
control file is OK, but storing the backup end TLI there would not be
OK. 


We have a place in pg_control for the end TLI (minRecoveryPointTLI). 
That's only valid for backup from standby since a primary backup can 
never change timelines.



But the point here is that there's probably a good deal of careful
thinking that would need to be done here about exactly where all of
the stuff that currently exists in the backup_label file but not in
pg_control needs to end up.


Agreed.


Second, right now, the stuff that we return at the end of a backup is
all text data. With this proposal, it becomes binary data. I entirely
realize that people should only be doing these kinds of backups using
automated tools that that those automated tools should be perfectly
capable of handling binary data without garbling anything. But that's
about as realistic as supposing that people won't instantly remove the
backup_label file the moment it seems like it will solve some problem,
even when the directions clearly state that this should only be done
in some other situation that is not the one the user is facing. 


Well, we do specify that backup_label and tablespace_map should be saved 
byte for byte. But We've already seen users mess this up in the past and 
add \r characters that made the files unreadable.


If it can be done wrong, it will be done wrong by somebody.


It
just occurred to me that one thing we could do to improve the user
experience here is offer some kind of command-line utility to assist
with taking a low-level backup. This could be done even if we don't
proceed with this proposal e.g.

pg_llbackup -d $CONNTR --backup-label=PATH --tablespace-map=PATH
--copy-data-directory=SHELLCOMMAND

I don't know for sure how much that would help, but I wonder if it
might actually help quite a bit, because right now people do things
like use psql in a shell script to try to juggle a database connection
and then in some other part of the shell script do the data copying.
But it is quite easy to screw up the error handling or the psql
session lifetime or something like that, and this would maybe provide
a nicer interface. 


I think in most cases where this would be useful the user should just be 
using pg_basebackup. If the backup is trying to use

Re: The danger of deleting backup_label

2023-10-18 Thread David Steele

On 10/17/23 22:13, Kyotaro Horiguchi wrote:

At Tue, 17 Oct 2023 16:16:42 -0400, David Steele  wrote in

Given that the above can't be back patched, I'm thinking we don't need
backup_label at all going forward. We just write the values we need
for recovery into pg_control and return *that* from pg_backup_stop()
and tell the user to store it with their backup. We already have
"These files are vital to the backup working and must be written byte
for byte without modification, which may require opening the file in
binary mode." in the documentation so dealing with pg_control should
not be a problem. pg_control also has a CRC so we will know if it gets
munged.


I'm somewhat perplexed regarding the objective of this thread.

This thread began with the intent of preventing users from removing
the backup_label from a backup. At the beginning, the proposal aimed
to achieve this by injecting an invalid value to pg_control file
located in the generated backup. However, this (and previous) proposal
seems to deviate from that initial objective. It now eliminates the
need to be concerned about the pg_control version that is coped into
the generated backup. However, if someone removes the backup_label
from a backup, the initial concerns could still surface.


Yeah, the discussion has moved around quite a bit, but the goal remains 
the same, to make Postgres error when it does not have the information 
it needs to proceed with recovery. Right now if you delete backup_label 
recovery appears to complete successfully, silently corrupting the database.


In the proposal as it stands now there would be no backup_label at all, 
so no danger of removing it.


We have also gotten a bit sidetracked by our hope to use this proposal 
to address torn reads of pg_control during the backup, at least in HEAD.


Regards,
-David




Re: Rename backup_label to recovery_control

2023-10-18 Thread David Steele

On 10/18/23 03:07, Peter Eisentraut wrote:

On 16.10.23 17:15, David Steele wrote:

I also do wonder with recovery_control is really a better name. Maybe
I just have backup_label too firmly stuck in my head, but is what that
file does really best described as recovery control? I'm not so sure
about that.


The thing it does that describes it as "recovery control" in my view 
is that it contains the LSN where Postgres must start recovery (plus 
TLI, backup method, etc.). There is some other informational stuff in 
there, but the important fields are all about ensuring consistent 
recovery.


At the end of the day the entire point of backup *is* recovery and 
users will interact with this file primarily in recovery scenarios.


Maybe "restore" is better than "recovery", since recovery also happens 
separate from backups, but restoring is something you do with a backup 
(and there is also restore_command etc.).


I would not object to restore (there is restore_command) but I do think 
of what PostgreSQL does as "recovery" as opposed to "restore", which 
comes before the recovery. Recovery is used a lot in the docs and there 
is also recovery.signal.


But based on the discussion in [1] I think we might be able to do away 
with backup_label entirely, which would make this change moot.


Regards,
-David

[1] 
https://www.postgresql.org/message-id/0f948866-7caf-0759-d53c-93c3e266ec3f%40pgmasters.net





Re: The danger of deleting backup_label

2023-10-17 Thread David Steele

On 10/14/23 11:30, David Steele wrote:

On 10/12/23 10:19, David Steele wrote:

On 10/11/23 18:10, Thomas Munro wrote:


As Stephen mentioned[1], we could perhaps also complain if both backup
label and control file exist, and then hint that the user should
remove the *control file* (not the backup label!).  I had originally
suggested we would just overwrite the control file, but by explicitly
complaining about it we would also bring the matter to tool/script
authors' attention, ie that they shouldn't be backing that file up, or
should be removing it in a later step if they copy everything.  He
also mentions that there doesn't seem to be anything stopping us from
back-patching changes to the backup label contents if we go this way.
I don't have a strong opinion on that and we could leave the question
for later.


I'm worried about the possibility of back patching this unless the 
solution comes out to be simpler than I think and that rarely comes to 
pass. Surely throwing errors on something that is currently valid 
(i.e. backup_label and pg_control both present).


But perhaps there is a simpler, acceptable solution we could back 
patch (transparent to all parties except Postgres) and then a more 
advanced solution we could go forward with.


I guess I had better get busy on this.


Attached is a very POC attempt at something along these lines that could 
be back patched. I stopped when I realized excluding pg_control from the 
backup is a necessity to make this work (else we still could end up with 
a torn copy of pg_control even if there is a good copy elsewhere). To 
enumerate the back patch issues as I see them:


Given that the above can't be back patched, I'm thinking we don't need 
backup_label at all going forward. We just write the values we need for 
recovery into pg_control and return *that* from pg_backup_stop() and 
tell the user to store it with their backup. We already have "These 
files are vital to the backup working and must be written byte for byte 
without modification, which may require opening the file in binary 
mode." in the documentation so dealing with pg_control should not be a 
problem. pg_control also has a CRC so we will know if it gets munged.


It doesn't really matter where/how they store pg_control as long as it 
is written back into PGDATA before the cluster starts. If 
backupEndRequired, etc., are set appropriately then recovery will do the 
right thing when it starts, just as now if PG crashes after it has 
renamed backup_label but before recovery to consistency has completed.


We can still enforce the presence of recovery.signal by checking 
backupEndRequired if that's something we want but it seems like 
backupEndRequired would be enough. I'm fine either way.


Another thing we can do here is make backup from standby easier. The 
user won't need to worry about *when* pg_control is copied. We can just 
write the ideal min recovery point into pg_control.


Any informational data currently in backup_label can be returned as 
columns (like the start/end lsn is now).


This makes the patch much less invasive and while it definitely, 
absolutely cannot be back patched, it seems like a good way forward.


This is the direction I'm planning to work on patch-wise but I'd like to 
hear people's thoughts.


Regards,
-David




Re: Improving Physical Backup/Restore within the Low Level API

2023-10-17 Thread David Steele

On 10/17/23 14:28, Robert Haas wrote:

On Mon, Oct 16, 2023 at 5:21 PM David G. Johnston
 wrote:

But no, by default, and probably so far as pg_basebackup is concerned, a server 
crash during backup results in requiring outside intervention in order to get 
the server to restart.


Others may differ, but I think such a proposal is dead on arrival. As
Laurenz says, that's just reinventing one of the main problems with
exclusive backup mode.


I concur -- this proposal resurrects the issues we had with exclusive 
backups without solving the issues being debated elsewhere, e.g. torn 
reads of pg_control or users removing backup_label when they should not.


Regards,
-David




Re: The danger of deleting backup_label

2023-10-16 Thread David Steele

On 10/16/23 15:06, Robert Haas wrote:

On Mon, Oct 16, 2023 at 1:00 PM David Steele  wrote:

After some agonizing (we hope) they decide to delete backup_label and,
wow, it just works! So now they merrily go on their way with a corrupted
cluster. They also remember for the next time that deleting backup_label
is definitely a good procedure.

The idea behind this patch is that deleting backup_label would produce a
hard error because pg_control would be missing as well (if the backup
software did its job). If both pg_control and backup_label are present
(but pg_control has not been loaded with the contents of backup_label,
i.e. it is the running copy from the backup cluster) we can also error.


I mean, I think we're just going in circles, here. I did and do
understand, but I didn't and don't agree. You're hypothesizing a user
who is willing to do ONE thing that they shouldn't do during backup
restoration (namely, remove backup_label) but who won't be willing to
do a SECOND thing that they shouldn't do during backup restoration
(namely, run pg_resetwal). 


In my experience the first case is much more likely than the second. 
Your experience may vary.


Anyway, I think they are pretty different. Deleting backup label appears 
to give a perfectly valid restore. Running pg_resetwal is more clearly 
(I think) the nuclear solution.



I understand that you feel differently, and that's fine, but I don't
think our disagreement here stems from me being confused. I just ...
don't agree.


Fair enough, we don't agree.

Regards,
-David




Re: Rename backup_label to recovery_control

2023-10-16 Thread David Steele

On 10/16/23 12:12, Robert Haas wrote:

On Mon, Oct 16, 2023 at 12:06 PM Michael Banck  wrote:

Not sure what to do about this, but as people/companies start moving to
15, I am afraid we will get people complaining about this. I think
having exclusive mode still be the default for pg_start_backup() (albeit
deprecated) in one release and then dropping it in the next was too
fast.


I completely agree, and I said so at the time, but got shouted down. I
think the argument that exclusive backups were breaking anything at
all was very weak. Nobody was being forced to use them, and they broke
nothing for people who didn't.


My argument then (and now) is that exclusive backup prevented us from 
making material improvements in backup and recovery. It was complicated, 
duplicative (in code and docs), and entirely untested.


So you are correct that it was only dangerous to the people who were 
using it (even if they did not know they were), but it was also a 
barrier to progress.


Regards,
-David




Re: Rename backup_label to recovery_control

2023-10-16 Thread David Steele




On 10/16/23 12:06, Michael Banck wrote:

On Mon, Oct 16, 2023 at 11:15:53AM -0400, David Steele wrote:

On 10/16/23 10:19, Robert Haas wrote:

We got rid of exclusive backup mode. We replaced pg_start_backup
with pg_backup_start.


I do think this was an improvement. For example it allows us to do
[1], which I believe is a better overall solution to the problem of
torn reads of pg_control. With exclusive backup we would not have this
option.


Well maybe, but it also seems to mean that any other 3rd party (i.e. not
Postgres-specific) backup tool seems to only support Postgres up till
version 14, as they cannot deal with non-exclusive mode - they are used
to a simple pre/post-script approach.


I'd be curious to know what enterprise solutions currently depend on 
this method. At the very least they'd need to manage a WAL archive since 
copying pg_wal is not a safe thing to do (without a snapshot), so it's 
not just a matter of using start/stop scripts. And you'd probably want 
PITR, etc.



Not sure what to do about this, but as people/companies start moving to
15, I am afraid we will get people complaining about this. I think
having exclusive mode still be the default for pg_start_backup() (albeit
deprecated) in one release and then dropping it in the next was too
fast.


But lots of companies are on PG15 and lots of hosting providers support 
it, apparently with no issues. Perhaps the companies you are referring 
to are lagging in adoption (a pretty common scenario) but I still see no 
evidence that there is a big problem looming.


Exclusive backup was deprecated for six releases, which should have been 
ample time to switch over. All the backup solutions I am familiar with 
have supported non-exclusive backup for years.



Or is somebody helping those "enterprise" backup solutions along in
implementing non-exclusive Postgres backups?


I couldn't say, but there are many examples in open source projects of 
how to do this. Somebody (Laurenz, I believe) also wrote a shell script 
to simulate exclusive backup behavior for those that want to continue 
using it. Not what I would recommend, but he showed that it was possible.


Regards,
-David




Re: The danger of deleting backup_label

2023-10-16 Thread David Steele

On 10/16/23 12:25, Robert Haas wrote:

On Mon, Oct 16, 2023 at 11:45 AM David Steele  wrote:

Hmmm, the reason to back patch this is that it would fix [1], which sure
looks like a problem to me even if it is not a "bug". We can certainly
require backup software to retry pg_control until the checksum is valid
but that seems like a pretty big ask, even considering how complicated
backup is.


That seems like a problem with pg_control not being written atomically
when the standby server is updating it during recovery, rather than a
problem with backup_label not being used at the start of recovery.
Unless I'm confused.


You are not confused. But the fact that it practically can be read as 
torn at all on the standby is a function of how rapidly it is being 
written to update min recovery point. Writing it atomically via a temp 
file may affect performance in this area, but only during the backup, 
which may cause recovery to run more slowly during a backup.


I don't have proof of this because I don't have any hosts large enough 
to test the theory.



Right now the user can remove backup_label and get a "successful"
restore and not realize that they have just corrupted their cluster.
This is independent of the backup/restore tool doing all the right things.


I don't think it's independent of that at all.


I think it is. Imagine the user does backup/recovery using their 
favorite tool and everything is fine. But due to some misconfiguration 
or a problem in the WAL archive, they get this error:


FATAL:  could not locate required checkpoint record
2023-10-16 16:42:50.132 UTC HINT:  If you are restoring from a backup, 
touch "/home/dev/test/data/recovery.signal" or 
"/home/dev/test/data/standby.signal" and add required recovery options.
If you are not restoring from a backup, try removing the file 
"/home/dev/test/data/backup_label".
Be careful: removing "/home/dev/test/data/backup_label" will 
result in a corrupt cluster if restoring from a backup.


I did this by setting restore_command=false, but it could just as easily 
be the actual restore command that returns false due to a variety of 
reasons. The user has no idea what "could not locate required checkpoint 
record" means and if there is enough automation they may not even 
realize they just restored from a backup.


After some agonizing (we hope) they decide to delete backup_label and, 
wow, it just works! So now they merrily go on their way with a corrupted 
cluster. They also remember for the next time that deleting backup_label 
is definitely a good procedure.


The idea behind this patch is that deleting backup_label would produce a 
hard error because pg_control would be missing as well (if the backup 
software did its job). If both pg_control and backup_label are present 
(but pg_control has not been loaded with the contents of backup_label, 
i.e. it is the running copy from the backup cluster) we can also error.


It's not perfect, because they could backup (or restore) pg_control but 
not backup_label, but we are narrowing the cases where something can go 
wrong and they have silent corruption, especially if their 
backup/restore software follows the directions.


Regards,
-David




Re: The danger of deleting backup_label

2023-10-16 Thread David Steele

On 10/16/23 10:55, Robert Haas wrote:

On Sat, Oct 14, 2023 at 11:33 AM David Steele  wrote:

All of this is fixable in HEAD, but seems incredibly dangerous to back
patch. Even so, I have attached the patch in case somebody sees an
opportunity that I do not.


I really do not think we should be even thinking about back-patching
something like this. It's clearly not a bug fix, although I'm sure
that someone can try to characterize it that way, if they want to make
the well-worn argument that any behavior they don't like is a bug. But
that's a pretty lame argument. Usage errors on the part of users are
not bugs, even if we've coded the software in such a way as to make
those errors more likely.


Hmmm, the reason to back patch this is that it would fix [1], which sure 
looks like a problem to me even if it is not a "bug". We can certainly 
require backup software to retry pg_control until the checksum is valid 
but that seems like a pretty big ask, even considering how complicated 
backup is.


I investigated this as a solution to [1] because it seemed like a better 
solution that what we have in [1]. But once I got into the weeds it was 
obvious that this wasn't going to be something we could back patch.



I think what we ought to be talking about is whether a change like
this is a good idea even in master. I don't think it's a terrible
idea, but I'm also not sure that it's a good idea. The problem is that
if you're doing the right thing with your backup_label, then this is
unnecessary, and if you're doing the wrong thing, then why should you
do the right thing about this? 


First and foremost, this solves the problem of pg_control being torn 
when read by the backup software. It can't be torn if it is not there.


There are also other advantages -- we can massage pg_control before 
writing it back out. This already happens, but before that happens there 
is a copy of the (prior) running pg_control there that can mess up the 
process.



I mean, admittedly you can't just
ignore a fatal error, but I think people will just run pg_resetwal,
which is even worse than starting from the wrong checkpoint. 


If you start from the last checkpoint (which is what will generally be 
stored in pg_control) then the effect is pretty similar.



I feel
like in cases where a customer I'm working with has a bad backup,
their entire focus is on doing something to that backup to get a
running system back, whatever it takes. It's already too late at that
point to fix the backup procedure - they only have the backups they
have. You could hope people would do test restores before disaster
strikes, but people who are that prepared are probably running a real
backup tool and will never have this problem in the first place.


Right now the user can remove backup_label and get a "successful" 
restore and not realize that they have just corrupted their cluster. 
This is independent of the backup/restore tool doing all the right things.


My goal here is to narrow the options to try and make it so there is 
*one* valid procedure that will work. For this patch the idea is that 
they *must* start Postgres to get a valid pg_control from the 
backup_label. Any other action leads to a fatal error.


Note that the current patch is very WIP and does not actually do 
everything I'm talking about here. I was just trying to see if it could 
be used to solve the problem in [1]. It can't.



Perhaps that's all too pessimistic. I don't know. Certainly, other
people can have experiences that are different than mine. But I feel
like I struggle to think of a case where this would have prevented a
bad outcome, and that makes me wonder whether it's really a good idea
to complicate the system.


I'm specifically addressing cases like those that came up (twice!) in 
[2]. This is the main place I see people stumbling these days. If even 
hackers can make this mistake then we should do better.


Regards,
-David

[1] 
https://www.postgresql.org/message-id/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
[2] [1] 
https://www.postgresql.org/message-id/flat/CAM_vCudkSjr7NsNKSdjwtfAm9dbzepY6beZ5DP177POKy8%3D2aw%40mail.gmail.com#746e492bfcd2667635634f1477a61288





Re: Rename backup_label to recovery_control

2023-10-16 Thread David Steele

On 10/16/23 10:19, Robert Haas wrote:

On Sat, Oct 14, 2023 at 2:22 PM David Steele  wrote:

I was recently discussing the complexities of dealing with pg_control
and backup_label with some hackers at PGConf NYC, when David Christensen
commented that backup_label was not a very good name since it gives the
impression of being informational and therefore something the user can
delete. In fact, we see this happen quite a lot, and there have been
some other discussions about it recently, see [1] and [2]. I bounced the
idea of a rename off various hackers at the conference and in general
people seemed to think it was a good idea.


Personally, I feel like this is an area where we keep moving the parts
around but I'm not sure we're really getting to anything better. We
got rid of recovery.conf. 


I agree that this was not an improvement. I was fine with bringing the 
recovery options into the GUC fold but never really liked forcing them 
into postgresql.auto.conf. But I lost that argument.



We got rid of exclusive backup mode. We
replaced pg_start_backup with pg_backup_start. 


I do think this was an improvement. For example it allows us to do [1], 
which I believe is a better overall solution to the problem of torn 
reads of pg_control. With exclusive backup we would not have this option.



It feels like every
other release or so we whack something around here, but I'm not
convinced that any of it is really making much of an impact. If
there's been any decrease in people screwing up their backups, I
haven't noticed it.


It's pretty subjective, but I feel much the same way. However, I think 
the *areas* that people are messing up are changing as we remove 
obstacles and I feel like we should address them. backup_label has 
always been a bit of a problem -- basically deciding should it be deleted?


With the removal of exclusive backup we removed the only valid use case 
(I think) for removing backup_label manually. Now, it should probably 
never be removed manually, so we need to make adjustments to make that 
clearer to the user, also see [1].


Better messaging may also help, and I am also thinking about that.


To be fair, I will grant that renaming pg_clog to pg_xact_status and
pg_xlog to pg_wal does seem to have reduced the incidence of people
nuking those directories, at least IME. So maybe this change would
help too, for similar reasons. But I'm still concerned that we're
doing too much superficial tinkering in this area. Breaking
compatibility is not without cost.


True enough, but ISTM that we have gotten few (or any) actual complaints 
outside of hackers speculating that there will be complaints. For the 
various maintainers of backup software this is just business as usual. 
The changes to pg_basebackup are also pretty trivial.



I also do wonder with recovery_control is really a better name. Maybe
I just have backup_label too firmly stuck in my head, but is what that
file does really best described as recovery control? I'm not so sure
about that.


The thing it does that describes it as "recovery control" in my view is 
that it contains the LSN where Postgres must start recovery (plus TLI, 
backup method, etc.). There is some other informational stuff in there, 
but the important fields are all about ensuring consistent recovery.


At the end of the day the entire point of backup *is* recovery and users 
will interact with this file primarily in recovery scenarios.


Regards,
-David

---

[1] 
https://www.postgresql.org/message-id/1330cb48-4e47-03ca-f2fb-b144b49514d8%40pgmasters.net





Re: Rename backup_label to recovery_control

2023-10-16 Thread David Steele

On 10/16/23 00:26, Kyotaro Horiguchi wrote:

At Mon, 16 Oct 2023 13:16:42 +0900 (JST), Kyotaro Horiguchi 
 wrote in

Just an idea in a slightly different direction, but I'm wondering if
we can simply merge the content of backup_label into control file.
The file is 8192 bytes long, yet only 256 bytes are used. As a result,
we anticipate no overhead.  Sucha configuration would forcibly prevent
uses from from removing the backup information.


In second thought, that would break the case of file-system level
backups, which require backup information separately from control
data.


Exactly -- but we do have a proposal to do the opposite and embed 
pg_control into backup_label [1] (or hopefully recovery_control).


Regards,
-David

[1] 
https://www.postgresql.org/message-id/1330cb48-4e47-03ca-f2fb-b144b49514d8%40pgmasters.net





Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

2023-10-14 Thread David Steele

On 9/28/23 19:59, Michael Paquier wrote:

On Thu, Sep 28, 2023 at 04:23:42PM -0400, David Steele wrote:


So overall, +1 for Michael's patch, though I have only read through it and
not tested it yet.


Reviews, thoughts and opinions are welcome.


OK, I have now reviewed and tested the patch and it looks good to me. I 
stopped short of marking this RfC since there are other reviewers in the 
mix.


I dislike that we need to repeat:

OwnLatch(>recoveryWakeupLatch);

But I see the logic behind why you did it and there's no better way to 
do it as far as I can see.



One comment, though, if we are going to require recovery.signal when
backup_label is present, should it just be implied? Why error and force the
user to create it?


That's one thing I was considering, but I also cannot convince myself
that this is the best option because the presence of recovery.signal
or standby.standby (if both, standby.signal takes priority) makes it
clear what type of recovery is wanted at disk level.  I'd be OK if
folks think that this is a sensible consensus, as well, even if I
don't really agree with it.


I'm OK with keeping it as required for now.


Another idea I had was to force the creation of recovery.signal by
pg_basebackup even if -R is not used.  All the reports we've seen with
people getting confused came from pg_basebackup that enforces no
configuration.


This change makes it more obvious if configuration is missing (since 
you'll get an error), however +1 for adding this to pg_basebackup.



A last thing, that had better be covered in a separate thread and
patch, is about validateRecoveryParameters().  These days, I'd like to
think that it may be OK to lift at least the restriction on
restore_command being required if we are doing recovery to ease the
case of self-contained backups (aka the case where all the WAL needed
to reach a consistent point is in pg_wal/ or its tarball)


Hmmm, I'm not sure about this. I'd prefer users set 
restore_command=/bin/false explicitly to fetch WAL from pg_wal by 
default if that's what they really intend.


Regards,
-David




Rename backup_label to recovery_control

2023-10-14 Thread David Steele

Hackers,

I was recently discussing the complexities of dealing with pg_control 
and backup_label with some hackers at PGConf NYC, when David Christensen 
commented that backup_label was not a very good name since it gives the 
impression of being informational and therefore something the user can 
delete. In fact, we see this happen quite a lot, and there have been 
some other discussions about it recently, see [1] and [2]. I bounced the 
idea of a rename off various hackers at the conference and in general 
people seemed to think it was a good idea.


Attached is a patch to rename backup_label to recovery_control. The 
purpose is to make it more obvious that the file should not be deleted. 
I'm open to other names, e.g. recovery.control. That makes the naming 
distinct from tablespace_map, which is perhaps a good thing, but is also 
more likely to be confused with recovery.signal.


I did a pretty straight-forward search and replace on comments and 
documentation with only light editing. If this seems like a good idea 
and we choose a final name, I'll do a more thorough pass through the 
comments and docs to try and make the usage more consistent.


Note that there is one usage of backup label that remains, i.e. the text 
that the user can set to describe the backup.


Regards,
-David

[1] 
https://www.postgresql.org/message-id/1330cb48-4e47-03ca-f2fb-b144b49514d8%40pgmasters.net
[2] 
https://www.postgresql.org/message-id/flat/CAM_vCudkSjr7NsNKSdjwtfAm9dbzepY6beZ5DP177POKy8%3D2aw%40mail.gmail.com#746e492bfcd2667635634f1477a61288diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 8cb24d6ae54..958755f9eee 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -937,7 +937,7 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
 
  pg_backup_stop will return one row with three
  values. The second of these fields should be written to a file named
- backup_label in the root directory of the backup. The
+ recovery_control in the root directory of the 
backup. The
  third field should be written to a file named
  tablespace_map unless the field is empty. These 
files are
  vital to the backup working and must be written byte for byte without
@@ -1062,7 +1062,7 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);

 

-The backup label
+The recovery control
 file includes the label string you gave to 
pg_backup_start,
 as well as the time at which pg_backup_start was run, 
and
 the name of the starting WAL file.  In case of confusion it is therefore
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index affd1254bb7..b4f158606cd 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26763,12 +26763,12 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 
622360 free (88 chunks); 1029560
)
 record
 ( lsn pg_lsn,
-labelfile text,
+recoveryctlfile text,
 spcmapfile text )


 Finishes performing an on-line backup.  The desired contents of the
-backup label file and the tablespace map file are returned as part of
+recovery control file and the tablespace map file are returned as part 
of
 the result of the function and must be written to files in the
 backup area.  These files must not be written to the live data 
directory
 (doing so will cause PostgreSQL to fail to restart in the event of a
@@ -26803,7 +26803,7 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 
free (88 chunks); 1029560
 The result of the function is a single record.
 The lsn column holds the backup's ending
 write-ahead log location (which again can be ignored).  The second
-column returns the contents of the backup label file, and the third
+column returns the contents of the recovery control file, and the third
 column returns the contents of the tablespace map file.  These must be
 stored as part of the backup and are required as part of the restore
 process.
diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 8ed39fb..4af18dae589 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -400,7 +400,7 @@ GRANT EXECUTE ON function 
pg_catalog.pg_read_binary_file(text, bigint, bigint, b
   pg_serial/, pg_snapshots/,
   pg_stat_tmp/, and pg_subtrans/
   are omitted from the data copied from the source cluster. The files
-  backup_label,
+  recovery_control,
   tablespace_map,
   pg_internal.init,
   postmaster.opts, and
@@ -410,7 +410,7 @@ GRANT EXECUTE ON function 
pg_catalog.pg_read_binary_file(text, bigint, bigint, b
 
 
  
-  Create a backup_label file to begin WAL replay at
+  Create a recovery_control file to begin WAL replay 
at
   the checkpoint created at failover and configure the
   pg_control file 

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-10-14 Thread David Steele

On 10/13/23 10:40, David Steele wrote:

On 10/12/23 19:15, Michael Paquier wrote:

On Thu, Oct 12, 2023 at 10:41:39AM -0400, David Steele wrote:

After some more thought, I think we could massage the "pg_control in
backup_label" method into something that could be back patched, with 
more
advanced features (e.g. error on backup_label and pg_control both 
present on

initial cluster start) saved for HEAD.


I doubt that anything changed in this area would be in the
backpatchable zone, particularly as it would involve protocol changes
within the replication commands, so I'd recommend to focus on HEAD.


I can't see why there would be any protocol changes, but perhaps I am 
missing something?


One thing that does have to change, however, is the ordering of 
backup_label in the base tar file. Right now it is at the beginning but 
we need it to be at the end like pg_control is now.


Well, no protocol changes, but overall this does not seem like a 
direction that would be even remotely back patch-able. See [1] for details.


For back branches that puts us back to committing some form of 0001 and 
0003. I'm still worried about the performance implications of 0001 on a 
standby when in backup mode, but I don't have any better ideas.


If we do commit 0001 and 0003 to the back branches I'd still like to 
hold off on HEAD to see if we can do something better there.


Regards,
-David

[1] 
https://www.postgresql.org/message-id/e05f20f9-6f91-9a70-efab-9a2ae472e65d%40pgmasters.net





Re: The danger of deleting backup_label

2023-10-14 Thread David Steele

On 10/12/23 10:19, David Steele wrote:

On 10/11/23 18:10, Thomas Munro wrote:


As Stephen mentioned[1], we could perhaps also complain if both backup
label and control file exist, and then hint that the user should
remove the *control file* (not the backup label!).  I had originally
suggested we would just overwrite the control file, but by explicitly
complaining about it we would also bring the matter to tool/script
authors' attention, ie that they shouldn't be backing that file up, or
should be removing it in a later step if they copy everything.  He
also mentions that there doesn't seem to be anything stopping us from
back-patching changes to the backup label contents if we go this way.
I don't have a strong opinion on that and we could leave the question
for later.


I'm worried about the possibility of back patching this unless the 
solution comes out to be simpler than I think and that rarely comes to 
pass. Surely throwing errors on something that is currently valid (i.e. 
backup_label and pg_control both present).


But perhaps there is a simpler, acceptable solution we could back patch 
(transparent to all parties except Postgres) and then a more advanced 
solution we could go forward with.


I guess I had better get busy on this.


Attached is a very POC attempt at something along these lines that could 
be back patched. I stopped when I realized excluding pg_control from the 
backup is a necessity to make this work (else we still could end up with 
a torn copy of pg_control even if there is a good copy elsewhere). To 
enumerate the back patch issues as I see them:


1) We still need a copy of pg_control in order to get Postgres to start 
and that copy might be torn (pretty much where we are now). We can 
handle this easily in pg_basebackup but most backup software will not be 
able to do so. In my opinion teaching Postgres to start without 
pg_control is too big a change to possibly back patch.


2) We need to deal with backups made with a prior *minor* version that 
did not include pg_control in the backup_label. Doable, but...


3) We need to move backup_label to the end of the main pg_basebackup 
tar, which could cause unforeseen breakage for tools.


4) This patch is less efficient for backups taken from standby because 
it will overwrite pg_control on restart and force replay back to the 
original MinRecoveryPoint.


5) We still need a solution for exclusive backup (still valid in PG <= 
14). Doable but it would still have the weakness of 1.


All of this is fixable in HEAD, but seems incredibly dangerous to back 
patch. Even so, I have attached the patch in case somebody sees an 
opportunity that I do not.


Regards,
-Daviddiff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index c0e4ca50899..e4a4478af75 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -74,6 +74,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "port/atomics.h"
+#include "port/pg_crc32c.h"
 #include "port/pg_iovec.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/startup.h"
@@ -8691,6 +8692,21 @@ do_pg_backup_stop(BackupState *state, bool 
waitforarchive)
 "and should not be used. "
 "Try taking another online 
backup.")));
 
+   /*
+* Create a copy of control data to be stored in the backup label.
+* Recalculate the CRC so recovery can validate the contents but do not
+* bother with the timestamp since that will be applied before it is
+* written by recovery.
+*/
+   LWLockAcquire(ControlFileLock, LW_SHARED);
+   state->controlFile = *ControlFile;
+   LWLockRelease(ControlFileLock);
+
+   INIT_CRC32C(state->controlFile.crc);
+   COMP_CRC32C(state->controlFile.crc, (char *)>controlFile,
+   offsetof(ControlFileData, crc));
+   FIN_CRC32C(state->controlFile.crc);
+
/*
 * During recovery, we don't write an end-of-backup record. We assume 
that
 * pg_control was backed up last and its minimum recovery point can be
@@ -8741,11 +8757,8 @@ do_pg_backup_stop(BackupState *state, bool 
waitforarchive)
 "Enable 
full_page_writes and run CHECKPOINT on the primary, "
 "and then try an 
online backup again.")));
 
-
-   LWLockAcquire(ControlFileLock, LW_SHARED);
-   state->stoppoint = ControlFile->minRecoveryPoint;
-   state->stoptli = ControlFile->minRecoveryPointTLI;
-   LWLockRelease(ControlFileLock);
+   state->stoppoint = state->controlFile.minRecoveryPoint;
+   state->stoptli = state->controlFi

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-10-13 Thread David Steele

On 10/12/23 19:15, Michael Paquier wrote:

On Thu, Oct 12, 2023 at 10:41:39AM -0400, David Steele wrote:

After some more thought, I think we could massage the "pg_control in
backup_label" method into something that could be back patched, with more
advanced features (e.g. error on backup_label and pg_control both present on
initial cluster start) saved for HEAD.


I doubt that anything changed in this area would be in the
backpatchable zone, particularly as it would involve protocol changes
within the replication commands, so I'd recommend to focus on HEAD.


I can't see why there would be any protocol changes, but perhaps I am 
missing something?


One thing that does have to change, however, is the ordering of 
backup_label in the base tar file. Right now it is at the beginning but 
we need it to be at the end like pg_control is now.


I'm working up a POC patch now and hope to have something today or 
tomorrow. I think it makes sense to at least have a look at an 
alternative solution before going forward.


Regards,
-David





Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-10-12 Thread David Steele

On 10/12/23 09:58, David Steele wrote:

On Thu, Oct 12, 2023 at 12:25:34PM +1300, Thomas Munro wrote:

I'm planning to push 0002 (retries in frontend programs, which is
where this thread began) and 0004 (add missing locks to SQL
functions), including back-patches as far as 12, in a day or so.

I'll abandon the others for now, since we're now thinking bigger[1]
for backups, side stepping the problem.


FWIW, 0003 looks like a low-risk improvement seen from here, so I'd be
OK to use it at least for now on HEAD before seeing where the other
discussions lead.  0004 would be OK if applied to v11, as well, but I
also agree that it is not a big deal to let this branch be as it is
now at this stage if you feel strongly this way.


Agreed on 0002 and 0004, though I don't really think a back patch of 
0004 to 11 is necessary. I'd feel differently if there was a single 
field report of this issue.


I would prefer to hold off on applying 0003 to HEAD until we see how [1] 
pans out.


Having said that, I have a hard time seeing [1] as being something we 
could back patch. The manipulation of backup_label is simple enough, but 
starting a cluster without pg_control is definitely going to change some 
things. Also, the requirement that backup software skip copying 
pg_control after a minor release is not OK.




After some more thought, I think we could massage the "pg_control in 
backup_label" method into something that could be back patched, with 
more advanced features (e.g. error on backup_label and pg_control both 
present on initial cluster start) saved for HEAD.


Regards,
-David




Re: The danger of deleting backup_label

2023-10-12 Thread David Steele




On 10/11/23 18:22, Michael Paquier wrote:

On Tue, Oct 10, 2023 at 05:06:45PM -0400, David Steele wrote:

That fails because there is a check to make sure the checkpoint is valid
when pg_control is loaded. Another possibility is to use a special LSN like
we use for unlogged tables. Anything >= 24 and < WAL segment size will work
fine.


Do we have any reason to do that in the presence of a backup_label
file anyway?  We'll know the LSN of the checkpoint based on what the
base backup wants us to use.  Using a fake-still-rather-valid value
for the LSN in the control file to bypass this check does not address
the issue you are pointing at: it is just avoiding this check.  A
reasonable answer would be, IMO, to just not do this check at all
based on the control file in this case.


Yeah, that's fair. And it looks like we are leaning towards excluding 
pg_control from the backup entirely, so the point is probably moot.



If the contents of the control file are tweaked before sending it
through a BASE_BACKUP, it would cover more than just pg_basebackup.
Switching the way the control file is sent with new contents in
sendFileWithContent() rather than sendFile() would be one way, for
instance..


Good point, and that makes this even more compelling. If we include
pg_control into backup_label then there is no need to modify pg_control (as
above) -- we can just exclude it from the backup entirely. That will
certainly require some rejigging in recovery but seems worth it for backup
solutions that can't easily modify pg_control. The C-based solutions can do
this pretty easily but it is a pretty high bar for anyone else.


I have little idea about that, but I guess that you are referring to
backrest here.


Sure, pgBackRest, but there are other backup solutions written in C. My 
point is really that we should not depend on backup solutions being able 
to manipulate C structs. It looks the the solution we are working 
towards would not require that.


Regards,
-David




Re: The danger of deleting backup_label

2023-10-12 Thread David Steele

Hi Thomas,

On 10/11/23 18:10, Thomas Munro wrote:


Even though I spent a whole bunch of time trying to figure out how to
make concurrent reads of the control file sufficiently atomic for
backups (pg_basebackup and low level filesystem tools), and we
explored multiple avenues with varying results, and finally came up
with something that basically works pretty well... actually I just
hate all of that stuff, and I'm hoping to be able to just withdraw
https://commitfest.postgresql.org/45/4025/ and chalk it all up to
discovery/education and call *this* thread the real outcome of that
preliminary work.

So I'm +1 on the idea of putting a control file image into the backup
label and I'm happy that you're looking into it.


Well, hopefully this thread will *at least* be the solution going 
forward. Not sure about a back patch yet, see below...



We could just leave the control file out of the base backup
completely, as you said, removing a whole foot-gun.  


That's the plan.


People following
the 'low level' instructions will still get a copy of the control file
from the filesystem, and I don't see any reliable way to poison that
file without also making it so that a crash wouldn't also be prevented
from recovering.  I have wondered about putting extra "fingerprint"
information into the control file such as the file's path and inode
number etc, so that you can try to distinguish between a control file
written by PostgreSQL, and a control file copied somewhere else, but
that all feels too fragile, and at the end of the day, people
following the low level backup instructions had better follow the low
level backup instructions (hopefully via the intermediary of an
excellent external backup tool).


Not sure about the inode idea, because it seems OK for people to move a 
cluster elsewhere under a variety of circumstances. I do have an idea 
about how to mark a cluster in "recovery to consistency" mode, but not 
quite sure how to atomically turn that off at the end of recovery to 
consistency. I have some ideas I'll work on though.



As Stephen mentioned[1], we could perhaps also complain if both backup
label and control file exist, and then hint that the user should
remove the *control file* (not the backup label!).  I had originally
suggested we would just overwrite the control file, but by explicitly
complaining about it we would also bring the matter to tool/script
authors' attention, ie that they shouldn't be backing that file up, or
should be removing it in a later step if they copy everything.  He
also mentions that there doesn't seem to be anything stopping us from
back-patching changes to the backup label contents if we go this way.
I don't have a strong opinion on that and we could leave the question
for later.


I'm worried about the possibility of back patching this unless the 
solution comes out to be simpler than I think and that rarely comes to 
pass. Surely throwing errors on something that is currently valid (i.e. 
backup_label and pg_control both present).


But perhaps there is a simpler, acceptable solution we could back patch 
(transparent to all parties except Postgres) and then a more advanced 
solution we could go forward with.


I guess I had better get busy on this.

Regards,
-David

[1] 
https://www.postgresql.org/message-id/ZL69NXjCNG%2BWHCqG%40tamriel.snowman.net





Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-10-12 Thread David Steele




On 10/11/23 21:10, Michael Paquier wrote:

On Thu, Oct 12, 2023 at 12:25:34PM +1300, Thomas Munro wrote:

I'm planning to push 0002 (retries in frontend programs, which is
where this thread began) and 0004 (add missing locks to SQL
functions), including back-patches as far as 12, in a day or so.

I'll abandon the others for now, since we're now thinking bigger[1]
for backups, side stepping the problem.


FWIW, 0003 looks like a low-risk improvement seen from here, so I'd be
OK to use it at least for now on HEAD before seeing where the other
discussions lead.  0004 would be OK if applied to v11, as well, but I
also agree that it is not a big deal to let this branch be as it is
now at this stage if you feel strongly this way.


Agreed on 0002 and 0004, though I don't really think a back patch of 
0004 to 11 is necessary. I'd feel differently if there was a single 
field report of this issue.


I would prefer to hold off on applying 0003 to HEAD until we see how [1] 
pans out.


Having said that, I have a hard time seeing [1] as being something we 
could back patch. The manipulation of backup_label is simple enough, but 
starting a cluster without pg_control is definitely going to change some 
things. Also, the requirement that backup software skip copying 
pg_control after a minor release is not OK.


If we do back patch 0001 is 0003 really needed? Surely if 0001 works 
with other backup software it would work fine for pg_basebackup.


Regards,
-David

[1] 
https://www.postgresql.org/message-id/flat/1330cb48-4e47-03ca-f2fb-b144b49514d8%40pgmasters.net





Re: The danger of deleting backup_label

2023-10-10 Thread David Steele

On 9/28/23 22:30, Michael Paquier wrote:

On Thu, Sep 28, 2023 at 05:14:22PM -0400, David Steele wrote:


Recovery worked perfectly as long as backup_label was present and failed
hard when it was not:

LOG:  invalid primary checkpoint record
PANIC:  could not locate a valid checkpoint record

It's not a very good message, but at least the foot gun has been removed. We
could use this as a special value to give a better message, and maybe use
something a bit more unique like 0xFADEFADE (or whatever) as the
value.


Why not just InvalidXLogRecPtr?


That fails because there is a check to make sure the checkpoint is valid 
when pg_control is loaded. Another possibility is to use a special LSN 
like we use for unlogged tables. Anything >= 24 and < WAL segment size 
will work fine.



This is all easy enough for pg_basebackup to do, but will certainly be
non-trivial for most backup software to implement. In [2] we have discussed
perhaps returning pg_control from pg_backup_stop() for the backup software
to save, or it could become part of the backup_label (encoded as hex or
base64, presumably). I prefer the latter as this means less work for the
backup software (except for the need to exclude pg_control from the backup).

I don't have a patch for this yet because I did not test this idea using
pg_basebackup, but I'll be happy to work up a patch if there is interest.


If the contents of the control file are tweaked before sending it
through a BASE_BACKUP, it would cover more than just pg_basebackup.
Switching the way the control file is sent with new contents in
sendFileWithContent() rather than sendFile() would be one way, for
instance..


Good point, and that makes this even more compelling. If we include 
pg_control into backup_label then there is no need to modify pg_control 
(as above) -- we can just exclude it from the backup entirely. That will 
certainly require some rejigging in recovery but seems worth it for 
backup solutions that can't easily modify pg_control. The C-based 
solutions can do this pretty easily but it is a pretty high bar for 
anyone else.


Regards,
-David




Re: how to manage Cirrus on personal repository

2023-09-29 Thread David Steele




On 9/29/23 18:02, Thomas Munro wrote:

On Sat, Sep 30, 2023 at 3:35 AM Nazir Bilal Yavuz  wrote:

On Fri, 29 Sept 2023 at 13:24, Peter Eisentraut  wrote:

I have a private repository on GitHub where I park PostgreSQL
development work.  I also have Cirrus activated on that repository, to
build those branches, using the existing Cirrus configuration.

Now under the new system of limited credits that started in September,
this blew through the credits about half-way through the month.


[Replying to wrong person because I never saw Peter's original email,
something must be jammed in the intertubes...]

It's annoying, but on the bright side... if you're making it halfway
through the month, I think that means there's a chance you'd have
enough credit if we can depessimise the known problems with TAP query
execution[1], and there are some more obviously stupid things too
(sleep/poll for replication progress where PostgreSQL should offer an
event-based wait-for-replay, running all the tests when only docs
changed, running the regression tests fewer times, ...).


I also have Cirrus setup for my cloned Postgres repo and it is annoying 
that it will run CI whenever I push up new commits that I pulled from 
git.p.o.


My strategy for this (on other projects) is to require branches that 
need CI to end in -ci. Since I use multiple CI services, I further allow 
-cic (Cirrus), -cig (Github Actions), etc. PRs are also included. That 
way, nothing runs through CI unless I want it to.


Here's an example of how this works on Cirrus:

# Build the branch if it is a pull request, or ends in -ci/-cic (-cic 
targets only Cirrus CI)
only_if: $CIRRUS_PR != '' || $CIRRUS_BRANCH =~ '.*-ci$' || 
$CIRRUS_BRANCH =~ '.*-cic$'


Regards,
-David




  1   2   3   4   5   6   7   8   9   >