Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
Hello Robert, 17.10.2023 17:39, Robert Haas wrote: On Tue, Oct 17, 2023 at 4:57 AM Aleksander Alekseev wrote: v11-0001 and v11-0002 LGTM too. Cool. Seems we are all in agreement, so committed these. Thanks! Please look at the following sentence added by the commit: ... to issue manual VACUUM commands on the tables where relminxid is oldest. Isn't relminxid a typo there? Best regards, Alexander
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Tue, Oct 17, 2023 at 9:39 PM Robert Haas wrote: > > Cool. Seems we are all in agreement, so committed these. Thanks! Thank you for getting this across the finish line!
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Tue, Oct 17, 2023 at 4:57 AM Aleksander Alekseev wrote: > v11-0001 and v11-0002 LGTM too. Cool. Seems we are all in agreement, so committed these. Thanks! -- Robert Haas EDB: http://www.enterprisedb.com
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
Hi, > > LGTM, except for one small detail: I noticed that you misspelled > > "translations" in the commit message. > > Oops. Fixed locally. v11-0001 and v11-0002 LGTM too. IMO "to assign a XID" sounds better than "to generate a XID". -- Best regards, Aleksander Alekseev
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Mon, Oct 16, 2023 at 3:46 PM Peter Geoghegan wrote: > On Mon, Oct 16, 2023 at 11:06 AM Robert Haas wrote: > > I propose to commit these changes only to master. I have included a > > fairly long paragraph about that in the commit message for 0002. > > LGTM, except for one small detail: I noticed that you misspelled > "translations" in the commit message. Oops. Fixed locally. > Thanks for getting this over the line Sure thing. I'm glad we're finally doing something about it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Mon, Oct 16, 2023 at 11:06 AM Robert Haas wrote: > I propose to commit these changes only to master. I have included a > fairly long paragraph about that in the commit message for 0002. LGTM, except for one small detail: I noticed that you misspelled "translations" in the commit message. Thanks for getting this over the line -- Peter Geoghegan
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Fri, Oct 13, 2023 at 5:03 AM Aleksander Alekseev wrote: > > Thanks for working on this. I will be relieved once this is finally > > taken care of. > > +1, and many thanks for your attention to the patchset and all the details! Cool. I committed that and back-patched to v14, and here's the rest. 0001 makes the terminology change that I proposed earlier, and then 0002 is the remainder of what was in the previous patch set that wasn't covered by what I committed already, with a few adjustments. In particular, I preferred to stick with "avoid" rather than changing to "prevent," and I thought it was clearer to refer to "failures" plural rather than "failure" collective. These are arguable decisions, though. I propose to commit these changes only to master. I have included a fairly long paragraph about that in the commit message for 0002. -- Robert Haas EDB: http://www.enterprisedb.com v11-0002-Reword-messages-about-impending-M-XID-exhaustion.patch Description: Binary data v11-0001-Talk-about-assigning-rather-than-generating-new-.patch Description: Binary data
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
Hi, > Those all make sense to me. > > > [...] > > Of course. Your general approach seems wise. > > Thanks for working on this. I will be relieved once this is finally > taken care of. +1, and many thanks for your attention to the patchset and all the details! -- Best regards, Aleksander Alekseev
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Thu, Oct 12, 2023 at 1:10 PM Robert Haas wrote: > On Thu, Oct 12, 2023 at 12:01 PM Peter Geoghegan wrote: > > No objections from me. > > Here is a doc-only patch that I think could be back-patched as far as > emergency mode exists. It combines all of the wording changes to the > documentation from v1-v3 of the previous version, but without changing > the message text that is quoted in the documentation, and without > adding more instances of similar message texts to the documentation, > and with a bunch of additional hacking by me. It's a bit weird that we're effectively saying "pay no attention to that terrible HINT"...but I get it. The simple fact is that the docs were written in a way that allowed misinformation to catch on -- the damage that needs to be undone isn't exactly limited to the docs themselves. Your choice to not backpatch the changes to the log messages makes a lot more sense, now that I see that I see the wider context built by this preparatory patch. Arguably, it would be counterproductive to pretend that we didn't make this mistake on the backbranches. Better to own the mistake. > Some things I changed: > > - I made it so that the MXID section refers back to the XID section > instead of duplicating it, but with a short list of differences. > - I weakened the existing claim that says you must be a superuser or > VACUUM definitely won't fix it to say instead that you SHOULD run > VACUUM as the superuser, because the former is false and the latter is > true. > - I made the list of steps for recovering more explicit. > - I split out the bit about running autovacuum in the affected > database into a separate step to be performed after VACUUM for > continued good operation, rather than a necessary ingredient in > recovery, because it isn't. > - A bit of other minor rejiggering. Those all make sense to me. > I'm not forgetting about the rest of the proposed patch set, or the > change I proposed earlier. I'm just posting this much now because this > is how far I got today, and it would be useful to get comments before > I go further. I think the residual portion of the patch set not > included in this documentation patch will be quite small, and I think > that's a good thing, but again, I don't intend to blow that off. Of course. Your general approach seems wise. Thanks for working on this. I will be relieved once this is finally taken care of. -- Peter Geoghegan
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Thu, Oct 12, 2023 at 12:01 PM Peter Geoghegan wrote: > No objections from me. Here is a doc-only patch that I think could be back-patched as far as emergency mode exists. It combines all of the wording changes to the documentation from v1-v3 of the previous version, but without changing the message text that is quoted in the documentation, and without adding more instances of similar message texts to the documentation, and with a bunch of additional hacking by me. Some things I changed: - I made it so that the MXID section refers back to the XID section instead of duplicating it, but with a short list of differences. - I weakened the existing claim that says you must be a superuser or VACUUM definitely won't fix it to say instead that you SHOULD run VACUUM as the superuser, because the former is false and the latter is true. - I made the list of steps for recovering more explicit. - I split out the bit about running autovacuum in the affected database into a separate step to be performed after VACUUM for continued good operation, rather than a necessary ingredient in recovery, because it isn't. - A bit of other minor rejiggering. I'm not forgetting about the rest of the proposed patch set, or the change I proposed earlier. I'm just posting this much now because this is how far I got today, and it would be useful to get comments before I go further. I think the residual portion of the patch set not included in this documentation patch will be quite small, and I think that's a good thing, but again, I don't intend to blow that off. -- Robert Haas EDB: http://www.enterprisedb.com v10-0001-Update-the-documentation-on-recovering-from-M-XI.patch Description: Binary data
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Thu, Oct 12, 2023 at 8:54 AM Robert Haas wrote: > - I find the use of the word "generate" in error messages slightly > odd. I think it's reasonable given the existing precedent, but the > word I would have picked is "assign", which I see is what Aleksander > actually had in v1. How would people feel about changing the two > existing messages that say "database is not accepting commands that > generate new MultiXactIds to avoid wraparound data loss ..." to use > "assign" instead, and then make the new messages match that? WFM. > - I think that 0002 needs a bit of wordsmithing. I will work on that. > In particular, I don't like this sentence: "It increases downtime, > makes monitoring impossible, disables replication, bypasses safeguards > against wraparound, etc." While there's nothing untrue there, it feels > more like a sentence from a pgsql-hackers email where most people > participating in the discussion understand the general contours of the > problem already than like polished documentation that really lays > things out methodically. I agree. > - I'm somewhat inclined to have a go at restructuring these patches a > bit so that some of the documentation changes can potentially be > back-patched without back-patching the message changes. Even if we > eventually decide to back-patch everything or nothing, there are > wording adjustments spread across all 3 patches that seem somewhat > independent of the changes to the server messages. I think it would be > clearer to have one patch that is mostly about documentation wording > changes, and a second one that is about changing the server messages > and then making documentation changes that are directly dependent on > those message changes. And I might also be inclined to back-patch the > former patch as far as it makes sense to do so, while leaving the > latter one master-only. No objections from me. -- Peter Geoghegan
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Wed, Oct 4, 2023 at 8:07 AM Peter Geoghegan wrote: > If you're willing to take over as committer here, I'll let the issue > of backpatching go. > > I only ask that you note why you've not backpatched in the commit message. Will do, but see also the last point below. I have looked over these patches in some detail and here are my thoughts: - I find the use of the word "generate" in error messages slightly odd. I think it's reasonable given the existing precedent, but the word I would have picked is "assign", which I see is what Aleksander actually had in v1. How would people feel about changing the two existing messages that say "database is not accepting commands that generate new MultiXactIds to avoid wraparound data loss ..." to use "assign" instead, and then make the new messages match that? - I think that 0002 needs a bit of wordsmithing. I will work on that. In particular, I don't like this sentence: "It increases downtime, makes monitoring impossible, disables replication, bypasses safeguards against wraparound, etc." While there's nothing untrue there, it feels more like a sentence from a pgsql-hackers email where most people participating in the discussion understand the general contours of the problem already than like polished documentation that really lays things out methodically. - I'm somewhat inclined to have a go at restructuring these patches a bit so that some of the documentation changes can potentially be back-patched without back-patching the message changes. Even if we eventually decide to back-patch everything or nothing, there are wording adjustments spread across all 3 patches that seem somewhat independent of the changes to the server messages. I think it would be clearer to have one patch that is mostly about documentation wording changes, and a second one that is about changing the server messages and then making documentation changes that are directly dependent on those message changes. And I might also be inclined to back-patch the former patch as far as it makes sense to do so, while leaving the latter one master-only. Comments? -- Robert Haas EDB: http://www.enterprisedb.com
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Mon, Oct 2, 2023 at 1:25 PM Robert Haas wrote: > I'm also pretty happy with these patches and would like to see at > least 0001 and 0002 committed, and probably 0003 as well. I am, > however, -1 on back-patching. Perhaps that is overly cautious, but I > don't like changing existing messages in back-branches. It will break > translations, and potentially monitoring scripts, etc. > > If John's not available to take this forward, I can volunteer as > substitute committer, unless Peter or Peter would like to handle it. If you're willing to take over as committer here, I'll let the issue of backpatching go. I only ask that you note why you've not backpatched in the commit message. -- Peter Geoghegan
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Mon, Oct 2, 2023 at 11:52 AM Pavel Borisov wrote: > FWIW I think the patch is still in good shape and worth to be committed. I'm also pretty happy with these patches and would like to see at least 0001 and 0002 committed, and probably 0003 as well. I am, however, -1 on back-patching. Perhaps that is overly cautious, but I don't like changing existing messages in back-branches. It will break translations, and potentially monitoring scripts, etc. If John's not available to take this forward, I can volunteer as substitute committer, unless Peter or Peter would like to handle it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
Hi! On Mon, 2 Oct 2023 at 03:34, Peter Geoghegan wrote: > > On Sun, Oct 1, 2023 at 11:46 AM Peter Eisentraut wrote: > > What is the status of this patch discussion now? It had been set as > > Ready for Committer for some months. Do these recent discussions > > invalidate that? Does it need more discussion? > > I don't think that recent discussion invalidated anything. I meant to > follow-up on investigating the extent to which anything could hold up > OldestMXact without also holding up OldestXmin/removable cutoff, but > that doesn't seem essential. > > This patch does indeed seem "ready for committer". John? > > -- > Peter Geoghegan FWIW I think the patch is still in good shape and worth to be committed. Regards, Pavel Borisov Supabase
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Sun, Oct 1, 2023 at 11:46 AM Peter Eisentraut wrote: > What is the status of this patch discussion now? It had been set as > Ready for Committer for some months. Do these recent discussions > invalidate that? Does it need more discussion? I don't think that recent discussion invalidated anything. I meant to follow-up on investigating the extent to which anything could hold up OldestMXact without also holding up OldestXmin/removable cutoff, but that doesn't seem essential. This patch does indeed seem "ready for committer". John? -- Peter Geoghegan
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On 20.09.23 05:41, Peter Geoghegan wrote: On Sun, May 14, 2023 at 6:06 PM Peter Geoghegan wrote: BTW, Google cloud already just instruct their users to ignore the xidStopLimit HINT about single user mode: https://cloud.google.com/sql/docs/postgres/txid-wraparound I read this just today, and was reminded of this thread: https://cloud.google.com/blog/products/databases/alloydb-for-postgresql-under-the-hood-adaptive-autovacuum It reads: "1. Transaction ID wraparound: PostgreSQL transaction IDs or XIDs are 32-bit unsigned integers that are assigned to each transaction and also get incremented. When they reach their maximum value, it would wrap around to zero (similar to a ring buffer) and can lead to data corruption." What is the status of this patch discussion now? It had been set as Ready for Committer for some months. Do these recent discussions invalidate that? Does it need more discussion?
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Sun, May 14, 2023 at 6:06 PM Peter Geoghegan wrote: > BTW, Google cloud already just instruct their users to ignore the > xidStopLimit HINT about single user mode: > > https://cloud.google.com/sql/docs/postgres/txid-wraparound I read this just today, and was reminded of this thread: https://cloud.google.com/blog/products/databases/alloydb-for-postgresql-under-the-hood-adaptive-autovacuum It reads: "1. Transaction ID wraparound: PostgreSQL transaction IDs or XIDs are 32-bit unsigned integers that are assigned to each transaction and also get incremented. When they reach their maximum value, it would wrap around to zero (similar to a ring buffer) and can lead to data corruption." -- Peter Geoghegan
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Fri, May 12, 2023 at 9:14 PM John Naylor wrote: > Attached is v9, which is mostly editing the steps for restoring normal > operation, which are in 0003 now but will be squashed into 0002. Comments to > polish the wording welcome. I'll try to get you more feedback on this soon. BTW, Google cloud already just instruct their users to ignore the xidStopLimit HINT about single user mode: https://cloud.google.com/sql/docs/postgres/txid-wraparound I checked with archive.org. This directive to just ignore the HINT appeared for the first time no later than December 2021. Fixing this in Postgres is long overdue. -- Peter Geoghegan
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
Attached is v9, which is mostly editing the steps for restoring normal operation, which are in 0003 now but will be squashed into 0002. Comments to polish the wording welcome. -- John Naylor EDB: http://www.enterprisedb.com From 0e9d6ea72216b196d37de4629736c0cf34e7cd5c Mon Sep 17 00:00:00 2001 From: John Naylor Date: Sat, 13 May 2023 11:03:40 +0700 Subject: [PATCH v9 3/3] Rough draft of complete steps to recover from (M)XID generation failure TODO: squash with previous --- doc/src/sgml/maintenance.sgml | 61 ++- 1 file changed, 60 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml index 45d6cd1815..fee88cb647 100644 --- a/doc/src/sgml/maintenance.sgml +++ b/doc/src/sgml/maintenance.sgml @@ -675,7 +675,25 @@ HINT: Execute a database-wide VACUUM in that database. In this condition any transactions already in progress can continue, but only read-only transactions can be started. Operations that modify database records or truncate relations will fail. -The VACUUM command can still be run normally to recover. +The VACUUM command can still be run normally. +To restore normal operation: + + + + Commit or rollback any prepared transactions. + + + Terminate any sessions that might have open transactions. + + + Drop any old replication slots. + + + Ensure autovacuum is running, and execute a database-wide VACUUM. +To reduce the time required, it as also possible to issue manual VACUUM +commands on the tables where relminxid is oldest. + + @@ -761,6 +779,47 @@ HINT: Execute a database-wide VACUUM in that database. have the oldest multixact-age. Both of these kinds of aggressive scans will occur even if autovacuum is nominally disabled. + + +Similar to the XID case, if autovacuum fails to clear old MXIDs from a table, the +system will begin to emit warning messages like this when the database's +oldest MXIDs reach forty million transactions from the wraparound point: + + +WARNING: database "mydb" must be vacuumed within 39985967 transactions +HINT: To prevent MultiXactId generation failure, execute a database-wide VACUUM in that database. + + +(A manual VACUUM should fix the problem, as suggested by the +hint; but note that the VACUUM must be performed by a +superuser, else it will fail to process system catalogs and thus not +be able to advance the database's datfrozenxid.) +If these warnings are ignored, the system will refuse to generate new +MXIDs once there are fewer than three million left until wraparound: + + +ERROR: database is not accepting commands that generate new MultiXactIds to avoid wraparound data loss in database "mydb" +HINT: Execute a database-wide VACUUM in that database. + + + + +To restore normal operation: + + + Commit or rollback each prepared transaction that might appear in a multixact. + + + Resolve each transaction that might appear in a multixact. + + + Ensure autovacuum is running, and execute a database-wide VACUUM. +To reduce the time required, it as also possible to issue manual VACUUM +commands on the tables where relminmxid is oldest. + + + + -- 2.39.2 From 267609882a0be2764bc33fc289c0a962d47643c4 Mon Sep 17 00:00:00 2001 From: John Naylor Date: Fri, 28 Apr 2023 16:08:33 +0700 Subject: [PATCH v9 1/3] Correct outdated docs and messages regarding XID/MXID limits Previously, when approaching xidStopLimit or xidWrapLimit, log messages would warn against a "database shutdown", and when it reached those limits claimed that it "is not accepting commands". This language originated in commit 60b2444cc when the xidStopLimit was added in 2005. At that time, even a trivial SELECT would have failed. Commit 295e63983d in 2007 introduced virtual transaction IDs, which allowed actual XIDs to be allocated lazily when it is necessary to do so, such as when modifying database records. Since then, the behavior at these limits is merely to refuse to allocate new XIDs, so read-only queries can continue to be initiated. The "database shutdown" message was also copied to the message warning for multiWarnLimit when it was added. This has been wrong for a very long time, so backpatch to all supported branches. Aleksander Alekseev, with some editing by me Reviewed by Pavel Borisov and Peter Geoghegan Discussion: https://postgr.es/m/caj7c6tm2d277u2wh8x78kg8ph3tduqebv3_jcjqakyqfhcf...@mail.gmail.com --- doc/src/sgml/maintenance.sgml | 22 -- src/backend/access/transam/multixact.c | 4 ++-- src/backend/access/transam/varsup.c| 12 ++-- 3 files changed, 20 insertions(+), 18 deletions(-) diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml index 9cf9d030a8..48d43cb339
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Thu, May 4, 2023 at 12:59 AM Peter Geoghegan wrote: > I'd quite like to drop this topic, and get on with the work at hand. I'd be grateful, and the other points you made are, of course, valid. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Wed, May 3, 2023 at 12:30 AM John Naylor wrote: > I went to go find the phrase that I thought I was reacted to, and ... > nothing. I am also baffled. My comment was inexcusable. I'd quite like to drop this topic, and get on with the work at hand. But before I do that, I ask you to consider one thing: if you were mistaken about my words (or their intent) on this occasion, isn't it also possible that it wasn't the first time? I never had the opportunity to sit down to talk with you face to face before now. If things had been different (if we managed to talk at one of the PGCons before COVID, say), then maybe this incident would have happened in just the same way. I can't help but think that some face time would have prevented the whole episode, though. You have every right to dislike me on a personal level, of course, but if you do then I'd very much prefer that it be due to one of my actual flaws. I'm not a petty man -- I don't resent the success of others. I've always thought that you do rather good work. Plus I'm just not in the habit of obstructing things that I directly benefit from. -- Peter Geoghegan
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Wed, May 3, 2023 at 10:04 AM Peter Geoghegan wrote: > > That's not the only reason, though. I also genuinely don't have the > foggiest notion what was behind what you said. In particular, this > part still makes zero sense to me: > > "Claim that others are holding you back, and then try to move the > goalposts in their work" I went to go find the phrase that I thought I was reacted to, and ... nothing. I am also baffled. My comment was inexcusable. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Tue, May 2, 2023 at 6:46 PM John Naylor wrote: > It might least be good for readability to gloss over the warning and only > quote the MXID limit error message, but we'll have to see how it looks. Apparently you expect me to join you in pretending that you didn't lambast my review on this thread less than 24 hours ago [1]. I happen to believe that this particular patch is of great strategic importance, so I'll admit that I thought about it for a second. But just a second -- I have more self-respect than that. That's not the only reason, though. I also genuinely don't have the foggiest notion what was behind what you said. In particular, this part still makes zero sense to me: "Claim that others are holding you back, and then try to move the goalposts in their work" Let me get this straight: "Moving the goalposts of their work" refers to something *I* did to *you*, on *this* thread...right? If anything, I'm biased in *favor* of this patch. The fact that you seem to think that I was being obstructionist just doesn't make any sense to me, at all. I really don't know where to go from there. I'm not so much upset as baffled. [1] https://postgr.es/m/CAFBsxsGJMp43QO2cLAh0==ueYVL35pbbEHeXZ0cnZkU=q8s...@mail.gmail.com -- Peter Geoghegan
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Tue, May 2, 2023 at 9:55 AM Peter Geoghegan wrote: > > On Mon, May 1, 2023 at 5:34 AM John Naylor wrote: > > 0003 is now a quick-and-dirty attempt at that, only in the docs. The MXID part is mostly copy-pasted from the XID part, just to get something together. I'd like to abbreviate that somehow. > > Yeah, the need to abbreviate statements about MultiXact IDs by saying > that they work analogously to XIDs in some particular respect > is...another thing that makes this tricky. Then it sounds like they should stay separate. A direct copy-paste is not good for style, so I will add things like: - If for some reason autovacuum fails to clear old MXIDs from a table, the + As in the case with XIDs, it is possible for autovacuum to fail to [...] It might least be good for readability to gloss over the warning and only quote the MXID limit error message, but we'll have to see how it looks. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Mon, May 1, 2023 at 7:55 PM Peter Geoghegan wrote: > Obviously there are certain things that can hold back OldestMXact by a > wildly excessive amount. But I don't think that there is anything that > can hold back OldestMXact by a wildly excessive amount that won't more > or less do the same thing to OldestXmin. Actually, it's probably possible for a transaction that only has a virtual transaction ID to call MultiXactIdSetOldestVisible(), which will then have the effect of holding back OldestMXact without also holding back OldestXmin (in READ COMMITTED mode). Will have to check to make sure, but that won't happen today. -- Peter Geoghegan
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Mon, May 1, 2023 at 5:34 AM John Naylor wrote: > Well, since you have a placeholder "xidStopLimit mode" in your other patch, > I'll confine my response to there. Inventing "modes" seems like an awkward > thing to backpatch, not to mention moving the goalposts. My modest goal here > is quite limited: to stop lying to our users about "not accepting commands", > and change tragically awful advice into sensible advice. I can't argue with that. > Here's my new idea: > > -HINT: To avoid a database shutdown, [...] > +HINT: To prevent XID generation failure, [...] > > Actually, I like "allocation" better, but the v8 patch now has "generation" > simply because one MXID message already has "generate" and I did it that way > before thinking too hard. I'd be okay with either one as long as it's > consistent. WFM. > Granted. Whatever form your rewrite ends up in, it could make a lot of sense > to then backpatch a few localized corrections. I wouldn't even object to > including a few substitutions of s/wraparound failure/allocation failure/ > where appropriate. Let's see how that shakes out first. Makes sense. > > > I think the docs would do well to have ordered steps for recovering from > > > both XID and MXID exhaustion. > > > > I had planned to address this with my ongoing work on the "Routine > > Vacuuming" docs, but I think that you're right about the necessity of > > addressing it as part of this patch. > > 0003 is now a quick-and-dirty attempt at that, only in the docs. The MXID > part is mostly copy-pasted from the XID part, just to get something together. > I'd like to abbreviate that somehow. Yeah, the need to abbreviate statements about MultiXact IDs by saying that they work analogously to XIDs in some particular respect is...another thing that makes this tricky. I don't think that Multis are fundamentally different to XIDs. I believe that the process through which VACUUM establishes its OldestMXact cutoff can be pessimistic compared to OldestXmin, but I don't think that it changes the guidance you'll need to give here. VACUUM should always be able to advance relminmxid right up until OldestMXact, if that's what the user insists on. For example, VACUUM FREEZE sometimes allocates new Multis, just to be able to do that. Obviously there are certain things that can hold back OldestMXact by a wildly excessive amount. But I don't think that there is anything that can hold back OldestMXact by a wildly excessive amount that won't more or less do the same thing to OldestXmin. -- Peter Geoghegan
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Mon, May 1, 2023 at 2:30 AM Peter Geoghegan wrote: > > On Sat, Apr 29, 2023 at 7:30 PM John Naylor > wrote: > > How about > > > > -HINT: To avoid a database shutdown, [...] > > +HINT: To prevent XID exhaustion, [...] > > > > ...and "MXID", respectively? We could explain in the docs that vacuum and read-only queries still work "when XIDs have been exhausted", etc. > > I think that that particular wording works in this example -- we *are* > avoiding XID exhaustion. But it still doesn't really address my > concern -- at least not on its own. I think that we need a term for > xidStopLimit mode (and perhaps multiStopLimit) itself. This is a > discrete state/mode that is associated with a specific mechanism. Well, since you have a placeholder "xidStopLimit mode" in your other patch, I'll confine my response to there. Inventing "modes" seems like an awkward thing to backpatch, not to mention moving the goalposts. My modest goal here is quite limited: to stop lying to our users about "not accepting commands", and change tragically awful advice into sensible advice. Here's my new idea: -HINT: To avoid a database shutdown, [...] +HINT: To prevent XID generation failure, [...] Actually, I like "allocation" better, but the v8 patch now has "generation" simply because one MXID message already has "generate" and I did it that way before thinking too hard. I'd be okay with either one as long as it's consistent. > > (I should probably also add in the commit message that the "shutdown" in the message was carried over to MXIDs when they arrived also in 2005). Done > > > Separately, there is a need to update a couple of other places to use > > > this new terminology. The documentation for vacuum_sailsafe_age and > > > vacuum_multixact_failsafe_age refer to "system-wide transaction ID > > > wraparound failure", which seems less than ideal, even without your > > > patch. > > > > Right, I'll have a look. Looking now, I'm even less inclined to invent new terminology in back branches. > As you know, there is a more general problem with the use of the term > "wraparound" in the docs, and in the system itself (in places like > pg_stat_activity). Even the very basic terminology in this area is > needlessly scary. Terms like "VACUUM (to prevent wraparound)" are > uncomfortably close to "a race against time to avoid data corruption". > The system isn't ever supposed to corrupt data, even if misconfigured > (unless the misconfiguration is very low-level, such as "fsync=off"). > Users should be able to take that much for granted. Granted. Whatever form your rewrite ends up in, it could make a lot of sense to then backpatch a few localized corrections. I wouldn't even object to including a few substitutions of s/wraparound failure/allocation failure/ where appropriate. Let's see how that shakes out first. > > I think the docs would do well to have ordered steps for recovering from both XID and MXID exhaustion. > > I had planned to address this with my ongoing work on the "Routine > Vacuuming" docs, but I think that you're right about the necessity of > addressing it as part of this patch. 0003 is now a quick-and-dirty attempt at that, only in the docs. The MXID part is mostly copy-pasted from the XID part, just to get something together. I'd like to abbreviate that somehow. -- John Naylor EDB: http://www.enterprisedb.com From 469d0e7123a16386e300a85a6bb08109e283b65c Mon Sep 17 00:00:00 2001 From: John Naylor Date: Sat, 29 Apr 2023 14:23:50 +0700 Subject: [PATCH v8 2/3] Stop telling users to run VACUUM in a single-user mode Single-user mode is almost always the worst thing to reach for in a VACUUM emergency: * Restarting in single user mode requires a shutdown checkpoint * The user interface is completely different, and awful * The buffer cache is completely cold * The checkpointer, background writer and WAL writer are not running * Without checkpoints WAL segments can not be rotated and reused * Replication is not running, so after VACUUM is done and database is started in normal mode, there is a huge backlog to replicate * pg_stat_progress_vacuum is not available so there is no indication of when the command will complete * VACUUM VERBOSE doesn't work - there is no output from single-user mode vacuum, with or without VERBOSE If that weren't enough, it's also unsafe because the wraparound limits are not enforced. It is by no means impossible to corrupt the database by mistake, such as by a user running VACUUM FULL because it sounds better. As mentioned in commit X, the system is perfectly capable of accepting commands when reaching xidStopLimit. Most VACUUM operations will work normally, with one exception: A new XID is required when truncating the relation if wal_level is above "minimal". As of v14 the failsafe mechanism disables truncation some time before reaching xidStopLimit, so this is not an issue in practice. By remaining in multi-user mode, users still have read-only access to their
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Sat, Apr 29, 2023 at 7:30 PM John Naylor wrote: > How about > > -HINT: To avoid a database shutdown, [...] > +HINT: To prevent XID exhaustion, [...] > > ...and "MXID", respectively? We could explain in the docs that vacuum and > read-only queries still work "when XIDs have been exhausted", etc. I think that that particular wording works in this example -- we *are* avoiding XID exhaustion. But it still doesn't really address my concern -- at least not on its own. I think that we need a term for xidStopLimit mode (and perhaps multiStopLimit) itself. This is a discrete state/mode that is associated with a specific mechanism. I'd like to emphasize the purpose of xidStopLimit (over when xidStopLimit happens) in choosing this user-facing name. As you know, the point of xidStopLimit mode is to give autovacuum the chance to catch up with managing the XID space through freezing: the available supply of XIDs doesn't meet present demand, and hasn't for some time, so it finally came to this. Even if we had true 64-bit XIDs we'd probably still need something similar -- there would still have to be *some* point that allowing the "freezing deficit" to continue to grow just wasn't tenable. If a person consistently spends more than they take in, their "initial bankroll" isn't necessarily relevant. If our ~2.1 billion XID "bankroll" wasn't enough to avoid xidStopLimit, why would we expect 8 billion or 20 billion XIDs to have been enough? I'm thinking of a user-facing name for xidStopLimit along the lines of "emergency XID allocation restoration mode" (admittedly that's quite a mouthful). Something that carries the implication of "imbalance". The system was configured in a way that turned out to be unsustainable. The system was therefore forced to "restore sustainability" using the only tool that remained. This is closely related to the failsafe. As bad as xidStopLimit is, it won't always be the end of the world -- much depends on individual application requirements. > (I should probably also add in the commit message that the "shutdown" in the > message was carried over to MXIDs when they arrived also in 2005). > > > Separately, there is a need to update a couple of other places to use > > this new terminology. The documentation for vacuum_sailsafe_age and > > vacuum_multixact_failsafe_age refer to "system-wide transaction ID > > wraparound failure", which seems less than ideal, even without your > > patch. > > Right, I'll have a look. As you know, there is a more general problem with the use of the term "wraparound" in the docs, and in the system itself (in places like pg_stat_activity). Even the very basic terminology in this area is needlessly scary. Terms like "VACUUM (to prevent wraparound)" are uncomfortably close to "a race against time to avoid data corruption". The system isn't ever supposed to corrupt data, even if misconfigured (unless the misconfiguration is very low-level, such as "fsync=off"). Users should be able to take that much for granted. I don't expect either of us to address that problem in the short term -- the term "wraparound" is too baked-in for it to be okay to just remove it overnight. But, it could still make sense for your patch (or my own) to fully own the fact that "wraparound" is actually a misnomer. At least when used in contexts like "to prevent wraparound" (xidStopLimit actually "prevents wraparound", though we shouldn't say anything about it in a place of prominence). Let's reassure users that they should continue to take "we won't corrupt your data for no good reason" for granted. > I think the docs would do well to have ordered steps for recovering from both > XID and MXID exhaustion. I had planned to address this with my ongoing work on the "Routine Vacuuming" docs, but I think that you're right about the necessity of addressing it as part of this patch. These extra steps will be required whenever the problem is a leaked prepared transaction, or something along those lines. That is increasingly likely to turn out to be the underlying cause of entering xidStopLimit, given the work we've done on VACUUM over the years. I still think that "imbalance" is the right way to frame discussion of xidStopLimit. After all, autovacuum/VACUUM will still spin its wheels in a futile effort to "restore balance". So it's kinda still about restoring imbalance IMV. -- Peter Geoghegan
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Sun, Apr 30, 2023 at 4:15 AM Peter Geoghegan wrote: > > On Sat, Apr 29, 2023 at 1:09 AM John Naylor > wrote: > > I made some small changes and wrote a suitably comprehensive commit message. I separated the possible uses for single-user mode into a separate paragraph in the "Note:" , moved the justification for the 3-million xid margin there, and restored the link to how to run it (I already mentioned we still need this info, but didn't notice this part didn't make it back in). > > I notice that you've called xidStopLimit "read-only mode" in the docs. > I think that it makes sense that you wouldn't use the term > xidStopLimit here, but I'm not sure about this terminology, either. It > seems to me that it should be something quite specific, since we're > talking about a very specific mechanism. Whatever it is, It shouldn't > contain the word "wraparound". How about -HINT: To avoid a database shutdown, [...] +HINT: To prevent XID exhaustion, [...] ...and "MXID", respectively? We could explain in the docs that vacuum and read-only queries still work "when XIDs have been exhausted", etc. (I should probably also add in the commit message that the "shutdown" in the message was carried over to MXIDs when they arrived also in 2005). > Separately, there is a need to update a couple of other places to use > this new terminology. The documentation for vacuum_sailsafe_age and > vacuum_multixact_failsafe_age refer to "system-wide transaction ID > wraparound failure", which seems less than ideal, even without your > patch. Right, I'll have a look. > Do we need two new names? One for xidStopLimit, another for > multiStopLimit? The latter really can't be called read-only mode. Thanks for that correction. Somewhat related to the now-postponed 0003: I think the docs would do well to have ordered steps for recovering from both XID and MXID exhaustion. The previous practice of shutting down had the side-effect of e.g. rolling back all in-progress transactions whose XID appeared in a MXID but if you remain in normal mode there is a bit more to check. Manual VACUUM will warn about "cutoff for removing and freezing tuples is far in the past", but the docs should be clear on this. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Sat, Apr 29, 2023 at 1:09 AM John Naylor wrote: > Looks good to me. I'm strongly in favor of this. It's most unfortunate that it took this long. > I've just made some small edits for v7 and wrote a commit message to explain > how we got here. This can be backpatched all the way, as it's simply a > correction. +1 to backpatching at least back until v14. After all, it wouldn't make any sense for us to not backpatch to 14; the whole justification for doing this was in no way influenced by anything that happened since the failsafe went in. I'm also in favor of backpatching to 11, 12, and 13 -- though I acknowledge that that's more of a judgement call. As you note in comments in the draft patch, the story with these earlier releases (especially 11) is a little more complicated for users. > I made some small changes and wrote a suitably comprehensive commit message. > I separated the possible uses for single-user mode into a separate paragraph > in the "Note:" , moved the justification for the 3-million xid margin there, > and restored the link to how to run it (I already mentioned we still need > this info, but didn't notice this part didn't make it back in). I notice that you've called xidStopLimit "read-only mode" in the docs. I think that it makes sense that you wouldn't use the term xidStopLimit here, but I'm not sure about this terminology, either. It seems to me that it should be something quite specific, since we're talking about a very specific mechanism. Whatever it is, It shouldn't contain the word "wraparound". Separately, there is a need to update a couple of other places to use this new terminology. The documentation for vacuum_sailsafe_age and vacuum_multixact_failsafe_age refer to "system-wide transaction ID wraparound failure", which seems less than ideal, even without your patch. Do we need two new names? One for xidStopLimit, another for multiStopLimit? The latter really can't be called read-only mode. > 0003: > > It really needs a more comprehensive change, and just making a long hint even > longer doesn't seem worth doing. I'd like to set that aside and come back to > it. I've left it out of the attached set. Yeah, 0003 can be treated as independent work IMV. -- Peter Geoghegan
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Tue, Apr 4, 2023 at 8:08 PM Aleksander Alekseev wrote: > [v6] 0001: Looks good to me. I've just made some small edits for v7 and wrote a commit message to explain how we got here. This can be backpatched all the way, as it's simply a correction. I do want to test on v11 first just for completeness. (The reality has already been tested by others back to 9.6 but there's no substitute for trying it yourself). I hope to commit soon after that. 0002: I've been testing wraparound using the v3 convenience function in [1] to reach xidStopLimit: -- reduce log spam alter system set log_min_messages = error; alter system set client_min_messages = error; -- restart -- no actual replication, just for testing dropping it select pg_create_physical_replication_slot('foo', true, false); create table t (i int); BEGIN; insert into t values(1); PREPARE TRANSACTION 'trx_id_pin'; -- get to xidStopLimit select consume_xids(1*1000*1000*1000); insert into t values(1); select consume_xids(1*1000*1000*1000); insert into t values(1); select consume_xids( 140*1000*1000); insert into t values(1); select consume_xids(10*1000*1000); SELECT datname, age(datfrozenxid) FROM pg_database; -- works just fine select pg_drop_replication_slot('foo'); COMMIT PREPARED 'trx_id_pin'; -- watch autovacuum take care of it automatically: SELECT datname, age(datfrozenxid) FROM pg_database; The consume_xids function builds easily on PG14, but before that it would need a bit of work because data types changed. That coincidentally was the first version to include the failsafe, which is convenient in this scenario. I'd like to do testing on PG12/13 before commit, which would require turning off truncation in the command (and can also be made faster by turning off index cleanup), but I'm also okay with going ahead with just going back to PG14 at first. That also safest. I made some small changes and wrote a suitably comprehensive commit message. I separated the possible uses for single-user mode into a separate paragraph in the "Note:" , moved the justification for the 3-million xid margin there, and restored the link to how to run it (I already mentioned we still need this info, but didn't notice this part didn't make it back in). 0003: It really needs a more comprehensive change, and just making a long hint even longer doesn't seem worth doing. I'd like to set that aside and come back to it. I've left it out of the attached set. [1] https://www.postgresql.org/message-id/CAD21AoBZ3t%2BfRtVamQTA%2BwBJaapFUY1dfP08-rxsQ%2BfouPvgKg%40mail.gmail.com -- John Naylor EDB: http://www.enterprisedb.com From f5fa3605effd624b263ff3e5dbb8b3e5c8992dba Mon Sep 17 00:00:00 2001 From: John Naylor Date: Sat, 29 Apr 2023 14:23:50 +0700 Subject: [PATCH v7 2/2] Stop telling users to run VACUUM in a single-user mode Single-user mode is almost always the worst thing to reach for in a VACUUM emergency: * Restarting in single user mode requires a shutdown checkpoint * The user interface is completely different, and awful * The buffer cache is completely cold * The checkpointer, background writer and WAL writer are not running * Without checkpoints WAL segments can not be rotated and reused * Replication is not running, so after VACUUM is done and database is started in normal mode, there is a huge backlog to replicate * pg_stat_progress_vacuum is not available so there is no indication of when the command will complete * VACUUM VERBOSE doesn't work - there is no output from single-user mode vacuum, with or without VERBOSE If that weren't enough, it's also unsafe because the wraparound limits are not enforced. It is by no means impossible to corrupt the database by mistake, such as by a user running VACUUM FULL because it sounds better. As mentioned in commit X, the system is perfectly capable of accepting commands when reaching xidStopLimit. Most VACUUM operations will work normally, with one exception: A new XID is required when truncating the relation if wal_level is above "minimal". As of v14 the failsafe mechanism disables truncation some time before reaching xidStopLimit, so this is not an issue in practice. By remaining in multi-user mode, users still have read-only access to their database, they can use parallelism, they can use command line utilities like vacuumdb, and they can remotely access the database using normal clients. The only reason to restart in single-user mode is to DROP or TRUNCATE a table, when it is suspected that doing that would be faster than vacuuming. Also add an explicit note warning against using single-user mode. Backpatch to v14, which is the first version with the VACUUM failsafe. XXX We should consider v12-13 with "VACUUM (TRUNCATE off, INDEX_CLEANUP off);", but it's not yet convenient to get to xidStopLimit before v14. XXX We could consider v11 with careful instructions about redoing VACUUMs in single-user mode when truncation is necessary. Aleksander Alekseev, with some adjustments by me
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Tue, 4 Apr 2023 at 17:08, Aleksander Alekseev wrote: > > Hi, > > > The proposed changes are in patchset v5. > > Pavel, John, thanks for your feedback. > > > I've looked into the patches v4. > > For 0001: > > I think long "not accepting commands that generate" is equivalent to > > more concise "can't generate". > > Frankly I don't think this is a good change for this particular CF > entry and it violates the consistency with multixact.c. Additionally > the new message is not accurate. The DBMS _can_ generate new XIDs, > quite a few of them actually. It merely refuses to do so. > > > For all: > > In a errhint's list what _might_ be done I think AND is a little bit > > better that OR as the word _might_ means that each of the proposals in > > the list is a probable, not a sure one. > > Well, that's debatable... IMO "or" makes a bit more sense since the > user knows better whether he/she needs to kill a long-running > transaction, or run VACUUM, or maybe do both. "And" would imply that > the user needs to do all of it, which is not necessarily true. Since > originally it was "or" I suggest we keep it as is for now. > > All in all I would prefer keeping the focus on the particular problem > the patch tries to address. > > > For 0003: > > I think double mentioning of Vacuum at each errhist i.e.: "Execute a > > database-wide VACUUM in that database" and "...or run a manual > > database-wide VACUUM." are redundant. The advice "Ensure that > > autovacuum is progressing,..." is also not needed after advice to > > "Execute a database-wide VACUUM in that database". > > [...] > > > Okay, great. For v4-0003: > > > > Each hint mentions vacuum twice, because it's adding the vacuum message at > > the end, but not removing it from the beginning. The vacuum string should > > be on its own line, since we will have to modify that for back branches > > (skip indexes and truncation). > > My bad. Fixed. > > > I'm also having second thoughts about "Ensure that autovacuum is > > progressing" in the hint. That might be better in the docs, if we decide to > > go ahead with adding a specific checklist there. > > OK, done. > > > In vacuum.c: > > > > errhint("Close open transactions soon to avoid wraparound problems.\n" > > - "You might also need to commit or roll back old prepared transactions, or > > drop stale replication slots."))); > > + "You might also need to commit or roll back old prepared transactions, > > drop stale replication slots, or kill long-running sessions. Ensure that > > autovacuum is progressing, or run a manual database-wide VACUUM."))); > > > > This appears in vacuum_get_cutoffs(), which is called by vacuum and > > cluster, and the open transactions were already mentioned, so this is not > > the place for this change. > > Fixed. > > > v4-0002: > > > > - errhint("Stop the postmaster and vacuum that database in single-user > > mode.\n" > > + errhint("VACUUM that database.\n" > > > > Further in the spirit of consistency, the mulixact path already has > > "Execute a database-wide VACUUM in that database.\n", and that seems like > > better wording. > > Agree. Fixed. Alexander, Ok, nice! I think it could be moved to committer then. Pavel.
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
Hi, > The proposed changes are in patchset v5. Pavel, John, thanks for your feedback. > I've looked into the patches v4. > For 0001: > I think long "not accepting commands that generate" is equivalent to > more concise "can't generate". Frankly I don't think this is a good change for this particular CF entry and it violates the consistency with multixact.c. Additionally the new message is not accurate. The DBMS _can_ generate new XIDs, quite a few of them actually. It merely refuses to do so. > For all: > In a errhint's list what _might_ be done I think AND is a little bit > better that OR as the word _might_ means that each of the proposals in > the list is a probable, not a sure one. Well, that's debatable... IMO "or" makes a bit more sense since the user knows better whether he/she needs to kill a long-running transaction, or run VACUUM, or maybe do both. "And" would imply that the user needs to do all of it, which is not necessarily true. Since originally it was "or" I suggest we keep it as is for now. All in all I would prefer keeping the focus on the particular problem the patch tries to address. > For 0003: > I think double mentioning of Vacuum at each errhist i.e.: "Execute a > database-wide VACUUM in that database" and "...or run a manual > database-wide VACUUM." are redundant. The advice "Ensure that > autovacuum is progressing,..." is also not needed after advice to > "Execute a database-wide VACUUM in that database". > [...] > Okay, great. For v4-0003: > > Each hint mentions vacuum twice, because it's adding the vacuum message at > the end, but not removing it from the beginning. The vacuum string should be > on its own line, since we will have to modify that for back branches (skip > indexes and truncation). My bad. Fixed. > I'm also having second thoughts about "Ensure that autovacuum is progressing" > in the hint. That might be better in the docs, if we decide to go ahead with > adding a specific checklist there. OK, done. > In vacuum.c: > > errhint("Close open transactions soon to avoid wraparound problems.\n" > - "You might also need to commit or roll back old prepared transactions, or > drop stale replication slots."))); > + "You might also need to commit or roll back old prepared transactions, drop > stale replication slots, or kill long-running sessions. Ensure that > autovacuum is progressing, or run a manual database-wide VACUUM."))); > > This appears in vacuum_get_cutoffs(), which is called by vacuum and cluster, > and the open transactions were already mentioned, so this is not the place > for this change. Fixed. > v4-0002: > > - errhint("Stop the postmaster and vacuum that database in single-user > mode.\n" > + errhint("VACUUM that database.\n" > > Further in the spirit of consistency, the mulixact path already has "Execute > a database-wide VACUUM in that database.\n", and that seems like better > wording. Agree. Fixed. -- Best regards, Aleksander Alekseev v6-0001-Correct-the-docs-and-messages-about-preventing-XI.patch Description: Binary data v6-0002-Don-t-recommend-running-VACUUM-in-a-single-user-m.patch Description: Binary data v6-0003-Modify-the-hints-about-preventing-XID-wraparound.patch Description: Binary data
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
Hi! I've looked into the patches v4. For 0001: I think long "not accepting commands that generate" is equivalent to more concise "can't generate". For 0003: I think double mentioning of Vacuum at each errhist i.e.: "Execute a database-wide VACUUM in that database" and "...or run a manual database-wide VACUUM." are redundant. The advice "Ensure that autovacuum is progressing,..." is also not needed after advice to "Execute a database-wide VACUUM in that database". For all: In a errhint's list what _might_ be done I think AND is a little bit better that OR as the word _might_ means that each of the proposals in the list is a probable, not a sure one. The proposed changes are in patchset v5. Kind regards, Pavel Borisov, Supabase. v5-0002-This-recommendation-is-outdated-for-some-time-now.patch Description: Binary data v5-0001-Correct-the-docs-and-messages-about-preventing-XI.patch Description: Binary data v5-0003-Modify-the-hints-about-preventing-XID-wraparound.patch Description: Binary data
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Mon, Apr 3, 2023 at 7:33 PM Aleksander Alekseev wrote: > > Yes, the exact same text as it appeared in the [2] thread above, which prompted Robert's comment I quoted. I take the point to mean: All of these things need to be taken care of *first*, before vacuuming, so the hint should order things so that it is clear. > > > > > Please let me know if you think > > > we should also add a suggestion to kill long-running sessions, etc. > > > > +1 for also adding that. > > OK, done. I included this change as a separate patch. It can be > squashed with another one if necessary. Okay, great. For v4-0003: Each hint mentions vacuum twice, because it's adding the vacuum message at the end, but not removing it from the beginning. The vacuum string should be on its own line, since we will have to modify that for back branches (skip indexes and truncation). I'm also having second thoughts about "Ensure that autovacuum is progressing" in the hint. That might be better in the docs, if we decide to go ahead with adding a specific checklist there. In vacuum.c: errhint("Close open transactions soon to avoid wraparound problems.\n" - "You might also need to commit or roll back old prepared transactions, or drop stale replication slots."))); + "You might also need to commit or roll back old prepared transactions, drop stale replication slots, or kill long-running sessions. Ensure that autovacuum is progressing, or run a manual database-wide VACUUM."))); This appears in vacuum_get_cutoffs(), which is called by vacuum and cluster, and the open transactions were already mentioned, so this is not the place for this change. > This particular wording was chosen for consistency with multixact.c: > > ``` > errmsg("database is not accepting commands that generate new > MultiXactIds to avoid wraparound data loss in database \"%s\"", > ``` Okay, I didn't look into that -- seems like a good precedent. v4-0002: - errhint("Stop the postmaster and vacuum that database in single-user mode.\n" + errhint("VACUUM that database.\n" Further in the spirit of consistency, the mulixact path already has "Execute a database-wide VACUUM in that database.\n", and that seems like better wording. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
Hi John, Many thanks for all the great feedback! > Okay, the changes look good. To go further, I think we need to combine into > two patches, one with 0001-0003 and one with 0004: > > 1. Correct false statements about "shutdown" etc. This should contain changes > that can safely be patched all the way to v11. > 2. Change bad advice (single-user mode) into good advice. We can target head > first, and then try to adopt as far back as we safely can (and test). Done. > > > In swapping this topic back in my head, I also saw [2] where Robert > > > suggested > > > > > > "that old prepared transactions and stale replication > > > slots should be emphasized more prominently. Maybe something like: > > > > > > HINT: Commit or roll back old prepared transactions, drop stale > > > replication slots, or kill long-running sessions. > > > Ensure that autovacuum is progressing, or run a manual database-wide > > > VACUUM." > > > > It looks like the hint regarding replication slots was added at some > > point. Currently we have: > > > > ``` > > errhint( [...] > > "You might also need to commit or roll back old prepared > > transactions, or drop stale replication slots."))); > > ``` > > Yes, the exact same text as it appeared in the [2] thread above, which > prompted Robert's comment I quoted. I take the point to mean: All of these > things need to be taken care of *first*, before vacuuming, so the hint should > order things so that it is clear. > > > Please let me know if you think > > we should also add a suggestion to kill long-running sessions, etc. > > +1 for also adding that. OK, done. I included this change as a separate patch. It can be squashed with another one if necessary. While on it, I noticed that multixact.c still talks about a "shutdown". I made corresponding changes in 0001. > - errmsg("database is not accepting commands to avoid wraparound data loss in > database \"%s\"", > + errmsg("database is not accepting commands that generate new XIDs to avoid > wraparound data loss in database \"%s\"", > > I'm not quite on board with the new message, but welcome additional opinions. > For one, it's a bit longer and now ambiguous. I also bet that "generate XIDs" > doesn't really communicate anything useful. The people who understand > exactly what that means, and what the consequences are, are unlikely to let > the system get near wraparound in the first place, and might even know enough > to ignore the hint. > > I'm thinking we might need to convey something about "writes". While it's > less technically correct, I imagine it's more useful. Remember, many users > have it drilled in their heads that they need to drop immediately to > single-user mode. I'd like to give some idea of what they can and cannot do. This particular wording was chosen for consistency with multixact.c: ``` errmsg("database is not accepting commands that generate new MultiXactIds to avoid wraparound data loss in database \"%s\"", ``` The idea of using "writes" is sort of OK, but note that the same message will appear for a query like: ``` SELECT pg_current_xact_id(); ``` ... which doesn't do writes. The message will be misleading in this case. On top of that, although a PostgreSQL user may not be aware of MultiXactIds, arguably there are many users that are aware of XIDs. Not to mention the fact that XIDs are well documented. I didn't make this change in v4 since it seems to be controversial and probably not the highest priority at the moment. I suggest we discuss it separately. > I propose for discussion that 0004 should show in the docs all the queries > for finding prepared xacts, repl slots etc. If we ever show the info at > runtime, we can dispense with the queries, but there seems to be no urgency > for that... Good idea. > + Previously it was required to stop the postmaster and VACUUM the > database > + in a single-user mode. There is no need to use a single-user mode > anymore. > > I think we need to go further and actively warn against it: It's slow, > impossible to monitor, disables replication and disables safeguards against > wraparound. (Other bad things too, but these are easily understandable for > users) > > Maybe mention also that it's main use in wraparound situations is for a way > to perform DROPs and TRUNCATEs if that would help speed up resolution. Fixed. -- Best regards, Aleksander Alekseev v4-0001-Correct-the-docs-and-messages-about-preventing-XI.patch Description: Binary data v4-0002-Don-t-recommend-running-VACUUM-in-a-single-user-m.patch Description: Binary data v4-0003-Modify-the-hints-about-preventing-XID-wraparound.patch Description: Binary data
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Tue, Mar 21, 2023 at 6:44 PM Aleksander Alekseev < aleksan...@timescale.com> wrote: Okay, the changes look good. To go further, I think we need to combine into two patches, one with 0001-0003 and one with 0004: 1. Correct false statements about "shutdown" etc. This should contain changes that can safely be patched all the way to v11. 2. Change bad advice (single-user mode) into good advice. We can target head first, and then try to adopt as far back as we safely can (and test). (...and future work, so not part of the CF here) 3. Tell the user what caused the problem, instead of saying "go figure it out yourself". > > In swapping this topic back in my head, I also saw [2] where Robert suggested > > > > "that old prepared transactions and stale replication > > slots should be emphasized more prominently. Maybe something like: > > > > HINT: Commit or roll back old prepared transactions, drop stale > > replication slots, or kill long-running sessions. > > Ensure that autovacuum is progressing, or run a manual database-wide VACUUM." > > It looks like the hint regarding replication slots was added at some > point. Currently we have: > > ``` > errhint( [...] > "You might also need to commit or roll back old prepared > transactions, or drop stale replication slots."))); > ``` Yes, the exact same text as it appeared in the [2] thread above, which prompted Robert's comment I quoted. I take the point to mean: All of these things need to be taken care of *first*, before vacuuming, so the hint should order things so that it is clear. > Please let me know if you think > we should also add a suggestion to kill long-running sessions, etc. +1 for also adding that. - errmsg("database is not accepting commands to avoid wraparound data loss in database \"%s\"", + errmsg("database is not accepting commands that generate new XIDs to avoid wraparound data loss in database \"%s\"", I'm not quite on board with the new message, but welcome additional opinions. For one, it's a bit longer and now ambiguous. I also bet that "generate XIDs" doesn't really communicate anything useful. The people who understand exactly what that means, and what the consequences are, are unlikely to let the system get near wraparound in the first place, and might even know enough to ignore the hint. I'm thinking we might need to convey something about "writes". While it's less technically correct, I imagine it's more useful. Remember, many users have it drilled in their heads that they need to drop immediately to single-user mode. I'd like to give some idea of what they can and cannot do. + Previously it was required to stop the postmaster and VACUUM the database + in a single-user mode. There is no need to use a single-user mode anymore. I think we need to go further and actively warn against it: It's slow, impossible to monitor, disables replication and disables safeguards against wraparound. (Other bad things too, but these are easily understandable for users) Maybe mention also that it's main use in wraparound situations is for a way to perform DROPs and TRUNCATEs if that would help speed up resolution. I propose for discussion that 0004 should show in the docs all the queries for finding prepared xacts, repl slots etc. If we ever show the info at runtime, we can dispense with the queries, but there seems to be no urgency for that... -- John Naylor EDB: http://www.enterprisedb.com
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
Hi John, > Thanks for picking up this badly-needed topic again! Many thanks for the review! > 0001: > > +In this condition the system can still execute read-only transactions. > +The active transactions will continue to execute and will be able to > +commit. > > This is ambiguous. I'd first say that any transactions already started can > continue, and then say that only new read-only transactions can be started. Fixed. > 0004: > > -HINT: Stop the postmaster and vacuum that database in single-user mode. > +HINT: VACUUM or VACUUM FREEZE that database. > > VACUUM FREEZE is worse and should not be mentioned, since it does unnecessary > work. Emergency vacuum is not school -- you don't get extra credit for doing > unnecessary work. Fixed. > Also, we may consider adding a boxed NOTE warning specifically against > single-user mode, especially if this recommendation will change in at least > some minor releases so people may not hear about it. See also [1]. Done. > - * If we're past xidStopLimit, refuse to execute transactions, unless > - * we are running in single-user mode (which gives an escape hatch > - * to the DBA who somehow got past the earlier defenses). > + * If we're past xidStopLimit, refuse to allocate new XIDs. > > This patch doesn't completely get rid of the need for single-user mode, so it > should keep all information about it. If a DBA wanted to e.g. drop or > truncate a table to save vacuum time, it is still possible to do it in > single-user mode, so the escape hatch is still useful. Fixed. > In swapping this topic back in my head, I also saw [2] where Robert suggested > > "that old prepared transactions and stale replication > slots should be emphasized more prominently. Maybe something like: > > HINT: Commit or roll back old prepared transactions, drop stale > replication slots, or kill long-running sessions. > Ensure that autovacuum is progressing, or run a manual database-wide VACUUM." It looks like the hint regarding replication slots was added at some point. Currently we have: ``` errhint( [...] "You might also need to commit or roll back old prepared transactions, or drop stale replication slots."))); ``` So I choose to keep it as is for now. Please let me know if you think we should also add a suggestion to kill long-running sessions, etc. -- Best regards, Aleksander Alekseev v3-0001-Correct-the-docs-about-preventing-XID-wraparound.patch Description: Binary data v3-0002-Fix-the-message-in-case-of-approaching-xidWrapLim.patch Description: Binary data v3-0003-Fix-the-message-in-case-of-exceeding-xidWarnLimit.patch Description: Binary data v3-0004-Don-t-recommend-running-VACUUM-in-a-single-user-m.patch Description: Binary data
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
Thanks for picking up this badly-needed topic again! I was irresponsible last year and let it fall off my radar, but I'm looking at the patches, as well as revisiting discussions from the last four (!?) years that didn't lead to action. 0001: +In this condition the system can still execute read-only transactions. +The active transactions will continue to execute and will be able to +commit. This is ambiguous. I'd first say that any transactions already started can continue, and then say that only new read-only transactions can be started. 0004: -HINT: Stop the postmaster and vacuum that database in single-user mode. +HINT: VACUUM or VACUUM FREEZE that database. VACUUM FREEZE is worse and should not be mentioned, since it does unnecessary work. Emergency vacuum is not school -- you don't get extra credit for doing unnecessary work. Also, we may consider adding a boxed NOTE warning specifically against single-user mode, especially if this recommendation will change in at least some minor releases so people may not hear about it. See also [1]. - * If we're past xidStopLimit, refuse to execute transactions, unless - * we are running in single-user mode (which gives an escape hatch - * to the DBA who somehow got past the earlier defenses). + * If we're past xidStopLimit, refuse to allocate new XIDs. This patch doesn't completely get rid of the need for single-user mode, so it should keep all information about it. If a DBA wanted to e.g. drop or truncate a table to save vacuum time, it is still possible to do it in single-user mode, so the escape hatch is still useful. In swapping this topic back in my head, I also saw [2] where Robert suggested "that old prepared transactions and stale replication slots should be emphasized more prominently. Maybe something like: HINT: Commit or roll back old prepared transactions, drop stale replication slots, or kill long-running sessions. Ensure that autovacuum is progressing, or run a manual database-wide VACUUM." That sounds like a good direction to me. There is more we could do here to make the message more specific [3][4][5], but the patches here are in the right direction. Note for possible backpatching: It seems straightforward to go back to PG14, which has the failsafe, but we should have better testing in place first. There is a patch in this CF to make it easier to get close to wraparound, so I'll look at what it does as well. [1] https://www.postgresql.org/message-id/CA%2BTgmoadjx%2Br8_gGbbnNifL6vEyjZntiQRPzyixrUihvtZ5jdQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/CA+Tgmob1QCMJrHwRBK8HZtGsr+6cJANRQw2mEgJ9e=d+z7c...@mail.gmail.com [3] https://www.postgresql.org/message-id/20190504023015.5mgpbl27tld4irw5%40alap3.anarazel.de [4] https://www.postgresql.org/message-id/20220204013539.qdegpqzvayq3d4y2%40alap3.anarazel.de [5] https://www.postgresql.org/message-id/20220220045757.GA3733812%40rfd.leadboat.com -- John Naylor EDB: http://www.enterprisedb.com
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound (stop telling users to "vacuum that database in single-user mode")
On Mon, Jan 16, 2023 at 03:50:57PM +0300, Aleksander Alekseev wrote: > Hi hackers, > > > The proposed patchset changes the documentation and the error messages > > accordingly, making them less misleading. 0001 corrects the > > documentation but doesn't touch the code. 0002 and 0003 correct the > > messages shown when approaching xidWrapLimit and xidWarnLimit > > accordingly. > > A colleague of mine, Oleksii Kliukin, pointed out that the > recommendation about running VACUUM in a single-user mode is also > outdated, as it was previously reported in [1]. I didn't believe it at > first and decided to double-check: and again at: https://www.postgresql.org/message-id/flat/CA%2BTgmoYPfofQmRtUan%3DA3aWE9wFsJaOFr%2BW_ys2pPkNPr-2FZw%40mail.gmail.com#e7dd25fdcd171c5775f3f9e3f86b2082 > Unfortunately the [1] discussion went nowhere. likewise... > So I figured it would be appropriate to add corresponding changes to > the proposed patchset since it's relevant and is registered in the CF > app already. PFA patchset v2 which now also includes 0004. > > [1]: > https://www.postgresql.org/message-id/flat/CAMT0RQTmRj_Egtmre6fbiMA9E2hM3BsLULiV8W00stwa3URvzA%40mail.gmail.com I suggest to resend this with a title like the 2021 thread [1] (I was unable to find this just now when I looked) | doc: stop telling users to "vacuum that database in single-user mode" And copy the participants of the previous two iterations of this thread. -- Justin
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
Hi hackers, > The proposed patchset changes the documentation and the error messages > accordingly, making them less misleading. 0001 corrects the > documentation but doesn't touch the code. 0002 and 0003 correct the > messages shown when approaching xidWrapLimit and xidWarnLimit > accordingly. A colleague of mine, Oleksii Kliukin, pointed out that the recommendation about running VACUUM in a single-user mode is also outdated, as it was previously reported in [1]. I didn't believe it at first and decided to double-check: ``` =# select * from phonebook; id | name | phone +-+--- 1 | Alex| 123 5 | Charlie | 789 2 | Bob | 456 6 | Ololo | 789 (4 rows) =# insert into phonebook values (7, 'Trololo', 987); ERROR: database is not accepting commands to avoid wraparound data loss in database "template1" HINT: Stop the postmaster and vacuum that database in single-user mode. You might also need to commit or roll back old prepared transactions, or drop stale replication slots. =# VACUUM FREEZE; VACUUM =# insert into phonebook values (7, 'Trololo', 987); INSERT 0 1 =# SELECT current_setting('wal_level'); current_setting - logical ``` Unfortunately the [1] discussion went nowhere. So I figured it would be appropriate to add corresponding changes to the proposed patchset since it's relevant and is registered in the CF app already. PFA patchset v2 which now also includes 0004. [1]: https://www.postgresql.org/message-id/flat/CAMT0RQTmRj_Egtmre6fbiMA9E2hM3BsLULiV8W00stwa3URvzA%40mail.gmail.com -- Best regards, Aleksander Alekseev v2-0001-Correct-the-docs-about-preventing-XID-wraparound.patch Description: Binary data v2-0003-Fix-the-message-in-case-of-exceeding-xidWarnLimit.patch Description: Binary data v2-0004-Don-t-recommend-running-VACUUM-in-a-single-user-m.patch Description: Binary data v2-0002-Fix-the-message-in-case-of-approaching-xidWrapLim.patch Description: Binary data
[PATCH] Clarify the behavior of the system when approaching XID wraparound
Hi hackers, While playing with 64-bit XIDs [1] my attention was drawn by the following statement in the docs [2]: """ If these warnings are ignored, the system will shut down and refuse to start any new transactions once there are fewer than three million transactions left until wraparound. """ I decided to check this. Unfortunately it can't be done easily e.g. by modifying ShmemVariableCache->nextXid in gdb, because the system will PANIC with something like "could not access status of transaction 12345". Hopefully [3] will change the situation someday. Meanwhile I choose the hard way. In one session I did: ``` CREATE TABLE phonebook( "id" SERIAL PRIMARY KEY NOT NULL, "name" NAME NOT NULL, "phone" INT NOT NULL); BEGIN; INSERT INTO phonebook VALUES (1, 'Alex', 123); -- don't commit! ``` Then I did the following: ``` echo "SELECT pg_current_xact_id();" > t.sql pgbench -j 8 -c 8 -f t.sql -T 86400 eax ``` After 20-24 hours on the typical hardware (perhaps faster if only I didn't forget to use `synchronous_commit = off`) pgbench will use up the XID pool. The old tuples can't be frozen because the transaction we created in the beginning is still in progress. So now we can observe what actually happens when the system reaches xidStopLimit. Firstly, the system doesn't shutdown as the documentation says. Secondly, it executes new transactions just fine as long as these transactions don't allocate new XIDs. XIDs are allocated not for every transaction but rather lazily, when needed (see backend_xid in pg_stat_activity). A transaction doesn't need an assigned XID for checking the visibility of the tuples. Rather it uses xmin horizon, and only when using an isolation level above READ COMMITTED, see backend_xmin in pg_stat_activity. Assigning a xmin horizon doesn't increase nextXid. As a result, PostgreSQL can still execute read-only transactions even after reaching xidStopLimit. Similarly to how it can do this on hot standby replicas without having conflicts with the leader server. Thirdly, if there was a transaction created before reaching xidStopLimit, it will continue to execute after reaching xidStopLimit, and it can be successfully committed. All in all, the actual behavior is far from "system shutdown" and "refusing to start any new transactions". It's closer to entering read-only mode, similarly to what hot standbys allow to do. The proposed patchset changes the documentation and the error messages accordingly, making them less misleading. 0001 corrects the documentation but doesn't touch the code. 0002 and 0003 correct the messages shown when approaching xidWrapLimit and xidWarnLimit accordingly. Thoughts? [1]: https://commitfest.postgresql.org/41/3594/ [2]: https://www.postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND [3]: https://commitfest.postgresql.org/41/3729/ -- Best regards, Aleksander Alekseev v1-0003-Fix-the-message-in-case-of-exceeding-xidWarnLimit.patch Description: Binary data v1-0002-Fix-the-message-in-case-of-approaching-xidWrapLim.patch Description: Binary data v1-0001-Correct-the-docs-about-preventing-XID-wraparound.patch Description: Binary data