Re: [Patch] ALTER SYSTEM READ ONLY
Hi, On Thu, Jul 28, 2022 at 4:05 AM Jacob Champion wrote: > > On Fri, Apr 8, 2022 at 7:27 AM Amul Sul wrote: > > Attached is rebase version for the latest maste head(#891624f0ec). > > Hi Amul, > > I'm going through past CF triage emails today; I noticed that this > patch dropped out of the commitfest when you withdrew it in January, > but it hasn't been added back with the most recent patchset you > posted. Was that intended, or did you want to re-register it for > review? > Yes, there is a plan to re-register it again but not anytime soon, once we start to rework the design. Regards, Amul
Re: [Patch] ALTER SYSTEM READ ONLY
On Fri, Apr 8, 2022 at 7:27 AM Amul Sul wrote: > Attached is rebase version for the latest maste head(#891624f0ec). Hi Amul, I'm going through past CF triage emails today; I noticed that this patch dropped out of the commitfest when you withdrew it in January, but it hasn't been added back with the most recent patchset you posted. Was that intended, or did you want to re-register it for review? --Jacob
Re: [Patch] ALTER SYSTEM READ ONLY
On Sat, Apr 23, 2022 at 1:34 PM Bharath Rupireddy wrote: > > On Mon, Mar 15, 2021 at 12:56 PM Amul Sul wrote: > > > > > > It is a very minor change, so I rebased the patch. Please take a look, if > > > that works for you. > > > > > > > Thanks, I am getting one more failure for the vacuumlazy.c. on the > > latest master head(d75288fb27b), I fixed that in attached version. > > Thanks Amul! I haven't looked at the whole thread, I may be repeating > things here, please bear with me. > Np, thanks for looking into it. > 1) Is the pg_prohibit_wal() only user sets the wal prohibit mode? Or > do we still allow via 'ALTER SYSTEM READ ONLY/READ WRITE'? If not, I > think the patches still have ALTER SYSTEM READ ONLY references. Could you please point me to what those references are? I didn't find any in the v45 version. > 2) IIUC, the idea of this patch is not to generate any new WAL when > set as default_transaction_read_only and transaction_read_only can't > guarantee that? No. Complete WAL write should be disabled, in other words XLogInsert() should be restricted. > 3) IMO, the function name pg_prohibit_wal doesn't look good where it > also allows one to set WAL writes, how about the following functions - > pg_prohibit_wal or pg_disallow_wal_{generation, inserts} or > pg_allow_wal or pg_allow_wal_{generation, inserts} without any > arguments and if needed a common function > pg_set_wal_generation_state(read-only/read-write) something like that? There are already similar suggestions before too, but none of that finalized yet, there are other more challenges that need to be handled, so we can keep this work at last. > 4) It looks like only the checkpointer is setting the WAL prohibit > state? Is there a strong reason for that? Why can't the backend take a > lock on prohibit state in shared memory and set it and let the > checkpointer read it and block itself from writing WAL? Once WAL prohibited state transition is initiated and should be completed, there is no fallback. What if the backed exit before the complete transition? Similarly, even if the checkpointer exits, that will be restarted again and will complete the state transition. > 5) Is SIGUSR1 (which is multiplexed) being sent without a "reason" to > checkpointer? Why? Simply want to wake up the checkpointer process without asking for specific work in the handle function. Another suitable choice will be SIGINT, we can choose that too if needed. > 6) What happens for long-running or in-progress transactions if > someone prohibits WAL in the midst of them? Do these txns fail? Or do > we say that we will allow them to run to completion? Or do we fail > those txns at commit time? One might use this feature to say not let > server go out of disk space, but if we allow in-progress txns to > generate/write WAL, then how can one achieve that with this feature? > Say, I monitor my server in such a way that at 90% of disk space, > prohibit WAL to avoid server crash. But if this feature allows > in-progress txns to generate WAL, then the server may still crash? Read-only transactions will be allowed to continue, and if that transaction tries to write or any other transaction that has performed any writes already then the session running that transaction will be terminated -- the design is described in the first mail of this thread. > 7) What are the other use-cases (I can think of - to avoid out of disk > crashes, block/freeze writes to database when the server is > compromised) with this feature? Any usages during/before failover, > promotion or after it? The important use case is for failover to avoid split-brain situations. > 8) Is there a strong reason that we've picked up conditional variable > wal_prohibit_cv over mutex/lock for updating WALProhibit shared > memory? I am not sure how that can be done using mutex or lock. > 9) Any tests that you are planning to add? Yes, we can. I have added very sophisticated tests that cover most of my code changes, but that is not enough for such critical code changes, have a lot of chances of improvement and adding more tests for this module as well as other parts e.g. some missing coverage of gin, gists, brin, core features where this patch is adding checks, etc. Any help will be greatly appreciated. Regards, Amul
Re: [Patch] ALTER SYSTEM READ ONLY
On Mon, Mar 15, 2021 at 12:56 PM Amul Sul wrote: > > > > It is a very minor change, so I rebased the patch. Please take a look, if > > that works for you. > > > > Thanks, I am getting one more failure for the vacuumlazy.c. on the > latest master head(d75288fb27b), I fixed that in attached version. Thanks Amul! I haven't looked at the whole thread, I may be repeating things here, please bear with me. 1) Is the pg_prohibit_wal() only user sets the wal prohibit mode? Or do we still allow via 'ALTER SYSTEM READ ONLY/READ WRITE'? If not, I think the patches still have ALTER SYSTEM READ ONLY references. 2) IIUC, the idea of this patch is not to generate any new WAL when set as default_transaction_read_only and transaction_read_only can't guarantee that? 3) IMO, the function name pg_prohibit_wal doesn't look good where it also allows one to set WAL writes, how about the following functions - pg_prohibit_wal or pg_disallow_wal_{generation, inserts} or pg_allow_wal or pg_allow_wal_{generation, inserts} without any arguments and if needed a common function pg_set_wal_generation_state(read-only/read-write) something like that? 4) It looks like only the checkpointer is setting the WAL prohibit state? Is there a strong reason for that? Why can't the backend take a lock on prohibit state in shared memory and set it and let the checkpointer read it and block itself from writing WAL? 5) Is SIGUSR1 (which is multiplexed) being sent without a "reason" to checkpointer? Why? 6) What happens for long-running or in-progress transactions if someone prohibits WAL in the midst of them? Do these txns fail? Or do we say that we will allow them to run to completion? Or do we fail those txns at commit time? One might use this feature to say not let server go out of disk space, but if we allow in-progress txns to generate/write WAL, then how can one achieve that with this feature? Say, I monitor my server in such a way that at 90% of disk space, prohibit WAL to avoid server crash. But if this feature allows in-progress txns to generate WAL, then the server may still crash? 7) What are the other use-cases (I can think of - to avoid out of disk crashes, block/freeze writes to database when the server is compromised) with this feature? Any usages during/before failover, promotion or after it? 8) Is there a strong reason that we've picked up conditional variable wal_prohibit_cv over mutex/lock for updating WALProhibit shared memory? 9) Any tests that you are planning to add? Regards, Bharath Rupireddy.
Re: [Patch] ALTER SYSTEM READ ONLY
On Wed, Nov 17, 2021 at 6:20 PM Amul Sul wrote: > > On Wed, Nov 17, 2021 at 4:07 PM Amul Sul wrote: > > > > On Wed, Nov 17, 2021 at 11:13 AM Amul Sul wrote: > > > > > > On Sat, Nov 13, 2021 at 2:18 AM Robert Haas wrote: > > > > > > > > On Mon, Nov 8, 2021 at 8:20 AM Amul Sul wrote: > > > > > Attached is the rebased version of refactoring as well as the > > > > > pg_prohibit_wal feature patches for the latest master head (commit # > > > > > 39a3105678a). > > > > > > > > I spent a lot of time today studying 0002, and specifically the > > > > question of whether EndOfLog must be the same as > > > > XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as > > > > XLogCtl->replayEndTLI. > > > > > > > > The answer to the former question is "no" because, if we don't enter > > > > redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do > > > > enter redo, then I think it has to be the same unless something very > > > > weird happens. EndOfLog gets set like this: > > > > > > > > XLogBeginRead(xlogreader, LastRec); > > > > record = ReadRecord(xlogreader, PANIC, false, replayTLI); > > > > EndOfLog = EndRecPtr; > > > > > > > > In every case that exists in our regression tests, EndRecPtr is the > > > > same before these three lines of code as it is afterward. However, if > > > > you test with recovery_target=immediate, you can get it to be > > > > different, because in that case we drop out of the redo loop after > > > > calling recoveryStopsBefore() rather than after calling > > > > recoveryStopsAfter(). Similarly I'm fairly sure that if you use > > > > recovery_target_inclusive=off you can likewise get it to be different > > > > (though I discovered the hard way that recovery_target_inclusive=off > > > > is ignored when you use recovery_target_name). It seems like a really > > > > bad thing that neither recovery_target=immediate nor > > > > recovery_target_inclusive=off have any tests, and I think we ought to > > > > add some. > > > > > > > > > > recovery/t/003_recovery_targets.pl has test for > > > recovery_target=immediate but not for recovery_target_inclusive=off, we > > > can add that for recovery_target_lsn, recovery_target_time, and > > > recovery_target_xid case only where it affects. > > > > > > > Anyway, in effect, these three lines of code have the effect of > > > > backing up the xlogreader by one record when we stop before rather > > > > than after a record that we're replaying. What that means is that > > > > EndOfLog is going to be the end+1 of the last record that we actually > > > > replayed. There might be one more record that we read but did not > > > > replay, and that record won't impact the value we end up with in > > > > EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last > > > > record that we actually replayed. To put that another way, there's no > > > > way to exit the main redo loop after we set XLogCtl->replayEndRecPtr > > > > and before we change LastRec. So in the cases where > > > > XLogCtl->replayEndRecPtr gets initialized at all, it can only be > > > > different from EndOfLog if something different happens when we re-read > > > > the last-replayed WAL record than what happened when we read it the > > > > first time. That seems unlikely, and would be unfortunate it if it did > > > > happen. I am inclined to think that it might be better not to reread > > > > the record at all, though. > > > > > > There are two reasons that the record is reread; first, one that you > > > have just explained where the redo loop drops out due to > > > recoveryStopsBefore() and another one is that InRecovery is false. > > > > > > In the formal case at the end, redo while-loop does read a new record > > > which in effect updates EndRecPtr and when we breaks the loop, we do > > > reach the place where we do reread record -- where we do read the > > > record (i.e. LastRec) before the record that redo loop has read and > > > which correctly sets EndRecPtr. In the latter case, definitely, we > > > don't need any adjustment to EndRecPtr. > > > > > > So technically one case needs reread but that is also not needed, we > > > have that value in XLogCtl->lastReplayedEndRecPtr. I do agree that we > > > do not need to reread the record, but EndOfLog and EndOfLogTLI should > > > be set conditionally something like: > > > > > > if (InRecovery) > > > { > > > EndOfLog = XLogCtl->lastReplayedEndRecPtr; > > > EndOfLogTLI = XLogCtl->lastReplayedTLI; > > > } > > > else > > > { > > > EndOfLog = EndRecPtr; > > > EndOfLogTLI = replayTLI; > > > } > > > > > > > As far as this patch goes, I think we need > > > > a solution that doesn't involve fetching EndOfLog from a variable > > > > that's only sometimes initialized and then not doing anything with it > > > > except in the cases where it was initialized. > > > > > > > > > > Another reason could be EndOfLog changes further in the following case: > > > > > > /* > > > * Actually, if WAL ended in an
Re: [Patch] ALTER SYSTEM READ ONLY
On Wed, Nov 17, 2021 at 4:07 PM Amul Sul wrote: > > On Wed, Nov 17, 2021 at 11:13 AM Amul Sul wrote: > > > > On Sat, Nov 13, 2021 at 2:18 AM Robert Haas wrote: > > > > > > On Mon, Nov 8, 2021 at 8:20 AM Amul Sul wrote: > > > > Attached is the rebased version of refactoring as well as the > > > > pg_prohibit_wal feature patches for the latest master head (commit # > > > > 39a3105678a). > > > > > > I spent a lot of time today studying 0002, and specifically the > > > question of whether EndOfLog must be the same as > > > XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as > > > XLogCtl->replayEndTLI. > > > > > > The answer to the former question is "no" because, if we don't enter > > > redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do > > > enter redo, then I think it has to be the same unless something very > > > weird happens. EndOfLog gets set like this: > > > > > > XLogBeginRead(xlogreader, LastRec); > > > record = ReadRecord(xlogreader, PANIC, false, replayTLI); > > > EndOfLog = EndRecPtr; > > > > > > In every case that exists in our regression tests, EndRecPtr is the > > > same before these three lines of code as it is afterward. However, if > > > you test with recovery_target=immediate, you can get it to be > > > different, because in that case we drop out of the redo loop after > > > calling recoveryStopsBefore() rather than after calling > > > recoveryStopsAfter(). Similarly I'm fairly sure that if you use > > > recovery_target_inclusive=off you can likewise get it to be different > > > (though I discovered the hard way that recovery_target_inclusive=off > > > is ignored when you use recovery_target_name). It seems like a really > > > bad thing that neither recovery_target=immediate nor > > > recovery_target_inclusive=off have any tests, and I think we ought to > > > add some. > > > > > > > recovery/t/003_recovery_targets.pl has test for > > recovery_target=immediate but not for recovery_target_inclusive=off, we > > can add that for recovery_target_lsn, recovery_target_time, and > > recovery_target_xid case only where it affects. > > > > > Anyway, in effect, these three lines of code have the effect of > > > backing up the xlogreader by one record when we stop before rather > > > than after a record that we're replaying. What that means is that > > > EndOfLog is going to be the end+1 of the last record that we actually > > > replayed. There might be one more record that we read but did not > > > replay, and that record won't impact the value we end up with in > > > EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last > > > record that we actually replayed. To put that another way, there's no > > > way to exit the main redo loop after we set XLogCtl->replayEndRecPtr > > > and before we change LastRec. So in the cases where > > > XLogCtl->replayEndRecPtr gets initialized at all, it can only be > > > different from EndOfLog if something different happens when we re-read > > > the last-replayed WAL record than what happened when we read it the > > > first time. That seems unlikely, and would be unfortunate it if it did > > > happen. I am inclined to think that it might be better not to reread > > > the record at all, though. > > > > There are two reasons that the record is reread; first, one that you > > have just explained where the redo loop drops out due to > > recoveryStopsBefore() and another one is that InRecovery is false. > > > > In the formal case at the end, redo while-loop does read a new record > > which in effect updates EndRecPtr and when we breaks the loop, we do > > reach the place where we do reread record -- where we do read the > > record (i.e. LastRec) before the record that redo loop has read and > > which correctly sets EndRecPtr. In the latter case, definitely, we > > don't need any adjustment to EndRecPtr. > > > > So technically one case needs reread but that is also not needed, we > > have that value in XLogCtl->lastReplayedEndRecPtr. I do agree that we > > do not need to reread the record, but EndOfLog and EndOfLogTLI should > > be set conditionally something like: > > > > if (InRecovery) > > { > > EndOfLog = XLogCtl->lastReplayedEndRecPtr; > > EndOfLogTLI = XLogCtl->lastReplayedTLI; > > } > > else > > { > > EndOfLog = EndRecPtr; > > EndOfLogTLI = replayTLI; > > } > > > > > As far as this patch goes, I think we need > > > a solution that doesn't involve fetching EndOfLog from a variable > > > that's only sometimes initialized and then not doing anything with it > > > except in the cases where it was initialized. > > > > > > > Another reason could be EndOfLog changes further in the following case: > > > > /* > > * Actually, if WAL ended in an incomplete record, skip the parts that > > * made it through and start writing after the portion that persisted. > > * (It's critical to first write an OVERWRITE_CONTRECORD message, which > > * we'll do as soon as we're open for writing new WAL.) >
Re: [Patch] ALTER SYSTEM READ ONLY
On Wed, Nov 17, 2021 at 11:13 AM Amul Sul wrote: > > On Sat, Nov 13, 2021 at 2:18 AM Robert Haas wrote: > > > > On Mon, Nov 8, 2021 at 8:20 AM Amul Sul wrote: > > > Attached is the rebased version of refactoring as well as the > > > pg_prohibit_wal feature patches for the latest master head (commit # > > > 39a3105678a). > > > > I spent a lot of time today studying 0002, and specifically the > > question of whether EndOfLog must be the same as > > XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as > > XLogCtl->replayEndTLI. > > > > The answer to the former question is "no" because, if we don't enter > > redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do > > enter redo, then I think it has to be the same unless something very > > weird happens. EndOfLog gets set like this: > > > > XLogBeginRead(xlogreader, LastRec); > > record = ReadRecord(xlogreader, PANIC, false, replayTLI); > > EndOfLog = EndRecPtr; > > > > In every case that exists in our regression tests, EndRecPtr is the > > same before these three lines of code as it is afterward. However, if > > you test with recovery_target=immediate, you can get it to be > > different, because in that case we drop out of the redo loop after > > calling recoveryStopsBefore() rather than after calling > > recoveryStopsAfter(). Similarly I'm fairly sure that if you use > > recovery_target_inclusive=off you can likewise get it to be different > > (though I discovered the hard way that recovery_target_inclusive=off > > is ignored when you use recovery_target_name). It seems like a really > > bad thing that neither recovery_target=immediate nor > > recovery_target_inclusive=off have any tests, and I think we ought to > > add some. > > > > recovery/t/003_recovery_targets.pl has test for > recovery_target=immediate but not for recovery_target_inclusive=off, we > can add that for recovery_target_lsn, recovery_target_time, and > recovery_target_xid case only where it affects. > > > Anyway, in effect, these three lines of code have the effect of > > backing up the xlogreader by one record when we stop before rather > > than after a record that we're replaying. What that means is that > > EndOfLog is going to be the end+1 of the last record that we actually > > replayed. There might be one more record that we read but did not > > replay, and that record won't impact the value we end up with in > > EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last > > record that we actually replayed. To put that another way, there's no > > way to exit the main redo loop after we set XLogCtl->replayEndRecPtr > > and before we change LastRec. So in the cases where > > XLogCtl->replayEndRecPtr gets initialized at all, it can only be > > different from EndOfLog if something different happens when we re-read > > the last-replayed WAL record than what happened when we read it the > > first time. That seems unlikely, and would be unfortunate it if it did > > happen. I am inclined to think that it might be better not to reread > > the record at all, though. > > There are two reasons that the record is reread; first, one that you > have just explained where the redo loop drops out due to > recoveryStopsBefore() and another one is that InRecovery is false. > > In the formal case at the end, redo while-loop does read a new record > which in effect updates EndRecPtr and when we breaks the loop, we do > reach the place where we do reread record -- where we do read the > record (i.e. LastRec) before the record that redo loop has read and > which correctly sets EndRecPtr. In the latter case, definitely, we > don't need any adjustment to EndRecPtr. > > So technically one case needs reread but that is also not needed, we > have that value in XLogCtl->lastReplayedEndRecPtr. I do agree that we > do not need to reread the record, but EndOfLog and EndOfLogTLI should > be set conditionally something like: > > if (InRecovery) > { > EndOfLog = XLogCtl->lastReplayedEndRecPtr; > EndOfLogTLI = XLogCtl->lastReplayedTLI; > } > else > { > EndOfLog = EndRecPtr; > EndOfLogTLI = replayTLI; > } > > > As far as this patch goes, I think we need > > a solution that doesn't involve fetching EndOfLog from a variable > > that's only sometimes initialized and then not doing anything with it > > except in the cases where it was initialized. > > > > Another reason could be EndOfLog changes further in the following case: > > /* > * Actually, if WAL ended in an incomplete record, skip the parts that > * made it through and start writing after the portion that persisted. > * (It's critical to first write an OVERWRITE_CONTRECORD message, which > * we'll do as soon as we're open for writing new WAL.) > */ > if (!XLogRecPtrIsInvalid(missingContrecPtr)) > { > Assert(!XLogRecPtrIsInvalid(abortedRecPtr)); > EndOfLog = missingContrecPtr; > } > > Now only solution that I can think is to copy EndOfLog (so > EndOfLogTLI) into shared memory. > > >
Re: [Patch] ALTER SYSTEM READ ONLY
On Sat, Nov 13, 2021 at 2:18 AM Robert Haas wrote: > > On Mon, Nov 8, 2021 at 8:20 AM Amul Sul wrote: > > Attached is the rebased version of refactoring as well as the > > pg_prohibit_wal feature patches for the latest master head (commit # > > 39a3105678a). > > I spent a lot of time today studying 0002, and specifically the > question of whether EndOfLog must be the same as > XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as > XLogCtl->replayEndTLI. > > The answer to the former question is "no" because, if we don't enter > redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do > enter redo, then I think it has to be the same unless something very > weird happens. EndOfLog gets set like this: > > XLogBeginRead(xlogreader, LastRec); > record = ReadRecord(xlogreader, PANIC, false, replayTLI); > EndOfLog = EndRecPtr; > > In every case that exists in our regression tests, EndRecPtr is the > same before these three lines of code as it is afterward. However, if > you test with recovery_target=immediate, you can get it to be > different, because in that case we drop out of the redo loop after > calling recoveryStopsBefore() rather than after calling > recoveryStopsAfter(). Similarly I'm fairly sure that if you use > recovery_target_inclusive=off you can likewise get it to be different > (though I discovered the hard way that recovery_target_inclusive=off > is ignored when you use recovery_target_name). It seems like a really > bad thing that neither recovery_target=immediate nor > recovery_target_inclusive=off have any tests, and I think we ought to > add some. > recovery/t/003_recovery_targets.pl has test for recovery_target=immediate but not for recovery_target_inclusive=off, we can add that for recovery_target_lsn, recovery_target_time, and recovery_target_xid case only where it affects. > Anyway, in effect, these three lines of code have the effect of > backing up the xlogreader by one record when we stop before rather > than after a record that we're replaying. What that means is that > EndOfLog is going to be the end+1 of the last record that we actually > replayed. There might be one more record that we read but did not > replay, and that record won't impact the value we end up with in > EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last > record that we actually replayed. To put that another way, there's no > way to exit the main redo loop after we set XLogCtl->replayEndRecPtr > and before we change LastRec. So in the cases where > XLogCtl->replayEndRecPtr gets initialized at all, it can only be > different from EndOfLog if something different happens when we re-read > the last-replayed WAL record than what happened when we read it the > first time. That seems unlikely, and would be unfortunate it if it did > happen. I am inclined to think that it might be better not to reread > the record at all, though. There are two reasons that the record is reread; first, one that you have just explained where the redo loop drops out due to recoveryStopsBefore() and another one is that InRecovery is false. In the formal case at the end, redo while-loop does read a new record which in effect updates EndRecPtr and when we breaks the loop, we do reach the place where we do reread record -- where we do read the record (i.e. LastRec) before the record that redo loop has read and which correctly sets EndRecPtr. In the latter case, definitely, we don't need any adjustment to EndRecPtr. So technically one case needs reread but that is also not needed, we have that value in XLogCtl->lastReplayedEndRecPtr. I do agree that we do not need to reread the record, but EndOfLog and EndOfLogTLI should be set conditionally something like: if (InRecovery) { EndOfLog = XLogCtl->lastReplayedEndRecPtr; EndOfLogTLI = XLogCtl->lastReplayedTLI; } else { EndOfLog = EndRecPtr; EndOfLogTLI = replayTLI; } > As far as this patch goes, I think we need > a solution that doesn't involve fetching EndOfLog from a variable > that's only sometimes initialized and then not doing anything with it > except in the cases where it was initialized. > Another reason could be EndOfLog changes further in the following case: /* * Actually, if WAL ended in an incomplete record, skip the parts that * made it through and start writing after the portion that persisted. * (It's critical to first write an OVERWRITE_CONTRECORD message, which * we'll do as soon as we're open for writing new WAL.) */ if (!XLogRecPtrIsInvalid(missingContrecPtr)) { Assert(!XLogRecPtrIsInvalid(abortedRecPtr)); EndOfLog = missingContrecPtr; } Now only solution that I can think is to copy EndOfLog (so EndOfLogTLI) into shared memory. > As for EndOfLogTLI, I'm afraid I don't think that's the same thing as > XLogCtl->replayEndTLI. Now, it's hard to be sure, because I don't > think the regression tests contain any scenarios where we run recovery > and the values end up different. However, I
Re: [Patch] ALTER SYSTEM READ ONLY
On Mon, Nov 8, 2021 at 8:20 AM Amul Sul wrote: > Attached is the rebased version of refactoring as well as the > pg_prohibit_wal feature patches for the latest master head (commit # > 39a3105678a). I spent a lot of time today studying 0002, and specifically the question of whether EndOfLog must be the same as XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as XLogCtl->replayEndTLI. The answer to the former question is "no" because, if we don't enter redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do enter redo, then I think it has to be the same unless something very weird happens. EndOfLog gets set like this: XLogBeginRead(xlogreader, LastRec); record = ReadRecord(xlogreader, PANIC, false, replayTLI); EndOfLog = EndRecPtr; In every case that exists in our regression tests, EndRecPtr is the same before these three lines of code as it is afterward. However, if you test with recovery_target=immediate, you can get it to be different, because in that case we drop out of the redo loop after calling recoveryStopsBefore() rather than after calling recoveryStopsAfter(). Similarly I'm fairly sure that if you use recovery_target_inclusive=off you can likewise get it to be different (though I discovered the hard way that recovery_target_inclusive=off is ignored when you use recovery_target_name). It seems like a really bad thing that neither recovery_target=immediate nor recovery_target_inclusive=off have any tests, and I think we ought to add some. Anyway, in effect, these three lines of code have the effect of backing up the xlogreader by one record when we stop before rather than after a record that we're replaying. What that means is that EndOfLog is going to be the end+1 of the last record that we actually replayed. There might be one more record that we read but did not replay, and that record won't impact the value we end up with in EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last record that we actually replayed. To put that another way, there's no way to exit the main redo loop after we set XLogCtl->replayEndRecPtr and before we change LastRec. So in the cases where XLogCtl->replayEndRecPtr gets initialized at all, it can only be different from EndOfLog if something different happens when we re-read the last-replayed WAL record than what happened when we read it the first time. That seems unlikely, and would be unfortunate it if it did happen. I am inclined to think that it might be better not to reread the record at all, though. As far as this patch goes, I think we need a solution that doesn't involve fetching EndOfLog from a variable that's only sometimes initialized and then not doing anything with it except in the cases where it was initialized. As for EndOfLogTLI, I'm afraid I don't think that's the same thing as XLogCtl->replayEndTLI. Now, it's hard to be sure, because I don't think the regression tests contain any scenarios where we run recovery and the values end up different. However, I think that the code sets EndOfLogTLI to the TLI of the last WAL file that we read, and I think XLogCtl->replayEndTLI gets set to the timeline from which that WAL record originated. So imagine that we are looking for WAL that ought to be in 00010003 but we don't find it; instead we find 00020003 because our recovery target timeline is 2, or something that has 2 in its history. We will read the WAL for timeline 1 from this file which has timeline 2 in the file name. I think if recovery ends in this file before the timeline switch, these values will be different. I did not try to construct a test case for this today due to not having enough time, so it's possible that I'm wrong about this, but that's how it looks to me from the code. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Mon, Oct 25, 2021 at 8:15 PM Robert Haas wrote: > > On Mon, Oct 25, 2021 at 3:05 AM Amul Sul wrote: > > Ok, did the same in the attached 0001 patch. > > > > There is no harm in calling LocalSetXLogInsertAllowed() calling > > multiple times, but the problem I can see is that with this patch user > > is allowed to call LocalSetXLogInsertAllowed() at the time it is > > supposed not to be called e.g. when LocalXLogInsertAllowed = 0; > > WAL writes are explicitly disabled. > > I've pushed 0001 and 0002 but I reversed the order of them and made a > few other edits. > Thank you! I have rebased the remaining patches on top of the latest master head (commit # e63ce9e8d6a). In addition to that, I did the additional changes to 0002 where I haven't included the change that tries to remove arguments of CleanupAfterArchiveRecovery() in the previous version. Because if we want to use XLogCtl->replayEndTLI and XLogCtl->replayEndRecPtr to replace EndOfLogTLI and EndOfLog arguments respectively, then we also need to consider the case where EndOfLog is changing if the abort-record does exist. That can be decided only in XLogAcceptWrite() before the shared memory value related to abort-record is going to be clear. Regards, Amul From 14bfb4a91c8935efd17b50a0a0687101481b15c7 Mon Sep 17 00:00:00 2001 From: Amul Sul Date: Thu, 30 Sep 2021 06:29:06 -0400 Subject: [PATCH v40 2/2] Remove dependencies on startup-process specifical variables. To make XLogAcceptWrites(), need to dependency on few global and local variable spcific to startup process. Global variables are abortedRecPtr, missingContrecPtr, ArchiveRecoveryRequested and LocalPromoteIsTriggered, whereas LocalPromoteIsTriggered can be accessed in any other process using existing PromoteIsTriggered(). abortedRecPtr & ArchiveRecoveryRequested is made accessible by copying into shared memory. missingContrecPtr can get from the existing shared memory values through XLogCtl->lastSegSwitchLSN, which is not going to change until we use it. That changes only when the current WAL segment gets full, there won't be any WAL write until that point. XLogAcceptWrites() accepts two argument as EndOfLogTLI and EndOfLog which are local to StartupXLOG(). Instead of passing as an argument XLogCtl->replayEndTLI and XLogCtl->replayEndRecPtr from the shared memory can be used as an replacement to EndOfLogTLI and EndOfLog respectively. EndOfLog will be changed if the abort record is exists and in that case, the missingContrecPtr point will be considered as the end of WAL since the further part going to be skipped anyway. EndOfLogTLI in the StartupXLOG() is the timeline ID of the last record that xlogreader reads, but this xlogreader was simply re-fetching the last record which we have replied in redo loop if it was in recovery, if not in recovery, we don't need to worry since this value is needed only in case of ArchiveRecoveryRequested = true, which implicitly forces redo and sets XLogCtl->replayEndTLI value. --- src/backend/access/transam/xlog.c | 82 +-- 1 file changed, 66 insertions(+), 16 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 8d72e1967d4..10159b3312b 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -669,6 +669,13 @@ typedef struct XLogCtlData */ bool SharedPromoteIsTriggered; + /* + * SharedArchiveRecoveryRequested exports the value of the + * ArchiveRecoveryRequested flag to be share which is otherwise valid only + * in the startup process. + */ + bool SharedArchiveRecoveryRequested; + /* * WalWriterSleeping indicates whether the WAL writer is currently in * low-power mode (and hence should be nudged if an async commit occurs). @@ -718,6 +725,13 @@ typedef struct XLogCtlData /* timestamp of last COMMIT/ABORT record replayed (or being replayed) */ TimestampTz recoveryLastXTime; + /* + * SharedAbortedRecPtr exports abortedRecPtr to be shared with another + * process to write OVERWRITE_CONTRECORD message, if WAL writes are not + * permitted in the current process which reads that. + */ + XLogRecPtr SharedAbortedRecPtr; + /* * timestamp of when we started replaying the current chunk of WAL data, * only relevant for replication or archive recovery @@ -940,7 +954,7 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force); static XLogRecord *ReadRecord(XLogReaderState *xlogreader, int emode, bool fetching_ckpt); static void CheckRecoveryConsistency(void); -static bool XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog); +static bool XLogAcceptWrites(void); static bool PerformRecoveryXLogAction(void); static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int whichChkpt, bool report); @@ -5268,7 +5282,9 @@ XLOGShmemInit(void) XLogCtl->SharedHotStandbyActive = false; XLogCtl->InstallXLogFileSegmentActive = false;
Re: [Patch] ALTER SYSTEM READ ONLY
On 10/25/21, 1:33 PM, "Robert Haas" wrote: > On Mon, Oct 25, 2021 at 3:14 PM Bossart, Nathan wrote: >> My compiler is complaining about oldXLogAllowed possibly being used >> uninitialized in CreateCheckPoint(). AFAICT it can just be initially >> set to zero to silence this warning because it will, in fact, be >> initialized properly when it is used. > > Hmm, I guess I could have foreseen that, had I been a little bit > smarter than I am. I have committed a change to initialize it to 0 as > you propose. Thanks! Nathan
Re: [Patch] ALTER SYSTEM READ ONLY
On Mon, Oct 25, 2021 at 3:14 PM Bossart, Nathan wrote: > My compiler is complaining about oldXLogAllowed possibly being used > uninitialized in CreateCheckPoint(). AFAICT it can just be initially > set to zero to silence this warning because it will, in fact, be > initialized properly when it is used. Hmm, I guess I could have foreseen that, had I been a little bit smarter than I am. I have committed a change to initialize it to 0 as you propose. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On 10/25/21, 7:50 AM, "Robert Haas" wrote: > I've pushed 0001 and 0002 but I reversed the order of them and made a > few other edits. My compiler is complaining about oldXLogAllowed possibly being used uninitialized in CreateCheckPoint(). AFAICT it can just be initially set to zero to silence this warning because it will, in fact, be initialized properly when it is used. Nathan
Re: [Patch] ALTER SYSTEM READ ONLY
On Mon, Oct 25, 2021 at 3:05 AM Amul Sul wrote: > Ok, did the same in the attached 0001 patch. > > There is no harm in calling LocalSetXLogInsertAllowed() calling > multiple times, but the problem I can see is that with this patch user > is allowed to call LocalSetXLogInsertAllowed() at the time it is > supposed not to be called e.g. when LocalXLogInsertAllowed = 0; > WAL writes are explicitly disabled. I've pushed 0001 and 0002 but I reversed the order of them and made a few other edits. I don't really see the issue you mention here as a problem. There's only one place where we set LocalXLogInsertAllowed = 0, and I don't know that we'll ever have another one. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Tue, Oct 19, 2021 at 3:50 AM Robert Haas wrote: > > On Mon, Oct 18, 2021 at 9:54 AM Amul Sul wrote: > > I tried this in the attached version, but I'm a bit skeptical with > > changes that are needed for CreateCheckPoint(), those don't seem to be > > clean. > > Yeah, that doesn't look great. I don't think it's entirely correct, > actually, because surely you want LocalXLogInsertAllowed = 0 to be > executed even if !IsPostmasterEnvironment. It's only > LocalXLogInsertAllowed = -1 that we would want to have depend on > IsPostmasterEnvironment. But that's pretty ugly too: I guess the > reason it has to be like is that, if it does that unconditionally, it > will overwrite the temporary value of 1 set by the caller, which will > then cause problems when the caller tries to XLogReportParameters(). > > I think that problem goes away if we drive the decision off of shared > state rather than a local variable, but I agree that it's otherwise a > bit tricky to untangle. One idea might be to have > LocalSetXLogInsertAllowed return the old value. Then we could use the > same kind of coding we do when switching memory contexts, where we > say: > > oldcontext = MemoryContextSwitchTo(something); > // do stuff > MemoryContextSwitchTo(oldcontext); > > Here we could maybe do: > > oldxlallowed = LocalSetXLogInsertAllowed(); > // do stuff > XLogInsertAllowed = oldxlallowed; > Ok, did the same in the attached 0001 patch. There is no harm in calling LocalSetXLogInsertAllowed() calling multiple times, but the problem I can see is that with this patch user is allowed to call LocalSetXLogInsertAllowed() at the time it is supposed not to be called e.g. when LocalXLogInsertAllowed = 0; WAL writes are explicitly disabled. > That way, instead of CreateCheckPoint() knowing under what > circumstances the caller might have changed the value, it only knows > that some callers might have already changed the value. That seems > better. > > > I agree that calling InitXLOGAccess() from RecoveryInProgress() is not > > good, but I am not sure about calling it from XLogInsertAllowed() > > either, perhaps, both are status check function and general > > expectations might be that status checking functions are not going > > change and/or initialize the system state. InitXLOGAccess() should > > get called from the very first WAL write operation if needed, but if > > we don't want to do that, then I would prefer to call InitXLOGAccess() > > from XLogInsertAllowed() instead of RecoveryInProgress(). > > Well, that's a fair point, too, but it might not be safe to, say, move > this to XLogBeginInsert(). Like, imagine that there's a hypothetical > piece of code that looks like this: > > if (RecoveryInProgress()) > ereport(ERROR, errmsg("can't do that in recovery"))); > > // do something here that depends on ThisTimeLineID or > wal_segment_size or RedoRecPtr > > XLogBeginInsert(); > > lsn = XLogInsert(...); > > Such code would work correctly the way things are today, but if the > InitXLOGAccess() call were deferred until XLogBeginInsert() time, then > it would fail. > > I was curious whether this is just a theoretical problem. It turns out > that it's not. I wrote a couple of just-for-testing patches, which I > attach here. The first one just adjusts things so that we'll fail an > assertion if we try to make use of ThisTimeLineID before we've set it > to a legal value. I had to exempt two places from these checks just > for 'make check-world' to pass; these are shown in the patch, and one > or both of them might be existing bugs -- or maybe not, I haven't > looked too deeply. The second one then adjusts the patch to pretend > that ThisTimeLineID is not necessarily valid just because we've called > InitXLOGAccess() but that it is valid after XLogBeginInsert(). With > that change, I find about a dozen places where, apparently, the early > call to InitXLOGAccess() is critical to getting ThisTimeLineID > adjusted in time. So apparently a change of this type is not entirely > trivial. And this is just a quick test, and just for one of the three > things that get initialized here. > > On the other hand, just moving it to XLogInsertAllowed() isn't > risk-free either and would likely require adjusting some of the same > places I found with this test. So I guess if we want to do something > like this we need more study. > Yeah, that requires a lot of energy and time -- not done anything related to this in the attached version. Please have a look at the attached version where the 0001 patch does change LocalSetXLogInsertAllowed() as said before. 0002 patch moves XLogReportParameters() closer to other wal write operations and removes unnecessary LocalSetXLogInsertAllowed() calls. 0003 is code movements adds XLogAcceptWrites() function same as the before, and 0004 patch tries to remove the dependency. 0004 patch could change w.r.t. decision that is going to be made for the patch that I posted[1] to remove abortedRecPtr global variable. For now, I
Re: [Patch] ALTER SYSTEM READ ONLY
On Mon, Oct 18, 2021 at 9:54 AM Amul Sul wrote: > I tried this in the attached version, but I'm a bit skeptical with > changes that are needed for CreateCheckPoint(), those don't seem to be > clean. Yeah, that doesn't look great. I don't think it's entirely correct, actually, because surely you want LocalXLogInsertAllowed = 0 to be executed even if !IsPostmasterEnvironment. It's only LocalXLogInsertAllowed = -1 that we would want to have depend on IsPostmasterEnvironment. But that's pretty ugly too: I guess the reason it has to be like is that, if it does that unconditionally, it will overwrite the temporary value of 1 set by the caller, which will then cause problems when the caller tries to XLogReportParameters(). I think that problem goes away if we drive the decision off of shared state rather than a local variable, but I agree that it's otherwise a bit tricky to untangle. One idea might be to have LocalSetXLogInsertAllowed return the old value. Then we could use the same kind of coding we do when switching memory contexts, where we say: oldcontext = MemoryContextSwitchTo(something); // do stuff MemoryContextSwitchTo(oldcontext); Here we could maybe do: oldxlallowed = LocalSetXLogInsertAllowed(); // do stuff XLogInsertAllowed = oldxlallowed; That way, instead of CreateCheckPoint() knowing under what circumstances the caller might have changed the value, it only knows that some callers might have already changed the value. That seems better. > I agree that calling InitXLOGAccess() from RecoveryInProgress() is not > good, but I am not sure about calling it from XLogInsertAllowed() > either, perhaps, both are status check function and general > expectations might be that status checking functions are not going > change and/or initialize the system state. InitXLOGAccess() should > get called from the very first WAL write operation if needed, but if > we don't want to do that, then I would prefer to call InitXLOGAccess() > from XLogInsertAllowed() instead of RecoveryInProgress(). Well, that's a fair point, too, but it might not be safe to, say, move this to XLogBeginInsert(). Like, imagine that there's a hypothetical piece of code that looks like this: if (RecoveryInProgress()) ereport(ERROR, errmsg("can't do that in recovery"))); // do something here that depends on ThisTimeLineID or wal_segment_size or RedoRecPtr XLogBeginInsert(); lsn = XLogInsert(...); Such code would work correctly the way things are today, but if the InitXLOGAccess() call were deferred until XLogBeginInsert() time, then it would fail. I was curious whether this is just a theoretical problem. It turns out that it's not. I wrote a couple of just-for-testing patches, which I attach here. The first one just adjusts things so that we'll fail an assertion if we try to make use of ThisTimeLineID before we've set it to a legal value. I had to exempt two places from these checks just for 'make check-world' to pass; these are shown in the patch, and one or both of them might be existing bugs -- or maybe not, I haven't looked too deeply. The second one then adjusts the patch to pretend that ThisTimeLineID is not necessarily valid just because we've called InitXLOGAccess() but that it is valid after XLogBeginInsert(). With that change, I find about a dozen places where, apparently, the early call to InitXLOGAccess() is critical to getting ThisTimeLineID adjusted in time. So apparently a change of this type is not entirely trivial. And this is just a quick test, and just for one of the three things that get initialized here. On the other hand, just moving it to XLogInsertAllowed() isn't risk-free either and would likely require adjusting some of the same places I found with this test. So I guess if we want to do something like this we need more study. -- Robert Haas EDB: http://www.enterprisedb.com 0002-Pretend-it-s-not-valid-until-XLogBeginInsert.patch Description: Binary data 0001-Test-code-to-see-whether-we-have-always-properly-ini.patch Description: Binary data
Re: [Patch] ALTER SYSTEM READ ONLY
On Thu, Oct 14, 2021 at 11:10 PM Robert Haas wrote: > > On Tue, Oct 12, 2021 at 8:18 AM Amul Sul wrote: > > In the attached version I have fixed this issue by restoring > > missingContrecPtr. > > > > To handle abortedRecPtr and missingContrecPtr newly added global > > variables thought the commit # ff9f111bce24, we don't need to store > > them in the shared memory separately, instead, we need a flag that > > indicates a broken record found previously, at the end of recovery, so > > that we can overwrite contrecord. > > > > The missingContrecPtr is assigned to the EndOfLog, and we have handled > > EndOfLog previously in the 0004 patch, and the abortedRecPtr is the > > same as the lastReplayedEndRecPtr, AFAICS. I have added an assert to > > ensure that the lastReplayedEndRecPtr value is the same as the > > abortedRecPtr, but I think that is not needed, we can go ahead and > > write an overwrite-contrecord starting at lastReplayedEndRecPtr. > > I thought that it made sense to commit 0001 and 0002 at this point, so > I have done that. I think that the treatment of missingContrecPtr and > abortedRecPtr may need more thought yet, so at least for that reason I > don't think it's a good idea to proceed with 0004 yet. 0003 is just > code movement so I guess that can be committed whenever we're > confident that we know exactly which things we want to end up inside > XLogAcceptWrites(). > Ok. > I do have a few ideas after studying this a bit more: > > - I wonder whether, in addition to moving a few things later as 0002 > did, we also ought to think about moving one thing earlier, > specifically XLogReportParameters(). Right now, we have, I believe, > four things that write WAL at the end of recovery: > CreateOverwriteContrecordRecord(), UpdateFullPageWrites(), > PerformRecoveryXLogAction(), and XLogReportParameters(). As the code > is structured now, we do the first three of those things, and then do > a bunch of other stuff inside CleanupAfterArchiveRecovery() like > running recovery_end_command, and removing non-parent xlog files, and > archiving the partial segment, and then come back and do the fourth > one. Is there any good reason for that? If not, I think doing them all > together would be cleaner, and would propose to reverse the order of > CleanupAfterArchiveRecovery() and XLogReportParameters(). > Yes, that can be done. > - If we did that, then I would further propose to adjust things so > that we remove the call to LocalSetXLogInsertAllowed() and the > assignment LocalXLogInsertAllowed = -1 from inside > CreateEndOfRecoveryRecord(), the LocalXLogInsertAllowed = -1 from just > after UpdateFullPageWrites(), and the call to > LocalSetXLogInsertAllowed() just before XLogReportParameters(). > Instead, just let the call to LocalSetXLogInsertAllowed() right before > CreateOverwriteContrecordRecord() remain in effect. There doesn't seem > to be much point in flipping that switch off and on again, and the > fact that we have been doing so is in my view just evidence that > StartupXLOG() doesn't do a very good job of getting related code all > into one place. > Currently there are three places that are calling LocalSetXLogInsertAllowed() and resetting that LocalXLogInsertAllowed flag as StartupXLOG(), CreateEndOfRecoveryRecord() and the CreateCheckPoint(). By doing the aforementioned code rearrangement we can get rid of frequent calls from StartupXLOG() and can completely remove the need for it in CreateEndOfRecoveryRecord() since that gets called only from StartupXLOG() directly. Whereas CreateCheckPoint() too gets called from StartupXLOG() when it is running in a standalone backend only, at that time we don't need to call LocalSetXLogInsertAllowed() but if that running in the Checkpointer process then we need that. I tried this in the attached version, but I'm a bit skeptical with changes that are needed for CreateCheckPoint(), those don't seem to be clean. I am wondering if we could completely remove the need to end of recovery checkpoint as proposed in [1], that would get rid of CHECKPOINT_END_OF_RECOVERY operation and the LocalSetXLogInsertAllowed() requirement in CreateCheckPoint(), and after that, we were not expecting checkpoint operation in recovery. If we could do that then we would have LocalSetXLogInsertAllowed() only at one place i.e. in StartupXLOG (...and in the future in XLogAcceptWrites()) -- the code that runs only once in a lifetime of the server and the kludge that the attached patch doing for CreateCheckPoint() will not be needed. > - It seems really tempting to invent a fourth RecoveryState value that > indicates that we are done with REDO but not yet in production, and > maybe also to rename RecoveryState to something like WALState. I'm > thinking of something like WAL_STATE_CRASH_RECOVERY, > WAL_STATE_ARCHIVE_RECOVERY, WAL_STATE_REDO_COMPLETE, and > WAL_STATE_PRODUCTION. Then, instead of having > LocalSetXLogInsertAllowed(), we could teach XLogInsertAllowed() that > the startup
Re: [Patch] ALTER SYSTEM READ ONLY
On Tue, Oct 12, 2021 at 8:18 AM Amul Sul wrote: > In the attached version I have fixed this issue by restoring > missingContrecPtr. > > To handle abortedRecPtr and missingContrecPtr newly added global > variables thought the commit # ff9f111bce24, we don't need to store > them in the shared memory separately, instead, we need a flag that > indicates a broken record found previously, at the end of recovery, so > that we can overwrite contrecord. > > The missingContrecPtr is assigned to the EndOfLog, and we have handled > EndOfLog previously in the 0004 patch, and the abortedRecPtr is the > same as the lastReplayedEndRecPtr, AFAICS. I have added an assert to > ensure that the lastReplayedEndRecPtr value is the same as the > abortedRecPtr, but I think that is not needed, we can go ahead and > write an overwrite-contrecord starting at lastReplayedEndRecPtr. I thought that it made sense to commit 0001 and 0002 at this point, so I have done that. I think that the treatment of missingContrecPtr and abortedRecPtr may need more thought yet, so at least for that reason I don't think it's a good idea to proceed with 0004 yet. 0003 is just code movement so I guess that can be committed whenever we're confident that we know exactly which things we want to end up inside XLogAcceptWrites(). I do have a few ideas after studying this a bit more: - I wonder whether, in addition to moving a few things later as 0002 did, we also ought to think about moving one thing earlier, specifically XLogReportParameters(). Right now, we have, I believe, four things that write WAL at the end of recovery: CreateOverwriteContrecordRecord(), UpdateFullPageWrites(), PerformRecoveryXLogAction(), and XLogReportParameters(). As the code is structured now, we do the first three of those things, and then do a bunch of other stuff inside CleanupAfterArchiveRecovery() like running recovery_end_command, and removing non-parent xlog files, and archiving the partial segment, and then come back and do the fourth one. Is there any good reason for that? If not, I think doing them all together would be cleaner, and would propose to reverse the order of CleanupAfterArchiveRecovery() and XLogReportParameters(). - If we did that, then I would further propose to adjust things so that we remove the call to LocalSetXLogInsertAllowed() and the assignment LocalXLogInsertAllowed = -1 from inside CreateEndOfRecoveryRecord(), the LocalXLogInsertAllowed = -1 from just after UpdateFullPageWrites(), and the call to LocalSetXLogInsertAllowed() just before XLogReportParameters(). Instead, just let the call to LocalSetXLogInsertAllowed() right before CreateOverwriteContrecordRecord() remain in effect. There doesn't seem to be much point in flipping that switch off and on again, and the fact that we have been doing so is in my view just evidence that StartupXLOG() doesn't do a very good job of getting related code all into one place. - It seems really tempting to invent a fourth RecoveryState value that indicates that we are done with REDO but not yet in production, and maybe also to rename RecoveryState to something like WALState. I'm thinking of something like WAL_STATE_CRASH_RECOVERY, WAL_STATE_ARCHIVE_RECOVERY, WAL_STATE_REDO_COMPLETE, and WAL_STATE_PRODUCTION. Then, instead of having LocalSetXLogInsertAllowed(), we could teach XLogInsertAllowed() that the startup process and the checkpointer are allowed to insert WAL when the state is WAL_STATE_REDO_COMPLETE, but other processes only once we reach WAL_STATE_PRODUCTION. We would set WAL_STATE_REDO_COMPLETE where we now call LocalSetXLogInsertAllowed(). It's necessary to include the checkpointer, or at least I think it is, because PerformRecoveryXLogAction() might call RequestCheckpoint(), and that's got to work. If we did this, then I think it would also solve another problem which the overall patch set has to address somehow. Say that we eventually move responsibility for the to-be-created XLogAcceptWrites() function from the startup process to the checkpointer, as proposed. The checkpointer needs to know when to call it ... and the answer with this change is simple: when we reach WAL_STATE_REDO_COMPLETE, it's time! But this idea is not completely problem-free. I spent some time poking at it and I think it's a little hard to come up with a satisfying way to code XLogInsertAllowed(). Right now that function calls RecoveryInProgress(), and if RecoveryInProgress() decides that recovery is no longer in progress, it calls InitXLOGAccess(). However, that presumes that the only reason you'd call RecoveryInProgress() is to figure out whether you should write WAL, which I don't think is really true, and it also means that, when the wal state is WAL_STATE_REDO_COMPLETE, RecoveryInProgress() would need to return true in the checkpointer and startup process and false everywhere else, which does not sound like a great idea. It seems fine to say that xlog insertion is allowed in some processes but not others, because not
Re: [Patch] ALTER SYSTEM READ ONLY
On Thu, Oct 7, 2021 at 6:21 PM Amul Sul wrote: > > On Thu, Oct 7, 2021 at 5:56 AM Jaime Casanova > wrote: > > > > On Tue, Oct 05, 2021 at 04:11:58PM +0530, Amul Sul wrote: > > >On Mon, Oct 4, 2021 at 1:57 PM Rushabh Lathia > > > wrote: > > > > > > > > I tried to apply the patch on the master branch head and it's failing > > > > with conflicts. > > > > > > > > > > Thanks, Rushabh, for the quick check, I have attached a rebased version > > > for the > > > latest master head commit # f6b5d05ba9a. > > > > > > > Hi, > > > > I got this error while executing "make check" on src/test/recovery: > > > > """ > > t/026_overwrite_contrecord.pl 1/3 # poll_query_until timed out > > executing this query: > > # SELECT '0/201D4D8'::pg_lsn <= pg_last_wal_replay_lsn() > > # expecting this output: > > # t > > # last actual query output: > > # f > > # with stderr: > > # Looks like your test exited with 29 just after 1. > > t/026_overwrite_contrecord.pl Dubious, test returned 29 (wstat > > 7424, 0x1d00) > > Failed 2/3 subtests > > > > Test Summary Report > > --- > > t/026_overwrite_contrecord.pl (Wstat: 7424 Tests: 1 Failed: 0) > > Non-zero exit status: 29 > > Parse errors: Bad plan. You planned 3 tests but ran 1. > > Files=26, Tests=279, 400 wallclock secs ( 0.27 usr 0.10 sys + 73.78 cusr > > 59.66 csys = 133.81 CPU) > > Result: FAIL > > make: *** [Makefile:23: check] Error 1 > > """ > > > > Thanks for the reporting problem, I am working on it. The cause of > failure is that v37_0004 patch clearing the missingContrecPtr global > variable before CreateOverwriteContrecordRecord() execution, which it > shouldn't. > In the attached version I have fixed this issue by restoring missingContrecPtr. To handle abortedRecPtr and missingContrecPtr newly added global variables thought the commit # ff9f111bce24, we don't need to store them in the shared memory separately, instead, we need a flag that indicates a broken record found previously, at the end of recovery, so that we can overwrite contrecord. The missingContrecPtr is assigned to the EndOfLog, and we have handled EndOfLog previously in the 0004 patch, and the abortedRecPtr is the same as the lastReplayedEndRecPtr, AFAICS. I have added an assert to ensure that the lastReplayedEndRecPtr value is the same as the abortedRecPtr, but I think that is not needed, we can go ahead and write an overwrite-contrecord starting at lastReplayedEndRecPtr. Regards, Amul From 5bf021226d9742a6fefbcb33e54f7ef044d8fbcc Mon Sep 17 00:00:00 2001 From: Amul Sul Date: Thu, 30 Sep 2021 06:29:06 -0400 Subject: [PATCH v38 4/4] Remove dependencies on startup-process specifical variables. To make XLogAcceptWrites(), need to dependency on few global and local variable spcific to startup process. Global variables are abortedRecPtr, missingContrecPtr, ArchiveRecoveryRequested and LocalPromoteIsTriggered, whereas LocalPromoteIsTriggered can be accessed in any other process using existing PromoteIsTriggered(). ArchiveRecoveryRequested is made accessible by copying into shared memory. abortedRecPtr and missingContrecPtr can get from the existing shared memory values but for that, we need a flag indicating that broken records was found previously and OVERWRITE_CONTRECORD message needs to be written when WAL writes permitted, added a flag for the same. XLogAcceptWrites() accepts two argument as EndOfLogTLI and EndOfLog which are local to StartupXLOG(). Instead of passing as an argument XLogCtl->replayEndTLI and XLogCtl->lastSegSwitchLSN from the shared memory can be used as an replacement to EndOfLogTLI and EndOfLog respectively. XLogCtl->lastSegSwitchLSN is not going to change until we use it. That changes only when the current WAL segment gets full which never going to happen because of two reasons, first WAL writes are disabled for other processes until XLogAcceptWrites() finishes and other reasons before use of lastSegSwitchLSN, XLogAcceptWrites() is writes fix size wal records as full-page write and record for either recovery end or checkpoint which not going to fill up the 16MB wal segment. EndOfLogTLI in the StartupXLOG() is the timeline ID of the last record that xlogreader reads, but this xlogreader was simply re-fetching the last record which we have replied in redo loop if it was in recovery, if not in recovery, we don't need to worry since this value is needed only in case of ArchiveRecoveryRequested = true, which implicitly forces redo and sets XLogCtl->replayEndTLI value. --- src/backend/access/transam/xlog.c | 90 --- 1 file changed, 72 insertions(+), 18 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index cdfec248081..b9596ca005c 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -668,6 +668,13 @@ typedef struct XLogCtlData */ bool SharedPromoteIsTriggered; + /* + * SharedArchiveRecoveryRequested exports the
Re: [Patch] ALTER SYSTEM READ ONLY
On Thu, Oct 7, 2021 at 5:56 AM Jaime Casanova wrote: > > On Tue, Oct 05, 2021 at 04:11:58PM +0530, Amul Sul wrote: > >On Mon, Oct 4, 2021 at 1:57 PM Rushabh Lathia > > wrote: > > > > > > I tried to apply the patch on the master branch head and it's failing > > > with conflicts. > > > > > > > Thanks, Rushabh, for the quick check, I have attached a rebased version for > > the > > latest master head commit # f6b5d05ba9a. > > > > Hi, > > I got this error while executing "make check" on src/test/recovery: > > """ > t/026_overwrite_contrecord.pl 1/3 # poll_query_until timed out > executing this query: > # SELECT '0/201D4D8'::pg_lsn <= pg_last_wal_replay_lsn() > # expecting this output: > # t > # last actual query output: > # f > # with stderr: > # Looks like your test exited with 29 just after 1. > t/026_overwrite_contrecord.pl Dubious, test returned 29 (wstat 7424, > 0x1d00) > Failed 2/3 subtests > > Test Summary Report > --- > t/026_overwrite_contrecord.pl (Wstat: 7424 Tests: 1 Failed: 0) > Non-zero exit status: 29 > Parse errors: Bad plan. You planned 3 tests but ran 1. > Files=26, Tests=279, 400 wallclock secs ( 0.27 usr 0.10 sys + 73.78 cusr > 59.66 csys = 133.81 CPU) > Result: FAIL > make: *** [Makefile:23: check] Error 1 > """ > Thanks for the reporting problem, I am working on it. The cause of failure is that v37_0004 patch clearing the missingContrecPtr global variable before CreateOverwriteContrecordRecord() execution, which it shouldn't. Regards, Amul
Re: [Patch] ALTER SYSTEM READ ONLY
On Tue, Oct 05, 2021 at 04:11:58PM +0530, Amul Sul wrote: >On Mon, Oct 4, 2021 at 1:57 PM Rushabh Lathia > wrote: > > > > I tried to apply the patch on the master branch head and it's failing > > with conflicts. > > > > Thanks, Rushabh, for the quick check, I have attached a rebased version for > the > latest master head commit # f6b5d05ba9a. > Hi, I got this error while executing "make check" on src/test/recovery: """ t/026_overwrite_contrecord.pl 1/3 # poll_query_until timed out executing this query: # SELECT '0/201D4D8'::pg_lsn <= pg_last_wal_replay_lsn() # expecting this output: # t # last actual query output: # f # with stderr: # Looks like your test exited with 29 just after 1. t/026_overwrite_contrecord.pl Dubious, test returned 29 (wstat 7424, 0x1d00) Failed 2/3 subtests Test Summary Report --- t/026_overwrite_contrecord.pl (Wstat: 7424 Tests: 1 Failed: 0) Non-zero exit status: 29 Parse errors: Bad plan. You planned 3 tests but ran 1. Files=26, Tests=279, 400 wallclock secs ( 0.27 usr 0.10 sys + 73.78 cusr 59.66 csys = 133.81 CPU) Result: FAIL make: *** [Makefile:23: check] Error 1 """ -- Jaime Casanova Director de Servicios Profesionales SystemGuards - Consultores de PostgreSQL
Re: [Patch] ALTER SYSTEM READ ONLY
On Mon, Oct 4, 2021 at 1:57 PM Rushabh Lathia wrote: > > > > On Fri, Oct 1, 2021 at 2:29 AM Robert Haas wrote: >> >> On Thu, Sep 30, 2021 at 7:59 AM Amul Sul wrote: >> > To find the value of InRecovery after we clear it, patch still uses >> > ControlFile's DBState, but now the check condition changed to a more >> > specific one which is less confusing. >> > >> > In casual off-list discussion, the point was made to check >> > SharedRecoveryState to find out the InRecovery value afterward, and >> > check that using RecoveryInProgress(). But we can't depend on >> > SharedRecoveryState because at the start it gets initialized to >> > RECOVERY_STATE_CRASH irrespective of InRecovery that happens later. >> > Therefore, we can't use RecoveryInProgress() which always returns >> > true if SharedRecoveryState != RECOVERY_STATE_DONE. >> >> Uh, this change has crept into 0002, but it should be in 0004 with the >> rest of the changes to remove dependencies on variables specific to >> the startup process. Like I said before, we should really be trying to >> separate code movement from functional changes. Well, I have to replace the InRecovery flag in that patch since we are moving code past to the point where the InRecovery flag gets cleared. If I don't do, then the 0002 patch would be wrong since InRecovery is always false, and behaviour won't be the same as it was before that patch. >> Also, 0002 doesn't >> actually apply for me. Did you generate these patches with 'git >> format-patch'? >> >> [rhaas pgsql]$ patch -p1 < >> ~/Downloads/v36-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patch >> patching file src/backend/access/transam/xlog.c >> Hunk #1 succeeded at 889 (offset 9 lines). >> Hunk #2 succeeded at 939 (offset 12 lines). >> Hunk #3 succeeded at 5734 (offset 37 lines). >> Hunk #4 succeeded at 8038 (offset 70 lines). >> Hunk #5 succeeded at 8248 (offset 70 lines). >> [rhaas pgsql]$ patch -p1 < >> ~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch >> patching file src/backend/access/transam/xlog.c >> Reversed (or previously applied) patch detected! Assume -R? [n] >> Apply anyway? [n] y >> Hunk #1 FAILED at 7954. >> Hunk #2 succeeded at 8079 (offset 70 lines). >> 1 out of 2 hunks FAILED -- saving rejects to file >> src/backend/access/transam/xlog.c.rej >> [rhaas pgsql]$ git reset --hard >> HEAD is now at b484ddf4d2 Treat ETIMEDOUT as indicating a >> non-recoverable connection failure. >> [rhaas pgsql]$ patch -p1 < >> ~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch >> patching file src/backend/access/transam/xlog.c >> Reversed (or previously applied) patch detected! Assume -R? [n] >> Apply anyway? [n] >> Skipping patch. >> 2 out of 2 hunks ignored -- saving rejects to file >> src/backend/access/transam/xlog.c.rej >> > > I tried to apply the patch on the master branch head and it's failing > with conflicts. > Thanks, Rushabh, for the quick check, I have attached a rebased version for the latest master head commit # f6b5d05ba9a. > Later applied patch on below commit and it got applied cleanly: > > commit 7d1aa6bf1c27bf7438179db446f7d1e72ae093d0 > Author: Tom Lane > Date: Mon Sep 27 18:48:01 2021 -0400 > > Re-enable contrib/bloom's TAP tests. > > rushabh@rushabh:postgresql$ git apply > v36-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patch > rushabh@rushabh:postgresql$ git apply > v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch > rushabh@rushabh:postgresql$ git apply > v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch > v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:34: space > before tab in indent. > /* > v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:38: space > before tab in indent. > */ > v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:39: space > before tab in indent. > Insert->fullPageWrites = lastFullPageWrites; > warning: 3 lines add whitespace errors. > rushabh@rushabh:postgresql$ git apply > v36-0004-Remove-dependencies-on-startup-process-specifica.patch > > There are whitespace errors on patch 0003. > Fixed. >> >> It seems to me that the approach you're pursuing here can't work, >> because the long-term goal is to get to a place where, if the system >> starts up read-only, XLogAcceptWrites() might not be called until >> later, after StartupXLOG() has exited. But in that case the control >> file state would be DB_IN_PRODUCTION. But my idea of using >> RecoveryInProgress() won't work either, because we set >> RECOVERY_STATE_DONE just after we set DB_IN_PRODUCTION. Put >> differently, the question we want to answer is not "are we in recovery >> now?" but "did we perform recovery?". After studying the code a bit, I >> think a good test might be >> !XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr). If InRecovery >> gets set to true, then we're certain to enter the if (InRecovery) >> block that contains the main redo loop. And
Re: [Patch] ALTER SYSTEM READ ONLY
On Fri, Oct 1, 2021 at 2:29 AM Robert Haas wrote: > On Thu, Sep 30, 2021 at 7:59 AM Amul Sul wrote: > > To find the value of InRecovery after we clear it, patch still uses > > ControlFile's DBState, but now the check condition changed to a more > > specific one which is less confusing. > > > > In casual off-list discussion, the point was made to check > > SharedRecoveryState to find out the InRecovery value afterward, and > > check that using RecoveryInProgress(). But we can't depend on > > SharedRecoveryState because at the start it gets initialized to > > RECOVERY_STATE_CRASH irrespective of InRecovery that happens later. > > Therefore, we can't use RecoveryInProgress() which always returns > > true if SharedRecoveryState != RECOVERY_STATE_DONE. > > Uh, this change has crept into 0002, but it should be in 0004 with the > rest of the changes to remove dependencies on variables specific to > the startup process. Like I said before, we should really be trying to > separate code movement from functional changes. Also, 0002 doesn't > actually apply for me. Did you generate these patches with 'git > format-patch'? > > [rhaas pgsql]$ patch -p1 < > ~/Downloads/v36-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patch > patching file src/backend/access/transam/xlog.c > Hunk #1 succeeded at 889 (offset 9 lines). > Hunk #2 succeeded at 939 (offset 12 lines). > Hunk #3 succeeded at 5734 (offset 37 lines). > Hunk #4 succeeded at 8038 (offset 70 lines). > Hunk #5 succeeded at 8248 (offset 70 lines). > [rhaas pgsql]$ patch -p1 < > ~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch > patching file src/backend/access/transam/xlog.c > Reversed (or previously applied) patch detected! Assume -R? [n] > Apply anyway? [n] y > Hunk #1 FAILED at 7954. > Hunk #2 succeeded at 8079 (offset 70 lines). > 1 out of 2 hunks FAILED -- saving rejects to file > src/backend/access/transam/xlog.c.rej > [rhaas pgsql]$ git reset --hard > HEAD is now at b484ddf4d2 Treat ETIMEDOUT as indicating a > non-recoverable connection failure. > [rhaas pgsql]$ patch -p1 < > ~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch > patching file src/backend/access/transam/xlog.c > Reversed (or previously applied) patch detected! Assume -R? [n] > Apply anyway? [n] > Skipping patch. > 2 out of 2 hunks ignored -- saving rejects to file > src/backend/access/transam/xlog.c.rej > > I tried to apply the patch on the master branch head and it's failing with conflicts. Later applied patch on below commit and it got applied cleanly: commit 7d1aa6bf1c27bf7438179db446f7d1e72ae093d0 Author: Tom Lane Date: Mon Sep 27 18:48:01 2021 -0400 Re-enable contrib/bloom's TAP tests. rushabh@rushabh:postgresql$ git apply v36-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patch rushabh@rushabh:postgresql$ git apply v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch rushabh@rushabh:postgresql$ git apply v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:34: space before tab in indent. /* v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:38: space before tab in indent. */ v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:39: space before tab in indent. Insert->fullPageWrites = lastFullPageWrites; warning: 3 lines add whitespace errors. rushabh@rushabh:postgresql$ git apply v36-0004-Remove-dependencies-on-startup-process-specifica.patch There are whitespace errors on patch 0003. > It seems to me that the approach you're pursuing here can't work, > because the long-term goal is to get to a place where, if the system > starts up read-only, XLogAcceptWrites() might not be called until > later, after StartupXLOG() has exited. But in that case the control > file state would be DB_IN_PRODUCTION. But my idea of using > RecoveryInProgress() won't work either, because we set > RECOVERY_STATE_DONE just after we set DB_IN_PRODUCTION. Put > differently, the question we want to answer is not "are we in recovery > now?" but "did we perform recovery?". After studying the code a bit, I > think a good test might be > !XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr). If InRecovery > gets set to true, then we're certain to enter the if (InRecovery) > block that contains the main redo loop. And that block unconditionally > does XLogCtl->lastReplayedEndRecPtr = XLogCtl->replayEndRecPtr. I > think that replayEndRecPtr can't be 0 because it's supposed to > represent the record we're pretending to have last replayed, as > explained by the comments. And while lastReplayedEndRecPtr will get > updated later as we replay more records, I think it will never be set > back to 0. It's only going to increase, as we replay more records. On > the other hand if InRecovery = false then we'll never change it, and > it seems that it starts out as 0. > > I was hoping to have more time today to comment on 0004, but the
Re: [Patch] ALTER SYSTEM READ ONLY
On Thu, Sep 30, 2021 at 7:59 AM Amul Sul wrote: > To find the value of InRecovery after we clear it, patch still uses > ControlFile's DBState, but now the check condition changed to a more > specific one which is less confusing. > > In casual off-list discussion, the point was made to check > SharedRecoveryState to find out the InRecovery value afterward, and > check that using RecoveryInProgress(). But we can't depend on > SharedRecoveryState because at the start it gets initialized to > RECOVERY_STATE_CRASH irrespective of InRecovery that happens later. > Therefore, we can't use RecoveryInProgress() which always returns > true if SharedRecoveryState != RECOVERY_STATE_DONE. Uh, this change has crept into 0002, but it should be in 0004 with the rest of the changes to remove dependencies on variables specific to the startup process. Like I said before, we should really be trying to separate code movement from functional changes. Also, 0002 doesn't actually apply for me. Did you generate these patches with 'git format-patch'? [rhaas pgsql]$ patch -p1 < ~/Downloads/v36-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patch patching file src/backend/access/transam/xlog.c Hunk #1 succeeded at 889 (offset 9 lines). Hunk #2 succeeded at 939 (offset 12 lines). Hunk #3 succeeded at 5734 (offset 37 lines). Hunk #4 succeeded at 8038 (offset 70 lines). Hunk #5 succeeded at 8248 (offset 70 lines). [rhaas pgsql]$ patch -p1 < ~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch patching file src/backend/access/transam/xlog.c Reversed (or previously applied) patch detected! Assume -R? [n] Apply anyway? [n] y Hunk #1 FAILED at 7954. Hunk #2 succeeded at 8079 (offset 70 lines). 1 out of 2 hunks FAILED -- saving rejects to file src/backend/access/transam/xlog.c.rej [rhaas pgsql]$ git reset --hard HEAD is now at b484ddf4d2 Treat ETIMEDOUT as indicating a non-recoverable connection failure. [rhaas pgsql]$ patch -p1 < ~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch patching file src/backend/access/transam/xlog.c Reversed (or previously applied) patch detected! Assume -R? [n] Apply anyway? [n] Skipping patch. 2 out of 2 hunks ignored -- saving rejects to file src/backend/access/transam/xlog.c.rej It seems to me that the approach you're pursuing here can't work, because the long-term goal is to get to a place where, if the system starts up read-only, XLogAcceptWrites() might not be called until later, after StartupXLOG() has exited. But in that case the control file state would be DB_IN_PRODUCTION. But my idea of using RecoveryInProgress() won't work either, because we set RECOVERY_STATE_DONE just after we set DB_IN_PRODUCTION. Put differently, the question we want to answer is not "are we in recovery now?" but "did we perform recovery?". After studying the code a bit, I think a good test might be !XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr). If InRecovery gets set to true, then we're certain to enter the if (InRecovery) block that contains the main redo loop. And that block unconditionally does XLogCtl->lastReplayedEndRecPtr = XLogCtl->replayEndRecPtr. I think that replayEndRecPtr can't be 0 because it's supposed to represent the record we're pretending to have last replayed, as explained by the comments. And while lastReplayedEndRecPtr will get updated later as we replay more records, I think it will never be set back to 0. It's only going to increase, as we replay more records. On the other hand if InRecovery = false then we'll never change it, and it seems that it starts out as 0. I was hoping to have more time today to comment on 0004, but the day seems to have gotten away from me. One quick thought is that it looks a bit strange to be getting EndOfLog from GetLastSegSwitchData() which returns lastSegSwitchLSN while getting EndOfLogTLI from replayEndTLI ... because there's also replayEndRecPtr, which seems to go with replayEndTLI. It feels like we should use a source for the TLI that clearly matches the source for the corresponding LSN, unless there's some super-good reason to do otherwise. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Fri, Sep 24, 2021 at 5:07 PM Amul Sul wrote: > > On Thu, Sep 23, 2021 at 11:56 PM Robert Haas wrote: > > > > On Mon, Sep 20, 2021 at 11:20 AM Amul Sul wrote: > > > Ok, understood, I have separated my changes into 0001 and 0002 patch, > > > and the refactoring patches start from 0003. > > > > I think it would be better in the other order, with the refactoring > > patches at the beginning of the series. > > > > Ok, will do that. I did this other way to minimize the diff e.g. > deletion diff of RecoveryXlogAction enum and > DetermineRecoveryXlogAction(), etc. > I have reversed the patch order. Now refactoring patches will be first, and the patch that removes the dependencies on global & local variables will be the last. I did the necessary modification in the refactoring patches too e.g. removed DetermineRecoveryXlogAction() and RecoveryXlogAction enum which is no longer needed (thanks to commit # 1d919de5eb3fffa7cc9479ed6d2915fb89794459 to make code simple). To find the value of InRecovery after we clear it, patch still uses ControlFile's DBState, but now the check condition changed to a more specific one which is less confusing. In casual off-list discussion, the point was made to check SharedRecoveryState to find out the InRecovery value afterward, and check that using RecoveryInProgress(). But we can't depend on SharedRecoveryState because at the start it gets initialized to RECOVERY_STATE_CRASH irrespective of InRecovery that happens later. Therefore, we can't use RecoveryInProgress() which always returns true if SharedRecoveryState != RECOVERY_STATE_DONE. I am posting only refactoring patches for now. Regards, Amul From 730e8331fefc882b4cab7112adf0f4d8da1ea831 Mon Sep 17 00:00:00 2001 From: Amul Sul Date: Thu, 30 Sep 2021 06:29:06 -0400 Subject: [PATCH v36 4/4] Remove dependencies on startup-process specifical variables. To make XLogAcceptWrites(), need to dependency on few global and local variable spcific to startup process. Global variables are ArchiveRecoveryRequested and LocalPromoteIsTriggered, whereas LocalPromoteIsTriggered can be accessed in any other process using existing PromoteIsTriggered(). ArchiveRecoveryRequested is made accessible by copying into shared memory. XLogAcceptWrites() accepts two argument as EndOfLogTLI and EndOfLog which are local to StartupXLOG(). Instead of passing as an argument XLogCtl->replayEndTLI and XLogCtl->lastSegSwitchLSN from the shared memory can be used as an replacement to EndOfLogTLI and EndOfLog respectively. XLogCtl->lastSegSwitchLSN is not going to change until we use it. That changes only when the current WAL segment gets full which never going to happen because of two reasons, first WAL writes are disabled for other processes until XLogAcceptWrites() finishes and other reasons before use of lastSegSwitchLSN, XLogAcceptWrites() is writes fix size wal records as full-page write and record for either recovery end or checkpoint which not going to fill up the 16MB wal segment. EndOfLogTLI in the StartupXLOG() is the timeline ID of the last record that xlogreader reads, but this xlogreader was simply re-fetching the last record which we have replied in redo loop if it was in recovery, if not in recovery, we don't need to worry since this value is needed only in case of ArchiveRecoveryRequested = true, which implicitly forces redo and sets XLogCtl->replayEndTLI value. --- src/backend/access/transam/xlog.c | 36 ++- 1 file changed, 26 insertions(+), 10 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 91cdd7d9ff2..5b4e5ac379f 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -659,6 +659,13 @@ typedef struct XLogCtlData */ bool SharedPromoteIsTriggered; + /* + * SharedArchiveRecoveryRequested exports the value of the + * ArchiveRecoveryRequested flag to be share which is otherwise valid only + * in the startup process. + */ + bool SharedArchiveRecoveryRequested; + /* * WalWriterSleeping indicates whether the WAL writer is currently in * low-power mode (and hence should be nudged if an async commit occurs). @@ -880,8 +887,7 @@ static MemoryContext walDebugCxt = NULL; static void readRecoverySignalFile(void); static void validateRecoveryParameters(void); static void exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog); -static void CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, - XLogRecPtr EndOfLog); +static void CleanupAfterArchiveRecovery(void); static bool recoveryStopsBefore(XLogReaderState *record); static bool recoveryStopsAfter(XLogReaderState *record); static char *getRecoveryStopReason(void); @@ -927,7 +933,7 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force); static XLogRecord *ReadRecord(XLogReaderState *xlogreader, int emode, bool fetching_ckpt); static void CheckRecoveryConsistency(void); -static bool XLogAcceptWrites(TimeLineID
Re: [Patch] ALTER SYSTEM READ ONLY
On Thu, Sep 23, 2021 at 11:56 PM Robert Haas wrote: > > On Mon, Sep 20, 2021 at 11:20 AM Amul Sul wrote: > > Ok, understood, I have separated my changes into 0001 and 0002 patch, > > and the refactoring patches start from 0003. > > I think it would be better in the other order, with the refactoring > patches at the beginning of the series. > Ok, will do that. I did this other way to minimize the diff e.g. deletion diff of RecoveryXlogAction enum and DetermineRecoveryXlogAction(), etc. > > In the 0001 patch, I have copied ArchiveRecoveryRequested to shared > > memory as said previously. Coping ArchiveRecoveryRequested value to > > shared memory is not really interesting, and I think somehow we should > > reuse existing variable, (perhaps, with some modification of the > > information it can store, e.g. adding one more enum value for > > SharedRecoveryState or something else), thinking on the same. > > > > In addition to that, I tried to turn down the scope of > > ArchiveRecoveryRequested global variable. Now, this is a static > > variable, and the scope is limited to xlog.c file like > > LocalXLogInsertAllowed and can be accessed through the newly added > > function ArchiveRecoveryIsRequested() (like PromoteIsTriggered()). Let > > me know what you think about the approach. > > I'm not sure yet whether I like this or not, but it doesn't seem like > a terrible idea. You spelled UNKNOWN wrong, though, which does seem > like a terrible idea. :-) "acccsed" is not correct either. > > Also, the new comments for ArchiveRecoveryRequested / > ARCHIVE_RECOVERY_REQUEST_* are really not very clear. All you did is > substitute the new terminology into the existing comment, but that > means that the purpose of the new "unknown" value is not at all clear. > Ok, will fix those typos and try to improve the comments. > Consider the following two patch fragments: > > + * SharedArchiveRecoveryRequested indicates whether an archive recovery is > + * requested. Protected by info_lck. > ... > + * Remember archive recovery request in shared memory state. A lock is not > + * needed since we are the only ones who updating this. > > These two comments directly contradict each other. > Okay, the first comment is not clear enough, I will fix that too. I meant we don't need the lock now since we are the only one updating at this point. > + SpinLockAcquire(>info_lck); > + XLogCtl->SharedArchiveRecoveryRequested = false; > + ArchiveRecoveryRequested = ARCHIVE_RECOVERY_REQUEST_UNKOWN; > + SpinLockRelease(>info_lck); > > This seems odd to me. In the first place, there doesn't seem to be any > value in clearing this -- we're just expending extra CPU cycles to get > rid of a value that wouldn't be used anyway. In the second place, if > somehow someone checked the value after this point, with this code, > they might get the wrong answer, whereas if you just deleted this, > they would get the right answer. > Previously, this flag was only valid in the startup process. But now it will be valid for all the processes and will stay until the whole server gets restarted. I don't want anybody to use this flag after the cleanup point and just be sure I am explicitly cleaning this. By the way, I also don't expect we should go with this approach. I proposed this by referring to the PromoteIsTriggered() implementation, but IMO, it is better to have something else since we just want to perform archive cleanup operation, and most of the work related to archive recovery was done inside the StartupXLOG(). Rather than the proposed design, I was thinking of adding one or two more RecoveryState enums. And while skipping XLogAcceptsWrite() set XLogCtl->SharedRecoveryState appropriately, so that we can easily identify that the archive recovery was requested previously and now, we need to perform its pending cleanup operation. Thoughts? > > In 0002 patch is a mixed one where I tried to remove the dependencies > > on global variables and local variables belonging to StartupXLOG(). I > > am still worried about the InRecovery value that needs to be deduced > > afterward inside XLogAcceptWrites(). Currently, relying on > > ControlFile->state != DB_SHUTDOWNED check but I think that will not be > > good for ASRO where we plan to skip XLogAcceptWrites() work only and > > let the StartupXLOG() do the rest of the work as it is where it will > > going to update ControlFile's DBState to DB_IN_PRODUCTION, then we > > might need some ugly kludge to call PerformRecoveryXLogAction() in > > checkpointer irrespective of DBState, which makes me a bit > > uncomfortable. > > I think that replacing the if (InRecovery) test with if > (ControlFile->state != DB_SHUTDOWNED) is probably just plain wrong. I > mean, there are three separate places where we set InRecovery = true. > One of those executes if ControlFile->state != DB_SHUTDOWNED, matching > what you have here, but it also can happen if checkPoint.redo < > RecPtr, or if read_backup_label is true and
Re: [Patch] ALTER SYSTEM READ ONLY
On Mon, Sep 20, 2021 at 11:20 AM Amul Sul wrote: > Ok, understood, I have separated my changes into 0001 and 0002 patch, > and the refactoring patches start from 0003. I think it would be better in the other order, with the refactoring patches at the beginning of the series. > In the 0001 patch, I have copied ArchiveRecoveryRequested to shared > memory as said previously. Coping ArchiveRecoveryRequested value to > shared memory is not really interesting, and I think somehow we should > reuse existing variable, (perhaps, with some modification of the > information it can store, e.g. adding one more enum value for > SharedRecoveryState or something else), thinking on the same. > > In addition to that, I tried to turn down the scope of > ArchiveRecoveryRequested global variable. Now, this is a static > variable, and the scope is limited to xlog.c file like > LocalXLogInsertAllowed and can be accessed through the newly added > function ArchiveRecoveryIsRequested() (like PromoteIsTriggered()). Let > me know what you think about the approach. I'm not sure yet whether I like this or not, but it doesn't seem like a terrible idea. You spelled UNKNOWN wrong, though, which does seem like a terrible idea. :-) "acccsed" is not correct either. Also, the new comments for ArchiveRecoveryRequested / ARCHIVE_RECOVERY_REQUEST_* are really not very clear. All you did is substitute the new terminology into the existing comment, but that means that the purpose of the new "unknown" value is not at all clear. Consider the following two patch fragments: + * SharedArchiveRecoveryRequested indicates whether an archive recovery is + * requested. Protected by info_lck. ... + * Remember archive recovery request in shared memory state. A lock is not + * needed since we are the only ones who updating this. These two comments directly contradict each other. + SpinLockAcquire(>info_lck); + XLogCtl->SharedArchiveRecoveryRequested = false; + ArchiveRecoveryRequested = ARCHIVE_RECOVERY_REQUEST_UNKOWN; + SpinLockRelease(>info_lck); This seems odd to me. In the first place, there doesn't seem to be any value in clearing this -- we're just expending extra CPU cycles to get rid of a value that wouldn't be used anyway. In the second place, if somehow someone checked the value after this point, with this code, they might get the wrong answer, whereas if you just deleted this, they would get the right answer. > In 0002 patch is a mixed one where I tried to remove the dependencies > on global variables and local variables belonging to StartupXLOG(). I > am still worried about the InRecovery value that needs to be deduced > afterward inside XLogAcceptWrites(). Currently, relying on > ControlFile->state != DB_SHUTDOWNED check but I think that will not be > good for ASRO where we plan to skip XLogAcceptWrites() work only and > let the StartupXLOG() do the rest of the work as it is where it will > going to update ControlFile's DBState to DB_IN_PRODUCTION, then we > might need some ugly kludge to call PerformRecoveryXLogAction() in > checkpointer irrespective of DBState, which makes me a bit > uncomfortable. I think that replacing the if (InRecovery) test with if (ControlFile->state != DB_SHUTDOWNED) is probably just plain wrong. I mean, there are three separate places where we set InRecovery = true. One of those executes if ControlFile->state != DB_SHUTDOWNED, matching what you have here, but it also can happen if checkPoint.redo < RecPtr, or if read_backup_label is true and ReadCheckpointRecord returns non-NULL. Now maybe you're going to tell me that in those other two cases we can't reach here anyway, but I don't see off-hand why that should be true, and even if it is true, it seems like kind of a fragile thing to rely on. I think we need to rely on something in shared memory that is more explicitly connected to the question of whether we are in recovery. The other part of this patch has to do with whether we can use the return value of GetLastSegSwitchData as a substitute for relying on EndOfLog. Now as you have it, you end up creating a local variable called EndOfLog that shadows another such variable in an outer scope, which probably would not make anyone who noticed things in such a state very happy. However, that will naturally get fixed if you reorder the patches as per above, so let's turn to the central question: is this a good way of getting EndOfLog? The value that would be in effect at the time this code is executed is set here: XLogBeginRead(xlogreader, LastRec); record = ReadRecord(xlogreader, PANIC, false); EndOfLog = EndRecPtr; Subsequently we do this: /* start the archive_timeout timer and LSN running */ XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL); XLogCtl->lastSegSwitchLSN = EndOfLog; So at that point the value that GetLastSegSwitchData() would return has to match what's in the existing variable. But later XLogWrite() will change the value. So the question boils down to whether
Re: [Patch] ALTER SYSTEM READ ONLY
On Wed, Sep 22, 2021 at 7:33 PM Mark Dilger wrote: > > > > > On Sep 22, 2021, at 6:39 AM, Amul Sul wrote: > > > > Yes, that is a bit longer, here is the snip from v35-0010 patch > > Right, that's longer, and only tests one interaction. The isolation spec I > posted upthread tests multiple interactions between the session which uses > cursors and the system going read-only. Whether the session using a cursor > gets a FATAL, just an ERROR, or neither depends on where it is in the process > of opening, using, closing and committing. I think that's interesting. If > the implementation of the ALTER SESSION READ ONLY feature were to change in > such a way as, for example, to make the attempt to open the cursor be a FATAL > error, you'd see a change in the test output. > Agreed. Regards, Amul
Re: [Patch] ALTER SYSTEM READ ONLY
> On Sep 22, 2021, at 6:39 AM, Amul Sul wrote: > > Yes, that is a bit longer, here is the snip from v35-0010 patch Right, that's longer, and only tests one interaction. The isolation spec I posted upthread tests multiple interactions between the session which uses cursors and the system going read-only. Whether the session using a cursor gets a FATAL, just an ERROR, or neither depends on where it is in the process of opening, using, closing and committing. I think that's interesting. If the implementation of the ALTER SESSION READ ONLY feature were to change in such a way as, for example, to make the attempt to open the cursor be a FATAL error, you'd see a change in the test output. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [Patch] ALTER SYSTEM READ ONLY
On Wed, Sep 22, 2021 at 6:59 PM Mark Dilger wrote: > > > > > On Sep 22, 2021, at 6:14 AM, Amul Sul wrote: > > > >> Attached patch v34-0010 adds a test of cursors opened FOR UPDATE > >> interacting with a system that is set read-only by a different session. > >> The expected output is worth reviewing to see how this plays out. I don't > >> see anything in there which is obviously wrong, but some of it is a bit > >> clunky. For example, by the time the client sees an error "FATAL: WAL is > >> now prohibited", the system may already have switched back to read-write. > >> Also, it is a bit strange to get one of these errors on an attempted > >> ROLLBACK. Once again, not wrong as such, but clunky. > >> > > > > Can't we do the same in the TAP test? If the intention is only to test > > session termination when the system changes to WAL are prohibited then > > that I have added in the latest version, but that test does not > > reinitiate the same connection again, I think that is not possible > > there too. > > Perhaps you can point me to a TAP test that does this in a concise fashion. > When I tried writing a TAP test for this, it was much longer than the > equivalent isolation test spec. > Yes, that is a bit longer, here is the snip from v35-0010 patch: +my $psql_timeout = IPC::Run::timer(60); +my ($mysession_stdin, $mysession_stdout, $mysession_stderr) = ('', '', ''); +my $mysession = IPC::Run::start( + [ + 'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d', + $node_primary->connstr('postgres') + ], + '<', + \$mysession_stdin, + '>', + \$mysession_stdout, + '2>', + \$mysession_stderr, + $psql_timeout); + +# Write in transaction and get backend pid +$mysession_stdin .= q[ +BEGIN; +INSERT INTO tab VALUES(7); +SELECT $$value-7-inserted-into-tab$$; +]; +$mysession->pump until $mysession_stdout =~ /value-7-inserted-into-tab[\r\n]$/; +like($mysession_stdout, qr/value-7-inserted-into-tab/, + 'started write transaction in a session'); +$mysession_stdout = ''; +$mysession_stderr = ''; + +# Change to WAL prohibited +$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(true)'); +is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on', + 'server is changed to wal prohibited by another session'); + +# Try to commit open write transaction. +$mysession_stdin .= q[ +COMMIT; +]; +$mysession->pump; +like($mysession_stderr, qr/FATAL: WAL is now prohibited/, + 'session with open write transaction is terminated'); Regards, Amul
Re: [Patch] ALTER SYSTEM READ ONLY
> On Sep 22, 2021, at 6:14 AM, Amul Sul wrote: > >> Attached patch v34-0010 adds a test of cursors opened FOR UPDATE interacting >> with a system that is set read-only by a different session. The expected >> output is worth reviewing to see how this plays out. I don't see anything >> in there which is obviously wrong, but some of it is a bit clunky. For >> example, by the time the client sees an error "FATAL: WAL is now >> prohibited", the system may already have switched back to read-write. Also, >> it is a bit strange to get one of these errors on an attempted ROLLBACK. >> Once again, not wrong as such, but clunky. >> > > Can't we do the same in the TAP test? If the intention is only to test > session termination when the system changes to WAL are prohibited then > that I have added in the latest version, but that test does not > reinitiate the same connection again, I think that is not possible > there too. Perhaps you can point me to a TAP test that does this in a concise fashion. When I tried writing a TAP test for this, it was much longer than the equivalent isolation test spec. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [Patch] ALTER SYSTEM READ ONLY
On Wed, Sep 15, 2021 at 4:34 AM Mark Dilger wrote: > > > > > On Jun 16, 2020, at 6:55 AM, amul sul wrote: > > > > (2) if the session is idle, we also need the top-level abort > > record to be written immediately, but can't send an error to the client > > until the next > > command is issued without losing wire protocol synchronization. For now, we > > just use > > FATAL to kill the session; maybe this can be improved in the future. > > Andres, > > I'd like to have a patch that tests the impact of a vacuum running for xid > wraparound purposes, blocked on a pinned page held by the cursor, when > another session disables WAL. It would be very interesting to test how the > vacuum handles that specific change. I have not figured out the cleanest way > to do this, though, as we don't as a project yet have a standard way of > setting up xid exhaustion in a regression test, do we? The closest I saw to > it was your work in [1], but that doesn't seem to have made much headway > recently, and is designed for the TAP testing infrastructure, which isn't > useable from inside an isolation test. Do you have a suggestion how best to > continue developing out the test infrastructure? > > > Amul, > > The most obvious way to test how your ALTER SYSTEM READ ONLY feature > interacts with concurrent sessions is using the isolation tester in > src/test/isolation/, but as it stands now, the first permutation that gets a > FATAL causes the test to abort and all subsequent permutations to not run. > Attached patch v34-0009 fixes that. > Interesting. > Attached patch v34-0010 adds a test of cursors opened FOR UPDATE interacting > with a system that is set read-only by a different session. The expected > output is worth reviewing to see how this plays out. I don't see anything in > there which is obviously wrong, but some of it is a bit clunky. For example, > by the time the client sees an error "FATAL: WAL is now prohibited", the > system may already have switched back to read-write. Also, it is a bit > strange to get one of these errors on an attempted ROLLBACK. Once again, not > wrong as such, but clunky. > Can't we do the same in the TAP test? If the intention is only to test session termination when the system changes to WAL are prohibited then that I have added in the latest version, but that test does not reinitiate the same connection again, I think that is not possible there too. Regards, Amul
Re: [Patch] ALTER SYSTEM READ ONLY
On Wed, Sep 15, 2021 at 9:38 PM Robert Haas wrote: > > On Wed, Sep 15, 2021 at 10:32 AM Robert Haas wrote: > > Putting these changes into 0001 seems to make no sense. It seems like > > they should be part of 0003, or maybe a new 0004 patch. > > After looking at this a little bit more, I think it's really necessary > to separate out all of your changes into separate patches at least for > initial review. It's particularly important to separate code movement > changes from other kinds of changes. 0001 was just moving code before, > and so was 0002, but now both are making other changes, which is not > easy to see from looking at the 'git diff' output. For that reason > it's not so easy to understand exactly what you've changed here and > analyze it. > Ok, understood, I have separated my changes into 0001 and 0002 patch, and the refactoring patches start from 0003. In the 0001 patch, I have copied ArchiveRecoveryRequested to shared memory as said previously. Coping ArchiveRecoveryRequested value to shared memory is not really interesting, and I think somehow we should reuse existing variable, (perhaps, with some modification of the information it can store, e.g. adding one more enum value for SharedRecoveryState or something else), thinking on the same. In addition to that, I tried to turn down the scope of ArchiveRecoveryRequested global variable. Now, this is a static variable, and the scope is limited to xlog.c file like LocalXLogInsertAllowed and can be accessed through the newly added function ArchiveRecoveryIsRequested() (like PromoteIsTriggered()). Let me know what you think about the approach. In 0002 patch is a mixed one where I tried to remove the dependencies on global variables and local variables belonging to StartupXLOG(). I am still worried about the InRecovery value that needs to be deduced afterward inside XLogAcceptWrites(). Currently, relying on ControlFile->state != DB_SHUTDOWNED check but I think that will not be good for ASRO where we plan to skip XLogAcceptWrites() work only and let the StartupXLOG() do the rest of the work as it is where it will going to update ControlFile's DBState to DB_IN_PRODUCTION, then we might need some ugly kludge to call PerformRecoveryXLogAction() in checkpointer irrespective of DBState, which makes me a bit uncomfortable. > I poked around a little bit at these patches, looking for > perhaps-interesting global variables upon which the code called from > XLogAcceptWrites() would depend with your patches applied. The most > interesting ones seem to be (1) ThisTimeLineID, which you mentioned > and which may be fine but I'm not totally convinced yet, (2) > LocalXLogInsertAllowed, which is probably not broken but I'm thinking > we may want to redesign that mechanism somehow to make it cleaner, and Thanks for the off-list detailed explanation on this. For somebody else who might be reading this, the concern here is (not really a concern, it is a good thing to improve) the LocalSetXLogInsertAllowed() function call, is a kind of hack that enables wal writes irrespective of RecoveryInProgress() for a shorter period. E.g. see following code in StartupXLOG: " LocalSetXLogInsertAllowed(); UpdateFullPageWrites(); LocalXLogInsertAllowed = -1; /* * If any of the critical GUCs have changed, log them before we allow * backends to write WAL. */ LocalSetXLogInsertAllowed(); XLogReportParameters(); " Instead of explicitly enabling wal insert, somehow that implicitly allowed for the startup process and/or the checkpointer doing the first checkpoint and/or wal writes after the recovery. Well, the current LocalSetXLogInsertAllowed() mechanism is not really harming anything or bad and does not necessarily need to change but it would be nice if we were able to come up with something much cleaner, bug-free, and 100% perfect enough design. (Hope I am not missing anything from the discussion). > (3) CheckpointStats, which is called from RemoveXlogFile which is > called from RemoveNonParentXlogFiles which is called from > CleanupAfterArchiveRecovery which is called from XLogAcceptWrites. > This last one is actually pretty weird already in the existing code. > It sort of looks like RemoveXlogFile() only expects to be called from > the checkpointer (or a standalone backend) so that it can update > CheckpointStats and have that just work, but actually it's also called > from the startup process when a timeline switch happens. I don't know > whether the fact that the increments to ckpt_segs_recycled get lost in > that case should be considered an intentional behavior that should be > preserved or an inadvertent mistake. > Maybe I could be wrong, but I think that is intentional. It removes pre-allocated or bogus files of the old timeline which are not supposed to be considered in stats. The comments for CheckpointStatsData might not be clear but comment at the calling RemoveNonParentXlogFiles() place inside StartupXLOG hints the same: " /* *
Re: [Patch] ALTER SYSTEM READ ONLY
On Wed, Sep 15, 2021 at 10:32 AM Robert Haas wrote: > Putting these changes into 0001 seems to make no sense. It seems like > they should be part of 0003, or maybe a new 0004 patch. After looking at this a little bit more, I think it's really necessary to separate out all of your changes into separate patches at least for initial review. It's particularly important to separate code movement changes from other kinds of changes. 0001 was just moving code before, and so was 0002, but now both are making other changes, which is not easy to see from looking at the 'git diff' output. For that reason it's not so easy to understand exactly what you've changed here and analyze it. I poked around a little bit at these patches, looking for perhaps-interesting global variables upon which the code called from XLogAcceptWrites() would depend with your patches applied. The most interesting ones seem to be (1) ThisTimeLineID, which you mentioned and which may be fine but I'm not totally convinced yet, (2) LocalXLogInsertAllowed, which is probably not broken but I'm thinking we may want to redesign that mechanism somehow to make it cleaner, and (3) CheckpointStats, which is called from RemoveXlogFile which is called from RemoveNonParentXlogFiles which is called from CleanupAfterArchiveRecovery which is called from XLogAcceptWrites. This last one is actually pretty weird already in the existing code. It sort of looks like RemoveXlogFile() only expects to be called from the checkpointer (or a standalone backend) so that it can update CheckpointStats and have that just work, but actually it's also called from the startup process when a timeline switch happens. I don't know whether the fact that the increments to ckpt_segs_recycled get lost in that case should be considered an intentional behavior that should be preserved or an inadvertent mistake. So I think you've covered most of the necessary things here, with probably some more discussion needed on whether you've done the right things... -- Robert Haas EDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Wed, Sep 15, 2021 at 6:49 AM Amul Sul wrote: > Initially, I thought to > use SharedRecoveryState which is always set to RECOVERY_STATE_ARCHIVE, > if the archive recovery requested. But there is another case where > SharedRecoveryState could be RECOVERY_STATE_ARCHIVE irrespective of > ArchiveRecoveryRequested value, that is the presence of a backup label > file. Right, there's a difference between whether archive recovery has been *requested* and whether it is actually *happening*. > If we want to use SharedRecoveryState, we need one more state > which could differentiate between ArchiveRecoveryRequested and the > backup label file presence case. To move ahead, I have copied > ArchiveRecoveryRequested into shared memory and it will be cleared > once archive cleanup is finished. With all these changes, we could get > rid of xlogaction argument and DetermineRecoveryXlogAction() function. > Could move its logic to PerformRecoveryXLogAction() directly. Putting these changes into 0001 seems to make no sense. It seems like they should be part of 0003, or maybe a new 0004 patch. > And for ThisTimeLineID, I don't think we need to do anything since this > value is available with all the backend as per the following comment: > " > /* > * ThisTimeLineID will be same in all backends --- it identifies current > * WAL timeline for the database system. > */ > TimeLineID ThisTimeLineID = 0; I'm not sure I find that argument totally convincing. The two variables aren't assigned at exactly the same places in the code, nonwithstanding the comment. I'm not saying you're wrong. I'm just saying I don't believe it just because the comment says so. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
, On Sat, Jul 24, 2021 at 1:33 AM Robert Haas wrote: > > On Thu, Jun 17, 2021 at 1:23 AM Amul Sul wrote: > > Attached is rebase for the latest master head. Also, I added one more > > refactoring code that deduplicates the code setting database state in the > > control file. The same code set the database state is also needed for this > > feature. > > I started studying 0001 today and found that it rearranged the order > of operations in StartupXLOG() more than I was expecting. It does, as > per previous discussions, move a bunch of things to the place where we > now call XLogParamters(). But, unsatisfyingly, InRecovery = false and > XLogReaderFree() then have to move down even further. Since the goal > here is to get to a situation where we sometimes XLogAcceptWrites() > after InRecovery = false, it didn't seem nice for this refactoring > patch to still end up with a situation where this stuff happens while > InRecovery = true. In fact, with the patch, the amount of code that > runs with InRecovery = true actually *increases*, which is not what I > think should be happening here. That's why the patch ends up having to > adjust SetMultiXactIdLimit to not Assert(!InRecovery). > > And then I started to wonder how this was ever going to work as part > of the larger patch set, because as you have it here, > XLogAcceptWrites() takes arguments XLogReaderState *xlogreader, > XLogRecPtr EndOfLog, and TimeLineID EndOfLogTLI and if the > checkpointer is calling that at a later time after the user issues > pg_prohibit_wal(false), it's going to have none of those things. So I > had a quick look at that part of the code and found this in > checkpointer.c: > > XLogAcceptWrites(true, NULL, InvalidXLogRecPtr, 0); > > For those following along from home, the additional "true" is a bool > needChkpt argument added to XLogAcceptWrites() by 0003. Well, none of > this is very satisfying. The whole purpose of passing the xlogreader > is so we can figure out whether we need a checkpoint (never mind the > question of whether the existing algorithm for determining that is > really sensible) but now we need a second argument that basically > serves the same purpose since one of the two callers to this function > won't have an xlogreader. And then we're passing the EndOfLog and > EndOfLogTLI as dummy values which seems like it's probably just > totally wrong, but if for some reason it works correctly there sure > don't seem to be any comments explaining why. > > So I started doing a bit of hacking myself and ended up with the > attached, which I think is not completely the right thing yet but I > think it's better than your version. I split this into three parts. > 0001 splits up the logic that currently decides whether to write an > end-of-recovery record or a checkpoint record and if the latter how > the checkpoint ought to be performed into two functions. > DetermineRecoveryXlogAction() figures out what we want to do, and > PerformRecoveryXlogAction() does it. It also moves the code to run > recovery_end_command and related stuff into a new function > CleanupAfterArchiveRecovery(). 0002 then builds on this by postponing > UpdateFullPageWrites(), PerformRecoveryXLogAction(), and > CleanupAfterArchiveRecovery() to just before we > XLogReportParameters(). Because of the refactoring done by 0001, this > is only a small amount of code movement. Because of the separation > between DetermineRecoveryXlogAction() and PerformRecoveryXlogAction(), > the latter doesn't need the xlogreader. So we can do > DetermineRecoveryXlogAction() at the same time as now, while the > xlogreader is available, and then we don't need it later when we > PerformRecoveryXlogAction(), because we already know what we need to > know. I think this is all fine as far as it goes. > > My 0003 is where I see some lingering problems. It creates > XLogAcceptWrites(), moves the appropriate stuff there, and doesn't > need the xlogreader. But it doesn't really solve the problem of how > checkpointer.c would be able to call this function with proper > arguments. It is at least better in not needing two arguments to > decide what to do, but how is checkpointer.c supposed to know what to > pass for xlogaction? Worse yet, how is checkpointer.c supposed to know > what to pass for EndOfLogTLI and EndOfLog? Actually, EndOfLog doesn't > seem too problematic, because that value has been stored in four (!) > places inside XLogCtl by this code: > > LogwrtResult.Write = LogwrtResult.Flush = EndOfLog; > > XLogCtl->LogwrtResult = LogwrtResult; > > XLogCtl->LogwrtRqst.Write = EndOfLog; > XLogCtl->LogwrtRqst.Flush = EndOfLog; > > Presumably we could relatively easily change things around so that we > finish one of those values ... probably one of the "write" values .. > back out of XLogCtl instead of passing it as a parameter. That would > work just as well from the checkpointer as from the startup process, > and there seems to be no way for the value to change until
Re: [Patch] ALTER SYSTEM READ ONLY
> On Jun 16, 2020, at 6:55 AM, amul sul wrote: > > (2) if the session is idle, we also need the top-level abort > record to be written immediately, but can't send an error to the client until > the next > command is issued without losing wire protocol synchronization. For now, we > just use > FATAL to kill the session; maybe this can be improved in the future. Andres, I'd like to have a patch that tests the impact of a vacuum running for xid wraparound purposes, blocked on a pinned page held by the cursor, when another session disables WAL. It would be very interesting to test how the vacuum handles that specific change. I have not figured out the cleanest way to do this, though, as we don't as a project yet have a standard way of setting up xid exhaustion in a regression test, do we? The closest I saw to it was your work in [1], but that doesn't seem to have made much headway recently, and is designed for the TAP testing infrastructure, which isn't useable from inside an isolation test. Do you have a suggestion how best to continue developing out the test infrastructure? Amul, The most obvious way to test how your ALTER SYSTEM READ ONLY feature interacts with concurrent sessions is using the isolation tester in src/test/isolation/, but as it stands now, the first permutation that gets a FATAL causes the test to abort and all subsequent permutations to not run. Attached patch v34-0009 fixes that. Attached patch v34-0010 adds a test of cursors opened FOR UPDATE interacting with a system that is set read-only by a different session. The expected output is worth reviewing to see how this plays out. I don't see anything in there which is obviously wrong, but some of it is a bit clunky. For example, by the time the client sees an error "FATAL: WAL is now prohibited", the system may already have switched back to read-write. Also, it is a bit strange to get one of these errors on an attempted ROLLBACK. Once again, not wrong as such, but clunky. v34-0009-Make-isolationtester-handle-closed-sessions.patch Description: Binary data v34-0010-Test-ALTER-SYSTEM-READ-ONLY-against-cursors.patch Description: Binary data [1] https://www.postgresql.org/message-id/flat/CAP4vRV5gEHFLB7NwOE6_dyHAeVfkvqF8Z_g5GaCQZNgBAE0Frw%40mail.gmail.com#e10861372aec22119b66756ecbac581c — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [Patch] ALTER SYSTEM READ ONLY
On Fri, Sep 10, 2021 at 1:16 PM Mark Dilger wrote: > uses "immediately" and "will kill the running transaction" which reenforced > the impression that this mechanism is heavier handed than it is. It's intended to be just as immediate as e.g. pg_cancel_backend() and pg_terminate_backend(), which work just the same way, but not any more so. I guess we could look at how things are worded in those cases. >From a user perspective such things are usually pretty immediate, but not as immediate as firing a signal handler. Computers are fast.[1] -- Robert Haas EDB: http://www.enterprisedb.com [1] https://www.youtube.com/watch?v=6xijhqU8r2A
Re: [Patch] ALTER SYSTEM READ ONLY
> On Sep 10, 2021, at 9:56 AM, Robert Haas wrote: > > I think the relevant question here is not "could a signal handler > fire?" but "can we hit a CHECK_FOR_INTERRUPTS()?". If the relevant > question is the former, then there's no hope of ever making it work > because there's always a race condition. But the signal handler is > only setting flags whose only effect is to make a subsequent > CHECK_FOR_INTERRUPTS() do something, so it doesn't really matter when > the signal handler can run, but when CHECK_FOR_INTERRUPTS() can call > ProcessInterrupts(). Ok, that makes more sense. I was reviewing the code after first reviewing the documentation changes, which lead me to believe the system was designed to respond more quickly than that: +WAL prohibited is a read-only system state. Any permitted user can call +pg_prohibit_wal function to forces the system into +a WAL prohibited mode where insert write ahead log will be prohibited until +the same function executed to change that state to read-write. Like Hot and +Otherwise, it will be off. When the user requests WAL +prohibited state, at that moment if any existing session is already running +a transaction, and that transaction has already been performed or planning +to perform wal write operations then the session running that transaction +will be terminated. "forces the system" in the first part, and "at that moment ... that transaction will be terminated" sounds heavier handed than something which merely sets a flag asking the backend to exit. I was reading that as more immediate and then trying to figure out how the signal handling could possibly work, and failing to see how. The README: +Any +backends which receive WAL prohibited system state transition barrier interrupt +need to stop WAL writing immediately. For barrier absorption the backed(s) will +kill the running transaction which has valid XID indicates that the transaction +has performed and/or planning WAL write. uses "immediately" and "will kill the running transaction" which reenforced the impression that this mechanism is heavier handed than it is. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [Patch] ALTER SYSTEM READ ONLY
On Fri, Sep 10, 2021 at 12:20 PM Mark Dilger wrote: > A better example may be found in ginmetapage.c: > > needwal = RelationNeedsWAL(indexrel); > if (needwal) > { > CheckWALPermitted(); > computeLeafRecompressWALData(leaf); > } > > /* Apply changes to page */ > START_CRIT_SECTION(); Yeah, that looks sketchy. Why not move CheckWALPermitted() down a line? > Even if CheckWALPermitted is assumed to be close enough to atomic to not be a > problem (I don't agree), that argument can't be made here, as > computeLeafRecompressWALData is not trivial and signals could easily be > processed while it is running. I think the relevant question here is not "could a signal handler fire?" but "can we hit a CHECK_FOR_INTERRUPTS()?". If the relevant question is the former, then there's no hope of ever making it work because there's always a race condition. But the signal handler is only setting flags whose only effect is to make a subsequent CHECK_FOR_INTERRUPTS() do something, so it doesn't really matter when the signal handler can run, but when CHECK_FOR_INTERRUPTS() can call ProcessInterrupts(). -- Robert Haas EDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
> On Sep 10, 2021, at 8:42 AM, Mark Dilger wrote: > > Take for example a code stanza from heapam.c: > >if (needwal) >CheckWALPermitted(); > >/* NO EREPORT(ERROR) from here till changes are logged */ >START_CRIT_SECTION(); > > Now, I know that interrupts won't be processed after starting the critical > section, but I can see plain as day that an interrupt might get processed > *during* CheckWALPermitted, since that function isn't atomic. A better example may be found in ginmetapage.c: needwal = RelationNeedsWAL(indexrel); if (needwal) { CheckWALPermitted(); computeLeafRecompressWALData(leaf); } /* Apply changes to page */ START_CRIT_SECTION(); Even if CheckWALPermitted is assumed to be close enough to atomic to not be a problem (I don't agree), that argument can't be made here, as computeLeafRecompressWALData is not trivial and signals could easily be processed while it is running. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [Patch] ALTER SYSTEM READ ONLY
> On Sep 10, 2021, at 7:36 AM, Amul Sul wrote: > >> v33-0005 >> >> This patch makes bool XLogInsertAllowed() more complicated than before. The >> result used to depend mostly on the value of LocalXLogInsertAllowed except >> that when that value was negative, the result was determined by >> RecoveryInProgress(). There was an arcane rule that LocalXLogInsertAllowed >> must have the non-negative values binary coercible to boolean "true" and >> "false", with the basis for that rule being the coding of >> XLogInsertAllowed(). Now that the function is more complicated, this rule >> seems even more arcane. Can we change the logic to not depend on casting an >> integer to bool? >> > > We can't use a boolean variable because LocalXLogInsertAllowed > represents three states as, 1 means "wal is allowed'', 0 for "wal is > disallowed", and -1 is for "need to check". I'm complaining that we're using an integer rather than an enum. I'm ok if we define it so that WAL_ALLOWABLE_UNKNOWN = -1, WAL_DISALLOWED = 0, WAL_ALLOWED = 1 or such, but the logic of the function has gotten complicated enough that having to remember which number represents which logical condition has become a (small) mental burden. Given how hard the WAL code is to read and fully grok, I'd rather avoid any unnecessary burden, even small ones. >> The new function CheckWALPermitted() seems to test the current state of >> variables but not lock any of them, and the new function comment says: >> > > CheckWALPermitted() calls XLogInsertAllowed() does check the > LocalXLogInsertAllowed flag which is local to that process only, and > nobody else reads that concurrently. > >> /* >> * In opposite to the above assertion if a transaction doesn't have valid XID >> * (e.g. VACUUM) then it won't be killed while changing the system state to >> WAL >> * prohibited. Therefore, we need to explicitly error out before entering >> into >> * the critical section. >> */ >> >> This suggests to me that a vacuum process can check whether wal is >> prohibited, then begin a critical section which needs wal to be allowed, and >> concurrently somebody else might disable wal without killing the vacuum >> process. I'm given to wonder what horrors await when the vacuum process >> does something that needs to be wal logged but cannot be. Does it trigger a >> panic? I don't like the idea that calling pg_prohibit_wal durning a vacuum >> might panic the cluster. If there is some reason this is not a problem, I >> think the comment should explain it. In particular, why is it sufficient to >> check whether wal is prohibited before entering the critical section and not >> necessary to be sure it remains allowed through the lifetime of that >> critical section? >> > > Hm, interrupts absorption are disabled inside the critical section. > The wal prohibited state for that process (here vacuum) will never get > set until it sees the interrupts & the system will not be said wal > prohibited until every process sees that interrupts. I am not sure we > should explain the characteristics of the critical section at this > place, if want, we can add a brief saying that inside the critical > section we should not worry about the state change which never happens > because interrupts are disabled there. I think the fact that interrupts are disabled during critical sections is understood, so there is no need to mention that. The problem is that the method for taking the system read-only is less generally known, and readers of other sections of code need to jump to the definition of CheckWALPermitted to read the comments and understand what it does. Take for example a code stanza from heapam.c: if (needwal) CheckWALPermitted(); /* NO EREPORT(ERROR) from here till changes are logged */ START_CRIT_SECTION(); Now, I know that interrupts won't be processed after starting the critical section, but I can see plain as day that an interrupt might get processed *during* CheckWALPermitted, since that function isn't atomic. It might happen after the check is meaningfully finished but before the function actually returns. So I'm not inclined to believe that the way this all works is dependent on interrupts being blocked. So I think, maybe this is all protected by some other scheme. But what? It's not clear from the code comments for CheckWALPermitted, so I'm left having to reverse engineer the system to understand it. One interpretation is that the signal handler will exit() my backend if it receives a signal saying that the system is going read-only, so there is no race condition. But then why the call to CheckWALPermitted()? If this interpretation were correct, we'd happily enter the critical section without checking, secure in the knowledge that as long as we haven't exited yet, all is ok. Another interpretation is that the whole thing is just a performance trick. Maybe we're ok with the idea that we will
Re: [Patch] ALTER SYSTEM READ ONLY
On Thu, Sep 9, 2021 at 11:12 PM Mark Dilger wrote: > > Thank you, for looking at the patch. Please see my reply inline below: > > > On Sep 8, 2021, at 6:44 AM, Amul Sul wrote: > > > > Here is the rebased version. > > v33-0004 > > This patch moves the include of "catalog/pg_control.h" from transam/xlog.c > into access/xlog.h, making pg_control.h indirectly included from a much > larger set of files. Maybe that's ok. I don't know. But it seems you are > doing this merely to get the symbol (not even the definition) for struct > DBState. I'd recommend rearranging the code so this isn't necessary, but > otherwise you'd at least want to remove the now redundant includes of > catalog/pg_control.h from xlogdesc.c, xloginsert.c, auth-scram.c, > postmaster.c, misc/pg_controldata.c, and pg_controldata/pg_controldata.c. > Yes, you are correct, xlog.h is included in more than 150 files. I was wondering if we can have a forward declaration instead of including pg_control.h (e.g. The same way struct XLogRecData was declared in xlog.h). Perhaps, DBState is enum & I don't see we have done the same for enum elsewhere as we are doing for structures, but that seems to be fine, IMO. Earlier, I was unsure before preparing this patch, but since that makes sense (I assumed) and minimizes duplications, can we go ahead and post separately with the same change in StartupXLOG() which I have skipped for the same reason mentioned in patch commit-msg. > v33-0005 > > This patch makes bool XLogInsertAllowed() more complicated than before. The > result used to depend mostly on the value of LocalXLogInsertAllowed except > that when that value was negative, the result was determined by > RecoveryInProgress(). There was an arcane rule that LocalXLogInsertAllowed > must have the non-negative values binary coercible to boolean "true" and > "false", with the basis for that rule being the coding of > XLogInsertAllowed(). Now that the function is more complicated, this rule > seems even more arcane. Can we change the logic to not depend on casting an > integer to bool? > We can't use a boolean variable because LocalXLogInsertAllowed represents three states as, 1 means "wal is allowed'', 0 for "wal is disallowed", and -1 is for "need to check". > The code comment change in autovacuum.c introduces a non-grammatical > sentence: "First, the system is not read only i.e. wal writes permitted". > > The function comment in checkpointer.c reads more like it toggles the system > into allowing something, rather than actually doing that same something: > "SendSignalToCheckpointer allows a process to send a signal to the checkpoint > process". > > The new code comment in ipci.c contains a typo, but more importantly, it > doesn't impart any knowledge beyond what a reader of the function name could > already surmise. Perhaps the comment can better clarify what is happening: > "Set up wal probibit shared state" > > The new code comment in sync.c copies and changes a nearby comment but drops > part of the verb phrase: "As in ProcessSyncRequests, we don't want to stop > wal prohibit change requests". The nearby comment reads "stop absorbing". I > think this one should read "stop processing". This same comment is used > again below. Then a third comment reads "For the same reason mentioned > previously for the wal prohibit state change request check." That third > comment is too glib. > > tcop/utility.c needlessly includes "access/walprohibit.h" > > wait_event.h extends enum WaitEventIO with new values > WAIT_EVENT_WALPROHIBIT_STATE and WAIT_EVENT_WALPROHIBIT_STATE_CHANGE. I > don't find the difference between these two names at all clear. Waiting for > a state change is clear enough. But how is waiting on a state different? > > xlog.h defines a new enum. I don't find any of it clear; not the comment, > nor the name of the enum, nor the names of the values: > > /* State of work that enables wal writes */ > typedef enum XLogAcceptWritesState > { > XLOG_ACCEPT_WRITES_PENDING = 0, /* initial state, not started */ > XLOG_ACCEPT_WRITES_SKIPPED, /* skipped wal writes */ > XLOG_ACCEPT_WRITES_DONE /* wal writes are enabled */ > } XLogAcceptWritesState; > > This enum seems to have been written from the point of view of someone who > already knew what it was for. It needs to be written in a way that will be > clear to people who have no idea what it is for. > > v33-0006: > > The new code comments in brin.c and elsewhere should use the verb "require" > rather than "have", otherwise "building indexes" reads as a noun phrase > rather than as a gerund: /* Building indexes will have an XID */ > Will try to think about the pointed code comments for the improvements. > The new function CheckWALPermitted() seems to test the current state of > variables but not lock any of them, and the new function comment says: > CheckWALPermitted() calls XLogInsertAllowed() does check the
Re: [Patch] ALTER SYSTEM READ ONLY
> On Sep 9, 2021, at 11:21 AM, Robert Haas wrote: > > They have to check whether WAL has become prohibited > and error out if so, and they need to do so before entering the > critical section - because if the problem were detected for the first > time inside the critical section it would escalate to a PANIC, which > we do not want. But that is the part that is still not clear. Should the comment say that a concurrent change to prohibit wal after the current process checks but before the current process exists the critical section will result in a panic? What is unclear about the comment is that it implies that a check before the critical section is sufficient, but ordinarily one would expect a lock to be held and the check-and-lock dance to carefully avoid any race condition. If somehow this is safe, the logic for why it is safe should be spelled out. If not, a mia culpa saying, "hey, were not terribly safe about this" should be explicit in the comment. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [Patch] ALTER SYSTEM READ ONLY
On Thu, Sep 9, 2021 at 1:42 PM Mark Dilger wrote: > v33-0006: > > The new code comments in brin.c and elsewhere should use the verb "require" > rather than "have", otherwise "building indexes" reads as a noun phrase > rather than as a gerund: /* Building indexes will have an XID */ Honestly that sentence doesn't sound very clear even with a different verb. > This suggests to me that a vacuum process can check whether wal is > prohibited, then begin a critical section which needs wal to be allowed, and > concurrently somebody else might disable wal without killing the vacuum > process. I'm given to wonder what horrors await when the vacuum process does > something that needs to be wal logged but cannot be. Does it trigger a > panic? I don't like the idea that calling pg_prohibit_wal durning a vacuum > might panic the cluster. If there is some reason this is not a problem, I > think the comment should explain it. In particular, why is it sufficient to > check whether wal is prohibited before entering the critical section and not > necessary to be sure it remains allowed through the lifetime of that critical > section? The idea here is that if a transaction already has an XID assigned, we have to kill it off before we can declare the system read-only, because it will definitely write WAL when the transaction ends: either a commit record, or an abort record, but definitely something. So cases where we write WAL without necessarily having an XID require special handling. They have to check whether WAL has become prohibited and error out if so, and they need to do so before entering the critical section - because if the problem were detected for the first time inside the critical section it would escalate to a PANIC, which we do not want. Places where we're guaranteed to have an XID - e.g. inserting a heap tuple - don't need a run-time check before entering the critical section, because the code can't be reached in the first place if the system is WAL-read-only. > Why is this function defined to take a boolean such that > pg_prohibit_wal(true) means to prohibit wal and pg_prohibit_wal(false) means > to allow wal. Wouldn't a different function named pg_allow_wal() make it > more clear? This also would be a better interface if taking the system > read-only had a timeout as I suggested above, as such a timeout parameter > when allowing wal is less clearly useful. Hmm, I find pg_prohibit_wal(true/false) better than pg_prohibit_wal() and pg_allow_wal(), and would prefer pg_prohibit_wal(true/false, timeout) over pg_prohibit_wal(timeout) and pg_allow_wal(), because I think then once you find that one function you know how to do everything about that feature, whereas the other way you need to find both functions to have the whole story. That said, I can see why somebody else might prefer something else. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
> On Sep 8, 2021, at 6:44 AM, Amul Sul wrote: > > Here is the rebased version. v33-0004 This patch moves the include of "catalog/pg_control.h" from transam/xlog.c into access/xlog.h, making pg_control.h indirectly included from a much larger set of files. Maybe that's ok. I don't know. But it seems you are doing this merely to get the symbol (not even the definition) for struct DBState. I'd recommend rearranging the code so this isn't necessary, but otherwise you'd at least want to remove the now redundant includes of catalog/pg_control.h from xlogdesc.c, xloginsert.c, auth-scram.c, postmaster.c, misc/pg_controldata.c, and pg_controldata/pg_controldata.c. v33-0005 This patch makes bool XLogInsertAllowed() more complicated than before. The result used to depend mostly on the value of LocalXLogInsertAllowed except that when that value was negative, the result was determined by RecoveryInProgress(). There was an arcane rule that LocalXLogInsertAllowed must have the non-negative values binary coercible to boolean "true" and "false", with the basis for that rule being the coding of XLogInsertAllowed(). Now that the function is more complicated, this rule seems even more arcane. Can we change the logic to not depend on casting an integer to bool? The code comment change in autovacuum.c introduces a non-grammatical sentence: "First, the system is not read only i.e. wal writes permitted". The function comment in checkpointer.c reads more like it toggles the system into allowing something, rather than actually doing that same something: "SendSignalToCheckpointer allows a process to send a signal to the checkpoint process". The new code comment in ipci.c contains a typo, but more importantly, it doesn't impart any knowledge beyond what a reader of the function name could already surmise. Perhaps the comment can better clarify what is happening: "Set up wal probibit shared state" The new code comment in sync.c copies and changes a nearby comment but drops part of the verb phrase: "As in ProcessSyncRequests, we don't want to stop wal prohibit change requests". The nearby comment reads "stop absorbing". I think this one should read "stop processing". This same comment is used again below. Then a third comment reads "For the same reason mentioned previously for the wal prohibit state change request check." That third comment is too glib. tcop/utility.c needlessly includes "access/walprohibit.h" wait_event.h extends enum WaitEventIO with new values WAIT_EVENT_WALPROHIBIT_STATE and WAIT_EVENT_WALPROHIBIT_STATE_CHANGE. I don't find the difference between these two names at all clear. Waiting for a state change is clear enough. But how is waiting on a state different? xlog.h defines a new enum. I don't find any of it clear; not the comment, nor the name of the enum, nor the names of the values: /* State of work that enables wal writes */ typedef enum XLogAcceptWritesState { XLOG_ACCEPT_WRITES_PENDING = 0, /* initial state, not started */ XLOG_ACCEPT_WRITES_SKIPPED, /* skipped wal writes */ XLOG_ACCEPT_WRITES_DONE /* wal writes are enabled */ } XLogAcceptWritesState; This enum seems to have been written from the point of view of someone who already knew what it was for. It needs to be written in a way that will be clear to people who have no idea what it is for. v33-0006: The new code comments in brin.c and elsewhere should use the verb "require" rather than "have", otherwise "building indexes" reads as a noun phrase rather than as a gerund: /* Building indexes will have an XID */ The new function CheckWALPermitted() seems to test the current state of variables but not lock any of them, and the new function comment says: /* * In opposite to the above assertion if a transaction doesn't have valid XID * (e.g. VACUUM) then it won't be killed while changing the system state to WAL * prohibited. Therefore, we need to explicitly error out before entering into * the critical section. */ This suggests to me that a vacuum process can check whether wal is prohibited, then begin a critical section which needs wal to be allowed, and concurrently somebody else might disable wal without killing the vacuum process. I'm given to wonder what horrors await when the vacuum process does something that needs to be wal logged but cannot be. Does it trigger a panic? I don't like the idea that calling pg_prohibit_wal durning a vacuum might panic the cluster. If there is some reason this is not a problem, I think the comment should explain it. In particular, why is it sufficient to check whether wal is prohibited before entering the critical section and not necessary to be sure it remains allowed through the lifetime of that critical section? v33-0007: I don't really like what the documentation has to say about pg_prohibit_wal. Why should pg_prohibit_wal differ from other signal sending functions in whether it returns a
Re: [Patch] ALTER SYSTEM READ ONLY
On Tue, 7 Sep 2021 at 8:43 PM, Mark Dilger wrote: > > > > On Aug 31, 2021, at 5:15 AM, Amul Sul wrote: > > > > Attached is the rebased version for the latest master head. > > Hi Amul! > > Could you please rebase again? > Ok will do that tomorrow, thanks. Regards, Amul
Re: [Patch] ALTER SYSTEM READ ONLY
> On Aug 31, 2021, at 5:15 AM, Amul Sul wrote: > > Attached is the rebased version for the latest master head. Hi Amul! Could you please rebase again? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [Patch] ALTER SYSTEM READ ONLY
On Tue, Aug 31, 2021 at 8:16 AM Amul Sul wrote: > Attached is the rebased version for the latest master head. Also, > added tap tests to test some part of this feature and a separate patch > to test recovery_end_command execution. It looks like you haven't given any thought to writing that in a way that will work on Windows? > What is usual practice, can have a few tests in TAP and a few in > pg_regress for the same feature? Sure, there's no problem with that. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
Hi, On Thu, Jul 29, 2021 at 9:46 PM Robert Haas wrote: > On Wed, Jul 28, 2021 at 7:33 AM Amul Sul wrote: > > I was too worried about how I could miss that & after thinking more > > about that, I realized that the operation for ArchiveRecoveryRequested > > is never going to be skipped in the startup process and that never > > left for the checkpoint process to do that later. That is the reason > > that assert was added there. > > > > When ArchiveRecoveryRequested, the server will no longer be in > > the wal prohibited mode, we implicitly change the state to > > wal-permitted. Here is the snip from the 0003 patch: > > Ugh, OK. That makes sense, but I'm still not sure that I like it. I've > kind of been wondering: why not have XLogAcceptWrites() be the > responsibility of the checkpointer all the time, in every case? That > would require fixing some more things, and this is one of them, but > then it would be consistent, which means that any bugs would be likely > to get found and fixed. If calling XLogAcceptWrites() from the > checkpointer is some funny case that only happens when the system > crashes while WAL is prohibited, then we might fail to notice that we > have a bug. > > This is especially true given that we have very little test coverage > in this area. Andres was ranting to me about this earlier this week, > and I wasn't sure he was right, but then I noticed that we have > exactly zero tests in the entire source tree that make use of > recovery_end_command. We really need a TAP test for that, I think. > It's too scary to do much reorganization of the code without having > any tests at all for the stuff we're moving around. Likewise, we're > going to need TAP tests for the stuff that is specific to this patch. > For example, we should have a test that crashes the server while it's > read only, brings it back up, checks that we still can't write WAL, > then re-enables WAL, and checks that we now can write WAL. There are > probably a bunch of other things that we should test, too. > Hi, I have been testing “ALTER SYSTEM READ ONLY” and wrote a few tap test cases for this feature. Please find the test case(Draft version) attached herewith, to be applied on top of the v30 patch by Amul. Kindly have a review and let me know the required changes. -- With Regards, Prabhat Kumar Sahu EnterpriseDB: http://www.enterprisedb.com prohibitwal-tap-test.patch Description: Binary data
Re: [Patch] ALTER SYSTEM READ ONLY
On Wed, Jul 28, 2021 at 7:33 AM Amul Sul wrote: > I was too worried about how I could miss that & after thinking more > about that, I realized that the operation for ArchiveRecoveryRequested > is never going to be skipped in the startup process and that never > left for the checkpoint process to do that later. That is the reason > that assert was added there. > > When ArchiveRecoveryRequested, the server will no longer be in > the wal prohibited mode, we implicitly change the state to > wal-permitted. Here is the snip from the 0003 patch: Ugh, OK. That makes sense, but I'm still not sure that I like it. I've kind of been wondering: why not have XLogAcceptWrites() be the responsibility of the checkpointer all the time, in every case? That would require fixing some more things, and this is one of them, but then it would be consistent, which means that any bugs would be likely to get found and fixed. If calling XLogAcceptWrites() from the checkpointer is some funny case that only happens when the system crashes while WAL is prohibited, then we might fail to notice that we have a bug. This is especially true given that we have very little test coverage in this area. Andres was ranting to me about this earlier this week, and I wasn't sure he was right, but then I noticed that we have exactly zero tests in the entire source tree that make use of recovery_end_command. We really need a TAP test for that, I think. It's too scary to do much reorganization of the code without having any tests at all for the stuff we're moving around. Likewise, we're going to need TAP tests for the stuff that is specific to this patch. For example, we should have a test that crashes the server while it's read only, brings it back up, checks that we still can't write WAL, then re-enables WAL, and checks that we now can write WAL. There are probably a bunch of other things that we should test, too. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Thu, Jul 29, 2021 at 4:47 PM Dilip Kumar wrote: > > On Wed, Jul 28, 2021 at 5:03 PM Amul Sul wrote: > > > > I was too worried about how I could miss that & after thinking more > > about that, I realized that the operation for ArchiveRecoveryRequested > > is never going to be skipped in the startup process and that never > > left for the checkpoint process to do that later. That is the reason > > that assert was added there. > > > > When ArchiveRecoveryRequested, the server will no longer be in > > the wal prohibited mode, we implicitly change the state to > > wal-permitted. Here is the snip from the 0003 patch: > > > > @@ -6614,13 +6629,30 @@ StartupXLOG(void) > > (errmsg("starting archive recovery"))); > > } > > > > - /* > > - * Take ownership of the wakeup latch if we're going to sleep during > > - * recovery. > > - */ > > if (ArchiveRecoveryRequested) > > + { > > + /* > > + * Take ownership of the wakeup latch if we're going to sleep during > > + * recovery. > > + */ > > OwnLatch(>recoveryWakeupLatch); > > > > + /* > > + * Since archive recovery is requested, we cannot be in a wal prohibited > > + * state. > > + */ > > + if (ControlFile->wal_prohibited) > > + { > > + /* No need to hold ControlFileLock yet, we aren't up far enough */ > > + ControlFile->wal_prohibited = false; > > + ControlFile->time = (pg_time_t) time(NULL); > > + UpdateControlFile(); > > + > > Is there some reason why we are forcing 'wal_prohibited' to off if we > are doing archive recovery? It might have already been discussed, but > I could not find it on a quick look into the thread. > Here is: https://postgr.es/m/CA+TgmoZ=cctbaxxmtyzogxegqzoz9smkbwrdpsacpjvfcgc...@mail.gmail.com Regards, Amul
Re: [Patch] ALTER SYSTEM READ ONLY
On Wed, Jul 28, 2021 at 5:03 PM Amul Sul wrote: > > I was too worried about how I could miss that & after thinking more > about that, I realized that the operation for ArchiveRecoveryRequested > is never going to be skipped in the startup process and that never > left for the checkpoint process to do that later. That is the reason > that assert was added there. > > When ArchiveRecoveryRequested, the server will no longer be in > the wal prohibited mode, we implicitly change the state to > wal-permitted. Here is the snip from the 0003 patch: > > @@ -6614,13 +6629,30 @@ StartupXLOG(void) > (errmsg("starting archive recovery"))); > } > > - /* > - * Take ownership of the wakeup latch if we're going to sleep during > - * recovery. > - */ > if (ArchiveRecoveryRequested) > + { > + /* > + * Take ownership of the wakeup latch if we're going to sleep during > + * recovery. > + */ > OwnLatch(>recoveryWakeupLatch); > > + /* > + * Since archive recovery is requested, we cannot be in a wal prohibited > + * state. > + */ > + if (ControlFile->wal_prohibited) > + { > + /* No need to hold ControlFileLock yet, we aren't up far enough */ > + ControlFile->wal_prohibited = false; > + ControlFile->time = (pg_time_t) time(NULL); > + UpdateControlFile(); > + Is there some reason why we are forcing 'wal_prohibited' to off if we are doing archive recovery? It might have already been discussed, but I could not find it on a quick look into the thread. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Wed, Jul 28, 2021 at 4:37 PM Amul Sul wrote: > > On Wed, Jul 28, 2021 at 2:26 AM Robert Haas wrote: > > > > On Fri, Jul 23, 2021 at 4:03 PM Robert Haas wrote: > > > My 0003 is where I see some lingering problems. It creates > > > XLogAcceptWrites(), moves the appropriate stuff there, and doesn't > > > need the xlogreader. But it doesn't really solve the problem of how > > > checkpointer.c would be able to call this function with proper > > > arguments. It is at least better in not needing two arguments to > > > decide what to do, but how is checkpointer.c supposed to know what to > > > pass for xlogaction? Worse yet, how is checkpointer.c supposed to know > > > what to pass for EndOfLogTLI and EndOfLog? > > > > On further study, I found another problem: the way my patch set leaves > > things, XLogAcceptWrites() depends on ArchiveRecoveryRequested, which > > will not be correctly initialized in any process other than the > > startup process. So CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog) > > would just be skipped. Your 0001 seems to have the same problem. You > > added Assert(AmStartupProcess()) to the inside of the if > > (ArchiveRecoveryRequested) block, but that doesn't fix anything. > > Outside the startup process, ArchiveRecoveryRequested will always be > > false, but the point is that the associated stuff should be done if > > ArchiveRecoveryRequested would have been true in the startup process. > > Both of our patch sets leave things in a state where that would never > > happen, which is not good. Unless I'm missing something, it seems like > > maybe you didn't test your patches to verify that, when the > > XLogAcceptWrites() call comes from the checkpointer, all the same > > things happen that would have happened had it been called from the > > startup process. That would be a really good thing to have tested > > before posting your patches. > > > > My bad, I am extremely sorry about that. I usually do test my patches, > but somehow I failed to test this change due to manually testing the > whole ASRO feature and hurrying in posting the newest version. > > I will try to be more careful next time. > I was too worried about how I could miss that & after thinking more about that, I realized that the operation for ArchiveRecoveryRequested is never going to be skipped in the startup process and that never left for the checkpoint process to do that later. That is the reason that assert was added there. When ArchiveRecoveryRequested, the server will no longer be in the wal prohibited mode, we implicitly change the state to wal-permitted. Here is the snip from the 0003 patch: @@ -6614,13 +6629,30 @@ StartupXLOG(void) (errmsg("starting archive recovery"))); } - /* - * Take ownership of the wakeup latch if we're going to sleep during - * recovery. - */ if (ArchiveRecoveryRequested) + { + /* + * Take ownership of the wakeup latch if we're going to sleep during + * recovery. + */ OwnLatch(>recoveryWakeupLatch); + /* + * Since archive recovery is requested, we cannot be in a wal prohibited + * state. + */ + if (ControlFile->wal_prohibited) + { + /* No need to hold ControlFileLock yet, we aren't up far enough */ + ControlFile->wal_prohibited = false; + ControlFile->time = (pg_time_t) time(NULL); + UpdateControlFile(); + + ereport(LOG, + (errmsg("clearing WAL prohibition because the system is in archive recovery"))); + } + } + > > As far as EndOfLogTLI is concerned, there are, somewhat annoyingly, > > several TLIs stored in XLogCtl. None of them seem to be precisely the > > same thing as EndLogTLI, but I am hoping that replayEndTLI is close > > enough. I found out pretty quickly through testing that replayEndTLI > > isn't always valid -- it ends up 0 if we don't enter recovery. That's > > not really a problem, though, because we only need it to be valid if > > ArchiveRecoveryRequested. The code that initializes and updates it > > seems to run whenever InRecovery = true, and ArchiveRecoveryRequested > > = true will force InRecovery = true. So it looks to me like > > replayEndTLI will always be initialized in the cases where we need a > > value. It's not yet entirely clear to me if it has to have the same > > value as EndOfLogTLI. I find this code comment quite mysterious: > > > > /* > > * EndOfLogTLI is the TLI in the filename of the XLOG segment containing > > * the end-of-log. It could be different from the timeline that EndOfLog > > * nominally belongs to, if there was a timeline switch in that segment, > > * and we were reading the old WAL from a segment belonging to a higher > > * timeline. > > */ > > EndOfLogTLI = xlogreader->seg.ws_tli; > > > > The thing is, if we were reading old WAL from a segment belonging to a > > higher timeline, wouldn't we have switched to that new timeline? > > AFAIUC, by browsing the code, yes, we are switching to the new > timeline. Along with lastReplayedTLI, lastReplayedEndRecPtr is also > the same as the
Re: [Patch] ALTER SYSTEM READ ONLY
On Wed, Jul 28, 2021 at 2:26 AM Robert Haas wrote: > > On Fri, Jul 23, 2021 at 4:03 PM Robert Haas wrote: > > My 0003 is where I see some lingering problems. It creates > > XLogAcceptWrites(), moves the appropriate stuff there, and doesn't > > need the xlogreader. But it doesn't really solve the problem of how > > checkpointer.c would be able to call this function with proper > > arguments. It is at least better in not needing two arguments to > > decide what to do, but how is checkpointer.c supposed to know what to > > pass for xlogaction? Worse yet, how is checkpointer.c supposed to know > > what to pass for EndOfLogTLI and EndOfLog? > > On further study, I found another problem: the way my patch set leaves > things, XLogAcceptWrites() depends on ArchiveRecoveryRequested, which > will not be correctly initialized in any process other than the > startup process. So CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog) > would just be skipped. Your 0001 seems to have the same problem. You > added Assert(AmStartupProcess()) to the inside of the if > (ArchiveRecoveryRequested) block, but that doesn't fix anything. > Outside the startup process, ArchiveRecoveryRequested will always be > false, but the point is that the associated stuff should be done if > ArchiveRecoveryRequested would have been true in the startup process. > Both of our patch sets leave things in a state where that would never > happen, which is not good. Unless I'm missing something, it seems like > maybe you didn't test your patches to verify that, when the > XLogAcceptWrites() call comes from the checkpointer, all the same > things happen that would have happened had it been called from the > startup process. That would be a really good thing to have tested > before posting your patches. > My bad, I am extremely sorry about that. I usually do test my patches, but somehow I failed to test this change due to manually testing the whole ASRO feature and hurrying in posting the newest version. I will try to be more careful next time. > As far as EndOfLogTLI is concerned, there are, somewhat annoyingly, > several TLIs stored in XLogCtl. None of them seem to be precisely the > same thing as EndLogTLI, but I am hoping that replayEndTLI is close > enough. I found out pretty quickly through testing that replayEndTLI > isn't always valid -- it ends up 0 if we don't enter recovery. That's > not really a problem, though, because we only need it to be valid if > ArchiveRecoveryRequested. The code that initializes and updates it > seems to run whenever InRecovery = true, and ArchiveRecoveryRequested > = true will force InRecovery = true. So it looks to me like > replayEndTLI will always be initialized in the cases where we need a > value. It's not yet entirely clear to me if it has to have the same > value as EndOfLogTLI. I find this code comment quite mysterious: > > /* > * EndOfLogTLI is the TLI in the filename of the XLOG segment containing > * the end-of-log. It could be different from the timeline that EndOfLog > * nominally belongs to, if there was a timeline switch in that segment, > * and we were reading the old WAL from a segment belonging to a higher > * timeline. > */ > EndOfLogTLI = xlogreader->seg.ws_tli; > > The thing is, if we were reading old WAL from a segment belonging to a > higher timeline, wouldn't we have switched to that new timeline? AFAIUC, by browsing the code, yes, we are switching to the new timeline. Along with lastReplayedTLI, lastReplayedEndRecPtr is also the same as the EndOfLog that we needed when ArchiveRecoveryRequested is true. I went through the original commit 7cbee7c0a1db and the thread[1] but didn't find any related discussion for that. > Suppose we want WAL segment 246 from TLI 1, but we don't have that > segment on TLI 1, only TLI 2. Well, as far as I know, for us to use > the TLI 2 version, we'd need to have TLI 2 in the history of the > recovery_target_timeline. And if that is the case, then we would have > to replay through the record where the timeline changes. And if we do > that, then the discrepancy postulated by the comment cannot still > exist by the time we reach this code, because this code is only > reached after we finish WAL redo. So I'm baffled as to how this can > happen, but considering how many cases there are in this code, I sure > can't promise that it doesn't. The fact that we have few tests for any > of this doesn't help either. I am not an expert in this area, but will try to spend some more time on understanding and testing. 1] postgr.es/m/555dd101.7080...@iki.fi Regards, Amul
Re: [Patch] ALTER SYSTEM READ ONLY
On Fri, Jul 23, 2021 at 4:03 PM Robert Haas wrote: > My 0003 is where I see some lingering problems. It creates > XLogAcceptWrites(), moves the appropriate stuff there, and doesn't > need the xlogreader. But it doesn't really solve the problem of how > checkpointer.c would be able to call this function with proper > arguments. It is at least better in not needing two arguments to > decide what to do, but how is checkpointer.c supposed to know what to > pass for xlogaction? Worse yet, how is checkpointer.c supposed to know > what to pass for EndOfLogTLI and EndOfLog? On further study, I found another problem: the way my patch set leaves things, XLogAcceptWrites() depends on ArchiveRecoveryRequested, which will not be correctly initialized in any process other than the startup process. So CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog) would just be skipped. Your 0001 seems to have the same problem. You added Assert(AmStartupProcess()) to the inside of the if (ArchiveRecoveryRequested) block, but that doesn't fix anything. Outside the startup process, ArchiveRecoveryRequested will always be false, but the point is that the associated stuff should be done if ArchiveRecoveryRequested would have been true in the startup process. Both of our patch sets leave things in a state where that would never happen, which is not good. Unless I'm missing something, it seems like maybe you didn't test your patches to verify that, when the XLogAcceptWrites() call comes from the checkpointer, all the same things happen that would have happened had it been called from the startup process. That would be a really good thing to have tested before posting your patches. As far as EndOfLogTLI is concerned, there are, somewhat annoyingly, several TLIs stored in XLogCtl. None of them seem to be precisely the same thing as EndLogTLI, but I am hoping that replayEndTLI is close enough. I found out pretty quickly through testing that replayEndTLI isn't always valid -- it ends up 0 if we don't enter recovery. That's not really a problem, though, because we only need it to be valid if ArchiveRecoveryRequested. The code that initializes and updates it seems to run whenever InRecovery = true, and ArchiveRecoveryRequested = true will force InRecovery = true. So it looks to me like replayEndTLI will always be initialized in the cases where we need a value. It's not yet entirely clear to me if it has to have the same value as EndOfLogTLI. I find this code comment quite mysterious: /* * EndOfLogTLI is the TLI in the filename of the XLOG segment containing * the end-of-log. It could be different from the timeline that EndOfLog * nominally belongs to, if there was a timeline switch in that segment, * and we were reading the old WAL from a segment belonging to a higher * timeline. */ EndOfLogTLI = xlogreader->seg.ws_tli; The thing is, if we were reading old WAL from a segment belonging to a higher timeline, wouldn't we have switched to that new timeline? Suppose we want WAL segment 246 from TLI 1, but we don't have that segment on TLI 1, only TLI 2. Well, as far as I know, for us to use the TLI 2 version, we'd need to have TLI 2 in the history of the recovery_target_timeline. And if that is the case, then we would have to replay through the record where the timeline changes. And if we do that, then the discrepancy postulated by the comment cannot still exist by the time we reach this code, because this code is only reached after we finish WAL redo. So I'm baffled as to how this can happen, but considering how many cases there are in this code, I sure can't promise that it doesn't. The fact that we have few tests for any of this doesn't help either. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Thu, Jun 17, 2021 at 1:23 AM Amul Sul wrote: > Attached is rebase for the latest master head. Also, I added one more > refactoring code that deduplicates the code setting database state in the > control file. The same code set the database state is also needed for this > feature. I started studying 0001 today and found that it rearranged the order of operations in StartupXLOG() more than I was expecting. It does, as per previous discussions, move a bunch of things to the place where we now call XLogParamters(). But, unsatisfyingly, InRecovery = false and XLogReaderFree() then have to move down even further. Since the goal here is to get to a situation where we sometimes XLogAcceptWrites() after InRecovery = false, it didn't seem nice for this refactoring patch to still end up with a situation where this stuff happens while InRecovery = true. In fact, with the patch, the amount of code that runs with InRecovery = true actually *increases*, which is not what I think should be happening here. That's why the patch ends up having to adjust SetMultiXactIdLimit to not Assert(!InRecovery). And then I started to wonder how this was ever going to work as part of the larger patch set, because as you have it here, XLogAcceptWrites() takes arguments XLogReaderState *xlogreader, XLogRecPtr EndOfLog, and TimeLineID EndOfLogTLI and if the checkpointer is calling that at a later time after the user issues pg_prohibit_wal(false), it's going to have none of those things. So I had a quick look at that part of the code and found this in checkpointer.c: XLogAcceptWrites(true, NULL, InvalidXLogRecPtr, 0); For those following along from home, the additional "true" is a bool needChkpt argument added to XLogAcceptWrites() by 0003. Well, none of this is very satisfying. The whole purpose of passing the xlogreader is so we can figure out whether we need a checkpoint (never mind the question of whether the existing algorithm for determining that is really sensible) but now we need a second argument that basically serves the same purpose since one of the two callers to this function won't have an xlogreader. And then we're passing the EndOfLog and EndOfLogTLI as dummy values which seems like it's probably just totally wrong, but if for some reason it works correctly there sure don't seem to be any comments explaining why. So I started doing a bit of hacking myself and ended up with the attached, which I think is not completely the right thing yet but I think it's better than your version. I split this into three parts. 0001 splits up the logic that currently decides whether to write an end-of-recovery record or a checkpoint record and if the latter how the checkpoint ought to be performed into two functions. DetermineRecoveryXlogAction() figures out what we want to do, and PerformRecoveryXlogAction() does it. It also moves the code to run recovery_end_command and related stuff into a new function CleanupAfterArchiveRecovery(). 0002 then builds on this by postponing UpdateFullPageWrites(), PerformRecoveryXLogAction(), and CleanupAfterArchiveRecovery() to just before we XLogReportParameters(). Because of the refactoring done by 0001, this is only a small amount of code movement. Because of the separation between DetermineRecoveryXlogAction() and PerformRecoveryXlogAction(), the latter doesn't need the xlogreader. So we can do DetermineRecoveryXlogAction() at the same time as now, while the xlogreader is available, and then we don't need it later when we PerformRecoveryXlogAction(), because we already know what we need to know. I think this is all fine as far as it goes. My 0003 is where I see some lingering problems. It creates XLogAcceptWrites(), moves the appropriate stuff there, and doesn't need the xlogreader. But it doesn't really solve the problem of how checkpointer.c would be able to call this function with proper arguments. It is at least better in not needing two arguments to decide what to do, but how is checkpointer.c supposed to know what to pass for xlogaction? Worse yet, how is checkpointer.c supposed to know what to pass for EndOfLogTLI and EndOfLog? Actually, EndOfLog doesn't seem too problematic, because that value has been stored in four (!) places inside XLogCtl by this code: LogwrtResult.Write = LogwrtResult.Flush = EndOfLog; XLogCtl->LogwrtResult = LogwrtResult; XLogCtl->LogwrtRqst.Write = EndOfLog; XLogCtl->LogwrtRqst.Flush = EndOfLog; Presumably we could relatively easily change things around so that we finish one of those values ... probably one of the "write" values .. back out of XLogCtl instead of passing it as a parameter. That would work just as well from the checkpointer as from the startup process, and there seems to be no way for the value to change until after XLogAcceptWrites() has been called, so it seems fine. But that doesn't help for the other arguments. What I'm thinking is that we should just arrange to store EndOfLogTLI and xlogaction into XLogCtl also, and then
Re: [Patch] ALTER SYSTEM READ ONLY
On Mon, May 17, 2021 at 11:48 AM Amul Sul wrote: > > On Sat, May 15, 2021 at 3:12 PM Dilip Kumar wrote: > > > > On Thu, May 13, 2021 at 2:56 PM Dilip Kumar wrote: > > > > > > Great thanks. I will review the remaining patch soon. > > > > I have reviewed v28-0003, and I have some comments on this. > > > > === > > @@ -126,9 +127,14 @@ XLogBeginInsert(void) > > Assert(mainrdata_last == (XLogRecData *) _head); > > Assert(mainrdata_len == 0); > > > > +/* > > + * WAL permission must have checked before entering the critical > > section. > > + * Otherwise, WAL prohibited error will force system panic. > > + */ > > +Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || > > !CritSectionCount); > > + > > /* cross-check on whether we should be here or not */ > > -if (!XLogInsertAllowed()) > > -elog(ERROR, "cannot make new WAL entries during recovery"); > > +CheckWALPermitted(); > > > > We must not call CheckWALPermitted inside the critical section, > > instead if we are here we must be sure that > > WAL is permitted, so better put an assert. Even if that is ensured by > > some other mean then also I don't > > see any reason for calling this error generating function. > > > > I understand that we should not have an error inside a critical section but > this check is not wrong. Patch has enough checking so that errors due to WAL > prohibited state must not hit in the critical section, see assert just before > CheckWALPermitted(). Before entering into the critical section, we do have an > explicit WAL prohibited check. And to make sure that check has been done for > all current critical section for the wal writes, we have aforesaid assert > checking, for more detail on this please have a look at the "WAL prohibited > system state" section of src/backend/access/transam/README added in 0004 > patch. > This assertion also ensures that future development does not miss the WAL > prohibited state check before entering into a newly added critical section for > WAL writes. I think we need CheckWALPermitted(); check, in XLogBeginInsert() function because if XLogBeginInsert() maybe called outside critical section e.g. pg_truncate_visibility_map() then we should error out. So this check make sense to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Thu, May 13, 2021 at 2:56 PM Dilip Kumar wrote: > > Great thanks. I will review the remaining patch soon. I have reviewed v28-0003, and I have some comments on this. === @@ -126,9 +127,14 @@ XLogBeginInsert(void) Assert(mainrdata_last == (XLogRecData *) _head); Assert(mainrdata_len == 0); +/* + * WAL permission must have checked before entering the critical section. + * Otherwise, WAL prohibited error will force system panic. + */ +Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount); + /* cross-check on whether we should be here or not */ -if (!XLogInsertAllowed()) -elog(ERROR, "cannot make new WAL entries during recovery"); +CheckWALPermitted(); We must not call CheckWALPermitted inside the critical section, instead if we are here we must be sure that WAL is permitted, so better put an assert. Even if that is ensured by some other mean then also I don't see any reason for calling this error generating function. === +CheckWALPermitted(void) +{ +if (!XLogInsertAllowed()) +ereport(ERROR, +(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION), + errmsg("system is now read only"))); + system is now read only -> wal is prohibited (in error message) === - * We can't write WAL in recovery mode, so there's no point trying to + * We can't write WAL during read-only mode, so there's no point trying to during read-only mode -> if WAL is prohibited or WAL recovery in progress (add recovery in progress and also modify read-only to wal prohibited) === +if (!XLogInsertAllowed()) { GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED); -GUC_check_errmsg("cannot set transaction read-write mode during recovery"); +GUC_check_errmsg("cannot set transaction read-write mode while system is read only"); return false; } system is read only -> WAL is prohibited === I think that's all, I have to say about 0003. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Thu, May 13, 2021 at 2:54 PM Amul Sul wrote: > > On Thu, May 13, 2021 at 12:36 PM Dilip Kumar wrote: > > > > On Wed, May 12, 2021 at 5:55 PM Amul Sul wrote: > > > > > > > Thanks for the updated patch, while going through I noticed this comment. > > > > + /* > > + * WAL prohibit state changes not allowed during recovery except the crash > > + * recovery case. > > + */ > > + PreventCommandDuringRecovery("pg_prohibit_wal()"); > > > > Why do we need to allow state change during recovery? Do you still > > need it after the latest changes you discussed here, I mean now > > XLogAcceptWrites() being called before sending barrier to backends. > > So now we are not afraid that the backend will write WAL before we > > call XLogAcceptWrites(). So now IMHO, we don't need to keep the > > system in recovery until pg_prohibit_wal(false) is called, right? > > > > Your understanding is correct, and the previous patch also does the same, but > the code comment is wrong. Fixed in the attached version, also rebased for > the > latest master head. Sorry for the confusion. Great thanks. I will review the remaining patch soon. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Wed, May 12, 2021 at 5:55 PM Amul Sul wrote: > Thanks for the updated patch, while going through I noticed this comment. + /* + * WAL prohibit state changes not allowed during recovery except the crash + * recovery case. + */ + PreventCommandDuringRecovery("pg_prohibit_wal()"); Why do we need to allow state change during recovery? Do you still need it after the latest changes you discussed here, I mean now XLogAcceptWrites() being called before sending barrier to backends. So now we are not afraid that the backend will write WAL before we call XLogAcceptWrites(). So now IMHO, we don't need to keep the system in recovery until pg_prohibit_wal(false) is called, right? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Thu, May 13, 2021 at 2:26 AM Robert Haas wrote: > > On Wed, May 12, 2021 at 1:39 AM Dilip Kumar wrote: > > Your idea makes sense, but IMHO, if we are first writing > > XLogAcceptWrites() and then pushing out the barrier, then I don't > > understand the meaning of having state #4. I mean whenever any > > backend receives the barrier the system will always be in state #5. > > So what do we want to do with state #4? > > Well, if you don't have that, how does the checkpointer know that it's > supposed to push out the barrier? > > You and Amul both seem to want to merge states #4 and #5. But how to > make that work? Basically what you are both saying is that, after we > move into the "going read-write" state, backends aren't immediately > told that they can write WAL, but have to keep checking back. But this > could be expensive. If you have one state that means that the > checkpointer has been requested to run XLogAcceptWrites() and push out > a barrier, and another state to mean that it has done so, then you > avoid that. Maybe that overhead wouldn't be large anyway, but it seems > like it's only necessary because you're trying to merge two states > which, from a logical point of view, are separate. I don't have an objection to having 5 states, just wanted to understand your reasoning. So it makes sense to me. Thanks. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Wed, May 12, 2021 at 1:39 AM Dilip Kumar wrote: > Your idea makes sense, but IMHO, if we are first writing > XLogAcceptWrites() and then pushing out the barrier, then I don't > understand the meaning of having state #4. I mean whenever any > backend receives the barrier the system will always be in state #5. > So what do we want to do with state #4? Well, if you don't have that, how does the checkpointer know that it's supposed to push out the barrier? You and Amul both seem to want to merge states #4 and #5. But how to make that work? Basically what you are both saying is that, after we move into the "going read-write" state, backends aren't immediately told that they can write WAL, but have to keep checking back. But this could be expensive. If you have one state that means that the checkpointer has been requested to run XLogAcceptWrites() and push out a barrier, and another state to mean that it has done so, then you avoid that. Maybe that overhead wouldn't be large anyway, but it seems like it's only necessary because you're trying to merge two states which, from a logical point of view, are separate. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Tue, May 11, 2021 at 11:54 PM Robert Haas wrote: > > On Tue, May 11, 2021 at 11:17 AM Amul Sul wrote: > > I think I have much easier solution than this, will post that with update > > version patch set tomorrow. > > I don't know what you have in mind, but based on this discussion, it > seems to me that we should just have 5 states instead of 4: > > 1. WAL is permitted. > 2. WAL is being prohibited but some backends may not know about the change > yet. > 3. WAL is prohibited. > 4. WAL is in the process of being permitted but XLogAcceptWrites() may > not have been called yet. > 5. WAL is in the process of being permitted and XLogAcceptWrites() has > been called but some backends may not know about the change yet. > > If we're in state #3 and someone does pg_prohibit_wal(false) then we > enter state #4. The checkpointer calls XLogAcceptWrites(), moves us to > state #5, and pushes out a barrier. Then it waits for the barrier to > be absorbed and, when it has been, it moves us to state #1. Then if > someone does pg_prohibit_wal(true) we move to state #2. The > checkpointer pushes out a barrier and waits for it to be absorbed. > Then it calls XLogFlush() and afterward moves us to state #3. > > We can have any (reasonable) number of states that we want. There's > nothing magical about 4. Your idea makes sense, but IMHO, if we are first writing XLogAcceptWrites() and then pushing out the barrier, then I don't understand the meaning of having state #4. I mean whenever any backend receives the barrier the system will always be in state #5. So what do we want to do with state #4? Is it just to make the state machine better? I mean in the checkpoint process, we don't need separate "if checks" whether the XLogAcceptWrites() is called or not, instead we can just rely on the state, if it is #4 then we have to call XLogAcceptWrites(). If so then I think it's okay to have an additional state, just wanted to know what idea you had in mind? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Tue, May 11, 2021 at 11:17 AM Amul Sul wrote: > I think I have much easier solution than this, will post that with update > version patch set tomorrow. I don't know what you have in mind, but based on this discussion, it seems to me that we should just have 5 states instead of 4: 1. WAL is permitted. 2. WAL is being prohibited but some backends may not know about the change yet. 3. WAL is prohibited. 4. WAL is in the process of being permitted but XLogAcceptWrites() may not have been called yet. 5. WAL is in the process of being permitted and XLogAcceptWrites() has been called but some backends may not know about the change yet. If we're in state #3 and someone does pg_prohibit_wal(false) then we enter state #4. The checkpointer calls XLogAcceptWrites(), moves us to state #5, and pushes out a barrier. Then it waits for the barrier to be absorbed and, when it has been, it moves us to state #1. Then if someone does pg_prohibit_wal(true) we move to state #2. The checkpointer pushes out a barrier and waits for it to be absorbed. Then it calls XLogFlush() and afterward moves us to state #3. We can have any (reasonable) number of states that we want. There's nothing magical about 4. I also entirely agree with Dilip that we should do some renaming to get rid of the read-write/read-only terminology, now that this is no longer part of the syntax. In fact I made the exact same point in my last review. The WALPROHIBIT_STATE_* constants are just one thing of many that needs to be included in that renaming. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Tue, 11 May 2021 at 7:50 PM, Dilip Kumar wrote: > On Tue, May 11, 2021 at 6:56 PM Amul Sul wrote: > > > > On Tue, May 11, 2021 at 6:48 PM Dilip Kumar > wrote: > > > > > > On Tue, May 11, 2021 at 4:50 PM Amul Sul wrote: > > > > > > > > On Tue, May 11, 2021 at 4:13 PM Dilip Kumar > wrote: > > > > > I might be missing something, but assume the behavior should be > like this > > > > > > > > > > 1. If the state is getting changed from > WALPROHIBIT_STATE_READ_WRITE > > > > > -> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process > > > > > the barrier, we can immediately abort any read-write > transaction(and > > > > > stop allowing WAL writing), because once we ensure that all session > > > > > has responded that now they have no read-write transaction then we > can > > > > > safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to > > > > > WALPROHIBIT_STATE_READ_ONLY. > > > > > > > > > > > > > Yes, that's what the current patch is doing from the first patch > version. > > > > > > > > > 2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY -> > > > > > WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the > backend > > > > > to consider the system as read-write, instead, we should wait until > > > > > the shared state is changed to WALPROHIBIT_STATE_READ_WRITE. > > > > > > > > > > > > > I am sure that only not enough will have the same issue where > > > > LocalXLogInsertAllowed gets set the same as the read-only as > described in > > > > my previous reply. > > > > > > Okay, but while browsing the code I do not see any direct if condition > > > based on the "LocalXLogInsertAllowed" variable, can you point me to > > > some references? > > > I only see one if check on this variable and that is in > > > XLogInsertAllowed() function, but now in XLogInsertAllowed() function, > > > you are already checking IsWALProhibited. No? > > > > > > > I am not sure I understood this. Where am I checking IsWALProhibited()? > > > > IsWALProhibited() is called by XLogInsertAllowed() once when > > LocalXLogInsertAllowed is in a reset state, and that result will be > > cached in LocalXLogInsertAllowed and will be used in the subsequent > > XLogInsertAllowed() call. > > Okay, got what you were trying to say. But that can be easily > fixable, I mean if the state is WALPROHIBIT_STATE_GOING_READ_WRITE > then what we can do is don't allow to write the WAL but let's not set > the LocalXLogInsertAllowed to 0. So until we are in the intermediate > state WALPROHIBIT_STATE_GOING_READ_WRITE, we will always have to rely > on GetWALProhibitState(), I know this will add a performance penalty > but this is for the short period until we are in the intermediate > state. After that as soon as it will set to > WALPROHIBIT_STATE_READ_WRITE then the XLogInsertAllowed() will set > LocalXLogInsertAllowed to 1. I think I have much easier solution than this, will post that with update version patch set tomorrow. Regards, Amul
Re: [Patch] ALTER SYSTEM READ ONLY
On Tue, May 11, 2021 at 6:56 PM Amul Sul wrote: > > On Tue, May 11, 2021 at 6:48 PM Dilip Kumar wrote: > > > > On Tue, May 11, 2021 at 4:50 PM Amul Sul wrote: > > > > > > On Tue, May 11, 2021 at 4:13 PM Dilip Kumar wrote: > > > > I might be missing something, but assume the behavior should be like > > > > this > > > > > > > > 1. If the state is getting changed from WALPROHIBIT_STATE_READ_WRITE > > > > -> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process > > > > the barrier, we can immediately abort any read-write transaction(and > > > > stop allowing WAL writing), because once we ensure that all session > > > > has responded that now they have no read-write transaction then we can > > > > safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to > > > > WALPROHIBIT_STATE_READ_ONLY. > > > > > > > > > > Yes, that's what the current patch is doing from the first patch version. > > > > > > > 2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY -> > > > > WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the backend > > > > to consider the system as read-write, instead, we should wait until > > > > the shared state is changed to WALPROHIBIT_STATE_READ_WRITE. > > > > > > > > > > I am sure that only not enough will have the same issue where > > > LocalXLogInsertAllowed gets set the same as the read-only as described in > > > my previous reply. > > > > Okay, but while browsing the code I do not see any direct if condition > > based on the "LocalXLogInsertAllowed" variable, can you point me to > > some references? > > I only see one if check on this variable and that is in > > XLogInsertAllowed() function, but now in XLogInsertAllowed() function, > > you are already checking IsWALProhibited. No? > > > > I am not sure I understood this. Where am I checking IsWALProhibited()? > > IsWALProhibited() is called by XLogInsertAllowed() once when > LocalXLogInsertAllowed is in a reset state, and that result will be > cached in LocalXLogInsertAllowed and will be used in the subsequent > XLogInsertAllowed() call. Okay, got what you were trying to say. But that can be easily fixable, I mean if the state is WALPROHIBIT_STATE_GOING_READ_WRITE then what we can do is don't allow to write the WAL but let's not set the LocalXLogInsertAllowed to 0. So until we are in the intermediate state WALPROHIBIT_STATE_GOING_READ_WRITE, we will always have to rely on GetWALProhibitState(), I know this will add a performance penalty but this is for the short period until we are in the intermediate state. After that as soon as it will set to WALPROHIBIT_STATE_READ_WRITE then the XLogInsertAllowed() will set LocalXLogInsertAllowed to 1. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Tue, May 11, 2021 at 6:48 PM Dilip Kumar wrote: > > On Tue, May 11, 2021 at 4:50 PM Amul Sul wrote: > > > > On Tue, May 11, 2021 at 4:13 PM Dilip Kumar wrote: > > > I might be missing something, but assume the behavior should be like this > > > > > > 1. If the state is getting changed from WALPROHIBIT_STATE_READ_WRITE > > > -> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process > > > the barrier, we can immediately abort any read-write transaction(and > > > stop allowing WAL writing), because once we ensure that all session > > > has responded that now they have no read-write transaction then we can > > > safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to > > > WALPROHIBIT_STATE_READ_ONLY. > > > > > > > Yes, that's what the current patch is doing from the first patch version. > > > > > 2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY -> > > > WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the backend > > > to consider the system as read-write, instead, we should wait until > > > the shared state is changed to WALPROHIBIT_STATE_READ_WRITE. > > > > > > > I am sure that only not enough will have the same issue where > > LocalXLogInsertAllowed gets set the same as the read-only as described in > > my previous reply. > > Okay, but while browsing the code I do not see any direct if condition > based on the "LocalXLogInsertAllowed" variable, can you point me to > some references? > I only see one if check on this variable and that is in > XLogInsertAllowed() function, but now in XLogInsertAllowed() function, > you are already checking IsWALProhibited. No? > I am not sure I understood this. Where am I checking IsWALProhibited()? IsWALProhibited() is called by XLogInsertAllowed() once when LocalXLogInsertAllowed is in a reset state, and that result will be cached in LocalXLogInsertAllowed and will be used in the subsequent XLogInsertAllowed() call. Regards, Amul
Re: [Patch] ALTER SYSTEM READ ONLY
On Tue, May 11, 2021 at 4:50 PM Amul Sul wrote: > > On Tue, May 11, 2021 at 4:13 PM Dilip Kumar wrote: > > I might be missing something, but assume the behavior should be like this > > > > 1. If the state is getting changed from WALPROHIBIT_STATE_READ_WRITE > > -> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process > > the barrier, we can immediately abort any read-write transaction(and > > stop allowing WAL writing), because once we ensure that all session > > has responded that now they have no read-write transaction then we can > > safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to > > WALPROHIBIT_STATE_READ_ONLY. > > > > Yes, that's what the current patch is doing from the first patch version. > > > 2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY -> > > WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the backend > > to consider the system as read-write, instead, we should wait until > > the shared state is changed to WALPROHIBIT_STATE_READ_WRITE. > > > > I am sure that only not enough will have the same issue where > LocalXLogInsertAllowed gets set the same as the read-only as described in > my previous reply. Okay, but while browsing the code I do not see any direct if condition based on the "LocalXLogInsertAllowed" variable, can you point me to some references? I only see one if check on this variable and that is in XLogInsertAllowed() function, but now in XLogInsertAllowed() function, you are already checking IsWALProhibited. No? > > Other than this point, I think the state names READ_ONLY, READ_WRITE > > are a bit confusing no? because actually, these states represent > > whether WAL is allowed or not, but READ_ONLY, READ_WRITE seems like we > > are putting the system under a Read-only state. For example, if you > > are doing some write operation on an unlogged table will be allowed, I > > guess because that will not generate the WAL until you commit (because > > commit generates WAL) right? so practically, we are just blocking the > > WAL, not the write operation. > > > > This read-only and read-write are the wal prohibited states though we > are using for read-only/read-write system in the discussion and the > complete macro name is WALPROHIBIT_STATE_READ_ONLY and > WALPROHIBIT_STATE_READ_WRITE, I am not sure why that would make > implementation confusing. Fine, I am not too particular about these names. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Tue, May 11, 2021 at 4:13 PM Dilip Kumar wrote: > > On Tue, May 11, 2021 at 3:38 PM Amul Sul wrote: > > > > On Tue, May 11, 2021 at 2:26 PM Dilip Kumar wrote: > > > > > > On Tue, May 11, 2021 at 2:16 PM Amul Sul wrote: > > > > > > > I get why you think that, I wasn't very precise in briefing the problem. > > > > > > > > Any new backend that gets connected right after the shared memory > > > > state changes to WALPROHIBIT_STATE_GOING_READ_WRITE will be by > > > > default allowed to do the WAL writes. Such backends can perform write > > > > operation before the checkpointer does the XLogAcceptWrites(). > > > > > > Okay, make sense now. But my next question is why do we allow backends > > > to write WAL in WALPROHIBIT_STATE_GOING_READ_WRITE state? why don't we > > > wait until the shared memory state is changed to > > > WALPROHIBIT_STATE_READ_WRITE? > > > > > > > Ok, good question. > > > > Now let's first try to understand the Checkpointer's work. > > > > When Checkpointer sees the wal prohibited state is an in-progress state, > > then > > it first emits the global barrier and waits until all backers absorb that. > > After that it set the final requested WAL prohibit state. > > > > When other backends absorb those barriers then appropriate action is taken > > (e.g. abort the read-write transaction if moving to read-only) by them. > > Also, > > LocalXLogInsertAllowed flags get reset in it and that backend needs to call > > XLogInsertAllowed() to get the right value for it, which further decides WAL > > writes permitted or prohibited. > > > > Consider an example that the system is trying to change to read-write and > > for > > that wal prohibited state is set to WALPROHIBIT_STATE_GOING_READ_WRITE > > before > > Checkpointer starts its work. If we want to treat that system as read-only > > for > > the WALPROHIBIT_STATE_GOING_READ_WRITE state as well. Then we might need to > > think about the behavior of the backend that has absorbed the barrier and > > reset > > the LocalXLogInsertAllowed flag. That backend eventually going to call > > XLogInsertAllowed() to get the actual value for it and by seeing the current > > state as WALPROHIBIT_STATE_GOING_READ_WRITE, it will set > > LocalXLogInsertAllowed > > again same as it was before for the read-only state. > > I might be missing something, but assume the behavior should be like this > > 1. If the state is getting changed from WALPROHIBIT_STATE_READ_WRITE > -> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process > the barrier, we can immediately abort any read-write transaction(and > stop allowing WAL writing), because once we ensure that all session > has responded that now they have no read-write transaction then we can > safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to > WALPROHIBIT_STATE_READ_ONLY. > Yes, that's what the current patch is doing from the first patch version. > 2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY -> > WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the backend > to consider the system as read-write, instead, we should wait until > the shared state is changed to WALPROHIBIT_STATE_READ_WRITE. > I am sure that only not enough will have the same issue where LocalXLogInsertAllowed gets set the same as the read-only as described in my previous reply. > So your problem is that on receiving the barrier we need to call > LocalXLogInsertAllowed() from the backend, but how does that matter? > you can still make IsWALProhibited() return true. > Note that LocalXLogInsertAllowed is a local flag for a backend, not a function, and in the server code at every place, we don't rely on IsWALProhibited() instead we do rely on LocalXLogInsertAllowed flags before wal writes and that check made via XLogInsertAllowed(). > I don't know the complete code so I might be missing something but at > least that is what I would expect from the design POV. > > > Other than this point, I think the state names READ_ONLY, READ_WRITE > are a bit confusing no? because actually, these states represent > whether WAL is allowed or not, but READ_ONLY, READ_WRITE seems like we > are putting the system under a Read-only state. For example, if you > are doing some write operation on an unlogged table will be allowed, I > guess because that will not generate the WAL until you commit (because > commit generates WAL) right? so practically, we are just blocking the > WAL, not the write operation. > This read-only and read-write are the wal prohibited states though we are using for read-only/read-write system in the discussion and the complete macro name is WALPROHIBIT_STATE_READ_ONLY and WALPROHIBIT_STATE_READ_WRITE, I am not sure why that would make implementation confusing. Regards, Amul
Re: [Patch] ALTER SYSTEM READ ONLY
On Tue, May 11, 2021 at 3:38 PM Amul Sul wrote: > > On Tue, May 11, 2021 at 2:26 PM Dilip Kumar wrote: > > > > On Tue, May 11, 2021 at 2:16 PM Amul Sul wrote: > > > > > I get why you think that, I wasn't very precise in briefing the problem. > > > > > > Any new backend that gets connected right after the shared memory > > > state changes to WALPROHIBIT_STATE_GOING_READ_WRITE will be by > > > default allowed to do the WAL writes. Such backends can perform write > > > operation before the checkpointer does the XLogAcceptWrites(). > > > > Okay, make sense now. But my next question is why do we allow backends > > to write WAL in WALPROHIBIT_STATE_GOING_READ_WRITE state? why don't we > > wait until the shared memory state is changed to > > WALPROHIBIT_STATE_READ_WRITE? > > > > Ok, good question. > > Now let's first try to understand the Checkpointer's work. > > When Checkpointer sees the wal prohibited state is an in-progress state, then > it first emits the global barrier and waits until all backers absorb that. > After that it set the final requested WAL prohibit state. > > When other backends absorb those barriers then appropriate action is taken > (e.g. abort the read-write transaction if moving to read-only) by them. Also, > LocalXLogInsertAllowed flags get reset in it and that backend needs to call > XLogInsertAllowed() to get the right value for it, which further decides WAL > writes permitted or prohibited. > > Consider an example that the system is trying to change to read-write and for > that wal prohibited state is set to WALPROHIBIT_STATE_GOING_READ_WRITE before > Checkpointer starts its work. If we want to treat that system as read-only > for > the WALPROHIBIT_STATE_GOING_READ_WRITE state as well. Then we might need to > think about the behavior of the backend that has absorbed the barrier and > reset > the LocalXLogInsertAllowed flag. That backend eventually going to call > XLogInsertAllowed() to get the actual value for it and by seeing the current > state as WALPROHIBIT_STATE_GOING_READ_WRITE, it will set > LocalXLogInsertAllowed > again same as it was before for the read-only state. I might be missing something, but assume the behavior should be like this 1. If the state is getting changed from WALPROHIBIT_STATE_READ_WRITE -> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process the barrier, we can immediately abort any read-write transaction(and stop allowing WAL writing), because once we ensure that all session has responded that now they have no read-write transaction then we can safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to WALPROHIBIT_STATE_READ_ONLY. 2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY -> WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the backend to consider the system as read-write, instead, we should wait until the shared state is changed to WALPROHIBIT_STATE_READ_WRITE. So your problem is that on receiving the barrier we need to call LocalXLogInsertAllowed() from the backend, but how does that matter? you can still make IsWALProhibited() return true. I don't know the complete code so I might be missing something but at least that is what I would expect from the design POV. Other than this point, I think the state names READ_ONLY, READ_WRITE are a bit confusing no? because actually, these states represent whether WAL is allowed or not, but READ_ONLY, READ_WRITE seems like we are putting the system under a Read-only state. For example, if you are doing some write operation on an unlogged table will be allowed, I guess because that will not generate the WAL until you commit (because commit generates WAL) right? so practically, we are just blocking the WAL, not the write operation. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Tue, May 11, 2021 at 2:26 PM Dilip Kumar wrote: > > On Tue, May 11, 2021 at 2:16 PM Amul Sul wrote: > > > I get why you think that, I wasn't very precise in briefing the problem. > > > > Any new backend that gets connected right after the shared memory > > state changes to WALPROHIBIT_STATE_GOING_READ_WRITE will be by > > default allowed to do the WAL writes. Such backends can perform write > > operation before the checkpointer does the XLogAcceptWrites(). > > Okay, make sense now. But my next question is why do we allow backends > to write WAL in WALPROHIBIT_STATE_GOING_READ_WRITE state? why don't we > wait until the shared memory state is changed to > WALPROHIBIT_STATE_READ_WRITE? > Ok, good question. Now let's first try to understand the Checkpointer's work. When Checkpointer sees the wal prohibited state is an in-progress state, then it first emits the global barrier and waits until all backers absorb that. After that it set the final requested WAL prohibit state. When other backends absorb those barriers then appropriate action is taken (e.g. abort the read-write transaction if moving to read-only) by them. Also, LocalXLogInsertAllowed flags get reset in it and that backend needs to call XLogInsertAllowed() to get the right value for it, which further decides WAL writes permitted or prohibited. Consider an example that the system is trying to change to read-write and for that wal prohibited state is set to WALPROHIBIT_STATE_GOING_READ_WRITE before Checkpointer starts its work. If we want to treat that system as read-only for the WALPROHIBIT_STATE_GOING_READ_WRITE state as well. Then we might need to think about the behavior of the backend that has absorbed the barrier and reset the LocalXLogInsertAllowed flag. That backend eventually going to call XLogInsertAllowed() to get the actual value for it and by seeing the current state as WALPROHIBIT_STATE_GOING_READ_WRITE, it will set LocalXLogInsertAllowed again same as it was before for the read-only state. Now the question is when this value should get reset again so that backend can be read-write? We are done with a barrier and that backend never going to come back to read-write again. One solution, I think, is to set the final state before emitting the barrier but as per the current design that should get set after all barrier processing. Let's see what Robert says on this. Regards, Amul
Re: [Patch] ALTER SYSTEM READ ONLY
On Tue, May 11, 2021 at 2:16 PM Amul Sul wrote: > I get why you think that, I wasn't very precise in briefing the problem. > > Any new backend that gets connected right after the shared memory > state changes to WALPROHIBIT_STATE_GOING_READ_WRITE will be by > default allowed to do the WAL writes. Such backends can perform write > operation before the checkpointer does the XLogAcceptWrites(). Okay, make sense now. But my next question is why do we allow backends to write WAL in WALPROHIBIT_STATE_GOING_READ_WRITE state? why don't we wait until the shared memory state is changed to WALPROHIBIT_STATE_READ_WRITE? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Tue, May 11, 2021 at 11:33 AM Dilip Kumar wrote: > > On Mon, May 10, 2021 at 10:25 PM Amul Sul wrote: > > > > Yes, we don't want any write slip in before UpdateFullPageWrites(). > > Recently[1], we have decided to let the Checkpointed process call > > XLogAcceptWrites() unconditionally. > > > > Here problem is that when a backend executes the > > pg_prohibit_wal(false) function to make the system read-write, the wal > > prohibited state is set to inprogress(ie. > > WALPROHIBIT_STATE_GOING_READ_WRITE) and then Checkpointer is signaled. > > Next, Checkpointer will convey this system change to all existing > > backends using a global barrier, and after that final wal prohibited > > state is set to the read-write(i.e. WALPROHIBIT_STATE_READ_WRITE). > > While Checkpointer is in the progress of conveying this global > > barrier, any new backend can connect at that time and can write a new > > record because the inprogress read-write state is equivalent to the > > final read-write state iff LocalXLogInsertAllowed != 0 for that > > backend. And, that new record could slip in before or in between > > records to be written by XLogAcceptWrites(). > > > > 1] > > http://postgr.es/m/CA+TgmoZYQN=rcye-ixwnjdvmaoh+7jaqsif3u2k8xqxipba...@mail.gmail.com > > But, IIUC, once the state is set to WALPROHIBIT_STATE_GOING_READ_WRITE > and signaled to the checkpointer. The checkpointer should first call > XLogAcceptWrites and then it should inform other backends through the > global barrier? Are we worried that if we have written the WAL in > XLogAcceptWrites but later if we could not set the state to > WALPROHIBIT_STATE_READ_WRITE? Then maybe we can inform all the > backend first but before setting the state to > WALPROHIBIT_STATE_READ_WRITE, we can call XLogAcceptWrites? > I get why you think that, I wasn't very precise in briefing the problem. Any new backend that gets connected right after the shared memory state changes to WALPROHIBIT_STATE_GOING_READ_WRITE will be by default allowed to do the WAL writes. Such backends can perform write operation before the checkpointer does the XLogAcceptWrites(). Also, possible that a backend could connect at the same time checkpointer performing XLogAcceptWrites() and can write a wal. So, having XLogAcceptWrites() before does not really solve my concern. Note that the previous patch XLogAcceptWrites() does get called before global barrier emission. Please let me know if it is not yet cleared to you, thanks. Regards, Amul
Re: [Patch] ALTER SYSTEM READ ONLY
On Mon, May 10, 2021 at 10:25 PM Amul Sul wrote: > > Yes, we don't want any write slip in before UpdateFullPageWrites(). > Recently[1], we have decided to let the Checkpointed process call > XLogAcceptWrites() unconditionally. > > Here problem is that when a backend executes the > pg_prohibit_wal(false) function to make the system read-write, the wal > prohibited state is set to inprogress(ie. > WALPROHIBIT_STATE_GOING_READ_WRITE) and then Checkpointer is signaled. > Next, Checkpointer will convey this system change to all existing > backends using a global barrier, and after that final wal prohibited > state is set to the read-write(i.e. WALPROHIBIT_STATE_READ_WRITE). > While Checkpointer is in the progress of conveying this global > barrier, any new backend can connect at that time and can write a new > record because the inprogress read-write state is equivalent to the > final read-write state iff LocalXLogInsertAllowed != 0 for that > backend. And, that new record could slip in before or in between > records to be written by XLogAcceptWrites(). > > 1] > http://postgr.es/m/CA+TgmoZYQN=rcye-ixwnjdvmaoh+7jaqsif3u2k8xqxipba...@mail.gmail.com But, IIUC, once the state is set to WALPROHIBIT_STATE_GOING_READ_WRITE and signaled to the checkpointer. The checkpointer should first call XLogAcceptWrites and then it should inform other backends through the global barrier? Are we worried that if we have written the WAL in XLogAcceptWrites but later if we could not set the state to WALPROHIBIT_STATE_READ_WRITE? Then maybe we can inform all the backend first but before setting the state to WALPROHIBIT_STATE_READ_WRITE, we can call XLogAcceptWrites? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Mon, May 10, 2021 at 9:21 PM Robert Haas wrote: > > On Sun, May 9, 2021 at 1:26 AM Amul Sul wrote: > > The state in the control file also gets cleared. Though, after > > clearing in memory the state patch doesn't really do the immediate > > change to the control file, it relies on the next UpdateControlFile() > > to do that. > > But when will that happen? If you're relying on some very nearby code, > that might be OK, but perhaps a comment is in order. If you're just > thinking it's going to happen eventually, I think that's not good > enough. > Ok. > > Regarding log message I think I have skipped that intentionally, to > > avoid confusing log as "system is now read write" when we do start as > > hot-standby which is not really read-write. > > I think the message should not be phrased that way. In fact, I think > now that we've moved to calling this pg_prohibit_wal() rather than > ALTER SYSTEM READ ONLY, a lot of messages need to be rethought, and > maybe some comments and function names as well. Perhaps something > like: > > system is read only -> WAL is now prohibited > system is read write -> WAL is no longer prohibited > > And then for this particular case, maybe something like: > > clearing WAL prohibition because the system is in archive recovery > Ok, thanks for the suggestions. > > > The second part of this proposal was: > > > > > > "2. Create a new function with a name like XLogAcceptWrites(). Move the > > > following things from StartupXLOG() into that function: (1) the call > > > to UpdateFullPageWrites(), (2) the following block of code that does > > > either CreateEndOfRecoveryRecord() or RequestCheckpoint() or > > > CreateCheckPoint(), (3) the next block of code that runs > > > recovery_end_command, (4) the call to XLogReportParameters(), and (5) > > > the call to CompleteCommitTsInitialization(). Call the new function > > > from the place where we now call XLogReportParameters(). This would > > > mean that (1)-(3) happen later than they do now, which might require > > > some adjustments." > > > > > > Now you moved that code, but you also moved (6) > > > CompleteCommitTsInitialization(), (7) setting the control file to > > > DB_IN_PRODUCTION, (8) setting the state to RECOVERY_STATE_DONE, and > > > (9) requesting a checkpoint if we were just promoted. That's not what > > > was proposed. One result of this is that the server now thinks it's in > > > recovery even after the startup process has exited. > > > RecoveryInProgress() is still returning true everywhere. But that is > > > inconsistent with what Andres and I were recommending in > > > http://postgr.es/m/CA+TgmoZYQN=rcye-ixwnjdvmaoh+7jaqsif3u2k8xqxipba...@mail.gmail.com > > > > Regarding modified approach, I tried to explain that why I did > > this in > > http://postgr.es/m/CAAJ_b96Yb4jaW6oU1bVYEBaf=tq-ql+mmt1exfwvnzer7xr...@mail.gmail.com > > I am not able to understand what problem you are seeing there. If > we're in crash recovery, then nobody can connect to the database, so > there can't be any concurrent activity. If we're in archive recovery, > we now clear the WAL-is-prohibited flag so that we will go read-write > directly at the end of recovery. We can and should refuse any effort > to call pg_prohibit_wal() during recovery. If we reached the end of > crash recovery and are now permitting read-only connections, why would > anyone be able to write WAL before the system has been changed to > read-write? If that can happen, it's a bug, not a reason to change the > design. > > Maybe your concern here is about ordering: the process that is going > to run XLogAcceptWrites() needs to allow xlog writes locally before we > tell other backends that they also can xlog writes; otherwise, some > other records could slip in before UpdateFullPageWrites() and similar > have run, which we probably don't want. But that's why > LocalSetXLogInsertAllowed() was invented, and if it doesn't quite do > what we need in this situation, we should be able to tweak it so it > does. > Yes, we don't want any write slip in before UpdateFullPageWrites(). Recently[1], we have decided to let the Checkpointed process call XLogAcceptWrites() unconditionally. Here problem is that when a backend executes the pg_prohibit_wal(false) function to make the system read-write, the wal prohibited state is set to inprogress(ie. WALPROHIBIT_STATE_GOING_READ_WRITE) and then Checkpointer is signaled. Next, Checkpointer will convey this system change to all existing backends using a global barrier, and after that final wal prohibited state is set to the read-write(i.e. WALPROHIBIT_STATE_READ_WRITE). While Checkpointer is in the progress of conveying this global barrier, any new backend can connect at that time and can write a new record because the inprogress read-write state is equivalent to the final read-write state iff LocalXLogInsertAllowed != 0 for that backend. And, that new record could slip in before or in between records to be written by XLogAcceptWrites().
Re: [Patch] ALTER SYSTEM READ ONLY
On Sun, May 9, 2021 at 1:26 AM Amul Sul wrote: > The state in the control file also gets cleared. Though, after > clearing in memory the state patch doesn't really do the immediate > change to the control file, it relies on the next UpdateControlFile() > to do that. But when will that happen? If you're relying on some very nearby code, that might be OK, but perhaps a comment is in order. If you're just thinking it's going to happen eventually, I think that's not good enough. > Regarding log message I think I have skipped that intentionally, to > avoid confusing log as "system is now read write" when we do start as > hot-standby which is not really read-write. I think the message should not be phrased that way. In fact, I think now that we've moved to calling this pg_prohibit_wal() rather than ALTER SYSTEM READ ONLY, a lot of messages need to be rethought, and maybe some comments and function names as well. Perhaps something like: system is read only -> WAL is now prohibited system is read write -> WAL is no longer prohibited And then for this particular case, maybe something like: clearing WAL prohibition because the system is in archive recovery > > The second part of this proposal was: > > > > "2. Create a new function with a name like XLogAcceptWrites(). Move the > > following things from StartupXLOG() into that function: (1) the call > > to UpdateFullPageWrites(), (2) the following block of code that does > > either CreateEndOfRecoveryRecord() or RequestCheckpoint() or > > CreateCheckPoint(), (3) the next block of code that runs > > recovery_end_command, (4) the call to XLogReportParameters(), and (5) > > the call to CompleteCommitTsInitialization(). Call the new function > > from the place where we now call XLogReportParameters(). This would > > mean that (1)-(3) happen later than they do now, which might require > > some adjustments." > > > > Now you moved that code, but you also moved (6) > > CompleteCommitTsInitialization(), (7) setting the control file to > > DB_IN_PRODUCTION, (8) setting the state to RECOVERY_STATE_DONE, and > > (9) requesting a checkpoint if we were just promoted. That's not what > > was proposed. One result of this is that the server now thinks it's in > > recovery even after the startup process has exited. > > RecoveryInProgress() is still returning true everywhere. But that is > > inconsistent with what Andres and I were recommending in > > http://postgr.es/m/CA+TgmoZYQN=rcye-ixwnjdvmaoh+7jaqsif3u2k8xqxipba...@mail.gmail.com > > Regarding modified approach, I tried to explain that why I did > this in > http://postgr.es/m/CAAJ_b96Yb4jaW6oU1bVYEBaf=tq-ql+mmt1exfwvnzer7xr...@mail.gmail.com I am not able to understand what problem you are seeing there. If we're in crash recovery, then nobody can connect to the database, so there can't be any concurrent activity. If we're in archive recovery, we now clear the WAL-is-prohibited flag so that we will go read-write directly at the end of recovery. We can and should refuse any effort to call pg_prohibit_wal() during recovery. If we reached the end of crash recovery and are now permitting read-only connections, why would anyone be able to write WAL before the system has been changed to read-write? If that can happen, it's a bug, not a reason to change the design. Maybe your concern here is about ordering: the process that is going to run XLogAcceptWrites() needs to allow xlog writes locally before we tell other backends that they also can xlog writes; otherwise, some other records could slip in before UpdateFullPageWrites() and similar have run, which we probably don't want. But that's why LocalSetXLogInsertAllowed() was invented, and if it doesn't quite do what we need in this situation, we should be able to tweak it so it does. If your concern is something else, can you spell it out for me again because I'm not getting it? -- Robert Haas EDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Fri, May 7, 2021 at 1:23 AM Robert Haas wrote: > > On Mon, Apr 12, 2021 at 10:04 AM Amul Sul wrote: > > Rebased again. > > I started to look at this today, and didn't get very far, but I have a > few comments. The main one is that I don't think this patch implements > the design proposed in > https://www.postgresql.org/message-id/CA+TgmoZ=cctbaxxmtyzogxegqzoz9smkbwrdpsacpjvfcgc...@mail.gmail.com > > The first part of that proposal said this: > > "1. If the server starts up and is read-only and > ArchiveRecoveryRequested, clear the read-only state in memory and also > in the control file, log a message saying that this has been done, and > proceed. This makes some other cases simpler to deal with." > > As I read it, the patch clears the read-only state in memory, does not > clear it in the control file, and does not log a message. > The state in the control file also gets cleared. Though, after clearing in memory the state patch doesn't really do the immediate change to the control file, it relies on the next UpdateControlFile() to do that. Regarding log message I think I have skipped that intentionally, to avoid confusing log as "system is now read write" when we do start as hot-standby which is not really read-write. > The second part of this proposal was: > > "2. Create a new function with a name like XLogAcceptWrites(). Move the > following things from StartupXLOG() into that function: (1) the call > to UpdateFullPageWrites(), (2) the following block of code that does > either CreateEndOfRecoveryRecord() or RequestCheckpoint() or > CreateCheckPoint(), (3) the next block of code that runs > recovery_end_command, (4) the call to XLogReportParameters(), and (5) > the call to CompleteCommitTsInitialization(). Call the new function > from the place where we now call XLogReportParameters(). This would > mean that (1)-(3) happen later than they do now, which might require > some adjustments." > > Now you moved that code, but you also moved (6) > CompleteCommitTsInitialization(), (7) setting the control file to > DB_IN_PRODUCTION, (8) setting the state to RECOVERY_STATE_DONE, and > (9) requesting a checkpoint if we were just promoted. That's not what > was proposed. One result of this is that the server now thinks it's in > recovery even after the startup process has exited. > RecoveryInProgress() is still returning true everywhere. But that is > inconsistent with what Andres and I were recommending in > http://postgr.es/m/CA+TgmoZYQN=rcye-ixwnjdvmaoh+7jaqsif3u2k8xqxipba...@mail.gmail.com > Regarding modified approach, I tried to explain that why I did this in http://postgr.es/m/CAAJ_b96Yb4jaW6oU1bVYEBaf=tq-ql+mmt1exfwvnzer7xr...@mail.gmail.com > I also noticed that 0001 does not compile without 0002, so the > separation into multiple patches is not clean. I would actually > suggest that the first patch in the series should just create > XLogAcceptWrites() with the minimum amount of adjustment to make that > work. That would potentially let us commit that change independently, > which would be good, because then if we accidentally break something, > it'll be easier to pin down to that particular change instead of being > mixed with everything else this needs to change. > Ok, I will try in the next version. Regards, Amul
Re: [Patch] ALTER SYSTEM READ ONLY
On Mon, Apr 12, 2021 at 10:04 AM Amul Sul wrote: > Rebased again. I started to look at this today, and didn't get very far, but I have a few comments. The main one is that I don't think this patch implements the design proposed in https://www.postgresql.org/message-id/CA+TgmoZ=cctbaxxmtyzogxegqzoz9smkbwrdpsacpjvfcgc...@mail.gmail.com The first part of that proposal said this: "1. If the server starts up and is read-only and ArchiveRecoveryRequested, clear the read-only state in memory and also in the control file, log a message saying that this has been done, and proceed. This makes some other cases simpler to deal with." As I read it, the patch clears the read-only state in memory, does not clear it in the control file, and does not log a message. The second part of this proposal was: "2. Create a new function with a name like XLogAcceptWrites(). Move the following things from StartupXLOG() into that function: (1) the call to UpdateFullPageWrites(), (2) the following block of code that does either CreateEndOfRecoveryRecord() or RequestCheckpoint() or CreateCheckPoint(), (3) the next block of code that runs recovery_end_command, (4) the call to XLogReportParameters(), and (5) the call to CompleteCommitTsInitialization(). Call the new function from the place where we now call XLogReportParameters(). This would mean that (1)-(3) happen later than they do now, which might require some adjustments." Now you moved that code, but you also moved (6) CompleteCommitTsInitialization(), (7) setting the control file to DB_IN_PRODUCTION, (8) setting the state to RECOVERY_STATE_DONE, and (9) requesting a checkpoint if we were just promoted. That's not what was proposed. One result of this is that the server now thinks it's in recovery even after the startup process has exited. RecoveryInProgress() is still returning true everywhere. But that is inconsistent with what Andres and I were recommending in http://postgr.es/m/CA+TgmoZYQN=rcye-ixwnjdvmaoh+7jaqsif3u2k8xqxipba...@mail.gmail.com I also noticed that 0001 does not compile without 0002, so the separation into multiple patches is not clean. I would actually suggest that the first patch in the series should just create XLogAcceptWrites() with the minimum amount of adjustment to make that work. That would potentially let us commit that change independently, which would be good, because then if we accidentally break something, it'll be easier to pin down to that particular change instead of being mixed with everything else this needs to change. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Mon, Apr 5, 2021 at 11:02 AM Amul Sul wrote: > > Attached is the rebase version for the latest master head(commit # > 9f6f1f9b8e6). Some minor comments on 0001: Isn't it "might not be running"? + errdetail("Checkpointer might not running."), Isn't it "Try again after sometime"? + errhint("Try after sometime again."))); Can we have ereport(DEBUG1 just to be consistent(although it doesn't make any difference from elog(DEBUG1) with the new log messages introduced in the patch? +elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change"); With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
Hi all, While testing this feature with v20-patch, the server is crashing with below steps. Steps to reproduce: 1. Configure master-slave replication setup. 2. Connect to Slave. 3. Execute below statements, it will crash the server: SELECT pg_prohibit_wal(true); SELECT pg_prohibit_wal(false); -- Slave: postgres=# select pg_is_in_recovery(); pg_is_in_recovery --- t (1 row) postgres=# SELECT pg_prohibit_wal(true); pg_prohibit_wal - (1 row) postgres=# SELECT pg_prohibit_wal(false); WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !?> -- Below are the stack trace: [prabhat@localhost bin]$ gdb -q -c /tmp/data_slave/core.35273 postgres Reading symbols from /home/prabhat/PG/PGsrcNew/postgresql/inst/bin/postgres...done. [New LWP 35273] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `postgres: checkpointer '. Program terminated with signal 6, Aborted. #0 0x7fa876233387 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.17-317.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-50.el7.x86_64 libcom_err-1.42.9-19.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 libselinux-2.5-15.el7.x86_64 openssl-libs-1.0.2k-21.el7_9.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-18.el7.x86_64 (gdb) bt #0 0x7fa876233387 in raise () from /lib64/libc.so.6 #1 0x7fa876234a78 in abort () from /lib64/libc.so.6 #2 0x00aea31c in ExceptionalCondition (conditionName=0xb8c998 "ThisTimeLineID != 0 || IsBootstrapProcessingMode()", errorType=0xb8956d "FailedAssertion", fileName=0xb897c0 "xlog.c", lineNumber=8611) at assert.c:69 #3 0x00588eb5 in InitXLOGAccess () at xlog.c:8611 #4 0x00588ae6 in LocalSetXLogInsertAllowed () at xlog.c:8483 #5 0x005881bb in XLogAcceptWrites (needChkpt=true, xlogreader=0x0, EndOfLog=0, EndOfLogTLI=0) at xlog.c:8008 #6 0x005751ed in ProcessWALProhibitStateChangeRequest () at walprohibit.c:361 #7 0x0088c69f in CheckpointerMain () at checkpointer.c:355 #8 0x0059d7db in AuxiliaryProcessMain (argc=2, argv=0x7ffd1290d060) at bootstrap.c:455 #9 0x0089fc5f in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5416 #10 0x0089f782 in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5128 #11 #12 0x7fa8762f2983 in __select_nocancel () from /lib64/libc.so.6 #13 0x0089b511 in ServerLoop () at postmaster.c:1700 #14 0x0089af00 in PostmasterMain (argc=5, argv=0x15b8460) at postmaster.c:1408 #15 0x0079c23a in main (argc=5, argv=0x15b8460) at main.c:209 (gdb) kindly let me know if you need more inputs on this. On Mon, Mar 15, 2021 at 12:56 PM Amul Sul wrote: > On Sun, Mar 14, 2021 at 11:51 PM Ibrar Ahmed > wrote: > > > > On Tue, Mar 9, 2021 at 3:31 PM Amul Sul wrote: > >> > >> On Thu, Mar 4, 2021 at 11:02 PM Amul Sul wrote: > >> > > >> > On Wed, Mar 3, 2021 at 8:56 PM Robert Haas > wrote: > >> > > > >> > > On Tue, Mar 2, 2021 at 7:22 AM Dilip Kumar > wrote: > >[] > > > > One of the patch > (v18-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch) from the > latest patchset does not apply successfully. > > > > http://cfbot.cputube.org/patch_32_2602.log > > > > === applying patch > ./v18-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch > > > > Hunk #15 succeeded at 2604 (offset -13 lines). > > 1 out of 15 hunks FAILED -- saving rejects to file > src/backend/access/nbtree/nbtpage.c.rej > > patching file src/backend/access/spgist/spgdoinsert.c > > > > It is a very minor change, so I rebased the patch. Please take a look, > if that works for you. > > > > Thanks, I am getting one more failure for the vacuumlazy.c. on the > latest master head(d75288fb27b), I fixed that in attached version. > > Regards, > Amul > -- With Regards, Prabhat Kumar Sahu EnterpriseDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Wed, Mar 3, 2021 at 8:56 PM Robert Haas wrote: > > On Tue, Mar 2, 2021 at 7:22 AM Dilip Kumar wrote: > > Why do we need to move promote related code in XLogAcceptWrites? > > IMHO, this promote related handling should be in StartupXLOG only. > > That will look cleaner. > > The key design question here, at least in my mind, is what exactly > happens after prohibit-WAL + system-crash + recovery-finishes. We > clearly can't write the checkpoint or end-of-recovery record and > proceed with business as usual, but are we still in recovery? Either > (1) we are technically still in recovery, stopping just short of > entering normal running, and will emerge from recovery when WAL is > permitted again; or (2) we have technically finished recovery, but > deferred some of the actions that would normally occur at that time > until a later point. Maybe this is academic distinction as much as > anything, but the idea is if we choose #1 then we should do as little > as possible at the point when recovery finishes and defer as much as > possible until we actually enter normal running; whereas if we choose > #2 we should do as much as possible at the point when recovery > finishes and defer only those things which absolutely have to be > deferred. That said, I and I think also Andres are voting for #2. > > But if we go that way, that precludes what you are proposing here. If > we picked #1 then it would be natural for the startup process to > remain active and the control file update to be postponed until WAL > writes are re-enabled; but under model #2 we want, if possible, for > the startup process to exit and the control file update to happen > normally, and only the writing of the actual WAL records to be > deferred. Maybe I did not put my point clearly, let me clarify that. First, I was also inclined that it should work like #2. And, if it works like #2 then I would assume that the code goes in XLogAcceptWrites function should be minimal, only those part which we want to execute after the system is back to read-write mode. So basically, the XLogAcceptWrites should only keep the code that is common code which we want to execute at the end of the StartupXLog if the system is normal or we want to execute when the system is back to read-write if it was read only. So my point was all the uncommon code that we have moved into XLogAcceptWrites should be kept inside the StartupXLog function only. So I think the promotion-related code doesn't belong to the XLogAcceptWrites function. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Wed, Mar 3, 2021 at 7:56 PM Dilip Kumar wrote: > > On Wed, Mar 3, 2021 at 4:50 PM Amul Sul wrote: > > > > On Wed, Mar 3, 2021 at 12:08 PM Dilip Kumar wrote: > > > > > Yes, it is possible to allow wal temporarily for itself by setting > > LocalXLogInsertAllowed, but when we request Checkpointer for the > > end-of-recovery > > checkpoint, the first thing it will do is that wal prohibit state transition > > then recovery-end-checkpoint. > > > > Also, allowing WAL write in read-only (WAL prohibited state) mode is against > > this feature principle. > > So IIUC before the checkpoint change the state in the control file we > anyway inform other backend and then they are allowed to write the WAL > is the right? If that is true then what is the problem in first doing > the pending post-recovery process and then informing the backend about > the state change. I mean we are in a process of changing the state to > read-write so why it is necessary to inform all the backend before we > can write WAL? Are we afraid that after we write the WAL and if there > is some failure before we make the system read-write then it will > break the principle of the feature, I mean eventually system will stay > read-only but we wrote the WAL? If so then currently, are we assuring > that once we inform the backend and backend are allowed to write the > WAL there are no chances of failure and the system state is guaranteed > to be changed. If all that is true then I will take my point back. > The wal prohibit state transition handling code is integrated into various places of the checkpointer process so that it can pick state changes as soon as possible. Before informing other backends we can do UpdateFullPageWrites() but when we try to next the end-of-recovery checkpoint write operation, the Checkpointer will hit ProcessWALProhibitStateChangeRequest() first which will try for the wal prohibit transition state completion and then write the checkpoint record. Regards, Amul
Re: [Patch] ALTER SYSTEM READ ONLY
On Tue, Mar 2, 2021 at 7:22 AM Dilip Kumar wrote: > Why do we need to move promote related code in XLogAcceptWrites? > IMHO, this promote related handling should be in StartupXLOG only. > That will look cleaner. The key design question here, at least in my mind, is what exactly happens after prohibit-WAL + system-crash + recovery-finishes. We clearly can't write the checkpoint or end-of-recovery record and proceed with business as usual, but are we still in recovery? Either (1) we are technically still in recovery, stopping just short of entering normal running, and will emerge from recovery when WAL is permitted again; or (2) we have technically finished recovery, but deferred some of the actions that would normally occur at that time until a later point. Maybe this is academic distinction as much as anything, but the idea is if we choose #1 then we should do as little as possible at the point when recovery finishes and defer as much as possible until we actually enter normal running; whereas if we choose #2 we should do as much as possible at the point when recovery finishes and defer only those things which absolutely have to be deferred. That said, I and I think also Andres are voting for #2. But if we go that way, that precludes what you are proposing here. If we picked #1 then it would be natural for the startup process to remain active and the control file update to be postponed until WAL writes are re-enabled; but under model #2 we want, if possible, for the startup process to exit and the control file update to happen normally, and only the writing of the actual WAL records to be deferred. What I find much odder, looking at the present patch, is that PerformPendingStartupOperations() gets called from pg_prohibit_wal() rather than by the checkpointer. If the checkpointer is the process that is in charge of coordinating the change between a read-only state and a read-write state, then it ought to also do this. I also think that the PerformPendingStartupOperations() wrapper is unnecessary. Just invert the sense of the XLogCtl flag: xlogAllowWritesDone, rather than startupCrashRecoveryPending, and have XLogAcceptWrites() set it (and return without doing anything if it's already set). Then the checkpointer can just call the function unconditionally whenever we go read-write, and for a bonus we will have much better naming consistency, rather than calling the same thing "xlog accept writes" in one place, "pending startup operations" in another, and "startup crash recovery pending" in a third. Since this feature is basically no longer "alter system read only" but rather "pg_prohibit_wal" I think we also ought to rename the GUC, system_is_read_only -> wal_prohibited. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Wed, Mar 3, 2021 at 4:50 PM Amul Sul wrote: > > On Wed, Mar 3, 2021 at 12:08 PM Dilip Kumar wrote: > > > Yes, it is possible to allow wal temporarily for itself by setting > LocalXLogInsertAllowed, but when we request Checkpointer for the > end-of-recovery > checkpoint, the first thing it will do is that wal prohibit state transition > then recovery-end-checkpoint. > > Also, allowing WAL write in read-only (WAL prohibited state) mode is against > this feature principle. So IIUC before the checkpoint change the state in the control file we anyway inform other backend and then they are allowed to write the WAL is the right? If that is true then what is the problem in first doing the pending post-recovery process and then informing the backend about the state change. I mean we are in a process of changing the state to read-write so why it is necessary to inform all the backend before we can write WAL? Are we afraid that after we write the WAL and if there is some failure before we make the system read-write then it will break the principle of the feature, I mean eventually system will stay read-only but we wrote the WAL? If so then currently, are we assuring that once we inform the backend and backend are allowed to write the WAL there are no chances of failure and the system state is guaranteed to be changed. If all that is true then I will take my point back. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Wed, Mar 3, 2021 at 12:08 PM Dilip Kumar wrote: > > On Tue, Mar 2, 2021 at 9:01 PM Dilip Kumar wrote: > > > > > > > We don't want that to happen in cases where previous > > > recovery-end-checkpoint is > > > skipped in startup. We want Checkpointer first to convey the barrier to > > > all > > > backends but, the backend shouldn't write wal until the Checkpointer > > > writes > > > recovery-end-checkpoint record. > > > > > > To refrain these backends from writing WAL I think we should keep the > > > server in > > > crash recovery mode until UpdateFullPageWrites(), > > > end-of-recovery-checkpoint, and XLogReportParameters() are performed. > > I did not read the code for this, but let me ask something about this > case. Why do we want checkpointer to convey the barrier to all the > backend before completing the end of recovery checkpoint and other > stuff? Is it because the system is still in WAL prohibited state? Consider the previous case, where the user wants to change the system to read-write. When a permitted user executes pg_prohibit_wal(false), the wal prohibited state in shared memory updated to GOING_READ_WRITE which is the transition state and then waits until the transition state completes and the final state (i.e. READ_WRITE) gets updated in shared memory. To set the final set is a job of the Checkpointer process. We have integrated code into the Checkpointer process such that if it sees wal prohibit transition state then it completes that as soon as possible by doing necessary steps i.e. emitting super barriers, then update the final wal prohibited state in shared memory and in control file. > Is > it possible that as soon as we get the pg_prohibit_wal(false) request > the receiving backend start allowing the WAL writing for itself and > finish the all post-recovery pending work and then inform the > checkpointer to inform all other backends? > Yes, it is possible to allow wal temporarily for itself by setting LocalXLogInsertAllowed, but when we request Checkpointer for the end-of-recovery checkpoint, the first thing it will do is that wal prohibit state transition then recovery-end-checkpoint. Also, allowing WAL write in read-only (WAL prohibited state) mode is against this feature principle. Regards, Amul
Re: [Patch] ALTER SYSTEM READ ONLY
On Tue, Mar 2, 2021 at 9:01 PM Dilip Kumar wrote: > > > > We don't want that to happen in cases where previous > > recovery-end-checkpoint is > > skipped in startup. We want Checkpointer first to convey the barrier to all > > backends but, the backend shouldn't write wal until the Checkpointer writes > > recovery-end-checkpoint record. > > > > To refrain these backends from writing WAL I think we should keep the > > server in > > crash recovery mode until UpdateFullPageWrites(), > > end-of-recovery-checkpoint, and XLogReportParameters() are performed. I did not read the code for this, but let me ask something about this case. Why do we want checkpointer to convey the barrier to all the backend before completing the end of recovery checkpoint and other stuff? Is it because the system is still in WAL prohibited state? Is it possible that as soon as we get the pg_prohibit_wal(false) request the receiving backend start allowing the WAL writing for itself and finish the all post-recovery pending work and then inform the checkpointer to inform all other backends? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Tue, Mar 2, 2021 at 7:54 PM Amul Sul wrote: > XLogAcceptWrites() tried to club all the WAL write operations that happen at > the > end of StartupXLOG(). The only exception is that promotion checkpoint. Okay, I was expecting that XLogAcceptWrites should have all the WAL write-related operations which should either executed at the end of StartupXLOG() if the system is not read-only or after the system is set back to read-write. But promotion-related code is completely irrelevant when it is executed from PerformPendingStartupOperations. So I am not entirely sure that whether we want to keep those stuff in XLogAcceptWrites. > > That will look cleaner. > > I think it would be better to move the promotion checkpoint call inside > XLogAcceptWrites() as well. So that we can say XLogAcceptWrites() is a part of > StartupXLOG() does the required WAL writes. Thoughts? Okay so if we want to keep all the WAL write inside XLogAcceptWrites including promotion-related stuff then +1 for moving this also inside XLogAcceptWrites. > > > > > > 1] > > > http://postgr.es/m/CA+TgmoZ=cctbaxxmtyzogxegqzoz9smkbwrdpsacpjvfcgc...@mail.gmail.com > > > 2] > > > http://postgr.es/m/caaj_b97xx-nqrym_uxzecph9asgomrordnhrg1n51fdcdwo...@mail.gmail.com > > > > 2. > > I did not clearly understand your concern in point [2], because of > > which you have to postpone RECOVERY_STATE_DONE untill system is set > > back to read-write. Can you explain this? > > > > Sure, for that let me explain how this transition to read-write occurs. When > a > backend executes a function (ie. pg_prohibit_wal(false)) to make the system > read-write then that system state changes will be conveyed by the Checkpointer > process to all existing backends using global barrier and while Checkpointer > in > the progress of conveying this barrier, any existing backends who might have > absorbed barriers can write new records. > > We don't want that to happen in cases where previous recovery-end-checkpoint > is > skipped in startup. We want Checkpointer first to convey the barrier to all > backends but, the backend shouldn't write wal until the Checkpointer writes > recovery-end-checkpoint record. > > To refrain these backends from writing WAL I think we should keep the server > in > crash recovery mode until UpdateFullPageWrites(), > end-of-recovery-checkpoint, and XLogReportParameters() are performed. Thanks for the explanation. Now, I understand the problem, however, I am not sure that whether keeping the system in recovery is the best way to solve this but as of now I don't have anything better to suggest, and immediately I couldn’t think of any problem with this solution. But I will think about this again. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Tue, Mar 2, 2021 at 5:52 PM Dilip Kumar wrote: > > On Fri, Feb 19, 2021 at 5:43 PM Amul Sul wrote: > > > > In the attached version I have made the changes accordingly what Robert has > > summarised in his previous mail[1]. > > > > In addition to that, I also move the code that updates the control file to > > XLogAcceptWrites() which will also get skipped when the system is read-only > > (wal > > prohibited). The system will be in the crash recovery, and that will > > change once we do the end-of-recovery checkpoint and the WAL writes > > operation > > which we were skipping from startup. The benefit of keeping the system in > > recovery mode is that it fixes my concern[2] where other backends could > > connect > > and write wal records while we were changing the system to read-write. Now, > > no > > other backends allow a wal write; UpdateFullPageWrites(), end-of-recovery > > checkpoint, and XLogReportParameters() operations will be performed in the > > same > > sequence as it is in the startup while changing the system to read-write. > > I was looking into the changes espcially recovery related problem, I > have a few questions > > 1. > +static bool > +XLogAcceptWrites(bool needChkpt, bool bgwriterLaunched, > + bool localPromoteIsTriggered, XLogReaderState *xlogreader, > + bool archiveRecoveryRequested, TimeLineID endOfLogTLI, > + XLogRecPtr endOfLog, TimeLineID thisTimeLineID) > +{ > +boolpromoted = false; > + > +/* > . > +if (localPromoteIsTriggered) > { > -checkPointLoc = ControlFile->checkPoint; > +XLogRecord *record; > > ... > +record = ReadCheckpointRecord(xlogreader, > + ControlFile->checkPoint, > + 1, false); > if (record != NULL) > { > promoted = true; > ... > CreateEndOfRecoveryRecord(); > } > > Why do we need to move promote related code in XLogAcceptWrites? > IMHO, this promote related handling should be in StartupXLOG only. XLogAcceptWrites() tried to club all the WAL write operations that happen at the end of StartupXLOG(). The only exception is that promotion checkpoint. > That will look cleaner. I think it would be better to move the promotion checkpoint call inside XLogAcceptWrites() as well. So that we can say XLogAcceptWrites() is a part of StartupXLOG() does the required WAL writes. Thoughts? > > > > > 1] > > http://postgr.es/m/CA+TgmoZ=cctbaxxmtyzogxegqzoz9smkbwrdpsacpjvfcgc...@mail.gmail.com > > 2] > > http://postgr.es/m/caaj_b97xx-nqrym_uxzecph9asgomrordnhrg1n51fdcdwo...@mail.gmail.com > > 2. > I did not clearly understand your concern in point [2], because of > which you have to postpone RECOVERY_STATE_DONE untill system is set > back to read-write. Can you explain this? > Sure, for that let me explain how this transition to read-write occurs. When a backend executes a function (ie. pg_prohibit_wal(false)) to make the system read-write then that system state changes will be conveyed by the Checkpointer process to all existing backends using global barrier and while Checkpointer in the progress of conveying this barrier, any existing backends who might have absorbed barriers can write new records. We don't want that to happen in cases where previous recovery-end-checkpoint is skipped in startup. We want Checkpointer first to convey the barrier to all backends but, the backend shouldn't write wal until the Checkpointer writes recovery-end-checkpoint record. To refrain these backends from writing WAL I think we should keep the server in crash recovery mode until UpdateFullPageWrites(), end-of-recovery-checkpoint, and XLogReportParameters() are performed. Regards, Amul
Re: [Patch] ALTER SYSTEM READ ONLY
On Fri, Feb 19, 2021 at 5:43 PM Amul Sul wrote: > > In the attached version I have made the changes accordingly what Robert has > summarised in his previous mail[1]. > > In addition to that, I also move the code that updates the control file to > XLogAcceptWrites() which will also get skipped when the system is read-only > (wal > prohibited). The system will be in the crash recovery, and that will > change once we do the end-of-recovery checkpoint and the WAL writes operation > which we were skipping from startup. The benefit of keeping the system in > recovery mode is that it fixes my concern[2] where other backends could > connect > and write wal records while we were changing the system to read-write. Now, no > other backends allow a wal write; UpdateFullPageWrites(), end-of-recovery > checkpoint, and XLogReportParameters() operations will be performed in the > same > sequence as it is in the startup while changing the system to read-write. I was looking into the changes espcially recovery related problem, I have a few questions 1. +static bool +XLogAcceptWrites(bool needChkpt, bool bgwriterLaunched, + bool localPromoteIsTriggered, XLogReaderState *xlogreader, + bool archiveRecoveryRequested, TimeLineID endOfLogTLI, + XLogRecPtr endOfLog, TimeLineID thisTimeLineID) +{ +boolpromoted = false; + +/* . +if (localPromoteIsTriggered) { -checkPointLoc = ControlFile->checkPoint; +XLogRecord *record; ... +record = ReadCheckpointRecord(xlogreader, + ControlFile->checkPoint, + 1, false); if (record != NULL) { promoted = true; ... CreateEndOfRecoveryRecord(); } Why do we need to move promote related code in XLogAcceptWrites? IMHO, this promote related handling should be in StartupXLOG only. That will look cleaner. > > 1] > http://postgr.es/m/CA+TgmoZ=cctbaxxmtyzogxegqzoz9smkbwrdpsacpjvfcgc...@mail.gmail.com > 2] > http://postgr.es/m/caaj_b97xx-nqrym_uxzecph9asgomrordnhrg1n51fdcdwo...@mail.gmail.com 2. I did not clearly understand your concern in point [2], because of which you have to postpone RECOVERY_STATE_DONE untill system is set back to read-write. Can you explain this? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
Hi, On 2021-02-16 17:11:06 -0500, Robert Haas wrote: > I can't promise that what I'm about to write is an entirely faithful > representation of what he said, but hopefully it's not so far off that > he gets mad at me or something. Seems accurate - and also I'm way too tired that I'd be mad ;) > 1. If the server starts up and is read-only and > ArchiveRecoveryRequested, clear the read-only state in memory and also > in the control file, log a message saying that this has been done, and > proceed. This makes some other cases simpler to deal with. It seems also to make sense from a behaviour POV to me: Imagine a "smooth" planned failover with ASRO: 1) ASRO on primary 2) promote standby 3) edit primary config to include primary_conninfo, add standby.signal 4) restart "read only primary" There's not really any spot in which it'd be useful to do disable ASRO, right? But 4) should make the node a normal standby. Greetings, Andres Freund
Re: [Patch] ALTER SYSTEM READ ONLY
On Thu, Jan 28, 2021 at 7:17 AM Amul Sul wrote: > I am still on this. The things that worried me here are the wal records > sequence > being written in the startup process -- UpdateFullPageWrites() generate record > just before the recovery check-point record and XLogReportParameters() just > after that but before any other backend could write any wal record. We might > also need to follow the same sequence while changing the system to read-write. I was able to chat with Andres about this topic for a while today and he made some proposals which seemed pretty good to me. I can't promise that what I'm about to write is an entirely faithful representation of what he said, but hopefully it's not so far off that he gets mad at me or something. 1. If the server starts up and is read-only and ArchiveRecoveryRequested, clear the read-only state in memory and also in the control file, log a message saying that this has been done, and proceed. This makes some other cases simpler to deal with. 2. Create a new function with a name like XLogAcceptWrites(). Move the following things from StartupXLOG() into that function: (1) the call to UpdateFullPageWrites(), (2) the following block of code that does either CreateEndOfRecoveryRecord() or RequestCheckpoint() or CreateCheckPoint(), (3) the next block of code that runs recovery_end_command, (4) the call to XLogReportParameters(), and (5) the call to CompleteCommitTsInitialization(). Call the new function from the place where we now call XLogReportParameters(). This would mean that (1)-(3) happen later than they do now, which might require some adjustments. 3. If the system is starting up read only (and the read-only state didn't get cleared because of #1 above) then don't call XLogAcceptWrites() at the end of StartupXLOG() and instead have the checkpointer do it later when the system is going read-write for the first time. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Thu, Jan 21, 2021 at 9:47 AM Amul Sul wrote: > It is not like that, let me explain. When a user backend requests to alter WAL > prohibit state by using ASRO/ASRW DDL with the previous patch or calling > pg_alter_wal_prohibit_state() then WAL prohibit state in shared memory will be > set to the transition state i.e. going-read-only or going-read-write if it is > not already. If another backend trying to request the same alteration to the > wal prohibit state then nothing going to be changed in shared memory but that > backend needs to wait until the transition to the final wal prohibited state > completes. If a backend tries to request for the opposite state than the > previous which is in progress then it will see an error as "system state > transition to read only/write is already in progress". At a time only one > transition state can be set. Hrm. Well, then that needs to be abundantly clear in the relevant comments. > Now, I am a little bit concerned about the current function name. How about > pg_set_wal_prohibit_state(bool) name or have two functions as > pg_set_wal_prohibit_state(void) and pg_unset_wal_prohibit_state(void) or any > other suggestions? How about pg_prohibit_wal(true|false)? > > It seems odd that ProcessWALProhibitStateChangeRequest() returns > > without doing anything if !AmCheckpointerProcess(), rather than having > > that be an Assert(). Why is it like that? > > Like AbsorbSyncRequests(). Well, that can be called not from the checkpointer, according to the comments. Specifically from the postmaster, I guess. Again, comments please. > If you think that there will be panic when UpdateFullPageWrites() and/or > XLogReportParameters() tries to write WAL since the shared memory state for > WAL > prohibited is set then it is not like that. For those functions, WAL write is > explicitly enabled by calling LocalSetXLogInsertAllowed(). > > I was under the impression that there won't be any problem if we allow the > writing WAL to UpdateFullPageWrites() and XLogReportParameters(). It can be > considered as an exception since it is fine that this WAL record is not > streamed > to standby while graceful failover, I may be wrong though. I don't think that's OK. I mean, the purpose of the feature is to prohibit WAL. If it doesn't do that, I believe it will fail to satisfy the principle of least surprise. > I am not clear on this part. In the attached version I am sending SIGUSR1 > instead of SIGINT, which works for me. OK. > The attached version does not address all your comments, I'll continue my work > on that. Some thoughts on this version: +/* Extract last two bits */ +#defineWALPROHIBIT_CURRENT_STATE(stateGeneration) \ + ((uint32)(stateGeneration) & ((uint32) ((1 << 2) - 1))) +#defineWALPROHIBIT_NEXT_STATE(stateGeneration) \ + WALPROHIBIT_CURRENT_STATE((stateGeneration + 1)) This is really confusing. First, the comment looks like it applies to both based on how it is positioned, but that's clearly not true. Second, the naming is really hard to understand. Third, there don't seem to be comments explaining the theory of what is going on here. Fourth, stateGeneration refers not to which generation of state we've got here but to the combination of the state and the generation. However, it's not clear that we ever really use the generation for anything. I think that the direction you went with this is somewhat different from what I had in mind. That may be OK, but let me just explain the difference. We both had in mind the idea that the low two bits of the state would represent the current state and the upper bits would represent the state generation. However, I wasn't necessarily imagining that the only supported operation was making the combined value go up by 1. For instance, I had thought that perhaps the effect of trying to go read-only when we're in the middle of going read-write would be to cancel the previous operation and start the new one. What you have instead is that it errors out. So in your model a change always has to finish before the next one can start, which in turn means that the sequence is completely linear. In my idea the state+generation might go from say 1 to 7, because trying to go read-write would cancel the previous attempt to go read-only and replace it with an attempt to go the other direction, and from 7 we might go to to 9 if somebody now tries to go read-only again before that finishes. In your model, there's never any sort of cancellation of that kind, so you can only go 0->1->2->3->4->5->6->7->8->9 etc. One disadvantage of the way you've got it from a user perspective is that if I'm writing a tool, I might get an error telling me that the state change I'm trying to make is already in progress, and then I have to retry. With the other design, I might attempt a state change and have it fail because the change can't be completed, but I won't ever fail because I attempt a state change and it can't be started
Re: [Patch] ALTER SYSTEM READ ONLY
On Thu, Jan 14, 2021 at 6:29 AM Amul Sul wrote: > To move development, testing, and the review forward, I have commented out the > code acquiring CheckpointLock from CreateCheckPoint() in the 0003 patch and > added the changes for the checkpointer so that system read-write state change > request can be processed as soon as possible, as suggested by Robert[1]. I am extremely doubtful about SetWALProhibitState()'s claim that "The final state can only be requested by the checkpointer or by the single-user so that there will be no chance that the server is already in the desired final state." It seems like there is an obvious race condition: CompleteWALProhibitChange() is called with a cur_state_gen argument which embeds the last state we saw, but there's nothing to keep it from changing between the time we saw it and the time that function calls SetWALProhibitState(), is there? We aren't holding any lock. It seems to me that SetWALProhibitState() needs to be rewritten to avoid this assumption. On a related note, SetWALProhibitState() has only two callers. One passes is_final_state as true, and the other as false: it's never a variable. The two cases are handled mostly differently. This doesn't seem good. A lot of the logic in this function should probably be moved to the calling sites, especially because it's almost certainly wrong for this function to be basing what it does on the *current* WAL prohibit state rather than the WAL prohibit state that was in effect at the time we made the decision to call this function in the first place. As I mentioned in the previous paragraph, that's a built-in race condition. To put that another way, this function should NOT feel free to call GetWALProhibitStateGen(). I don't really see why we should have both an SQL callable function pg_alter_wal_prohibit_state() and also a DDL command for this. If we're going to go with a functional interface, and I guess the idea of that is to make it so GRANT EXECUTE works, then why not just get rid of the DDL? RequestWALProhibitChange() doesn't look very nice. It seems like it's basically the second half of pg_alter_wal_prohibit_state(), not being called from anywhere else. It doesn't seem to add anything to separate it out like this; the interface between the two is not especially clean. It seems odd that ProcessWALProhibitStateChangeRequest() returns without doing anything if !AmCheckpointerProcess(), rather than having that be an Assert(). Why is it like that? I think WALProhibitStateShmemInit() would probably look more similar to other functions if it did if (found) { stuff; } rather than if (!found) return; stuff; -- but I might be wrong about the existing precedent. The SetLastCheckPointSkipped() and LastCheckPointIsSkipped() stuff seems confusingly-named, because we have other reasons for skipping a checkpoint that are not what we're talking about here. I think this is talking about whether we've performed a checkpoint after recovery, and the naming should reflect that. But I think there's something else wrong with the design, too: why is this protected by a spinlock? I have questions in both directions. On the one hand, I wonder why we need any kind of lock at all. On the other hand, if we do need a lock, I wonder why a spinlock that protects only the setting and clearing of the flag and nothing else is sufficient. There are zero comments explaining what the idea behind this locking regime is, and I can't understand why it should be correct. In fact, I think this area needs a broader rethink. Like, the way you integrated that stuff into StartupXLog(), it sure looks to me like we might skip the checkpoint but still try to write other WAL records. Before we reach the offending segment of code, we call UpdateFullPageWrites(). Afterwards, we call XLogReportParameters(). Both of those are going to potentially write WAL. I guess you could argue that's OK, on the grounds that neither function is necessarily going to log anything, but I don't think I believe that. If I make my server read only, take the OS down, change some GUCs, and then start it again, I don't expect it to PANIC. Also, I doubt that it's OK to skip the checkpoint as this code does and then go ahead and execute recovery_end_command and update the control file anyway. It sure looks like the existing code is written with the assumption that the checkpoint happens before those other things. One idea I just had was: suppose that, if the system is READ ONLY, we don't actually exit recovery right away, and the startup process doesn't exit. Instead we just sit there and wait for the system to be made read-write again before doing anything else. But then if hot_standby=false, there's no way for someone to execute a ALTER SYSTEM READ WRITE and/or pg_alter_wal_prohibit_state(), which seems bad. So perhaps we need to let in regular connections *as if* the system were read-write while postponing not just the end-of-recovery checkpoint but also the other associated
Re: [Patch] ALTER SYSTEM READ ONLY
On Thu, Jan 14, 2021 at 6:29 AM Amul Sul wrote: > To move development, testing, and the review forward, I have commented out the > code acquiring CheckpointLock from CreateCheckPoint() in the 0003 patch and > added the changes for the checkpointer so that system read-write state change > request can be processed as soon as possible, as suggested by Robert[1]. > > I have started a new thread[2] to understand the need for the CheckpointLock > in > CreateCheckPoint() function. Until then we can continue work on this feature > by > skipping CheckpointLock in CreateCheckPoint(), and therefore the 0003 patch is > marked WIP. Based on the favorable review comment from Andres upthread and also your feedback, I committed 0001. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [Patch] ALTER SYSTEM READ ONLY
On Mon, Dec 14, 2020 at 11:28 AM Amul Sul wrote: > > On Thu, Dec 10, 2020 at 6:04 AM Andres Freund wrote: > > > > Hi, > > > > On 2020-12-09 16:13:06 -0500, Robert Haas wrote: > > > That's not good. On a typical busy system, a system is going to be in > > > the middle of a checkpoint most of the time, and the checkpoint will > > > take a long time to finish - maybe minutes. > > > > Or hours, even. Due to the cost of FPWs it can make a lot of sense to > > reduce the frequency of that cost... > > > > > > > We want this feature to respond within milliseconds or a few seconds, > > > not minutes. So we need something better here. > > > > Indeed. > > > > > > > I'm inclined to think > > > that we should try to CompleteWALProhibitChange() at the same places > > > we AbsorbSyncRequests(). We know from experience that bad things > > > happen if we fail to absorb sync requests in a timely fashion, so we > > > probably have enough calls to AbsorbSyncRequests() to make sure that > > > we always do that work in a timely fashion. So, if we do this work in > > > the same place, then it will also be done in a timely fashion. > > > > Sounds sane, without having looked in detail. > > > > Understood & agreed that we need to change the system state as soon as > possible. > > I can see AbsorbSyncRequests() is called from 4 routing as > CheckpointWriteDelay(), ProcessSyncRequests(), SyncPostCheckpoint() and > CheckpointerMain(). Out of 4, the first three executes with an interrupt is > on > hod which will cause a problem when we do emit barrier and wait for those > barriers absorption by all the process including itself and will cause an > infinite wait. I think that can be fixed by teaching > WaitForProcSignalBarrier(), > do not wait on self to absorb barrier. Let that get absorbed at a later point > in time when the interrupt is resumed. I assumed that we cannot do barrier > processing right away since there could be other barriers (maybe in the > future) > including ours that should not process while the interrupt is on hold. > CreateCheckPoint() holds CheckpointLock LW at start and releases at the end which puts interrupt on hold. This kinda surprising that we were holding this lock and putting interrupt on hots for a long time. We do need that CheckpointLock just to ensure that one checkpoint happens at a time. Can't we do something easy to ensure that instead of the lock? Probably holding off interrupts for so long doesn't seem to be a good idea. Thoughts/Suggestions? Regards, Amul
Re: [Patch] ALTER SYSTEM READ ONLY
On Thu, Dec 10, 2020 at 6:04 AM Andres Freund wrote: > > Hi, > > On 2020-12-09 16:13:06 -0500, Robert Haas wrote: > > That's not good. On a typical busy system, a system is going to be in > > the middle of a checkpoint most of the time, and the checkpoint will > > take a long time to finish - maybe minutes. > > Or hours, even. Due to the cost of FPWs it can make a lot of sense to > reduce the frequency of that cost... > > > > We want this feature to respond within milliseconds or a few seconds, > > not minutes. So we need something better here. > > Indeed. > > > > I'm inclined to think > > that we should try to CompleteWALProhibitChange() at the same places > > we AbsorbSyncRequests(). We know from experience that bad things > > happen if we fail to absorb sync requests in a timely fashion, so we > > probably have enough calls to AbsorbSyncRequests() to make sure that > > we always do that work in a timely fashion. So, if we do this work in > > the same place, then it will also be done in a timely fashion. > > Sounds sane, without having looked in detail. > Understood & agreed that we need to change the system state as soon as possible. I can see AbsorbSyncRequests() is called from 4 routing as CheckpointWriteDelay(), ProcessSyncRequests(), SyncPostCheckpoint() and CheckpointerMain(). Out of 4, the first three executes with an interrupt is on hod which will cause a problem when we do emit barrier and wait for those barriers absorption by all the process including itself and will cause an infinite wait. I think that can be fixed by teaching WaitForProcSignalBarrier(), do not wait on self to absorb barrier. Let that get absorbed at a later point in time when the interrupt is resumed. I assumed that we cannot do barrier processing right away since there could be other barriers (maybe in the future) including ours that should not process while the interrupt is on hold. > > > I'm not 100% sure whether that introduces any other problems. > > Certainly, we're not going to be able to finish the checkpoint once > > we've gone read-only, so we'll fail when we try to write the WAL > > record for that, or maybe earlier if there's anything else that tries > > to write WAL. Either the checkpoint needs to error out, like any other > > attempt to write WAL, and we can attempt a new checkpoint if and when > > we go read/write, or else we need to finish writing stuff out to disk > > but not actually write the checkpoint completion record (or any other > > WAL) unless and until the system goes back into read/write mode - and > > then at that point the previously-started checkpoint will finish > > normally. The latter seems better if we can make it work, but the > > former is probably also acceptable. What you've got right now is not. > > I mostly wonder which of those two has which implications for how many > FPWs we need to redo. Presumably stalling but not cancelling the current > checkpoint is better? > Also, I like to uphold this idea of stalling a checkpointer's work in the middle instead of canceling it. But here, we need to take care of shutdown requests and death of postmaster cases that can cancel this stalling. If that happens we need to make sure that no unwanted wal insertion happens afterward and for that LocalXLogInsertAllowed flag needs to be updated correctly since the wal prohibits barrier processing was skipped for the checkpointer since it emits that barrier as mentioned above. Regards, Amul
Re: [Patch] ALTER SYSTEM READ ONLY
Hi, On 2020-12-09 16:13:06 -0500, Robert Haas wrote: > That's not good. On a typical busy system, a system is going to be in > the middle of a checkpoint most of the time, and the checkpoint will > take a long time to finish - maybe minutes. Or hours, even. Due to the cost of FPWs it can make a lot of sense to reduce the frequency of that cost... > We want this feature to respond within milliseconds or a few seconds, > not minutes. So we need something better here. Indeed. > I'm inclined to think > that we should try to CompleteWALProhibitChange() at the same places > we AbsorbSyncRequests(). We know from experience that bad things > happen if we fail to absorb sync requests in a timely fashion, so we > probably have enough calls to AbsorbSyncRequests() to make sure that > we always do that work in a timely fashion. So, if we do this work in > the same place, then it will also be done in a timely fashion. Sounds sane, without having looked in detail. > I'm not 100% sure whether that introduces any other problems. > Certainly, we're not going to be able to finish the checkpoint once > we've gone read-only, so we'll fail when we try to write the WAL > record for that, or maybe earlier if there's anything else that tries > to write WAL. Either the checkpoint needs to error out, like any other > attempt to write WAL, and we can attempt a new checkpoint if and when > we go read/write, or else we need to finish writing stuff out to disk > but not actually write the checkpoint completion record (or any other > WAL) unless and until the system goes back into read/write mode - and > then at that point the previously-started checkpoint will finish > normally. The latter seems better if we can make it work, but the > former is probably also acceptable. What you've got right now is not. I mostly wonder which of those two has which implications for how many FPWs we need to redo. Presumably stalling but not cancelling the current checkpoint is better? Greetings, Andres Freund
Re: [Patch] ALTER SYSTEM READ ONLY
On 2020-11-20 11:23:44 -0500, Robert Haas wrote: > Andres, do you like the new loop better? I do!
Re: [Patch] ALTER SYSTEM READ ONLY
On Sat, Sep 12, 2020 at 1:23 AM Amul Sul wrote: > > So, if we're in the middle of a paced checkpoint with a large > > checkpoint_timeout - a sensible real world configuration - we'll not > > process ASRO until that checkpoint is over? That seems very much not > > practical. What am I missing? > > Yes, the process doing ASRO will wait until that checkpoint is over. That's not good. On a typical busy system, a system is going to be in the middle of a checkpoint most of the time, and the checkpoint will take a long time to finish - maybe minutes. We want this feature to respond within milliseconds or a few seconds, not minutes. So we need something better here. I'm inclined to think that we should try to CompleteWALProhibitChange() at the same places we AbsorbSyncRequests(). We know from experience that bad things happen if we fail to absorb sync requests in a timely fashion, so we probably have enough calls to AbsorbSyncRequests() to make sure that we always do that work in a timely fashion. So, if we do this work in the same place, then it will also be done in a timely fashion. I'm not 100% sure whether that introduces any other problems. Certainly, we're not going to be able to finish the checkpoint once we've gone read-only, so we'll fail when we try to write the WAL record for that, or maybe earlier if there's anything else that tries to write WAL. Either the checkpoint needs to error out, like any other attempt to write WAL, and we can attempt a new checkpoint if and when we go read/write, or else we need to finish writing stuff out to disk but not actually write the checkpoint completion record (or any other WAL) unless and until the system goes back into read/write mode - and then at that point the previously-started checkpoint will finish normally. The latter seems better if we can make it work, but the former is probably also acceptable. What you've got right now is not. -- Robert Haas EDB: http://www.enterprisedb.com