subject:"\[Patch\] ALTER SYSTEM READ ONLY"

Re: [Patch] ALTER SYSTEM READ ONLY

2022-07-27 Thread Amul Sul

Hi,

On Thu, Jul 28, 2022 at 4:05 AM Jacob Champion  wrote:
>
> On Fri, Apr 8, 2022 at 7:27 AM Amul Sul  wrote:
> > Attached is rebase version for the latest maste head(#891624f0ec).
>
> Hi Amul,
>
> I'm going through past CF triage emails today; I noticed that this
> patch dropped out of the commitfest when you withdrew it in January,
> but it hasn't been added back with the most recent patchset you
> posted. Was that intended, or did you want to re-register it for
> review?
>

Yes, there is a plan to re-register it again but not anytime soon,
once we start to rework the design.

Regards,
Amul

Re: [Patch] ALTER SYSTEM READ ONLY

2022-07-27 Thread Jacob Champion

On Fri, Apr 8, 2022 at 7:27 AM Amul Sul  wrote:
> Attached is rebase version for the latest maste head(#891624f0ec).

Hi Amul,

I'm going through past CF triage emails today; I noticed that this
patch dropped out of the commitfest when you withdrew it in January,
but it hasn't been added back with the most recent patchset you
posted. Was that intended, or did you want to re-register it for
review?

--Jacob

Re: [Patch] ALTER SYSTEM READ ONLY

2022-04-26 Thread Amul Sul

On Sat, Apr 23, 2022 at 1:34 PM Bharath Rupireddy
 wrote:
>
> On Mon, Mar 15, 2021 at 12:56 PM Amul Sul  wrote:
> > >
> > > It is a very minor change, so I rebased the patch. Please take a look, if 
> > > that works for you.
> > >
> >
> > Thanks, I am getting one more failure for the vacuumlazy.c. on the
> > latest master head(d75288fb27b), I fixed that in attached version.
>
> Thanks Amul! I haven't looked at the whole thread, I may be repeating
> things here, please bear with me.
>

Np, thanks for looking into it.

> 1) Is the pg_prohibit_wal() only user sets the wal prohibit mode? Or
> do we still allow via 'ALTER SYSTEM READ ONLY/READ WRITE'? If not, I
> think the patches still have ALTER SYSTEM READ ONLY references.

Could you please point me to what those references are? I didn't find
any in the v45 version.

> 2) IIUC, the idea of this patch is not to generate any new WAL when
> set as default_transaction_read_only and transaction_read_only can't
> guarantee that?

No. Complete WAL write should be disabled, in other words XLogInsert()
should be restricted.

> 3) IMO, the function name pg_prohibit_wal doesn't look good where it
> also allows one to set WAL writes, how about the following functions -
> pg_prohibit_wal or pg_disallow_wal_{generation, inserts} or
> pg_allow_wal or pg_allow_wal_{generation, inserts} without any
> arguments and if needed a common function
> pg_set_wal_generation_state(read-only/read-write) something like that?

There are already similar suggestions before too, but none of that
finalized yet, there are other more challenges that need to be
handled, so we can keep this work at last.

> 4) It looks like only the checkpointer is setting the WAL prohibit
> state? Is there a strong reason for that? Why can't the backend take a
> lock on prohibit state in shared memory and set it and let the
> checkpointer read it and block itself from writing WAL?

Once WAL prohibited state transition is initiated and should be
completed, there is no fallback. What if the backed exit before the
complete transition? Similarly, even if the checkpointer exits,
that will be restarted again and will complete the state transition.

> 5) Is SIGUSR1 (which is multiplexed) being sent without a "reason" to
> checkpointer? Why?

Simply want to wake up the checkpointer process without asking for
specific work in the handle function. Another suitable choice will be
SIGINT, we can choose that too if needed.

> 6) What happens for long-running or in-progress transactions if
> someone prohibits WAL in the midst of them? Do these txns fail? Or do
> we say that we will allow them to run to completion? Or do we fail
> those txns at commit time? One might use this feature to say not let
> server go out of disk space, but if we allow in-progress txns to
> generate/write WAL, then how can one achieve that with this feature?
> Say, I monitor my server in such a way that at 90% of disk space,
> prohibit WAL to avoid server crash. But if this feature allows
> in-progress txns to generate WAL, then the server may still crash?

Read-only transactions will be allowed to continue, and if that
transaction tries to write or any other transaction that has performed
any writes already then the session running that transaction will be
terminated -- the design is described in the first mail of this
thread.

> 7) What are the other use-cases (I can think of - to avoid out of disk
> crashes, block/freeze writes to database when the server is
> compromised) with this feature? Any usages during/before failover,
> promotion or after it?

The important use case is for failover to avoid split-brain situations.

> 8) Is there a strong reason that we've picked up conditional variable
> wal_prohibit_cv over mutex/lock for updating WALProhibit shared
> memory?

I am not sure how that can be done using mutex or lock.

> 9) Any tests that you are planning to add?

Yes, we can. I have added very sophisticated tests that cover most of
my code changes, but that is not enough for such critical code
changes, have a lot of chances of improvement and adding more tests
for this module as well as other parts e.g. some missing coverage of
gin, gists, brin, core features where this patch is adding checks, etc.
Any help will be greatly appreciated.

Regards,
Amul

Re: [Patch] ALTER SYSTEM READ ONLY

2022-04-23 Thread Bharath Rupireddy

On Mon, Mar 15, 2021 at 12:56 PM Amul Sul  wrote:
> >
> > It is a very minor change, so I rebased the patch. Please take a look, if 
> > that works for you.
> >
>
> Thanks, I am getting one more failure for the vacuumlazy.c. on the
> latest master head(d75288fb27b), I fixed that in attached version.

Thanks Amul! I haven't looked at the whole thread, I may be repeating
things here, please bear with me.

1) Is the pg_prohibit_wal() only user sets the wal prohibit mode? Or
do we still allow via 'ALTER SYSTEM READ ONLY/READ WRITE'? If not, I
think the patches still have ALTER SYSTEM READ ONLY references.
2) IIUC, the idea of this patch is not to generate any new WAL when
set as default_transaction_read_only and transaction_read_only can't
guarantee that?
3) IMO, the function name pg_prohibit_wal doesn't look good where it
also allows one to set WAL writes, how about the following functions -
pg_prohibit_wal or pg_disallow_wal_{generation, inserts} or
pg_allow_wal or pg_allow_wal_{generation, inserts} without any
arguments and if needed a common function
pg_set_wal_generation_state(read-only/read-write) something like that?
4) It looks like only the checkpointer is setting the WAL prohibit
state? Is there a strong reason for that? Why can't the backend take a
lock on prohibit state in shared memory and set it and let the
checkpointer read it and block itself from writing WAL?
5) Is SIGUSR1 (which is multiplexed) being sent without a "reason" to
checkpointer? Why?
6) What happens for long-running or in-progress transactions if
someone prohibits WAL in the midst of them? Do these txns fail? Or do
we say that we will allow them to run to completion? Or do we fail
those txns at commit time? One might use this feature to say not let
server go out of disk space, but if we allow in-progress txns to
generate/write WAL, then how can one achieve that with this feature?
Say, I monitor my server in such a way that at 90% of disk space,
prohibit WAL to avoid server crash. But if this feature allows
in-progress txns to generate WAL, then the server may still crash?
7) What are the other use-cases (I can think of - to avoid out of disk
crashes, block/freeze writes to database when the server is
compromised) with this feature? Any usages during/before failover,
promotion or after it?
8) Is there a strong reason that we've picked up conditional variable
wal_prohibit_cv over mutex/lock for updating WALProhibit shared
memory?
9) Any tests that you are planning to add?

Regards,
Bharath Rupireddy.

Re: [Patch] ALTER SYSTEM READ ONLY

2021-11-23 Thread Amul Sul

   On Wed, Nov 17, 2021 at 6:20 PM Amul Sul  wrote:
>
>   On Wed, Nov 17, 2021 at 4:07 PM Amul Sul  wrote:
> >
> > On Wed, Nov 17, 2021 at 11:13 AM Amul Sul  wrote:
> > >
> > > On Sat, Nov 13, 2021 at 2:18 AM Robert Haas  wrote:
> > > >
> > > > On Mon, Nov 8, 2021 at 8:20 AM Amul Sul  wrote:
> > > > > Attached is the rebased version of refactoring as well as the
> > > > > pg_prohibit_wal feature patches for the latest master head (commit #
> > > > > 39a3105678a).
> > > >
> > > > I spent a lot of time today studying 0002, and specifically the
> > > > question of whether EndOfLog must be the same as
> > > > XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as
> > > > XLogCtl->replayEndTLI.
> > > >
> > > > The answer to the former question is "no" because, if we don't enter
> > > > redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do
> > > > enter redo, then I think it has to be the same unless something very
> > > > weird happens. EndOfLog gets set like this:
> > > >
> > > > XLogBeginRead(xlogreader, LastRec);
> > > > record = ReadRecord(xlogreader, PANIC, false, replayTLI);
> > > > EndOfLog = EndRecPtr;
> > > >
> > > > In every case that exists in our regression tests, EndRecPtr is the
> > > > same before these three lines of code as it is afterward. However, if
> > > > you test with recovery_target=immediate, you can get it to be
> > > > different, because in that case we drop out of the redo loop after
> > > > calling recoveryStopsBefore() rather than after calling
> > > > recoveryStopsAfter(). Similarly I'm fairly sure that if you use
> > > > recovery_target_inclusive=off you can likewise get it to be different
> > > > (though I discovered the hard way that recovery_target_inclusive=off
> > > > is ignored when you use recovery_target_name). It seems like a really
> > > > bad thing that neither recovery_target=immediate nor
> > > > recovery_target_inclusive=off have any tests, and I think we ought to
> > > > add some.
> > > >
> > >
> > > recovery/t/003_recovery_targets.pl has test for
> > > recovery_target=immediate but not for recovery_target_inclusive=off, we
> > > can add that for recovery_target_lsn, recovery_target_time, and
> > > recovery_target_xid case only where it affects.
> > >
> > > > Anyway, in effect, these three lines of code have the effect of
> > > > backing up the xlogreader by one record when we stop before rather
> > > > than after a record that we're replaying. What that means is that
> > > > EndOfLog is going to be the end+1 of the last record that we actually
> > > > replayed. There might be one more record that we read but did not
> > > > replay, and that record won't impact the value we end up with in
> > > > EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last
> > > > record that we actually replayed. To put that another way, there's no
> > > > way to exit the main redo loop after we set XLogCtl->replayEndRecPtr
> > > > and before we change LastRec. So in the cases where
> > > > XLogCtl->replayEndRecPtr gets initialized at all, it can only be
> > > > different from EndOfLog if something different happens when we re-read
> > > > the last-replayed WAL record than what happened when we read it the
> > > > first time. That seems unlikely, and would be unfortunate it if it did
> > > > happen. I am inclined to think that it might be better not to reread
> > > > the record at all, though.
> > >
> > > There are two reasons that the record is reread; first, one that you
> > > have just explained where the redo loop drops out due to
> > > recoveryStopsBefore() and another one is that InRecovery is false.
> > >
> > > In the formal case at the end, redo while-loop does read a new record
> > > which in effect updates EndRecPtr and when we breaks the loop, we do
> > > reach the place where we do reread record -- where we do read the
> > > record (i.e. LastRec) before the record that redo loop has read and
> > > which correctly sets EndRecPtr. In the latter case, definitely, we
> > > don't need any adjustment to EndRecPtr.
> > >
> > > So technically one case needs reread but that is also not needed, we
> > > have that value in XLogCtl->lastReplayedEndRecPtr. I do agree that we
> > > do not need to reread the record, but EndOfLog and EndOfLogTLI should
> > > be set conditionally something like:
> > >
> > > if (InRecovery)
> > > {
> > > EndOfLog = XLogCtl->lastReplayedEndRecPtr;
> > > EndOfLogTLI = XLogCtl->lastReplayedTLI;
> > > }
> > > else
> > > {
> > > EndOfLog = EndRecPtr;
> > > EndOfLogTLI = replayTLI;
> > > }
> > >
> > > > As far as this patch goes, I think we need
> > > > a solution that doesn't involve fetching EndOfLog from a variable
> > > > that's only sometimes initialized and then not doing anything with it
> > > > except in the cases where it was initialized.
> > > >
> > >
> > > Another reason could be EndOfLog changes further in the following case:
> > >
> > > /*
> > >  * Actually, if WAL ended in an

Re: [Patch] ALTER SYSTEM READ ONLY

2021-11-17 Thread Amul Sul

  On Wed, Nov 17, 2021 at 4:07 PM Amul Sul  wrote:
>
> On Wed, Nov 17, 2021 at 11:13 AM Amul Sul  wrote:
> >
> > On Sat, Nov 13, 2021 at 2:18 AM Robert Haas  wrote:
> > >
> > > On Mon, Nov 8, 2021 at 8:20 AM Amul Sul  wrote:
> > > > Attached is the rebased version of refactoring as well as the
> > > > pg_prohibit_wal feature patches for the latest master head (commit #
> > > > 39a3105678a).
> > >
> > > I spent a lot of time today studying 0002, and specifically the
> > > question of whether EndOfLog must be the same as
> > > XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as
> > > XLogCtl->replayEndTLI.
> > >
> > > The answer to the former question is "no" because, if we don't enter
> > > redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do
> > > enter redo, then I think it has to be the same unless something very
> > > weird happens. EndOfLog gets set like this:
> > >
> > > XLogBeginRead(xlogreader, LastRec);
> > > record = ReadRecord(xlogreader, PANIC, false, replayTLI);
> > > EndOfLog = EndRecPtr;
> > >
> > > In every case that exists in our regression tests, EndRecPtr is the
> > > same before these three lines of code as it is afterward. However, if
> > > you test with recovery_target=immediate, you can get it to be
> > > different, because in that case we drop out of the redo loop after
> > > calling recoveryStopsBefore() rather than after calling
> > > recoveryStopsAfter(). Similarly I'm fairly sure that if you use
> > > recovery_target_inclusive=off you can likewise get it to be different
> > > (though I discovered the hard way that recovery_target_inclusive=off
> > > is ignored when you use recovery_target_name). It seems like a really
> > > bad thing that neither recovery_target=immediate nor
> > > recovery_target_inclusive=off have any tests, and I think we ought to
> > > add some.
> > >
> >
> > recovery/t/003_recovery_targets.pl has test for
> > recovery_target=immediate but not for recovery_target_inclusive=off, we
> > can add that for recovery_target_lsn, recovery_target_time, and
> > recovery_target_xid case only where it affects.
> >
> > > Anyway, in effect, these three lines of code have the effect of
> > > backing up the xlogreader by one record when we stop before rather
> > > than after a record that we're replaying. What that means is that
> > > EndOfLog is going to be the end+1 of the last record that we actually
> > > replayed. There might be one more record that we read but did not
> > > replay, and that record won't impact the value we end up with in
> > > EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last
> > > record that we actually replayed. To put that another way, there's no
> > > way to exit the main redo loop after we set XLogCtl->replayEndRecPtr
> > > and before we change LastRec. So in the cases where
> > > XLogCtl->replayEndRecPtr gets initialized at all, it can only be
> > > different from EndOfLog if something different happens when we re-read
> > > the last-replayed WAL record than what happened when we read it the
> > > first time. That seems unlikely, and would be unfortunate it if it did
> > > happen. I am inclined to think that it might be better not to reread
> > > the record at all, though.
> >
> > There are two reasons that the record is reread; first, one that you
> > have just explained where the redo loop drops out due to
> > recoveryStopsBefore() and another one is that InRecovery is false.
> >
> > In the formal case at the end, redo while-loop does read a new record
> > which in effect updates EndRecPtr and when we breaks the loop, we do
> > reach the place where we do reread record -- where we do read the
> > record (i.e. LastRec) before the record that redo loop has read and
> > which correctly sets EndRecPtr. In the latter case, definitely, we
> > don't need any adjustment to EndRecPtr.
> >
> > So technically one case needs reread but that is also not needed, we
> > have that value in XLogCtl->lastReplayedEndRecPtr. I do agree that we
> > do not need to reread the record, but EndOfLog and EndOfLogTLI should
> > be set conditionally something like:
> >
> > if (InRecovery)
> > {
> > EndOfLog = XLogCtl->lastReplayedEndRecPtr;
> > EndOfLogTLI = XLogCtl->lastReplayedTLI;
> > }
> > else
> > {
> > EndOfLog = EndRecPtr;
> > EndOfLogTLI = replayTLI;
> > }
> >
> > > As far as this patch goes, I think we need
> > > a solution that doesn't involve fetching EndOfLog from a variable
> > > that's only sometimes initialized and then not doing anything with it
> > > except in the cases where it was initialized.
> > >
> >
> > Another reason could be EndOfLog changes further in the following case:
> >
> > /*
> >  * Actually, if WAL ended in an incomplete record, skip the parts that
> >  * made it through and start writing after the portion that persisted.
> >  * (It's critical to first write an OVERWRITE_CONTRECORD message, which
> >  * we'll do as soon as we're open for writing new WAL.)
>

Re: [Patch] ALTER SYSTEM READ ONLY

2021-11-17 Thread Amul Sul

On Wed, Nov 17, 2021 at 11:13 AM Amul Sul  wrote:
>
> On Sat, Nov 13, 2021 at 2:18 AM Robert Haas  wrote:
> >
> > On Mon, Nov 8, 2021 at 8:20 AM Amul Sul  wrote:
> > > Attached is the rebased version of refactoring as well as the
> > > pg_prohibit_wal feature patches for the latest master head (commit #
> > > 39a3105678a).
> >
> > I spent a lot of time today studying 0002, and specifically the
> > question of whether EndOfLog must be the same as
> > XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as
> > XLogCtl->replayEndTLI.
> >
> > The answer to the former question is "no" because, if we don't enter
> > redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do
> > enter redo, then I think it has to be the same unless something very
> > weird happens. EndOfLog gets set like this:
> >
> > XLogBeginRead(xlogreader, LastRec);
> > record = ReadRecord(xlogreader, PANIC, false, replayTLI);
> > EndOfLog = EndRecPtr;
> >
> > In every case that exists in our regression tests, EndRecPtr is the
> > same before these three lines of code as it is afterward. However, if
> > you test with recovery_target=immediate, you can get it to be
> > different, because in that case we drop out of the redo loop after
> > calling recoveryStopsBefore() rather than after calling
> > recoveryStopsAfter(). Similarly I'm fairly sure that if you use
> > recovery_target_inclusive=off you can likewise get it to be different
> > (though I discovered the hard way that recovery_target_inclusive=off
> > is ignored when you use recovery_target_name). It seems like a really
> > bad thing that neither recovery_target=immediate nor
> > recovery_target_inclusive=off have any tests, and I think we ought to
> > add some.
> >
>
> recovery/t/003_recovery_targets.pl has test for
> recovery_target=immediate but not for recovery_target_inclusive=off, we
> can add that for recovery_target_lsn, recovery_target_time, and
> recovery_target_xid case only where it affects.
>
> > Anyway, in effect, these three lines of code have the effect of
> > backing up the xlogreader by one record when we stop before rather
> > than after a record that we're replaying. What that means is that
> > EndOfLog is going to be the end+1 of the last record that we actually
> > replayed. There might be one more record that we read but did not
> > replay, and that record won't impact the value we end up with in
> > EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last
> > record that we actually replayed. To put that another way, there's no
> > way to exit the main redo loop after we set XLogCtl->replayEndRecPtr
> > and before we change LastRec. So in the cases where
> > XLogCtl->replayEndRecPtr gets initialized at all, it can only be
> > different from EndOfLog if something different happens when we re-read
> > the last-replayed WAL record than what happened when we read it the
> > first time. That seems unlikely, and would be unfortunate it if it did
> > happen. I am inclined to think that it might be better not to reread
> > the record at all, though.
>
> There are two reasons that the record is reread; first, one that you
> have just explained where the redo loop drops out due to
> recoveryStopsBefore() and another one is that InRecovery is false.
>
> In the formal case at the end, redo while-loop does read a new record
> which in effect updates EndRecPtr and when we breaks the loop, we do
> reach the place where we do reread record -- where we do read the
> record (i.e. LastRec) before the record that redo loop has read and
> which correctly sets EndRecPtr. In the latter case, definitely, we
> don't need any adjustment to EndRecPtr.
>
> So technically one case needs reread but that is also not needed, we
> have that value in XLogCtl->lastReplayedEndRecPtr. I do agree that we
> do not need to reread the record, but EndOfLog and EndOfLogTLI should
> be set conditionally something like:
>
> if (InRecovery)
> {
> EndOfLog = XLogCtl->lastReplayedEndRecPtr;
> EndOfLogTLI = XLogCtl->lastReplayedTLI;
> }
> else
> {
> EndOfLog = EndRecPtr;
> EndOfLogTLI = replayTLI;
> }
>
> > As far as this patch goes, I think we need
> > a solution that doesn't involve fetching EndOfLog from a variable
> > that's only sometimes initialized and then not doing anything with it
> > except in the cases where it was initialized.
> >
>
> Another reason could be EndOfLog changes further in the following case:
>
> /*
>  * Actually, if WAL ended in an incomplete record, skip the parts that
>  * made it through and start writing after the portion that persisted.
>  * (It's critical to first write an OVERWRITE_CONTRECORD message, which
>  * we'll do as soon as we're open for writing new WAL.)
>  */
> if (!XLogRecPtrIsInvalid(missingContrecPtr))
> {
> Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
> EndOfLog = missingContrecPtr;
> }
>
> Now only solution that I can think is to copy EndOfLog (so
> EndOfLogTLI) into shared memory.
>
> >

Re: [Patch] ALTER SYSTEM READ ONLY

2021-11-16 Thread Amul Sul

On Sat, Nov 13, 2021 at 2:18 AM Robert Haas  wrote:
>
> On Mon, Nov 8, 2021 at 8:20 AM Amul Sul  wrote:
> > Attached is the rebased version of refactoring as well as the
> > pg_prohibit_wal feature patches for the latest master head (commit #
> > 39a3105678a).
>
> I spent a lot of time today studying 0002, and specifically the
> question of whether EndOfLog must be the same as
> XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as
> XLogCtl->replayEndTLI.
>
> The answer to the former question is "no" because, if we don't enter
> redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do
> enter redo, then I think it has to be the same unless something very
> weird happens. EndOfLog gets set like this:
>
> XLogBeginRead(xlogreader, LastRec);
> record = ReadRecord(xlogreader, PANIC, false, replayTLI);
> EndOfLog = EndRecPtr;
>
> In every case that exists in our regression tests, EndRecPtr is the
> same before these three lines of code as it is afterward. However, if
> you test with recovery_target=immediate, you can get it to be
> different, because in that case we drop out of the redo loop after
> calling recoveryStopsBefore() rather than after calling
> recoveryStopsAfter(). Similarly I'm fairly sure that if you use
> recovery_target_inclusive=off you can likewise get it to be different
> (though I discovered the hard way that recovery_target_inclusive=off
> is ignored when you use recovery_target_name). It seems like a really
> bad thing that neither recovery_target=immediate nor
> recovery_target_inclusive=off have any tests, and I think we ought to
> add some.
>

recovery/t/003_recovery_targets.pl has test for
recovery_target=immediate but not for recovery_target_inclusive=off, we
can add that for recovery_target_lsn, recovery_target_time, and
recovery_target_xid case only where it affects.

> Anyway, in effect, these three lines of code have the effect of
> backing up the xlogreader by one record when we stop before rather
> than after a record that we're replaying. What that means is that
> EndOfLog is going to be the end+1 of the last record that we actually
> replayed. There might be one more record that we read but did not
> replay, and that record won't impact the value we end up with in
> EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last
> record that we actually replayed. To put that another way, there's no
> way to exit the main redo loop after we set XLogCtl->replayEndRecPtr
> and before we change LastRec. So in the cases where
> XLogCtl->replayEndRecPtr gets initialized at all, it can only be
> different from EndOfLog if something different happens when we re-read
> the last-replayed WAL record than what happened when we read it the
> first time. That seems unlikely, and would be unfortunate it if it did
> happen. I am inclined to think that it might be better not to reread
> the record at all, though.

There are two reasons that the record is reread; first, one that you
have just explained where the redo loop drops out due to
recoveryStopsBefore() and another one is that InRecovery is false.

In the formal case at the end, redo while-loop does read a new record
which in effect updates EndRecPtr and when we breaks the loop, we do
reach the place where we do reread record -- where we do read the
record (i.e. LastRec) before the record that redo loop has read and
which correctly sets EndRecPtr. In the latter case, definitely, we
don't need any adjustment to EndRecPtr.

So technically one case needs reread but that is also not needed, we
have that value in XLogCtl->lastReplayedEndRecPtr. I do agree that we
do not need to reread the record, but EndOfLog and EndOfLogTLI should
be set conditionally something like:

if (InRecovery)
{
EndOfLog = XLogCtl->lastReplayedEndRecPtr;
EndOfLogTLI = XLogCtl->lastReplayedTLI;
}
else
{
EndOfLog = EndRecPtr;
EndOfLogTLI = replayTLI;
}

> As far as this patch goes, I think we need
> a solution that doesn't involve fetching EndOfLog from a variable
> that's only sometimes initialized and then not doing anything with it
> except in the cases where it was initialized.
>

Another reason could be EndOfLog changes further in the following case:

/*
 * Actually, if WAL ended in an incomplete record, skip the parts that
 * made it through and start writing after the portion that persisted.
 * (It's critical to first write an OVERWRITE_CONTRECORD message, which
 * we'll do as soon as we're open for writing new WAL.)
 */
if (!XLogRecPtrIsInvalid(missingContrecPtr))
{
Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
EndOfLog = missingContrecPtr;
}

Now only solution that I can think is to copy EndOfLog (so
EndOfLogTLI) into shared memory.

> As for EndOfLogTLI, I'm afraid I don't think that's the same thing as
> XLogCtl->replayEndTLI. Now, it's hard to be sure, because I don't
> think the regression tests contain any scenarios where we run recovery
> and the values end up different. However, I

Re: [Patch] ALTER SYSTEM READ ONLY

2021-11-12 Thread Robert Haas

On Mon, Nov 8, 2021 at 8:20 AM Amul Sul  wrote:
> Attached is the rebased version of refactoring as well as the
> pg_prohibit_wal feature patches for the latest master head (commit #
> 39a3105678a).

I spent a lot of time today studying 0002, and specifically the
question of whether EndOfLog must be the same as
XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as
XLogCtl->replayEndTLI.

The answer to the former question is "no" because, if we don't enter
redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do
enter redo, then I think it has to be the same unless something very
weird happens. EndOfLog gets set like this:

XLogBeginRead(xlogreader, LastRec);
record = ReadRecord(xlogreader, PANIC, false, replayTLI);
EndOfLog = EndRecPtr;

In every case that exists in our regression tests, EndRecPtr is the
same before these three lines of code as it is afterward. However, if
you test with recovery_target=immediate, you can get it to be
different, because in that case we drop out of the redo loop after
calling recoveryStopsBefore() rather than after calling
recoveryStopsAfter(). Similarly I'm fairly sure that if you use
recovery_target_inclusive=off you can likewise get it to be different
(though I discovered the hard way that recovery_target_inclusive=off
is ignored when you use recovery_target_name). It seems like a really
bad thing that neither recovery_target=immediate nor
recovery_target_inclusive=off have any tests, and I think we ought to
add some.

Anyway, in effect, these three lines of code have the effect of
backing up the xlogreader by one record when we stop before rather
than after a record that we're replaying. What that means is that
EndOfLog is going to be the end+1 of the last record that we actually
replayed. There might be one more record that we read but did not
replay, and that record won't impact the value we end up with in
EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last
record that we actually replayed. To put that another way, there's no
way to exit the main redo loop after we set XLogCtl->replayEndRecPtr
and before we change LastRec. So in the cases where
XLogCtl->replayEndRecPtr gets initialized at all, it can only be
different from EndOfLog if something different happens when we re-read
the last-replayed WAL record than what happened when we read it the
first time. That seems unlikely, and would be unfortunate it if it did
happen. I am inclined to think that it might be better not to reread
the record at all, though. As far as this patch goes, I think we need
a solution that doesn't involve fetching EndOfLog from a variable
that's only sometimes initialized and then not doing anything with it
except in the cases where it was initialized.

As for EndOfLogTLI, I'm afraid I don't think that's the same thing as
XLogCtl->replayEndTLI. Now, it's hard to be sure, because I don't
think the regression tests contain any scenarios where we run recovery
and the values end up different. However, I think that the code sets
EndOfLogTLI to the TLI of the last WAL file that we read, and I think
XLogCtl->replayEndTLI gets set to the timeline from which that WAL
record originated. So imagine that we are looking for WAL that ought
to be in 00010003 but we don't find it; instead we
find 00020003 because our recovery target timeline is
2, or something that has 2 in its history. We will read the WAL for
timeline 1 from this file which has timeline 2 in the file name. I
think if recovery ends in this file before the timeline switch, these
values will be different. I did not try to construct a test case for
this today due to not having enough time, so it's possible that I'm
wrong about this, but that's how it looks to me from the code.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-10-26 Thread Amul Sul

On Mon, Oct 25, 2021 at 8:15 PM Robert Haas  wrote:
>
> On Mon, Oct 25, 2021 at 3:05 AM Amul Sul  wrote:
> > Ok, did the same in the attached 0001 patch.
> >
> > There is no harm in calling LocalSetXLogInsertAllowed() calling
> > multiple times, but the problem I can see is that with this patch user
> > is allowed to call LocalSetXLogInsertAllowed() at the time it is
> > supposed not to be called e.g. when LocalXLogInsertAllowed = 0;
> > WAL writes are explicitly disabled.
>
> I've pushed 0001 and 0002 but I reversed the order of them and made a
> few other edits.
>

Thank you!

I have rebased the remaining patches on top of the latest master head
(commit # e63ce9e8d6a).

In addition to that, I did the additional changes to 0002 where I
haven't included the change that tries to remove arguments of
CleanupAfterArchiveRecovery() in the previous version. Because if we
want to use XLogCtl->replayEndTLI and XLogCtl->replayEndRecPtr to
replace EndOfLogTLI and EndOfLog arguments respectively, then we also
need to consider the case where EndOfLog is changing if the
abort-record does exist. That can be decided only in XLogAcceptWrite()
before the shared memory value related to abort-record is going to be
clear.

Regards,
Amul
From 14bfb4a91c8935efd17b50a0a0687101481b15c7 Mon Sep 17 00:00:00 2001
From: Amul Sul 
Date: Thu, 30 Sep 2021 06:29:06 -0400
Subject: [PATCH v40 2/2] Remove dependencies on startup-process specifical
 variables.

To make XLogAcceptWrites(), need to dependency on few global and local
variable spcific to startup process.

Global variables are abortedRecPtr, missingContrecPtr,
ArchiveRecoveryRequested and LocalPromoteIsTriggered, whereas
LocalPromoteIsTriggered can be accessed in any other process using
existing PromoteIsTriggered().  abortedRecPtr & ArchiveRecoveryRequested
is made accessible by copying into shared memory.  missingContrecPtr
can get from the existing shared memory values through
XLogCtl->lastSegSwitchLSN, which is not going to change until we use
it. That changes only when the current WAL segment gets full, there
won't be any WAL write until that point.

XLogAcceptWrites() accepts two argument as EndOfLogTLI and EndOfLog
which are local to StartupXLOG(). Instead of passing as an argument
XLogCtl->replayEndTLI and XLogCtl->replayEndRecPtr from the shared
memory can be used as an replacement to EndOfLogTLI and EndOfLog
respectively. EndOfLog will be changed if the abort record is exists
and in that case, the missingContrecPtr point will be considered as
the end of WAL since the further part going to be skipped anyway.

EndOfLogTLI in the StartupXLOG() is the timeline ID of the last record
that xlogreader reads, but this xlogreader was simply re-fetching the
last record which we have replied in redo loop if it was in recovery,
if not in recovery, we don't need to worry since this value is needed
only in case of ArchiveRecoveryRequested = true, which implicitly
forces redo and sets XLogCtl->replayEndTLI value.
---
 src/backend/access/transam/xlog.c | 82 +--
 1 file changed, 66 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8d72e1967d4..10159b3312b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -669,6 +669,13 @@ typedef struct XLogCtlData
 	 */
 	bool		SharedPromoteIsTriggered;

+	/*
+	 * SharedArchiveRecoveryRequested exports the value of the
+	 * ArchiveRecoveryRequested flag to be share which is otherwise valid only
+	 * in the startup process.
+	 */
+	bool		SharedArchiveRecoveryRequested;
+
 	/*
 	 * WalWriterSleeping indicates whether the WAL writer is currently in
 	 * low-power mode (and hence should be nudged if an async commit occurs).
@@ -718,6 +725,13 @@ typedef struct XLogCtlData
 	/* timestamp of last COMMIT/ABORT record replayed (or being replayed) */
 	TimestampTz recoveryLastXTime;

+	/*
+	 * SharedAbortedRecPtr exports abortedRecPtr to be shared with another
+	 * process to write OVERWRITE_CONTRECORD message, if WAL writes are not
+	 * permitted in the current process which reads that.
+	 */
+	XLogRecPtr	SharedAbortedRecPtr;
+
 	/*
 	 * timestamp of when we started replaying the current chunk of WAL data,
 	 * only relevant for replication or archive recovery
@@ -940,7 +954,7 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 			  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
-static bool XLogAcceptWrites(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog);
+static bool XLogAcceptWrites(void);
 static bool PerformRecoveryXLogAction(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 		XLogRecPtr RecPtr, int whichChkpt, bool report);
@@ -5268,7 +5282,9 @@ XLOGShmemInit(void)
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->InstallXLogFileSegmentActive = false;

Re: [Patch] ALTER SYSTEM READ ONLY

2021-10-25 Thread Bossart, Nathan

On 10/25/21, 1:33 PM, "Robert Haas"  wrote:
> On Mon, Oct 25, 2021 at 3:14 PM Bossart, Nathan  wrote:
>> My compiler is complaining about oldXLogAllowed possibly being used
>> uninitialized in CreateCheckPoint().  AFAICT it can just be initially
>> set to zero to silence this warning because it will, in fact, be
>> initialized properly when it is used.
>
> Hmm, I guess I could have foreseen that, had I been a little bit
> smarter than I am. I have committed a change to initialize it to 0 as
> you propose.

Thanks!

Nathan

Re: [Patch] ALTER SYSTEM READ ONLY

2021-10-25 Thread Robert Haas

On Mon, Oct 25, 2021 at 3:14 PM Bossart, Nathan  wrote:
> My compiler is complaining about oldXLogAllowed possibly being used
> uninitialized in CreateCheckPoint().  AFAICT it can just be initially
> set to zero to silence this warning because it will, in fact, be
> initialized properly when it is used.

Hmm, I guess I could have foreseen that, had I been a little bit
smarter than I am. I have committed a change to initialize it to 0 as
you propose.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-10-25 Thread Bossart, Nathan

On 10/25/21, 7:50 AM, "Robert Haas"  wrote:
> I've pushed 0001 and 0002 but I reversed the order of them and made a
> few other edits.

My compiler is complaining about oldXLogAllowed possibly being used
uninitialized in CreateCheckPoint().  AFAICT it can just be initially
set to zero to silence this warning because it will, in fact, be
initialized properly when it is used.

Nathan

Re: [Patch] ALTER SYSTEM READ ONLY

2021-10-25 Thread Robert Haas

On Mon, Oct 25, 2021 at 3:05 AM Amul Sul  wrote:
> Ok, did the same in the attached 0001 patch.
>
> There is no harm in calling LocalSetXLogInsertAllowed() calling
> multiple times, but the problem I can see is that with this patch user
> is allowed to call LocalSetXLogInsertAllowed() at the time it is
> supposed not to be called e.g. when LocalXLogInsertAllowed = 0;
> WAL writes are explicitly disabled.

I've pushed 0001 and 0002 but I reversed the order of them and made a
few other edits.

I don't really see the issue you mention here as a problem. There's
only one place where we set LocalXLogInsertAllowed = 0, and I don't
know that we'll ever have another one.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-10-25 Thread Amul Sul

On Tue, Oct 19, 2021 at 3:50 AM Robert Haas  wrote:
>
> On Mon, Oct 18, 2021 at 9:54 AM Amul Sul  wrote:
> > I tried this in the attached version, but I'm a bit skeptical with
> > changes that are needed for CreateCheckPoint(), those don't seem to be
> > clean.
>
> Yeah, that doesn't look great. I don't think it's entirely correct,
> actually, because surely you want LocalXLogInsertAllowed = 0 to be
> executed even if !IsPostmasterEnvironment. It's only
> LocalXLogInsertAllowed = -1 that we would want to have depend on
> IsPostmasterEnvironment. But that's pretty ugly too: I guess the
> reason it has to be like is that, if it does that unconditionally, it
> will overwrite the temporary value of 1 set by the caller, which will
> then cause problems when the caller tries to XLogReportParameters().
>
> I think that problem goes away if we drive the decision off of shared
> state rather than a local variable, but I agree that it's otherwise a
> bit tricky to untangle. One idea might be to have
> LocalSetXLogInsertAllowed return the old value. Then we could use the
> same kind of coding we do when switching memory contexts, where we
> say:
>
> oldcontext = MemoryContextSwitchTo(something);
> // do stuff
> MemoryContextSwitchTo(oldcontext);
>
> Here we could maybe do:
>
> oldxlallowed = LocalSetXLogInsertAllowed();
> // do stuff
> XLogInsertAllowed = oldxlallowed;
>

Ok, did the same in the attached 0001 patch.

There is no harm in calling LocalSetXLogInsertAllowed() calling
multiple times, but the problem I can see is that with this patch user
is allowed to call LocalSetXLogInsertAllowed() at the time it is
supposed not to be called e.g. when LocalXLogInsertAllowed = 0;
WAL writes are explicitly disabled.

> That way, instead of CreateCheckPoint() knowing under what
> circumstances the caller might have changed the value, it only knows
> that some callers might have already changed the value. That seems
> better.
>
> > I agree that calling InitXLOGAccess() from RecoveryInProgress() is not
> > good, but I am not sure about calling it from XLogInsertAllowed()
> > either, perhaps, both are status check function and general
> > expectations might be that status checking functions are not going
> > change and/or initialize the system state. InitXLOGAccess() should
> > get called from the very first WAL write operation if needed, but if
> > we don't want to do that, then I would prefer to call InitXLOGAccess()
> > from XLogInsertAllowed() instead of RecoveryInProgress().
>
> Well, that's a fair point, too, but it might not be safe to, say, move
> this to XLogBeginInsert(). Like, imagine that there's a hypothetical
> piece of code that looks like this:
>
> if (RecoveryInProgress())
> ereport(ERROR, errmsg("can't do that in recovery")));
>
> // do something here that depends on ThisTimeLineID or
> wal_segment_size or RedoRecPtr
>
> XLogBeginInsert();
> 
> lsn = XLogInsert(...);
>
> Such code would work correctly the way things are today, but if the
> InitXLOGAccess() call were deferred until XLogBeginInsert() time, then
> it would fail.
>
> I was curious whether this is just a theoretical problem. It turns out
> that it's not. I wrote a couple of just-for-testing patches, which I
> attach here. The first one just adjusts things so that we'll fail an
> assertion if we try to make use of ThisTimeLineID before we've set it
> to a legal value. I had to exempt two places from these checks just
> for 'make check-world' to pass; these are shown in the patch, and one
> or both of them might be existing bugs -- or maybe not, I haven't
> looked too deeply. The second one then adjusts the patch to pretend
> that ThisTimeLineID is not necessarily valid just because we've called
> InitXLOGAccess() but that it is valid after XLogBeginInsert(). With
> that change, I find about a dozen places where, apparently, the early
> call to InitXLOGAccess() is critical to getting ThisTimeLineID
> adjusted in time. So apparently a change of this type is not entirely
> trivial. And this is just a quick test, and just for one of the three
> things that get initialized here.
>
> On the other hand, just moving it to XLogInsertAllowed() isn't
> risk-free either and would likely require adjusting some of the same
> places I found with this test. So I guess if we want to do something
> like this we need more study.
>

Yeah, that requires a lot of energy and time -- not done anything
related to this in the attached version.

Please have a look at the attached version where the 0001 patch does
change LocalSetXLogInsertAllowed() as said before. 0002 patch moves
XLogReportParameters() closer to other wal write operations and
removes unnecessary LocalSetXLogInsertAllowed() calls. 0003 is code
movements adds XLogAcceptWrites() function same as the before, and
0004 patch tries to remove the dependency. 0004 patch could change
w.r.t. decision that is going to be made for the patch that I
posted[1] to remove abortedRecPtr global variable. For now, I

Re: [Patch] ALTER SYSTEM READ ONLY

2021-10-18 Thread Robert Haas

On Mon, Oct 18, 2021 at 9:54 AM Amul Sul  wrote:
> I tried this in the attached version, but I'm a bit skeptical with
> changes that are needed for CreateCheckPoint(), those don't seem to be
> clean.

Yeah, that doesn't look great. I don't think it's entirely correct,
actually, because surely you want LocalXLogInsertAllowed = 0 to be
executed even if !IsPostmasterEnvironment. It's only
LocalXLogInsertAllowed = -1 that we would want to have depend on
IsPostmasterEnvironment. But that's pretty ugly too: I guess the
reason it has to be like is that, if it does that unconditionally, it
will overwrite the temporary value of 1 set by the caller, which will
then cause problems when the caller tries to XLogReportParameters().

I think that problem goes away if we drive the decision off of shared
state rather than a local variable, but I agree that it's otherwise a
bit tricky to untangle. One idea might be to have
LocalSetXLogInsertAllowed return the old value. Then we could use the
same kind of coding we do when switching memory contexts, where we
say:

oldcontext = MemoryContextSwitchTo(something);
// do stuff
MemoryContextSwitchTo(oldcontext);

Here we could maybe do:

oldxlallowed = LocalSetXLogInsertAllowed();
// do stuff
XLogInsertAllowed = oldxlallowed;

That way, instead of CreateCheckPoint() knowing under what
circumstances the caller might have changed the value, it only knows
that some callers might have already changed the value. That seems
better.

> I agree that calling InitXLOGAccess() from RecoveryInProgress() is not
> good, but I am not sure about calling it from XLogInsertAllowed()
> either, perhaps, both are status check function and general
> expectations might be that status checking functions are not going
> change and/or initialize the system state. InitXLOGAccess() should
> get called from the very first WAL write operation if needed, but if
> we don't want to do that, then I would prefer to call InitXLOGAccess()
> from XLogInsertAllowed() instead of RecoveryInProgress().

Well, that's a fair point, too, but it might not be safe to, say, move
this to XLogBeginInsert(). Like, imagine that there's a hypothetical
piece of code that looks like this:

if (RecoveryInProgress())
ereport(ERROR, errmsg("can't do that in recovery")));

// do something here that depends on ThisTimeLineID or
wal_segment_size or RedoRecPtr

XLogBeginInsert();

lsn = XLogInsert(...);

Such code would work correctly the way things are today, but if the
InitXLOGAccess() call were deferred until XLogBeginInsert() time, then
it would fail.

I was curious whether this is just a theoretical problem. It turns out
that it's not. I wrote a couple of just-for-testing patches, which I
attach here. The first one just adjusts things so that we'll fail an
assertion if we try to make use of ThisTimeLineID before we've set it
to a legal value. I had to exempt two places from these checks just
for 'make check-world' to pass; these are shown in the patch, and one
or both of them might be existing bugs -- or maybe not, I haven't
looked too deeply. The second one then adjusts the patch to pretend
that ThisTimeLineID is not necessarily valid just because we've called
InitXLOGAccess() but that it is valid after XLogBeginInsert(). With
that change, I find about a dozen places where, apparently, the early
call to InitXLOGAccess() is critical to getting ThisTimeLineID
adjusted in time. So apparently a change of this type is not entirely
trivial. And this is just a quick test, and just for one of the three
things that get initialized here.

On the other hand, just moving it to XLogInsertAllowed() isn't
risk-free either and would likely require adjusting some of the same
places I found with this test. So I guess if we want to do something
like this we need more study.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

0002-Pretend-it-s-not-valid-until-XLogBeginInsert.patch
Description: Binary data

0001-Test-code-to-see-whether-we-have-always-properly-ini.patch
Description: Binary data

Re: [Patch] ALTER SYSTEM READ ONLY

2021-10-18 Thread Amul Sul

On Thu, Oct 14, 2021 at 11:10 PM Robert Haas  wrote:
>
> On Tue, Oct 12, 2021 at 8:18 AM Amul Sul  wrote:
> > In the attached version I have fixed this issue by restoring 
> > missingContrecPtr.
> >
> > To handle abortedRecPtr and missingContrecPtr newly added global
> > variables thought the commit # ff9f111bce24, we don't need to store
> > them in the shared memory separately, instead, we need a flag that
> > indicates a broken record found previously, at the end of recovery, so
> > that we can overwrite contrecord.
> >
> > The missingContrecPtr is assigned to the EndOfLog, and we have handled
> > EndOfLog previously in the 0004 patch, and the abortedRecPtr is the
> > same as the lastReplayedEndRecPtr, AFAICS.  I have added an assert to
> > ensure that the lastReplayedEndRecPtr value is the same as the
> > abortedRecPtr, but I think that is not needed, we can go ahead and
> > write an overwrite-contrecord starting at lastReplayedEndRecPtr.
>
> I thought that it made sense to commit 0001 and 0002 at this point, so
> I have done that. I think that the treatment of missingContrecPtr and
> abortedRecPtr may need more thought yet, so at least for that reason I
> don't think it's a good idea to proceed with 0004 yet. 0003 is just
> code movement so I guess that can be committed whenever we're
> confident that we know exactly which things we want to end up inside
> XLogAcceptWrites().
>

Ok.

> I do have a few ideas after studying this a bit more:
>
> - I wonder whether, in addition to moving a few things later as 0002
> did, we also ought to think about moving one thing earlier,
> specifically XLogReportParameters(). Right now, we have, I believe,
> four things that write WAL at the end of recovery:
> CreateOverwriteContrecordRecord(), UpdateFullPageWrites(),
> PerformRecoveryXLogAction(), and XLogReportParameters(). As the code
> is structured now, we do the first three of those things, and then do
> a bunch of other stuff inside CleanupAfterArchiveRecovery() like
> running recovery_end_command, and removing non-parent xlog files, and
> archiving the partial segment, and then come back and do the fourth
> one. Is there any good reason for that? If not, I think doing them all
> together would be cleaner, and would propose to reverse the order of
> CleanupAfterArchiveRecovery() and XLogReportParameters().
>

Yes, that can be done.

> - If we did that, then I would further propose to adjust things so
> that we remove the call to LocalSetXLogInsertAllowed() and the
> assignment LocalXLogInsertAllowed = -1 from inside
> CreateEndOfRecoveryRecord(), the LocalXLogInsertAllowed = -1 from just
> after UpdateFullPageWrites(), and the call to
> LocalSetXLogInsertAllowed() just before XLogReportParameters().
> Instead, just let the call to LocalSetXLogInsertAllowed() right before
> CreateOverwriteContrecordRecord() remain in effect. There doesn't seem
> to be much point in flipping that switch off and on again, and the
> fact that we have been doing so is in my view just evidence that
> StartupXLOG() doesn't do a very good job of getting related code all
> into one place.
>

Currently there are three places that are calling
LocalSetXLogInsertAllowed() and resetting that LocalXLogInsertAllowed
flag as StartupXLOG(), CreateEndOfRecoveryRecord() and the
CreateCheckPoint().  By doing the aforementioned code rearrangement we
can get rid of frequent calls from StartupXLOG() and can completely
remove the need for it in CreateEndOfRecoveryRecord() since that gets
called only from StartupXLOG() directly. Whereas CreateCheckPoint()
too gets called from StartupXLOG() when it is running in a standalone
backend only, at that time we don't need to call
LocalSetXLogInsertAllowed()  but if that running in the Checkpointer
process then we need that.

I tried this in the attached version, but I'm a bit skeptical with
changes that are needed for CreateCheckPoint(), those don't seem to be
clean. I am wondering if we could completely remove the need to end of
recovery checkpoint as proposed in [1], that would get rid of
CHECKPOINT_END_OF_RECOVERY operation and the
LocalSetXLogInsertAllowed() requirement in CreateCheckPoint(), and
after that, we were not expecting checkpoint operation in recovery. If
we could do that then we would have LocalSetXLogInsertAllowed() only
at one place i.e. in StartupXLOG (...and in the future in
XLogAcceptWrites()) -- the code that runs only once in a lifetime of
the server and the kludge that the attached patch doing for
CreateCheckPoint() will not be needed.

> - It seems really tempting to invent a fourth RecoveryState value that
> indicates that we are done with REDO but not yet in production, and
> maybe also to rename RecoveryState to something like WALState. I'm
> thinking of something like WAL_STATE_CRASH_RECOVERY,
> WAL_STATE_ARCHIVE_RECOVERY, WAL_STATE_REDO_COMPLETE, and
> WAL_STATE_PRODUCTION. Then, instead of having
> LocalSetXLogInsertAllowed(), we could teach XLogInsertAllowed() that
> the startup

Re: [Patch] ALTER SYSTEM READ ONLY

2021-10-14 Thread Robert Haas

On Tue, Oct 12, 2021 at 8:18 AM Amul Sul  wrote:
> In the attached version I have fixed this issue by restoring 
> missingContrecPtr.
>
> To handle abortedRecPtr and missingContrecPtr newly added global
> variables thought the commit # ff9f111bce24, we don't need to store
> them in the shared memory separately, instead, we need a flag that
> indicates a broken record found previously, at the end of recovery, so
> that we can overwrite contrecord.
>
> The missingContrecPtr is assigned to the EndOfLog, and we have handled
> EndOfLog previously in the 0004 patch, and the abortedRecPtr is the
> same as the lastReplayedEndRecPtr, AFAICS.  I have added an assert to
> ensure that the lastReplayedEndRecPtr value is the same as the
> abortedRecPtr, but I think that is not needed, we can go ahead and
> write an overwrite-contrecord starting at lastReplayedEndRecPtr.

I thought that it made sense to commit 0001 and 0002 at this point, so
I have done that. I think that the treatment of missingContrecPtr and
abortedRecPtr may need more thought yet, so at least for that reason I
don't think it's a good idea to proceed with 0004 yet. 0003 is just
code movement so I guess that can be committed whenever we're
confident that we know exactly which things we want to end up inside
XLogAcceptWrites().

I do have a few ideas after studying this a bit more:

- I wonder whether, in addition to moving a few things later as 0002
did, we also ought to think about moving one thing earlier,
specifically XLogReportParameters(). Right now, we have, I believe,
four things that write WAL at the end of recovery:
CreateOverwriteContrecordRecord(), UpdateFullPageWrites(),
PerformRecoveryXLogAction(), and XLogReportParameters(). As the code
is structured now, we do the first three of those things, and then do
a bunch of other stuff inside CleanupAfterArchiveRecovery() like
running recovery_end_command, and removing non-parent xlog files, and
archiving the partial segment, and then come back and do the fourth
one. Is there any good reason for that? If not, I think doing them all
together would be cleaner, and would propose to reverse the order of
CleanupAfterArchiveRecovery() and XLogReportParameters().

- If we did that, then I would further propose to adjust things so
that we remove the call to LocalSetXLogInsertAllowed() and the
assignment LocalXLogInsertAllowed = -1 from inside
CreateEndOfRecoveryRecord(), the LocalXLogInsertAllowed = -1 from just
after UpdateFullPageWrites(), and the call to
LocalSetXLogInsertAllowed() just before XLogReportParameters().
Instead, just let the call to LocalSetXLogInsertAllowed() right before
CreateOverwriteContrecordRecord() remain in effect. There doesn't seem
to be much point in flipping that switch off and on again, and the
fact that we have been doing so is in my view just evidence that
StartupXLOG() doesn't do a very good job of getting related code all
into one place.

- It seems really tempting to invent a fourth RecoveryState value that
indicates that we are done with REDO but not yet in production, and
maybe also to rename RecoveryState to something like WALState. I'm
thinking of something like WAL_STATE_CRASH_RECOVERY,
WAL_STATE_ARCHIVE_RECOVERY, WAL_STATE_REDO_COMPLETE, and
WAL_STATE_PRODUCTION. Then, instead of having
LocalSetXLogInsertAllowed(), we could teach XLogInsertAllowed() that
the startup process and the checkpointer are allowed to insert WAL
when the state is WAL_STATE_REDO_COMPLETE, but other processes only
once we reach WAL_STATE_PRODUCTION. We would set
WAL_STATE_REDO_COMPLETE where we now call LocalSetXLogInsertAllowed().
It's necessary to include the checkpointer, or at least I think it is,
because PerformRecoveryXLogAction() might call RequestCheckpoint(),
and that's got to work. If we did this, then I think it would also
solve another problem which the overall patch set has to address
somehow. Say that we eventually move responsibility for the
to-be-created XLogAcceptWrites() function from the startup process to
the checkpointer, as proposed. The checkpointer needs to know when to
call it ... and the answer with this change is simple: when we reach
WAL_STATE_REDO_COMPLETE, it's time!

But this idea is not completely problem-free. I spent some time poking
at it and I think it's a little hard to come up with a satisfying way
to code XLogInsertAllowed(). Right now that function calls
RecoveryInProgress(), and if RecoveryInProgress() decides that
recovery is no longer in progress, it calls InitXLOGAccess(). However,
that presumes that the only reason you'd call RecoveryInProgress() is
to figure out whether you should write WAL, which I don't think is
really true, and it also means that, when the wal state is
WAL_STATE_REDO_COMPLETE, RecoveryInProgress() would need to return
true in the checkpointer and startup process and false everywhere
else, which does not sound like a great idea. It seems fine to say
that xlog insertion is allowed in some processes but not others,
because not

Re: [Patch] ALTER SYSTEM READ ONLY

2021-10-12 Thread Amul Sul

On Thu, Oct 7, 2021 at 6:21 PM Amul Sul  wrote:
>
> On Thu, Oct 7, 2021 at 5:56 AM Jaime Casanova
>  wrote:
> >
> > On Tue, Oct 05, 2021 at 04:11:58PM +0530, Amul Sul wrote:
> > >On Mon, Oct 4, 2021 at 1:57 PM Rushabh Lathia
> > >  wrote:
> > > >
> > > > I tried to apply the patch on the master branch head and it's failing
> > > > with conflicts.
> > > >
> > >
> > > Thanks, Rushabh, for the quick check, I have attached a rebased version 
> > > for the
> > > latest master head commit # f6b5d05ba9a.
> > >
> >
> > Hi,
> >
> > I got this error while executing "make check" on src/test/recovery:
> >
> > """
> > t/026_overwrite_contrecord.pl  1/3 # poll_query_until timed out 
> > executing this query:
> > # SELECT '0/201D4D8'::pg_lsn <= pg_last_wal_replay_lsn()
> > # expecting this output:
> > # t
> > # last actual query output:
> > # f
> > # with stderr:
> > # Looks like your test exited with 29 just after 1.
> > t/026_overwrite_contrecord.pl  Dubious, test returned 29 (wstat 
> > 7424, 0x1d00)
> > Failed 2/3 subtests
> >
> > Test Summary Report
> > ---
> > t/026_overwrite_contrecord.pl  (Wstat: 7424 Tests: 1 Failed: 0)
> >   Non-zero exit status: 29
> >   Parse errors: Bad plan.  You planned 3 tests but ran 1.
> > Files=26, Tests=279, 400 wallclock secs ( 0.27 usr  0.10 sys + 73.78 cusr 
> > 59.66 csys = 133.81 CPU)
> > Result: FAIL
> > make: *** [Makefile:23: check] Error 1
> > """
> >
>
> Thanks for the reporting problem, I am working on it. The cause of
> failure is that v37_0004 patch clearing the missingContrecPtr global
> variable before CreateOverwriteContrecordRecord() execution, which it
> shouldn't.
>

In the attached version I have fixed this issue by restoring missingContrecPtr.

To handle abortedRecPtr and missingContrecPtr newly added global
variables thought the commit # ff9f111bce24, we don't need to store
them in the shared memory separately, instead, we need a flag that
indicates a broken record found previously, at the end of recovery, so
that we can overwrite contrecord.

The missingContrecPtr is assigned to the EndOfLog, and we have handled
EndOfLog previously in the 0004 patch, and the abortedRecPtr is the
same as the lastReplayedEndRecPtr, AFAICS.  I have added an assert to
ensure that the lastReplayedEndRecPtr value is the same as the
abortedRecPtr, but I think that is not needed, we can go ahead and
write an overwrite-contrecord starting at lastReplayedEndRecPtr.

Regards,
Amul
From 5bf021226d9742a6fefbcb33e54f7ef044d8fbcc Mon Sep 17 00:00:00 2001
From: Amul Sul 
Date: Thu, 30 Sep 2021 06:29:06 -0400
Subject: [PATCH v38 4/4] Remove dependencies on startup-process specifical
 variables.

To make XLogAcceptWrites(), need to dependency on few global and local
variable spcific to startup process.

Global variables are abortedRecPtr, missingContrecPtr,
ArchiveRecoveryRequested and LocalPromoteIsTriggered, whereas
LocalPromoteIsTriggered can be accessed in any other process using
existing PromoteIsTriggered().  ArchiveRecoveryRequested is made
accessible by copying into shared memory.  abortedRecPtr and
missingContrecPtr can get from the existing shared memory values but
for that, we need a flag indicating that broken records was found
previously and OVERWRITE_CONTRECORD message needs to be written when
WAL writes permitted, added a flag for the same.

XLogAcceptWrites() accepts two argument as EndOfLogTLI and EndOfLog
which are local to StartupXLOG(). Instead of passing as an argument
XLogCtl->replayEndTLI and XLogCtl->lastSegSwitchLSN from the shared
memory can be used as an replacement to EndOfLogTLI and EndOfLog
respectively.  XLogCtl->lastSegSwitchLSN is not going to change until
we use it. That changes only when the current WAL segment gets full
which never going to happen because of two reasons, first WAL writes
are disabled for other processes until XLogAcceptWrites() finishes and
other reasons before use of lastSegSwitchLSN, XLogAcceptWrites() is
writes fix size wal records as full-page write and record for either
recovery end or checkpoint which not going to fill up the 16MB wal
segment.

EndOfLogTLI in the StartupXLOG() is the timeline ID of the last record
that xlogreader reads, but this xlogreader was simply re-fetching the
last record which we have replied in redo loop if it was in recovery,
if not in recovery, we don't need to worry since this value is needed
only in case of ArchiveRecoveryRequested = true, which implicitly
forces redo and sets XLogCtl->replayEndTLI value.
---
 src/backend/access/transam/xlog.c | 90 ---
 1 file changed, 72 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cdfec248081..b9596ca005c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -668,6 +668,13 @@ typedef struct XLogCtlData
 	 */
 	bool		SharedPromoteIsTriggered;

+	/*
+	 * SharedArchiveRecoveryRequested exports the

Re: [Patch] ALTER SYSTEM READ ONLY

2021-10-07 Thread Amul Sul

On Thu, Oct 7, 2021 at 5:56 AM Jaime Casanova
 wrote:
>
> On Tue, Oct 05, 2021 at 04:11:58PM +0530, Amul Sul wrote:
> >On Mon, Oct 4, 2021 at 1:57 PM Rushabh Lathia
> >  wrote:
> > >
> > > I tried to apply the patch on the master branch head and it's failing
> > > with conflicts.
> > >
> >
> > Thanks, Rushabh, for the quick check, I have attached a rebased version for 
> > the
> > latest master head commit # f6b5d05ba9a.
> >
>
> Hi,
>
> I got this error while executing "make check" on src/test/recovery:
>
> """
> t/026_overwrite_contrecord.pl  1/3 # poll_query_until timed out 
> executing this query:
> # SELECT '0/201D4D8'::pg_lsn <= pg_last_wal_replay_lsn()
> # expecting this output:
> # t
> # last actual query output:
> # f
> # with stderr:
> # Looks like your test exited with 29 just after 1.
> t/026_overwrite_contrecord.pl  Dubious, test returned 29 (wstat 7424, 
> 0x1d00)
> Failed 2/3 subtests
>
> Test Summary Report
> ---
> t/026_overwrite_contrecord.pl  (Wstat: 7424 Tests: 1 Failed: 0)
>   Non-zero exit status: 29
>   Parse errors: Bad plan.  You planned 3 tests but ran 1.
> Files=26, Tests=279, 400 wallclock secs ( 0.27 usr  0.10 sys + 73.78 cusr 
> 59.66 csys = 133.81 CPU)
> Result: FAIL
> make: *** [Makefile:23: check] Error 1
> """
>

Thanks for the reporting problem, I am working on it. The cause of
failure is that v37_0004 patch clearing the missingContrecPtr global
variable before CreateOverwriteContrecordRecord() execution, which it
shouldn't.

Regards,
Amul

Re: [Patch] ALTER SYSTEM READ ONLY

2021-10-06 Thread Jaime Casanova

On Tue, Oct 05, 2021 at 04:11:58PM +0530, Amul Sul wrote:
>On Mon, Oct 4, 2021 at 1:57 PM Rushabh Lathia
>  wrote:
> >
> > I tried to apply the patch on the master branch head and it's failing
> > with conflicts.
> >
> 
> Thanks, Rushabh, for the quick check, I have attached a rebased version for 
> the
> latest master head commit # f6b5d05ba9a.
> 

Hi,

I got this error while executing "make check" on src/test/recovery:

"""
t/026_overwrite_contrecord.pl  1/3 # poll_query_until timed out 
executing this query:
# SELECT '0/201D4D8'::pg_lsn <= pg_last_wal_replay_lsn()
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
# Looks like your test exited with 29 just after 1.
t/026_overwrite_contrecord.pl  Dubious, test returned 29 (wstat 7424, 
0x1d00)
Failed 2/3 subtests 

Test Summary Report
---
t/026_overwrite_contrecord.pl  (Wstat: 7424 Tests: 1 Failed: 0)
  Non-zero exit status: 29
  Parse errors: Bad plan.  You planned 3 tests but ran 1.
Files=26, Tests=279, 400 wallclock secs ( 0.27 usr  0.10 sys + 73.78 cusr 59.66 
csys = 133.81 CPU)
Result: FAIL
make: *** [Makefile:23: check] Error 1
"""



-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL

Re: [Patch] ALTER SYSTEM READ ONLY

2021-10-05 Thread Amul Sul

   On Mon, Oct 4, 2021 at 1:57 PM Rushabh Lathia
 wrote:
>
>
>
> On Fri, Oct 1, 2021 at 2:29 AM Robert Haas  wrote:
>>
>> On Thu, Sep 30, 2021 at 7:59 AM Amul Sul  wrote:
>> > To find the value of InRecovery after we clear it, patch still uses
>> > ControlFile's DBState, but now the check condition changed to a more
>> > specific one which is less confusing.
>> >
>> > In casual off-list discussion, the point was made to check
>> > SharedRecoveryState to find out the InRecovery value afterward, and
>> > check that using RecoveryInProgress().  But we can't depend on
>> > SharedRecoveryState because at the start it gets initialized to
>> > RECOVERY_STATE_CRASH irrespective of InRecovery that happens later.
>> > Therefore, we can't use RecoveryInProgress() which always returns
>> > true if SharedRecoveryState != RECOVERY_STATE_DONE.
>>
>> Uh, this change has crept into 0002, but it should be in 0004 with the
>> rest of the changes to remove dependencies on variables specific to
>> the startup process. Like I said before, we should really be trying to
>> separate code movement from functional changes.

Well, I have to replace the InRecovery flag in that patch since we are
moving code past to the point where the InRecovery flag gets cleared.
If I don't do, then the 0002 patch would be wrong since InRecovery is
always false, and behaviour won't be the same as it was before that
patch.

>> Also, 0002 doesn't
>> actually apply for me. Did you generate these patches with 'git
>> format-patch'?
>>
>> [rhaas pgsql]$ patch -p1 <
>> ~/Downloads/v36-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patch
>> patching file src/backend/access/transam/xlog.c
>> Hunk #1 succeeded at 889 (offset 9 lines).
>> Hunk #2 succeeded at 939 (offset 12 lines).
>> Hunk #3 succeeded at 5734 (offset 37 lines).
>> Hunk #4 succeeded at 8038 (offset 70 lines).
>> Hunk #5 succeeded at 8248 (offset 70 lines).
>> [rhaas pgsql]$ patch -p1 <
>> ~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch
>> patching file src/backend/access/transam/xlog.c
>> Reversed (or previously applied) patch detected!  Assume -R? [n]
>> Apply anyway? [n] y
>> Hunk #1 FAILED at 7954.
>> Hunk #2 succeeded at 8079 (offset 70 lines).
>> 1 out of 2 hunks FAILED -- saving rejects to file
>> src/backend/access/transam/xlog.c.rej
>> [rhaas pgsql]$ git reset --hard
>> HEAD is now at b484ddf4d2 Treat ETIMEDOUT as indicating a
>> non-recoverable connection failure.
>> [rhaas pgsql]$ patch -p1 <
>> ~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch
>> patching file src/backend/access/transam/xlog.c
>> Reversed (or previously applied) patch detected!  Assume -R? [n]
>> Apply anyway? [n]
>> Skipping patch.
>> 2 out of 2 hunks ignored -- saving rejects to file
>> src/backend/access/transam/xlog.c.rej
>>
>
> I tried to apply the patch on the master branch head and it's failing
> with conflicts.
>

Thanks, Rushabh, for the quick check, I have attached a rebased version for the
latest master head commit # f6b5d05ba9a.

> Later applied patch on below commit and it got applied cleanly:
>
> commit 7d1aa6bf1c27bf7438179db446f7d1e72ae093d0
> Author: Tom Lane 
> Date:   Mon Sep 27 18:48:01 2021 -0400
>
> Re-enable contrib/bloom's TAP tests.
>
> rushabh@rushabh:postgresql$ git apply 
> v36-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patch
> rushabh@rushabh:postgresql$ git apply 
> v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch
> rushabh@rushabh:postgresql$ git apply 
> v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch
> v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:34: space 
> before tab in indent.
>   /*
> v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:38: space 
> before tab in indent.
>   */
> v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:39: space 
> before tab in indent.
>   Insert->fullPageWrites = lastFullPageWrites;
> warning: 3 lines add whitespace errors.
> rushabh@rushabh:postgresql$ git apply 
> v36-0004-Remove-dependencies-on-startup-process-specifica.patch
>
> There are whitespace errors on patch 0003.
>

Fixed.

>>
>> It seems to me that the approach you're pursuing here can't work,
>> because the long-term goal is to get to a place where, if the system
>> starts up read-only, XLogAcceptWrites() might not be called until
>> later, after StartupXLOG() has exited. But in that case the control
>> file state would be DB_IN_PRODUCTION. But my idea of using
>> RecoveryInProgress() won't work either, because we set
>> RECOVERY_STATE_DONE just after we set DB_IN_PRODUCTION. Put
>> differently, the question we want to answer is not "are we in recovery
>> now?" but "did we perform recovery?". After studying the code a bit, I
>> think a good test might be
>> !XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr). If InRecovery
>> gets set to true, then we're certain to enter the if (InRecovery)
>> block that contains the main redo loop. And

Re: [Patch] ALTER SYSTEM READ ONLY

2021-10-04 Thread Rushabh Lathia

On Fri, Oct 1, 2021 at 2:29 AM Robert Haas  wrote:

> On Thu, Sep 30, 2021 at 7:59 AM Amul Sul  wrote:
> > To find the value of InRecovery after we clear it, patch still uses
> > ControlFile's DBState, but now the check condition changed to a more
> > specific one which is less confusing.
> >
> > In casual off-list discussion, the point was made to check
> > SharedRecoveryState to find out the InRecovery value afterward, and
> > check that using RecoveryInProgress().  But we can't depend on
> > SharedRecoveryState because at the start it gets initialized to
> > RECOVERY_STATE_CRASH irrespective of InRecovery that happens later.
> > Therefore, we can't use RecoveryInProgress() which always returns
> > true if SharedRecoveryState != RECOVERY_STATE_DONE.
>
> Uh, this change has crept into 0002, but it should be in 0004 with the
> rest of the changes to remove dependencies on variables specific to
> the startup process. Like I said before, we should really be trying to
> separate code movement from functional changes. Also, 0002 doesn't
> actually apply for me. Did you generate these patches with 'git
> format-patch'?
>
> [rhaas pgsql]$ patch -p1 <
> ~/Downloads/v36-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patch
> patching file src/backend/access/transam/xlog.c
> Hunk #1 succeeded at 889 (offset 9 lines).
> Hunk #2 succeeded at 939 (offset 12 lines).
> Hunk #3 succeeded at 5734 (offset 37 lines).
> Hunk #4 succeeded at 8038 (offset 70 lines).
> Hunk #5 succeeded at 8248 (offset 70 lines).
> [rhaas pgsql]$ patch -p1 <
> ~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch
> patching file src/backend/access/transam/xlog.c
> Reversed (or previously applied) patch detected!  Assume -R? [n]
> Apply anyway? [n] y
> Hunk #1 FAILED at 7954.
> Hunk #2 succeeded at 8079 (offset 70 lines).
> 1 out of 2 hunks FAILED -- saving rejects to file
> src/backend/access/transam/xlog.c.rej
> [rhaas pgsql]$ git reset --hard
> HEAD is now at b484ddf4d2 Treat ETIMEDOUT as indicating a
> non-recoverable connection failure.
> [rhaas pgsql]$ patch -p1 <
> ~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch
> patching file src/backend/access/transam/xlog.c
> Reversed (or previously applied) patch detected!  Assume -R? [n]
> Apply anyway? [n]
> Skipping patch.
> 2 out of 2 hunks ignored -- saving rejects to file
> src/backend/access/transam/xlog.c.rej
>
>
I tried to apply the patch on the master branch head and it's failing
with conflicts.

Later applied patch on below commit and it got applied cleanly:

commit 7d1aa6bf1c27bf7438179db446f7d1e72ae093d0
Author: Tom Lane 
Date:   Mon Sep 27 18:48:01 2021 -0400

Re-enable contrib/bloom's TAP tests.

rushabh@rushabh:postgresql$ git apply
v36-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patch
rushabh@rushabh:postgresql$ git apply
v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch
rushabh@rushabh:postgresql$ git apply
v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch
v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:34: space
before tab in indent.
  /*
v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:38: space
before tab in indent.
  */
v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:39: space
before tab in indent.
  Insert->fullPageWrites = lastFullPageWrites;
warning: 3 lines add whitespace errors.
rushabh@rushabh:postgresql$ git apply
v36-0004-Remove-dependencies-on-startup-process-specifica.patch

There are whitespace errors on patch 0003.


> It seems to me that the approach you're pursuing here can't work,
> because the long-term goal is to get to a place where, if the system
> starts up read-only, XLogAcceptWrites() might not be called until
> later, after StartupXLOG() has exited. But in that case the control
> file state would be DB_IN_PRODUCTION. But my idea of using
> RecoveryInProgress() won't work either, because we set
> RECOVERY_STATE_DONE just after we set DB_IN_PRODUCTION. Put
> differently, the question we want to answer is not "are we in recovery
> now?" but "did we perform recovery?". After studying the code a bit, I
> think a good test might be
> !XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr). If InRecovery
> gets set to true, then we're certain to enter the if (InRecovery)
> block that contains the main redo loop. And that block unconditionally
> does XLogCtl->lastReplayedEndRecPtr = XLogCtl->replayEndRecPtr. I
> think that replayEndRecPtr can't be 0 because it's supposed to
> represent the record we're pretending to have last replayed, as
> explained by the comments. And while lastReplayedEndRecPtr will get
> updated later as we replay more records, I think it will never be set
> back to 0. It's only going to increase, as we replay more records. On
> the other hand if InRecovery = false then we'll never change it, and
> it seems that it starts out as 0.
>
> I was hoping to have more time today to comment on 0004, but the

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-30 Thread Robert Haas

On Thu, Sep 30, 2021 at 7:59 AM Amul Sul  wrote:
> To find the value of InRecovery after we clear it, patch still uses
> ControlFile's DBState, but now the check condition changed to a more
> specific one which is less confusing.
>
> In casual off-list discussion, the point was made to check
> SharedRecoveryState to find out the InRecovery value afterward, and
> check that using RecoveryInProgress().  But we can't depend on
> SharedRecoveryState because at the start it gets initialized to
> RECOVERY_STATE_CRASH irrespective of InRecovery that happens later.
> Therefore, we can't use RecoveryInProgress() which always returns
> true if SharedRecoveryState != RECOVERY_STATE_DONE.

Uh, this change has crept into 0002, but it should be in 0004 with the
rest of the changes to remove dependencies on variables specific to
the startup process. Like I said before, we should really be trying to
separate code movement from functional changes. Also, 0002 doesn't
actually apply for me. Did you generate these patches with 'git
format-patch'?

[rhaas pgsql]$ patch -p1 <
~/Downloads/v36-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patch
patching file src/backend/access/transam/xlog.c
Hunk #1 succeeded at 889 (offset 9 lines).
Hunk #2 succeeded at 939 (offset 12 lines).
Hunk #3 succeeded at 5734 (offset 37 lines).
Hunk #4 succeeded at 8038 (offset 70 lines).
Hunk #5 succeeded at 8248 (offset 70 lines).
[rhaas pgsql]$ patch -p1 <
~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch
patching file src/backend/access/transam/xlog.c
Reversed (or previously applied) patch detected!  Assume -R? [n]
Apply anyway? [n] y
Hunk #1 FAILED at 7954.
Hunk #2 succeeded at 8079 (offset 70 lines).
1 out of 2 hunks FAILED -- saving rejects to file
src/backend/access/transam/xlog.c.rej
[rhaas pgsql]$ git reset --hard
HEAD is now at b484ddf4d2 Treat ETIMEDOUT as indicating a
non-recoverable connection failure.
[rhaas pgsql]$ patch -p1 <
~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch
patching file src/backend/access/transam/xlog.c
Reversed (or previously applied) patch detected!  Assume -R? [n]
Apply anyway? [n]
Skipping patch.
2 out of 2 hunks ignored -- saving rejects to file
src/backend/access/transam/xlog.c.rej

It seems to me that the approach you're pursuing here can't work,
because the long-term goal is to get to a place where, if the system
starts up read-only, XLogAcceptWrites() might not be called until
later, after StartupXLOG() has exited. But in that case the control
file state would be DB_IN_PRODUCTION. But my idea of using
RecoveryInProgress() won't work either, because we set
RECOVERY_STATE_DONE just after we set DB_IN_PRODUCTION. Put
differently, the question we want to answer is not "are we in recovery
now?" but "did we perform recovery?". After studying the code a bit, I
think a good test might be
!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr). If InRecovery
gets set to true, then we're certain to enter the if (InRecovery)
block that contains the main redo loop. And that block unconditionally
does XLogCtl->lastReplayedEndRecPtr = XLogCtl->replayEndRecPtr. I
think that replayEndRecPtr can't be 0 because it's supposed to
represent the record we're pretending to have last replayed, as
explained by the comments. And while lastReplayedEndRecPtr will get
updated later as we replay more records, I think it will never be set
back to 0. It's only going to increase, as we replay more records. On
the other hand if InRecovery = false then we'll never change it, and
it seems that it starts out as 0.

I was hoping to have more time today to comment on 0004, but the day
seems to have gotten away from me. One quick thought is that it looks
a bit strange to be getting EndOfLog from GetLastSegSwitchData() which
returns lastSegSwitchLSN while getting EndOfLogTLI from replayEndTLI
... because there's also replayEndRecPtr, which seems to go with
replayEndTLI. It feels like we should use a source for the TLI that
clearly matches the source for the corresponding LSN, unless there's
some super-good reason to do otherwise.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-30 Thread Amul Sul

On Fri, Sep 24, 2021 at 5:07 PM Amul Sul  wrote:
>
> On Thu, Sep 23, 2021 at 11:56 PM Robert Haas  wrote:
> >
> > On Mon, Sep 20, 2021 at 11:20 AM Amul Sul  wrote:
> > > Ok, understood, I have separated my changes into 0001 and 0002 patch,
> > > and the refactoring patches start from 0003.
> >
> > I think it would be better in the other order, with the refactoring
> > patches at the beginning of the series.
> >
>
> Ok, will do that. I did this other way to minimize the diff e.g.
> deletion diff of RecoveryXlogAction enum and
> DetermineRecoveryXlogAction(), etc.
>

I have reversed the patch order. Now refactoring patches will be
first, and the patch that removes the dependencies on global & local
variables will be the last. I did the necessary modification in the
refactoring patches too e.g. removed DetermineRecoveryXlogAction() and
RecoveryXlogAction enum which is no longer needed (thanks to commit #
1d919de5eb3fffa7cc9479ed6d2915fb89794459 to make code simple).

To find the value of InRecovery after we clear it, patch still uses
ControlFile's DBState, but now the check condition changed to a more
specific one which is less confusing.

In casual off-list discussion, the point was made to check
SharedRecoveryState to find out the InRecovery value afterward, and
check that using RecoveryInProgress().  But we can't depend on
SharedRecoveryState because at the start it gets initialized to
RECOVERY_STATE_CRASH irrespective of InRecovery that happens later.
Therefore, we can't use RecoveryInProgress() which always returns
true if SharedRecoveryState != RECOVERY_STATE_DONE.

I am posting only refactoring patches for now.

Regards,
Amul
From 730e8331fefc882b4cab7112adf0f4d8da1ea831 Mon Sep 17 00:00:00 2001
From: Amul Sul 
Date: Thu, 30 Sep 2021 06:29:06 -0400
Subject: [PATCH v36 4/4] Remove dependencies on startup-process specifical
 variables.

To make XLogAcceptWrites(), need to dependency on few global and local
variable spcific to startup process.

Global variables are ArchiveRecoveryRequested and
LocalPromoteIsTriggered, whereas LocalPromoteIsTriggered can be
accessed in any other process using existing PromoteIsTriggered().
ArchiveRecoveryRequested is made accessible by copying into shared
memory.

XLogAcceptWrites() accepts two argument as EndOfLogTLI and EndOfLog
which are local to StartupXLOG(). Instead of passing as an argument
XLogCtl->replayEndTLI and XLogCtl->lastSegSwitchLSN from the shared
memory can be used as an replacement to EndOfLogTLI and EndOfLog
respectively.  XLogCtl->lastSegSwitchLSN is not going to change until
we use it. That changes only when the current WAL segment gets full
which never going to happen because of two reasons, first WAL writes
are disabled for other processes until XLogAcceptWrites() finishes and
other reasons before use of lastSegSwitchLSN, XLogAcceptWrites() is
writes fix size wal records as full-page write and record for either
recovery end or checkpoint which not going to fill up the 16MB wal
segment.

EndOfLogTLI in the StartupXLOG() is the timeline ID of the last record
that xlogreader reads, but this xlogreader was simply re-fetching the
last record which we have replied in redo loop if it was in recovery,
if not in recovery, we don't need to worry since this value is needed
only in case of ArchiveRecoveryRequested = true, which implicitly
forces redo and sets XLogCtl->replayEndTLI value.
---
 src/backend/access/transam/xlog.c | 36 ++-
 1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 91cdd7d9ff2..5b4e5ac379f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -659,6 +659,13 @@ typedef struct XLogCtlData
 	 */
 	bool		SharedPromoteIsTriggered;

+	/*
+	 * SharedArchiveRecoveryRequested exports the value of the
+	 * ArchiveRecoveryRequested flag to be share which is otherwise valid only
+	 * in the startup process.
+	 */
+	bool		SharedArchiveRecoveryRequested;
+
 	/*
 	 * WalWriterSleeping indicates whether the WAL writer is currently in
 	 * low-power mode (and hence should be nudged if an async commit occurs).
@@ -880,8 +887,7 @@ static MemoryContext walDebugCxt = NULL;
 static void readRecoverySignalFile(void);
 static void validateRecoveryParameters(void);
 static void exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog);
-static void CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI,
-		XLogRecPtr EndOfLog);
+static void CleanupAfterArchiveRecovery(void);
 static bool recoveryStopsBefore(XLogReaderState *record);
 static bool recoveryStopsAfter(XLogReaderState *record);
 static char *getRecoveryStopReason(void);
@@ -927,7 +933,7 @@ static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 			  int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
-static bool XLogAcceptWrites(TimeLineID

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-24 Thread Amul Sul

On Thu, Sep 23, 2021 at 11:56 PM Robert Haas  wrote:
>
> On Mon, Sep 20, 2021 at 11:20 AM Amul Sul  wrote:
> > Ok, understood, I have separated my changes into 0001 and 0002 patch,
> > and the refactoring patches start from 0003.
>
> I think it would be better in the other order, with the refactoring
> patches at the beginning of the series.
>

Ok, will do that. I did this other way to minimize the diff e.g.
deletion diff of RecoveryXlogAction enum and
DetermineRecoveryXlogAction(), etc.

> > In the 0001 patch, I have copied ArchiveRecoveryRequested to shared
> > memory as said previously. Coping ArchiveRecoveryRequested value to
> > shared memory is not really interesting, and I think somehow we should
> > reuse existing variable, (perhaps, with some modification of the
> > information it can store, e.g. adding one more enum value for
> > SharedRecoveryState or something else), thinking on the same.
> >
> > In addition to that, I tried to turn down the scope of
> > ArchiveRecoveryRequested global variable. Now, this is a static
> > variable, and the scope is limited to xlog.c file like
> > LocalXLogInsertAllowed and can be accessed through the newly added
> > function ArchiveRecoveryIsRequested() (like PromoteIsTriggered()). Let
> > me know what you think about the approach.
>
> I'm not sure yet whether I like this or not, but it doesn't seem like
> a terrible idea. You spelled UNKNOWN wrong, though, which does seem
> like a terrible idea. :-) "acccsed" is not correct either.
>
> Also, the new comments for ArchiveRecoveryRequested /
> ARCHIVE_RECOVERY_REQUEST_* are really not very clear. All you did is
> substitute the new terminology into the existing comment, but that
> means that the purpose of the new "unknown" value is not at all clear.
>

Ok, will fix those typos and try to improve the comments.

> Consider the following two patch fragments:
>
> + * SharedArchiveRecoveryRequested indicates whether an archive recovery is
> + * requested. Protected by info_lck.
> ...
> + * Remember archive recovery request in shared memory state.  A lock is not
> + * needed since we are the only ones who updating this.
>
> These two comments directly contradict each other.
>

Okay, the first comment is not clear enough, I will fix that too. I
meant we don't need the lock now since we are the only one updating at
this point.

> + SpinLockAcquire(>info_lck);
> + XLogCtl->SharedArchiveRecoveryRequested = false;
> + ArchiveRecoveryRequested = ARCHIVE_RECOVERY_REQUEST_UNKOWN;
> + SpinLockRelease(>info_lck);
>
> This seems odd to me. In the first place, there doesn't seem to be any
> value in clearing this -- we're just expending extra CPU cycles to get
> rid of a value that wouldn't be used anyway. In the second place, if
> somehow someone checked the value after this point, with this code,
> they might get the wrong answer, whereas if you just deleted this,
> they would get the right answer.
>

Previously, this flag was only valid in the startup process. But now
it will be valid for all the processes and will stay until the whole
server gets restarted. I don't want anybody to use this flag after the
cleanup point and just be sure I am explicitly cleaning this.

By the way, I also don't expect we should go with this approach.  I
proposed this by referring to the PromoteIsTriggered() implementation,
but IMO, it is better to have something else since we just want to perform
archive cleanup operation, and most of the work related to archive
recovery was done inside the StartupXLOG().

Rather than the proposed design, I was thinking of adding one or two
more RecoveryState enums. And while skipping XLogAcceptsWrite() set
XLogCtl->SharedRecoveryState appropriately, so that we can easily
identify that the archive recovery was requested previously and now,
we need to perform its pending cleanup operation. Thoughts?

> > In 0002 patch is a mixed one where I tried to remove the dependencies
> > on global variables and local variables belonging to StartupXLOG(). I
> > am still worried about the InRecovery value that needs to be deduced
> > afterward inside XLogAcceptWrites().  Currently, relying on
> > ControlFile->state != DB_SHUTDOWNED check but I think that will not be
> > good for ASRO where we plan to skip XLogAcceptWrites() work only and
> > let the StartupXLOG() do the rest of the work as it is where it will
> > going to update ControlFile's DBState to DB_IN_PRODUCTION, then we
> > might need some ugly kludge to call PerformRecoveryXLogAction() in
> > checkpointer irrespective of DBState, which makes me a bit
> > uncomfortable.
>
> I think that replacing the if (InRecovery) test with if
> (ControlFile->state != DB_SHUTDOWNED) is probably just plain wrong. I
> mean, there are three separate places where we set InRecovery = true.
> One of those executes if ControlFile->state != DB_SHUTDOWNED, matching
> what you have here, but it also can happen if checkPoint.redo <
> RecPtr, or if read_backup_label is true and

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-23 Thread Robert Haas

On Mon, Sep 20, 2021 at 11:20 AM Amul Sul  wrote:
> Ok, understood, I have separated my changes into 0001 and 0002 patch,
> and the refactoring patches start from 0003.

I think it would be better in the other order, with the refactoring
patches at the beginning of the series.

> In the 0001 patch, I have copied ArchiveRecoveryRequested to shared
> memory as said previously. Coping ArchiveRecoveryRequested value to
> shared memory is not really interesting, and I think somehow we should
> reuse existing variable, (perhaps, with some modification of the
> information it can store, e.g. adding one more enum value for
> SharedRecoveryState or something else), thinking on the same.
>
> In addition to that, I tried to turn down the scope of
> ArchiveRecoveryRequested global variable. Now, this is a static
> variable, and the scope is limited to xlog.c file like
> LocalXLogInsertAllowed and can be accessed through the newly added
> function ArchiveRecoveryIsRequested() (like PromoteIsTriggered()). Let
> me know what you think about the approach.

I'm not sure yet whether I like this or not, but it doesn't seem like
a terrible idea. You spelled UNKNOWN wrong, though, which does seem
like a terrible idea. :-) "acccsed" is not correct either.

Also, the new comments for ArchiveRecoveryRequested /
ARCHIVE_RECOVERY_REQUEST_* are really not very clear. All you did is
substitute the new terminology into the existing comment, but that
means that the purpose of the new "unknown" value is not at all clear.

Consider the following two patch fragments:

+ * SharedArchiveRecoveryRequested indicates whether an archive recovery is
+ * requested. Protected by info_lck.
...
+ * Remember archive recovery request in shared memory state.  A lock is not
+ * needed since we are the only ones who updating this.

These two comments directly contradict each other.

+ SpinLockAcquire(>info_lck);
+ XLogCtl->SharedArchiveRecoveryRequested = false;
+ ArchiveRecoveryRequested = ARCHIVE_RECOVERY_REQUEST_UNKOWN;
+ SpinLockRelease(>info_lck);

This seems odd to me. In the first place, there doesn't seem to be any
value in clearing this -- we're just expending extra CPU cycles to get
rid of a value that wouldn't be used anyway. In the second place, if
somehow someone checked the value after this point, with this code,
they might get the wrong answer, whereas if you just deleted this,
they would get the right answer.

> In 0002 patch is a mixed one where I tried to remove the dependencies
> on global variables and local variables belonging to StartupXLOG(). I
> am still worried about the InRecovery value that needs to be deduced
> afterward inside XLogAcceptWrites().  Currently, relying on
> ControlFile->state != DB_SHUTDOWNED check but I think that will not be
> good for ASRO where we plan to skip XLogAcceptWrites() work only and
> let the StartupXLOG() do the rest of the work as it is where it will
> going to update ControlFile's DBState to DB_IN_PRODUCTION, then we
> might need some ugly kludge to call PerformRecoveryXLogAction() in
> checkpointer irrespective of DBState, which makes me a bit
> uncomfortable.

I think that replacing the if (InRecovery) test with if
(ControlFile->state != DB_SHUTDOWNED) is probably just plain wrong. I
mean, there are three separate places where we set InRecovery = true.
One of those executes if ControlFile->state != DB_SHUTDOWNED, matching
what you have here, but it also can happen if checkPoint.redo <
RecPtr, or if read_backup_label is true and ReadCheckpointRecord
returns non-NULL. Now maybe you're going to tell me that in those
other two cases we can't reach here anyway, but I don't see off-hand
why that should be true, and even if it is true, it seems like kind of
a fragile thing to rely on. I think we need to rely on something in
shared memory that is more explicitly connected to the question of
whether we are in recovery.

The other part of this patch has to do with whether we can use the
return value of GetLastSegSwitchData as a substitute for relying on
EndOfLog. Now as you have it, you end up creating a local variable
called EndOfLog that shadows another such variable in an outer scope,
which probably would not make anyone who noticed things in such a
state very happy. However, that will naturally get fixed if you
reorder the patches as per above, so let's turn to the central
question: is this a good way of getting EndOfLog? The value that would
be in effect at the time this code is executed is set here:

XLogBeginRead(xlogreader, LastRec);
record = ReadRecord(xlogreader, PANIC, false);
EndOfLog = EndRecPtr;

Subsequently we do this:

/* start the archive_timeout timer and LSN running */
XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
XLogCtl->lastSegSwitchLSN = EndOfLog;

So at that point the value that GetLastSegSwitchData() would return
has to match what's in the existing variable. But later XLogWrite()
will change the value. So the question boils down to whether

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-22 Thread Amul Sul

On Wed, Sep 22, 2021 at 7:33 PM Mark Dilger
 wrote:
>
>
>
> > On Sep 22, 2021, at 6:39 AM, Amul Sul  wrote:
> >
> > Yes, that is a bit longer, here is the snip from v35-0010 patch
>
> Right, that's longer, and only tests one interaction.  The isolation spec I 
> posted upthread tests multiple interactions between the session which uses 
> cursors and the system going read-only.  Whether the session using a cursor 
> gets a FATAL, just an ERROR, or neither depends on where it is in the process 
> of opening, using, closing and committing.  I think that's interesting.  If 
> the implementation of the ALTER SESSION READ ONLY feature were to change in 
> such a way as, for example, to make the attempt to open the cursor be a FATAL 
> error, you'd see a change in the test output.
>

Agreed.

Regards,
Amul

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-22 Thread Mark Dilger

> On Sep 22, 2021, at 6:39 AM, Amul Sul  wrote:
> 
> Yes, that is a bit longer, here is the snip from v35-0010 patch

Right, that's longer, and only tests one interaction.  The isolation spec I 
posted upthread tests multiple interactions between the session which uses 
cursors and the system going read-only.  Whether the session using a cursor 
gets a FATAL, just an ERROR, or neither depends on where it is in the process 
of opening, using, closing and committing.  I think that's interesting.  If the 
implementation of the ALTER SESSION READ ONLY feature were to change in such a 
way as, for example, to make the attempt to open the cursor be a FATAL error, 
you'd see a change in the test output.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-22 Thread Amul Sul

On Wed, Sep 22, 2021 at 6:59 PM Mark Dilger
 wrote:
>
>
>
> > On Sep 22, 2021, at 6:14 AM, Amul Sul  wrote:
> >
> >> Attached patch v34-0010 adds a test of cursors opened FOR UPDATE 
> >> interacting with a system that is set read-only by a different session.  
> >> The expected output is worth reviewing to see how this plays out.  I don't 
> >> see anything in there which is obviously wrong, but some of it is a bit 
> >> clunky.  For example, by the time the client sees an error "FATAL:  WAL is 
> >> now prohibited", the system may already have switched back to read-write.  
> >> Also, it is a bit strange to get one of these errors on an attempted 
> >> ROLLBACK.  Once again, not wrong as such, but clunky.
> >>
> >
> > Can't we do the same in the TAP test? If the intention is only to test
> > session termination when the system changes to WAL are prohibited then
> > that I have added in the latest version, but that test does not
> > reinitiate the same connection again, I think that is not possible
> > there too.
>
> Perhaps you can point me to a TAP test that does this in a concise fashion.  
> When I tried writing a TAP test for this, it was much longer than the 
> equivalent isolation test spec.
>

Yes, that is a bit longer, here is the snip from v35-0010 patch:

+my $psql_timeout = IPC::Run::timer(60);
+my ($mysession_stdin, $mysession_stdout, $mysession_stderr) = ('', '', '');
+my $mysession = IPC::Run::start(
+ [
+ 'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d',
+ $node_primary->connstr('postgres')
+ ],
+ '<',
+ \$mysession_stdin,
+ '>',
+ \$mysession_stdout,
+ '2>',
+ \$mysession_stderr,
+ $psql_timeout);
+
+# Write in transaction and get backend pid
+$mysession_stdin .= q[
+BEGIN;
+INSERT INTO tab VALUES(7);
+SELECT $$value-7-inserted-into-tab$$;
+];
+$mysession->pump until $mysession_stdout =~ /value-7-inserted-into-tab[\r\n]$/;
+like($mysession_stdout, qr/value-7-inserted-into-tab/,
+ 'started write transaction in a session');
+$mysession_stdout = '';
+$mysession_stderr = '';
+
+# Change to WAL prohibited
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(true)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+ 'server is changed to wal prohibited by another session');
+
+# Try to commit open write transaction.
+$mysession_stdin .= q[
+COMMIT;
+];
+$mysession->pump;
+like($mysession_stderr, qr/FATAL:  WAL is now prohibited/,
+ 'session with open write transaction is terminated');

Regards,
Amul

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-22 Thread Mark Dilger




> On Sep 22, 2021, at 6:14 AM, Amul Sul  wrote:
> 
>> Attached patch v34-0010 adds a test of cursors opened FOR UPDATE interacting 
>> with a system that is set read-only by a different session.  The expected 
>> output is worth reviewing to see how this plays out.  I don't see anything 
>> in there which is obviously wrong, but some of it is a bit clunky.  For 
>> example, by the time the client sees an error "FATAL:  WAL is now 
>> prohibited", the system may already have switched back to read-write.  Also, 
>> it is a bit strange to get one of these errors on an attempted ROLLBACK.  
>> Once again, not wrong as such, but clunky.
>> 
> 
> Can't we do the same in the TAP test? If the intention is only to test
> session termination when the system changes to WAL are prohibited then
> that I have added in the latest version, but that test does not
> reinitiate the same connection again, I think that is not possible
> there too.

Perhaps you can point me to a TAP test that does this in a concise fashion.  
When I tried writing a TAP test for this, it was much longer than the 
equivalent isolation test spec.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-22 Thread Amul Sul

On Wed, Sep 15, 2021 at 4:34 AM Mark Dilger
 wrote:
>
>
>
> > On Jun 16, 2020, at 6:55 AM, amul sul  wrote:
> >
> > (2) if the session is idle, we also need the top-level abort
> > record to be written immediately, but can't send an error to the client 
> > until the next
> > command is issued without losing wire protocol synchronization. For now, we 
> > just use
> > FATAL to kill the session; maybe this can be improved in the future.
>
> Andres,
>
> I'd like to have a patch that tests the impact of a vacuum running for xid 
> wraparound purposes, blocked on a pinned page held by the cursor, when 
> another session disables WAL.  It would be very interesting to test how the 
> vacuum handles that specific change.  I have not figured out the cleanest way 
> to do this, though, as we don't as a project yet have a standard way of 
> setting up xid exhaustion in a regression test, do we?  The closest I saw to 
> it was your work in [1], but that doesn't seem to have made much headway 
> recently, and is designed for the TAP testing infrastructure, which isn't 
> useable from inside an isolation test.  Do you have a suggestion how best to 
> continue developing out the test infrastructure?
>
>
> Amul,
>
> The most obvious way to test how your ALTER SYSTEM READ ONLY feature 
> interacts with concurrent sessions is using the isolation tester in 
> src/test/isolation/, but as it stands now, the first permutation that gets a 
> FATAL causes the test to abort and all subsequent permutations to not run.  
> Attached patch v34-0009 fixes that.
>

Interesting.

> Attached patch v34-0010 adds a test of cursors opened FOR UPDATE interacting 
> with a system that is set read-only by a different session.  The expected 
> output is worth reviewing to see how this plays out.  I don't see anything in 
> there which is obviously wrong, but some of it is a bit clunky.  For example, 
> by the time the client sees an error "FATAL:  WAL is now prohibited", the 
> system may already have switched back to read-write.  Also, it is a bit 
> strange to get one of these errors on an attempted ROLLBACK.  Once again, not 
> wrong as such, but clunky.
>

Can't we do the same in the TAP test? If the intention is only to test
session termination when the system changes to WAL are prohibited then
that I have added in the latest version, but that test does not
reinitiate the same connection again, I think that is not possible
there too.


Regards,
Amul

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-20 Thread Amul Sul

On Wed, Sep 15, 2021 at 9:38 PM Robert Haas  wrote:
>
> On Wed, Sep 15, 2021 at 10:32 AM Robert Haas  wrote:
> > Putting these changes into 0001 seems to make no sense. It seems like
> > they should be part of 0003, or maybe a new 0004 patch.
>
> After looking at this a little bit more, I think it's really necessary
> to separate out all of your changes into separate patches at least for
> initial review. It's particularly important to separate code movement
> changes from other kinds of changes. 0001 was just moving code before,
> and so was 0002, but now both are making other changes, which is not
> easy to see from looking at the 'git diff' output. For that reason
> it's not so easy to understand exactly what you've changed here and
> analyze it.
>

Ok, understood, I have separated my changes into 0001 and 0002 patch,
and the refactoring patches start from 0003.

In the 0001 patch, I have copied ArchiveRecoveryRequested to shared
memory as said previously. Coping ArchiveRecoveryRequested value to
shared memory is not really interesting, and I think somehow we should
reuse existing variable, (perhaps, with some modification of the
information it can store, e.g. adding one more enum value for
SharedRecoveryState or something else), thinking on the same.

In addition to that, I tried to turn down the scope of
ArchiveRecoveryRequested global variable. Now, this is a static
variable, and the scope is limited to xlog.c file like
LocalXLogInsertAllowed and can be accessed through the newly added
function ArchiveRecoveryIsRequested() (like PromoteIsTriggered()). Let
me know what you think about the approach.

In 0002 patch is a mixed one where I tried to remove the dependencies
on global variables and local variables belonging to StartupXLOG(). I
am still worried about the InRecovery value that needs to be deduced
afterward inside XLogAcceptWrites().  Currently, relying on
ControlFile->state != DB_SHUTDOWNED check but I think that will not be
good for ASRO where we plan to skip XLogAcceptWrites() work only and
let the StartupXLOG() do the rest of the work as it is where it will
going to update ControlFile's DBState to DB_IN_PRODUCTION, then we
might need some ugly kludge to call PerformRecoveryXLogAction() in
checkpointer irrespective of DBState, which makes me a bit
uncomfortable.

> I poked around a little bit at these patches, looking for
> perhaps-interesting global variables upon which the code called from
> XLogAcceptWrites() would depend with your patches applied. The most
> interesting ones seem to be (1) ThisTimeLineID, which you mentioned
> and which may be fine but I'm not totally convinced yet, (2)
> LocalXLogInsertAllowed, which is probably not broken but I'm thinking
> we may want to redesign that mechanism somehow to make it cleaner, and

Thanks for the off-list detailed explanation on this.

For somebody else who might be reading this, the concern here is (not
really a concern, it is a good thing to improve) the
LocalSetXLogInsertAllowed() function call, is a kind of hack that
enables wal writes irrespective of RecoveryInProgress() for a shorter
period. E.g. see following code in StartupXLOG:

"
  LocalSetXLogInsertAllowed();
  UpdateFullPageWrites();
  LocalXLogInsertAllowed = -1;

  /*
   * If any of the critical GUCs have changed, log them before we allow
   * backends to write WAL.
   */
  LocalSetXLogInsertAllowed();
  XLogReportParameters();
"

Instead of explicitly enabling wal insert, somehow that implicitly
allowed for the startup process and/or the checkpointer doing the
first checkpoint and/or wal writes after the recovery. Well, the
current LocalSetXLogInsertAllowed() mechanism is not really harming
anything or bad and does not necessarily need to change but it would
be nice if we were able to come up with something much cleaner,
bug-free, and 100% perfect enough design.

(Hope I am not missing anything from the discussion).

> (3) CheckpointStats, which is called from RemoveXlogFile which is
> called from RemoveNonParentXlogFiles which is called from
> CleanupAfterArchiveRecovery which is called from XLogAcceptWrites.
> This last one is actually pretty weird already in the existing code.
> It sort of looks like RemoveXlogFile() only expects to be called from
> the checkpointer (or a standalone backend) so that it can update
> CheckpointStats and have that just work, but actually it's also called
> from the startup process when a timeline switch happens. I don't know
> whether the fact that the increments to ckpt_segs_recycled get lost in
> that case should be considered an intentional behavior that should be
> preserved or an inadvertent mistake.
>

Maybe I could be wrong, but I think that is intentional.  It removes
pre-allocated or bogus files of the old timeline which are not
supposed to be considered in stats. The comments for
CheckpointStatsData might not be clear but comment at the calling
RemoveNonParentXlogFiles() place inside StartupXLOG hints the same:

"
/*
 *

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-15 Thread Robert Haas

On Wed, Sep 15, 2021 at 10:32 AM Robert Haas  wrote:
> Putting these changes into 0001 seems to make no sense. It seems like
> they should be part of 0003, or maybe a new 0004 patch.

After looking at this a little bit more, I think it's really necessary
to separate out all of your changes into separate patches at least for
initial review. It's particularly important to separate code movement
changes from other kinds of changes. 0001 was just moving code before,
and so was 0002, but now both are making other changes, which is not
easy to see from looking at the 'git diff' output. For that reason
it's not so easy to understand exactly what you've changed here and
analyze it.

I poked around a little bit at these patches, looking for
perhaps-interesting global variables upon which the code called from
XLogAcceptWrites() would depend with your patches applied. The most
interesting ones seem to be (1) ThisTimeLineID, which you mentioned
and which may be fine but I'm not totally convinced yet, (2)
LocalXLogInsertAllowed, which is probably not broken but I'm thinking
we may want to redesign that mechanism somehow to make it cleaner, and
(3) CheckpointStats, which is called from RemoveXlogFile which is
called from RemoveNonParentXlogFiles which is called from
CleanupAfterArchiveRecovery which is called from XLogAcceptWrites.
This last one is actually pretty weird already in the existing code.
It sort of looks like RemoveXlogFile() only expects to be called from
the checkpointer (or a standalone backend) so that it can update
CheckpointStats and have that just work, but actually it's also called
from the startup process when a timeline switch happens. I don't know
whether the fact that the increments to ckpt_segs_recycled get lost in
that case should be considered an intentional behavior that should be
preserved or an inadvertent mistake.

So I think you've covered most of the necessary things here, with
probably some more discussion needed on whether you've done the right
things...

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-15 Thread Robert Haas

On Wed, Sep 15, 2021 at 6:49 AM Amul Sul  wrote:
>  Initially, I thought to
> use SharedRecoveryState which is always set to RECOVERY_STATE_ARCHIVE,
> if  the archive recovery requested. But there is another case where
> SharedRecoveryState could be RECOVERY_STATE_ARCHIVE irrespective of
> ArchiveRecoveryRequested value, that is the presence of a backup label
> file.

Right, there's a difference between whether archive recovery has been
*requested* and whether it is actually *happening*.

> If we want to use SharedRecoveryState, we need one more state
> which could differentiate between ArchiveRecoveryRequested and the
> backup label file presence case.  To move ahead, I have copied
> ArchiveRecoveryRequested into shared memory and it will be cleared
> once archive cleanup is finished. With all these changes, we could get
> rid of xlogaction argument and DetermineRecoveryXlogAction() function.
> Could move its logic to PerformRecoveryXLogAction() directly.

Putting these changes into 0001 seems to make no sense. It seems like
they should be part of 0003, or maybe a new 0004 patch.

> And for ThisTimeLineID, I don't think we need to do anything since this
> value is available with all the backend as per the following comment:
> "
> /*
>  * ThisTimeLineID will be same in all backends --- it identifies current
>  * WAL timeline for the database system.
>  */
> TimeLineID  ThisTimeLineID = 0;

I'm not sure I find that argument totally convincing. The two
variables aren't assigned at exactly the same places in the code,
nonwithstanding the comment. I'm not saying you're wrong. I'm just
saying I don't believe it just because the comment says so.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-15 Thread Amul Sul

, On Sat, Jul 24, 2021 at 1:33 AM Robert Haas  wrote:
>
> On Thu, Jun 17, 2021 at 1:23 AM Amul Sul  wrote:
> > Attached is rebase for the latest master head.  Also, I added one more
> > refactoring code that deduplicates the code setting database state in the
> > control file. The same code set the database state is also needed for this
> > feature.
>
> I started studying 0001 today and found that it rearranged the order
> of operations in StartupXLOG() more than I was expecting. It does, as
> per previous discussions, move a bunch of things to the place where we
> now call XLogParamters(). But, unsatisfyingly, InRecovery = false and
> XLogReaderFree() then have to move down even further. Since the goal
> here is to get to a situation where we sometimes XLogAcceptWrites()
> after InRecovery = false, it didn't seem nice for this refactoring
> patch to still end up with a situation where this stuff happens while
> InRecovery = true. In fact, with the patch, the amount of code that
> runs with InRecovery = true actually *increases*, which is not what I
> think should be happening here. That's why the patch ends up having to
> adjust SetMultiXactIdLimit to not Assert(!InRecovery).
>
> And then I started to wonder how this was ever going to work as part
> of the larger patch set, because as you have it here,
> XLogAcceptWrites() takes arguments XLogReaderState *xlogreader,
> XLogRecPtr EndOfLog, and TimeLineID EndOfLogTLI and if the
> checkpointer is calling that at a later time after the user issues
> pg_prohibit_wal(false), it's going to have none of those things. So I
> had a quick look at that part of the code and found this in
> checkpointer.c:
>
> XLogAcceptWrites(true, NULL, InvalidXLogRecPtr, 0);
>
> For those following along from home, the additional "true" is a bool
> needChkpt argument added to XLogAcceptWrites() by 0003. Well, none of
> this is very satisfying. The whole purpose of passing the xlogreader
> is so we can figure out whether we need a checkpoint (never mind the
> question of whether the existing algorithm for determining that is
> really sensible) but now we need a second argument that basically
> serves the same purpose since one of the two callers to this function
> won't have an xlogreader. And then we're passing the EndOfLog and
> EndOfLogTLI as dummy values which seems like it's probably just
> totally wrong, but if for some reason it works correctly there sure
> don't seem to be any comments explaining why.
>
> So I started doing a bit of hacking myself and ended up with the
> attached, which I think is not completely the right thing yet but I
> think it's better than your version. I split this into three parts.
> 0001 splits up the logic that currently decides whether to write an
> end-of-recovery record or a checkpoint record and if the latter how
> the checkpoint ought to be performed into two functions.
> DetermineRecoveryXlogAction() figures out what we want to do, and
> PerformRecoveryXlogAction() does it. It also moves the code to run
> recovery_end_command and related stuff into a new function
> CleanupAfterArchiveRecovery(). 0002 then builds on this by postponing
> UpdateFullPageWrites(), PerformRecoveryXLogAction(), and
> CleanupAfterArchiveRecovery() to just before we
> XLogReportParameters(). Because of the refactoring done by 0001, this
> is only a small amount of code movement. Because of the separation
> between DetermineRecoveryXlogAction() and PerformRecoveryXlogAction(),
> the latter doesn't need the xlogreader. So we can do
> DetermineRecoveryXlogAction() at the same time as now, while the
> xlogreader is available, and then we don't need it later when we
> PerformRecoveryXlogAction(), because we already know what we need to
> know. I think this is all fine as far as it goes.
>
> My 0003 is where I see some lingering problems. It creates
> XLogAcceptWrites(), moves the appropriate stuff there, and doesn't
> need the xlogreader. But it doesn't really solve the problem of how
> checkpointer.c would be able to call this function with proper
> arguments. It is at least better in not needing two arguments to
> decide what to do, but how is checkpointer.c supposed to know what to
> pass for xlogaction? Worse yet, how is checkpointer.c supposed to know
> what to pass for EndOfLogTLI and EndOfLog? Actually, EndOfLog doesn't
> seem too problematic, because that value has been stored in four (!)
> places inside XLogCtl by this code:
>
> LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
>
> XLogCtl->LogwrtResult = LogwrtResult;
>
> XLogCtl->LogwrtRqst.Write = EndOfLog;
> XLogCtl->LogwrtRqst.Flush = EndOfLog;
>
> Presumably we could relatively easily change things around so that we
> finish one of those values ... probably one of the "write" values ..
> back out of XLogCtl instead of passing it as a parameter. That would
> work just as well from the checkpointer as from the startup process,
> and there seems to be no way for the value to change until

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-14 Thread Mark Dilger

> On Jun 16, 2020, at 6:55 AM, amul sul  wrote:
> 
> (2) if the session is idle, we also need the top-level abort
> record to be written immediately, but can't send an error to the client until 
> the next
> command is issued without losing wire protocol synchronization. For now, we 
> just use
> FATAL to kill the session; maybe this can be improved in the future.

Andres,

I'd like to have a patch that tests the impact of a vacuum running for xid 
wraparound purposes, blocked on a pinned page held by the cursor, when another 
session disables WAL.  It would be very interesting to test how the vacuum 
handles that specific change.  I have not figured out the cleanest way to do 
this, though, as we don't as a project yet have a standard way of setting up 
xid exhaustion in a regression test, do we?  The closest I saw to it was your 
work in [1], but that doesn't seem to have made much headway recently, and is 
designed for the TAP testing infrastructure, which isn't useable from inside an 
isolation test.  Do you have a suggestion how best to continue developing out 
the test infrastructure?

Amul,

The most obvious way to test how your ALTER SYSTEM READ ONLY feature interacts 
with concurrent sessions is using the isolation tester in src/test/isolation/, 
but as it stands now, the first permutation that gets a FATAL causes the test 
to abort and all subsequent permutations to not run.  Attached patch v34-0009 
fixes that.

Attached patch v34-0010 adds a test of cursors opened FOR UPDATE interacting 
with a system that is set read-only by a different session.  The expected 
output is worth reviewing to see how this plays out.  I don't see anything in 
there which is obviously wrong, but some of it is a bit clunky.  For example, 
by the time the client sees an error "FATAL:  WAL is now prohibited", the 
system may already have switched back to read-write.  Also, it is a bit strange 
to get one of these errors on an attempted ROLLBACK.  Once again, not wrong as 
such, but clunky.

v34-0009-Make-isolationtester-handle-closed-sessions.patch
Description: Binary data

v34-0010-Test-ALTER-SYSTEM-READ-ONLY-against-cursors.patch
Description: Binary data

[1] 
https://www.postgresql.org/message-id/flat/CAP4vRV5gEHFLB7NwOE6_dyHAeVfkvqF8Z_g5GaCQZNgBAE0Frw%40mail.gmail.com#e10861372aec22119b66756ecbac581c

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-10 Thread Robert Haas

On Fri, Sep 10, 2021 at 1:16 PM Mark Dilger
 wrote:
> uses "immediately" and "will kill the running transaction" which reenforced 
> the impression that this mechanism is heavier handed than it is.

It's intended to be just as immediate as e.g. pg_cancel_backend() and
pg_terminate_backend(), which work just the same way, but not any more
so. I guess we could look at how things are worded in those cases.
>From a user perspective such things are usually pretty immediate, but
not as immediate as firing a signal handler. Computers are fast.[1]

-- 
Robert Haas
EDB: http://www.enterprisedb.com

[1] https://www.youtube.com/watch?v=6xijhqU8r2A

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-10 Thread Mark Dilger

> On Sep 10, 2021, at 9:56 AM, Robert Haas  wrote:
> 
> I think the relevant question here is not "could a signal handler
> fire?" but "can we hit a CHECK_FOR_INTERRUPTS()?". If the relevant
> question is the former, then there's no hope of ever making it work
> because there's always a race condition. But the signal handler is
> only setting flags whose only effect is to make a subsequent
> CHECK_FOR_INTERRUPTS() do something, so it doesn't really matter when
> the signal handler can run, but when CHECK_FOR_INTERRUPTS() can call
> ProcessInterrupts().

Ok, that makes more sense.  I was reviewing the code after first reviewing the 
documentation changes, which lead me to believe the system was designed to 
respond more quickly than that:

+WAL prohibited is a read-only system state. Any permitted user can call
+pg_prohibit_wal function to forces the system into
+a WAL prohibited mode where insert write ahead log will be prohibited until
+the same function executed to change that state to read-write. Like Hot

and

+Otherwise, it will be off.  When the user requests WAL
+prohibited state, at that moment if any existing session is already running
+a transaction, and that transaction has already been performed or planning
+to perform wal write operations then the session running that transaction
+will be terminated.

"forces the system" in the first part, and "at that moment ... that transaction 
will be terminated" sounds heavier handed than something which merely sets a 
flag asking the backend to exit.  I was reading that as more immediate and then 
trying to figure out how the signal handling could possibly work, and failing 
to see how.

The README:

+Any
+backends which receive WAL prohibited system state transition barrier interrupt
+need to stop WAL writing immediately.  For barrier absorption the backed(s) 
will
+kill the running transaction which has valid XID indicates that the transaction
+has performed and/or planning WAL write.

uses "immediately" and "will kill the running transaction" which reenforced the 
impression that this mechanism is heavier handed than it is.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-10 Thread Robert Haas

On Fri, Sep 10, 2021 at 12:20 PM Mark Dilger
 wrote:
> A better example may be found in ginmetapage.c:
>
> needwal = RelationNeedsWAL(indexrel);
> if (needwal)
> {
> CheckWALPermitted();
> computeLeafRecompressWALData(leaf);
> }
>
> /* Apply changes to page */
> START_CRIT_SECTION();

Yeah, that looks sketchy. Why not move CheckWALPermitted() down a line?

> Even if CheckWALPermitted is assumed to be close enough to atomic to not be a 
> problem (I don't agree), that argument can't be made here, as 
> computeLeafRecompressWALData is not trivial and signals could easily be 
> processed while it is running.

I think the relevant question here is not "could a signal handler
fire?" but "can we hit a CHECK_FOR_INTERRUPTS()?". If the relevant
question is the former, then there's no hope of ever making it work
because there's always a race condition. But the signal handler is
only setting flags whose only effect is to make a subsequent
CHECK_FOR_INTERRUPTS() do something, so it doesn't really matter when
the signal handler can run, but when CHECK_FOR_INTERRUPTS() can call
ProcessInterrupts().

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-10 Thread Mark Dilger




> On Sep 10, 2021, at 8:42 AM, Mark Dilger  wrote:
> 
> Take for example a code stanza from heapam.c:
> 
>if (needwal)
>CheckWALPermitted();
> 
>/* NO EREPORT(ERROR) from here till changes are logged */
>START_CRIT_SECTION();
> 
> Now, I know that interrupts won't be processed after starting the critical 
> section, but I can see plain as day that an interrupt might get processed 
> *during* CheckWALPermitted, since that function isn't atomic. 

A better example may be found in ginmetapage.c:

needwal = RelationNeedsWAL(indexrel);
if (needwal)
{
CheckWALPermitted();
computeLeafRecompressWALData(leaf);
}

/* Apply changes to page */
START_CRIT_SECTION();

Even if CheckWALPermitted is assumed to be close enough to atomic to not be a 
problem (I don't agree), that argument can't be made here, as 
computeLeafRecompressWALData is not trivial and signals could easily be 
processed while it is running.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-10 Thread Mark Dilger

> On Sep 10, 2021, at 7:36 AM, Amul Sul  wrote:
> 
>> v33-0005
>> 
>> This patch makes bool XLogInsertAllowed() more complicated than before.  The 
>> result used to depend mostly on the value of LocalXLogInsertAllowed except 
>> that when that value was negative, the result was determined by 
>> RecoveryInProgress(). There was an arcane rule that LocalXLogInsertAllowed 
>> must have the non-negative values binary coercible to boolean "true" and 
>> "false", with the basis for that rule being the coding of 
>> XLogInsertAllowed().  Now that the function is more complicated, this rule 
>> seems even more arcane.  Can we change the logic to not depend on casting an 
>> integer to bool?
>> 
> 
> We can't use a boolean variable because LocalXLogInsertAllowed
> represents three states as, 1 means "wal is allowed'', 0 for "wal is
> disallowed", and -1 is for "need to check".

I'm complaining that we're using an integer rather than an enum.  I'm ok if we 
define it so that WAL_ALLOWABLE_UNKNOWN = -1, WAL_DISALLOWED = 0, WAL_ALLOWED = 
1 or such, but the logic of the function has gotten complicated enough that 
having to remember which number represents which logical condition has become a 
(small) mental burden.  Given how hard the WAL code is to read and fully grok, 
I'd rather avoid any unnecessary burden, even small ones.

>> The new function CheckWALPermitted() seems to test the current state of 
>> variables but not lock any of them, and the new function comment says:
>> 
> 
> CheckWALPermitted() calls XLogInsertAllowed() does check the
> LocalXLogInsertAllowed flag which is local to that process only, and
> nobody else reads that concurrently.
> 
>> /*
>> * In opposite to the above assertion if a transaction doesn't have valid XID
>> * (e.g. VACUUM) then it won't be killed while changing the system state to 
>> WAL
>> * prohibited.  Therefore, we need to explicitly error out before entering 
>> into
>> * the critical section.
>> */
>> 
>> This suggests to me that a vacuum process can check whether wal is 
>> prohibited, then begin a critical section which needs wal to be allowed, and 
>> concurrently somebody else might disable wal without killing the vacuum 
>> process.  I'm given to wonder what horrors await when the vacuum process 
>> does something that needs to be wal logged but cannot be.  Does it trigger a 
>> panic?  I don't like the idea that calling pg_prohibit_wal durning a vacuum 
>> might panic the cluster.  If there is some reason this is not a problem, I 
>> think the comment should explain it.  In particular, why is it sufficient to 
>> check whether wal is prohibited before entering the critical section and not 
>> necessary to be sure it remains allowed through the lifetime of that 
>> critical section?
>> 
> 
> Hm, interrupts absorption are disabled inside the critical section.
> The wal prohibited state for that process (here vacuum) will never get
> set until it sees the interrupts & the system will not be said wal
> prohibited until every process sees that interrupts. I am not sure we
> should explain the characteristics of the critical section at this
> place, if want, we can add a brief saying that inside the critical
> section we should not worry about the state change which never happens
> because interrupts are disabled there.

I think the fact that interrupts are disabled during critical sections is 
understood, so there is no need to mention that.  The problem is that the 
method for taking the system read-only is less generally known, and readers of 
other sections of code need to jump to the definition of CheckWALPermitted to 
read the comments and understand what it does.  Take for example a code stanza 
from heapam.c:

if (needwal)
CheckWALPermitted();

/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();

Now, I know that interrupts won't be processed after starting the critical 
section, but I can see plain as day that an interrupt might get processed 
*during* CheckWALPermitted, since that function isn't atomic.  It might happen 
after the check is meaningfully finished but before the function actually 
returns.  So I'm not inclined to believe that the way this all works is 
dependent on interrupts being blocked.  So I think, maybe this is all protected 
by some other scheme.  But what?  It's not clear from the code comments for 
CheckWALPermitted, so I'm left having to reverse engineer the system to 
understand it.

One interpretation is that the signal handler will exit() my backend if it 
receives a signal saying that the system is going read-only, so there is no 
race condition.  But then why the call to CheckWALPermitted()?  If this 
interpretation were correct, we'd happily enter the critical section without 
checking, secure in the knowledge that as long as we haven't exited yet, all is 
ok.

Another interpretation is that the whole thing is just a performance trick.  
Maybe we're ok with the idea that we will

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-10 Thread Amul Sul

On Thu, Sep 9, 2021 at 11:12 PM Mark Dilger
 wrote:
>
>

Thank you, for looking at the patch.  Please see my reply inline below:

>
> > On Sep 8, 2021, at 6:44 AM, Amul Sul  wrote:
> >
> > Here is the rebased version.
>
> v33-0004
>
> This patch moves the include of "catalog/pg_control.h" from transam/xlog.c 
> into access/xlog.h, making pg_control.h indirectly included from a much 
> larger set of files.  Maybe that's ok.  I don't know.  But it seems you are 
> doing this merely to get the symbol (not even the definition) for struct 
> DBState.  I'd recommend rearranging the code so this isn't necessary, but 
> otherwise you'd at least want to remove the now redundant includes of 
> catalog/pg_control.h from xlogdesc.c, xloginsert.c, auth-scram.c, 
> postmaster.c, misc/pg_controldata.c, and pg_controldata/pg_controldata.c.
>

Yes, you are correct, xlog.h is included in more than 150 files. I was
wondering if we can have a forward declaration instead of including
pg_control.h (e.g. The same way struct XLogRecData was declared in
xlog.h). Perhaps, DBState is enum & I don't see we have done the same
for enum elsewhere as we are doing for structures, but that seems to
be fine, IMO.

Earlier, I was unsure before preparing this patch, but since that
makes sense (I assumed) and minimizes duplications, can we go ahead
and post separately with the same change in StartupXLOG() which I have
skipped for the same reason mentioned in patch commit-msg.

> v33-0005
>
> This patch makes bool XLogInsertAllowed() more complicated than before.  The 
> result used to depend mostly on the value of LocalXLogInsertAllowed except 
> that when that value was negative, the result was determined by 
> RecoveryInProgress().  There was an arcane rule that LocalXLogInsertAllowed 
> must have the non-negative values binary coercible to boolean "true" and 
> "false", with the basis for that rule being the coding of 
> XLogInsertAllowed().  Now that the function is more complicated, this rule 
> seems even more arcane.  Can we change the logic to not depend on casting an 
> integer to bool?
>

We can't use a boolean variable because LocalXLogInsertAllowed
represents three states as, 1 means "wal is allowed'', 0 for "wal is
disallowed", and -1 is for "need to check".

> The code comment change in autovacuum.c introduces a non-grammatical 
> sentence: "First, the system is not read only i.e. wal writes permitted".
>
> The function comment in checkpointer.c reads more like it toggles the system 
> into allowing something, rather than actually doing that same something: 
> "SendSignalToCheckpointer allows a process to send a signal to the checkpoint 
> process".
>
> The new code comment in ipci.c contains a typo, but more importantly, it 
> doesn't impart any knowledge beyond what a reader of the function name could 
> already surmise.  Perhaps the comment can better clarify what is happening: 
> "Set up wal probibit shared state"
>
> The new code comment in sync.c copies and changes a nearby comment but drops 
> part of the verb phrase:  "As in ProcessSyncRequests, we don't want to stop 
> wal prohibit change requests".  The nearby comment reads "stop absorbing".  I 
> think this one should read "stop processing".  This same comment is used 
> again below.   Then a third comment reads "For the same reason mentioned 
> previously for the wal prohibit state change request check."  That third 
> comment is too glib.
>
> tcop/utility.c needlessly includes "access/walprohibit.h"
>
> wait_event.h extends enum WaitEventIO with new values 
> WAIT_EVENT_WALPROHIBIT_STATE and WAIT_EVENT_WALPROHIBIT_STATE_CHANGE.  I 
> don't find the difference between these two names at all clear.  Waiting for 
> a state change is clear enough.  But how is waiting on a state different?
>
> xlog.h defines a new enum.  I don't find any of it clear; not the comment, 
> nor the name of the enum, nor the names of the values:
>
> /* State of work that enables wal writes */
> typedef enum XLogAcceptWritesState
> {
> XLOG_ACCEPT_WRITES_PENDING = 0, /* initial state, not started */
> XLOG_ACCEPT_WRITES_SKIPPED, /* skipped wal writes */
> XLOG_ACCEPT_WRITES_DONE /* wal writes are enabled */
> } XLogAcceptWritesState;
>
> This enum seems to have been written from the point of view of someone who 
> already knew what it was for.  It needs to be written in a way that will be 
> clear to people who have no idea what it is for.
>
> v33-0006:
>
> The new code comments in brin.c and elsewhere should use the verb "require" 
> rather than "have", otherwise "building indexes" reads as a noun phrase 
> rather than as a gerund: /* Building indexes will have an XID */
>

Will try to think about the pointed code comments for the improvements.

> The new function CheckWALPermitted() seems to test the current state of 
> variables but not lock any of them, and the new function comment says:
>

CheckWALPermitted() calls XLogInsertAllowed() does check the

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-09 Thread Mark Dilger

> On Sep 9, 2021, at 11:21 AM, Robert Haas  wrote:
> 
> They have to check whether WAL has become prohibited
> and error out if so, and they need to do so before entering the
> critical section - because if the problem were detected for the first
> time inside the critical section it would escalate to a PANIC, which
> we do not want.

But that is the part that is still not clear.  Should the comment say that a 
concurrent change to prohibit wal after the current process checks but before 
the current process exists the critical section will result in a panic?  What 
is unclear about the comment is that it implies that a check before the 
critical section is sufficient, but ordinarily one would expect a lock to be 
held and the check-and-lock dance to carefully avoid any race condition.  If 
somehow this is safe, the logic for why it is safe should be spelled out.  If 
not, a mia culpa saying, "hey, were not terribly safe about this" should be 
explicit in the comment.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-09 Thread Robert Haas

On Thu, Sep 9, 2021 at 1:42 PM Mark Dilger  wrote:
> v33-0006:
>
> The new code comments in brin.c and elsewhere should use the verb "require" 
> rather than "have", otherwise "building indexes" reads as a noun phrase 
> rather than as a gerund: /* Building indexes will have an XID */

Honestly that sentence doesn't sound very clear even with a different verb.

> This suggests to me that a vacuum process can check whether wal is 
> prohibited, then begin a critical section which needs wal to be allowed, and 
> concurrently somebody else might disable wal without killing the vacuum 
> process.  I'm given to wonder what horrors await when the vacuum process does 
> something that needs to be wal logged but cannot be.  Does it trigger a 
> panic?  I don't like the idea that calling pg_prohibit_wal durning a vacuum 
> might panic the cluster.  If there is some reason this is not a problem, I 
> think the comment should explain it.  In particular, why is it sufficient to 
> check whether wal is prohibited before entering the critical section and not 
> necessary to be sure it remains allowed through the lifetime of that critical 
> section?

The idea here is that if a transaction already has an XID assigned, we
have to kill it off before we can declare the system read-only,
because it will definitely write WAL when the transaction ends: either
a commit record, or an abort record, but definitely something. So
cases where we write WAL without necessarily having an XID require
special handling. They have to check whether WAL has become prohibited
and error out if so, and they need to do so before entering the
critical section - because if the problem were detected for the first
time inside the critical section it would escalate to a PANIC, which
we do not want. Places where we're guaranteed to have an XID - e.g.
inserting a heap tuple - don't need a run-time check before entering
the critical section, because the code can't be reached in the first
place if the system is WAL-read-only.

> Why is this function defined to take a boolean such that 
> pg_prohibit_wal(true) means to prohibit wal and pg_prohibit_wal(false) means 
> to allow wal.  Wouldn't a different function named pg_allow_wal() make it 
> more clear?  This also would be a better interface if taking the system 
> read-only had a timeout as I suggested above, as such a timeout parameter 
> when allowing wal is less clearly useful.

Hmm, I find pg_prohibit_wal(true/false) better than pg_prohibit_wal()
and pg_allow_wal(), and would prefer pg_prohibit_wal(true/false,
timeout) over pg_prohibit_wal(timeout) and pg_allow_wal(), because I
think then once you find that one function you know how to do
everything about that feature, whereas the other way you need to find
both functions to have the whole story. That said, I can see why
somebody else might prefer something else.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-09 Thread Mark Dilger

> On Sep 8, 2021, at 6:44 AM, Amul Sul  wrote:
> 
> Here is the rebased version.

v33-0004 

This patch moves the include of "catalog/pg_control.h" from transam/xlog.c into 
access/xlog.h, making pg_control.h indirectly included from a much larger set 
of files.  Maybe that's ok.  I don't know.  But it seems you are doing this 
merely to get the symbol (not even the definition) for struct DBState.  I'd 
recommend rearranging the code so this isn't necessary, but otherwise you'd at 
least want to remove the now redundant includes of catalog/pg_control.h from 
xlogdesc.c, xloginsert.c, auth-scram.c, postmaster.c, misc/pg_controldata.c, 
and pg_controldata/pg_controldata.c.

v33-0005 

This patch makes bool XLogInsertAllowed() more complicated than before.  The 
result used to depend mostly on the value of LocalXLogInsertAllowed except that 
when that value was negative, the result was determined by 
RecoveryInProgress().  There was an arcane rule that LocalXLogInsertAllowed 
must have the non-negative values binary coercible to boolean "true" and 
"false", with the basis for that rule being the coding of XLogInsertAllowed().  
Now that the function is more complicated, this rule seems even more arcane.  
Can we change the logic to not depend on casting an integer to bool?

The code comment change in autovacuum.c introduces a non-grammatical sentence: 
"First, the system is not read only i.e. wal writes permitted".

The function comment in checkpointer.c reads more like it toggles the system 
into allowing something, rather than actually doing that same something: 
"SendSignalToCheckpointer allows a process to send a signal to the checkpoint 
process".

The new code comment in ipci.c contains a typo, but more importantly, it 
doesn't impart any knowledge beyond what a reader of the function name could 
already surmise.  Perhaps the comment can better clarify what is happening: 
"Set up wal probibit shared state"

The new code comment in sync.c copies and changes a nearby comment but drops 
part of the verb phrase:  "As in ProcessSyncRequests, we don't want to stop wal 
prohibit change requests".  The nearby comment reads "stop absorbing".  I think 
this one should read "stop processing".  This same comment is used again below. 
  Then a third comment reads "For the same reason mentioned previously for the 
wal prohibit state change request check."  That third comment is too glib.

tcop/utility.c needlessly includes "access/walprohibit.h"

wait_event.h extends enum WaitEventIO with new values 
WAIT_EVENT_WALPROHIBIT_STATE and WAIT_EVENT_WALPROHIBIT_STATE_CHANGE.  I don't 
find the difference between these two names at all clear.  Waiting for a state 
change is clear enough.  But how is waiting on a state different?

xlog.h defines a new enum.  I don't find any of it clear; not the comment, nor 
the name of the enum, nor the names of the values:

/* State of work that enables wal writes */
typedef enum XLogAcceptWritesState
{
XLOG_ACCEPT_WRITES_PENDING = 0, /* initial state, not started */
XLOG_ACCEPT_WRITES_SKIPPED, /* skipped wal writes */
XLOG_ACCEPT_WRITES_DONE /* wal writes are enabled */
} XLogAcceptWritesState; 

This enum seems to have been written from the point of view of someone who 
already knew what it was for.  It needs to be written in a way that will be 
clear to people who have no idea what it is for.

v33-0006:

The new code comments in brin.c and elsewhere should use the verb "require" 
rather than "have", otherwise "building indexes" reads as a noun phrase rather 
than as a gerund: /* Building indexes will have an XID */

The new function CheckWALPermitted() seems to test the current state of 
variables but not lock any of them, and the new function comment says:

/*
 * In opposite to the above assertion if a transaction doesn't have valid XID
 * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
 * prohibited.  Therefore, we need to explicitly error out before entering into
 * the critical section.
 */

This suggests to me that a vacuum process can check whether wal is prohibited, 
then begin a critical section which needs wal to be allowed, and concurrently 
somebody else might disable wal without killing the vacuum process.  I'm given 
to wonder what horrors await when the vacuum process does something that needs 
to be wal logged but cannot be.  Does it trigger a panic?  I don't like the 
idea that calling pg_prohibit_wal durning a vacuum might panic the cluster.  If 
there is some reason this is not a problem, I think the comment should explain 
it.  In particular, why is it sufficient to check whether wal is prohibited 
before entering the critical section and not necessary to be sure it remains 
allowed through the lifetime of that critical section?

v33-0007:

I don't really like what the documentation has to say about pg_prohibit_wal.  
Why should pg_prohibit_wal differ from other signal sending functions in 
whether it returns a

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-07 Thread Amul Sul

On Tue, 7 Sep 2021 at 8:43 PM, Mark Dilger 
wrote:

>
>
> > On Aug 31, 2021, at 5:15 AM, Amul Sul  wrote:
> >
> > Attached is the rebased version for the latest master head.
>
> Hi Amul!
>
> Could you please rebase again?
>

Ok will do that tomorrow, thanks.

Regards,
Amul

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-07 Thread Mark Dilger




> On Aug 31, 2021, at 5:15 AM, Amul Sul  wrote:
> 
> Attached is the rebased version for the latest master head. 

Hi Amul!

Could you please rebase again?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [Patch] ALTER SYSTEM READ ONLY

2021-09-02 Thread Robert Haas

On Tue, Aug 31, 2021 at 8:16 AM Amul Sul  wrote:
> Attached is the rebased version for the latest master head. Also,
> added tap tests to test some part of this feature and a separate patch
> to test recovery_end_command execution.

It looks like you haven't given any thought to writing that in a way
that will work on Windows?

> What is usual practice, can have a few tests in TAP and a few in
> pg_regress for the same feature?

Sure, there's no problem with that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-07-30 Thread Prabhat Sahu

Hi,

On Thu, Jul 29, 2021 at 9:46 PM Robert Haas  wrote:

> On Wed, Jul 28, 2021 at 7:33 AM Amul Sul  wrote:
> > I was too worried about how I could miss that & after thinking more
> > about that, I realized that the operation for ArchiveRecoveryRequested
> > is never going to be skipped in the startup process and that never
> > left for the checkpoint process to do that later. That is the reason
> > that assert was added there.
> >
> > When ArchiveRecoveryRequested, the server will no longer be in
> > the wal prohibited mode, we implicitly change the state to
> > wal-permitted. Here is the snip from the 0003 patch:
>
> Ugh, OK. That makes sense, but I'm still not sure that I like it. I've
> kind of been wondering: why not have XLogAcceptWrites() be the
> responsibility of the checkpointer all the time, in every case? That
> would require fixing some more things, and this is one of them, but
> then it would be consistent, which means that any bugs would be likely
> to get found and fixed. If calling XLogAcceptWrites() from the
> checkpointer is some funny case that only happens when the system
> crashes while WAL is prohibited, then we might fail to notice that we
> have a bug.
>
> This is especially true given that we have very little test coverage
> in this area. Andres was ranting to me about this earlier this week,
> and I wasn't sure he was right, but then I noticed that we have
> exactly zero tests in the entire source tree that make use of
> recovery_end_command. We really need a TAP test for that, I think.
> It's too scary to do much reorganization of the code without having
> any tests at all for the stuff we're moving around. Likewise, we're
> going to need TAP tests for the stuff that is specific to this patch.
> For example, we should have a test that crashes the server while it's
> read only, brings it back up, checks that we still can't write WAL,
> then re-enables WAL, and checks that we now can write WAL. There are
> probably a bunch of other things that we should test, too.
>

Hi,

I have been testing “ALTER SYSTEM READ ONLY” and wrote a few tap test cases
for this feature.
Please find the test case(Draft version) attached herewith, to be applied
on top of the v30 patch by Amul.
Kindly have a review and let me know the required changes.
-- 

With Regards,
Prabhat Kumar Sahu
EnterpriseDB: http://www.enterprisedb.com


prohibitwal-tap-test.patch
Description: Binary data

Re: [Patch] ALTER SYSTEM READ ONLY

2021-07-29 Thread Robert Haas

On Wed, Jul 28, 2021 at 7:33 AM Amul Sul  wrote:
> I was too worried about how I could miss that & after thinking more
> about that, I realized that the operation for ArchiveRecoveryRequested
> is never going to be skipped in the startup process and that never
> left for the checkpoint process to do that later. That is the reason
> that assert was added there.
>
> When ArchiveRecoveryRequested, the server will no longer be in
> the wal prohibited mode, we implicitly change the state to
> wal-permitted. Here is the snip from the 0003 patch:

Ugh, OK. That makes sense, but I'm still not sure that I like it. I've
kind of been wondering: why not have XLogAcceptWrites() be the
responsibility of the checkpointer all the time, in every case? That
would require fixing some more things, and this is one of them, but
then it would be consistent, which means that any bugs would be likely
to get found and fixed. If calling XLogAcceptWrites() from the
checkpointer is some funny case that only happens when the system
crashes while WAL is prohibited, then we might fail to notice that we
have a bug.

This is especially true given that we have very little test coverage
in this area. Andres was ranting to me about this earlier this week,
and I wasn't sure he was right, but then I noticed that we have
exactly zero tests in the entire source tree that make use of
recovery_end_command. We really need a TAP test for that, I think.
It's too scary to do much reorganization of the code without having
any tests at all for the stuff we're moving around. Likewise, we're
going to need TAP tests for the stuff that is specific to this patch.
For example, we should have a test that crashes the server while it's
read only, brings it back up, checks that we still can't write WAL,
then re-enables WAL, and checks that we now can write WAL. There are
probably a bunch of other things that we should test, too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-07-29 Thread Amul Sul

On Thu, Jul 29, 2021 at 4:47 PM Dilip Kumar  wrote:
>
> On Wed, Jul 28, 2021 at 5:03 PM Amul Sul  wrote:
> >
> > I was too worried about how I could miss that & after thinking more
> > about that, I realized that the operation for ArchiveRecoveryRequested
> > is never going to be skipped in the startup process and that never
> > left for the checkpoint process to do that later. That is the reason
> > that assert was added there.
> >
> > When ArchiveRecoveryRequested, the server will no longer be in
> > the wal prohibited mode, we implicitly change the state to
> > wal-permitted. Here is the snip from the 0003 patch:
> >
> > @@ -6614,13 +6629,30 @@ StartupXLOG(void)
> >   (errmsg("starting archive recovery")));
> >   }
> >
> > - /*
> > - * Take ownership of the wakeup latch if we're going to sleep during
> > - * recovery.
> > - */
> >   if (ArchiveRecoveryRequested)
> > + {
> > + /*
> > + * Take ownership of the wakeup latch if we're going to sleep during
> > + * recovery.
> > + */
> >   OwnLatch(>recoveryWakeupLatch);
> >
> > + /*
> > + * Since archive recovery is requested, we cannot be in a wal prohibited
> > + * state.
> > + */
> > + if (ControlFile->wal_prohibited)
> > + {
> > + /* No need to hold ControlFileLock yet, we aren't up far enough */
> > + ControlFile->wal_prohibited = false;
> > + ControlFile->time = (pg_time_t) time(NULL);
> > + UpdateControlFile();
> > +
>
> Is there some reason why we are forcing 'wal_prohibited' to off if we
> are doing archive recovery?  It might have already been discussed, but
> I could not find it on a quick look into the thread.
>

Here is: 
https://postgr.es/m/CA+TgmoZ=cctbaxxmtyzogxegqzoz9smkbwrdpsacpjvfcgc...@mail.gmail.com

Regards,
Amul

Re: [Patch] ALTER SYSTEM READ ONLY

2021-07-29 Thread Dilip Kumar

On Wed, Jul 28, 2021 at 5:03 PM Amul Sul  wrote:
>
> I was too worried about how I could miss that & after thinking more
> about that, I realized that the operation for ArchiveRecoveryRequested
> is never going to be skipped in the startup process and that never
> left for the checkpoint process to do that later. That is the reason
> that assert was added there.
>
> When ArchiveRecoveryRequested, the server will no longer be in
> the wal prohibited mode, we implicitly change the state to
> wal-permitted. Here is the snip from the 0003 patch:
>
> @@ -6614,13 +6629,30 @@ StartupXLOG(void)
>   (errmsg("starting archive recovery")));
>   }
>
> - /*
> - * Take ownership of the wakeup latch if we're going to sleep during
> - * recovery.
> - */
>   if (ArchiveRecoveryRequested)
> + {
> + /*
> + * Take ownership of the wakeup latch if we're going to sleep during
> + * recovery.
> + */
>   OwnLatch(>recoveryWakeupLatch);
>
> + /*
> + * Since archive recovery is requested, we cannot be in a wal prohibited
> + * state.
> + */
> + if (ControlFile->wal_prohibited)
> + {
> + /* No need to hold ControlFileLock yet, we aren't up far enough */
> + ControlFile->wal_prohibited = false;
> + ControlFile->time = (pg_time_t) time(NULL);
> + UpdateControlFile();
> +

Is there some reason why we are forcing 'wal_prohibited' to off if we
are doing archive recovery?  It might have already been discussed, but
I could not find it on a quick look into the thread.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-07-28 Thread Amul Sul

On Wed, Jul 28, 2021 at 4:37 PM Amul Sul  wrote:
>
> On Wed, Jul 28, 2021 at 2:26 AM Robert Haas  wrote:
> >
> > On Fri, Jul 23, 2021 at 4:03 PM Robert Haas  wrote:
> > > My 0003 is where I see some lingering problems. It creates
> > > XLogAcceptWrites(), moves the appropriate stuff there, and doesn't
> > > need the xlogreader. But it doesn't really solve the problem of how
> > > checkpointer.c would be able to call this function with proper
> > > arguments. It is at least better in not needing two arguments to
> > > decide what to do, but how is checkpointer.c supposed to know what to
> > > pass for xlogaction? Worse yet, how is checkpointer.c supposed to know
> > > what to pass for EndOfLogTLI and EndOfLog?
> >
> > On further study, I found another problem: the way my patch set leaves
> > things, XLogAcceptWrites() depends on ArchiveRecoveryRequested, which
> > will not be correctly initialized in any process other than the
> > startup process. So CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog)
> > would just be skipped. Your 0001 seems to have the same problem. You
> > added Assert(AmStartupProcess()) to the inside of the if
> > (ArchiveRecoveryRequested) block, but that doesn't fix anything.
> > Outside the startup process, ArchiveRecoveryRequested will always be
> > false, but the point is that the associated stuff should be done if
> > ArchiveRecoveryRequested would have been true in the startup process.
> > Both of our patch sets leave things in a state where that would never
> > happen, which is not good. Unless I'm missing something, it seems like
> > maybe you didn't test your patches to verify that, when the
> > XLogAcceptWrites() call comes from the checkpointer, all the same
> > things happen that would have happened had it been called from the
> > startup process. That would be a really good thing to have tested
> > before posting your patches.
> >
>
> My bad, I am extremely sorry about that. I usually do test my patches,
> but somehow I failed to test this change due to manually testing the
> whole ASRO feature and hurrying in posting the newest version.
>
> I will try to be more careful next time.
>

I was too worried about how I could miss that & after thinking more
about that, I realized that the operation for ArchiveRecoveryRequested
is never going to be skipped in the startup process and that never
left for the checkpoint process to do that later. That is the reason
that assert was added there.

When ArchiveRecoveryRequested, the server will no longer be in
the wal prohibited mode, we implicitly change the state to
wal-permitted. Here is the snip from the 0003 patch:

@@ -6614,13 +6629,30 @@ StartupXLOG(void)
  (errmsg("starting archive recovery")));
  }

- /*
- * Take ownership of the wakeup latch if we're going to sleep during
- * recovery.
- */
  if (ArchiveRecoveryRequested)
+ {
+ /*
+ * Take ownership of the wakeup latch if we're going to sleep during
+ * recovery.
+ */
  OwnLatch(>recoveryWakeupLatch);

+ /*
+ * Since archive recovery is requested, we cannot be in a wal prohibited
+ * state.
+ */
+ if (ControlFile->wal_prohibited)
+ {
+ /* No need to hold ControlFileLock yet, we aren't up far enough */
+ ControlFile->wal_prohibited = false;
+ ControlFile->time = (pg_time_t) time(NULL);
+ UpdateControlFile();
+
+ ereport(LOG,
+ (errmsg("clearing WAL prohibition because the system is in archive
recovery")));
+ }
+ }
+


> > As far as EndOfLogTLI is concerned, there are, somewhat annoyingly,
> > several TLIs stored in XLogCtl. None of them seem to be precisely the
> > same thing as EndLogTLI, but I am hoping that replayEndTLI is close
> > enough. I found out pretty quickly through testing that replayEndTLI
> > isn't always valid -- it ends up 0 if we don't enter recovery. That's
> > not really a problem, though, because we only need it to be valid if
> > ArchiveRecoveryRequested. The code that initializes and updates it
> > seems to run whenever InRecovery = true, and ArchiveRecoveryRequested
> > = true will force InRecovery = true. So it looks to me like
> > replayEndTLI will always be initialized in the cases where we need a
> > value. It's not yet entirely clear to me if it has to have the same
> > value as EndOfLogTLI. I find this code comment quite mysterious:
> >
> > /*
> >  * EndOfLogTLI is the TLI in the filename of the XLOG segment containing
> >  * the end-of-log. It could be different from the timeline that EndOfLog
> >  * nominally belongs to, if there was a timeline switch in that segment,
> >  * and we were reading the old WAL from a segment belonging to a higher
> >  * timeline.
> >  */
> > EndOfLogTLI = xlogreader->seg.ws_tli;
> >
> > The thing is, if we were reading old WAL from a segment belonging to a
> > higher timeline, wouldn't we have switched to that new timeline?
>
> AFAIUC, by browsing the code, yes, we are switching to the new
> timeline.  Along with lastReplayedTLI, lastReplayedEndRecPtr is also
> the same as the

Re: [Patch] ALTER SYSTEM READ ONLY

2021-07-28 Thread Amul Sul

On Wed, Jul 28, 2021 at 2:26 AM Robert Haas  wrote:
>
> On Fri, Jul 23, 2021 at 4:03 PM Robert Haas  wrote:
> > My 0003 is where I see some lingering problems. It creates
> > XLogAcceptWrites(), moves the appropriate stuff there, and doesn't
> > need the xlogreader. But it doesn't really solve the problem of how
> > checkpointer.c would be able to call this function with proper
> > arguments. It is at least better in not needing two arguments to
> > decide what to do, but how is checkpointer.c supposed to know what to
> > pass for xlogaction? Worse yet, how is checkpointer.c supposed to know
> > what to pass for EndOfLogTLI and EndOfLog?
>
> On further study, I found another problem: the way my patch set leaves
> things, XLogAcceptWrites() depends on ArchiveRecoveryRequested, which
> will not be correctly initialized in any process other than the
> startup process. So CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog)
> would just be skipped. Your 0001 seems to have the same problem. You
> added Assert(AmStartupProcess()) to the inside of the if
> (ArchiveRecoveryRequested) block, but that doesn't fix anything.
> Outside the startup process, ArchiveRecoveryRequested will always be
> false, but the point is that the associated stuff should be done if
> ArchiveRecoveryRequested would have been true in the startup process.
> Both of our patch sets leave things in a state where that would never
> happen, which is not good. Unless I'm missing something, it seems like
> maybe you didn't test your patches to verify that, when the
> XLogAcceptWrites() call comes from the checkpointer, all the same
> things happen that would have happened had it been called from the
> startup process. That would be a really good thing to have tested
> before posting your patches.
>

My bad, I am extremely sorry about that. I usually do test my patches,
but somehow I failed to test this change due to manually testing the
whole ASRO feature and hurrying in posting the newest version.

I will try to be more careful next time.

> As far as EndOfLogTLI is concerned, there are, somewhat annoyingly,
> several TLIs stored in XLogCtl. None of them seem to be precisely the
> same thing as EndLogTLI, but I am hoping that replayEndTLI is close
> enough. I found out pretty quickly through testing that replayEndTLI
> isn't always valid -- it ends up 0 if we don't enter recovery. That's
> not really a problem, though, because we only need it to be valid if
> ArchiveRecoveryRequested. The code that initializes and updates it
> seems to run whenever InRecovery = true, and ArchiveRecoveryRequested
> = true will force InRecovery = true. So it looks to me like
> replayEndTLI will always be initialized in the cases where we need a
> value. It's not yet entirely clear to me if it has to have the same
> value as EndOfLogTLI. I find this code comment quite mysterious:
>
> /*
>  * EndOfLogTLI is the TLI in the filename of the XLOG segment containing
>  * the end-of-log. It could be different from the timeline that EndOfLog
>  * nominally belongs to, if there was a timeline switch in that segment,
>  * and we were reading the old WAL from a segment belonging to a higher
>  * timeline.
>  */
> EndOfLogTLI = xlogreader->seg.ws_tli;
>
> The thing is, if we were reading old WAL from a segment belonging to a
> higher timeline, wouldn't we have switched to that new timeline?

AFAIUC, by browsing the code, yes, we are switching to the new
timeline.  Along with lastReplayedTLI, lastReplayedEndRecPtr is also
the same as the EndOfLog that we needed when ArchiveRecoveryRequested
is true.

I went through the original commit 7cbee7c0a1db and the thread[1] but
didn't find any related discussion for that.

> Suppose we want WAL segment 246 from TLI 1, but we don't have that
> segment on TLI 1, only TLI 2. Well, as far as I know, for us to use
> the TLI 2 version, we'd need to have TLI 2 in the history of the
> recovery_target_timeline. And if that is the case, then we would have
> to replay through the record where the timeline changes. And if we do
> that, then the discrepancy postulated by the comment cannot still
> exist by the time we reach this code, because this code is only
> reached after we finish WAL redo. So I'm baffled as to how this can
> happen, but considering how many cases there are in this code, I sure
> can't promise that it doesn't. The fact that we have few tests for any
> of this doesn't help either.

I am not an expert in this area, but will try to spend some more time
on understanding and testing.

1] postgr.es/m/555dd101.7080...@iki.fi

Regards,
Amul

Re: [Patch] ALTER SYSTEM READ ONLY

2021-07-27 Thread Robert Haas

On Fri, Jul 23, 2021 at 4:03 PM Robert Haas  wrote:
> My 0003 is where I see some lingering problems. It creates
> XLogAcceptWrites(), moves the appropriate stuff there, and doesn't
> need the xlogreader. But it doesn't really solve the problem of how
> checkpointer.c would be able to call this function with proper
> arguments. It is at least better in not needing two arguments to
> decide what to do, but how is checkpointer.c supposed to know what to
> pass for xlogaction? Worse yet, how is checkpointer.c supposed to know
> what to pass for EndOfLogTLI and EndOfLog?

On further study, I found another problem: the way my patch set leaves
things, XLogAcceptWrites() depends on ArchiveRecoveryRequested, which
will not be correctly initialized in any process other than the
startup process. So CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog)
would just be skipped. Your 0001 seems to have the same problem. You
added Assert(AmStartupProcess()) to the inside of the if
(ArchiveRecoveryRequested) block, but that doesn't fix anything.
Outside the startup process, ArchiveRecoveryRequested will always be
false, but the point is that the associated stuff should be done if
ArchiveRecoveryRequested would have been true in the startup process.
Both of our patch sets leave things in a state where that would never
happen, which is not good. Unless I'm missing something, it seems like
maybe you didn't test your patches to verify that, when the
XLogAcceptWrites() call comes from the checkpointer, all the same
things happen that would have happened had it been called from the
startup process. That would be a really good thing to have tested
before posting your patches.

As far as EndOfLogTLI is concerned, there are, somewhat annoyingly,
several TLIs stored in XLogCtl. None of them seem to be precisely the
same thing as EndLogTLI, but I am hoping that replayEndTLI is close
enough. I found out pretty quickly through testing that replayEndTLI
isn't always valid -- it ends up 0 if we don't enter recovery. That's
not really a problem, though, because we only need it to be valid if
ArchiveRecoveryRequested. The code that initializes and updates it
seems to run whenever InRecovery = true, and ArchiveRecoveryRequested
= true will force InRecovery = true. So it looks to me like
replayEndTLI will always be initialized in the cases where we need a
value. It's not yet entirely clear to me if it has to have the same
value as EndOfLogTLI. I find this code comment quite mysterious:

/*
 * EndOfLogTLI is the TLI in the filename of the XLOG segment containing
 * the end-of-log. It could be different from the timeline that EndOfLog
 * nominally belongs to, if there was a timeline switch in that segment,
 * and we were reading the old WAL from a segment belonging to a higher
 * timeline.
 */
EndOfLogTLI = xlogreader->seg.ws_tli;

The thing is, if we were reading old WAL from a segment belonging to a
higher timeline, wouldn't we have switched to that new timeline?
Suppose we want WAL segment 246 from TLI 1, but we don't have that
segment on TLI 1, only TLI 2. Well, as far as I know, for us to use
the TLI 2 version, we'd need to have TLI 2 in the history of the
recovery_target_timeline. And if that is the case, then we would have
to replay through the record where the timeline changes. And if we do
that, then the discrepancy postulated by the comment cannot still
exist by the time we reach this code, because this code is only
reached after we finish WAL redo. So I'm baffled as to how this can
happen, but considering how many cases there are in this code, I sure
can't promise that it doesn't. The fact that we have few tests for any
of this doesn't help either.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-07-23 Thread Robert Haas

On Thu, Jun 17, 2021 at 1:23 AM Amul Sul  wrote:
> Attached is rebase for the latest master head.  Also, I added one more
> refactoring code that deduplicates the code setting database state in the
> control file. The same code set the database state is also needed for this
> feature.

I started studying 0001 today and found that it rearranged the order
of operations in StartupXLOG() more than I was expecting. It does, as
per previous discussions, move a bunch of things to the place where we
now call XLogParamters(). But, unsatisfyingly, InRecovery = false and
XLogReaderFree() then have to move down even further. Since the goal
here is to get to a situation where we sometimes XLogAcceptWrites()
after InRecovery = false, it didn't seem nice for this refactoring
patch to still end up with a situation where this stuff happens while
InRecovery = true. In fact, with the patch, the amount of code that
runs with InRecovery = true actually *increases*, which is not what I
think should be happening here. That's why the patch ends up having to
adjust SetMultiXactIdLimit to not Assert(!InRecovery).

And then I started to wonder how this was ever going to work as part
of the larger patch set, because as you have it here,
XLogAcceptWrites() takes arguments XLogReaderState *xlogreader,
XLogRecPtr EndOfLog, and TimeLineID EndOfLogTLI and if the
checkpointer is calling that at a later time after the user issues
pg_prohibit_wal(false), it's going to have none of those things. So I
had a quick look at that part of the code and found this in
checkpointer.c:

XLogAcceptWrites(true, NULL, InvalidXLogRecPtr, 0);

For those following along from home, the additional "true" is a bool
needChkpt argument added to XLogAcceptWrites() by 0003. Well, none of
this is very satisfying. The whole purpose of passing the xlogreader
is so we can figure out whether we need a checkpoint (never mind the
question of whether the existing algorithm for determining that is
really sensible) but now we need a second argument that basically
serves the same purpose since one of the two callers to this function
won't have an xlogreader. And then we're passing the EndOfLog and
EndOfLogTLI as dummy values which seems like it's probably just
totally wrong, but if for some reason it works correctly there sure
don't seem to be any comments explaining why.

So I started doing a bit of hacking myself and ended up with the
attached, which I think is not completely the right thing yet but I
think it's better than your version. I split this into three parts.
0001 splits up the logic that currently decides whether to write an
end-of-recovery record or a checkpoint record and if the latter how
the checkpoint ought to be performed into two functions.
DetermineRecoveryXlogAction() figures out what we want to do, and
PerformRecoveryXlogAction() does it. It also moves the code to run
recovery_end_command and related stuff into a new function
CleanupAfterArchiveRecovery(). 0002 then builds on this by postponing
UpdateFullPageWrites(), PerformRecoveryXLogAction(), and
CleanupAfterArchiveRecovery() to just before we
XLogReportParameters(). Because of the refactoring done by 0001, this
is only a small amount of code movement. Because of the separation
between DetermineRecoveryXlogAction() and PerformRecoveryXlogAction(),
the latter doesn't need the xlogreader. So we can do
DetermineRecoveryXlogAction() at the same time as now, while the
xlogreader is available, and then we don't need it later when we
PerformRecoveryXlogAction(), because we already know what we need to
know. I think this is all fine as far as it goes.

My 0003 is where I see some lingering problems. It creates
XLogAcceptWrites(), moves the appropriate stuff there, and doesn't
need the xlogreader. But it doesn't really solve the problem of how
checkpointer.c would be able to call this function with proper
arguments. It is at least better in not needing two arguments to
decide what to do, but how is checkpointer.c supposed to know what to
pass for xlogaction? Worse yet, how is checkpointer.c supposed to know
what to pass for EndOfLogTLI and EndOfLog? Actually, EndOfLog doesn't
seem too problematic, because that value has been stored in four (!)
places inside XLogCtl by this code:

LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;

XLogCtl->LogwrtResult = LogwrtResult;

XLogCtl->LogwrtRqst.Write = EndOfLog;
XLogCtl->LogwrtRqst.Flush = EndOfLog;

Presumably we could relatively easily change things around so that we
finish one of those values ... probably one of the "write" values ..
back out of XLogCtl instead of passing it as a parameter. That would
work just as well from the checkpointer as from the startup process,
and there seems to be no way for the value to change until after
XLogAcceptWrites() has been called, so it seems fine. But that doesn't
help for the other arguments. What I'm thinking is that we should just
arrange to store EndOfLogTLI and xlogaction into XLogCtl also, and
then

Re: [Patch] ALTER SYSTEM READ ONLY

2021-05-17 Thread Dilip Kumar

On Mon, May 17, 2021 at 11:48 AM Amul Sul  wrote:
>
> On Sat, May 15, 2021 at 3:12 PM Dilip Kumar  wrote:
> >
> > On Thu, May 13, 2021 at 2:56 PM Dilip Kumar  wrote:
> > >
> > > Great thanks.  I will review the remaining patch soon.
> >
> > I have reviewed v28-0003, and I have some comments on this.
> >
> > ===
> > @@ -126,9 +127,14 @@ XLogBeginInsert(void)
> >  Assert(mainrdata_last == (XLogRecData *) _head);
> >  Assert(mainrdata_len == 0);
> >
> > +/*
> > + * WAL permission must have checked before entering the critical 
> > section.
> > + * Otherwise, WAL prohibited error will force system panic.
> > + */
> > +Assert(walpermit_checked_state != WALPERMIT_UNCHECKED ||
> > !CritSectionCount);
> > +
> >  /* cross-check on whether we should be here or not */
> > -if (!XLogInsertAllowed())
> > -elog(ERROR, "cannot make new WAL entries during recovery");
> > +CheckWALPermitted();
> >
> > We must not call CheckWALPermitted inside the critical section,
> > instead if we are here we must be sure that
> > WAL is permitted, so better put an assert.  Even if that is ensured by
> > some other mean then also I don't
> > see any reason for calling this error generating function.
> >
>
> I understand that we should not have an error inside a critical section but
> this check is not wrong. Patch has enough checking so that errors due to WAL
> prohibited state must not hit in the critical section, see assert just before
> CheckWALPermitted().  Before entering into the critical section, we do have an
> explicit WAL prohibited check. And to make sure that check has been done for
> all current critical section for the wal writes, we have aforesaid assert
> checking, for more detail on this please have a look at the "WAL prohibited
> system state" section of src/backend/access/transam/README added in 0004 
> patch.
> This assertion also ensures that future development does not miss the WAL
> prohibited state check before entering into a newly added critical section for
> WAL writes.

I think we need CheckWALPermitted(); check, in XLogBeginInsert()
function because if XLogBeginInsert() maybe called outside critical
section e.g. pg_truncate_visibility_map() then we should error out.
So this check make sense to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-05-15 Thread Dilip Kumar

On Thu, May 13, 2021 at 2:56 PM Dilip Kumar  wrote:
>
> Great thanks.  I will review the remaining patch soon.

I have reviewed v28-0003, and I have some comments on this.

===
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
 Assert(mainrdata_last == (XLogRecData *) _head);
 Assert(mainrdata_len == 0);

+/*
+ * WAL permission must have checked before entering the critical section.
+ * Otherwise, WAL prohibited error will force system panic.
+ */
+Assert(walpermit_checked_state != WALPERMIT_UNCHECKED ||
!CritSectionCount);
+
 /* cross-check on whether we should be here or not */
-if (!XLogInsertAllowed())
-elog(ERROR, "cannot make new WAL entries during recovery");
+CheckWALPermitted();

We must not call CheckWALPermitted inside the critical section,
instead if we are here we must be sure that
WAL is permitted, so better put an assert.  Even if that is ensured by
some other mean then also I don't
see any reason for calling this error generating function.

===

+CheckWALPermitted(void)
+{
+if (!XLogInsertAllowed())
+ereport(ERROR,
+(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+ errmsg("system is now read only")));
+

system is now read only ->  wal is prohibited (in error message)

===

- * We can't write WAL in recovery mode, so there's no point trying to
+ * We can't write WAL during read-only mode, so there's no point trying to

during read-only mode -> if WAL is prohibited or WAL recovery in
progress (add recovery in progress and also modify read-only to wal
prohibited)

===

+if (!XLogInsertAllowed())
 {
 GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-GUC_check_errmsg("cannot set transaction read-write mode
during recovery");
+GUC_check_errmsg("cannot set transaction read-write mode
while system is read only");
 return false;
 }

system is read only -> WAL is prohibited

===

I think that's all, I have to say about 0003.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-05-13 Thread Dilip Kumar

On Thu, May 13, 2021 at 2:54 PM Amul Sul  wrote:
>
> On Thu, May 13, 2021 at 12:36 PM Dilip Kumar  wrote:
> >
> > On Wed, May 12, 2021 at 5:55 PM Amul Sul  wrote:
> > >
> >
> > Thanks for the updated patch, while going through I noticed this comment.
> >
> > + /*
> > + * WAL prohibit state changes not allowed during recovery except the crash
> > + * recovery case.
> > + */
> > + PreventCommandDuringRecovery("pg_prohibit_wal()");
> >
> > Why do we need to allow state change during recovery?  Do you still
> > need it after the latest changes you discussed here, I mean now
> > XLogAcceptWrites() being called before sending barrier to backends.
> > So now we are not afraid that the backend will write WAL before we
> > call XLogAcceptWrites().  So now IMHO, we don't need to keep the
> > system in recovery until pg_prohibit_wal(false) is called, right?
> >
>
> Your understanding is correct, and the previous patch also does the same, but
> the code comment is wrong.  Fixed in the attached version, also rebased for 
> the
> latest master head. Sorry for the confusion.

Great thanks.  I will review the remaining patch soon.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-05-13 Thread Dilip Kumar

On Wed, May 12, 2021 at 5:55 PM Amul Sul  wrote:
>

Thanks for the updated patch, while going through I noticed this comment.

+ /*
+ * WAL prohibit state changes not allowed during recovery except the crash
+ * recovery case.
+ */
+ PreventCommandDuringRecovery("pg_prohibit_wal()");

Why do we need to allow state change during recovery?  Do you still
need it after the latest changes you discussed here, I mean now
XLogAcceptWrites() being called before sending barrier to backends.
So now we are not afraid that the backend will write WAL before we
call XLogAcceptWrites().  So now IMHO, we don't need to keep the
system in recovery until pg_prohibit_wal(false) is called, right?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-05-13 Thread Dilip Kumar

On Thu, May 13, 2021 at 2:26 AM Robert Haas  wrote:
>
> On Wed, May 12, 2021 at 1:39 AM Dilip Kumar  wrote:
> > Your idea makes sense, but IMHO, if we are first writing
> > XLogAcceptWrites() and then pushing out the barrier, then I don't
> > understand the meaning of having state #4.  I mean whenever any
> > backend receives the barrier the system will always be in state #5.
> > So what do we want to do with state #4?
>
> Well, if you don't have that, how does the checkpointer know that it's
> supposed to push out the barrier?
>
> You and Amul both seem to want to merge states #4 and #5. But how to
> make that work? Basically what you are both saying is that, after we
> move into the "going read-write" state, backends aren't immediately
> told that they can write WAL, but have to keep checking back. But this
> could be expensive. If you have one state that means that the
> checkpointer has been requested to run XLogAcceptWrites() and push out
> a barrier, and another state to mean that it has done so, then you
> avoid that. Maybe that overhead wouldn't be large anyway, but it seems
> like it's only necessary because you're trying to merge two states
> which, from a logical point of view, are separate.

I don't have an objection to having 5 states, just wanted to
understand your reasoning.  So it makes sense to me.  Thanks.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-05-12 Thread Robert Haas

On Wed, May 12, 2021 at 1:39 AM Dilip Kumar  wrote:
> Your idea makes sense, but IMHO, if we are first writing
> XLogAcceptWrites() and then pushing out the barrier, then I don't
> understand the meaning of having state #4.  I mean whenever any
> backend receives the barrier the system will always be in state #5.
> So what do we want to do with state #4?

Well, if you don't have that, how does the checkpointer know that it's
supposed to push out the barrier?

You and Amul both seem to want to merge states #4 and #5. But how to
make that work? Basically what you are both saying is that, after we
move into the "going read-write" state, backends aren't immediately
told that they can write WAL, but have to keep checking back. But this
could be expensive. If you have one state that means that the
checkpointer has been requested to run XLogAcceptWrites() and push out
a barrier, and another state to mean that it has done so, then you
avoid that. Maybe that overhead wouldn't be large anyway, but it seems
like it's only necessary because you're trying to merge two states
which, from a logical point of view, are separate.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-05-11 Thread Dilip Kumar

On Tue, May 11, 2021 at 11:54 PM Robert Haas  wrote:
>
> On Tue, May 11, 2021 at 11:17 AM Amul Sul  wrote:
> > I think I have much easier solution than this, will post that with update 
> > version patch set tomorrow.
>
> I don't know what you have in mind, but based on this discussion, it
> seems to me that we should just have 5 states instead of 4:
>
> 1. WAL is permitted.
> 2. WAL is being prohibited but some backends may not know about the change 
> yet.
> 3. WAL is prohibited.
> 4. WAL is in the process of being permitted but XLogAcceptWrites() may
> not have been called yet.
> 5. WAL is in the process of being permitted and XLogAcceptWrites() has
> been called but some backends may not know about the change yet.
>
> If we're in state #3 and someone does pg_prohibit_wal(false) then we
> enter state #4. The checkpointer calls XLogAcceptWrites(), moves us to
> state #5, and pushes out a barrier. Then it waits for the barrier to
> be absorbed and, when it has been, it moves us to state #1. Then if
> someone does pg_prohibit_wal(true) we move to state #2. The
> checkpointer pushes out a barrier and waits for it to be absorbed.
> Then it calls XLogFlush() and afterward moves us to state #3.
>
> We can have any (reasonable) number of states that we want. There's
> nothing magical about 4.

Your idea makes sense, but IMHO, if we are first writing
XLogAcceptWrites() and then pushing out the barrier, then I don't
understand the meaning of having state #4.  I mean whenever any
backend receives the barrier the system will always be in state #5.
So what do we want to do with state #4?

Is it just to make the state machine better?  I mean in the checkpoint
process, we don't need separate "if checks" whether the
XLogAcceptWrites() is called or not, instead we can just rely on the
state, if it is #4 then we have to call XLogAcceptWrites().  If so
then I think it's okay to have an additional state, just wanted to
know what idea you had in mind?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-05-11 Thread Robert Haas

On Tue, May 11, 2021 at 11:17 AM Amul Sul  wrote:
> I think I have much easier solution than this, will post that with update 
> version patch set tomorrow.

I don't know what you have in mind, but based on this discussion, it
seems to me that we should just have 5 states instead of 4:

1. WAL is permitted.
2. WAL is being prohibited but some backends may not know about the change yet.
3. WAL is prohibited.
4. WAL is in the process of being permitted but XLogAcceptWrites() may
not have been called yet.
5. WAL is in the process of being permitted and XLogAcceptWrites() has
been called but some backends may not know about the change yet.

If we're in state #3 and someone does pg_prohibit_wal(false) then we
enter state #4. The checkpointer calls XLogAcceptWrites(), moves us to
state #5, and pushes out a barrier. Then it waits for the barrier to
be absorbed and, when it has been, it moves us to state #1. Then if
someone does pg_prohibit_wal(true) we move to state #2. The
checkpointer pushes out a barrier and waits for it to be absorbed.
Then it calls XLogFlush() and afterward moves us to state #3.

We can have any (reasonable) number of states that we want. There's
nothing magical about 4.

I also entirely agree with Dilip that we should do some renaming to
get rid of the read-write/read-only terminology, now that this is no
longer part of the syntax. In fact I made the exact same point in my
last review. The WALPROHIBIT_STATE_* constants are just one thing of
many that needs to be included in that renaming.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-05-11 Thread Amul Sul

On Tue, 11 May 2021 at 7:50 PM, Dilip Kumar  wrote:

> On Tue, May 11, 2021 at 6:56 PM Amul Sul  wrote:
> >
> > On Tue, May 11, 2021 at 6:48 PM Dilip Kumar 
> wrote:
> > >
> > > On Tue, May 11, 2021 at 4:50 PM Amul Sul  wrote:
> > > >
> > > > On Tue, May 11, 2021 at 4:13 PM Dilip Kumar 
> wrote:
> > > > > I might be missing something, but assume the behavior should be
> like this
> > > > >
> > > > > 1. If the state is getting changed from
> WALPROHIBIT_STATE_READ_WRITE
> > > > > -> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process
> > > > > the barrier, we can immediately abort any read-write
> transaction(and
> > > > > stop allowing WAL writing), because once we ensure that all session
> > > > > has responded that now they have no read-write transaction then we
> can
> > > > > safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to
> > > > > WALPROHIBIT_STATE_READ_ONLY.
> > > > >
> > > >
> > > > Yes, that's what the current patch is doing from the first patch
> version.
> > > >
> > > > > 2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY ->
> > > > > WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the
> backend
> > > > > to consider the system as read-write, instead, we should wait until
> > > > > the shared state is changed to WALPROHIBIT_STATE_READ_WRITE.
> > > > >
> > > >
> > > > I am sure that only not enough will have the same issue where
> > > > LocalXLogInsertAllowed gets set the same as the read-only as
> described in
> > > > my previous reply.
> > >
> > > Okay, but while browsing the code I do not see any direct if condition
> > > based on the "LocalXLogInsertAllowed" variable, can you point me to
> > > some references?
> > > I only see one if check on this variable and that is in
> > > XLogInsertAllowed() function, but now in XLogInsertAllowed() function,
> > > you are already checking IsWALProhibited.  No?
> > >
> >
> > I am not sure I understood this. Where am I checking IsWALProhibited()?
> >
> > IsWALProhibited() is called by XLogInsertAllowed() once when
> > LocalXLogInsertAllowed is in a reset state, and that result will be
> > cached in LocalXLogInsertAllowed and will be used in the subsequent
> > XLogInsertAllowed() call.
>
> Okay, got what you were trying to say.  But that can be easily
> fixable, I mean if the state is WALPROHIBIT_STATE_GOING_READ_WRITE
> then what we can do is don't allow to write the WAL but let's not set
> the LocalXLogInsertAllowed to 0.  So until we are in the intermediate
> state WALPROHIBIT_STATE_GOING_READ_WRITE, we will always have to rely
> on GetWALProhibitState(), I know this will add a performance penalty
> but this is for the short period until we are in the intermediate
> state.  After that as soon as it will set to
> WALPROHIBIT_STATE_READ_WRITE then the XLogInsertAllowed() will set
> LocalXLogInsertAllowed to 1.


I think I have much easier solution than this, will post that with update
version patch set tomorrow.

Regards,
Amul

Re: [Patch] ALTER SYSTEM READ ONLY

2021-05-11 Thread Dilip Kumar

On Tue, May 11, 2021 at 6:56 PM Amul Sul  wrote:
>
> On Tue, May 11, 2021 at 6:48 PM Dilip Kumar  wrote:
> >
> > On Tue, May 11, 2021 at 4:50 PM Amul Sul  wrote:
> > >
> > > On Tue, May 11, 2021 at 4:13 PM Dilip Kumar  wrote:
> > > > I might be missing something, but assume the behavior should be like 
> > > > this
> > > >
> > > > 1. If the state is getting changed from WALPROHIBIT_STATE_READ_WRITE
> > > > -> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process
> > > > the barrier, we can immediately abort any read-write transaction(and
> > > > stop allowing WAL writing), because once we ensure that all session
> > > > has responded that now they have no read-write transaction then we can
> > > > safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to
> > > > WALPROHIBIT_STATE_READ_ONLY.
> > > >
> > >
> > > Yes, that's what the current patch is doing from the first patch version.
> > >
> > > > 2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY ->
> > > > WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the backend
> > > > to consider the system as read-write, instead, we should wait until
> > > > the shared state is changed to WALPROHIBIT_STATE_READ_WRITE.
> > > >
> > >
> > > I am sure that only not enough will have the same issue where
> > > LocalXLogInsertAllowed gets set the same as the read-only as described in
> > > my previous reply.
> >
> > Okay, but while browsing the code I do not see any direct if condition
> > based on the "LocalXLogInsertAllowed" variable, can you point me to
> > some references?
> > I only see one if check on this variable and that is in
> > XLogInsertAllowed() function, but now in XLogInsertAllowed() function,
> > you are already checking IsWALProhibited.  No?
> >
>
> I am not sure I understood this. Where am I checking IsWALProhibited()?
>
> IsWALProhibited() is called by XLogInsertAllowed() once when
> LocalXLogInsertAllowed is in a reset state, and that result will be
> cached in LocalXLogInsertAllowed and will be used in the subsequent
> XLogInsertAllowed() call.

Okay, got what you were trying to say.  But that can be easily
fixable, I mean if the state is WALPROHIBIT_STATE_GOING_READ_WRITE
then what we can do is don't allow to write the WAL but let's not set
the LocalXLogInsertAllowed to 0.  So until we are in the intermediate
state WALPROHIBIT_STATE_GOING_READ_WRITE, we will always have to rely
on GetWALProhibitState(), I know this will add a performance penalty
but this is for the short period until we are in the intermediate
state.  After that as soon as it will set to
WALPROHIBIT_STATE_READ_WRITE then the XLogInsertAllowed() will set
LocalXLogInsertAllowed to 1.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-05-11 Thread Amul Sul

On Tue, May 11, 2021 at 6:48 PM Dilip Kumar  wrote:
>
> On Tue, May 11, 2021 at 4:50 PM Amul Sul  wrote:
> >
> > On Tue, May 11, 2021 at 4:13 PM Dilip Kumar  wrote:
> > > I might be missing something, but assume the behavior should be like this
> > >
> > > 1. If the state is getting changed from WALPROHIBIT_STATE_READ_WRITE
> > > -> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process
> > > the barrier, we can immediately abort any read-write transaction(and
> > > stop allowing WAL writing), because once we ensure that all session
> > > has responded that now they have no read-write transaction then we can
> > > safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to
> > > WALPROHIBIT_STATE_READ_ONLY.
> > >
> >
> > Yes, that's what the current patch is doing from the first patch version.
> >
> > > 2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY ->
> > > WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the backend
> > > to consider the system as read-write, instead, we should wait until
> > > the shared state is changed to WALPROHIBIT_STATE_READ_WRITE.
> > >
> >
> > I am sure that only not enough will have the same issue where
> > LocalXLogInsertAllowed gets set the same as the read-only as described in
> > my previous reply.
>
> Okay, but while browsing the code I do not see any direct if condition
> based on the "LocalXLogInsertAllowed" variable, can you point me to
> some references?
> I only see one if check on this variable and that is in
> XLogInsertAllowed() function, but now in XLogInsertAllowed() function,
> you are already checking IsWALProhibited.  No?
>

I am not sure I understood this. Where am I checking IsWALProhibited()?

IsWALProhibited() is called by XLogInsertAllowed() once when
LocalXLogInsertAllowed is in a reset state, and that result will be
cached in LocalXLogInsertAllowed and will be used in the subsequent
XLogInsertAllowed() call.

Regards,
Amul

Re: [Patch] ALTER SYSTEM READ ONLY

2021-05-11 Thread Dilip Kumar

On Tue, May 11, 2021 at 4:50 PM Amul Sul  wrote:
>
> On Tue, May 11, 2021 at 4:13 PM Dilip Kumar  wrote:
> > I might be missing something, but assume the behavior should be like this
> >
> > 1. If the state is getting changed from WALPROHIBIT_STATE_READ_WRITE
> > -> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process
> > the barrier, we can immediately abort any read-write transaction(and
> > stop allowing WAL writing), because once we ensure that all session
> > has responded that now they have no read-write transaction then we can
> > safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to
> > WALPROHIBIT_STATE_READ_ONLY.
> >
>
> Yes, that's what the current patch is doing from the first patch version.
>
> > 2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY ->
> > WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the backend
> > to consider the system as read-write, instead, we should wait until
> > the shared state is changed to WALPROHIBIT_STATE_READ_WRITE.
> >
>
> I am sure that only not enough will have the same issue where
> LocalXLogInsertAllowed gets set the same as the read-only as described in
> my previous reply.

Okay, but while browsing the code I do not see any direct if condition
based on the "LocalXLogInsertAllowed" variable, can you point me to
some references?
I only see one if check on this variable and that is in
XLogInsertAllowed() function, but now in XLogInsertAllowed() function,
you are already checking IsWALProhibited.  No?


> > Other than this point, I think the state names READ_ONLY, READ_WRITE
> > are a bit confusing no? because actually, these states represent
> > whether WAL is allowed or not, but READ_ONLY, READ_WRITE seems like we
> > are putting the system under a Read-only state.  For example, if you
> > are doing some write operation on an unlogged table will be allowed, I
> > guess because that will not generate the WAL until you commit (because
> > commit generates WAL) right? so practically, we are just blocking the
> > WAL, not the write operation.
> >
>
> This read-only and read-write are the wal prohibited states though we
> are using for read-only/read-write system in the discussion and the
> complete macro name is WALPROHIBIT_STATE_READ_ONLY and
> WALPROHIBIT_STATE_READ_WRITE, I am not sure why that would make
> implementation confusing.

Fine, I am not too particular about these names.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-05-11 Thread Amul Sul

On Tue, May 11, 2021 at 4:13 PM Dilip Kumar  wrote:
>
> On Tue, May 11, 2021 at 3:38 PM Amul Sul  wrote:
> >
> > On Tue, May 11, 2021 at 2:26 PM Dilip Kumar  wrote:
> > >
> > > On Tue, May 11, 2021 at 2:16 PM Amul Sul  wrote:
> > >
> > > > I get why you think that, I wasn't very precise in briefing the problem.
> > > >
> > > > Any new backend that gets connected right after the shared memory
> > > > state changes to WALPROHIBIT_STATE_GOING_READ_WRITE will be by
> > > > default allowed to do the WAL writes.  Such backends can perform write
> > > > operation before the checkpointer does the XLogAcceptWrites().
> > >
> > > Okay, make sense now. But my next question is why do we allow backends
> > > to write WAL in WALPROHIBIT_STATE_GOING_READ_WRITE state? why don't we
> > > wait until the shared memory state is changed to
> > > WALPROHIBIT_STATE_READ_WRITE?
> > >
> >
> > Ok, good question.
> >
> > Now let's first try to understand the Checkpointer's work.
> >
> > When Checkpointer sees the wal prohibited state is an in-progress state, 
> > then
> > it first emits the global barrier and waits until all backers absorb that.
> > After that it set the final requested WAL prohibit state.
> >
> > When other backends absorb those barriers then appropriate action is taken
> > (e.g. abort the read-write transaction if moving to read-only) by them. 
> > Also,
> > LocalXLogInsertAllowed flags get reset in it and that backend needs to call
> > XLogInsertAllowed() to get the right value for it, which further decides WAL
> > writes permitted or prohibited.
> >
> > Consider an example that the system is trying to change to read-write and 
> > for
> > that wal prohibited state is set to WALPROHIBIT_STATE_GOING_READ_WRITE 
> > before
> > Checkpointer starts its work.  If we want to treat that system as read-only 
> > for
> > the WALPROHIBIT_STATE_GOING_READ_WRITE state as well. Then we might need to
> > think about the behavior of the backend that has absorbed the barrier and 
> > reset
> > the LocalXLogInsertAllowed flag.  That backend eventually going to call
> > XLogInsertAllowed() to get the actual value for it and by seeing the current
> > state as WALPROHIBIT_STATE_GOING_READ_WRITE, it will set 
> > LocalXLogInsertAllowed
> > again same as it was before for the read-only state.
>
> I might be missing something, but assume the behavior should be like this
>
> 1. If the state is getting changed from WALPROHIBIT_STATE_READ_WRITE
> -> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process
> the barrier, we can immediately abort any read-write transaction(and
> stop allowing WAL writing), because once we ensure that all session
> has responded that now they have no read-write transaction then we can
> safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to
> WALPROHIBIT_STATE_READ_ONLY.
>

Yes, that's what the current patch is doing from the first patch version.

> 2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY ->
> WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the backend
> to consider the system as read-write, instead, we should wait until
> the shared state is changed to WALPROHIBIT_STATE_READ_WRITE.
>

I am sure that only not enough will have the same issue where
LocalXLogInsertAllowed gets set the same as the read-only as described in
my previous reply.

> So your problem is that on receiving the barrier we need to call
> LocalXLogInsertAllowed() from the backend, but how does that matter?
> you can still make IsWALProhibited() return true.
>

Note that LocalXLogInsertAllowed is a local flag for a backend, not a
function, and in the server code at every place, we don't rely on
IsWALProhibited() instead we do rely on LocalXLogInsertAllowed
flags before wal writes and that check made via XLogInsertAllowed().

> I don't know the complete code so I might be missing something but at
> least that is what I would expect from the design POV.
>
>
> Other than this point, I think the state names READ_ONLY, READ_WRITE
> are a bit confusing no? because actually, these states represent
> whether WAL is allowed or not, but READ_ONLY, READ_WRITE seems like we
> are putting the system under a Read-only state.  For example, if you
> are doing some write operation on an unlogged table will be allowed, I
> guess because that will not generate the WAL until you commit (because
> commit generates WAL) right? so practically, we are just blocking the
> WAL, not the write operation.
>

This read-only and read-write are the wal prohibited states though we
are using for read-only/read-write system in the discussion and the
complete macro name is WALPROHIBIT_STATE_READ_ONLY and
WALPROHIBIT_STATE_READ_WRITE, I am not sure why that would make
implementation confusing.

Regards,
Amul

Re: [Patch] ALTER SYSTEM READ ONLY

2021-05-11 Thread Dilip Kumar

On Tue, May 11, 2021 at 3:38 PM Amul Sul  wrote:
>
> On Tue, May 11, 2021 at 2:26 PM Dilip Kumar  wrote:
> >
> > On Tue, May 11, 2021 at 2:16 PM Amul Sul  wrote:
> >
> > > I get why you think that, I wasn't very precise in briefing the problem.
> > >
> > > Any new backend that gets connected right after the shared memory
> > > state changes to WALPROHIBIT_STATE_GOING_READ_WRITE will be by
> > > default allowed to do the WAL writes.  Such backends can perform write
> > > operation before the checkpointer does the XLogAcceptWrites().
> >
> > Okay, make sense now. But my next question is why do we allow backends
> > to write WAL in WALPROHIBIT_STATE_GOING_READ_WRITE state? why don't we
> > wait until the shared memory state is changed to
> > WALPROHIBIT_STATE_READ_WRITE?
> >
>
> Ok, good question.
>
> Now let's first try to understand the Checkpointer's work.
>
> When Checkpointer sees the wal prohibited state is an in-progress state, then
> it first emits the global barrier and waits until all backers absorb that.
> After that it set the final requested WAL prohibit state.
>
> When other backends absorb those barriers then appropriate action is taken
> (e.g. abort the read-write transaction if moving to read-only) by them. Also,
> LocalXLogInsertAllowed flags get reset in it and that backend needs to call
> XLogInsertAllowed() to get the right value for it, which further decides WAL
> writes permitted or prohibited.
>
> Consider an example that the system is trying to change to read-write and for
> that wal prohibited state is set to WALPROHIBIT_STATE_GOING_READ_WRITE before
> Checkpointer starts its work.  If we want to treat that system as read-only 
> for
> the WALPROHIBIT_STATE_GOING_READ_WRITE state as well. Then we might need to
> think about the behavior of the backend that has absorbed the barrier and 
> reset
> the LocalXLogInsertAllowed flag.  That backend eventually going to call
> XLogInsertAllowed() to get the actual value for it and by seeing the current
> state as WALPROHIBIT_STATE_GOING_READ_WRITE, it will set 
> LocalXLogInsertAllowed
> again same as it was before for the read-only state.

I might be missing something, but assume the behavior should be like this

1. If the state is getting changed from WALPROHIBIT_STATE_READ_WRITE
-> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process
the barrier, we can immediately abort any read-write transaction(and
stop allowing WAL writing), because once we ensure that all session
has responded that now they have no read-write transaction then we can
safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to
WALPROHIBIT_STATE_READ_ONLY.

2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY ->
WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the backend
to consider the system as read-write, instead, we should wait until
the shared state is changed to WALPROHIBIT_STATE_READ_WRITE.

So your problem is that on receiving the barrier we need to call
LocalXLogInsertAllowed() from the backend, but how does that matter?
you can still make IsWALProhibited() return true.

I don't know the complete code so I might be missing something but at
least that is what I would expect from the design POV.


Other than this point, I think the state names READ_ONLY, READ_WRITE
are a bit confusing no? because actually, these states represent
whether WAL is allowed or not, but READ_ONLY, READ_WRITE seems like we
are putting the system under a Read-only state.  For example, if you
are doing some write operation on an unlogged table will be allowed, I
guess because that will not generate the WAL until you commit (because
commit generates WAL) right? so practically, we are just blocking the
WAL, not the write operation.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-05-11 Thread Amul Sul

On Tue, May 11, 2021 at 2:26 PM Dilip Kumar  wrote:
>
> On Tue, May 11, 2021 at 2:16 PM Amul Sul  wrote:
>
> > I get why you think that, I wasn't very precise in briefing the problem.
> >
> > Any new backend that gets connected right after the shared memory
> > state changes to WALPROHIBIT_STATE_GOING_READ_WRITE will be by
> > default allowed to do the WAL writes.  Such backends can perform write
> > operation before the checkpointer does the XLogAcceptWrites().
>
> Okay, make sense now. But my next question is why do we allow backends
> to write WAL in WALPROHIBIT_STATE_GOING_READ_WRITE state? why don't we
> wait until the shared memory state is changed to
> WALPROHIBIT_STATE_READ_WRITE?
>

Ok, good question.

Now let's first try to understand the Checkpointer's work.

When Checkpointer sees the wal prohibited state is an in-progress state, then
it first emits the global barrier and waits until all backers absorb that.
After that it set the final requested WAL prohibit state.

When other backends absorb those barriers then appropriate action is taken
(e.g. abort the read-write transaction if moving to read-only) by them. Also,
LocalXLogInsertAllowed flags get reset in it and that backend needs to call
XLogInsertAllowed() to get the right value for it, which further decides WAL
writes permitted or prohibited.

Consider an example that the system is trying to change to read-write and for
that wal prohibited state is set to WALPROHIBIT_STATE_GOING_READ_WRITE before
Checkpointer starts its work.  If we want to treat that system as read-only for
the WALPROHIBIT_STATE_GOING_READ_WRITE state as well. Then we might need to
think about the behavior of the backend that has absorbed the barrier and reset
the LocalXLogInsertAllowed flag.  That backend eventually going to call
XLogInsertAllowed() to get the actual value for it and by seeing the current
state as WALPROHIBIT_STATE_GOING_READ_WRITE, it will set LocalXLogInsertAllowed
again same as it was before for the read-only state.

Now the question is when this value should get reset again so that backend can
be read-write? We are done with a barrier and that backend never going to come
back to read-write again.

One solution, I think, is to set the final state before emitting the barrier
but as per the current design that should get set after all barrier processing.
Let's see what Robert says on this.

Regards,
Amul

Re: [Patch] ALTER SYSTEM READ ONLY

2021-05-11 Thread Dilip Kumar

On Tue, May 11, 2021 at 2:16 PM Amul Sul  wrote:

> I get why you think that, I wasn't very precise in briefing the problem.
>
> Any new backend that gets connected right after the shared memory
> state changes to WALPROHIBIT_STATE_GOING_READ_WRITE will be by
> default allowed to do the WAL writes.  Such backends can perform write
> operation before the checkpointer does the XLogAcceptWrites().

Okay, make sense now. But my next question is why do we allow backends
to write WAL in WALPROHIBIT_STATE_GOING_READ_WRITE state? why don't we
wait until the shared memory state is changed to
WALPROHIBIT_STATE_READ_WRITE?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-05-11 Thread Amul Sul

On Tue, May 11, 2021 at 11:33 AM Dilip Kumar  wrote:
>
> On Mon, May 10, 2021 at 10:25 PM Amul Sul  wrote:
> >
> > Yes, we don't want any write slip in before UpdateFullPageWrites().
> > Recently[1], we have decided to let the Checkpointed process call
> > XLogAcceptWrites() unconditionally.
> >
> > Here problem is that when a backend executes the
> > pg_prohibit_wal(false) function to make the system read-write, the wal
> > prohibited state is set to inprogress(ie.
> > WALPROHIBIT_STATE_GOING_READ_WRITE) and then Checkpointer is signaled.
> > Next, Checkpointer will convey this system change to all existing
> > backends using a global barrier, and after that final wal prohibited
> > state is set to the read-write(i.e. WALPROHIBIT_STATE_READ_WRITE).
> > While Checkpointer is in the progress of conveying this global
> > barrier,  any new backend can connect at that time and can write a new
> > record because the inprogress read-write state is equivalent to the
> > final read-write state iff LocalXLogInsertAllowed != 0 for that
> > backend.  And, that new record could slip in before or in between
> > records to be written by XLogAcceptWrites().
> >
> > 1] 
> > http://postgr.es/m/CA+TgmoZYQN=rcye-ixwnjdvmaoh+7jaqsif3u2k8xqxipba...@mail.gmail.com
>
> But, IIUC, once the state is set to WALPROHIBIT_STATE_GOING_READ_WRITE
> and signaled to the checkpointer.  The checkpointer should first call
> XLogAcceptWrites and then it should inform other backends through the
> global barrier?  Are we worried that if we have written the WAL in
> XLogAcceptWrites but later if we could not set the state to
> WALPROHIBIT_STATE_READ_WRITE?  Then maybe we can inform all the
> backend first but before setting the state to
> WALPROHIBIT_STATE_READ_WRITE, we can call XLogAcceptWrites?
>

I get why you think that, I wasn't very precise in briefing the problem.

Any new backend that gets connected right after the shared memory
state changes to WALPROHIBIT_STATE_GOING_READ_WRITE will be by
default allowed to do the WAL writes.  Such backends can perform write
operation before the checkpointer does the XLogAcceptWrites(). Also,
possible that a backend could connect at the same time checkpointer
performing XLogAcceptWrites() and can write a wal.

So, having XLogAcceptWrites() before does not really solve my concern.
Note that the previous patch XLogAcceptWrites() does get called before
global barrier emission.

Please let me know if it is not yet cleared to you, thanks.

Regards,
Amul

Re: [Patch] ALTER SYSTEM READ ONLY

2021-05-11 Thread Dilip Kumar

On Mon, May 10, 2021 at 10:25 PM Amul Sul  wrote:
>
> Yes, we don't want any write slip in before UpdateFullPageWrites().
> Recently[1], we have decided to let the Checkpointed process call
> XLogAcceptWrites() unconditionally.
>
> Here problem is that when a backend executes the
> pg_prohibit_wal(false) function to make the system read-write, the wal
> prohibited state is set to inprogress(ie.
> WALPROHIBIT_STATE_GOING_READ_WRITE) and then Checkpointer is signaled.
> Next, Checkpointer will convey this system change to all existing
> backends using a global barrier, and after that final wal prohibited
> state is set to the read-write(i.e. WALPROHIBIT_STATE_READ_WRITE).
> While Checkpointer is in the progress of conveying this global
> barrier,  any new backend can connect at that time and can write a new
> record because the inprogress read-write state is equivalent to the
> final read-write state iff LocalXLogInsertAllowed != 0 for that
> backend.  And, that new record could slip in before or in between
> records to be written by XLogAcceptWrites().
>
> 1] 
> http://postgr.es/m/CA+TgmoZYQN=rcye-ixwnjdvmaoh+7jaqsif3u2k8xqxipba...@mail.gmail.com

But, IIUC, once the state is set to WALPROHIBIT_STATE_GOING_READ_WRITE
and signaled to the checkpointer.  The checkpointer should first call
XLogAcceptWrites and then it should inform other backends through the
global barrier?  Are we worried that if we have written the WAL in
XLogAcceptWrites but later if we could not set the state to
WALPROHIBIT_STATE_READ_WRITE?  Then maybe we can inform all the
backend first but before setting the state to
WALPROHIBIT_STATE_READ_WRITE, we can call XLogAcceptWrites?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-05-10 Thread Amul Sul

On Mon, May 10, 2021 at 9:21 PM Robert Haas  wrote:
>
> On Sun, May 9, 2021 at 1:26 AM Amul Sul  wrote:
> > The state in the control file also gets cleared. Though, after
> > clearing in memory the state patch doesn't really do the immediate
> > change to the control file, it relies on the next UpdateControlFile()
> > to do that.
>
> But when will that happen? If you're relying on some very nearby code,
> that might be OK, but perhaps a comment is in order. If you're just
> thinking it's going to happen eventually, I think that's not good
> enough.
>

Ok.

> > Regarding log message I think I have skipped that intentionally, to
> > avoid confusing log as "system is now read write" when we do start as
> > hot-standby which is not really read-write.
>
> I think the message should not be phrased that way. In fact, I think
> now that we've moved to calling this pg_prohibit_wal() rather than
> ALTER SYSTEM READ ONLY, a lot of messages need to be rethought, and
> maybe some comments and function names as well. Perhaps something
> like:
>
> system is read only -> WAL is now prohibited
> system is read write -> WAL is no longer prohibited
>
> And then for this particular case, maybe something like:
>
> clearing WAL prohibition because the system is in archive recovery
>

Ok, thanks for the suggestions.

> > > The second part of this proposal was:
> > >
> > > "2. Create a new function with a name like XLogAcceptWrites(). Move the
> > > following things from StartupXLOG() into that function: (1) the call
> > > to UpdateFullPageWrites(), (2) the following block of code that does
> > > either CreateEndOfRecoveryRecord() or RequestCheckpoint() or
> > > CreateCheckPoint(), (3) the next block of code that runs
> > > recovery_end_command, (4) the call to XLogReportParameters(), and (5)
> > > the call to CompleteCommitTsInitialization(). Call the new function
> > > from the place where we now call XLogReportParameters(). This would
> > > mean that (1)-(3) happen later than they do now, which might require
> > > some adjustments."
> > >
> > > Now you moved that code, but you also moved (6)
> > > CompleteCommitTsInitialization(), (7) setting the control file to
> > > DB_IN_PRODUCTION, (8) setting the state to RECOVERY_STATE_DONE, and
> > > (9) requesting a checkpoint if we were just promoted. That's not what
> > > was proposed. One result of this is that the server now thinks it's in
> > > recovery even after the startup process has exited.
> > > RecoveryInProgress() is still returning true everywhere. But that is
> > > inconsistent with what Andres and I were recommending in
> > > http://postgr.es/m/CA+TgmoZYQN=rcye-ixwnjdvmaoh+7jaqsif3u2k8xqxipba...@mail.gmail.com
> >
> > Regarding modified approach, I tried to explain that why I did
> > this in 
> > http://postgr.es/m/CAAJ_b96Yb4jaW6oU1bVYEBaf=tq-ql+mmt1exfwvnzer7xr...@mail.gmail.com
>
> I am not able to understand what problem you are seeing there. If
> we're in crash recovery, then nobody can connect to the database, so
> there can't be any concurrent activity. If we're in archive recovery,
> we now clear the WAL-is-prohibited flag so that we will go read-write
> directly at the end of recovery. We can and should refuse any effort
> to call pg_prohibit_wal() during recovery. If we reached the end of
> crash recovery and are now permitting read-only connections, why would
> anyone be able to write WAL before the system has been changed to
> read-write? If that can happen, it's a bug, not a reason to change the
> design.
>
> Maybe your concern here is about ordering: the process that is going
> to run XLogAcceptWrites() needs to allow xlog writes locally before we
> tell other backends that they also can xlog writes; otherwise, some
> other records could slip in before UpdateFullPageWrites() and similar
> have run, which we probably don't want. But that's why
> LocalSetXLogInsertAllowed() was invented, and if it doesn't quite do
> what we need in this situation, we should be able to tweak it so it
> does.
>

Yes, we don't want any write slip in before UpdateFullPageWrites().
Recently[1], we have decided to let the Checkpointed process call
XLogAcceptWrites() unconditionally.

Here problem is that when a backend executes the
pg_prohibit_wal(false) function to make the system read-write, the wal
prohibited state is set to inprogress(ie.
WALPROHIBIT_STATE_GOING_READ_WRITE) and then Checkpointer is signaled.
Next, Checkpointer will convey this system change to all existing
backends using a global barrier, and after that final wal prohibited
state is set to the read-write(i.e. WALPROHIBIT_STATE_READ_WRITE).
While Checkpointer is in the progress of conveying this global
barrier,  any new backend can connect at that time and can write a new
record because the inprogress read-write state is equivalent to the
final read-write state iff LocalXLogInsertAllowed != 0 for that
backend.  And, that new record could slip in before or in between
records to be written by XLogAcceptWrites().

Re: [Patch] ALTER SYSTEM READ ONLY

2021-05-10 Thread Robert Haas

On Sun, May 9, 2021 at 1:26 AM Amul Sul  wrote:
> The state in the control file also gets cleared. Though, after
> clearing in memory the state patch doesn't really do the immediate
> change to the control file, it relies on the next UpdateControlFile()
> to do that.

But when will that happen? If you're relying on some very nearby code,
that might be OK, but perhaps a comment is in order. If you're just
thinking it's going to happen eventually, I think that's not good
enough.

> Regarding log message I think I have skipped that intentionally, to
> avoid confusing log as "system is now read write" when we do start as
> hot-standby which is not really read-write.

I think the message should not be phrased that way. In fact, I think
now that we've moved to calling this pg_prohibit_wal() rather than
ALTER SYSTEM READ ONLY, a lot of messages need to be rethought, and
maybe some comments and function names as well. Perhaps something
like:

system is read only -> WAL is now prohibited
system is read write -> WAL is no longer prohibited

And then for this particular case, maybe something like:

clearing WAL prohibition because the system is in archive recovery

> > The second part of this proposal was:
> >
> > "2. Create a new function with a name like XLogAcceptWrites(). Move the
> > following things from StartupXLOG() into that function: (1) the call
> > to UpdateFullPageWrites(), (2) the following block of code that does
> > either CreateEndOfRecoveryRecord() or RequestCheckpoint() or
> > CreateCheckPoint(), (3) the next block of code that runs
> > recovery_end_command, (4) the call to XLogReportParameters(), and (5)
> > the call to CompleteCommitTsInitialization(). Call the new function
> > from the place where we now call XLogReportParameters(). This would
> > mean that (1)-(3) happen later than they do now, which might require
> > some adjustments."
> >
> > Now you moved that code, but you also moved (6)
> > CompleteCommitTsInitialization(), (7) setting the control file to
> > DB_IN_PRODUCTION, (8) setting the state to RECOVERY_STATE_DONE, and
> > (9) requesting a checkpoint if we were just promoted. That's not what
> > was proposed. One result of this is that the server now thinks it's in
> > recovery even after the startup process has exited.
> > RecoveryInProgress() is still returning true everywhere. But that is
> > inconsistent with what Andres and I were recommending in
> > http://postgr.es/m/CA+TgmoZYQN=rcye-ixwnjdvmaoh+7jaqsif3u2k8xqxipba...@mail.gmail.com
>
> Regarding modified approach, I tried to explain that why I did
> this in 
> http://postgr.es/m/CAAJ_b96Yb4jaW6oU1bVYEBaf=tq-ql+mmt1exfwvnzer7xr...@mail.gmail.com

I am not able to understand what problem you are seeing there. If
we're in crash recovery, then nobody can connect to the database, so
there can't be any concurrent activity. If we're in archive recovery,
we now clear the WAL-is-prohibited flag so that we will go read-write
directly at the end of recovery. We can and should refuse any effort
to call pg_prohibit_wal() during recovery. If we reached the end of
crash recovery and are now permitting read-only connections, why would
anyone be able to write WAL before the system has been changed to
read-write? If that can happen, it's a bug, not a reason to change the
design.

Maybe your concern here is about ordering: the process that is going
to run XLogAcceptWrites() needs to allow xlog writes locally before we
tell other backends that they also can xlog writes; otherwise, some
other records could slip in before UpdateFullPageWrites() and similar
have run, which we probably don't want. But that's why
LocalSetXLogInsertAllowed() was invented, and if it doesn't quite do
what we need in this situation, we should be able to tweak it so it
does.

If your concern is something else, can you spell it out for me again
because I'm not getting it?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-05-08 Thread Amul Sul

On Fri, May 7, 2021 at 1:23 AM Robert Haas  wrote:
>
> On Mon, Apr 12, 2021 at 10:04 AM Amul Sul  wrote:
> > Rebased again.
>
> I started to look at this today, and didn't get very far, but I have a
> few comments. The main one is that I don't think this patch implements
> the design proposed in
> https://www.postgresql.org/message-id/CA+TgmoZ=cctbaxxmtyzogxegqzoz9smkbwrdpsacpjvfcgc...@mail.gmail.com
>
> The first part of that proposal said this:
>
> "1. If the server starts up and is read-only and
> ArchiveRecoveryRequested, clear the read-only state in memory and also
> in the control file, log a message saying that this has been done, and
> proceed. This makes some other cases simpler to deal with."
>
> As I read it, the patch clears the read-only state in memory, does not
> clear it in the control file, and does not log a message.
>

The state in the control file also gets cleared. Though, after
clearing in memory the state patch doesn't really do the immediate
change to the control file, it relies on the next UpdateControlFile()
to do that.

Regarding log message I think I have skipped that intentionally, to
avoid confusing log as "system is now read write" when we do start as
hot-standby which is not really read-write.

> The second part of this proposal was:
>
> "2. Create a new function with a name like XLogAcceptWrites(). Move the
> following things from StartupXLOG() into that function: (1) the call
> to UpdateFullPageWrites(), (2) the following block of code that does
> either CreateEndOfRecoveryRecord() or RequestCheckpoint() or
> CreateCheckPoint(), (3) the next block of code that runs
> recovery_end_command, (4) the call to XLogReportParameters(), and (5)
> the call to CompleteCommitTsInitialization(). Call the new function
> from the place where we now call XLogReportParameters(). This would
> mean that (1)-(3) happen later than they do now, which might require
> some adjustments."
>
> Now you moved that code, but you also moved (6)
> CompleteCommitTsInitialization(), (7) setting the control file to
> DB_IN_PRODUCTION, (8) setting the state to RECOVERY_STATE_DONE, and
> (9) requesting a checkpoint if we were just promoted. That's not what
> was proposed. One result of this is that the server now thinks it's in
> recovery even after the startup process has exited.
> RecoveryInProgress() is still returning true everywhere. But that is
> inconsistent with what Andres and I were recommending in
> http://postgr.es/m/CA+TgmoZYQN=rcye-ixwnjdvmaoh+7jaqsif3u2k8xqxipba...@mail.gmail.com
>

Regarding modified approach, I tried to explain that why I did
this in 
http://postgr.es/m/CAAJ_b96Yb4jaW6oU1bVYEBaf=tq-ql+mmt1exfwvnzer7xr...@mail.gmail.com

> I also noticed that 0001 does not compile without 0002, so the
> separation into multiple patches is not clean. I would actually
> suggest that the first patch in the series should just create
> XLogAcceptWrites() with the minimum amount of adjustment to make that
> work. That would potentially let us commit that change independently,
> which would be good, because then if we accidentally break something,
> it'll be easier to pin down to that particular change instead of being
> mixed with everything else this needs to change.
>

Ok, I will try in the next version.

Regards,
Amul

Re: [Patch] ALTER SYSTEM READ ONLY

2021-05-06 Thread Robert Haas

On Mon, Apr 12, 2021 at 10:04 AM Amul Sul  wrote:
> Rebased again.

I started to look at this today, and didn't get very far, but I have a
few comments. The main one is that I don't think this patch implements
the design proposed in
https://www.postgresql.org/message-id/CA+TgmoZ=cctbaxxmtyzogxegqzoz9smkbwrdpsacpjvfcgc...@mail.gmail.com

The first part of that proposal said this:

"1. If the server starts up and is read-only and
ArchiveRecoveryRequested, clear the read-only state in memory and also
in the control file, log a message saying that this has been done, and
proceed. This makes some other cases simpler to deal with."

As I read it, the patch clears the read-only state in memory, does not
clear it in the control file, and does not log a message.

The second part of this proposal was:

"2. Create a new function with a name like XLogAcceptWrites(). Move the
following things from StartupXLOG() into that function: (1) the call
to UpdateFullPageWrites(), (2) the following block of code that does
either CreateEndOfRecoveryRecord() or RequestCheckpoint() or
CreateCheckPoint(), (3) the next block of code that runs
recovery_end_command, (4) the call to XLogReportParameters(), and (5)
the call to CompleteCommitTsInitialization(). Call the new function
from the place where we now call XLogReportParameters(). This would
mean that (1)-(3) happen later than they do now, which might require
some adjustments."

Now you moved that code, but you also moved (6)
CompleteCommitTsInitialization(), (7) setting the control file to
DB_IN_PRODUCTION, (8) setting the state to RECOVERY_STATE_DONE, and
(9) requesting a checkpoint if we were just promoted. That's not what
was proposed. One result of this is that the server now thinks it's in
recovery even after the startup process has exited.
RecoveryInProgress() is still returning true everywhere. But that is
inconsistent with what Andres and I were recommending in
http://postgr.es/m/CA+TgmoZYQN=rcye-ixwnjdvmaoh+7jaqsif3u2k8xqxipba...@mail.gmail.com

I also noticed that 0001 does not compile without 0002, so the
separation into multiple patches is not clean. I would actually
suggest that the first patch in the series should just create
XLogAcceptWrites() with the minimum amount of adjustment to make that
work. That would potentially let us commit that change independently,
which would be good, because then if we accidentally break something,
it'll be easier to pin down to that particular change instead of being
mixed with everything else this needs to change.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-04-05 Thread Bharath Rupireddy

On Mon, Apr 5, 2021 at 11:02 AM Amul Sul  wrote:
>
> Attached is the rebase version for the latest master head(commit # 
> 9f6f1f9b8e6).

Some minor comments on 0001:
Isn't it "might not be running"?
+ errdetail("Checkpointer might not running."),

Isn't it  "Try again after sometime"?
+ errhint("Try after sometime again.")));

Can we have ereport(DEBUG1 just to be consistent(although it doesn't
make any difference from elog(DEBUG1) with the new log messages
introduced in the patch?
+elog(DEBUG1, "waiting for backends to adopt requested WAL
prohibit state change");

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-03-19 Thread Prabhat Sahu

Hi all,
While testing this feature with v20-patch, the server is crashing with
below steps.

Steps to reproduce:
1. Configure master-slave replication setup.
2. Connect to Slave.
3. Execute below statements, it will crash the server:
SELECT pg_prohibit_wal(true);
SELECT pg_prohibit_wal(false);

-- Slave:
postgres=# select pg_is_in_recovery();
 pg_is_in_recovery
---
 t
(1 row)

postgres=# SELECT pg_prohibit_wal(true);
 pg_prohibit_wal
-

(1 row)

postgres=# SELECT pg_prohibit_wal(false);
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?>

-- Below are the stack trace:
[prabhat@localhost bin]$ gdb -q -c /tmp/data_slave/core.35273 postgres
Reading symbols from
/home/prabhat/PG/PGsrcNew/postgresql/inst/bin/postgres...done.
[New LWP 35273]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: checkpointer
'.
Program terminated with signal 6, Aborted.
#0  0x7fa876233387 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install
glibc-2.17-317.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64
krb5-libs-1.15.1-50.el7.x86_64 libcom_err-1.42.9-19.el7.x86_64
libgcc-4.8.5-44.el7.x86_64 libselinux-2.5-15.el7.x86_64
openssl-libs-1.0.2k-21.el7_9.x86_64 pcre-8.32-17.el7.x86_64
zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  0x7fa876233387 in raise () from /lib64/libc.so.6
#1  0x7fa876234a78 in abort () from /lib64/libc.so.6
#2  0x00aea31c in ExceptionalCondition (conditionName=0xb8c998
"ThisTimeLineID != 0 || IsBootstrapProcessingMode()",
errorType=0xb8956d "FailedAssertion", fileName=0xb897c0 "xlog.c",
lineNumber=8611) at assert.c:69
#3  0x00588eb5 in InitXLOGAccess () at xlog.c:8611
#4  0x00588ae6 in LocalSetXLogInsertAllowed () at xlog.c:8483
#5  0x005881bb in XLogAcceptWrites (needChkpt=true, xlogreader=0x0,
EndOfLog=0, EndOfLogTLI=0) at xlog.c:8008
#6  0x005751ed in ProcessWALProhibitStateChangeRequest () at
walprohibit.c:361
#7  0x0088c69f in CheckpointerMain () at checkpointer.c:355
#8  0x0059d7db in AuxiliaryProcessMain (argc=2,
argv=0x7ffd1290d060) at bootstrap.c:455
#9  0x0089fc5f in StartChildProcess (type=CheckpointerProcess) at
postmaster.c:5416
#10 0x0089f782 in sigusr1_handler (postgres_signal_arg=10) at
postmaster.c:5128
#11 
#12 0x7fa8762f2983 in __select_nocancel () from /lib64/libc.so.6
#13 0x0089b511 in ServerLoop () at postmaster.c:1700
#14 0x0089af00 in PostmasterMain (argc=5, argv=0x15b8460) at
postmaster.c:1408
#15 0x0079c23a in main (argc=5, argv=0x15b8460) at main.c:209
(gdb)

kindly let me know if you need more inputs on this.

On Mon, Mar 15, 2021 at 12:56 PM Amul Sul  wrote:

> On Sun, Mar 14, 2021 at 11:51 PM Ibrar Ahmed 
> wrote:
> >
> > On Tue, Mar 9, 2021 at 3:31 PM Amul Sul  wrote:
> >>
> >> On Thu, Mar 4, 2021 at 11:02 PM Amul Sul  wrote:
> >> >
> >> > On Wed, Mar 3, 2021 at 8:56 PM Robert Haas 
> wrote:
> >> > >
> >> > > On Tue, Mar 2, 2021 at 7:22 AM Dilip Kumar 
> wrote:
> >[]
> >
> > One of the patch
> (v18-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch) from the
> latest patchset does not apply successfully.
> >
> > http://cfbot.cputube.org/patch_32_2602.log
> >
> > === applying patch
> ./v18-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch
> >
> > Hunk #15 succeeded at 2604 (offset -13 lines).
> > 1 out of 15 hunks FAILED -- saving rejects to file
> src/backend/access/nbtree/nbtpage.c.rej
> > patching file src/backend/access/spgist/spgdoinsert.c
> >
> > It is a very minor change, so I rebased the patch. Please take a look,
> if that works for you.
> >
>
> Thanks, I am getting one more failure for the vacuumlazy.c. on the
> latest master head(d75288fb27b), I fixed that in attached version.
>
> Regards,
> Amul
>


-- 

With Regards,
Prabhat Kumar Sahu
EnterpriseDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-03-08 Thread Dilip Kumar

On Wed, Mar 3, 2021 at 8:56 PM Robert Haas  wrote:
>
> On Tue, Mar 2, 2021 at 7:22 AM Dilip Kumar  wrote:
> > Why do we need to move promote related code in XLogAcceptWrites?
> > IMHO, this promote related handling should be in StartupXLOG only.
> > That will look cleaner.
>
> The key design question here, at least in my mind, is what exactly
> happens after prohibit-WAL + system-crash + recovery-finishes. We
> clearly can't write the checkpoint or end-of-recovery record and
> proceed with business as usual, but are we still in recovery? Either
> (1) we are technically still in recovery, stopping just short of
> entering normal running, and will emerge from recovery when WAL is
> permitted again; or (2) we have technically finished recovery, but
> deferred some of the actions that would normally occur at that time
> until a later point. Maybe this is academic distinction as much as
> anything, but the idea is if we choose #1 then we should do as little
> as possible at the point when recovery finishes and defer as much as
> possible until we actually enter normal running; whereas if we choose
> #2 we should do as much as possible at the point when recovery
> finishes and defer only those things which absolutely have to be
> deferred. That said, I and I think also Andres are voting for #2.
>
> But if we go that way, that precludes what you are proposing here. If
> we picked #1 then it would be natural for the startup process to
> remain active and the control file update to be postponed until WAL
> writes are re-enabled; but under model #2 we want, if possible, for
> the startup process to exit and the control file update to happen
> normally, and only the writing of the actual WAL records to be
> deferred.

Maybe I did not put my point clearly,  let me clarify that.  First, I
was also inclined that it should work like #2.  And, if it works like
#2 then I would assume that the code goes in XLogAcceptWrites function
should be minimal, only those part which we want to execute after the
system is back to read-write mode.  So basically, the XLogAcceptWrites
should only keep the code that is common code which we want to execute
at the end of the StartupXLog if the system is normal or we want to
execute when the system is back to read-write if it was read only.  So
my point was all the uncommon code that we have moved into
XLogAcceptWrites should be kept inside the StartupXLog function only.
So I think the promotion-related code doesn't belong to the
XLogAcceptWrites function.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-03-04 Thread Amul Sul

On Wed, Mar 3, 2021 at 7:56 PM Dilip Kumar  wrote:
>
> On Wed, Mar 3, 2021 at 4:50 PM Amul Sul  wrote:
> >
> > On Wed, Mar 3, 2021 at 12:08 PM Dilip Kumar  wrote:
> > >
> > Yes, it is possible to allow wal temporarily for itself by setting
> > LocalXLogInsertAllowed, but when we request Checkpointer for the 
> > end-of-recovery
> > checkpoint, the first thing it will do is that wal prohibit state transition
> > then recovery-end-checkpoint.
> >
> > Also, allowing WAL write in read-only (WAL prohibited state) mode is against
> > this feature principle.
>
> So IIUC before the checkpoint change the state in the control file we
> anyway inform other backend and then they are allowed to write the WAL
> is the right?  If that is true then what is the problem in first doing
> the pending post-recovery process and then informing the backend about
> the state change.  I mean we are in a process of changing the state to
> read-write so why it is necessary to inform all the backend before we
> can write WAL?  Are we afraid that after we write the WAL and if there
> is some failure before we make the system read-write then it will
> break the principle of the feature, I mean eventually system will stay
> read-only but we wrote the WAL?  If so then currently, are we assuring
> that once we inform the backend and backend are allowed to write the
> WAL there are no chances of failure and the system state is guaranteed
> to be changed.  If all that is true then I will take my point back.
>

The wal prohibit state transition handling code is integrated into various
places of the checkpointer process so that it can pick state changes as soon as
possible. Before informing other backends we can do UpdateFullPageWrites() but
when we try to next the end-of-recovery checkpoint write operation, the
Checkpointer will hit ProcessWALProhibitStateChangeRequest() first which will
try for the wal prohibit transition state completion and then  write
the checkpoint
record.

Regards,
Amul

Re: [Patch] ALTER SYSTEM READ ONLY

2021-03-03 Thread Robert Haas

On Tue, Mar 2, 2021 at 7:22 AM Dilip Kumar  wrote:
> Why do we need to move promote related code in XLogAcceptWrites?
> IMHO, this promote related handling should be in StartupXLOG only.
> That will look cleaner.

The key design question here, at least in my mind, is what exactly
happens after prohibit-WAL + system-crash + recovery-finishes. We
clearly can't write the checkpoint or end-of-recovery record and
proceed with business as usual, but are we still in recovery? Either
(1) we are technically still in recovery, stopping just short of
entering normal running, and will emerge from recovery when WAL is
permitted again; or (2) we have technically finished recovery, but
deferred some of the actions that would normally occur at that time
until a later point. Maybe this is academic distinction as much as
anything, but the idea is if we choose #1 then we should do as little
as possible at the point when recovery finishes and defer as much as
possible until we actually enter normal running; whereas if we choose
#2 we should do as much as possible at the point when recovery
finishes and defer only those things which absolutely have to be
deferred. That said, I and I think also Andres are voting for #2.

But if we go that way, that precludes what you are proposing here. If
we picked #1 then it would be natural for the startup process to
remain active and the control file update to be postponed until WAL
writes are re-enabled; but under model #2 we want, if possible, for
the startup process to exit and the control file update to happen
normally, and only the writing of the actual WAL records to be
deferred.

What I find much odder, looking at the present patch, is that
PerformPendingStartupOperations() gets called from pg_prohibit_wal()
rather than by the checkpointer. If the checkpointer is the process
that is in charge of coordinating the change between a read-only state
and a read-write state, then it ought to also do this. I also think
that the PerformPendingStartupOperations() wrapper is unnecessary.
Just invert the sense of the XLogCtl flag: xlogAllowWritesDone, rather
than startupCrashRecoveryPending, and have XLogAcceptWrites() set it
(and return without doing anything if it's already set). Then the
checkpointer can just call the function unconditionally whenever we go
read-write, and for a bonus we will have much better naming
consistency, rather than calling the same thing "xlog accept writes"
in one place, "pending startup operations" in another, and "startup
crash recovery pending" in a third.

Since this feature is basically no longer "alter system read only" but
rather "pg_prohibit_wal" I think we also ought to rename the GUC,
system_is_read_only -> wal_prohibited.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-03-03 Thread Dilip Kumar

On Wed, Mar 3, 2021 at 4:50 PM Amul Sul  wrote:
>
> On Wed, Mar 3, 2021 at 12:08 PM Dilip Kumar  wrote:
> >
> Yes, it is possible to allow wal temporarily for itself by setting
> LocalXLogInsertAllowed, but when we request Checkpointer for the 
> end-of-recovery
> checkpoint, the first thing it will do is that wal prohibit state transition
> then recovery-end-checkpoint.
>
> Also, allowing WAL write in read-only (WAL prohibited state) mode is against
> this feature principle.

So IIUC before the checkpoint change the state in the control file we
anyway inform other backend and then they are allowed to write the WAL
is the right?  If that is true then what is the problem in first doing
the pending post-recovery process and then informing the backend about
the state change.  I mean we are in a process of changing the state to
read-write so why it is necessary to inform all the backend before we
can write WAL?  Are we afraid that after we write the WAL and if there
is some failure before we make the system read-write then it will
break the principle of the feature, I mean eventually system will stay
read-only but we wrote the WAL?  If so then currently, are we assuring
that once we inform the backend and backend are allowed to write the
WAL there are no chances of failure and the system state is guaranteed
to be changed.  If all that is true then I will take my point back.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-03-03 Thread Amul Sul

On Wed, Mar 3, 2021 at 12:08 PM Dilip Kumar  wrote:
>
> On Tue, Mar 2, 2021 at 9:01 PM Dilip Kumar  wrote:
>
> > >
> > > We don't want that to happen in cases where previous 
> > > recovery-end-checkpoint is
> > > skipped in startup. We want Checkpointer first to convey the barrier to 
> > > all
> > > backends but, the backend shouldn't write wal until the Checkpointer 
> > > writes
> > > recovery-end-checkpoint record.
> > >
> > > To refrain these backends from writing WAL I think we should keep the 
> > > server in
> > > crash recovery mode until UpdateFullPageWrites(),
> > > end-of-recovery-checkpoint, and XLogReportParameters() are performed.
>
> I did not read the code for this, but let me ask something about this
> case.  Why do we want checkpointer to convey the barrier to all the
> backend before completing the end of recovery checkpoint and other
> stuff? Is it because the system is still in WAL prohibited state?

Consider the previous case, where the user wants to change the system to
read-write. When a permitted user executes pg_prohibit_wal(false), the wal
prohibited state in shared memory updated to GOING_READ_WRITE
which is the transition state and then waits until the transition state
completes and the final state (i.e. READ_WRITE) gets updated
in shared memory. To set the final set is a job of the Checkpointer process.

We have integrated code into the Checkpointer process such that if it sees wal
prohibit transition state then it completes that as soon as possible by doing
necessary steps i.e. emitting super barriers, then update the final wal
prohibited state in shared memory and in control file.

> Is
> it possible that as soon as we get the pg_prohibit_wal(false) request
> the receiving backend start allowing the WAL writing for itself and
> finish the all post-recovery pending work and then inform the
> checkpointer to inform all other backends?
>

Yes, it is possible to allow wal temporarily for itself by setting
LocalXLogInsertAllowed, but when we request Checkpointer for the end-of-recovery
checkpoint, the first thing it will do is that wal prohibit state transition
then recovery-end-checkpoint.

Also, allowing WAL write in read-only (WAL prohibited state) mode is against
this feature principle.

Regards,
Amul

Re: [Patch] ALTER SYSTEM READ ONLY

2021-03-02 Thread Dilip Kumar

On Tue, Mar 2, 2021 at 9:01 PM Dilip Kumar  wrote:

> >
> > We don't want that to happen in cases where previous 
> > recovery-end-checkpoint is
> > skipped in startup. We want Checkpointer first to convey the barrier to all
> > backends but, the backend shouldn't write wal until the Checkpointer writes
> > recovery-end-checkpoint record.
> >
> > To refrain these backends from writing WAL I think we should keep the 
> > server in
> > crash recovery mode until UpdateFullPageWrites(),
> > end-of-recovery-checkpoint, and XLogReportParameters() are performed.

I did not read the code for this, but let me ask something about this
case.  Why do we want checkpointer to convey the barrier to all the
backend before completing the end of recovery checkpoint and other
stuff? Is it because the system is still in WAL prohibited state?  Is
it possible that as soon as we get the pg_prohibit_wal(false) request
the receiving backend start allowing the WAL writing for itself and
finish the all post-recovery pending work and then inform the
checkpointer to inform all other backends?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-03-02 Thread Dilip Kumar

On Tue, Mar 2, 2021 at 7:54 PM Amul Sul  wrote:
> XLogAcceptWrites() tried to club all the WAL write operations that happen at 
> the
> end of StartupXLOG(). The only exception is that promotion checkpoint.

Okay, I was expecting that XLogAcceptWrites should have all the WAL
write-related operations which should either executed at the end of
StartupXLOG() if the system is not read-only or after the system is
set back to read-write.  But promotion-related code is completely
irrelevant when it is executed from PerformPendingStartupOperations.
So I am not entirely sure that whether we want to keep those stuff in
XLogAcceptWrites.

> > That will look cleaner.
>
> I think it would be better to move the promotion checkpoint call inside
> XLogAcceptWrites() as well. So that we can say XLogAcceptWrites() is a part of
> StartupXLOG() does the required WAL writes. Thoughts?

Okay so if we want to keep all the WAL write inside XLogAcceptWrites
including promotion-related stuff then +1 for moving this also inside
XLogAcceptWrites.

> > >
> > > 1] 
> > > http://postgr.es/m/CA+TgmoZ=cctbaxxmtyzogxegqzoz9smkbwrdpsacpjvfcgc...@mail.gmail.com
> > > 2] 
> > > http://postgr.es/m/caaj_b97xx-nqrym_uxzecph9asgomrordnhrg1n51fdcdwo...@mail.gmail.com
> >
> > 2.
> > I did not clearly understand your concern in point [2], because of
> > which you have to postpone RECOVERY_STATE_DONE untill system is set
> > back to read-write.  Can you explain this?
> >
>
> Sure, for that let me explain how this transition to read-write occurs.  When 
> a
> backend executes a function (ie. pg_prohibit_wal(false)) to make the system
> read-write then that system state changes will be conveyed by the Checkpointer
> process to all existing backends using global barrier and while Checkpointer 
> in
> the progress of conveying this barrier, any existing backends who might have
> absorbed barriers can write new records.
>
> We don't want that to happen in cases where previous recovery-end-checkpoint 
> is
> skipped in startup. We want Checkpointer first to convey the barrier to all
> backends but, the backend shouldn't write wal until the Checkpointer writes
> recovery-end-checkpoint record.
>
> To refrain these backends from writing WAL I think we should keep the server 
> in
> crash recovery mode until UpdateFullPageWrites(),
> end-of-recovery-checkpoint, and XLogReportParameters() are performed.

Thanks for the explanation.  Now, I understand the problem, however, I
am not sure that whether keeping the system in recovery is the best
way to solve this but as of now I don't have anything better to
suggest, and immediately I couldn’t think of any problem with this
solution.  But I will think about this again.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-03-02 Thread Amul Sul

On Tue, Mar 2, 2021 at 5:52 PM Dilip Kumar  wrote:
>
> On Fri, Feb 19, 2021 at 5:43 PM Amul Sul  wrote:
> >
> > In the attached version I have made the changes accordingly what Robert has
> > summarised in his previous mail[1].
> >
> > In addition to that, I also move the code that updates the control file to
> > XLogAcceptWrites() which will also get skipped when the system is read-only 
> > (wal
> > prohibited).  The system will be in the crash recovery, and that will
> > change once we do the end-of-recovery checkpoint and the WAL writes 
> > operation
> > which we were skipping from startup.  The benefit of keeping the system in
> > recovery mode is that it fixes my concern[2] where other backends could 
> > connect
> > and write wal records while we were changing the system to read-write. Now, 
> > no
> > other backends allow a wal write; UpdateFullPageWrites(), end-of-recovery
> > checkpoint, and XLogReportParameters() operations will be performed in the 
> > same
> > sequence as it is in the startup while changing the system to read-write.
>
> I was looking into the changes espcially recovery related problem,  I
> have a few questions
>
> 1.
> +static bool
> +XLogAcceptWrites(bool needChkpt, bool bgwriterLaunched,
> + bool localPromoteIsTriggered, XLogReaderState *xlogreader,
> + bool archiveRecoveryRequested, TimeLineID endOfLogTLI,
> + XLogRecPtr endOfLog, TimeLineID thisTimeLineID)
> +{
> +boolpromoted = false;
> +
> +/*
> .
> +if (localPromoteIsTriggered)
>  {
> -checkPointLoc = ControlFile->checkPoint;
> +XLogRecord *record;
>
> ...
> +record = ReadCheckpointRecord(xlogreader,
> +  ControlFile->checkPoint,
> +  1, false);
>  if (record != NULL)
>  {
>  promoted = true;
> ...
> CreateEndOfRecoveryRecord();
> }
>
> Why do we need to move promote related code in XLogAcceptWrites?
> IMHO, this promote related handling should be in StartupXLOG only.

XLogAcceptWrites() tried to club all the WAL write operations that happen at the
end of StartupXLOG(). The only exception is that promotion checkpoint.

> That will look cleaner.

I think it would be better to move the promotion checkpoint call inside
XLogAcceptWrites() as well. So that we can say XLogAcceptWrites() is a part of
StartupXLOG() does the required WAL writes. Thoughts?

>
> >
> > 1] 
> > http://postgr.es/m/CA+TgmoZ=cctbaxxmtyzogxegqzoz9smkbwrdpsacpjvfcgc...@mail.gmail.com
> > 2] 
> > http://postgr.es/m/caaj_b97xx-nqrym_uxzecph9asgomrordnhrg1n51fdcdwo...@mail.gmail.com
>
> 2.
> I did not clearly understand your concern in point [2], because of
> which you have to postpone RECOVERY_STATE_DONE untill system is set
> back to read-write.  Can you explain this?
>

Sure, for that let me explain how this transition to read-write occurs.  When a
backend executes a function (ie. pg_prohibit_wal(false)) to make the system
read-write then that system state changes will be conveyed by the Checkpointer
process to all existing backends using global barrier and while Checkpointer in
the progress of conveying this barrier, any existing backends who might have
absorbed barriers can write new records.

We don't want that to happen in cases where previous recovery-end-checkpoint is
skipped in startup. We want Checkpointer first to convey the barrier to all
backends but, the backend shouldn't write wal until the Checkpointer writes
recovery-end-checkpoint record.

To refrain these backends from writing WAL I think we should keep the server in
crash recovery mode until UpdateFullPageWrites(),
end-of-recovery-checkpoint, and XLogReportParameters() are performed.

Regards,
Amul

Re: [Patch] ALTER SYSTEM READ ONLY

2021-03-02 Thread Dilip Kumar

On Fri, Feb 19, 2021 at 5:43 PM Amul Sul  wrote:
>
> In the attached version I have made the changes accordingly what Robert has
> summarised in his previous mail[1].
>
> In addition to that, I also move the code that updates the control file to
> XLogAcceptWrites() which will also get skipped when the system is read-only 
> (wal
> prohibited).  The system will be in the crash recovery, and that will
> change once we do the end-of-recovery checkpoint and the WAL writes operation
> which we were skipping from startup.  The benefit of keeping the system in
> recovery mode is that it fixes my concern[2] where other backends could 
> connect
> and write wal records while we were changing the system to read-write. Now, no
> other backends allow a wal write; UpdateFullPageWrites(), end-of-recovery
> checkpoint, and XLogReportParameters() operations will be performed in the 
> same
> sequence as it is in the startup while changing the system to read-write.

I was looking into the changes espcially recovery related problem,  I
have a few questions

1.
+static bool
+XLogAcceptWrites(bool needChkpt, bool bgwriterLaunched,
+ bool localPromoteIsTriggered, XLogReaderState *xlogreader,
+ bool archiveRecoveryRequested, TimeLineID endOfLogTLI,
+ XLogRecPtr endOfLog, TimeLineID thisTimeLineID)
+{
+boolpromoted = false;
+
+/*
.
+if (localPromoteIsTriggered)
 {
-checkPointLoc = ControlFile->checkPoint;
+XLogRecord *record;

...
+record = ReadCheckpointRecord(xlogreader,
+  ControlFile->checkPoint,
+  1, false);
 if (record != NULL)
 {
 promoted = true;
...
CreateEndOfRecoveryRecord();
}

Why do we need to move promote related code in XLogAcceptWrites?
IMHO, this promote related handling should be in StartupXLOG only.
That will look cleaner.

>
> 1] 
> http://postgr.es/m/CA+TgmoZ=cctbaxxmtyzogxegqzoz9smkbwrdpsacpjvfcgc...@mail.gmail.com
> 2] 
> http://postgr.es/m/caaj_b97xx-nqrym_uxzecph9asgomrordnhrg1n51fdcdwo...@mail.gmail.com

2.
I did not clearly understand your concern in point [2], because of
which you have to postpone RECOVERY_STATE_DONE untill system is set
back to read-write.  Can you explain this?


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-02-16 Thread Andres Freund

Hi,

On 2021-02-16 17:11:06 -0500, Robert Haas wrote:
> I can't promise that what I'm about to write is an entirely faithful
> representation of what he said, but hopefully it's not so far off that
> he gets mad at me or something.

Seems accurate - and also I'm way too tired that I'd be mad ;)


> 1. If the server starts up and is read-only and
> ArchiveRecoveryRequested, clear the read-only state in memory and also
> in the control file, log a message saying that this has been done, and
> proceed. This makes some other cases simpler to deal with.

It seems also to make sense from a behaviour POV to me: Imagine a
"smooth" planned failover with ASRO:
1) ASRO on primary
2) promote standby
3) edit primary config to include primary_conninfo, add standby.signal
4) restart "read only primary"

There's not really any spot in which it'd be useful to do disable ASRO,
right? But 4) should make the node a normal standby.

Greetings,

Andres Freund

Re: [Patch] ALTER SYSTEM READ ONLY

2021-02-16 Thread Robert Haas

On Thu, Jan 28, 2021 at 7:17 AM Amul Sul  wrote:
> I am still on this. The things that worried me here are the wal records 
> sequence
> being written in the startup process -- UpdateFullPageWrites() generate record
> just before the recovery check-point record and XLogReportParameters() just
> after that but before any other backend could write any wal record. We might
> also need to follow the same sequence while changing the system to read-write.

I was able to chat with Andres about this topic for a while today and
he made some proposals which seemed pretty good to me. I can't promise
that what I'm about to write is an entirely faithful representation of
what he said, but hopefully it's not so far off that he gets mad at me
or something.

1. If the server starts up and is read-only and
ArchiveRecoveryRequested, clear the read-only state in memory and also
in the control file, log a message saying that this has been done, and
proceed. This makes some other cases simpler to deal with.

2. Create a new function with a name like XLogAcceptWrites(). Move the
following things from StartupXLOG() into that function: (1) the call
to UpdateFullPageWrites(), (2) the following block of code that does
either CreateEndOfRecoveryRecord() or RequestCheckpoint() or
CreateCheckPoint(), (3) the next block of code that runs
recovery_end_command, (4) the call to XLogReportParameters(), and (5)
the call to CompleteCommitTsInitialization(). Call the new function
from the place where we now call XLogReportParameters(). This would
mean that (1)-(3) happen later than they do now, which might require
some adjustments.

3. If the system is starting up read only (and the read-only state
didn't get cleared because of #1 above) then don't call
XLogAcceptWrites() at the end of StartupXLOG() and instead have the
checkpointer do it later when the system is going read-write for the
first time.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2021-01-25 Thread Robert Haas

On Thu, Jan 21, 2021 at 9:47 AM Amul Sul  wrote:
> It is not like that, let me explain. When a user backend requests to alter WAL
> prohibit state by using ASRO/ASRW DDL with the previous patch or calling
> pg_alter_wal_prohibit_state() then WAL prohibit state in shared memory will be
> set to the transition state i.e. going-read-only or going-read-write if it is
> not already.  If another backend trying to request the same alteration to the
> wal prohibit state then nothing going to be changed in shared memory but that
> backend needs to wait until the transition to the final wal prohibited state
> completes.  If a backend tries to request for the opposite state than the
> previous which is in progress then it will see an error as "system state
> transition to read only/write is already in progress".  At a time only one
> transition state can be set.

Hrm. Well, then that needs to be abundantly clear in the relevant comments.

> Now, I am a little bit concerned about the current function name. How about
> pg_set_wal_prohibit_state(bool) name or have two functions as
> pg_set_wal_prohibit_state(void) and pg_unset_wal_prohibit_state(void) or any
> other suggestions?

How about pg_prohibit_wal(true|false)?

> > It seems odd that ProcessWALProhibitStateChangeRequest() returns
> > without doing anything if !AmCheckpointerProcess(), rather than having
> > that be an Assert(). Why is it like that?
>
> Like AbsorbSyncRequests().

Well, that can be called not from the checkpointer, according to the
comments.  Specifically from the postmaster, I guess.  Again, comments
please.

> If you think that there will be panic when UpdateFullPageWrites() and/or
> XLogReportParameters() tries to write WAL since the shared memory state for 
> WAL
> prohibited is set then it is not like that.  For those functions, WAL write is
> explicitly enabled by calling LocalSetXLogInsertAllowed().
>
> I was under the impression that there won't be any problem if we allow the
> writing WAL to UpdateFullPageWrites() and XLogReportParameters().  It can be
> considered as an exception since it is fine that this WAL record is not 
> streamed
> to standby while graceful failover, I may be wrong though.

I don't think that's OK. I mean, the purpose of the feature is to
prohibit WAL. If it doesn't do that, I believe it will fail to satisfy
the principle of least surprise.

> I am not clear on this part. In the attached version I am sending SIGUSR1
> instead of SIGINT, which works for me.

OK.

> The attached version does not address all your comments, I'll continue my work
> on that.

Some thoughts on this version:

+/* Extract last two bits */
+#defineWALPROHIBIT_CURRENT_STATE(stateGeneration)  \
+   ((uint32)(stateGeneration) & ((uint32) ((1 << 2) - 1)))
+#defineWALPROHIBIT_NEXT_STATE(stateGeneration) \
+   WALPROHIBIT_CURRENT_STATE((stateGeneration + 1))

This is really confusing. First, the comment looks like it applies to
both based on how it is positioned, but that's clearly not true.
Second, the naming is really hard to understand. Third, there don't
seem to be comments explaining the theory of what is going on here.
Fourth, stateGeneration refers not to which generation of state we've
got here but to the combination of the state and the generation.
However, it's not clear that we ever really use the generation for
anything.

I think that the direction you went with this is somewhat different
from what I had in mind. That may be OK, but let me just explain the
difference. We both had in mind the idea that the low two bits of the
state would represent the current state and the upper bits would
represent the state generation. However, I wasn't necessarily
imagining that the only supported operation was making the combined
value go up by 1. For instance, I had thought that perhaps the effect
of trying to go read-only when we're in the middle of going read-write
would be to cancel the previous operation and start the new one. What
you have instead is that it errors out. So in your model a change
always has to finish before the next one can start, which in turn
means that the sequence is completely linear. In my idea the
state+generation might go from say 1 to 7, because trying to go
read-write would cancel the previous attempt to go read-only and
replace it with an attempt to go the other direction, and from 7 we
might go to to 9 if somebody now tries to go read-only again before
that finishes. In your model, there's never any sort of cancellation
of that kind, so you can only go 0->1->2->3->4->5->6->7->8->9 etc.

One disadvantage of the way you've got it from a user perspective is
that if I'm writing a tool, I might get an error telling me that the
state change I'm trying to make is already in progress, and then I
have to retry. With the other design, I might attempt a state change
and have it fail because the change can't be completed, but I won't
ever fail because I attempt a state change and it can't be started

Re: [Patch] ALTER SYSTEM READ ONLY

2021-01-19 Thread Robert Haas

On Thu, Jan 14, 2021 at 6:29 AM Amul Sul  wrote:
> To move development, testing, and the review forward, I have commented out the
> code acquiring CheckpointLock from CreateCheckPoint() in the 0003 patch and
> added the changes for the checkpointer so that system read-write state change
> request can be processed as soon as possible, as suggested by Robert[1].

I am extremely doubtful about SetWALProhibitState()'s claim that "The
final state can only be requested by the checkpointer or by the
single-user so that there will be no chance that the server is already
in the desired final state." It seems like there is an obvious race
condition: CompleteWALProhibitChange() is called with a cur_state_gen
argument which embeds the last state we saw, but there's nothing to
keep it from changing between the time we saw it and the time that
function calls SetWALProhibitState(), is there? We aren't holding any
lock. It seems to me that SetWALProhibitState() needs to be rewritten
to avoid this assumption.

On a related note, SetWALProhibitState() has only two callers. One
passes is_final_state as true, and the other as false: it's never a
variable. The two cases are handled mostly differently. This doesn't
seem good. A lot of the logic in this function should probably be
moved to the calling sites, especially because it's almost certainly
wrong for this function to be basing what it does on the *current* WAL
prohibit state rather than the WAL prohibit state that was in effect
at the time we made the decision to call this function in the first
place. As I mentioned in the previous paragraph, that's a built-in
race condition. To put that another way, this function should NOT feel
free to call GetWALProhibitStateGen().

I don't really see why we should have both an SQL callable function
pg_alter_wal_prohibit_state() and also a DDL command for this. If
we're going to go with a functional interface, and I guess the idea of
that is to make it so GRANT EXECUTE works, then why not just get rid
of the DDL?

RequestWALProhibitChange() doesn't look very nice. It seems like it's
basically the second half of pg_alter_wal_prohibit_state(), not being
called from anywhere else. It doesn't seem to add anything to separate
it out like this; the interface between the two is not especially
clean.

It seems odd that ProcessWALProhibitStateChangeRequest() returns
without doing anything if !AmCheckpointerProcess(), rather than having
that be an Assert(). Why is it like that?

I think WALProhibitStateShmemInit() would probably look more similar
to other functions if it did if (found) { stuff; } rather than if
(!found) return; stuff; -- but I might be wrong about the existing
precedent.

The SetLastCheckPointSkipped() and LastCheckPointIsSkipped() stuff
seems confusingly-named, because we have other reasons for skipping a
checkpoint that are not what we're talking about here. I think this is
talking about whether we've performed a checkpoint after recovery, and
the naming should reflect that. But I think there's something else
wrong with the design, too: why is this protected by a spinlock? I
have questions in both directions. On the one hand, I wonder why we
need any kind of lock at all. On the other hand, if we do need a lock,
I wonder why a spinlock that protects only the setting and clearing of
the flag and nothing else is sufficient. There are zero comments
explaining what the idea behind this locking regime is, and I can't
understand why it should be correct.

In fact, I think this area needs a broader rethink. Like, the way you
integrated that stuff into StartupXLog(), it sure looks to me like we
might skip the checkpoint but still try to write other WAL records.
Before we reach the offending segment of code, we call
UpdateFullPageWrites(). Afterwards, we call XLogReportParameters().
Both of those are going to potentially write WAL. I guess you could
argue that's OK, on the grounds that neither function is necessarily
going to log anything, but I don't think I believe that. If I make my
server read only, take the OS down, change some GUCs, and then start
it again, I don't expect it to PANIC.

Also, I doubt that it's OK to skip the checkpoint as this code does
and then go ahead and execute recovery_end_command and update the
control file anyway. It sure looks like the existing code is written
with the assumption that the checkpoint happens before those other
things. One idea I just had was: suppose that, if the system is READ
ONLY, we don't actually exit recovery right away, and the startup
process doesn't exit. Instead we just sit there and wait for the
system to be made read-write again before doing anything else. But
then if hot_standby=false, there's no way for someone to execute a
ALTER SYSTEM READ WRITE and/or pg_alter_wal_prohibit_state(), which
seems bad. So perhaps we need to let in regular connections *as if*
the system were read-write while postponing not just the
end-of-recovery checkpoint but also the other associated

Re: [Patch] ALTER SYSTEM READ ONLY

2021-01-18 Thread Robert Haas

On Thu, Jan 14, 2021 at 6:29 AM Amul Sul  wrote:
> To move development, testing, and the review forward, I have commented out the
> code acquiring CheckpointLock from CreateCheckPoint() in the 0003 patch and
> added the changes for the checkpointer so that system read-write state change
> request can be processed as soon as possible, as suggested by Robert[1].
>
> I have started a new thread[2] to understand the need for the CheckpointLock 
> in
> CreateCheckPoint() function. Until then we can continue work on this feature 
> by
> skipping CheckpointLock in CreateCheckPoint(), and therefore the 0003 patch is
> marked WIP.

Based on the favorable review comment from Andres upthread and also
your feedback, I committed 0001.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

2020-12-14 Thread Amul Sul

On Mon, Dec 14, 2020 at 11:28 AM Amul Sul  wrote:
>
> On Thu, Dec 10, 2020 at 6:04 AM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2020-12-09 16:13:06 -0500, Robert Haas wrote:
> > > That's not good. On a typical busy system, a system is going to be in
> > > the middle of a checkpoint most of the time, and the checkpoint will
> > > take a long time to finish - maybe minutes.
> >
> > Or hours, even. Due to the cost of FPWs it can make a lot of sense to
> > reduce the frequency of that cost...
> >
> >
> > > We want this feature to respond within milliseconds or a few seconds,
> > > not minutes. So we need something better here.
> >
> > Indeed.
> >
> >
> > > I'm inclined to think
> > > that we should try to CompleteWALProhibitChange() at the same places
> > > we AbsorbSyncRequests(). We know from experience that bad things
> > > happen if we fail to absorb sync requests in a timely fashion, so we
> > > probably have enough calls to AbsorbSyncRequests() to make sure that
> > > we always do that work in a timely fashion. So, if we do this work in
> > > the same place, then it will also be done in a timely fashion.
> >
> > Sounds sane, without having looked in detail.
> >
>
> Understood & agreed that we need to change the system state as soon as 
> possible.
>
> I can see AbsorbSyncRequests() is called from 4 routing as
> CheckpointWriteDelay(), ProcessSyncRequests(), SyncPostCheckpoint() and
> CheckpointerMain().  Out of 4, the first three executes with an interrupt is 
> on
> hod which will cause a problem when we do emit barrier and wait for those
> barriers absorption by all the process including itself and will cause an
> infinite wait. I think that can be fixed by teaching 
> WaitForProcSignalBarrier(),
> do not wait on self to absorb barrier.  Let that get absorbed at a later point
> in time when the interrupt is resumed.  I assumed that we cannot do barrier
> processing right away since there could be other barriers (maybe in the 
> future)
> including ours that should not process while the interrupt is on hold.
>

CreateCheckPoint() holds CheckpointLock LW at start and releases at the end
which puts interrupt on hold.  This kinda surprising that we were holding this
lock and putting interrupt on hots for a long time.  We do need that
CheckpointLock just to ensure that one checkpoint happens at a time. Can't we do
something easy to ensure that instead of the lock? Probably holding off
interrupts for so long doesn't seem to be a good idea. Thoughts/Suggestions?

Regards,
Amul

Re: [Patch] ALTER SYSTEM READ ONLY

2020-12-13 Thread Amul Sul

On Thu, Dec 10, 2020 at 6:04 AM Andres Freund  wrote:
>
> Hi,
>
> On 2020-12-09 16:13:06 -0500, Robert Haas wrote:
> > That's not good. On a typical busy system, a system is going to be in
> > the middle of a checkpoint most of the time, and the checkpoint will
> > take a long time to finish - maybe minutes.
>
> Or hours, even. Due to the cost of FPWs it can make a lot of sense to
> reduce the frequency of that cost...
>
>
> > We want this feature to respond within milliseconds or a few seconds,
> > not minutes. So we need something better here.
>
> Indeed.
>
>
> > I'm inclined to think
> > that we should try to CompleteWALProhibitChange() at the same places
> > we AbsorbSyncRequests(). We know from experience that bad things
> > happen if we fail to absorb sync requests in a timely fashion, so we
> > probably have enough calls to AbsorbSyncRequests() to make sure that
> > we always do that work in a timely fashion. So, if we do this work in
> > the same place, then it will also be done in a timely fashion.
>
> Sounds sane, without having looked in detail.
>

Understood & agreed that we need to change the system state as soon as possible.

I can see AbsorbSyncRequests() is called from 4 routing as
CheckpointWriteDelay(), ProcessSyncRequests(), SyncPostCheckpoint() and
CheckpointerMain().  Out of 4, the first three executes with an interrupt is on
hod which will cause a problem when we do emit barrier and wait for those
barriers absorption by all the process including itself and will cause an
infinite wait. I think that can be fixed by teaching WaitForProcSignalBarrier(),
do not wait on self to absorb barrier.  Let that get absorbed at a later point
in time when the interrupt is resumed.  I assumed that we cannot do barrier
processing right away since there could be other barriers (maybe in the future)
including ours that should not process while the interrupt is on hold.

>
> > I'm not 100% sure whether that introduces any other problems.
> > Certainly, we're not going to be able to finish the checkpoint once
> > we've gone read-only, so we'll fail when we try to write the WAL
> > record for that, or maybe earlier if there's anything else that tries
> > to write WAL. Either the checkpoint needs to error out, like any other
> > attempt to write WAL, and we can attempt a new checkpoint if and when
> > we go read/write, or else we need to finish writing stuff out to disk
> > but not actually write the checkpoint completion record (or any other
> > WAL) unless and until the system goes back into read/write mode - and
> > then at that point the previously-started checkpoint will finish
> > normally. The latter seems better if we can make it work, but the
> > former is probably also acceptable. What you've got right now is not.
>
> I mostly wonder which of those two has which implications for how many
> FPWs we need to redo. Presumably stalling but not cancelling the current
> checkpoint is better?
>

Also, I like to uphold this idea of stalling a checkpointer's work in the middle
instead of canceling it. But here, we need to take care of shutdown requests and
death of postmaster cases that can cancel this stalling.  If that happens we
need to make sure that no unwanted wal insertion happens afterward and for that
LocalXLogInsertAllowed flag needs to be updated correctly since the wal
prohibits barrier processing was skipped for the checkpointer since it emits
that barrier as mentioned above.

Regards,
Amul

Re: [Patch] ALTER SYSTEM READ ONLY

2020-12-09 Thread Andres Freund

Hi,

On 2020-12-09 16:13:06 -0500, Robert Haas wrote:
> That's not good. On a typical busy system, a system is going to be in
> the middle of a checkpoint most of the time, and the checkpoint will
> take a long time to finish - maybe minutes.

Or hours, even. Due to the cost of FPWs it can make a lot of sense to
reduce the frequency of that cost...


> We want this feature to respond within milliseconds or a few seconds,
> not minutes. So we need something better here.

Indeed.


> I'm inclined to think
> that we should try to CompleteWALProhibitChange() at the same places
> we AbsorbSyncRequests(). We know from experience that bad things
> happen if we fail to absorb sync requests in a timely fashion, so we
> probably have enough calls to AbsorbSyncRequests() to make sure that
> we always do that work in a timely fashion. So, if we do this work in
> the same place, then it will also be done in a timely fashion.

Sounds sane, without having looked in detail.


> I'm not 100% sure whether that introduces any other problems.
> Certainly, we're not going to be able to finish the checkpoint once
> we've gone read-only, so we'll fail when we try to write the WAL
> record for that, or maybe earlier if there's anything else that tries
> to write WAL. Either the checkpoint needs to error out, like any other
> attempt to write WAL, and we can attempt a new checkpoint if and when
> we go read/write, or else we need to finish writing stuff out to disk
> but not actually write the checkpoint completion record (or any other
> WAL) unless and until the system goes back into read/write mode - and
> then at that point the previously-started checkpoint will finish
> normally. The latter seems better if we can make it work, but the
> former is probably also acceptable. What you've got right now is not.

I mostly wonder which of those two has which implications for how many
FPWs we need to redo. Presumably stalling but not cancelling the current
checkpoint is better?

Greetings,

Andres Freund

Re: [Patch] ALTER SYSTEM READ ONLY

2020-12-09 Thread Andres Freund

On 2020-11-20 11:23:44 -0500, Robert Haas wrote:
> Andres, do you like the new loop better?

I do!

Re: [Patch] ALTER SYSTEM READ ONLY

2020-12-09 Thread Robert Haas

On Sat, Sep 12, 2020 at 1:23 AM Amul Sul  wrote:
> > So, if we're in the middle of a paced checkpoint with a large
> > checkpoint_timeout - a sensible real world configuration - we'll not
> > process ASRO until that checkpoint is over?  That seems very much not
> > practical. What am I missing?
>
> Yes, the process doing ASRO will wait until that checkpoint is over.

That's not good. On a typical busy system, a system is going to be in
the middle of a checkpoint most of the time, and the checkpoint will
take a long time to finish - maybe minutes. We want this feature to
respond within milliseconds or a few seconds, not minutes. So we need
something better here. I'm inclined to think that we should try to
CompleteWALProhibitChange() at the same places we
AbsorbSyncRequests(). We know from experience that bad things happen
if we fail to absorb sync requests in a timely fashion, so we probably
have enough calls to AbsorbSyncRequests() to make sure that we always
do that work in a timely fashion. So, if we do this work in the same
place, then it will also be done in a timely fashion.

I'm not 100% sure whether that introduces any other problems.
Certainly, we're not going to be able to finish the checkpoint once
we've gone read-only, so we'll fail when we try to write the WAL
record for that, or maybe earlier if there's anything else that tries
to write WAL. Either the checkpoint needs to error out, like any other
attempt to write WAL, and we can attempt a new checkpoint if and when
we go read/write, or else we need to finish writing stuff out to disk
but not actually write the checkpoint completion record (or any other
WAL) unless and until the system goes back into read/write mode - and
then at that point the previously-started checkpoint will finish
normally. The latter seems better if we can make it work, but the
former is probably also acceptable. What you've got right now is not.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

1 2 >

1 - 100 of 161 matches

Mail list logo