Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-06-05 Thread Andres Freund
Hi,

On 2017-06-05 15:30:38 +0900, Michael Paquier wrote:
> + * This will trigger walsenders to send the remaining WAL, prevent them from
> + * accepting further commands. After that they'll wait till the last WAL is
> + * written.
> s/prevent/preventing/?
> I would rephrase the last sentence a bit:
> "After that each WAL sender will wait until the end-of-checkpoint
> record has been flushed on the receiver side."

I didn't like your proposed phrasing much, but I aggree that what I had
wasn't good either.  Tried to improve it.

Thanks for the review.


I pushed this series, this should resolve the issue in this thread
entirely, and should fix a good chunk of the issues in the 'walsender
and parallelism' thread.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-06-05 Thread Michael Paquier
On Tue, Jun 6, 2017 at 9:47 AM, Andres Freund  wrote:
> On 2017-06-05 15:30:38 +0900, Michael Paquier wrote:
>> I think that it would be interesting to be able to
>> trigger a feedback message using SIGHUP in WAL receivers, refactoring
>> at the same time SIGHUP handling for WAL receivers. It is possible for
>> example to abuse SIGHUP in autovacuum for cost parameters.
>
> Could you clarify a bit here, I can't follow?  Do you think it's
> actually a good idea to combine that with the largely mechanical patch?

Sort of. The thought here is to be able to trigger
XLogWalRcvSendReply() using a SIGHUP, even if force_reply is not
enforced. But looking again at the code, XLogWalRcvSendReply() is
processed only if data is received so sending multiple times the same
message to server would be pointless. Still, don't you think that it
would be helpful to wake up the WAL receiver at will on SIGHUP by
setting its latch? XLogWalRcvSendHSFeedback() could be triggered at
will using that.

ProcessWalRcvInterrupts() could include the checks for SIGHUP by the way...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-06-05 Thread Andres Freund
On 2017-06-05 15:30:38 +0900, Michael Paquier wrote:
> I have looked at all those patches. The set looks solid to me.

Thanks!


> Here are some comments about 0003.
> +   /*
> +* Have WalSndLoop() terminate the connection in an orderly
> +* manner, after writing out all the pending data.
> +*/
> +   if (got_STOPPING)
> +   got_SIGUSR2 = true;
> I think that for correctness the state of the WAL sender should be
> switched to WALSNDSTATE_STOPPING in XLogSendLogical() as well.

No, that would be wrong.  If we switched here, checkpointer would finish
waiting, even though XLogSendLogical() might get called again.  That
e.g. could happen the TCP socket was full, and XLogSendLogical() gets
called again.


> A more appropriate name would be ConfigReloadPending perhaps?

Hm, ok.


> 0005 looks like a fine one-liner to me.
> 
> For 0006, you could include as well the removal of worker_spi_sighup()
> in the refactoring.

Ok.  I'll leave that patch for now, since I think it's probably better
to apply it only to master once v10 branched off.


> I think that it would be interesting to be able to
> trigger a feedback message using SIGHUP in WAL receivers, refactoring
> at the same time SIGHUP handling for WAL receivers. It is possible for
> example to abuse SIGHUP in autovacuum for cost parameters.

Could you clarify a bit here, I can't follow?  Do you think it's
actually a good idea to combine that with the largely mechanical patch?

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-06-05 Thread Michael Paquier
On Mon, Jun 5, 2017 at 10:29 AM, Andres Freund  wrote:
> On 2017-06-02 17:20:23 -0700, Andres Freund wrote:
>> Attached is a *preliminary* patch series implementing this.  I've first
>> reverted the previous patch, as otherwise backpatchable versions of the
>> necessary patches would get too complicated, due to the signals used and
>> such.

That makes sense.

> I went again through this, and the only real thing I found that there
> was a leftover prototype in walsender.h.  I've in interim worked on
> backpatch versions of that series, annoying conflicts, but nothing
> really problematic.  The only real difference is adding SetLatch() calls
> to HandleWalSndInitStopping() < 9.6, and guarding SetLatch with an if <
> 9.5.
>
> As an additional patch (based on one by Petr), even though it more
> belongs to
> http://archives.postgresql.org/message-id/20170421014030.fdzvvvbrz4nckrow%40alap3.anarazel.de
> attached is a patch unifying SIGHUP between normal and walsender
> backends.  This needs to be backpatched all the way.  I've also attached
> a second patch, again based on Petr's, that unifies SIGHUP handling
> across all the remaining backends, but that's something that probably
> more appropriate for v11, although I'm still tempted to commit it
> earlier.

I have looked at all those patches. The set looks solid to me.

0001 and 0002 are straight-forward things. It makes sense to unify the
SIGUSR1 handling.

Here are some comments about 0003.

+ * This will trigger walsenders to send the remaining WAL, prevent them from
+ * accepting further commands. After that they'll wait till the last WAL is
+ * written.
s/prevent/preventing/?
I would rephrase the last sentence a bit:
"After that each WAL sender will wait until the end-of-checkpoint
record has been flushed on the receiver side."

+   /*
+* Have WalSndLoop() terminate the connection in an orderly
+* manner, after writing out all the pending data.
+*/
+   if (got_STOPPING)
+   got_SIGUSR2 = true;
I think that for correctness the state of the WAL sender should be
switched to WALSNDSTATE_STOPPING in XLogSendLogical() as well.

About 0004... This definitely meritates a backpatch, PostgresMain() is
taken by WAL senders as well when executing queries.

-   if (got_SIGHUP)
+   if (ConfigRereadPending)
{
-   got_SIGHUP = false;
+   ConfigRereadPending = false;
A more appropriate name would be ConfigReloadPending perhaps?

0005 looks like a fine one-liner to me.

For 0006, you could include as well the removal of worker_spi_sighup()
in the refactoring. I think that it would be interesting to be able to
trigger a feedback message using SIGHUP in WAL receivers, refactoring
at the same time SIGHUP handling for WAL receivers. It is possible for
example to abuse SIGHUP in autovacuum for cost parameters.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-06-04 Thread Andres Freund
Hi,

On 2017-06-05 10:31:12 +0900, Michael Paquier wrote:
> On Mon, Jun 5, 2017 at 10:29 AM, Andres Freund  wrote:
> > Michael, Peter, Fujii, is either of you planning to review this? I'm
> > planning to commit this tomorrow morning PST, unless somebody protest
> > till then.
> 
> Yes, I am. It would be nice if you could let me 24 hours to look at it
> in details.

Sure.  Could you let me know when you're done?

Noah, I might thus not be able to resolve most of "Query handling in
Walsender is somewhat broken" by tomorrow, but it might end up being
Tuesday.  Even after that there'll be a remaining item after all these.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-06-04 Thread Michael Paquier
On Mon, Jun 5, 2017 at 10:29 AM, Andres Freund  wrote:
> Michael, Peter, Fujii, is either of you planning to review this? I'm
> planning to commit this tomorrow morning PST, unless somebody protest
> till then.

Yes, I am. It would be nice if you could let me 24 hours to look at it
in details.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-06-04 Thread Andres Freund
On 2017-06-02 17:20:23 -0700, Andres Freund wrote:
> Attached is a *preliminary* patch series implementing this.  I've first
> reverted the previous patch, as otherwise backpatchable versions of the
> necessary patches would get too complicated, due to the signals used and
> such.

I went again through this, and the only real thing I found that there
was a leftover prototype in walsender.h.  I've in interim worked on
backpatch versions of that series, annoying conflicts, but nothing
really problematic.  The only real difference is adding SetLatch() calls
to HandleWalSndInitStopping() < 9.6, and guarding SetLatch with an if <
9.5.

As an additional patch (based on one by Petr), even though it more
belongs to
http://archives.postgresql.org/message-id/20170421014030.fdzvvvbrz4nckrow%40alap3.anarazel.de
attached is a patch unifying SIGHUP between normal and walsender
backends.  This needs to be backpatched all the way.  I've also attached
a second patch, again based on Petr's, that unifies SIGHUP handling
across all the remaining backends, but that's something that probably
more appropriate for v11, although I'm still tempted to commit it
earlier.

Michael, Peter, Fujii, is either of you planning to review this? I'm
planning to commit this tomorrow morning PST, unless somebody protest
till then.

- Andres
>From 39f95c9e85811d6759a29b293adc97567d895d69 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Fri, 2 Jun 2017 14:14:34 -0700
Subject: [PATCH 1/6] Revert "Prevent panic during shutdown checkpoint"

This reverts commit 086221cf6b1727c2baed4703c582f657b7c5350e, which
was made to master only.

The approach implemented in the above commit has some issues.  While
those could easily be fixed incrementally, doing so would make
backpatching considerably harder, so instead first revert this patch.

Discussion: https://postgr.es/m/20170602002912.tqlwn4gymzlxp...@alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml|   5 -
 src/backend/access/transam/xlog.c   |   6 --
 src/backend/postmaster/postmaster.c |   7 +-
 src/backend/replication/walsender.c | 143 
 src/include/replication/walsender.h |   1 -
 src/include/replication/walsender_private.h |   3 +-
 6 files changed, 24 insertions(+), 141 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 79ca45a156..5640c0d84a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1690,11 +1690,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
backup: This WAL sender is sending a backup.
   
  
- 
-  
-   stopping: This WAL sender is stopping.
-  
- 

  
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 399822d3fe..35ee7d1cb6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8324,12 +8324,6 @@ ShutdownXLOG(int code, Datum arg)
 	ereport(IsPostmasterEnvironment ? LOG : NOTICE,
 			(errmsg("shutting down")));
 
-	/*
-	 * Wait for WAL senders to be in stopping state.  This prevents commands
-	 * from writing new WAL.
-	 */
-	WalSndWaitStopping();
-
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	else
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 35b4ec88d3..5c79b1e40d 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2918,7 +2918,7 @@ reaper(SIGNAL_ARGS)
  * Waken walsenders for the last time. No regular backends
  * should be around anymore.
  */
-SignalChildren(SIGINT);
+SignalChildren(SIGUSR2);
 
 pmState = PM_SHUTDOWN_2;
 
@@ -3656,9 +3656,7 @@ PostmasterStateMachine(void)
 /*
  * If we get here, we are proceeding with normal shutdown. All
  * the regular children are gone, and it's time to tell the
- * checkpointer to do a shutdown checkpoint. All WAL senders
- * are told to switch to a stopping state so that the shutdown
- * checkpoint can go ahead.
+ * checkpointer to do a shutdown checkpoint.
  */
 Assert(Shutdown > NoShutdown);
 /* Start the checkpointer if not running */
@@ -3667,7 +3665,6 @@ PostmasterStateMachine(void)
 /* And tell it to shut down */
 if (CheckpointerPID != 0)
 {
-	SignalSomeChildren(SIGUSR2, BACKEND_TYPE_WALSND);
 	signal_child(CheckpointerPID, SIGUSR2);
 	pmState = PM_SHUTDOWN;
 }
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 49cce38880..aa705e5b35 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -24,14 +24,11 @@
  * are treated as not a crash but approximately normal termination;
  * the walsender will exit quickly without sending any more XLOG records.
  *
- * If the 

Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-06-02 Thread Andres Freund
On 2017-06-01 17:29:12 -0700, Andres Freund wrote:
> On 2017-06-02 08:38:51 +0900, Michael Paquier wrote:
> > On Fri, Jun 2, 2017 at 7:05 AM, Andres Freund  wrote:
> > > I'm a unhappy how this is reusing SIGINT for WalSndLastCycleHandler.
> > > Normally INT is used cancel interrupts, and since walsender is now also
> > > working as a normal backend, this overlap is bad.  Even for plain
> > > walsender backend this seems bad, because now non-superusers replication
> > > users will terminate replication connections when they do
> > > pg_cancel_backend().  For replication=dbname users it's especially bad
> > > because there can be completely legitimate uses of pg_cancel_backend().
> >
> > Signals for WAL senders are set in WalSndSignals() which uses SIG_IGN
> > for SIGINT now in ~9.6, and StatementCancelHandler does not get set up
> > for a non-am_walsender backend. Am I missing something?
> 
> Yes, but nothing in those observeration actually addresses my point?
> 
> Some points:
> 
> 1) 086221cf6b1727c2baed4703c582f657b7c5350e changes things so walsender
>backends use SIGINT for WalSndLastCycleHandler(), which is now
>triggerable by pg_cancel_backend(). Especially for logical rep
>walsenders it's not absurd sending that.
> 2) Walsenders now can run normal queries.
> 3) Afaict 086221cf6b1727c2baed4703c582f657b7c5350e doesn't really
>address the PANIC problem for database connected walsenders at all,
>because it doesn't even cancel non-replication commands.  I.e. an
>already running query can just continue to run.  Which afaict just
>entirely breaks shutdown.  If the connection is idle, or running a
>query, we'll just wait forever.
> 4) the whole logic introduced in the above commit doesn't actually
>appear to handle logical decoding senders properly - wasn't the whole
>issue at hand that those can write WAL in some case?  But
>nevertheless WalSndWaitForWal() does a
>WalSndSetState(WALSNDSTATE_STOPPING); *and then continues decoding
>and waiting* - which seems to obviate the entire point of that commit.
> 
> I'm working on a patch rejiggering things so:
> 
> a) upon shutdown checkpointer (so we can use procsignal), not
>postmaster, sends PROCSIG_WALSND_INIT_STOPPING to all walsenders (so
>we don't have to use up a normal signal handler)
> b) Upon reception walsenders immediately exit if !replication_active,
>otherwise sets got_STOPPING
> c) Logical decoding will finish if got_STOPPING is sent, *WITHOUT*, as
>currently done, moving to WALSNDSTATE_STOPPING.  Not yet quite sure
>how to best handle the WALSNDSTATE_STOPPING transition in WalSndLoop().
> d) Once all remaining walsenders are in stopping state, postmaster sends
>SIGUSR2 to trigger shutdown (basically as before)
> 
> Does that seem to make sense?

Attached is a *preliminary* patch series implementing this.  I've first
reverted the previous patch, as otherwise backpatchable versions of the
necessary patches would get too complicated, due to the signals used and
such.

This also fixes several of the issues from the somewhat related thread at
http://archives.postgresql.org/message-id/20170421014030.fdzvvvbrz4nckrow%40alap3.anarazel.de

I'm not perfectly happy with the use of XLogBackgroundFlush() but we
don't currently expose anything else to flush all pending WAL afaics -
it's not too bad either.  Without that we can end up waiting forever
while if the last XLogInserts are done by an asynchronously committing
backend or the relevant backends exited before getting to flush out
their records, because walwriter has already been shut down at that
point.

Comments?

- Andres
>From 187f99cdf98886be954cd1edda275c51b83da5ef Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Fri, 2 Jun 2017 14:14:34 -0700
Subject: [PATCH 1/4] Revert "Prevent panic during shutdown checkpoint"

This reverts commit 086221cf6b1727c2baed4703c582f657b7c5350e, which
was made to master only.

The approach implemented in the above commit has some issues.  While
this could easily be fixed incrementally, doing so would make
backpatching considerably harder, so instead first revert this patch.

Discussion: https://postgr.es/m/20170602002912.tqlwn4gymzlxp...@alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml|   5 -
 src/backend/access/transam/xlog.c   |   6 --
 src/backend/postmaster/postmaster.c |   7 +-
 src/backend/replication/walsender.c | 143 
 src/include/replication/walsender.h |   1 -
 src/include/replication/walsender_private.h |   3 +-
 6 files changed, 24 insertions(+), 141 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 79ca45a156..5640c0d84a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1690,11 +1690,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
backup: This 

Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-06-01 Thread Robert Haas
On Thu, Jun 1, 2017 at 6:05 PM, Andres Freund  wrote:
> I'm a unhappy how this is reusing SIGINT for WalSndLastCycleHandler.
> Normally INT is used cancel interrupts, and since walsender is now also
> working as a normal backend, this overlap is bad.

Yep, that's bad.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-06-01 Thread Andres Freund
On 2017-06-02 10:05:21 +0900, Michael Paquier wrote:
> On Fri, Jun 2, 2017 at 9:29 AM, Andres Freund  wrote:
> > On 2017-06-02 08:38:51 +0900, Michael Paquier wrote:
> >> On Fri, Jun 2, 2017 at 7:05 AM, Andres Freund  wrote:
> >> > I'm a unhappy how this is reusing SIGINT for WalSndLastCycleHandler.
> >> > Normally INT is used cancel interrupts, and since walsender is now also
> >> > working as a normal backend, this overlap is bad.  Even for plain
> >> > walsender backend this seems bad, because now non-superusers replication
> >> > users will terminate replication connections when they do
> >> > pg_cancel_backend().  For replication=dbname users it's especially bad
> >> > because there can be completely legitimate uses of pg_cancel_backend().
> >>
> >> Signals for WAL senders are set in WalSndSignals() which uses SIG_IGN
> >> for SIGINT now in ~9.6, and StatementCancelHandler does not get set up
> >> for a non-am_walsender backend. Am I missing something?
> >
> > Yes, but nothing in those observeration actually addresses my point?
> 
> I am still confused by your previous email, which, at least it seems
> to me, implies that logical WAL senders have been working correctly
> with query cancellations. Now SIGINT is just ignored, which means that
> pg_cancel_backend() has never worked for WAL senders until now, and
> this behavior has not changed with 086221c. So there is no new
> breakage introduced by this commit. I get your point to reuse SIGINT
> for query cancellations though, but that's a new feature.

The issue is that the commit made a non-existant feature
(pg_cancel_backend() to walsenders) into a broken one (pg_cancel_backend
terminates walsenders).  Additionally v10 added something new
(walsenders executing SQL), and that will need at least some signal
handling fixes - hard to do if e.g. SIGINT is reused for something else.

> > a) upon shutdown checkpointer (so we can use procsignal), not
> >postmaster, sends PROCSIG_WALSND_INIT_STOPPING to all walsenders (so
> >we don't have to use up a normal signal handler)
> 
> You'll need a second one that wakes up the latch of the WAL senders to
> send more WAL records.

Don't think so, procsignal_sigusr1_handler serves quite well for that.
There's nearby discussion that we need to do so anyway, to fix recovery
conflict interrupts, parallelism interrupts and such.


> > b) Upon reception walsenders immediately exit if !replication_active,
> >otherwise sets got_STOPPING
> 
> Okay, that's what happens now anyway, any new replication command
> received results in an error. I actually prefer the way of doing in
> HEAD, which at least reports an error.

Err, no. What happens right now is that plainly nothing is done if a
connection is idle or busy executing things.  Only if a new command is
sent we error out - that makes very little sense.


> > c) Logical decoding will finish if got_STOPPING is sent, *WITHOUT*, as
> >currently done, moving to WALSNDSTATE_STOPPING.  Not yet quite sure
> >how to best handle the WALSNDSTATE_STOPPING transition in WalSndLoop().
> 
> Wouldn't it make sense to have the logical receivers be able to
> receive WAL up to the end of checkpoint record?

Yea, that's what I'm doing.  For that we really only need to change the
check in WalSndWaitForWal() check of got_SIGINT to got_STOPPING, and add
a XLogSendLogical() check in the WalSndCaughtUp if() that sets
got_SIGUSR2 *without* setting WALSNDSTATE_STOPPING (otherwise we'd
possibly continue to trigger wal records until the send buffer is
emptied).

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-06-01 Thread Michael Paquier
On Fri, Jun 2, 2017 at 9:29 AM, Andres Freund  wrote:
> On 2017-06-02 08:38:51 +0900, Michael Paquier wrote:
>> On Fri, Jun 2, 2017 at 7:05 AM, Andres Freund  wrote:
>> > I'm a unhappy how this is reusing SIGINT for WalSndLastCycleHandler.
>> > Normally INT is used cancel interrupts, and since walsender is now also
>> > working as a normal backend, this overlap is bad.  Even for plain
>> > walsender backend this seems bad, because now non-superusers replication
>> > users will terminate replication connections when they do
>> > pg_cancel_backend().  For replication=dbname users it's especially bad
>> > because there can be completely legitimate uses of pg_cancel_backend().
>>
>> Signals for WAL senders are set in WalSndSignals() which uses SIG_IGN
>> for SIGINT now in ~9.6, and StatementCancelHandler does not get set up
>> for a non-am_walsender backend. Am I missing something?
>
> Yes, but nothing in those observeration actually addresses my point?

I am still confused by your previous email, which, at least it seems
to me, implies that logical WAL senders have been working correctly
with query cancellations. Now SIGINT is just ignored, which means that
pg_cancel_backend() has never worked for WAL senders until now, and
this behavior has not changed with 086221c. So there is no new
breakage introduced by this commit. I get your point to reuse SIGINT
for query cancellations though, but that's a new feature.

> Some points:
>
> 1) 086221cf6b1727c2baed4703c582f657b7c5350e changes things so walsender
>backends use SIGINT for WalSndLastCycleHandler(), which is now
>triggerable by pg_cancel_backend(). Especially for logical rep
>walsenders it's not absurd sending that.
> 2) Walsenders now can run normal queries.
> 3) Afaict 086221cf6b1727c2baed4703c582f657b7c5350e doesn't really
>address the PANIC problem for database connected walsenders at all,
>because it doesn't even cancel non-replication commands.  I.e. an
>already running query can just continue to run.  Which afaict just
>entirely breaks shutdown.  If the connection is idle, or running a
>query, we'll just wait forever.
> 4) the whole logic introduced in the above commit doesn't actually
>appear to handle logical decoding senders properly - wasn't the whole
>issue at hand that those can write WAL in some case?  But
>nevertheless WalSndWaitForWal() does a
>WalSndSetState(WALSNDSTATE_STOPPING); *and then continues decoding
>and waiting* - which seems to obviate the entire point of that commit.
>
> I'm working on a patch rejiggering things so:
>
> a) upon shutdown checkpointer (so we can use procsignal), not
>postmaster, sends PROCSIG_WALSND_INIT_STOPPING to all walsenders (so
>we don't have to use up a normal signal handler)

You'll need a second one that wakes up the latch of the WAL senders to
send more WAL records.

> b) Upon reception walsenders immediately exit if !replication_active,
>otherwise sets got_STOPPING

Okay, that's what happens now anyway, any new replication command
received results in an error. I actually prefer the way of doing in
HEAD, which at least reports an error.

> c) Logical decoding will finish if got_STOPPING is sent, *WITHOUT*, as
>currently done, moving to WALSNDSTATE_STOPPING.  Not yet quite sure
>how to best handle the WALSNDSTATE_STOPPING transition in WalSndLoop().

Wouldn't it make sense to have the logical receivers be able to
receive WAL up to the end of checkpoint record?

> d) Once all remaining walsenders are in stopping state, postmaster sends
>SIGUSR2 to trigger shutdown (basically as before)

OK.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-06-01 Thread Andres Freund
On 2017-06-02 08:38:51 +0900, Michael Paquier wrote:
> On Fri, Jun 2, 2017 at 7:05 AM, Andres Freund  wrote:
> > I'm a unhappy how this is reusing SIGINT for WalSndLastCycleHandler.
> > Normally INT is used cancel interrupts, and since walsender is now also
> > working as a normal backend, this overlap is bad.  Even for plain
> > walsender backend this seems bad, because now non-superusers replication
> > users will terminate replication connections when they do
> > pg_cancel_backend().  For replication=dbname users it's especially bad
> > because there can be completely legitimate uses of pg_cancel_backend().
>
> Signals for WAL senders are set in WalSndSignals() which uses SIG_IGN
> for SIGINT now in ~9.6, and StatementCancelHandler does not get set up
> for a non-am_walsender backend. Am I missing something?

Yes, but nothing in those observeration actually addresses my point?

Some points:

1) 086221cf6b1727c2baed4703c582f657b7c5350e changes things so walsender
   backends use SIGINT for WalSndLastCycleHandler(), which is now
   triggerable by pg_cancel_backend(). Especially for logical rep
   walsenders it's not absurd sending that.
2) Walsenders now can run normal queries.
3) Afaict 086221cf6b1727c2baed4703c582f657b7c5350e doesn't really
   address the PANIC problem for database connected walsenders at all,
   because it doesn't even cancel non-replication commands.  I.e. an
   already running query can just continue to run.  Which afaict just
   entirely breaks shutdown.  If the connection is idle, or running a
   query, we'll just wait forever.
4) the whole logic introduced in the above commit doesn't actually
   appear to handle logical decoding senders properly - wasn't the whole
   issue at hand that those can write WAL in some case?  But
   nevertheless WalSndWaitForWal() does a
   WalSndSetState(WALSNDSTATE_STOPPING); *and then continues decoding
   and waiting* - which seems to obviate the entire point of that commit.

I'm working on a patch rejiggering things so:

a) upon shutdown checkpointer (so we can use procsignal), not
   postmaster, sends PROCSIG_WALSND_INIT_STOPPING to all walsenders (so
   we don't have to use up a normal signal handler)
b) Upon reception walsenders immediately exit if !replication_active,
   otherwise sets got_STOPPING
c) Logical decoding will finish if got_STOPPING is sent, *WITHOUT*, as
   currently done, moving to WALSNDSTATE_STOPPING.  Not yet quite sure
   how to best handle the WALSNDSTATE_STOPPING transition in WalSndLoop().
d) Once all remaining walsenders are in stopping state, postmaster sends
   SIGUSR2 to trigger shutdown (basically as before)

Does that seem to make sense?

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-06-01 Thread Michael Paquier
On Fri, Jun 2, 2017 at 7:05 AM, Andres Freund  wrote:
> I'm a unhappy how this is reusing SIGINT for WalSndLastCycleHandler.
> Normally INT is used cancel interrupts, and since walsender is now also
> working as a normal backend, this overlap is bad.  Even for plain
> walsender backend this seems bad, because now non-superusers replication
> users will terminate replication connections when they do
> pg_cancel_backend().  For replication=dbname users it's especially bad
> because there can be completely legitimate uses of pg_cancel_backend().

Signals for WAL senders are set in WalSndSignals() which uses SIG_IGN
for SIGINT now in ~9.6, and StatementCancelHandler does not get set up
for a non-am_walsender backend. Am I missing something?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-06-01 Thread Andres Freund
On 2017-05-05 10:50:11 -0400, Peter Eisentraut wrote:
> On 5/5/17 01:26, Michael Paquier wrote:
> > The only code path doing HOT-pruning and generating WAL is
> > heap_page_prune(). Do you think that we need to worry about FPWs as
> > well?
> > 
> > Attached is an updated patch, which also forbids the run of any
> > replication commands when the stopping state is reached.
> 
> I have committed this without the HOT pruning change.  That can be
> considered separately, and I think it could use another round of
> thinking about it.

I'm a unhappy how this is reusing SIGINT for WalSndLastCycleHandler.
Normally INT is used cancel interrupts, and since walsender is now also
working as a normal backend, this overlap is bad.  Even for plain
walsender backend this seems bad, because now non-superusers replication
users will terminate replication connections when they do
pg_cancel_backend().  For replication=dbname users it's especially bad
because there can be completely legitimate uses of pg_cancel_backend().

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-05-27 Thread Michael Paquier
On Fri, May 26, 2017 at 4:47 PM, Peter Eisentraut
 wrote:
> On 5/26/17 14:16, Michael Paquier wrote:
>> So, now that the last round of minor releases has happened and that
>> some dust has settled on this patch, shouldn't there be a backpatch?
>> If yes, do you need patches for all branches? This problems goes down
>> to 9.2 anyway as BASE_BACKUP can generate end-of-backup records.
>
> Yes, this could be backpatched now.  It looks like it will need a bit of
> fiddling to get it into all the backbranches.  If you want to give it a
> closer look, go ahead please.

Attached are patches for 9.2~9.6. There are a couple of conflicts
across each version. Particularly in 9.2, I have made the choice to
not rename walsender_ready_to_stop to got_SIGINT as this is used as
well in basebackup.c to make clearer the code. In 9.3~ the use of this
flag is located only within walsender.c.
-- 
Michael


walsender-shutdown-96.patch
Description: Binary data


walsender-shutdown-95.patch
Description: Binary data


walsender-shutdown-93.patch
Description: Binary data


walsender-shutdown-94.patch
Description: Binary data


walsender-shutdown-92.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-05-26 Thread Peter Eisentraut
On 5/26/17 14:16, Michael Paquier wrote:
> So, now that the last round of minor releases has happened and that
> some dust has settled on this patch, shouldn't there be a backpatch?
> If yes, do you need patches for all branches? This problems goes down
> to 9.2 anyway as BASE_BACKUP can generate end-of-backup records.

Yes, this could be backpatched now.  It looks like it will need a bit of
fiddling to get it into all the backbranches.  If you want to give it a
closer look, go ahead please.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-05-26 Thread Michael Paquier
On Sat, May 6, 2017 at 6:40 AM, Michael Paquier
 wrote:
> Agreed. Just adding an ERROR message in XLogInsert() is not going to
> help much as this leads also to PANIC for critical sections :(
> So a patch really needs to be a no-op for all WAL-related operations
> within the WAL sender, and that will be quite invasive I am afraid.
>
>> I will move the open item to "Older Bugs" now, because the user
>> experience regression, so to speak, in version 10 has been addressed.
>> (This could be a backpatching candidate, but I am not planning on it for
>> next week's releases in any case.)
>
> No issues with all that.

So, now that the last round of minor releases has happened and that
some dust has settled on this patch, shouldn't there be a backpatch?
If yes, do you need patches for all branches? This problems goes down
to 9.2 anyway as BASE_BACKUP can generate end-of-backup records.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-05-06 Thread Michael Paquier
On Fri, May 5, 2017 at 11:50 PM, Peter Eisentraut
 wrote:
> On 5/5/17 01:26, Michael Paquier wrote:
>> The only code path doing HOT-pruning and generating WAL is
>> heap_page_prune(). Do you think that we need to worry about FPWs as
>> well?
>>
>> Attached is an updated patch, which also forbids the run of any
>> replication commands when the stopping state is reached.
>
> I have committed this without the HOT pruning change.  That can be
> considered separately, and I think it could use another round of
> thinking about it.

Agreed. Just adding an ERROR message in XLogInsert() is not going to
help much as this leads also to PANIC for critical sections :(
So a patch really needs to be a no-op for all WAL-related operations
within the WAL sender, and that will be quite invasive I am afraid.

> I will move the open item to "Older Bugs" now, because the user
> experience regression, so to speak, in version 10 has been addressed.
> (This could be a backpatching candidate, but I am not planning on it for
> next week's releases in any case.)

No issues with all that.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-05-05 Thread Peter Eisentraut
On 5/5/17 01:26, Michael Paquier wrote:
> The only code path doing HOT-pruning and generating WAL is
> heap_page_prune(). Do you think that we need to worry about FPWs as
> well?
> 
> Attached is an updated patch, which also forbids the run of any
> replication commands when the stopping state is reached.

I have committed this without the HOT pruning change.  That can be
considered separately, and I think it could use another round of
thinking about it.

I will move the open item to "Older Bugs" now, because the user
experience regression, so to speak, in version 10 has been addressed.

(This could be a backpatching candidate, but I am not planning on it for
next week's releases in any case.)

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-05-05 Thread Michael Paquier
On Fri, May 5, 2017 at 5:33 PM, Pavan Deolasee  wrote:
>
>
> On Fri, May 5, 2017 at 10:56 AM, Michael Paquier 
> wrote:
>>
>> On Wed, May 3, 2017 at 12:25 AM, Peter Eisentraut
>>
>>
>> >>> Can we prevent HOT pruning during logical decoding?
>> >>
>> >> It does not sound much difficult to do, couldn't you just make it a
>> >> no-op with am_walsender?
>> >
>> > That's my hope.
>>
>> The only code path doing HOT-pruning and generating WAL is
>> heap_page_prune(). Do you think that we need to worry about FPWs as
>> well?
>
>
> IMO the check should go inside heap_page_prune_opt(). Do we need to worry
> about wal_log_hints or checksums producing WAL because of hint bit updates?
> While I haven't read the thread, I am assuming if HOT pruning can happen,
> surely hint bits can get set too.

Yeah, that's as well what I am worrying about. Experts of logical
decoding will correct me, but it seems to me that we have to cover all
the cases where heap scans can generate WAL.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-05-05 Thread Pavan Deolasee
On Fri, May 5, 2017 at 10:56 AM, Michael Paquier 
wrote:

> On Wed, May 3, 2017 at 12:25 AM, Peter Eisentraut
>
>
> >>> Can we prevent HOT pruning during logical decoding?
> >>
> >> It does not sound much difficult to do, couldn't you just make it a
> >> no-op with am_walsender?
> >
> > That's my hope.
>
> The only code path doing HOT-pruning and generating WAL is
> heap_page_prune(). Do you think that we need to worry about FPWs as
> well?
>

IMO the check should go inside heap_page_prune_opt(). Do we need to worry
about wal_log_hints or checksums producing WAL because of hint bit updates?
While I haven't read the thread, I am assuming if HOT pruning can happen,
surely hint bits can get set too.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-05-04 Thread Michael Paquier
On Wed, May 3, 2017 at 12:25 AM, Peter Eisentraut
 wrote:
> On 5/2/17 10:08, Michael Paquier wrote:
>> On Tue, May 2, 2017 at 9:30 PM, Peter Eisentraut
>>  wrote:
>>> On 5/2/17 03:11, Petr Jelinek wrote:
 logical decoding can theoretically
 do HOT pruning (even if the chance is really small) so it's not safe to
 start logical replication either.
>>>
>>> This seems a bit impossible to resolve.  On the one hand, we want to
>>> allow streaming until after the shutdown checkpoint.  On the other hand,
>>> streaming itself might produce new WAL.
>>
>> It would be nice to split things into two:
>> - patch 1 adding the signal handling that wins a backpatch.
>> - patch 2 fixing the side cases with logical decoding.
>
> The side cases with logical decoding are also not new and would need
> backpatching, AIUI.

Okay, I thought that there was some new concept part of logical
replication here.

>>> Can we prevent HOT pruning during logical decoding?
>>
>> It does not sound much difficult to do, couldn't you just make it a
>> no-op with am_walsender?
>
> That's my hope.

The only code path doing HOT-pruning and generating WAL is
heap_page_prune(). Do you think that we need to worry about FPWs as
well?

Attached is an updated patch, which also forbids the run of any
replication commands when the stopping state is reached.
-- 
Michael
From c8a44b6f84926712b7b6f2b36b4f13b0d1b41977 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 5 May 2017 14:10:16 +0900
Subject: [PATCH] Prevent panic during shutdown checkpoint

When the checkpointer writes the shutdown checkpoint, it checks
afterwards whether any WAL has been written since it started and throws
a PANIC if so.  At that point, only walsenders are still active, so one
might think this could not happen, but WAL senders can generate WAL
in the context of certain replication commands that can be run during
the shutdown checkpoint:
- certain variants of CREATE_REPLICATION_SLOT.
- BASE_BACKUP and the backend end WAL record.
- START_REPLICATION and logical decoding, able to do HOT-pruning.

To fix this, divide the walsender shutdown into two phases.  First, the
postmaster sends a SIGUSR2 signal to all walsenders.  The walsenders
then put themselves into the "stopping" state.  In this state, they
reject any commands that might generate WAL.  The checkpointer waits for
all walsenders to reach this state before proceeding with the shutdown
checkpoint.  After the shutdown checkpoint is done, the postmaster sends
SIGINT (previously unused) to the walsenders.  This triggers the
existing shutdown behavior of sending out the shutdown checkpointer and
then terminating.

Author: Michael Paquier 
Reported-by: Fujii Masao 
---
 doc/src/sgml/monitoring.sgml|   5 +
 src/backend/access/heap/pruneheap.c |   9 ++
 src/backend/access/transam/xlog.c   |   6 ++
 src/backend/postmaster/postmaster.c |   7 +-
 src/backend/replication/walsender.c | 143 
 src/include/replication/walsender.h |   1 +
 src/include/replication/walsender_private.h |   3 +-
 7 files changed, 150 insertions(+), 24 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 2a83671b53..80d12b26d7 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1690,6 +1690,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
backup: This WAL sender is sending a backup.
   
  
+ 
+  
+   stopping: This WAL sender is stopping.
+  
+ 

  
 
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index d69a266c36..d510649b18 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -22,6 +22,7 @@
 #include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walsender.h"
 #include "storage/bufmgr.h"
 #include "utils/snapmgr.h"
 #include "utils/rel.h"
@@ -189,6 +190,14 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 	PruneState	prstate;
 
 	/*
+	 * Do nothing in the presence of a WAL sender. This code path can be
+	 * taken when doing logical decoding, and it is better to avoid WAL
+	 * generation as this interferes with shutdown checkpoints.
+	 */
+	if (am_walsender)
+		return ndeleted;
+
+	/*
 	 * Our strategy is to scan the page and make lists of items to change,
 	 * then apply the changes within a critical section.  This keeps as much
 	 * logic as possible out of the critical section, and also ensures that
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a89d99838a..5d6f8b75b8 100644
--- a/src/backend/access/transam/xlog.c
+++ 

Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-05-02 Thread Peter Eisentraut
On 5/2/17 10:08, Michael Paquier wrote:
> On Tue, May 2, 2017 at 9:30 PM, Peter Eisentraut
>  wrote:
>> On 5/2/17 03:11, Petr Jelinek wrote:
>>> logical decoding can theoretically
>>> do HOT pruning (even if the chance is really small) so it's not safe to
>>> start logical replication either.
>>
>> This seems a bit impossible to resolve.  On the one hand, we want to
>> allow streaming until after the shutdown checkpoint.  On the other hand,
>> streaming itself might produce new WAL.
> 
> It would be nice to split things into two:
> - patch 1 adding the signal handling that wins a backpatch.
> - patch 2 fixing the side cases with logical decoding.

The side cases with logical decoding are also not new and would need
backpatching, AIUI.

>> Can we prevent HOT pruning during logical decoding?
> 
> It does not sound much difficult to do, couldn't you just make it a
> no-op with am_walsender?

That's my hope.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-05-02 Thread Michael Paquier
On Tue, May 2, 2017 at 9:30 PM, Peter Eisentraut
 wrote:
> On 5/2/17 03:11, Petr Jelinek wrote:
>> logical decoding can theoretically
>> do HOT pruning (even if the chance is really small) so it's not safe to
>> start logical replication either.
>
> This seems a bit impossible to resolve.  On the one hand, we want to
> allow streaming until after the shutdown checkpoint.  On the other hand,
> streaming itself might produce new WAL.

It would be nice to split things into two:
- patch 1 adding the signal handling that wins a backpatch.
- patch 2 fixing the side cases with logical decoding.

> Can we prevent HOT pruning during logical decoding?

It does not sound much difficult to do, couldn't you just make it a
no-op with am_walsender?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-05-02 Thread Michael Paquier
On Tue, May 2, 2017 at 9:27 PM, Peter Eisentraut
 wrote:
> On 5/2/17 03:43, Michael Paquier wrote:
>>> I don't think the code covers all because a) the SQL queries are not
>>> covered at all that I can see and b) logical decoding can theoretically
>>> do HOT pruning (even if the chance is really small) so it's not safe to
>>> start logical replication either.
>>
>> Ahhh. So START_REPLICATION can also now generate WAL. Good to know.
>
> And just looking at pg_current_wal_location(), running BASE_BACKUP also
> advances the WAL.

Indeed. I forgot the backup end record and the segment switch. We are
good for a backpatch down to 9.2 here.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-05-02 Thread Peter Eisentraut
On 5/2/17 03:11, Petr Jelinek wrote:
> logical decoding can theoretically
> do HOT pruning (even if the chance is really small) so it's not safe to
> start logical replication either.

This seems a bit impossible to resolve.  On the one hand, we want to
allow streaming until after the shutdown checkpoint.  On the other hand,
streaming itself might produce new WAL.

Can we prevent HOT pruning during logical decoding?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-05-02 Thread Peter Eisentraut
On 5/2/17 03:43, Michael Paquier wrote:
>> I don't think the code covers all because a) the SQL queries are not
>> covered at all that I can see and b) logical decoding can theoretically
>> do HOT pruning (even if the chance is really small) so it's not safe to
>> start logical replication either.
> 
> Ahhh. So START_REPLICATION can also now generate WAL. Good to know.

And just looking at pg_current_wal_location(), running BASE_BACKUP also
advances the WAL.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-05-02 Thread Michael Paquier
On Tue, May 2, 2017 at 4:11 PM, Petr Jelinek
 wrote:
> On 02/05/17 05:35, Michael Paquier wrote:
>> On Tue, May 2, 2017 at 7:07 AM, Peter Eisentraut
>>  wrote:
>>> On 4/25/17 21:47, Michael Paquier wrote:
 Attached is an updated patch to reflect that.
>>>
>>> I edited this a bit, here is a new version.
>>
>> Thanks, looks fine for me.
>>
>>> A variant approach would be to prohibit *all* new commands after
>>> entering the "stopping" state, just let running commands run.  That way
>>> we don't have to pick which individual commands are at risk.  I'm not
>>> sure that we have covered everything here.
>>
>> It seems to me that everything is covered. We are taking about
>> creation and dropping of slots here, where standby snapshots can be
>> created and SQL queries can be run when doing a tablesync meaning that
>> FPWs could be taken in the context of the WAL sender. Blocking all
>> commands would be surely safer I agree, but I see no reason to block
>> things more than necessary.
>>
>
> I don't think the code covers all because a) the SQL queries are not
> covered at all that I can see and b) logical decoding can theoretically
> do HOT pruning (even if the chance is really small) so it's not safe to
> start logical replication either.

Ahhh. So START_REPLICATION can also now generate WAL. Good to know.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-05-02 Thread Petr Jelinek
On 02/05/17 05:35, Michael Paquier wrote:
> On Tue, May 2, 2017 at 7:07 AM, Peter Eisentraut
>  wrote:
>> On 4/25/17 21:47, Michael Paquier wrote:
>>> Attached is an updated patch to reflect that.
>>
>> I edited this a bit, here is a new version.
> 
> Thanks, looks fine for me.
> 
>> A variant approach would be to prohibit *all* new commands after
>> entering the "stopping" state, just let running commands run.  That way
>> we don't have to pick which individual commands are at risk.  I'm not
>> sure that we have covered everything here.
> 
> It seems to me that everything is covered. We are taking about
> creation and dropping of slots here, where standby snapshots can be
> created and SQL queries can be run when doing a tablesync meaning that
> FPWs could be taken in the context of the WAL sender. Blocking all
> commands would be surely safer I agree, but I see no reason to block
> things more than necessary.
> 

I don't think the code covers all because a) the SQL queries are not
covered at all that I can see and b) logical decoding can theoretically
do HOT pruning (even if the chance is really small) so it's not safe to
start logical replication either.

I wonder if this whole prevent thing should just be called
unconditionally on walsender that's connected to database
(am_db_walsender), because in the WAL logging will only happen there -
CREATE_REPLICATION_SLOT PHYSICAL will not WAL log and
CREATE_REPLICATION_SLOT LOGICAL can't be run without being connected to
db, neither can logical decoding and SQL queries, so that coves all.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-05-01 Thread Michael Paquier
On Tue, May 2, 2017 at 7:07 AM, Peter Eisentraut
 wrote:
> On 4/25/17 21:47, Michael Paquier wrote:
>> Attached is an updated patch to reflect that.
>
> I edited this a bit, here is a new version.

Thanks, looks fine for me.

> A variant approach would be to prohibit *all* new commands after
> entering the "stopping" state, just let running commands run.  That way
> we don't have to pick which individual commands are at risk.  I'm not
> sure that we have covered everything here.

It seems to me that everything is covered. We are taking about
creation and dropping of slots here, where standby snapshots can be
created and SQL queries can be run when doing a tablesync meaning that
FPWs could be taken in the context of the WAL sender. Blocking all
commands would be surely safer I agree, but I see no reason to block
things more than necessary.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-05-01 Thread Peter Eisentraut
On 4/25/17 21:47, Michael Paquier wrote:
> Attached is an updated patch to reflect that.

I edited this a bit, here is a new version.

A variant approach would be to prohibit *all* new commands after
entering the "stopping" state, just let running commands run.  That way
we don't have to pick which individual commands are at risk.  I'm not
sure that we have covered everything here.

More reviews please.  Also, this is a possible backpatching candidate.

Also, if someone has a test script for reproducing the original issue,
please share it, or run it against this patch.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


v3-0001-Prevent-panic-during-shutdown-checkpoint.patch
Description: invalid/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-25 Thread Michael Paquier
On Wed, Apr 26, 2017 at 3:17 AM, Peter Eisentraut
 wrote:
> On 4/21/17 00:11, Michael Paquier wrote:
>> Hmm. I have been actually looking at this solution and I am having
>> doubts regarding its robustness. In short this would need to be
>> roughly a two-step process:
>> - In PostmasterStateMachine(), SIGUSR2 is sent to the checkpoint to
>> make it call ShutdownXLOG(). Prior doing that, a first signal should
>> be sent to all the WAL senders with
>> SignalSomeChildren(BACKEND_TYPE_WALSND). SIGUSR2 or SIGINT could be
>> used.
>> - At reception of this signal, all WAL senders switch to a stopping
>> state, refusing commands that can generate WAL.
>> - Checkpointer looks at the state of all WAL senders, looping with a
>> sleep call of a couple of ms, refusing to launch the shutdown
>> checkpoint as long as all WAL senders have not switched to the
>> stopping state.
>> - In reaper(), once checkpointer is confirmed as stopped, signal again
>> the WAL senders, and tell them to perform the last loop.
>
> Yeah that looks like a reasonable approach.
>
> I'm not sure why in your patch you process got_SIGUSR2 in
> WalSndErrorCleanup() instead of in the main loop.

Yes I was hesitating about this one when hacking it. Thinking an extra
time, the similar check in StartReplication() should also not use
got_SIGUSR2 to give the WAL sender a chance to do more work while the
shutdown checkpoint is running as it could take minutes.

Attached is an updated patch to reflect that.
-- 
Michael


walsender-chkpt-v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-25 Thread Peter Eisentraut
On 4/21/17 00:11, Michael Paquier wrote:
> Hmm. I have been actually looking at this solution and I am having
> doubts regarding its robustness. In short this would need to be
> roughly a two-step process:
> - In PostmasterStateMachine(), SIGUSR2 is sent to the checkpoint to
> make it call ShutdownXLOG(). Prior doing that, a first signal should
> be sent to all the WAL senders with
> SignalSomeChildren(BACKEND_TYPE_WALSND). SIGUSR2 or SIGINT could be
> used.
> - At reception of this signal, all WAL senders switch to a stopping
> state, refusing commands that can generate WAL.
> - Checkpointer looks at the state of all WAL senders, looping with a
> sleep call of a couple of ms, refusing to launch the shutdown
> checkpoint as long as all WAL senders have not switched to the
> stopping state.
> - In reaper(), once checkpointer is confirmed as stopped, signal again
> the WAL senders, and tell them to perform the last loop.

Yeah that looks like a reasonable approach.

I'm not sure why in your patch you process got_SIGUSR2 in
WalSndErrorCleanup() instead of in the main loop.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-23 Thread Michael Paquier
On Sun, Apr 23, 2017 at 10:15 AM, Petr Jelinek
 wrote:
> On 21/04/17 06:11, Michael Paquier wrote:
>> On Fri, Apr 21, 2017 at 12:29 AM, Peter Eisentraut
>>  wrote:
>> Hmm. I have been actually looking at this solution and I am having
>> doubts regarding its robustness. In short this would need to be
>> roughly a two-step process:
>> - In PostmasterStateMachine(), SIGUSR2 is sent to the checkpoint to
>> make it call ShutdownXLOG(). Prior doing that, a first signal should
>> be sent to all the WAL senders with
>> SignalSomeChildren(BACKEND_TYPE_WALSND). SIGUSR2 or SIGINT could be
>> used.
>> - At reception of this signal, all WAL senders switch to a stopping
>> state, refusing commands that can generate WAL.
>> - Checkpointer looks at the state of all WAL senders, looping with a
>> sleep call of a couple of ms, refusing to launch the shutdown
>> checkpoint as long as all WAL senders have not switched to the
>> stopping state.
>> - In reaper(), once checkpointer is confirmed as stopped, signal again
>> the WAL senders, and tell them to perform the last loop.

OK, I have been hacking that, finishing with the attached. In the
attached I am using SIGUSR2 to instruct the WAL senders to prepare for
stopping, and SIGINT to handle the last WAL flush loop. The shutdown
checkpoint moves on only if all active WAL senders are marked with a
STOPPING state. Reviews as welcome.

>> After that, I got a second, more simple idea.
>> CheckpointerShmem->ckpt_flags holds the information about checkpoints
>> currently running, so we could have the WAL senders look at this data
>> and prevent any commands generating WAL. The checkpointer may be
>> already stopped at the moment the WAL senders finish their loop, so we
>> need also to check if the checkpointer is running or not on those code
>> paths. Such safeguards may actually be enough for the problem of this
>> thread. Thoughts?
>>
>
> Hmm but how do we handle statements that are already in progress by the
> time ckpt_flags changes?

Yup, this does not handle well race conditions.
-- 
Michael


walsender-chkpt-v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-22 Thread Petr Jelinek
On 21/04/17 06:11, Michael Paquier wrote:
> On Fri, Apr 21, 2017 at 12:29 AM, Peter Eisentraut
>  wrote:
>> On 4/20/17 07:52, Petr Jelinek wrote:
>>> On 20/04/17 05:57, Michael Paquier wrote:
 2nd thoughts here... Ah now I see your point. True that there is no
 way to ensure that an unwanted command is not running when SIGUSR2 is
 received as the shutdown checkpoint may have already begun. Here is an
 idea: add a new state in WalSndState, say WALSNDSTATE_STOPPING, and
 the shutdown checkpoint does not run as long as all WAL senders still
 running do not reach such a state.
>>>
>>> +1 to this solution
>>
>> Michael, can you attempt to supply a patch?
> 
> Hmm. I have been actually looking at this solution and I am having
> doubts regarding its robustness. In short this would need to be
> roughly a two-step process:
> - In PostmasterStateMachine(), SIGUSR2 is sent to the checkpoint to
> make it call ShutdownXLOG(). Prior doing that, a first signal should
> be sent to all the WAL senders with
> SignalSomeChildren(BACKEND_TYPE_WALSND). SIGUSR2 or SIGINT could be
> used.
> - At reception of this signal, all WAL senders switch to a stopping
> state, refusing commands that can generate WAL.
> - Checkpointer looks at the state of all WAL senders, looping with a
> sleep call of a couple of ms, refusing to launch the shutdown
> checkpoint as long as all WAL senders have not switched to the
> stopping state.
> - In reaper(), once checkpointer is confirmed as stopped, signal again
> the WAL senders, and tell them to perform the last loop.
> 
> After that, I got a second, more simple idea.
> CheckpointerShmem->ckpt_flags holds the information about checkpoints
> currently running, so we could have the WAL senders look at this data
> and prevent any commands generating WAL. The checkpointer may be
> already stopped at the moment the WAL senders finish their loop, so we
> need also to check if the checkpointer is running or not on those code
> paths. Such safeguards may actually be enough for the problem of this
> thread. Thoughts?
> 

Hmm but how do we handle statements that are already in progress by the
time ckpt_flags changes?

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-20 Thread Michael Paquier
On Fri, Apr 21, 2017 at 12:29 AM, Peter Eisentraut
 wrote:
> On 4/20/17 07:52, Petr Jelinek wrote:
>> On 20/04/17 05:57, Michael Paquier wrote:
>>> 2nd thoughts here... Ah now I see your point. True that there is no
>>> way to ensure that an unwanted command is not running when SIGUSR2 is
>>> received as the shutdown checkpoint may have already begun. Here is an
>>> idea: add a new state in WalSndState, say WALSNDSTATE_STOPPING, and
>>> the shutdown checkpoint does not run as long as all WAL senders still
>>> running do not reach such a state.
>>
>> +1 to this solution
>
> Michael, can you attempt to supply a patch?

Hmm. I have been actually looking at this solution and I am having
doubts regarding its robustness. In short this would need to be
roughly a two-step process:
- In PostmasterStateMachine(), SIGUSR2 is sent to the checkpoint to
make it call ShutdownXLOG(). Prior doing that, a first signal should
be sent to all the WAL senders with
SignalSomeChildren(BACKEND_TYPE_WALSND). SIGUSR2 or SIGINT could be
used.
- At reception of this signal, all WAL senders switch to a stopping
state, refusing commands that can generate WAL.
- Checkpointer looks at the state of all WAL senders, looping with a
sleep call of a couple of ms, refusing to launch the shutdown
checkpoint as long as all WAL senders have not switched to the
stopping state.
- In reaper(), once checkpointer is confirmed as stopped, signal again
the WAL senders, and tell them to perform the last loop.

After that, I got a second, more simple idea.
CheckpointerShmem->ckpt_flags holds the information about checkpoints
currently running, so we could have the WAL senders look at this data
and prevent any commands generating WAL. The checkpointer may be
already stopped at the moment the WAL senders finish their loop, so we
need also to check if the checkpointer is running or not on those code
paths. Such safeguards may actually be enough for the problem of this
thread. Thoughts?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-20 Thread Peter Eisentraut
On 4/20/17 07:52, Petr Jelinek wrote:
> On 20/04/17 05:57, Michael Paquier wrote:
>> 2nd thoughts here... Ah now I see your point. True that there is no
>> way to ensure that an unwanted command is not running when SIGUSR2 is
>> received as the shutdown checkpoint may have already begun. Here is an
>> idea: add a new state in WalSndState, say WALSNDSTATE_STOPPING, and
>> the shutdown checkpoint does not run as long as all WAL senders still
>> running do not reach such a state.
> 
> +1 to this solution

Michael, can you attempt to supply a patch?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-20 Thread Petr Jelinek
On 20/04/17 05:57, Michael Paquier wrote:
> On Thu, Apr 20, 2017 at 12:40 PM, Michael Paquier
>  wrote:
>> On Thu, Apr 20, 2017 at 4:57 AM, Peter Eisentraut
>>  wrote:
>>> I think the problem with a signal-based solution is that there is no
>>> feedback.  Ideally, you would wait for all walsenders to acknowledge the
>>> receipt of SIGUSR2 (or similar) and only then proceed with the shutdown
>>> checkpoint.
>>
>> Are you sure that it is necessary to go to such extent? Why wouldn't
>> it be enough to prevent any replication commands generating WAL to run
>> when the WAL sender knows that the postmaster is in shutdown mode?
> 
> 2nd thoughts here... Ah now I see your point. True that there is no
> way to ensure that an unwanted command is not running when SIGUSR2 is
> received as the shutdown checkpoint may have already begun. Here is an
> idea: add a new state in WalSndState, say WALSNDSTATE_STOPPING, and
> the shutdown checkpoint does not run as long as all WAL senders still
> running do not reach such a state.
> 

+1 to this solution

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-19 Thread Michael Paquier
On Thu, Apr 20, 2017 at 12:40 PM, Michael Paquier
 wrote:
> On Thu, Apr 20, 2017 at 4:57 AM, Peter Eisentraut
>  wrote:
>> I think the problem with a signal-based solution is that there is no
>> feedback.  Ideally, you would wait for all walsenders to acknowledge the
>> receipt of SIGUSR2 (or similar) and only then proceed with the shutdown
>> checkpoint.
>
> Are you sure that it is necessary to go to such extent? Why wouldn't
> it be enough to prevent any replication commands generating WAL to run
> when the WAL sender knows that the postmaster is in shutdown mode?

2nd thoughts here... Ah now I see your point. True that there is no
way to ensure that an unwanted command is not running when SIGUSR2 is
received as the shutdown checkpoint may have already begun. Here is an
idea: add a new state in WalSndState, say WALSNDSTATE_STOPPING, and
the shutdown checkpoint does not run as long as all WAL senders still
running do not reach such a state.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-19 Thread Michael Paquier
On Thu, Apr 20, 2017 at 4:57 AM, Peter Eisentraut
 wrote:
> On 4/19/17 01:45, Michael Paquier wrote:
>> On Tue, Apr 18, 2017 at 3:27 AM, Peter Eisentraut
>>  wrote:
>>> I'd imagine the postmaster would tell the walsender that it has started
>>> shutdown, and then the walsender would reject $certain_things.  But I
>>> don't see an existing way for the walsender to know that shutdown has
>>> been initiated.  SIGINT is still free ...
>>
>> The WAL sender receives SIGUSR2 from the postmaster when shutdown is
>> initiated, so why not just rely on that and issue an ERROR when a
>> client attempts to create or drop a new slot, setting up
>> walsender_ready_to_stop unconditionally? It seems to me that the issue
>> here is the delay between the moment SIGTERM is acknowledged by the
>> WAL sender and the moment CREATE_SLOT is treated. An idea with the
>> attached...
>
> I think the problem with a signal-based solution is that there is no
> feedback.  Ideally, you would wait for all walsenders to acknowledge the
> receipt of SIGUSR2 (or similar) and only then proceed with the shutdown
> checkpoint.

Are you sure that it is necessary to go to such extent? Why wouldn't
it be enough to prevent any replication commands generating WAL to run
when the WAL sender knows that the postmaster is in shutdown mode?
-- 
Michael
VMware vCenter Server
www.vmware.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-19 Thread Peter Eisentraut
On 4/19/17 01:45, Michael Paquier wrote:
> On Tue, Apr 18, 2017 at 3:27 AM, Peter Eisentraut
>  wrote:
>> I'd imagine the postmaster would tell the walsender that it has started
>> shutdown, and then the walsender would reject $certain_things.  But I
>> don't see an existing way for the walsender to know that shutdown has
>> been initiated.  SIGINT is still free ...
> 
> The WAL sender receives SIGUSR2 from the postmaster when shutdown is
> initiated, so why not just rely on that and issue an ERROR when a
> client attempts to create or drop a new slot, setting up
> walsender_ready_to_stop unconditionally? It seems to me that the issue
> here is the delay between the moment SIGTERM is acknowledged by the
> WAL sender and the moment CREATE_SLOT is treater. An idea with the
> attached...

I think the problem with a signal-based solution is that there is no
feedback.  Ideally, you would wait for all walsenders to acknowledge the
receipt of SIGUSR2 (or similar) and only then proceed with the shutdown
checkpoint.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-18 Thread Michael Paquier
On Tue, Apr 18, 2017 at 3:27 AM, Peter Eisentraut
 wrote:
> I'd imagine the postmaster would tell the walsender that it has started
> shutdown, and then the walsender would reject $certain_things.  But I
> don't see an existing way for the walsender to know that shutdown has
> been initiated.  SIGINT is still free ...

The WAL sender receives SIGUSR2 from the postmaster when shutdown is
initiated, so why not just rely on that and issue an ERROR when a
client attempts to create or drop a new slot, setting up
walsender_ready_to_stop unconditionally? It seems to me that the issue
here is the delay between the moment SIGTERM is acknowledged by the
WAL sender and the moment CREATE_SLOT is treater. An idea with the
attached...

> The alternative of shutting down logical walsenders earlier also doesn't
> look straightforward, since the postmaster doesn't know directly what
> kind of walsender a certain process is.  So you'd also need additional
> signal types or something there.

Yup, but is a switchover between a publisher and a subscriber
something that can happen?
-- 
Michael


walsender-shutdown-fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-17 Thread Peter Eisentraut
On 4/17/17 12:30, Andres Freund wrote:
 So I guess that CREATE_REPLICATION_SLOT code calls LogStandbySnapshot()
 and which generates WAL record about snapshot of running transactions.
>>>
>>> Erroring out in these cases sounds easy enough.  Wonder if there's not a
>>> bigger problem with WAL records generated e.g. by HOT pruning or such,
>>> during decoding.  Not super likely, but would probably hit exactly the
>>> same, no?
>>
>> Sounds possible, yes. Sounds like that's going to be nontrivial to fix
>> though.
>>
>> Another problem is that queries can run on walsender now. But that
>> should be possible to detect and shutdown just like backend.
> 
> This sounds like a case for s/PANIC/ERROR|FATAL/ to me...

I'd imagine the postmaster would tell the walsender that it has started
shutdown, and then the walsender would reject $certain_things.  But I
don't see an existing way for the walsender to know that shutdown has
been initiated.  SIGINT is still free ...

The alternative of shutting down logical walsenders earlier also doesn't
look straightforward, since the postmaster doesn't know directly what
kind of walsender a certain process is.  So you'd also need additional
signal types or something there.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-17 Thread Andres Freund
On 2017-04-17 18:28:16 +0200, Petr Jelinek wrote:
> On 17/04/17 18:02, Andres Freund wrote:
> > On 2017-04-15 02:33:59 +0900, Fujii Masao wrote:
> >> On Fri, Apr 14, 2017 at 10:33 PM, Petr Jelinek
> >>  wrote:
> >>> On 12/04/17 15:55, Fujii Masao wrote:
>  Hi,
> 
>  When I shut down the publisher while I repeated creating and dropping
>  the subscription in the subscriber, the publisher emitted the following
>  PANIC error during shutdown checkpoint.
> 
>  PANIC:  concurrent transaction log activity while database system is
>  shutting down
> 
>  The cause of this problem is that walsender for logical replication can
>  generate WAL records even during shutdown checkpoint.
> 
>  Firstly walsender keeps running until shutdown checkpoint finishes
>  so that all the WAL including shutdown checkpoint record can be
>  replicated to the standby. This was safe because previously walsender
>  could not generate WAL records. However this assumption became
>  invalid because of logical replication. That is, currenty walsender for
>  logical replication can generate WAL records, for example, by executing
>  CREATE_REPLICATION_SLOT command. This is an oversight in
>  logical replication patch, I think.
> >>>
> >>> Hmm, but CREATE_REPLICATION_SLOT should not generate WAL afaik. I agree
> >>> that the issue with walsender still exist (since we now allow normal SQL
> >>> to run there) but I think it's important to identify what exactly causes
> >>> the WAL activity in your case
> >>
> >> At least in my case, the following CREATE_REPLICATION_SLOT command
> >> generated WAL record.
> >>
> >> BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ;
> >> CREATE_REPLICATION_SLOT testslot TEMPORARY LOGICAL pgoutput 
> >> USE_SNAPSHOT;
> >>
> >> Here is the pg_waldump output of the WAL record that 
> >> CREATE_REPLICATION_SLOT
> >> generated.
> >>
> >> rmgr: Standby len (rec/tot): 24/50, tx:  0,
> >> lsn: 0/01601438, prev 0/01601400, desc: RUNNING_XACTS nextXid 692
> >> latestCompletedXid 691 oldestRunningXid 692
> >>
> >> So I guess that CREATE_REPLICATION_SLOT code calls LogStandbySnapshot()
> >> and which generates WAL record about snapshot of running transactions.
> > 
> > Erroring out in these cases sounds easy enough.  Wonder if there's not a
> > bigger problem with WAL records generated e.g. by HOT pruning or such,
> > during decoding.  Not super likely, but would probably hit exactly the
> > same, no?
> > 
> 
> Sounds possible, yes. Sounds like that's going to be nontrivial to fix
> though.
> 
> Another problem is that queries can run on walsender now. But that
> should be possible to detect and shutdown just like backend.

This sounds like a case for s/PANIC/ERROR|FATAL/ to me...


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-17 Thread Petr Jelinek
On 17/04/17 18:02, Andres Freund wrote:
> On 2017-04-15 02:33:59 +0900, Fujii Masao wrote:
>> On Fri, Apr 14, 2017 at 10:33 PM, Petr Jelinek
>>  wrote:
>>> On 12/04/17 15:55, Fujii Masao wrote:
 Hi,

 When I shut down the publisher while I repeated creating and dropping
 the subscription in the subscriber, the publisher emitted the following
 PANIC error during shutdown checkpoint.

 PANIC:  concurrent transaction log activity while database system is
 shutting down

 The cause of this problem is that walsender for logical replication can
 generate WAL records even during shutdown checkpoint.

 Firstly walsender keeps running until shutdown checkpoint finishes
 so that all the WAL including shutdown checkpoint record can be
 replicated to the standby. This was safe because previously walsender
 could not generate WAL records. However this assumption became
 invalid because of logical replication. That is, currenty walsender for
 logical replication can generate WAL records, for example, by executing
 CREATE_REPLICATION_SLOT command. This is an oversight in
 logical replication patch, I think.
>>>
>>> Hmm, but CREATE_REPLICATION_SLOT should not generate WAL afaik. I agree
>>> that the issue with walsender still exist (since we now allow normal SQL
>>> to run there) but I think it's important to identify what exactly causes
>>> the WAL activity in your case
>>
>> At least in my case, the following CREATE_REPLICATION_SLOT command
>> generated WAL record.
>>
>> BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ;
>> CREATE_REPLICATION_SLOT testslot TEMPORARY LOGICAL pgoutput USE_SNAPSHOT;
>>
>> Here is the pg_waldump output of the WAL record that CREATE_REPLICATION_SLOT
>> generated.
>>
>> rmgr: Standby len (rec/tot): 24/50, tx:  0,
>> lsn: 0/01601438, prev 0/01601400, desc: RUNNING_XACTS nextXid 692
>> latestCompletedXid 691 oldestRunningXid 692
>>
>> So I guess that CREATE_REPLICATION_SLOT code calls LogStandbySnapshot()
>> and which generates WAL record about snapshot of running transactions.
> 
> Erroring out in these cases sounds easy enough.  Wonder if there's not a
> bigger problem with WAL records generated e.g. by HOT pruning or such,
> during decoding.  Not super likely, but would probably hit exactly the
> same, no?
> 

Sounds possible, yes. Sounds like that's going to be nontrivial to fix
though.

Another problem is that queries can run on walsender now. But that
should be possible to detect and shutdown just like backend.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-17 Thread Andres Freund
On 2017-04-15 02:33:59 +0900, Fujii Masao wrote:
> On Fri, Apr 14, 2017 at 10:33 PM, Petr Jelinek
>  wrote:
> > On 12/04/17 15:55, Fujii Masao wrote:
> >> Hi,
> >>
> >> When I shut down the publisher while I repeated creating and dropping
> >> the subscription in the subscriber, the publisher emitted the following
> >> PANIC error during shutdown checkpoint.
> >>
> >> PANIC:  concurrent transaction log activity while database system is
> >> shutting down
> >>
> >> The cause of this problem is that walsender for logical replication can
> >> generate WAL records even during shutdown checkpoint.
> >>
> >> Firstly walsender keeps running until shutdown checkpoint finishes
> >> so that all the WAL including shutdown checkpoint record can be
> >> replicated to the standby. This was safe because previously walsender
> >> could not generate WAL records. However this assumption became
> >> invalid because of logical replication. That is, currenty walsender for
> >> logical replication can generate WAL records, for example, by executing
> >> CREATE_REPLICATION_SLOT command. This is an oversight in
> >> logical replication patch, I think.
> >
> > Hmm, but CREATE_REPLICATION_SLOT should not generate WAL afaik. I agree
> > that the issue with walsender still exist (since we now allow normal SQL
> > to run there) but I think it's important to identify what exactly causes
> > the WAL activity in your case
> 
> At least in my case, the following CREATE_REPLICATION_SLOT command
> generated WAL record.
> 
> BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ;
> CREATE_REPLICATION_SLOT testslot TEMPORARY LOGICAL pgoutput USE_SNAPSHOT;
> 
> Here is the pg_waldump output of the WAL record that CREATE_REPLICATION_SLOT
> generated.
> 
> rmgr: Standby len (rec/tot): 24/50, tx:  0,
> lsn: 0/01601438, prev 0/01601400, desc: RUNNING_XACTS nextXid 692
> latestCompletedXid 691 oldestRunningXid 692
> 
> So I guess that CREATE_REPLICATION_SLOT code calls LogStandbySnapshot()
> and which generates WAL record about snapshot of running transactions.

Erroring out in these cases sounds easy enough.  Wonder if there's not a
bigger problem with WAL records generated e.g. by HOT pruning or such,
during decoding.  Not super likely, but would probably hit exactly the
same, no?

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-14 Thread Petr Jelinek
On 14/04/17 21:05, Peter Eisentraut wrote:
> On 4/14/17 14:23, Petr Jelinek wrote:
>> Ah yes looking at the code, it does exactly that (on master only). Means
>> that backport will be necessary.
> 
> I think these two sentences are contradicting each other.
> 

Hehe, didn't realize master will be taken as master branch, I meant
master as in not standby :)

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-14 Thread Peter Eisentraut
On 4/14/17 14:23, Petr Jelinek wrote:
> Ah yes looking at the code, it does exactly that (on master only). Means
> that backport will be necessary.

I think these two sentences are contradicting each other.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-14 Thread Petr Jelinek
On 14/04/17 19:33, Fujii Masao wrote:
> On Fri, Apr 14, 2017 at 10:33 PM, Petr Jelinek
>  wrote:
>> On 12/04/17 15:55, Fujii Masao wrote:
>>> Hi,
>>>
>>> When I shut down the publisher while I repeated creating and dropping
>>> the subscription in the subscriber, the publisher emitted the following
>>> PANIC error during shutdown checkpoint.
>>>
>>> PANIC:  concurrent transaction log activity while database system is
>>> shutting down
>>>
>>> The cause of this problem is that walsender for logical replication can
>>> generate WAL records even during shutdown checkpoint.
>>>
>>> Firstly walsender keeps running until shutdown checkpoint finishes
>>> so that all the WAL including shutdown checkpoint record can be
>>> replicated to the standby. This was safe because previously walsender
>>> could not generate WAL records. However this assumption became
>>> invalid because of logical replication. That is, currenty walsender for
>>> logical replication can generate WAL records, for example, by executing
>>> CREATE_REPLICATION_SLOT command. This is an oversight in
>>> logical replication patch, I think.
>>
>> Hmm, but CREATE_REPLICATION_SLOT should not generate WAL afaik. I agree
>> that the issue with walsender still exist (since we now allow normal SQL
>> to run there) but I think it's important to identify what exactly causes
>> the WAL activity in your case
> 
> At least in my case, the following CREATE_REPLICATION_SLOT command
> generated WAL record.
> 
> BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ;
> CREATE_REPLICATION_SLOT testslot TEMPORARY LOGICAL pgoutput USE_SNAPSHOT;
> 
> Here is the pg_waldump output of the WAL record that CREATE_REPLICATION_SLOT
> generated.
> 
> rmgr: Standby len (rec/tot): 24/50, tx:  0,
> lsn: 0/01601438, prev 0/01601400, desc: RUNNING_XACTS nextXid 692
> latestCompletedXid 691 oldestRunningXid 692
> 
> So I guess that CREATE_REPLICATION_SLOT code calls LogStandbySnapshot()
> and which generates WAL record about snapshot of running transactions.
> 

Ah yes looking at the code, it does exactly that (on master only). Means
that backport will be necessary.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-14 Thread Fujii Masao
On Fri, Apr 14, 2017 at 10:33 PM, Petr Jelinek
 wrote:
> On 12/04/17 15:55, Fujii Masao wrote:
>> Hi,
>>
>> When I shut down the publisher while I repeated creating and dropping
>> the subscription in the subscriber, the publisher emitted the following
>> PANIC error during shutdown checkpoint.
>>
>> PANIC:  concurrent transaction log activity while database system is
>> shutting down
>>
>> The cause of this problem is that walsender for logical replication can
>> generate WAL records even during shutdown checkpoint.
>>
>> Firstly walsender keeps running until shutdown checkpoint finishes
>> so that all the WAL including shutdown checkpoint record can be
>> replicated to the standby. This was safe because previously walsender
>> could not generate WAL records. However this assumption became
>> invalid because of logical replication. That is, currenty walsender for
>> logical replication can generate WAL records, for example, by executing
>> CREATE_REPLICATION_SLOT command. This is an oversight in
>> logical replication patch, I think.
>
> Hmm, but CREATE_REPLICATION_SLOT should not generate WAL afaik. I agree
> that the issue with walsender still exist (since we now allow normal SQL
> to run there) but I think it's important to identify what exactly causes
> the WAL activity in your case

At least in my case, the following CREATE_REPLICATION_SLOT command
generated WAL record.

BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ;
CREATE_REPLICATION_SLOT testslot TEMPORARY LOGICAL pgoutput USE_SNAPSHOT;

Here is the pg_waldump output of the WAL record that CREATE_REPLICATION_SLOT
generated.

rmgr: Standby len (rec/tot): 24/50, tx:  0,
lsn: 0/01601438, prev 0/01601400, desc: RUNNING_XACTS nextXid 692
latestCompletedXid 691 oldestRunningXid 692

So I guess that CREATE_REPLICATION_SLOT code calls LogStandbySnapshot()
and which generates WAL record about snapshot of running transactions.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-14 Thread Petr Jelinek
On 12/04/17 15:55, Fujii Masao wrote:
> Hi,
> 
> When I shut down the publisher while I repeated creating and dropping
> the subscription in the subscriber, the publisher emitted the following
> PANIC error during shutdown checkpoint.
> 
> PANIC:  concurrent transaction log activity while database system is
> shutting down
> 
> The cause of this problem is that walsender for logical replication can
> generate WAL records even during shutdown checkpoint.
> 
> Firstly walsender keeps running until shutdown checkpoint finishes
> so that all the WAL including shutdown checkpoint record can be
> replicated to the standby. This was safe because previously walsender
> could not generate WAL records. However this assumption became
> invalid because of logical replication. That is, currenty walsender for
> logical replication can generate WAL records, for example, by executing
> CREATE_REPLICATION_SLOT command. This is an oversight in
> logical replication patch, I think.

Hmm, but CREATE_REPLICATION_SLOT should not generate WAL afaik. I agree
that the issue with walsender still exist (since we now allow normal SQL
to run there) but I think it's important to identify what exactly causes
the WAL activity in your case - if it's one of the replication commands
and not SQL then we'll need to backpatch any solution we come up with to
9.4, if it's not replication command, we only need to fix 10.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-13 Thread Michael Paquier
On Fri, Apr 14, 2017 at 3:03 AM, Fujii Masao  wrote:
> On Thu, Apr 13, 2017 at 12:36 PM, Michael Paquier
>  wrote:
>> On Thu, Apr 13, 2017 at 12:28 PM, Fujii Masao  wrote:
>>> On Thu, Apr 13, 2017 at 5:25 AM, Peter Eisentraut
>>>  wrote:
 On 4/12/17 09:55, Fujii Masao wrote:
> To fix this issue, we should terminate walsender for logical replication
> before shutdown checkpoint starts. Of course walsender for physical
> replication still needs to keep running until shutdown checkpoint ends,
> though.

 Can we turn it into a kind of read-only or no-new-commands mode instead,
 so it can keep streaming but not accept any new actions?
>>>
>>> So we make walsenders switch to that mode and wait for all the 
>>> already-ongoing
>>> their "write" commands to finish, and then we start a shutdown checkpoint?
>>> This is an idea, but seems a bit complicated. ISTM that it's simpler to
>>> terminate only walsenders for logical rep before shutdown checkpoint.
>>
>> Perhaps my memory is failing me here... But in clean shutdowns we do
>> shut down WAL senders after the checkpoint has completed so as we are
>> sure that they have flushed the LSN corresponding to the checkpoint
>> ending, right?
>
> Yes.
>
>> Why introducing an inconsistency for logical workers?
>> It seems to me that logical workers should fail under the same rules.
>
> Could you tell me why? You think that even walsender for logical rep
> needs to stream the shutdown checkpoint WAL record to the subscriber?
> I was not thinking that's true.

For physical replication, the property to wait that standbys have
flushed the LSN of the shutdown checkpoint can be important for
switchovers. For example, with a primary and a standby, it is possible
to stop cleanly the master, promote the standby, and then connect back
to the cluster the old primary as a standby to the now-new primary
with the guarantee that both are in a consistent state. It seems to me
that having similar guarantees for logical replication is important.

Now, I have not reviewed the code of logirep in details at the level
of Peter, Petr or yourself...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-13 Thread Fujii Masao
On Thu, Apr 13, 2017 at 12:36 PM, Michael Paquier
 wrote:
> On Thu, Apr 13, 2017 at 12:28 PM, Fujii Masao  wrote:
>> On Thu, Apr 13, 2017 at 5:25 AM, Peter Eisentraut
>>  wrote:
>>> On 4/12/17 09:55, Fujii Masao wrote:
 To fix this issue, we should terminate walsender for logical replication
 before shutdown checkpoint starts. Of course walsender for physical
 replication still needs to keep running until shutdown checkpoint ends,
 though.
>>>
>>> Can we turn it into a kind of read-only or no-new-commands mode instead,
>>> so it can keep streaming but not accept any new actions?
>>
>> So we make walsenders switch to that mode and wait for all the 
>> already-ongoing
>> their "write" commands to finish, and then we start a shutdown checkpoint?
>> This is an idea, but seems a bit complicated. ISTM that it's simpler to
>> terminate only walsenders for logical rep before shutdown checkpoint.
>
> Perhaps my memory is failing me here... But in clean shutdowns we do
> shut down WAL senders after the checkpoint has completed so as we are
> sure that they have flushed the LSN corresponding to the checkpoint
> ending, right?

Yes.

> Why introducing an inconsistency for logical workers?
> It seems to me that logical workers should fail under the same rules.

Could you tell me why? You think that even walsender for logical rep
needs to stream the shutdown checkpoint WAL record to the subscriber?
I was not thinking that's true.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-12 Thread Michael Paquier
On Thu, Apr 13, 2017 at 12:28 PM, Fujii Masao  wrote:
> On Thu, Apr 13, 2017 at 5:25 AM, Peter Eisentraut
>  wrote:
>> On 4/12/17 09:55, Fujii Masao wrote:
>>> To fix this issue, we should terminate walsender for logical replication
>>> before shutdown checkpoint starts. Of course walsender for physical
>>> replication still needs to keep running until shutdown checkpoint ends,
>>> though.
>>
>> Can we turn it into a kind of read-only or no-new-commands mode instead,
>> so it can keep streaming but not accept any new actions?
>
> So we make walsenders switch to that mode and wait for all the already-ongoing
> their "write" commands to finish, and then we start a shutdown checkpoint?
> This is an idea, but seems a bit complicated. ISTM that it's simpler to
> terminate only walsenders for logical rep before shutdown checkpoint.

Perhaps my memory is failing me here... But in clean shutdowns we do
shut down WAL senders after the checkpoint has completed so as we are
sure that they have flushed the LSN corresponding to the checkpoint
ending, right? Why introducing an inconsistency for logical workers?
It seems to me that logical workers should fail under the same rules.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-12 Thread Fujii Masao
On Thu, Apr 13, 2017 at 5:25 AM, Peter Eisentraut
 wrote:
> On 4/12/17 09:55, Fujii Masao wrote:
>> To fix this issue, we should terminate walsender for logical replication
>> before shutdown checkpoint starts. Of course walsender for physical
>> replication still needs to keep running until shutdown checkpoint ends,
>> though.
>
> Can we turn it into a kind of read-only or no-new-commands mode instead,
> so it can keep streaming but not accept any new actions?

So we make walsenders switch to that mode and wait for all the already-ongoing
their "write" commands to finish, and then we start a shutdown checkpoint?
This is an idea, but seems a bit complicated. ISTM that it's simpler to
terminate only walsenders for logical rep before shutdown checkpoint.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-12 Thread Peter Eisentraut
On 4/12/17 09:55, Fujii Masao wrote:
> To fix this issue, we should terminate walsender for logical replication
> before shutdown checkpoint starts. Of course walsender for physical
> replication still needs to keep running until shutdown checkpoint ends,
> though.

Can we turn it into a kind of read-only or no-new-commands mode instead,
so it can keep streaming but not accept any new actions?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-04-12 Thread Fujii Masao
Hi,

When I shut down the publisher while I repeated creating and dropping
the subscription in the subscriber, the publisher emitted the following
PANIC error during shutdown checkpoint.

PANIC:  concurrent transaction log activity while database system is
shutting down

The cause of this problem is that walsender for logical replication can
generate WAL records even during shutdown checkpoint.

Firstly walsender keeps running until shutdown checkpoint finishes
so that all the WAL including shutdown checkpoint record can be
replicated to the standby. This was safe because previously walsender
could not generate WAL records. However this assumption became
invalid because of logical replication. That is, currenty walsender for
logical replication can generate WAL records, for example, by executing
CREATE_REPLICATION_SLOT command. This is an oversight in
logical replication patch, I think.

To fix this issue, we should terminate walsender for logical replication
before shutdown checkpoint starts. Of course walsender for physical
replication still needs to keep running until shutdown checkpoint ends,
though.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers