[HACKERS] Streaming replication - unable to stop the standby

2010-05-03 Thread Stefan Kaltenbrunner
I'm currently testing SR/HS in 9.0beta1 and I noticed that it seems 
quite easy to end up in a situation where you have a standby that seems 
to be stuck in:


$ psql -p 5433
psql: FATAL:  the database system is shutting down

but not not actually shuting down ever. I ran into that a few times now 
(mostly because I'm trying to chase a recovery issue I hit during 
earlier testing) by simply having the master iterate between a pgbench 
run and idle while simple doing pg_ctl restart in a loop on the standby.
I do vaguely recall some discussions of that but I thought the issue git 
settled somehow?



Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication - unable to stop the standby

2010-05-03 Thread Tom Lane
Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes:
 I'm currently testing SR/HS in 9.0beta1 and I noticed that it seems 
 quite easy to end up in a situation where you have a standby that seems 
 to be stuck in:

 $ psql -p 5433
 psql: FATAL:  the database system is shutting down

 but not not actually shuting down ever. I ran into that a few times now 
 (mostly because I'm trying to chase a recovery issue I hit during 
 earlier testing) by simply having the master iterate between a pgbench 
 run and idle while simple doing pg_ctl restart in a loop on the standby.
 I do vaguely recall some discussions of that but I thought the issue git 
 settled somehow?

Hm, I haven't pushed this hard but pg_ctl stop seems to stop the
standby for me.  Which subprocesses of the slave postmaster are still
around?  Could you attach to them with gdb and get stack traces?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication - unable to stop the standby

2010-05-03 Thread Robert Haas
On Mon, May 3, 2010 at 2:04 PM, Stefan Kaltenbrunner
ste...@kaltenbrunner.cc wrote:
 I'm currently testing SR/HS in 9.0beta1 and I noticed that it seems quite
 easy to end up in a situation where you have a standby that seems to be
 stuck in:

 $ psql -p 5433
 psql: FATAL:  the database system is shutting down

 but not not actually shuting down ever. I ran into that a few times now
 (mostly because I'm trying to chase a recovery issue I hit during earlier
 testing) by simply having the master iterate between a pgbench run and
 idle while simple doing pg_ctl restart in a loop on the standby.
 I do vaguely recall some discussions of that but I thought the issue git
 settled somehow?

Yes - I thought it was too.  Specifically, I thought I fixed it.  The
default mode is 'smart' shutdown, just as it is on the primary, so it
won't shut down until all clients have disconnected, but it should
work provided you don't leave a session somewhere.  Can you describe
steps to reproduce?

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication - unable to stop the standby

2010-05-03 Thread Stefan Kaltenbrunner

Tom Lane wrote:

Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes:
I'm currently testing SR/HS in 9.0beta1 and I noticed that it seems 
quite easy to end up in a situation where you have a standby that seems 
to be stuck in:



$ psql -p 5433
psql: FATAL:  the database system is shutting down


but not not actually shuting down ever. I ran into that a few times now 
(mostly because I'm trying to chase a recovery issue I hit during 
earlier testing) by simply having the master iterate between a pgbench 
run and idle while simple doing pg_ctl restart in a loop on the standby.
I do vaguely recall some discussions of that but I thought the issue git 
settled somehow?


Hm, I haven't pushed this hard but pg_ctl stop seems to stop the
standby for me.  Which subprocesses of the slave postmaster are still
around?  Could you attach to them with gdb and get stack traces?


it is not always failing to shut down - it only fails sometimes - I have 
not exactly pinpointed yet what it is causing this but the standby is in 
a weird state now:


* the master is currently idle
* the standby has no connections at all

logs from the standby:

FATAL:  the database system is shutting down
FATAL:  the database system is shutting down
FATAL:  replication terminated by primary server
LOG:  restored log file 000100190054 from archive
cp: cannot stat `/mnt/space/wal-archive/000100190055': No 
such file or directory

LOG:  record with zero length at 19/5578
cp: cannot stat `/mnt/space/wal-archive/000100190055': No 
such file or directory
FATAL:  could not connect to the primary server: could not connect to 
server: Connection refused

Is the server running on host localhost and accepting
TCP/IP connections on port 5432?
could not connect to server: Connection refused
Is the server running on host localhost and accepting
TCP/IP connections on port 5432?

cp: cannot stat `/mnt/space/wal-archive/000100190055': No 
such file or directory
cp: cannot stat `/mnt/space/wal-archive/000100190055': No 
such file or directory

LOG:  streaming replication successfully connected to primary
FATAL:  the database system is shutting down


the first two FATAL: the database system is shutting down are from me 
trying to connect using psql after i noticed that pg_ctl failed to 
shutdown the slave.
The next thing I tried was restarting the master - which lead to the 
following logs and the standby noticing that and reconnecting but you 
cannot actually connect...


process tree for the standby is:

29523 pts/2S  0:00 /home/postgres9/pginst/bin/postgres -D 
/mnt/space/pgdata_standby
29524 ?Ss 0:06  \_ postgres: startup process   waiting for 
000100190055
29529 ?Ss 0:00  \_ postgres: writer process 

29835 ?Ss 0:00  \_ postgres: wal receiver process 
streaming 19/5578




Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication - unable to stop the standby

2010-05-03 Thread Stefan Kaltenbrunner

Robert Haas wrote:

On Mon, May 3, 2010 at 2:04 PM, Stefan Kaltenbrunner
ste...@kaltenbrunner.cc wrote:

I'm currently testing SR/HS in 9.0beta1 and I noticed that it seems quite
easy to end up in a situation where you have a standby that seems to be
stuck in:

$ psql -p 5433
psql: FATAL:  the database system is shutting down

but not not actually shuting down ever. I ran into that a few times now
(mostly because I'm trying to chase a recovery issue I hit during earlier
testing) by simply having the master iterate between a pgbench run and
idle while simple doing pg_ctl restart in a loop on the standby.
I do vaguely recall some discussions of that but I thought the issue git
settled somehow?


Yes - I thought it was too.  Specifically, I thought I fixed it.  The
default mode is 'smart' shutdown, just as it is on the primary, so it
won't shut down until all clients have disconnected, but it should
work provided you don't leave a session somewhere.  Can you describe
steps to reproduce?


well this is basically master and standby on the same box - with the 
master doing short pgbench interleaved with a sleep 20, the standby is 
doing nothing in terms of queries and just executing pg_ctl restart in a 
loop(simulating a typical maintainance reboot of a standby).



Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication - unable to stop the standby

2010-05-03 Thread Robert Haas
On Mon, May 3, 2010 at 2:22 PM, Stefan Kaltenbrunner
ste...@kaltenbrunner.cc wrote:
 Tom Lane wrote:

 Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes:

 I'm currently testing SR/HS in 9.0beta1 and I noticed that it seems quite
 easy to end up in a situation where you have a standby that seems to be
 stuck in:

 $ psql -p 5433
 psql: FATAL:  the database system is shutting down

 but not not actually shuting down ever. I ran into that a few times now
 (mostly because I'm trying to chase a recovery issue I hit during earlier
 testing) by simply having the master iterate between a pgbench run and
 idle while simple doing pg_ctl restart in a loop on the standby.
 I do vaguely recall some discussions of that but I thought the issue git
 settled somehow?

 Hm, I haven't pushed this hard but pg_ctl stop seems to stop the
 standby for me.  Which subprocesses of the slave postmaster are still
 around?  Could you attach to them with gdb and get stack traces?

 it is not always failing to shut down - it only fails sometimes - I have not
 exactly pinpointed yet what it is causing this but the standby is in a weird
 state now:

 * the master is currently idle
 * the standby has no connections at all

 logs from the standby:

 FATAL:  the database system is shutting down
 FATAL:  the database system is shutting down
 FATAL:  replication terminated by primary server
 LOG:  restored log file 000100190054 from archive
 cp: cannot stat `/mnt/space/wal-archive/000100190055': No such
 file or directory
 LOG:  record with zero length at 19/5578
 cp: cannot stat `/mnt/space/wal-archive/000100190055': No such
 file or directory
 FATAL:  could not connect to the primary server: could not connect to
 server: Connection refused
                Is the server running on host localhost and accepting
                TCP/IP connections on port 5432?
        could not connect to server: Connection refused
                Is the server running on host localhost and accepting
                TCP/IP connections on port 5432?

 cp: cannot stat `/mnt/space/wal-archive/000100190055': No such
 file or directory
 cp: cannot stat `/mnt/space/wal-archive/000100190055': No such
 file or directory
 LOG:  streaming replication successfully connected to primary
 FATAL:  the database system is shutting down


 the first two FATAL: the database system is shutting down are from me
 trying to connect using psql after i noticed that pg_ctl failed to shutdown
 the slave.
 The next thing I tried was restarting the master - which lead to the
 following logs and the standby noticing that and reconnecting but you cannot
 actually connect...

 process tree for the standby is:

 29523 pts/2    S      0:00 /home/postgres9/pginst/bin/postgres -D
 /mnt/space/pgdata_standby
 29524 ?        Ss     0:06  \_ postgres: startup process   waiting for
 000100190055
 29529 ?        Ss     0:00  \_ postgres: writer process
 29835 ?        Ss     0:00  \_ postgres: wal receiver process streaming
 19/5578

uninformed-speculation

Hmm.  When I committed that patch to fix smart shutdown on the
standby, we discussed the fact that the startup process can't simply
release its locks and die at shutdown time because the locks it holds
prevent other backends from seeing the database in an inconsistent
state.  Therefore, if we were to terminate recovery as soon as the
smart shutdown request is received, we might never complete, because a
backend might be waiting on a lock that will never get released.  If
that's really a danger scenario, then it follows that we might also
fail to shut down if we can't connect to the primary, because we might
not be able to replay enough WAL to release the locks the remaining
backends are waiting for.  That sort of looks like what is happening
to you, except based on your test scenario I can't figure out where
this came from:

FATAL:  replication terminated by primary server

/uninformed-speculation

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication - unable to stop the standby

2010-05-03 Thread Stefan Kaltenbrunner

Robert Haas wrote:

On Mon, May 3, 2010 at 2:22 PM, Stefan Kaltenbrunner
ste...@kaltenbrunner.cc wrote:

Tom Lane wrote:

Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes:

I'm currently testing SR/HS in 9.0beta1 and I noticed that it seems quite
easy to end up in a situation where you have a standby that seems to be
stuck in:
$ psql -p 5433
psql: FATAL:  the database system is shutting down
but not not actually shuting down ever. I ran into that a few times now
(mostly because I'm trying to chase a recovery issue I hit during earlier
testing) by simply having the master iterate between a pgbench run and
idle while simple doing pg_ctl restart in a loop on the standby.
I do vaguely recall some discussions of that but I thought the issue git
settled somehow?

Hm, I haven't pushed this hard but pg_ctl stop seems to stop the
standby for me.  Which subprocesses of the slave postmaster are still
around?  Could you attach to them with gdb and get stack traces?

it is not always failing to shut down - it only fails sometimes - I have not
exactly pinpointed yet what it is causing this but the standby is in a weird
state now:

* the master is currently idle
* the standby has no connections at all

logs from the standby:

FATAL:  the database system is shutting down
FATAL:  the database system is shutting down
FATAL:  replication terminated by primary server
LOG:  restored log file 000100190054 from archive
cp: cannot stat `/mnt/space/wal-archive/000100190055': No such
file or directory
LOG:  record with zero length at 19/5578
cp: cannot stat `/mnt/space/wal-archive/000100190055': No such
file or directory
FATAL:  could not connect to the primary server: could not connect to
server: Connection refused
   Is the server running on host localhost and accepting
   TCP/IP connections on port 5432?
   could not connect to server: Connection refused
   Is the server running on host localhost and accepting
   TCP/IP connections on port 5432?

cp: cannot stat `/mnt/space/wal-archive/000100190055': No such
file or directory
cp: cannot stat `/mnt/space/wal-archive/000100190055': No such
file or directory
LOG:  streaming replication successfully connected to primary
FATAL:  the database system is shutting down


the first two FATAL: the database system is shutting down are from me
trying to connect using psql after i noticed that pg_ctl failed to shutdown
the slave.
The next thing I tried was restarting the master - which lead to the
following logs and the standby noticing that and reconnecting but you cannot
actually connect...

process tree for the standby is:

29523 pts/2S  0:00 /home/postgres9/pginst/bin/postgres -D
/mnt/space/pgdata_standby
29524 ?Ss 0:06  \_ postgres: startup process   waiting for
000100190055
29529 ?Ss 0:00  \_ postgres: writer process
29835 ?Ss 0:00  \_ postgres: wal receiver process streaming
19/5578


uninformed-speculation

Hmm.  When I committed that patch to fix smart shutdown on the
standby, we discussed the fact that the startup process can't simply
release its locks and die at shutdown time because the locks it holds
prevent other backends from seeing the database in an inconsistent
state.  Therefore, if we were to terminate recovery as soon as the
smart shutdown request is received, we might never complete, because a
backend might be waiting on a lock that will never get released.  If
that's really a danger scenario, then it follows that we might also
fail to shut down if we can't connect to the primary, because we might
not be able to replay enough WAL to release the locks the remaining
backends are waiting for.  That sort of looks like what is happening
to you, except based on your test scenario I can't figure out where
this came from:

FATAL:  replication terminated by primary server


as I said before I restarted the master at that point, the standby 
logged the above, restored 000100190054 from the archive, 
tried reconnecting and logged the connection refused. a few seconds 
later the master was up again and the standby succeeded in reconnecting.



Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication - unable to stop the standby

2010-05-03 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Hmm.  When I committed that patch to fix smart shutdown on the
 standby, we discussed the fact that the startup process can't simply
 release its locks and die at shutdown time because the locks it holds
 prevent other backends from seeing the database in an inconsistent
 state.  Therefore, if we were to terminate recovery as soon as the
 smart shutdown request is received, we might never complete, because a
 backend might be waiting on a lock that will never get released.  If
 that's really a danger scenario, then it follows that we might also
 fail to shut down if we can't connect to the primary, because we might
 not be able to replay enough WAL to release the locks the remaining
 backends are waiting for.  That sort of looks like what is happening
 to you, except based on your test scenario I can't figure out where
 this came from:

 FATAL:  replication terminated by primary server

I suspect you have it right, because my experiments where the standby
did shut down correctly were all done with an idle master.

Seems like we could go ahead and forcibly kill the startup process *once
all the standby backends are gone*.  There is then no need to worry
about not releasing locks, and re-establishing a consistent state when
we later restart is logic that we have to have anyway.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication - unable to stop the standby

2010-05-03 Thread Robert Haas
On Mon, May 3, 2010 at 2:47 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Hmm.  When I committed that patch to fix smart shutdown on the
 standby, we discussed the fact that the startup process can't simply
 release its locks and die at shutdown time because the locks it holds
 prevent other backends from seeing the database in an inconsistent
 state.  Therefore, if we were to terminate recovery as soon as the
 smart shutdown request is received, we might never complete, because a
 backend might be waiting on a lock that will never get released.  If
 that's really a danger scenario, then it follows that we might also
 fail to shut down if we can't connect to the primary, because we might
 not be able to replay enough WAL to release the locks the remaining
 backends are waiting for.  That sort of looks like what is happening
 to you, except based on your test scenario I can't figure out where
 this came from:

 FATAL:  replication terminated by primary server

 I suspect you have it right, because my experiments where the standby
 did shut down correctly were all done with an idle master.

 Seems like we could go ahead and forcibly kill the startup process *once
 all the standby backends are gone*.  There is then no need to worry
 about not releasing locks, and re-establishing a consistent state when
 we later restart is logic that we have to have anyway.

That's exactly what we already do.  The problem is that smart shutdown
doesn't actually kill off the standby backends - it waits for them to
exit on their own.  Except, if they're blocking on a lock that's never
going to get released, then they never do.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers