Re: [GENERAL] the database system is shutting down - terminating walsender process due to replication timeout

2017-11-06 Thread Zarko Aleksic
Thanks for you reply. In the meantime we figured out what the issue was. We had 
a virtual IP that was being released from the master server just as the master 
would try to stream the last checkpoint. This caused wal_sender process to hang 
until the timeout value was reached. Technically speaking, it was a networking 
problem, but not on the switch/router, rather on the host it self. Stopping the 
replication from standby to the (movable) master VIP was the solution that we 
went with.


Best regards,
Zarko


From: Tom Lane 
Sent: Wednesday, November 1, 2017 10:25 AM
To: Zarko Aleksic
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] the database system is shutting down - terminating 
walsender process due to replication timeout

Zarko Aleksic  writes:
> I'm looking for a bit of help understanding a particular behavior we are 
> seeing with our PostgreSQL 9.6. After issuing a service shutdown command with 
> "systemctl stop" on RHEL 7 our PostgreSQL instance started behaving weirdly. 
> For the first time it wouldn't shutdown so easily / quickly.
> From the logs we could see that standby nodes that were trying to connect 
> were rejected due to database being shutdown. After wal_sender_timeout and 
> wal_receiver_timeout (default 60s) were reached the database finally shut 
> down. It seems that walsender process was preventing the shutdown of the 
> master database - until timeout was reached, a behavior we didn't experience 
> before.

9.6.what?

There were several possibly-relevant bug fixes in 9.6.3 and 9.6.4,
notably this one:

Author: Tom Lane 
Branch: master Release: REL_10_BR [fca85f8ef] 2017-06-30 12:00:15 -0400
Branch: REL9_6_STABLE Release: REL9_6_4 [e9d4aa594] 2017-06-30 12:00:03 -0400
Branch: REL9_5_STABLE Release: REL9_5_8 [446914f6b] 2017-06-30 12:00:03 -0400
Branch: REL9_4_STABLE Release: REL9_4_13 [5aa8db014] 2017-06-30 12:00:03 -0400

Fix walsender to exit promptly if client requests shutdown.

It's possible for WalSndWaitForWal to be asked to wait for WAL that doesn't
exist yet.  That's fine, in fact it's the normal situation if we're caught
up; but when the client requests shutdown we should not keep waiting.
The previous coding could wait indefinitely if the source server was idle.

In passing, improve the rather weak comments in this area, and slightly
rearrange some related code for better readability.

Back-patch to 9.4 where this code was introduced.

Discussion: https://postgr.es/m/14154.1498781...@sss.pgh.pa.us

I think that would only apply if the walsender's client had tried to
disconnect at the same time you were doing the master-server shutdown,
but maybe that's what happened.

There is still work going on around the walsender timeout, so maybe
what you hit is an as-yet-unresolved bug, but in any case you should
be keeping up with minor releases.

regards, tom lane


[GENERAL] the database system is shutting down - terminating walsender process due to replication timeout

2017-11-01 Thread Zarko Aleksic
Greetings everyone,


I'm looking for a bit of help understanding a particular behavior we are seeing 
with our PostgreSQL 9.6. After issuing a service shutdown command with 
"systemctl stop" on RHEL 7 our PostgreSQL instance started behaving weirdly. 
For the first time it wouldn't shutdown so easily / quickly.


>From the logs we could see that standby nodes that were trying to connect were 
>rejected due to database being shutdown. After wal_sender_timeout and 
>wal_receiver_timeout (default 60s) were reached the database finally shut 
>down. It seems that walsender process was preventing the shutdown of the 
>master database - until timeout was reached, a behavior we didn't experience 
>before.


Does anyone know why would this happen?


We have 1 standby node in our primary site (same subnet as master DB), and two 
standbys in a remote site. WAL archiving is enabled to the remote site with 
rsync command that worked normally during this time and generally completes 
within a couple of seconds - definitely less than a minute. So we kind of ruled 
out WAL archiving.

Would shutting down remote site standbys prevent this kind of delay, they 
usually have a couple of seconds lag in terms of replication (pg_xlog) location?


Here is the postgres log from master that was being shutdown.

2017-10-26 22:04:01 CDT [1701]: [6-1]  user=,db= LOG:  received fast shutdown 
request
2017-10-26 22:04:01 CDT [1701]: [7-1]  user=,db= LOG:  aborting any active 
transactions
2017-10-26 22:04:01 CDT [1711]: [2-1]  user=,db= LOG:  dbms_aq launcher 
shutting down
2017-10-26 22:04:01 CDT [1708]: [2-1]  user=,db= LOG:  autovacuum launcher 
shutting down
2017-10-26 22:04:01 CDT [1705]: [9971-1]  user=,db= LOG:  shutting down
2017-10-26 22:04:01 CDT [1705]: [9972-1]  user=,db= LOG:  checkpoint starting: 
shutdown immediate
2017-10-26 22:04:01 CDT [1705]: [9973-1]  user=,db= LOG:  checkpoint complete: 
wrote 54 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 1 
recycled; write=0.017 s, sync=0.001 s, total=0.026 s; sync files=12, 
longest=0.000 s, average=0.000 s; distance=1100 kB, estimate=3307 kB
2017-10-26 22:04:01 CDT [1705]: [9974-1]  user=,db= LOG:  database system is 
shut down

2017-10-26 22:04:36 CDT [8763]: [1-1] "Local site standby IP"(49606) 
user=replication_user ,db=[unknown] FATAL:  the database system is shutting down

[OMITTED - a bunch of "FATAL:  the database system is shutting down" messages]

2017-10-26 22:05:00 CDT [2669]: [1-1] "Disaster recovery site standby IP" 
(55498) user=replication_user,db=[unknown] LOG:  terminating walsender process 
due to replication timeout

2017-10-26 22:05:00 CDT [1465]: [1-1] "Disaster recovery site standby 
IP"(36948) user=replication_user,db=[unknown] LOG:  terminating walsender 
process due to replication timeout
2017-10-26 22:05:01 CDT [1701]: [8-1]  user=,db= LOG:  database system is shut 
down

Standbys had the same message regarding walreceiver being terminated due to 
replication timeout.

Any help is greatly appreciated.

Thanks,
Zarko



[GENERAL] the database system is shutting down - terminating walsender process due to replication timeout

2017-11-01 Thread Zarko Aleksic

Greetings everyone,


I'm looking for a bit of help understanding a particular behavior we are seeing 
with our PostgreSQL 9.6. After issuing a service shutdown command with 
"systemctl stop" on RHEL 7 our PostgreSQL instance started behaving weirdly. 
For the first time it wouldn't shutdown so easily / quickly.


>From the logs we could see that standby nodes that were trying to connect were 
>rejected due to database being shutdown. After wal_sender_timeout and 
>wal_receiver_timeout (default 60s) were reached the database finally shut 
>down. It seems that walsender process was preventing the shutdown of the 
>master database - until timeout was reached, a behavior we didn't experience 
>before.


Does anyone know why would this happen?


We have 1 standby node in our primary site (same subnet as master DB), and two 
standbys in a remote site. WAL archiving is enabled to the remote site with 
rsync command that worked normally during this time and generally completes 
within a couple of seconds - definitely less than a minute. So we kind of ruled 
out WAL archiving.

Would shutting down remote site standbys prevent this kind of delay, they 
usually have a couple of seconds lag in terms of replication (pg_xlog) location?


Here is the postgres log from master that was being shutdown.

2017-10-26 22:04:01 CDT [1701]: [6-1]  user=,db= LOG:  received fast shutdown 
request
2017-10-26 22:04:01 CDT [1701]: [7-1]  user=,db= LOG:  aborting any active 
transactions
2017-10-26 22:04:01 CDT [1711]: [2-1]  user=,db= LOG:  dbms_aq launcher 
shutting down
2017-10-26 22:04:01 CDT [1708]: [2-1]  user=,db= LOG:  autovacuum launcher 
shutting down
2017-10-26 22:04:01 CDT [1705]: [9971-1]  user=,db= LOG:  shutting down
2017-10-26 22:04:01 CDT [1705]: [9972-1]  user=,db= LOG:  checkpoint starting: 
shutdown immediate
2017-10-26 22:04:01 CDT [1705]: [9973-1]  user=,db= LOG:  checkpoint complete: 
wrote 54 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 1 
recycled; write=0.017 s, sync=0.001 s, total=0.026 s; sync files=12, 
longest=0.000 s, average=0.000 s; distance=1100 kB, estimate=3307 kB
2017-10-26 22:04:01 CDT [1705]: [9974-1]  user=,db= LOG:  database system is 
shut down

2017-10-26 22:04:36 CDT [8763]: [1-1] "Local site standby IP"(49606) 
user=replication_user ,db=[unknown] FATAL:  the database system is shutting down

[OMITTED - a bunch of "FATAL:  the database system is shutting down" messages]

2017-10-26 22:05:00 CDT [2669]: [1-1] "Disaster recovery site standby IP" 
(55498) user=replication_user,db=[unknown] LOG:  terminating walsender process 
due to replication timeout

2017-10-26 22:05:00 CDT [1465]: [1-1] "Disaster recovery site standby 
IP"(36948) user=replication_user,db=[unknown] LOG:  terminating walsender 
process due to replication timeout
2017-10-26 22:05:01 CDT [1701]: [8-1]  user=,db= LOG:  database system is shut 
down

Standbys had the same message regarding walreceiver being terminated due to 
replication timeout.

Any help is greatly appreciated.

Thanks,
Zarko



[GENERAL] the database system is shutting down - terminating walsender process due to replication timeout

2017-10-30 Thread Zarko Aleksic
Greetings everyone,


I'm looking for a bit of help understanding a particular behavior we are seeing 
with our PostgreSQL 9.6. After issuing a service shutdown command with 
"systemctl stop" on RHEL 7 our PostgreSQL instance started behaving weirdly. 
For the first time it wouldn't shutdown so easily / quickly.


>From the logs we could see that standby nodes that were trying to connect were 
>rejected due to database being shutdown. After wal_sender_timeout and 
>wal_receiver_timeout (default 60s) were reached the database finally shut 
>down. It seems that walsender process was preventing the shutdown of the 
>master database - until timeout was reached, a behavior we didn't experience 
>before.


Does anyone know why would this happen?


We have 1 standby node in our primary site (same subnet as master DB), and two 
standbys in a remote site. WAL archiving is enabled to the remote site with 
rsync command that worked normally during this time and generally completes 
within a couple of seconds - definitely less than a minute. So we kind of ruled 
out WAL archiving.

Would shutting down remote site standbys prevent this kind of delay, they 
usually have a couple of seconds lag in terms of replication (pg_xlog) location?


Here is the postgres log from master that was being shutdown.

2017-10-26 22:04:01 CDT [1701]: [6-1]  user=,db= LOG:  received fast shutdown 
request
2017-10-26 22:04:01 CDT [1701]: [7-1]  user=,db= LOG:  aborting any active 
transactions
2017-10-26 22:04:01 CDT [1711]: [2-1]  user=,db= LOG:  dbms_aq launcher 
shutting down
2017-10-26 22:04:01 CDT [1708]: [2-1]  user=,db= LOG:  autovacuum launcher 
shutting down
2017-10-26 22:04:01 CDT [1705]: [9971-1]  user=,db= LOG:  shutting down
2017-10-26 22:04:01 CDT [1705]: [9972-1]  user=,db= LOG:  checkpoint starting: 
shutdown immediate
2017-10-26 22:04:01 CDT [1705]: [9973-1]  user=,db= LOG:  checkpoint complete: 
wrote 54 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 1 
recycled; write=0.017 s, sync=0.001 s, total=0.026 s; sync files=12, 
longest=0.000 s, average=0.000 s; distance=1100 kB, estimate=3307 kB
2017-10-26 22:04:01 CDT [1705]: [9974-1]  user=,db= LOG:  database system is 
shut down

2017-10-26 22:04:36 CDT [8763]: [1-1] "Local site standby IP"(49606) 
user=replication_user ,db=[unknown] FATAL:  the database system is shutting down

[OMITTED - a bunch of "FATAL:  the database system is shutting down" messages]

2017-10-26 22:05:00 CDT [2669]: [1-1] "Disaster recovery site standby IP" 
(55498) user=replication_user,db=[unknown] LOG:  terminating walsender process 
due to replication timeout

2017-10-26 22:05:00 CDT [1465]: [1-1] "Disaster recovery site standby 
IP"(36948) user=replication_user,db=[unknown] LOG:  terminating walsender 
process due to replication timeout
2017-10-26 22:05:01 CDT [1701]: [8-1]  user=,db= LOG:  database system is shut 
down

Standbys had the same message regarding walreceiver being terminated due to 
replication timeout.

Any help is greatly appreciated.

Thanks,
Zarko