Re: [HACKERS] Standby receiving part of missing WAL segment

2015-02-12 Thread Robert Haas
On Wed, Feb 11, 2015 at 12:55 PM, Thom Brown t...@linux.com wrote:
 Today I witnessed a situation which appears to have gone down like this:

 - The primary server starting streaming WAL data from segment 00A8 to the
 standby
 - The standby server started receiving that data
 - Before 00A8 is finished, the wal sender process dies on the primary, but
 the archiver process continues, and 00A8 ends up being archived as usual
 - The primary continues to generate WAL and cleans up old WAL files from
 pg_xlog until 00A8 is gone.
 - The primary is restarted and the wal sender process is back up and running
 - The standby says waiting for 00A8, which it can no longer get from the
 primary
 - 00A8 is in the standby's archive directory, but the standby is waiting for
 the rest of the segment from the primary via streaming replication, so
 doesn't check the archive
 - The standby is restarted
 - The standby goes back into recovery and eventually replays 00A8 and
 continues as normal.

 Should the standby be able to get feedback from the primary that the
 requested segment is no longer available, and therefore know to check its
 archive?

Last time I played around with this, if the standby requested a
segment from the master that was no longer present there, the standby
would immediately get an ERROR, which it seems like would get you out
of trouble.  I wonder why that didn't happen in your case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Standby receiving part of missing WAL segment

2015-02-12 Thread Robert Haas
On Thu, Feb 12, 2015 at 9:56 AM, Thom Brown t...@linux.com wrote:
 On 12 February 2015 at 13:56, Robert Haas robertmh...@gmail.com wrote:

 On Wed, Feb 11, 2015 at 12:55 PM, Thom Brown t...@linux.com wrote:
  Today I witnessed a situation which appears to have gone down like this:
 
  - The primary server starting streaming WAL data from segment 00A8 to
  the
  standby
  - The standby server started receiving that data
  - Before 00A8 is finished, the wal sender process dies on the primary,
  but
  the archiver process continues, and 00A8 ends up being archived as usual
  - The primary continues to generate WAL and cleans up old WAL files from
  pg_xlog until 00A8 is gone.
  - The primary is restarted and the wal sender process is back up and
  running
  - The standby says waiting for 00A8, which it can no longer get from
  the
  primary
  - 00A8 is in the standby's archive directory, but the standby is waiting
  for
  the rest of the segment from the primary via streaming replication, so
  doesn't check the archive
  - The standby is restarted
  - The standby goes back into recovery and eventually replays 00A8 and
  continues as normal.
 
  Should the standby be able to get feedback from the primary that the
  requested segment is no longer available, and therefore know to check
  its
  archive?

 Last time I played around with this, if the standby requested a
 segment from the master that was no longer present there, the standby
 would immediately get an ERROR, which it seems like would get you out
 of trouble.  I wonder why that didn't happen in your case.


 Yeah, I've tried recreating this like so:

 - Primary streams to standby like usual
 - Kill -9 primary then change its port and bring it back up
 - Create traffic on primary until it no longer has the WAL file the standby
 wants, but has archived it
 - Change the port of the primary back to what the standby is trying to talk
 to

 But before it gets to that 4th point, the standby has gone to the archive
 for the rest of it:

 cp: cannot stat ‘/tmp/walarch/0001000600C8’: No such file or
 directory
 2015-02-12 14:47:52 GMT [8280]: [1-1] user=,db=,client= FATAL:  could not
 connect to the primary server: could not connect to server: Connection
 refused
 Is the server running on host 127.0.0.1 and accepting
 TCP/IP connections on port 5488?

 cp: cannot stat ‘/tmp/walarch/0001000600C8’: No such file or
 directory
 2015-02-12 14:47:57 GMT [8283]: [1-1] user=,db=,client= FATAL:  could not
 connect to the primary server: could not connect to server: Connection
 refused
 Is the server running on host 127.0.0.1 and accepting
 TCP/IP connections on port 5488?

 2015-02-12 14:48:02 GMT [8202]: [6-1] user=,db=,client= LOG:  restored log
 file 0001000600C8 from archive

 I don't suppose this is something that was buggy in 9.3.1?

I don't know.  I don't remember a bug of that type, but I might have
missed it.  If you see the scenario again, it might help to pull a
stack trace from the master and the standby.  That might tell us
whether the master or the standby is to blame, and where the offender
is wedged.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Standby receiving part of missing WAL segment

2015-02-12 Thread Thom Brown
On 12 February 2015 at 13:56, Robert Haas robertmh...@gmail.com wrote:

 On Wed, Feb 11, 2015 at 12:55 PM, Thom Brown t...@linux.com wrote:
  Today I witnessed a situation which appears to have gone down like this:
 
  - The primary server starting streaming WAL data from segment 00A8 to the
  standby
  - The standby server started receiving that data
  - Before 00A8 is finished, the wal sender process dies on the primary,
 but
  the archiver process continues, and 00A8 ends up being archived as usual
  - The primary continues to generate WAL and cleans up old WAL files from
  pg_xlog until 00A8 is gone.
  - The primary is restarted and the wal sender process is back up and
 running
  - The standby says waiting for 00A8, which it can no longer get from
 the
  primary
  - 00A8 is in the standby's archive directory, but the standby is waiting
 for
  the rest of the segment from the primary via streaming replication, so
  doesn't check the archive
  - The standby is restarted
  - The standby goes back into recovery and eventually replays 00A8 and
  continues as normal.
 
  Should the standby be able to get feedback from the primary that the
  requested segment is no longer available, and therefore know to check its
  archive?

 Last time I played around with this, if the standby requested a
 segment from the master that was no longer present there, the standby
 would immediately get an ERROR, which it seems like would get you out
 of trouble.  I wonder why that didn't happen in your case.


Yeah, I've tried recreating this like so:

- Primary streams to standby like usual
- Kill -9 primary then change its port and bring it back up
- Create traffic on primary until it no longer has the WAL file the standby
wants, but has archived it
- Change the port of the primary back to what the standby is trying to talk
to

But before it gets to that 4th point, the standby has gone to the archive
for the rest of it:

cp: cannot stat ‘/tmp/walarch/0001000600C8’: No such file or
directory
2015-02-12 14:47:52 GMT [8280]: [1-1] user=,db=,client= FATAL:  could not
connect to the primary server: could not connect to server: Connection
refused
Is the server running on host 127.0.0.1 and accepting
TCP/IP connections on port 5488?

cp: cannot stat ‘/tmp/walarch/0001000600C8’: No such file or
directory
2015-02-12 14:47:57 GMT [8283]: [1-1] user=,db=,client= FATAL:  could not
connect to the primary server: could not connect to server: Connection
refused
Is the server running on host 127.0.0.1 and accepting
TCP/IP connections on port 5488?

2015-02-12 14:48:02 GMT [8202]: [6-1] user=,db=,client= LOG:  restored log
file 0001000600C8 from archive

I don't suppose this is something that was buggy in 9.3.1?

Thom


[HACKERS] Standby receiving part of missing WAL segment

2015-02-11 Thread Thom Brown
Hi,

Today I witnessed a situation which appears to have gone down like this:

- The primary server starting streaming WAL data from segment 00A8 to the
standby
- The standby server started receiving that data
- Before 00A8 is finished, the wal sender process dies on the primary, but
the archiver process continues, and 00A8 ends up being archived as usual
- The primary continues to generate WAL and cleans up old WAL files from
pg_xlog until 00A8 is gone.
- The primary is restarted and the wal sender process is back up and running
- The standby says waiting for 00A8, which it can no longer get from the
primary
- 00A8 is in the standby's archive directory, but the standby is waiting
for the rest of the segment from the primary via streaming replication, so
doesn't check the archive
- The standby is restarted
- The standby goes back into recovery and eventually replays 00A8 and
continues as normal.

Should the standby be able to get feedback from the primary that the
requested segment is no longer available, and therefore know to check its
archive?

Or should it check the archive anyway if it hasn't received any further WAL
data via the streaming replication connection after a certain amount of
time?

At the moment, the standby gets stuck forever in this situation, even
though it has access to the WAL it needs.

Thom