Re: [HACKERS] loss of transactions in streaming replication

2011-11-30 Thread Robert Haas
On Mon, Oct 24, 2011 at 8:40 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Fri, Oct 21, 2011 at 12:01 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Oct 20, 2011 at 9:51 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, Oct 20, 2011 at 1:05 AM, Robert Haas robertmh...@gmail.com wrote:
 OK, so this is an artifact of the changes to make libpq communication
 bidirectional.  But I'm still confused about where the error is coming
 from.  In your OP, you wrote In 9.2dev and 9.1, when walreceiver
 detects an error while sending data to WAL stream, it always emits
 ERROR even if there are data available in the receive buffer.  So
 that implied to me that this is only going to trigger if you have a
 shutdown together with an awkwardly-timed error.  But your scenario
 for reproducing this problem doesn't seem to involve an error.

 Yes, my scenario doesn't cause any real error. My original description was
 misleading. The following would be closer to the truth:

    In 9.2dev and 9.1, when walreceiver detects the termination of 
 replication
    connection while sending data to WAL stream, it always emits ERROR
    even if there are data available in the receive buffer.

 Ah, OK.  I think I now agree that this is a bug and that we should fix
 and back-patch.

 The patch that I posted before is well-formed enough to be adopted?

Does this still need to be worked on?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] loss of transactions in streaming replication

2011-10-24 Thread Fujii Masao
On Fri, Oct 21, 2011 at 12:01 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Oct 20, 2011 at 9:51 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, Oct 20, 2011 at 1:05 AM, Robert Haas robertmh...@gmail.com wrote:
 OK, so this is an artifact of the changes to make libpq communication
 bidirectional.  But I'm still confused about where the error is coming
 from.  In your OP, you wrote In 9.2dev and 9.1, when walreceiver
 detects an error while sending data to WAL stream, it always emits
 ERROR even if there are data available in the receive buffer.  So
 that implied to me that this is only going to trigger if you have a
 shutdown together with an awkwardly-timed error.  But your scenario
 for reproducing this problem doesn't seem to involve an error.

 Yes, my scenario doesn't cause any real error. My original description was
 misleading. The following would be closer to the truth:

    In 9.2dev and 9.1, when walreceiver detects the termination of 
 replication
    connection while sending data to WAL stream, it always emits ERROR
    even if there are data available in the receive buffer.

 Ah, OK.  I think I now agree that this is a bug and that we should fix
 and back-patch.

The patch that I posted before is well-formed enough to be adopted?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] loss of transactions in streaming replication

2011-10-20 Thread Fujii Masao
On Thu, Oct 20, 2011 at 1:05 AM, Robert Haas robertmh...@gmail.com wrote:
 OK, so this is an artifact of the changes to make libpq communication
 bidirectional.  But I'm still confused about where the error is coming
 from.  In your OP, you wrote In 9.2dev and 9.1, when walreceiver
 detects an error while sending data to WAL stream, it always emits
 ERROR even if there are data available in the receive buffer.  So
 that implied to me that this is only going to trigger if you have a
 shutdown together with an awkwardly-timed error.  But your scenario
 for reproducing this problem doesn't seem to involve an error.

Yes, my scenario doesn't cause any real error. My original description was
misleading. The following would be closer to the truth:

In 9.2dev and 9.1, when walreceiver detects the termination of replication
connection while sending data to WAL stream, it always emits ERROR
even if there are data available in the receive buffer.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] loss of transactions in streaming replication

2011-10-20 Thread Robert Haas
On Thu, Oct 20, 2011 at 9:51 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, Oct 20, 2011 at 1:05 AM, Robert Haas robertmh...@gmail.com wrote:
 OK, so this is an artifact of the changes to make libpq communication
 bidirectional.  But I'm still confused about where the error is coming
 from.  In your OP, you wrote In 9.2dev and 9.1, when walreceiver
 detects an error while sending data to WAL stream, it always emits
 ERROR even if there are data available in the receive buffer.  So
 that implied to me that this is only going to trigger if you have a
 shutdown together with an awkwardly-timed error.  But your scenario
 for reproducing this problem doesn't seem to involve an error.

 Yes, my scenario doesn't cause any real error. My original description was
 misleading. The following would be closer to the truth:

    In 9.2dev and 9.1, when walreceiver detects the termination of replication
    connection while sending data to WAL stream, it always emits ERROR
    even if there are data available in the receive buffer.

Ah, OK.  I think I now agree that this is a bug and that we should fix
and back-patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] loss of transactions in streaming replication

2011-10-19 Thread Fujii Masao
On Wed, Oct 19, 2011 at 11:28 AM, Robert Haas robertmh...@gmail.com wrote:
 Convince me.  :-)

Yeah, I try.

 My reading of the situation is that you're talking about a problem
 that will only occur if, while the master is in the process of
 shutting down, a network error occurs.

No. This happens even if a network error doesn't occur. I can
reproduce the issue by doing the following:

1. Set up streaming replication master and standby with archive
   setting.
2. Run pgbench -i
3. Shuts down the master with fast mode.

Then I can see that the latest WAL file in the master's pg_xlog
doesn't exist in the standby's one. The WAL record which was
lost was the shutdown checkpoint one.

When smart or fast shutdown is requested, the master tries to
write and send the WAL switch (if archiving is enabled) and
shutdown checkpoint record. Because of the problem I described,
the WAL switch record arrives at the standby but the shutdown
checkpoint does not.

 I am not sure it's a good idea
 to convolute the code to handle that case, because (1) there are going
 to be many similar situations where nothing within our power is
 sufficient to prevent WAL from failing to make it to the standby and

Shutting down the master is not a rare case. So I think it's worth
doing something.

 (2) for this marginal improvement, you're giving up including
 PQerrorMessage(streamConn) in the error message that ultimately gets
 omitted, which seems like a substantial regression as far as
 debuggability is concerned.

I think that it's possible to include PQerrorMessage() in the error
message. Will change the patch.

 Even if we do decide that we want the
 change in behavior, I see no compelling reason to back-patch it.
 Stable releases are supposed to be stable, not change behavior because
 we thought of something we like better than what we originally
 released.

The original behavior, in 9.0, is that all outstanding WAL are
replicated to the standby when the master shuts down normally.
But ISTM the behavior was changed unexpectedly in 9.1. So
I think that it should be back-patched to 9.1 to revert the behavior
to the original.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] loss of transactions in streaming replication

2011-10-19 Thread Fujii Masao
On Wed, Oct 19, 2011 at 3:31 PM, Fujii Masao masao.fu...@gmail.com wrote:
 (2) for this marginal improvement, you're giving up including
 PQerrorMessage(streamConn) in the error message that ultimately gets
 omitted, which seems like a substantial regression as far as
 debuggability is concerned.

 I think that it's possible to include PQerrorMessage() in the error
 message. Will change the patch.

Attached is the updated version of the patch. When walreceiver fails to
send data to WAL stream, it emits WARNING with the message including
PQerrorMessage(), and also it emits the following DETAIL message:

Walreceiver process will be terminated after all available data
have been received from WAL stream.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
*** a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
--- b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
***
*** 49,55  static char *recvBuf = NULL;
  static bool libpqrcv_connect(char *conninfo, XLogRecPtr startpoint);
  static bool libpqrcv_receive(int timeout, unsigned char *type,
   char **buffer, int *len);
! static void libpqrcv_send(const char *buffer, int nbytes);
  static void libpqrcv_disconnect(void);
  
  /* Prototypes for private functions */
--- 49,55 
  static bool libpqrcv_connect(char *conninfo, XLogRecPtr startpoint);
  static bool libpqrcv_receive(int timeout, unsigned char *type,
   char **buffer, int *len);
! static bool libpqrcv_send(const char *buffer, int nbytes);
  static void libpqrcv_disconnect(void);
  
  /* Prototypes for private functions */
***
*** 404,417  libpqrcv_receive(int timeout, unsigned char *type, char **buffer, int *len)
  /*
   * Send a message to XLOG stream.
   *
!  * ereports on error.
   */
! static void
  libpqrcv_send(const char *buffer, int nbytes)
  {
  	if (PQputCopyData(streamConn, buffer, nbytes) = 0 ||
  		PQflush(streamConn))
! 		ereport(ERROR,
  (errmsg(could not send data to WAL stream: %s,
! 		PQerrorMessage(streamConn;
  }
--- 404,429 
  /*
   * Send a message to XLOG stream.
   *
!  * Return true if successfully, false otherwise.
   */
! static bool
  libpqrcv_send(const char *buffer, int nbytes)
  {
+ 	/*
+ 	 * Even if we could not send data to WAL stream, we don't emit ERROR
+ 	 * just yet. There might be still data available in the receive buffer. We
+ 	 * emit ERROR after all available data have been received and flushed to
+ 	 * disk. Otherwise, such outstanding data would be lost.
+ 	 */
  	if (PQputCopyData(streamConn, buffer, nbytes) = 0 ||
  		PQflush(streamConn))
! 	{
! 		ereport(WARNING,
  (errmsg(could not send data to WAL stream: %s,
! 		PQerrorMessage(streamConn)),
!  errdetail(Walreceiver process will be terminated after all available data have been received from WAL stream.)));
! 		return false;
! 	}
! 
! 	return true;
  }
*** a/src/backend/replication/walreceiver.c
--- b/src/backend/replication/walreceiver.c
***
*** 96,101  static struct
--- 96,104 
  static StandbyReplyMessage reply_message;
  static StandbyHSFeedbackMessage feedback_message;
  
+ /* Did previous attempt to send data to WAL stream fail? */
+ static bool walrcv_send_error = false;
+ 
  /*
   * About SIGTERM handling:
   *
***
*** 328,333  WalReceiverMain(void)
--- 331,345 
  		else
  		{
  			/*
+ 			 * If we didn't receive anything new after we had failed to send data
+ 			 * to WAL stream, we can guarantee that there is no data available in
+ 			 * the receive buffer. So we at last emit ERROR.
+ 			 */
+ 			if (walrcv_send_error)
+ ereport(ERROR,
+ 		(errmsg(terminating walreceiver process due to failure to send data to WAL stream)));
+ 
+ 			/*
  			 * We didn't receive anything new, but send a status update to the
  			 * master anyway, to report any progress in applying WAL.
  			 */
***
*** 627,632  XLogWalRcvSendReply(void)
--- 639,651 
  	   wal_receiver_status_interval * 1000))
  		return;
  
+ 	/*
+ 	 * If previous attempt to send data to WAL stream has failed, there is
+ 	 * nothing to do here because this attempt would fail again.
+ 	 */
+ 	if (walrcv_send_error)
+ 		return;
+ 
  	/* Construct a new message */
  	reply_message.write = LogstreamResult.Write;
  	reply_message.flush = LogstreamResult.Flush;
***
*** 641,647  XLogWalRcvSendReply(void)
  	/* Prepend with the message type and send it. */
  	buf[0] = 'r';
  	memcpy(buf[1], reply_message, sizeof(StandbyReplyMessage));
! 	walrcv_send(buf, sizeof(StandbyReplyMessage) + 1);
  }
  
  /*
--- 660,666 
  	/* Prepend with the message type and send it. */
  	buf[0] = 'r';
  	memcpy(buf[1], reply_message, sizeof(StandbyReplyMessage));
! 	walrcv_send_error = !walrcv_send(buf, sizeof(StandbyReplyMessage) + 1);
  }
  
  /*
***
*** 682,687  XLogWalRcvSendHSFeedback(void)
--- 701,714 
  		return;

Re: [HACKERS] loss of transactions in streaming replication

2011-10-19 Thread Robert Haas
On Wed, Oct 19, 2011 at 2:31 AM, Fujii Masao masao.fu...@gmail.com wrote:
 My reading of the situation is that you're talking about a problem
 that will only occur if, while the master is in the process of
 shutting down, a network error occurs.

 No. This happens even if a network error doesn't occur. I can
 reproduce the issue by doing the following:

 1. Set up streaming replication master and standby with archive
   setting.
 2. Run pgbench -i
 3. Shuts down the master with fast mode.

 Then I can see that the latest WAL file in the master's pg_xlog
 doesn't exist in the standby's one. The WAL record which was
 lost was the shutdown checkpoint one.

 When smart or fast shutdown is requested, the master tries to
 write and send the WAL switch (if archiving is enabled) and
 shutdown checkpoint record. Because of the problem I described,
 the WAL switch record arrives at the standby but the shutdown
 checkpoint does not.

Oh, that's not good.

 The original behavior, in 9.0, is that all outstanding WAL are
 replicated to the standby when the master shuts down normally.
 But ISTM the behavior was changed unexpectedly in 9.1. So
 I think that it should be back-patched to 9.1 to revert the behavior
 to the original.

Which commit broke this?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] loss of transactions in streaming replication

2011-10-19 Thread Fujii Masao
On Wed, Oct 19, 2011 at 9:44 PM, Robert Haas robertmh...@gmail.com wrote:
 The original behavior, in 9.0, is that all outstanding WAL are
 replicated to the standby when the master shuts down normally.
 But ISTM the behavior was changed unexpectedly in 9.1. So
 I think that it should be back-patched to 9.1 to revert the behavior
 to the original.

 Which commit broke this?

d3d414696f39e2b57072fab3dd4fa11e465be4ed
b186523fd97ce02ffbb7e21d5385a047deeef4f6

The former introduced problematic libpqrcv_send() (which was my mistake...),
and the latter is the first user of it.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] loss of transactions in streaming replication

2011-10-19 Thread Robert Haas
On Wed, Oct 19, 2011 at 10:41 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Oct 19, 2011 at 9:44 PM, Robert Haas robertmh...@gmail.com wrote:
 The original behavior, in 9.0, is that all outstanding WAL are
 replicated to the standby when the master shuts down normally.
 But ISTM the behavior was changed unexpectedly in 9.1. So
 I think that it should be back-patched to 9.1 to revert the behavior
 to the original.

 Which commit broke this?

 d3d414696f39e2b57072fab3dd4fa11e465be4ed
 b186523fd97ce02ffbb7e21d5385a047deeef4f6

 The former introduced problematic libpqrcv_send() (which was my mistake...),
 and the latter is the first user of it.

OK, so this is an artifact of the changes to make libpq communication
bidirectional.  But I'm still confused about where the error is coming
from.  In your OP, you wrote In 9.2dev and 9.1, when walreceiver
detects an error while sending data to WAL stream, it always emits
ERROR even if there are data available in the receive buffer.  So
that implied to me that this is only going to trigger if you have a
shutdown together with an awkwardly-timed error.  But your scenario
for reproducing this problem doesn't seem to involve an error.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] loss of transactions in streaming replication

2011-10-18 Thread Robert Haas
On Fri, Oct 14, 2011 at 7:51 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, Oct 13, 2011 at 10:08 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Oct 12, 2011 at 10:29 PM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Oct 12, 2011 at 5:45 AM, Fujii Masao masao.fu...@gmail.com wrote:
 In 9.2dev and 9.1, when walreceiver detects an error while sending data to
 WAL stream, it always emits ERROR even if there are data available in the
 receive buffer. This might lead to loss of transactions because such
 remaining data are not received by walreceiver :(

 Won't it just reconnect?

 Yes if the master is running normally. OTOH, if the master is not running 
 (i.e.,
 failover case), the standby cannot receive again the data which it failed to
 receive.

 I found this issue when I shut down the master. When the master shuts down,
 it sends the shutdown checkpoint record, but I found that the standby failed
 to receive it.

 Patch attached.

 The patch changes walreceiver so that it doesn't emit ERROR just yet even
 if it fails to send data to WAL stream. Then, after all available data have 
 been
 received and flushed to the disk, it emits ERROR.

 If the patch is OK, it should be backported to v9.1.

Convince me.  :-)

My reading of the situation is that you're talking about a problem
that will only occur if, while the master is in the process of
shutting down, a network error occurs.  I am not sure it's a good idea
to convolute the code to handle that case, because (1) there are going
to be many similar situations where nothing within our power is
sufficient to prevent WAL from failing to make it to the standby and
(2) for this marginal improvement, you're giving up including
PQerrorMessage(streamConn) in the error message that ultimately gets
omitted, which seems like a substantial regression as far as
debuggability is concerned.  Even if we do decide that we want the
change in behavior, I see no compelling reason to back-patch it.
Stable releases are supposed to be stable, not change behavior because
we thought of something we like better than what we originally
released.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] loss of transactions in streaming replication

2011-10-14 Thread Fujii Masao
On Thu, Oct 13, 2011 at 10:08 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Oct 12, 2011 at 10:29 PM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Oct 12, 2011 at 5:45 AM, Fujii Masao masao.fu...@gmail.com wrote:
 In 9.2dev and 9.1, when walreceiver detects an error while sending data to
 WAL stream, it always emits ERROR even if there are data available in the
 receive buffer. This might lead to loss of transactions because such
 remaining data are not received by walreceiver :(

 Won't it just reconnect?

 Yes if the master is running normally. OTOH, if the master is not running 
 (i.e.,
 failover case), the standby cannot receive again the data which it failed to
 receive.

 I found this issue when I shut down the master. When the master shuts down,
 it sends the shutdown checkpoint record, but I found that the standby failed
 to receive it.

Patch attached.

The patch changes walreceiver so that it doesn't emit ERROR just yet even
if it fails to send data to WAL stream. Then, after all available data have been
received and flushed to the disk, it emits ERROR.

If the patch is OK, it should be backported to v9.1.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
*** a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
--- b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
***
*** 49,55  static char *recvBuf = NULL;
  static bool libpqrcv_connect(char *conninfo, XLogRecPtr startpoint);
  static bool libpqrcv_receive(int timeout, unsigned char *type,
   char **buffer, int *len);
! static void libpqrcv_send(const char *buffer, int nbytes);
  static void libpqrcv_disconnect(void);
  
  /* Prototypes for private functions */
--- 49,55 
  static bool libpqrcv_connect(char *conninfo, XLogRecPtr startpoint);
  static bool libpqrcv_receive(int timeout, unsigned char *type,
   char **buffer, int *len);
! static bool libpqrcv_send(const char *buffer, int nbytes);
  static void libpqrcv_disconnect(void);
  
  /* Prototypes for private functions */
***
*** 404,417  libpqrcv_receive(int timeout, unsigned char *type, char **buffer, int *len)
  /*
   * Send a message to XLOG stream.
   *
!  * ereports on error.
   */
! static void
  libpqrcv_send(const char *buffer, int nbytes)
  {
  	if (PQputCopyData(streamConn, buffer, nbytes) = 0 ||
  		PQflush(streamConn))
! 		ereport(ERROR,
! (errmsg(could not send data to WAL stream: %s,
! 		PQerrorMessage(streamConn;
  }
--- 404,428 
  /*
   * Send a message to XLOG stream.
   *
!  * Return true if successfully, false otherwise.
   */
! static bool
  libpqrcv_send(const char *buffer, int nbytes)
  {
+ 	/*
+ 	 * Even if we could not send data to WAL stream, we don't emit ERROR
+ 	 * just yet. There might be still data available in the receive buffer. We
+ 	 * emit ERROR after all available data have been received and flushed to
+ 	 * disk. Otherwise, such outstanding data would be lost.
+ 	 *
+ 	 * XXX: Should the result of PQerrorMessage() be returned so that the
+ 	 * caller can report it? This doesn't seem worthwhile because in most cases,
+ 	 * before reporting that, we will get another error and emit ERROR while
+ 	 * trying to process all available data.
+ 	 */
  	if (PQputCopyData(streamConn, buffer, nbytes) = 0 ||
  		PQflush(streamConn))
! 		return false;
! 
! 	return true;
  }
*** a/src/backend/replication/walreceiver.c
--- b/src/backend/replication/walreceiver.c
***
*** 96,101  static struct
--- 96,104 
  static StandbyReplyMessage reply_message;
  static StandbyHSFeedbackMessage feedback_message;
  
+ /* Did previous attempt to send data to WAL stream fail? */
+ static bool walrcv_send_error = false;
+ 
  /*
   * About SIGTERM handling:
   *
***
*** 328,333  WalReceiverMain(void)
--- 331,345 
  		else
  		{
  			/*
+ 			 * If we didn't receive anything new after we had failed to send data
+ 			 * to WAL stream, we can guarantee that there is no data available in
+ 			 * the receive buffer. So we at last emit ERROR.
+ 			 */
+ 			if (walrcv_send_error)
+ ereport(ERROR,
+ 		(errmsg(could not send data to WAL stream)));
+ 
+ 			/*
  			 * We didn't receive anything new, but send a status update to the
  			 * master anyway, to report any progress in applying WAL.
  			 */
***
*** 627,632  XLogWalRcvSendReply(void)
--- 639,651 
  	   wal_receiver_status_interval * 1000))
  		return;
  
+ 	/*
+ 	 * If previous attempt to send data to WAL stream has failed, there is
+ 	 * nothing to do here because this attempt would fail again.
+ 	 */
+ 	if (walrcv_send_error)
+ 		return;
+ 
  	/* Construct a new message */
  	reply_message.write = LogstreamResult.Write;
  	reply_message.flush = LogstreamResult.Flush;
***
*** 641,647  XLogWalRcvSendReply(void)
  	/* Prepend with the message type and send it. */
  	buf[0] = 'r';
  	memcpy(buf[1], 

Re: [HACKERS] loss of transactions in streaming replication

2011-10-12 Thread Robert Haas
On Wed, Oct 12, 2011 at 5:45 AM, Fujii Masao masao.fu...@gmail.com wrote:
 In 9.2dev and 9.1, when walreceiver detects an error while sending data to
 WAL stream, it always emits ERROR even if there are data available in the
 receive buffer. This might lead to loss of transactions because such
 remaining data are not received by walreceiver :(

Won't it just reconnect?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] loss of transactions in streaming replication

2011-10-12 Thread Fujii Masao
On Wed, Oct 12, 2011 at 10:29 PM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Oct 12, 2011 at 5:45 AM, Fujii Masao masao.fu...@gmail.com wrote:
 In 9.2dev and 9.1, when walreceiver detects an error while sending data to
 WAL stream, it always emits ERROR even if there are data available in the
 receive buffer. This might lead to loss of transactions because such
 remaining data are not received by walreceiver :(

 Won't it just reconnect?

Yes if the master is running normally. OTOH, if the master is not running (i.e.,
failover case), the standby cannot receive again the data which it failed to
receive.

I found this issue when I shut down the master. When the master shuts down,
it sends the shutdown checkpoint record, but I found that the standby failed
to receive it.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers