Re: [HACKERS] CREATE REPLICATION SLOT fails on a timeout

2014-05-30 Thread Steve Singer

On 05/28/2014 06:41 PM, Andres Freund wrote:

Hi,


Pushed a fix for it. I am pretty sure it will, but could you still test
that it fixes your problem?

Thanks!


The fix seems to work (I am no longer getting the timeout on slot creation)

Thanks




Andres Freund





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CREATE REPLICATION SLOT fails on a timeout

2014-05-28 Thread Andres Freund
Hi,

On 2014-05-17 01:34:25 +0200, Andres Freund wrote:
 On 2014-05-16 17:02:33 -0400, Steve Singer wrote:
  I don't think that's going to cut it though. The creation can take
  longer than whatever wal_sender_timeout is set to (when there's lots of
  longrunning transactions). I think checking whether last_reply_timestamp
  = 0 during timeout checking is more robust.
 
  That makes sense.
  A patch that does that is attached.
 
 I've attached a heavily revised version of that patch. Didn't touch all
 the necessary places...
 
 Survives creation of logical replication slots under 'intensive
 pressure', with a wal_sender_timeout=10ms.
 
 Thanks for the bugreport!

Pushed a fix for it. I am pretty sure it will, but could you still test
that it fixes your problem?

Thanks!

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] CREATE REPLICATION SLOT fails on a timeout

2014-05-16 Thread Steve Singer
I am finding that my logical walsender connections are being terminated 
due to a timeout on the CREATE REPLICATION SLOT command. with 
terminating walsender process due to replication timeout


Below is the stack trace when this happens

#3  0x0067df28 in WalSndCheckTimeOut 
(now=now@entry=453585463823871) at walsender.c:1748
#4  0x0067eedc in WalSndWaitForWal (loc=691727888) at 
walsender.c:1216
#5  logical_read_xlog_page (state=optimized out, 
targetPagePtr=691724288, reqLen=optimized out,
targetRecPtr=optimized out, cur_page=0x18476e0 }\320\005, 
pageTLI=optimized out) at walsender.c:741
#6  0x004f41bf in ReadPageInternal (state=state@entry=0x17aa140, 
pageptr=pageptr@entry=691724288,

reqLen=reqLen@entry=3600) at xlogreader.c:525
#7  0x004f454a in XLogReadRecord (state=0x17aa140, 
RecPtr=691727856, RecPtr@entry=0,

errormsg=errormsg@entry=0x7fff7f668c28) at xlogreader.c:228
#8  0x00675e5c in DecodingContextFindStartpoint 
(ctx=ctx@entry=0x17a0358) at logical.c:460
#9  0x00680f16 in CreateReplicationSlot (cmd=0x1798b50) at 
walsender.c:800

#10 exec_replication_command (
cmd_string=cmd_string@entry=0x17f1478 CREATE_REPLICATION_SLOT 
\slon_4_1\ LOGICAL \slony1_funcs.2.2.0\)

at walsender.c:1291
#11 0x006bf4a1 in PostgresMain (argc=optimized out, 
argv=argv@entry=0x177db50, dbname=0x177db30 test1,


(gdb) p last_reply_timestamp
$1 = 0


I propose the attached patch sets last_reply_timestamp to now() it 
starts processing a command.  Since receiving a command is hearing 
something from the client.



Steve

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
new file mode 100644
index 5c11d68..56a2f10
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
*** exec_replication_command(const char *cmd
*** 1276,1281 
--- 1276,1282 
    parse_rc;
  
  	cmd_node = replication_parse_result;
+ 	last_reply_timestamp = GetCurrentTimestamp();
  
  	switch (cmd_node-type)
  	{

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CREATE REPLICATION SLOT fails on a timeout

2014-05-16 Thread Andres Freund
Hi,

On 2014-05-16 16:37:16 -0400, Steve Singer wrote:
 I am finding that my logical walsender connections are being terminated due
 to a timeout on the CREATE REPLICATION SLOT command. with terminating
 walsender process due to replication timeout
 
 Below is the stack trace when this happens
 
 #3  0x0067df28 in WalSndCheckTimeOut (now=now@entry=453585463823871)
 at walsender.c:1748
 #4  0x0067eedc in WalSndWaitForWal (loc=691727888) at
 walsender.c:1216
 ...
 #9  0x00680f16 in CreateReplicationSlot (cmd=0x1798b50) at
 walsender.c:800
 #10 exec_replication_command ()  at walsender.c:1291
 #11 0x006bf4a1 in PostgresMain (argc=optimized out,
 argv=argv@entry=0x177db50, dbname=0x177db30 test1,
 
 (gdb) p last_reply_timestamp
 $1 = 0
 
 
 I propose the attached patch sets last_reply_timestamp to now() it starts
 processing a command.  Since receiving a command is hearing something from
 the client.

Hm. Yes, that's a problem.

 diff --git a/src/backend/replication/walsender.c 
 b/src/backend/replication/walsender.c
 new file mode 100644
 index 5c11d68..56a2f10
 *** a/src/backend/replication/walsender.c
 --- b/src/backend/replication/walsender.c
 *** exec_replication_command(const char *cmd
 *** 1276,1281 
 --- 1276,1282 
 parse_rc;
   
   cmd_node = replication_parse_result;
 + last_reply_timestamp = GetCurrentTimestamp();
   
   switch (cmd_node-type)
   {

I don't think that's going to cut it though. The creation can take
longer than whatever wal_sender_timeout is set to (when there's lots of
longrunning transactions). I think checking whether last_reply_timestamp
= 0 during timeout checking is more robust.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CREATE REPLICATION SLOT fails on a timeout

2014-05-16 Thread Steve Singer

On 05/16/2014 04:43 PM, Andres Freund wrote:

Hi,

I don't think that's going to cut it though. The creation can take
longer than whatever wal_sender_timeout is set to (when there's lots of
longrunning transactions). I think checking whether last_reply_timestamp
= 0 during timeout checking is more robust.

Greetings,

Andres Freund





That makes sense.
A patch that does that is attached.

Steve
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
new file mode 100644
index 5c11d68..d23f06b
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
*** WalSndCheckTimeOut(TimestampTz now)
*** 1738,1744 
  	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
  		  wal_sender_timeout);
  
! 	if (wal_sender_timeout  0  now = timeout)
  	{
  		/*
  		 * Since typically expiration of replication timeout means
--- 1738,1744 
  	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
  		  wal_sender_timeout);
  
! 	if (last_reply_timestamp  0  wal_sender_timeout  0  now = timeout)
  	{
  		/*
  		 * Since typically expiration of replication timeout means

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CREATE REPLICATION SLOT fails on a timeout

2014-05-16 Thread Andres Freund
On 2014-05-16 17:02:33 -0400, Steve Singer wrote:
 I don't think that's going to cut it though. The creation can take
 longer than whatever wal_sender_timeout is set to (when there's lots of
 longrunning transactions). I think checking whether last_reply_timestamp
 = 0 during timeout checking is more robust.

 That makes sense.
 A patch that does that is attached.

I've attached a heavily revised version of that patch. Didn't touch all
the necessary places...

Survives creation of logical replication slots under 'intensive
pressure', with a wal_sender_timeout=10ms.

Thanks for the bugreport!

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From 324d74e16dd8ee2a0fa977d92a269fd43a746196 Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Sat, 17 May 2014 01:22:01 +0200
Subject: [PATCH] Don't pay heed to wal_sender_timeout while creating a
 decoding slot.

Sometimes CREATE_REPLICATION_SLOT ... LOGICAL ... needs to wait for
futher WAL using WalSndWaitForWal(). That used to always respect
wal_sender_timeout and kill the session when waiting long enough
because no feedback/ping messages can be sent while the slot is still
being created.
Add the notion that last_reply_timestamp = 0 means that the walsender
currently doesn't need timeout processing and add checks for it in a
couple of places.

Bugreport and initial patch by Steve Singer, revised by Andres Freund.
---
 src/backend/replication/walsender.c | 25 +
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 5c11d68..90394ce 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -148,9 +148,10 @@ static StringInfoData reply_message;
 static StringInfoData tmpbuf;
 
 /*
- * Timestamp of the last receipt of the reply from the standby.
+ * Timestamp of the last receipt of the reply from the standby. Set to 0 if a
+ * the process currently shouldn't be killed by wal_sender_timeout.
  */
-static TimestampTz last_reply_timestamp;
+static TimestampTz last_reply_timestamp = 0;
 
 /* Have we sent a heartbeat message asking for reply, since last reply? */
 static bool waiting_for_ping_response = false;
@@ -796,6 +797,14 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		logical_read_xlog_page,
 		WalSndPrepareWrite, WalSndWriteData);
 
+		/*
+		 * Signal that we don't need the timeout mechanism. We're just
+		 * creating the replication slot and don't yet accept feedback
+		 * messages or send keepalives. As we possibly need to wait for
+		 * further WAL the walsender would possibly be killed too soon.
+		 */
+		last_reply_timestamp = 0;
+
 		/* build initial snapshot, might take a while */
 		DecodingContextFindStartpoint(ctx);
 
@@ -1693,7 +1702,7 @@ WalSndComputeSleeptime(TimestampTz now)
 {
 	long		sleeptime = 1;		/* 10 s */
 
-	if (wal_sender_timeout  0)
+	if (wal_sender_timeout  0  last_reply_timestamp  0)
 	{
 		TimestampTz wakeup_time;
 		long		sec_to_timeout;
@@ -1735,6 +1744,10 @@ WalSndCheckTimeOut(TimestampTz now)
 {
 	TimestampTz timeout;
 
+	/* don't bail out if we're doing something that doesn't require timeouts */
+	if (last_reply_timestamp = 0)
+		return;
+
 	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
 		  wal_sender_timeout);
 
@@ -2879,7 +2892,11 @@ WalSndKeepaliveIfNecessary(TimestampTz now)
 {
 	TimestampTz ping_time;
 
-	if (wal_sender_timeout = 0)
+	/*
+	 * Don't send keepalive messages if timeouts are globally disabled or
+	 * we're doing something not partaking in timeouts.
+	 */
+	if (wal_sender_timeout = 0 || last_reply_timestamp = 0)
 		return;
 
 	if (waiting_for_ping_response)
-- 
2.0.0.rc2.4.g1dc51c6.dirty


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers