Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-02-23 Thread Fujii Masao
On Fri, Feb 20, 2015 at 9:33 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Fri, Feb 20, 2015 at 9:12 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, Feb 10, 2015 at 10:32 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 On Mon, Feb 9, 2015 at 8:29 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Sun, Feb 8, 2015 at 2:54 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 On Fri, Feb 6, 2015 at 4:58 PM, Fujii Masao wrote:
 - * Wait for more WAL to arrive. Time out after 5 
 seconds,
 - * like when polling the archive, to react to a 
 trigger
 - * file promptly.
 + * Wait for more WAL to arrive. Time out after
 the amount of
 + * time specified by wal_retrieve_retry_interval, 
 like
 + * when polling the archive, to react to a
 trigger file promptly.
   */
  WaitLatch(XLogCtl-recoveryWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT,
 -  5000L);
 +  wal_retrieve_retry_interval * 1000L);

 This change can prevent the startup process from reacting to
 a trigger file. Imagine the case where the large interval is set
 and the user want to promote the standby by using the trigger file
 instead of pg_ctl promote. I think that the sleep time should be 5s
 if the interval is set to more than 5s. Thought?

 I disagree here. It is interesting to accelerate the check of WAL
 availability from a source in some cases for replication, but the
 opposite is true as well as mentioned by Alexey at the beginning of
 the thread to reduce the number of requests when requesting WAL
 archives from an external storage type AWS. Hence a correct solution
 would be to check periodically for the trigger file with a maximum
 one-time wait of 5s to ensure backward-compatible behavior. We could
 reduce it to 1s or something like that as well.

 You seem to have misunderstood the code in question. Or I'm missing 
 something.
 The timeout of the WaitLatch is just the interval to check for the trigger 
 file
 while waiting for more WAL to arrive from streaming replication. Not 
 related to
 the retry time to restore WAL from the archive.

 [Re-reading the code...]
 Aah.. Yes you are right. Sorry for the noise. Yes let's wait for a
 maximum of 5s then. I also noticed in previous patch that the wait was
 maximized to 5s. To begin with, a loop should have been used if it was
 a sleep, but as now patch uses a latch this limit does not make much
 sense... Patch updated is attached.

 On second thought, the interval to check the trigger file is very different
 from the wait time to retry to retrieve WAL. So it seems strange and even
 confusing to control them by one parameter. If we really want to make
 the interval for the trigger file configurable, we should invent new GUC for 
 it.
 But I don't think that it's worth doing that. If someone wants to react the
 trigger file more promptly for fast promotion, he or she basically can use
 pg_ctl promote command, instead. Thought?

 Hm, OK.

 Attached is the updated version of the patch. I changed the parameter so that
 it doesn't affect the interval of checking the trigger file.

 -static pg_time_t last_fail_time = 0;
 -pg_time_tnow;
 +TimestampTz now = GetCurrentTimestamp();
 +TimestampTzlast_fail_time = now;

 I reverted the code here as it was. I don't think GetCurrentTimestamp() needs
 to be called for each WaitForWALToBecomeAvailable().

 +WaitLatch(XLogCtl-recoveryWakeupLatch,
 +  WL_LATCH_SET | WL_TIMEOUT,
 +  wait_time / 1000);

 Don't we need to specify WL_POSTMASTER_DEATH flag here? Added.

 Yeah, I am wondering though why this has not been added after 89fd72cb though.

 +{wal_retrieve_retry_interval, PGC_SIGHUP, WAL_SETTINGS,

 WAL_SETTINGS should be REPLICATION_STANDBY? Changed.

 Sure.

So I pushed the patch.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-02-23 Thread Michael Paquier
On Mon, Feb 23, 2015 at 8:57 PM, Fujii Masao masao.fu...@gmail.com wrote:

 So I pushed the patch.


Thank you.
-- 
Michael


Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-02-20 Thread Michael Paquier
On Fri, Feb 20, 2015 at 9:12 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, Feb 10, 2015 at 10:32 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 On Mon, Feb 9, 2015 at 8:29 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Sun, Feb 8, 2015 at 2:54 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 On Fri, Feb 6, 2015 at 4:58 PM, Fujii Masao wrote:
 - * Wait for more WAL to arrive. Time out after 5 
 seconds,
 - * like when polling the archive, to react to a 
 trigger
 - * file promptly.
 + * Wait for more WAL to arrive. Time out after
 the amount of
 + * time specified by wal_retrieve_retry_interval, 
 like
 + * when polling the archive, to react to a
 trigger file promptly.
   */
  WaitLatch(XLogCtl-recoveryWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT,
 -  5000L);
 +  wal_retrieve_retry_interval * 1000L);

 This change can prevent the startup process from reacting to
 a trigger file. Imagine the case where the large interval is set
 and the user want to promote the standby by using the trigger file
 instead of pg_ctl promote. I think that the sleep time should be 5s
 if the interval is set to more than 5s. Thought?

 I disagree here. It is interesting to accelerate the check of WAL
 availability from a source in some cases for replication, but the
 opposite is true as well as mentioned by Alexey at the beginning of
 the thread to reduce the number of requests when requesting WAL
 archives from an external storage type AWS. Hence a correct solution
 would be to check periodically for the trigger file with a maximum
 one-time wait of 5s to ensure backward-compatible behavior. We could
 reduce it to 1s or something like that as well.

 You seem to have misunderstood the code in question. Or I'm missing 
 something.
 The timeout of the WaitLatch is just the interval to check for the trigger 
 file
 while waiting for more WAL to arrive from streaming replication. Not 
 related to
 the retry time to restore WAL from the archive.

 [Re-reading the code...]
 Aah.. Yes you are right. Sorry for the noise. Yes let's wait for a
 maximum of 5s then. I also noticed in previous patch that the wait was
 maximized to 5s. To begin with, a loop should have been used if it was
 a sleep, but as now patch uses a latch this limit does not make much
 sense... Patch updated is attached.

 On second thought, the interval to check the trigger file is very different
 from the wait time to retry to retrieve WAL. So it seems strange and even
 confusing to control them by one parameter. If we really want to make
 the interval for the trigger file configurable, we should invent new GUC for 
 it.
 But I don't think that it's worth doing that. If someone wants to react the
 trigger file more promptly for fast promotion, he or she basically can use
 pg_ctl promote command, instead. Thought?

Hm, OK.

 Attached is the updated version of the patch. I changed the parameter so that
 it doesn't affect the interval of checking the trigger file.

 -static pg_time_t last_fail_time = 0;
 -pg_time_tnow;
 +TimestampTz now = GetCurrentTimestamp();
 +TimestampTzlast_fail_time = now;

 I reverted the code here as it was. I don't think GetCurrentTimestamp() needs
 to be called for each WaitForWALToBecomeAvailable().

 +WaitLatch(XLogCtl-recoveryWakeupLatch,
 +  WL_LATCH_SET | WL_TIMEOUT,
 +  wait_time / 1000);

 Don't we need to specify WL_POSTMASTER_DEATH flag here? Added.

Yeah, I am wondering though why this has not been added after 89fd72cb though.

 +{wal_retrieve_retry_interval, PGC_SIGHUP, WAL_SETTINGS,

 WAL_SETTINGS should be REPLICATION_STANDBY? Changed.

Sure.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-02-20 Thread Fujii Masao
On Tue, Feb 10, 2015 at 10:32 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Mon, Feb 9, 2015 at 8:29 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Sun, Feb 8, 2015 at 2:54 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 On Fri, Feb 6, 2015 at 4:58 PM, Fujii Masao wrote:
 - * Wait for more WAL to arrive. Time out after 5 
 seconds,
 - * like when polling the archive, to react to a 
 trigger
 - * file promptly.
 + * Wait for more WAL to arrive. Time out after
 the amount of
 + * time specified by wal_retrieve_retry_interval, like
 + * when polling the archive, to react to a
 trigger file promptly.
   */
  WaitLatch(XLogCtl-recoveryWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT,
 -  5000L);
 +  wal_retrieve_retry_interval * 1000L);

 This change can prevent the startup process from reacting to
 a trigger file. Imagine the case where the large interval is set
 and the user want to promote the standby by using the trigger file
 instead of pg_ctl promote. I think that the sleep time should be 5s
 if the interval is set to more than 5s. Thought?

 I disagree here. It is interesting to accelerate the check of WAL
 availability from a source in some cases for replication, but the
 opposite is true as well as mentioned by Alexey at the beginning of
 the thread to reduce the number of requests when requesting WAL
 archives from an external storage type AWS. Hence a correct solution
 would be to check periodically for the trigger file with a maximum
 one-time wait of 5s to ensure backward-compatible behavior. We could
 reduce it to 1s or something like that as well.

 You seem to have misunderstood the code in question. Or I'm missing 
 something.
 The timeout of the WaitLatch is just the interval to check for the trigger 
 file
 while waiting for more WAL to arrive from streaming replication. Not related 
 to
 the retry time to restore WAL from the archive.

 [Re-reading the code...]
 Aah.. Yes you are right. Sorry for the noise. Yes let's wait for a
 maximum of 5s then. I also noticed in previous patch that the wait was
 maximized to 5s. To begin with, a loop should have been used if it was
 a sleep, but as now patch uses a latch this limit does not make much
 sense... Patch updated is attached.

On second thought, the interval to check the trigger file is very different
from the wait time to retry to retrieve WAL. So it seems strange and even
confusing to control them by one parameter. If we really want to make
the interval for the trigger file configurable, we should invent new GUC for it.
But I don't think that it's worth doing that. If someone wants to react the
trigger file more promptly for fast promotion, he or she basically can use
pg_ctl promote command, instead. Thought?

Attached is the updated version of the patch. I changed the parameter so that
it doesn't affect the interval of checking the trigger file.

-static pg_time_t last_fail_time = 0;
-pg_time_tnow;
+TimestampTz now = GetCurrentTimestamp();
+TimestampTzlast_fail_time = now;

I reverted the code here as it was. I don't think GetCurrentTimestamp() needs
to be called for each WaitForWALToBecomeAvailable().

+WaitLatch(XLogCtl-recoveryWakeupLatch,
+  WL_LATCH_SET | WL_TIMEOUT,
+  wait_time / 1000);

Don't we need to specify WL_POSTMASTER_DEATH flag here? Added.

+{wal_retrieve_retry_interval, PGC_SIGHUP, WAL_SETTINGS,

WAL_SETTINGS should be REPLICATION_STANDBY? Changed.

Regards,

-- 
Fujii Masao
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 2985,2990  include_dir 'conf.d'
--- 2985,3008 
/listitem
   /varlistentry
  
+  varlistentry id=guc-wal-retrieve-retry-interval xreflabel=wal_retrieve_retry_interval
+   termvarnamewal_retrieve_retry_interval/varname (typeinteger/type)
+   indexterm
+primaryvarnamewal_retrieve_retry_interval/ configuration parameter/primary
+   /indexterm
+   /term
+   listitem
+para
+ Specify the amount of time, in milliseconds, to wait when WAL is not
+ available from any sources (streaming replication,
+ local filenamepg_xlog/ or WAL archive) before retrying to
+ retrieve WAL.  This parameter can only be set in the
+ filenamepostgresql.conf/ file or on the server command line.
+ The default value is 5 seconds.
+/para
+   /listitem
+  /varlistentry
+ 
   /variablelist
  /sect2
 /sect1
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 93,98  int			sync_method = DEFAULT_SYNC_METHOD;
--- 93,99 
  int			wal_level = 

Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-02-10 Thread Michael Paquier
On Mon, Feb 9, 2015 at 8:29 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Sun, Feb 8, 2015 at 2:54 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 On Fri, Feb 6, 2015 at 4:58 PM, Fujii Masao wrote:
 - * Wait for more WAL to arrive. Time out after 5 
 seconds,
 - * like when polling the archive, to react to a trigger
 - * file promptly.
 + * Wait for more WAL to arrive. Time out after
 the amount of
 + * time specified by wal_retrieve_retry_interval, like
 + * when polling the archive, to react to a
 trigger file promptly.
   */
  WaitLatch(XLogCtl-recoveryWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT,
 -  5000L);
 +  wal_retrieve_retry_interval * 1000L);

 This change can prevent the startup process from reacting to
 a trigger file. Imagine the case where the large interval is set
 and the user want to promote the standby by using the trigger file
 instead of pg_ctl promote. I think that the sleep time should be 5s
 if the interval is set to more than 5s. Thought?

 I disagree here. It is interesting to accelerate the check of WAL
 availability from a source in some cases for replication, but the
 opposite is true as well as mentioned by Alexey at the beginning of
 the thread to reduce the number of requests when requesting WAL
 archives from an external storage type AWS. Hence a correct solution
 would be to check periodically for the trigger file with a maximum
 one-time wait of 5s to ensure backward-compatible behavior. We could
 reduce it to 1s or something like that as well.

 You seem to have misunderstood the code in question. Or I'm missing something.
 The timeout of the WaitLatch is just the interval to check for the trigger 
 file
 while waiting for more WAL to arrive from streaming replication. Not related 
 to
 the retry time to restore WAL from the archive.

[Re-reading the code...]
Aah.. Yes you are right. Sorry for the noise. Yes let's wait for a
maximum of 5s then. I also noticed in previous patch that the wait was
maximized to 5s. To begin with, a loop should have been used if it was
a sleep, but as now patch uses a latch this limit does not make much
sense... Patch updated is attached.
Regards,
-- 
Michael
From 9b7e3bb32c744b328a0d99db3040cadfcba606aa Mon Sep 17 00:00:00 2001
From: Michael Paquier mich...@otacoo.com
Date: Mon, 19 Jan 2015 16:08:48 +0900
Subject: [PATCH] Add wal_retrieve_retry_interval

This parameter aids to control at which timing WAL availability is checked
when a node is in recovery, particularly  when successive failures happen when
fetching WAL archives, or when fetching WAL records from a streaming source.

Default value is 5s.
---
 doc/src/sgml/config.sgml  | 17 ++
 src/backend/access/transam/xlog.c | 46 +++
 src/backend/utils/misc/guc.c  | 12 +++
 src/backend/utils/misc/postgresql.conf.sample |  3 ++
 src/include/access/xlog.h |  1 +
 5 files changed, 66 insertions(+), 13 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6bcb106..d82b26a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2985,6 +2985,23 @@ include_dir 'conf.d'
   /listitem
  /varlistentry
 
+ varlistentry id=guc-wal-retrieve-retry-interval xreflabel=wal_retrieve_retry_interval
+  termvarnamewal_retrieve_retry_interval/varname (typeinteger/type)
+  indexterm
+   primaryvarnamewal_retrieve_retry_interval/ configuration parameter/primary
+  /indexterm
+  /term
+  listitem
+   para
+Specify the amount of time to wait when WAL is not available from
+any sources (streaming replication, local filenamepg_xlog/ or
+WAL archive) before retrying to retrieve WAL.  This parameter can
+only be set in the filenamepostgresql.conf/ file or on the
+server command line.  The default value is 5 seconds.
+   /para
+  /listitem
+ /varlistentry
+
  /variablelist
 /sect2
/sect1
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 629a457..1f9c3c4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -93,6 +93,7 @@ int			sync_method = DEFAULT_SYNC_METHOD;
 int			wal_level = WAL_LEVEL_MINIMAL;
 int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
+int			wal_retrieve_retry_interval = 5000;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -10340,8 +10341,8 @@ static bool
 WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 			bool fetching_ckpt, XLogRecPtr tliRecPtr)
 {
-	static pg_time_t last_fail_time = 0;
-	pg_time_t	now;
+	TimestampTz now = 

Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-02-09 Thread Fujii Masao
On Sun, Feb 8, 2015 at 2:54 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Fri, Feb 6, 2015 at 4:58 PM, Fujii Masao wrote:
 - * Wait for more WAL to arrive. Time out after 5 
 seconds,
 - * like when polling the archive, to react to a trigger
 - * file promptly.
 + * Wait for more WAL to arrive. Time out after
 the amount of
 + * time specified by wal_retrieve_retry_interval, like
 + * when polling the archive, to react to a
 trigger file promptly.
   */
  WaitLatch(XLogCtl-recoveryWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT,
 -  5000L);
 +  wal_retrieve_retry_interval * 1000L);

 This change can prevent the startup process from reacting to
 a trigger file. Imagine the case where the large interval is set
 and the user want to promote the standby by using the trigger file
 instead of pg_ctl promote. I think that the sleep time should be 5s
 if the interval is set to more than 5s. Thought?

 I disagree here. It is interesting to accelerate the check of WAL
 availability from a source in some cases for replication, but the
 opposite is true as well as mentioned by Alexey at the beginning of
 the thread to reduce the number of requests when requesting WAL
 archives from an external storage type AWS. Hence a correct solution
 would be to check periodically for the trigger file with a maximum
 one-time wait of 5s to ensure backward-compatible behavior. We could
 reduce it to 1s or something like that as well.

You seem to have misunderstood the code in question. Or I'm missing something.
The timeout of the WaitLatch is just the interval to check for the trigger file
while waiting for more WAL to arrive from streaming replication. Not related to
the retry time to restore WAL from the archive.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-02-07 Thread Michael Paquier
On Fri, Feb 6, 2015 at 4:58 PM, Fujii Masao wrote:
 timestamp.c:1708: warning: implicit declaration of function
 'HandleStartupProcInterrupts'

 I got the above warning at the compilation.

 +pg_usleep(wait_time);
 +HandleStartupProcInterrupts();
 +total_time -= wait_time;

 It seems strange to call the startup-process-specific function in
 the generic function like TimestampSleepDifference().

I changed the sleep to a WaitLatch and moved all the logic back to
xlog.c, removing at the same the stuff in timestamp.c because it seems
strange to add a dependency with even latch there.

  * 5. Sleep 5 seconds, and loop back to 1.

 In WaitForWALToBecomeAvailable(), the above comment should
 be updated.

Done.

 - * Wait for more WAL to arrive. Time out after 5 seconds,
 - * like when polling the archive, to react to a trigger
 - * file promptly.
 + * Wait for more WAL to arrive. Time out after
 the amount of
 + * time specified by wal_retrieve_retry_interval, like
 + * when polling the archive, to react to a
 trigger file promptly.
   */
  WaitLatch(XLogCtl-recoveryWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT,
 -  5000L);
 +  wal_retrieve_retry_interval * 1000L);

 This change can prevent the startup process from reacting to
 a trigger file. Imagine the case where the large interval is set
 and the user want to promote the standby by using the trigger file
 instead of pg_ctl promote. I think that the sleep time should be 5s
 if the interval is set to more than 5s. Thought?

I disagree here. It is interesting to accelerate the check of WAL
availability from a source in some cases for replication, but the
opposite is true as well as mentioned by Alexey at the beginning of
the thread to reduce the number of requests when requesting WAL
archives from an external storage type AWS. Hence a correct solution
would be to check periodically for the trigger file with a maximum
one-time wait of 5s to ensure backward-compatible behavior. We could
reduce it to 1s or something like that as well.

 +# specifies an optional internal to wait for WAL to become available when

 s/internal/interval?

This part has been removed in the new part as parameter is set as a GUC.


 +This parameter specifies the amount of time to wait when a failure
 +occurred when reading WAL from a source (be it via streaming
 +replication, local filenamepg_xlog/ or WAL archive) for a node
 +in standby mode, or when WAL is expected from a source.

 At least to me, it seems not easy to understand what the parameter is
 from this description. What about the following, for example?
 This parameter specifies the amount of time to wait when WAL is not
 available from any sources (streaming replication, local
 filenamepg_xlog/ or WAL archive) before retrying to retrieve WAL

OK, done.

 Isn't it better to place this parameter in postgresql.conf rather than
 recovery.conf? I'd like to change the value of this parameter without
 restarting the server.

Yes, agreed. Done so with a SIGHUP guc.
Regards,
-- 
Michael
From 594db45eda43c0abc13ad0dbca81d6ed3f4650e7 Mon Sep 17 00:00:00 2001
From: Michael Paquier mich...@otacoo.com
Date: Mon, 19 Jan 2015 16:08:48 +0900
Subject: [PATCH] Add wal_retrieve_retry_interval

This parameter aids to control at which timing WAL availability is checked
when a node is in recovery, particularly  when successive failures happen when
fetching WAL archives, or when fetching WAL records from a streaming source.

Default value is 5s.
---
 doc/src/sgml/config.sgml  | 17 +
 src/backend/access/transam/xlog.c | 91 ++-
 src/backend/utils/misc/guc.c  | 12 
 src/backend/utils/misc/postgresql.conf.sample |  3 +
 src/include/access/xlog.h |  1 +
 5 files changed, 93 insertions(+), 31 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6bcb106..d82b26a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2985,6 +2985,23 @@ include_dir 'conf.d'
   /listitem
  /varlistentry
 
+ varlistentry id=guc-wal-retrieve-retry-interval xreflabel=wal_retrieve_retry_interval
+  termvarnamewal_retrieve_retry_interval/varname (typeinteger/type)
+  indexterm
+   primaryvarnamewal_retrieve_retry_interval/ configuration parameter/primary
+  /indexterm
+  /term
+  listitem
+   para
+Specify the amount of time to wait when WAL is not available from
+any sources (streaming replication, local filenamepg_xlog/ or
+WAL archive) before retrying to retrieve WAL.  This parameter can
+only be set in the filenamepostgresql.conf/ file or on the
+   

Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-02-06 Thread Fujii Masao
On Fri, Feb 6, 2015 at 1:23 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Thu, Feb 5, 2015 at 11:58 PM, Michael Paquier wrote:
 An updated patch is attached.
 I just noticed that the patch I sent was incorrect:
 - Parameter name was still wal_availability_check_interval and not
 wal_retrieve_retry_interval
 - Documentation was incorrect.
 Please use the patch attached instead for further review.

Thanks for updating the patch!

timestamp.c:1708: warning: implicit declaration of function
'HandleStartupProcInterrupts'

I got the above warning at the compilation.

+pg_usleep(wait_time);
+HandleStartupProcInterrupts();
+total_time -= wait_time;

It seems strange to call the startup-process-specific function in
the generic function like TimestampSleepDifference().

 * 5. Sleep 5 seconds, and loop back to 1.

In WaitForWALToBecomeAvailable(), the above comment should
be updated.

- * Wait for more WAL to arrive. Time out after 5 seconds,
- * like when polling the archive, to react to a trigger
- * file promptly.
+ * Wait for more WAL to arrive. Time out after
the amount of
+ * time specified by wal_retrieve_retry_interval, like
+ * when polling the archive, to react to a
trigger file promptly.
  */
 WaitLatch(XLogCtl-recoveryWakeupLatch,
   WL_LATCH_SET | WL_TIMEOUT,
-  5000L);
+  wal_retrieve_retry_interval * 1000L);

This change can prevent the startup process from reacting to
a trigger file. Imagine the case where the large interval is set
and the user want to promote the standby by using the trigger file
instead of pg_ctl promote. I think that the sleep time should be 5s
if the interval is set to more than 5s. Thought?

+# specifies an optional internal to wait for WAL to become available when

s/internal/interval?

+This parameter specifies the amount of time to wait when a failure
+occurred when reading WAL from a source (be it via streaming
+replication, local filenamepg_xlog/ or WAL archive) for a node
+in standby mode, or when WAL is expected from a source.

At least to me, it seems not easy to understand what the parameter is
from this description. What about the following, for example?

This parameter specifies the amount of time to wait when WAL is not
available from any sources (streaming replication, local
filenamepg_xlog/ or WAL archive) before retrying to retrieve WAL

Isn't it better to place this parameter in postgresql.conf rather than
recovery.conf? I'd like to change the value of this parameter without
restarting the server.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-02-05 Thread Michael Paquier
On Thu, Feb 5, 2015 at 11:58 PM, Michael Paquier wrote:
 An updated patch is attached.
I just noticed that the patch I sent was incorrect:
- Parameter name was still wal_availability_check_interval and not
wal_retrieve_retry_interval
- Documentation was incorrect.
Please use the patch attached instead for further review.
-- 
Michael
From 06a4d3d1f5fe4362d7be1404cbf0b45b74fea69f Mon Sep 17 00:00:00 2001
From: Michael Paquier mich...@otacoo.com
Date: Mon, 19 Jan 2015 16:08:48 +0900
Subject: [PATCH] Add wal_retrieve_retry_interval

This parameter aids to control at which timing WAL availability is checked
when a node is in recovery, particularly  when successive failures happen when
fetching WAL archives, or when fetching WAL records from a streaming source.

Default value is 5s.
---
 doc/src/sgml/recovery-config.sgml   | 17 +
 src/backend/access/transam/recovery.conf.sample |  9 +
 src/backend/access/transam/xlog.c   | 47 +
 src/backend/utils/adt/timestamp.c   | 38 
 src/include/utils/timestamp.h   |  2 ++
 5 files changed, 99 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/recovery-config.sgml b/doc/src/sgml/recovery-config.sgml
index 0c64ff2..d4babbd 100644
--- a/doc/src/sgml/recovery-config.sgml
+++ b/doc/src/sgml/recovery-config.sgml
@@ -145,6 +145,23 @@ restore_command = 'copy C:\\server\\archivedir\\%f %p'  # Windows
   /listitem
  /varlistentry
 
+ varlistentry id=wal-retrieve-retry-interval xreflabel=wal_retrieve_retry_interval
+  termvarnamewal_retrieve_retry_interval/varname (typeinteger/type)
+  indexterm
+primaryvarnamewal_retrieve_retry_interval/ recovery parameter/primary
+  /indexterm
+  /term
+  listitem
+   para
+This parameter specifies the amount of time to wait when a failure
+occurred when reading WAL from a source (be it via streaming
+replication, local filenamepg_xlog/ or WAL archive) for a node
+in standby mode, or when WAL is expected from a source. Default
+value is literal5s/.
+   /para
+  /listitem
+ /varlistentry
+
 /variablelist
 
   /sect1
diff --git a/src/backend/access/transam/recovery.conf.sample b/src/backend/access/transam/recovery.conf.sample
index b777400..458308c 100644
--- a/src/backend/access/transam/recovery.conf.sample
+++ b/src/backend/access/transam/recovery.conf.sample
@@ -58,6 +58,15 @@
 #
 #recovery_end_command = ''
 #
+#
+# wal_retrieve_retry_interval
+#
+# specifies an optional internal to wait for WAL to become available when
+# a failure occurred when reading WAL from a source for a node in standby
+# mode, or when WAL is expected from a source.
+#
+#wal_retrieve_retry_interval = '5s'
+#
 #---
 # RECOVERY TARGET PARAMETERS
 #---
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 629a457..111e53d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -235,6 +235,7 @@ static TimestampTz recoveryTargetTime;
 static char *recoveryTargetName;
 static int	recovery_min_apply_delay = 0;
 static TimestampTz recoveryDelayUntilTime;
+static int	wal_retrieve_retry_interval = 5000;
 
 /* options taken from recovery.conf for XLOG streaming */
 static bool StandbyModeRequested = false;
@@ -4881,6 +4882,26 @@ readRecoveryCommandFile(void)
 	(errmsg_internal(trigger_file = '%s',
 	 TriggerFile)));
 		}
+		else if (strcmp(item-name, wal_retrieve_retry_interval) == 0)
+		{
+			const char *hintmsg;
+
+			if (!parse_int(item-value, wal_retrieve_retry_interval, GUC_UNIT_MS,
+		   hintmsg))
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+		 errmsg(parameter \%s\ requires a temporal value,
+wal_retrieve_retry_interval),
+		 hintmsg ? errhint(%s, _(hintmsg)) : 0));
+			ereport(DEBUG2,
+	(errmsg_internal(wal_retrieve_retry_interval = '%s', item-value)));
+
+			if (wal_retrieve_retry_interval = 0)
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+		 errmsg(\%s\ must have a value strictly positive,
+wal_retrieve_retry_interval)));
+		}
 		else if (strcmp(item-name, recovery_min_apply_delay) == 0)
 		{
 			const char *hintmsg;
@@ -10340,8 +10361,8 @@ static bool
 WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 			bool fetching_ckpt, XLogRecPtr tliRecPtr)
 {
-	static pg_time_t last_fail_time = 0;
-	pg_time_t	now;
+	TimestampTz now = GetCurrentTimestamp();
+	TimestampTz	last_fail_time = now;
 
 	/*---
 	 * Standby mode is implemented by a state machine:
@@ -10490,15 +10511,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * machine, so we've exhausted all the options for
 	 * obtaining the requested WAL. We're 

Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-02-05 Thread Michael Paquier
Fujii Masao wrote:
 +TimestampDifference(start_time, stop_time, secs, microsecs);
 +pg_usleep(interval_msec * 1000L - (100L * secs + 1L * microsecs));

 What if the large interval is set and a signal arrives during the sleep?
 I'm afraid that a signal cannot terminate the sleep on some platforms.
 This problem exists even now because pg_usleep is used, but the sleep
 time is just 5 seconds, so it's not so bad. But the patch allows a user to
 set large sleep time.

Yes, and I thought that we could live with that for this patch... Now
that you mention it something similar to what recoveryPausesHere would
be quite good to ease the shutdown of a process interrupted, even more
than now as well. So let's do that.

 Shouldn't we use WaitLatch or split the pg_usleep like recoveryPausesHere() 
 does?

I'd vote for the latter as we would still need to calculate a
timestamp difference in any cases, so it feels easier to do that in
the new single API and this patch does (routine useful for plugins as
well). Now I will not fight if people think that using
recoveryWakeupLatch is better.

An updated patch is attached. This patch contains as well a fix for
something that was mentioned upthread but not added in latest version:
wal_availability_check_interval should be used when waiting for a WAL
record from a stream, and for a segment when fetching from archives.
Last version did only the former, not the latter.
-- 
Michael
From 9c3a7fa35538993c1345a40fe0973332a76bdb81 Mon Sep 17 00:00:00 2001
From: Michael Paquier mich...@otacoo.com
Date: Mon, 19 Jan 2015 16:08:48 +0900
Subject: [PATCH] Add wal_availability_check_interval

This parameter aids to control at which timing WAL availability is checked
when a node is in recovery, particularly  when successive failures happen when
fetching WAL archives, or when fetching WAL records from a streaming source.

Default value is 5s.
---
 doc/src/sgml/recovery-config.sgml   | 21 +++
 src/backend/access/transam/recovery.conf.sample | 10 ++
 src/backend/access/transam/xlog.c   | 47 +
 src/backend/utils/adt/timestamp.c   | 38 
 src/include/utils/timestamp.h   |  2 ++
 5 files changed, 104 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/recovery-config.sgml b/doc/src/sgml/recovery-config.sgml
index 0c64ff2..7fd51ce 100644
--- a/doc/src/sgml/recovery-config.sgml
+++ b/doc/src/sgml/recovery-config.sgml
@@ -145,6 +145,27 @@ restore_command = 'copy C:\\server\\archivedir\\%f %p'  # Windows
   /listitem
  /varlistentry
 
+ varlistentry id=wal-availability-check-interval xreflabel=wal_availability_check_interval
+  termvarnamewal_availability_check_interval/varname (typeinteger/type)
+  indexterm
+primaryvarnamewal_availability_check_interval/ recovery parameter/primary
+  /indexterm
+  /term
+  listitem
+   para
+This parameter specifies the amount of time to wait when
+WAL is not available for a node in recovery. Default value is
+literal5s/.
+   /para
+   para
+A node in recovery will wait for this amount of time if
+varnamerestore_command/ returns nonzero exit status code when
+fetching new WAL segment files from archive or when a WAL receiver
+is not able to fetch a WAL record when using streaming replication.
+   /para
+  /listitem
+ /varlistentry
+
 /variablelist
 
   /sect1
diff --git a/src/backend/access/transam/recovery.conf.sample b/src/backend/access/transam/recovery.conf.sample
index b777400..70d3946 100644
--- a/src/backend/access/transam/recovery.conf.sample
+++ b/src/backend/access/transam/recovery.conf.sample
@@ -58,6 +58,16 @@
 #
 #recovery_end_command = ''
 #
+#
+# wal_availability_check_interval
+#
+# specifies an optional interval to check for availability of WAL when
+# recovering a node. This interval of time represents the frequency of
+# retries if a previous command of restore_command returned nonzero exit
+# status code or if a walreceiver did not stream completely a WAL record.
+#
+#wal_availability_check_interval = '5s'
+#
 #---
 # RECOVERY TARGET PARAMETERS
 #---
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 629a457..4f4efca 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -235,6 +235,7 @@ static TimestampTz recoveryTargetTime;
 static char *recoveryTargetName;
 static int	recovery_min_apply_delay = 0;
 static TimestampTz recoveryDelayUntilTime;
+static int	wal_availability_check_interval = 5000;
 
 /* options taken from recovery.conf for XLOG streaming */
 static bool StandbyModeRequested = false;
@@ -4881,6 +4882,26 @@ readRecoveryCommandFile(void)
 	(errmsg_internal(trigger_file = 

Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-02-04 Thread Fujii Masao
On Tue, Jan 20, 2015 at 12:57 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Tue, Jan 20, 2015 at 1:56 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2015-01-19 17:16:11 +0900, Michael Paquier wrote:
 On Mon, Jan 19, 2015 at 4:10 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
  On Sat, Jan 17, 2015 at 2:44 AM, Andres Freund and...@2ndquadrant.com 
  wrote:
  Not this patch's fault, but I'm getting a bit tired seeing the above open
  coded. How about adding a function that does the sleeping based on a
  timestamptz and a ms interval?
  You mean in plugins, right? I don't recall seeing similar patterns in
  other code paths of backend. But I think that we can use something
  like that in timestamp.c then because we need to leverage that between
  two timestamps, the last failure and now():
  TimestampSleepDifference(start_time, stop_time, internal_ms);
  Perhaps you have something else in mind?
 
  Attached is an updated patch.

 Actually I came with better than last patch by using a boolean flag as
 return value of TimestampSleepDifference and use
 TimestampDifferenceExceeds directly inside it.

 Subject: [PATCH] Add wal_availability_check_interval to control WAL fetching
  on failure

 I think that name isn't a very good. And its isn't very accurate
 either.
 How about wal_retrieve_retry_interval? Not very nice, but I think it's
 still better than the above.
 Fine for me.

 + varlistentry id=wal-availability-check-interval 
 xreflabel=wal_availability_check_interval
 +  termvarnamewal_availability_check_interval/varname 
 (typeinteger/type)
 +  indexterm
 +primaryvarnamewal_availability_check_interval/ recovery 
 parameter/primary
 +  /indexterm
 +  /term
 +  listitem
 +   para
 +This parameter specifies the amount of time to wait when
 +WAL is not available for a node in recovery. Default value is
 +literal5s/.
 +   /para
 +   para
 +A node in recovery will wait for this amount of time if
 +varnamerestore_command/ returns nonzero exit status code when
 +fetching new WAL segment files from archive or when a WAL receiver
 +is not able to fetch a WAL record when using streaming replication.
 +   /para
 +  /listitem
 + /varlistentry
 +
  /variablelist

 Walreceiver doesn't wait that amount, but rather how long the connection
 is intact. And restore_command may or may not retry.
 I changed this paragraph as follows:
 +This parameter specifies the amount of time to wait when a failure
 +occurred when reading WAL from a source (be it via streaming
 +replication, local filenamepg_xlog/ or WAL archive) for a node
 +in standby mode, or when WAL is expected from a source. Default
 +value is literal5s/.
 Pondering more about this patch, I am getting the feeling that it is
 not that much a win to explain in details in the doc each failure
 depending on the source from which WAL is taken. But it is essential
 to mention that the wait is done only for a node using standby mode.

 Btw, I noticed two extra things that I think should be done for consistency:
 - We should use as well wal_retrieve_retry_interval when waiting for
 WAL to arrive after checking for the failures:
 /*
 -* Wait for more WAL to
 arrive. Time out after 5 seconds,
 -* like when polling the
 archive, to react to a trigger
 -* file promptly.
 +* Wait for more WAL to
 arrive. Time out after
 +* wal_retrieve_retry_interval
 ms, like when polling the archive
 +* to react to a trigger file 
 promptly.
  */
 
 WaitLatch(XLogCtl-recoveryWakeupLatch,
   WL_LATCH_SET
 | WL_TIMEOUT,
 - 5000L);
 +
 wal_retrieve_retry_interval * 1L);
 -

 Updated patch attached.

+TimestampDifference(start_time, stop_time, secs, microsecs);
+pg_usleep(interval_msec * 1000L - (100L * secs + 1L * microsecs));

What if the large interval is set and a signal arrives during the sleep?
I'm afraid that a signal cannot terminate the sleep on some platforms.
This problem exists even now because pg_usleep is used, but the sleep
time is just 5 seconds, so it's not so bad. But the patch allows a user to
set large sleep time. Shouldn't we use WaitLatch or split the pg_usleep
like recoveryPausesHere() does?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-01-19 Thread Michael Paquier
On Tue, Jan 20, 2015 at 1:56 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2015-01-19 17:16:11 +0900, Michael Paquier wrote:
 On Mon, Jan 19, 2015 at 4:10 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
  On Sat, Jan 17, 2015 at 2:44 AM, Andres Freund and...@2ndquadrant.com 
  wrote:
  Not this patch's fault, but I'm getting a bit tired seeing the above open
  coded. How about adding a function that does the sleeping based on a
  timestamptz and a ms interval?
  You mean in plugins, right? I don't recall seeing similar patterns in
  other code paths of backend. But I think that we can use something
  like that in timestamp.c then because we need to leverage that between
  two timestamps, the last failure and now():
  TimestampSleepDifference(start_time, stop_time, internal_ms);
  Perhaps you have something else in mind?
 
  Attached is an updated patch.

 Actually I came with better than last patch by using a boolean flag as
 return value of TimestampSleepDifference and use
 TimestampDifferenceExceeds directly inside it.

 Subject: [PATCH] Add wal_availability_check_interval to control WAL fetching
  on failure

 I think that name isn't a very good. And its isn't very accurate
 either.
 How about wal_retrieve_retry_interval? Not very nice, but I think it's
 still better than the above.
Fine for me.

 + varlistentry id=wal-availability-check-interval 
 xreflabel=wal_availability_check_interval
 +  termvarnamewal_availability_check_interval/varname 
 (typeinteger/type)
 +  indexterm
 +primaryvarnamewal_availability_check_interval/ recovery 
 parameter/primary
 +  /indexterm
 +  /term
 +  listitem
 +   para
 +This parameter specifies the amount of time to wait when
 +WAL is not available for a node in recovery. Default value is
 +literal5s/.
 +   /para
 +   para
 +A node in recovery will wait for this amount of time if
 +varnamerestore_command/ returns nonzero exit status code when
 +fetching new WAL segment files from archive or when a WAL receiver
 +is not able to fetch a WAL record when using streaming replication.
 +   /para
 +  /listitem
 + /varlistentry
 +
  /variablelist

 Walreceiver doesn't wait that amount, but rather how long the connection
 is intact. And restore_command may or may not retry.
I changed this paragraph as follows:
+This parameter specifies the amount of time to wait when a failure
+occurred when reading WAL from a source (be it via streaming
+replication, local filenamepg_xlog/ or WAL archive) for a node
+in standby mode, or when WAL is expected from a source. Default
+value is literal5s/.
Pondering more about this patch, I am getting the feeling that it is
not that much a win to explain in details in the doc each failure
depending on the source from which WAL is taken. But it is essential
to mention that the wait is done only for a node using standby mode.

Btw, I noticed two extra things that I think should be done for consistency:
- We should use as well wal_retrieve_retry_interval when waiting for
WAL to arrive after checking for the failures:
/*
-* Wait for more WAL to
arrive. Time out after 5 seconds,
-* like when polling the
archive, to react to a trigger
-* file promptly.
+* Wait for more WAL to
arrive. Time out after
+* wal_retrieve_retry_interval
ms, like when polling the archive
+* to react to a trigger file promptly.
 */
WaitLatch(XLogCtl-recoveryWakeupLatch,
  WL_LATCH_SET
| WL_TIMEOUT,
- 5000L);
+
wal_retrieve_retry_interval * 1L);
-

Updated patch attached.
-- 
Michael
From e07949030a676b033f4a563cfaac4687bcc37dca Mon Sep 17 00:00:00 2001
From: Michael Paquier michael@otacoo.com
Date: Mon, 19 Jan 2015 16:08:48 +0900
Subject: [PATCH] Add wal_retrieve_retry_interval to control WAL retrieval at
 recovery

This parameter allows to control the amount of time process will wait
for WAL from a source when a failure occurred for a standby node, or
for the amount of time to wait when WAL is expected from a source.
Default value is 5s.
---
 doc/src/sgml/recovery-config.sgml   | 17 +
 src/backend/access/transam/recovery.conf.sample |  9 +
 src/backend/access/transam/xlog.c   | 50 +
 src/backend/utils/adt/timestamp.c   | 24 
 src/include/utils/timestamp.h   |  2 +
 5 files changed, 87 insertions(+), 15 deletions(-)

diff --git 

Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-01-19 Thread Michael Paquier
On Mon, Jan 19, 2015 at 4:10 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Sat, Jan 17, 2015 at 2:44 AM, Andres Freund and...@2ndquadrant.com wrote:
 Not this patch's fault, but I'm getting a bit tired seeing the above open
 coded. How about adding a function that does the sleeping based on a
 timestamptz and a ms interval?
 You mean in plugins, right? I don't recall seeing similar patterns in
 other code paths of backend. But I think that we can use something
 like that in timestamp.c then because we need to leverage that between
 two timestamps, the last failure and now():
 TimestampSleepDifference(start_time, stop_time, internal_ms);
 Perhaps you have something else in mind?

 Attached is an updated patch.
Actually I came with better than last patch by using a boolean flag as
return value of TimestampSleepDifference and use
TimestampDifferenceExceeds directly inside it.
Regards,
-- 
Michael
From d77429197838c0ad9d378aba08137d4ca0b384fd Mon Sep 17 00:00:00 2001
From: Michael Paquier mich...@otacoo.com
Date: Mon, 19 Jan 2015 16:08:48 +0900
Subject: [PATCH] Add wal_availability_check_interval to control WAL fetching
 on failure

Default value is 5s.
---
 doc/src/sgml/recovery-config.sgml   | 21 +
 src/backend/access/transam/recovery.conf.sample | 10 +++
 src/backend/access/transam/xlog.c   | 39 ++---
 src/backend/utils/adt/timestamp.c   | 24 +++
 src/include/utils/timestamp.h   |  2 ++
 5 files changed, 86 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/recovery-config.sgml b/doc/src/sgml/recovery-config.sgml
index 0c64ff2..7fd51ce 100644
--- a/doc/src/sgml/recovery-config.sgml
+++ b/doc/src/sgml/recovery-config.sgml
@@ -145,6 +145,27 @@ restore_command = 'copy C:\\server\\archivedir\\%f %p'  # Windows
   /listitem
  /varlistentry
 
+ varlistentry id=wal-availability-check-interval xreflabel=wal_availability_check_interval
+  termvarnamewal_availability_check_interval/varname (typeinteger/type)
+  indexterm
+primaryvarnamewal_availability_check_interval/ recovery parameter/primary
+  /indexterm
+  /term
+  listitem
+   para
+This parameter specifies the amount of time to wait when
+WAL is not available for a node in recovery. Default value is
+literal5s/.
+   /para
+   para
+A node in recovery will wait for this amount of time if
+varnamerestore_command/ returns nonzero exit status code when
+fetching new WAL segment files from archive or when a WAL receiver
+is not able to fetch a WAL record when using streaming replication.
+   /para
+  /listitem
+ /varlistentry
+
 /variablelist
 
   /sect1
diff --git a/src/backend/access/transam/recovery.conf.sample b/src/backend/access/transam/recovery.conf.sample
index b777400..70d3946 100644
--- a/src/backend/access/transam/recovery.conf.sample
+++ b/src/backend/access/transam/recovery.conf.sample
@@ -58,6 +58,16 @@
 #
 #recovery_end_command = ''
 #
+#
+# wal_availability_check_interval
+#
+# specifies an optional interval to check for availability of WAL when
+# recovering a node. This interval of time represents the frequency of
+# retries if a previous command of restore_command returned nonzero exit
+# status code or if a walreceiver did not stream completely a WAL record.
+#
+#wal_availability_check_interval = '5s'
+#
 #---
 # RECOVERY TARGET PARAMETERS
 #---
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 629a457..39b4e1c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -235,6 +235,7 @@ static TimestampTz recoveryTargetTime;
 static char *recoveryTargetName;
 static int	recovery_min_apply_delay = 0;
 static TimestampTz recoveryDelayUntilTime;
+static int	wal_availability_check_interval = 5000;
 
 /* options taken from recovery.conf for XLOG streaming */
 static bool StandbyModeRequested = false;
@@ -4881,6 +4882,26 @@ readRecoveryCommandFile(void)
 	(errmsg_internal(trigger_file = '%s',
 	 TriggerFile)));
 		}
+		else if (strcmp(item-name, wal_availability_check_interval) == 0)
+		{
+			const char *hintmsg;
+
+			if (!parse_int(item-value, wal_availability_check_interval, GUC_UNIT_MS,
+		   hintmsg))
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+		 errmsg(parameter \%s\ requires a temporal value,
+wal_availability_check_interval),
+		 hintmsg ? errhint(%s, _(hintmsg)) : 0));
+			ereport(DEBUG2,
+	(errmsg_internal(wal_availability_check_interval = '%s', item-value)));
+
+			if (wal_availability_check_interval = 0)
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+		 errmsg(\%s\ must have a value strictly positive,

Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-01-19 Thread Andres Freund
On 2015-01-19 17:16:11 +0900, Michael Paquier wrote:
 On Mon, Jan 19, 2015 at 4:10 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
  On Sat, Jan 17, 2015 at 2:44 AM, Andres Freund and...@2ndquadrant.com 
  wrote:
  Not this patch's fault, but I'm getting a bit tired seeing the above open
  coded. How about adding a function that does the sleeping based on a
  timestamptz and a ms interval?
  You mean in plugins, right? I don't recall seeing similar patterns in
  other code paths of backend. But I think that we can use something
  like that in timestamp.c then because we need to leverage that between
  two timestamps, the last failure and now():
  TimestampSleepDifference(start_time, stop_time, internal_ms);
  Perhaps you have something else in mind?
 
  Attached is an updated patch.

 Actually I came with better than last patch by using a boolean flag as
 return value of TimestampSleepDifference and use
 TimestampDifferenceExceeds directly inside it.

 Subject: [PATCH] Add wal_availability_check_interval to control WAL fetching
  on failure

I think that name isn't a very good. And its isn't very accurate
either.

How about wal_retrieve_retry_interval? Not very nice, but I think it's
still better than the above.


 + varlistentry id=wal-availability-check-interval 
 xreflabel=wal_availability_check_interval
 +  termvarnamewal_availability_check_interval/varname 
 (typeinteger/type)
 +  indexterm
 +primaryvarnamewal_availability_check_interval/ recovery 
 parameter/primary
 +  /indexterm
 +  /term
 +  listitem
 +   para
 +This parameter specifies the amount of time to wait when
 +WAL is not available for a node in recovery. Default value is
 +literal5s/.
 +   /para
 +   para
 +A node in recovery will wait for this amount of time if
 +varnamerestore_command/ returns nonzero exit status code when
 +fetching new WAL segment files from archive or when a WAL receiver
 +is not able to fetch a WAL record when using streaming replication.
 +   /para
 +  /listitem
 + /varlistentry
 +
  /variablelist

Walreceiver doesn't wait that amount, but rather how long the connection
is intact. And restore_command may or may not retry.

   /*---
* Standby mode is implemented by a state machine:
 @@ -10490,15 +10511,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool 
 randAccess,
* machine, so we've exhausted all the 
 options for
* obtaining the requested WAL. We're 
 going to loop back
* and retry from the archive, but if 
 it hasn't been long
 -  * since last attempt, sleep 5 seconds 
 to avoid
 -  * busy-waiting.
 +  * since last attempt, sleep the amount 
 of time specified
 +  * by wal_availability_check_interval 
 to avoid busy-waiting.
*/
 - now = (pg_time_t) time(NULL);
 - if ((now - last_fail_time)  5)
 - {
 - pg_usleep(100L * (5 - (now 
 - last_fail_time)));
 - now = (pg_time_t) time(NULL);
 - }
 + now = GetCurrentTimestamp();
 + if 
 (TimestampSleepDifference(last_fail_time, now,
 + 
 wal_availability_check_interval))
 + now = GetCurrentTimestamp();

Not bad, that's much easier to read imo.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-01-18 Thread Michael Paquier
On Mon, Jan 19, 2015 at 4:10 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Sat, Jan 17, 2015 at 2:44 AM, Andres Freund and...@2ndquadrant.com wrote:
 Am I missing something or does this handle/affect streaming failures
 just the same? That might not be a bad idea, because the current
 interval is far too long for e.g. a sync replication setup. But it
 certainly makes the name a complete misnomer.
 Hm.
Sorry for the typo here, I meant that I agree on that. But I am sure
you guessed it..
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-01-18 Thread Michael Paquier
On Sat, Jan 17, 2015 at 2:44 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2015-01-05 17:16:43 +0900, Michael Paquier wrote:
 + varlistentry id=restore-command-retry-interval 
 xreflabel=restore_command_retry_interval
 +  termvarnamerestore_command_retry_interval/varname 
 (typeinteger/type)
 +  indexterm
 +primaryvarnamerestore_command_retry_interval/ recovery 
 parameter/primary
 +  /indexterm
 +  /term
 +  listitem
 +   para
 +If varnamerestore_command/ returns nonzero exit status code, 
 retry
 +command after the interval of time specified by this parameter,
 +measured in milliseconds if no unit is specified. Default value is
 +literal5s/.
 +   /para
 +   para
 +This command is particularly useful for warm standby configurations
 +to leverage the amount of retries done to restore a WAL segment file
 +depending on the activity of a master node.
 +   /para

 Don't really like this paragraph. What's leveraging the amount of
 retries done supposed to mean?
Actually I have reworked the whole paragraph with the renaming of the
parameter. Hopefully that's more clear.

 +# restore_command_retry_interval
 +#
 +# specifies an optional retry interval for restore_command command, if
 +# previous command returned nonzero exit status code. This can be useful
 +# to increase or decrease the number of calls of restore_command.
 It's not really the number  but frequency, right?
Sure.

 + else if (strcmp(item-name, restore_command_retry_interval) 
 == 0)
 + {
 + const char *hintmsg;
 +
 + if (!parse_int(item-value, 
 restore_command_retry_interval, GUC_UNIT_MS,
 +hintmsg))
 + ereport(ERROR,
 + 
 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 +  errmsg(parameter \%s\ 
 requires a temporal value,
 + 
 restore_command_retry_interval),
 +  hintmsg ? errhint(%s, 
 _(hintmsg)) : 0));

 temporal value sounds odd to my ears. I'd rather keep in line with
 what guc.c outputs in such a case.
Yeah, I have been pondering about that a bit and I think that
wal_availability_check_interval is the closest thing as we want to
check if WAL is available for a walreceiver at record level and at
segment level for a restore_command.

 Am I missing something or does this handle/affect streaming failures
 just the same? That might not be a bad idea, because the current
 interval is far too long for e.g. a sync replication setup. But it
 certainly makes the name a complete misnomer.
Hm.

 Not this patch's fault, but I'm getting a bit tired seeing the above open
 coded. How about adding a function that does the sleeping based on a
 timestamptz and a ms interval?
You mean in plugins, right? I don't recall seeing similar patterns in
other code paths of backend. But I think that we can use something
like that in timestamp.c then because we need to leverage that between
two timestamps, the last failure and now():
TimestampSleepDifference(start_time, stop_time, internal_ms);
Perhaps you have something else in mind?

Attached is an updated patch.
-- 
Michael
From 0206c417f5b28eadb5f9d67ee0643320d7c0b621 Mon Sep 17 00:00:00 2001
From: Michael Paquier mich...@otacoo.com
Date: Mon, 19 Jan 2015 16:08:48 +0900
Subject: [PATCH] Add wal_availability_check_interval to control WAL fetching
 on failure

Default value is 5s.
---
 doc/src/sgml/recovery-config.sgml   | 21 +
 src/backend/access/transam/recovery.conf.sample | 10 +++
 src/backend/access/transam/xlog.c   | 39 -
 src/backend/utils/adt/timestamp.c   | 19 
 src/include/utils/timestamp.h   |  2 ++
 5 files changed, 83 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/recovery-config.sgml b/doc/src/sgml/recovery-config.sgml
index 0c64ff2..7fd51ce 100644
--- a/doc/src/sgml/recovery-config.sgml
+++ b/doc/src/sgml/recovery-config.sgml
@@ -145,6 +145,27 @@ restore_command = 'copy C:\\server\\archivedir\\%f %p'  # Windows
   /listitem
  /varlistentry
 
+ varlistentry id=wal-availability-check-interval xreflabel=wal_availability_check_interval
+  termvarnamewal_availability_check_interval/varname (typeinteger/type)
+  indexterm
+primaryvarnamewal_availability_check_interval/ recovery parameter/primary
+  /indexterm
+  /term
+  listitem
+   para
+This parameter specifies the amount of time to wait when
+WAL is not available for a node in recovery. Default value is
+literal5s/.
+   /para
+   para
+A node in recovery will wait for this amount of time if
+varnamerestore_command/ returns 

Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-01-16 Thread Andres Freund
Hi,

On 2015-01-05 17:16:43 +0900, Michael Paquier wrote:
 + varlistentry id=restore-command-retry-interval 
 xreflabel=restore_command_retry_interval
 +  termvarnamerestore_command_retry_interval/varname 
 (typeinteger/type)
 +  indexterm
 +primaryvarnamerestore_command_retry_interval/ recovery 
 parameter/primary
 +  /indexterm
 +  /term
 +  listitem
 +   para
 +If varnamerestore_command/ returns nonzero exit status code, 
 retry
 +command after the interval of time specified by this parameter,
 +measured in milliseconds if no unit is specified. Default value is
 +literal5s/.
 +   /para
 +   para
 +This command is particularly useful for warm standby configurations
 +to leverage the amount of retries done to restore a WAL segment file
 +depending on the activity of a master node.
 +   /para

Don't really like this paragraph. What's leveraging the amount of
retries done supposed to mean?

 +# restore_command_retry_interval
 +#
 +# specifies an optional retry interval for restore_command command, if
 +# previous command returned nonzero exit status code. This can be useful
 +# to increase or decrease the number of calls of restore_command.

It's not really the number  but frequency, right?

 + else if (strcmp(item-name, restore_command_retry_interval) 
 == 0)
 + {
 + const char *hintmsg;
 +
 + if (!parse_int(item-value, 
 restore_command_retry_interval, GUC_UNIT_MS,
 +hintmsg))
 + ereport(ERROR,
 + 
 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 +  errmsg(parameter \%s\ 
 requires a temporal value,
 + 
 restore_command_retry_interval),
 +  hintmsg ? errhint(%s, 
 _(hintmsg)) : 0));

temporal value sounds odd to my ears. I'd rather keep in line with
what guc.c outputs in such a case.

 + now = GetCurrentTimestamp();
 + if 
 (!TimestampDifferenceExceeds(last_fail_time, now,
 + 
 restore_command_retry_interval))
   {
 - pg_usleep(100L * (5 - (now 
 - last_fail_time)));
 - now = (pg_time_t) time(NULL);
 + longsecs;
 + int 
 microsecs;
 +
 + 
 TimestampDifference(last_fail_time, now, secs, microsecs);
 + 
 pg_usleep(restore_command_retry_interval * 1000L - (100L * secs + 1L * 
 microsecs));
 + now = GetCurrentTimestamp();
   }
   last_fail_time = now;
   currentSource = XLOG_FROM_ARCHIVE;

Am I missing something or does this handle/affect streaming failures
just the same? That might not be a bad idea, because the current
interval is far too long for e.g. a sync replication setup. But it
certainly makes the name a complete misnomer.

Not this patch's fault, but I'm getting a bit tired seing the above open
coded. How about adding a function that does the sleeping based on a
timestamptz and a ms interval?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-01-05 Thread Michael Paquier
On Wed, Dec 31, 2014 at 12:22 AM, Alexey Vasiliev leopard...@inbox.ru wrote:
 Tue, 30 Dec 2014 21:39:33 +0900 от Michael Paquier 
 michael.paqu...@gmail.com:
 As I understand now = (pg_time_t) time(NULL); return time in seconds, what is 
 why I multiply value to 1000 to compare with restore_command_retry_interval 
 in milliseconds.
This way of doing is incorrect, you would get a wait time of 1s even
for retries lower than 1s. In order to get the feature working
correctly and to get a comparison granularity sufficient, you need to
use TimetampTz for the tracking of the current and last failure time
and to use the APIs from TimestampTz for difference calculations.

 I am not sure about small retry interval of time, in my cases I need interval 
 bigger 5 seconds (20-40 seconds). Right now I limiting this value be bigger 
 100 milliseconds.
Other people may disagree here, but I don't see any reason to put
restrictions to this interval of time.

Attached is a patch fixing the feature to use TimestampTz, updating as
well the docs and recovery.conf.sample which was incorrect, on top of
other things I noticed here and there.

Alexey, does this new patch look fine for you?
-- 
Michael
From 9bb71e39b218885466a53c005a87361d4ae889fa Mon Sep 17 00:00:00 2001
From: Michael Paquier mich...@otacoo.com
Date: Mon, 5 Jan 2015 17:10:13 +0900
Subject: [PATCH] Add restore_command_retry_interval to control retries of
 restore_command

restore_command_retry_interval can be specified in recovery.conf to control
the interval of time between retries of restore_command when failures occur.
---
 doc/src/sgml/recovery-config.sgml   | 21 +
 src/backend/access/transam/recovery.conf.sample |  9 ++
 src/backend/access/transam/xlog.c   | 42 -
 3 files changed, 64 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/recovery-config.sgml b/doc/src/sgml/recovery-config.sgml
index 0c64ff2..0488ae3 100644
--- a/doc/src/sgml/recovery-config.sgml
+++ b/doc/src/sgml/recovery-config.sgml
@@ -145,6 +145,27 @@ restore_command = 'copy C:\\server\\archivedir\\%f %p'  # Windows
   /listitem
  /varlistentry
 
+ varlistentry id=restore-command-retry-interval xreflabel=restore_command_retry_interval
+  termvarnamerestore_command_retry_interval/varname (typeinteger/type)
+  indexterm
+primaryvarnamerestore_command_retry_interval/ recovery parameter/primary
+  /indexterm
+  /term
+  listitem
+   para
+If varnamerestore_command/ returns nonzero exit status code, retry
+command after the interval of time specified by this parameter,
+measured in milliseconds if no unit is specified. Default value is
+literal5s/.
+   /para
+   para
+This command is particularly useful for warm standby configurations
+to leverage the amount of retries done to restore a WAL segment file
+depending on the activity of a master node.
+   /para
+  /listitem
+ /varlistentry
+
 /variablelist
 
   /sect1
diff --git a/src/backend/access/transam/recovery.conf.sample b/src/backend/access/transam/recovery.conf.sample
index b777400..8699a33 100644
--- a/src/backend/access/transam/recovery.conf.sample
+++ b/src/backend/access/transam/recovery.conf.sample
@@ -58,6 +58,15 @@
 #
 #recovery_end_command = ''
 #
+#
+# restore_command_retry_interval
+#
+# specifies an optional retry interval for restore_command command, if
+# previous command returned nonzero exit status code. This can be useful
+# to increase or decrease the number of calls of restore_command.
+#
+#restore_command_retry_interval = '5s'
+#
 #---
 # RECOVERY TARGET PARAMETERS
 #---
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5cc7e47..1944e6f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -235,6 +235,7 @@ static TimestampTz recoveryTargetTime;
 static char *recoveryTargetName;
 static int	recovery_min_apply_delay = 0;
 static TimestampTz recoveryDelayUntilTime;
+static int	restore_command_retry_interval = 5000;
 
 /* options taken from recovery.conf for XLOG streaming */
 static bool StandbyModeRequested = false;
@@ -4881,6 +4882,26 @@ readRecoveryCommandFile(void)
 	(errmsg_internal(trigger_file = '%s',
 	 TriggerFile)));
 		}
+		else if (strcmp(item-name, restore_command_retry_interval) == 0)
+		{
+			const char *hintmsg;
+
+			if (!parse_int(item-value, restore_command_retry_interval, GUC_UNIT_MS,
+		   hintmsg))
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+		 errmsg(parameter \%s\ requires a temporal value,
+restore_command_retry_interval),
+		 hintmsg ? errhint(%s, _(hintmsg)) : 0));
+			ereport(DEBUG2,
+	(errmsg_internal(restore_command_retry_interval 

Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-01-05 Thread Michael Paquier
On Mon, Jan 5, 2015 at 10:39 PM, Alexey Vasiliev leopard...@inbox.ru wrote:
 Hello. Thanks for help. Yes, new patch look fine!
OK cool. Let's get an opinion from a committer then.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2014-12-30 Thread Alexey Vasiliev
 Hello.

Thanks, I understand, what look in another part of code. Hope right now I did 
right changes.

To not modify current pg_usleep calculation, I changed 
restore_command_retry_interval value to seconds (not milliseconds). In this 
case, min value - 1 second.


Mon, 29 Dec 2014 00:15:03 +0900 от Michael Paquier michael.paqu...@gmail.com:
On Sat, Dec 27, 2014 at 3:42 AM, Alexey Vasiliev  leopard...@inbox.ru  wrote:
 Thanks for suggestions.

 Patch updated.

Cool, thanks. I just had an extra look at it.

+This is useful, if I using for restore of wal logs some
+external storage (like AWS S3) and no matter what the slave database
+will lag behind the master. The problem, what for each request to
+AWS S3 need to pay, what is why for N nodes, which try to get next
+wal log each 5 seconds will be bigger price, than for example each
+30 seconds.
I reworked this portion of the docs, it is rather incorrect as the
documentation should not use first-person subjects, and I don't
believe that referencing any commercial products is a good thing in
this context.

+# specifies an optional timeout after nonzero code of restore_command.
+# This can be useful to increase/decrease number of a restore_command calls.
This is still referring to a timeout. That's not good. And the name of
the parameter at the top of this comment block is missing.

+static int restore_command_retry_interval = 5000L;
I think that it would be more adapted to set that to 5000, and
multiply by 1L. I am also wondering about having a better lower bound,
like 100ms to avoid some abuse with this feature in the retries?

+   ereport(ERROR,
+
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg(\%s\ must
be bigger zero,
+   restore_command_retry_interval)));
I'd rather rewrite that to must have a strictly positive value.

-* Wait for more WAL to
arrive. Time out after 5 seconds,
+* Wait for more WAL to
arrive. Time out after
+*
restore_command_retry_interval (5 seconds by default),
 * like when polling the
archive, to react to a trigger
 * file promptly.
 */

WaitLatch(XLogCtl-recoveryWakeupLatch,
  WL_LATCH_SET
| WL_TIMEOUT,
- 5000L);
+
restore_command_retry_interval);
I should have noticed earlier, but in its current state your patch
actually does not work. What you are doing here is tuning the time
process waits for WAL from stream. In your case what you want to
control is the retry time for a restore_command in archive recovery,
no?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Alexey Vasiliev
diff --git a/doc/src/sgml/recovery-config.sgml b/doc/src/sgml/recovery-config.sgml
index ef78bc0..38420a5 100644
--- a/doc/src/sgml/recovery-config.sgml
+++ b/doc/src/sgml/recovery-config.sgml
@@ -145,6 +145,26 @@ restore_command = 'copy C:\\server\\archivedir\\%f %p'  # Windows
   /listitem
  /varlistentry
 
+ varlistentry id=restore-command-retry-interval xreflabel=restore_command_retry_interval
+  termvarnamerestore_command_retry_interval/varname (typeinteger/type)
+  indexterm
+primaryvarnamerestore_command_retry_interval/ recovery parameter/primary
+  /indexterm
+  /term
+  listitem
+   para
+If varnamerestore_command/ returns nonzero exit status code, retry
+command after the interval of time specified by this parameter.
+Default value is literal5s/.
+   /para
+   para
+This is useful, if I using for restore of wal logs some
+external storage and no matter what the slave database
+will lag behind the master.
+   /para
+  /listitem
+ /varlistentry
+
 /variablelist
 
   /sect1
diff --git a/src/backend/access/transam/recovery.conf.sample b/src/backend/access/transam/recovery.conf.sample
index b777400..5b63f60 100644
--- a/src/backend/access/transam/recovery.conf.sample
+++ b/src/backend/access/transam/recovery.conf.sample
@@ -58,6 +58,11 @@
 #
 #recovery_end_command = ''
 #
+# specifies an optional retry interval of restore_command command, if previous return nonzero exit status code.
+# This can be useful to increase/decrease number of a restore_command calls.
+#
+#restore_command_retry_interval = 5s
+#
 #---
 # RECOVERY TARGET PARAMETERS
 

Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2014-12-30 Thread Michael Paquier
On Tue, Dec 30, 2014 at 9:33 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Tue, Dec 30, 2014 at 9:10 PM, Alexey Vasiliev leopard...@inbox.ru wrote:
 To not modify current pg_usleep calculation, I changed
 restore_command_retry_interval value to seconds (not milliseconds). In this
 case, min value - 1 second.
 Er, what the problem with not changing 100L to 1000L? The unit of
 your parameter is ms AFAIK.
Of course I meant in the previous version of the patch not the current
one. Wouldn't it be useful to use it with for example retry intervals
of the order of 100ms~300ms for some cases?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2014-12-28 Thread Michael Paquier
On Sat, Dec 27, 2014 at 3:42 AM, Alexey Vasiliev leopard...@inbox.ru wrote:
 Thanks for suggestions.

 Patch updated.

Cool, thanks. I just had an extra look at it.

+This is useful, if I using for restore of wal logs some
+external storage (like AWS S3) and no matter what the slave database
+will lag behind the master. The problem, what for each request to
+AWS S3 need to pay, what is why for N nodes, which try to get next
+wal log each 5 seconds will be bigger price, than for example each
+30 seconds.
I reworked this portion of the docs, it is rather incorrect as the
documentation should not use first-person subjects, and I don't
believe that referencing any commercial products is a good thing in
this context.

+# specifies an optional timeout after nonzero code of restore_command.
+# This can be useful to increase/decrease number of a restore_command calls.
This is still referring to a timeout. That's not good. And the name of
the parameter at the top of this comment block is missing.

+static int restore_command_retry_interval = 5000L;
I think that it would be more adapted to set that to 5000, and
multiply by 1L. I am also wondering about having a better lower bound,
like 100ms to avoid some abuse with this feature in the retries?

+   ereport(ERROR,
+
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg(\%s\ must
be bigger zero,
+   restore_command_retry_interval)));
I'd rather rewrite that to must have a strictly positive value.

-* Wait for more WAL to
arrive. Time out after 5 seconds,
+* Wait for more WAL to
arrive. Time out after
+*
restore_command_retry_interval (5 seconds by default),
 * like when polling the
archive, to react to a trigger
 * file promptly.
 */
WaitLatch(XLogCtl-recoveryWakeupLatch,
  WL_LATCH_SET
| WL_TIMEOUT,
- 5000L);
+
restore_command_retry_interval);
I should have noticed earlier, but in its current state your patch
actually does not work. What you are doing here is tuning the time
process waits for WAL from stream. In your case what you want to
control is the retry time for a restore_command in archive recovery,
no?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2014-12-26 Thread Alexey Vasiliev
 Thanks for suggestions.

Patch updated.


Mon, 22 Dec 2014 12:07:06 +0900 от Michael Paquier michael.paqu...@gmail.com:
On Tue, Nov 4, 2014 at 6:25 AM, Alexey Vasiliev  leopard...@inbox.ru  wrote:
 Added new patch.
Seems useful to me to be able to tune this interval of time.

I would simply reword the documentation as follows:
If varnamerestore_command/ returns nonzero exit status code, retry
command after the interval of time specified by this parameter.
Default value is literal5s/.

Also, I think that it would be a good idea to error out if this
parameter has a value of let's say, less than 1s instead of doing a
check for a positive value in the waiting latch. On top of that, the
default value of restore_command_retry_interval should be 5000 and not
0 to simplify the code.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Alexey Vasiliev
diff --git a/doc/src/sgml/recovery-config.sgml b/doc/src/sgml/recovery-config.sgml
index ef78bc0..53ccf13 100644
--- a/doc/src/sgml/recovery-config.sgml
+++ b/doc/src/sgml/recovery-config.sgml
@@ -145,6 +145,29 @@ restore_command = 'copy C:\\server\\archivedir\\%f %p'  # Windows
   /listitem
  /varlistentry
 
+ varlistentry id=restore-command-retry-interval xreflabel=restore_command_retry_interval
+  termvarnamerestore_command_retry_interval/varname (typeinteger/type)
+  indexterm
+primaryvarnamerestore_command_retry_interval/ recovery parameter/primary
+  /indexterm
+  /term
+  listitem
+   para
+If varnamerestore_command/ returns nonzero exit status code, retry
+command after the interval of time specified by this parameter.
+Default value is literal5s/.
+   /para
+   para
+This is useful, if I using for restore of wal logs some
+external storage (like AWS S3) and no matter what the slave database
+will lag behind the master. The problem, what for each request to
+AWS S3 need to pay, what is why for N nodes, which try to get next
+wal log each 5 seconds will be bigger price, than for example each
+30 seconds.
+   /para
+  /listitem
+ /varlistentry
+
 /variablelist
 
   /sect1
diff --git a/src/backend/access/transam/recovery.conf.sample b/src/backend/access/transam/recovery.conf.sample
index b777400..7ed6f3b 100644
--- a/src/backend/access/transam/recovery.conf.sample
+++ b/src/backend/access/transam/recovery.conf.sample
@@ -58,6 +58,11 @@
 #
 #recovery_end_command = ''
 #
+# specifies an optional timeout after nonzero code of restore_command.
+# This can be useful to increase/decrease number of a restore_command calls.
+#
+#restore_command_retry_interval = 5s
+#
 #---
 # RECOVERY TARGET PARAMETERS
 #---
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e54..02c55a8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -235,6 +235,7 @@ static TimestampTz recoveryTargetTime;
 static char *recoveryTargetName;
 static int	recovery_min_apply_delay = 0;
 static TimestampTz recoveryDelayUntilTime;
+static int 	restore_command_retry_interval = 5000L;
 
 /* options taken from recovery.conf for XLOG streaming */
 static bool StandbyModeRequested = false;
@@ -4881,6 +4882,28 @@ readRecoveryCommandFile(void)
 	(errmsg_internal(trigger_file = '%s',
 	 TriggerFile)));
 		}
+		else if (strcmp(item-name, restore_command_retry_interval) == 0)
+		{
+			const char *hintmsg;
+
+			if (!parse_int(item-value, restore_command_retry_interval, GUC_UNIT_MS,
+		   hintmsg))
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+		 errmsg(parameter \%s\ requires a temporal value,
+restore_command_retry_interval),
+		 hintmsg ? errhint(%s, _(hintmsg)) : 0));
+			ereport(DEBUG2,
+	(errmsg_internal(restore_command_retry_interval = '%s', item-value)));
+
+			if (restore_command_retry_interval = 0)
+			{
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+		 errmsg(\%s\ must be bigger zero,
+	restore_command_retry_interval)));
+			}
+		}
 		else if (strcmp(item-name, recovery_min_apply_delay) == 0)
 		{
 			const char *hintmsg;
@@ -10658,13 +10681,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	}
 
 	/*
-	 * Wait for more WAL to arrive. Time out after 5 seconds,
+	 * Wait for more WAL to arrive. Time out after
+	 * restore_command_retry_interval (5 seconds by default),
 	 * like when polling the archive, to react to a trigger
 	 * file promptly.
 	 */
 	WaitLatch(XLogCtl-recoveryWakeupLatch,
 			  WL_LATCH_SET | WL_TIMEOUT,
-			  5000L);
+			  

Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2014-11-04 Thread Andres Freund
Hi,

On 2014-11-03 14:04:00 +0300, Alexey Vasiliev wrote:
 *  What the patch does in a short paragraph: This patch should add option 
 recovery_timeout, which help to control timeout of restore_command nonzero 
 status code. Right now default value is 5 seconds. This is useful, if I using 
 for restore of wal logs some external storage (like AWS S3) and no matter 
 what the slave database will lag behind the master. The problem, what for 
 each request to AWS S3 need to pay, what is why for N nodes, which try to get 
 next wal log each 5 seconds will be bigger price, than for example each 30 
 seconds. Before I do this in this way:  if ! (/usr/local/bin/envdir 
 /etc/wal-e.d/env /usr/local/bin/wal-e wal-fetch %f %p); then sleep 60; fi 
 . But in this case restart/stop database slower.

Without saying that the feature is unneccessary, wouldn't this better be
solved by using streaming rep most of the time?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2014-11-03 Thread Peter Eisentraut
On 11/3/14 6:04 AM, Alexey Vasiliev wrote:
  3. What the patch does in a short paragraph: This patch should add
 option recovery_timeout, which help to control timeout of
 restore_command nonzero status code. Right now default value is 5
 seconds. This is useful, if I using for restore of wal logs some
 external storage (like AWS S3) and no matter what the slave database
 will lag behind the master. The problem, what for each request to
 AWS S3 need to pay, what is why for N nodes, which try to get next
 wal log each 5 seconds will be bigger price, than for example each
 30 seconds.

That seems useful.  I would include something about this use case in the
documentation.

 This is my first patch. I am not sure about name of option. Maybe it
 should called recovery_nonzero_timeout.

The option name had me confused.  At first I though this is the time
after which a running restore_command invocation gets killed.  I think a
more precise description might be restore_command_retry_interval.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2014-11-03 Thread Fabrízio de Royes Mello
On Mon, Nov 3, 2014 at 7:25 PM, Alexey Vasiliev leopard...@inbox.ru wrote:




 Mon, 3 Nov 2014 19:18:51 -0200 от Fabrízio de Royes Mello 
fabriziome...@gmail.com:
 
  
   Should I change my patch and send it by email? And also as I
understand I should change message ID for
https://commitfest.postgresql.org/action/patch_view?id=1636 , isn't it?
  
 
  You should just send another version of your patch by email and add a
new comment to commit-fest using the Patch comment type.


 Added new patch.


And you should add the patch to an open commit-fest not to an in-progress.
The next opened is 2014-12 [1].


Regards,


[1] https://commitfest.postgresql.org/action/commitfest_view?id=25

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
 Timbira: http://www.timbira.com.br
 Blog: http://fabriziomello.github.io
 Linkedin: http://br.linkedin.com/in/fabriziomello
 Twitter: http://twitter.com/fabriziomello
 Github: http://github.com/fabriziomello