Re: Delay of standby shutdown

2020-11-16 Thread Fujii Masao




On 2020/11/13 2:58, Soumyadeep Chakraborty wrote:

Thanks! Marking this as ready for committer.


Pushed. Thanks!

Regards,


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Delay of standby shutdown

2020-11-12 Thread Soumyadeep Chakraborty
Thanks! Marking this as ready for committer.

Regards,
Soumyadeep (VMware)




Re: Delay of standby shutdown

2020-11-11 Thread Fujii Masao



On 2020/11/04 9:35, Soumyadeep Chakraborty wrote:

Hello Fujii,

On Thu, Sep 17, 2020 at 6:49 AM Fujii Masao  wrote:

As far as I understand after debugging, the problem is as follows:


Yes.




1. After the primary is stopped, and the standby startup process is
waiting inside:

(void) WaitLatch(>recoveryWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT |
WL_EXIT_ON_PM_DEATH,
wait_time,
WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL);

it receives SIGTERM on account of stopping the standby and it leads to
the WaitLatch call returning with WL_LATCH_SET.

2. WaitForWALToBecomeAvailable() then will return true after calling
XLogFileReadAnyTLI() and eventually, XLogReadRecord() will return NULL
since there is no new WAL to read, which means ReadRecord() will loop
back and perform another XLogReadRecord().

3. The additional XLogReadRecord() will lead to a 5s wait inside
WaitForWALToBecomeAvailable() - a different WaitLatch() call this time:

/*
  * Wait for more WAL to arrive. Time out after 5 seconds
  * to react to a trigger file promptly and to check if the
  * WAL receiver is still active.
  */
(void) WaitLatch(>recoveryWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT |
WL_EXIT_ON_PM_DEATH,
5000L, WAIT_EVENT_RECOVERY_WAL_STREAM);

4. And then eventually, the code will handle interrupts here inside
WaitForWALToBecomeAvailable() after the 5s wait:

/*
  * This possibly-long loop needs to handle interrupts of startup
  * process.
  */
HandleStartupProcInterrupts();


To avoid this issue, I think that ReadRecord() should call
HandleStartupProcInterrupts() whenever looping back to retry.


Alternatively, perhaps we can place it inside
WaitForWALToBecomeAvailable() (which already has a similar call),
since it is more semantically aligned to checking for interrupts, rather
than ReadRecord()? Like this:


Yes. Your approach looks better to me.
Attached is the updated version of the patch implementing that.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index aa63f37615..7d97b96e72 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12225,6 +12225,9 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool 
randAccess,

 WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL);
ResetLatch(MyLatch);
now = GetCurrentTimestamp();
+
+   /* Handle interrupt signals of 
startup process */
+   HandleStartupProcInterrupts();
}
last_fail_time = now;
currentSource = XLOG_FROM_ARCHIVE;


Re: Delay of standby shutdown

2020-11-03 Thread Soumyadeep Chakraborty
Hello Fujii,

On Thu, Sep 17, 2020 at 6:49 AM Fujii Masao  wrote:

As far as I understand after debugging, the problem is as follows:

1. After the primary is stopped, and the standby startup process is
waiting inside:

(void) WaitLatch(>recoveryWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT |
WL_EXIT_ON_PM_DEATH,
wait_time,
WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL);

it receives SIGTERM on account of stopping the standby and it leads to
the WaitLatch call returning with WL_LATCH_SET.

2. WaitForWALToBecomeAvailable() then will return true after calling
XLogFileReadAnyTLI() and eventually, XLogReadRecord() will return NULL
since there is no new WAL to read, which means ReadRecord() will loop
back and perform another XLogReadRecord().

3. The additional XLogReadRecord() will lead to a 5s wait inside
WaitForWALToBecomeAvailable() - a different WaitLatch() call this time:

/*
 * Wait for more WAL to arrive. Time out after 5 seconds
 * to react to a trigger file promptly and to check if the
 * WAL receiver is still active.
 */
(void) WaitLatch(>recoveryWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT |
WL_EXIT_ON_PM_DEATH,
5000L, WAIT_EVENT_RECOVERY_WAL_STREAM);

4. And then eventually, the code will handle interrupts here inside
WaitForWALToBecomeAvailable() after the 5s wait:

/*
 * This possibly-long loop needs to handle interrupts of startup
 * process.
 */
HandleStartupProcInterrupts();

> To avoid this issue, I think that ReadRecord() should call
> HandleStartupProcInterrupts() whenever looping back to retry.

Alternatively, perhaps we can place it inside
WaitForWALToBecomeAvailable() (which already has a similar call),
since it is more semantically aligned to checking for interrupts, rather
than ReadRecord()? Like this:

diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index 52a67b117015..b05cf6c7c219 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12249,6 +12249,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr
RecPtr, bool randAccess,

WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL);
 ResetLatch(>recoveryWakeupLatch);
 now = GetCurrentTimestamp();
+/*
+ * Check for interrupts
+ */
+HandleStartupProcInterrupts();
 }
 last_fail_time = now;
 currentSource = XLOG_FROM_ARCHIVE;

It also has the advantage of being in a slightly less "hot" code path
as compared to it being where you suggested (the location you suggested
is executed infinitely when a standby is not connected to a primary and
there is no more WAL to read)


Regards,

Soumyadeep and Georgios




Delay of standby shutdown

2020-09-17 Thread Fujii Masao

Hi,

When I shut down the standby server just after setting up replication
environment and shutting down the primary, I found that it took
five seconds to complete the shutdown of the standby server. Note that
no WAL was generated between the setup of replication and the server
shutdown, in this case. That is, basically neither WAL replay nor
shutdown restart point didn't cause this issue, I think. Here are
the log messages in that standby. I have observed this issue some times
so far.

2020-09-17 21:29:54 JST [postmaster] sby1 LOG:  received fast shutdown request
2020-09-17 21:29:54 JST [postmaster] sby1 LOG:  aborting any active transactions
2020-09-17 21:29:59 JST [checkpointer] sby1 LOG:  shutting down
2020-09-17 21:29:59 JST [checkpointer] sby1 LOG:  restartpoint starting: 
shutdown immediate
2020-09-17 21:29:59 JST [checkpointer] sby1 LOG:  restartpoint complete: wrote 
0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.014 s, 
sync=0.000 s, total=0.022 s; sync files=0, longest=0.000 s, average=0.000 s; 
distance=16384 kB, estimate=16384 kB
2020-09-17 21:29:59 JST [checkpointer] sby1 LOG:  recovery restart point at 
0/560
2020-09-17 21:29:59 JST [postmaster] sby1 LOG:  database system is shut down

Per my analysis, this issue could happen in the following way.

1. In WaitForWALToBecomeAvailable(), the startup proesss receives
the shutdown request during sleeping wal_retrieve_retry_interval.
2. In ReadRecord(), the startup process thinks no valid record available
and loops back to retry reading WAL from stream.
3. In WaitForWALToBecomeAvailable(), the startup proesss tries to
start up walreceiver and then waits for recoveryWakeupLatch.
4. After the timeout (5 seconds is used for that wait for recoveryWakeupLatch)
passes, HandleStartupProcInterrupts() is called and the shutdown request
is processed.

So waiting at #3 for recoveryWakeupLatch caused the delay of
the standby shutdown.

To avoid this issue, I think that ReadRecord() should call
HandleStartupProcInterrupts() whenever looping back to retry.
Patch attached. Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 61754312e2..faf8d9c4f0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4456,15 +4456,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 */
lastSourceFailed = false;
currentSource = XLOG_FROM_ANY;
-
-   continue;
}
-
/* In standby mode, loop back to retry. Otherwise, give 
up. */
-   if (StandbyMode && !CheckForStandbyTrigger())
-   continue;
-   else
+   else if (!StandbyMode || CheckForStandbyTrigger())
return NULL;
+
+   /*
+* This possibly-long loop needs to handle interrupts 
of startup
+* process.
+*/
+   HandleStartupProcInterrupts();
}
}
 }