Re: [HACKERS] Some problems about cascading replication
On Tue, Aug 16, 2011 at 11:56 PM, Heikki Linnakangas wrote: > I think there's a race condition here. If a walsender is just starting up, > it might not have registered itself as a walsender yet. It's actually been > there before this patch to suppress the log message. Right. To address this problem, I changed the patch so that "dead-end" walsender (i.e., cascading walsender even though recovery is not in progress) always emits the log message. This change would cause duplicate log messages if the standby promotion is requested while multiple walsenders including "dead-end" one are running. But since this is less likely to happen, I don't think it's worth writing code to suppress those duplicate log messages. Comments? I attached the updated version of the patch. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index c0a32a3..90dad2c 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -2328,10 +2328,11 @@ reaper(SIGNAL_ARGS) * XXX should avoid the need for disconnection. When we do, * am_cascading_walsender should be replaced with RecoveryInProgress() */ - if (max_wal_senders > 0) + if (max_wal_senders > 0 && CountChildren(BACKEND_TYPE_WALSND) > 0) { ereport(LOG, - (errmsg("terminating all walsender processes to force cascaded standby(s) to update timeline and reconnect"))); + (errmsg("terminating all walsender processes to force cascaded " +"standby(s) to update timeline and reconnect"))); SignalSomeChildren(SIGUSR2, BACKEND_TYPE_WALSND); } diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 0eadf64..813c998 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -368,6 +368,35 @@ StartReplication(StartReplicationCmd *cmd) SendPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE); /* + * When promoting a cascading standby, postmaster sends SIGUSR2 to + * any cascading walsenders to kill them. But there is a corner-case where + * such walsender fails to receive SIGUSR2 and survives a standby promotion + * unexpectedly. This happens when postmaster sends SIGUSR2 before + * the walsender marks itself as a WAL sender, because postmaster sends + * SIGUSR2 to only the processes marked as a WAL sender. + * + * To avoid this corner-case, if recovery is NOT in progress even though + * the walsender is cascading one, we do the same thing as SIGUSR2 signal + * handler does, i.e., set walsender_ready_to_stop to true. Which causes + * the walsender to end later. + * + * When terminating cascading walsenders, usually postmaster writes + * the log message announcing the terminations. But there is a race condition + * here. If there is no walsender except this process before reaching here, + * postmaster thinks that there is no walsender and suppresses that + * log message. To handle this case, we always emit that log message here. + * This might cause duplicate log messages, but which is less likely to happen, + * so it's not worth writing some code to suppress them. + */ + if (am_cascading_walsender && !RecoveryInProgress()) + { + ereport(LOG, +(errmsg("terminating walsender process to force cascaded standby " + "to update timeline and reconnect"))); + walsender_ready_to_stop = true; + } + + /* * We assume here that we're logging enough information in the WAL for * log-shipping, since this is checked in PostmasterMain(). * -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Some problems about cascading replication
On 16.08.2011 16:25, Simon Riggs wrote: On Tue, Aug 16, 2011 at 9:55 AM, Fujii Masao wrote: When I tested the PITR on git master with max_wal_senders> 0, I found that the following inappropriate log meesage was always output even though cascading replication is not in progress. Attached patch fixes this problem. LOG: terminating all walsender processes to force cascaded standby(s) to update timeline and reconnect When making the patch, I found another problem about cascading replication; When promoting a cascading standby, postmaster sends SIGUSR2 to any cascading walsenders to kill them. But there is a orner-case where such walsender fails to receive SIGUSR2 and survives a standby promotion unexpectedly. This happens when postmaster sends SIGUSR2 before the walsender marks itself as a WAL sender, because postmaster sends SIGUSR2 to only the processes marked as a WAL sender. To avoid the corner-case, I changed walsender so that it checks whether recovery is in progress or not again after marking itself as a WAL sender. If recovery is not in progress even though the walsender is cascading one, it does the same thing as SIGUSR2 signal handler does, and then exits later. Attached patch also includes this fix. Looks like valid problems and appropriate fixes to me. Will commit. I think there's a race condition here. If a walsender is just starting up, it might not have registered itself as a walsender yet. It's actually been there before this patch to suppress the log message. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Some problems about cascading replication
On Tue, Aug 16, 2011 at 9:55 AM, Fujii Masao wrote: > When I tested the PITR on git master with max_wal_senders > 0, > I found that the following inappropriate log meesage was always > output even though cascading replication is not in progress. Attached > patch fixes this problem. > > LOG: terminating all walsender processes to force cascaded > standby(s) to update timeline and reconnect > > When making the patch, I found another problem about cascading > replication; When promoting a cascading standby, postmaster sends > SIGUSR2 to any cascading walsenders to kill them. But there is a > orner-case where such walsender fails to receive SIGUSR2 and > survives a standby promotion unexpectedly. This happens when > postmaster sends SIGUSR2 before the walsender marks itself as > a WAL sender, because postmaster sends SIGUSR2 to only the > processes marked as a WAL sender. > > To avoid the corner-case, I changed walsender so that it checks > whether recovery is in progress or not again after marking itself > as a WAL sender. If recovery is not in progress even though the > walsender is cascading one, it does the same thing as SIGUSR2 > signal handler does, and then exits later. Attached patch also includes > this fix. Looks like valid problems and appropriate fixes to me. Will commit. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Some problems about cascading replication
Hi, When I tested the PITR on git master with max_wal_senders > 0, I found that the following inappropriate log meesage was always output even though cascading replication is not in progress. Attached patch fixes this problem. LOG: terminating all walsender processes to force cascaded standby(s) to update timeline and reconnect When making the patch, I found another problem about cascading replication; When promoting a cascading standby, postmaster sends SIGUSR2 to any cascading walsenders to kill them. But there is a orner-case where such walsender fails to receive SIGUSR2 and survives a standby promotion unexpectedly. This happens when postmaster sends SIGUSR2 before the walsender marks itself as a WAL sender, because postmaster sends SIGUSR2 to only the processes marked as a WAL sender. To avoid the corner-case, I changed walsender so that it checks whether recovery is in progress or not again after marking itself as a WAL sender. If recovery is not in progress even though the walsender is cascading one, it does the same thing as SIGUSR2 signal handler does, and then exits later. Attached patch also includes this fix. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index c0a32a3..2ec39dd 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -2328,7 +2328,7 @@ reaper(SIGNAL_ARGS) * XXX should avoid the need for disconnection. When we do, * am_cascading_walsender should be replaced with RecoveryInProgress() */ - if (max_wal_senders > 0) + if (max_wal_senders > 0 && CountChildren(BACKEND_TYPE_WALSND) > 0) { ereport(LOG, (errmsg("terminating all walsender processes to force cascaded standby(s) to update timeline and reconnect"))); diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 0eadf64..ef6894c 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -368,6 +368,22 @@ StartReplication(StartReplicationCmd *cmd) SendPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE); /* + * When promoting a cascading standby, postmaster sends SIGUSR2 to + * any cascading walsenders to kill them. But there is a corner-case where + * such walsender fails to receive SIGUSR2 and survives a standby promotion + * unexpectedly. This happens when postmaster sends SIGUSR2 before + * the walsender marks itself as a WAL sender, because postmaster sends + * SIGUSR2 to only the processes marked as a WAL sender. + * + * To avoid this corner-case, if recovery is NOT in progress even though + * the walsender is cascading one, we do the same thing as SIGUSR2 signal + * handler does, i.e., set walsender_ready_to_stop to true. Which causes + * the walsender to end later. + */ + if (am_cascading_walsender && !RecoveryInProgress()) + walsender_ready_to_stop = true; + + /* * We assume here that we're logging enough information in the WAL for * log-shipping, since this is checked in PostmasterMain(). * -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers