Re: [HACKERS] Crash on promotion when recovery.conf is renamed
On Mon, Oct 2, 2017 at 8:13 AM, Daniel Gustafsson wrote: > I’ve moved this to the next CF, but since this no longer applies cleanly I’ve > reset it to Waiting for author. Thanks Daniel for the reminder. Attached are rebased patches. This thing rots easily... -- Michael 0001-Change-detection-of-corrupted-2PC-files-as-FATAL.patch Description: Binary data 0002-Minimize-window-between-history-file-and-end-of-reco.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Crash on promotion when recovery.conf is renamed
> On 02 Sep 2017, at 14:17, Michael Paquier wrote: > > On Fri, Sep 1, 2017 at 7:17 PM, Thomas Munro > wrote: >> On Sat, Aug 26, 2017 at 7:32 AM, Robert Haas wrote: >>> On Sat, Apr 8, 2017 at 10:05 AM, David Steele wrote: On 3/28/17 1:21 AM, Tsunakawa, Takayuki wrote: > Thanks, marked as ready for committer, as the code is good and Alexander > reported the test success. This bug has been moved to CF 2017-07. >>> >>> This bug fix has been pending in "Ready for Committer" state for about >>> 4.5 months. Three committers (Magnus, Heikki, Tom) have contributed >>> to the thread to date. Maybe one of them would like to commit this? >> >> In the meantime its bits have begun to rot. Michael, could you please >> rebase? > > Thanks for the reminder, Thomas. The 2PC code triggered during > recovery has been largely reworked in PG10, explaining the conflicts. > As this has been some time since I touched this patch, I had again a > look at its logic and could not find any problems around it. So > attached are a rebased versions for both 0001 and 0002, I have updated > the messages as well to be more in-line with what is in HEAD for > corrupted entries. I’ve moved this to the next CF, but since this no longer applies cleanly I’ve reset it to Waiting for author. cheers ./daniel -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Crash on promotion when recovery.conf is renamed
On Fri, Sep 1, 2017 at 7:17 PM, Thomas Munro wrote: > On Sat, Aug 26, 2017 at 7:32 AM, Robert Haas wrote: >> On Sat, Apr 8, 2017 at 10:05 AM, David Steele wrote: >>> On 3/28/17 1:21 AM, Tsunakawa, Takayuki wrote: Thanks, marked as ready for committer, as the code is good and Alexander reported the test success. >>> >>> This bug has been moved to CF 2017-07. >> >> This bug fix has been pending in "Ready for Committer" state for about >> 4.5 months. Three committers (Magnus, Heikki, Tom) have contributed >> to the thread to date. Maybe one of them would like to commit this? > > In the meantime its bits have begun to rot. Michael, could you please rebase? Thanks for the reminder, Thomas. The 2PC code triggered during recovery has been largely reworked in PG10, explaining the conflicts. As this has been some time since I touched this patch, I had again a look at its logic and could not find any problems around it. So attached are a rebased versions for both 0001 and 0002, I have updated the messages as well to be more in-line with what is in HEAD for corrupted entries. -- Michael From 2f3a16c0c0cb12f8bfef2d58656352c46a681393 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Sat, 2 Sep 2017 21:15:04 +0900 Subject: [PATCH 2/2] Minimize window between history file and end-of-recovery record Once a standby node is promoted, this makes the assignment of the new timeline number booked earlier as the history file gets archived immediately. This way the other nodes are aware that this new timeline number is taken and should not be assigned to other nodes. The window between which the history file is archived and the end-of-recovery record is written cannot be zeroed, but this way it is minimized as much as possible. The new order of actions prevents as well a corrupted data directory on failure. --- src/backend/access/transam/xlog.c | 27 ++- 1 file changed, 18 insertions(+), 9 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 80fe2c60e6..abc0cec3d2 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -7449,6 +7449,24 @@ StartupXLOG(void) else snprintf(reason, sizeof(reason), "no recovery target specified"); + /* + * We are now done reading the old WAL. Turn off archive fetching + * if it was active, and make a writable copy of the last WAL segment. + * (Note that we also have a copy of the last block of the old WAL + * in readBuf; we will use that below.) + */ + exitArchiveRecovery(EndOfLogTLI, EndOfLog); + + /* + * Write the timeline history file, and have it archived. After this + * point (or rather, as soon as the file is archived), the timeline + * will appear as "taken" in the WAL archive and to any standby servers. + * If we crash before actually switching to the new timeline, standby + * servers will nevertheless think that we switched to the new timeline, + * and will try to connect to the new timeline. To minimize the window + * for that, try to do as little as possible between here and writing the + * end-of-recovery record. + */ writeTimeLineHistory(ThisTimeLineID, recoveryTargetTLI, EndRecPtr, reason); } @@ -7457,15 +7475,6 @@ StartupXLOG(void) XLogCtl->ThisTimeLineID = ThisTimeLineID; XLogCtl->PrevTimeLineID = PrevTimeLineID; - /* - * We are now done reading the old WAL. Turn off archive fetching if it - * was active, and make a writable copy of the last WAL segment. (Note - * that we also have a copy of the last block of the old WAL in readBuf; - * we will use that below.) - */ - if (ArchiveRecoveryRequested) - exitArchiveRecovery(EndOfLogTLI, EndOfLog); - /* * Prepare to write WAL starting at EndOfLog location, and init xlog * buffer cache using the block containing the last record from the -- 2.14.1 From 12ab1765f7ea032db217f2081f9561c00d1d96df Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Sat, 2 Sep 2017 21:07:21 +0900 Subject: [PATCH 1/2] Change detection of corrupted 2PC files as FATAL When scanning 2PC files, be it for initializing a hot standby node or finding the XID range at the end of recovery, bumping on a corrupted 2PC file is reported as a WARNING and are blindly corrupted. If it happened that the corrupted file was actually legit, there is actually a risk of corruption, so switch that to a hard failure. --- src/backend/access/transam/twophase.c | 27 +++ src/backend/access/transam/xlog.c | 10 +++--- 2 files changed, 18 insertions(+), 19 deletions(-) diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c index ba03d9687e..4b4f2449f8 100644 --- a/src/backend/access/transam/twophase.c +++ b/src/backend/access/transam/twophase.c @@ -1778,6 +1778,10 @@ restoreTwoPhaseData(void) * write a WAL entry, and so there might be no evidence in WAL of those * subxact XIDs. * + * On corrupted two-phase fil
Re: [HACKERS] Crash on promotion when recovery.conf is renamed
On Sat, Aug 26, 2017 at 7:32 AM, Robert Haas wrote: > On Sat, Apr 8, 2017 at 10:05 AM, David Steele wrote: >> On 3/28/17 1:21 AM, Tsunakawa, Takayuki wrote: >>> Thanks, marked as ready for committer, as the code is good and Alexander >>> reported the test success. >> >> This bug has been moved to CF 2017-07. > > This bug fix has been pending in "Ready for Committer" state for about > 4.5 months. Three committers (Magnus, Heikki, Tom) have contributed > to the thread to date. Maybe one of them would like to commit this? In the meantime its bits have begun to rot. Michael, could you please rebase? Thanks! -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Crash on promotion when recovery.conf is renamed
On Sat, Apr 8, 2017 at 10:05 AM, David Steele wrote: > On 3/28/17 1:21 AM, Tsunakawa, Takayuki wrote: >> From: Michael Paquier [mailto:michael.paqu...@gmail.com] >>> Okay. I got the message, and I agree with what you say here. You are right >>> by the way, the error messages just use "two-phase file" and not "two-phase >>> STATE file", missed that previously. >>> -- >> Thanks, marked as ready for committer, as the code is good and Alexander >> reported the test success. > > This bug has been moved to CF 2017-07. This bug fix has been pending in "Ready for Committer" state for about 4.5 months. Three committers (Magnus, Heikki, Tom) have contributed to the thread to date. Maybe one of them would like to commit this? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Crash on promotion when recovery.conf is renamed
On 3/28/17 1:21 AM, Tsunakawa, Takayuki wrote: > From: Michael Paquier [mailto:michael.paqu...@gmail.com] >> Okay. I got the message, and I agree with what you say here. You are right >> by the way, the error messages just use "two-phase file" and not "two-phase >> STATE file", missed that previously. >> -- > Thanks, marked as ready for committer, as the code is good and Alexander > reported the test success. This bug has been moved to CF 2017-07. -- -David da...@pgmasters.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Crash on promotion when recovery.conf is renamed
From: Michael Paquier [mailto:michael.paqu...@gmail.com] > Okay. I got the message, and I agree with what you say here. You are right > by the way, the error messages just use "two-phase file" and not "two-phase > STATE file", missed that previously. > -- Thanks, marked as ready for committer, as the code is good and Alexander reported the test success. Regards Takayuki Tsunakawa -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Crash on promotion when recovery.conf is renamed
On Tue, Mar 28, 2017 at 1:33 PM, Tsunakawa, Takayuki wrote: > I get the impression that DATA_CORRUPTED means the table data is corrupted, > because there's an error code named INDEX_CORRUPTED. I have interpreted that as the other way around, aka DATA_CORRUPTED could be used as well to 2PC files :) But grepping around it seems that you are grabbing the meaning better than I do, ERRCODE_DATA_CORRUPTED is only used now for relation pages or large pages. > Anyway, I don't think this patch needs to attach an error code because: > * Currently, other interlal files (not tables or indexes) seem to use > INTERNAL_ERROR (XX000). For example, see ReadControlFile() in xlog.c and > pgstat_read_statsfiles() in pgstat.c. > * It doesn't seem that the user needs distinction. I don't object to > providing a specific error code for this case, but if the patch needs a > specific error code to be committed, I'd like to know how that's useful (e.g. > how it affects the recovery action the user takes.) > So, I'd like to mark the patch as ready for committer when ereport() and > errmsg() are on separate lines and the message changes to "two-phase state > file." Okay. I got the message, and I agree with what you say here. You are right by the way, the error messages just use "two-phase file" and not "two-phase STATE file", missed that previously. -- Michael 0001-Change-detection-of-corrupted-2PC-files-as-FATAL.patch Description: Binary data 0002-Minimize-window-between-history-file-and-end-of-reco.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Crash on promotion when recovery.conf is renamed
From: pgsql-hackers-ow...@postgresql.org > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Michael Paquier > While I agree with that in general, we are taking about 2PC files that are > on disk in patch 0001, so I would think that in this case > ERRCODE_DATA_CORRUPTED is the most adapted (or > ERRCODE_TWOPHASE_CORRUPTED?). > > The other WARNING messages refer to stale files of already committed > transactions, which are not actually corrupted. What about in this case > having a ERRCODE_TWOPHASE_INVALID? > > Updated patches are attached, I did not change the WARNING portion though > as I am not sure what's the consensus on the matter. > I get the impression that DATA_CORRUPTED means the table data is corrupted, because there's an error code named INDEX_CORRUPTED. Anyway, I don't think this patch needs to attach an error code because: * Currently, other interlal files (not tables or indexes) seem to use INTERNAL_ERROR (XX000). For example, see ReadControlFile() in xlog.c and pgstat_read_statsfiles() in pgstat.c. * It doesn't seem that the user needs distinction. I don't object to providing a specific error code for this case, but if the patch needs a specific error code to be committed, I'd like to know how that's useful (e.g. how it affects the recovery action the user takes.) So, I'd like to mark the patch as ready for committer when ereport() and errmsg() are on separate lines and the message changes to "two-phase state file." Regards Takayuki Tsunakawa -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Crash on promotion when recovery.conf is renamed
On Mon, Mar 27, 2017 at 11:20 PM, Tom Lane wrote: > Alexander Korotkov writes: >> On Mon, Mar 27, 2017 at 4:48 PM, Tom Lane wrote: >>> Actually, the *real* problem with that coding is it lacks a SQLSTATE >>> (errcode call). The only places where it's acceptable to leave that >>> out are for internal "can't happen" cases, which this surely isn't. > >> Surrounding code also has ereports lacking SQLSTATE. And that isn't "can't >> happen" case as well. >> Should we consider fixing them? > > Yup. Just remember that the default is > XX000EERRCODE_INTERNAL_ERROR internal_error > > If that's not how you want the error case reported, you need an errcode() > call. > > We might need more ERRCODEs than are there now, if none of the existing > ones seem to fit these cases. There's already ERRCODE_DATA_CORRUPTED > and ERRCODE_INDEX_CORRUPTED; maybe we need ERRCODE_WAL_CORRUPTED, for > example? While I agree with that in general, we are taking about 2PC files that are on disk in patch 0001, so I would think that in this case ERRCODE_DATA_CORRUPTED is the most adapted (or ERRCODE_TWOPHASE_CORRUPTED?). The other WARNING messages refer to stale files of already committed transactions, which are not actually corrupted. What about in this case having a ERRCODE_TWOPHASE_INVALID? Updated patches are attached, I did not change the WARNING portion though as I am not sure what's the consensus on the matter. -- Michael 0001-Change-detection-of-corrupted-2PC-files-as-FATAL.patch Description: Binary data 0002-Minimize-window-between-history-file-and-end-of-reco.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Crash on promotion when recovery.conf is renamed
From: pgsql-hackers-ow...@postgresql.org > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tom Lane > "Tsunakawa, Takayuki" writes: > > All other places in twophase.c and most places in other files put ereport() > and errmsg() on separate lines. I think it would be better to align with > surrounding code. > > > + ereport(FATAL, (errmsg("corrupted > two-phase file \"%s\"", > > Actually, the *real* problem with that coding is it lacks a SQLSTATE (errcode > call). The only places where it's acceptable to leave that out are for > internal "can't happen" cases, which this surely isn't. Oh, I overlooked it. But... > Yup. Just remember that the default is > XX000EERRCODE_INTERNAL_ERROR internal_error > > If that's not how you want the error case reported, you need an errcode() > call. > > We might need more ERRCODEs than are there now, if none of the existing > ones seem to fit these cases. There's already ERRCODE_DATA_CORRUPTED and > ERRCODE_INDEX_CORRUPTED; maybe we need ERRCODE_WAL_CORRUPTED, for > example? I'd be always happy if the error code is more specific, but maybe that would be a separate patch. WAL corruption message so far doesn't accompany a specific error code like this in xlog.c: /* * We only end up here without a message when XLogPageRead() * failed - in that case we already logged something. In * StandbyMode that only happens if we have been triggered, so we * shouldn't loop anymore in that case. */ if (errormsg) ereport(emode_for_corrupt_record(emode, RecPtr ? RecPtr : EndRecPtr), (errmsg_internal("%s", errormsg) /* already translated */ )); Regards Takayuki Tsunakawa -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Crash on promotion when recovery.conf is renamed
On Mon, Mar 27, 2017 at 11:26 AM, Tsunakawa, Takayuki < tsunakawa.ta...@jp.fujitsu.com> wrote: > From: pgsql-hackers-ow...@postgresql.org > > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Michael Paquier > > Moved to CF 2017-03. Both patches still apply. > > Sorry to be late for reviewing this, but done now. The patch applied, > make check passed, and the code looks almost good. I could successfully > perform a simple archive recovery. Finally, I broke the 2pc state file > while the server is down, and could confirm that the server failed to start > as expected, emitting a FATAL message. Worked nicely. > I've tested moving recovery.conf away case which was originally reported by Magnus. When I'm trying to promote standby when recovery.conf is moved away, I get FATAL error. waiting for server to promote2017-03-27 17:12:38.225 MSK [30307] LOG: received promote request 2017-03-27 17:12:38.225 MSK [30311] FATAL: terminating walreceiver process due to administrator command 2017-03-27 17:12:38.225 MSK [30307] LOG: redo done at 0/328 2017-03-27 17:12:38.226 MSK [30307] LOG: selected new timeline ID: 2 2017-03-27 17:12:38.253 MSK [30307] FATAL: could not open file "recovery.conf": No such file or directory 2017-03-27 17:12:38.253 MSK [30306] LOG: startup process (PID 30307) exited with exit code 1 2017-03-27 17:12:38.253 MSK [30306] LOG: terminating any other active server processes 2017-03-27 17:12:38.256 MSK [30306] LOG: database system is shut down ... stopped waiting server is still promoting If I try to start it after crash, it becomes a master on timeline 1. Just like in case we deleted recovery.conf while server was shut down. waiting for server to start2017-03-27 17:16:54.186 MSK [30413] LOG: listening on IPv6 address "::1", port 5430 2017-03-27 17:16:54.186 MSK [30413] LOG: listening on IPv6 address "fe80::1%lo0", port 5430 2017-03-27 17:16:54.186 MSK [30413] LOG: listening on IPv4 address "127.0.0.1", port 5430 2017-03-27 17:16:54.187 MSK [30413] LOG: listening on Unix socket "/tmp/.s.PGSQL.5430" 2017-03-27 17:16:54.195 MSK [30414] LOG: database system was interrupted while in recovery at log time 2017-03-27 17:10:23 MSK 2017-03-27 17:16:54.195 MSK [30414] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. 2017-03-27 17:16:54.218 MSK [30414] LOG: database system was not properly shut down; automatic recovery in progress 2017-03-27 17:16:54.219 MSK [30414] LOG: redo starts at 0/260 2017-03-27 17:16:54.219 MSK [30414] LOG: consistent recovery state reached at 0/360 2017-03-27 17:16:54.219 MSK [30414] LOG: invalid record length at 0/360: wanted 24, got 0 2017-03-27 17:16:54.219 MSK [30414] LOG: redo done at 0/328 2017-03-27 17:16:54.224 MSK [30413] LOG: database system is ready to accept connections done server started # select pg_is_in_recovery(); pg_is_in_recovery ─── f (1 row) # select pg_walfile_name(pg_current_wal_location()); pg_walfile_name ── 00010003 (1 row) If instead I put recovery back and start server, then streaming replication continues normally. waiting for server to start2017-03-27 17:32:16.887 MSK [30783] LOG: listening on IPv6 address "::1", port 5430 2017-03-27 17:32:16.887 MSK [30783] LOG: listening on IPv6 address "fe80::1%lo0", port 5430 2017-03-27 17:32:16.887 MSK [30783] LOG: listening on IPv4 address "127.0.0.1", port 5430 2017-03-27 17:32:16.888 MSK [30783] LOG: listening on Unix socket "/tmp/.s.PGSQL.5430" 2017-03-27 17:32:16.897 MSK [30784] LOG: database system was interrupted while in recovery at log time 2017-03-27 17:28:05 MSK 2017-03-27 17:32:16.897 MSK [30784] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. 2017-03-27 17:32:16.914 MSK [30784] LOG: entering standby mode 2017-03-27 17:32:16.916 MSK [30784] LOG: redo starts at 0/828 2017-03-27 17:32:16.916 MSK [30784] LOG: consistent recovery state reached at 0/960 2017-03-27 17:32:16.916 MSK [30784] LOG: invalid record length at 0/960: wanted 24, got 0 2017-03-27 17:32:16.916 MSK [30783] LOG: database system is ready to accept read only connections 2017-03-27 17:32:16.920 MSK [30788] LOG: started streaming WAL from primary at 0/900 on timeline 1 done server started # select pg_is_in_recovery(); pg_is_in_recovery ─── t (1 row) Thus, I think patch is working as expected in this case. Also, I'd like to notice that it passes check-world without warnings on my laptop. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Crash on promotion when recovery.conf is renamed
Alexander Korotkov writes: > On Mon, Mar 27, 2017 at 4:48 PM, Tom Lane wrote: >> Actually, the *real* problem with that coding is it lacks a SQLSTATE >> (errcode call). The only places where it's acceptable to leave that >> out are for internal "can't happen" cases, which this surely isn't. > Surrounding code also has ereports lacking SQLSTATE. And that isn't "can't > happen" case as well. > Should we consider fixing them? Yup. Just remember that the default is XX000EERRCODE_INTERNAL_ERROR internal_error If that's not how you want the error case reported, you need an errcode() call. We might need more ERRCODEs than are there now, if none of the existing ones seem to fit these cases. There's already ERRCODE_DATA_CORRUPTED and ERRCODE_INDEX_CORRUPTED; maybe we need ERRCODE_WAL_CORRUPTED, for example? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Crash on promotion when recovery.conf is renamed
On Mon, Mar 27, 2017 at 4:48 PM, Tom Lane wrote: > "Tsunakawa, Takayuki" writes: > > All other places in twophase.c and most places in other files put > ereport() and errmsg() on separate lines. I think it would be better to > align with surrounding code. > > > + ereport(FATAL, (errmsg("corrupted > two-phase file \"%s\"", > > Actually, the *real* problem with that coding is it lacks a SQLSTATE > (errcode call). The only places where it's acceptable to leave that > out are for internal "can't happen" cases, which this surely isn't. > Surrounding code also has ereports lacking SQLSTATE. And that isn't "can't happen" case as well. if (ControlFile->backupEndRequired) > ereport(FATAL, > (errmsg("WAL ends before end of online backup"), > errhint("All WAL generated while online backup was taken must be > available at recovery."))); > else if (!XLogRecPtrIsInvalid(ControlFile->backupStartPoint)) > ereport(FATAL, > (errmsg("WAL ends before end of online backup"), > errhint("Online backup started with pg_start_backup() must be ended with > pg_stop_backup(), and all WAL up to that point must be available at > recovery."))); > else > ereport(FATAL, > (errmsg("WAL ends before consistent recovery point"))); Should we consider fixing them? -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Crash on promotion when recovery.conf is renamed
"Tsunakawa, Takayuki" writes: > All other places in twophase.c and most places in other files put ereport() > and errmsg() on separate lines. I think it would be better to align with > surrounding code. > + ereport(FATAL, (errmsg("corrupted two-phase > file \"%s\"", Actually, the *real* problem with that coding is it lacks a SQLSTATE (errcode call). The only places where it's acceptable to leave that out are for internal "can't happen" cases, which this surely isn't. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Crash on promotion when recovery.conf is renamed
From: pgsql-hackers-ow...@postgresql.org > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Michael Paquier > Moved to CF 2017-03. Both patches still apply. Sorry to be late for reviewing this, but done now. The patch applied, make check passed, and the code looks almost good. I could successfully perform a simple archive recovery. Finally, I broke the 2pc state file while the server is down, and could confirm that the server failed to start as expected, emitting a FATAL message. Worked nicely. Just two cosmetic points: (1) Other places use "two-phase state file", not "two-phase file". (2) All other places in twophase.c and most places in other files put ereport() and errmsg() on separate lines. I think it would be better to align with surrounding code. + ereport(FATAL, (errmsg("corrupted two-phase file \"%s\"", Regards Takayuki Tsunakawa -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Crash on promotion when recovery.conf is renamed
From: David Steele [mailto:da...@pgmasters.net] > Any idea when you'll have a chance to review? I'll do it by early next week. Regards Takayuki Tsunakawa -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Crash on promotion when recovery.conf is renamed
On 12/20/16 2:54 AM, Michael Paquier wrote: > On Sat, Dec 17, 2016 at 9:23 PM, Magnus Hagander wrote: >> On Fri, Dec 16, 2016 at 7:08 AM, Michael Paquier >> wrote: >>> Looking at PrescanPreparedTransactions(), I am thinking as well that it >>> would >>> be better to get a hard failure when bumping on a corrupted 2PC file. >>> Future >>> files are one thing, but corrupted files should be treated more carefully. >> >> >> Again without looking at it, I agree (so much easier that way :P). Ignoring >> corruption is generally a bad idea. Failing hard makes the user notice the >> error, and makes it possible to initiate recovery from a standby or from >> backups or something, or to *intentionally* remove/clear/ignore it. > > And I am finishing with the two patches attached: > - 0001 changes the 2PC checks so as corrupted entries are FATAL. > PreScanPreparedTransaction is used when a hot standby is initialized. > In this case a failure protects the range of XIDs generated, > potentially saving from corruption of data. At the end of recovery, > this is done before any on-disk actions are taken. > - 0002 is the thing that Heikki has sent previously to minimize the > window between end-of-recovery record write and timeline history file > archiving. > > I am attaching that to next CF. This patch still applies cleanly and compiles at cccbdde. Any idea when you'll have a chance to review? -- -David da...@pgmasters.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Crash on promotion when recovery.conf is renamed
On Tue, Dec 20, 2016 at 4:54 PM, Michael Paquier wrote: > I am attaching that to next CF. Moved to CF 2017-03. Both patches still apply. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Crash on promotion when recovery.conf is renamed
On Sat, Dec 17, 2016 at 9:23 PM, Magnus Hagander wrote: > On Fri, Dec 16, 2016 at 7:08 AM, Michael Paquier > wrote: >> Looking at PrescanPreparedTransactions(), I am thinking as well that it >> would >> be better to get a hard failure when bumping on a corrupted 2PC file. >> Future >> files are one thing, but corrupted files should be treated more carefully. > > > Again without looking at it, I agree (so much easier that way :P). Ignoring > corruption is generally a bad idea. Failing hard makes the user notice the > error, and makes it possible to initiate recovery from a standby or from > backups or something, or to *intentionally* remove/clear/ignore it. And I am finishing with the two patches attached: - 0001 changes the 2PC checks so as corrupted entries are FATAL. PreScanPreparedTransaction is used when a hot standby is initialized. In this case a failure protects the range of XIDs generated, potentially saving from corruption of data. At the end of recovery, this is done before any on-disk actions are taken. - 0002 is the thing that Heikki has sent previously to minimize the window between end-of-recovery record write and timeline history file archiving. I am attaching that to next CF. -- Michael From 0e816454c57b851f4aa2743ea776b1e604a27d77 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Tue, 20 Dec 2016 16:47:35 +0900 Subject: [PATCH 2/2] Minimize window between history file and end-of-recovery record Once a standby node is promoted, this makes the assignment of the new timeline number booked earlier as the history file gets archived immediately. This way the other nodes are aware that this new timeline number is taken and should not be assigned to other nodes. The window between which the history file is archived and the end-of-recovery record is written cannot be zeroed, but this way it is minimized as much as possible. The new order of actions prevents as well a corrupted data directory on failure. --- src/backend/access/transam/xlog.c | 27 ++- 1 file changed, 18 insertions(+), 9 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index ea2c523e6e..f23feb47e8 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -7210,6 +7210,24 @@ StartupXLOG(void) else snprintf(reason, sizeof(reason), "no recovery target specified"); + /* + * We are now done reading the old WAL. Turn off archive fetching if it + * was active, and make a writable copy of the last WAL segment. (Note + * that we also have a copy of the last block of the old WAL in readBuf; + * we will use that below.) + */ + exitArchiveRecovery(EndOfLogTLI, EndOfLog); + + /* + * Write the timeline history file, and have it archived. After this + * point (or rather, as soon as the file is archived), the timeline + * will appear as "taken" in the WAL archive and to any standby servers. + * If we crash before actually switching to the new timeline, standby + * servers will nevertheless think that we switched to the new timeline, + * and will try to connect to the new timeline. To minimize the window + * for that, try to do as little as possible between here and writing the + * end-of-recovery record. + */ writeTimeLineHistory(ThisTimeLineID, recoveryTargetTLI, EndRecPtr, reason); } @@ -7219,15 +7237,6 @@ StartupXLOG(void) XLogCtl->PrevTimeLineID = PrevTimeLineID; /* - * We are now done reading the old WAL. Turn off archive fetching if it - * was active, and make a writable copy of the last WAL segment. (Note - * that we also have a copy of the last block of the old WAL in readBuf; - * we will use that below.) - */ - if (ArchiveRecoveryRequested) - exitArchiveRecovery(EndOfLogTLI, EndOfLog); - - /* * Prepare to write WAL starting at EndOfLog position, and init xlog * buffer cache using the block containing the last record from the * previous incarnation. -- 2.11.0 From 4037db778a4353fc79b15eb683a5b9e3f648db24 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Tue, 20 Dec 2016 16:41:44 +0900 Subject: [PATCH 1/2] Change detection of corrupted 2PC files as FATAL When scanning 2PC files, be it for initializing a hot standby node or finding the XID range at the end of recovery, bumping on a corrupted 2PC file is reported as a WARNING and are blindly corrupted. If it happened that the corrupted file was actually legit, there is actually a risk of corruption, so switch that to a hard failure. --- src/backend/access/transam/twophase.c | 23 --- src/backend/access/transam/xlog.c | 10 +++--- 2 files changed, 15 insertions(+), 18 deletions(-) diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c index 5415604993..9cbcb83062 100644 --- a/src/backend/access/transam/twophase.c +++ b/src/backend/access/transam/twophase.c @@ -1672,6 +1672,10 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon) * write a
Re: [HACKERS] Crash on promotion when recovery.conf is renamed
On Fri, Dec 16, 2016 at 7:08 AM, Michael Paquier wrote: > On Thu, Dec 15, 2016 at 11:25:10AM +0200, Heikki Linnakangas wrote: > > On 12/15/2016 10:44 AM, Magnus Hagander wrote: > > > I wonder if there might be more corner cases like this, but in this > > > particular one it seems easy enough to just say that failing to rename > > > recovery.conf because it didn't exist is safe. > > > > Yeah. It's unexpected though, so I think erroring out is quite > reasonable. > > If the recovery.conf file went missing, who knows what else is wrong. > > i'd rather not mess up with the exiting behavior and just complain to the > pilot to not do that. This enters in the category of not removing the > backup_label file after taking a backup... > I'm happy to say that people shouldn't do that. However, we have a system that is unnecessarily fragile, by making something like that so easy to break. This could equally happen in other ways. For example, someone could accidentally provision a recovery.conf with the wrong permissions. Reducing the fragility of such an important part of the system is a big improvement, IMO. Of course, we have to be extra careful when touching those parts. Bottom line, I'm perfectly fine with failing in such a scenario. I'm not fine with leaving the user with a corrupt cluster if we can avoid it. As for your comparison, we fixed the backup_label part by adding support for taking backups without using the fragile system. So it's not like we didn't recognize the problem. > > > > But in the case of failing to rename recovery.conf for example because > of > > > permissions errors, we don't want to ignore it. But we also really > don't > > > want to end up with this kind of inconsistent data directory IMO. I > don't > > > know that code well enough to suggest how to fix it though -- hoping > for > > > input for someone who knows it closer? > > > > Hmm. Perhaps we should write the timeline history file only after > renaming > > recovery.conf. In general, we should keep the window between writing the > > timeline history file and writing the end-of-recovery record as small as > > possible. I don't think we can eliminate it completely, but it makes > sense > > to minimize it. Something like the attached (completely untested). > I haven't looked at the code either, but the reason is definitely solid - let's keep the time as short as possible. In particular, keeping it recoverable for things that are caused by humans, because they will keep making mistakes. And editing recovery.conf is a normal thing for a DBA to do (sure, not removing it -- but it's one of the few files in the data directory that the DBA is actually supposed to handle manually, so that increases the risk). And also by doing things in an order that will at least make it less likely to end up corrupt if people manage to do it anyway. I did some tests. And after a lookup it looks like a good thing to book > the timeline number at an earlier step and let other nodes know about > it. So +1 on it. > > > Looking at PrescanPreparedTransactions(), I am thinking as well that it > would > be better to get a hard failure when bumping on a corrupted 2PC file. > Future > files are one thing, but corrupted files should be treated more carefully. > Again without looking at it, I agree (so much easier that way :P). Ignoring corruption is generally a bad idea. Failing hard makes the user notice the error, and makes it possible to initiate recovery from a standby or from backups or something, or to *intentionally* remove/clear/ignore it. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] Crash on promotion when recovery.conf is renamed
On Thu, Dec 15, 2016 at 11:25:10AM +0200, Heikki Linnakangas wrote: > On 12/15/2016 10:44 AM, Magnus Hagander wrote: > > I wonder if there might be more corner cases like this, but in this > > particular one it seems easy enough to just say that failing to rename > > recovery.conf because it didn't exist is safe. > > Yeah. It's unexpected though, so I think erroring out is quite reasonable. > If the recovery.conf file went missing, who knows what else is wrong. i'd rather not mess up with the exiting behavior and just complain to the pilot to not do that. This enters in the category of not removing the backup_label file after taking a backup... > > But in the case of failing to rename recovery.conf for example because of > > permissions errors, we don't want to ignore it. But we also really don't > > want to end up with this kind of inconsistent data directory IMO. I don't > > know that code well enough to suggest how to fix it though -- hoping for > > input for someone who knows it closer? > > Hmm. Perhaps we should write the timeline history file only after renaming > recovery.conf. In general, we should keep the window between writing the > timeline history file and writing the end-of-recovery record as small as > possible. I don't think we can eliminate it completely, but it makes sense > to minimize it. Something like the attached (completely untested). I did some tests. And after a lookup it looks like a good thing to book the timeline number at an earlier step and let other nodes know about it. So +1 on it. Looking at PrescanPreparedTransactions(), I am thinking as well that it would be better to get a hard failure when bumping on a corrupted 2PC file. Future files are one thing, but corrupted files should be treated more carefully. -- Michael signature.asc Description: PGP signature
Re: [HACKERS] Crash on promotion when recovery.conf is renamed
On 12/15/2016 10:44 AM, Magnus Hagander wrote: I wonder if there might be more corner cases like this, but in this particular one it seems easy enough to just say that failing to rename recovery.conf because it didn't exist is safe. Yeah. It's unexpected though, so I think erroring out is quite reasonable. If the recovery.conf file went missing, who knows what else is wrong. But in the case of failing to rename recovery.conf for example because of permissions errors, we don't want to ignore it. But we also really don't want to end up with this kind of inconsistent data directory IMO. I don't know that code well enough to suggest how to fix it though -- hoping for input for someone who knows it closer? Hmm. Perhaps we should write the timeline history file only after renaming recovery.conf. In general, we should keep the window between writing the timeline history file and writing the end-of-recovery record as small as possible. I don't think we can eliminate it completely, but it makes sense to minimize it. Something like the attached (completely untested). - Heikki reorder-end-of-archive-recovery-actions-1.patch Description: invalid/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Crash on promotion when recovery.conf is renamed
I had a system where the recovery.conf file was renamed "out of the way" at some point, and then the system was promoted. This is obviously operator error, but it seems like something we should handle. What happens now is that the non-existance of recovery.conf is a FATAL error. I wonder if it should just be a WARNING, at least in the case of ENOENT? What happens is this. Log output: 2016-12-15 09:36:46.265 CET [25437] LOG: received promote request 2016-12-15 09:36:46.265 CET [25438] FATAL: terminating walreceiver process due to administrator command mha@mha-laptop:~/postgresql/inst/head$ 2016-12-15 09:36:46.265 CET [25437] LOG: invalid record length at 0/5015168: wanted 24, got 0 2016-12-15 09:36:46.265 CET [25437] LOG: redo done at 0/5015130 2016-12-15 09:36:46.265 CET [25437] LOG: last completed transaction was at log time 2016-12-15 09:36:19.27125+01 2016-12-15 09:36:46.276 CET [25437] LOG: selected new timeline ID: 2 2016-12-15 09:36:46.429 CET [25437] FATAL: could not open file "recovery.conf": No such file or directory 2016-12-15 09:36:46.429 CET [25436] LOG: startup process (PID 25437) exited with exit code 1 2016-12-15 09:36:46.429 CET [25436] LOG: terminating any other active server processes 2016-12-15 09:36:46.429 CET [25456] WARNING: terminating connection because of crash of another server process 2016-12-15 09:36:46.429 CET [25456] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2016-12-15 09:36:46.429 CET [25456] HINT: In a moment you should be able to reconnect to the database and repeat your command. 2016-12-15 09:36:46.431 CET [25436] LOG: database system is shut down So we can see it switches to timeline 2. Looking in pg_wal (or pg_xlog -- customer system was on 9.5, but this is reproducible in HEAD): -rw--- 1 mha mha 16777216 Dec 15 09:36 00010004 -rw--- 1 mha mha 16777216 Dec 15 09:36 00010005 -rw--- 1 mha mha 16777216 Dec 15 09:36 00020005 -rw--- 1 mha mha 41 Dec 15 09:36 0002.history However, according to pg_controldata, we are still on timeline 1: Latest checkpoint location: 0/460 Prior checkpoint location:0/460 Latest checkpoint's REDO location:0/428 Latest checkpoint's REDO WAL file:00010004 Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 .. Minimum recovery ending location: 0/5015168 Min recovery ending loc's timeline: 1 But since we have a history file for timeline 2 in the data directory (and neatly archived), this data directory isn't consistent with that. Meaning that for example any other standbys that you try to connect to this cluster will simply fail, because they try to join up on timeline 2 which doesn't actually exist. I wonder if there might be more corner cases like this, but in this particular one it seems easy enough to just say that failing to rename recovery.conf because it didn't exist is safe. But in the case of failing to rename recovery.conf for example because of permissions errors, we don't want to ignore it. But we also really don't want to end up with this kind of inconsistent data directory IMO. I don't know that code well enough to suggest how to fix it though -- hoping for input for someone who knows it closer? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/