Re: [HACKERS] initdb -S and tablespaces
Andres Freund writes: > On 2015-05-08 22:08:31 -0400, Robert Haas wrote: >> Of course, even the last one isn't totally bullet-proof. Suppose one >> backend fails to absorb the new setting for some reason... > I've a hard time worrying much about that one... You should. At the very least, whatever recipe we write for changing fsync safely has to include a clause like "wait for all postmaster children to have absorbed the new fsync setting". The facts that (a) this could be a long time and (b) there's no easy way to be entirely certain about when it's done don't make it something you should ignore. I wonder whether we should change fsync to be PGC_POSTMASTER and then document the safe procedure as requiring a postmaster restart. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
On 2015-05-08 22:08:31 -0400, Robert Haas wrote: > That seems a bit better. I think it's really important, if we're > going to start to try to make fsync=off anything other than a toy, I think it's long past that. I've seen many, many people use it during initial data loading. > that we document really clearly the circumstances in which it is or is > not safe: Yea, we really should have done that a long time ago. > - If you crash while fsync=off, your cluster may be corrupted. HW crash, right? > - If you crash while fsync=on, but it was off at the last checkpoint, > your cluster may be corrupted. > - If you turn fsync=off, do stuff, turn fsync=on, and checkpoint > successfully, a subsequent crash should not corrupt anything. Yep. > Of course, even the last one isn't totally bullet-proof. Suppose one > backend fails to absorb the new setting for some reason... I've a hard time worrying much about that one... Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
On Fri, May 8, 2015 at 7:53 PM, Andres Freund wrote: > On 2015-05-04 14:23:16 -0400, Robert Haas wrote: >> On Fri, May 1, 2015 at 10:41 AM, Abhijit Menon-Sen >> wrote: >> > As for the non-backpatchable 0002, I agree with Andres that it should be >> > included in 9.5; but I take it you're still not convinced? >> >> No, I'm not convinced. That patch will protect you in one extremely >> specific scenario: you turn off fsync, do some stuff, shut down, turn >> fsync back on again, and start up. > > Hm. ISTM it'd not be hard to actually make it safe in nearly all > situations. What about tracking the last checkpoint's fsync setting and > do a fsync_pgdata() in the checkpointer at the end of a checkpointer if > the previous setting was off and the current one is on? Combined with > doing so at startup if the settings changed between runs, that should > give pretty decent protection. And seems fairly simple to implement. That seems a bit better. I think it's really important, if we're going to start to try to make fsync=off anything other than a toy, that we document really clearly the circumstances in which it is or is not safe: - If you crash while fsync=off, your cluster may be corrupted. - If you crash while fsync=on, but it was off at the last checkpoint, your cluster may be corrupted. - If you turn fsync=off, do stuff, turn fsync=on, and checkpoint successfully, a subsequent crash should not corrupt anything. Of course, even the last one isn't totally bullet-proof. Suppose one backend fails to absorb the new setting for some reason... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
On 2015-05-04 14:23:16 -0400, Robert Haas wrote: > On Fri, May 1, 2015 at 10:41 AM, Abhijit Menon-Sen > wrote: > > As for the non-backpatchable 0002, I agree with Andres that it should be > > included in 9.5; but I take it you're still not convinced? > > No, I'm not convinced. That patch will protect you in one extremely > specific scenario: you turn off fsync, do some stuff, shut down, turn > fsync back on again, and start up. Hm. ISTM it'd not be hard to actually make it safe in nearly all situations. What about tracking the last checkpoint's fsync setting and do a fsync_pgdata() in the checkpointer at the end of a checkpointer if the previous setting was off and the current one is on? Combined with doing so at startup if the settings changed between runs, that should give pretty decent protection. And seems fairly simple to implement. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
On Tue, May 5, 2015 at 6:26 AM, David Rowley wrote: > On 5 May 2015 at 06:23, Robert Haas wrote: >> >> OK, committed and back-patched. > > There's a couple of problems with this that the windows buildfarm members > are not happy with. > > The attached patch seems to work locally. Thanks. I think the open() stuff should be fixed by using BasicOpenFile() rather than introducing support for the two-argument form of open(). I'll push a fix shortly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
On 5 May 2015 at 06:23, Robert Haas wrote: > > OK, committed and back-patched. > There's a couple of problems with this that the windows buildfarm members are not happy with. The attached patch seems to work locally. Regards David Rowley fsync_win32_fix.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
On Fri, May 1, 2015 at 10:41 AM, Abhijit Menon-Sen wrote: > At 2015-05-01 09:57:28 -0400, robertmh...@gmail.com wrote: >> >> If you don't object to this version, I'll commit it. > > Looks fine to me, thank you. OK, committed and back-patched. > As for the non-backpatchable 0002, I agree with Andres that it should be > included in 9.5; but I take it you're still not convinced? No, I'm not convinced. That patch will protect you in one extremely specific scenario: you turn off fsync, do some stuff, shut down, turn fsync back on again, and start up. But it won't protect you if you crash while fsync is off, or after you shut down with fsync=off and before you restart with fsync=on. And there's no documentation change here that would help anyone distinguish between the situations in which they are protected and the situations in which they are not protected. Without that, a lot of people are going to get this wrong. As an alternative, how about fsync=shutdown parameter? This could be documented to fsync the data directory at shutdown. It could document that there is a risk of corruption if the server crashes, but that the database is OK if shut down cleanly. fsync=off could document that you must run initdb --sync-only after shutting down, else you are unsafe. I'm not wedded to any particular solution, but an undocumented hack that some people will manage to use safely some of the time doesn't seem good enough to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
At 2015-05-01 09:57:28 -0400, robertmh...@gmail.com wrote: > > If you don't object to this version, I'll commit it. Looks fine to me, thank you. As for the non-backpatchable 0002, I agree with Andres that it should be included in 9.5; but I take it you're still not convinced? Should I add that to the CF separately for discussion, or what? -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
Hi, I agree that splitting the patch into two separate ones is a good one. On 2015-05-01 09:57:28 -0400, Robert Haas wrote: > If you don't object to this version, I'll commit it. I believe this > part *should* be back-patched, but Tom seemed to disagree, for reasons > I'm not really clear on. This is a potential data corrupting bug as > legitimate as any other, so a back-patch seems right to me. Agreed. Especially for WAL files this seems to be a pretty clear correctness issue to me. I unsurprisingly think the other patch is a good idea too. But it's clearly *not* something for the back branches. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
On Fri, May 1, 2015 at 8:42 AM, Abhijit Menon-Sen wrote: > At 2015-05-01 08:10:16 -0400, robertmh...@gmail.com wrote: >> It seems to me that, at a minimum, it would be good to split those >> controversial and definitely not-back-patchable changes into their >> own patch. > > OK, split here (0002*). > >> I do mind putting it into xlog.c instead of some place that's actually >> appropriate. > > OK, moved to storage/file/fd.c (0001*). Here's a revised version of your 0001 patch which I am comfortable with. I changed some of the comments, and I moved the fsync_pgdata() call slightly later, so that we don't do a (possibly long) set of fsyncs before printing out the first log message that tells the user what is going on. If you don't object to this version, I'll commit it. I believe this part *should* be back-patched, but Tom seemed to disagree, for reasons I'm not really clear on. This is a potential data corrupting bug as legitimate as any other, so a back-patch seems right to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company fsync-pgdata-rmh.patch Description: binary/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
At 2015-05-01 08:10:16 -0400, robertmh...@gmail.com wrote: > > It seems to me that, at a minimum, it would be good to split those > controversial and definitely not-back-patchable changes into their > own patch. OK, split here (0002*). > I do mind putting it into xlog.c instead of some place that's actually > appropriate. OK, moved to storage/file/fd.c (0001*). -- Abhijit >From 088b80eb0825339eca688e4347a4ef547edcadbb Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen Date: Thu, 6 Nov 2014 00:45:56 +0530 Subject: Recursively fsync PGDATA at startup after a crash This is so that we don't lose older unflushed writes in the event of a power failure after crash recovery, where more recent writes are preserved. See 20140918083148.ga17...@alap3.anarazel.de for details. --- src/backend/access/transam/xlog.c | 49 + src/backend/storage/file/fd.c | 112 ++ src/include/storage/fd.h | 2 + 3 files changed, 163 insertions(+) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 6cf4415..084174d 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -845,6 +845,8 @@ static void WALInsertLockAcquireExclusive(void); static void WALInsertLockRelease(void); static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt); +static void fsync_pgdata(char *datadir); + /* * Insert an XLOG record represented by an already-constructed chain of data * chunks. This is a low-level routine; to construct the WAL record header @@ -5878,6 +5880,25 @@ StartupXLOG(void) ereport(FATAL, (errmsg("control file contains invalid data"))); + /* + * If we need to perform crash recovery, we issue an fsync on the + * data directory and its contents to try to ensure that any data + * written before the crash are flushed to disk. Otherwise a power + * failure in the near future might cause earlier unflushed writes + * to be lost, even though more recent data written to disk from + * here on would be persisted. + * + * We also do this if the control file indicates that fsync was + * disabled at some point while the server was running earlier. + */ + + if (enableFsync && + (ControlFile->state != DB_SHUTDOWNED && + ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)) + { + fsync_pgdata(data_directory); + } + if (ControlFile->state == DB_SHUTDOWNED) { /* This is the expected case, so don't be chatty in standalone mode */ @@ -11123,3 +11144,31 @@ SetWalWriterSleeping(bool sleeping) XLogCtl->WalWriterSleeping = sleeping; SpinLockRelease(&XLogCtl->info_lck); } + +/* + * Issue fsync recursively on PGDATA and all its contents. + */ +static void +fsync_pgdata(char *datadir) +{ + if (!enableFsync) + return; + + /* + * If possible, hint to the kernel that we're soon going to fsync + * the data directory and its contents. + */ +#if defined(HAVE_SYNC_FILE_RANGE) || \ + (defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)) + walkdir(datadir, pre_sync_fname); +#endif + + /* + * Now we do the fsync()s in the same order. + * + * It's important to fsync the destination directory itself as individual + * file fsyncs don't guarantee that the directory entry for the file is + * synced. + */ + walkdir(datadir, fsync_fname); +} diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index f796717..aba12ca 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -2439,3 +2439,115 @@ looks_like_temp_rel_name(const char *name) return false; return true; } + +/* + * Hint to the OS that it should get ready to fsync() this file. + * + * Adapted from pre_sync_fname in initdb.c + */ +void +pre_sync_fname(char *fname, bool isdir) +{ + int fd; + + fd = open(fname, O_RDONLY | PG_BINARY); + + /* + * Some OSs don't allow us to open directories at all (Windows returns + * EACCES) + */ + if (fd < 0 && isdir && (errno == EISDIR || errno == EACCES)) + return; + + if (fd < 0) + ereport(FATAL, +(errmsg("could not open file \"%s\" before fsync", + fname))); + + pg_flush_data(fd, 0, 0); + + close(fd); +} + +/* + * walkdir: recursively walk a directory, applying the action to each + * regular file and directory (including the named directory itself) + * and following symbolic links. + */ +void +walkdir(char *path, void (*action) (char *fname, bool isdir)) +{ + DIR *dir; + struct dirent *de; + + dir = AllocateDir(path); + while ((de = ReadDir(dir, path)) != NULL) + { + char subpath[MAXPGPATH]; + struct stat fst; + + CHECK_FOR_INTERRUPTS(); + + if (strcmp(de->d_name, ".") == 0 || + strcmp(de->d_name, "..") == 0) + continue; + + snprintf(subpath, MAXPGPATH, "%s/%s", path, de->d_name); + + if (lstat(subpath, &fst) < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not stat file \"%s\": %m", subpath))); + + if (S_ISREG(fst.st_mode)) + (*action) (subpath, false); + else if (S_ISDIR(fst.
Re: [HACKERS] initdb -S and tablespaces
On Thu, Apr 30, 2015 at 11:29 PM, Abhijit Menon-Sen wrote: >> 2. I don't know why it's part of this patch. > > In 20150115133245.gg5...@awork2.anarazel.de, Andres explained his > rationale as follows: > > «What I am thinking of is that, currently, if you start the server > for initial loading with fsync=off, and then restart it, you're open > to data loss. So when the current config file setting is changed > from off to on, we should fsync the data directory. Even if there > was no crash restart.» That's awfully clever, but I'm not sure I like the idea of trying to be that clever. I think if users temporarily disable fsync, they should be responsible for using initdb -S after if that is needed in their situation, and this should be documented. It seems to me that, at a minimum, it would be good to split those controversial and definitely not-back-patchable changes into their own patch. >> Also, it seems awfully unfortunate to me that we're duplicating a >> whole pile of code into xlog.c here. > > I have pointed out and discussed the duplication several times. I did it > this way only because we were considering backporting the changes, and > at the time it seemed better to do this and fix the duplication in a > separate patch. As I've mentioned a few times, I don't mind duplicating the code if we have frontend and backend versions that are materially different. But I do mind putting it into xlog.c instead of some place that's actually appropriate. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
At 2015-04-30 16:56:17 -0700, t...@sss.pgh.pa.us wrote: > > As for the notion that this needs to be back-patched, I would say no. Not even just the "fsync after crash" part? I could separate that out from the control file changes and try to eliminate the duplication. I think that would be worth back-patching, at least. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
At 2015-04-30 15:37:44 -0400, robertmh...@gmail.com wrote: > > 1. It doesn't do that. As soon as we fsync the data directory, we >reset the flag. That's not what "ever disabled" means to me. Could you suggest an acceptable alternative wording? I can't immediately think of anything better than "disabled since the last restart". That is conditional on our resetting the flag, which we will do only if fsync is enabled at startup. So it's true, but not the whole truth. > 2. I don't know why it's part of this patch. In 20150115133245.gg5...@awork2.anarazel.de, Andres explained his rationale as follows: «What I am thinking of is that, currently, if you start the server for initial loading with fsync=off, and then restart it, you're open to data loss. So when the current config file setting is changed from off to on, we should fsync the data directory. Even if there was no crash restart.» That's what I tried to implement. > Also, it seems awfully unfortunate to me that we're duplicating a > whole pile of code into xlog.c here. I have pointed out and discussed the duplication several times. I did it this way only because we were considering backporting the changes, and at the time it seemed better to do this and fix the duplication in a separate patch. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
Robert Haas writes: > On Thu, Apr 30, 2015 at 6:44 PM, Alvaro Herrera > wrote: >> Ah, so that's not the duplicate code that I was remembering -- I think >> it's walkdir() or something like that, which is in initdb IIRC. > Yeah, walkdir() is there too. But if we're going to add that to the > backend, I think it should go in src/backend/storage/file, not > src/backend/access/transam. Agreed that .../transam is a pretty horrid choice; but maybe we should be thinking about putting it in src/common, so there's only one copy? As for the notion that this needs to be back-patched, I would say no. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
On Thu, Apr 30, 2015 at 6:44 PM, Alvaro Herrera wrote: > Robert Haas wrote: >> On Thu, Apr 30, 2015 at 6:18 PM, Alvaro Herrera >> wrote: >> >> Also, it seems awfully unfortunate to me that we're duplicating a >> >> whole pile of code into xlog.c here. Maybe there's no way to avoid >> >> the code duplication, but pre_sync_fname() seems like it'd more >> >> naturally go in fd.c than here. I dunno where walkdir should go, but >> >> again, not in xlog.c. >> > >> > Hm, there's an interest in backpatching this as a bugfix, if I >> > understand correctly; hence the duplicated code. We could remove the >> > duplicity later with a refactoring patch in master only. >> >> That seems pretty silly. If we going to add pre_sync_fname() to every >> branch, we should add it to the same (correct) file in all of them, >> not put it in xlog.c in the back-branches and fd.c in master. > > Ah, so that's not the duplicate code that I was remembering -- I think > it's walkdir() or something like that, which is in initdb IIRC. Yeah, walkdir() is there too. But if we're going to add that to the backend, I think it should go in src/backend/storage/file, not src/backend/access/transam. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
Robert Haas wrote: > On Thu, Apr 30, 2015 at 6:18 PM, Alvaro Herrera > wrote: > >> Also, it seems awfully unfortunate to me that we're duplicating a > >> whole pile of code into xlog.c here. Maybe there's no way to avoid > >> the code duplication, but pre_sync_fname() seems like it'd more > >> naturally go in fd.c than here. I dunno where walkdir should go, but > >> again, not in xlog.c. > > > > Hm, there's an interest in backpatching this as a bugfix, if I > > understand correctly; hence the duplicated code. We could remove the > > duplicity later with a refactoring patch in master only. > > That seems pretty silly. If we going to add pre_sync_fname() to every > branch, we should add it to the same (correct) file in all of them, > not put it in xlog.c in the back-branches and fd.c in master. Ah, so that's not the duplicate code that I was remembering -- I think it's walkdir() or something like that, which is in initdb IIRC. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
On Thu, Apr 30, 2015 at 6:18 PM, Alvaro Herrera wrote: >> Also, it seems awfully unfortunate to me that we're duplicating a >> whole pile of code into xlog.c here. Maybe there's no way to avoid >> the code duplication, but pre_sync_fname() seems like it'd more >> naturally go in fd.c than here. I dunno where walkdir should go, but >> again, not in xlog.c. > > Hm, there's an interest in backpatching this as a bugfix, if I > understand correctly; hence the duplicated code. We could remove the > duplicity later with a refactoring patch in master only. That seems pretty silly. If we going to add pre_sync_fname() to every branch, we should add it to the same (correct) file in all of them, not put it in xlog.c in the back-branches and fd.c in master. > However, now that you mention a pg_control flag, it becomes evident to > me that a change to pg_control cannot be back-patched ... Indeed. But I think we can solve that problem by just ripping that part out. Unless I'm missing something, it's not really doing anything useful. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
Robert Haas wrote: > Also, it seems awfully unfortunate to me that we're duplicating a > whole pile of code into xlog.c here. Maybe there's no way to avoid > the code duplication, but pre_sync_fname() seems like it'd more > naturally go in fd.c than here. I dunno where walkdir should go, but > again, not in xlog.c. Hm, there's an interest in backpatching this as a bugfix, if I understand correctly; hence the duplicated code. We could remove the duplicity later with a refactoring patch in master only. However, now that you mention a pg_control flag, it becomes evident to me that a change to pg_control cannot be back-patched ... -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
On Thu, Apr 16, 2015 at 9:24 AM, Abhijit Menon-Sen wrote: > Here's a variation of the earlier patch that follows all links in > PGDATA. Does this look more like what you had in mind? I'm really confused by the additional control-file field. It is documented as indicating whether fsync was ever disabled while the server was running. But: 1. It doesn't do that. As soon as we fsync the data directory, we reset the flag. That's not what "ever disabled" means to me. 2. I don't know why it's part of this patch. Tracking whether fsync was ever disabled could be useful forensically, but isn't related to fsync-ing the data directory after a crash, so I dunno why we'd put that in this patch. Tracking whether fsync was disabled recently, as the patch actually does, doesn't seem to be of any obvious value in preventing corruption either. Also, it seems awfully unfortunate to me that we're duplicating a whole pile of code into xlog.c here. Maybe there's no way to avoid the code duplication, but pre_sync_fname() seems like it'd more naturally go in fd.c than here. I dunno where walkdir should go, but again, not in xlog.c. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
Hi. Here's a variation of the earlier patch that follows all links in PGDATA. Does this look more like what you had in mind? -- Abhijit >From d86888b0d2f5a3a57027d26ce050a3bbb58670d3 Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen Date: Thu, 6 Nov 2014 00:45:56 +0530 Subject: Recursively fsync PGDATA at startup if needed It's needed if we need to perform crash recovery or if fsync was disabled at some point while the server was running earlier (and we must store that information in the control file). This is so that we don't lose older unflushed writes in the event of a power failure after crash recovery, where more recent writes are preserved. See 20140918083148.ga17...@alap3.anarazel.de for details. --- src/backend/access/transam/xlog.c | 168 src/bin/pg_controldata/pg_controldata.c | 2 + src/include/catalog/pg_control.h| 8 +- 3 files changed, 177 insertions(+), 1 deletion(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 2580996..263d2d0 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -835,6 +835,8 @@ static void WALInsertLockAcquireExclusive(void); static void WALInsertLockRelease(void); static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt); +static void fsync_pgdata(char *datadir); + /* * Insert an XLOG record represented by an already-constructed chain of data * chunks. This is a low-level routine; to construct the WAL record header @@ -4811,6 +4813,7 @@ BootStrapXLOG(void) ControlFile->checkPoint = checkPoint.redo; ControlFile->checkPointCopy = checkPoint; ControlFile->unloggedLSN = 1; + ControlFile->fsync_disabled = false; /* Set important parameter values for use when replaying WAL */ ControlFile->MaxConnections = MaxConnections; @@ -5868,6 +5871,27 @@ StartupXLOG(void) ereport(FATAL, (errmsg("control file contains invalid data"))); + /* + * If we need to perform crash recovery, we issue an fsync on the + * data directory and its contents to try to ensure that any data + * written before the crash are flushed to disk. Otherwise a power + * failure in the near future might cause earlier unflushed writes + * to be lost, even though more recent data written to disk from + * here on would be persisted. + * + * We also do this if the control file indicates that fsync was + * disabled at some point while the server was running earlier. + */ + + if (enableFsync && + (ControlFile->fsync_disabled || + (ControlFile->state != DB_SHUTDOWNED && + ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY))) + { + fsync_pgdata(data_directory); + ControlFile->fsync_disabled = false; + } + if (ControlFile->state == DB_SHUTDOWNED) { /* This is the expected case, so don't be chatty in standalone mode */ @@ -8236,6 +8260,8 @@ CreateCheckPoint(int flags) /* crash recovery should always recover to the end of WAL */ ControlFile->minRecoveryPoint = InvalidXLogRecPtr; ControlFile->minRecoveryPointTLI = 0; + if (!enableFsync) + ControlFile->fsync_disabled = true; /* * Persist unloggedLSN value. It's reset on crash recovery, so this goes @@ -11107,3 +11133,145 @@ SetWalWriterSleeping(bool sleeping) XLogCtl->WalWriterSleeping = sleeping; SpinLockRelease(&XLogCtl->info_lck); } + +/* + * Hint to the OS that it should get ready to fsync() this file. + * + * Adapted from pre_sync_fname in initdb.c + */ +static void +pre_sync_fname(char *fname, bool isdir) +{ + int fd; + + fd = open(fname, O_RDONLY | PG_BINARY); + + /* + * Some OSs don't allow us to open directories at all (Windows returns + * EACCES) + */ + if (fd < 0 && isdir && (errno == EISDIR || errno == EACCES)) + return; + + if (fd < 0) + ereport(FATAL, +(errmsg("could not open file \"%s\" before fsync", + fname))); + + pg_flush_data(fd, 0, 0); + + close(fd); +} + +/* + * walkdir: recursively walk a directory, applying the action to each + * regular file and directory (including the named directory itself) + * and following symbolic links. + */ +static void +walkdir(char *path, void (*action) (char *fname, bool isdir)) +{ + DIR *dir; + struct dirent *de; + + dir = AllocateDir(path); + while ((de = ReadDir(dir, path)) != NULL) + { + char subpath[MAXPGPATH]; + struct stat fst; + + CHECK_FOR_INTERRUPTS(); + + if (strcmp(de->d_name, ".") == 0 || + strcmp(de->d_name, "..") == 0) + continue; + + snprintf(subpath, MAXPGPATH, "%s/%s", path, de->d_name); + + if (lstat(subpath, &fst) < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not stat file \"%s\": %m", subpath))); + + if (S_ISREG(fst.st_mode)) + (*action) (subpath, false); + else if (S_ISDIR(fst.st_mode)) + walkdir(subpath, action); +#ifndef WIN32 + else if (S_ISLNK(fst.st_mode)) +#else + else if (pg_win32_is_junction(subpath)) +#endif + { +#if defined(HAVE_READLINK) || defined(WIN32) + char linkpath[MAXPGPATH]; + int len; + str
Re: [HACKERS] initdb -S and tablespaces
At 2015-04-15 11:40:34 +0530, a...@2ndquadrant.com wrote: > > Still, thanks to the example in basebackup.c, I've got most of a patch > that should follow any symlinks in PGDATA. I notice that src/bin/pg_rewind/copy_fetch.c has a traverse_datadir() function that does what we want (but it recurses into symlinks only inside pg_tblspc). -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
At 2015-04-06 10:16:36 -0300, alvhe...@2ndquadrant.com wrote: > > Well, we have many things that can be set as symlinks; some you can > have initdb create the links for you (initdb --xlogdir), others you > can move manually. Looking at sendDir() in backend/replication/basebackup.c, I notice that the only places where it expects/allows symlinks are for pg_xlog and in pg_tblspc. Still, thanks to the example in basebackup.c, I've got most of a patch that should follow any symlinks in PGDATA. I just need to test a little more before I post it. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
Abhijit Menon-Sen wrote: Hi, > At 2015-04-03 13:32:32 -0300, alvhe...@2ndquadrant.com wrote: > > I also noticed that walkdir() thinks it is completely agnostic on > > what the passed action is; except that the comment at the bottom talks > > about how fsync on directories is important, which seems out of place. > > Yes. The function behaves as documented, but the comment is clearly too > specific. I'm not sure where to put it. I could make walkdir() NOT do > it, and instead do it in the caller with the same comment. Thoughts? I think it's enough to state in the function comment that the action is applied to the top element too. Maybe add the fsync comment on the callsite. > > Hmm ... Actually, since surely we must follow symlinks everywhere, > > why do we have to do this separately for pg_tblspc? Shouldn't that > > link-following occur automatically when walking PGDATA in the first > > place? > > I'm not sure about that (and that's why I've not attached an updated > patch here). The original idea was to follow only those links that we > expect to be in PGDATA. > > I suppose it would be easier in terms of the code to follow all links, > but I don't know if it's the right thing. If that's what you think we > should do, I can post a simplified patch. Well, we have many things that can be set as symlinks; some you can have initdb create the links for you (initdb --xlogdir), others you can move manually. I think not following all links might lead to impossible-to- detect bugs such as failing to fsync new pgdata subdirectories we add in the future, for example. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
Hi Álvaro. Thanks for taking a look at the patch. At 2015-04-03 13:32:32 -0300, alvhe...@2ndquadrant.com wrote: > > But then, since the name is already telling us that it's all about > PGDATA, maybe we can remove the "recursively" part of the name? Sure, that makes sense too. Since you and Andres both like «fsync_pgdata(data_directory)», I'll change it accordingly. > I also noticed that walkdir() thinks it is completely agnostic on > what the passed action is; except that the comment at the bottom talks > about how fsync on directories is important, which seems out of place. Yes. The function behaves as documented, but the comment is clearly too specific. I'm not sure where to put it. I could make walkdir() NOT do it, and instead do it in the caller with the same comment. Thoughts? > Hmm ... Actually, since surely we must follow symlinks everywhere, > why do we have to do this separately for pg_tblspc? Shouldn't that > link-following occur automatically when walking PGDATA in the first > place? I'm not sure about that (and that's why I've not attached an updated patch here). The original idea was to follow only those links that we expect to be in PGDATA. I suppose it would be easier in terms of the code to follow all links, but I don't know if it's the right thing. If that's what you think we should do, I can post a simplified patch. > Maybe fsync_recursively should Assert() that it's only being called > with enableFsync=on; or perhaps we can have it return early if it's > unset. I prefer the latter. Will change. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
Abhijit Menon-Sen wrote: > At 2015-01-15 14:32:45 +0100, and...@2ndquadrant.com wrote: > Patch attached. > > Changes: > > 1. Renamed perform_fsync to fsync_recursively (otherwise it would read >"fsync_pgdata(pg_data)") Okay, but as far as I can tell this function is very specific to PGDATA; you couldn't use it in any other directory (or pg_tblspc would be missing and that would cause an error, right?) Therefore I think it would make sense to have the name reflect this; maybe fsync_datadir_recursively(data_directory) or fsync_pgdata_recursively(data_directory) would work? But then, since the name is already telling us that it's all about PGDATA, maybe we can remove the "recursively" part of the name? Not sure about any of this; other opinions? I also noticed that walkdir() thinks it is completely agnostic on what the passed action is; except that the comment at the bottom talks about how fsync on directories is important, which seems out of place. I wonder about walktblspc_links() too. Seems to me that that function is pretty much the same as walkdir(); would it work to add a flag to the latter to change the behavior in whatever way needs to be changed, and remove the former? Hmm ... Actually, since surely we must follow symlinks everywhere, why do we have to do this separately for pg_tblspc? Shouldn't that link-following occur automatically when walking PGDATA in the first place? > 2. Added ControlData->fsync_disabled to record whether fsync was ever >disabled while the server was running (tested in CreateCheckPoint) > 3. Run fsync_recursively at startup only if fsync is enabled > 4. Run it if we're doing crash recovery, or fsync was disabled > 5. Use pg_flush_data in pre_sync_fname > 6. Issue fsync on directories too > 7. Tested that it works if pg_xlog is a symlink (no changes). > > (In short, everything you mentioned in your earlier mail.) > > Note that I set ControlData->fsync_disabled to false in BootstrapXLOG, > but it gets set to true during a later CreateCheckPoint(). This means > we run fsync again at startup after initdb. I'm not sure what to do > about that. This all looks reasonable to me. I just noticed, though, that the fd.c routines test enableFsync and do nothing if it's not enabled; but fsync_recursively goes to all the trouble of doing stuff even if disabled, and the actions are skipped later; the enableFsync check is then responsibility of the caller. This seems a bit prone to later confusion. Maybe fsync_recursively should Assert() that it's only being called with enableFsync=on; or perhaps we can have it return early if it's unset. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
At 2015-01-15 14:32:45 +0100, and...@2ndquadrant.com wrote: > > What I am thinking of is that, currently, if you start the server for > initial loading with fsync=off, and then restart it, you're open to > data loss. So when the current config file setting is changed from off > to on, we should fsync the data directory. Even if there was no crash > restart. Patch attached. Changes: 1. Renamed perform_fsync to fsync_recursively (otherwise it would read "fsync_pgdata(pg_data)") 2. Added ControlData->fsync_disabled to record whether fsync was ever disabled while the server was running (tested in CreateCheckPoint) 3. Run fsync_recursively at startup only if fsync is enabled 4. Run it if we're doing crash recovery, or fsync was disabled 5. Use pg_flush_data in pre_sync_fname 6. Issue fsync on directories too 7. Tested that it works if pg_xlog is a symlink (no changes). (In short, everything you mentioned in your earlier mail.) Note that I set ControlData->fsync_disabled to false in BootstrapXLOG, but it gets set to true during a later CreateCheckPoint(). This means we run fsync again at startup after initdb. I'm not sure what to do about that. Is this about what you had in mind? -- Abhijit >From bb2b5f130525dd44a10eab6829b9802b8f6f7eed Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen Date: Thu, 6 Nov 2014 00:45:56 +0530 Subject: Recursively fsync PGDATA at startup if needed It's needed if we need to perform crash recovery or if fsync was disabled at some point while the server was running earlier (and we must store that information in the control file). This is so that we don't lose older unflushed writes in the event of a power failure after crash recovery, where more recent writes are preserved. See 20140918083148.ga17...@alap3.anarazel.de for details. --- src/backend/access/transam/xlog.c | 171 src/bin/pg_controldata/pg_controldata.c | 2 + src/include/catalog/pg_control.h| 8 +- 3 files changed, 180 insertions(+), 1 deletion(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 71cbe0e..75a6862 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -833,6 +833,12 @@ static void WALInsertLockAcquireExclusive(void); static void WALInsertLockRelease(void); static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt); +static void pre_sync_fname(char *fname, bool isdir); +static void walkdir(char *path, void (*action) (char *fname, bool isdir)); +static void walktblspc_links(char *path, + void (*action) (char *fname, bool isdir)); +static void fsync_recursively(char *pg_data); + /* * Insert an XLOG record represented by an already-constructed chain of data * chunks. This is a low-level routine; to construct the WAL record header @@ -4721,6 +4727,7 @@ BootStrapXLOG(void) ControlFile->checkPoint = checkPoint.redo; ControlFile->checkPointCopy = checkPoint; ControlFile->unloggedLSN = 1; + ControlFile->fsync_disabled = false; /* Set important parameter values for use when replaying WAL */ ControlFile->MaxConnections = MaxConnections; @@ -5787,6 +5794,27 @@ StartupXLOG(void) ereport(FATAL, (errmsg("control file contains invalid data"))); + /* + * If we need to perform crash recovery, we issue an fsync on the + * data directory and its contents to try to ensure that any data + * written before the crash are flushed to disk. Otherwise a power + * failure in the near future might cause earlier unflushed writes + * to be lost, even though more recent data written to disk from + * here on would be persisted. + * + * We also do this if the control file indicates that fsync was + * disabled at some point while the server was running earlier. + */ + + if (enableFsync && + (ControlFile->fsync_disabled || + (ControlFile->state != DB_SHUTDOWNED && + ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY))) + { + fsync_recursively(data_directory); + ControlFile->fsync_disabled = false; + } + if (ControlFile->state == DB_SHUTDOWNED) { /* This is the expected case, so don't be chatty in standalone mode */ @@ -8137,6 +8165,8 @@ CreateCheckPoint(int flags) /* crash recovery should always recover to the end of WAL */ ControlFile->minRecoveryPoint = InvalidXLogRecPtr; ControlFile->minRecoveryPointTLI = 0; + if (!enableFsync) + ControlFile->fsync_disabled = true; /* * Persist unloggedLSN value. It's reset on crash recovery, so this goes @@ -11008,3 +11038,144 @@ SetWalWriterSleeping(bool sleeping) XLogCtl->WalWriterSleeping = sleeping; SpinLockRelease(&XLogCtl->info_lck); } + +/* + * Hint to the OS that it should get ready to fsync() this file. + * + * Adapted from pre_sync_fname in initdb.c + */ +static void +pre_sync_fname(char *fname, bool isdir) +{ + int fd; + + fd = open(fname, O_RDONLY | PG_BINARY); + + /* + * Some OSs don't allow us to open directories at all (Windows returns + * EACCES) + */ + if (fd
Re: [HACKERS] initdb -S and tablespaces
On 2015-01-15 11:02:43 +0530, Abhijit Menon-Sen wrote: > At 2015-01-14 11:59:08 +0100, and...@2ndquadrant.com wrote: > > > > > + if (ControlFile->state != DB_SHUTDOWNED && > > > + ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY) > > > + perform_fsync(data_directory); > > > + > > > > a) Please think of a slightly more descriptive name than perform_fsync > > OK. (I just copied the name initdb uses, because at the time I was still > thinking in terms of a later patch moving this to src/common.) What do > you think of fsync_recursively? fsync_pgdata? I like fsync_pgdata/datadir or something. Note that I think you'll have to check/handle pg_xlog being a symlink - we explicitly support that as a usecase... > > c) I'm wondering if we should add fsync to the control file and also > >perform an fsync if the last shutdown was clear, but fsync was > >disabled. > > Explain? "Add fsync to the control file" means store the value of the > fsync GUC setting in the control file? Yes. > And would the fsync you mention be dependent on the setting, or unconditional? What I am thinking of is that, currently, if you start the server for initial loading with fsync=off, and then restart it, you're open to data loss. So when the current config file setting is changed from off to on, we should fsync the data directory. Even if there was no crash restart. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
At 2015-01-14 11:59:08 +0100, and...@2ndquadrant.com wrote: > > > + if (ControlFile->state != DB_SHUTDOWNED && > > + ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY) > > + perform_fsync(data_directory); > > + > > a) Please think of a slightly more descriptive name than perform_fsync OK. (I just copied the name initdb uses, because at the time I was still thinking in terms of a later patch moving this to src/common.) What do you think of fsync_recursively? fsync_pgdata? I think fsync_recursively(data_directory) reads well. > b) I think we should check here whether fsync is enabled. OK, will do. > c) I'm wondering if we should add fsync to the control file and also >perform an fsync if the last shutdown was clear, but fsync was >disabled. Explain? "Add fsync to the control file" means store the value of the fsync GUC setting in the control file? And would the fsync you mention be dependent on the setting, or unconditional? > > +static void > > +pre_sync_fname(char *fname, bool isdir) > > +{ > > this essentially already exists in the backend inparts. C.f. > pg_flush_data. OK, I missed that. I'll rework the patch to use it. > Theoretically you should also invoke fsync on directories. OK. > > +* We need to name the parent of PGDATA. get_parent_directory() isn't > > +* enough here, because it can result in an empty string. > > +*/ > > + snprintf(pdir, MAXPGPATH, "%s/..", pg_data); > > + canonicalize_path(pdir); > > Hm. Why is this neded? Just syncing . should work? Not sure, will investigate. Thanks. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
Hi, On 2014-11-06 17:56:53 +0530, Abhijit Menon-Sen wrote: > + /* > + * If we need to perform crash recovery, we issue an fsync on the > + * data directory and its contents to try to ensure that any data > + * written before the crash are flushed to disk. Otherwise a power > + * failure in the near future might cause earlier unflushed writes > + * to be lost, even though more recent data written to disk from > + * here on would be persisted. > + */ > + > + if (ControlFile->state != DB_SHUTDOWNED && > + ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY) > + perform_fsync(data_directory); > + a) Please think of a slightly more descriptive name than perform_fsync b) I think we should check here whether fsync is enabled. c) I'm wondering if we should add fsync to the control file and also perform an fsync if the last shutdown was clear, but fsync was disabled. > if (ControlFile->state == DB_SHUTDOWNED) > { > /* This is the expected case, so don't be chatty in standalone > mode */ > @@ -11262,3 +11281,168 @@ SetWalWriterSleeping(bool sleeping) > XLogCtl->WalWriterSleeping = sleeping; > SpinLockRelease(&XLogCtl->info_lck); > } > + > +/* > + * Hint to the OS that it should get ready to fsync() this file. > + * > + * Adapted from pre_sync_fname in initdb.c > + */ > +static void > +pre_sync_fname(char *fname, bool isdir) > +{ this essentially already exists in the backend inparts. C.f. pg_flush_data. > +/* > + * walkdir: recursively walk a directory, applying the action to each > + * regular file and directory (including the named directory itself). > + * > + * Adapted from copydir() in copydir.c. > + */ > +static void > +walkdir(char *path, void (*action) (char *fname, bool isdir)) > +{ > + DIR*dir; > + struct dirent *de; > + > + dir = AllocateDir(path); > + while ((de = ReadDir(dir, path)) != NULL) > + { > + charsubpath[MAXPGPATH]; > + struct stat fst; > + > + CHECK_FOR_INTERRUPTS(); > + > + if (strcmp(de->d_name, ".") == 0 || > + strcmp(de->d_name, "..") == 0) > + continue; > + > + snprintf(subpath, MAXPGPATH, "%s/%s", path, de->d_name); > + > + if (lstat(subpath, &fst) < 0) > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("could not stat file \"%s\": > %m", subpath))); > + > + if (S_ISDIR(fst.st_mode)) > + walkdir(subpath, action); > + else if (S_ISREG(fst.st_mode)) > + (*action) (subpath, false); Theoretically you should also invoke fsync on directories. > +/* > + * Issue fsync recursively on PGDATA and all its contents, including the > + * links under pg_tblspc. > + * > + * Adapted from perform_fsync in initdb.c > + */ > +static void > +perform_fsync(char *pg_data) > +{ > + charpdir[MAXPGPATH]; > + charpg_tblspc[MAXPGPATH]; > + > + /* > + * We need to name the parent of PGDATA. get_parent_directory() isn't > + * enough here, because it can result in an empty string. > + */ > + snprintf(pdir, MAXPGPATH, "%s/..", pg_data); > + canonicalize_path(pdir); Hm. Why is this neded? Just syncing . should work? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
On 2014-10-30 14:30:28 +0530, Abhijit Menon-Sen wrote: > Here's a proposed patch to initdb to make initdb -S fsync everything > under pg_tblspc. It introduces a new function that calls walkdir on > every entry under pg_tblspc. This is only one approach: I could have > also changed walkdir to follow links, but that would have required a > bunch of #ifdefs for Windows (because it doesn't have symlinks), and I > guessed a separate function with two calls might be easier to patch into > back branches. I've tested this patch under various conditions on Linux, > but it could use some testing on Windows. I've pushed this. The windows buildfarm animals that run pg_upgrade (and thus --sync-only) will have to tell us whether there's a problem. I sure hope there's none... Thanks for that patch! Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
At 2014-10-30 14:30:27 +0530, a...@2ndquadrant.com wrote: > > Here's a proposed patch to initdb to make initdb -S fsync everything > under pg_tblspc. Oops, I meant to include the corresponding patch to xlog.c to do the same at startup. It's based on the initdb patch, but modified to not use fprintf/exit_nicely and so on. (Note that this was written in a single chunk to aid backpatching. There's no attempt made to share code in this set of patches.) Now attached. -- Abhijit >From ae91da4df7ee60e6f83c98801e979929442f588a Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen Date: Thu, 6 Nov 2014 00:45:56 +0530 Subject: If we need to perform crash recovery, fsync PGDATA recursively This is so that we don't lose older unflushed writes in the event of a power failure after crash recovery, where more recent writes are preserved. See 20140918083148.ga17...@alap3.anarazel.de for details. --- src/backend/access/transam/xlog.c | 184 ++ 1 file changed, 184 insertions(+) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index ab04380..ef95f64 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -830,6 +830,12 @@ static void WALInsertLockAcquireExclusive(void); static void WALInsertLockRelease(void); static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt); +static void pre_sync_fname(char *fname, bool isdir); +static void walkdir(char *path, void (*action) (char *fname, bool isdir)); +static void walktblspc_links(char *path, + void (*action) (char *fname, bool isdir)); +static void perform_fsync(char *pg_data); + /* * Insert an XLOG record having the specified RMID and info bytes, * with the body of the record being the data chunk(s) described by @@ -5981,6 +5987,19 @@ StartupXLOG(void) ereport(FATAL, (errmsg("control file contains invalid data"))); + /* + * If we need to perform crash recovery, we issue an fsync on the + * data directory and its contents to try to ensure that any data + * written before the crash are flushed to disk. Otherwise a power + * failure in the near future might cause earlier unflushed writes + * to be lost, even though more recent data written to disk from + * here on would be persisted. + */ + + if (ControlFile->state != DB_SHUTDOWNED && + ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY) + perform_fsync(data_directory); + if (ControlFile->state == DB_SHUTDOWNED) { /* This is the expected case, so don't be chatty in standalone mode */ @@ -11262,3 +11281,168 @@ SetWalWriterSleeping(bool sleeping) XLogCtl->WalWriterSleeping = sleeping; SpinLockRelease(&XLogCtl->info_lck); } + +/* + * Hint to the OS that it should get ready to fsync() this file. + * + * Adapted from pre_sync_fname in initdb.c + */ +static void +pre_sync_fname(char *fname, bool isdir) +{ +#if defined(HAVE_SYNC_FILE_RANGE) || \ + (defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)) + int fd; + + fd = open(fname, O_RDONLY | PG_BINARY); + + /* + * Some OSs don't allow us to open directories at all (Windows returns + * EACCES) + */ + if (fd < 0 && isdir && (errno == EISDIR || errno == EACCES)) + return; + + if (fd < 0) + ereport(FATAL, +(errmsg("could not open file \"%s\" before fsync", + fname))); + + /* + * Prefer sync_file_range, else use posix_fadvise. We ignore any error + * here since this operation is only a hint anyway. + */ +#if defined(HAVE_SYNC_FILE_RANGE) + sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE); +#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED) + posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED); +#endif + + close(fd); +#endif +} + +/* + * walkdir: recursively walk a directory, applying the action to each + * regular file and directory (including the named directory itself). + * + * Adapted from copydir() in copydir.c. + */ +static void +walkdir(char *path, void (*action) (char *fname, bool isdir)) +{ + DIR *dir; + struct dirent *de; + + dir = AllocateDir(path); + while ((de = ReadDir(dir, path)) != NULL) + { + char subpath[MAXPGPATH]; + struct stat fst; + + CHECK_FOR_INTERRUPTS(); + + if (strcmp(de->d_name, ".") == 0 || + strcmp(de->d_name, "..") == 0) + continue; + + snprintf(subpath, MAXPGPATH, "%s/%s", path, de->d_name); + + if (lstat(subpath, &fst) < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not stat file \"%s\": %m", subpath))); + + if (S_ISDIR(fst.st_mode)) + walkdir(subpath, action); + else if (S_ISREG(fst.st_mode)) + (*action) (subpath, false); + } + FreeDir(dir); + + /* + * It's important to fsync the destination directory itself as individual + * file fsyncs don't guarantee that the directory entry for the file is + * synced. Recent versions of ext4 have made the window much wider but + * it's been an issue for ext3 and other filesystems in the past. + */ + (*action) (path, true); +} + +/* + * walktblspc_links: call walkdir on each ent
Re: [HACKERS] initdb -S and tablespaces
At 2014-09-29 11:54:10 +0200, and...@2ndquadrant.com wrote: > > On 2014-09-29 14:09:01 +0530, Abhijit Menon-Sen wrote: > > > > I just noticed that initdb -S ("Safely write all database files to disk > > and exit") does (only) the following in perform_fsync: > > > > pre_sync_fname(pdir, true); > > walkdir(pg_data, pre_sync_fname); > > > > fsync_fname(pdir, true); > > walkdir(pg_data, fsync_fname); > > > > walkdir() reads the directory and calls itself recursively for S_ISDIR > > entries, or calls the function for S_ISREG entries… which means it > > doesn't follow links. > > > > Which means it doesn't fsync the contents of tablespaces. > > Which means at least pg_upgrade is unsafe right > now... c.f. 630cd14426dc1daf85163ad417f3a224eb4ac7b0. Here's a proposed patch to initdb to make initdb -S fsync everything under pg_tblspc. It introduces a new function that calls walkdir on every entry under pg_tblspc. This is only one approach: I could have also changed walkdir to follow links, but that would have required a bunch of #ifdefs for Windows (because it doesn't have symlinks), and I guessed a separate function with two calls might be easier to patch into back branches. I've tested this patch under various conditions on Linux, but it could use some testing on Windows. -- Abhijit diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index dc1f1df..4fdc9eb 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -218,6 +218,7 @@ static char **filter_lines_with_token(char **lines, const char *token); static char **readfile(const char *path); static void writefile(char *path, char **lines); static void walkdir(char *path, void (*action) (char *fname, bool isdir)); +static void walktblspc_links(char *path, void (*action) (char *fname, bool isdir)); static void pre_sync_fname(char *fname, bool isdir); static void fsync_fname(char *fname, bool isdir); static FILE *popen_check(const char *command, const char *mode); @@ -588,6 +589,53 @@ walkdir(char *path, void (*action) (char *fname, bool isdir)) } /* + * walktblspc_links: call walkdir on each entry under the given + * pg_tblspc directory, or do nothing if pg_tblspc doesn't exist. + */ +static void +walktblspc_links(char *path, void (*action) (char *fname, bool isdir)) +{ + DIR *dir; + struct dirent *direntry; + char subpath[MAXPGPATH]; + + dir = opendir(path); + if (dir == NULL) + { + if (errno == ENOENT) + return; + fprintf(stderr, _("%s: could not open directory \"%s\": %s\n"), +progname, path, strerror(errno)); + exit_nicely(); + } + + while (errno = 0, (direntry = readdir(dir)) != NULL) + { + if (strcmp(direntry->d_name, ".") == 0 || + strcmp(direntry->d_name, "..") == 0) + continue; + + snprintf(subpath, MAXPGPATH, "%s/%s", path, direntry->d_name); + + walkdir(subpath, action); + } + + if (errno) + { + fprintf(stderr, _("%s: could not read directory \"%s\": %s\n"), +progname, path, strerror(errno)); + exit_nicely(); + } + + if (closedir(dir)) + { + fprintf(stderr, _("%s: could not close directory \"%s\": %s\n"), +progname, path, strerror(errno)); + exit_nicely(); + } +} + +/* * Hint to the OS that it should get ready to fsync() this file. */ static void @@ -2377,6 +2425,7 @@ static void perform_fsync(void) { char pdir[MAXPGPATH]; + char pg_tblspc[MAXPGPATH]; fputs(_("syncing data to disk ... "), stdout); fflush(stdout); @@ -2398,6 +2447,11 @@ perform_fsync(void) /* then recursively through the directory */ walkdir(pg_data, pre_sync_fname); + /* now do the same thing for everything under pg_tblspc */ + + snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pg_data); + walktblspc_links(pg_tblspc, pre_sync_fname); + /* * Now, do the fsync()s in the same order. */ @@ -2408,6 +2462,8 @@ perform_fsync(void) /* then recursively through the directory */ walkdir(pg_data, fsync_fname); + walktblspc_links(pg_tblspc, fsync_fname); + check_ok(); } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
At 2014-09-29 12:59:09 +0200, and...@2ndquadrant.com wrote: > > So I'm afraid at least in a first patch it'll need to be a bit of > duplication. Fixing initdb's code back to 9.3 and the backend all > the way back to 9.0. OK, thanks, I'll submit two separate patches then. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
On 2014-09-29 15:43:32 +0530, Abhijit Menon-Sen wrote: > At 2014-09-29 11:54:10 +0200, and...@2ndquadrant.com wrote: > > > > Note that the perform_fsync() *was* ok for its original purpose in > > initdb. At the end of initdb there's no relevant tablespaces. But if > > used *after* pg_upgrade, that's not necessarily the case. > > Right. > > So, since I'm writing a function to fsync everything inside PGDATA > anyway, it makes sense to call it both from initdb and StartupXLOG. > It'll do what initdb -S now does, plus follow links in pg_tblspc. > > Any suggestions about where to put such a function? (I was looking at > backend/utils/init, but I'm not sure that's a good place for this.) That can't work unfortunately. Both frontend and backend code need to execute it... I'm not sure it's realistic to handle both cases the same. The error handling, opening files/directories, and all will be different. It'll also make backpatching hard :(. So I'm afraid at least in a first patch it'll need to be a bit of duplication. Fixing initdb's code back to 9.3 and the backend all the way back to 9.0. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
At 2014-09-29 11:54:10 +0200, and...@2ndquadrant.com wrote: > > Note that the perform_fsync() *was* ok for its original purpose in > initdb. At the end of initdb there's no relevant tablespaces. But if > used *after* pg_upgrade, that's not necessarily the case. Right. So, since I'm writing a function to fsync everything inside PGDATA anyway, it makes sense to call it both from initdb and StartupXLOG. It'll do what initdb -S now does, plus follow links in pg_tblspc. Any suggestions about where to put such a function? (I was looking at backend/utils/init, but I'm not sure that's a good place for this.) -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
On 2014-09-29 14:09:01 +0530, Abhijit Menon-Sen wrote: > Hi. > > I just noticed that initdb -S ("Safely write all database files to disk > and exit") does (only) the following in perform_fsync: > > pre_sync_fname(pdir, true); > walkdir(pg_data, pre_sync_fname); > > fsync_fname(pdir, true); > walkdir(pg_data, fsync_fname); > > walkdir() reads the directory and calls itself recursively for S_ISDIR > entries, or calls the function for S_ISREG entries… which means it > doesn't follow links. > > Which means it doesn't fsync the contents of tablespaces. Which means at least pg_upgrade is unsafe right now... c.f. 630cd14426dc1daf85163ad417f3a224eb4ac7b0. Note that the perform_fsync() *was* ok for its original purpose in initdb. At the end of initdb there's no relevant tablespaces. But if used *after* pg_upgrade, that's not necessarily the case. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] initdb -S and tablespaces
Hi. I just noticed that initdb -S ("Safely write all database files to disk and exit") does (only) the following in perform_fsync: pre_sync_fname(pdir, true); walkdir(pg_data, pre_sync_fname); fsync_fname(pdir, true); walkdir(pg_data, fsync_fname); walkdir() reads the directory and calls itself recursively for S_ISDIR entries, or calls the function for S_ISREG entries… which means it doesn't follow links. Which means it doesn't fsync the contents of tablespaces. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers